使用python访问网页

发布时间:2019-09-07 08:03:53编辑:auto阅读(2334)

    python版本:3

    访问页面:

    import urllib.request
    
    url="https://blog.csdn.net/qq_33160790"
    req=urllib.request.Request(url)
    resp=urllib.request.urlopen(req)
    data=resp.read().decode('utf-8')
    
    print(data)

    效果:
    这里写图片描述


    抓取csdn页面中文章的链接:
    xpath语法可以看这篇文章:
    http://www.w3school.com.cn/xpath/xpath_syntax.asp

    from lxml import etree
    import requests
    
    url='https://blog.csdn.net/qq_33160790'
    resp=requests.get(url)
    if resp.status_code==requests.codes.ok:
            html=etree.HTML(resp.text)
            hrefs=html.xpath('////span[@class="link_title"]/a/@href')
            for href in hrefs:
                    print href
    

    效果:
    这里写图片描述


    打印出所有文章url:

    from lxml import etree
    import requests
    
    for i in range(1,23):   #23 is equal to pagelist-1
            #print(i)
            url='https://blog.csdn.net/qq_33160790/article/list/'+str(i)
            resp=requests.get(url)
            if resp.status_code==requests.codes.ok:
                    html=etree.HTML(resp.text)
                    hrefs=html.xpath('////span[@class="link_title"]/a/@href')
                    for href in hrefs:
                            print href
    

    这里写图片描述


    刷csdn点击脚本:
    PS:url和23结合实际修改

    from lxml import etree
    import requests
    import urllib.request
    
    for i in range(1,23):   #23 is equal to pagelist-1
            #print(i)
            url='https://blog.csdn.net/qq_33160790/article/list/'+str(i)
            resp=requests.get(url)
            if resp.status_code==requests.codes.ok:
                    html=etree.HTML(resp.text)
                    hrefs=html.xpath('////span[@class="link_title"]/a/@href')
                    for href in hrefs:
                            print (href)
                            req=urllib.request.Request(href)
                            data=urllib.request.urlopen(req).read()
    

关键字