VII Python(7)爬虫

发布时间:2019-09-17 07:46:55编辑:auto阅读(1963)

    VII Python7)爬虫

     

    网络爬虫(网页蜘蛛):

    python访问互联网:

    urlliburllib2模块(python2.*urlliburllib2python3..4.1中把urlliburllib2合并统一为一个包package,注意版本3是包不是模块);

    json模块(json轻量级的数据交换格式,此处对其应用是用字符串形式将python的数据结构封装起来);

    URL的一般格式:

    protocol://hostname[:port]/path/to/file

    protocal有:httphttpsftpfileed2k

     

    In [1]: import urllib

    In [2]: dir(urllib)

    ……

     'urlopen',

     'urlretrieve']

    In [6]: help(urllib.urlopen)

    urlopen(url, data=None, proxies=None)

       Create a file-like object for the specified URL to read from.

    In [18]: help(urllib.urlretrieve)

    urlretrieve(url, filename=None,reporthook=None, data=None)

    In [19]: help(urllib.urlencode)

    urlencode(query, doseq=0)

       Encode a sequence of two-element tuples or dictionary into a URL querystring.

     

    In [1]: import urllib2

    In [2]: help(urllib2.urlopen)

    urlopen(url, data=None, timeout=<objectobject>)

    In [8]: help(urllib2.Request)

    __init__(self, url, data=None, headers={},origin_req_host=None, unverifiable=False)

    add_header(self, key, val)

    In [19]: help(urllib2.ProxyHandler)

    __init__(self, proxies=None)

    proxy_open(self, req, proxy, type)

     

    In [23]: import json

    In [24]: json.<TAB>

    json.JSONDecoder  json.decoder      json.dumps        json.load         json.scanner      

    json.JSONEncoder  json.dump         json.encoder      json.loads 

    In [24]: help(json.loads)

    loads(s, encoding=None, cls=None,object_hook=None, parse_float=None, parse_int=None, parse_constant=None,object_pairs_hook=None, **kw)

       Deserialize ``s`` (a ``str`` or ``unicode`` instance containing a JSON

       document) to a Python object.

     

    In [10]: import time

    In [11]: time.<TAB>

    time.accept2dyear  time.clock         time.gmtime        time.sleep         time.struct_time   time.tzname

    time.altzone       time.ctime         time.localtime     time.strftime      time.time          time.tzset

    time.asctime       time.daylight      time.mktime        time.strptime      time.timezone     

    In [11]: help(time.sleep)

    sleep(...)

       sleep(seconds)

     

    举例1

    In [13]: response=urllib.urlopen('http://www.FishC.com')

    In [14]: html=response.read()

    In [15]: print html   #(若此处打印的内容(即是网页中审查元素看到的代码)不规整,则要根据网站编码进行转码,html=html.decode('utf-8')

    <!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Strict//EN"

             "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

    <!--

    (c) 2011 ubomr Krupa, CCBY-ND 3.0

     --> 

    <htmlxmlns="http://www.w3.org/1999/xhtml">

             <head>

                       <metahttp-equiv="content-type" content="text/html; charset=utf-8" />

    ……

    In [19]: response.<TAB>   #(对于打开的网页,可施加的方法或属性,geturl()得到访问的地址,info()返回的是文件对象(内容是请求的网页的代码),getcode()返回的是http的状态码)

    response.close      response.fp         response.headers    response.read       response.url       

    response.code       response.getcode    response.info       response.readline  

    response.fileno     response.geturl     response.next       response.readlines

    In [19]: response.geturl()

    Out[19]: 'http://www.FishC.com'

    In [20]: response.info()

    Out[20]: <httplib.HTTPMessage instanceat 0x16a7b48>

    In [21]: print response.info

    <bound method addinfourl.info of<addinfourl at 23755304 whose fp = <socket._fileobject object at0x15abbd0>>>

    In [22]:response.getcode()

    Out[22]: 200

     

    举例2(保存网站placekitten.com中的图片):

    [root@localhost ~]# vim download_cat.py

    -----------------------script start-----------------------

    #!/usr/bin/python2.7

    #filename:download_cat.py

    import urllib

    response=urllib.urlopen('http://placekitten.com/g/500/600')

    cat_img=response.read()

    with open('cat_500_600.jpg','wb') as f:

           f.write(cat_img)

    ----------------------script end--------------------------

    [root@localhost ~]# chmod 755download_cat.py

    [root@localhost ~]# python2.7 download_cat.py

    [root@localhost ~]# ll cat_500_600.jpg

    -rw-r--r--. 1 root root 26590 Jun 19 22:10 cat_500_600.jpg

     

    举例3(模拟在线浏览器翻译):

    网页中右键审查元素-->Network-->找到如下信息,在Headers中的内容是我们需要的

    wKiom1drLtzR_4h5AABtORGO6ro630.jpg

    wKioL1drLuiyQGLEAAA0MoiSoZk259.jpg

    wKioL1drLvOw495eAAA0cvN_bNE529.jpg

    Headers中,General段中的RequestURL(用此处的地址才可翻译),Request Headers段中的User-Agent(服务器用来判断是否非人类访问,不过此处信息可自定义),From DataPOST提交的主要内容)

    注:GET(从server请求获得数据);POST(向指定server提交被处理的数据)

    [root@localhost ~]# vim translation.py

    ---------------------------script start------------------------

    #!/usr/bin/python2.7

    #filename:translation.py

    import urllib

    import json

     

    content=raw_input('please input translatecontent: ')

    url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

    data={}

    data['type']='AUTO'

    data['i']=content

    data['doctype']='json'

    data['xmlVersion']='1.8'

    data['keyfrom']='fanyi.web'

    data['ue']='UTF-8'

    data['action']='FY_BY_CLICKBUTTON'

    data['typoResult']='true'

    data=urllib.urlencode(data)

    response=urllib.urlopen(url,data)

    html=response.read()

    target=json.loads(html)

    print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

    -----------------------------script end---------------------------

    [root@localhost ~]# python2.7 translation.py

    please input translate content: girl

    Translate the result: 女孩

     

    注:

    此脚本优化:

    可将代码放在while循环中,当输入quitq时退出;

    此脚本不能运行在生产环境中,因为server会根据User-Agent判断是人工访问还是机器代码访问,若机器代码访问多了会被server屏蔽,解决方法:隐藏修改User-Agent,(1)先事先定义好head={'User-Agend':'……'}再传递给urllib2.Request(url,data,head);(2)在请求urllib2.Request(url,data)之后通过urllib2.Request.add_header()添加;

    修改User-Agent方法虽可行,但server还会根据IP访问的次数,在超过预值(阈值)会认为是网络爬虫,server会要求其填验证码之类的,若是用户可识别验证码,但以上脚本仍无法应付会被屏蔽,解决方法:(1)通过time模块延迟提交时间time.sleep(3),让脚本代码(爬虫)看上去是人类在正常访问;(2)使用代理IP(推荐使用此方法)

    注:

    使用代理IP三步骤:

    1proxy_support=urllib2.ProxyHandler({'http':'112.111.53.173:8888'}),注意此方法扩号中要是一个字典,格式:urllib2.ProxyHandler('类型':'代理ip:port');

    2)定制、创建一个opener(可理解为私人定制),opener=urllib2.build_opener(proxy_support);

    3)安装opener,urllib2.install_opener(opener),opener.open(url);

     

     

    举例4(优化例3,修改User-Agent,使用方法1):

    [root@localhost ~]# vim translation.py

    ----------------------script start--------------------

    #!/usr/bin/python2.7

    #filename:translation.py

    import urllib

    import urllib2

    import json

     

    while True:

           content=raw_input('please input translate content: ')

            if content=='q':

                    break

     

           url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

     

            head={}

            head['User-Agend']='Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36'

     

           data={}

           data['type']='AUTO'

           data['i']=content

           data['doctype']='json'

           data['xmlVersion']='1.8'

           data['keyfrom']='fanyi.web'

           data['ue']='UTF-8'

           data['action']='FY_BY_CLICKBUTTON'

           data['typoResult']='true'

           data=urllib.urlencode(data)

     

            req=urllib2.Request(url,data,head)

            response=urllib2.urlopen(req)

           html=response.read()

           target=json.loads(html)

           print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

    ------------------------------script end----------------------

    [root@localhost ~]# python2.7 translation.py

    please input translate content: ladies

    Translate the result: 女士们

    please input translate content: gentleman

    Translate the result: 绅士

    please input translate content: q

     

    举例5(优化例3,修改User-Agent,使用方法2):

    [root@localhost ~]# vim translation.py

    ------------------------script start---------------------

    #!/usr/bin/python2.7

    #filename:translation.py

    import urllib

    import urllib2

    import json

     

    while True:

           content=raw_input('please input translate content: ')

           if content=='q':

                    break

     

           url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

     

           #head={}

           #head['User-Agend']='Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'

     

           data={}

           data['type']='AUTO'

           data['i']=content

           data['doctype']='json'

           data['xmlVersion']='1.8'

           data['keyfrom']='fanyi.web'

           data['ue']='UTF-8'

           data['action']='FY_BY_CLICKBUTTON'

           data['typoResult']='true'

           data=urllib.urlencode(data)

     

            req=urllib2.Request(url,data)

            req.add_header('User-Agent','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')

           response=urllib2.urlopen(req)

           html=response.read()

           target=json.loads(html)

           print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

    ----------------------------script end---------------------------

    [root@localhost ~]# python2.7 translation.py

    please input translate content: cat

    Translate the result:

    please input translate content: dog

    Translate the result:

    please input translate content: q

     

    举例6(优化例3,使用代码频繁访问翻译server防止将我们的IP屏蔽,方法一延迟提交时间,这样在每翻译一个条目后间隔3s才允许翻译下个条目):

    [root@localhost ~]# vim translation.py

    ----------------script start----------------

    #filename:translation.py

    import urllib

    import urllib2

    import json

    import time

     

    while True:

    ……

            time.sleep(3)

    -----------------script end----------------

    [root@localhost ~]# python2.7 translation.py

    please input translate content: chinese

    Translate the result: 中国

    please input translate content: japanese

    Translate the result: 日本

    please input translate content: q#!/usr/bin/python2.7

     

    举例7(使用代理访问网页):

    准备(通过http://www.whatismyip.com.tw/得到当前正在使用的IP,通过http://www.xicidaili.com/得到代理IP

    [root@localhost ~]# vim proxy_egg.py

    ---------------------script start--------------------

    #!/usr/bin/python2.7

    #filename:proxy_egg.py

    import urllib2

    import random

     

    url='http://www.whatismyip.com.tw'

    ip_list=['110.6.35.181:8888','122.193.55.64:81']

     

    proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})

    opener=urllib2.build_opener(proxy_support)

    #opener.addheaders=[('User-Agend','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')]

    urllib2.install_opener(opener)

    response=urllib2.urlopen(url)

    html=response.read()

    print html

    -------------------------scirpt end------------------------

    [root@localhost ~]# python2.7 proxy_egg.py

    <html>

     <head>

       <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>

       <meta name="description" content="我的IP查詢"/>

       <meta name="keywords" content="ip,ip查詢,查我的ip,我的ip位址,我的ip位置,偵測我的ip,查詢我的ip,查看我的ip,顯示我的ip,whatis my IP,whatismyip,my IP address,my IP proxy"/>

       <title>我的IP位址查詢</title>

      </head>

     <body>

    <h1>IP位址</h1> <h2>122.193.55.64</h2>

     

    <scripttype="text/javascript">

    var sc_project=6392240;

    var sc_invisible=1;

    var sc_security="65d86b9d";

    var scJsHost = (("https:" ==document.location.protocol) ? "https://secure." :"http://www.");

    document.write("<sc"+"ripttype='text/javascript' src='" + scJsHost +"statcounter.com/counter/counter.js'></"+"script>");

    </script>

    <noscript><divclass="statcounter"><a title="websitestatistics"href="http://statcounter.com/" target="_blank"><imgclass="statcounter" src="http://c.statcounter.com/6392240/0/65d86b9d/1/"alt="website statistics"></a></div></noscript>

     

     </body>

    </html>

     

    举例8(优化例3,使用脚本代码频繁访问翻译server,防止server将我们的IP屏蔽,方法二使用代理IP):

    注:使用免费代理IP极不稳定,应尽可能在ip_list中多加一些代理IP

    [root@localhost ~]# vim translation.py

    -----------------------script start-------------------

    #!/usr/bin/python2.7

    #filename:translation.py

    import urllib

    import urllib2

    import json

    import random

     

    while True:

           content=raw_input('please input translate content: ')

           if content=='q':

                    break

     

           url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

           ip_list=['123.185.109.86:8888','124.235.47.141:8888']

           proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})

           opener=urllib2.build_opener(proxy_support)

           opener.addheaders=[('User-Agend','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36')]

           urllib2.install_opener(opener)

     

           data={}

           data['type']='AUTO'

           data['i']=content

           data['doctype']='json'

           data['xmlVersion']='1.8'

           data['keyfrom']='fanyi.web'

           data['ue']='UTF-8'

           data['action']='FY_BY_CLICKBUTTON'

           data['typoResult']='true'

           data=urllib.urlencode(data)

     

           req=urllib2.Request(url,data)

           response=urllib2.urlopen(req)

           html=response.read()

           target=json.loads(html)

            print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

    ----------------scipt end----------------

    [root@localhost ~]# python2.7 translation.py

    please input translate content: boy

    Translate the result: 男孩

    please input translate content: girl

    Translate the result: 女孩

    please input translate content: man

    Traceback (most recent call last):

     File "translation.py", line 32, in <module>

       response=urllib2.urlopen(req)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line127, in urlopen

       return _opener.open(url, data, timeout)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line404, in open

       response = self._open(req, data)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line422, in _open

       '_open', req)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line382, in _call_chain

       result = func(*args)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1214, in http_open

       return self.do_open(httplib.HTTPConnection, req)

     File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1184, in do_open

       raise URLError(err)

    urllib2.URLError:<urlopen error [Errno 111] Connection refused>

     

     

    举例(下载指定网页中的图片,默认下载至当前目录,使用urllib.urlretrieve()将文件保存至本地):

    此脚本缺陷:仅下载指定页面的图片,不能更新到该网站最新的图片进行下载

    [root@localhost ~]# vim download_pic.py

    ------------------script start-------------------

    #!/usr/bin/python2.7

    #filename:download_pic.py

    import urllib

    import urllib2

    import re

     

    url='http://jandan.net/ooxx'

     

    def getHtml(url):

           req=urllib2.Request(url)

           req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')

           response=urllib2.urlopen(req)

           html=response.read()

           return html

     

    def getImg(html):

           imglist=re.findall(r'src="(.*?\.jpg)"',html)

           #print imglist

           x=1

           for imgurl in imglist:

                   urllib.urlretrieve(imgurl,'%s.jpg' % x)

                    x+=1

     

    html=getHtml(url)

    #print html

    getImg(html)

    --------------------script end------------------

    [root@localhost ~]# python2.7 download_pic.py

    [root@localhost ~]# ll

    total 31664

    -rw-r--r--. 1 root   root     174584 Jun 21 23:18 10.jpg

    -rw-r--r--. 1 root   root     153359 Jun 21 23:18 11.jpg

    -rw-r--r--. 1 root   root     125877 Jun 21 23:18 12.jpg

    -rw-r--r--. 1 root   root     152194 Jun 21 23:18 13.jpg

    -rw-r--r--. 1 root   root      91847 Jun 21 23:18 14.jpg

    -rw-r--r--. 1 root   root      78389 Jun 21 23:18 15.jpg

    -rw-r--r--. 1 root   root      68577 Jun 21 23:18 16.jpg

    -rw-r--r--. 1 root   root      99573 Jun 21 23:18 17.jpg

    -rw-r--r--. 1 root   root      32444 Jun 21 23:18 18.jpg

    -rw-r--r--. 1 root   root      79730 Jun 21 23:18 19.jpg

    -rw-r--r--. 1 root   root     144334 Jun 21 23:18 1.jpg

    ……

     

     


关键字