Python 爬虫 urllib模块:p

发布时间:2019-06-30 16:53:12编辑:auto阅读(1478)

    本程序以爬取 'http://httpbin.org/post' 为例

    格式:

      导入urllib.request

      导入urllib.parse

      数据编码处理,再设为utf-8编码:  bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')

      打开爬取的网页: response = urllib.request.urlopen('网址', data = data)

      读取网页代码: html = response.read()

      打印:

          1.不decode 

          print(html) #爬取的网页代码会不分行,没有空格显示,很难看

          2.decode

          print(html.decode()) #爬取的网页代码会分行,像写规范的代码一样,看起来很舒服

      查询请求结果:

          a. response.status # 返回 200:请求成功  404:网页找不到,请求失败

          b. response.getcode() # 返回 200:请求成功  404:网页找不到,请求失败



    1.不decode的程序如下:

    import urllib.request
    import urllib.parsse
    
    data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')
    response = urllib.request.urlopen(' data = data )
    html = response.read()
    
    print(html)
    print("------------------------------------------------------------------")
    print("------------------------------------------------------------------")
    print(response.status)
    print(response.getcode())


    运行结果:

    blob.png


    2.带decode的程序如下:

    import urllib.request
    import urllib.parsse
    
    data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding = 'utf-8')
    response = urllib.request.urlopen(' data = data )
    html = response.read()
    
    print(html.decode())
    print("------------------------------------------------------------------")
    print("------------------------------------------------------------------")
    print(response.status)
    print(response.getcode())


    运行结果:

    {
      "args": {}, 
      "data": "", 
      "files": {}, 
      "form": {
        "word": "hello"
      }, 
      "headers": {
        "Accept-Encoding": "identity", 
        "Connection": "close", 
        "Content-Length": "10", 
        "Content-Type": "application/x-www-form-urlencoded", 
        "Host": "httpbin.org", 
        "User-Agent": "Python-urllib/3.4"
      }, 
      "json": null, 
      "origin": "106.14.17.222", 
      "url": "http://httpbin.org/post"
    }
    
    ------------------------------------------------------------------
    ------------------------------------------------------------------
    200
    200


    为什么要用bytes转换?

    因为

    data = urllib.parse.urlencode({'word': 'hello'}) ##没有用bytes
    response = urllib.request.urlopen('http://httpbin.org/post', data = data )
    html = response.read()

    错误提示:

    Traceback (most recent call last):
      File "/usercode/file.py", line 15, in <module>
        response = urllib.request.urlopen('http://httpbin.org/post', data = data )
      File "/usr/lib/python3.4/urllib/request.py", line 153, in urlopen
        return opener.open(url, data, timeout)
      File "/usr/lib/python3.4/urllib/request.py", line 453, in open
        req = meth(req)
      File "/usr/lib/python3.4/urllib/request.py", line 1104, in do_request_
        raise TypeError(msg)
    TypeError: POST data should be bytes or an iterable of bytes. It cannot be of type str.

    由此可见,post方式需要将请求内容用二进制编码。

    class bytes([source[, encoding[, errors]]])

        Return a new “bytes” object, which is an immutable sequence of integers in the range <= 256bytes is an immutable version of bytearray– it has the same non-mutating methods and the same indexing and slicing behavior.

    Accordingly, constructor arguments are interpreted as for bytearray().


关键字