python爬虫系列三:html解析大法

发布时间:2019-09-23 17:04:21编辑:auto阅读(1708)

    Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。
    它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。
    在爬虫开发中主要用的是Beautiful Soup的查找提取功能。
    Beautiful Soup是第三方模块,需要额外下载
    下载命令:pip install bs4
    安装解析器:pip install lxml
    这里写图片描述
    这里写图片描述
    这里写图片描述

    from bs4 import BeautifulSoup
    
    html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little
    sisters; and their names were
    <a href="http://example.com/elsie" class="sister"
    id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister"
    id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister"
    id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    #创建一个bs对象
    #默认不指定的情况,bs会选择python内部的解析器
    #因此指定lxml作为解析器
    soup=BeautifulSoup(html_doc,"lxml")
    ----------
    
     1. 解析网页后的类型及格式化
    
    print(type(soup))  #<class 'bs4.BeautifulSoup'>
    print(soup.prettify())  #格式化 答案如下:
    
    <html>
     <head>
      <title>
       The Dormouse's story
      </title>
     </head>
     <body>
      <p class="title">
       <b>
        The Dormouse's story
       </b>
      </p>
      <p class="story">
       Once upon a time there were three little
    sisters; and their names were
       <a class="sister" href="http://example.com/elsie" id="link1">
        Elsie
       </a>
       ,
       <a class="sister" href="http://example.com/lacie" id="link2">
        Lacie
       </a>
       and
       <a class="sister" href="http://example.com/tillie" id="link3">
        Tillie
       </a>
       ;
    and they lived at the bottom of a well.
      </p>
      <p class="story">
       ...
      </p>
     </body>
    </html>
    
     2. 基本操作
    
    #1.标签选择法 缺点:只能猎取到符合条件的第一个
    #(1)获取p标签
    print(soup.p) #<p class="title"><b>The Dormouse's story</b></p>
    print(soup.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
    #(2)获取a标签的herf属性:因为标签下的内容是以键对形式体现出来
    print(soup.a["href"]) #http://example.com/elsie
    #(3)获取标签的内容
    print(soup.title.string) #The Dormouse's story
    print(soup.a.string)#Elsie
    
    
    ----------
    3.遍历文档树
    
    #1.操作子节点:contents ,children(子节点)
    print(soup.body.contents) #得到文档对象中body中所有的子节点,返回的列表,包括换行符
    print(soup.body.children) #<list_iterator object at 0x0000000002ACC668>
    #这个意思是意味着返回的是列表迭代器,需要用for 将列表中的内容遍历出来
    for i in soup.body.children:
        print(i)
    
    answer:
    <p class="title"><b>The Dormouse's story</b></p>
    
    
    <p class="story">Once upon a time there were three little
    sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    
    <p class="story">...</p>
    
    #2.操作父节点:parent(父节点),parents(祖父节点)
    
    print(soup.a.parent)
    print(soup.a.parents)
    
    <p class="story">Once upon a time there were three little
    sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    
    #兄弟节点
    next_siblings 
    next_sibling
    previous_siblings 
    previous_sinbling
    ----------
    4、搜索文档树
    
    #搜索文档树
    #(1)查询指定的标签
    res1=soup.find_all("a")
    print(res1) #返回所有的a标签,返回列表
    
    answer:
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    #(2)正则表达式
    #compile():声明正则表达式对象
    #查找包含d字符的便签
    import re
    res2=soup.find_all(re.compile("d+"))
    print(res2)
    
    answer:
    [<head><title>The Dormouse's story</title></head>, <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little
    sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    </body>]
    #(3)列表
    # 查询所有title标签或者a标签
    res3=soup.find_all(["title","a"])
    print(res3)
    
    answer:
    [<title>The Dormouse's story</title>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    #(4)关键字参数
    #查询属性id="link2"的标签
    res4=soup.find_all(id="link2")
    print(res4)
    
    answer:
    [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
    
    #(5)内容匹配
    res5=soup.find_all(text="Elsie")
    print(res5)
    
    answer:
    ['Elsie']
    
    find_all()
    find():返回的匹配结果的第一个元素
    find_parents()               find_parent()
    find_next_siblings()         find_next_sibling()
    find_previous_siblings()     find_previous_sibling()
    find_all_next()              find_next()
    find_all_previous()          find_previous()
    
    
    ----------
    5.CSS选择器
    
    使用CSS选择器,只需要调用SeleCt()方法,传入相应的CSS选择器,返回类型是 liSt
    
    CSS语法:
    
     - 标签名不加任何修饰
     - 类名前加点
     - id名前加 #
     - 标签a,标签b : 表示找到所有的节点a和节点c
     - 标签a 标签b : 表示找到节点a下的所有节点b
     - get_text() :获取节点内的文本内容
    
    #(1)通过id查询标签对象
    res2=soup.select("#link3")
    print(res2)
    #[<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    #(2)根据class属性查询标签对象
    res3=soup.select(".sister")
    print(res3)
    
    #(3)属性选择
    res4=soup.select("a[herf='http://example.com/elsie']")
    print(res4)
    
    #(4)包含选择器
    res5=soup.select("p a#link3")#p标签下的a标签下的link3
    print(res5)
    answer:
    [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    
    #(5)得到标签内容
    res6=soup.select("p a.sister")
    print(res6)
    #获取标下下的内容
    print(res6[0].get_text())
    print(res6[0].string)
    
    answer:
    [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
    Elsie
    Elsie
    
    
    
    

关键字

上一篇: python 插入mysql数据

下一篇: python创造矩阵