【python爬虫学习 】python

发布时间:2019-10-15 09:05:16编辑:auto阅读(718)

    1. pip 安装 pip install scrapy
    2. 可能的问题:
      问题/解决:error: Microsoft Visual C++ 14.0 is required.
    3. 实例demo教程 中文教程文档
      第一步:创建项目目录

      scrapy startproject tutorial

      第二步:进入tutorial创建spider爬虫

      scrapy genspider baidu www.baidu.com

      第三步:创建存储容器,复制项目下的items.py重命名为BaiduItems

      # -*- coding: utf-8 -*-
      
      # Define here the models for your scraped items
      #
      # See documentation in:
      # https://doc.scrapy.org/en/latest/topics/items.html
      
      import scrapy
      
      class BaiduItems(scrapy.Item):
          # define the fields for your item here like:
          # name = scrapy.Field()
          title = scrapy.Field()
          link = scrapy.Field()
          desc = scrapy.Field()
          pass

      第四步:修改spiders/baidu.py xpath提取数据

      # -*- coding: utf-8 -*-
      import scrapy
      # 引入数据容器
      from tutorial.BaiduItems import BaiduItems
      
      class BaiduSpider(scrapy.Spider):
          name = 'baidu'
          allowed_domains = ['www.readingbar.net']
          start_urls = ['http://www.readingbar.net/']
          def parse(self, response):
              for sel in response.xpath('//ul/li'):
                  item = BaiduItems()
                  item['title'] = sel.xpath('a/text()').extract()
                  item['link'] = sel.xpath('a/@href').extract()
                  item['desc'] = sel.xpath('text()').extract()
                  yield item
              pass

      第五步:解决百度首页网站抓取空白问题,设置setting.py

      # 设置用户代理
      USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36'
      
      # 解决 robots.txt 相关debug
      ROBOTSTXT_OBEY = False
      # scrapy 解决数据保存乱码问题
      FEED_EXPORT_ENCODING = 'utf-8'

      最后一步:开始爬取数据命令并保存数据为指定的文件
      执行的时候可能报错:No module named 'win32api' 可以下载指定版本安装

      scrapy crawl baidu -o baidu.json
    4. 深度爬取百度首页及导航菜单相关页内容

      # -*- coding: utf-8 -*-
      import scrapy
      
      from scrapyProject.BaiduItems import BaiduItems
      
      class BaiduSpider(scrapy.Spider):
          name = 'baidu'
          # 由于tab包含其他域名,需要添加域名否则无法爬取
          allowed_domains = [
              'www.baidu.com',
              'v.baidu.com',
              'map.baidu.com',
              'news.baidu.com',
              'tieba.baidu.com',
              'xueshu.baidu.com'
          ]
          start_urls = ['https://www.baidu.com/']
          def parse(self, response):
              item = BaiduItems()
              item['title'] = response.xpath('//title/text()').extract()
              yield item
              for sel in response.xpath('//a[@class="mnav"]'):
                  item = BaiduItems()
                  item['nav'] = sel.xpath('text()').extract()
                  item['href'] = sel.xpath('@href').extract()
                  yield item
                  # 根据提取的nav地址建立新的请求并执行回调函数
                  yield scrapy.Request(item['href'][0],callback=self.parse_newpage)
              pass
          # 深度提取tab网页标题信息
          def parse_newpage(self, response):
              item = BaiduItems()
              item['title'] = response.xpath('//title/text()').extract()
              yield item
              pass
    5. 绕过登录进行爬取
      a.解决图片验证 pytesseract

关键字