python scrapy学习笔记（二）

发布时间：2019-09-08 09:10:54编辑：auto阅读（2074）

使用scrapy批量抓取,参考http://python.jobbole.com/87155

一、创建项目

# scrapy startproject comics

创建完成后的目录结构

.
├── comics
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

二、创建Spider类

start_requests：启动爬虫的时候调用，默认是调用make_requests_from_url方法爬取start_urls的链接，可以在这个方法里面定制，如果重写了该方法，start_urls默认将不会被使用，可以在这个方法里面定制一些自定义的url，如登录，从数据库读取url等，本方法返回Request对象

start_urls是框架中提供的属性，为一个包含目标网页url的数组，设置了start_urls的值后，不需要重载start_requests方法，爬虫也会依次爬取start_urls中的地址，并在请求完成后自动调用parse作为回调方法。

# cd comics/spiders
# vim comic.py

#!/usr/bin/python
#coding:utf-8

import scrapy

class Comics(scrapy.Spider):

    name = "comics"

    def start_requests(self):
        urls = ['http://www.xeall.com/shenshi']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.log(response.body);

三、开始爬取漫画

爬虫的主要任务是爬取列表中每一部漫画的图片，爬取完当前页后，进入下一页漫画列表继续爬取漫画，依次不断循环直至所有漫画爬取完毕。思路：获取当前的漫画的URl，访问后获取漫画的名字和所有的图片的URL，进行批量下载，循环往复

1、获取当前页面中漫画的url，同时获取下一个

单个漫画的url

# 找出所有的漫画url                                                                                                                                                
def parse(self, response):
# 找出所有的漫画url
def parse(self, response):

    # 请求返回的response对象
    content = Selector(response=response)

    # 获取漫画标签对象
    com_count = content.xpath("//div[@class='mainleft']/ul/li")

    # 获取单页中所有漫画的url
    comics_url_list = []
    base_url = 'http://www.xeall.com'
    for i in range(len(com_count)):
        com_url = content.xpath("//div[@class='mainleft']/ul/li[{}]/a/@href".format(i+1)).extract()
        url = base_url+com_url[0]
        comics_url_list.append(url)

    # 处理当前页每部漫画
    for url in comics_url_list:
        yield scrapy.Request(url=url, callback=self.comics_parse)

    # 获取下一页的url
    url_num = content.xpath("//div[@class='mainleft']/div[@class='pages']/ul/li")
    next_url = content.xpath("//div[@class='mainleft']/div[@class='pages']/ul/li[{}]/a/@href".format(len(url_num)-3)).extract()

    # print '总页数: {},下一页: {}'.format(url_num,next_url)

    # 判断下一页是否为最后一页
    if next_url:
        next_page = 'http://www.xeall.com/shenshi/' + next_url[0]
        if next_page is not None:
            yield scrapy.Request(next_page, callback=self.parse)
            pass

2、获取所有页面

不同页面的源码

当前漫画的名字和url

# 提取每部漫画数据
def comics_parse(self, response):                                                                                                                                  

    content = Selector(response=response)

    # 当前页数
    page_num = content.xpath("//div[@class='dede_pages']/ul/li[@class='thisclass']/a/text()").extract()

    # 首张图片的url
    current_url = content.xpath("//div[@class='mhcon_left']/ul/li/p/img/@src").extract()

    # 漫画名称
    comic_name = content.xpath("//div[@class='mhcon_left']/ul/li/p/img/@alt").extract()

    # self.log('img url: ' + current_url[0])

    # 将图片保存到本地
    self.save_img(page_num[0], comic_name[0], current_url[0])

    # 下一页图片的url，当下一页标签的href属性为‘#’时为漫画的最后一页
    page_num = content.xpath("//div[@class='dede_pages']/ul/li")
    next_page = content.xpath("//div[@class='dede_pages']/ul/li[{}]/a/@href".format(len(page_num))).extract()

    # 最后一页中href='#' 
    if next_page == '#':
        print('parse comics:' + comic_name + 'finished.')
    else:
        next_page = 'http://www.xeall.com/shenshi/' + next_page[0]
        yield scrapy.Request(next_page, callback=self.comics_parse)

3、将数据持久化存储

# 将图片编号，图片名，图片url作为参数传入
def save_img(self, img_mun, title, img_url):                                                                                                                       
    # 将图片保存到本地
    # self.log('saving pic: ' + img_url)

    # 保存漫画的文件夹
    document = os.path.join(os.getcwd(),'cartoon')

    # 每部漫画的文件名以标题命名
    comics_path = os.path.join(document,title)
    exists = os.path.exists(comics_path)

    if not exists:
        print('create document: ' + title)
        os.makedirs(comics_path)

    # 每张图片以页数命名
    pic_name = comics_path + '/' + img_mun + '.jpg'

    # 检查图片是否已经下载到本地，若存在则不再重新下载
    exists = os.path.exists(pic_name)
    if exists:
        print('pic exists: ' + pic_name)
        return

    try:
        response = requests.get(img_url,timeout=30)

        # 请求返回到的数据
        data = response

        with open(pic_name,'wb') as f:
            for chunk in data.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()

        print('save p_w_picpath finished:' + pic_name)

    except Exception as e:
        print('save p_w_picpath error.')
        print(e)

完整源码地址 https://github.com/yaoliang83/Scrapy-for-Comics

关键字：

上一篇：【python】chr与ord函数的使用

下一篇： Python之模块介绍



搜索

热门推荐

最新文章

Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2224°
Browser-use:智能浏览器自动化(Web-Agent)
 2911°
使用 LangChain 实现本地 Agent
 2433°
使用 LangChain 构建本地 RAG 应用
 2374°
使用LLaMA-Factory微调大模型的function calling能力
 2943°
复现一个简单Agent系统
 2377°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 3181°
LLaMA Factory微调后的模型合并导出和部署-4
 5236°
LLaMA Factory微调模型的各种参数怎么设置-3
 5049°
LLaMA Factory构建高质量数据集-2
 3608°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客