Python爬虫入门教程 40-100

发布时间：2019-02-28 18:39:04编辑：auto阅读（2946）

爬前叨叨

第40篇博客吹响号角，爬取博客园博客~本文最终抓取到了从2010年1月1日到2019年1月7日的37W+文章，后面可以分析好多东西了呢

经常看博客的同志知道，博客园每个栏目下面有200页，多了的数据他就不显示了，最多显示4000篇博客如何尽可能多的得到博客数据，是这篇文章研究的一点点核心内容，能√get到多少就看你的了~

python3爬虫入门教程

单纯的从每个栏目去爬取是不显示的，转换一下思路，看到搜索页面，有时间~，有时间！
python3爬虫入门教程

注意看URL链接

https://zzk.cnblogs.com/s/blogpost?Keywords=python&datetimerange=Customer&from=2019-01-01&to=2019-01-01

这个链接得到之后，其实用一个比较简单的思路就可以获取到所有python相关的文章了，迭代时间。
下面编写核心代码，比较重要的几个点，我单独提炼出来。

页面搜索的时候因为加了验证，所以你必须要获取到你本地的cookie，这个你很容易得到
字典生成器的语法是时候去复习一下了

import scrapy
from scrapy import Request,Selector
import time
import datetime

class BlogsSpider(scrapy.Spider):
    name = 'Blogs'
    allowed_domains = ['zzk.cnblogs.com']
    start_urls = ['http://zzk.cnblogs.com/']
    from_time = "2010-01-01"
    end_time = "2010-01-01"
    keywords = "python"
    page =1
    url = "https://zzk.cnblogs.com/s/blogpost?Keywords={keywords}&datetimerange=Customer&from={from_time}&to={end_time}&pageindex={page}"
    custom_settings = {
        "DEFAULT_REQUEST_HEADERS":{
            "HOST":"zzk.cnblogs.com",
            "TE":"Trailers",
            "referer": "https://zzk.cnblogs.com/s/blogpost?w=python",
            "upgrade-insecure-requests": "1",
            "user-agent": "Mozilla/5.0 Gecko/20100101 Firefox/64.0"

        }
    }


    def start_requests(self):
        cookie_str = "想办法自己获取到"
        self.cookies = {item.split("=")[0]: item.split("=")[1] for item in cookie_str.split("; ")}
        yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=self.page),cookies=self.cookies,callback=self.parse)

页面爬取完毕之后，需要进行解析，获取翻页页码，同时将时间+1天，下面的代码重点看时间叠加部分的操作。

    def parse(self, response):
        print("正在爬取",response.url)
        count = int(response.css('#CountOfResults::text').extract_first()) # 获取是否有数据
        if count>0:
            for page in range(1,int(count/10)+2):
                # 抓取详细数据
                yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=page),cookies=self.cookies,callback=self.parse_detail,dont_filter=True)

        time.sleep(2)
        # 跳转下一个日期
        d = datetime.datetime.strptime(self.from_time, '%Y-%m-%d')
        delta = datetime.timedelta(days=1)
        d = d + delta
        self.from_time = d.strftime('%Y-%m-%d')
        self.end_time =self.from_time
        yield Request(
            self.url.format(keywords=self.keywords, from_time=self.from_time, end_time=self.end_time, page=self.page),
            cookies=self.cookies, callback=self.parse, dont_filter=True)

页面解析入库

本部分操作逻辑没有复杂点，只需要按照流程编写即可，运行代码，跑起来，在mongodb等待一些时间

db.getCollection('dict').count({})

372352条数据


    def parse_detail(self,response):
        items = response.xpath('//div[@class="searchItem"]')
        for item in items:
            title = item.xpath('h3[@class="searchItemTitle"]/a//text()').extract()
            title = "".join(title)

            author = item.xpath(".//span[@class='searchItemInfo-userName']/a/text()").extract_first()
            public_date = item.xpath(".//span[@class='searchItemInfo-publishDate']/text()").extract_first()
            pv = item.xpath(".//span[@class='searchItemInfo-views']/text()").extract_first()
            if pv:
                pv = pv[3:-1]
            url = item.xpath(".//span[@class='searchURL']/text()").extract_first()
            #print(title,author,public_date,pv)
            yield {
                "title":title,
                "author":author,
                "public_date":public_date,
                "pv":pv,
                "url":url
            }

数据入库

一顿操作猛如虎，数据就到手了~后面可以做一些简单的数据分析，那篇博客再见啦@

python3爬虫入门教程

关键字：

上一篇： Django Rest Framewor

下一篇：简单介绍我的开源小工具：SanicDB



搜索

热门推荐

最新文章

Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2123°
Browser-use:智能浏览器自动化(Web-Agent)
 2830°
使用 LangChain 实现本地 Agent
 2359°
使用 LangChain 构建本地 RAG 应用
 2297°
使用LLaMA-Factory微调大模型的function calling能力
 2821°
复现一个简单Agent系统
 2313°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 3095°
LLaMA Factory微调后的模型合并导出和部署-4
 5073°
LLaMA Factory微调模型的各种参数怎么设置-3
 4914°
LLaMA Factory构建高质量数据集-2
 3506°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客