aiohttp异步爬虫

发布时间：2025-03-10 09:30:26编辑：123阅读（1432）

asyncio 采用单线程事件循环（event loop）来运行异步任务。这使得它在执行大量并发网络连接时特别高效。 asyncio 定义了一个 async/await 语法糖，用于编写异步代码。这种方法比旧式的回调函数风格更直观、更易于维护。

asyncio 为Python的异步编程提供了核心的框架和工具，但随着需求的不断增长，开发者需要更多功能来构建异步服务。这促进了像 aiohttp 这样的异步网络库的出现，它建立在 asyncio 的基础上，提供了一个简单的异步HTTP客户端和服务器端的实现。

目标网站 https://spa5.scrape.center/

示例代码:

import asyncio
import aiohttp
import json

# 目标网站 https://spa5.scrape.center/
index_url = 'https://spa5.scrape.center/api/book/?limit=18&offset={offset}'  # 列表页
detail_url = 'https://spa5.scrape.center/api/book/{id}'  # 书详情
page_size = 18  # 每页显示数
concurrency = 10  # 并发数
page_number = 20  # 总page数 实际503

semaphore = asyncio.Semaphore(concurrency)
session = None

headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36"}

async def scrape_api(url):
    async with semaphore:
        try:
            async with session.get(url=url, headers=headers) as response:
                await asyncio.sleep(0.5)
                return await response.json()
        except aiohttp.ClientError:
            print('error %S', url)

async def scrape_index(page):
    """
    获取列表页url
    :param page:
    :return:
    """
    url = index_url.format(offset=str(page_size * (page - 1)))
    return await scrape_api(url)

async def scrape_detail(id):
    """
    获取详情页信息
    :param id:
    :return:
    """
    url = detail_url.format(id=id)
    data = await scrape_api(url)
    return data

async def main():
    global session
    session = aiohttp.ClientSession()
    scrape_index_tasks = [asyncio.ensure_future(scrape_index(page)) for page in range(1, page_number + 1)]
    results = await asyncio.gather(*scrape_index_tasks)
    ids = []
    for index_data in results:
        if not index_data:
            continue
        for item in index_data.get('results'):
            ids.append(item.get('id'))
    scrape_detail_tasks = [asyncio.ensure_future(scrape_detail(i)) for i in ids]
    ids_results = await asyncio.gather(*scrape_detail_tasks)
    await session.close()
    with open('data.json', 'a', encoding='utf-8') as file:
        file.write(json.dumps(ids_results, ensure_ascii=False))
        file.write("\n")


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())

关键字：

上一篇： python爬取有道词典

下一篇： pythonp爬虫-fanqienovel字体反爬实战



搜索

热门推荐

最新文章

Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2064°
Browser-use:智能浏览器自动化(Web-Agent)
 2767°
使用 LangChain 实现本地 Agent
 2309°
使用 LangChain 构建本地 RAG 应用
 2245°
使用LLaMA-Factory微调大模型的function calling能力
 2739°
复现一个简单Agent系统
 2266°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 3030°
LLaMA Factory微调后的模型合并导出和部署-4
 4971°
LLaMA Factory微调模型的各种参数怎么设置-3
 4834°
LLaMA Factory构建高质量数据集-2
 3442°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客