发布时间:2019-05-14 22:33:07编辑:Run阅读(4973)
京东页面分析:
点击图片上传按钮,上传一张小图,可以看到上传失败了,不要紧,在network里面可以看到
image?op=upload的信息,点开就可以看到图片上传的接口了.
接口地址:https://search.jd.com/image?op=upload
提交post的请求的时候,还需要带上一些headers里面的信息.在接口信息上面都能找到的。
利用requests-html向接口提交post请求,代码如下:
from requests_html import HTMLSession session = HTMLSession() post_url = 'https://search.jd.com/image?op=upload' headers = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9", "content-length": "4068000", "origin": "https://search.jd.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36" } files = {'file': open('333.jpg','rb')} r = session.post(post_url, headers=headers, files=files, timeout=30) print(r.text)
返回结果:
<script>document.domain="jd.com";parent.$o.uploader.callback("jfs/t28462/331/1256269893/74388/84637f95/5cdace08N104202e7.jpg");</script>
callback(回调函数)里面包含了一个图片路径 jfs/t28462/331/1256269893/74388/84637f95/5cdace08N104202e7.jpg
继续回到页面分析:
这次上传一张正常的图片:
可以看到url变成了:
https://search.jd.com/image?path=jfs%2Ft10162%2F273%2F2841925085%2F74388%2F84637f95%2F5cdacee0Nf87545c9.jpg&op=search
浏览器会自动把url编码,拿到站长工具上面去解析一下url
可以发现path=后面的路径就是post上传返回的路径
整理下思路:首先提交post请求,拿到图片的路径,然后在拼接url地址访问,就能得到图片识别后的内容了
整理代码:
#!/usr/bin/env python # coding: utf-8 from requests_html import HTMLSession import re session = HTMLSession() post_url = 'https://search.jd.com/image?op=upload' headers = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "accept-encoding": "gzip, deflate, br", "accept-language": "zh-CN,zh;q=0.9", "content-length": "4068000", "origin": "https://search.jd.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36" } files = {'file': open('333.jpg','rb')} r = session.post(post_url, headers=headers, files=files, timeout=30) # 利用正则匹配出路径 ret = re.findall('[(]["](.*)["][)]', r.text)[0] if ret == 'ERROR.UPLOAD_FORMAT': # 图片识别失败 print(False) else: s = session.get("https://search.jd.com/image?path={}&op=search".format(ret)) url_list = s.html.xpath("//div[@class='p-img']/a/@href")[0:3] for x in url_list: print("https:" + x) rr = session.get("https:" + x) # 拿到商品描述信息 jd_describe = rr.html.xpath("//div[@class='sku-name']/text()")[0].replace(' ', '') print(jd_describe) # 拿到商品分类信息 jd_classification_list = [] for y in rr.html.xpath("//div[@class='crumb fl clearfix']//a"): if y.text.strip() == '': pass else: jd_classification_list.append(y.text.strip()) five = rr.html.xpath("//div[@class='crumb fl clearfix']//div[@class='item ellipsis']/text()") if five: jd_classification_list.append(five[0]) print(jd_classification_list)
运行结果:
上一篇: requests-html爬虫利器介绍
47743
46233
37108
34625
29227
25883
24743
19861
19414
17906
5713°
6312°
5832°
5885°
6981°
5827°
5842°
6358°
6313°
7670°