我的爬虫自学之旅

发布时间:2019-03-13 22:28:32编辑:auto阅读(1985)

    电子版参考书:https://pan.baidu.com/s/15R08yEjLDj8FxrBwnUaTyA 注:仅限网上学习交流,如有侵权请联系我

    我们一起学习┏(^0^)┛

    自我介绍,我是一个python迈过基础游荡在爬虫自学之路的一只小蚂蚁。在计算机编程漫长枯燥的道路上,很多技术博客帮助了我,心怀感激,想把自己的经历也记录下来,这是我的第一篇博客,如有瑕疵请多包涵,谢谢~对了,如果你也是自学入门的,来试试hackerrank.com,我只是需要一个队友~你会有不一样的感受的^_^

     

     

    安装第三方库经常报错:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    下载:https://download.microsoft.com/download/5/f/7/5f7acaeb-8363-451f-9425-68a90f98b238/visualcppbuildtools_full.exe?fixForIE=.exe.       安装挺久,但一劳永逸有木有哈哈?

    安装selenium,chromedriver.exe地址:http://chromedriver.storage.googleapis.com/index.html?path=2.41/()

    我的是windows系统,文件放在python/Scripts目录下,不用配置环境变量。本文只用Chrome爬虫。

     照教程爬了猫眼排行榜还是啥也不懂的我,接了朋友给的艰巨任务:智联招聘(【内牛满面】)

    所学库不多,但好歹迈出了第一步。对代码运行结果也有困惑,希望交流~

     

    from urllib.parse import urlencode
    import requests
    import json
    import csv
    import time
    
    
    def get_one_page(page):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
        }
        params = {
            'start': '',
            'pageSize': '60',
            'cityId': '489',
            'workExperience': '-1',
            'education': '-1',
            'companyType': '-1',
            'employmentType': '-1',
            'jobWelfareTag': '-1',
            'kw': '数据分析师',
            'kt': '3',
            'lastUrlQuery': {"p": page,
                             "pageSize": "60",
                             "jl": "489",
                             "kw": "数据分析师",
                             "kt": "3"
                             }
        }
        base_url = 'https://fe-api.zhaopin.com/c/i/sou?'
        url = base_url + urlencode(params)
        # print(url)
    
        response = requests.get(url, headers=headers)
        try:
            if response.status_code == 200:
                return response.json()
        except Exception as e:
            print('Error:', e)
    
    
    @get_one_page
    def func(page):
        if page == 0:
            get_one_page().params.pop('start')
            get_one_page().params['lastUrlQuery'].pop('p')
        else:
            get_one_page().params['start'] = 60 * (page - 1)
        return get_one_page()
    
    
    def parse_page(json):
        if json.get('data'):
            data = json.get('data').get('results')
            data_list = []
            for item in data:
                job_name = item.get('jobName')
                salary = item.get('salary')
                company = item.get('company').get('name')
                welfare = item.get('welfare')
                city = item.get('city').get('name')
                work = item.get('workingExp').get('name')
                edu_level = item.get('eduLevel').get('name')
                data_list.append([job_name, company, welfare, salary, city, work, edu_level])
            print(data_list)
            return data_list
    
    
    def save_data(datas):
        with open('data_zhilian_findjob.csv', 'w') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(['job_name', 'company', 'welfare,salary', 'city', 'workingExp', 'edu_level'])
            for row in datas:
                writer.writerow(row)
    
    
    def main():
        for page in range(20):
            json = get_one_page(page)
            data = parse_page(json)
            # print(data)
            time.sleep(0.8)
            save_data(data)
    
    
    if __name__ == '__main__':
        main()

     

关键字