How to collect recruitment information data of pull hook net in Python crawler practice 07/01 Update SLTechnology News&Howtos

How to collect recruitment information data of pull hook net in Python crawler practice

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains the "Python crawler practice how to achieve the collection of recruitment information data", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "how to collect recruitment information data of Python crawler practice"!

The main points of this article:

The basic process of the crawler

The use of requests module

Save csv

Visual analysis and display

Environment introduction

Python 3.8

Pycharm 2021 Professional Activation Code

Jupyter Notebook

Pycharm is editor > > for writing code (easier to write code, more comfortable to write code)

Python is the interpreter > that runs the interpreter python code

This goal

Crawler blocks use built-in modules:

Import pprint > format input module

Import csv > Save the csv file

Import re > re regular expression

Import time > time module

Third-party modules:

Import requests > > data request module pip install requests

Win + R enter cmd, and enter the name of the installation command pip install module.

If there is a popularity, it may be because the network connection timed out and switched the domestic mirror source.

Code implementation steps: (basic steps of crawler code)

Send a request

Get data

Parsing data

Save data

Start code import module import requests # data request module third party module pip install requestsimport pprint # format output module import csv # csv save data import time send request url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'# headers request header is used to disguise python code to prevent it from being identified as a crawler Then it is anti-crawled # user-agent: the basic identity of the browser headers = {'cookie':' privacyPolicyPopup=false User_trace_token=20211016201224-ba4d90f0-3db5-4647 user_trace_token=20211016201224-ba4d90f0-3db5-4647 FINA 86e murf 411ee3d5bfef; _ _ lg_stoken__=08639898fbdd53a7ebf88fa16e895b59a51e47738f45faef6a32b9a88d6537bf9459b2c6d956a636a99ff599c6a260f04514df42cb77f83065d55f48a2549e60381e8da811b8; JSESSIONID=ABAAAECAAEBABIIE72FFC38A79322951663B5C7AF10CD12; WEBTJ-ID=20211016201225-17c89047f4293-0d7a7cd583dc83-b7a1438-2073600-17c89047f43a90; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c8904800d57b-04f17ed5193984-b7a1438-2073600-17c8904800e765%22%2C%22%24device_id%22%3A%2217c8904800d57b-04f17ed5193984-b7a1438-2073600-17c8904800e765% 22% 7D; PRE_UTM=; PRE_HOST= PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5Fpython%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGSID=20211016201225-7b8aa578-74ab-4b09-885c: ebbe57a6029a; PRE_SITE=; LGUID=20211016201225-fda15dbb-7823-4a2d-9d80-258caf018f02; _ ga=GA1.2.903785807.1634386346; _ gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1634386346; _ gid=GA1.2.701447082.1634386346; XeroHTTP tokenba154973a88f2f64153683436141effc1d544fa2ed; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1634386352; LGRID=20211016201232-8913a057-d37d-41c3-b094-a04cf36515a7 SEARCH_ID=ff32d1294b464305b4e0907f659ef2a7', 'referer':' https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',} data= {'first':' false', 'pn': page,' kd': 'python',' sid': 'bf8ed05047294473875b2c8373df0357'} # response custom variable can be defined by response = requests.post (url=url, data=data, headers=headers)

Get the server to give us response data

Parsing data

Json data had better be parsed and parsed very well, so take the value according to the dictionary key value pair.

Result = response.json () ['content'] [' positionResult'] ['result'] # loop to extract the elements one by one from the result list for index in result: # pprint.pprint (index) # href = index [' positionId'] href = f 'https://www.lagou.com/jobs/{index["positionId"]}.html' dit = {' title': index ['positionName'] 'region': index ['city'],' company name': index ['companyFullName'],' salary': index ['salary'],' education': index ['education'],' experience': index ['workYear'],' company tag':', 'join (index [' companyLabelList']), 'details page': href Join () converts the list to the string 'free shuttle bus', csv_writer.writerow (dit) print (dit) plus page flip for page in range (1) 31): print (Felicity Murray-crawling page {page} -') time.sleep (1) Save data f = open ('recruitment data .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title') 'region', 'company name', 'salary', 'education', 'experience', 'company tag', 'details page',]) csv_writer.writeheader () # write header run code Get the data

Thank you for your reading, the above is the content of "how to collect recruitment information data of Python crawler". After the study of this article, I believe you have a deeper understanding of how to collect recruitment information data of Python crawler practice, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.