In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/03 Report--
This article mainly explains the "Python crawler practice how to achieve the collection of recruitment information data", the content of the article is simple and clear, easy to learn and understand, the following please follow the editor's ideas slowly in-depth, together to study and learn "how to collect recruitment information data of Python crawler practice"!
The main points of this article:
The basic process of the crawler
The use of requests module
Save csv
Visual analysis and display
Environment introduction
Python 3.8
Pycharm 2021 Professional Activation Code
Jupyter Notebook
Pycharm is editor > > for writing code (easier to write code, more comfortable to write code)
Python is the interpreter > that runs the interpreter python code
This goal
Crawler blocks use built-in modules:
Import pprint > format input module
Import csv > Save the csv file
Import re > re regular expression
Import time > time module
Third-party modules:
Import requests > > data request module pip install requests
Win + R enter cmd, and enter the name of the installation command pip install module.
If there is a popularity, it may be because the network connection timed out and switched the domestic mirror source.
Code implementation steps: (basic steps of crawler code)
Send a request
Get data
Parsing data
Save data
Start code import module import requests # data request module third party module pip install requestsimport pprint # format output module import csv # csv save data import time send request url = 'https://www.lagou.com/jobs/positionAjax.json?needAddtionalResult=false'# headers request header is used to disguise python code to prevent it from being identified as a crawler Then it is anti-crawled # user-agent: the basic identity of the browser headers = {'cookie':' privacyPolicyPopup=false User_trace_token=20211016201224-ba4d90f0-3db5-4647 user_trace_token=20211016201224-ba4d90f0-3db5-4647 FINA 86e murf 411ee3d5bfef; _ _ lg_stoken__=08639898fbdd53a7ebf88fa16e895b59a51e47738f45faef6a32b9a88d6537bf9459b2c6d956a636a99ff599c6a260f04514df42cb77f83065d55f48a2549e60381e8da811b8; JSESSIONID=ABAAAECAAEBABIIE72FFC38A79322951663B5C7AF10CD12; WEBTJ-ID=20211016201225-17c89047f4293-0d7a7cd583dc83-b7a1438-2073600-17c89047f43a90; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2217c8904800d57b-04f17ed5193984-b7a1438-2073600-17c8904800e765%22%2C%22%24device_id%22%3A%2217c8904800d57b-04f17ed5193984-b7a1438-2073600-17c8904800e765% 22% 7D; PRE_UTM=; PRE_HOST= PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist%5Fpython%3FlabelWords%3D%26fromSearch%3Dtrue%26suginput%3D; LGSID=20211016201225-7b8aa578-74ab-4b09-885c: ebbe57a6029a; PRE_SITE=; LGUID=20211016201225-fda15dbb-7823-4a2d-9d80-258caf018f02; _ ga=GA1.2.903785807.1634386346; _ gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1634386346; _ gid=GA1.2.701447082.1634386346; XeroHTTP tokenba154973a88f2f64153683436141effc1d544fa2ed; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1634386352; LGRID=20211016201232-8913a057-d37d-41c3-b094-a04cf36515a7 SEARCH_ID=ff32d1294b464305b4e0907f659ef2a7', 'referer':' https://www.lagou.com/jobs/list_python?labelWords=&fromSearch=true&suginput=', 'user-agent':' Mozilla/5.0 (Windows NT 10.0; Win64 X64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36',} data= {'first':' false', 'pn': page,' kd': 'python',' sid': 'bf8ed05047294473875b2c8373df0357'} # response custom variable can be defined by response = requests.post (url=url, data=data, headers=headers)
Get the server to give us response data
Parsing data
Json data had better be parsed and parsed very well, so take the value according to the dictionary key value pair.
Result = response.json () ['content'] [' positionResult'] ['result'] # loop to extract the elements one by one from the result list for index in result: # pprint.pprint (index) # href = index [' positionId'] href = f 'https://www.lagou.com/jobs/{index["positionId"]}.html' dit = {' title': index ['positionName'] 'region': index ['city'],' company name': index ['companyFullName'],' salary': index ['salary'],' education': index ['education'],' experience': index ['workYear'],' company tag':', 'join (index [' companyLabelList']), 'details page': href Join () converts the list to the string 'free shuttle bus', csv_writer.writerow (dit) print (dit) plus page flip for page in range (1) 31): print (Felicity Murray-crawling page {page} -') time.sleep (1) Save data f = open ('recruitment data .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title') 'region', 'company name', 'salary', 'education', 'experience', 'company tag', 'details page',]) csv_writer.writeheader () # write header run code Get the data
Thank you for your reading, the above is the content of "how to collect recruitment information data of Python crawler". After the study of this article, I believe you have a deeper understanding of how to collect recruitment information data of Python crawler practice, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.