In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/02 Report--
Today, I will talk to you about how to use Python crawler to capture Zhaopin recruitment. Many people may not know much about it. In order to make you understand better, the editor has summarized the following content for you. I hope you can get something according to this article.
For every office worker, there are always several times to change jobs, how to find the desired job online? How to prepare in advance for the interview for the job you want? Today we come to grab the recruitment information of Zhaopin to help you change your job successfully!
Running platform: Windows
Python version: Python3.6
IDE: Sublime Text
Other tools: Chrome browser
1. Web page analysis
1.1 analyze request address
Take the python engineer in Haidian District of Beijing as an example to analyze the web page. Open the Zhaopin recruitment home page, select the Beijing area, enter "python engineer" in the search box, and click "search job":
Next, jump to the search results page, press "F12" to open the developer tool, and then select "Haidian" in the "Hot area" column. Let's take a look at the address bar:
From the bottom half of the address bar, searchresult.ashx?jl= Beijing & kw=python engineer & sm=0&isfilter=1&p=1&re=2005 can see that we have to construct the address ourselves. Next, analyze the developer tools and follow the steps shown in the figure to find the data we need: Request Headers and Query String Parameters:
Construction request address:
Paras = {'jl':' Beijing', # search city 'kw':' python engineer', # search keywords' isadv': 0, # whether to turn on more detailed search options' isfilter': 1, # whether to filter the results' isfilter':: 1 # number of pages' re': 2005 # abbreviation for region Region, 2005 represents Haidian} url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode (paras)
Request header:
Headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'Host':' sou.zhaopin.com', 'Referer':' https://www.zhaopin.com/', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/* Qroom0.8, 'Accept-Encoding':' gzip, deflate, br', 'Accept-Language':' zh-CN,zh;q=0.9'}
1.2 analyze useful data
Next, we need to analyze the useful data. The data we need from the search results are: job name, company name, company details page address, position monthly salary:
Locate these items in the HTML file by locating the page elements, as shown in the following figure:
Extract these four items with regular expressions:
# regular expression parses pattern = re.compile ('(. *?). *?'# match position information'(. *?). *?'# match company website and company name'(. *), re.S) # match monthly salary # match all eligible content items = re.findall (pattern Html)
Note: some of the resolved job titles are labeled, as shown in the following figure:
Then after parsing, the data is processed and untagged, which is implemented with the following code:
For item in items: job_name = item [0] job_name = job_name.replace (',') job_name = job_name.replace (',') yield {'job': job_name,' website': item [1], 'company': item [2],' salary': item [3]}
2. Write to file
The data we obtained are the same for each position, which can be written into the database, but this article chooses the csv file. The following is explained by Baidu encyclopedia:
Comma-separated values (Comma-Separated Values,CSV, sometimes referred to as character-separated values, because delimiters can also be non-commas) whose files store table data (numbers and text) in plain text. Plain text means that the file is a sequence of characters and does not contain data that must be interpreted like binary numbers.
Because python has built-in library functions for manipulating csv files, it is convenient:
Import csv def write_csv_headers (path, headers):''write header' 'with open (path,' asides, encoding='gb18030', newline='') as f: f_csv = csv.DictWriter (f, headers) f_csv.writeheader () def write_csv_rows (path, headers, rows):''write line' with open (path,'a') Encoding='gb18030', newline='') as f: f_csv = csv.DictWriter (f, headers) f_csv.writerows (rows)
3. Progress display
If we want to find an ideal job, we must screen more positions, then the amount of data we crawl must be very large, dozens, hundreds or even thousands of pages, so we have to grasp the progress in order to be more secure, ah, so we need to add a progress bar display function.
In this article, tqdm is selected for progress display. Let's take a look at the cool results (image source network):
Execute the following command to install: pip install tqdm.
Simple example:
From tqdm import tqdm from time import sleep for i in tqdm (range (1000)): sleep
4. Complete code
The above is an analysis of all the functions, and the following is the complete code:
#-*-coding: utf-8-*-import re import csv import requests from tqdm import tqdm from urllib.parse import urlencode from requests.exceptions import RequestException def get_one_page (city, keyword, region, page):''get the html content of the web page and return' 'paras = {' jl': city, # search city 'kw': keyword # search keywords' isadv': 0, # whether to turn on the more detailed search option 'isfilter': 1, # whether to filter the abbreviations for' packs: page, # pages' re': region # region Region, 2005 represents Haidian} headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36', 'Host':' sou.zhaopin.com', 'Referer':' https://www.zhaopin.com/', 'Accept':' text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/* Qcow 0.8, 'Accept-Encoding':' gzip, deflate, br', 'Accept-Language':' zh-CN,zh Qroom0.9'} url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?' + urlencode (paras) try: # get the content of a web page Return html data response = requests.get (url, headers=headers) # determine whether the acquisition is successful by the status code if response.status_code = = 200: return response.text return None except RequestException as e: return None def parse_one_page (html):''parse the HTML code Extract useful information and return''# regular expression to parse pattern = re.compile ('(. *?). *? 'match position information' (. *?). *?'# match company URL and company name'(. *?)' Re.S) # match monthly salary # match all eligible content items = re.findall (pattern, html) for item in items: job_name = item [0] job_name = job_name.replace ('','') job_name = job_name.replace ('' '') yield {'job': job_name,' website': item [1], 'company': item [2],' salary': item [3]} def write_csv_file (path, headers Rows):''write headers and rows to the csv file''# add encoding to prevent writing errors in Chinese # newline parameter prevents one blank line with open (path,'a row, encoding='gb18030', newline='') as f: f_csv = csv.DictWriter (f) Headers) f_csv.writeheader () f_csv.writerows (rows) def write_csv_headers (path, headers):''write header' with open (path, 'asides, encoding='gb18030', newline='') as f: f_csv = csv.DictWriter (f, headers) f_csv.writeheader () def write_csv_rows (path, headers Rows):''write lines' with open (path, 'asides, encoding='gb18030', newline='') as f: f_csv = csv.DictWriter (f, headers) f_csv.writerows (rows) def main (city, keyword, region) Pages):''main function' 'filename =' zl_' + city +'_'+ keyword + '.csv' headers = ['job',' website', 'company',' salary'] write_csv_headers (filename, headers) for i in tqdm (range (pages)):''get all the position information on this page Write csv file''jobs = [] html = get_one_page (city, keyword, region, I) items = parse_one_page (html) for item in items: jobs.append (item) write_csv_rows (filename, headers, jobs) if _ _ name__ = =' _ _ main__': main ('Beijing', 'python engineer', 2005, 10)
The execution effect of the above code is shown in the figure:
After the execution is completed, a file named zl_ Beijing _ python engineer .csv will be generated under the py peer folder. The results are as follows:
After reading the above, do you have any further understanding of how to use Python crawler to capture Zhaopin recruitment? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.