How does Python crawl 51 job's recruitment information 04/17 Update SLTechnology News&Howtos

How does Python crawl 51 job's recruitment information

2025-04-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "how Python crawls 51 job recruitment information". In daily operation, I believe many people have doubts about how Python crawls 51 job recruitment information. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how Python crawls 51 job recruitment information". Next, please follow the editor to study!

Basic development environment

Python 3.6

Pycharm

Use of related modules

Requests

Parsel

Csv

Install Python and add it to the environment variable, and pip installs the relevant modules you need.

First, define the needs

Crawl 51 job recruitment information

Crawl content:

Recruitment title

Company

Salary

Urban area

Work experience requirements, education requirements, recruitment, release time, company benefits

Job responsibilities and requirements

Second, request the web page, first get the details of all the recruitment information, url address

Use developer tools to find that the content loaded by the web page is messy, which also means that when you crawl later, you need to transcode, so you can't see whether the content page you want has returned data. You can copy the data in the page and search in the source code of the page.

There are no results, so we can search for the ID of the links to the details

It contains not only ID but also details, url address. Match the ID with a regular expression, then concatenate the url, and if you match the url address, you need to turn it again.

Special statement:

Because of the website, the url address of each recruitment detail page is only the change of ID. If ID is not the only change value, it is better to take the url address.

Import requestsimport redef get_response (html_url): headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',} response = requests.get (url=html_url, headers=headers) return responsedef get_id (html_url): response = get_response (html_url) result = re.findall ('"jobid": "(\ d +)"' Response.text) print (response.text) print (result) if _ _ name__ = ='_ main__': url = 'https://search.51job.com/list/010000%252C020000%252C030200%252C040000%252C090200,000000,0000,00,9,99,python,2,1.html' get_id (url)

Simple summary

Print response.text can use regular matching rules in pycharm to test whether there is a match to the data. The details are shown in the following figure

Analyze the recruitment information data and extract the content

There is no salary and no salary for the first recruitment information on each page. For the recruitment information without salary, we will skip it automatically, so we need to judge it first.

Secondly, as mentioned earlier, the content viewed on the web page is garbled and needs to be transcoded.

Def get_content (html_url): result = get_id (html_url) for i in result: page_url = f 'https://jobs.51job.com/shanghai-xhq/{i}.html?s=01&t=0' response = get_response (page_url) # transcode response.encoding = response.apparent_encoding html_data = response.text selector = parsel.Selector (html _ data) # salary money = selector.css ('. Cn strong::text'). Get () # judge if there is a salary to continue to extract relevant content if money: # title title = selector.css ('. Cn h2::attr (title)'). Get () # Company cname = selector. Css ('.cname a:nth-child (1):: attr (title)'). Get () # Shanghai-Xuhui District | 5-7 years experience | undergraduate course | recruit 1 person | 01-25 release info_list = selector.css ('p.msg.ltype::attr (title)'). Get (). Split ('|') city = info_list [0] # City exp = info_list [1] # experience requirements edu = info_list [2] # academic qualifications people = info_list [3] # recruitment date = info_list [4] # release time # Welfare boon_list = selector. Css ('.t1 span::text'). Getall () boon_str =' | '.join (boon_list) # Job responsibilities: job requirements: position_list = selector.css (' .job _ msg ppurl position'). Getall () position ='\ n'.join (position_list) dit = {'title': title 'Company': cname, 'City': city, 'experience requirements': exp, 'academic qualifications': edu, 'salary': money, 'benefits': boon_str, 'recruitment': people, 'release time': date 'detailed address': page_url,} 4. Save data (data persistence)

Save the salary and company address in csv, and save the job responsibilities and requirements in text format, which will look a little more comfortable.

Save csv

F = open ('python recruitment .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title', 'company', 'city', 'experience requirements', 'academic requirements', 'salary', 'benefits', 'recruitment', 'release time' 'detailed address']) csv_writer.writeheader ()

Save txt

Txt_filename = 'Job responsibilities\' + f'{cname} recruitment {title} information. Txt'with open (txt_filename, mode='a', encoding='utf-8') as f: f.write (position) 5. Multi-page data crawling''if _ _ name__ ='_ _ main__':''first page address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python, Page 2 address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python,2,2.html page 3 address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python,2,3.html 'for page in range (1,11): url = f' https://search.51job.com/list/010000%252C020000%252C030200%252C040000%252C090200,000000,0000,00,9,99,python, 2, {page} .html 'get_content (url) implementation effect

Supplementary code

Regular matching replaces special characters

Def change_title (title): pattern = re.compile (r "[[\ /\\:\ *\?\"\ |] ") #'/\: *?"

< >

| | 'new_title = re.sub (pattern, "_", title) # replace with an underscore return new_title |

Main function code

Def main (html_url): result = get_id (html_url) for i in result: page_url = f 'https://jobs.51job.com/shanghai-xhq/{i}.html?s=01&t=0' response = get_response (page_url) response.encoding = response.apparent_encoding html_data = response.text selector = parsel.Selector (html_data) # salary Money = selector.css ('. Cn strong::text'). Get () # judge that if there is a salary, continue to extract the relevant content if money: # title title = selector.css ('.cn h2::attr (title)'). Get () # Company cname = selector.css ('.cname a:nth-child (1) ): attr (title)'). Get () # Shanghai-Xuhui District | 5-7 years experience | undergraduate course | recruit 1 person | 01-25 release info_list = selector.css ('p.msg.ltype::attr (title)'). Get (). Split ('|') if len (info_list) = 5: city = Info_list [0] # City exp = info_list [1] # experience requirements edu = info_list [2] # academic qualifications people = info_list [3] # recruitment date = info_list [4] # release time # benefits Boon_list = selector.css ('.t1 span::text'). Getall () boon_str =' |'. Join (boon_list) # Job responsibilities: job requirements: position_list = selector.css ('.job _ msg ppurl position'). Getall () position ='\ n'.join (position_list) Dit = {'title': title 'Company': cname, 'City': city, 'experience requirements': exp, 'academic qualifications': edu, 'salary': money, 'benefits': boon_str, 'recruitment': people 'release time': date, 'detailed address': page_url,} new_title = change_title (title) txt_filename = 'Job responsibilities\' + f'{cname} recruitment {new_title} Information .txt 'with open (txt_filename, mode='a') Encoding='utf-8') as f: f.write (position) csv_writer.writerow (dit) print (dit) so far The study on "how Python crawls 51 job recruitment information" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.