In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how Python crawls 51 job recruitment information". In daily operation, I believe many people have doubts about how Python crawls 51 job recruitment information. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how Python crawls 51 job recruitment information". Next, please follow the editor to study!
Basic development environment
Python 3.6
Pycharm
Use of related modules
Requests
Parsel
Csv
Re
Install Python and add it to the environment variable, and pip installs the relevant modules you need.
First, define the needs
Crawl 51 job recruitment information
Crawl content:
Recruitment title
Company
Salary
Urban area
Work experience requirements, education requirements, recruitment, release time, company benefits
Job responsibilities and requirements
Second, request the web page, first get the details of all the recruitment information, url address
Use developer tools to find that the content loaded by the web page is messy, which also means that when you crawl later, you need to transcode, so you can't see whether the content page you want has returned data. You can copy the data in the page and search in the source code of the page.
There are no results, so we can search for the ID of the links to the details
It contains not only ID but also details, url address. Match the ID with a regular expression, then concatenate the url, and if you match the url address, you need to turn it again.
Special statement:
Because of the website, the url address of each recruitment detail page is only the change of ID. If ID is not the only change value, it is better to take the url address.
Import requestsimport redef get_response (html_url): headers = {'User-Agent':' Mozilla/5.0 (Windows NT 10.0) WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',} response = requests.get (url=html_url, headers=headers) return responsedef get_id (html_url): response = get_response (html_url) result = re.findall ('"jobid": "(\ d +)"' Response.text) print (response.text) print (result) if _ _ name__ = ='_ main__': url = 'https://search.51job.com/list/010000%252C020000%252C030200%252C040000%252C090200,000000,0000,00,9,99,python,2,1.html' get_id (url)
Simple summary
Print response.text can use regular matching rules in pycharm to test whether there is a match to the data. The details are shown in the following figure
Analyze the recruitment information data and extract the content
There is no salary and no salary for the first recruitment information on each page. For the recruitment information without salary, we will skip it automatically, so we need to judge it first.
Secondly, as mentioned earlier, the content viewed on the web page is garbled and needs to be transcoded.
Def get_content (html_url): result = get_id (html_url) for i in result: page_url = f 'https://jobs.51job.com/shanghai-xhq/{i}.html?s=01&t=0' response = get_response (page_url) # transcode response.encoding = response.apparent_encoding html_data = response.text selector = parsel.Selector (html _ data) # salary money = selector.css ('. Cn strong::text'). Get () # judge if there is a salary to continue to extract relevant content if money: # title title = selector.css ('. Cn h2::attr (title)'). Get () # Company cname = selector. Css ('.cname a:nth-child (1):: attr (title)'). Get () # Shanghai-Xuhui District | 5-7 years experience | undergraduate course | recruit 1 person | 01-25 release info_list = selector.css ('p.msg.ltype::attr (title)'). Get (). Split ('|') city = info_list [0] # City exp = info_list [1] # experience requirements edu = info_list [2] # academic qualifications people = info_list [3] # recruitment date = info_list [4] # release time # Welfare boon_list = selector. Css ('.t1 span::text'). Getall () boon_str =' | '.join (boon_list) # Job responsibilities: job requirements: position_list = selector.css (' .job _ msg ppurl position'). Getall () position ='\ n'.join (position_list) dit = {'title': title 'Company': cname, 'City': city, 'experience requirements': exp, 'academic qualifications': edu, 'salary': money, 'benefits': boon_str, 'recruitment': people, 'release time': date 'detailed address': page_url,} 4. Save data (data persistence)
Save the salary and company address in csv, and save the job responsibilities and requirements in text format, which will look a little more comfortable.
Save csv
F = open ('python recruitment .csv', mode='a', encoding='utf-8', newline='') csv_writer = csv.DictWriter (f, fieldnames= ['title', 'company', 'city', 'experience requirements', 'academic requirements', 'salary', 'benefits', 'recruitment', 'release time' 'detailed address']) csv_writer.writeheader ()
Save txt
Txt_filename = 'Job responsibilities\' + f'{cname} recruitment {title} information. Txt'with open (txt_filename, mode='a', encoding='utf-8') as f: f.write (position) 5. Multi-page data crawling''if _ _ name__ ='_ _ main__':''first page address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python, Page 2 address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python,2,2.html page 3 address: https://search.51job.com/list/010000%252c020000%252c030200%252c040000%252c090200,000000,0000,00,9,99,python,2,3.html 'for page in range (1,11): url = f' https://search.51job.com/list/010000%252C020000%252C030200%252C040000%252C090200,000000,0000,00,9,99,python, 2, {page} .html 'get_content (url) implementation effect
Supplementary code
Regular matching replaces special characters
Def change_title (title): pattern = re.compile (r "[[\ /\\:\ *\?\"\ |] ") #'/\: *?"
< >| | 'new_title = re.sub (pattern, "_", title) # replace with an underscore return new_title |
Main function code
Def main (html_url): result = get_id (html_url) for i in result: page_url = f 'https://jobs.51job.com/shanghai-xhq/{i}.html?s=01&t=0' response = get_response (page_url) response.encoding = response.apparent_encoding html_data = response.text selector = parsel.Selector (html_data) # salary Money = selector.css ('. Cn strong::text'). Get () # judge that if there is a salary, continue to extract the relevant content if money: # title title = selector.css ('.cn h2::attr (title)'). Get () # Company cname = selector.css ('.cname a:nth-child (1) ): attr (title)'). Get () # Shanghai-Xuhui District | 5-7 years experience | undergraduate course | recruit 1 person | 01-25 release info_list = selector.css ('p.msg.ltype::attr (title)'). Get (). Split ('|') if len (info_list) = 5: city = Info_list [0] # City exp = info_list [1] # experience requirements edu = info_list [2] # academic qualifications people = info_list [3] # recruitment date = info_list [4] # release time # benefits Boon_list = selector.css ('.t1 span::text'). Getall () boon_str =' |'. Join (boon_list) # Job responsibilities: job requirements: position_list = selector.css ('.job _ msg ppurl position'). Getall () position ='\ n'.join (position_list) Dit = {'title': title 'Company': cname, 'City': city, 'experience requirements': exp, 'academic qualifications': edu, 'salary': money, 'benefits': boon_str, 'recruitment': people 'release time': date, 'detailed address': page_url,} new_title = change_title (title) txt_filename = 'Job responsibilities\' + f'{cname} recruitment {new_title} Information .txt 'with open (txt_filename, mode='a') Encoding='utf-8') as f: f.write (position) csv_writer.writerow (dit) print (dit) so far The study on "how Python crawls 51 job recruitment information" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.