In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how python crawls 51 job recruitment data". In daily operation, I believe many people have doubts about how python crawls 51 job recruitment data. The editor consulted all kinds of materials and sorted out simple and easy-to-use methods of operation. I hope it will be helpful to answer the doubts of "how python crawls 51 job recruitment data". Next, please follow the editor to study!
1. Crawl data
Target url:
Https://www.51job.com/
Enter the keyword python in the Future carefree Network to search for relevant job data. If you turn the page to look at these recruitment position information, you can find the rule of url page turning.
Check the source code of the web page to find the data you want to extract.
Some of the crawler codes are as follows. For more information, please see download at the end of the article.
Async def parse (self, text): # regular matching extracted data try: job_name = re.findall ('"job_name": "(. *?)",', text) # position company_name = re.findall ('"company_name": "(. *?)",' Text) # Company name salary = re.findall ('"providesalary_text": "(. *?)",', text) salary = [i.replace ('\\',') for i in salary] # salary removed\ symbol city = re.findall ('"workarea_text": "(. *?)",' Text) # City job_welfare = re.findall ('"jobwelf": "(. *?)",', text) # position welfare attribute_text = re.findall ('"attribute_text": (. *?), "companysize_text"' Text) attribute_text = ['| '.join (eval (I)) for i in attribute_text] companysize = re.findall (' "companysize_text": "(. *?)",', text) # Company size category = re.findall ('"companyind_text": "(. *?)",', text) category = [i.replace ('\\') '') for i in category] # the industry to which the company belongs removes the symbol datas = pd.DataFrame ({'company_name': company_name,' job_name': job_name, 'companysize': companysize,' city': city, 'salary': salary,' attribute_text': attribute_text, 'category': category 'job_welfare': job_welfare}) datas.to_csv (' job_info.csv', mode='a+', index=False, header=True) logging.info ({'company_name': company_name,' job_name': job_name, 'company_size': companysize,' city': city, 'salary': salary,' attribute_text': attribute_text, 'category': category 'job_welfare': job_welfare}) except Exception as e: print (e)
The running effect is as follows:
Crawled 200 pages of recruitment data, a total of 10000 recruitment messages, in 49.919s.
two。 Data viewing and preprocessing import pandas as pddf = pd.read_csv ('job_info.csv') # when an asynchronous crawler crawls data, datas.to_csv (' job_info.csv', mode='a+', index=False, header=True) deletes multiple column names df1 = df ['salary']! =' salary'] # View the first 10 lines df1.head (10)
# city column data is processed as cities # by-split expand=True 0 column is reassigned to df ['city'] df1 [' city'] = df1 ['city'] .str.split (' -', expand=True) [0] df1.head (10) # experience requirements are in attribute_text column df ['attribute_text'] .str.split (' |', expand=True)
Df1 ['experience'] = df1 [' attribute_text'] .str.split ('|, expand=True) [1] df1 ['education'] = df1 [' attribute_text'] .str.split ('|', expand=True) [2] df1
Save as cleaned data
Df1.to_csv ('cleaned data .csv', index=False)
View index, data type, and memory information
Df2 = pd.read_csv ('cleaned data .csv') df2.info ()
3. Data analysis and visualization
(1) the bar chart shows the city with the largest number of jobs recruited, Top10.
The code is as follows:
Import pandas as pdimport randomimport matplotlib.pyplot as pltimport matplotlib as mpldf = pd.read_csv ('cleaned data .csv') # some are remote recruitment filtered out data = df [df ['city']! =' remote recruitment'] ['city'] .value_counts () city = list (data.index) [: 10] # City nums = list (data.values) [: 10] # posts print (city) print (nums) colors = [' # FF0000' '# 00000 random.shuffle,' # 00BFFFF,'# 008000pixel,'# FF1493','# FFD700','# FF4500','# 00FA9A,'# 191970C,'# 9932CC'] random.shuffle (colors) # set the size pixel plt.figure (figsize= (9,6)) Dpi=100) # set Chinese display mpl.rcParams ['font.family'] =' SimHei'# draw column chart set bar width and color # color parameters configure different colors plt.bar (city, nums, width=0.5, color=colors) for each bar # add description information plt.title ('city with the largest number of jobs) plt.xlabel (' city', fontsize=12) plt.ylabel ('number of posts' Fontsize=12) # display picture plt.show ()
The running effect is as follows:
['Shanghai', 'Shenzhen', 'Guangzhou', 'Beijing', 'Hangzhou', 'Chengdu', 'Wuhan', 'Nanjing', 'Suzhou', 'Changsha'] [2015, 1359, 999,674,550466444320211]
There are many jobs in Shanghai, Shenzhen, Guangzhou and Beijing, as well as a considerable number of jobs in Hangzhou, Chengdu, Wuhan, Nanjing and other cities.
(2) calculate the salary of the job data, deal with the number of K / month, divide the salary range, calculate the salary distribution, and show the pie chart.
The code is as follows:
# set Chinese display mpl.rcParams ['font.family'] =' SimHei'# set size pixel plt.figure (figsize= (9,6), dpi=100) plt.axes (aspect='equal') # ensure that the pie chart is a positive circle explodes = [0,0,0,0.1,0.2,0.3,0.4,0.5] plt.pie (nums, pctdistance=0.75, shadow=True, colors=colors, autopct='%.2f%%', explode=explodes, startangle=15, labeldistance=1.1 ) # set the legend to adjust the legend position plt.legend (part_interval, bbox_to_anchor= (1.0,1.0)) plt.title ('salary distribution of recruitment posts', fontsize=15) plt.show ()
The running effect is as follows:
The salary of recruitment posts accounts for a large proportion in the range of 5K-10K and 10K-15K, and there are also a certain proportion of high-paid jobs above 50K.
(3) check the academic requirements of the recruitment position, and visualize the horizontal bar chart.
Mport pandas as pdimport matplotlib.pyplot as pltimport matplotlib as mpldf = pd.read_csv (r 'cleaned data .csv') ['education'] data = df.value_counts () labels = [' junior college', 'undergraduate', 'master', 'doctor'] nums = [Data [I] for i in labels] print (labels) print (nums) colors = ['cyan',' red', 'yellow' 'blue'] # set Chinese display mpl.rcParams [' font.family'] = 'SimHei'# set display style plt.style.use (' ggplot') # set size pixel plt.figure (figsize= (8,6), dpi=100) # draw horizontal bar chart plt.barh (labels, nums, height=0.36, color=colors) plt.title ('requirements for academic qualifications for recruitment positions', fontsize=16) plt.xlabel ('number of posts', fontsize=12) plt.show ()
The running effect is as follows:
['junior college', 'undergraduate', 'master', 'doctor'] [2052, 6513, 761,45]
(4) check the requirements of the recruitment position for work experience, and visualize the horizontal bar chart.
As the data in the work experience column is not standardized, special treatment needs to be done in statistics.
The code is as follows:
# set Chinese display mpl.rcParams ['font.family'] =' SimHei'# set display style plt.style.use ('ggplot') # set size pixel plt.figure (figsize= (9,6), dpi=100) # draw horizontal bar chart plt.barh (labels, nums, height=0.5, color=colors) plt.title (' recruitment requirements for work experience', fontsize=16) plt.xlabel ('number of posts', fontsize=12) plt.show ()
The running effect is as follows:
3-4 years experience 33612 years experience 21141 years experience 14715-7 years experience 1338 current students do not need experience 1828-9 years experience 1828-9 years experience 64 masters 59 years experience 1 person 57 recruit several people 57 recruit 2 people 42 junior college 30 recruitment, 3 doctors, 14 doctors, 11 doctors, 5 doctors, 9 doctors, 4 doctors, 10 doctors, 2 doctors, 7 doctors 1Name: experience Dtype: int64 ['no experience','1 year experience','2 years experience','3-4 years experience','5-7 years experience','8-9 years experience', 'more than 10 years experience'] [1260, 1530, 2114, 3372, 1338, 105,64]
[] # (5) check the distribution of the industry to which the recruitment company belongs and show the words.
The code is as follows:
Import pandas as pdimport collectionsfrom wordcloud import WordCloudimport matplotlib.pyplot as pltdf = pd.read_csv (r 'cleaned data .csv') ['category'] data = list (df.values) word_list = [] for i in data: X = i.split (' /') for j in x: word_list.append (j) word_counts = collections.Counter (word_list) # drawing word my_cloud = WordCloud (background_color='white' # set background color: black width=900, height=500, font_path='simhei.ttf', # set font display Chinese max_font_size=120, # set font maximum value min_font_size=15, # set subimage minimum value random_state=60 # set random generation status That is, how many color schemes). Generate_from_frequencies (word_counts) # display the generated word cloud picture plt.imshow (my_cloud, interpolation='bilinear') # display setting word cloud image without axis plt.axis ('off') plt.show ()
The running effect is as follows:
At this point, the study on "how python crawls 51 job recruitment data" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.