In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
How to make Python crawl recruitment website data and achieve a visual interactive screen, in view of this problem, this article introduces the corresponding analysis and solutions in detail, hoping to help more partners who want to solve this problem to find a more simple and easy way.
Project background
With the rapid development of science and technology, data shows explosive growth, no one can get rid of dealing with data, and the demand for talents in "data" is also increasing. So to understand what kind of talents do enterprises need to recruit at present? What kind of skills do you need? It is necessary for both students and job seekers.
Based on this problem, this paper crawls the recruitment information of big data, data analysis, data mining, machine learning, artificial intelligence and other related positions nationwide for 51job recruitment website. Analyze and compare the salary and education requirements of different positions; analyze and compare the demand for relevant talents in different regions and industries; analyze and compare the knowledge and skill requirements of different positions.
The results of the project after completion are as follows:
Crawling of information
Crawling positions: big data, data analysis, machine learning, artificial intelligence and other related positions
Crawl fields: company name, job name, job address, salary, release time, job description, company type, number of employees, industry
Explanation: based on the 51job recruitment website, we searched the national demand for "data" jobs, which is about 2000 pages. The field we crawled contains not only the relevant information of the first-level page, but also some information of the second-level page.
Crawl idea: first do an analysis for a first-level page of a page of data, then do a second-level page analysis, and finally turn the page.
Use tools: Python+requests+lxml+pandas+time
Website parsing method: Xpath
1. Import related libraries
Import requestsimport pandas as pdfrom pprint import pprintfrom lxml import etreeimport timeimport warningswarnings.filterwarnings ("ignore")
2. Instructions on turning the page
# characteristics of page 1 https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,1.html?# characteristics of page 2 https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,2.html?# characteristics of page 3 https://search.51job.com/list/000000,000000,0000,00,9,99,%25E6%2595%25B0%25E6%258D%25AE,2,3.html?
Note: through the observation of the page, you can see that the number in one place has changed, so you only need to do string concatenation, and then loop crawl.
3. Complete crawling code
Import requestsimport pandas as pdfrom pprint import pprintfrom lxml import etreeimport timeimport warningswarnings.filterwarnings ("ignore") for i in range (1minute 1501): print ("crawling" >
As you can see here, we crawled more than 1000 pages of data for the final analysis. Therefore, every time you crawl a page of data, do a data storage to avoid the final failure caused by one-time storage. At the same time, according to our own tests, there are some pages for data storage, which will lead to failure. In order not to affect the execution of the later code, we use "try-except" exception handling.
On the first-level page, we crawled the fields of "job name", "company name", "work location", "salary", "release date" and "url of the second-tier URL".
In the second-tier page, we crawled the fields of "experience and education information", "job description", "company type", "company size" and "industry".
Data preprocessing
From the intercepted part of the crawled data, we can see that the data is very messy. The cluttered data is not conducive to our analysis, so we need to do a data preprocessing according to the research goal to get the data that we can finally use for visual display.
1. Import of related libraries and reading of data
Df = pd.read_csv (r "G:\ 8 Teddy\ python_project\ 51_job\ job_info1.csv", engine= "python", header=None) # specify row index df.index = range (len (df)) # specify column index df.columns = ["job name", "company name", "workplace", "salary", "release date", "experience and qualifications", "company type", "company size" for the data box "Industry", "Job description"]
2. Data deduplication
We think that the company name of a company is the same as the published job name, which is regarded as a duplicate value. Therefore, using the drop_duplicates (subset= []) function, a duplicate value is eliminated based on the "job name" and "company name".
# number of records before deduplication print ("number of records before deduplication", df.shape) # record deduplicated df.drop_duplicates (subset= ["company name", "post name"], inplace=True) # number of records after deduplication print ("number of records after deduplication", df.shape)
3. The treatment of the name field of the post
Exploration of ① Job name Field
Df ["position name"] .value_counts () df ["position name"] = df ["position name"] .apply (lambda x:x.lower ())
Explanation: first of all, we do a statistics on the frequency of each post, we can see that the "post name field" is too messy, it is not convenient for us to do statistical analysis. Then we convert the uppercase letters in the job name into lowercase letters, which means that "AI" and "Ai" belong to the same thing.
② constructs the target posts that you want to analyze, and does a data filter.
Job_list = ['data analysis', 'data statistics', 'data specialist', 'data mining', 'algorithm', 'big data', 'development engineer', 'operation', 'software engineering', 'front-end development', 'deep learning', 'ai',' database, 'database', 'data product' 'customer service', 'java',' .net', 'andrio',' artificial intelligence', 'cached intelligence', 'data management', 'testing', 'operation and maintenance'] job_list = np.array (job_list) def rename (x=None) Job_list=job_list): index = [i in x for i in job_list] if sum (index) > 0: return job_ list [index] [0] else: return xjob_info ["position name"] = job_info ["position name"]. Apply (rename) job_info ["position name"]. Value_counts () # data Statistics, data specialist, Data analysis is grouped into data analysis job_info ["position name"] = job_info ["position name"] .apply (lambda x:re.sub ("data specialist") "data Analysis", x) job_info ["position name"] = job_info ["position name"] .apply (lambda x:re.sub ("data Statistics", "data Analysis", x))
Explanation: first of all, we have constructed the key words of the above seven target positions. Then use the count () function to count whether each record contains these seven keywords, keep the field if it does, but delete it if it doesn't. Finally, check how many records are left after the filter.
③ target posts are standardized (as the target posts are too messy, we need to unify them)
Job_list = ['data analysis', 'data statistics', 'data specialist', 'data mining', 'algorithm', 'big data', 'development engineer', 'operation', 'software engineering', 'front-end development', 'deep learning', 'ai',' database, 'database', 'data product' 'customer service', 'java',' .net', 'andrio',' artificial intelligence', 'cached intelligence', 'data management', 'testing', 'operation and maintenance'] job_list = np.array (job_list) def rename (x=None) Job_list=job_list): index = [i in x for i in job_list] if sum (index) > 0: return job_ list [index] [0] else: return xjob_info ["position name"] = job_info ["position name"]. Apply (rename) job_info ["position name"]. Value_counts () # data Statistics, data specialist, Data analysis is grouped into data analysis job_info ["position name"] = job_info ["position name"] .apply (lambda x:re.sub ("data specialist") "data Analysis", x) job_info ["position name"] = job_info ["position name"] .apply (lambda x:re.sub ("data Statistics", "data Analysis", x))
Description: first we define a target job job_list that we want to replace and convert it to an ndarray array. Then define a function, if a record contains a keyword in the job_list array, then replace the record with this keyword, if a record contains multiple keywords in the job_list array, we only replace the record with the first keyword. Then use the value_counts () function to count the frequency of each post after replacement. Finally, we classify "data specialist" and "data statistics" as "data analysis".
4. The treatment of wage level field.
The data in the salary level field is similar to the formats of "20-300000 / year", "250-30000 / month" and "3.5-4.5 thousand / month". We need to make a unified change, convert the data format to "yuan / month", and then take out these two numbers and find an average.
Job_info ["wages"]. Str [- 1] .value _ counts () job_info ["wages"]. Str [- 3] .value _ counts () index1 = job_info ["wages"]. Str [- 1] .isin (["year", "month"]) index2 = job_info ["wages"]. Str [- 3] .isin (["ten thousand") "thousands"]) job_info = job_ info [index1 & index2] def get_money_max_min (x): try: if x [- 3] = "ten thousand": Z = [float (I) * 10000 for i in re.findall ("[0-9] +\. [0-9] *" X)] elif x [- 3] = "thousands": Z = [float (I) * 1000 for i in re.findall ("[0-9] +\. [0-9] *" X)] if x [- 1] = "year": Z = [I for i in z 12] return z except: return xsalary = job_info ["wage"] .apply (get_money_max_min) job_info ["minimum wage"] = salary.str [0] job_info ["maximum wage"] = salary.str [1] job_info ["wage level"] = job_info ["minimum wage" "maximum wage"] .mean (axis=1)
Description: first of all, we do a data filter, for each record, if the last word is in "year" and "month", and the third word is in "ten thousand" and "thousand", then keep the record, otherwise delete it. Then a function is defined to uniformly convert the format to "yuan / month". Finally, the minimum wage and the maximum wage are averaged to get the final "wage level" field.
5. Processing of the workplace field
Since the whole data is about the whole country, there are also a lot of cities involved. We need to customize a commonly used target location field to unify the data.
# job_info ["place of work"]. Value_counts () address_list = [Beijing, Shanghai, Guangzhou, Shenzhen, Hangzhou, Suzhou, Changsha, Wuhan, Tianjin, Chengdu, Xi'an, Dongguan, Hefei, Foshan, Ningbo, Nanjing Chongqing, Changchun, Zhengzhou, Changzhou, Fuzhou, Shenyang, Jinan, Ningbo, Xiamen, Guizhou, Zhuhai, Qingdao, Zhongshan, Dalian, Kunshan, Huizhou, Harbin, Kunming, Nanchang "Wuxi"] address_list= np.array (address_list) def rename: index = [i in x for i in address_list] if sum (index) > 0: return address_ list [index] [0] else: return xjob_info ["workplace"] = job_info ["workplace"] .apply (rename)
Description: first we define a list of target locations and convert it to an ndarray array. Then a function is defined to replace the original duty station record with the city in the target duty station.
6. Processing of company type field
This is easy, so I won't elaborate on it.
Job_ info.loc.loc.job _ info ["Company Type"] .apply (lambda x:len (x)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.