How to use Python to crawl the recruitment information of operation and maintenance staff 07/12 Update SLTechnology News&Howtos

How to use Python to crawl the recruitment information of operation and maintenance staff

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use Python to crawl the operation and maintenance recruitment information. The introduction in this article is very detailed and has certain reference value. Interested friends must read it!

For the description of this article, we will explain it in the following three steps.

reptile part

data cleaning

Data visualization and analysis

1. Crawler part

This article mainly crawls 51job above, about the data of OPM-related posts, website parsing mainly uses Xpath, data cleaning mainly uses Pandas library, and visualization mainly uses Pyecharts library.

Relevant comments have been noted in the code, for ease of reading, only part of the code is shown here, the complete code can be obtained by viewing the end of the text.

# 1. Job_name = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@title')#2. Company name company_name = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]/span[@class="t2"]/a[@target="_blank"]/@title') # 3、 work address = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]/span[@class="t3"]/text()') # 4、 salary_mid = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]/span[@class="t4"]') salary = [i.text for i in salary_mid] # 5、 release_time = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]/span[@class="t5"]/text()') # 6、 Get the secondary URL url deep_url = dom.xpath ('//div[@class="dw_table"]/div[@class="el"]//p/span/a[@target="_blank"]/@href')#7. Crawl experience and education information, and combine them into one field first, and then clean the data later. named random_all random_all = dom_test.xpath ('//div[@class="tHeader tHjob"]//div[@class="cn"]/p[@class="msg ltype"]/text()')#8. Job description information job_describe = dom_test.xpath ('//div[@class="tBorderTop_box"]//div[@class="bmsg job_msg inbox"]/p/text()') # 9、 Company type company_type = dom_test.xpath ('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[1]/@title') # 10、 Company size (headcount) company_size = dom_test.xpath ('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[2]/@title') # 11、 Industry (Company) industry = dom_test.xpath ('//div[@class="tCompany_sidebar"]//div[@class="com_tag"]/p[3]/@ title') 2, Data cleaning

1) Read the data

#import pandas as pd import numpy as np import re import jieba df = pd.read_csv("only_yun_wei.csv",encoding="gbk",header=None) df.head()

2) Set new row and column indexes for data

#Assign row index df.index = range(len(df)) #Assign column index df.columns = [Job Title, Company Name, Work Location, Salary, Release Date, Experience and Education, Company Type, Company Size, Industry, Job Description] df.head()

3) De-reprocessing

#Number of records before de-duplication print("Number of records before de-duplication",df.shape) #Number of records after de-duplication df.drop_duplicates(subset=["Company name","Position name","Work place"],inplace=True) #Number of records after de-duplication print("Number of records after de-duplication",df.shape)

4) Processing of post name field

# ① Exploration of post field name df["post name"].value_counts() df["post name"] = df["post name"].apply (lambda x:x.lower() # ② Construct the target job you want to analyze, and do a data filter df.shape target_job = ['OPS','Linux OPS','OPS development','devOps',' application OPS ',' system OPS ',' database OPS ',' OPS security','network OPS','Desktop OPS'] index = [df["post name"].str.count(i) for i in target_job] index = np.array(index).sum(axis=0) > 0 job_info = df[index] job_info.shape job_list = ['linux OPS',' OPS development','devOps',' application OPS',' system OPS',' Database OPS',' OPS Security',' Network OPS',' Desktop OPS','it OPS',' Software OPS',' OPS Engineer'] job_list = np.array (job_list) def rename (x=None,job_list=job_list): index = [i in x for i in job_list] if sum(index) > 0: return job_list[index][0] else: return x job_info["Position Name"] = job_info["Position Name"].apply(rename) job_info["Position Name"].value_counts()[:10]

5) Processing of salary field

job_info["Salary"].str[-1].value_counts() job_info["Salary"].str[-3].value_counts() index1 = job_info["Salary"].str[-1].isin (["Year","Month"]) index2 = job_info["Salary"].str[-3].isin (["million","thousand"]) job_info = job_info[index1 & index2] job_info["salary"].str[-3:].value_counts() def get_money_max_min(x): try: if x[-3] == "million": z = [float(i)*10000 for i in re.findall("[0-9]+\.? [0-9]*",x)] elif x[-3] == " thousand ": z = [float(i) * 1000 for i in re.findall("[0-9]+\.? [0-9]*", x)] if x[-1] == " year ": z = [i/12 for i in z] return z except: return x salary = job_info["salary"].apply (get_money_max_min) job_info["minimum wage"] = salary.str[0] job_info["maximum wage"] = salary.str[1] job_info["wage level"] = job_info["minimum wage","maximum wage"].mean(axis=1)

6) Processing of workplace field

address_list = ['Beijing',' Shanghai','Guangzhou',' Shenzhen','Hangzhou',' Suzhou','Changsha',' Wuhan','Tianjin',' Chengdu','Xian',' Dongguan','Hefei', 'Foshan',' Ningbo','Nanjing',' Chongqing','Changchun',' Zhengzhou','Changzhou',' Fuzhou','Shenyang', 'Jinan','Ningbo',' Xiamen','Guizhou',' Zhuhai','Qingdao',' Zhongshan','Dalian','Kunshan',' Huizhou ',' Harbin ',' Kunming ',' Nanchang ',' Wuxi '] address_list = np.array (address_list) def rename (x=None,address_list=address_list): index = [i in x for i in address_list] if sum(index) > 0: return address_list[index][0] else: return x job_info["workplace"] = job_info["workplace"].apply(rename) job_info["workplace"].value_counts()

7) Processing of company type field

job_info.loc[job_info["company type"].apply(lambda x:len(x)

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.