In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "how to use Python to analyze matchmaking website data". In daily operation, I believe many people have doubts about how to use Python to analyze matchmaking website data. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubts about "how to use Python to analyze matchmaking website data". Next, please follow the editor to study!
I. Preface
In this paper, we use Python to analyze the marriage information of all regions by city, and look at the portraits of men and women on blind dates.
II. Data viewing and preprocessing
Import used libraries
Import pandas as pdimport re
Read the data and view the first five rows
Df = pd.read_excel ('marriage.xlsx') df.head ()
View index, data type, and memory information
Df.info ()
You can see that there are no missing values in the data.
In the data obtained, the place of residence is in each region, in order to facilitate analysis, it needs to be processed into provincial administrative regions, the column data of education / monthly salary, some are monthly salary, some are academic qualifications, which can be processed into two columns of data, which are academic qualifications. The educational level is extracted, and the monthly salary is marked as "unknown"; for the monthly salary, the monthly salary is extracted and calculated, and the educational background is marked as "unknown".
# obtain the names of 34 provincial administrative regions, including 23 provinces, 5 autonomous regions, 4 municipalities directly under the Central Government, and 2 special administrative regions with open ('region .txt', 'ritual, encoding='utf-8') as f: area = f.read (). Split ('\ n') print (area) print (len (area))
The results are as follows:
[Beijing, Shanghai, Tianjin, Chongqing, Heilongjiang, Jilin, Liaoning, Inner Mongolia, Hebei, Xinjiang, Gansu, Qinghai, Shaanxi, Ningxia, Henan, Shandong, Shanxi, Anhui, Hubei, Hunan, Jiangsu 'Sichuan', 'Guizhou', 'Yunnan', 'Guangxi', 'Xizang', 'Zhejiang', 'Jiangxi', 'Guangdong', 'Fujian', 'Taiwan', 'Hainan', 'Hong Kong' 'Macau'] 34areas_list = [] for i in df ['residence']: for j in area: if j in i: areas_list.append (j) break else: areas_list.append ('unknown') df ['residence'] = areas_listdf.head () with open ('academic qualifications .txt','r' Encoding='utf-8') as fp: edu = fp.read () .split ('\ n') print (edu)
The results are as follows:
['doctor', 'master', 'undergraduate', 'junior college', 'technical secondary school', 'senior high school', 'junior high school', 'primary school'] salary_list = [] edu_list = [] for item in df ['education / monthly salary']: if 'yuan' in item: # this column of data is calculated as data = re.findall ('\\ dsalary') Item) data = [int (x) for x in data] salary = int (sum (data) / len (data)) # rounding salary_list.append (salary) edu_list.append ('unknown') else: salary_list.append ('unknown') for e in edu: if e in item: edu_list. Append (e) break else: edu_list.append ('unknown') print (len (edu_list)) print (len (salary_list)) df ['academic qualifications'] = edu_listdf ['monthly salary'] = salary_listdf.head ()
When the data is processed, you can delete the education / monthly salary column and re-save it to Excel blank.
Del df ['education / monthly salary'] dfdf.to_excel ('processed data .xlsx', index=False) 3. Data analysis
The proportion of men and women on blind dates?
#-*-coding: UTF-8-*-"@ File: male / female ratio .py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/"""import pandas as pdimport collectionsfrom pyecharts.charts import Piefrom pyecharts import options as optsfrom pyecharts.globals import ThemeType CurrentConfig# refers to local js resources to render CurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/'# extracted data df = pd.read_excel (' processed data .xlsx') gender = list (df ['gender']) # Statistics on the number of men and women gender_count = collections.Counter (gender). Most_common () gender_count = [(k, v) for k V in gender_count] pie = Pie (init_opts=opts.InitOpts (theme=ThemeType.MACARONS)) # Rich text effect Loop pie.add ("gender", data_pair=gender_count, radius= ["40%", "55%"), label_opts=opts.LabelOpts (position= "outside", formatter= "{a | {a}} {abg |}\ n {hr |}\ n {b | {b}:} {c} {per | {d}%}" Background_color= "# eee", border_color= "# aaa", border_width=1, border_radius=4, rich= {"a": {"color": "# 999", "lineHeight": 22, "align": "center"}, "abg": {"backgroundColor": "# e3e3e3" "width": "100%", "align": "right", "height": 22, "borderRadius": [4,4,0,0],}, "hr": {"borderColor": "# aaa" "width": "100%", "borderWidth": 0.5, "height": 0,}, "b": {"fontSize": 16, "lineHeight": 33}, "per": {"color": "# eee" "backgroundColor": "# 334455", "padding": [2,4], "borderRadius": 2,},},),) pie.set_global_opts (title_opts=opts.TitleOpts (proportion of male and female matchmaking in title='')) pie.set_colors (['red' 'blue']) # set the color pie.render (' male / female ratio .html')
There are more women than men in blind dates. There are 25910 men, accounting for 45.72%, and 30767 women, accounting for 54.28%.
Age distribution of men and women on blind dates?
#-*-coding: UTF-8-*-"" @ File: age distribution .py @ Author: Ye Tingyun @ CSDN: https://yetingyun.blog.csdn.net/"""import pandas as pdimport collectionsfrom pyecharts.charts import Barfrom pyecharts.globals import ThemeType CurrentConfigfrom pyecharts import options as optsCurrentConfig.ONLINE_HOST = 'D:/python/pyecharts-assets-master/assets/'df = pd.read_excel (' processed data .xlsx') age = list (df ['age']) age_count = collections.Counter (age). Most_common () # sort by age age_count.sort (key=lambda x: X [0]) age = [x [0] for x in age_count] nums = [y [1] for y in age_count ] # print (age_count) bar = Bar (init_opts=opts.InitOpts (theme=ThemeType.MACARONS)) bar.add_xaxis (age) bar.add_yaxis ('number' Nums) # when there is a lot of data, set the label bar.set_global_opts (title_opts=opts.TitleOpts (title=' blind date age distribution')) # Mark maximum minimum average mark average bar.set_series_opts (label_opts=opts.LabelOpts (is_show=False)) Markpoint_opts=opts.MarkPointOpts (data= [opts.MarkPointItem (type_= "max", name= "maximum"), opts.MarkPointItem (type_= "min", name= "minimum"), opts.MarkPointItem (type_= "average", name= "average")]) Markline_opts=opts.MarkLineOpts (data= [opts.MarkLineItem (type_= "average", name= "average")) bar.render ('age distribution .html')
The 31-year-old has the largest number of blind dates, with 2637. There are a certain number of blind dates in all age groups. We can extract the data of blind dates less than or equal to 20 years old and greater than or equal to 70 years old.
Import pandas as pddf = pd.read_excel ('processed data .xlsx') df1 = df [df ['age'] 1: result_list.append (word) print (result_list) # after screening statistics word_counts = collections.Counter (result_list) mask_ = 255-np.array (Image.open ('woman_mask.png')) # draw the word my_cloud = WordCloud (background_color='white' # set background color defaults to black mask=mask_, font_path='simhei.ttf', # set font display Chinese max_font_size=112, # set font maximum value min_font_size=12, # set font minimum value random_state=88 # set random generation status That is, how many color schemes). Generate_from_frequencies (word_counts) # draw word cloud plt.figure (figsize= (8,5), dpi=200) # display generated word cloud picture plt.imshow (my_cloud, interpolation='bilinear') # display settings No coordinate plt.axis ('off') plt.savefig (' woman_cloud.png', dpi=200) plt.show () in the word cloud image, the study on "how to use Python to analyze data from dating sites" is over I hope I can solve everyone's doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.