In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-17 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to achieve Python distribution analysis, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let the editor take you to understand it.
Preface
Distribution analysis is generally an analysis method to group data into groups and study the distribution law of each group according to the purpose of analysis. There are two ways of data grouping: equidistant or non-equidistant grouping.
Distribution analysis is widely used in the actual data analysis practice, such as user gender distribution, user age distribution, user consumption distribution and so on.
Distribution analysis
1. Import related library packages
Import pandas as pdimport matplotlib.pyplot as pltimport math
two。 Data processing.
> df = pd.read_csv ('UserInfo.csv') > df.info () RangeIndex: 1000000 entries, 0 to 999999Data columns (total 4 columns): UserId 1000000 non-null int64CardId 1000000 non-null int64LoginTime 1000000 non-null objectDeviceType 1000000 non-null objectdtypes: int64 (2), object (2) memory usage: 30.5 + MB
Because next we need to do age distribution analysis, but from the source data info () method, we know that there is no age field, need to generate their own.
# to check the age range, partition > > df ['Age'] .max (), df [' Age'] .min () # (45,18) > bins = [0Leijie 18pyrus 25mine30pyrus 40100] > > labels = ['18 and under','19 to 25','26 to 30','31 to 35','36 to 40','41 and over'> > df ['Age stratification'] = pd.cut (df ['Age'], bins, labels = labels)
3. Calculated age
Because the data comes from offline, the validity of the data has not been verified, so the data are identified and verified before the age calculation.
# extract date of birth: month and day > > df [['month','day']] = df [' DateofBirth'] .str.split ('-', expand=True). Loc [:, 1:2] # extract Xiaoyue to see if there is 31 > df_small_month = df [df ['month'] .isin ([[' 02] # invalid data As shown in the figure > df_small_ month [DF _ small_month ['day'] = =' 31'] # all deleted, all invalid data > df.drop (df_small_ month [DF _ small_month ['day'] = =' 31'] .index, inplace=True) # same principle, check February > df_2 = df [df ['month'] = =' 02'] # check in February can be done more carefully First judge whether to moisturize the year and then delete it > df_2 [df_2 ['day'] .isin ([[' 29 'df_2]])] # all deleted > df.drop (df_2 [df_2 [' day'] .isin (['29'])] .index, inplace=True)
# calculate the age # method 1 > > df ['Age'] = df [' DateofBirth'] .apply (lambda x: math.floor ((pd.datetime.now ()-pd.to_datetime (x)) .days / 365)) # method 2 > df ['DateofBirth'] .apply (lambda x: pd.datetime.now (). Year-pd.to_datetime (x) .year)
4. Age distribution
# to check the age range, partition > > df ['Age'] .max (), df [' Age'] .min () # (45,18) > bins = [0Leijie 18pyrus 25mine30pyrus 40100] > > labels = ['18 and under','19 to 25','26 to 30','31 to 35','36 to 40','41 and over'> > df ['Age stratification'] = pd.cut (df ['Age'], bins, labels = labels)
Because this data records user login information, there must be duplicate data. And Python is so powerful that a nunique () method can do de-duplicates.
# check to see if there are duplicate values > df.duplicated ('UserId'). Sum () # 4768 data total entries > df.count () # 980954
After grouping, the count () method can also be used to calculate the distribution, but it is limited to the case where there is no duplicate data. Python is so invincible that it provides a nunique () method that can be used to calculate situations with repeated values.
> > df.groupby ('Age stratification') ['UserId'] .count () Age stratification 18 and below 2526219 to 25 years old 25450226 to 30 years old 18175131 to 35 years old 18141736 to 40 years old and above 156433Name: UserId, dtype: int64# by summation It can be seen that duplicate data are also counted > df.groupby ('age stratification') ['UserId'] .count (). Sum () # 980954 > > df.groupby (' age stratification') ['UserId'] .nunique () Age stratification 18 and under 2401419 to 25 years old 24219926 to 30 years old 17283231 to 35 years old 17260836 to 40 years old 17280441 years and above 148816Name: UserId Dtype: int64 > df.groupby ('age stratification') ['UserId'] .nunique (). Sum () # 933273 = 980954 (total)-47681 (repeat) # calculated age distribution > result = df.groupby (' age stratification') ['UserId'] .nunique () / df.groupby (' age stratification') ['UserId'] .nunique () > > result# results age stratification 18 years old and below 0.02573119 to 25 years old 0.25951626 to 30 years old 0.18518931 to 35 years old 0.18494936 to 40 years old 0.18515941 years old and over 0.159456Name: UserId Dtype: float64# format > result = round (result,4) * 100 > > result.map ("{: .2f}%" .format) Age stratification 18 and below 25.95% 26 to 30 18.52% 31 to 35 18.49% 36 to 40 18.52% 41 and above 15.95%Name: UserId, dtype: object thank you for reading this article carefully I hope the article "how to achieve Python Distribution Analysis" shared by the editor will be helpful to you. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.