In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-26 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces the relevant knowledge of "how to use Python for data cleaning". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to use Python for data cleaning" can help you solve the problem.
Data cleaning is a necessary part of data analysis, in the process of analysis, there will be a lot of data that do not meet the requirements of analysis, such as repetition, error, missing, abnormal data.
01 duplicate value processing
Duplicate data may be produced in the process of data entry and data integration, and direct deletion is the main method of duplicate data processing. Pandas provides methods duplicated and drop_duplicates to view and handle duplicate data. Take the following data as an example:
> sample = pd.DataFrame ({'id': [1, 1, 1, 3, 4, 5]
'name': ['Bob','Bob','Mark','Miki','Sully','Rose']
'score': [99, 99, 87, 77, 77, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, and 7, respectively]
'group': [1, 1],})
> sample
Group id name score
0 1 1 Bob 99.0
1 1 1 Bob 99.0
2 1 1 Mark 87.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN
Duplicate data is found through the duplicated method, as shown below, through which you can view the duplicated data.
> sample [sample.duplicated ()]
Group id name score
1 1 1 Bob 99.0
When you need to remove the weight, the drop_duplicates method can be completed:
> sample.drop_duplicates ()
Group id name score
0 1 1 Bob 99.0
2 1 1 Mark 87.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN
The drop_duplicates method can also remove duplicates by a column, such as removing all records that are duplicated in the id column:
> sample.drop_duplicates ('id')
Group id name score
0 1 1 Bob 99.0
3 2 3 Miki 77.0
4 1 4 Sully 77.0
5 2 5 Rose NaN
02 missing value processing
Missing value is a common problem in data cleaning. Missing value is generally expressed by NA, and certain principles should be followed when dealing with missing value.
First of all, we need to deal with the missing value according to the business understanding, find out whether the cause of the missing value is intentional missing or random missing, and then fill it through some business experience. Generally speaking, when the missing value is less than 20%, continuous variables can be filled with mean or median; classified variables do not need to be filled, only one category can be filled, or the multiplicity can be used to fill classified variables.
When the missing value is between 20% and 80%, the filling method is the same as above. In addition, each variable with a missing value can generate an indicative dumb variable to participate in the subsequent modeling. When the missing value is more than 80%, each variable with missing value generates an indicative dumb variable and participates in subsequent modeling without using the original variable.
The generation process of median filling missing values and missing value indication variables is shown in the following figure.
Pandas provides a fillna method to replace missing value data, which is similar to the previous replace method, for example, for the following data:
> sample
Group id name score
0 1.0 1.0 Bob 99.0
1 1.0 1.0 Bob NaN
2 NaN 1.0 Mark 87.0
3 2.0 3.0 Miki 77.0
4 1.0 4.0 Sully 77.0
5 NaN NaN NaN NaN
The missing values are viewed and filled step by step as follows:
1. Check the missing situation
Before data analysis, you generally need to know about the missing data. In Python, you can construct a lambda function to view the missing values. In this lambda function, sum (col.isnull ()) indicates how many rows of data are missing in the current column, and col.size represents the total number of rows of data in the current column:
> sample.apply (lambda col:sum (col.isnull ()) / col.size)
Group 0.333333
Id 0.166667
Name 0.166667
Score 0.333333
Dtype: float64
two。 Fill with a specified value
The pandas data box provides the fillna method to fill the missing values, such as filling the missing values for the column of the sample table score, the filling method is the mean:
> sample.score.fillna (sample.score.mean ())
0 99.0
1 85.0
2 87.0
3 77.0
4 77.0
5 85.0
Name: score, dtype: float64
Of course, it can also be filled by quantiles and other methods:
> sample.score.fillna (sample.score.median ())
0 99.0
1 82.0
2 87.0
3 77.0
4 77.0
5 82.0
Name: score, dtype: float64
3. Missing value indicator variable
The pandas data frame object can directly call the method isnull to generate the missing value indication variable, such as the missing value indication variable of the score variable:
> sample.score.isnull ()
0 False
1 True
2 False
3 False
4 False
5 True
Name: score, dtype: bool
If you want to convert to a numeric value of 0,1 indicating a variable, you can use the apply method, where int means to replace the column with the int type.
> sample.score.isnull () .apply (int)
0 0
1 1
2 0
3 0
4 0
5 1
Name: score, dtype: int64
This is the end of the introduction to "how to clean data with Python". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.