How to use Python for data cleaning 07/08 Update SLTechnology News&Howtos

How to use Python for data cleaning

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the relevant knowledge of "how to use Python for data cleaning". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "how to use Python for data cleaning" can help you solve the problem.

Data cleaning is a necessary part of data analysis, in the process of analysis, there will be a lot of data that do not meet the requirements of analysis, such as repetition, error, missing, abnormal data.

01 duplicate value processing

Duplicate data may be produced in the process of data entry and data integration, and direct deletion is the main method of duplicate data processing. Pandas provides methods duplicated and drop_duplicates to view and handle duplicate data. Take the following data as an example:

> sample = pd.DataFrame ({'id': [1, 1, 1, 3, 4, 5]

'name': ['Bob','Bob','Mark','Miki','Sully','Rose']

'score': [99, 99, 87, 77, 77, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, and 7, respectively]

'group': [1, 1],})

> sample

Group id name score

0 1 1 Bob 99.0

1 1 1 Bob 99.0

2 1 1 Mark 87.0

3 2 3 Miki 77.0

4 1 4 Sully 77.0

5 2 5 Rose NaN

Duplicate data is found through the duplicated method, as shown below, through which you can view the duplicated data.

> sample [sample.duplicated ()]

Group id name score

1 1 1 Bob 99.0

When you need to remove the weight, the drop_duplicates method can be completed:

> sample.drop_duplicates ()

Group id name score

0 1 1 Bob 99.0

2 1 1 Mark 87.0

3 2 3 Miki 77.0

4 1 4 Sully 77.0

5 2 5 Rose NaN

The drop_duplicates method can also remove duplicates by a column, such as removing all records that are duplicated in the id column:

> sample.drop_duplicates ('id')

Group id name score

0 1 1 Bob 99.0

3 2 3 Miki 77.0

4 1 4 Sully 77.0

5 2 5 Rose NaN

02 missing value processing

Missing value is a common problem in data cleaning. Missing value is generally expressed by NA, and certain principles should be followed when dealing with missing value.

First of all, we need to deal with the missing value according to the business understanding, find out whether the cause of the missing value is intentional missing or random missing, and then fill it through some business experience. Generally speaking, when the missing value is less than 20%, continuous variables can be filled with mean or median; classified variables do not need to be filled, only one category can be filled, or the multiplicity can be used to fill classified variables.

When the missing value is between 20% and 80%, the filling method is the same as above. In addition, each variable with a missing value can generate an indicative dumb variable to participate in the subsequent modeling. When the missing value is more than 80%, each variable with missing value generates an indicative dumb variable and participates in subsequent modeling without using the original variable.

The generation process of median filling missing values and missing value indication variables is shown in the following figure.

Pandas provides a fillna method to replace missing value data, which is similar to the previous replace method, for example, for the following data:

> sample

Group id name score

0 1.0 1.0 Bob 99.0

1 1.0 1.0 Bob NaN

2 NaN 1.0 Mark 87.0

3 2.0 3.0 Miki 77.0

4 1.0 4.0 Sully 77.0

5 NaN NaN NaN NaN

The missing values are viewed and filled step by step as follows:

1. Check the missing situation

Before data analysis, you generally need to know about the missing data. In Python, you can construct a lambda function to view the missing values. In this lambda function, sum (col.isnull ()) indicates how many rows of data are missing in the current column, and col.size represents the total number of rows of data in the current column:

> sample.apply (lambda col:sum (col.isnull ()) / col.size)

Group 0.333333

Id 0.166667

Name 0.166667

Score 0.333333

Dtype: float64

two。 Fill with a specified value

The pandas data box provides the fillna method to fill the missing values, such as filling the missing values for the column of the sample table score, the filling method is the mean:

> sample.score.fillna (sample.score.mean ())

0 99.0

1 85.0

2 87.0

3 77.0

4 77.0

5 85.0

Name: score, dtype: float64

Of course, it can also be filled by quantiles and other methods:

> sample.score.fillna (sample.score.median ())

0 99.0

1 82.0

2 87.0

3 77.0

4 77.0

5 82.0

Name: score, dtype: float64

3. Missing value indicator variable

The pandas data frame object can directly call the method isnull to generate the missing value indication variable, such as the missing value indication variable of the score variable:

> sample.score.isnull ()

0 False

1 True

2 False

3 False

4 False

5 True

Name: score, dtype: bool

If you want to convert to a numeric value of 0,1 indicating a variable, you can use the apply method, where int means to replace the column with the int type.

> sample.score.isnull () .apply (int)

0 0

1 1

2 0

3 0

4 0

5 1

Name: score, dtype: int64

This is the end of the introduction to "how to clean data with Python". Thank you for reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.