What is the method of data cleaning with pandas 07/06 Update SLTechnology News&Howtos

What is the method of data cleaning with pandas

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article will explain in detail how to use pandas for data cleaning, the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

We have the following data and use it to do simple data analysis.

This is the membership data from a clothing store. The top row is the column coordinates, and the leftmost column is the row coordinates. In the column coordinates, column 0 represents the serial number, column 1 represents the name of the member, column 2 represents age, column 3 represents weight, column 4-6 represents the BWH size of male members, and column 7-9 represents the BWH size of female members.

The data cleaning rules are summarized into the following four key points, which are called "full integration". Let's explain them below:

Integrity: whether there are null values in a single piece of data and whether the statistical fields are perfect.

Comprehensiveness: look at all the values of a column. For example, in the Excel table, if we select a column, we can see the average, maximum, and minimum values of that column. We can judge whether there is something wrong with the column by common sense, such as data definition, unit identification, and the value itself.

Legitimacy: the legitimacy of the type, content and size of the data. For example, there are non-ASCII characters in the data, gender is unknown, age is over 150 years old, and so on.

Uniqueness: whether there are duplicate records of data, because the data usually come from summaries from different sources, and repetition is common. Both row and column data need to be unique, for example, a person cannot be recorded repeatedly, and a person's weight cannot be repeated in the column indicators.

1. Integrity 1.1 missing value

In general, due to the huge amount of data, in the process of data collection, there will be some data units have not been collected, that is, the data is missing. Usually in the face of this situation, we can adopt the following three methods:

Deleting: deleting records with missing data

Mean: filled with the mean of the current column

High frequency: use the data with the highest frequency in the preceding column

For example, if we use the average age to fill the missing values in data ['Age'], we can write:

Df ['Age'] .fillna (df [' Age'] .mean (), inplace=True)

If we populate the data with the most high frequency, we can first obtain the highest frequency age_maxf of the Age field through value_counts, and then populate the missing data in the Age field with age_maxf:

Age_maxf = train_features ['Age']. Value_counts (). Index [0] train_features [' Age'] .fillna (age_maxf, inplace=True) 1.2 blank lines

We find that there is a blank line in the data, and all values except index are NaN. Pandas's read_csv () has no optional parameter to ignore blank lines, so we need to use dropna () to remove blank lines after the data has been read in.

# Delete the empty row df.dropna (how='all',inplace=True) 2. The unit of comprehensive column data is not uniform

If the units of a column of data are not uniform, such as the weight column, some units are kilograms (Kgs) and some are pounds (Lbs).

Here we use kilograms as a unified unit of measurement to convert pounds into kilograms:

# get the data in lbs in the weight data column rows_with_lbs = df ['weight'] .str.records (' lbs') .fillna (False) print Dfrows [with_lbs _ with_lbs] # convert lbs to kgs, 2.2lbs=1kgsfor iRepewrow in DF [rows _ with_lbs] .iterrows (): # remove lbs before intercepting from the beginning to the penultimate character. Weight = int (float (lbs_row ['weight'] [:-3]) / 2.2) df.at [iGrainine weight'] =' {} kgs'.format (weight) 3, reasonable non-ASCII characters

Suppose there are some non-ASCII characters in Firstname and Lastname in the dataset. We can use deletion or replacement to solve non-ASCII problems. Here we use deletion method, that is, replace method:

# Delete non-ASCII characters df ['first_name'] .replace ({r' [^\ X00 -\ x7F] +':'}, regex=True, inplace=True) df ['last_name'] .replace ({r' [^\ x00 -\ x7F] +':'}, regex=True, inplace=True) 4, uniqueness 4.1 has multiple parameters in one column

Suppose the name (Name) contains two parameters, Firstname and Lastname. In order to achieve the purpose of data cleanliness, we split the Name column into Firstname and Lastname fields. We use Python's split method, str.split (expand=True), to split the list into new columns, and then delete the original Name column.

# split the name and delete the source data column df [['first_name','last_name']] = df [' name'] .str.split (expand=True) df.drop ('name', axis=1, inplace=True) 4.2 duplicate data

Let's check whether there are duplicate records in the data. If there are duplicate records, use drop_duplicates () provided by Pandas to remove the duplicate data.

# Delete duplicate row df.drop_duplicates (['first_name','last_name'], inplace=True)

In this way, we clean up the membership data in the above case to see the results after cleaning up.

On the use of pandas data cleaning method is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.