How to realize data cleaning with Python 07/02 Update SLTechnology News&Howtos

How to realize data cleaning with Python

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

Editor to share with you how to achieve Python data cleaning, I hope you will learn something after reading this article, let's discuss it together!

Data cleaning kit

In the following code snippet, the data cleaning code is encapsulated in some functions, and the purpose of the code is very intuitive. You can use this code directly without embedding it into functions that require a small amount of parameter modification.

1. Delete multi-column data def drop_multiple_col (col_names_list, df):''AIM-> Drop multiple columns based on their column names INPUT-> List of column names, df OUTPUT-> updated df with dropped columns -' 'df.drop (col_names_list, axis=1, inplace=True) return df

Sometimes, not all column data is useful for our data analysis work. Therefore, "df.drop" can easily delete the column you selected.

two。 Convert Dtypesdef change_dtypes (col_int, col_float, df):''AIM-> Changing dtypes to save memory INPUT-> List of column names (int, float), df OUTPUT-> updated df with smaller memory -''DF [col _ int] = DF [col _ int] .astype ('int32') DF [col _ float] = DF [col _ float] .astype (' float32')

When we are faced with larger data sets, we need to convert "dtypes" to save memory. If you are interested in learning how to use "Pandas" to deal with big data, I highly recommend that you read the article "Why and How to Use Pandas with Large Data" (https://towardsdatascience.com/why-and-how-to-use-pandas-with-large-data-9594dda2ea4c).

3. Convert classification variables to numeric variables def convert_cat2num (df): # Convert categorical variable to numerical variable num_encode = {'col_1': {' YES':1, 'NO':0},' col_2': {'WON':1,' LOSE':0, 'DRAW':0}} df.replace (num_encode, inplace=True)

There are some machine learning models that require variables to exist in numerical form. At this point, we need to convert the classification variables into numerical variables and then use them as inputs to the model. For data visualization tasks, I recommend that you keep the classification variables so that the visualization results can be more clearly explained and easy to understand.

4. Check the missing data def check_missing_data (df): # check for any missing data in the df (display in descending order) return df.isnull (). Sum (). Sort_values (ascending=False)

If you want to check how much missing data is in each column, this is probably the fastest way. This approach will give you a clearer idea of which columns have more missing data and help you decide what to do next in data cleaning and data analysis.

5. Delete the string def remove_col_str (df) from the column: # remove a portion of string in a dataframe column-col_1 df ['col_1'] .replace ('\ nbread,'', regex=True, inplace=True) # remove all the characters after & # (including & #) for column-col_1 df ['col_1'] .replace (' & #. *',', regex=True, inplace=True)

Sometimes you may see a new line of characters, or some strange symbols in the string column. You can easily use df ['col_1']. Replace to deal with this problem, where "col_1" is a column in the data frame df.

6. Delete the space def remove_col_white_space (df) in the column: # remove white space at the beginning of string df [col] = DF [col] .str.lstrip ()

When the data is very chaotic, a lot of unexpected things will happen. It is common to have some spaces at the beginning of a string. Therefore, this method is useful when you want to remove spaces at the beginning of a string in a column.

7. Splice two columns of string data (under certain conditions) def concat_col_str_condition (df): # concat 2 columns with strings if the last 3 letters of the first column are 'pil' mask = df [' col_1'] .str.endswith ('pil', na=False) col_new = df [mask] [' col_1'] + df [mask] ['col_2'] col_new.replace (' pil',', regex=True) Inplace=True) # replace the 'pil' with emtpy space

This method is useful when you want to combine two columns of string data under certain conditions. For example, you want to concatenate the first and second columns of data when the first column ends with some specific letters. According to your needs, you can also delete the ending letters after the splicing work is completed.

8. Conversion timestamp (from string type to date "DateTime" format) def convert_str_datetime (df):''AIM-> Convert datetime (String) to datetime (format we want) INPUT-> df OUTPUT-> updated df with new datetime format -' 'df.insert (loc=2, column='timestamp', value=pd.to_datetime (df.transdate) Format='%Y-%m-%d% HRV% MRV% S.% f'))

When dealing with time series data, you may encounter timestamp columns in string format. This means that we may have to convert the data in string format to the date "datetime" format specified according to our needs, so that we can use the data for meaningful analysis and presentation.

After reading this article, I believe you have a certain understanding of "how to achieve data cleaning in Python". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.