How to use Pandas and NumPy for data cleaning in Python 04/18 Update SLTechnology News&Howtos

How to use Pandas and NumPy for data cleaning in Python

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article Xiaobian introduces in detail for you "Python how to use Pandas and NumPy for data cleaning", the content is detailed, the steps are clear, and the details are handled properly. I hope this "Python how to use Pandas and NumPy for data cleaning" article can help you solve your doubts.

Many data scientists believe that the initial steps to obtain and clean up data account for 80% of the work, and spend a lot of time cleaning up data sets and boil them down to usable forms.

So if you are new or planning to enter this field, it is important to be able to deal with cluttered data, whether or not it contains missing values, inconsistent formats, malformed records, or meaningless outliers.

Python's Pandas and NumPy libraries will be used to clean up the data.

Preparatory work

After importing the module, start the formal data preprocessing.

Deletion of import pandas as pdimport numpy as npDataFrame column

You will often find that not all data categories in the dataset are useful. For example, you might have a dataset that contains student information (name, grade, standard, parent name, and address), but you want to focus on analyzing student performance. In this case, the address or the parents' names are not important. Keeping this unwanted data will take up unnecessary space.

BL-Flickr-Images-Book.csv data manipulation.

Df = pd.read_csv ('data science essential Pandas, NumPy for data cleaning / BL-Flickr-Images-Book.csv') df.head ()

You can see that these columns do not have any information to help Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks, so you can delete them in batches.

To_drop_column = ['Edition Statement',' Corporate Author', 'Corporate Contributors',' Former owner', 'Engraver',' Contributors', 'Issuance type',' Shelfmarks'] df.drop (to_drop_column, inplace=True Axis=1) df.head ()

DataFrame index change

The Pandas index extends the capabilities of the NumPy array to allow for more general slicing and tagging. In many cases, it is helpful to use the unique value identification field of the data as its index.

Gets a unique identifier.

Df ['Identifier']. Is_uniqueTrue

The Identifier column replaces the index column.

Df = df.set_index ('Identifier') df.head ()

206is the first label of the index and can be accessed using the df.iloc [0] location-based index.

DataFrame data field collation

Clean up specific columns and convert them to a uniform format to better understand the dataset and enforce consistency.

Processing the Date of Publication publication date column, it is found that the format of the data column is not uniform.

Df.loc [1905 Date of Publication'] .head (10)

Identifier1905 18881929 1839, 38-542836 18972854 18652956 1860-632957 18733017 18663131 18994598 18144884 1820Name: Date of Publication, dtype: object

We can use regular expressions to directly extract four consecutive numbers.

Extr = df ['Date of Publication'] .str.extract (r' ^ (\ d {4})', expand=False) extr.head () Identifier206 1879216 1868218 1869472 1851480 1857Name: Date of Publication, dtype: object

Finally, get the numeric field column.

Df ['Date of Publication'] = pd.to_numeric (extr) str method combined with NumPy to clean up columns

Df ['Date of Publication']. Str. This property is a way to access fast string operations in Pandas, which largely mimic operations on native Python strings or compiled regular expressions, such as .split (), .replace (), and .capitalize ().

To clean up the Place of Publication field, we can combine Pandas's str method with NumPy's np.where function, which is basically the vectorized form of Excel's IF () macro.

Np.where (condition, then, else)

Here condition is either an array-like object or a Boolean mask. Then is the value to use if the condition is evaluated as True, otherwise it is the value to be used.

Essentially .where () takes each element in the object used for the condition, checks whether that particular element evaluates to True in the context of the condition, and returns a ndarray containing then or else, depending on which is applicable. Can be nested in a compound if-then statement, allowing values to be calculated based on multiple conditions.

Deal with Place of Publication publication data.

Df ['Place of Publication'] .head (10) Identifier206 London216 London Virtue & Yorston218 London472 London480 London481 London519 London667 pp. 40. G. Bryan & Co: Oxford, 1898874 London] 1143 LondonName: Place of Publication, dtype: object

Use the included method to extract the required data information.

Pub = df ['Place of Publication'] london = pub.str.contains (' London') london [: 5] Identifier206 True216 True218 True472 True480 TrueName: Place of Publication, dtype: bool

You can also use np.where processing.

Df ['Place of Publication'] = np.where (london,' London', pub.str.replace ('-') ) Identifier206 London216 London218 London472 London480 London. 4158088 London4158128 Derby4159563 London4159587 Newcastle upon Tyne4160339 LondonName: Place of Publication, Length: 8287 Dtype: the objectapply function cleans up the entire dataset

In some cases, a custom function is applied to each cell or element of the DataFrame. The Pandas.apply () method is similar to the built-in map () function, but applies the function to all elements in the DataFrame.

For example, if the release date of the data is processed into a xxxx year format, apply can be used.

Def clean_date (text): try: return str (int (text)) + "year" except: return textdf ["new_date"] = df ["Date of Publication"] .apply (clean_date) df ["new_date"] Identifier206 1879, 216, 1868, 218, 1869, 472, 1851, 480, 1857... 4158088, 1838, 4158128, 1831, 4159563 NaN4159587 1834 4160339 1834 Name: new_date Length: 8287, dtype: objectDataFrame skip the line olympics_df = pd.read_csv ('data science essential Pandas, NumPy for data cleaning / olympics.csv') olympics_df.head ()

You can add parameters to skip some unwanted rows when reading data, such as index 0 rows.

Olympics_df = pd.read_csv ('data science essential Pandas, NumPy for data cleaning / olympics.csv',header=1) olympics_df.head ()

DataFrame renamed column new_names = {'Unnamed: 0:' Country','? Summer': 'Summer Olympics',' 01!': 'Gold',' 02!': 'Silver',' 03!': 'Bronze','? Winter': 'Winter Olympics',' 01! .1: 'Gold.1',' 02! .1: 'Silver.1',' 03! .1: 'Bronze.1','? Games':'# Games','01! .2percent: 'Gold.2',' 02! .2percent: 'Silver.2',' 03! .2percent: 'Bronze.2'} olympics_df.rename (columns=new_names, inplace=True) olympics_df.head ()

After reading this, the article "how Python uses Pandas and NumPy for data cleaning" has been introduced. If you want to master the knowledge of this article, you still need to practice and use it yourself. If you want to know more about related articles, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.