How to deal with missing values with pandas 04/16 Update SLTechnology News&Howtos

How to deal with missing values with pandas

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces how to use pandas to deal with missing values, the article is very detailed, has a certain reference value, interested friends must read it!

All descriptive statistics for the pandas object exclude missing values by default.

The way missing values are represented in pandas objects is not perfect, but it is useful for most users. For numeric data, pandas uses a floating-point value of NaN (Not a Number to represent missing values). We call NaN an easily detectable identification value:

In:

String_data = pd.Series (['aardvark',' artichoke', np.nan, 'avocado']) string_data

Out:

0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object

In:

String_data.isnull ()

Out:

0 False 1 False 2 True 3 False dtype: bool

In pandas, we adopted the programming convention in the R language, changing the missing value to NA, which means not available. In statistical applications, NA data can be non-existent data or existing but unobservable data (for example, problems in the data collection process). When cleaning data is used for analysis, it is often important to analyze the missing data itself to determine data bias caused by data collection problems or data loss.

The built-in None value of Python is also treated as NA in the object array:

In:

String_data [0] = Nonestring_data.isnull ()

Out:

0 True 1 False 2 True 3 False dtype: bool

The pandas project continues to improve the internal details of dealing with missing values, but user API functions, such as pandas. Isnull abstracts out a lot of boring details. The list of related functions that deal with missing values is as follows:

Dropna: filter axis tags based on whether the value of each tag is missing data, and determine the threshold based on the amount of data allowed to be lost

Fillna: populate missing data with certain values or use interpolation methods (such as "ffill" or "bfill").

Isnull: returns a Boolean value indicating which values are missing

Reaction function of notnull:isnull

01 filter missing values

There are several ways to filter missing values. Although you can filter missing values manually using pandas.isnull and Boolean indexes, dropna is very useful when filtering missing values. Using dropna on Series returns all non-empty data in Series and its index values:

In:

From numpy import nan as NAdata = pd.Series ([1, NA, 3.5, NA, 7]) data.dropna ()

Out:

0 1.0 2 3.5 4 7.0 dtype: float64

The above example is equivalent to the following code:

In:

Data [data.notnull ()]

Out:

0 1.0 2 3.5 4 7.0 dtype: float64

When dealing with DataFrame objects, things get a little more complicated. You may want to delete rows or columns that are all NA or contain NA. Dropna deletes rows that contain missing values by default:

In:

Data = pd.DataFrame ([[1,6.5,3.], [1, NA, NA] [NA, NA, NA], [NA, 6.5,3.]]) cleaned = data.dropna () data

Out:

0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN 3 NaN 6.5 3.0

In:

Cleaned

Out:

0 1 2 1.0 6.5 3.0

When how='all' is passed in, all rows with values of NA are deleted:

In:

Data.dropna (how='all')

Out:

0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0

If you want to delete the column in the same way, pass in the parameter axis=1:

In:

Data [4] = NAdata

Out:

0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN 2 NaN 3 NaN 6.5 3.0 NaN

In:

Data.dropna (axis=1, how='all')

Out:

0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN 3 NaN 6.5 3.0

The related methods of filtering rows of DataFrame often involve time series data. Suppose you only want to keep rows that contain a certain number of observations. You can use the thresh parameter to indicate:

In:

Df = pd.DataFrame (np.random.randn (7,3)) df.iloc [: 4,1] = NAdf.iloc [: 2,2] = NAdf

Out:

01 20-0.204708 NaN NaN 1-0.555730 NaN NaN 2 0.092908 NaN 0.769023 3 1.246435 NaN-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741

In:

Df.dropna ()

Out:

01 2 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741

In:

Df.dropna (thresh=2)

Out:

01 22 0.092908 NaN 0.769023 3 1.246435 NaN-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741

02 make up the missing value

You may sometimes need to close the "loophole" in a variety of ways instead of filtering missing values (or discarding other data).

In most cases, the fillna method is mainly used to fill in the missing values. When calling fillna, you can use a constant to replace the missing value:

In:

Df.fillna (0)

Out:

01 20-0.204708 0.000000 0.000000 1-0.555730 0.000000 0.000000 2 0.092908 0.000000 0.769023 3 1.246435 0.000000-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741

Using a dictionary when calling fillna, you can set different padding values for different columns:

In:

Df.fillna ({1: 0.5,2: 0})

Out:

01 20-0.204708 0.500000 0.000000 1-0.555730 0.500000 0.000000 2 0.092908 0.500000 0.769023 3 1.246435 0.500000-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741

Fillna returns a new object, but you can also modify an existing object:

In:

_ = df.fillna (0, inplace=True) df

Out:

The same interpolation method used to rebuild the index can also be used for fillna:

In:

Df = pd.DataFrame (np.random.randn (6,3)) df.iloc [2np.random.randn, 1] = NAdf.iloc [4np.random.randn, 2] = NAdf

Out:

0 120 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 NaN 1.343810 3-0.713544 NaN-2.370232 4-1.860761 NaN NaN 5-1.265934 NaN NaN

In:

Df.fillna (method='ffill')

Out:

0 12 0 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 0.124121 1.343810 3-0.713544 0.124121-2.370232 4-1.860761 0.124121-2.370232 5-1.265934 0.124121-2.370232

In:

Df.fillna (method='ffill', limit=2)

Out:

0 120 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 0.124121 1.343810 3-0.713544 0.124121-2.370232 4-1.860761 NaN-2.370232 5-1.265934 NaN-2.370232

You can do a lot of creative work with fillna. For example, you can use the average or median of Series to fill the missing values:

In:

Data = pd.Series ([1, NA, 3.5, NA, 7]) data.fillna (data.mean ())

Out:

0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64

The following are the function parameters of fillna.

Value: scalar values or word typical objects are used to fill missing values

Method: interpolation method. If there are no other parameters, the default is' ffill'

Axis: axis to be filled, default axis=0

Inplace: modify the called object instead of generating a backup

Limit: the maximum fill range for forward or backward filling

The above is all the contents of the article "how to deal with missing values with pandas". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.