In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces how to use pandas to deal with missing values, the article is very detailed, has a certain reference value, interested friends must read it!
All descriptive statistics for the pandas object exclude missing values by default.
The way missing values are represented in pandas objects is not perfect, but it is useful for most users. For numeric data, pandas uses a floating-point value of NaN (Not a Number to represent missing values). We call NaN an easily detectable identification value:
In:
String_data = pd.Series (['aardvark',' artichoke', np.nan, 'avocado']) string_data
Out:
0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object
In:
String_data.isnull ()
Out:
0 False 1 False 2 True 3 False dtype: bool
In pandas, we adopted the programming convention in the R language, changing the missing value to NA, which means not available. In statistical applications, NA data can be non-existent data or existing but unobservable data (for example, problems in the data collection process). When cleaning data is used for analysis, it is often important to analyze the missing data itself to determine data bias caused by data collection problems or data loss.
The built-in None value of Python is also treated as NA in the object array:
In:
String_data [0] = Nonestring_data.isnull ()
Out:
0 True 1 False 2 True 3 False dtype: bool
The pandas project continues to improve the internal details of dealing with missing values, but user API functions, such as pandas. Isnull abstracts out a lot of boring details. The list of related functions that deal with missing values is as follows:
Dropna: filter axis tags based on whether the value of each tag is missing data, and determine the threshold based on the amount of data allowed to be lost
Fillna: populate missing data with certain values or use interpolation methods (such as "ffill" or "bfill").
Isnull: returns a Boolean value indicating which values are missing
Reaction function of notnull:isnull
01 filter missing values
There are several ways to filter missing values. Although you can filter missing values manually using pandas.isnull and Boolean indexes, dropna is very useful when filtering missing values. Using dropna on Series returns all non-empty data in Series and its index values:
In:
From numpy import nan as NAdata = pd.Series ([1, NA, 3.5, NA, 7]) data.dropna ()
Out:
0 1.0 2 3.5 4 7.0 dtype: float64
The above example is equivalent to the following code:
In:
Data [data.notnull ()]
Out:
0 1.0 2 3.5 4 7.0 dtype: float64
When dealing with DataFrame objects, things get a little more complicated. You may want to delete rows or columns that are all NA or contain NA. Dropna deletes rows that contain missing values by default:
In:
Data = pd.DataFrame ([[1,6.5,3.], [1, NA, NA] [NA, NA, NA], [NA, 6.5,3.]]) cleaned = data.dropna () data
Out:
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN 3 NaN 6.5 3.0
In:
Cleaned
Out:
0 1 2 1.0 6.5 3.0
When how='all' is passed in, all rows with values of NA are deleted:
In:
Data.dropna (how='all')
Out:
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0
If you want to delete the column in the same way, pass in the parameter axis=1:
In:
Data [4] = NAdata
Out:
0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN 2 NaN 3 NaN 6.5 3.0 NaN
In:
Data.dropna (axis=1, how='all')
Out:
0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN 3 NaN 6.5 3.0
The related methods of filtering rows of DataFrame often involve time series data. Suppose you only want to keep rows that contain a certain number of observations. You can use the thresh parameter to indicate:
In:
Df = pd.DataFrame (np.random.randn (7,3)) df.iloc [: 4,1] = NAdf.iloc [: 2,2] = NAdf
Out:
01 20-0.204708 NaN NaN 1-0.555730 NaN NaN 2 0.092908 NaN 0.769023 3 1.246435 NaN-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
In:
Df.dropna ()
Out:
01 2 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
In:
Df.dropna (thresh=2)
Out:
01 22 0.092908 NaN 0.769023 3 1.246435 NaN-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
02 make up the missing value
You may sometimes need to close the "loophole" in a variety of ways instead of filtering missing values (or discarding other data).
In most cases, the fillna method is mainly used to fill in the missing values. When calling fillna, you can use a constant to replace the missing value:
In:
Df.fillna (0)
Out:
01 20-0.204708 0.000000 0.000000 1-0.555730 0.000000 0.000000 2 0.092908 0.000000 0.769023 3 1.246435 0.000000-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
Using a dictionary when calling fillna, you can set different padding values for different columns:
In:
Df.fillna ({1: 0.5,2: 0})
Out:
01 20-0.204708 0.500000 0.000000 1-0.555730 0.500000 0.000000 2 0.092908 0.500000 0.769023 3 1.246435 0.500000-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
Fillna returns a new object, but you can also modify an existing object:
In:
_ = df.fillna (0, inplace=True) df
Out:
01 20-0.204708 0.000000 0.000000 1-0.555730 0.000000 0.000000 2 0.092908 0.000000 0.769023 3 1.246435 0.000000-1.296221 4 0.274992 0.228913 1.352917 5 0.886429-2.001637-0.371843 6 1.669025-0.438570-0.539741
The same interpolation method used to rebuild the index can also be used for fillna:
In:
Df = pd.DataFrame (np.random.randn (6,3)) df.iloc [2np.random.randn, 1] = NAdf.iloc [4np.random.randn, 2] = NAdf
Out:
0 120 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 NaN 1.343810 3-0.713544 NaN-2.370232 4-1.860761 NaN NaN 5-1.265934 NaN NaN
In:
Df.fillna (method='ffill')
Out:
0 12 0 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 0.124121 1.343810 3-0.713544 0.124121-2.370232 4-1.860761 0.124121-2.370232 5-1.265934 0.124121-2.370232
In:
Df.fillna (method='ffill', limit=2)
Out:
0 120 0.476985 3.248944-1.021228 1-0.577087 0.124121 0.302614 2 0.523772 0.124121 1.343810 3-0.713544 0.124121-2.370232 4-1.860761 NaN-2.370232 5-1.265934 NaN-2.370232
You can do a lot of creative work with fillna. For example, you can use the average or median of Series to fill the missing values:
In:
Data = pd.Series ([1, NA, 3.5, NA, 7]) data.fillna (data.mean ())
Out:
0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64
The following are the function parameters of fillna.
Value: scalar values or word typical objects are used to fill missing values
Method: interpolation method. If there are no other parameters, the default is' ffill'
Axis: axis to be filled, default axis=0
Inplace: modify the called object instead of generating a backup
Limit: the maximum fill range for forward or backward filling
The above is all the contents of the article "how to deal with missing values with pandas". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.