How to use pandas to solve common preprocessing tasks 07/03 Update SLTechnology News&Howtos

How to use pandas to solve common preprocessing tasks

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly shows you "how to use pandas to solve common preprocessing tasks", the content is easy to understand, clear, hope to help you solve doubts, the following let the editor lead you to study and learn "how to use pandas to solve common preprocessing tasks" this article.

The common processing steps of data preprocessing include finding abnormal values, dealing with missing values, filtering inappropriate values, removing duplicate rows, sub-boxes, grouping, ranking, category conversion values, and so on. Here, use pandas to solve these most common preprocessing tasks.

There are two common ways to find outliers:

Standard deviation method: values outside the range of 1.96 standard deviations above and below the average of outliers

Quartile method: less than 1 quarter minus 1.5 times the difference between 1 and 3, and more than 3, minus 1.5 times the difference between 1 and 3, which are all abnormal values.

Skill 1: standard deviation method

Import pandas as pd df = pd.DataFrame ({'aqie: [1Jing 3pcnp.nan], 'baked: [4pcnp.nan] Np.nan]}) # values beyond 1.96 standard deviation intervals above and below the average outliers meangrade = df ['a'] .mean () stdgrade = df ['a'] .std () toprange = meangrade + stdgrade * 1.96 botrange = meangrade-stdgrade * 1.96 # values copydf = df copydfcopydf = copydf.drop (copydf [copydf ['a'] > toprange] .index) copydfcopydf = copydf.drop (copydf [copydf [a'])

< botrange].index) copydf 技能2：分位数法 q1 = df['a'].quantile(.25) q3 = df['a'].quantile(.75) iqr = q3-q1 toprange = q3 + iqr * 1.5 botrange = q1 - iqr * 1.5 copydf = df copydfcopydf = copydf.drop(copydf[copydf['a'] >

Toprange] .index) copydfcopydf = copydf.drop (copydf [copydf ['a']

< botrange].index) copydf 技能3：处理空值 np.nan 是 pandas 中常见空值，使用 dropna 过滤空值，axis 0 表示按照行，1 表示按列，how 默认为 any ，意思是只要有一个 nan 就过滤某行或某列，all 所有都为 nan # axis 0 表示按照行，all 此行所有值都为 nan df.dropna(axis=0, how='all') 技能4：充填空值空值一般使用某个统计值填充，如平均数、众数、中位数等，使用函数 fillna： # 使用a列平均数填充列的空值，inplace true表示就地填充 df["a"].fillna(df["a"].mean(), inplace=True) 技能5：修复不合适值假如某门课最高分100，如果出现 -2， 120 这样的值，显然不合理，使用布尔类型的Series对象修改数值： df.loc[(df['a'] < -2,'a')] = 0 df.loc[(df['a'] >

= 100 (100)] = 100

Skill 6: filter duplicate values

Filter a column of duplicate values, using the drop_duplicated method. The first parameter is the column name, and the keep keyword equals last: the last occurrence of this value row:

Df.drop_duplicates (['Names'], keep='last')

Skill 7:apply element level: remove special characters

A column of cells contains special characters, such as punctuation, and use the element-level operation method apply to kill them:

Import string exclude = set (string.punctuation) def remove_punctuation (x): X = '.join (ch for ch in x if ch not in exclude) return x # original df Out [26]: a b 0 c D edc.rc 1 3 3 2 d ef 4 # filter a column punctuation In [27]: dfdf.a = df.a.apply (remove_punctuation) In [28]: df Out [28]: a b 0 cd edc.rc 1 3 3 2 d ef 4

Skill 8:cut data sub-box

The percentile score is converted to four grades of A bins Breco C pencil D, and bins is divided into [0meme 60, 75pr 90100], and labels equals ['dashed,' clocked, 'baked,' A']:

# generate 20 [0100] random integers In [30]: a = np.random.randint (1jre 100je 20) In [31]: an Out [31]: array ([48, 22, 46, 84, 13, 52, 36, 35, 27, 99, 31, 37, 15, 31, 5, 46, 98, 99, 60, 43]) # cut box In [33]: pd.cut (a) Out [33]: [d, D, D, B, D,..., D, A, A, D] Length: 20 Categories (4, object): [d < C < B < A]

Skill 9:rank ranking

Rank method to generate numerical ranking. Ascending is False. The higher the score, the higher the ranking:

In [36]: df = pd.DataFrame ({'aqiao: [4698 Magazine 99,60,43]}) In [53]: df ['a'] .rank (ascending=False) Out [53]: 0 4.0 1 2.0 2 1.0 3 3.0 4 5.0

Skill 10:category column transfer value

A column can only have a limited number of enumerated values, and you often need to convert it to a numeric value, use get_dummies, or define your own functions:

Pd.get_dummies (df ['a'])

Custom function, combined with apply:

Def c2n (x): if xpendicular Aids: return 95 if xpendicular bands: return 80 df ['a'] .apply (c2n)

The above combined with the ten small tasks of data preprocessing, respectively find the corresponding implementation in pandas.

The above is all the contents of the article "how to use pandas to solve common preprocessing tasks". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.