Data cleaning, merging, transformation and reconstruction 07/01 Update SLTechnology News&Howtos

Data cleaning, merging, transformation and reconstruction

2025-07-01 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Data cleaning data cleaning is a key step in data analysis. Does it have a direct impact on the processing data that needs to be modified? Is there anything that needs to be changed? How should the data be adjusted to apply to the following analysis and mining? Is an iterative process, and the actual project may need to perform these cleaning operations more than once to deal with missing data: pd.fillna (), pd.dropna () data connection (pd.merge) pd.merge joins rows of different DataFrame based on single or multiple keys, similar to database connection operation.

Sample code:

Import pandas as pdimport numpy as npdf_obj1 = pd.DataFrame ({'key': [' baked, 'baked,' axed, 'caged,' averse,'b'], 'data1': np.random.randint)}) df_obj2 = pd.DataFrame ({' key': ['baked,' baked,'d'] 'data2': np.random.randint (010jin3)}) print (df_obj1) print (df_obj2)

Running result:

Data1 key data1 key0 8 b1 8 b2 3 a3 5 c4 4 a5 9 a6 6 b data2 key0 9 a1 0 b2 3 d1. By default, the column names of overlapping columns are connected as "foreign keys"

Sample code:

# by default, the column names of overlapping columns are connected to print (pd.merge (df_obj1, df_obj2)) as "foreign keys"

Running result:

Data1 key data20 8 b 01 8 b 02 6 b 03 3 a 94 4 a 95 9 a 92. On displays specified "foreign key"

Sample code:

# on displays the specified "foreign key" print (pd.merge (df_obj1, df_obj2, on='key'))

Running result:

Data1 key data20 8 b 01 8 b 02 6 b 03 3 a 94 4 a 95 9 a 93. Left_on, the "foreign key" of the left data, right_on, the "foreign key" of the right data

Sample code:

# left_on,right_on specifies the "foreign key" of the data on the left and right, respectively. # change the column name df_obj1 = df_obj1.rename (columns= {'key':'key1'}) df_obj2 = df_obj2.rename (columns= {' key':'key2'}) print (pd.merge (df_obj1, df_obj2, left_on='key1', right_on='key2'))

Running result:

Data1 key1 data2 key20 8 b 0 b 1 8 b 0 b 2 6 b 0 b 3 3 a 9 a 4 4 a 9 a 5 9 a 9 an is "inner" by default, that is, the keys in the result are intersected.

How specifies the connection method

4. "external join" (outer), and the key in the result is union

Sample code:

# external connection print (pd.merge (df_obj1, df_obj2, left_on='key1', right_on='key2', how='outer'))

Running result:

Data1 key1 data2 key20 8.0 b 0.0 b1 8.0 b 0.0 b2 6.0 b 0.0 b3 3.0 a 9.0 a4 4.0 a 9.0 a5 9.0 a 9.0 a6 5.0 c NaN NaN7 NaN NaN 3.0 d5. Left connection (left)

Sample code:

# left connection print (pd.merge (df_obj1, df_obj2, left_on='key1', right_on='key2', how='left'))

Running result:

Data1 key1 data2 key20 8 b 0.0 b1 8 b 0.0 b2 3 a 9.0 a3 5 c NaN NaN4 4 a 9.0 a5 9 a 9.0 a6 6 b 0.0 b6. "right connection" (right)

Sample code:

# right connection print (pd.merge (df_obj1, df_obj2, left_on='key1', right_on='key2', how='right'))

Running result:

Data1 key1 data2 key20 8.0 b 0 b1 8.0 b 0 b2 6.0 b 0 b3 3.0 a 9 a4 4.0 a 9 a5 9.0 a 9 a6 NaN NaN 3 d7. Deal with duplicate column names

Suffixes, defaults to _ x, _ y

Sample code:

# deal with repeated column names df_obj1 = pd.DataFrame ({'key': [' baked, 'baked,' axed, 'ajar,' b'], 'data': np.random.randint (0SCR 10L7)}) df_obj2 = pd.DataFrame ({' key': ['baked,' baked,'d'] 'data': np.random.randint (010jin3)}) print (pd.merge (df_obj1, df_obj2, on='key', suffixes= (' _ left','_ right')

Running result:

Data_left key data_right0 9 b 11 5 b 12 1 b 13 2 a 84 2 a 85 5 a 88. Join by index

Left_index=True or right_index=True

Sample code:

# Connect df_obj1 = pd.DataFrame ({'key': [' baked, 'baked,' ajar,'b'], 'data1': np.random.randint (010L7)} by index) df_obj2 = pd.DataFrame ({' data2': np.random.randint)}, index= ['averse,' b' 'd']) print (pd.merge (df_obj1, df_obj2, left_on='key', right_index=True))

Running result:

Data1 key data20 3 b 61 4 b 66 8 b 62 6 a 04 3 a 05 0 a 0 data merging (pd.concat) merges multiple objects together along the axis 1.concatnp.concatenate of NumPy

Sample code:

Import numpy as npimport pandas as pdarr1 = np.random.randint (0,10, (3,4)) arr2 = np.random.randint (0,10, (3,4)) print (arr1) print (arr2) print (np.concatenate ([arr1, arr2])) print (np.concatenate ([arr1, arr2], axis=1))

Running result:

# print (arr1) [[3 308] [20 31] [48 8 2]] # print (arr2) [[6 8 7 3] [1 68 7 7] [1 47 1]] # print (np.concatenate ([arr1, arr2])) [3 308] [2 031] [4 8 8 2] [6 8 7 3] [1 6 8 7] [1 47]] # print (np.concatenate ([arr1, arr2]) Axis=1) [[3 30 0 8 6 8 7 3] [2 0 1 1 6 8 7 7] [4 8 8 2 1 4 7 1] 2. Pd.concat pay attention to the specified axis direction The default axis=0join specifies the merge method, and defaults to outerSeries when merging to check whether the row index is duplicated. 1) index does not duplicate

Sample code:

# index has no repetition ser_obj1 = pd.Series (np.random.randint (0,10,5), index=range (0,10,5)) ser_obj2 = pd.Series (np.random.randint (0,10,4), index=range (5je 9)) ser_obj3 = pd.Series (np.random.randint (0,10,3), index=range (9jue 12) print (ser_obj1) print (ser_obj2) print (ser_obj3) print (pd.concat ([ser_obj1, ser_obj2) Ser_obj3]) print (pd.concat ([ser_obj1, ser_obj2, ser_obj3], axis=1))

Running result:

# print (ser_obj1) 0 11 82 43 94 4dtype: int64# print (ser_obj2) 5 26 67 48 2dtype: int64# print (ser_obj3) 9 610 211 7dtype: int64# print (pd.concat ([ser_obj1, ser_obj2, ser_obj3]) 0 11 82 43 94 45 26 67 48 29 610 211 7dtype: int64# print ([ser_obj1, ser_obj2]) Ser_obj3], axis=1) 01 20 1.0 NaN NaN1 5.0 NaN NaN2 3.0 NaN NaN3 2.0 NaN NaN4 4.0 NaN NaN5 NaN 9.0 NaN6 NaN 8.0 NaN7 NaN 3.0 NaN8 NaN 6.0 NaN9 NaN NaN 2.010 NaN NaN 3.011 NaN NaN 3.02) index is duplicated

Sample code:

# index has repetition ser_obj1 = pd.Series (np.random.randint (0,10,5), index=range (5)) ser_obj2 = pd.Series (np.random.randint (0,10,4), index=range (4)) ser_obj3 = pd.Series (np.random.randint (0,10,3), index=range (3) print (ser_obj1) print (ser_obj2) print (ser_obj3) print (pd.concat ([ser_obj1, ser_obj2, ser_obj3]))

Running result:

# print (ser_obj1) 0 01 32 73 24 5dtype: int64# print (ser_obj2) 0 51 12 93 9dtype: int64# print (ser_obj3) 0 81 72 9dtype: int64# print (pd.concat ([ser_obj1, ser_obj2, ser_obj3]) 0 01 32 73 24 50 51 12 93 90 81 72 9dtype: int64# print ([ser_obj1, ser_obj2, ser_obj3], axis=1 Join='inner')) # join='inner' will remove the row or column of NaN 0 1 20 0 5 81 3 1 72 7 9 93) when DataFrame merges, check both row and column indexes for duplicates

Sample code:

Df_obj1 = pd.DataFrame (np.random.randint (0,10, (3,2)), index= ['await,' baked,'c'], columns= ['Agar,' B']) df_obj2 = pd.DataFrame (np.random.randint (0,10, (2,2)), index= ['averse,' b'], columns= ['C' 'D']) print (df_obj1) print (df_obj2) print (pd.concat ([df_obj1, df_obj2])) print ([df_obj1, df_obj2], axis=1, join='inner'))

Running result:

# print (df_obj1) A Ba 3 3b 5 4c 8 "print (df_obj2) C Da 1 9b 6" print (pd.concat ([df_obj1, df_obj2])) A B C Da 3.0 NaN NaNb 5.0 4.0 NaN NaNc 8.06.0 NaN NaNa NaN NaN 1.09.0b NaN NaN 6.0 8.0C print (pd.concat ([df_obj1, df_obj2], axis=1) Join='inner') A B C Da 3 3 19b 5 468 data reconstruction 1. Stack rotates column indexes into row indexes Complete the hierarchical index DataFrame- > Series

Sample code:

Import numpy as npimport pandas as pddf_obj = pd.DataFrame (np.random.randint (0Jing 10, (5J 2)), columns= ['data1',' data2']) print (df_obj) stacked = df_obj.stack () print (stacked)

Running result:

# print (df_obj) data1 data20 7 91 7 82 8 93 4 14 1 "print (stacked) 0 data1 7 data2 91 data1 7 data2 82 data1 8 data2 93 data1 4 data2 14 data1 1 data2 2dtype: int642. Unstack expands the hierarchical index Series- > DataFrame to operate the inner index, that is, level=-1

Sample code:

# default operation inner layer index print (stacked.unstack ()) # specify the level of the operation index print (stacked.unstack (level=0)) through level

Running result:

# print (stacked.unstack ()) data1 data20 7 91 7 82 8 93 4 14 1 "print (stacked.unstack (level=0)) 0 1 2 3 4data1 7 7 8 4 1data2 9 8 91 2 data conversion 1, processing duplicate data 1 duplicated () returns Boolean Series indicating whether each row is a duplicate row

Sample code:

Import numpy as npimport pandas as pddf_obj = pd.DataFrame ({'data1': [' a'] * 4 + ['b'] * 4, 'data2': np.random.randint (0,4,8)}) print (df_obj) print (df_obj.duplicated ())

Running result:

# print (df_obj) data1 data20 a 31 a 22 a 33 a 34 b 15 b 06 b 37 b print (df_obj.duplicated ()) 0 False1 False2 True3 True4 False5 False6 False7 Truedtype: bool2 drop_duplicates () filter repeat rows default decision all columns

Can be specified to judge by certain columns

Sample code:

Print (df_obj.drop_duplicates ()) print (df_obj.drop_duplicates ('data2'))

Running result:

# print (df_obj.drop_duplicates ()) data1 data20 a 31 a 24 b 15 b 06 b print (df_obj.drop_duplicates ('data2')) data1 data20 a 31 a 24 b 15 b 03. Convert each row or column according to the function passed in by map Series converts each row or column according to the function passed in by map

Sample code:

Ser_obj = pd.Series (np.random.randint (0Jing 10j 10)) print (ser_obj) print (ser_obj.map (lambda x: X * * 2))

Running result:

# print (ser_obj) 0 11 42 83 64 85 66 67 48 79 3dtype: int64# print (ser_obj.map (lambda x: X * * 2)) 0 11 162 643 364 645 366 367 168 499 9dtype: int64 II data replacement replace is replaced according to the content of the value

Sample code:

# single value replaces single value print (ser_obj.replace (1,100)) # multiple values replace one value print (ser_obj.replace ([6,8],-100)) # multiple values replace multiple values print (ser_obj.replace ([4,7], [- 100,200]))

Running result:

# print (ser_obj.replace (1,100)) 0-1001 42 83 64 85 66 67 48 79 3dtype: int64# print (ser_obj.replace ([6,8],-100)) 0 11 42-1003-1004-1005-1006-1007 48 79 3dtype: int64# print (ser_obj.replace ([4,7]) ) 0 11-1002 83 64 85 66 67-1008-2009 3dtype: int64

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.