Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the Pandas data merge functions in Python

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces what are the Pandas data merge functions in Python. It is very detailed and has a certain reference value. Friends who are interested must finish reading it.

1. Concat

Concat is a function dedicated to data connection merging in pandas. It is very powerful and supports vertical merging and horizontal merging. By default, vertical merging can be set through parameters.

Pd.concat (objs: 'Iterable [NDFrame] | Mapping [Hashable, NDFrame]', axis=0, join='outer', ignore_index: 'bool' = False, keys=None, levels=None, names=None, verify_integrity:' bool' = False, sort: 'bool' = False, copy:' bool' = True,)-> 'FrameOrSeriesUnion'

In the function method, the meanings of the parameters are as follows:

Objs: data for connection, which can be a list of DataFrame or Series

Axis=0: the way to connect. Default is 0, which means vertical connection. Optional 1 is horizontal connection.

Join='outer': merging method. Default is inner, which means intersection. Optional outer is union.

Ignore_index: whether to keep the original index

Keys=None: join relationship, using the passed value as the primary index

Levels=None: used to construct multi-level indexes

Names=None: the name of the index

Verify_integrity: check whether the index is duplicated. If it is True, an error will be reported if the index is duplicated.

Sort: sort columns in union and merge mode

Copy: whether to make a deep copy

Next, we will demonstrate the function of this function.

Basic connection

In [1]: import pandas as pdIn [2]: S1 = pd.Series (['asides,' b']) In [3]: S2 = pd.Series (['croup,' d']) In [4]: s1Out [4]: 0A1 bdtype: objectIn [5]: s2Out [5]: 0c1 ddtype: objectIn [6]: pd.concat ([S1, S2]) Out [6]: 0a1 b0c1 ddtype: objectIn [7]: df1 = pd.DataFrame ([['a'') 1], ['baked, 2]],...: columns= [' letter', 'number']) In [8]: df2 = pd.DataFrame ([' letter', 'number']],...: columns= [' letter', 'number']) In [9]: pd.concat ([df1]) Df2]) Out [9]: letter number0 a 11 b 20 c 31 d 4

Horizontal connection

In [10]: pd.concat ([df1, df2], axis=1) Out [10]: letter number letter number0 a 1 c 31 b 2 d 4

By default, concat is union, and if one of the two data does not correspond to a row or column, it is populated with a null value of NaN.

Merge intersection

In [11]: df3 = pd.DataFrame ([['letter', 3,' cat'], ['dumped, 4,' dog']],...: columns= ['letter',' number'] 'animal']) In [12]: df1Out [12]: letter number0 a 11 b 2In [13]: df3Out [13]: letter number animal0 c 3 cat1 d 4 dogIn [14]: pd.concat ([df1, df3], join='inner') Out [14]: letter number0 a 11 b 20 c 31 d 4

Index reset (do not retain the original index)

In [15]: pd.concat ([df1, df3], join='inner', ignore_index=True) Out [15]: letter number0 a 11 b 22 c 33 d equivalent to the above output In [16]: pd.concat ([df1, df3] Join='inner') .reset_index (drop=True) Out [16]: letter number0 a 11 b 22 c 33 d 4

Specify index

In [17]: pd.concat ([df1, df3], keys= ['df1','df3']) Out [17]: letter number animaldf1 0 a 1 NaN 1 b 2 NaNdf3 0 c 3 cat 1 d 4 dogIn [18]: pd.concat ([df1, df3], keys= [' df1','df3'], names= ['df name' 'row ID']) Out [18]: letter number animaldf name line ID df1 0 a 1 NaN 1 b 2 NaNdf3 0 c 3 cat 1 d 4 dog

Detect duplicates

If the index is duplicated, it fails the test and an error is reported.

In [19]: pd.concat ([df1, df3], verify_integrity=True) Traceback (most recent call last):... ValueError: Indexes have overlapping values: Int64Index ([0,1], dtype='int64')

Columns sorting under merge and union

In [21]: pd.concat ([df1, df3], sort=True) Out [21]: animal letter number0 NaN a 11 NaN b 20 cat c 31 dog d 4

Merger of DataFrame and Series

In [22]: pd.concat ([df1, S1]) Out [22]: letter number 00 a 1.0 NaN1 b 2.0 NaN0 NaN NaN A1 NaN NaN bIn [23]: pd.concat ([df1, S1] Axis=1) Out [23]: letter number 00a1a1b 2b# In [24]: df1.assign (new column = S1) Out [24]: letter number new column 0a1a1b 2 bIn [25]: df1 ['new column'] = s1In [26]: df1Out [26]: letter number new column 0a 1 a1 b 2 b

These are some of the functions of the concat function method. By contrast, another function, append, can also be used for data appending (vertical merging).

2. Append

Append is mainly used to append data, which is a relatively simple and direct way of data merging.

Df.append (other, ignore_index: 'bool' = False, verify_integrity:' bool' = False, sort: 'bool' = False,)->' DataFrame'

In the function method, the meanings of the parameters are as follows:

Other: data for append, which can be DataFrame or Series or a list of components

Ignore_index: whether to keep the original index

Verify_integrity: check whether the index is duplicated. If it is True, an error will be reported if the index is duplicated.

Sort: sort columns in union and merge mode

Next, we will demonstrate the function of this function.

Basic addition

In [41]: df1.append (df2) Out [41]: letter number0 a 11 b 20 c 31 d 4In [42]: df1.append ([df1,df2] Df3]) Out [42]: letter number animal0 a 1 NaN1 b 2 NaN0 a 1 NaN1 b 2 NaN0 c 3 NaN1 d 4 NaN0 c 3 cat1 d 4 dog

Columns reset (do not retain the original index)

In [43]: df1.append ([df1,df2,df3], ignore_index=True) Out [43]: letter number animal0 a 1 NaN1 b 2 NaN2 a 1 NaN3 b 2 NaN4 c 3 NaN5 d 4 NaN6 c 3 cat7 d 4 dog

Detect duplicates

If the index is duplicated, it fails the test and an error is reported.

In [44]: df1.append ([df1,df2], verify_integrity=True) Traceback (most recent call last):... ValueError: Indexes have overlapping values: Int64Index ([0,1], dtype='int64')

Index sort

In [46]: df1.append ([df1,df2,df3], sort=True) Out [46]: animal letter number0 NaN a 11 NaN b 20 NaN a 11 NaN b 20 NaN c 31 NaN d 40 cat c 31 dog d 4

Append Series

In [49]: s = pd.Series ({'letter':'s1','number':9}) In [50]: sOut [50]: letter s1number 9dtype: objectIn [51]: df1.append (s) Traceback (most recent call last):... TypeError: Can only append a Series if ignore_index=True or if the Series has a nameIn [53]: df1.append (s, ignore_index=True) Out [53]: letter number0 a 11 b 22 S19

Additional dictionary

This works better when crawling. Every time a piece of data is crawled, it is merged into DataFrame-like data and stored.

In [54]: dic = {'letter':'s1','number':9} In [55]: df1.append (dic, ignore_index=True) Out [55]: letter number0 a 11 b 22 s 93. Merge

The merge function method is similar to join in SQL and can be pd.merge or df.merge, except that the data to be merged in the latter is

Pd.merge (left: 'DataFrame | Series', right:' DataFrame | Series', how: 'str' =' inner', on: 'IndexLabel | None' = None, left_on:' IndexLabel | None' = None, right_on: 'IndexLabel | None' = None, left_index:' bool' = False, right_index: 'bool' = False, sort:' bool' = False Suffixes: 'Suffixes' = (' _ x','_ y'), copy: 'bool' = True, indicator:' bool' = False, validate: 'str | None' = None,)->' DataFrame'

In the function method, the key parameters have the following meanings:

Left: left side data for connection

Right: data on the right side of the connection

How: data connection method. Default is inner. Optional outer, left and right

On: connect key fields, both in the left and right data need to exist, otherwise use left_on and right_on

Left_on: the key field of the data on the left for the connection

Right_on: the key field of the data on the right for the connection

Left_index: True indicates that the left index is a join key field

Right_index: True indicates that the right index is a join key field

Suffixes: 'Suffixes' = (' _ x _ names,'_ y'), which can be freely specified, that is, the column names will display the suffix after merging the same column names.

Indicator: whether to show the home source of a row of data after merging

Next, we will demonstrate the function of this function.

Basic merger

In [55]: df1 = pd.DataFrame ({'key': [' foo', 'bar',' bal'],...: 'value2': [1,2,3]}) In [56]: df2 = pd.DataFrame ({' key': ['foo',' bar', 'baz'],.:' value1': [5,6] 7]}) In [57]: df1.merge (df2) Out [57]: key value2 value10 foo 1 51 bar 2 6

Other connection methods

In [58]: df1.merge (df2, how='left') Out [58]: key value2 value10 foo 1 5.01 bar 2 6.02 bal 3 NaNIn [59]: df1.merge (df2, how='right') Out [59]: key value2 value10 foo 1.051 bar 2.0 62 baz NaN 7In [60]: df1.merge (df2) How='outer') Out [60]: key value2 value10 foo 1.05.01 bar 2.06.02 bal 3.0 NaN3 baz NaN 7.0In [61]: df1.merge (df2 How='cross') Out [61]: key_x value2 key_y value10 foo 1 foo 51 foo 1 bar 62 foo 1 baz 73 bar 2 foo 54 bar 2 bar 65 bar 2 baz 76 bal 3 foo 57 bal 3 bar 68 bal 3 baz 7

Specify connection key

You can specify a single connection key or multiple connection keys

In [62]: df1 = pd.DataFrame ({'lkey1': [' foo', 'bar',' bal'],...: 'lkey2': [' await, 'baked,' c'],...: 'value2': [1,2,3]}) In [63]: df2 = pd.DataFrame ({' rkey1': ['foo',' bar') 'baz'],...:' rkey2': ['await,' baked,'c'],...: 'value2': [5,6 7]}) In [64]: df1Out [64]: lkey1 lkey2 value20 foo a 11 bar b 22 bal c 3In [65]: df2Out [65]: rkey1 rkey2 value20 foo a 51 bar b 62 baz c 7In [66]: df1.merge (df2, left_on='lkey1') Right_on='rkey1') Out [66]: lkey1 lkey2 value2_x rkey1 rkey2 value2_y0 foo a 1 foo a 51 bar b 2 bar b 6In [67]: df1.merge (df2, left_on= ['lkey1','lkey2'], right_on= [' rkey1'] 'rkey2']) Out [67]: lkey1 lkey2 value2_x rkey1 rkey2 value2_y0 foo a 1 foo a 51 bar b 2 bar b 6

Specify the index as the key

Out [68]: df1.merge (df2, left_index=True, right_index=True) Out [68]: lkey1 lkey2 value2_x rkey1 rkey2 value2_y0 foo a 1 foo a 51 bar b 2 bar b 62 bal c 3 baz c 7

Set duplicate column suffix

In [69]: df1.merge (df2, left_on='lkey1', right_on='rkey1', suffixes= ['left', 'right']) Out [69]: lkey1 lkey2 value2 left rkey1 rkey2 value2 right 0 foo a 1 foo a 51 bar b 2 bar b 6

Connection indication

Add a new column to display data sources

In [70]: df1.merge (df2, left_on='lkey1', right_on='rkey1', suffixes= ['left', 'right'], how='outer' ...: indicator=True...:) Out [70]: lkey1 lkey2 value2 left rkey1 rkey2 value2 right _ merge0 foo a 1.0 foo a 5.0 both1 bar b 2.0 bar b 6.0 both2 bal c 3.0 NaN left_only3 NaN baz c 7.0 right_only4. Join

Join is a bit like append to concat, for data consolidation.

Df.join (other: 'FrameOrSeriesUnion', on:' IndexLabel | None' = None, how: 'str' =' left', lsuffix: 'str' ='', rsuffix: 'str' ='', sort: 'bool' = False,)->' DataFrame'

In the function method, the key parameters have the following meanings:

Other: right side data for merging

On: connect key fields, both in the left and right data need to exist, otherwise use left_on and right_on

How: data connection method. Default is inner. Optional outer, left and right

Lsuffix: the suffix of the same name on the left

Rsuffix: the suffix of the same name column on the right

Next, we will demonstrate the function of this function.

In [71]: df = pd.DataFrame ({'key': [' K0','K1','K2','K3','K4','K5'],...:'A3','A1','A2','A3','A4','A5']}) In [72]: other = pd.DataFrame ({'key': [' K0','K1') In [73]: dfOut [73]: key A0 K0 A01 K1 A12 K2 A23 K3 34 K4 A45 K5 A5In [74]: otherOut [74]: key B0K0B01 K1 B12 K2 B2In [75]: df.join (other, on='key') Traceback (most recent call last):... ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

If you want to use the key keyword, you need key to be an index.

Specify key

In [76]: df.set_index ('key') .join (other.set_index (' key')) Out [76]: a Bkey K0 A0 B0K1 A1 B1K2 A2 B2K3 A3 NaNK4 A4 NaNK5 A5 NaNIn [77]: df.join (other.set_index ('key'), on='key') Out [77]: key A B0 K0 A0 B01 K1 A1 B2 K23 K3 NaN4 K4 NaN5 K5 A5 NaN

Specify duplicate column suffix

In [78]: df.join (other, lsuffix='_ left', rsuffix=' right') Out [78]: key_ left A key right B0 K0 A0 K0 B01 K1 A1 K1 B12 K2 A2 K2 B23 K3 A3 NaN NaN4 K4 NaN NaN5 K5 A5 NaN NaN

Other parameters will not be introduced, which is basically the same as merge.

5. Combine

In the process of data merging, we may need to calculate the values of the corresponding positions. Pandas provides combine and combine_first function methods for cooperative operations in this respect.

Df.combine (other: 'DataFrame', func, fill_value=None, overwrite:' bool' = True,)-> 'DataFrame'

For example, when merging data, take the smallest value of the cell.

In [79]: df1 = pd.DataFrame ({'Aids: [0,0],' bundles: [4,4]}) In [80]: df2 = pd.DataFrame ({'Aids: [1,1],' bones: [3,3]}) In [81]: df1Out [81]: a B0041 10 4In [82]: df2Out [82]: a B01 31 1 3In [83]: take_smaller = lambda S1, S2: S1 if s1.sum ()

< s2.sum() else s2In [84]: df1.combine(df2, take_smaller)Out[84]: A B0 0 31 0 3# 也可以调用numpy的函数In [85]: import numpy as npIn [86]: df1.combine(df2, np.minimum)Out[86]: A B0 0 31 0 3 fill_value填充缺失值 In [87]: df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})In [87]: df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})In [88]: df1Out[88]: A B0 0 NaN1 0 4.0In [89]: df2Out[89]: A B0 1 31 1 3In [90]: df1.combine(df2, take_smaller, fill_value=-88)Out[90]: A B0 0 -88.01 0 4.0 overwrite=False保留 In [91]: df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})In [92]: df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2])In [93]: df1Out[93]: A B0 0 41 0 4In [94]: df2Out[94]: B C1 3 -102 3 1In [95]: df1.combine(df2, take_smaller)Out[95]: A B C0 NaN NaN NaN1 NaN 3.0 -10.02 NaN 3.0 1.0# 保留A列原有的值In [96]: df1.combine(df2, take_smaller, overwrite=False)Out[96]: A B C0 0.0 NaN NaN1 0.0 3.0 -10.02 NaN 3.0 1.0 另外一个combine_first df.combine_first(other: 'DataFrame') ->

'DataFrame'

When the element in df is empty, it is replaced by the one in other, and the result is union merging.

In [97]: df1 = pd.DataFrame ({'Aids: [None, 0],' Barrier: [None, 4]}) In [98]: df2 = pd.DataFrame ({'Aids: [1,1],' baked: [3] 3]}) In [99]: df1Out [99]: a B0 NaN NaN1 0.0 4.0In [100]: df2Out [100]: a B0 1 31 11 3In [101]: df1.combine_first (df2) Out [101]: a B0 1.03.01 0.0 4.0In [102]: df1 = pd.DataFrame ({'Aids: [None, 0],' Bread: [4] None]}) In [103]: df2 = pd.DataFrame ({'Barrier: [3,3],' index=: [1,1]}, index= [1] In: df1Out: a B0 NaN 4.01 0.0 NaNIn: df2Out: BC1 1312 3 1In: df1.combine_first (df2) Out: a B C 0 NaN 4.0 NaN1 0.0 1.02 NaN 3.0 is all the content of the article "what are the Pandas data merging functions in Python" Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Development

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report