How to realize data merging and splicing by pandas 07/04 Update SLTechnology News&Howtos

How to realize data merging and splicing by pandas

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/02 Report--

This article will explain in detail how to merge and stitch data in pandas. The editor thinks it is very practical, so I share it with you as a reference. I hope you can get something after reading this article.

The merge, join and concat methods of Pandas package can complete data merging and splicing. The merge method is mainly based on the common column of two dataframe, the join method is mainly based on the index of two dataframe, and the concat method is to splice the rows or columns of series or dataframe.

1. Merge method

Pandas's merge method connects two dataframe based on a common column. The main parameters of the merge method:

Left/right: dataframe in the left / right position.

How: the way in which data is merged. Left: data merge based on left dataframe column; right: data merge based on right dataframe column; outer: column-based out-of-data merge (merge); inner: column-based intra-data merge (take intersection); default is' inner'.

On: the column name used for merging. This parameter ensures that the two dataframe have the same column name.

Left_on/right_on: left / right dataframe merged column names, which can also be indexes, arrays, and lists.

Left_index/right_index: whether to use index as the column name of the data merge, True indicates yes.

Sort: sort by keys merged by dataframe. Default is.

Suffixes: if you have the same column and the column is not a merged column, you can set the suffix name of the column through suffixes, which is generally tuple and list type.

Merges selects the connection mode of two dataframe by setting the how parameter, including internal connection, external connection, left connection and right connection. The following example describes the meaning of connection.

1.1 Internal connection

The link method of how='inner',dataframe is inner join. We can understand that the join is based on the intersection of common columns, and the parameter on sets the common column name of the connection.

# single column inner concatenation # defines df1import pandas as pdimport numpy as npdf1 = pd.DataFrame ({'alpha': [' Achilles pincushion'), 'feature1': [' feature1':'], 'feature2': [' low','medium','medium','high','low','high']}) # Definitions df2df2 = pd.DataFrame ({'alpha': [' Achievement'). 'F'],' pazham': ['apple','orange','pine','pear'],' kilo': ['high','low','high','medium'],' price':np.array ([5pje 6je 5je 7])}) # print (df1) # print (df2) # inner join df3 = pd.merge (df1,df2,how='inner',on='alpha') df3 based on the common column alpha

Concatenate the intersection of the common column alpha values.

1.2 external connection

The link method of how='outer',dataframe is external join. We can understand that the connection is based on the union of common columns, and the parameter on sets the common column name of the connection.

# single-column outer joins # define df1df1 = pd.DataFrame ({'alpha': [' Achilles repartee',''alpha':''), 'feature1': [1 low','medium','medium','high','low','high'],' df2df2 ['low','medium','medium','high','low','high']}) # Definitions df2df2 = pd.DataFrame ({' alpha': ['Achievement') 'F'],' pazham': ['apple','orange','pine','pear'],' kilo': ['high','low','high','medium'],' price':np.array ([5meme 6meme 5pm 7])}) # the inner join df4 = pd.merge (df1,df2,how='outer',on='alpha') df4 based on the common column alpha

If there is no same column between two dataframe except the join column set by on, the value of that column is set to NaN.

1.3 left connection

The link method of how='left',dataframe is left join. We can understand that the join is based on the column of dataframe on the left, and the parameter on sets the common column name of the connection.

# single column left concatenation # defines df1df1 = pd.DataFrame ({'alpha': [' Achilles repartee 'Benzhige'], 'feature1': [1 low','medium','medium','high','low','high' 1],' feature2': ['low','medium','medium','high','low','high']}) # Definitions df2df2 = pd.DataFrame ({' alpha': ['Achievement pageant') 'pazham': [' apple','orange','pine','pear'], 'kilo': [' high','low','high','medium'], 'price':np.array ([5pje 6je 5helo 7])}) # left join df5 = pd.merge (df1,df2,how='left',on='alpha') df5 based on the common column alpha

Because the join column alpha of df2 has two'A 'values, the left join df5 has two' A 'values. If there is no same column between the two dataframe except the join column set by on, the value of this column is set to NaN.

1.4 right connection

The link method of how='right',dataframe is left join. We can understand that the join is based on the column of dataframe on the right, and the parameter on sets the common column name of the connection.

# single column right concatenation # defines df1df1 = pd.DataFrame ({'alpha': [' Achilles repartee 'Benzhige'], 'feature1': [1 low','medium','medium','high','low','high' 1],' feature2': ['low','medium','medium','high','low','high']}) # Definitions df2df2 = pd.DataFrame ({' alpha': ['Apocalypse]] 'pazham': [' apple','orange','pine','pear'], 'kilo': [' high','low','high','medium'], 'price':np.array ([5pje 6je 5helo 7])}) # right join df6 = pd.merge (df1,df2,how='right',on='alpha') df6 based on the common column alpha

Because the join column alpha of df1 has two'B 'values, the right concatenated df6 has two' B 'values. If there is no same column between two dataframe except the join column set by on, the value of that column is set to NaN.

1.5 join algorithm based on multiple columns

The algorithm of multi-column join is the same as that of single-column join. This section only introduces inner join and right join based on multi-column. Readers can encode themselves and follow the illustration given in this article to understand outer join and left join.

Inner join of multiple columns:

# multiple columns of inner joins # define df1df1 = pd.DataFrame ({'alpha': [' Achilles repartee 'Bandlemagerie'], 'beta': [' aformafie`'], 'feature2': [' low','medium','medium','high','low'], 'feature2': [' 'low','medium','medium','high','low'] ('high']}) # Definitions df2df2 = pd.DataFrame ({' alpha': ['Atropy pageantry],' beta': ['dharma pedagogy],' pazham': ['apple','orange','pine','pear'],' kilo': ['high','low','high','medium'],' price':np.array ([5LEC6] 5J7])}) # based on the inner join of the common columns alpha and beta df7 = pd.merge (df1,df2,on= ['alpha','beta'], how='inner') df7

Right join of multiple columns:

# right concatenation of multiple columns # define df1df1 = pd.DataFrame ({'alpha': [' Achieve1'), 'feature2': [' low','medium','medium','high','low'], 'feature2': [' low','medium','medium','high','low'] ('high']}) # Definitions df2df2 = pd.DataFrame ({' alpha': ['Atropy pageantry],' beta': ['dharma pedagogy],' pazham': ['apple','orange','pine','pear'],' kilo': ['high','low','high','medium'],' price':np.array ([5LEC6] 5) print (df1) print (df2) # right concatenation df8 = pd.merge (df1,df2,on= ['alpha','beta'], how='right') df8 based on common columns alpha and beta

1.6 connection method based on index

The connection method based on column is introduced earlier, and the merge method can also connect dataframe based on index.

# based on the right connection of column and index # define df1df1 = pd.DataFrame ({'alpha': [' Achilles repartee 'Benzhiko'), 'beta': [' aformai 'penguin', 'beta': [' axiangliao'], 'feature1': [' feature1':'], 'feature2': [' low','medium','medium','high','low'] ('high']}) # define df2df2 = pd.DataFrame ({' alpha': ['Atropy],' pazham': ['apple','orange','pine','pear'],' kilo': ['high','low','high','medium'],' price':np.array], index= ['daddy, dagger, dharma, etc.) 'f']) print (df1) print (df2) # beta column based on df1 and index connection of df2 df9 = pd.merge (df1,df2,how='inner',left_on='beta',right_index=True) df9

Illustrate the internal connection method for index and column:

Set the parameter suffixes to modify the suffix name of the same column except the join column.

# df1-based alpha column and df2 intra-index connection df9 = pd.merge (df1,df2,how='inner',left_on='beta',right_index=True,suffixes= ('_ df1','_df2')) df9

2. Join method

Join method is based on index connection dataframe,merge method is based on column connection, connection method has inner connection, outer connection, left connection and right connection, which is consistent with merge.

Connection between index and index:

Caller = pd.DataFrame ({'key': [' K0','K1','K2','K3','K4,'K5'),'A': ['A0','A1','A2','A3','A4','A5']}) other = pd.DataFrame ({'key': [' K0,'K1,'K2'], 'B01: [' B0,'B1') 'B2']}) print (caller) print (other) # lsuffix and rsuffix set the suffix name of the connection caller.join (other,lsuffix='_caller', rsuffix='_other',how='inner')

Join can also connect based on columns:

Caller = pd.DataFrame ({'key': [' K0','K1','K2','K3','K4,'K5'),'A': ['A0','A1','A2','A3','A4','A5']}) other = pd.DataFrame ({'key': [' K0,'K1,'K2'], 'B01: [' B0,'B1') 'B2']}) print (caller) print (other) # join caller.set_index (' key') .join (other.set_index ('key'), how='inner') based on the key column

Therefore, the connection methods of join and merge are similar, so the join method is not expanded here, and the merge method is recommended.

3. Concat method

The concat method is a splicing function, with row stitching and column stitching. The default is row stitching, and the splicing method defaults to outer stitching (union). The stitching object is the pandas data type.

3.1 splicing methods of series types

Line stitching:

Df1 = pd.Series ([1.1jue 2.2jue 3.3], index= ['i1pr]) df2 = pd.Series ([4.4pje 5.5pct 6.6], index= [' i2pcmpl]) print (df1) print (df2) # line stitching pd.concat ([df1,df2])

If row stitching has the same index, in order to distinguish the index, we define the grouping of the index in the outermost layer.

# Line stitching grouping pd.concat ([df1,df2], keys= ['fea1','fea2'])

Column splicing:

Splicing is union by default:

# column concatenation. The default is union pd.concat ([df1,df2], axis=1)

To splice in an intersecting manner:

# column concatenation (intersection) pd.concat ([df1,df2], axis=1,join='inner')

Set the column name for column splicing:

# column concatenation (intersection) pd.concat ([df1,df2], axis=1,join='inner',keys= ['fea1','fea2'])

Concatenate the specified index:

# specify column concatenation pd.concat ([df1,df2], axis=1,join_axes= ['i1meme i2recoveryi3']) of index [i1rect i2jue i3])

3.2 splicing methods of dataframe types

Line stitching:

Df1 = pd.DataFrame ({'key': [' K0','K1','K2','K3','K4,'K5'),'A': ['A0','A1','A2','A3','A4','A5']}) df2 = pd.DataFrame ({'key': [' K0,'K1,'K2'], 'B01: [' B0,'B1') 'B2']}) print (df1) print (df2) # Line stitching pd.concat ([df1,df2])

Column splicing:

# column stitching pd.concat ([df1,df2], axis=1)

If the column or row stitching has duplicate column and row names, an error will be reported:

# determine whether there are duplicate column names, and if so, report an error pd.concat ([df1,df2], axis=1,verify_integrity = True)

ValueError: Indexes have overlapping values: ['key']

This is the end of the article on "how to merge and splice data in pandas". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, please share it out for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.