How to realize merging and splicing in python pandas 07/16 Update SLTechnology News&Howtos

How to realize merging and splicing in python pandas

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Editor to share with you how to achieve merging and splicing in python pandas. I hope you will get something after reading this article. Let's discuss it together.

Preface

In many applications, the data may come from different channels, and these data sets often need to be combined and spliced in the process of data processing to form more abundant data sets. Pandas provides a variety of methods to meet the common needs of data processing. Specifically, it includes join, merge, concat, append and so on.

Generally speaking

The method shows that join is the simplest, mainly used for horizontal merging merge based on index, and horizontal merging concat based on specified columns is the most powerful for main users. Horizontal and vertical merging append is mainly used to vertically append combine_first merge overlapping data, fill the missing value update to update the value of one data set to another data set.

Let's introduce each method one by one.

1. Join

Join is mainly used for horizontal merging and splicing based on index.

Before introducing pandas's join, let's take a look at several patterns of SQL versus dataset join. If you are familiar with SQL, you should be impressed by the various merging and stitching of SQL operational datasets. The methods of various JOIN in SQL are as follows:

Pandas's join implements four join methods commonly used by left join, right jion, inner join and out jion.

Description of parameters from the official website:

Dataframe.join (other, # another dataset to be merged on=None, # key how='left', # connection method: 'left',' right', 'outer',' inner' defaults to the suffix rsuffix='' of the same key in the dataset on the left (first) of left lsuffix='', # # suffix sort=False of the key of the second dataset) # whether to sort by the key of the connection Default False

Let's take a look at an example. There are two data sets, one is the name of the person, the other is the salary of the person.

Left=pd.DataFrame (['Zhang San','Li Si', 'Wang Wu', 'Zhao Liu', 'Qian Qi'), index= [3LEC4, 5je, 6], columns= ['name']) right=pd.DataFrame ([13000, 15000, 9000, 8600, 10000], index= [3pd4, 5pyr8], columns= ['salary])

Note that index is specified for both left and right datasets, because join is mainly used for index-based horizontal merge stitching.

1. Left joinleft.join (right) # default how='left'

The jion operation defaults to the operation of left jion. You can see that the left index is 7 and the name is Qian Qi. There is no corresponding index 7 in right, so the name of left is displayed, but the salary of right is NaN,right. The data with index 8 in left is not displayed, so it is not displayed. Left join merges data from left

Left join is shown in the following figure

2. Right joinleft.join (right,how='right')

When you merge the right link, you can see that the left dataset does not have an item with an index of 8, so the item with an index of 8 displays the salary data of the right dataset with the name NaN, and the item with index 7 in left is not displayed because it does not exist in the right. Right join merges data from right

Right join is shown in the following figure

3. Inner joinleft.join (right,how='inner')

When the inner link is merged, you can see that the index in the left dataset is 7 and the name is Qian Qi because the corresponding index cannot be found in the right dataset, and the index 8 in the right dataset cannot find the corresponding index in left, so when the inner link is merged, indexes 7 and 8 are not merged. Inner join merges only the data shared by the two datasets.

Inner join is shown in the following figure

4. Out joinleft.join (right,how='outer')

When external links are merged, you can see that both the data in left and the data in right are merged. Right join merges all data in two datasets.

Outer join is shown in the following figure

Join is simple, but it has limitations because it can only be merged based on indexes. You cannot specify a key to merge. For example, if I want to merge according to the number and name, join is more difficult. But pandas provides a way to merge, and you can specify columns for merge stitching.

II. Merge

Merge is the most commonly used. The main users are based on specified columns and horizontal merge and stitching. The syntax is as follows:

The pd.merge (left, right, how='inner', on=None,left_ on=None, right_on=None,left_index=False, right_index=False, sort=True,suffixes= ('_ x','_ y'), copy=True) parameter name indicates that left/right is two different DataFrame objects. On specifies the key used for the connection (that is, the name of the column label), which must exist in both the left and right DataFrame. If it is not specified and other parameters are not specified, then the intersection of the column names of the two DataFrame will be used as the join key. Left_on specifies the name of the column in the left DataFrame that acts as the join key. This parameter is useful when the left and right column signatures are different, but the meaning is the same. Right_on specifies the name of the column in the left DataFrame that acts as the join key. Left_ index Boolean parameter, which defaults to False. If True, the row index of the left DataFrame is used as the join key, and if the DataFrame has a multi-tier index (MultiIndex), the number of layers must be equal to the number of join keys. Right_ index Boolean parameter, which defaults to False. If True, the row index of the left DataFrame is used as the join key. The type of merge to be performed by how, taking a value from {'left',' right', 'outer',' inner'}, which defaults to the connection within "inner". The sort Boolean parameter, which defaults to True, sorts the merged data; if set to False, it sorts by the parameter value given by how. A tuple of suffixes strings. When the same column name exists in the left and right DataFrame, this parameter allows you to append the suffix to the same column name, which defaults to ('_ x _ parallel _ correcy'). Copy defaults to True, which means that the data is copied.

Let's take a look at the following dataset, in which the employee number is added in the left dataset and the number and name are added in the right dataset. The index is indexed by default.

Left=pd.DataFrame ([[3dint 'Zhang San'], [4jing'Li Si'], [5pi 'Wang Wu'], [6pm 'Zhao Liu'], [7pm 'Qian Qi'], columns= ['No.', 'name']) right=pd.DataFrame ([[3dome 'Zhang San', 13000], [4pme'Li Si', 15000], [5pcr 'Wang Wu', 9000], [6m 'Zhao Liu', 8600], [8pi 'Sun Ba', 10000]], columns= ['serial number'] 'name', 'salary'])

Pd.merge (left,right)

No connection key is specified, overlapping column names are used by default, connection method is not specified, and the default connection within inner (take the intersection of left and right numbers and names)

Like join, you can specify the connection method through how, such as:

Pd.merge (left,right,how='left')

How is connected in the same way as join supports left, right, inner, outer

Merge can also specify multiple columns to merge links, that is, set multiple associated columns like SQL.

Pd.merge (left,right,how='outer',on= ['number', 'name'])

If the column names of the two objects are different, you can specify them separately using left_on,right_on. For example, if we change the "Encoding" column label of the right dataset to "ID", if we need to associate the "number" of the left dataset with the "ID" of the right dataset

Right=pd.DataFrame ([3jing 'Zhang San', 13000], [4dre'Li Si', 15000], [5dre 'Wang Wu', 9000], [6dre 'Zhao Liu', 8600], [8ling 'Sun Ba', 10000]], columns= ['ID',' name', 'salary']) pd.merge (left,right,how='outer',left_on=' number', right_on='ID')

Although merge is already powerful, pandas is willing to give you more. It provides concat, which can be merged and spliced horizontally and vertically. That is to say, not only join in SQL but also union is realized.

III. Concat

The concat () function is used to perform join operations along a specific axis, with the following syntax:

The name of the pd.concat (objs,axis=0,join='outer',join_axes=None,ignore_index=False) parameter indicates that objs is a sequence or a Series or DataFrame object. Axis indicates the axis direction (row or column) in which the join operation occurs, and the default axis=0 indicates the row direction. Join specifies the connection method. The value is {"inner", "outer"}. By default, outer represents union, and inner represents intersection. The ignore_index Boolean parameter, which defaults to False, and if True, indicates that the index is not used on the connected axis. Join_axes represents a list of indexed objects.

Let's take a look at specific examples.

Left2=pd.DataFrame ([[1Jing 'Chen Yi'], [2Jing 'Tuesday']], columns= ['number', 'name'])

1. Vertical merger

Concat default vertical stitching, we want to merge the left1 data set on the basis of the left2 data set, it is very simple to merge directly with concat.

Df=pd.concat ([left,left2])

2. Horizontal merge df_outer=pd.concat ([left,right], axis=1,join='outer') # external links df_inner=pd.concat ([left,right], axis=1,join='inner') # internal links

Note: because concat links are merged through indexes like join, and cannot be specified to be merged through a specific column, the numbers and names of the merged datasets left and right are misplaced.

If you want to associate according to the number, you can specify the number as the index and then merge horizontally, so there is no problem.

Left.index=left ['number'] .valuesright.index = right ['number'] .valuesdf_outer=pd.concat ([left,right], axis=1,join='outer') df_inner=pd.concat ([left,right], axis=1,join='inner')

IV. Append

Df.append can append other rows to the end of the caller and return a new object. It is the simplest and most commonly used method of data merging. The syntax is as follows:

Df.append (self, other, ignore_index=False,verify_integrity=False, sort=False)

Where:

Other is the other DataFrame or similar sequence content to be appended.

Ignore_index re-index naturally if it is True

Verify_integrity encountered repeated indexing if it is True. Time error.

Sort by sort

Take a look at the following example:

1. Append data with the same structure

Append the data of the same structure to the original data, and append the left2 dataset to the left dataset. The contents of the left2 dataset are as follows:

Left2=pd.DataFrame ([[1Jing 'Chen Yi'], [2Jing 'Tuesday']], columns= ['number', 'name']) left2

Left.append (left2)

2. Append data with different structures

If different structural data is appended, the columns without the original data will be increased, and the columns without corresponding content will be empty NaN.

For example, left3's dataset is listed as "number", "name" and "salary"

Left3=pd.DataFrame ([8Jing 'Sun Ba', 10000], [9JI'he Jiu', 15000], columns= ['serial number', 'name', 'salary']) left3

Left.append (left3)

When left3 is appended to left, the "salary column" is added to the data set, and it will be empty if there is no corresponding content.

3. Append and merge multiple datasets

The append parameter can take a list of datasets, and multiple datasets can be appended to the original dataset

For example, we append both left2 and left3 to left

Left.append ([left2,left3])

5. Combine_first

Combine_first can be used to merge duplicate data and populate data that is not available with other datasets. If there is missing data in one DataFrame dataset, it can be populated with data from other DataFrame datasets. The syntax format is as follows:

Combine_first (other) # has only one parameter, other, which is used to receive DataFrame objects populated with missing values.

If there is no "salary" data in the left dataset, we can use the data in the right dataset to populate the data in the left dataset.

Left.combine_first (right) # fill left with right

VI. Update

Update is similar to combine_first, except that:

1. The filling and merging methods are slightly different.

Combine_first: if the value of c in S1 is empty, replace it with the value of S2, otherwise keep the value of S1

Update: if the value in S2 is not empty, then replace S1, otherwise keep the value of S1

2. Update updates the original data, and combine_first returns a new filled data set without updating the original data.

Left.update (right) # updates the data in left with the data from right.

After reading this article, I believe you have some understanding of "how to merge and splice in python pandas". If you want to know more about it, you are welcome to follow the industry information channel. Thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.