What is grouping aggregation in Pandas 04/18 Update SLTechnology News&Howtos

What is grouping aggregation in Pandas

2025-04-18 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article is about what grouping aggregation is in Pandas. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

One: grouping (groupby)

Group the data sets, and then make a statistical analysis of each group

SQL can filter and aggregate data in groups.

Pandas can use groupby to perform more complex grouping operations.

Grouping operation process: split- > apply- > combine split: according to the application of grouping: the calculation rules run by each group are merged: the calculation results of each group are combined.

Sample code:

Import pandas as pd

Import numpy as np

Dict_obj = {'key1': [' await, 'baked,' averse,'b'

'asituation, 'breadth,' await,'a']

Key2': ['one',' one', 'two',' three'

'two', 'two',' one', 'three']

'data1': np.random.randn (8)

'data2': np.random.randn (8)}

Df_obj = pd.DataFrame (dict_obj)

Print (df_obj)

Running result:

Data1 data2 key1 key2

0 0.974685-0.672494 an one

1-0.214324 0.758372 b one

2 1.508838 0.392787 a two

3 0.522911 0.630814 b three

4 1.347359-0.177858 a two

5-0.264616 1.017155 b two

6-0.624708 0.450885 an one

7-1.019229-1.143825 a three

1. GroupBy object: DataFrameGroupBy,SeriesGroupBy1. The grouping operation groupby () is grouped. The GroupBy object does not perform actual operations, but the intermediate data containing the grouping is grouped by column name: obj.groupby ('label').

Sample code:

Print (type (df_obj.groupby ('key1') # dataframe is grouped according to key1

Print (type (df_obj ['data1'] .groupby (df_obj [' key1'])) # data1 columns of dataframe are grouped according to key1

Running result:

two。 Grouping operations perform grouping operations on GroupBy objects / multiple grouping operations, for example, mean () non-numerical data does not perform grouping operations

Sample code:

Grouped1 = df_obj.groupby ('key1') # grouping operation

Print (grouped1.mean ())

Grouped2 = df_obj ['data1'] .groupby (df_obj [' key1'])

Print (grouped2.mean ())

Running result:

Data1 data2

Key1

A 0.437389-0.230101

B 0.014657 0.802114

Key1

A 0.437389

B 0.014657

Name: data1, dtype: float64

Size () returns the number of elements in each group

Sample code:

Print (grouped1.size ()) # size

Print (grouped2.size ())

Running result:

Key1

A 5

B 3

Dtype: int64

Key1

A 5

B 3

Dtype: int64

3. Group obj.groupby by custom key (self_def_key) Custom key can be a list or multi-tier list obj.groupby (['label1',' label2'])-> multi-tier dataframe

Sample code:

Self_def_key = [0, 1, 2, 3, 3, 4, 5, 7] # grouped by custom key, list

Print (df_obj.groupby (self_def_key). Size ()

Print (df_obj.groupby ([df_obj ['key1'], df_obj [' key2']]) .size () # grouped by custom key, multi-tier list

Grouped2 = df_obj.groupby (['key1',' key2']) # grouped by multiple columns and multiple layers

Print (grouped2.size ())

Grouped3 = df_obj.groupby (['key2',' key1']) # Multi-layer grouping is carried out in key order

Print (grouped3.mean ())

Print (grouped3.mean (). Unstack ()) # unstack can convert the results of a multi-tier index into a single-tier dataframe

Running result:

0 1

1 1

2 1

3 2

4 1

5 1

7 1

Dtype: int64

Key1 key2

An one 2

Three 1

Two 2

B one 1

Three 1

Two 1

Dtype: int64

Key1 key2

An one 2

Three 1

Two 2

B one 1

Three 1

Two 1

Dtype: int64

Data1 data2

Key2 key1

One a 0.174988-0.110804

B-0.214324 0.758372

Three a-1.019229-1.143825

B 0.522911 0.630814

Two a 1.428099 0.107465

B-0.264616 1.017155

Data1 data2

Key1 a b a b

Key2

One 0.174988-0.214324-0.110804 0.758372

Three-1.019229 0.522911-1.143825 0.630814

Two 1.428099-0.264616 0.107465 1.017155

Second, GroupBy objects support iterative operations to return a tuple (group_name, group_data) per iteration, which can be used for specific operations of grouped data 1. Single layer grouping

Sample code:

For group_name, group_data in grouped1: # single layer grouping according to key1

Print (group_name)

Print (group_data)

Running result:

A2. Multi-layer grouping

Sample code:

For group_name, group_data in grouped2: # Multi-layer grouping according to key1 and key2

Print (group_name)

Print (group_data)

Running result:

('asides,' one')

Data1 data2 key1 key2

0 0.974685-0.672494 an one

6-0.624708 0.450885 an one

('asides,' three')

Data1 data2 key1 key2

7-1.019229-1.143825 a three

('asides,' two')

Data1 data2 key1 key2

2 1.508838 0.392787 a two

4 1.347359-0.177858 a two

('baked,' one')

Data1 data2 key1 key2

1-0.214324 0.758372 b one

('baked,' three')

Data1 data2 key1 key2

3 0.522911 0.630814 b three

('baked,' two')

Data1 data2 key1 key2

5-0.264616 1.017155 b two

Third, GroupBy objects can be converted into lists or dictionaries

Sample code:

Print (list (grouped1)) # GroupBy object conversion list

Print (dict (list (grouped1) # GroupBy object conversion dict

Running result:

[('asides, data1 data2 key1 key2)

0 0.974685-0.672494 an one

2 1.508838 0.392787 a two

4 1.347359-0.177858 a two

6-0.624708 0.450885 an one

7-1.019229-1.143825 a three)

('baked, data1 data2 key1 key2

1-0.214324 0.758372 b one

3 0.522911 0.630814 b three

5-0.264616 1.017155 b two)]

{'aqu: data1 data2 key1 key2

0 0.974685-0.672494 an one

2 1.508838 0.392787 a two

4 1.347359-0.177858 a two

6-0.624708 0.450885 an one

7-1.019229-1.143825 a three

'baked: data1 data2 key1 key2

1-0.214324 0.758372 b one

3 0.522911 0.630814 b three

5-0.264616 1.017155 b two}

1. Grouped by column, grouped by data type

Sample code:

Print (df_obj.dtypes) # grouping by column

Print (df_obj.groupby (df_obj.dtypes, axis=1). Size ()) # grouped by data type

Print (df_obj.groupby (df_obj.dtypes, axis=1) .sum ())

Running result:

Data1 float64

Data2 float64

Key1 object

Key2 object

Dtype: object

Float64 2

Object 2

Dtype: int64

Float64 object

0 0.302191 an one

1 0.544048 b one

2 1.901626 a two

3 1.153725 b three

4 1.169501 a two

5 0.752539 b two

6-0.173823 an one

7-2.163054 a three

two。 Other grouping methods

Sample code:

Df_obj2 = pd.DataFrame (np.random.randint (1, 10, (5))

Columns= ['await,' baked, 'clocked,' dumped,'e']

Index= ['Aids,' Bads, 'Cards,' Downs,'E'])

Df_obj2.ix [1, 1:4] = np.NaN

Print (df_obj2)

Running result:

A b c d e

A 7 2.0 4.0 5.0 8

B 4 NaN 1

C 3 2.0 5.0 4.0 6

D 3 1.0 9.0 7.0 3

E 6 1.0 6.0 8.0 1

3. Group by dictionary

Sample code:

Mapping_dict = {'axiafiuzhuangzhuangyuanzhang,' baozhuzhizhongyuanjia, 'cantileveranglas,' dawns', 'ethereals'} # grouped by dictionaries

Print (df_obj2.groupby (mapping_dict, axis=1) .size ())

Print (df_obj2.groupby (mapping_dict, axis=1). Count ()) # number of non-NaN

Print (df_obj2.groupby (mapping_dict, axis=1) .sum ())

Running result:

C 1

Java 2

Python 2

Dtype: int64

C Java Python

A 1 2 2

B 0 1 1

C 1 2 2

D 1 2 2

E 1 2 2

C Java Python

A 5.0 12.0 9.0

B NaN 1.0 4.0

C 4.0 11.0 5.0

D 7.0 12.0 4.0

E 8.0 7.0 7.0

4. Through function grouping, the parameters passed in by the function are row index or column index

Sample code:

Df_obj3 = pd.DataFrame (np.random.randint (1, 10, (5))

Columns= ['await,' baked, 'clocked,' dumped,'e']

Index= ['AA',' BBB', 'CC',' Downs, 'EE']) # grouped by function

Def group_key (idx): # df_obj3

Idx is a column index or a row index

# return idx

Return len (idx)

Print (df_obj3.groupby (group_key). Size ()

# the above custom functions are equivalent to

# df_obj3.groupby (len). Size ()

Running result:

1 1

2 3

3 1

Dtype: int64

5. Grouping by index level

Sample code:

Columns = pd.MultiIndex.from_arrays ([['Python',' Java', 'Python',' Java', 'Python']

], names= ['language',' index']) # grouped by index level

Df_obj4 = pd.DataFrame (np.random.randint (1,10, (5,5)), columns=columns)

Print (df_obj4)

Print (df_obj4.groupby (level='language', axis=1). Sum ()) # groups according to language

Print (df_obj4.groupby (level='index', axis=1). Sum ()) # groups according to index

Running result:

Language Python Java Python Java Python

Index An A B C B

0 2 7 8 4 3

1 5 2 6 1 2

2 6 4 4 5 2

3 4 7 4 3 1

4 7 4 3 4 8

Language Java Python

0 11 13

13 13

2 9 12

3 10 9

4 8 18

Index A B C

0 9 11 4

1 7 8 1

2 10 6 5

3 11 5 3

4 11 11 4

Two: aggregation

The process of generating scalars from an array, such as mean (), count (), etc.

Often used to calculate grouped data

Sample code:

Dict_obj = {'key1': [' await, 'baked,' averse,'b'

'asituation, 'breadth,' await,'a']

Key2': ['one',' one', 'two',' three'

'two', 'two',' one', 'three']

'data1': np.random.randint (1, 10, 8)

'data2': np.random.randint (1, 10, 8)}

Df_obj5 = pd.DataFrame (dict_obj)

Print (df_obj5)

Running result:

Data1 data2 key1 key2

0 3 7 an one

1 1 5 b one

2 7 4 a two

3 2 4 b three

4 6 4 a two

5 9 9 b two

6 3 5 an one

7 8 4 a three

1. Built-in aggregate functions sum (), mean (), max (), min (), count (), size (), describe ()

Sample code:

Print (df_obj5.groupby ('key1'). Sum ())

Print (df_obj5.groupby ('key1'). Max ())

Print (df_obj5.groupby ('key1'). Min ())

Print (df_obj5.groupby ('key1'). Mean ())

Print (df_obj5.groupby ('key1'). Size ())

Print (df_obj5.groupby ('key1'). Count ())

Print (df_obj5.groupby ('key1'). Describe ())

Running result:

Data1 data2

Key1

A 27 24

B 12 18

Data1 data2 key2

Key1

A 8 7 two

B 9 9 two

Data1 data2 key2

Key1

A 3 4 one

B 1 4 one

Data1 data2

Key1

A 5.4 4.8

B 4.0 6.0

Key1

A 5

B 3

Dtype: int64

Data1 data2 key2

Key1

A 5 5 5

B 3 3 3

Data1 data2

Key1

A count 5.000000 5.000000

Mean 5.400000 4.800000

Std 2.302173 1.303840

Min 3.000000 4.000000

25% 3.000000 4.000000

50% 6.000000 4.000000

75% 7.000000 5.000000

Max 8.000000 7.000000

B count 3.000000 3.000000

Mean 4.000000 6.000000

Std 4.358899 2.645751

Min 1.000000 4.000000

25% 1.500000 4.500000

50% 2.000000 5.000000

75% 5.500000 7.000000

Max 9.000000 9.000000

two。 Customizable function. The parameter grouped.agg (func) func in the agg method is the record corresponding to the groupby index.

Sample code:

Def peak_range (df): # Custom aggregate function

Returns the range of values

# print type (df) # Parameter is the record corresponding to the index

Return df.max ()-df.min ()

Print (df_obj5.groupby ('key1') .agg (peak_range))

Print (df_obj.groupby ('key1') .agg (lambda df: df.max ()-df.min ()

Running result:

Data1 data2

Key1

A 5 3

B 8 5

Data1 data2

Key1

A 2.528067 1.594711

B 0.787527 0.386341

In [25]:

3. Apply multiple aggregate functions to aggregate operations at the same time, using function lists

Sample code:

'' apply multiple aggregate functions

Apply multiple aggregate functions at the same time

Print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', peak_range])) # the default column name is the function name

Print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', ('range', peak_range)]) # provides new column names through tuples

Running result:

Data1 data2

Mean std count peak_range mean std count peak_range

Key1

A 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711

B 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

Data1 data2

Mean std count range mean std count range

Key1

A 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711

B 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341

4. Different aggregate functions are applied to different columns, using dict

Sample code:

Dict_mapping = {'data1':'mean'

'data2':'sum'} # aggregate functions that act differently for each column

Print (df_obj.groupby ('key1') .agg (dict_mapping))

Dict_mapping = {'data1': [' mean','max']

'data2':'sum'}

Print (df_obj.groupby ('key1') .agg (dict_mapping))

Running result:

Data1 data2

Key1

A 0.437389-1.150505

B 0.014657 2.406341

Data1 data2

Mean max sum

Key1

A 0.437389 1.508838-1.150505

B 0.014657 0.522911 2.406341

5. Commonly used built-in aggregate functions

Sample code:

Import pandas as pd

Import numpy as np

Dict_obj = {'key1': [' await, 'baked,' averse,'b'

'asituation, 'breadth,' await,'a']

Key2': ['one',' one', 'two',' three'

'two', 'two',' one', 'three']

'data1': np.random.randint (1, 10, 8)

'data2': np.random.randint (1,10,8)}

Df_obj = pd.DataFrame (dict_obj)

Print (df_obj)

K1_sum = df_obj.groupby ('key1'). Sum (). Add_prefix (' sum_') # after grouping by key1, calculate the statistics of data1,data2 and append them to the original table, and add the header prefix

Print (k1_sum)

Running result:

Data1 data2 key1 key2

0 5 1 an one

1 7 8 b one

2 1 9 a two

3 2 6 b three

4 9 8 a two

5 8 3 b two

6 3 5 an one

7 8 3 a three

Sum_data1 sum_data2

Key1

A 26 26

B 17 17

After the aggregation operation, the shape of the original data will be changed. How to maintain the shape of the original data? 1. Merge uses the external connection of merge, which is more complex.

Sample code:

K1_sum_merge = pd.merge (df_obj, k1_sum, left_on='key1', right_index=True) # method 1, using merge

Print (k1_sum_merge)

Running result:

Data1 data2 key1 key2 sum_data1 sum_data2

0 5 1 an one 26 26

2 1 9 a two 26 26

4 9 8 a two 26 26

6 3 5 an one 26 26

7 8 3 a three 26 26

17 8 b one 17 17

3 2 6 b three 17 17

5 8 3 b two 17 17

2. The calculated result of transformtransform is consistent with the shape of the original data, such as grouped.transform (np.sum).

Sample code:

K1_sum_tf = df_obj.groupby ('key1'). Transform (np.sum). Add_prefix (' sum_') # method 2, using transform

Df_ OBJ [K1 _ sum_tf.columns] = k1_sum_tf

Print (df_obj)

Running result:

Data1 data2 key1 key2 sum_data1 sum_data2 sum_key2

0 5 1 an one 26 26 onetwotwoonethree

17 8 b one 17 17 onethreetwo

2 1 9 a two 26 26 onetwotwoonethree

3 2 6 b three 17 17 onethreetwo

4 9 8 a two 26 26 onetwotwoonethree

5 8 3 b two 17 17 onethreetwo

6 3 5 an one 26 26 onetwotwoonethree

7 8 3 a three 26 26 onetwotwoonethree

You can also pass in a custom function

Sample code:

Def diff_mean (s): # the custom function is passed into transform

Returns the difference between the data and the mean

Return s-s.mean ()

Print (df_obj.groupby ('key1') .transform (diff_mean))

Running result:

Data1 data2 sum_data1 sum_data2

0-0.200000-4.200000 00

1 1.333333 2.333333 0 0

2-4.200000 3.800000 00

3-3.666667 0.333333 0 0

4 3.800000 2.800000 00

5 2.333333-2.666667 0

6-2.200000-0.200000 00

7 2.800000-2.200000 00

The groupby.apply (func) func function can also be called separately on each group, and the final result is assembled together through pd.concat (data merge)

Sample code:

Import pandas as pd

Import numpy as np

Dataset_path ='. / starcraft.csv'

Df_data = pd.read_csv (dataset_path, usecols= ['LeagueIndex',' Age', 'HoursPerWeek'

'TotalHours',' APM'])

Def top_n (df, nasty 3, column='APM'):

Returns the top n data of each packet by column

Return df.sort_values (by=column, ascending=False) [: n]

Print (df_data.groupby ('LeagueIndex') .apply (top_n))

Running result:

LeagueIndex Age HoursPerWeek TotalHours APM

LeagueIndex

1 2214 1 20.0 12.0 730.0 172.9530

2246 1 27.0 8.0 250.0 141.6282

1753 1 20.0 28.0 100.0 139.6362

2 3062 2 20.0 6.0 100.0 179.6250

3229 2 16.0 24.0 110.0 156.7380

1520 2 29.0 6.0 250.0 151.6470

3 1557 3 22.0 6.0 200.0 226.6554

484 3 19.0 42.0 450.0 220.0692

2883 3 16.0 8.0 800.0 208.9500

4 2688 4 26.0 24.0 990.0 249.0210

1759 4 16.0 6.0 75.0 229.9122

2637 4 23.0 24.0 650.0 227.2272

5 3277 5 18.0 16.0 950.0 372.6426

93 5 17.0 36.0 720.0 335.4990

202 5 37.0 14.0 800.0 327.7218

6 734 6 16.0 28.0 730.0 389.8314

2746 6 16.0 28.0 4000.0 350.4114

1810 6 21.0 14.0 730.0 323.2506

7 3127 7 23.0 42.0 2000.0 298.7952

104 7 21.0 24.0 1000.0 286.4538

1654 7 18.0 98.0 700.0 236.0316

8 3393 8 NaN 375.8664

3373 8 NaN 364.8504

3372 8 NaN 355.3518

1. Generate hierarchical indexes: the outer index is the grouping name, and the inner index is the row index of df_obj

Sample code:

The parameters received by the print (df_data.groupby ('LeagueIndex'). Apply (top_n, nasty 2, column='Age')) # apply function are passed into the custom function

Running result:

LeagueIndex Age HoursPerWeek TotalHours APM

LeagueIndex

1 3146 1 40.0 12.0 150.0 38.5590

3040 1 39.0 10.0 500.0 29.8764

2 920 2 43.0 10.0 730.0 86.0586

2437 2 41.0 4.0 200.0 54.2166

3 1258 3 41.0 14.0 800.0 77.6472

2972 3 40.0 10.0 500.0 60.5970

4 1696 4 44.0 6.0 500.0 89.5266

1729 4 39.0 8.0 500.0 86.7246

5 202 5 37.0 14.0 800.0 327.7218

2745 5 37.0 18.0 1000.0 123.4098

6 3069 6 31.0 8.0 800.0 133.1790

2706 6 31.0 8.0 700.0 66.9918

7 2813 7 26.0 36.0 1300.0 188.5512

1992 7 26.0 24.0 1000.0 219.6690

8 3340 8 NaN 189.7404

3341 8 NaN 287.8128

two。 Disable hierarchical indexing, group_keys=False

Sample code:

Print (df_data.groupby ('LeagueIndex', group_keys=False) .apply (top_n))

Running result:

LeagueIndex Age HoursPerWeek TotalHours APM

2214 1 20.0 12.0 730.0 172.9530

2246 1 27.0 8.0 250.0 141.6282

1753 1 20.0 28.0 100.0 139.6362

3062 2 20.0 6.0 100.0 179.6250

3229 2 16.0 24.0 110.0 156.7380

1520 2 29.0 6.0 250.0 151.6470

1557 3 22.0 6.0 200.0 226.6554

484 3 19.0 42.0 450.0 220.0692

2883 3 16.0 8.0 800.0 208.9500

2688 4 26.0 24.0 990.0 249.0210

1759 4 16.0 6.0 75.0 229.9122

2637 4 23.0 24.0 650.0 227.2272

3277 5 18.0 16.0 950.0 372.6426

93 5 17.0 36.0 720.0 335.4990

202 5 37.0 14.0 800.0 327.7218

734 6 16.0 28.0 730.0 389.8314

2746 6 16.0 28.0 4000.0 350.4114

1810 6 21.0 14.0 730.0 323.2506

3127 7 23.0 42.0 2000.0 298.7952

104 7 21.0 24.0 1000.0 286.4538

1654 7 18.0 98.0 700.0 236.0316

3393 8 NaN 375.8664

3373 8 NaN 364.8504

3372 8 NaN 355.3518 Thank you for your reading! This is the end of this article on "what is grouping aggregation in Pandas?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.