Pandas grouping and aggregation 07/19 Update SLTechnology News&Howtos

Pandas grouping and aggregation

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Pandas grouping and aggregation grouping (groupby) groups the data sets, and then statistically analyzes each group. SQL can filter the data, and grouping aggregation pandas can use groupby to perform more complex grouping operations.

Grouping operation process: split- > apply- > combine

Split: the basis for grouping

two。 Application: calculation rules for each group run

3. Merge: merge the calculation results of each group

Sample code:

Import pandas as pdimport numpy as npdict_obj = {'key1': [' one', 'one',' two', 'three',' two', 'two',' one', 'three'] 'data1': np.random.randn (8),' data2': np.random.randn (8)} df_obj = pd.DataFrame (dict_obj) print (df_obj)

Running result:

Data1 data2 key1 key20 0.974685-0.672494 an one1-0.214324 0.758372 b one2 1.508838 0.392787 a two3 0.522911 0.630814 b three4 1.347359-0.177858 a two5-0.264616 1.017155 b two6-0.624708 0.450885 an one7-1.01929-1.143825a three-1, GroupBy object: DataFrameGroupBy,SeriesGroupBy1. The grouping operation groupby () is grouped, and the GroupBy object does not perform actual operations, but only contains the intermediate data of the grouping.

Grouped by column name: obj.groupby ('label')

Sample code:

# dataframe grouping according to key1 print (type (df_obj.groupby ('key1') # dataframe data1 column grouping according to key1 print (type (df_obj [' data1'] .groupby (df_obj ['key1'])

Running result:

two。 Grouping operations perform grouping / multigrouping operations on GroupBy objects, such as mean ()

Non-numerical data are not grouped.

Sample code:

# grouping operation grouped1 = df_obj.groupby ('key1') print (grouped1.mean ()) grouped2 = df_obj [' data1'] .groupby (df_obj ['key1']) print (grouped2.mean ())

Running result:

Data1 data2key1 a 0.437389-0.230101b 0.014657 0.802114key1a 0.437389b 0.014657Name: data1, dtype: float64size () returns the number of elements in each group

Sample code:

# sizeprint (grouped1.size ()) print (grouped2.size ()

Running result:

Key1a 5b 3dtype: int64key1a 5b 3dtype: int643. Group obj.groupby (self_def_key) by custom key

A custom key can be a list or a multi-tier list

Obj.groupby (['label1',' label2'])-> Multi-layer dataframe

Sample code:

# grouped by custom key, list self_def_key = [0,1,2,3,3,4,5,7] print (df_obj.groupby (self_def_key). Size ()) # grouped by custom key Multilayer list print (df_obj.groupby ([df_obj ['key1'], df_obj [' key2']]). Size () # multilayer grouping by multiple columns grouped2 = df_obj.groupby (['key1',' key2']) print (grouped2.size ()) # multilayer grouping grouped3 = df_obj.groupby (['key2']) in the order of key 'key1']) print (grouped3.mean ()) # unstack can convert the result of a multi-tier index into a single-tier dataframeprint (grouped3.mean (). Unstack ())

Running result:

0 11 12 13 24 15 17 1dtype: int64key1 key2 an one 2 three 1 two 2b one 1 three 1 two 1dtype: int64key1 key2 an one 2 three 1 two 2b one 1 three 1 two 1dtype: int64 data1 data2key2 key1 one a 0 .174988-0.110804 b-0.214324 0.758372three a-1.019229-1.143825 b 0.522911 0.630814two a 1.428099 0.107465 b-0.264616 1.017155 data1 data2 key1 a b a bkey2 One 0.174988-0.214324-0.110804 0.758372three-1.019229 0.522911-1.143825 0.630814two 1.428099-0.264616 0.107465 1.017155 II. The GroupBy object supports iterative operations to return one tuple per iteration (group_name Group_data)

Can be used for specific operations of grouped data

1. Single layer grouping

Sample code:

# single layer grouping, according to key1for group_name, group_data in grouped1: print (group_name) print (group_data)

Running result:

A data1 data2 key1 key20 0.974685-0.672494 an one2 1.508838 0.392787 a two4 1.347359-0.177858 a two6-0.624708 0.450885 an one7-1.019229-1.143825 a threeb data1 data2 key1 key21-0.214324 0.758372 b one3 0.522911 0.630814 b three5-0.264616 1.017155 b two2. Multi-layer grouping

Sample code:

# Multi-layer grouping, according to key1 and key2for group_name, group_data in grouped2: print (group_name) print (group_data)

Running result:

('averse,' one') data1 data2 key1 key20 0.974685-0.672494 an one6-0.624708 0.450885 an one ('await,' three') data1 data2 key1 key27-1.019229-1.143825 a three ('await,' two') data1 data2 key1 key22 1.508838 0.392787 a two4 1.347359-0.177858 a two ('b') 'one') data1 data2 key1 key21-0.214324 0.758372 b one (' baked, 'three') data1 data2 key1 key23 0.522911 0.630814 b three (' baked, 'two') data1 data2 key1 key25-0.264616 1.017155 b two III, GroupBy objects can be converted into lists or dictionaries

Sample code:

# GroupBy object conversion listprint (list (grouped1)) # GroupBy object conversion dictprint (dict (list (grouped1)

Running result:

[('aura, data1 data2 key1 key20 0.974685-0.672494 an one2 1.508838 0.392787 a two4 1.347359-0.177858 a two6-0.624708 0.450885 an one7-1.019229-1.143825 a three), (' b' Data1 data2 key1 key21-0.214324 0.758372 b one3 0.522911 0.630814 b three5-0.264616 1.017155 b two)] {'aeda: data1 data2 key1 key20 0.974685-0.672494 an one2 1.508838 0.392787 a two4 1.347359-0.177858a two6-0.624708 0.450885 an one7-1.01929-1.143825 a three Data1 data2 key1 key21-0.214324 0.758372 b one3 0.522911 0.630814 b three5-0.264616 1.017155 b two} 1. Grouped by column, grouped by data type

Sample code:

# grouping print (df_obj.dtypes) by column # grouping print by data type (df_obj.groupby (df_obj.dtypes, axis=1). Size () print (df_obj.groupby (df_obj.dtypes, axis=1). Sum ())

Running result:

Data1 float64data2 float64key1 objectkey2 objectdtype: objectfloat64 2object 2dtype: int64 float64 object0 0.302191 an one1 0.544048 b one2 1.901626 a two3 1.153725 b three4 1.169501 a two5 0.752539 b two6-0.173823 an one7-2.163054 a three2. Other grouping methods

Sample code:

Df_obj2 = pd.DataFrame (np.random.randint (1,10, (5jing5)), columns= ['await,' baked, 'caged,' dashed,'e'], index= ['Aging,' baked, 'Chim,' Dache,'E']) df_obj2.ix [1,1:4] = np.NaNprint (df_obj2)

Running result:

A b c d eA 7 2.0 4.0 5.0 8B 4 NaN 1C 3 2.0 5.0 4.0 6D 3 1.0 9.0 7.0 3E 6 1.0 6.0 8.0 13. Group by dictionary

Sample code:

# grouping through the dictionary mapping_dict = {'axiaqiao axis=1 axis=1. Size () print (df_obj2.groupby (mapping_dict, axis=1). Count () # the number of non-NaN print (df_obj2.groupby (mapping_dict, axis=1). Sum ())

Running result:

C 1Java 2Python 2dtype: int64 C Java PythonA 12 2B 0 1 1C 12 2D 12 2E 12 2 C Java PythonA 5.0 12.0 9.0B NaN 1.0 4.0C 4.0 11.0 5.0D 7.0 12.0 4.0E 8.0 7.0 7.04. Through function grouping, the parameters passed in by the function are row index or column index

Sample code:

# grouped by function df_obj3 = pd.DataFrame (np.random.randint (1,10, (5dje 5)), columns= ['AA',', 'baked,' caged, 'dumped,' e'], index= ['AA',' BBB', 'CC',' D' 'EE']) # df_obj3def group_key (idx): "idx is a column index or row index"# return idx return len (idx) print (df_obj3.groupby (group_key). Size ()) # the above custom function is equivalent to # df_obj3.groupby (len) .size ()

Running result:

1 12 33 1dtype: int645. Grouping by index level

Sample code:

# grouped by index level columns = pd.MultiIndex.from_arrays ([['Python',' Java', 'Python',' Java', 'Python'], [' Aids, 'Aids,' bones, 'cations,' B']], names= ['language',' index']) df_obj4 = pd.DataFrame (np.random.randint (1,10, (5,5) Columns=columns) print (df_obj4) # grouping according to language print (df_obj4.groupby (level='language', axis=1). Sum ()) # grouping print according to index (df_obj4.groupby (level='index', axis=1). Sum ())

Running result:

Language Python Java Python Java Pythonindex An A B C B0 2 7 8 4 31 5 2 6 1 22 6 4 4 5 23 4 7 4 3 14 7 4 3 4 8language Java Python0 11 131 3 132 9 123 10 94 8 18index A B C0 9 11 41 7 8 12 10 6 53 11 5 34 11 11 4 aggregation (aggregation) the process of producing scalars Such as mean (), count (), etc. are often used to calculate the grouped data.

Sample code:

Dict_obj = {'key1': [' one', 'one',' two', 'three',' two', 'two',' one', 'three'] 'data1': np.random.randint (1m 10,8), 'data2': np.random.randint (1m 10,8)} df_obj5 = pd.DataFrame (dict_obj) print (df_obj5)

Running result:

Data1 data2 key1 key20 3 7 an one1 1 5 b one2 7 4 a two3 2 4 b three4 6 4 a two5 9 9 b two6 3 5 an one7 8 4 a three1. Built-in aggregate functions sum (), mean (), max (), min (), count (), size (), describe ()

Sample code:

Print (df_obj5.groupby ('key1'). Sum () print (df_obj5.groupby (' key1'). Max ()) print (df_obj5.groupby ('key1'). Min () print (df_obj5.groupby (' key1'). Mean ()) print (df_obj5.groupby ('key1'). Size () print (df_obj5.groupby (' key1'). Count () print (df_obj5.groupby ('key1'). Describe ())

Running result:

Data1 data2key1 a 27 24b 12 18 data1 data2 key2key1 a 8 7 twob 9 9 two data1 data2 key2key1 a 3 4 oneb 1 4 one data1 data2key1 a 5.4 4.8b 4.0 6.0key1a 5B 3dtype: int64 data1 data2 key2key1 a 5 5 5b 3 3 data1 data2key1 a count 5.000000 5.000000 mean 5.400000 4.800000 std 2.302173 1.303840 min 3.000000 4.000000 3.000000 4.000000 50 % 6.000000 4.000000 75% 7.000000 5.000000 max 8.000000 7.000000b count 3.000000 3.000000 mean 4.000000 6.000000 std 4.358899 2.645751 min 1.000000 4.000000 25% 1.500000 4.500000 50% 2.000000 5.000000 75% 5.500000 7.000000 max 9.000000 9.00002. Customizable function, passing grouped.agg (func) in the agg method

The parameter of func is the record corresponding to groupby index.

Sample code:

# Custom aggregate function def peak_range (df): "# print type (df) # Parameter is the record return df.max ()-df.min () print (df_obj5.groupby ('key1') .agg (peak_range)) print (df_obj.groupby (' key1') .agg (lambda df: df.max ()-df.min () corresponding to the index.

Running result:

Data1 data2key1 a 5 3b 8 5 data1 data2key1 a 2.528067 1.594711b 0.787527 0.386341In [25]: 3. Apply multiple aggregate functions to aggregate operations at the same time, using function lists

Sample code:

# apply multiple aggregate functions # apply multiple aggregate functions print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', peak_range])) # default column name is function name print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', ('range', peak_range)]) # provide new column names through tuples

Running result:

Data1 data2 mean std count peak_range mean std count peak_rangekey1 a 0.437389 1.174151 5 2.528067-0.230101 0.686488 51. 594711b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341 data1 data2 mean std count range mean std count rangekey1 A 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711b 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.3863414. Different aggregate functions are applied to different columns, using dict

Sample code:

# aggregate function dict_mapping = {'data1':'mean',' data2':'sum'} print (df_obj.groupby ('key1') .agg (dict_mapping)) dict_mapping = {' data1': ['mean','max'],' data2':'sum'} print (df_obj.groupby ('key1') .agg (dict_mapping))

Running result:

Data1 data2key1 a 0.437389-1.150505b 0.014657 2.406341 data1 data2 mean max sumkey1 a 0.437389 1.508838-1.150505b 0.014657 0.522911 2.4063415. Commonly used built-in aggregate functions

# # #

Function name description

Count: the number of non-na values of grouped species

Sum: sum of non-na values

Mean: the average of non-na values

Median: arithmetic median of non-na values

Std, var: unbiased (denominator nmur1) standard deviation and variance

Min, max: minimum and maximum values of non-na values

Prod: the product of non-na values

First, last: the first and last non-na values

Grouping operation of data

Sample code:

Import pandas as pdimport numpy as npdict_obj = {'key1': [' one', 'one',' two', 'three',' two', 'two',' one', 'three'] 'data1': np.random.randint (1, 10, 8),' data2': np.random.randint (1, 10, 8)} df_obj = pd.DataFrame (dict_obj) print (df_obj) # after grouping by key1 Calculate the statistics for data1,data2 and append them to the original table, and add the header prefix k1_sum = df_obj.groupby ('key1'). Sum (). Add_prefix (' sum_') print (k1_sum)

Running result:

Data1 data2key1 key20 5 1 an one1 7 8 b one2 1 9 a two3 26b three4 9 8 a two5 8 3 b two6 3 5 an one7 8 3 a three sum_data1 sum_data2key1 a 26 26b 17 17 the shape of the original data will be changed after the aggregation operation.

How to maintain the shape of the original data?

1. Merge uses the external connection of merge, which is more complex.

Sample code:

# method 1, use mergek1_sum_merge = pd.merge (df_obj, k1_sum, left_on='key1', right_index=True) print (k1_sum_merge)

Running result:

Data1 data2 key1 key2 sum_data1 sum_data20 5 1 an one 26 262 1 9 a two 26 264 9 8 a two 26 266 3 5 an one 26 267 8 3 a three 26 261 7 8 b one 17 173 2 6 b three 17 175 8 3 b two 17 172. The calculated results of transformtransform are consistent with the shape of the original data.

Such as: grouped.transform (np.sum)

Sample code:

# method 2, use transformk1_sum_tf = df_obj.groupby ('key1') .transform (np.sum). Add_prefix (' sum_') df_ OBJ [K1 _ sum_tf.columns] = k1_sum_tfprint (df_obj)

Running result:

Data1 data2 key1 key2 sum_data1 sum_data2 sum_key20 5 1 an one 26 26 onetwotwoonethree1 7 8 b one 17 17 onethreetwo2 1 9 a two 26 26 onetwotwoonethree3 26 b three 17 17 onethreetwo4 9 8 a two 26 26 onetwotwoonethree5 8 3b two 17 17 onethreetwo6 3 5a one 26 26 onetwotwoonethree7 8 3 a three 26 26 onetwotwoonethree can also be passed into a custom function

Sample code:

# Custom function passes transformdef diff_mean (s): "returns the difference between the data and the mean" return s-s.mean () print (df_obj.groupby ('key1') .transform (diff_mean) "

Running result:

Data1 data2 sum_data1 sum_data20-0.200000-4.200000 0 01 1.333333 2.333333 0 02-4.200000 3.800000 0 03-3.666667 0.333333 0 04 3.800000 2.800000 0 05 2.333333-2.666667 006-2.200000-0.200000 0 07 2.800000-2.200000 0 0groupby.apply (func)

The func function can also be called separately on each group, and the final result is assembled together through pd.concat (data merge).

Sample code:

Import pandas as pdimport numpy as npdataset_path ='. / starcraft.csv'df_data = pd.read_csv (dataset_path, usecols= ['LeagueIndex',' Age', 'HoursPerWeek',' TotalHours', 'APM']) def top_n (df, nasty 3 Column='APM'): "returns top n data for each packet by column"return df.sort_values (by=column, ascending=False) [: n] print (df_data.groupby ('LeagueIndex') .apply (top_n)

Running result:

LeagueIndex Age HoursPerWeek TotalHours APMLeagueIndex 1 2214 1 20.0 12.0 730.0 172.9530 2246 1 27.0 8.0 250.0 141.6282 1753 1 20.0 28.0 100.0 139.63622 3062 2 20.0 6.0 100.0 179.6250 3229 2 16.0 24.0 110.0 156.7380 1520 2 29.0 6.0 250.0 151.64703 1557 3 22.0 6.0 200.0 226.6554 484 3 19.0 42.0 450.0 220.0692 2883 3 16.0 8.0 800.0 208.95004 2688 4 26.0 24.0 990.0 249.0210 1759 4 16.0 6.0 75.0 229.9122 2637 4 23.0 24.0 650.0 227.22725 3277 5 18.0 16.0 950.0 372.6426 93 5 17.0 36.0 720.0 335. 4990 202 5 37.0 14.0 800.0 327.72186 734 6 16.0 28.0 730.0 389.8314 2746 6 16.0 28.0 4000.0 350.4114 1810 6 21.0 14.0 730.0 323.25067 3127 7 23.0 42.0 2000.0 298.7952 104 7 21.0 24.0 1000.0 286.4538 1654 7 18.0 98.0 700.0 236.03168 3393 8 NaN NaN NaN 375.8664 3373 8 NaN 364.8504 3372 8 NaN 355.35181. Generate hierarchical indexes: the outer index is the grouping name, and the inner index is the row index of df_obj

Sample code:

# the parameters received by the apply function will be passed into the custom function print (df_data.groupby ('LeagueIndex'). Apply (top_n, naugh2, column='Age'))

Running result:

LeagueIndex Age HoursPerWeek TotalHours APMLeagueIndex 1 3146 1 40.0 12.0 150.0 38.5590 3040 1 39.0 10.0 500.0 29.87642 920 2 43.0 10.0 730.0 86.0586 2437 2 41.0 4.0 200.0 54.21663 1258 3 41.0 14.0 800.0 77.6472 2972 3 40.0 10.0 500.0 60.59704 1696 4 44.0 6.0 500.0 89.5266 1729 4 39.0 8.0 500.0 86.72465 202 5 37.0 14.0 800.0 327.7218 2745 5 37.0 18.0 1000.0 123. 40986 3069 6 31.0 8.0 800.0 133.1790 2706 6 31.0 8.0 700.0 66.99187 2813 7 26.0 36.0 1300.0 188.5512 1992 7 26.0 24.0 1000.0 219.66908 3340 8 NaN 189.7404 3341 8 NaN 287.81282. Disable hierarchical indexing, group_keys=False

Sample code:

Print (df_data.groupby ('LeagueIndex', group_keys=False) .apply (top_n))

Running result:

LeagueIndex Age HoursPerWeek TotalHours APM2214 1 20.0 12.0 730.0 172.95302246 1 27.0 8.0 250.0 141.62821753 1 20.0 28.0 100.0 139.63623062 2 20.0 6.0 100.0 179.62503229 2 16.0 24.0 110.0 156.73801520 2 29.0 6.0 250.0 151.64701557 3 22.0 6.0 200.0 226.6554484 3 19.0 42.0 450.0 220.06922883 3 16.0 8.0 800.0 208.95002688 4 26.0 24.0 990.0 249.02101759 4 16.0 6.0 75.0 229.91222637 4 23.0 24.0 650.0 227.22723277 5 18.0 16.0 950.0 372.642693 5 17.0 36.0 720.0 335.4990202 5 37.0 14.0 800.0 327.7218734 6 16.0 28.0 730.0 389.83142746 6 16.0 28.0 4000.0 350.41141810 6 21.0 14.0 730.0 323.25063127 7 23.0 42.0 2000.0 298.7952104 7 21.0 24.0 1000.0 286.45381654 7 18.0 98.0 700.0 236.03163393 8 NaN 375.86643373 8 NaN 364.85043372 8 NaN NaN NaN 355.3518apply can be used to handle missing data padding in different groups. Populate the mean of the packet.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.