In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-03-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article is about what grouping aggregation is in Pandas. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.
One: grouping (groupby)
Group the data sets, and then make a statistical analysis of each group
SQL can filter and aggregate data in groups.
Pandas can use groupby to perform more complex grouping operations.
Grouping operation process: split- > apply- > combine split: according to the application of grouping: the calculation rules run by each group are merged: the calculation results of each group are combined.
Sample code:
Import pandas as pd
Import numpy as np
Dict_obj = {'key1': [' await, 'baked,' averse,'b'
'asituation, 'breadth,' await,'a']
Key2': ['one',' one', 'two',' three'
'two', 'two',' one', 'three']
'data1': np.random.randn (8)
'data2': np.random.randn (8)}
Df_obj = pd.DataFrame (dict_obj)
Print (df_obj)
Running result:
Data1 data2 key1 key2
0 0.974685-0.672494 an one
1-0.214324 0.758372 b one
2 1.508838 0.392787 a two
3 0.522911 0.630814 b three
4 1.347359-0.177858 a two
5-0.264616 1.017155 b two
6-0.624708 0.450885 an one
7-1.019229-1.143825 a three
1. GroupBy object: DataFrameGroupBy,SeriesGroupBy1. The grouping operation groupby () is grouped. The GroupBy object does not perform actual operations, but the intermediate data containing the grouping is grouped by column name: obj.groupby ('label').
Sample code:
Print (type (df_obj.groupby ('key1') # dataframe is grouped according to key1
Print (type (df_obj ['data1'] .groupby (df_obj [' key1'])) # data1 columns of dataframe are grouped according to key1
Running result:
two。 Grouping operations perform grouping operations on GroupBy objects / multiple grouping operations, for example, mean () non-numerical data does not perform grouping operations
Sample code:
Grouped1 = df_obj.groupby ('key1') # grouping operation
Print (grouped1.mean ())
Grouped2 = df_obj ['data1'] .groupby (df_obj [' key1'])
Print (grouped2.mean ())
Running result:
Data1 data2
Key1
A 0.437389-0.230101
B 0.014657 0.802114
Key1
A 0.437389
B 0.014657
Name: data1, dtype: float64
Size () returns the number of elements in each group
Sample code:
Print (grouped1.size ()) # size
Print (grouped2.size ())
Running result:
Key1
A 5
B 3
Dtype: int64
Key1
A 5
B 3
Dtype: int64
3. Group obj.groupby by custom key (self_def_key) Custom key can be a list or multi-tier list obj.groupby (['label1',' label2'])-> multi-tier dataframe
Sample code:
Self_def_key = [0, 1, 2, 3, 3, 4, 5, 7] # grouped by custom key, list
Print (df_obj.groupby (self_def_key). Size ()
Print (df_obj.groupby ([df_obj ['key1'], df_obj [' key2']]) .size () # grouped by custom key, multi-tier list
Grouped2 = df_obj.groupby (['key1',' key2']) # grouped by multiple columns and multiple layers
Print (grouped2.size ())
Grouped3 = df_obj.groupby (['key2',' key1']) # Multi-layer grouping is carried out in key order
Print (grouped3.mean ())
Print (grouped3.mean (). Unstack ()) # unstack can convert the results of a multi-tier index into a single-tier dataframe
Running result:
0 1
1 1
2 1
3 2
4 1
5 1
7 1
Dtype: int64
Key1 key2
An one 2
Three 1
Two 2
B one 1
Three 1
Two 1
Dtype: int64
Key1 key2
An one 2
Three 1
Two 2
B one 1
Three 1
Two 1
Dtype: int64
Data1 data2
Key2 key1
One a 0.174988-0.110804
B-0.214324 0.758372
Three a-1.019229-1.143825
B 0.522911 0.630814
Two a 1.428099 0.107465
B-0.264616 1.017155
Data1 data2
Key1 a b a b
Key2
One 0.174988-0.214324-0.110804 0.758372
Three-1.019229 0.522911-1.143825 0.630814
Two 1.428099-0.264616 0.107465 1.017155
Second, GroupBy objects support iterative operations to return a tuple (group_name, group_data) per iteration, which can be used for specific operations of grouped data 1. Single layer grouping
Sample code:
For group_name, group_data in grouped1: # single layer grouping according to key1
Print (group_name)
Print (group_data)
Running result:
A2. Multi-layer grouping
Sample code:
For group_name, group_data in grouped2: # Multi-layer grouping according to key1 and key2
Print (group_name)
Print (group_data)
Running result:
('asides,' one')
Data1 data2 key1 key2
0 0.974685-0.672494 an one
6-0.624708 0.450885 an one
('asides,' three')
Data1 data2 key1 key2
7-1.019229-1.143825 a three
('asides,' two')
Data1 data2 key1 key2
2 1.508838 0.392787 a two
4 1.347359-0.177858 a two
('baked,' one')
Data1 data2 key1 key2
1-0.214324 0.758372 b one
('baked,' three')
Data1 data2 key1 key2
3 0.522911 0.630814 b three
('baked,' two')
Data1 data2 key1 key2
5-0.264616 1.017155 b two
Third, GroupBy objects can be converted into lists or dictionaries
Sample code:
Print (list (grouped1)) # GroupBy object conversion list
Print (dict (list (grouped1) # GroupBy object conversion dict
Running result:
[('asides, data1 data2 key1 key2)
0 0.974685-0.672494 an one
2 1.508838 0.392787 a two
4 1.347359-0.177858 a two
6-0.624708 0.450885 an one
7-1.019229-1.143825 a three)
('baked, data1 data2 key1 key2
1-0.214324 0.758372 b one
3 0.522911 0.630814 b three
5-0.264616 1.017155 b two)]
{'aqu: data1 data2 key1 key2
0 0.974685-0.672494 an one
2 1.508838 0.392787 a two
4 1.347359-0.177858 a two
6-0.624708 0.450885 an one
7-1.019229-1.143825 a three
'baked: data1 data2 key1 key2
1-0.214324 0.758372 b one
3 0.522911 0.630814 b three
5-0.264616 1.017155 b two}
1. Grouped by column, grouped by data type
Sample code:
Print (df_obj.dtypes) # grouping by column
Print (df_obj.groupby (df_obj.dtypes, axis=1). Size ()) # grouped by data type
Print (df_obj.groupby (df_obj.dtypes, axis=1) .sum ())
Running result:
Data1 float64
Data2 float64
Key1 object
Key2 object
Dtype: object
Float64 2
Object 2
Dtype: int64
Float64 object
0 0.302191 an one
1 0.544048 b one
2 1.901626 a two
3 1.153725 b three
4 1.169501 a two
5 0.752539 b two
6-0.173823 an one
7-2.163054 a three
two。 Other grouping methods
Sample code:
Df_obj2 = pd.DataFrame (np.random.randint (1, 10, (5))
Columns= ['await,' baked, 'clocked,' dumped,'e']
Index= ['Aids,' Bads, 'Cards,' Downs,'E'])
Df_obj2.ix [1, 1:4] = np.NaN
Print (df_obj2)
Running result:
A b c d e
A 7 2.0 4.0 5.0 8
B 4 NaN 1
C 3 2.0 5.0 4.0 6
D 3 1.0 9.0 7.0 3
E 6 1.0 6.0 8.0 1
3. Group by dictionary
Sample code:
Mapping_dict = {'axiafiuzhuangzhuangyuanzhang,' baozhuzhizhongyuanjia, 'cantileveranglas,' dawns', 'ethereals'} # grouped by dictionaries
Print (df_obj2.groupby (mapping_dict, axis=1) .size ())
Print (df_obj2.groupby (mapping_dict, axis=1). Count ()) # number of non-NaN
Print (df_obj2.groupby (mapping_dict, axis=1) .sum ())
Running result:
C 1
Java 2
Python 2
Dtype: int64
C Java Python
A 1 2 2
B 0 1 1
C 1 2 2
D 1 2 2
E 1 2 2
C Java Python
A 5.0 12.0 9.0
B NaN 1.0 4.0
C 4.0 11.0 5.0
D 7.0 12.0 4.0
E 8.0 7.0 7.0
4. Through function grouping, the parameters passed in by the function are row index or column index
Sample code:
Df_obj3 = pd.DataFrame (np.random.randint (1, 10, (5))
Columns= ['await,' baked, 'clocked,' dumped,'e']
Index= ['AA',' BBB', 'CC',' Downs, 'EE']) # grouped by function
Def group_key (idx): # df_obj3
"
Idx is a column index or a row index
"
# return idx
Return len (idx)
Print (df_obj3.groupby (group_key). Size ()
# the above custom functions are equivalent to
# df_obj3.groupby (len). Size ()
Running result:
1 1
2 3
3 1
Dtype: int64
5. Grouping by index level
Sample code:
Columns = pd.MultiIndex.from_arrays ([['Python',' Java', 'Python',' Java', 'Python']
], names= ['language',' index']) # grouped by index level
Df_obj4 = pd.DataFrame (np.random.randint (1,10, (5,5)), columns=columns)
Print (df_obj4)
Print (df_obj4.groupby (level='language', axis=1). Sum ()) # groups according to language
Print (df_obj4.groupby (level='index', axis=1). Sum ()) # groups according to index
Running result:
Language Python Java Python Java Python
Index An A B C B
0 2 7 8 4 3
1 5 2 6 1 2
2 6 4 4 5 2
3 4 7 4 3 1
4 7 4 3 4 8
Language Java Python
0 11 13
13 13
2 9 12
3 10 9
4 8 18
Index A B C
0 9 11 4
1 7 8 1
2 10 6 5
3 11 5 3
4 11 11 4
Two: aggregation
The process of generating scalars from an array, such as mean (), count (), etc.
Often used to calculate grouped data
Sample code:
Dict_obj = {'key1': [' await, 'baked,' averse,'b'
'asituation, 'breadth,' await,'a']
Key2': ['one',' one', 'two',' three'
'two', 'two',' one', 'three']
'data1': np.random.randint (1, 10, 8)
'data2': np.random.randint (1, 10, 8)}
Df_obj5 = pd.DataFrame (dict_obj)
Print (df_obj5)
Running result:
Data1 data2 key1 key2
0 3 7 an one
1 1 5 b one
2 7 4 a two
3 2 4 b three
4 6 4 a two
5 9 9 b two
6 3 5 an one
7 8 4 a three
1. Built-in aggregate functions sum (), mean (), max (), min (), count (), size (), describe ()
Sample code:
Print (df_obj5.groupby ('key1'). Sum ())
Print (df_obj5.groupby ('key1'). Max ())
Print (df_obj5.groupby ('key1'). Min ())
Print (df_obj5.groupby ('key1'). Mean ())
Print (df_obj5.groupby ('key1'). Size ())
Print (df_obj5.groupby ('key1'). Count ())
Print (df_obj5.groupby ('key1'). Describe ())
Running result:
Data1 data2
Key1
A 27 24
B 12 18
Data1 data2 key2
Key1
A 8 7 two
B 9 9 two
Data1 data2 key2
Key1
A 3 4 one
B 1 4 one
Data1 data2
Key1
A 5.4 4.8
B 4.0 6.0
Key1
A 5
B 3
Dtype: int64
Data1 data2 key2
Key1
A 5 5 5
B 3 3 3
Data1 data2
Key1
A count 5.000000 5.000000
Mean 5.400000 4.800000
Std 2.302173 1.303840
Min 3.000000 4.000000
25% 3.000000 4.000000
50% 6.000000 4.000000
75% 7.000000 5.000000
Max 8.000000 7.000000
B count 3.000000 3.000000
Mean 4.000000 6.000000
Std 4.358899 2.645751
Min 1.000000 4.000000
25% 1.500000 4.500000
50% 2.000000 5.000000
75% 5.500000 7.000000
Max 9.000000 9.000000
two。 Customizable function. The parameter grouped.agg (func) func in the agg method is the record corresponding to the groupby index.
Sample code:
Def peak_range (df): # Custom aggregate function
"
Returns the range of values
"
# print type (df) # Parameter is the record corresponding to the index
Return df.max ()-df.min ()
Print (df_obj5.groupby ('key1') .agg (peak_range))
Print (df_obj.groupby ('key1') .agg (lambda df: df.max ()-df.min ()
Running result:
Data1 data2
Key1
A 5 3
B 8 5
Data1 data2
Key1
A 2.528067 1.594711
B 0.787527 0.386341
In [25]:
3. Apply multiple aggregate functions to aggregate operations at the same time, using function lists
Sample code:
'' apply multiple aggregate functions
Apply multiple aggregate functions at the same time
Print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', peak_range])) # the default column name is the function name
Print (df_obj.groupby ('key1'). Agg ([' mean', 'std',' count', ('range', peak_range)]) # provides new column names through tuples
Running result:
Data1 data2
Mean std count peak_range mean std count peak_range
Key1
A 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711
B 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341
Data1 data2
Mean std count range mean std count range
Key1
A 0.437389 1.174151 5 2.528067-0.230101 0.686488 5 1.594711
B 0.014657 0.440878 3 0.787527 0.802114 0.196850 3 0.386341
4. Different aggregate functions are applied to different columns, using dict
Sample code:
Dict_mapping = {'data1':'mean'
'data2':'sum'} # aggregate functions that act differently for each column
Print (df_obj.groupby ('key1') .agg (dict_mapping))
Dict_mapping = {'data1': [' mean','max']
'data2':'sum'}
Print (df_obj.groupby ('key1') .agg (dict_mapping))
Running result:
Data1 data2
Key1
A 0.437389-1.150505
B 0.014657 2.406341
Data1 data2
Mean max sum
Key1
A 0.437389 1.508838-1.150505
B 0.014657 0.522911 2.406341
5. Commonly used built-in aggregate functions
Sample code:
Import pandas as pd
Import numpy as np
Dict_obj = {'key1': [' await, 'baked,' averse,'b'
'asituation, 'breadth,' await,'a']
Key2': ['one',' one', 'two',' three'
'two', 'two',' one', 'three']
'data1': np.random.randint (1, 10, 8)
'data2': np.random.randint (1,10,8)}
Df_obj = pd.DataFrame (dict_obj)
Print (df_obj)
K1_sum = df_obj.groupby ('key1'). Sum (). Add_prefix (' sum_') # after grouping by key1, calculate the statistics of data1,data2 and append them to the original table, and add the header prefix
Print (k1_sum)
Running result:
Data1 data2 key1 key2
0 5 1 an one
1 7 8 b one
2 1 9 a two
3 2 6 b three
4 9 8 a two
5 8 3 b two
6 3 5 an one
7 8 3 a three
Sum_data1 sum_data2
Key1
A 26 26
B 17 17
After the aggregation operation, the shape of the original data will be changed. How to maintain the shape of the original data? 1. Merge uses the external connection of merge, which is more complex.
Sample code:
K1_sum_merge = pd.merge (df_obj, k1_sum, left_on='key1', right_index=True) # method 1, using merge
Print (k1_sum_merge)
Running result:
Data1 data2 key1 key2 sum_data1 sum_data2
0 5 1 an one 26 26
2 1 9 a two 26 26
4 9 8 a two 26 26
6 3 5 an one 26 26
7 8 3 a three 26 26
17 8 b one 17 17
3 2 6 b three 17 17
5 8 3 b two 17 17
2. The calculated result of transformtransform is consistent with the shape of the original data, such as grouped.transform (np.sum).
Sample code:
K1_sum_tf = df_obj.groupby ('key1'). Transform (np.sum). Add_prefix (' sum_') # method 2, using transform
Df_ OBJ [K1 _ sum_tf.columns] = k1_sum_tf
Print (df_obj)
Running result:
Data1 data2 key1 key2 sum_data1 sum_data2 sum_key2
0 5 1 an one 26 26 onetwotwoonethree
17 8 b one 17 17 onethreetwo
2 1 9 a two 26 26 onetwotwoonethree
3 2 6 b three 17 17 onethreetwo
4 9 8 a two 26 26 onetwotwoonethree
5 8 3 b two 17 17 onethreetwo
6 3 5 an one 26 26 onetwotwoonethree
7 8 3 a three 26 26 onetwotwoonethree
You can also pass in a custom function
Sample code:
Def diff_mean (s): # the custom function is passed into transform
"
Returns the difference between the data and the mean
"
Return s-s.mean ()
Print (df_obj.groupby ('key1') .transform (diff_mean))
Running result:
Data1 data2 sum_data1 sum_data2
0-0.200000-4.200000 00
1 1.333333 2.333333 0 0
2-4.200000 3.800000 00
3-3.666667 0.333333 0 0
4 3.800000 2.800000 00
5 2.333333-2.666667 0
6-2.200000-0.200000 00
7 2.800000-2.200000 00
The groupby.apply (func) func function can also be called separately on each group, and the final result is assembled together through pd.concat (data merge)
Sample code:
Import pandas as pd
Import numpy as np
Dataset_path ='. / starcraft.csv'
Df_data = pd.read_csv (dataset_path, usecols= ['LeagueIndex',' Age', 'HoursPerWeek'
'TotalHours',' APM'])
Def top_n (df, nasty 3, column='APM'):
"
Returns the top n data of each packet by column
"
Return df.sort_values (by=column, ascending=False) [: n]
Print (df_data.groupby ('LeagueIndex') .apply (top_n))
Running result:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
2 3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
3 1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
4 2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
5 3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
6 734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
7 3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
8 3393 8 NaN 375.8664
3373 8 NaN 364.8504
3372 8 NaN 355.3518
1. Generate hierarchical indexes: the outer index is the grouping name, and the inner index is the row index of df_obj
Sample code:
The parameters received by the print (df_data.groupby ('LeagueIndex'). Apply (top_n, nasty 2, column='Age')) # apply function are passed into the custom function
Running result:
LeagueIndex Age HoursPerWeek TotalHours APM
LeagueIndex
1 3146 1 40.0 12.0 150.0 38.5590
3040 1 39.0 10.0 500.0 29.8764
2 920 2 43.0 10.0 730.0 86.0586
2437 2 41.0 4.0 200.0 54.2166
3 1258 3 41.0 14.0 800.0 77.6472
2972 3 40.0 10.0 500.0 60.5970
4 1696 4 44.0 6.0 500.0 89.5266
1729 4 39.0 8.0 500.0 86.7246
5 202 5 37.0 14.0 800.0 327.7218
2745 5 37.0 18.0 1000.0 123.4098
6 3069 6 31.0 8.0 800.0 133.1790
2706 6 31.0 8.0 700.0 66.9918
7 2813 7 26.0 36.0 1300.0 188.5512
1992 7 26.0 24.0 1000.0 219.6690
8 3340 8 NaN 189.7404
3341 8 NaN 287.8128
two。 Disable hierarchical indexing, group_keys=False
Sample code:
Print (df_data.groupby ('LeagueIndex', group_keys=False) .apply (top_n))
Running result:
LeagueIndex Age HoursPerWeek TotalHours APM
2214 1 20.0 12.0 730.0 172.9530
2246 1 27.0 8.0 250.0 141.6282
1753 1 20.0 28.0 100.0 139.6362
3062 2 20.0 6.0 100.0 179.6250
3229 2 16.0 24.0 110.0 156.7380
1520 2 29.0 6.0 250.0 151.6470
1557 3 22.0 6.0 200.0 226.6554
484 3 19.0 42.0 450.0 220.0692
2883 3 16.0 8.0 800.0 208.9500
2688 4 26.0 24.0 990.0 249.0210
1759 4 16.0 6.0 75.0 229.9122
2637 4 23.0 24.0 650.0 227.2272
3277 5 18.0 16.0 950.0 372.6426
93 5 17.0 36.0 720.0 335.4990
202 5 37.0 14.0 800.0 327.7218
734 6 16.0 28.0 730.0 389.8314
2746 6 16.0 28.0 4000.0 350.4114
1810 6 21.0 14.0 730.0 323.2506
3127 7 23.0 42.0 2000.0 298.7952
104 7 21.0 24.0 1000.0 286.4538
1654 7 18.0 98.0 700.0 236.0316
3393 8 NaN 375.8664
3373 8 NaN 364.8504
3372 8 NaN 355.3518 Thank you for your reading! This is the end of this article on "what is grouping aggregation in Pandas?". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.