Pandas Group By Guide – 3 Methods

Pandas Group By, the foundation of any data analysis. This is a MUST know function when working with the pandas library. 95% of analysis will require some form of grouping and aggregating data.

This post will focus directly on how to do a group by in Pandas. To learn what is a group by check out our future business analytics post.

Pandas Group By will aggregate your data around distinct values within your ‘group by’ columns. The result will apply a function (an aggregate function) to your data.

When it comes to group by functions, you’ll need two things from pandas

  1. The group by function – The function that tells pandas how you would like to consolidate your data.
  2. An aggregate function – The function that tells pandas what you would like to do with your consolidated data.

We are going to go over 3 ways to do Group Bys in Pandas. They are listed from quickest to most complete. Pick the one that works best for your situation.

1. pd.DataFrame.groupby('column_to_group_by') \
						['col_to_agg'] \
                  		.your_aggregate_function()
2. pd.DataFrame.groupby('column_to_group_by'].agg({
	'col_to_agg1' : aggregate_function1,
  	'col_to_agg2' : aggregate_function2,
	})
3. pd.DataFrame.groupby('column_to_group_by'].agg(
	new_column_name1=pd.NamedAgg(column='col_to_agg1', aggfunc=aggfunc1),
 	new_column_name2=pd.NamedAgg(column='col_to_agg2', aggfunc=aggfunc2)
	)

Pseudo code: For a given column, find the distinct groups within that column. Then combine all of the values in those groups from another column and apply a function.

Pandas Group By – 3 Methods

Method 1 – Single Aggregate Function

Pandas Group By With Single Aggregate Function - The most simple way to do a group by in Pandas
1. pd.DataFrame.groupby('column_to_group_by') \
						['col_to_agg'] \
                  		.your_aggregate_function()

In method 1 we are doing the most simple type of group by in pandas. This method only has 1 aggregate function. You start by defining the column (or columns) you’d like to group by, then the column you’d like to aggregate, then specify your aggregate function.

This method works great when you’re looking for a quick group by. The result returned will be a Pandas Series.

Method 2 – Multiple Aggregate Functions

Pandas Group By with Multiple columns
2. pd.DataFrame.groupby('column_to_group_by').agg({
	'col_to_agg1' : aggregate_function1,
  	'col_to_agg2' : aggregate_function2,
	})

Say you have different aggregate functions you’d like to apply to different columns. The way we like to do this is with method 2: using a dictionary within .agg(). Unfortunately with this method you can not specify your new column names.

To do this, you’ll need to call the column you want to group by, the column(s) you want to aggregate, and then finally an aggregate function for each column. If you want to see a list of potential aggregate functions, check out the Pandas Series documentation.

Method 3 – Multiple Aggregate Functions with new column names

Pandas Group By With Multiple Columns and new column names. pd.DataFrame.groupby() with aggregate functions
3. pd.DataFrame.groupby('column_to_group_by').agg(
	new_column_name1=pd.NamedAgg(column='col_to_agg1', aggfunc=aggfunc1),
 	new_column_name2=pd.NamedAgg(column='col_to_agg2', aggfunc=aggfunc2)
	)

In method 3 you’ll get to specify your new column names. This method is the most explicit and flexible. It takes the longest to write out, but the output is clear so it’s a winner in our book.

Let’s take a look at code examples


In [1]:
import pandas as pd

Pandas Group By

We have 3 main methods we use to do group bys. Each has their own advantages and disadvantages. Check out each one and see which works best for your case.

We will run through 3 examples:

  1. Method 1 - Quick and simple group by
  2. Method 1 - Quick and simple group by with multiple columns
  3. Method 2 - Different columns with different aggregate functions
  4. Method 3 - Different columns with different aggregate functions and new column names

But first, let's create our DataFrame

In [17]:
df = pd.DataFrame([('Liho Liho', 200, 45.32),
                  ('Chambers', 350, 65.33),
                  ('The Square', 15, 12.45),
                  ('Tosca Cafe', 35, 180.34),
                  ('Liho Liho', 98, 145.42),
                  ('Chambers', 205, 25.35)],
           columns=('name', 'Customers', 'AvgBill')
                 )
df
Out[17]:
nameCustomersAvgBill
0Liho Liho20045.32
1Chambers35065.33
2The Square1512.45
3Tosca Cafe35180.34
4Liho Liho98145.42
5Chambers20525.35

1. Method 1 - Quick and simple group by.

The simplest group by takes a single 'group by column,' single 'column to aggregate' and finally an aggregate function. A series is returned to you

  • Group By Column = 'name'
  • Column To Aggregate = 'AvgBill'
  • Aggregate function = .sum()
In [18]:
df.groupby('name')['AvgBill'].sum()
Out[18]:
name
Chambers       90.68
Liho Liho     190.74
The Square     12.45
Tosca Cafe    180.34
Name: AvgBill, dtype: float64

2. Method 1 - Quick and simple group by with multiple columns

In order to do an aggregate function on multiple columns, simply pass a list of columns into your 'columns to aggregate.'

  • Group By Column = 'name'
  • Column To Aggregate = ['Customers', 'AvgBill']
  • Aggregate function = .sum()
In [20]:
df.groupby('name')[['Customers', 'AvgBill']].sum()
Out[20]:
CustomersAvgBill
name
Chambers55590.68
Liho Liho298190.74
The Square1512.45
Tosca Cafe35180.34

3. Method 2 - Different columns with different aggregate functions

In order to apply different aggregate functions to different columns, you'll need to use the .agg() function. This helpful function allows you to specify each column and the specific function you'd like to apply to it.

You'll need to pass a dictionary to your .agg() function. Keys=Column name you'd like to aggregate, values=aggregate function.

Here I'm taking the MAX number from the Customers column and the mean of the AvgBill Column

In [26]:
df.groupby('name').agg({
    'Customers' : max,
    'AvgBill' : pd.Series.mean
})
Out[26]:
CustomersAvgBill
name
Chambers35045.34
Liho Liho20095.37
The Square1512.45
Tosca Cafe35180.34

4. Method 3 - Different columns with different aggregate functions and new column names

This final method, although long, is nice because you can rename the output columns.

Example for the first columns New column name = max_customers (Note: I agree it's weird that a name of something is pulled from a variable vs a string) column = Column to aggregate: Customers aggfunc = The agg function you'd like to apply - max

In [30]:
df.groupby('name').agg(
    max_customers = pd.NamedAgg(column='Customers', aggfunc=max),
    mean_avg_bill = pd.NamedAgg(column='AvgBill', aggfunc=pd.Series.mean),
)
Out[30]:
max_customersmean_avg_bill
name
Chambers35045.34
Liho Liho20095.37
The Square1512.45
Tosca Cafe35180.34

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation