Pandas Merge – Join Data – pd.DataFrame.merge()

Pandas Merge is another Top 10 Pandas function you must know. Simply, if you have two datasets that are related together, how do you bring them together?

Pandas Merge will join two DataFrames together resulting in a single, final dataset. You have full control how your two datasets are combined.

In this post, we’ll review the mechanics of Pandas Merge and go over different scenarios to use it on. For a tutorial on the different types of joins, check out our future post on Data Joins.

There are 3 core ideas you’ll need to grasp

  1. What data do you want to join? – This will be determined by your ‘right’ and ‘left’ datasets.
  2. How do you want to join them? – You have different ways to join two datasets – left, right, full, inner. It’s up to you depending on your final output requirements.
  3. What do you want to join them with? – You also need to tell Pandas how your two datasets are related. What is the common column between the two?
1. pd.DataFrame.merge(other_dataset,
                      how='type_of_join',
                      on='columns_to_join_on')

Pseudo code: For a given DataFrame, join another related DataFrame that results in a new, combined dataset.

Pandas Merge

Pandas Merge - Join Two DataFrames together with Pandas Merge. You can select which dataframes to merge and how you should merge them.

Pandas Merge Tip

When I merge two DataFrames, there are often columns I don’t want to merge in either dataset. For example, say I have two DataFrames with 100 columns distinct columns each, but I only care about 3 columns from each one.

Instead of joining two entire DataFrames together, I’ll only join a subset of columns together. See the code below.

# This join brings together the entire DataFame
df.merge(df2)

# This join only brings together a subset of columns
# 'col1' is my key in this case
df[['col1', 'col2']].merge(df2[['col1', 'col3']], on='col1')

Merge Parameters

  • right: The DataFrame you’re calling .merge() is considered your ‘left’ dataset. You need to specify your other dataset in the right parameter. This can be another DataFrame or named Series.
  • how (‘left’, ‘right’, ‘outer’, ‘inner’, default= ‘inner’): How will determine ‘how’ to join your two datasets together. Do you want to keep all of your samples from your left df? Or your right? Maybe just where they have common rows. For a tutorial on different types of joins, check out this resource.
  • on: Sometimes called merge/join ‘key’. The common column between your two datasets that you’d like to join on. You’ll use on when the two columns have the same name. If they aren’t named the same, then try left_ or right_ on
  • left_on/right_on: If your columns to join on do not have the same name, no problem, simply pass their names into left_on (for your left dataset) and right_on (for your right dataset).
  • left_index/right_index: Alternatively, instead of specifying a column, if you column to join on sits within a DataFrame’s index, you can set left_index/right_index=True.
  • suffixes (Default=(‘_x’, ‘_y’): If you wanted to add a suffix (to help tell which columns came from which DataFrame) to the end of your newly-merged columns you can add them here.
  • indicator (Default = False): Helpful function that will attribute each row to a specific DataFrame or both. Simply, was this row only on the left side, right side, or was it shared by both?
  • validate: Do a sanity check on both of your datasets before you merge. One of the ‘gotchas’ with merges is when there are unknown/unintended duplicates in either of your datasets. The validate parameter will help you check for these.
    • one_to_one” or “1:1”: checks if the merge column is unique in both left and right datasets.
    • one_to_many” or “1:m”: checks if the merge column is unique in left dataset
    • many_to_one” or “m:1”: checks if the merge column is unique in right dataset
    • many_to_many” or “m:m”: does not check duplicates in either column

Let’s take a look at a code sample


In [1]:
import pandas as pd

Pandas Merge

Pandas Merge is an extremely useful and important function within the Pandas library. It is essential for all data analysts to know. Here we'll show how to bring two different datasets together via .merge().

Let's run through 4 examples:

  1. 'hello world' of merges
  2. Merge with different column names
  3. Merge a subset of columns
  4. Different types of merges

First, let's create 2 DataFrames

In [2]:
df1 = pd.DataFrame([('Foreign Cinema', 'Restaurant'),
                   ('Liho Liho', 'Restaurant'),
                   ('500 Club', 'bar'),
                   ('The Square', 'bar')],
           columns=('name', 'type')
                 )

df2 = pd.DataFrame([('Foreign Cinema', 289, 5),
                   ('Liho Liho', 343, 4.5),
                   ('500 Club', 122, 4.7),
                   ('The Square', 45, 2.4)],
           columns=('name', 'AvgBill', 'Rating')
                 )

df1
Out[2]:
nametype
0Foreign CinemaRestaurant
1Liho LihoRestaurant
2500 Clubbar
3The Squarebar
In [3]:
df2
Out[3]:
nameAvgBillRating
0Foreign Cinema2895.0
1Liho Liho3434.5
2500 Club1224.7
3The Square452.4

1. 'Hello World' Of Merges

Let's run through a simple example of merging. Here we will bring together our two DataFrames via their 'name' column.

Check out how the distinct columns from each dataframe are joined together in one.

In [4]:
df1.merge(df2, on='name')
Out[4]:
nametypeAvgBillRating
0Foreign CinemaRestaurant2895.0
1Liho LihoRestaurant3434.5
2500 Clubbar1224.7
3The Squarebar452.4

2. Merge with different column names

Say you have two DataFrames that share a common column, but unfortunately that column has a different name on either df. You could rename the columns to be the same then join. However, we prefer to just specify a left_on and right_on to help us.

Let's create our DataFrames again

In [5]:
df3 = pd.DataFrame([('Foreign Cinema', 'Restaurant'),
                   ('Liho Liho', 'Restaurant'),
                   ('500 Club', 'bar'),
                   ('The Square', 'bar')],
           columns=('name', 'type')
                 )

df4 = pd.DataFrame([('Foreign Cinema', 289, 5),
                   ('Liho Liho', 343, 4.5),
                   ('500 Club', 122, 4.7),
                   ('The Square', 45, 2.4)],
           columns=('Restname', 'AvgBill', 'Rating')
                 )

df3
Out[5]:
nametype
0Foreign CinemaRestaurant
1Liho LihoRestaurant
2500 Clubbar
3The Squarebar
In [6]:
df4
Out[6]:
RestnameAvgBillRating
0Foreign Cinema2895.0
1Liho Liho3434.5
2500 Club1224.7
3The Square452.4

Here I'll merge our two dataframes. I want to merge on 'name' on the left side, and 'Restname' on the right side.

In [7]:
df3.merge(df4, left_on='name', right_on='Restname')
Out[7]:
nametypeRestnameAvgBillRating
0Foreign CinemaRestaurantForeign Cinema2895.0
1Liho LihoRestaurantLiho Liho3434.5
2500 Clubbar500 Club1224.7
3The SquarebarThe Square452.4

3. Merge a subset of columns

If you don't want to merge your entire dataframe, then I like to call .merge() on a subset of columns. Let's see here.

Notice how the 'Rating' column does not get merged since we only took a subset of columns

In [8]:
df1.merge(df2[['name', 'AvgBill']], on='name')
Out[8]:
nametypeAvgBill
0Foreign CinemaRestaurant289
1Liho LihoRestaurant343
2500 Clubbar122
3The Squarebar45

4. Different types of merges

In order to specify the type of merge you would like to do, pass a string of the type name to 'how'. You can choose left, right, outer, inner.

Let's create new DataFrames again

In [9]:
df5 = pd.DataFrame([('Foreign Cinema', 'Restaurant'),
                   ('Liho Liho', 'Restaurant'),
                   ('Chambers', 'Bar')],
           columns=('name', 'type')
                 )

df6 = pd.DataFrame([('Foreign Cinema', 289, 5),
                   ('Liho Liho', 343, 4.5),
                   ('500 Club', 122, 4.7),
                   ('The Square', 45, 2.4)],
           columns=('name', 'AvgBill', 'Rating')
                 )

df5
Out[9]:
nametype
0Foreign CinemaRestaurant
1Liho LihoRestaurant
2ChambersBar
In [10]:
df6
Out[10]:
nameAvgBillRating
0Foreign Cinema2895.0
1Liho Liho3434.5
2500 Club1224.7
3The Square452.4
In [11]:
# Include all items on the left side
df5.merge(df6, on='name', how='left')
Out[11]:
nametypeAvgBillRating
0Foreign CinemaRestaurant289.05.0
1Liho LihoRestaurant343.04.5
2ChambersBarNaNNaN
In [12]:
# Include all items on the right side
df5.merge(df6, on='name', how='right')
Out[12]:
nametypeAvgBillRating
0Foreign CinemaRestaurant2895.0
1Liho LihoRestaurant3434.5
2500 ClubNaN1224.7
3The SquareNaN452.4
In [13]:
# Include all items from both DataFrames
df5.merge(df6, on='name', how='outer')
Out[13]:
nametypeAvgBillRating
0Foreign CinemaRestaurant289.05.0
1Liho LihoRestaurant343.04.5
2ChambersBarNaNNaN
3500 ClubNaN122.04.7
4The SquareNaN45.02.4
In [14]:
# Include all items that both DataFrames share
df5.merge(df6, on='name', how='inner')
Out[14]:
nametypeAvgBillRating
0Foreign CinemaRestaurant2895.0
1Liho LihoRestaurant3434.5

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation