Pandas Drop Duplicates – pd.df.drop_duplicates()

Do you ever have repeat rows in your data when you don’t want to? Pandas Drop duplicates will remove these for you.

Pandas DataFrame.drop_duplicates() will remove any duplicate rows (or duplicate subset of rows) from your DataFrame. It is super helpful when you want to make sure you data has a unique key or unique rows.

1. YourDataFrame.drop_duplicates()

This function is a combination of DataFrame.drop() and DataFrame.duplicated().

I use this function most when I have a column that represents a unique id of an object. I’ll run .drop_duplicates() specifying my unique column as the subset.

Pseudo code: Look at all the rows (or subset of columns with your rows) and see if there are duplicates. If so, drop ’em.

Pandas Drop Duplicates

.drop_duplicates() is pretty straight forward, the two decisions you’ll have to make are 1) What subset of your data do you want pandas to evaluate for duplicates? and 2) Do you want to keep the first, or last or none of your duplicates?

Duplicate Parameters

  • subset: By default, Pandas will look at your entire row to see if it is a duplicate of any other entire row. However, you can tell pandas to only look at a subset of columns to look for duplicates vs the whole. The subset parameter specifies what subset of columns you would like pandas to evaluate.
  • keep (Default: ‘first’): If you have two duplicate rows, you can also tell pandas which one(s) to drop. keep=’first’ will keep the first duplicate and drop the rest. Keep=’last’ will keep the last duplicate and drop the last. None will drop all of them.
  • inplace (Default: False): If true, you would like to do your operation in place (write over your current DataFrame). If false, then your DataFrame will be returned to you.
  • ignore_index (Default: False): If True, then the axis returned to you will be labeled cleaning 0, 1, 2, …, n-1. If not, then the prior index will be used with the index labels of the drop rows dropped as well.

Here’s a Jupyter notebook showing how to set index in Pandas

In [1]:
import pandas as pd

Pandas Drop Duplicates

We will run through 3 examples:

  1. Dropping rows from duplicate rows
  2. Dropping rows from duplicate subset of columns
  3. Keeping the last duplicate instead of the default first column

Let's create our DataFrame

In [2]:
df = pd.DataFrame({
    'brand': ['Jet Boil', 'Jet Boil', 'Osprey', 'Osprey', 'Osprey'],
    'equipment': ['Stove', 'Stove', 'Backpack', 'Waterbottle', 'Backpack'],
    'rating': [3, 3, 5.5, 8.6, 7]
})
df
Out[2]:
brandequipmentrating
0Jet BoilStove3.0
1Jet BoilStove3.0
2OspreyBackpack5.5
3OspreyWaterbottle8.6
4OspreyBackpack7.0

1. Dropping rows from duplicate rows

When we call the default drop_duplicates, we are asking pandas to find all the duplicate rows, and then keep only the first ones.

Notice below, we call drop duplicates and row 2 (index=1) gets dropped because is the 2nd instance of a duplicate row.

In [3]:
df.drop_duplicates()
Out[3]:
brandequipmentrating
0Jet BoilStove3.0
2OspreyBackpack5.5
3OspreyWaterbottle8.6
4OspreyBackpack7.0

2. Dropping rows from duplicate subset of columns

When we specify a subset of column, drop duplicates will only look at a column (or mutiple columns) to see if they are duplicates with any other subset of columns from othr rows. If so, then those duplicates will get dropped.

Here we are specifying a subset to only look at the column 'brand.' All duplicates within the brand column will get dropped except for the 1st ones (because keep defaults to 'first').

In [4]:
df.drop_duplicates(subset='brand')
Out[4]:
brandequipmentrating
0Jet BoilStove3.0
2OspreyBackpack5.5

You can also do multiple columns as a subset by passing a list

In [5]:
df.drop_duplicates(subset=['brand', 'equipment'])
Out[5]:
brandequipmentrating
0Jet BoilStove3.0
2OspreyBackpack5.5
3OspreyWaterbottle8.6

3. Keeping the last duplicate instead of the default first

By default, .drop_duplicates() will keep your first duplicate it finds. However, if you wanted to switch it up and keep the last one you can specify keep='last'

Here we are running the same command as the first example, but keep='last'. Notice how row 1 (index=0) gets dropped. We keep the last duplicate only.

In [6]:
df.drop_duplicates(keep='last')
Out[6]:
brandequipmentrating
1Jet BoilStove3.0
2OspreyBackpack5.5
3OspreyWaterbottle8.6
4OspreyBackpack7.0

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation