Do you ever have repeat rows in your data when you don’t want to? Pandas Drop duplicates will remove these for you.
Pandas DataFrame.drop_duplicates()
will remove any duplicate rows (or duplicate subset of rows) from your DataFrame. It is super helpful when you want to make sure you data has a unique key or unique rows.
1. YourDataFrame.drop_duplicates()
This function is a combination of DataFrame.drop() and DataFrame.duplicated().
I use this function most when I have a column that represents a unique id of an object. I’ll run .drop_duplicates() specifying my unique column as the subset.
Pseudo code: Look at all the rows (or subset of columns with your rows) and see if there are duplicates. If so, drop ’em.
Pandas Drop Duplicates
.drop_duplicates() is pretty straight forward, the two decisions you’ll have to make are 1) What subset of your data do you want pandas to evaluate for duplicates? and 2) Do you want to keep the first, or last or none of your duplicates?
Duplicate Parameters
- subset: By default, Pandas will look at your entire row to see if it is a duplicate of any other entire row. However, you can tell pandas to only look at a subset of columns to look for duplicates vs the whole. The subset parameter specifies what subset of columns you would like pandas to evaluate.
- keep (Default: ‘first’): If you have two duplicate rows, you can also tell pandas which one(s) to drop. keep=’first’ will keep the first duplicate and drop the rest. Keep=’last’ will keep the last duplicate and drop the last. None will drop all of them.
- inplace (Default: False): If true, you would like to do your operation in place (write over your current DataFrame). If false, then your DataFrame will be returned to you.
- ignore_index (Default: False): If True, then the axis returned to you will be labeled cleaning 0, 1, 2, …, n-1. If not, then the prior index will be used with the index labels of the drop rows dropped as well.
Here’s a Jupyter notebook showing how to set index in Pandas
import pandas as pd
Pandas Drop Duplicates¶
We will run through 3 examples:
- Dropping rows from duplicate rows
- Dropping rows from duplicate subset of columns
- Keeping the last duplicate instead of the default first column
Let's create our DataFrame
df = pd.DataFrame({
'brand': ['Jet Boil', 'Jet Boil', 'Osprey', 'Osprey', 'Osprey'],
'equipment': ['Stove', 'Stove', 'Backpack', 'Waterbottle', 'Backpack'],
'rating': [3, 3, 5.5, 8.6, 7]
})
df
1. Dropping rows from duplicate rows¶
When we call the default drop_duplicates, we are asking pandas to find all the duplicate rows, and then keep only the first ones.
Notice below, we call drop duplicates and row 2 (index=1) gets dropped because is the 2nd instance of a duplicate row.
df.drop_duplicates()
2. Dropping rows from duplicate subset of columns¶
When we specify a subset of column, drop duplicates will only look at a column (or mutiple columns) to see if they are duplicates with any other subset of columns from othr rows. If so, then those duplicates will get dropped.
Here we are specifying a subset to only look at the column 'brand.' All duplicates within the brand column will get dropped except for the 1st ones (because keep defaults to 'first').
df.drop_duplicates(subset='brand')
You can also do multiple columns as a subset by passing a list
df.drop_duplicates(subset=['brand', 'equipment'])
3. Keeping the last duplicate instead of the default first¶
By default, .drop_duplicates() will keep your first duplicate it finds. However, if you wanted to switch it up and keep the last one you can specify keep='last'
Here we are running the same command as the first example, but keep='last'. Notice how row 1 (index=0) gets dropped. We keep the last duplicate only.
df.drop_duplicates(keep='last')
Check out more Pandas functions on our Pandas Page