Pandas Duplicated – pd.Series.duplicated()

You may want to know if you have duplicate values in your DataFrame or Series. That’s where Pandas Duplicated or pd.Series.Duplicated() comes in.

You use pandas duplicated when you want to remove repeat value, or flag them for further analysis.

To do this simply call: YourSeries.duplicated() to see which values appear more than once.

But you may want to treat your duplicates differently. Do you know want to know about the first duplicate? or the last? Pandas lets you pick. However, it’s a bit counter intuitive, let’s look at the options.

  • Method 1 – Keep=’First’: For when you want to mark all duplicates as true…EXCEPT for the first one.
  • Method 2 – Keep=’Last’: For when you want to mark all duplicates as true…EXCEPT for the last one.
  • Method 3 – Keep=False: For when you want to mark all duplicates as true.

Pandas duplicated

Check out how the different options below match up against each other. We have a pandas Series listing out different cities in the US. San Francisco and Dallas appear multiple times and therefore are duplicates. New York and Miami only appear one and are not duplicates.

Notice how New York and Miami never return True with .duplicated(). This is because there aren’t any duplicates!

However, San Francisco and Dallas do return true. Well, sometimes, depending on the “keep” you choose.

Here’s another example in a jupyter notebook

In [1]:
import pandas as pd

Pandas Duplicated

Often times you'll have a series with duplicate values and you'll want to know where they are. Pandas duplicated is the function for the job.

First let's create a series with some data

In [30]:
my_series = pd.Series(["Kanye West", "Drake" , "Mac Miller", "Drake", "Beyonce", "Kanye West", "Drake"], name='artists')
my_series
Out[30]:
0    Kanye West
1         Drake
2    Mac Miller
3         Drake
4       Beyonce
5    Kanye West
6         Drake
Name: artists, dtype: object

Looks like Kanye West and Drake both have duplicate values in the series above. This is an easy example, but what if you have 100K data points? You'll need a quicker way to locate duplicates than eye-balling it.

Before you get started finding duplicates, you have one decision to make: Which duplicates do you want to flag? The First, Last, or All of them?

Method 1 - Keep='first' (default): For when you want to mark all duplicates as true...EXCEPT for the first one.

In [31]:
series_duplicates_first = my_series.duplicated(keep='first') # Finding the duplicates
series_duplicates_first.name = 'duplicates' # Giving the series a name to view later
series_duplicates_first # View your duplicates
Out[31]:
0    False
1    False
2    False
3     True
4    False
5     True
6     True
Name: duplicates, dtype: bool

Let's merge the two series together with pd.concat() to easily view them

In [32]:
pd.concat([my_series, series_duplicates_first], axis=1)
Out[32]:
artistsduplicates
0Kanye WestFalse
1DrakeFalse
2Mac MillerFalse
3DrakeTrue
4BeyonceFalse
5Kanye WestTrue
6DrakeTrue

Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates), except for the first one!

Method 2 - Keep='last': For when you want to mark all duplicates as true...EXCEPT for the last one.

In [33]:
series_duplicates_last = my_series.duplicated(keep='last') # Finding the duplicates
series_duplicates_last.name = 'duplicates' # Giving the series a name to view later
series_duplicates_last # View your duplicates

pd.concat([my_series, series_duplicates_last], axis=1) # View your duplicates next to your values
Out[33]:
artistsduplicates
0Kanye WestTrue
1DrakeTrue
2Mac MillerFalse
3DrakeTrue
4BeyonceFalse
5Kanye WestFalse
6DrakeFalse

Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates), except for the last one!

Method 3 - Keep=False: For when you want to mark all duplicates as true.

In [29]:
series_duplicates_false = my_series.duplicated(keep=False) # Finding the duplicates
series_duplicates_false.name = 'duplicates' # Giving the series a name to view later
series_duplicates_false # View your duplicates

pd.concat([my_series, series_duplicates_false], axis=1) # View your duplicates next to your values
Out[29]:
artistsduplicates
0Kanye WestTrue
1DrakeTrue
2Mac MillerFalse
3DrakeTrue
4BeyonceFalse
5Kanye WestTrue
6DrakeTrue

Notice how all of the duplicates ("Kanye West"s and "Drake"s) are marked as True (meaning they are duplicates) now.

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation