Pandas Value Counts – pd.Series.value_counts()

Often when you’re doing exploratory data analysis (EDA), you’ll need to get a better feel for a column. One of the best ways to do this is to understand the distribution of values with you column. This is where Pandas Value Counts comes in.

Pandas Series.value_counts() function returns a Series containing the counts (number) of unique values in your Series. By default the resulting series will be in descending order so that the first element is the most frequent element.

1. YourDataFrame['your_column'].value_counts()
2. YourSeries.value_counts()

I usually do this when I want to get a bit more intimate with my date. My workflow goes:

  1. Run pandas.Series.nunique() first – This will count how many unique values I have. If it’s +100K it’ll slow down my computer once I call value_counts
  2. Run pandas.Series.value_counts() – This will tell me which values appear most frequently

Pseudo code: Take a DataFrame column (or Series) and find the distinct values. Then count how many times each distinct value occurs.

Hint: You can also do this across unique rows in a DataFrame by calling pandas.DataFrame.value_counts()

Pandas Value Counts

Pandas Value Counts - This function will count the distinct values in a Series and return a Series of the number of times each unique value appears.

By default, you don’t need to input any parameters when counting the values. Let’s take a look at the different parameters you can pass pd.Series.value_counts():

  • normalize (Default: False): If true, then you’ll return the relative frequencies of unique values. This means that instead of returning counts, you Series returned will be the percent each unique value makes up of the whole series.
  • sort (Default: True): This will return your values in the frequency order. The exact order is determined by the next parameter (ascending)
  • ascending (Default: False): If true, ascending will return your values in ascending order (lowest ones on top). By default your highest values appear first.
  • bins: Sometimes you’re working with a continuous variable (think a range of numbers vs discrete labels). In this case you’ll have too many unique values to pull signal from your data. If you set bins (Ex: [0, .25, .5, .75, 1], you’ll assign your values a bin based off of where they fall. value_counts will then count the bin frequency vs distinct value frequency. Check out the video or code below for more.
  • dropna (Default: True): This will either count (False) or not count (True) your NaNs in your Series.

Here’s a Jupyter notebook showing how to set index in Pandas

In [1]:
import pandas as pd

Pandas Value Counts

Pandas Value Counts will count the frequency of the unique values in your series. Or simply, "count how many each value occurs."

We will run through 3 examples:

  1. Counting frequency of unique values in a series
  2. Counting relative frequency of unique values in a series (normalizing)
  3. Counting a continuous series using bins.

First, let's create our DataFrame

In [2]:
df = pd.DataFrame([('Foreign Cinema', 'Restaurant', 289.0),
                   ('Liho Liho', 'Restaurant', 224.0),
                   ('500 Club', 'bar', 80.5),
                   ('The Square', 'bar', 25.30),
                   ('Liho Liho', 'Restaurant', 124.0),
                   ('The Square', 'bar', 53.30),
                   ('Liho Liho', 'Restaurant', 324.0),
                   ('500 Club', 'bar', 40.5),
                   ('Salzburg', 'bar', 123.5)],
           columns=('name', 'type', 'AvgBill')
                 )
df
Out[2]:
nametypeAvgBill
0Foreign CinemaRestaurant289.0
1Liho LihoRestaurant224.0
2500 Clubbar80.5
3The Squarebar25.3
4Liho LihoRestaurant124.0
5The Squarebar53.3
6Liho LihoRestaurant324.0
7500 Clubbar40.5
8Salzburgbar123.5

Counting frequency of unique values in a series

Then let's call value_counts on our "name" column. This will look at the distinct values within that column, and count how many times they appear.

In [3]:
df['name'].value_counts()
Out[3]:
Liho Liho         3
500 Club          2
The Square        2
Foreign Cinema    1
Salzburg          1
Name: name, dtype: int64

We could also have the series returned in reverse order (lowest values first) by setting ascending=True. Remember, ascending means to go up, so you'll start low and go up to the higest values

In [4]:
df['name'].value_counts(ascending=True)
Out[4]:
Salzburg          1
Foreign Cinema    1
The Square        2
500 Club          2
Liho Liho         3
Name: name, dtype: int64

Counting relative frequency of unique values in a series (normalizing)

Say you didn't want to get the count of each unique value, but rather see how frequent each value appears compared to the whole series. In order to do this, you'll set normalize=True

In [5]:
df['name'].value_counts(normalize=True)
Out[5]:
Liho Liho         0.333333
500 Club          0.222222
The Square        0.222222
Foreign Cinema    0.111111
Salzburg          0.111111
Name: name, dtype: float64

Let's break this down quickly. There are a total of 9 items in the Series (run "len(df)" if you don't believe me.)

From value_counts above, we saw that "Liho Liho" appeared 3 times. Since it appears 3 times out of 9 rows, we can do 3 / 9 which equals .333. This is the relative frequency of "Liho Liho" in this series

Counting relative frequency of unique values in a series (normalizing)

Now let's say we have a longer series of continous values. Think of a continous values as a list of numbers that don't serve as labels. For example: [.2, ,.23, .43, .85, .13]. Say we thought that .2 and .23 were close enough and wanted to count them together. Unfortunately, if we did value_counts regularly, we would count .2 and .23 as separate values.

If you want to group them together, this is where bins comes in. In order to create a list of random continuous numbers, I'm going to use numpy

In [14]:
import numpy as np
np.random.seed(seed=42) # To make sure the same values appear each time

random_numbers = np.random.random(size=(10,1), )
random_numbers = pd.DataFrame(random_numbers, columns=['rand_num'])
random_numbers
Out[14]:
rand_num
00.374540
10.950714
20.731994
30.598658
40.156019
50.155995
60.058084
70.866176
80.601115
90.708073

Now I want split my data into 3 bins and count how many times values appear in those bins.

In [15]:
random_numbers['rand_num'].value_counts(bins=3)
Out[15]:
(0.653, 0.951]     4
(0.356, 0.653]     3
(0.0562, 0.356]    3
Name: rand_num, dtype: int64

In this case, bins is returning buckets that are evenly spaced. But what if you wanted to create your own buckets? No problem, just pass a list of values that describe your buckets

In [17]:
random_numbers['rand_num'].value_counts(bins=[0,.2,.6, 1])
Out[17]:
(0.6, 1.0]       5
(-0.001, 0.2]    3
(0.2, 0.6]       2
Name: rand_num, dtype: int64

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation