Pandas Get Dummies – pd.get_dummies()

When you’re doing machine learning you’ll work with algorithms that cannot process categorical variables. In this case, you need to turn your column of labels (Ex: [‘cat’, ‘dog’, ‘bird’, ‘cat’]) into separate columns of 0s and 1s. This is called getting dummies pandas columns.

Pandas pd.get_dummies() will turn your categorical column (column of labels) into indicator columns (columns of 0s and 1s).

1. pd.get_dummies(your_data)

This function is heavily used within machine learning algorithms. For instance, random forrest doesn’t do great with columns that have labels. It’s best to turn these into dummy indicator columns.

Pandas Get Dummies - Turn your Categorical Columns into many indicator columns. This is heavily used within Machine Learning.

In the above scenario, we are creating dummy columns from our “Name” column. A new column is created for every distinct value we had in our original “Name” column.

Pseudo code: For each distinct value in your original categorical column, create a new column with an indictor (0 or 1).

Pandas Get Dummies

Be careful, if your categorical column has too many distinct values in it, you’ll quickly explode your new dummy columns. Before you run pd.get_dummies(), make sure to run pd.Series.nunique() to see how many new columns you’ll create.

Get Dummies Parameters

  • data: The data that you want to create dummy indicator columns from. This will be your DataFrame or Series
  • prefix (Default: None): You’d use this column if you wanted to add a prefix (string at the beginning) of your new column names. This can be helpful for identifying which columns are dummy afterward.
  • prefix_sep (Default: “-“): When you want to get fancy, you could specify what you want between your prefix and column names. The default is “-“. I doubt you’ll ever change this. If you do, please tweet about it and @ DataIndepedent.
  • dummy_na (Default: False): Used if you want to create a dummy column for your NA values.
  • columns (Default: None): Defining which columns from your DataFrame you’d like to get dummies for. By default its every column of object or category type (no ints, floats, etc.)
  • sparse (Default: False): Sometimes your dummy columns are sparse. This means there are a TON of 0s because you have a high number of distinct values in your original column. Sparse=True helps speed up the processing power in this case.
  • drop_first (Default: False): Advanced option – only use this if you know what you’re doing. Dropping your first categorical variable is possible because if every other dummy column is 0, then this means your first value would have been 1. What you remove in redundancy, you gain confusion.

Here’s a Jupyter notebook showing how to get dummies in Pandas

In [1]:
import pandas as pd

Pandas Get Dummies

Pandas Get Dummies will turn your categorical variables into many dummy indicator variables. This means you'll go from a Series of labels (['Bob', 'Fred', 'Katie']) to a list of indicators ([0,1,0,0]).

Let's run through 3 examples:

  1. Creating Dummy Indicator columns
  2. Creating Dummy Indicator columns with prefix
  3. Creating Dummy Indicator columns and dropping the first variable

First, let's create a DataFrame

In [2]:
df = pd.DataFrame([('Foreign Cinema', 289.0),
                   ('Liho Liho', 224.0),
                   ('500 Club', 80.5),
                   ('Foreign Cinema', 25.30)],
           columns=('name', 'Amount')
                 )

df
Out[2]:
nameAmount
0Foreign Cinema289.0
1Liho Liho224.0
2500 Club80.5
3Foreign Cinema25.3

1. Creating Dummy Indicator columns

To create dummy columns, I need to tell pandas which DataFrame I want to use, and which columns I want to create dummies on. Here I want to create dummies on the 'name' column.

Notice how there are 3 new columns, one for every disticnt value within our old 'name' column. Within these new columns is a list of 1s and 0s showing if the previous row had the column value.

In [3]:
pd.get_dummies(df, columns=['name'])
Out[3]:
Amountname_500 Clubname_Foreign Cinemaname_Liho Liho
0289.0010
1224.0001
280.5100
325.3010

2. Creating Dummy Indicator columns with prefix

See how above all of my new columns start with "name_"? Well I don't like it. I want to switch the prefix to something else. You can do this by specifying "prefix" parameter.

In [4]:
pd.get_dummies(df, columns=['name'], prefix="dmy")
Out[4]:
Amountdmy_500 Clubdmy_Foreign Cinemadmy_Liho Liho
0289.0010
1224.0001
280.5100
325.3010

You know what else I don't like? The _ that is in the middle of my prefix and column names. I'll switch it to an by specifying the prefix_sep.*

In [5]:
pd.get_dummies(df, columns=['name'], prefix="dmy", prefix_sep="*")
Out[5]:
Amountdmy*500 Clubdmy*Foreign Cinemadmy*Liho Liho
0289.0010
1224.0001
280.5100
325.3010

3. Creating Dummy Indicator columns and dropping the first variable

Notice above, how every new dummy column has at least one "1" within it? This is because every variable is accounted for with a True (1) indicator. However, what if a row was all 0s? This is also a way to identify one of your values. drop_first allows you to drop your first variable and identify it through all other columns being 0.

Notice how "500 Club" column has been removed, and where the "500 Club" row use to be, remains 0s in both "Foreign Cinema" and "Liho Liho".

It's a bit confusing. If you come up with a valid use case...@ me on Twitter.

In [6]:
pd.get_dummies(df, columns=['name'], drop_first=True)
Out[6]:
Amountname_Foreign Cinemaname_Liho Liho
0289.010
1224.001
280.500
325.310

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation