When you’re doing machine learning you’ll work with algorithms that cannot process categorical variables. In this case, you need to turn your column of labels (Ex: [‘cat’, ‘dog’, ‘bird’, ‘cat’]) into separate columns of 0s and 1s. This is called getting dummies pandas columns.
pd.get_dummies() will turn your categorical column (column of labels) into indicator columns (columns of 0s and 1s).
This function is heavily used within machine learning algorithms. For instance, random forrest doesn’t do great with columns that have labels. It’s best to turn these into dummy indicator columns.
In the above scenario, we are creating dummy columns from our “Name” column. A new column is created for every distinct value we had in our original “Name” column.
Pseudo code: For each distinct value in your original categorical column, create a new column with an indictor (0 or 1).
Pandas Get Dummies
Be careful, if your categorical column has too many distinct values in it, you’ll quickly explode your new dummy columns. Before you run pd.get_dummies(), make sure to run pd.Series.nunique() to see how many new columns you’ll create.
Get Dummies Parameters
- data: The data that you want to create dummy indicator columns from. This will be your DataFrame or Series
- prefix (Default: None): You’d use this column if you wanted to add a prefix (string at the beginning) of your new column names. This can be helpful for identifying which columns are dummy afterward.
- prefix_sep (Default: “-“): When you want to get fancy, you could specify what you want between your prefix and column names. The default is “-“. I doubt you’ll ever change this. If you do, please tweet about it and @ DataIndepedent.
- dummy_na (Default: False): Used if you want to create a dummy column for your NA values.
- columns (Default: None): Defining which columns from your DataFrame you’d like to get dummies for. By default its every column of object or category type (no ints, floats, etc.)
- sparse (Default: False): Sometimes your dummy columns are sparse. This means there are a TON of 0s because you have a high number of distinct values in your original column. Sparse=True helps speed up the processing power in this case.
- drop_first (Default: False): Advanced option – only use this if you know what you’re doing. Dropping your first categorical variable is possible because if every other dummy column is 0, then this means your first value would have been 1. What you remove in redundancy, you gain confusion.
Here’s a Jupyter notebook showing how to get dummies in Pandas
import pandas as pd
Pandas Get Dummies¶
Pandas Get Dummies will turn your categorical variables into many dummy indicator variables. This means you'll go from a Series of labels (['Bob', 'Fred', 'Katie']) to a list of indicators ([0,1,0,0]).
Let's run through 3 examples:
- Creating Dummy Indicator columns
- Creating Dummy Indicator columns with prefix
- Creating Dummy Indicator columns and dropping the first variable
First, let's create a DataFrame
df = pd.DataFrame([('Foreign Cinema', 289.0), ('Liho Liho', 224.0), ('500 Club', 80.5), ('Foreign Cinema', 25.30)], columns=('name', 'Amount') ) df
1. Creating Dummy Indicator columns¶
To create dummy columns, I need to tell pandas which DataFrame I want to use, and which columns I want to create dummies on. Here I want to create dummies on the 'name' column.
Notice how there are 3 new columns, one for every disticnt value within our old 'name' column. Within these new columns is a list of 1s and 0s showing if the previous row had the column value.
|Amount||name_500 Club||name_Foreign Cinema||name_Liho Liho|
2. Creating Dummy Indicator columns with prefix¶
See how above all of my new columns start with "name_"? Well I don't like it. I want to switch the prefix to something else. You can do this by specifying "prefix" parameter.
pd.get_dummies(df, columns=['name'], prefix="dmy")
|Amount||dmy_500 Club||dmy_Foreign Cinema||dmy_Liho Liho|
You know what else I don't like? The _ that is in the middle of my prefix and column names. I'll switch it to an by specifying the prefix_sep.*
pd.get_dummies(df, columns=['name'], prefix="dmy", prefix_sep="*")
|Amount||dmy*500 Club||dmy*Foreign Cinema||dmy*Liho Liho|
3. Creating Dummy Indicator columns and dropping the first variable¶
Notice above, how every new dummy column has at least one "1" within it? This is because every variable is accounted for with a True (1) indicator. However, what if a row was all 0s? This is also a way to identify one of your values. drop_first allows you to drop your first variable and identify it through all other columns being 0.
Notice how "500 Club" column has been removed, and where the "500 Club" row use to be, remains 0s in both "Foreign Cinema" and "Liho Liho".
It's a bit confusing. If you come up with a valid use case...@ me on Twitter.
pd.get_dummies(df, columns=['name'], drop_first=True)
|Amount||name_Foreign Cinema||name_Liho Liho|
Check out more Pandas functions on our Pandas Page