Pandas Query (Filter Data) – df.query()

Pandas Query, the way to filter your data you haven’t heard of. Well I guess you have because you’re here.

Pandas DataFrame.query() will filter the rows of your DataFrame with a True/False (boolean) expression. This is super helpful when filtering your data.

pandas.DataFrame.query('your_query_expression')

Pseudo Code: Evaluate the expression give, return only the true rows.

Pandas Query

.query() is simple, but the magic lies in how creative you get with your expression. Check out a few examples below.

  • expr – The string query that pandas will evaluate. Awesomely, you can also use variables within your string by starting them with ‘@’. Ex: ‘@myvariable’
  • inplace (Default: False) – Whether or not you want the DataFrame to be modified directly, or returned to you.
  • **kwargs – You can go crazy by adding a ton of other parameters from pandas.eval(). I won’t list them out here, explore for yourself!

Why isn’t this function more popular? I’m guessing it is because Pandas education revolves around handing Series and doing series manipulation. Filtering the your dataframe the traditional way (by putting a boolean series in your df) forces students to manually munge their data.

Here is a traditional filter for reference

filter_1 = df['column_1'] > df['column_2']
df[filter_1] # Your filtered data

Now a code sample for .query()

In [1]:
import pandas as pd
import numpy as np
np.random.seed(seed=42)

Pandas Query

Pandas Query is the other way to filter data, the one that you don't usually use but you might want to consider.

We will run through 3 examples:

  1. Simple filter for a column
  2. Filtering columns based off of each other
  3. Using an environment variable to filter

First, let's create our DataFrame

In [2]:
df = pd.DataFrame.from_dict({"Name": ['Liho Liho', 'Tompkins', 'The Square', 'Chambers'],
                             "Mon": np.random.randint(10,200, size=(1,4))[0],
                             "Tues": np.random.randint(12,200, size=(1,4))[0],
                             "Wed": np.random.randint(12,200, size=(1,4))[0],
                             "Thurs": np.random.randint(12,200, size=(1,4))[0]}, orient='columns')
df
Out[2]:
NameMonTuesWedThurs
0Liho Liho112118133111
1Tompkins1898386115
2The Square1023299163
3Chambers24114128142

1. Simple filter for a column

To query (filter) your data, all you need to do is pass a string with a conditional expression. This is super similar to writing a forumla in an excel cell.

Notice here I'm querying my data for the rows where the "Mon" column is greater then the 90. There are only two rows that satisfy this filter, and they are returned.

In [3]:
df.query('Mon > 90')
Out[3]:
NameMonTuesWedThurs
0Liho Liho112118133111
1Tompkins1898386115
2The Square1023299163

Notice that all the other rows which don't satisfies this query are not returned

2. Filtering columns based off of each other

You can also filter two columns based off of each other. In this case I'm querying for data where the Monday column is greater than the Tuesday column.

In [4]:
df.query('Mon > Tues')
Out[4]:
NameMonTuesWedThurs
1Tompkins1898386115
2The Square1023299163

You can also achieve this result via the traditional filtering method.

In [5]:
filter_1 = df['Mon'] > df['Tues']
df[filter_1]
Out[5]:
NameMonTuesWedThurs
1Tompkins1898386115
2The Square1023299163

3. Using an environment variable to filter

If needed you can also use an environment variable to filter your data. Make sure to put an "@" sign in front of your variable within the string.

In [6]:
dinner_limit=120
df.query('Thurs > @dinner_limit')
Out[6]:
NameMonTuesWedThurs
2The Square1023299163
3Chambers24114128142

Here I set a variable, dinner_limit to 85. I used this variable within my query by calling "@dinner_limit" and evaluating the 'Thurs' column against it.

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation