Pandas Diff – Difference Your Data – pd.df.diff()

Pandas Diff will difference your data. This means calculating the change in your row(s)/column(s) over a set number of periods. Or simply, pandas diff will subtract 1 cell value from another cell value within the same index.

Diff is very helpful when calculating rates of change. For example: you have temperature readings per day, calculating the difference will tell you how the temperatures have changed Day-Over-Day.

You can also think of this as taking the derivative (rate of change) of the data. This is also helpful when working with time series data and calculating Week-Over-Week.

There are 1 core concept you’ll need to grasp:

  1. Period = How many observations do you want to difference your data by? Most of the time this will be 1 period diff, but you can select as many as you want.
1. pd.DataFrame.diff(periods=1)

Pseudo code: For a given DataFrame or Series, find the difference (or rate of change) between rows/columns.

Pandas Diff

Pandas Diff - Difference your DataFrame Data by rows or columns. Set the periods to the number of rows you'd like to difference.

Your first row in your resulting diff DataFrame will generally be NaN. This is because there is no other observation to difference it with. If you had periods=2, then there would be 2 NaNs.

Diff Parameters

  • Periods (Default=1): You can select how many periods you’d like to difference by via the periods parameter. An easier way to think about this is, ‘how many rows would you like to difference from each cell?’ In the picture above, our periods=1 so we take the difference from each neighboring cell above.
  • Axis (Default=0): We usually talk about differencing rows (Axis=0), but pandas also allows you to difference columns (Axis=1).

Let’s take a look at a code sample


In [3]:
import pandas as pd
import numpy as np

Pandas Diff

Pandas Diff will return the difference between rows or columns on your DataFrame. You have the option to select how many rows/columns you'd like to difference via the 'periods' parameter.

We will run through 3 examples:

  1. Default differencing
  2. Two Period Differencing
  3. Column Differencing

First, let's create our DataFrame

In [22]:
np.random.seed(seed=42)
df = pd.DataFrame(data=np.random.normal(loc=70, scale=10, size=(7,3)),
           columns=('San Francisco', 'San Diego', 'Los Angeles'),
            index=['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
                 )
df = df.round()
df
Out[22]:
San FranciscoSan DiegoLos Angeles
Mon75.069.076.0
Tues85.068.068.0
Wed86.078.065.0
Thurs75.065.065.0
Fri72.051.053.0
Sat64.060.073.0
Sun61.056.085.0

1. Default differencing

By default, Pandas will difference by 1 row. Let's see how this looks for our cities.

Notice how the first row in the result is NaN. This is because the first row in the original DataFrame does not have another row to difference. Pandas returns an NaN in this case. You'll always have as many NaNs as you do periods differenced.

In [23]:
df.diff()
Out[23]:
San FranciscoSan DiegoLos Angeles
MonNaNNaNNaN
Tues10.0-1.0-8.0
Wed1.010.0-3.0
Thurs-11.0-13.00.0
Fri-3.0-14.0-12.0
Sat-8.09.020.0
Sun-3.0-4.012.0

2. Two Period Differencing

Say instead of differencing your data by 1 period, you wanted to do it by 2 periods. To do this, set your periods=2.

In [24]:
df.diff(periods=2)
Out[24]:
San FranciscoSan DiegoLos Angeles
MonNaNNaNNaN
TuesNaNNaNNaN
Wed11.09.0-11.0
Thurs-10.0-3.0-3.0
Fri-14.0-27.0-12.0
Sat-11.0-5.08.0
Sun-11.05.032.0

3. Column Differencing

Did you know you can also do column differencing? This would be helpful if your column represent dates or other items you'd like to compare.

To do this, set axis=1

In [25]:
df.diff(periods=1, axis=1)
Out[25]:
San FranciscoSan DiegoLos Angeles
MonNaN-6.07.0
TuesNaN-17.00.0
WedNaN-8.0-13.0
ThursNaN-10.00.0
FriNaN-21.02.0
SatNaN-4.013.0
SunNaN-5.029.0

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation