Pandas Resample is an amazing function that does more than you think. This powerful tool will help you transform and clean up your time series data.
Pandas Resample will convert your time series data into different frequencies. Think of it like a group by function, but for time series data.
Example: Imagine you have a data points every 5 minutes from 10am – 11am. What if you wanted to translate your data into a data point every 20min? or 1min?
For a full range of frequencies to convert with, check out the official pandas table.
Pandas DataFrame.resample() takes in a DatetimeIndex and spits out data that has been converted to a new time frequency.
Pseudo Code: Convert a DataFrame time range into a different time frequency.
.resample() is one of those functions that can be intimidating when you first look at the documentation. We suggest mastering the rule, closed, label, and convention parameters before anything else.
- Up Sampling – Going from a longer time grain to a short one. Example: Going from yearly data to monthly data. It’s “up” because you’re going “up” in the number of bins you have
- Down Sampling – Going from a fine time grain to a lower one. Example: Months to Years.
Resample Main Parameters
- rule – How you want to resample your data. Do you want to convert your time series into minute groups? 5 minute groups? You pick! Check out the official pandas documentation for frequencies to resample.
- axis (Default: 0) – Which axis do you want to go against? Usually this will always be rows. But set axis=1 if columns are you time index.
- closed (Default: None) – Do your want to include the data on the edge of your time sample? Which side of the bin interval is closed (it will not include data resampled from that interval). Check out samples below
- label (Default: None) – How do you want your new bins to be labeled? By definition, a bin has two sides, the start (label=left) and the end (label=right).
- convention (default: start) – Where do put your data points when up sampling. Say you’re going from Years to months. Do you want to put your yearly data points on the last month? Or the first month?
- Other parameters – There are a few other parameters, but in our experience, they don’t get used often. Feel free to check them out.
Now the fun part, let’s take a look at a code sample
import pandas as pd
Resample is an amazing function that will convert your time series data into a different frequency (or time intervals). This is most often used when converting your granular data into larger buckets.
Running through examples:
- Resampling minute data to 5 minute data
- Resampling minute data to 5 minute data - changing the "close" side
- Resampling minute data to 5 minute data - changing the "label" side
- Up resampling quarterly data to monthly data with convention: start/end
- Bonus: Combine close/label parameters together
First create a DataFrame with a Datetime Index. That's a fancy way of saying that Pandas recognizes the index as time points.
# Here I'm first creating a date range, then creating a DataFrame with the date range as the index. index = pd.date_range('2/1/2020', periods=9, freq='T') df = pd.DataFrame(data=range(9), index=index, columns=['count']) df
1. Resampling minute data to 5 minute data¶
First off, we are going to down sample our data from 1 minute frequency to 5 minute frequency. It's called 'down sampling' becuase you're going down in the number of samples.
You need to ask yourself:
- What new frequency do I want?
- What do I want to do with the data points in the old frequency. What aggregate function do you want to apply? This is very similary to .groupby() agg functions
Here I'm doing setting the frequency to "5T" which means 5-minutes. Then I'm taking the sum of the data points. Notice how
- The labels of the new frequency start at 00:00:00. This is known as the 'left' side of the bin.
- The data point under 00:05:00 is not included in the first bucket. This means the new bin is 'closed' on the left. Meaning the right most data point is not included in the new bucket. Yes, I know it sounds like 'closed on the left' should mean that the left data point is not included, but this isn't the case.
2. Resampling minute data to 5 minute data - changing the "close" side¶
Now let's change the 'close' side. Say you wanted to include the 00:05:00 data point within the first bucket. By default the closed side is usually the left.
Here we set closed='right'. Woah, we get another label - 23:55:00. This is because the old 00:00:00 data point needed somewhere to go. It used to be included within the 00:00:00 bucket when close='left' but now that we chose close='right' the 0 is in it's own bucket.
3. Resampling minute data to 5 minute data - changing the "label" side.¶
See how after we down sampled our original data frame, the resulting index labels were on the left side of the bin? This is because the label defaults to the left. However, we can change this to the right.
4. Up resampling quarterly data to monthly data with convention: start/end¶
So far we have down sampled our data. But what about up sampling? No problem, but we need to choose where we want to put our data points. By definition, since we are 'zooming in' on our data, we need to tell pandas where to put the previous data points.
Let's create another DataFrame of quarters with a period range. Think of period ranges representing intervals while time ranges represent specific times.
# Here I'm first creating a period range, then creating a DataFrame with the period range as the index. index = pd.period_range('1/1/2020', periods=3, freq='Q') df = pd.DataFrame(data=range(1, 4), index=index, columns=['count']) df
Now say I want to turn this quarterly data into monthly data. All we need to do is call .resample() and pass the months!
Notice how the data below is placed at the start of the period
and here the data is placed at the end of the period
5. Bonus: Combine close/label parameters together¶
Here I'm going to take my 3 minute time sample, and change it to a 7 minute time sample with labels and close on the right side of the bins.
index = pd.date_range('2/1/2020', periods=9, freq='3T') df = pd.DataFrame(data=range(9), index=index, columns=['count']) df
df.resample('7T', label='right', closed='right').sum()
Check out how our data is now in 7 minute intervals with the right-most bin data included and labels are the right bins. Nice.
Check out more Pandas functions on our Pandas Page