Pandas Describe – pd.DataFrame.describe()

I once had a data teacher told me, “You need to get intimate with your data.” One of the best ways to do this is through pandas describe.

pandas.DataFrame.describe()
pandas.Series.describe()

Pandas Describe does exactly what it sounds like, describe your data. Describe will return a series of descriptive information. This Series will tell you:

  • The count of values
  • The number of unique values
  • The top (most frequent) value
  • The frequency of your top value
  • The mean, standard deviation, min and max values
  • The percentiles of your data: 25%, 50%, 75% by default

Pseudo Code: With your Series or DataFrame, return a Series that tell us what the distribution of values looks like.

Pandas Describe

In order to evaluate a dataset, you need to get a feel for your data. This means you need to get an intuitive sense of how your data is distributed and what spectrum of values you have. This is the first step to launching a successful data analysis.

Often times the process of ‘getting to know your data’ is called Exploratory Data Analysis (EDA).

Pandas Describe - Exploratory Data Analysis of your dataframe or Series. Get to know your data.

Pandas Describe Parameters

The standard deviation function is pretty standard, but you may want to play with a view items.

  • percentiles = By default, pandas will include the 25th, 50th, and 75th percentile. However you can tell pandas whichever ones you want. Simply pass a list to percentiles and pandas will do the rest.
  • include = You may want to ‘describe’ all of your columns, or you may just want to do the numeric columns. By default, pandas will only describe your numeric columns. Select ‘all’ to include all columns.
  • exclude = The inverse of include, you can tell pandas which column data types you would like to exclude. Simply pass a list of datatypes you would like to exclude here.
  • datetime_is_numeric: By default pandas will treat your datetimes as objects. Meaning, Pandas will not calculate things like ‘average time/date’. However, if you select datetime_is_numeric=True then pandas will apply the min, max, and percentiles to your datetimes.

Now the fun part, let’s take a look at a code sample

In [1]:
import pandas as pd

Pandas Describe

Pandas Describe will do all of the hard work for you. Well...most of it. Calling .describe() on your dataset will produce a series of descriptive statistics that allow you to get to know your data better.

We will run through 3 examples:

  1. Default Describe - Let's see what comes out by default
  2. Including all columns via 'include'
  3. Treating datetimes like numbers via datetime_is_numeric=True

But first, let's user our San Francisco Tree dataset as our DataFrame. You can download this dataset at the github link below. Watch out, it's 193K rows.

In [2]:
df = pd.read_csv('../data/Street_Tree_List.csv', parse_dates=['PlantDate'])
df = df[['TreeID', 'qSpecies', 'PlantDate', 'DBH']]
df.rename(mapper={'DBH':"tree_depth"}, axis=1, inplace=True)

df.head()
Out[2]:
TreeIDqSpeciesPlantDatetree_depth
046534Tree(s) ::2002-04-01NaN
1121399Corymbia ficifolia :: Red Flowering GumNaTNaN
285269Arbutus 'Marina' :: Hybrid Strawberry Tree2007-07-24NaN
3121227Sequoia sempervirens :: Coast RedwoodNaTNaN
445986Tree(s) ::2001-12-06NaN

1. Default Describe - Let's see what comes out by default

By default, .describe() will tell us a series of descriptive statistics, let's see what they are.

You can see that although we have 4 columns in our dataset, only 2 of them are returned by default. This is because .describe() will only return the numeric column by default.

In [3]:
df.describe()
Out[3]:
TreeIDtree_depth
count193940.000000151614.000000
mean126960.0276749.927665
std79504.82913129.318932
min1.0000000.000000
25%52836.7500003.000000
50%121171.5000007.000000
75%203348.25000012.000000
max262465.0000009999.000000

2. Including all columns via 'include'

If you wanted to include all columns in describe, then set include='all'.

You'll notice that pandas needs to put 'NaN' for descriptive statistics that do not apply to non-numeric columns like strings. For example: 'qSpecies' does not have a 25th percentile.

In [4]:
df.describe(include='all')
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  """Entry point for launching an IPython kernel.
Out[4]:
TreeIDqSpeciesPlantDatetree_depth
count193940.00000019394068911151614.000000
uniqueNaN5718945NaN
topNaNTree(s) ::2000-06-23 00:00:00NaN
freqNaN11734314NaN
firstNaNNaN1955-09-19 00:00:00NaN
lastNaNNaN2020-07-30 00:00:00NaN
mean126960.027674NaNNaN9.927665
std79504.829131NaNNaN29.318932
min1.000000NaNNaN0.000000
25%52836.750000NaNNaN3.000000
50%121171.500000NaNNaN7.000000
75%203348.250000NaNNaN12.000000
max262465.000000NaNNaN9999.000000

3. Treating datetimes like numbers via datetime_is_numeric=True

Finally, let's end by calling .describe() on a Series. We'll do it on our 'PlantDate' column and see the difference between treating dates like objects and treating them like numbers.

Notice how in the first example we do not get percentiles or min/max. But in the second example we do.

In [5]:
df['PlantDate'].describe()
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  """Entry point for launching an IPython kernel.
Out[5]:
count                   68911
unique                   8945
top       2000-06-23 00:00:00
freq                      314
first     1955-09-19 00:00:00
last      2020-07-30 00:00:00
Name: PlantDate, dtype: object
In [6]:
df['PlantDate'].describe(datetime_is_numeric=True)
Out[6]:
count                            68911
mean     2000-12-02 22:19:59.122334464
min                1955-09-19 00:00:00
25%                1995-01-30 00:00:00
50%                2001-07-24 00:00:00
75%                2008-11-21 00:00:00
max                2020-07-30 00:00:00
Name: PlantDate, dtype: object

Link to code above

Check out more Pandas functions on our Pandas Page

Official Documentation