Exploratory Data Analysis – Know Your Data

Exploratory Data Analysis (EDA) is the act of getting intimate with your data.

This means you get a feeling for your data. You don’t simply know it’s characteristics (# rows, columns, distributions, etc.)…you actually feel it.

It may sound a bit corny, but after doing data for long enough, you gain the ability to understand a dataset on an intuition level.

EDA is the process of initial exploration. Imagine you are in a deep dark cave and all you have is a flash light. You illuminate sections of the walls, the ground, and head down passages. EDA is the same process for exploring data.

Whenever we do Exploratory Data Analysis, you can bet we are analyzing:

  • # rows, #columns
  • Column cardinality (how many unique elements are there in each group?)
  • Correlations, which columns relate to each other?
  • What are the min/max of each column?
  • What do outliers (if any) say about the data?

There isn’t a right answer when doing EDA. The goal is for you to have a launching point that will lead to more analysis. You’ll know when you are done when you are sufficiently inspired to take the next step in your analysis.

Let’s take a look at a python EDA sample


Python Exploratory Data Analysis

Let's get to know our dataset a bit more. We will perform basic analysis that describe our data. Let's start with:

  • # Rows, # Columns
  • Column cardinality (how many unique elements are there in each group?)
  • What are the min/max of each column?
  • What do outliers (if any) say about the data?
In [14]:
import pandas as pd
df = pd.read_csv('../data/Street_Tree_List.csv', parse_dates=['PlantDate'])

First, let's look at a few sample rows

In [2]:
df.head()
Out[2]:
TreeIDqLegalStatusqSpeciesqAddressSiteOrderqSiteInfoPlantTypeqCaretakerqCareAssistantPlantDate...XCoordYCoordLatitudeLongitudeLocationFire Prevention DistrictsPolice DistrictsSupervisor DistrictsZip CodesNeighborhoods (old)
046534Permitted SiteTree(s) ::73 Summer St7.0Sidewalk: Curb side : CutoutTreePrivateNaN04/01/2002 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1121399DPW MaintainedCorymbia ficifolia :: Red Flowering Gum349X Cargo Way1.0Sidewalk: Curb side : CutoutTreeDPWNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
285269Permitted SiteArbutus 'Marina' :: Hybrid Strawberry Tree1000 Edinburgh St3.0Sidewalk: Curb side : CutoutTreePrivateNaN07/24/2007 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3121227DPW MaintainedSequoia sempervirens :: Coast Redwood4299x 17th St3.0Front Yard : YardTreeDPWNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
445986Permitted SiteTree(s) ::NaN226.0Sidewalk: Curb side : CutoutTreePrivateNaN12/06/2001 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 23 columns

Then lets get the count of rows and columns

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193940 entries, 0 to 193939
Data columns (total 23 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   TreeID                     193940 non-null  int64  
 1   qLegalStatus               193883 non-null  object 
 2   qSpecies                   193940 non-null  object 
 3   qAddress                   192450 non-null  object 
 4   SiteOrder                  192230 non-null  float64
 5   qSiteInfo                  193940 non-null  object 
 6   PlantType                  193940 non-null  object 
 7   qCaretaker                 193940 non-null  object 
 8   qCareAssistant             24478 non-null   object 
 9   PlantDate                  68911 non-null   object 
 10  DBH                        151614 non-null  float64
 11  PlotSize                   143755 non-null  object 
 12  PermitNotes                52455 non-null   object 
 13  XCoord                     191066 non-null  float64
 14  YCoord                     191066 non-null  float64
 15  Latitude                   191066 non-null  float64
 16  Longitude                  191066 non-null  float64
 17  Location                   191066 non-null  object 
 18  Fire Prevention Districts  190815 non-null  float64
 19  Police Districts           190865 non-null  float64
 20  Supervisor Districts       190929 non-null  float64
 21  Zip Codes                  190923 non-null  float64
 22  Neighborhoods (old)        190925 non-null  float64
dtypes: float64(11), int64(1), object(11)
memory usage: 34.0+ MB

Then let's find out how many values sit within each column

In [6]:
df.apply(lambda x: [x.nunique()])
Out[6]:
TreeIDqLegalStatusqSpeciesqAddressSiteOrderqSiteInfoPlantTypeqCaretakerqCareAssistantPlantDate...XCoordYCoordLatitudeLongitudeLocationFire Prevention DistrictsPolice DistrictsSupervisor DistrictsZip CodesNeighborhoods (old)
0193940105718624231131322158945...1610691615101629481628811629591510112941

1 rows × 23 columns

Then let's look at min and max of a few columns

In [11]:
print ("Min Tree ID: {}".format(df['TreeID'].min()))
print ("Max Tree ID: {}".format(df['TreeID'].max()))
Min Tree ID: 1
Max Tree ID: 262465
In [15]:
print ("Min Tree Date: {}".format(df['PlantDate'].min()))
print ("Max Tree Date: {}".format(df['PlantDate'].max()))
Min Tree Date: 1955-09-19 00:00:00
Max Tree Date: 2020-07-30 00:00:00

Finally, let's get a brief feel on the outliers of location.

I like to start this off with a simple box plot. It'll show me the percentiles + outliers.

Without going further, I can already tell I'm going to need to take care of these distracting data points...

In [17]:
df['Latitude'].plot.box();
In [18]:
df['Longitude'].plot.box();

Link to code above

Check out more Python Vocabulary on our Glossary Page