Exploratory Data Analysis (EDA) is the act of getting intimate with your data.

This means you get a feeling for your data. You don’t simply know it’s characteristics (# rows, columns, distributions, etc.)…you actually feel it.

It may sound a bit corny, but after doing data for long enough, you gain the ability to understand a dataset on an intuition level.

EDA is the process of initial exploration. Imagine you are in a deep dark cave and all you have is a flash light. You illuminate sections of the walls, the ground, and head down passages. EDA is the same process for exploring data.

Whenever we do Exploratory Data Analysis, you can bet we are analyzing:

• # rows, #columns
• Column cardinality (how many unique elements are there in each group?)
• Correlations, which columns relate to each other?
• What are the min/max of each column?
• What do outliers (if any) say about the data?

There isn’t a right answer when doing EDA. The goal is for you to have a launching point that will lead to more analysis. You’ll know when you are done when you are sufficiently inspired to take the next step in your analysis.

Let’s take a look at a python EDA sample

## Python Exploratory Data Analysis¶

Let's get to know our dataset a bit more. We will perform basic analysis that describe our data. Let's start with:

• # Rows, # Columns
• Column cardinality (how many unique elements are there in each group?)
• What are the min/max of each column?
• What do outliers (if any) say about the data?
In [14]:
```import pandas as pd
```

### First, let's look at a few sample rows¶

In [2]:
```df.head()
```
Out[2]:
TreeIDqLegalStatusqSpeciesqAddressSiteOrderqSiteInfoPlantTypeqCaretakerqCareAssistantPlantDate...XCoordYCoordLatitudeLongitudeLocationFire Prevention DistrictsPolice DistrictsSupervisor DistrictsZip CodesNeighborhoods (old)
046534Permitted SiteTree(s) ::73 Summer St7.0Sidewalk: Curb side : CutoutTreePrivateNaN04/01/2002 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1121399DPW MaintainedCorymbia ficifolia :: Red Flowering Gum349X Cargo Way1.0Sidewalk: Curb side : CutoutTreeDPWNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
285269Permitted SiteArbutus 'Marina' :: Hybrid Strawberry Tree1000 Edinburgh St3.0Sidewalk: Curb side : CutoutTreePrivateNaN07/24/2007 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3121227DPW MaintainedSequoia sempervirens :: Coast Redwood4299x 17th St3.0Front Yard : YardTreeDPWNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
445986Permitted SiteTree(s) ::NaN226.0Sidewalk: Curb side : CutoutTreePrivateNaN12/06/2001 12:00:00 AM...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 23 columns

### Then lets get the count of rows and columns¶

In [3]:
```df.info()
```
```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193940 entries, 0 to 193939
Data columns (total 23 columns):
#   Column                     Non-Null Count   Dtype
---  ------                     --------------   -----
0   TreeID                     193940 non-null  int64
1   qLegalStatus               193883 non-null  object
2   qSpecies                   193940 non-null  object
4   SiteOrder                  192230 non-null  float64
5   qSiteInfo                  193940 non-null  object
6   PlantType                  193940 non-null  object
7   qCaretaker                 193940 non-null  object
8   qCareAssistant             24478 non-null   object
9   PlantDate                  68911 non-null   object
10  DBH                        151614 non-null  float64
11  PlotSize                   143755 non-null  object
12  PermitNotes                52455 non-null   object
13  XCoord                     191066 non-null  float64
14  YCoord                     191066 non-null  float64
15  Latitude                   191066 non-null  float64
16  Longitude                  191066 non-null  float64
17  Location                   191066 non-null  object
18  Fire Prevention Districts  190815 non-null  float64
19  Police Districts           190865 non-null  float64
20  Supervisor Districts       190929 non-null  float64
21  Zip Codes                  190923 non-null  float64
22  Neighborhoods (old)        190925 non-null  float64
dtypes: float64(11), int64(1), object(11)
memory usage: 34.0+ MB
```

### Then let's find out how many values sit within each column¶

In [6]:
```df.apply(lambda x: [x.nunique()])
```
Out[6]:
TreeIDqLegalStatusqSpeciesqAddressSiteOrderqSiteInfoPlantTypeqCaretakerqCareAssistantPlantDate...XCoordYCoordLatitudeLongitudeLocationFire Prevention DistrictsPolice DistrictsSupervisor DistrictsZip CodesNeighborhoods (old)
0193940105718624231131322158945...1610691615101629481628811629591510112941

1 rows × 23 columns

### Then let's look at min and max of a few columns¶

In [11]:
```print ("Min Tree ID: {}".format(df['TreeID'].min()))
print ("Max Tree ID: {}".format(df['TreeID'].max()))
```
```Min Tree ID: 1
Max Tree ID: 262465
```
In [15]:
```print ("Min Tree Date: {}".format(df['PlantDate'].min()))
print ("Max Tree Date: {}".format(df['PlantDate'].max()))
```
```Min Tree Date: 1955-09-19 00:00:00
Max Tree Date: 2020-07-30 00:00:00
```

### Finally, let's get a brief feel on the outliers of location.¶

I like to start this off with a simple box plot. It'll show me the percentiles + outliers.

Without going further, I can already tell I'm going to need to take care of these distracting data points...

In [17]:
```df['Latitude'].plot.box();
```
In [18]:
```df['Longitude'].plot.box();
```