exploratory data analysis – playing with the data
TRANSCRIPT
Exploratory Data Analysis – Playing with the Data
Shreegowri M B SE AnalystDell EMC [email protected]
Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries.
Dell EMC Proven Professional Knowledge Sharing 2
Table of Contents Preface ...................................................................................................................................... 3
Introduction to EDA .................................................................................................................... 4
EDA v/s classical data analysis .............................................................................................. 4
EDA Graphical Techniques ........................................................................................................ 5
Box and Whisker Plot ............................................................................................................. 5
Histogram ............................................................................................................................... 7
Scatter Plot ............................................................................................................................10
EDA and Big Data .....................................................................................................................12
Conclusion ................................................................................................................................13
Bibliography ..............................................................................................................................13
Disclaimer: The views, processes or methodologies published in this article are those of the
author. They do not necessarily reflect Dell EMC’s views, processes or methodologies.
Dell EMC Proven Professional Knowledge Sharing 3
Preface
“We have lots of data – what can we do with that? How can we unlock the real value from our
data?” With the vast amount of data available, companies in almost every industry are focusing
on exploiting data on hand to gain competitive advantage and business value. For any company
that wishes to enhance their business by being more data driven, data science is the key.
A standard definition from Wikipedia says, “Data science is an interdisciplinary field about
scientific methods, processes and systems to extract knowledge or insights from data in various
forms, which is a continuation of some of the data analysis fields such as statistics,
classification, clustering, machine learning, data mining and predictive analytics”. Data science
is a multidisciplinary blend of data inference, computer science, and algorithm development.
Data science process includes steps shown in Figure 1.
Figure 1: Data science process flowchart [1]
In this paper, we are especially looking into Exploratory Data Analysis (EDA) among all other
steps. We will be looking into what is EDA, differences between EDA and classical data
analysis, some basic EDA techniques, and how EDA is related to Big Data.
Dell EMC Proven Professional Knowledge Sharing 4
Introduction to EDA
The phrase “Exploratory Data Analysis” was defined by John Tukey. EDA is an
approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
1. Maximize insight into a data set
2. Uncover underlying structure
3. Extract important variables
4. Detect outliers and anomalies
5. Test underlying assumptions
6. Develop parsimonious models
7. Determine optimal factor settings [2]
EDA is an approach – not a set of techniques – as it allows the data itself to reveal the
underlying structure and model unlike the classical methods which start with the assumed
model for the data.
EDA v/s classical data analysis
Both EDA and classical data analysis starts with the science problem and yield the conclusion,
but they differ in the sequence of the intermediate steps.
Classical Data Analysis
EDA
Classical approach imposes models on the data, whereas EDA approach allows the data to
suggest the models that best fit the data.
Data Model Analysis Conclusion Problem
Data Model Conclusion Problem Analysis
Dell EMC Proven Professional Knowledge Sharing 5
In classical analysis, the focus is on estimating the parameters of the model and predicting
values from it. On the other hand, for EDA, the focus is on the data – its structure, outliers,
gaps, and models suggested by the data.
In the real world, data analysts put all the data analysis approaches together to get a much
better understanding of the data.
EDA Graphical Techniques
EDA emphasizes graphical techniques while classical data analysis emphasizes quantitative
techniques. EDA and classical techniques are not mutually exclusive and can be used in a
complementary fashion. For example, the analysis can start with some simple graphical
techniques such as the 4-plot followed by the classical confirmatory methods such as Measure
of Location, Measure of Scale, Bartlett’s Test, Grubbs Test, and so on... to provide more
rigorous statements about the conclusions.
Here we will discuss three basic EDA graphical techniques and see how the graph created for
the data set reveals or suggest the model to be used in the next step of the analysis. Classical
quantitative techniques are outside the scope of this paper.
Box and Whisker Plot
Box and whisker plots are graphical methods based on Tukey’s five number summary. We will
explain box and whisker plot using a small data set. This data set contains time – in seconds –
women and men took to complete 100 meters in a race. We will compare the scores of women
and men by creating box plot for both genders. We can draw multiple box plots to compare
multiple data sets (here scores of both men and women) or to compare groups in the same data
set.
Dell EMC Proven Professional Knowledge Sharing 6
Women’s Time
10
10
10
11
11
11
11
11
11
12
12
12
13
13
14
14
14
15
15
15
20
Men’s Time
9
10
10
10
10
10
11
11
11
12
12
12
12
13
13
13
14
14
14
15
15
First we need to calculate lower quartile (25th percentile), upper quartile (75th percentile), and
median (50th percentile) for the data sets. For women’s score, the 25th percentile is 11, 75th
percentile is 14, and 50th percentile is 12. For men’s score, the 25th percentile is 10, 75th
percentile is 13, and 50th percentile is 12.
For each gender (data set), a box is drawn between lower and upper quartile covering the
center 50% of the data. Draw a line at the median. Then whiskers, which are vertical lines
ending in a horizontal stroke, are drawn from lower and upper part of the box to the lower and
upper adjacent values (10 and 15 for the women’s data). We indicate mean score of the data
set by inserting a plus sign. Outside values are indicated by using a small “o” symbol; women’s
data has an outside value of 20 whereas men’s data does not have any outside value. Figure 2
shows the box and whisker plot for women’s score using different colors for each step while
creating the box plot. Figure 3 shows the box and whisker plot for both the genders.
Dell EMC Proven Professional Knowledge Sharing 7
Figure 2 Figure 3
Thus the box plot identifies the middle 50% of the data, the median and the extreme points.
The completed box and whisker plot (Figure 3) shows general information about the distribution
of data. Box plots are good for detecting and showing location and variation changes between
different data sets. From the plot we can see that the men generally ran faster than the women.
Histogram
A histogram is the most basic graphical method for displaying the shape of the distribution.
Histogram is a simple ways to learn a lot about the data on hand, including shape, outliers,
central tendency, spread, and different modes in the data.
Let’s begin with an example of a data set which shows weight of 220 students in a class. The
student’s weight ranged from 78 to 223 pounds. The first step in building the histogram is to
create a frequency table or relative frequency table. When we have a large number of
observations the frequency table would grow too big, running over hundreds to thousands of
rows. To simplify, the data must be grouped into classes before the table of frequency is
formed. A common form of histogram is obtained by dividing the data into equal sized classes
(also called bins). Below is the frequency table for 220 students with 7 classes.
Dell EMC Proven Professional Knowledge Sharing 8
In histogram, class frequencies (counts) or relative frequencies (count/number of observation)
are represented by the vertical bar. The height of each bar corresponds to its class frequency or
relative class frequency. In our example we have considered class frequency (in Y axis).
Histogram in Figure 4 shows most of the weights are in the middle of the distribution. We can
also see that distribution is not symmetric, the weights extend to the right farther than they do to
the left. Therefore, the distribution is said to be right skewed.
Class Interval Frequency
75-95 15
95-115 28
115-135 75
135-155 67
155-175 23
175-195 10
195-215 1
215-235 1
Dell EMC Proven Professional Knowledge Sharing 9
Figure 4
The recommended next steps for right skewed histograms are:
1. Quantitatively summarize the data by computing and reporting the sample mean, the
sample median, and the sample mode.
2. Determine the best-fit distribution (skewed-right) from the
a. Weibull family (for the maximum)
b. Gamma family
c. Chi-square family
d. Lognormal family
e. Power lognormal family
3. Consider a normalizing transformation such as the Box-Cox transformation [2].
If the histogram is symmetric (classical bell shaped) then the next step is to consider normal
distribution for the data. Hence, our next steps in the data analysis are determined by the shape
of the graph.
Dell EMC Proven Professional Knowledge Sharing 10
Scatter Plot
Scatter plot is a basic bivariate graphical EDA technique which has one variable on the x-axis,
one on the y-axis and a point for each observation in the data set on scatter plot.
As usual, let’s consider an example. Below, on the left, a table shows the height and the weight
of the students in a school soccer team. On the right, same data are displayed using scatter
plot. Each dot on the scatter plot represents the soccer player.
Figure 5
Scatter plot uncovers the relationship in the data set (relationship between the variables).
Scatter plots are used to analyze the bivariate data patterns in terms of linearity, slope, and
strength.
Linearity refers to whether the data pattern is linear or nonlinear.
Slope refers to the direction of change in variable Y when variable X gets bigger. If
variable Y also gets bigger, the slope is positive; but if variable Y gets smaller, the slope
is negative.
Strength refers to the degree of "scatter" in the plot. If the dots are widely spread, the
relationship between variables is weak. If the dots are concentrated around a line, the
relationship is strong.
They can also reveal the other features in data sets, like cluster, gaps, and outliers.
Figure 6 illustrates some common patterns.
0
50
100
150
200
250
300
64 66 68 70 72 74 76 78
Height
(Inches)
Weight
(Pounds)
65 148
67 155
69 170
70 180
72 240
74 210
75 230
77 250
Dell EMC Proven Professional Knowledge Sharing 11
Figure 6 [3]
The example plot (Figure 5) reveals a positive correlation and a linear relationship between the
two variables indicating that linear regression model might be appropriate in the later stages of
the analysis.
Dell EMC Proven Professional Knowledge Sharing 12
EDA and Big Data
Though John Tukey wrote his EDA book in the 1970’s – before the era of big data –we can still
relate the EDA and its techniques to big data! Let’s see how EDA addresses the 3 main
characteristics of big data (3 V’s of big data) – volume, velocity, and variety.
When we have big data, visual displays, which represents huge volume of data in graphical
formats, helps analysts identify patterns. In big data we have to look beyond the obvious
patterns. EDA helps exploit the data and discover hidden patterns using the vast set of
techniques and tools available.
Big data’s most challenging characteristics are velocity of data flow in applications requiring
near-real time analysis and complexity or variety of data. When an analyst analyzes any
complex data, especially those that have high dimensionality (such as big data), his/her first
step is EDA. EDA involves looking at data from many different angles, which helps to discover
the insights that were not known. EDA can be broadly used in initial exploratory studies for both
data discovery and hypotheses testing. In the next step, detailed analysis of significant
parameter-correlations can be found among masses of complex, high dimensional data.
Thus, EDA can be used for big data along with its use in data mining and data analysis of small
data sets.
Dell EMC Proven Professional Knowledge Sharing 13
Conclusion
Classical statistics assumes that we already have hypotheses and we understand the data,
hence it tests the hypotheses already held about the problem. Conversely, EDA allows data to
suggest the structure, model, or hypothesis that worked to generate this data.
EDA is the most important step in data science/analysis, as it often determines the success or
failure of an analytical project. There are a number of EDA techniques which give greater insight
into the data sets beyond suggesting a model. EDA graphical techniques help in identifying new
patterns in the data and save a huge amount of scientist time. EDA tool, used properly, can
save hours of useless research and facilitate unique discoveries that are beyond known
correlations and projected results. The Butler Scientifics website reports that EDA-enabled
research lead to quick discovery of all the correlations (less than 2 hours) “that the science
team identified during their 8-weeks intensive work but also several key correlations that, with a
further confirmatory phase, confirmed their original hypothesis”.
EDA is not merely a collection of techniques; it’s an approach and philosophy of data analysis.
Since in EDA we start with data without prior knowledge of the model and correlation in the
data, we just need to play with, experiment, and explore the data. We should use our domain
knowledge (graphical or quantitative techniques) along with our imagination and be creative to
get the sense of data and find new things in the data. As it is the first look at the data in the
analysis process, EDA is challenging but it is fun!
Bibliography
1. https://en.wikipedia.org/wiki/Data_science
2. http://www.itl.nist.gov/div898/handbook/index.htm
3. http://stattrek.com/tutorials/ap-statistics-tutorial.aspx
4. Basics of Statistics – Jarkko Isotalo
Dell EMC Proven Professional Knowledge Sharing 14
Dell EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO
RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE
INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying and distribution of any Dell EMC software described in this publication requires an
applicable software license.
Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.