exploratory data analysis – playing with the data

Exploratory Data Analysis – Playing with the Data

Shreegowri M B SE AnalystDell EMC [email protected]

Knowledge Sharing Article © 2017 Dell Inc. or its subsidiaries.

Dell EMC Proven Professional Knowledge Sharing 2

Table of Contents Preface ...................................................................................................................................... 3

Introduction to EDA .................................................................................................................... 4

EDA v/s classical data analysis .............................................................................................. 4

EDA Graphical Techniques ........................................................................................................ 5

Box and Whisker Plot ............................................................................................................. 5

Histogram ............................................................................................................................... 7

Scatter Plot ............................................................................................................................10

EDA and Big Data .....................................................................................................................12

Conclusion ................................................................................................................................13

Bibliography ..............................................................................................................................13

Disclaimer: The views, processes or methodologies published in this article are those of the

author. They do not necessarily reflect Dell EMC’s views, processes or methodologies.


Preface

“We have lots of data – what can we do with that? How can we unlock the real value from our

data?” With the vast amount of data available, companies in almost every industry are focusing

on exploiting data on hand to gain competitive advantage and business value. For any company

that wishes to enhance their business by being more data driven, data science is the key.

A standard definition from Wikipedia says, “Data science is an interdisciplinary field about

scientific methods, processes and systems to extract knowledge or insights from data in various

forms, which is a continuation of some of the data analysis fields such as statistics,

classification, clustering, machine learning, data mining and predictive analytics”. Data science

is a multidisciplinary blend of data inference, computer science, and algorithm development.

Data science process includes steps shown in Figure 1.

Figure 1: Data science process flowchart [1]

In this paper, we are especially looking into Exploratory Data Analysis (EDA) among all other

steps. We will be looking into what is EDA, differences between EDA and classical data

analysis, some basic EDA techniques, and how EDA is related to Big Data.


Introduction to EDA

The phrase “Exploratory Data Analysis” was defined by John Tukey. EDA is an

approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

1. Maximize insight into a data set

2. Uncover underlying structure

3. Extract important variables

4. Detect outliers and anomalies

5. Test underlying assumptions

6. Develop parsimonious models

7. Determine optimal factor settings [2]

EDA is an approach – not a set of techniques – as it allows the data itself to reveal the

underlying structure and model unlike the classical methods which start with the assumed

model for the data.

EDA v/s classical data analysis

Both EDA and classical data analysis starts with the science problem and yield the conclusion,

but they differ in the sequence of the intermediate steps.

Classical Data Analysis

EDA

Classical approach imposes models on the data, whereas EDA approach allows the data to

suggest the models that best fit the data.

Data Model Analysis Conclusion Problem

Data Model Conclusion Problem Analysis


In classical analysis, the focus is on estimating the parameters of the model and predicting

values from it. On the other hand, for EDA, the focus is on the data – its structure, outliers,

gaps, and models suggested by the data.

In the real world, data analysts put all the data analysis approaches together to get a much

better understanding of the data.

EDA Graphical Techniques

EDA emphasizes graphical techniques while classical data analysis emphasizes quantitative

techniques. EDA and classical techniques are not mutually exclusive and can be used in a

complementary fashion. For example, the analysis can start with some simple graphical

techniques such as the 4-plot followed by the classical confirmatory methods such as Measure

of Location, Measure of Scale, Bartlett’s Test, Grubbs Test, and so on... to provide more

rigorous statements about the conclusions.

Here we will discuss three basic EDA graphical techniques and see how the graph created for

the data set reveals or suggest the model to be used in the next step of the analysis. Classical

quantitative techniques are outside the scope of this paper.

Box and Whisker Plot

Box and whisker plots are graphical methods based on Tukey’s five number summary. We will

explain box and whisker plot using a small data set. This data set contains time – in seconds –

women and men took to complete 100 meters in a race. We will compare the scores of women

and men by creating box plot for both genders. We can draw multiple box plots to compare

multiple data sets (here scores of both men and women) or to compare groups in the same data

set.


Women’s Time

10

10

10

11

11

11

11

11

11

12

12

12

13

13

14

14

14

15

15

15

20

Men’s Time

9

10

10

10

10

10

11

11

11

12

12

12

12

13

13

13

14

14

14

15

15

First we need to calculate lower quartile (25th percentile), upper quartile (75th percentile), and

median (50th percentile) for the data sets. For women’s score, the 25th percentile is 11, 75th

percentile is 14, and 50th percentile is 12. For men’s score, the 25th percentile is 10, 75th

percentile is 13, and 50th percentile is 12.

For each gender (data set), a box is drawn between lower and upper quartile covering the

center 50% of the data. Draw a line at the median. Then whiskers, which are vertical lines

ending in a horizontal stroke, are drawn from lower and upper part of the box to the lower and

upper adjacent values (10 and 15 for the women’s data). We indicate mean score of the data

set by inserting a plus sign. Outside values are indicated by using a small “o” symbol; women’s

data has an outside value of 20 whereas men’s data does not have any outside value. Figure 2

shows the box and whisker plot for women’s score using different colors for each step while

creating the box plot. Figure 3 shows the box and whisker plot for both the genders.


Figure 2 Figure 3

Thus the box plot identifies the middle 50% of the data, the median and the extreme points.

The completed box and whisker plot (Figure 3) shows general information about the distribution

of data. Box plots are good for detecting and showing location and variation changes between

different data sets. From the plot we can see that the men generally ran faster than the women.

Histogram

A histogram is the most basic graphical method for displaying the shape of the distribution.

Histogram is a simple ways to learn a lot about the data on hand, including shape, outliers,

central tendency, spread, and different modes in the data.

Let’s begin with an example of a data set which shows weight of 220 students in a class. The

student’s weight ranged from 78 to 223 pounds. The first step in building the histogram is to

create a frequency table or relative frequency table. When we have a large number of

observations the frequency table would grow too big, running over hundreds to thousands of

rows. To simplify, the data must be grouped into classes before the table of frequency is

formed. A common form of histogram is obtained by dividing the data into equal sized classes

(also called bins). Below is the frequency table for 220 students with 7 classes.


In histogram, class frequencies (counts) or relative frequencies (count/number of observation)

are represented by the vertical bar. The height of each bar corresponds to its class frequency or

relative class frequency. In our example we have considered class frequency (in Y axis).

Histogram in Figure 4 shows most of the weights are in the middle of the distribution. We can

also see that distribution is not symmetric, the weights extend to the right farther than they do to

the left. Therefore, the distribution is said to be right skewed.

Class Interval Frequency

75-95 15

95-115 28

115-135 75

135-155 67

155-175 23

175-195 10

195-215 1

215-235 1


Figure 4

The recommended next steps for right skewed histograms are:

1. Quantitatively summarize the data by computing and reporting the sample mean, the

sample median, and the sample mode.

2. Determine the best-fit distribution (skewed-right) from the

a. Weibull family (for the maximum)

b. Gamma family

c. Chi-square family

d. Lognormal family

e. Power lognormal family

3. Consider a normalizing transformation such as the Box-Cox transformation [2].

If the histogram is symmetric (classical bell shaped) then the next step is to consider normal

distribution for the data. Hence, our next steps in the data analysis are determined by the shape

of the graph.


Scatter Plot

Scatter plot is a basic bivariate graphical EDA technique which has one variable on the x-axis,

one on the y-axis and a point for each observation in the data set on scatter plot.

As usual, let’s consider an example. Below, on the left, a table shows the height and the weight

of the students in a school soccer team. On the right, same data are displayed using scatter

plot. Each dot on the scatter plot represents the soccer player.

Figure 5

Scatter plot uncovers the relationship in the data set (relationship between the variables).

Scatter plots are used to analyze the bivariate data patterns in terms of linearity, slope, and

strength.

Linearity refers to whether the data pattern is linear or nonlinear.

Slope refers to the direction of change in variable Y when variable X gets bigger. If

variable Y also gets bigger, the slope is positive; but if variable Y gets smaller, the slope

is negative.

Strength refers to the degree of "scatter" in the plot. If the dots are widely spread, the

relationship between variables is weak. If the dots are concentrated around a line, the

relationship is strong.

They can also reveal the other features in data sets, like cluster, gaps, and outliers.

Figure 6 illustrates some common patterns.

0

50

100

150

200

250

300

64 66 68 70 72 74 76 78

Height

(Inches)

Weight

(Pounds)

65 148

67 155

69 170

70 180

72 240

74 210

75 230

77 250


Figure 6 [3]

The example plot (Figure 5) reveals a positive correlation and a linear relationship between the

two variables indicating that linear regression model might be appropriate in the later stages of

the analysis.


EDA and Big Data

Though John Tukey wrote his EDA book in the 1970’s – before the era of big data –we can still

relate the EDA and its techniques to big data! Let’s see how EDA addresses the 3 main

characteristics of big data (3 V’s of big data) – volume, velocity, and variety.

When we have big data, visual displays, which represents huge volume of data in graphical

formats, helps analysts identify patterns. In big data we have to look beyond the obvious

patterns. EDA helps exploit the data and discover hidden patterns using the vast set of

techniques and tools available.

Big data’s most challenging characteristics are velocity of data flow in applications requiring

near-real time analysis and complexity or variety of data. When an analyst analyzes any

complex data, especially those that have high dimensionality (such as big data), his/her first

step is EDA. EDA involves looking at data from many different angles, which helps to discover

the insights that were not known. EDA can be broadly used in initial exploratory studies for both

data discovery and hypotheses testing. In the next step, detailed analysis of significant

parameter-correlations can be found among masses of complex, high dimensional data.

Thus, EDA can be used for big data along with its use in data mining and data analysis of small

data sets.


Conclusion

Classical statistics assumes that we already have hypotheses and we understand the data,

hence it tests the hypotheses already held about the problem. Conversely, EDA allows data to

suggest the structure, model, or hypothesis that worked to generate this data.

EDA is the most important step in data science/analysis, as it often determines the success or

failure of an analytical project. There are a number of EDA techniques which give greater insight

into the data sets beyond suggesting a model. EDA graphical techniques help in identifying new

patterns in the data and save a huge amount of scientist time. EDA tool, used properly, can

save hours of useless research and facilitate unique discoveries that are beyond known

correlations and projected results. The Butler Scientifics website reports that EDA-enabled

research lead to quick discovery of all the correlations (less than 2 hours) “that the science

team identified during their 8-weeks intensive work but also several key correlations that, with a

further confirmatory phase, confirmed their original hypothesis”.

EDA is not merely a collection of techniques; it’s an approach and philosophy of data analysis.

Since in EDA we start with data without prior knowledge of the model and correlation in the

data, we just need to play with, experiment, and explore the data. We should use our domain

knowledge (graphical or quantitative techniques) along with our imagination and be creative to

get the sense of data and find new things in the data. As it is the first look at the data in the

analysis process, EDA is challenging but it is fun!

Bibliography

1. https://en.wikipedia.org/wiki/Data_science

2. http://www.itl.nist.gov/div898/handbook/index.htm

3. http://stattrek.com/tutorials/ap-statistics-tutorial.aspx

4. Basics of Statistics – Jarkko Isotalo

http://www.butlerscientifics.com/#%21realsuccesscases/colc

https://en.wikipedia.org/wiki/Data_science

http://www.itl.nist.gov/div898/handbook/index.htm

http://stattrek.com/tutorials/ap-statistics-tutorial.aspx


Dell EMC believes the information in this publication is accurate as of its publication date. The

information is subject to change without notice.

THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” DELL EMC MAKES NO

RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE

INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED

WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Use, copying and distribution of any Dell EMC software described in this publication requires an

applicable software license.

Dell, EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

exploratory data analysis – playing with the data

Documents