stat 248: exploratory data analysis handout 2gido/exploratory data analysis.pdf · stat 248:...

STAT 248: Exploratory Data Analysis

Handout 2

GSI: Gido van de Ven

September10th, 2010

1 Introduction

Today’s section will focus on Exploratory Data Analysis (EDA), a set of techniques developed to makea dataset more easily and effectifely handleable by human minds. This handout first discusses somefundamental issues in EDA, and then focuses on some basic data visualizing techniques: histogram,stem and leaf plot, boxplot and probability plots (QQ-plot).

During section, we won’t go over all the concepts mentioned in this handout, as it is more meantas reference. We will start the section with a discussion on the goal of Statistics and where EDA fitsin. After that we will go over some examples in R.

2 Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an approach for data analysis originally developed by Tukey.Four major ingredients of EDA stand out:

1. Displays visually reveal the behaviour of the data and the structure of the analyses;

2. Residuals focus attention on what remains of the data after some analysis;

3. Re-expressions by means of simple mathematical functions such as the logarithm and thesquare root, help to simplify behaviour and clarify analyses; and

4. Resistance ensures that a few extraordinary data values do not unduly influence the results ofan analysis.

The seminal work in EDA is ”Exploratory Data Analysis”, Tukey, (1977).

3 Basic Concepts

Depths: It is related to an ordered batch and how far from the low or high end of the batch issituated a data value. We define depth of each data value, as the value position in an enumerationof values that starts at the nearer end of the batch. Each extreme value is the first value in the enu-meration and therefore has depth 1; the second largest and second smallest values each have a depthof 2; and so on. In general in a batch of size n, two data values have depth i: the ith and the (n+1−i).

Median: If n is odd, there is a deepest data value-one as far from either end of the ordered batch.This is the median (M) and it marks the middle of the batch, exactly half the remaining n numbersin the batch are less than or equal to it, and exactly half are greater than or equal to it. The depthof the median, denoted by d(M), is calculated as:

d(M) =n + 1

2

1

If n is even, d(M) will have a fractional part equal to a half. In that case, the usual convention is to usethe average of the two observations with depth just above and below d(M) to calculate the median M .

Hinges: The median splits an ordered batch in half. We might naturally ask next about the middleof each of the halves. The hinges (H) are the summary values in the middle of each half of the data.They are denoted by the letters LH (=lower hinge) or UH (=upper hinge) and are about a quarterof the way in from each end of the ordered batch. Each hinge is at depth d(H):

d(H) =[d(M)] + 1

2

where [x] means that you take the largest integer value smaller than or equal to x (i.e. if d(M) containsan 1

2 , you drop it).

Quartiles: The hinges are almost similar to the quartiles, which are defined so that one quarterof the data lies below the lower quartiles and one quarter of the data lies above the upper quartile.The main difference between hinges and quartiles is that the depth of the hinges are calculated from thedepth of the median with the result that the hinges often lie closer to the median than do the quartiles.Note that the hinges equal the quartiles for odd n, and slightly differ for even n.

5-number summary: Introduced by Tukey, this summary includes the median, the two hingesand the two extreme values.

Spread: The spread of a dataset responds to the variability of the data. Values that can learnus about the spread are the interquartile range (IQR, also called H-spread) which is the distancebetween the two hinges (i.e. IQR = UH − LH). Another measure for the variability in the data isthe range, which simply is the largest value minus the smallest.

Outliers: Let’s first define the inner fences (IF ) and the outer fences (OF ) of a batch of data:

InnerFences =

{LH − 1.5 ∗ (H-spread)UH + 1.5 ∗ (H-spread)

OuterFences =

{LH − 3 ∗ (H-spread)UH + 3 ∗ (H-spread)

Any data value beyond either inner fences we term outside/outliers, and any value beyond eitherouter fence we call far outside/extreme outliers.

Boxplot: An boxplot is a with solid lines marked off box from hinge to hinge, with the medianshown as a solid line in the middle. A dashed whisker runs from the box to the value on each end thatis still not beyond the corresponding inner fences (these two values are known as adjacent values).Outliers are shown individually, and if possible they are labelled.

Histogram: A histogram can be used to illustrate the shape, or the distribution, of data. His-tograms consists of side-by-side bars (also called bins), where each data value is represented by anequal amount of area in its bar. The height of the bins either represents counts, or it representsproportions. In the last case the histogram is said to be a ”true” histogram, or to be in ”density-scale”. Histograms can be used to immediately judge whether the collection of data is approximatelysymmetric or skewed . Also, we can see whether the histogram has one single peak – then the datais said to be unimodal – or multiple peals – a multimodal distribution.

Stem-and-leaf Plot: A variation to an histogram is the stem-and-leaf plot. In this plot a cer-tain number of the digits at the beginning of each data value serve as the basis for sorting, and thenext digit appears in the display. Note that in a stem-and-leaf plot, except maybe the sequence of thedata values, no information is lost.

2

3.1 R functions

median(), fivenum(), quantile(), IQR(), range(), boxplot(), hist(), stem()

3.2 Example

The following example, to illustrate the above concepts, is taken from Velleman and Hoaglin’s book.New Jersey has 21 different counties; sorted into increasing order, the areas in squared miles of thesecounties are:

47, 103, 130, 192, 221, 228,

234, 267, 307, 312, 329, 362,

365, 423, 468, 476, 500, 527,

569, 642, 819.

Applying the definitions laid out above gives:

n = 21 d(M) =n + 1

2= 11 median = 329

d(H) =d(M) + 1

2= 6 LH = 228 UH = 476

IQR = H-spread = UH − LH = 248

IFdown = LH − 1.5 ∗ IQR = 228− 1.5 ∗ 248 = −144

IFup = UH + 1.5 ∗ IQR = 476 + 1.5 ∗ 248 = 848

OFdown = LH − 3 ∗ IQR = 228− 3 ∗ 248 = −516

OFup = UH + 3 ∗ IQR = 476 + 3 ∗ 248 = 1220

Adjacent values = {47, 819}

200 400 600 800

Boxplot: area of the New Jersey counties

Area (in squared miles)

Histogram: area of the New Jersey counties

Area (in squared miles)

Den

sity

0 200 400 600 800

0.00

000.

0010

0.00

20

Figure 1: New Jersey Counties. Top: Boxplot. Bottom: Histogram and kernel density plot.

> par(mfrow = c(2,1))

> boxplot(New.Jersey, horizontal = TRUE, main = "Boxplot: area of the New Jersey

counties", xlab = "Area (in squared miles)")

> hist(New.Jersey, prob = TRUE, col = "red", main = "Histogram: area of the New

Jersey counties", xlab = "Area (in squared miles)")

> lines(density(New.Jersey))

3

4 Probability Plots

Probability plots are an extremely useful graphical tool for qualitatively assessing the fit of data to atheoretical distribution.

Consider a sample of size n from a uniform distribution U [0, 1]. Denote the ordered sample values byX(1) < X(2)... < X(n). These values are order statistics. It can be shown that:

E(X(j)) =j

n + 1

This suggests plotting the ordered observations X(1), ..., X(n) against their expected values 1n+1 , ...,

nn+1 .

If the underlying distribution is uniform, the plot should look roughly linear. Now suppose you havethe hypothesis that the observations follows a certain distribution F different from the uniform dis-tribution.

Proposition: Let X be a continuous random variable with a strictly increasing cumulativedistribution function FX and if Y = FX(X) then Y has a uniform distribution on [0, 1].

Given a sample X1, ..., Xn we can plot the ordered observations (which may be viewed as the ob-served or empirical quantiles) versus the quantiles of the theoretical distributions:

X(k) vs F−1(k

n + 1)

where F−1( kn+1) is the k

n+1 quantile of the distribution F ; that is, it is the point such that the

probability that a random variable with distribution F is less than that point is kn+1 . If the underlying

distribution is the one that we hypothesized, the plot should look roughly linear.

●

●

●

●

● ● ●

●

● ●●

● ●

●

● ●●

●

●

●

●

−2 −1 0 1 2

200

400

600

800

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●

●●●

●●●

●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●

●●●●●

●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

QQ−plot against U[0,1]


Sam

ple

Qua

ntile

s

Figure 2: Left: QQplot areas New Jersey Counties. Right: QQplot simulated Uniform [0,1].> qqnorm(New.Jersey)

> qqline(New.Jersey)

> U = runif(1000)

> theoretical = seq(from = 1/1001, to = 1000/1001, length.out = 1000)

> qqplot(theoretical, U, ylab = "Sample Quantiles", xlab = "Theoretical Quantiles",

main = "QQ-plot against Uniform-distr")

> abline(lm(U[order(U)] ˜ theoretical))

4

4.1 Backup

Suppose that F is the cdf of a continuous random variable and is strictly increasing on some intervalI, and that F = 0 to the left of I and F = 1 to the right of I. I may be unbounded. Under thisassumption the inverse function F−1 is well defined: x = F−1(y) if y = F (x). The pth quantile ofthe distribution F is defined to be that value xp such that F (xp) = p or P (X ≤ xp) = p. Under thepreceding assumptions stated, xp is uniquely defined as xp = F−1(p). Special cases are p = 1

2 whichcorresponds to the median of F ; and p = 1

4 and p = 34 which correspond to the lower and upper

quartiles of F .

Proposition Let X be a random variable with a cdf F that adheres above assumptionsand let Z = F (X). Then Z has a uniform distribution on [0, 1].

Proof: P (Z ≤ z) = P (F (X) ≤ z) = P (X ≤ F−1(z)) = F (F−1(z)) = z

Proposition Let U be uniform on [0,1] and let X = F−1(U). Then the cdf of X is F

Proof: P (X ≤ x) = P (F−1(U) ≤ x) = P (U ≤ F (x)) = F (x)

5 Re-expression of the Data

Usually, it is preferable if the distribution of your data is symmetric. For example, if you have heavilyskewed data, a histogram or a boxplot won’t be so informative. But not only for visualization, manymodels or parametric tests assume that the data are (at least approximately) normally distributed,and if your data is heavily skewed you are sure this assumption is violated. Hence, sometimes it isbetter to transfrom your data prior to analyzing it. Note that linear transformations (of the formx→ a+bx) do not change the shape of the distribution, so for this purpose only non-linear transfor-mations are useful. An important class of such transformations is Tukey’s Ladder of Powers, givenby: y → yp, for some power p. This class of transformations will change the distribution of the data,but won’t affect the ordering of the data points. If you use p < 1, the upper tail of the distributionwill be pulled in and the lower tail stretched out. A value p > 1 will have the reverse effect. (Notethat for p = 1 the transformed data is equal to the original data.) The farther away from one thevalue of p is, the stronger this effect. Special case is p = 0. Usually y0 is defined to be 1, but it wouldbe useless to transform all data values to 1. However, it turns out that when the powers p are orderedaccording to the strength of their reshaping effect, the logarithm-functions falls naturally on the placeof the zero power (i.e. y → log(y)). For p < 0, the re-expression in the Ladder of Powers is givenby: y → −yp, where the extra minus sign is added in order to preserve the original ordering in the data.

The three main objectives for transforming data are:

• Normalize the distribution: Many statistical models and parametric test assume that datais (approximately) normal. Common transformations to get data closer to being normal are thelogarithm transformation and the inverse probability transformation1.

• Stabilize the variances: Besides normality, many parametric tests require equal variances ofthe data points as well. A typical example of a variance stabilizing transformation is the squareroot transformation: y → √y.

• Linearize the trend: Regression analysis techniques require the assumption of linearity. Fornon-linear data there exist non-linear regression approaches, but in many cases it is easier toapply a linearizing transformation. A typical example of such a re-expression is the logarithmictransformation.

1This method is based on the fact that if X is distributed according to cdf F (with F−1(.) being well defined), thenY = Φ−1(F (X)) is standard normal distributed. (Note that Φ(.) is the cdf of the standard normal distribution.)

5

6 Example: Exploratory Data Analysis

On last week’s handout there was an exercise about the dataset ”jj.dat”, which contains for the period1960 to 1980 the quarterly earnings per share of the company Johnson and Johnson. This time series,which is taken from the website of Schumway and Stoffer’s book, can be downloaded from the SectionWebsite. Below, this time series will be used to illustrate some of today’s concepts using R.

The decimal point is at the |

0 | 466667778888899900022333445566899

2 | 123334784667

4 | 03349003688

6 | 0146899778

8 | 379955

10 | 0369

12 | 120

14 | 078

16 | 02

Figure 3: Stem and leaf plot of quarterly earnings per Johnson & Johnson’s share> stem(jj.ts)

05

1015

Boxplot

Ear

ning

s pe

r S

hare

Histogram

Earnings per Share

Den

sity

0 5 10 15

0.00

0.05

0.10

0.15

0.20

Figure 4: Quarterly earnings per Johnson & Johnson’s share. Left: Boxplot. Right: Histogram.> par(mfrow = c(1,2))

> boxplot(jj.ts, ylab = "Earnings per Share", main = "Boxplot")

> hist(jj.ts, xlab = "Earnings per Share", main = "Histogram", prob = TRUE))

As these three plots illustrate, the dataset is skewed to the left. We want to apply a transforma-tion to make the distribution of this data more symmetric. As the data has a large right tail, we canuse a transformation from Tukey’s Ladder of Powers, choosing p < 1. Let’s choose p = 0, i.e. thelogarithmic transformation.

6

01

2Boxplot

log(

Ear

ning

s pe

r S

hare

)

Histogram

log(Earnings per Share)

Den

sity

−1 0 1 2 3

0.0

0.1

0.2

0.3

Figure 5: Boxplot and histogram of logarithmic transformation of quarterly earnings per Johnson &Johnson’s share.> par(mfrow = c(1,2))

> boxplot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Boxplot")

> hist(log(jj.ts), xlab = "log(Earnings per Share)", main = "Histogram", prob =

TRUE))

●●●

●● ●

●●

●● ●●

●●●

● ●●●

●● ●●

●●●

●●● ●

●●●

●●●●●●

●

●

●●●●

●●●

●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

−2 −1 0 1 2

05

1015

Normal qq−plot, earnings per J&J−share


Sam

ple

Qua

ntile

s

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●●●

●●

●

●

●●

●

●

●

●

●

●●●●

●●●

●●●

●

●●

●

●

●●

●

●

●

●●

●

●

●●

●

●●

●●

●●●

●

●●

●

●

●

●●

●

−2 −1 0 1 2

01

2

Normal qq−plot, log(earnings per J&J−share)


Sam

ple

Qua

ntile

s

Figure 6: Normal QQ-plots. Left: Original data. Right: After log-transformation.> par(mfrow = c(1,2))

> qqnorm(jj.ts, main = "Normal qq-plot, earnings per J&J-share")

> qqline(jj.ts)

> qqnorm(log(jj.ts), main = "Normal qq-plot, log(earnings per J&J-share)")

> qqline(log(jj.ts))

7

Johnson & Johnson

Time

Ear

ning

s pe

r S

hare

1960 1965 1970 1975 1980

05

1015

Johnson & Johnson

Timelo

g(E

arni

ngs

per

Sha

re)

1960 1965 1970 1975 1980

01

2

Figure 7: Time series plots. Left: Original data. Right: After log-transformation.> par(mfrow = c(1,2))

> plot(jj.ts, ylab = "Earnings per Share", main = "Johnson & Johnson")

> plot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Johnson & Johnson")

Note that the logarithmic transformation not only symmetrizes the data, it also makes the initialexponential trend in the time series more or less linear.

7 Bibliography

This handout is based on handouts prepared by Irma Hernandez-Magallanes, previous GSI for thiscourse. Additional sources that are used, and that could be useful for you:

• “Exploratory Data Analysis” by John W. Tukey

• “Applications, Basics and Computing of Exploratory Data Analysis” by Paul F. Velleman andDavid C. Hoaglin

• “Modern Applied Statistics with S” by W.N. Venables and B.D. Ripley

• “Time Series Analysis and Its Applications: With R Examples” by Robert Schumway & DavidStoffer

8

stat 248: exploratory data analysis handout 2gido/exploratory data analysis.pdf · stat 248:...

Documents