stat 248: exploratory data analysis handout 2gido/exploratory data analysis.pdf · stat 248:...
TRANSCRIPT
STAT 248: Exploratory Data Analysis
Handout 2
GSI: Gido van de Ven
September10th, 2010
1 Introduction
Today’s section will focus on Exploratory Data Analysis (EDA), a set of techniques developed to makea dataset more easily and effectifely handleable by human minds. This handout first discusses somefundamental issues in EDA, and then focuses on some basic data visualizing techniques: histogram,stem and leaf plot, boxplot and probability plots (QQ-plot).
During section, we won’t go over all the concepts mentioned in this handout, as it is more meantas reference. We will start the section with a discussion on the goal of Statistics and where EDA fitsin. After that we will go over some examples in R.
2 Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach for data analysis originally developed by Tukey.Four major ingredients of EDA stand out:
1. Displays visually reveal the behaviour of the data and the structure of the analyses;
2. Residuals focus attention on what remains of the data after some analysis;
3. Re-expressions by means of simple mathematical functions such as the logarithm and thesquare root, help to simplify behaviour and clarify analyses; and
4. Resistance ensures that a few extraordinary data values do not unduly influence the results ofan analysis.
The seminal work in EDA is ”Exploratory Data Analysis”, Tukey, (1977).
3 Basic Concepts
Depths: It is related to an ordered batch and how far from the low or high end of the batch issituated a data value. We define depth of each data value, as the value position in an enumerationof values that starts at the nearer end of the batch. Each extreme value is the first value in the enu-meration and therefore has depth 1; the second largest and second smallest values each have a depthof 2; and so on. In general in a batch of size n, two data values have depth i: the ith and the (n+1−i).
Median: If n is odd, there is a deepest data value-one as far from either end of the ordered batch.This is the median (M) and it marks the middle of the batch, exactly half the remaining n numbersin the batch are less than or equal to it, and exactly half are greater than or equal to it. The depthof the median, denoted by d(M), is calculated as:
d(M) =n + 1
2
1
If n is even, d(M) will have a fractional part equal to a half. In that case, the usual convention is to usethe average of the two observations with depth just above and below d(M) to calculate the median M .
Hinges: The median splits an ordered batch in half. We might naturally ask next about the middleof each of the halves. The hinges (H) are the summary values in the middle of each half of the data.They are denoted by the letters LH (=lower hinge) or UH (=upper hinge) and are about a quarterof the way in from each end of the ordered batch. Each hinge is at depth d(H):
d(H) =[d(M)] + 1
2
where [x] means that you take the largest integer value smaller than or equal to x (i.e. if d(M) containsan 1
2 , you drop it).
Quartiles: The hinges are almost similar to the quartiles, which are defined so that one quarterof the data lies below the lower quartiles and one quarter of the data lies above the upper quartile.The main difference between hinges and quartiles is that the depth of the hinges are calculated from thedepth of the median with the result that the hinges often lie closer to the median than do the quartiles.Note that the hinges equal the quartiles for odd n, and slightly differ for even n.
5-number summary: Introduced by Tukey, this summary includes the median, the two hingesand the two extreme values.
Spread: The spread of a dataset responds to the variability of the data. Values that can learnus about the spread are the interquartile range (IQR, also called H-spread) which is the distancebetween the two hinges (i.e. IQR = UH − LH). Another measure for the variability in the data isthe range, which simply is the largest value minus the smallest.
Outliers: Let’s first define the inner fences (IF ) and the outer fences (OF ) of a batch of data:
InnerFences =
{LH − 1.5 ∗ (H-spread)UH + 1.5 ∗ (H-spread)
OuterFences =
{LH − 3 ∗ (H-spread)UH + 3 ∗ (H-spread)
Any data value beyond either inner fences we term outside/outliers, and any value beyond eitherouter fence we call far outside/extreme outliers.
Boxplot: An boxplot is a with solid lines marked off box from hinge to hinge, with the medianshown as a solid line in the middle. A dashed whisker runs from the box to the value on each end thatis still not beyond the corresponding inner fences (these two values are known as adjacent values).Outliers are shown individually, and if possible they are labelled.
Histogram: A histogram can be used to illustrate the shape, or the distribution, of data. His-tograms consists of side-by-side bars (also called bins), where each data value is represented by anequal amount of area in its bar. The height of the bins either represents counts, or it representsproportions. In the last case the histogram is said to be a ”true” histogram, or to be in ”density-scale”. Histograms can be used to immediately judge whether the collection of data is approximatelysymmetric or skewed . Also, we can see whether the histogram has one single peak – then the datais said to be unimodal – or multiple peals – a multimodal distribution.
Stem-and-leaf Plot: A variation to an histogram is the stem-and-leaf plot. In this plot a cer-tain number of the digits at the beginning of each data value serve as the basis for sorting, and thenext digit appears in the display. Note that in a stem-and-leaf plot, except maybe the sequence of thedata values, no information is lost.
2
3.1 R functions
median(), fivenum(), quantile(), IQR(), range(), boxplot(), hist(), stem()
3.2 Example
The following example, to illustrate the above concepts, is taken from Velleman and Hoaglin’s book.New Jersey has 21 different counties; sorted into increasing order, the areas in squared miles of thesecounties are:
47, 103, 130, 192, 221, 228,
234, 267, 307, 312, 329, 362,
365, 423, 468, 476, 500, 527,
569, 642, 819.
Applying the definitions laid out above gives:
n = 21 d(M) =n + 1
2= 11 median = 329
d(H) =d(M) + 1
2= 6 LH = 228 UH = 476
IQR = H-spread = UH − LH = 248
IFdown = LH − 1.5 ∗ IQR = 228− 1.5 ∗ 248 = −144
IFup = UH + 1.5 ∗ IQR = 476 + 1.5 ∗ 248 = 848
OFdown = LH − 3 ∗ IQR = 228− 3 ∗ 248 = −516
OFup = UH + 3 ∗ IQR = 476 + 3 ∗ 248 = 1220
Adjacent values = {47, 819}
200 400 600 800
Boxplot: area of the New Jersey counties
Area (in squared miles)
Histogram: area of the New Jersey counties
Area (in squared miles)
Den
sity
0 200 400 600 800
0.00
000.
0010
0.00
20
Figure 1: New Jersey Counties. Top: Boxplot. Bottom: Histogram and kernel density plot.
> par(mfrow = c(2,1))
> boxplot(New.Jersey, horizontal = TRUE, main = "Boxplot: area of the New Jersey
counties", xlab = "Area (in squared miles)")
> hist(New.Jersey, prob = TRUE, col = "red", main = "Histogram: area of the New
Jersey counties", xlab = "Area (in squared miles)")
> lines(density(New.Jersey))
3
4 Probability Plots
Probability plots are an extremely useful graphical tool for qualitatively assessing the fit of data to atheoretical distribution.
Consider a sample of size n from a uniform distribution U [0, 1]. Denote the ordered sample values byX(1) < X(2)... < X(n). These values are order statistics. It can be shown that:
E(X(j)) =j
n + 1
This suggests plotting the ordered observations X(1), ..., X(n) against their expected values 1n+1 , ...,
nn+1 .
If the underlying distribution is uniform, the plot should look roughly linear. Now suppose you havethe hypothesis that the observations follows a certain distribution F different from the uniform dis-tribution.
Proposition: Let X be a continuous random variable with a strictly increasing cumulativedistribution function FX and if Y = FX(X) then Y has a uniform distribution on [0, 1].
Given a sample X1, ..., Xn we can plot the ordered observations (which may be viewed as the ob-served or empirical quantiles) versus the quantiles of the theoretical distributions:
X(k) vs F−1(k
n + 1)
where F−1( kn+1) is the k
n+1 quantile of the distribution F ; that is, it is the point such that the
probability that a random variable with distribution F is less than that point is kn+1 . If the underlying
distribution is the one that we hypothesized, the plot should look roughly linear.
●
●
●
●
● ● ●
●
● ●●
● ●
●
● ●●
●
●
●
●
−2 −1 0 1 2
200
400
600
800
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
●●●●●●●●●●
●●●●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●
●●●●●●
●●●
●●●
●●●●●●●●●●●●
●●●●●●
●●●●●●●●●●●●●●●
●●●●●●●
●●●●●
●
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
QQ−plot against U[0,1]
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 2: Left: QQplot areas New Jersey Counties. Right: QQplot simulated Uniform [0,1].> qqnorm(New.Jersey)
> qqline(New.Jersey)
> U = runif(1000)
> theoretical = seq(from = 1/1001, to = 1000/1001, length.out = 1000)
> qqplot(theoretical, U, ylab = "Sample Quantiles", xlab = "Theoretical Quantiles",
main = "QQ-plot against Uniform-distr")
> abline(lm(U[order(U)] ˜ theoretical))
4
4.1 Backup
Suppose that F is the cdf of a continuous random variable and is strictly increasing on some intervalI, and that F = 0 to the left of I and F = 1 to the right of I. I may be unbounded. Under thisassumption the inverse function F−1 is well defined: x = F−1(y) if y = F (x). The pth quantile ofthe distribution F is defined to be that value xp such that F (xp) = p or P (X ≤ xp) = p. Under thepreceding assumptions stated, xp is uniquely defined as xp = F−1(p). Special cases are p = 1
2 whichcorresponds to the median of F ; and p = 1
4 and p = 34 which correspond to the lower and upper
quartiles of F .
Proposition Let X be a random variable with a cdf F that adheres above assumptionsand let Z = F (X). Then Z has a uniform distribution on [0, 1].
Proof: P (Z ≤ z) = P (F (X) ≤ z) = P (X ≤ F−1(z)) = F (F−1(z)) = z
Proposition Let U be uniform on [0,1] and let X = F−1(U). Then the cdf of X is F
Proof: P (X ≤ x) = P (F−1(U) ≤ x) = P (U ≤ F (x)) = F (x)
5 Re-expression of the Data
Usually, it is preferable if the distribution of your data is symmetric. For example, if you have heavilyskewed data, a histogram or a boxplot won’t be so informative. But not only for visualization, manymodels or parametric tests assume that the data are (at least approximately) normally distributed,and if your data is heavily skewed you are sure this assumption is violated. Hence, sometimes it isbetter to transfrom your data prior to analyzing it. Note that linear transformations (of the formx→ a+bx) do not change the shape of the distribution, so for this purpose only non-linear transfor-mations are useful. An important class of such transformations is Tukey’s Ladder of Powers, givenby: y → yp, for some power p. This class of transformations will change the distribution of the data,but won’t affect the ordering of the data points. If you use p < 1, the upper tail of the distributionwill be pulled in and the lower tail stretched out. A value p > 1 will have the reverse effect. (Notethat for p = 1 the transformed data is equal to the original data.) The farther away from one thevalue of p is, the stronger this effect. Special case is p = 0. Usually y0 is defined to be 1, but it wouldbe useless to transform all data values to 1. However, it turns out that when the powers p are orderedaccording to the strength of their reshaping effect, the logarithm-functions falls naturally on the placeof the zero power (i.e. y → log(y)). For p < 0, the re-expression in the Ladder of Powers is givenby: y → −yp, where the extra minus sign is added in order to preserve the original ordering in the data.
The three main objectives for transforming data are:
• Normalize the distribution: Many statistical models and parametric test assume that datais (approximately) normal. Common transformations to get data closer to being normal are thelogarithm transformation and the inverse probability transformation1.
• Stabilize the variances: Besides normality, many parametric tests require equal variances ofthe data points as well. A typical example of a variance stabilizing transformation is the squareroot transformation: y → √y.
• Linearize the trend: Regression analysis techniques require the assumption of linearity. Fornon-linear data there exist non-linear regression approaches, but in many cases it is easier toapply a linearizing transformation. A typical example of such a re-expression is the logarithmictransformation.
1This method is based on the fact that if X is distributed according to cdf F (with F−1(.) being well defined), thenY = Φ−1(F (X)) is standard normal distributed. (Note that Φ(.) is the cdf of the standard normal distribution.)
5
6 Example: Exploratory Data Analysis
On last week’s handout there was an exercise about the dataset ”jj.dat”, which contains for the period1960 to 1980 the quarterly earnings per share of the company Johnson and Johnson. This time series,which is taken from the website of Schumway and Stoffer’s book, can be downloaded from the SectionWebsite. Below, this time series will be used to illustrate some of today’s concepts using R.
The decimal point is at the |
0 | 466667778888899900022333445566899
2 | 123334784667
4 | 03349003688
6 | 0146899778
8 | 379955
10 | 0369
12 | 120
14 | 078
16 | 02
Figure 3: Stem and leaf plot of quarterly earnings per Johnson & Johnson’s share> stem(jj.ts)
05
1015
Boxplot
Ear
ning
s pe
r S
hare
Histogram
Earnings per Share
Den
sity
0 5 10 15
0.00
0.05
0.10
0.15
0.20
Figure 4: Quarterly earnings per Johnson & Johnson’s share. Left: Boxplot. Right: Histogram.> par(mfrow = c(1,2))
> boxplot(jj.ts, ylab = "Earnings per Share", main = "Boxplot")
> hist(jj.ts, xlab = "Earnings per Share", main = "Histogram", prob = TRUE))
As these three plots illustrate, the dataset is skewed to the left. We want to apply a transforma-tion to make the distribution of this data more symmetric. As the data has a large right tail, we canuse a transformation from Tukey’s Ladder of Powers, choosing p < 1. Let’s choose p = 0, i.e. thelogarithmic transformation.
6
01
2Boxplot
log(
Ear
ning
s pe
r S
hare
)
Histogram
log(Earnings per Share)
Den
sity
−1 0 1 2 3
0.0
0.1
0.2
0.3
Figure 5: Boxplot and histogram of logarithmic transformation of quarterly earnings per Johnson &Johnson’s share.> par(mfrow = c(1,2))
> boxplot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Boxplot")
> hist(log(jj.ts), xlab = "log(Earnings per Share)", main = "Histogram", prob =
TRUE))
●●●
●● ●
●●
●● ●●
●●●
● ●●●
●● ●●
●●●
●●● ●
●●●
●●●●●●
●
●
●●●●
●●●
●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2
05
1015
Normal qq−plot, earnings per J&J−share
Theoretical Quantiles
Sam
ple
Qua
ntile
s
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●●
●●
●
●
●●
●
●
●
●
●
●●●●
●●●
●●●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●●
●●
●●●
●
●●
●
●
●
●●
●
−2 −1 0 1 2
01
2
Normal qq−plot, log(earnings per J&J−share)
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 6: Normal QQ-plots. Left: Original data. Right: After log-transformation.> par(mfrow = c(1,2))
> qqnorm(jj.ts, main = "Normal qq-plot, earnings per J&J-share")
> qqline(jj.ts)
> qqnorm(log(jj.ts), main = "Normal qq-plot, log(earnings per J&J-share)")
> qqline(log(jj.ts))
7
Johnson & Johnson
Time
Ear
ning
s pe
r S
hare
1960 1965 1970 1975 1980
05
1015
Johnson & Johnson
Timelo
g(E
arni
ngs
per
Sha
re)
1960 1965 1970 1975 1980
01
2
Figure 7: Time series plots. Left: Original data. Right: After log-transformation.> par(mfrow = c(1,2))
> plot(jj.ts, ylab = "Earnings per Share", main = "Johnson & Johnson")
> plot(log(jj.ts), ylab = "log(Earnings per Share)", main = "Johnson & Johnson")
Note that the logarithmic transformation not only symmetrizes the data, it also makes the initialexponential trend in the time series more or less linear.
7 Bibliography
This handout is based on handouts prepared by Irma Hernandez-Magallanes, previous GSI for thiscourse. Additional sources that are used, and that could be useful for you:
• “Exploratory Data Analysis” by John W. Tukey
• “Applications, Basics and Computing of Exploratory Data Analysis” by Paul F. Velleman andDavid C. Hoaglin
• “Modern Applied Statistics with S” by W.N. Venables and B.D. Ripley
• “Time Series Analysis and Its Applications: With R Examples” by Robert Schumway & DavidStoffer
8