outline - r-project.org

Letter Value BoxplotHeike Hofmann, Karen Kafadar, Hadley Wickham

IOWA STATE UNIVERSITY

Outline

• Boxplots: Definition, Strengths & Weaknesses

• Letter Value Statistics

• Letter Value Boxplots

• Examples

• Conclusion

Boxplots

• Early Version: Tukey 1972 (Snedecor Festzeitschrift, at Iowa State University)

• Most common version in EDA (1977):

• Median (Center Line), Fourths (Box Edges), adjacent values (ends of whiskers) and extreme values

• All marks correspond to actual data values

0 1 2 3 4 5 6 7

Boxplot: Strengths

• Quick summary without overwhelming amount of detail

• Approximate location, spread, shape of distribution

• Outlier identification

• Associations among variables

0 1 2 3 4 5 6 7

Boxplots: Weaknesses

• Expected rate of labeled outliers approx 0.4+ 0.007n

• For n = 100000 expect approx. 700 outliers!

Exponential Distribution,n= 100, 1000, 10000, 100000 0

12

34

~

5

02

46

02

46

8

~~

02

46

810

12

Modifications

• Notched box-and-whisker (McGill, Larsen, Tukey 1987)

• Nonparametric density estimates

• Vase plots (Benjamini, 1988)

• Violin plots (Hintze, Nelson 1998)

• Box-percentile plots (Esty, Banfield 2003)Implementations: S routines (David James), package vioplot (Adler, Romain), package HMisc bpplot (Harre#, Banfield), examples at R Graph Ga#ery

Letter Value Statistics

• Estimate quantiles corresponding to tail areas 2-j

• Median (1/2): depth =

• Fourths (1/4): depth =

• Eights (1/8): depth =

• Boxplots show median, fourths

• Large Data Sets: tail quantiles become more reliable include LVs beyond Fourths

dM = (1 + n)/2

dF = (1 + !dM")/2

dE = (1 + !dF ")/2

k = !log2 n" − 4

k =⌈log2 n − log2

(4z2

1−α/2

)⌉+ 1

1

Letter Value Boxplot

• How many boxes to show?

• Outlier identification?

• All marks are based on actual data values

x

-3 -2 -1 0 1 2 3

LVboxplot(rnorm(1000))

Stopping Rules & Outliers

• EDA: 5-8 outliers

• Percentage of data, e.g. 0.5-1%

• uncertainty in LVi extends beyond or into LVi-1 (i.e. upper limit for LVi crosses LVi-1)

dM = (1 + n)/2

dF = (1 + !dM")/2

dE = (1 + !dF ")/2

k = !log2 n" − 4


(4z2

1−α/2

)⌉+ 1

1

dM = (1 + n)/2

dF = (1 + !dM")/2

dE = (1 + !dF ")/2

k = !log2 n" − 4


(4z2

1−α/2

)⌉+ 1

1

Rules lead to similar answers

... Examples

Gaussian, Exponential & Normal

~/Documents/students/Maneesha/CEL files~/Documents/students/Maneesha/CEL files

Gaussian, n=10000

x

-3 -2 -1 0 1 2 3 -3 -2 -1 0

~/Documents/students/Maneesha/CEL files

1 2 3

Exponential, n=10000

x

0 2 4 6 8 0 2 4 6 8

Uniform, n=10000

x

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Gene Expression Values


T1-1

x

68

10

12

14

T1-2

x

68

10

12

14

T1-3

x

68

10

12

14


T2-1

x

68

10

12

14

T2-2

x

68

10

12

14

T2-3

x

68

10

12

14

WT-1

x

68

10

12

14

WT-2

x

68

10

12

14

WT-3

x

68

10

12

14

Conclusion

• appropriate for large number of values

• based on actual data values

• simple to compute

• reduce number of labeled outliers shown in conventional boxplots

• do not depend on a smoothing parameter

Letter Value Boxplots are

Download (for now) at http://www.public.iastate.edu/~hofmann

Graphical Displays of Large Data Sets

• Quick summary without overwhelming amount of detail

• Approximate location, spread, shape of distribution

• Outlier identification

• Associations among variables

“The greatest value of a picture us when it forces us to notice what we never expected to see” (Tukey 1977)