exploratory data analysis (wilks ch. 3) -...

26
1 Debra Baker AOSC 630: Class #2 January 30, 2008 Exploratory Data Analysis (Wilks Ch. 3) From: http://pubs.usgs.gov/wri/wri034126/htdocs/wrir03-4126.html Robustness Numerical Summaries Graphical Summaries Correlation Higher-Dimensional Data

Upload: others

Post on 14-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

1

Debra BakerAOSC 630: Class #2January 30, 2008

Exploratory Data Analysis (Wilks Ch. 3)

From: http://pubs.usgs.gov/wri/wri034126/htdocs/wrir03-4126.html

Robustness

Numerical Summaries

Graphical Summaries

Correlation

Higher-Dimensional Data

Page 2: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

2

Common assumptions:“normal” distribution

Robust: performsreasonably well for mosttypes of data

Resistant: not undulyinfluenced by a smallnumber of outliers

A good analysis method is insensitive to theassumptions about the data set.

From: http://www.chemicool.com/definition/gaussian_distribution.html

Page 3: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

3

Location: the centraltendency of the data set.

Spread: the dispersion ofthe data set around acentral value.

Symmetry: how the data isdistributed about the centralvalue.

There are three key features used to numericallydescribe a data set.

From: http://www.physics.csbsju.edu/stats/display.distribution.html

Page 4: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

4

Mean: the average of alldata points

Median: the center value inan ordered data set

Mode: the most frequentlyoccurring value

Which of these measuresare robust?

Which of these measuresare resistant?

The first common numerical summary of a data setis a measure of its location.

From: http://billkosloskymd.typepad.com/lexicillin_qd/2007/09/mean-vs-median-.html

Page 5: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

5

First quartile: the middle of thedata between the median andminimum.

Third quartile: the middle of thedata between the median andmaximum.

Are quartiles robust andresistant?

Quartiles are an example of aquantile,which can be based onany divisor (e.g., 10%).

Quartiles divide the data set into four equal parts todescribe its distribution.

From: http://www.physics.csbsju.edu/stats/display.distribution.html

Page 6: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

6

Standard Deviation: thesquare root of the averagedsquare distance betweendata points and the mean.

Interquartile Range:specifies the range of thecenter 50% of the data.

Are these measuresrobust?

Are these measuresresistant?

The second common numerical summary of a dataset is a measure of its spread.

s =1

n !1x1! x( )

2

i=1

n

"

IQR = q0.75

! q0.25

Equations 3.5 and 3.6 from Wilks (2006), pp. 26-27.

Page 7: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

7

Positive Skewness:distribution has a long right tail.

Negative Skewness:distribution has a long left tail.

Positive Kurtosis: distributionhas a tall narrow peak.

Negative Kurtosis: distributionhas flat low peak.

The third common numerical summary of a dataset is a measure of its symmetry.

From: http://mvpprograms.com/help/mvpstats/distributions/SkewnessKurtosis

Page 8: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

8

Skewness Coefficient: amoments-based measureof symmetry.

Yule-Kendall Index:compares the distancebetween the median andeach of the two quartiles.

Are these measuresrobust?

Are these measuresresistant?

There are two important measures of skewness.

! YK =q0.75

" q0.5( ) " q

0.5" q

0.25( )IQR

! =

1

n "1xi" x( )

3

i=1

n

#

s3

Equations 3.9 and 3.10 from Wilks (2006), p. 28.

Page 9: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

9

We will look specifically at boxplots, histograms,and cumulative frequency distributions.

The characteristics of a data set are best conveyedusing graphical summary techniques.

From:http://learnsigma.com/statistical-software-is-not-six-sigma/.

Page 10: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

10

This type of boxplot is called a schematic plot,which highlights unusual extreme values.

The boxplot is a simple graphic based onquartile calculations.

From:http://www.scc.ms.unimelb.edu.au/whatisstatistics/weather.html

Page 11: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

11

Unlike a boxplot, a histogram can show if the datais multimodal.

A histogram divides the data into bins and thencharts the relative frequency distribution.

From:http://www.climateaudit.org/?p=1022

Page 12: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

12

This graphic illustrates wheat yield from climatesimulations changing the precipitation variance (PV)

The cumulative frequency distribution includes thenumber of points in all lower bins.

From:http://clima.casaccia.enea.it/staff/pona/Rep_Dec99/index.html

Page 13: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

13

Scatterplots

Covariance

Pearson Correlation

Rank Correlation

Lag Correlation

Autocorrelation

For paired data sets, numerical and graphicalsummaries show the relationship of two variables.

From: Wilks (2006) p. 54

Page 14: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

14

Trends: Does the paireddata show a recognizablepattern?

Curvature: Do the pointsshow a positive or negativeslope?

Clustering: How closetogether are the points?

Spread: What is the rangeof the data?

Outliers: Are there anyunusually extreme values?

A scatterplot puts the paired data on an x-y grid.

Page 15: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

15

Cov > 0: If onevariable rises so doesthe other.

Cov < 0: If onevariable rises, theother falls.

This figure shows theestimated covarianceof surface ozone inthe Eastern U.S.between the gridpoint indicated by thered dot and the restof grid points for onelocation.

The covariance measures to what extent pairedvariables vary together.

Cov(x, y) =1

n !1xi ! x( ) yi ! y( )"# $%

i=1

n

&

From: http://www.image.ucar.edu/GSP/Projects/ResearchNuggets.shtml

Page 16: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

16

r = 1: perfect positive linearassociationr = 0: no associationr = -1: perfect negative linearassociation

This figure shows thePearson correlation of theSouthern Oscillation Indexand sea surface temperaturessix months later.

Is this measure robust andresistant?

The Pearson Correlation (i.e., standard correlation)is a product-moment coefficient of linear correlation.

rxy =Cov(x, y)

sxsy

From: http://ingrid.ldeo.columbia.edu/dochelp/StatTutorial/Correlation/

Page 17: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

17

0.88

What is the Pearson Correlation Coefficient for thesample paired data?

0.61

Will the correlation for each be positive or negative?

Which set will have the higher magnitude correlation?

Is the correlation meaningful?

Page 18: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

18

The x and y values areordered separately.

The ordering is done fromsmallest to largest.

The value’s rank number isthen substituted for thevalue in the pairing.

Is this measure robust andresistant?

Does beetle matingcorrelate with temperature?rxy = 0.83 and rrank = 0.11

The Spearman Rank Correlation is the Pearsoncorrelation of the rank of the data, not its value.

rrank

= 1!

6 Di

2

i=1

n

"

n n2!1( )

From: Lauren Silbert’s Penn State dissertation: The Aroid-ScarabMutualism: Importance of Floral Temperature for Scarab Attraction andCopulation, available at http://www.pennscience.org/?view=past&year=2003&issue=1/2/&article=28

Page 19: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

19

1.00

What is the Spearman Rank Correlation Coefficientfor the sample paired data?

0.018

Will the rank correlation be similar to the Pearson correlation?

Since Set I has a monotonic relationship between x and y, whatwill its rank correlation be?

Page 20: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

20

It is useful when onevariable has a delayedeffect on the othervariable.

The persistence of asingle variable over timecan be measured by lagcorrelation.

This relationship of avariable with itselfseparated by a timeinterval is calledautocorrelation.

Lag correlation measures the association of x andy when they are separated by a time interval.

Contours every 0.1, red indicates correlations > 0.1,blue indicates correlations < -0.1.From: http://ugamp.nerc.ac.uk/hot/um/mjo.html

rk = ! xt , yt" k( )

Page 21: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

21

Covariance Matrix

Correlation Matrix

Correlation Maps

The relationship between multiple variables can beevaluated using more advanced techniques.

From: http://www.dfanning.com/tips/scatter3d.html

Page 22: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

22

The covariance matrix is alwaysa square matrix.

It has one row/column for eachvariable.

The variance for each variable ison the diagonal.

The matrix is symmetric aboutthe diagonal.

If all variables are of unit variance,then the covariance matrix is thesame as the correlation matrix.

From: http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Covariance_Matrix_popup.htm

The covariance matrix is a collection of manycovariances in a d x d matrix

Page 23: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

23

The main diagonal is thecorrelation of the variablewith itself so it is alwaysunity.

Many methods of multivariateanalysis depend on it, such asprincipal componentanalysis.

This figure is an example of acorrelation matrix among 20climate model biases atcertain spatial locations.

A correlation matrix simultaneously displayscorrelations among matched multivariate data.

From: http://www.image.ucar.edu/GSP/Projects/ResearchNuggets.shtml

Page 24: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

24

With large variable sets, acorrelation matrix becomesunwieldy.

Atmospheric data is oftentoo numerous when it isfrom a large number oflocations.

A one-point correlationmap shows the correlationbetween the variable at onelocation with the variable atall the other locations onthe map.

A correlation map arranges the correlationinformation by geographical location.

From: http://www.climateaudit.org/?p=858

Correlation between the sea surface temperaturein the Pacific Warm Pool and surface air

temperature at other locations

Page 25: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

25

A teleconnection is astrong statisticalrelationship betweenweather in differentparts of the globe.

This figure shows theteleconnections of a LaNina event.

Recurring and persistentteleconnection patternsthat spans vast areas arecalled modes ofvariability.

From: http://www.fas.usda.gov/pecad2/highlights/2000/04/eth/eth_drought.htm

Correlation maps are often used to detectteleconnections patterns.

Page 26: Exploratory Data Analysis (Wilks Ch. 3) - UMDekalnay/syllabi/AOSC630/AOSC630Class2DebraBaker.p… · 15 Cov > 0: If one variable rises so does the other. Cov < 0: If one variable

26

Robustness is the ability of an analysis method toprovide reasonable results for different data sets.

Numerical Summaries include the mean, median,quartiles, standard deviation, interquartile range,skewness coefficient, and Yule-Kendall index.

Graphical Summaries include the boxplot,schematic plot, histograms, and cumulativefrequency distributions.

Correlation measures the association betweenpaired variables.

Higher-Dimensional Data can be analyzed usingcorrelation and covariance matrices and maps.

In summary, Exploratory Data Analysis employs avariety of techniques to characterize data sets.