exploratory data analysis (wilks ch. 3) -...

Post on 14-Feb-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Debra BakerAOSC 630: Class #2January 30, 2008

Exploratory Data Analysis (Wilks Ch. 3)

From: http://pubs.usgs.gov/wri/wri034126/htdocs/wrir03-4126.html

Robustness

Numerical Summaries

Graphical Summaries

Correlation

Higher-Dimensional Data

2

Common assumptions:“normal” distribution

Robust: performsreasonably well for mosttypes of data

Resistant: not undulyinfluenced by a smallnumber of outliers

A good analysis method is insensitive to theassumptions about the data set.

From: http://www.chemicool.com/definition/gaussian_distribution.html

3

Location: the centraltendency of the data set.

Spread: the dispersion ofthe data set around acentral value.

Symmetry: how the data isdistributed about the centralvalue.

There are three key features used to numericallydescribe a data set.

From: http://www.physics.csbsju.edu/stats/display.distribution.html

4

Mean: the average of alldata points

Median: the center value inan ordered data set

Mode: the most frequentlyoccurring value

Which of these measuresare robust?

Which of these measuresare resistant?

The first common numerical summary of a data setis a measure of its location.

From: http://billkosloskymd.typepad.com/lexicillin_qd/2007/09/mean-vs-median-.html

5

First quartile: the middle of thedata between the median andminimum.

Third quartile: the middle of thedata between the median andmaximum.

Are quartiles robust andresistant?

Quartiles are an example of aquantile,which can be based onany divisor (e.g., 10%).

Quartiles divide the data set into four equal parts todescribe its distribution.

From: http://www.physics.csbsju.edu/stats/display.distribution.html

6

Standard Deviation: thesquare root of the averagedsquare distance betweendata points and the mean.

Interquartile Range:specifies the range of thecenter 50% of the data.

Are these measuresrobust?

Are these measuresresistant?

The second common numerical summary of a dataset is a measure of its spread.

s =1

n !1x1! x( )

2

i=1

n

"

IQR = q0.75

! q0.25

Equations 3.5 and 3.6 from Wilks (2006), pp. 26-27.

7

Positive Skewness:distribution has a long right tail.

Negative Skewness:distribution has a long left tail.

Positive Kurtosis: distributionhas a tall narrow peak.

Negative Kurtosis: distributionhas flat low peak.

The third common numerical summary of a dataset is a measure of its symmetry.

From: http://mvpprograms.com/help/mvpstats/distributions/SkewnessKurtosis

8

Skewness Coefficient: amoments-based measureof symmetry.

Yule-Kendall Index:compares the distancebetween the median andeach of the two quartiles.

Are these measuresrobust?

Are these measuresresistant?

There are two important measures of skewness.

! YK =q0.75

" q0.5( ) " q

0.5" q

0.25( )IQR

! =

1

n "1xi" x( )

3

i=1

n

#

s3

Equations 3.9 and 3.10 from Wilks (2006), p. 28.

9

We will look specifically at boxplots, histograms,and cumulative frequency distributions.

The characteristics of a data set are best conveyedusing graphical summary techniques.

From:http://learnsigma.com/statistical-software-is-not-six-sigma/.

10

This type of boxplot is called a schematic plot,which highlights unusual extreme values.

The boxplot is a simple graphic based onquartile calculations.

From:http://www.scc.ms.unimelb.edu.au/whatisstatistics/weather.html

11

Unlike a boxplot, a histogram can show if the datais multimodal.

A histogram divides the data into bins and thencharts the relative frequency distribution.

From:http://www.climateaudit.org/?p=1022

12

This graphic illustrates wheat yield from climatesimulations changing the precipitation variance (PV)

The cumulative frequency distribution includes thenumber of points in all lower bins.

From:http://clima.casaccia.enea.it/staff/pona/Rep_Dec99/index.html

13

Scatterplots

Covariance

Pearson Correlation

Rank Correlation

Lag Correlation

Autocorrelation

For paired data sets, numerical and graphicalsummaries show the relationship of two variables.

From: Wilks (2006) p. 54

14

Trends: Does the paireddata show a recognizablepattern?

Curvature: Do the pointsshow a positive or negativeslope?

Clustering: How closetogether are the points?

Spread: What is the rangeof the data?

Outliers: Are there anyunusually extreme values?

A scatterplot puts the paired data on an x-y grid.

15

Cov > 0: If onevariable rises so doesthe other.

Cov < 0: If onevariable rises, theother falls.

This figure shows theestimated covarianceof surface ozone inthe Eastern U.S.between the gridpoint indicated by thered dot and the restof grid points for onelocation.

The covariance measures to what extent pairedvariables vary together.

Cov(x, y) =1

n !1xi ! x( ) yi ! y( )"# $%

i=1

n

&

From: http://www.image.ucar.edu/GSP/Projects/ResearchNuggets.shtml

16

r = 1: perfect positive linearassociationr = 0: no associationr = -1: perfect negative linearassociation

This figure shows thePearson correlation of theSouthern Oscillation Indexand sea surface temperaturessix months later.

Is this measure robust andresistant?

The Pearson Correlation (i.e., standard correlation)is a product-moment coefficient of linear correlation.

rxy =Cov(x, y)

sxsy

From: http://ingrid.ldeo.columbia.edu/dochelp/StatTutorial/Correlation/

17

0.88

What is the Pearson Correlation Coefficient for thesample paired data?

0.61

Will the correlation for each be positive or negative?

Which set will have the higher magnitude correlation?

Is the correlation meaningful?

18

The x and y values areordered separately.

The ordering is done fromsmallest to largest.

The value’s rank number isthen substituted for thevalue in the pairing.

Is this measure robust andresistant?

Does beetle matingcorrelate with temperature?rxy = 0.83 and rrank = 0.11

The Spearman Rank Correlation is the Pearsoncorrelation of the rank of the data, not its value.

rrank

= 1!

6 Di

2

i=1

n

"

n n2!1( )

From: Lauren Silbert’s Penn State dissertation: The Aroid-ScarabMutualism: Importance of Floral Temperature for Scarab Attraction andCopulation, available at http://www.pennscience.org/?view=past&year=2003&issue=1/2/&article=28

19

1.00

What is the Spearman Rank Correlation Coefficientfor the sample paired data?

0.018

Will the rank correlation be similar to the Pearson correlation?

Since Set I has a monotonic relationship between x and y, whatwill its rank correlation be?

20

It is useful when onevariable has a delayedeffect on the othervariable.

The persistence of asingle variable over timecan be measured by lagcorrelation.

This relationship of avariable with itselfseparated by a timeinterval is calledautocorrelation.

Lag correlation measures the association of x andy when they are separated by a time interval.

Contours every 0.1, red indicates correlations > 0.1,blue indicates correlations < -0.1.From: http://ugamp.nerc.ac.uk/hot/um/mjo.html

rk = ! xt , yt" k( )

21

Covariance Matrix

Correlation Matrix

Correlation Maps

The relationship between multiple variables can beevaluated using more advanced techniques.

From: http://www.dfanning.com/tips/scatter3d.html

22

The covariance matrix is alwaysa square matrix.

It has one row/column for eachvariable.

The variance for each variable ison the diagonal.

The matrix is symmetric aboutthe diagonal.

If all variables are of unit variance,then the covariance matrix is thesame as the correlation matrix.

From: http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Covariance_Matrix_popup.htm

The covariance matrix is a collection of manycovariances in a d x d matrix

23

The main diagonal is thecorrelation of the variablewith itself so it is alwaysunity.

Many methods of multivariateanalysis depend on it, such asprincipal componentanalysis.

This figure is an example of acorrelation matrix among 20climate model biases atcertain spatial locations.

A correlation matrix simultaneously displayscorrelations among matched multivariate data.

From: http://www.image.ucar.edu/GSP/Projects/ResearchNuggets.shtml

24

With large variable sets, acorrelation matrix becomesunwieldy.

Atmospheric data is oftentoo numerous when it isfrom a large number oflocations.

A one-point correlationmap shows the correlationbetween the variable at onelocation with the variable atall the other locations onthe map.

A correlation map arranges the correlationinformation by geographical location.

From: http://www.climateaudit.org/?p=858

Correlation between the sea surface temperaturein the Pacific Warm Pool and surface air

temperature at other locations

25

A teleconnection is astrong statisticalrelationship betweenweather in differentparts of the globe.

This figure shows theteleconnections of a LaNina event.

Recurring and persistentteleconnection patternsthat spans vast areas arecalled modes ofvariability.

From: http://www.fas.usda.gov/pecad2/highlights/2000/04/eth/eth_drought.htm

Correlation maps are often used to detectteleconnections patterns.

26

Robustness is the ability of an analysis method toprovide reasonable results for different data sets.

Numerical Summaries include the mean, median,quartiles, standard deviation, interquartile range,skewness coefficient, and Yule-Kendall index.

Graphical Summaries include the boxplot,schematic plot, histograms, and cumulativefrequency distributions.

Correlation measures the association betweenpaired variables.

Higher-Dimensional Data can be analyzed usingcorrelation and covariance matrices and maps.

In summary, Exploratory Data Analysis employs avariety of techniques to characterize data sets.

top related