chocolate cake seminar series on statistical applications todays talk: be an explorer with...

48
Chocolate Cake Seminar Series on Statistical Applications Today’s Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

Upload: lara-pulsipher

Post on 29-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

Chocolate Cake SeminarSeries on Statistical

Applications

Today’s Talk:

Be an Explorer with Exploratory Data Analysis!

By David Ramirez

Page 2: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

Outline of Presentation

• Exploratory v. Confirmatory Data Analyses• Exploratory Data Analysis Techniques• Examples of Graphical Techniques• Examples of Non-graphical Techniques

Page 3: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

3

What is Exploratory Data Analysis (EDA)?

• John Tukey (1915-2000), American statistician It is important to understand what you CAN DO before you learn to measure how WELL you seem to have DONE it.

• Definition EDA consists of methods of discovering unanticipated patterns and relationships in a data set, by summarizing data quantitatively or presenting them visually.

Page 4: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

4

Exploratory v. Confirmatory• Exploratory Data Analysis

– Descriptive Statistics - Inductive Approach • Look for flexible ways to examine data without preconceptions • Heavy reliance on graphical displays • Let data suggest questions

– Advantages • Flexible ways to generate hypotheses • Does not require more than data can support • Promotes deeper understanding of processes

– Disadvantages • Usually does not provide definitive answers • Requires judgment - cannot be cookbooked

Page 5: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

5

Exploratory v. Confirmatory• Confirmatory Data Analysis

– Inferential Statistics - Deductive Approach • Hypothesis tests and formal confidence interval estimation • Hypotheses determined at outset • Heavy reliance on probability models • Look for definite answers to specific questions • Emphasis on numerical calculations

– Advantages • Provide precise information in the right circumstances • Well-established theory and methods

– Disadvantages • Misleading impression of precision in less than ideal circumstances • Analysis driven by preconceived ideas • Difficult to notice unexpected results

Page 6: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

EDA Techniques• Graphical presentation of distribution - Continuous variables (stem-and-leaf plot, box plot,

histogram, bivariate scatterplot)

- Categorical variables (bar graph, pie chart)

• Non-graphical summary of distribution - Continuous variables (mean, median, mode, variance,

standard deviation, range, correlation coefficient, linear

regression)

- Categorical variables (frequency table, cross-tabulation)

Page 7: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

7

Stem-and-Leaf Plot

• What is it?– A plot where each data value is split into a "leaf" (usually the

last digit) and a "stem" (the other digits).

• Useful for describing distributions in terms of -- Symmetry or skewness (right-skewed=long right tail or

left-skewed=long left tail)

-- Unimodality, bimodality or multimodality (one, two,

or more peaks)

-- Presence of outliers (a few very large or very small

observations)

Page 8: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

8

How To Create Stem-and-Leaf Plot

• SyntaxEXAMINE VARIABLES=Rain/PLOT BOXPLOT STEMLEAF

• By Mouse– Descriptive Statistics-> Explore -> Plot Stem and

Leaf Plot

Page 9: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

9

Example: Stem-and-leaf Plot• We use SPSS to construct a stem-and-leaf plot for rainfall in

the US in metropolitan areas.

Frequency Stem & Leaf 4.00 Extremes (=<15) 1.00 1 . 8 .00 2 . 2.00 2 . 58 10.00 3 . 0001111234 15.00 3 . 555556666677889 16.00 4 . 0011222223333344 7.00 4 . 5555566 4.00 5 . 0234 1.00 Extremes (>=60)

Page 10: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

10

Box Plot

• What is it?– A way of graphically depicting groups of numerical data

through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A box plot may also indicate which observations, if any, might be considered outliers.

• Useful in visualizing the following:– Location– Spread– Skewness– Outliers

Page 11: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

11

How To Create Box Plot

• SyntaxEXAMINE VARIABLES=Rain/PLOT=BOXPLOT.

• By mouseGraphs> legacy plots-> Box Plots->Click summaries of separate variables-> Scaled Variable-> Optional: Label Case-> Okay

Page 12: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

12

Example: Box Plot

• Using the previous data on precipitation, we would like to understand the distribution of the rain and check for any outliers.

Page 13: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

13

Example: Multiple Box Plots

• Side-by-side box plots below display the population distribution of large cities in 1960.

Page 14: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

14

How To Create Box Plots

• SyntaxEXAMINE VARIABLES=Population BY Country /PLOT=BOXPLOT/ID=City.

• By mouse– Graph> legacy plots-> Box Plots> click summaries

of groups of cases> define> Variable (scalar) > categories (how are we organize them)> label (IDs or name (optional))

Page 15: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

15

Histogram

• What is it?– A diagram consisting of rectangles which area is

proportional to the frequency of a continuous variable and which width is equal to the class interval (bin).

• Useful for describing distributions in terms of -- Symmetry or skewness

-- Unimodality, bimodality or multimodality

-- Presence of outliers

Page 16: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

16

How To Create Histogram

• Automatically chosen Bins• Syntax

GRAPH /HISTOGRAM(NORMAL)=Population.

• By Mouse– Graphs-> histogram-> Variable (scalar)-> okay

Page 17: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

17

How To Create Histogram• User-selected number of bins• Syntax

GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE

REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE.BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Population=col(source(s), name("Population")) GUIDE: axis(dim(1), label("Population")) GUIDE: axis(dim(2), label("Frequency")) ELEMENT: interval(position(summary.count(bin.rect(Population, binCount(5)))),

shape.interior(shape.square))END GPL.

• By Mouse– Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)->set

parameters-> custom -> number of intervals -> continue-> okay

Page 18: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

18

How To Create Histogram• User-selected bin width

– Syntax* Chart Builder.GGRAPH /GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE

REPORTMISSING=NO /GRAPHSPEC SOURCE=INLINE.BEGIN GPL SOURCE: s=userSource(id("graphdataset")) DATA: Population=col(source(s), name("Population")) GUIDE: axis(dim(1), label("Population")) GUIDE: axis(dim(2), label("Frequency")) ELEMENT: interval(position(summary.count(bin.rect(Population, binWidth(1)))),

shape.interior(shape.square))END GPL.

• By Mouse– Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)->set

parameters-> custom -> number of intervals -> continue-> okay

Page 19: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

19

Example: Histogram

• A researcher might need to select bins to have a better understanding of the distribution and check what type of distribution we have.

Page 20: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

20

Scatterplot

• What is it?– A scatterplot is a plot of data points in xy-plane that displays the strength, direction and shape of the relationship between the two variables.

• Used for– Analyzing relationships between two variables– Looking to see if there are any outliers in the data

Page 21: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

21

How To Create Scatterplot

• SyntaxGRAPH /SCATTERPLOT(BIVAR)=Height WITH Wieght /MISSING=LISTWISE.

• By Mouse– > graph-> legacy dialogs-> scatter/dot-> Simple

Scatter-> Y axis (outcome) -> X axis (predictor)-> okay

Page 22: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

22

Example: Scatterplot

• Researchers wanted to see if there is a link between Height and Weight.

Page 23: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

23

Bar Graph

• What is it? -- A diagram consisting of rectangles which area is proportional to the frequency of each level of categorical variable. -- Bar graph is similar to histogram but for categorical variables.• Used for -- comparison of frequencies for different levels

Page 24: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

24

How To Create Bar Graph

• SyntaxGRAPH

/BAR(SIMPLE)=COUNT BY Gender. • By Mouse

Graph-> legacy dialogues-> bar-> Categorical Variable->Categorical Axis-> okay

Page 25: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

25

Example: Bar Graph

• Experimenters wanted to make sure they had an close equal number of males and females in a study.

Page 26: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

26

Pie chart

• What is it?– A type of graph in which a circle is divided into

sectors corresponding to each level of categorical variable and illustrating numerical proportion for that level.

• Used for -- comparison of proportions for different levels

Page 27: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

27

How To Create Pie Chart

• SyntaxGRAPH /PIE=COUNT BY Bindedage.

• By Mouse Graph-> Legacy Dialogs-> Pie Chart-> Summaries for group of cases-> define-> categorical variable-> categorical axis-> okay

Page 28: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

28

Example: Pie Chart

• A researcher wants to partition the age variable into a categorical variable in terms of mental development (College Age, Older Young Adult, Young Middle age, Middle Middle Age and up).

Page 29: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

29

Measures of Central Tendency• Central Tendency is the location of the middle

value– Mean=sum of all data values divided by the

number of values (arithmetic average).

Non-Graphical Techniques

Page 30: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

30

Measures of Central Tendency– Median=the middle value after all the values are

put in an ordered list (50% observations lie below and 50% above the median).

– If there is a two middle observations, median is the average of the two.

– Mode=most likely or frequently occurring value.

Page 31: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

31

Measures of Spread

• Spread is how far observations lie from each other.

-- Variance=average of the squared distances from the mean.

-- Standard deviation=square root of the variance. -- Range=maximum-minimum.

Page 32: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

32

How to Compute Measures of Central Tendency and Spread

• SyntaxFREQUENCIES VARIABLES=MORT /STATISTICS=STDDEV VARIANCE RANGE MEAN MEDIAN MODE /ORDER=ANALYSIS.

• By Mouse Analyze-> Frequency -> Select a Scaled data->

click Statistics-> select Mean, Median, Mode, Range, Maximum and Minimum.

Page 33: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

33

Example: Central Tendency and Spread

• We use SPSS to figure out the Central Tendency and Spread of the Mortality rates in the 1960s.

MORT

Valid 60

Missing 0

940.3650

943.7000

790.70a

62.20482

3869.439

322.30

Variance

Range

Statistics

N

Mean

Median

Mode

Std. Deviation

Page 34: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

34

Correlation Coefficient• What is it? -- A numeric measure of linear relationship between two continuous variables.

• Properties of correlation coefficient: -- Ranges between -1 and 1 -- The closer it is to -1 or 1, the stronger the linear relationship is -- If r=0, the two variables are not correlated -- If r is positive, relationship is described as positive (larger values of one variable tend to accompany larger values of the other variable) -- If r is negative, relationship is described as negative (larger value of one variable tend to accompany smaller values of the other variable)

Page 35: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

35

Correlation

• Slight warning:– Correlation tend to measure linear relationship;

however there are events that a curves might exist

Page 36: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

36

Linear Regression

• What is it? -- Statistical technique of fitting a linear function to data points in attempt to describe a relationship between two variables.• Used for -- prediction -- interpretation of coefficients (change in y for a unit increase in x)

Page 37: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

37

How To Find Correlation and Fitted Regression Line

• By SyntaxREGRESSION /DESCRIPTIVES MEAN STDDEV CORR SIG N /MISSING LISTWISE /STATISTICS COEFF OUTS R ANOVA /CRITERIA=PIN(.05) POUT(.10) /NOORIGIN /DEPENDENT Wieght /METHOD=ENTER Height.

• By mouseAnalyze->Regression-> Y (Variable we want to predict) to Dependent -> X (variable we are using to predict Y) with Independent->

Page 38: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

38

Example: Correlation

• Referring to our weight and height scatterplot, the researchers want to check how related these two variable are.

Wieght Hieght

Wieght 1.000 .717

Hieght.717 1.000

Wieght .000

Hieght .000

Wieght 507 507

Hieght 507 507

Correlations

Pearson Correlation

Sig. (1-tailed)

N

Page 39: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

39

Example: Regression

• Researchers want to create a linear model using the height as an independent variable (predictor) and weight as a dependent variable (outcome or response).

• The fitted line can be written as Weight= -105.011+1.018 (Height)

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.B Std. Error Beta1 (Constant)

-105.011 7.539   -13.928 .000

Hieght 1.018 .044 .717 23.135 .000

Page 40: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

40

Frequency Table

• What is it? -- A table that shows frequency (count) for each level of a categorical variable.• Used for -- comparison of frequencies for different levels

Page 41: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

41

How To Find Frequency Table

• SyntaxFREQUENCIES VARIABLES=EDUbinned /ORDER=ANALYSIS.

• By mouse Analyze-> Descriptives-> frequency->Variable -> display Frequency-> okay

Page 42: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

42

Example: Frequency Table • We want to know what was the frequencies of different educational

levels in the US metropolitan area in 1960s. We have to use visual binning first and identify bins. Using the range, we create bins from 9th, 10th, 11th, 12th grade and up. – Syntax

– * Visual Binning.– *EDU.– RECODE EDU (MISSING=COPY) (12 THRU HI=4) (11 THRU HI=3) (10 THRU HI=2) (LO THRU

HI=1) (ELSE=SYSMIS) INTO EDUbins.– VARIABLE LABELS EDUbins 'EDU (Binned)'.– FORMATS EDUbins (F5.0).– VALUE LABELS EDUbins 1 '9th Grade' 2 '10th Grade' 3 '11th Grade' 4 '12th grade and up'.– VARIABLE LEVEL EDUbins (ORDINAL).

– By Mouse• Transform-> Visual Binning-> variable we want to create into an ordinal value->

okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay

Page 43: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

Example: Frequency Table

EDU (Binned)

  Frequency PercentValid

PercentCumulative

PercentValid 9th Grade 9 15.0 15.0 15.0

10th Grade19 31.7 31.7 46.7

11th Grade20 33.3 33.3 80.0

12th grade and up 12 20.0 20.0 100.0

Total 60 100.0 100.0  

Page 44: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

44

Cross-tabulation

• What it is?– a two-way table containing frequencies (counts)

for different levels of the column and row variables.

• Used for– Comparison of frequencies for different levels of

the variables (chi-squared test)

Page 45: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

45

How To Find Cross-tabulation

• Syntax:CROSSTABS /TABLES=EDUbins BY US /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT /COUNT ROUND CELL.

• By MouseAnalyze-> Descriptive Statistics-> Crosstabs-> select variable for row-> select variable for column-> statistic-> Chi-Square-> continue-> Okay

Page 46: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

46

Example: Cross-tabulation

• Researchers wish to understand if the educational levels from the SMSA data were equally distributed among the US.

• Looking at the p-value, we can see that the educational levels are different among the regions of the US.

Count

1.00 2.00 3.00 4.00

9th Grade 5 1 3 0 9

10th Grade

8 6 5 0 19

11th Grade

7 6 5 2 20

12th grade and up 1 3 1 7 12

21 16 14 9 60

EDU (Binned)

Total

EDU (Binned) * US Crosstabulation

US

Total

Chi-Square Tests

  Value df

Asymp. Sig. (2-sided)

Pearson Chi-Square 26.078a 9 .002

Likelihood Ratio 25.377 9 .003

Linear-by-Linear Association

9.893 1 .002

N of Valid Cases 60    

Page 47: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

47

Page 48: Chocolate Cake Seminar Series on Statistical Applications Todays Talk: Be an Explorer with Exploratory Data Analysis! By David Ramirez

48

Recommended Readings/Citations• Hartwig, F., & Dearing, B. E. (1979). Exploratory Data Analysis.

Beverly Hills : Sage Publications.• Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983). Understanding

Robust and Exploratory Data Analysis. New York: John Wile & Sons Inc.

• Pampel, F. C. (2004). Exploratory Data Analysis . In M. S. Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE Encyclopedia of Social Science Research Methods (pp. 359-360). Thousand Oak, California : Sage Publications.

• Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt, Dictionary of Statistics & Methodology: A Nontechnical Guide for the Social Science (pp. 104-105). Thousand Oaks, California: SAGE Publications. Inc.