eda stats 2010

56
Module 2 - Exploratory Data Analysis (EDA) Central Tendency and Variability Text: Field, A. 2009 2nd edition -Chapter 1: 1.7 -Chapter 2: 2.1 – 2.5 -Chapter 4: 4.1 – 4.9

Upload: teganalexis

Post on 28-Nov-2014

891 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EDA stats 2010

Module 2 - Exploratory Data Analysis (EDA)

Central Tendency and Variability

Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9

Page 2: EDA stats 2010

Describing a Population/Sample

• Statistics is the study of data which has some element of random variation - random variable.

• This variation in the variable under study can be conceptualised as a frequency or probability distribution.

• An example - Distribution of a normal random variable (x)

• The properties of this distribution can be described in several ways - Central tendency, Position, Variability

x

Page 3: EDA stats 2010

Describing a Population/Sample

• Central Tendency or “Average”– Mode– Median– Mean

• Position – Quantiles– Quartiles– Percentiles

• Variability or Dispersion– Range, Interquartile Range (IQR)– Variance, Standard Deviation– Standard Error of the Sample Mean

16 18 20 22 24 26 28 30 32

height

0

3

6

9

12

15

Fre

qu

en

cy

Mean = 23.03Std. Dev. = 2.7412N = 50

Page 4: EDA stats 2010

Working With an Example

Note that for the following definitions, we will be working with the following data set (n=23) of individual weights (kg)

7393

68.5101

65.5

78.58380

80.587

7375.6

6186.561.5

65.53998

69.552.5

71.576

74.5

Page 5: EDA stats 2010

Central Tendency - Mode

• The mode is the most common value

• It has the highest frequency in the dataset

• You can see that the example dataset has two modes:

65.5kg and 73kg both have a frequency of 2

• This dataset is bimodal

Value Frequency39 1

52.5 161 1

61.5 165.5 268.5 169.5 171.5 1

73 274.5 175.6 1

76 178.5 1

80 180.5 1

83 186.5 1

87 193 198 1

101 1

Page 6: EDA stats 2010

Central Tendency - Median

• The median is the middle value in an ordered list of n numbers

• 50% of the data lie on either side of this value

• It is also represented as Q2 (2nd Quartile)

• The position of Q2 can be calculated by using the following

( 1)

2

n

Page 7: EDA stats 2010

Calculating the Median

In our example the dataset contains 23 numbers:

( 1)

2(23 1)

2

12th

n

number

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

Therefore the 12th number in the ascending data set will be the median (Q2 = 74.5kg)

Page 8: EDA stats 2010

Central Tendency - Mean

• Sample mean– Represented by

• Population mean– Represented by

• Note that means– sum all values from 1 to n

xn

xx

n

ii

1

n

i 1

Page 9: EDA stats 2010

Calculating the Mean

• The summation of all of our data values= 1714.1 kg.

• Divided by the number of values (n = 23)

• So the mean is

23

1i

i

x

1

1714.1

2374.5 .

n

ii

xx

n

kg

Page 10: EDA stats 2010

Position

• Quantiles – General name for measures of position

that divide the distribution (or ranked data) into equal groups. For examples quarters,tenths, hundreds, etc.

• Quartiles– Measures of position that divide the

distribution (or ranked data) into Quarters.

• Percentiles– Measures of position that divide the

distribution (or ranked data) into 100 equal subsets

Page 11: EDA stats 2010

Central Tendency vs. Variability

• The mean, median, and mode all tell us about the central tendency of a distribution.

• They cannot tell us about the spread of the distribution (variability).

Page 12: EDA stats 2010

Variability - Range

• The Range of the distribution of data is given by the difference between the maximum value and the minimum value

Range = Max - Min

• A measurement of variability that usually accompanies the Median.

Page 13: EDA stats 2010

Variability - Interquartile Range

• Quartiles are the three points (Q1, Q2, Q3) in the distribution defining four equal quarters.

• The quartiles cut the data distribution into four sections each containing 25% of the data.

Q1 Q2 Q325% of the data

Page 14: EDA stats 2010

Variability - Interquartile Range

• The Interquartile Range (IQR) is represented by the difference between the lower quartile (Q1) and the upper quartile (Q3)

• These quartile positions can be calculated via

• The IQR can then be calculated using the value at these positions

• A measurement of variability that usually accompanies the Median.

1

( 1)

4

nforQ

3

3( 1)

4

nforQ

Page 15: EDA stats 2010

Calculating the Interquartile

Range1

( 1)

4(23 1)

4

6th

nforQ

number

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

3

3( 1)

43(23 1)

4

18th

nforQ

number

Q1 is therefore 65.5kg.

Q3 is therefore 83.0kg.

Q2 or Median

Page 16: EDA stats 2010

Calculating the Interquartile

Range

Order Number1 392 52.53 614 61.55 65.56 65.57 68.58 69.59 71.5

10 7311 7312 74.513 75.614 7615 78.516 8017 80.518 8319 86.520 8721 9322 9823 101

Q1 = 65.5kg

Q3 = 83.0kg

IQR = Q3 - Q1

= 83.0 - 65.5= 17.5

Page 17: EDA stats 2010

Variability Around the Mean

80

70

60

50

40

Mean

Sample

Variation around the mean can be described as the difference (or distance) between the data point and the mean 30

x x

Page 18: EDA stats 2010

Variability Around the Mean

We cannot simply subtract each number from the mean because the sum of these differences will be zero - the positive differences will cancel out the negative differences

Number Mean Number - Mean

73 74.52609 -1.52608695793 74.52609 18.47391304

68.5 74.52609 -6.026086957101 74.52609 26.47391304

65.5 74.52609 -9.02608695778.5 74.52609 3.973913043

83 74.52609 8.47391304380 74.52609 5.473913043

80.5 74.52609 5.97391304387 74.52609 12.4739130473 74.52609 -1.526086957

75.6 74.52609 1.07391304361 74.52609 -13.52608696

86.5 74.52609 11.9739130461.5 74.52609 -13.0260869665.5 74.52609 -9.026086957

39 74.52609 -35.5260869698 74.52609 23.47391304

69.5 74.52609 -5.02608695752.5 74.52609 -22.0260869671.5 74.52609 -3.026086957

76 74.52609 1.47391304374.5 74.52609 -0.026086957

Total 0

Page 19: EDA stats 2010

• If we square the differences then we will always get a positive number– this is known as the

sum of squares (SS)– this can be

represented by the following equation

– Where;represents the mean

represents each individual number

Difference

Number Mean Number - Mean Squared

73 74.52609 -1.526086957 2.32894193 74.52609 18.47391304 341.2855

68.5 74.52609 -6.026086957 36.31372101 74.52609 26.47391304 700.8681

65.5 74.52609 -9.026086957 81.4702578.5 74.52609 3.973913043 15.79198

83 74.52609 8.473913043 71.807280 74.52609 5.473913043 29.96372

80.5 74.52609 5.973913043 35.6876487 74.52609 12.47391304 155.598573 74.52609 -1.526086957 2.328941

75.6 74.52609 1.073913043 1.15328961 74.52609 -13.52608696 182.955

86.5 74.52609 11.97391304 143.374661.5 74.52609 -13.02608696 169.678965.5 74.52609 -9.026086957 81.47025

39 74.52609 -35.52608696 1262.10398 74.52609 23.47391304 551.0246

69.5 74.52609 -5.026086957 25.2615552.5 74.52609 -22.02608696 485.148571.5 74.52609 -3.026086957 9.157202

76 74.52609 1.473913043 2.1724274.5 74.52609 -0.026086957 0.000681

Total 0 4386.944

2( )x x

x

x

Page 20: EDA stats 2010

Variability Around the Mean

• Although useful in some calculations, the sum of squares does not take into account the number of observations (is dependent on sample size).

• There are some important ways that the spread of the data around the mean can be represented (based on sum of squares).– The Variance (s2).– The Standard Deviation (s).– The Standard Error of the Sample Mean.

(S.E. or s).

Page 21: EDA stats 2010

Variability - Sample Variance

• The Variance uses the Sum of Squares adjusted for the number of “independent” observations in the sample:-“average” variation

• We can use the Sums of Squares calculated in the previous slide:

2

2

4.199

123

944.4386

kg

s

1

)(2

2

n

Xxs

Notice that we are in squared units

Page 22: EDA stats 2010

Variability - Sample Standard Deviation

• The sample’s Standard Deviation is the square root of the Variance:

2

1

199 4

14 12

( )

.

.

x xs

n

s

kg

Notice that we are now back in our original units

Page 23: EDA stats 2010

The Standard Error of the Sample Mean

• The Std. Dev. divided by the square root of n is called the Standard Error of the sample mean - we will encounter this measure later on in the course.2

199.4 14.12

23 232.94

x

s ss

n n

Page 24: EDA stats 2010

Sample VS Population

Sample Population 

x = sample mean = population mean

s = sample std dev = population std dev

s2= sample variance 2 = population var.n = sample size N = population size

Sample Only

Standard error of the sample mean (S.E.)

xs

Page 25: EDA stats 2010

Module 2 - Exploratory Data Analysis (EDA)

Graphical Methods

Text: Field, A. 2009 2nd edition-Chapter 1: 1.7-Chapter 2: 2.1 – 2.5-Chapter 4: 4.1 – 4.9

Page 26: EDA stats 2010

Graphical Methods & SPSS

• Graphical methods are a good way of summarising information and are useful to visualise patterns within your data.

• Various methods can be used depending on the measurement scale of the variables.

• SPSS is the statistical package that you will be using this semester and has a similar spreadsheet format to Microsoft Excel.

• Generally, when entering data into SPSS, each column contains a different variable.

Page 27: EDA stats 2010

Graphs for Discrete Variables

• Measurement scale - nominal or ordinal– Other terms -categorical, binned, class,

qualitative– Examples - gender, age group, trap type

• Common graphical methods are:– Pie charts for proportions, percentages, or

values that sum to a fixed value– Bar charts for most other discrete variables

• Data can be entered into SPSS in two forms– Each case (row) represents a single observation– Each case (row) represents the count,

percentage, or proportion of each level of the discrete variable

Page 28: EDA stats 2010

Data Entry for Discrete Variables

Data entry type 1 :-Can create charts directly using this type of data

Data entry type 2:-First tell SPSS that each discrete level has been counted

Page 29: EDA stats 2010

An Example - Mass (%) of Each Element Within a Star

• The data is entered into SPSS as in data entry type 2

• You must then tell SPSS to weight each observation (case) by the variable “mass”

• You will need to do this for a pie chart and for a bar graph

Page 30: EDA stats 2010

Making a Pie Chart in SPSS

Page 31: EDA stats 2010

The Pie Chart

Cases weighted by MASS

Other

Helium

Hydrogen

Page 32: EDA stats 2010

Making a Bar Chart in SPSS

Page 33: EDA stats 2010

Simple Bar Chart

Cases weighted by MASS

Element

OtherHeliumHydrogen

Co

un

t

80

60

40

20

0

One variable with three categories

Page 34: EDA stats 2010

Clustered Bar Chart

Smoker Non Smoker

Smoking Status

0

100

200

300

400

500

Co

un

t

Cancer StatusCancer

No Cancer

Cases weighted by freq

Two variables with two categories each

Page 35: EDA stats 2010

Graphs for Continuous Variables

• Measurement scale - Scale– Other terms - quantitative– Examples - Length, Temperature, Species

Richness

• Common graphical methods are:– For a single sample - Histograms, Box and

Whisker plots, Error Bar plots, Q-Q plots.– For 2 or more samples - Clustered Box and

Whisker plots, Clustered Error Bar plots.– For 2 scale variables - Scatter plots.

Page 36: EDA stats 2010

An Example - Plant Heights

We will be using the following data set of plant heights (cm) to construct a histogram.

21 24.5 20 23.5 24.520 26 21 24 25

21.5 23.5 21 20 2823 24.5 22.5 21 2821 25 21.5 22 26

21.5 26.5 22.5 21.5 2524 21.5 23 16.5 29

25.5 23 25 19 3120.5 22.5 23 19 21.5

24 23.5 23 19.5 22.5

Page 37: EDA stats 2010

HistogramTo create a histogram by hand, we need to create a series of “bins” or categories.

– The data ranges from 16.5 to 31.0.– we can use the following groups to classify

the data.You can see that the ‘bins’ have been organised so that there each datum belongs to a unique group

Bin Tally Frequency

16 – 17.9

18 – 19.9

20 – 21.9

22 – 23.9

24 – 25.9

26 – 27.9

28 – 29.9

30 – 31.9

Page 38: EDA stats 2010

Histogram

Histogram of Plant height

0

2

4

6

8

10

12

14

16

16 – 17.9

18 – 19.9

20 – 21.9

22 – 23.9

24 – 25.9

26 – 27.9

28 – 29.9

30 – 31.9

Height Categories (or Bins)

Fre

qu

en

cy

Page 39: EDA stats 2010

Histogram of Plant height

0

2

4

6

8

10

12

14

16

Height Categories (or Bins)

Fre

qu

en

cy

HistogramHere’s One We Prepared

Earlier

Page 40: EDA stats 2010

Histogram Using SPSS

• SPSS will create the bins, work out frequencies and create the histogram for you

• The data needs to be entered in a single column

Page 41: EDA stats 2010

Histogram Using SPSS

Page 42: EDA stats 2010

Histogram Using SPSS

16 18 20 22 24 26 28 30 32

height

0

3

6

9

12

15

Fre

qu

en

cy

Mean = 23.03Std. Dev. = 2.7412N = 50

Single sampleVariable height (8 bins)

Page 43: EDA stats 2010

Histogram Using SPSS

16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

height

0

2

4

6

8

10

12

Fre

qu

en

cy

Mean = 23.03Std. Dev. = 2.7412N = 50

Single sampleVariable height (16 bins)

Page 44: EDA stats 2010

Q-Q Plot

• For a single sample

• Plots the quantiles of a variable's distribution (observed - unknown distribution) against the quantiles of a test

distribution (expected - e.g. Normal Dist.).

• The test distribution (expected values) have the same mean and standard deviation as the observed data.

• Available test distributions include Beta, Chi-square, Exponential, Gamma, Logistic, Lognormal, Normal, Student’s t, and Uniform.

Page 45: EDA stats 2010

Q-Q Plot

• Probability plots are generally used to determine whether the distribution of a variable (observed - unknown distribution) matches a given distribution (expected - e.g..

Normal Dist.).

• If the selected variable matches the test distribution, the points line up on a 450 line (observed = expected).

• Note, if using a sample from a population the sample size needs to be reasonably large.

• An alternative is the P-P plot (percentile plot)

Page 46: EDA stats 2010

Q-Q Plot

Normal Q-Q Plot of HEIGHT

Observed Value

323028262422201816

Exp

ect

ed

No

rma

l Va

lue

30

28

26

24

22

20

18

16

Expected quantiles for a normal distribution with the same mean and standard deviation as the observed distribution

Observed quantiles from our sample of plant heights

Page 47: EDA stats 2010

Box and Whisker Plots

• The Box includes– The Median

– Q1 and Q3 as the edges of the box

• The Whiskers – either (method 1) – “5 number summary”

• Max and the Min are the ends of the whiskers

– or (method 2) – default method used in SPSS• Q3+1.5 IQR and Q1-1.5 IQR are the ends of the

whiskers

• Q3+3.0 IQR and Q1-3.0 IQR border between outliers and extreme outliers

• symbols used for outliers (O) and extreme outliers (*)

Page 48: EDA stats 2010

Box and Whisker Plot Method 1 - 5 Number

Summary

Max

Q3

Q2 (Median)

Q1

Min

IQRRange

This type of Box and Whisker Plot is the simplest.

It is based on a five number summary:-

Max, Q3, Q2, Q1, Min

Page 49: EDA stats 2010

Box and Whisker Plot Method 2 - SPSS (Boxplot)

Extreme Outlier

Outlier

Outliers

o

*

oo

Q3 + 3 IQR

Q3 + 1.5 IQR (or max)

Q3

Q2 (Median)

Q1

Q1 - 1.5 IQR (or min)

Q1 - 3 IQR

Page 50: EDA stats 2010

Making a Boxplot in SPSS

Page 51: EDA stats 2010

SPSS Clustered Boxplot

88888N =

SITES

54321

GA

LL

S

70

60

50

40

30

20

10

0

-10

15

Note:Outlier present in second site (sample)

Several samples

Page 52: EDA stats 2010

Error Bar Plot

The Error Bar plot is used to represent

• The mean

• Plus a measure of variation around the mean– Confidence Interval of the Sample Mean– The Standard Error of the Sample Mean– The Standard Deviation of the sample

• The most common form of the Error Bar Plot– Is the Standard Error Plot– Mean 1 Standard Error of the Sample Mean

Page 53: EDA stats 2010

Error Bar Plot in SPSS

Make sure you select the correct measure of variability

The default multiplier is 2 so make sure that you always change it to 1

Page 54: EDA stats 2010

88888N =

SITES

54321

Me

an

+-

1 S

E G

AL

LS

40

30

20

10

0

SPSS Clustered Error Bar Plot

Note:Mean 1 S.E.

Several samples

Page 55: EDA stats 2010

Scatter Plot

Two scale variables

-20.00 -10.00 0.00 10.00 20.00

Temperature

2.00

3.00

4.00

5.00

Ox

yg

en

Co

nc

en

tra

tio

n

-20.00 -10.00 0.00 10.00 20.00

Temperature

2.00

3.00

4.00

5.00

Oxy

gen

Co

nc

entr

ati

on

R Sq Linear = 0.979

Line of best fit or linear regression model

Page 56: EDA stats 2010

Scatter PlotThree scale variables

20.0 25.0 30.0 35.0 40.0

10.0

12.0

14.0

16.0

18.0

20.0

Tu

rbid

ity

4.05.06.07.08.09.010.011.0