chapter 2: looking at data - relationships /true-fact-the-lack-of-pirates-is-causing-global-warming

60
Chapter 2: Looking at Data - Relationships http://www.forbes.com/sites/erikaandersen/2012/03/23 /true-fact-the-lack-of-pirates-is-causing-global-warming/ 1

Post on 15-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

1

Chapter 2: Looking at Data - Relationships

http://www.forbes.com/sites/erikaandersen/2012/03/23/true-fact-the-lack-of-pirates-is-causing-global-warming/

Page 2: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

2

General Procedure

1. Plot the data.2. Look for the overall pattern.3. Calculate a numeric summary.4. Answer the question (which will be defined

shortly)

Page 3: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

3

2.1: Relationships - Goals• Be able to define what is meant by an association

between variables.• Be able to categorize whether a variable is a response

variable or a explanatory variable.• Be able to identify the key characteristics of a data set.

Page 4: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

4

Questions

• What objects do the data describe?• What variables are present and how are they

measured?• Are all of the variables quantitative?• Are the variables associated with each other?

Page 5: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

5

Association (cont.)

Two variables are associated if knowing the values of one of the variables tells you something about the values of the other variable.1. Do you want to explore the association?2. Do you want to show causality?

Page 6: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

6

Variable Types

• Response variable (Y): outcome of the study• Explanatory variable (X): explains or causes

changes in the response variable

Page 7: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

7

Key Characteristics of Data• Cases: Identify what they are and how many• Label: Identify what the label variable is (if

present)• Categorical or quantitative: Classify each

variable as categorical or quantitative. • Values. Identify the possible values for each

variable.• Explanatory or Response: Classify each

variable as explanatory or response.

Page 8: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

8

2.2: Scatterplots - Goals• Be able to create a scatterplot (lab)• Be able to interpret a scatterplot–Pattern–Outliers– Form, direction and strength of a relationship

• Be able to interpret scatterplots which have categorical variables.

Page 9: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

9

Scatterplot - Procedure

1. Decide which variable is the explanatory variable and put on X axis. The response variable goes on the Y axis.

2. Label and scale your axes.3. Plot the (x,y) pairs.

Page 10: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

10

Example: Scatterplot

The following data is to determine the relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.

a) Draw a scatterplot of this data.

Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8

Page 11: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

11

Example: Scatterplot (cont)

25 30 35 40 45 50 55 60 65 70 75-30

-20

-10

0

10

BP

25 30 35 40 45 50 55 60 65 70 75-30

-20

-10

0

10

BP

Age

Age

Page 12: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

12

Pattern

• Form• Direction• Strength• Outliers

Page 13: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

13

PatternLinear

Nonlinear

No relationship

Page 14: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

14

Outliers

Page 15: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

15

Example: Scatterplot (cont)

25 30 35 40 45 50 55 60 65 70 75-30

-20

-10

0

10

BP

Age

Page 16: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

16

Scatterplot with Categorical Variables

http://statland.org/Software_Help/Minitab/MTBpul2.htm

Page 17: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

17

I am a Turkey, not Tukey!Thank you for not eating me!

Page 18: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

18

2.3: Correlation - Goals• Be able to use (and calculate) the correlation to

describe the direction and strength of a linear relationship.

• Be able to recognize the properties of the correlation.

• Be able to determine when (and when not) you can use correlation to measure the association.

Page 19: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

19

Sample correlation, r(Pearson’s Sample Correlation Coefficient)

𝑟=∑ [ (𝑥𝑖−𝑥 ) (𝑦 𝑖− 𝑦 ) ]

√∑ (𝑥𝑖−𝑥 )2√∑ ( 𝑦 𝑖− 𝑦 )2=

𝑆𝑆𝑥𝑦

√𝑆𝑆𝑥𝑥√𝑆𝑆𝑦𝑦

=∑ [ (𝑥𝑖−𝑥 ) ( 𝑦 𝑖− 𝑦 ) ]

(𝑛−1)𝑠𝑥 𝑠 𝑦

= 1𝑛−1∑ [(𝑥 𝑖−𝑥

𝑠𝑥 )( 𝑦 𝑖− 𝑦𝑠 𝑦

)]

Page 20: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

20

Sum of Squares

Page 21: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

21

Properties of Correlation

• r > 0 ==> positive associationr < 0 ==> negative association

• r is always a number between -1 and 1.• The strength of the linear relationship

increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear

relationship– r = 0 ==> x and y are uncorrelated.

Page 22: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

22

Positive/Negative Correlation

Page 23: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

23

Example: Positive/Negative Correlation

1) Would the correlation between the age of a used car and its price be positive or negative? Why?

2) Would the correlation between the weight of a vehicle and miles per gallon be positive or negative? Why?

Page 24: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

24

Properties of Correlation

• r > 0 ==> positive associationr < 0 ==> negative association

• r is always a number between -1 and 1.• The strength of the linear relationship

increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear

relationship– r = 0 ==> x and y are uncorrelated.

Page 25: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

25

Variety of Correlation Values

Page 26: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

26

Value of r

Page 27: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

27

Properties of Correlation

• r > 0 ==> positive associationr < 0 ==> negative association

• r is always a number between -1 and 1.• The strength of the linear relationship

increases as |r| moves to 1.– |r| = 1 only occurs if there is a perfect linear

relationship– r = 0 ==> x and y are uncorrelated.

Page 28: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

28

Comments about Correlation• Correlation makes no distinction between

explanatory and response variables.

• r has no units and does not change when the units of x and y change.

Page 29: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

29

Cautions about Correlation• Correlation requires that both variables be

quantitative.• Correlation measures the strength of LINEAR

relationships only.

• The correlation is not resistant to outliers.• Correlation is not a complete summary of

bivariate data.

Page 30: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

30

Datasets with r = 0.816

Page 31: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

31

Questions about Correlation

• Does a small r indicate that x and y are NOT associated?

• Does a large r indicate that x and y are linearly associated?

Page 32: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

32

2.4: Least-Squares Regression - Goals• Be able to generally describe the method of

‘Least Squares Regression’• Be able to calculate and interpret the

regression line.• Using the least square regression line, be able

to predict the value of y for any appropriate value of x.

• Be able to calculate r2.• Be able to explain the meaning of r2.–Be able to discern what r2 does NOT explain.

Page 33: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

33

Regression Line

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.

We can use a regression line to predict the value of y for a given value of x.

Page 34: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

34

Idea of Linear Regression

Page 35: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

35

Linear Regression

i i xy y

1 2xx xi

x x y y SS sb r

SS sx x

b0 = y - b1x�

y = b0 + b1x

Page 36: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

36

Example: Regression Line

25 30 35 40 45 50 55 60 65 70 75-30-25-20-15-10

-505

10

BP

Age

y = 20.11 - 0.526x

Page 37: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

37

Example: Regression LineThe following data is to determine the relationship

between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.

= 52.727, y = -7.636, sx� x = 14.164, sy = 9.688, r = -0.76951

b) What is the regression line for this data?c) What would the predicted value be for someone who is

51 years old?

Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8

Page 38: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

38

Facts about Least Square Regression

1. Slope: A change of one standard deviation in x corresponds to a change of r standard deviations in y.

2. Intercept: the value of y when x = 0.3. The line passes through the point ( ,y).x�4. There is an inherent difference between x

and y.

Page 39: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

39

r2

• Coefficient of determination.• Fraction of the variation of the values of y that

is explained by the least-squares regression of y on x.

Page 40: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

40

Example: Regression LineThe following data is to determine the

relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.

d) What percent of variation of Y is due to the regression line?

Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8

Page 41: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

41

Beware of interpretation of r2

• Linearity• Outliers• Good prediction

Page 42: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

42

2.5: Cautions about Correlation and Regression - Goals

• Be able to calculate the residuals.• Be able to use a residual plot to assess the fit of a

regression line.• Be able to identify outliers and influential observations

by looking at scatterplots and residual plots.• Be able to determine when you can predict a new

value.• Be able to identify lurking variables that can influence

the relationship between two variables.• Be able to explain the different between association

and causation.

Page 43: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

43

Residuals

Page 44: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

44

Example: Regression LineThe following data is to determine the

relationship between age and change in systolic blood pressure (BP, mm Hg) after 24 hours in response to a particular treatment.

e) What is the residual for someone who is 51 years old?

Obs 1 2 3 4 5 6 7 8 9 10 11Age 70 51 65 70 48 70 45 48 35 48 30BP -28 -10 -8 -15 -8 -10 -12 3 1 -5 8

Page 45: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

45

Residual PlotsGood

Linearity Violation

Page 46: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

46

Residual PlotsGood

Constant variance violation

Page 47: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

47

Residual Plots – Bp

25 30 35 40 45 50 55 60 65 70 75-12

-6

0

6

12

Age

Resi

dual

25 30 35 40 45 50 55 60 65 70 75-24

-18

-12

-6

0

6

12

AgeRe

sidu

al

Original Y outlier

Page 48: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

48

Residual Plots – Bp

25 30 35 40 45 50 55 60 65 70 75-12

-6

0

6

12

Age

Resi

dual

Original X outlier

25 50 75 100-12

-6

0

6

12

AgeRe

sidu

al

Page 49: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

49

Influential Point

• An outlier is an observation that lies outside the overall pattern of the other observations.

• An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation.

Page 50: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

50

Cautions about Correlation and Regression: Extrapolation

25 30 35 40 45 50 55 60 65 70 75-30-20-10

010

BP

Page 51: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

51

Cautions about Correlation and Regression:

• Both describe linear relationship.• Both are affected by outliers.• Always PLOT the data.• Beware of extrapolation.• Beware of lurking variables– Lurking variables are important in the study,

but are not included.–Confounding variables confuse the issue.

• Correlation (association) does NOT imply causation!

Page 52: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

52

Lurking VariablesIn each of these cases, identify the lurking variable.

1. For children, there is an extremely strong correlation between shoe size and math scores.

2. There is a very strong correlation between ice cream sales and number of deaths by drowning.

3. There is very strong correlation between number of churches in a town and number of bars in a town.

Page 53: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

53

What is the lurking variable?

http://www.forbes.com/sites/erikaandersen/2012/03/23/true-fact-the-lack-of-pirates-is-causing-global-warming/

Page 54: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

54

2.6: Data Analysis for Two-Way Tables - Goals

Statements• The distribution of a two random variables

(bivariate) is called a joint distribution.• Two random variables are similar to two events in

that they can have conditional probabilities and be independent of each other.

Goal• Interpret examples of Simpson’s paradox

Page 55: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

55

Simpson’s Paradox

An association or comparison that holds for all of several groups can reverse direction when the data are combined to form a single group. This reversal is called Simpson’s paradox.

Page 56: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

56

Simpson’s Paradox

Consider the acceptance rates for the following groups of men and women who applied to college.

Counts Accepted Notaccepted Total

Men 198 162 360

Women 88 112 200

Total 286 274 560

Percents Accepted Notaccepted

Men 55% 45%

Women 44% 56%

Page 57: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

57

Simpson’s Paradox

• Business School

• Art School

Counts Accepted Notaccepted Total

Men 18 102 120

Women 24 96 120

Total 42 198 240

Percents Accepted Notaccepted

Men 15% 85%

Women 20% 80%

Counts Accepted Notaccepted Total

Men 180 60 240Women 64 16 80

Total 244 76 320

Percents Accepted Notaccepted

Men 75% 25%

Women 80% 20%

Page 58: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

58

2.7: The Question of Causation - Goals• Be able to explain an association–Causation–Common response–Confounding variables

• Apply the criteria for establishing causation.

Page 59: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

59

Causation

Association does not mean causation!

Page 60: Chapter 2: Looking at Data - Relationships  /true-fact-the-lack-of-pirates-is-causing-global-warming

60

Establishing CausationPerform an experiment!What do we need for causation?1. The association is strong.2. The association is consistent.

The connection happens in repeated trialsThe connection happens under varying conditions

3. Higher doses are associated with strong responses.

4. Alleged cause precedes the effect.5. The alleged cause is plausible.