stat 155 introductory statistics lecture 10: cautions ... · stat 155 introductory statistics...

28
10/03/06 Lecture 10 1 STAT 155 Introductory Statistics Lecture 10: Cautions about Regression and Correlation, Causation The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Upload: buitu

Post on 03-May-2018

218 views

Category:

Documents


1 download

TRANSCRIPT

10/03/06 Lecture 10 1

STAT 155 Introductory Statistics

Lecture 10: Cautions about Regression and Correlation, Causation

The UNIVERSITY of NORTH CAROLINAat CHAPEL HILL

10/03/06 Lecture 10 2

Review

• Least-Squares Regression Lines• Equation and interpretation of the line• Prediction using the line• Correlation and Regression• Coefficient of Determination

10/03/06 Lecture 10 3

Regression Diagnostics

• Look at residuals (errors):– A residual is the difference between an

observed value of the response variable and the value predicted by the regression line, i.e.,

– The sum of the least-squares residuals is always zero.

.ˆresidual yy −=

Why?

10/03/06 Lecture 10 4

Residual Plots

• A residual plot is a scatterplot of the regression residuals against the explanatory variable.

• Residual plots help us assess the fit of a regression line.

10/03/06 Lecture 10 5

Age vs. Height

10/03/06 Lecture 10 6

Residual Plot

• If the regression line catches the overall pattern of the data, there should be no pattern in the residual.

totally random

10/03/06 Lecture 10 7

nonlinear

nonconstantvariation

10/03/06 Lecture 10 8

Diabetes Patient: FPG vs. HbA

• FPG: fasting plasma glucose.• HbA: percent of red blood cells that have a

glucose molecule attached.• Both are measuring blood glucose.• We expect a positive association.• 18 subjects, r = 0.4819.• See the scatterplot on the next page.

10/03/06 Lecture 10 9

Diabetes Patient: FPG vs. HbA

10/03/06 Lecture 10 10

Outliers and Influential Observations

• An outlier is a point that lies outside the overall pattern of the other points. – Outliers in the y direction have large residuals, but

other outliers may not.

• An influential obs. is a point that the regression line would be significantly changed with or without it. – Outliers in the x direction are often influential

points.– But not always…

10/03/06 Lecture 10 11

Diabetes Patient: FPG vs. HbA

10/03/06 Lecture 10 12

• Outliers in the y direction can be spotted from the residual plot.

• Influential points can be identified by fitting regression lines with/without those points. More serious.– Can not be identified via residual plot.– Scatterplot gives us some hint.

Outliers & Influential Obs.

10/03/06 Lecture 10 13

Cautions about correlation and regression

• Linear only• DO NOT extrapolate• Not resistant• Beware lurking variables• Beware correlations based on averaged

data• The restricted-range problem

10/03/06 Lecture 10 14

Lurking Variable

• A lurking (hidden) variable is a variable that has an important effect on the relationship among the variables in a study, but is not included among the variables being studied.

• Examples:– SAT scores and college grades

• Lurking variable: IQ

10/03/06 Lecture 10 15

Lurking variables can create nonsense correlations.

• For the world’s nations, let x be the number of TVs/person and y be the average life expectancy;

• A high positive correlation – nations with more TV sets have higher life expectancies. – Could we lengthen the lives of people in Rwanda by shipping

them more TVs? • Lurking variable: wealth of the nation

– Rich nations: more TV sets. – Rich nations: longer life expectancies because of better nutrition,

clean water, and better health care. • There is no cause-and-effect tie between TV sets and

length of life.• Association vs. causation.

10/03/06 Lecture 10 16

Misleading correlation (two clusters)

10/03/06 Lecture 10 17

Beware correlations based on averaged data

• A correlation based on averages over many individuals is usually higher than the correlation between the same variables based on data for individuals.

• Age vs. Height• (Basketball) score % vs. practice time

10/03/06 Lecture 10 18

The restricted-range problem

• A restricted-range problem occurs when one does not get to observe the full range of the variables.

• When data suffer from restricted range, r and r2 are lower than they would be if the full range could be observed.

• SAT scores vs. College GPA– Princeton vs. Generic State College (Ex 2.22)

10/03/06 Lecture 10 19

Causation vs. Association

• Some studies want to find the existence of causation.

• Example of causation: – Increased drinking of alcohol causes a decrease in

coordination.– Smoking and Lung Cancer.

• Example of association: – The above two examples.– SAT scores and Freshman year GPA.

10/03/06 Lecture 10 20

Association does not imply causation.

• An association between two variables x and y can reflect many types of relationship among x, y, and one or more lurking variables.

• An association between a predictor x and a response y, even if it is very strong, is not by itself good evidence that changes in x actually cause changes in y.

10/03/06 Lecture 10 21

Explaining Association

10/03/06 Lecture 10 22

Explaining Association: Causation

• Cause-and-effect• Examples

– Amount of fertilizer and yield of corn– Weight of a car and its MPG– Dosage of a drug and the survival rate of the mice

10/03/06 Lecture 10 23

Explaining Association: Common Response

• Lurking variables• Both x and y change in response to

changes in z, the lurking variable• There may not be direct causal link

between x and y.• Examples:

– SAT scores vs. College GPA (IQ, Attitude)– Monthly flow of money into stock mutual funds

vs. rate of return for the stock market (Market Condition, Investor Attitude)

10/03/06 Lecture 10 24

Explaining Association: Confounding

• Two variables are confounded when their effects on a response variable are mixed together.

• One explanatory variable may be confounded with other explanatory variables or lurking variables.

• Examples:– More education leads to higher income.

• Family background…

– Religious people live longer.• Life style…

10/03/06 Lecture 10 25

Establishing causation

• The only compelling method: Designed experiment (More in Chapter 3)

• Hot disputes:– Does gun control reduce violent crime?– Does meat consumption in your diet cause

heart diseases?– Does smoking cause lung cancer?

10/03/06 Lecture 10 26

Does smoking CAUSE lung cancer?

• causation: smoking causes lung cancer.• common response: people who have a

genetic predisposition to lung cancer also have a genetic predisposition to smoking.

• confounding: people who drink too much, don't exercise, eat unhealthy foods, etc. are more likely to get lung cancer as a result of their lifestyle. Such people may be more likely to be smokers as well.

10/03/06 Lecture 10 27

Some guidelines when designed experiment is impossible:

• strong association• association consistent across various

studies• higher dose associated with stronger

responses• the cause precedes the effect in time• plausibility

10/03/06 Lecture 10 28

Take Home Message

• Residual Plots• Outliers and Influential Observations• Lurking Variables• Cautions about Correlation and Regression• Explaining associations:

– Causation– Common response– Confounding

• How to establish causation?