stat 111 introductory statistics

38
STAT 111 Introductory Statistics Lecture 3: Regression May 20, 2004

Upload: chancellor-dale

Post on 31-Dec-2015

41 views

Category:

Documents


2 download

DESCRIPTION

STAT 111 Introductory Statistics. Lecture 3: Regression May 20, 2004. Today’s Topics. Regression line Fitting a line Prediction Least-squares Interpretation Correlation and regression Causation Transforming variables (briefly). Review: The Scatterplot. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: STAT 111 Introductory Statistics

STAT 111 Introductory Statistics

Lecture 3: Regression

May 20, 2004

Page 2: STAT 111 Introductory Statistics

Today’s Topics

• Regression line – Fitting a line– Prediction– Least-squares– Interpretation

• Correlation and regression

• Causation

• Transforming variables (briefly)

Page 3: STAT 111 Introductory Statistics

Review: The Scatterplot

• The scatterplot shows the relationship between two quantitative variables.

• It plots the observations of different individuals in a two-dimensional graph.

• Each point in a scatterplot corresponds to an observation of two variables of the same individual.

Page 4: STAT 111 Introductory Statistics

The Regression Line

• A regression line is a straight line that summarizes the linear relationship between two variables.

• It describes how a response variable y changes as an explanatory variable x changes.

• A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x.

Page 5: STAT 111 Introductory Statistics

The Regression Line (cont.)

• We fit a line to data by drawing the line that comes as close as possible to the points.

• Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line.

• Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate.

Page 6: STAT 111 Introductory Statistics

Example: Predicting SAT Math Scores using SAT Verbal Scores

• Making a regression line using JMP:Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line

• Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers?

Page 7: STAT 111 Introductory Statistics

The Least-Squares Regression Line

• The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

• Mathematically, the line is determined by minimizing

2 ii bxay

Page 8: STAT 111 Introductory Statistics

The Least-Squares Regression Line (cont.)

• The equation of the least-squares regression line of y on x is

• The slope is determined using the formula

• The intercept is calculated using

bxay ˆ

x

y

s

srb

xbya

Page 9: STAT 111 Introductory Statistics

Interpreting the Regression Line

• The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y.

• The least-squares regression line always passes through the point .

• If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0).

),( yx

Page 10: STAT 111 Introductory Statistics

Interpreting the Regression Line (cont.)

• Since standard deviation can never be negative, the signs of r and b will always be the same.

• Hence, if our slope is positive, we have a positive association between our explanatory variable and our response.

• On the other hand, if our slope is negative, then we have a negative association between our explanatory variable and our response.

Page 11: STAT 111 Introductory Statistics

Example: SAT Scores Again

• In our SAT data, the math score is the response, and the verbal score is the explanatory variable. The least-squares regression line as reported by JMP is

math = 498.00765 + 0.3167866 verbal

• Hence, in the context of the SAT, if a student’s verbal score increases by 10 points, then his math score will increase by a little bit more than 3 points.

Page 12: STAT 111 Introductory Statistics

Example: SAT Scores (cont.)

• Suppose we want to predict using our regression line a student’s math score given that his verbal score was 550.

• The predicted math score then would be

498.00765 + 0.3167866 (550) = 672

• Remember not to extrapolate when you make your predictions.

Page 13: STAT 111 Introductory Statistics

Example: SAT Scores (cont.)

• Now, suppose we instead wanted to use a regression line to predict verbal scores using math scores, and suppose that one student had a math score of 670.

• Naively, we would predict the verbal score by taking the inverse of our existing regression line, in which case we would predict a verbal score between 540 and 550.

• It is not quite as simple as this.

Page 14: STAT 111 Introductory Statistics

Example: SAT Scores (cont.)

• What we would need to do is re-fit the regression line using math scores as our explanatory variable and verbal scores as our response.

• The new regression line is (from JMP)

verbal = 408.37653 + 0.3901289 math

• So, our predicted verbal score given a math score of 670 would be

408.37653 + 0.3901289 (670) = 670

Page 15: STAT 111 Introductory Statistics

Correlation and Regression

• The square of the correlation, r2, is the proportion of the variation in the data that is explained by our least-squares regression line.

• r2 is always between 0 and 1.

• If r = ± 0.7, then r2 = 0.49, or about ½ of the variation.

• In our SAT data, r2 = 0.1236 (it is the same for both regressions), so our regression line only captures about 12% of the response’s variation.

Page 16: STAT 111 Introductory Statistics

Understanding r2

• Let’s look at the SAT line (verbal as x, math as y) once again.

• The variance in our observed math values is (61.262875)2 = 3753.14

• If the only variability in observed math scores was because of the linear fit, then math scores would lie exactly on our line.

• In other words, the math scores would be identical to our predicted math scores.

Page 17: STAT 111 Introductory Statistics

Understanding r2 (cont.)

• After computing the predicted math scores, we have that the variance in our predicted values is (21.53698)2 = 463.84

• If we divide the variance of our predicteds by the variance of our actuals, we have

463.84 / 3753.14 = .1236

• It is always true for least-squares regression when we say that r2 gives us the variance of predicted responses as a fraction of the variance of actual responses.

Page 18: STAT 111 Introductory Statistics

Diagnosis (How Good is our Model?)

• Although we are most interested in the overall pattern as described by the regression line, deviations from this pattern are also important.

• In the regression setting, the deviations we consider are the vertical distances from the actual points to the least-squares regression line.

• These distances represent the variation left in the response after fitting the line and are called residuals.

Page 19: STAT 111 Introductory Statistics

Residuals

• A residual is the difference between an observed value and the predicted value.

• Residual = observed y – predicted y

• The sum of the residuals of a regression line is always equal to 0.

• A residual plot is a scatterplot of regression residuals against the explanatory variable and is used to assess the fit of a regression line.

yy ˆ

Page 20: STAT 111 Introductory Statistics

Simplified Patterns of Least-squares Residuals

x x

x

residual

residual

residual

linear relationship nonlinear relationship

Nonconstant prediction error

Page 21: STAT 111 Introductory Statistics

Outliers and Influential Observations

• An outlier is an observation that lies outside the overall pattern of the other observations.

• Points that are outliers in the y direction have large regression residuals, but that need not be the case for all outliers.

• An influential observation is one that would significantly change the regression line if removed. An outlier in the x direction is often influential for the least-squares regression line.

Page 22: STAT 111 Introductory Statistics

Example: Age at First Word and Gesell Score

• Does the age at which a child begin to talk predict a later score on a test of mental ability?

• The age in months at which the first word was spoken and the score on an ability test taken much later were recorded for 21 children.

• Fitting a line to all data reveals a negative linear relationship: early talkers tend to have higher test scores than those who start talking later.

Page 23: STAT 111 Introductory Statistics

Example: First Word and Gesell Score (cont.)

50

60

70

80

90

100

110

120

130

Sco

re

18

19

5 10 15 20 25 30 35 40 45

Age

Linear Fit

Linear Fit

Bivariate Fit of Score By Age

Page 24: STAT 111 Introductory Statistics

Example: First Word and Gesell Score (cont.)

• In the scatterplot, we see that observations 18 and 19 are unusual.

• Observation 18 is far out in the x direction; observation 19 is far out in the y direction.

• The red line is the regression line we obtained by including 18; the green is obtained by excluding 18.

• 18 is pulling the line towards itself; hence it is influential.

Page 25: STAT 111 Introductory Statistics

Extreme Example: Random Data

-4

-2

0

2

4

6

8

y

-2 -1 0 1 2 3 4 5 6

x

Linear Fit

Linear Fit

Bivariate Fit of Column 2 By Column 3

Page 26: STAT 111 Introductory Statistics

Causation vs Association

• Example of causation: Increased consumption of alcohol causes a decrease in coordination and reflexes.

• Example of association: A high SAT score in senior year of high school is typically associated with a high GPA in freshman year of college.

• In general, an association between an explanatory variable x and a response y is not sufficient evidence to prove that x causes y.

Page 27: STAT 111 Introductory Statistics

Causation vs Association (cont.)

• Examples:– High SAT math scores tend to be accompanied by

high SAT verbal scores, but does this mean a high math score causes a high verbal score?

– Nations in which people have easy access to the Internet tend to have higher life expectancies. Does better access to the Internet cause people to live longer?

– The divorce rate tends to be positively correlated with the quantity of bananas imported. Does importing more bananas cause more people to get divorced?

Page 28: STAT 111 Introductory Statistics

Lurking Variables

• A lurking variable is one that is not among the explanatory or response variables in a study, but may influence the interpretation of relationships among those variables.

• In each of our three cases mentioned previously, there is likely a lurking variable at work.

• Give a an example of one for each of the scenarios.

Page 29: STAT 111 Introductory Statistics

Lurking Variables (cont.)

• Lurking variables can create “nonsense correlations” in the sense that they suggest that changing one variable causes changes in the other.

• In addition, lurking variables can hide a true relationship between explanatory and response variables.

Page 30: STAT 111 Introductory Statistics

Causation

• In many cases, we wish to determine whether changes in an explanatory cause changes in the response variable.

• Even in the presence of strong association, it is difficult to decide whether this is due to a causal link.

• There are three main ways to explain an association between two variables.

Page 31: STAT 111 Introductory Statistics

Explaining Association

• The association between an explanatory and a response variable may be due to– Causation when there is a direct cause-and-effect link

between these two variables.– Common response when there is a lurking variable

whose changes cause both the explanatory variable and the response variable to change.

– Confounding when there are multiple influences at work that are getting mixed up.

Page 32: STAT 111 Introductory Statistics

Explaining Association (cont.)

Page 33: STAT 111 Introductory Statistics

• Officially, two variables are considered confounded when their effects on a response variable cannot be distinguished from each other.

• Confounded variables can be either explanatory or lurking.

• Even a very strong association between two variables is not sufficient evidence that there is a cause-and-effect link between the variables.

• The best way to establish that an association is due to causation is with a carefully designed experiment – more on this later.

Page 34: STAT 111 Introductory Statistics

Transformations of Relationships

• In some situations, the values of quantitative variables are quite spread out, with some isolated points. The rest of the data becomes very compressed, making it somewhat difficult to look at.

• Situations like this suggest using a function of the original variable; for example, we might use a function that will shrink the distance between values. This is what we call transforming the data.

Page 35: STAT 111 Introductory Statistics

Transformations of Relationships (cont.)

• Transforming data changes the original scale of measurement. Our most common transformations are linear (˚F → ˚C, lb → kg).

• Linear transformations cannot straighten curved relationships, though; to do that, we need a nonlinear transformation (e.g., powers, exponentials, logarithms).

• The most common transformations of our explanatory variable x are power transformations of the form xp.

Page 36: STAT 111 Introductory Statistics

Transformations of Relationships (cont.)

• We call a function f(x) monotone if its values move in only one direction as x increases

• For positive values of x, power functions with positive p (and the logarithm function) are monotonic increasing and preserve the order of observations.

• For negative p, the power functions are monotonic decreasing and reverse the order of observations.

Page 37: STAT 111 Introductory Statistics

• If we believe that there is some mathematical model that describes our data, then transformations will be quite effective.

• For example, the exponential growth model

y = a * bx can be written as a linear model if we take the logarithm of y (log y = log a + x log b).

• On the other hand, a power law growth model

y = a * xp can be written as a linear model if we take the logarithm of both x and y

(log y = log a + p log x).

Page 38: STAT 111 Introductory Statistics

Transformations of Relationships (cont.)

• In practice, our decision to make a transformation is governed by what we know about the data.

• This also holds true in terms of what type of transformation we decide to make.

• For example, animal populations and values of investments are often well-described by exponential growth model, though we do not always know the values of the parameters.