linear regression (lsrl). bivariate data x – variable: is the independent or explanatory variable...

62
Linear Regression (LSRL)

Upload: ainsley-shark

Post on 14-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Linear Regression(LSRL)

Page 2: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Bivariate data

• x – variable: is the independent or explanatory variable

• y- variable: is the dependent or response variable

• Use x to predict y

Page 3: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

bxay ˆ

b – is the slope– it is the amount by which y increases when

x increases by 1 unita – is the y-intercept

– it is the height of the line when x = 0– in some situations, the y-intercept has no

meaning

y - (y-hat) means the predicted y

Be sure to put the hat on the y

Page 4: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Least Squares Regression LineLSRL

• The line that gives the best fit to the data set

• The line that minimizes the sum of the squares of the deviations from the line

Page 5: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Sum of the squares = 61.25

45 xy .ˆ

-4

4.5

-5

y =.5(0) + 4 = 4

0 – 4 = -4

(0,0)

(3,10)

(6,2)

(0,0)

y =.5(3) + 4 = 5.5

10 – 5.5 = 4.5

y =.5(6) + 4 = 7

2 – 7 = -5

Page 6: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

(0,0)

(3,10)

(6,2)

Sum of the squares = 54

33

1 xy

Use a calculator to find the line of

best fit

Find y - y

-3

6

-3

What is the sum of the deviations

from the line?

Will it always be zero?

The line that minimizes the sum of the squares of the deviations from the line

is the LSRL.

Page 7: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Slope:

For each unit increase in x, there is an approximate increase/decrease of b in y.

Interpretations

Correlation coefficient:There is a direction, strength, type of association between x and y.

Page 8: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The ages (in months) and heights (in inches) of seven children are given.

x 16 24 42 60 75 102 120

y 24 30 35 40 48 56 60

Find the LSRL.

Interpret the slope and correlation coefficient in the context of the problem.

Page 9: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Correlation coefficient:

There is a strong, positive, linear association between the age and height of children.

Slope:For an increase in age of one month, there is an approximate increase of .34 inches in heights of children.

Page 10: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The ages (in months) and heights (in inches) of seven children are given.

x 16 24 42 60 75 102 120

y 24 30 35 40 48 56 60

Predict the height of a child who is 4.5 years old.

Predict the height of someone who is 20 years old.

Page 11: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Extrapolation• The LSRL should not be used to

predict y for values of x outside the data set.

• It is unknown whether the pattern observed in the scatterplot continues outside this range.

Page 12: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The ages (in months) and heights (in inches) of seven children are given.

x 16 24 42 60 75 102 120

y 24 30 35 40 48 56 60

Calculate x & y.

Plot the point (x, y) on the LSRL.

Will this point always be on the LSRL?

Page 13: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The correlation coefficient and the LSRL are both non-resistant measures.

Page 14: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Formulas – on chart

x

y

i

ii

s

srb

xbyb

xx

yyxxb

xbby

1

10

21

10ˆ

Page 15: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The following statistics are found for the variables posted speed limit and the average number of accidents.

99814818

61140

.,.,

,.,

rsy

sx

y

x

Find the LSRL & predict the number of accidents for a posted speed limit of 50 mph.

9210723 ..ˆ xy accidents2325.ˆ y

Page 16: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Correlation (r)

Page 17: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Suppose we found the age and weight of a sample of 10 adults.

Create a scatterplot of the data below.

Is there any relationship between the age and weight of these adults?

Age 24 30 41 28 50 46 49 35 20 39

Wt 256 124 320 185 158 129 103 196 110 130

Page 18: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Suppose we found the height and weight of a sample of 10 adults.

Create a scatterplot of the data below.

Is there any relationship between the height and weight of these adults?

Ht 74 65 77 72 68 60 62 73 61 64

Wt 256 124 320 185 158 129 103 196 110 130

Is it positive or negative? Weak or strong?

Page 19: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

The closer the points in a scatterplot are to a straight

line - the stronger the relationship.

The farther away from a straight line – the weaker the relationship

Page 20: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Identify as having a positive association, a negative association, or no association.

1. Heights of mothers & heights of their adult daughters

+

2. Age of a car in years and its current value

3. Weight of a person and calories consumed

4. Height of a person and the person’s birth month

5. Number of hours spent in safety training and the number of accidents that occur

-+NO

-

Page 21: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Correlation Coefficient (r)-• A quantitative assessment of the strength

& direction of the linear relationship between bivariate, quantitative data

• Pearson’s sample correlation is used most• parameter – r (rho)• statistic - r

y

i

x

i

s

yy

s

xx

nr

1

1

Page 22: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Calculate r. Interpret r in context.

Speed Limit (mph) 55 50 45 40 30 20

Avg. # of accidents (weekly)

28 25 21 17 11 6

There is a strong, positive, linear relationship between speed limit and average number of accidents per week.

Page 23: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Moderate CorrelationStrong correlation

Properties of r(correlation coefficient)

• legitimate values of r are [-1,1]

0 .5 .8 1-1 -.8 -.5

No Correlation

Weak correlation

Page 24: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

•value of r does not depend on the unit of measurement for either variable

x (in mm) 12 15 21 32 26 19 24

y 4 7 10 14 9 8 12

Find r.

Change to cm & find r.

The correlations are the same.

Page 25: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

•value of r does not depend on which of the two variables is

labeled x

x 12 15 21 32 26 1924

y 4 7 10 14 9 812

Switch x & y & find r.

The correlations are the same.

Page 26: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

•value of r is non-resistant

x 12 15 21 32 26 1924

y 4 7 10 14 9 822

Find r.Outliers affect the correlation

coefficient

Page 27: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

•value of r is a measure of the extent to which x & y are linearly related

A value of r close to zero does not rule out any strong relationship between x and y.

r = 0, but has a definite relationship!

Page 28: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Minister data:(Data on Elmo)

r = .9999

So does an increase in ministers cause an increase in consumption of rum?

Page 29: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Correlation does not imply causation

Correlation does not imply causation

Correlation does not imply causation

Page 30: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Residuals, Residual Plots, & Influential points

Page 31: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Residuals (error) -

• The vertical deviation between the observations & the LSRL

• the sum of the residuals is always zero• error = observed - expected

yy ˆresidual

Page 32: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Residual plot

• A scatterplot of the (x, residual) pairs.• Residuals can be graphed against other

statistics besides x• Purpose is to tell if a linear association

exist between the x & y variables• If no pattern exists between the points in

the residual plot, then the association is linear.

Page 33: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Residuals

x

Residuals

x

Linear Not linear

Page 34: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

One measure of the success of knee surgery is post-surgical range of motion for the knee joint following a knee dislocation. Is there a linear relationship between age & range of motion?

Sketch a residual plot.

Since there is no pattern in the residual plot, there is a linear relationship between age and range of motion

x

Res

idua

ls

Page 35: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

Plot the residuals against the y-hats. How does this residual plot compare to the previous one?

Res

idua

ls

y

Page 36: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Residual plots are the same no matter if plotted against x or y-hat.

x

Res

idua

ls

Res

idua

ls

y

Page 37: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Coefficient of determination-• r2

• gives the proportion of variation in y that can be attributed to an approximate linear relationship between x & y

• remains the same no matter which variable is labeled x

Page 38: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

Let’s examine r2.

Suppose you were going to predict a future y but you didn’t know the x-value. Your best guess would be the overall mean of the existing y’s.

Now, find the sum of the squared residuals (errors). L3 = (L2-130.0833)^2. Do 1VARSTAT on L3 to find the sum.

SSEy = 1564.917

Sum of the squared residuals (errors) using

the mean of y.

Page 39: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

Now suppose you were going to predict a future y but you DO know the x-value. Your best guess would be the point on the LSRL for that x-value (y-hat). Find the LSRL & store in Y1. In L3 = Y1(L1) to calculate the predicted y for each x-value.

Now, find the sum of the squared residuals (errors). In L4 = (L2-L3)^2. Do 1VARSTAT on L4 to find the sum.

Sum of the squared residuals (errors) using the LSRL.

SSEy = 1085.735

Page 40: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

By what percent did the sum of the squared error go down when you went from just an “overall mean” model to the “regression on x” model?

SSEy = 1085.735

SSEy = 1564.917

3062916671564

7351085916671564

SSE

SSESSE

y

yy

..

..

ˆ

This is r2 – the amount of the

variation in the y-values that is explained by the x-values.

Page 41: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Age Range of Motion

35 154

24 142

40 137

31 133

28 122

25 126

26 135

16 135

14 108

20 120

21 127

30 122

How well does age predict the range of motion after knee surgery?

Approximately 30.6% of the variation in range of motion after knee surgery can be explained by the linear regression of age and range of motion.

Page 42: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Interpretation of r2

Approximately r2% of the variation in y can be explained by the LSRL of x & y.

Page 43: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Computer-generated regression analysis of knee surgery data:

Predictor Coef Stdev T P

Constant 107.58 11.12 9.67 0.000

Age 0.8710 0.4146 2.10 0.062

s = 10.42 R-sq = 30.6% R-sq(adj) = 23.7%

x . . y 871058107ˆ 5532.r

What is the equation of the LSRL?

Find the slope & y-intercept.

NEVER use adjusted r2!

Be sure to convert r2 to decimal before taking the square

root!

What are the correlation coefficient and the coefficient of

determination?

Page 44: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Outlier –

• In a regression setting, an outlier is a data point with a large residual

Page 45: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Influential point-

• A point that influences where the LSRL is located

• If removed, it will significantly change the slope of the LSRL

Page 46: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Racket Resonance Acceleration (Hz) (m/sec/sec)

1 105 36.0

2 106 35.0

3 110 34.5

4 111 36.8

5 112 37.0

6 113 34.0

7 113 34.2

8 114 33.8

9 114 35.0

10 119 35.0

11 120 33.6

12 121 34.2

13 126 36.2

14 189 30.0

One factor in the development of tennis elbow is the impact-induced vibration of the racket and arm at ball contact.

Sketch a scatterplot of these data.

Calculate the LSRL & correlation coefficient.

Does there appear to be an influential point? If so, remove it and then calculate the new LSRL & correlation coefficient.

Page 47: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Which of these measures are resistant?

• LSRL• Correlation coefficient• Coefficient of determination

NONE – all are affected by outliers

Page 48: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Regression

Page 49: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

60 62 64 66 68

Height

Wei

ght

60 62 64 66 68

How much would an adult female weigh if she were 5

feet tall?

She could weigh varying amounts – in other words,

there is a distribution of

weights for adult females who are

5 feet tall.

This distribution is normally distributed.(we hope)

What would you expect for other heights?

Where would you expect the TRUE LSRL

to be?

What about the standard deviations of

all these normal

distributions?

60 62 64 66 68

60 62 64 66 68

xy

We want the standard deviations of all these normal distributions to be

the same.

Page 50: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Regression Model• The mean response y has a straight-line

relationship with x: – Where: slope β and intercept α are unknown parameters

• For any fixed value of x, the response y varies according to a normal distribution. Repeated responses of y are independent of each other.

• The standard deviation of y (sy) is the same for all values of x. (sy is also an unknown parameter)

xy

Page 51: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

• The slope b of the LSRL is an unbiased estimator of the true slope β.

• The intercept a of the LSRL is an unbiased estimator of the true intercept α.

• The standard error s is an unbiased estimator of the true standard deviation of y (sy).

bxay ˆ xy

22

ˆ 22

n

residualsn

yysNote:

df = n-2

We use to estimate

Page 52: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Do sampling distribution of slopes activity

Page 53: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

60 62 64 66 68

Height

Wei

ght

Suppose you took many

samples of the same size from this

population & calculated the

LSRL for each.

Using the slope from each of

these LSRLs – we can create a

sampling distribution for the slope of the

true LSRL.

bb bb bb b

What shape will this

distribution have?

What is the mean of the sampling

distribution equal?

μb = b

22

2

2

ˆ

1

iiii

ii

bxx

s

xx

nyy

s

What is the standard

deviation of the sampling distribution?

Page 54: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Assumptions for inference on slope

• The observations are independent– Check that you have an SRS

• The true relationship is linear– Check the scatter plot & residual plot

• The standard deviation of the response is constant.– Check the scatter plot & residual plot

• The responses vary normally about the true regression line.– Check a histogram or boxplot of residuals

Page 55: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Formulas:

• Confidence Interval:

• Hypothesis test:

1

* bstb

1bs

bt

df = n -2

Because there are two unknowns a & b

Page 56: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Hypotheses

H0: b = 0

Ha: b > 0

Ha: b < 0

Ha: b ≠ 0

This implies that there is no

relationship between x & y

Or that x should not be used to predict y

What would the slope equal if there were a perfect relationship

between x & y?

1

Be sure to define b!

Page 57: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

Example: It is difficult to accurately determine a person’s body fat percentage without immersing him or her in water. Researchers hoping to find ways to make a good estimate immersed 20 male subjects, and then measured their weights. a)Find the LSRL, correlation coefficient, and coefficient of determination.

Body fat = -27.376 + 0.250 weightr = 0.697r2 = 0.485

Page 58: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

b) Explain the meaning of slope in the context of the problem.There is approxiamtely .25% increase in body fat for every pound increase in weight.

c) Explain the meaning of the coefficient of determination in context.Approximately 48.5% of the variation in body fat can be explained by the regression of body fat on weight.

Page 59: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

d) Estimate a, b, and s.

a = -27.376b = 0.25s = 7.049

e) Create a scatter plot and residual plot for the data.

2

2

nresiduals

s

Weight

Res

idu

als

Weight

Bod

y fa

t

Page 60: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

f) Is there sufficient evidence that weight can be used to predict body fat? Assumptions:• Have an SRS of male subjects• Since the residual plot is randomly scattered, weight & body fat are

linear• Since the points are evenly spaced across the LSRL on the

scatterplot, sy is approximately equal for all values of weight• Since the boxplot of residual is approximately symmetrical, the

responses are approximately normally distributed.

H0: b = 0 Where b is the true slope of the LSRL of weight Ha: b ≠ 0 & body fat

Since the p-value < α, I reject H0. There is sufficient evidence to

suggest that weight can be used to predict body fat.

05.180006.120.40607.

025.0

dfvalueps

bt

b

tb

sb

Page 61: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

g) Give a 95% confidence interval for the true slope of the LSRL.Assumptions:• Have an SRS of male subjects• Since the residual plot is randomly scattered, weight & body fat

are linear• Since the points are evenly spaced across the LSRL on the

scatterplot, sy is approximately equal for all values of weight• Since the boxplot of residual is approximately symmetrical, the

responses are approximately normally distributed.

We are 95% confident that the true slope of the LSRL of weight & body fat is between 0.12 and 0.38.

Be sure to show all graphs!

18)377.,122(.0607.101.225.0* dfstb b

Page 62: Linear Regression (LSRL). Bivariate data x – variable: is the independent or explanatory variable y- variable: is the dependent or response variable Use

h) Here is the computer-generated result from the data:

Sample size: 20R-square = 43.83%s = 7.0491323

Parameter Estimate Std. Err.

Intercept -27.376263 11.547428

Weight0.2498741

4 0.060653996

df?

Correlation coeficient?Be sure to write as decimal first!

What does “s” represent (in context)?

What do these numbers represent?

What does this number represent?