introduction to regression - university of...

Introduction to regression

Regression describes how one variable (response) depends on

another variable (explanatory variable).

◦ Response variable: variable of interest, measures the out-

come of a study

◦ Explanatory variable: explains (or even causes) changes in

response variable

Examples:

◦ Hearing difficulties:

response - sound level (decibels), explanatory - age (years)

◦ Real estate market:

response - listing prize ($), explanatory - house size (sq. ft.)

◦ Salaries:

response - salary ($), explanatory - experience (years), educa-

tion, sex

Least squares regression, Jan 14, 2004 - 1 -

Introduction to regression

Example: Food expenditures and income

Data: Sample of 20 households

0 20 40 60 80 100 1200

4

8

12

16

20

income

food

exp

endi

ture

Questions:

◦ How does food expenditure (Y ) depend on income (X)?

◦ Suppose we know that X = x0, what can we tell about Y ?

Linear regression:

If the response Y depends linearly on the explanatory variable

X , we can use a straight line (regression line) to predict Y

from X .


Least squares regression

How to find the regression line

0 20 40 60 80 100 1200

4

8

12

16

20

income

food e

xpenditu

re

50 60 70 80 90

8

10

12

14

16

18

income

food e

xpenditu

re

Predicted y

Difference y − y

Observed y

Since we intend to predict Y from X , the errors of interest are

mispredictions of Y for fixed X .

The least squares regression line of Y on X is the line that

minimizes the sum of squared errors.

For observations (x1, y1), . . . , (xn, yn), the regression line is given

by

Y = a + b X

where

b = r sy

sx

and a = y − b x

(r correlation coefficient, sx, sx standard deviations, x, y means)


Least squares regression

Example: Food expenditure and incomeX 28 26 32 24 54 59 44 30 40 82

Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0

X 42 58 28 20 42 47 112 85 31 26

Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9

The summary statistics are:

◦ x = 45.50

◦ y = 7.97

◦ sx = 23.96

◦ sy = 4.66

◦ r = 0.946

The regression coefficients are:

b = r sy

sx

= 0.946 · 4.66

23.96= 0.184

a = y − b x = 7.97 − 0.184 · 45.5 = −0.402

0 20 40 60 80 100 120

0

5

10

15

20

income

food

exp

endi

ture


Interpreting the regression model

◦ The response in the model is denoted Y to indicate that these

are predicted Y values, not the true Y values. The “hat” de-

notes prediction.

◦ The slope of the line indicates how much Y changes for a unit

change in X .

◦ The intercept is the value of Y for X = 0. It may or not have

a physical interpretation, depending on whether or not X can

take values near 0.

◦ To make a prediction for an unobserved X , just plug it in and

calculate Y .

◦ Note that the line need not pass through the observed data

points. In fact, it often will not pass through any of them.


Regression and correlation

Correlation analysis:

We are interested in the joint distribution of two (or more)

quantitive variables.

Example: Heights of 1,078 fathers and sons

Father’s height (inches)

Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Points are scattered around the SD line:

◦ (y − y) =sy

sx(x − x)

◦ goes through center (x, y)

◦ has slope sy/sx

The correlation r measures how much the points spread around

the SD line.


Regression and correlation

Regression analysis:

We are interested how the distribution of one response variable

depends on one (or more) explanatory variables.

Example: Heights of 1,078 fathers and sons


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80 Father’s height = 64 inches

Son’s height (inches)

Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.05

0.10

0.15

0.20x

Father’s height = 68 inches


Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.04

0.08

0.12

0.16x

Father’s height = 72 inches


Den

sity

58 60 62 64 66 68 70 72 74 76 78 800.00

0.03

0.06

0.09

0.12

0.15

0.18x


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

In each vertical strip, the

points are distributed

around the regression

line.


Properties of least squares regression

◦ The distinction between explanatory and response variables is

essential. Looking at vertical deviations means that changing

the axes would change the regression line.


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

y = a + bx

x = a’ + b’y

◦ A change of 1 sd in X corresponds to a change of r sds in Y .

◦ The least squares regression line always passes through the

point (x, y).

◦ r2 (the square of the correlation) is the fraction of the variation

in the values of y that is explained by the least squares regres-

sion on x.

When reporting the results of a linear regression,

you should report r2.

These properties depend on the least-squares fitting criterion and

are one reason why that criterion is used.


The regression effect

Regression effect

In virtually all test-retest situations, the bottom group on the

first test will on average show some improvement on the sec-

ond test - and the top group will on average fall back. This is

the regression effect. The statistician and geneticist Sir Fran-

cis Galton (1822-1911) called this effect “regression to medi-

ocrity”.


Son

’s h

eigh

t (in

ches

)

58 60 62 64 66 68 70 72 74 76 78 8058

60

62

64

66

68

70

72

74

76

78

80

Regression fallacy

Thinking that the regression effect must be due to something

important, not just the spread around the line, is the regression

fallacy.


Regression in STATA

. infile food income size using food.txt

. graph twoway scatter food income || lfit food income, legend(off)> ytitle(food). regress food income

Source | SS df MS Number of obs = 20------------+------------------------------ F( 1, 18) = 151.97

Model | 369.572965 1 369.572965 Prob > F = 0.0000Residual | 43.7725361 18 2.43180756 R-squared = 0.8941

------------+------------------------------ Adj R-squared = 0.8882Total | 413.345502 19 21.7550264 Root MSE = 1.5594

---------------------------------------------------------------------------food | Coef. Std. Err. t P>|t| [95% Conf. Interval]

------------+--------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615

---------------------------------------------------------------------------

0

5

10

15

20

Food

exp

endi

ture

0 20 40 60 80 100 120Income

This graph has been generated using the graphical user interface of STATA.

The complete command is:

. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)


Residual plots

Residuals: difference of observed and predicted values

ei = observed y − predicted y

= yi − yi

= yi − (a + b xi)

For a least squares regression, the residuals always have mean zero.

Residual plot

A residual plot is a scatterplot of the residuals against the

explanatory variable. It is a diagnostic tool to assess the fit of

the regression line.

Patterns to look for:

◦ Curvature indicates that the relationship is not linear.

◦ Increasing or decreasing spread indicates that the prediction

will be less accurate in the range of explanatory variables where

the spread is larger.

◦ Points with large residuals are outliers in the vertical direc-

tion.

◦ Points that are extreme in the x direction are potential high

influence points.

Influential observations are individuals with extreme x values

that exert a strong influence on the position of the regression line.

Removing them would significantly change the regression line.


Regression Diagnostics

Example: First data set

Y

X5 10 15

0

5

10

Res

idua

ls

Fitted values4 6 8 10

−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

residuals are regularly distributed



Example: Second data set

Y

X5 10 15

0

5

10

Res

idua

ls


−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

functional relationship other than linear



Example: Third data set

Y

X5 10 15

0

5

10

15

Res

idua

ls


−1

0

1

2

3

Res

idua

ls

X5 10 15

−1

0

1

2

3

outlier, regression line misfits majority of data



Example: Fourth data set

Y

X5 10 15

0

5

10

15

Res

idua

ls


−2

−1

0

1

2

Res

idua

ls

X5 10 15

−2

−1

0

1

2

heteroscedasticity



Example: Fifth data set

Y

X5 10 15 20

0

5

10

15

Res

idua

ls

Fitted values6 8 10 12 14

−2

−1

0

1

2

Res

idua

ls

X5 10 15 20

−2

−1

0

1

2

one separate point in direction of x, highly influential


introduction to regression - university of...

Documents