introduction to regression - university of...
TRANSCRIPT
Introduction to regression
Regression describes how one variable (response) depends on
another variable (explanatory variable).
◦ Response variable: variable of interest, measures the out-
come of a study
◦ Explanatory variable: explains (or even causes) changes in
response variable
Examples:
◦ Hearing difficulties:
response - sound level (decibels), explanatory - age (years)
◦ Real estate market:
response - listing prize ($), explanatory - house size (sq. ft.)
◦ Salaries:
response - salary ($), explanatory - experience (years), educa-
tion, sex
Least squares regression, Jan 14, 2004 - 1 -
Introduction to regression
Example: Food expenditures and income
Data: Sample of 20 households
0 20 40 60 80 100 1200
4
8
12
16
20
income
food
exp
endi
ture
Questions:
◦ How does food expenditure (Y ) depend on income (X)?
◦ Suppose we know that X = x0, what can we tell about Y ?
Linear regression:
If the response Y depends linearly on the explanatory variable
X , we can use a straight line (regression line) to predict Y
from X .
Least squares regression, Jan 14, 2004 - 2 -
Least squares regression
How to find the regression line
0 20 40 60 80 100 1200
4
8
12
16
20
income
food e
xpenditu
re
50 60 70 80 90
8
10
12
14
16
18
income
food e
xpenditu
re
Predicted y
Difference y − y
Observed y
Since we intend to predict Y from X , the errors of interest are
mispredictions of Y for fixed X .
The least squares regression line of Y on X is the line that
minimizes the sum of squared errors.
For observations (x1, y1), . . . , (xn, yn), the regression line is given
by
Y = a + b X
where
b = r sy
sx
and a = y − b x
(r correlation coefficient, sx, sx standard deviations, x, y means)
Least squares regression, Jan 14, 2004 - 3 -
Least squares regression
Example: Food expenditure and incomeX 28 26 32 24 54 59 44 30 40 82
Y 5.2 5.1 5.6 4.6 11.3 8.1 7.8 5.8 5.1 18.0
X 42 58 28 20 42 47 112 85 31 26
Y 4.9 11.8 5.2 4.8 7.9 6.4 20.0 13.7 5.1 2.9
The summary statistics are:
◦ x = 45.50
◦ y = 7.97
◦ sx = 23.96
◦ sy = 4.66
◦ r = 0.946
The regression coefficients are:
b = r sy
sx
= 0.946 · 4.66
23.96= 0.184
a = y − b x = 7.97 − 0.184 · 45.5 = −0.402
0 20 40 60 80 100 120
0
5
10
15
20
income
food
exp
endi
ture
Least squares regression, Jan 14, 2004 - 4 -
Interpreting the regression model
◦ The response in the model is denoted Y to indicate that these
are predicted Y values, not the true Y values. The “hat” de-
notes prediction.
◦ The slope of the line indicates how much Y changes for a unit
change in X .
◦ The intercept is the value of Y for X = 0. It may or not have
a physical interpretation, depending on whether or not X can
take values near 0.
◦ To make a prediction for an unobserved X , just plug it in and
calculate Y .
◦ Note that the line need not pass through the observed data
points. In fact, it often will not pass through any of them.
Least squares regression, Jan 14, 2004 - 5 -
Regression and correlation
Correlation analysis:
We are interested in the joint distribution of two (or more)
quantitive variables.
Example: Heights of 1,078 fathers and sons
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
Points are scattered around the SD line:
◦ (y − y) =sy
sx(x − x)
◦ goes through center (x, y)
◦ has slope sy/sx
The correlation r measures how much the points spread around
the SD line.
Least squares regression, Jan 14, 2004 - 6 -
Regression and correlation
Regression analysis:
We are interested how the distribution of one response variable
depends on one (or more) explanatory variables.
Example: Heights of 1,078 fathers and sons
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80 Father’s height = 64 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.05
0.10
0.15
0.20x
Father’s height = 68 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.04
0.08
0.12
0.16x
Father’s height = 72 inches
Son’s height (inches)
Den
sity
58 60 62 64 66 68 70 72 74 76 78 800.00
0.03
0.06
0.09
0.12
0.15
0.18x
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
In each vertical strip, the
points are distributed
around the regression
line.
Least squares regression, Jan 14, 2004 - 7 -
Properties of least squares regression
◦ The distinction between explanatory and response variables is
essential. Looking at vertical deviations means that changing
the axes would change the regression line.
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
y = a + bx
x = a’ + b’y
◦ A change of 1 sd in X corresponds to a change of r sds in Y .
◦ The least squares regression line always passes through the
point (x, y).
◦ r2 (the square of the correlation) is the fraction of the variation
in the values of y that is explained by the least squares regres-
sion on x.
When reporting the results of a linear regression,
you should report r2.
These properties depend on the least-squares fitting criterion and
are one reason why that criterion is used.
Least squares regression, Jan 14, 2004 - 8 -
The regression effect
Regression effect
In virtually all test-retest situations, the bottom group on the
first test will on average show some improvement on the sec-
ond test - and the top group will on average fall back. This is
the regression effect. The statistician and geneticist Sir Fran-
cis Galton (1822-1911) called this effect “regression to medi-
ocrity”.
Father’s height (inches)
Son
’s h
eigh
t (in
ches
)
58 60 62 64 66 68 70 72 74 76 78 8058
60
62
64
66
68
70
72
74
76
78
80
Regression fallacy
Thinking that the regression effect must be due to something
important, not just the spread around the line, is the regression
fallacy.
Least squares regression, Jan 14, 2004 - 9 -
Regression in STATA
. infile food income size using food.txt
. graph twoway scatter food income || lfit food income, legend(off)> ytitle(food). regress food income
Source | SS df MS Number of obs = 20------------+------------------------------ F( 1, 18) = 151.97
Model | 369.572965 1 369.572965 Prob > F = 0.0000Residual | 43.7725361 18 2.43180756 R-squared = 0.8941
------------+------------------------------ Adj R-squared = 0.8882Total | 413.345502 19 21.7550264 Root MSE = 1.5594
---------------------------------------------------------------------------food | Coef. Std. Err. t P>|t| [95% Conf. Interval]
------------+--------------------------------------------------------------income | .1841099 .0149345 12.33 0.000 .1527336 .2154862_cons | -.4119994 .7637666 -0.54 0.596 -2.016613 1.192615
---------------------------------------------------------------------------
0
5
10
15
20
Food
exp
endi
ture
0 20 40 60 80 100 120Income
This graph has been generated using the graphical user interface of STATA.
The complete command is:
. twoway (scatter food income, msymbol(circle) msize(medium) mcolor(black))> (lfit food income, range(0 120) clcolor(black) clpat(solid) clwidth(medium)),> ytitle(Food expenditure, size(large)) ylabel(, valuelabel angle(horizontal)> labsize(medlarge)) xtitle(Income, size(large)) xscale(range(0 120))> xlabel(0(20)120, labsize(medlarge)) legend(off) ysize(2) xsize(3)
Least squares regression, Jan 14, 2004 - 10 -
Residual plots
Residuals: difference of observed and predicted values
ei = observed y − predicted y
= yi − yi
= yi − (a + b xi)
For a least squares regression, the residuals always have mean zero.
Residual plot
A residual plot is a scatterplot of the residuals against the
explanatory variable. It is a diagnostic tool to assess the fit of
the regression line.
Patterns to look for:
◦ Curvature indicates that the relationship is not linear.
◦ Increasing or decreasing spread indicates that the prediction
will be less accurate in the range of explanatory variables where
the spread is larger.
◦ Points with large residuals are outliers in the vertical direc-
tion.
◦ Points that are extreme in the x direction are potential high
influence points.
Influential observations are individuals with extreme x values
that exert a strong influence on the position of the regression line.
Removing them would significantly change the regression line.
Least squares regression, Jan 14, 2004 - 11 -
Regression Diagnostics
Example: First data set
Y
X5 10 15
0
5
10
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
residuals are regularly distributed
Least squares regression, Jan 14, 2004 - 12 -
Regression Diagnostics
Example: Second data set
Y
X5 10 15
0
5
10
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
functional relationship other than linear
Least squares regression, Jan 14, 2004 - 13 -
Regression Diagnostics
Example: Third data set
Y
X5 10 15
0
5
10
15
Res
idua
ls
Fitted values4 6 8 10
−1
0
1
2
3
Res
idua
ls
X5 10 15
−1
0
1
2
3
outlier, regression line misfits majority of data
Least squares regression, Jan 14, 2004 - 14 -
Regression Diagnostics
Example: Fourth data set
Y
X5 10 15
0
5
10
15
Res
idua
ls
Fitted values4 6 8 10
−2
−1
0
1
2
Res
idua
ls
X5 10 15
−2
−1
0
1
2
heteroscedasticity
Least squares regression, Jan 14, 2004 - 15 -