lecture 4-simple linear regression model-specification and
TRANSCRIPT
ACE 562, University of Illinois at Urbana-Champaign 4-1
ACE 562 Fall 2005
Lecture 4: Simple Linear Regression Model: Specification and Estimation
by
Professor Scott H. Irwin
Required Reading: Griffiths, Hill and Judge. "Simple Regression: Economic and Statistical Model Specification and Estimation," Ch. 5 in Learning and Practicing Econometrics Notation Warning: I previously distinguished random variables (Y) from their realized sample values (y). Following HGJ, I will no longer do this. Context of equation should make the distinction clear.
ACE 562, University of Illinois at Urbana-Champaign 4-2
Overview Previously, we examined one economic variable at a time We will now focus on two economic variables
• Primary objective of economic analysis is to understand the relationship between economic variables
Key question: How to use the information contained in samples of economic data to learn about unknown parameters of economic relationships?
• When it is believed that values of one variable are systematically determined by another variable, simple linear regression can be used to model the relationship
• Simple linear regression model is a
specification of the process that we believe describes the relationship between two variables
ACE 562, University of Illinois at Urbana-Champaign 4-3
Start with two variables ty and tx
• For each observation, we assume that tx is generated "outside" the process given by the simple linear regression model
• For each level of tx , we assume that ty is
generated by the following simple linear regression model
1 2t t ty x eβ β= + +
Since we only actually observe one sample of data,
our objective is to estimate the parameters ( 1 2an d β β ) of the above model
ACE 562, University of Illinois at Urbana-Champaign 4-4
ACE 562, University of Illinois at Urbana-Champaign 4-5
ACE 562, University of Illinois at Urbana-Champaign 4-6
The Problem What is the relationship between average household expenditure on food and income? To answer this question, we must extend economic and statistical models considered Lecture 3
• Focused on food expenditure of households of size 3 with annual income of $25,0000
• Population of interest is now all households of
size 3, regardless of income level Why is knowledge of the relationship between food expenditure and income important
• Micro-level decision?
• Macro-level decisions?
ACE 562, University of Illinois at Urbana-Champaign 4-7
The Economic Model Define household expenditure on food as y and household income as x Economic theory shows that
( )y f x= and we expect the relationship to be positive We need to be more precise about the relationship in order to ultimately estimate the parameters of the relationship
• Linear vs. non-linear
• In practice, never know exact form of the relation
• Use economic theory and information in
sample to make a reasonable choice
ACE 562, University of Illinois at Urbana-Champaign 4-8
ACE 562, University of Illinois at Urbana-Champaign 4-9
For simplicity, let's assume a linear relationship is reasonable,
1 2y xβ β= +
Contrast this model with the one considered earlier,
y β= The linear economic model has two parameters
• 1β : intercept, which shows the level of food expenditure when income is zero
• 2β : slope, which shows how much food
expenditure changes when income changes
ACE 562, University of Illinois at Urbana-Champaign 4-10
Often interested in relationship between y and x in percentage terms, or elasticity Point elasticity formula,
2ydy x xdx y y
η β= ⋅ = ⋅
where yη is the elasticity of food expenditure with respect to income
ACE 562, University of Illinois at Urbana-Champaign 4-11
ACE 562, University of Illinois at Urbana-Champaign 4-12
The Statistical Model The linear economic model predicts that food expenditure for a given level of income will be the same for all consumers
• No scatter of points around the line in Figure 5.2
Recognize that actual expenditure for a given level of income will not be the same for all consumers,
1 2 1,...,t t ty x e t Tβ β= + + =
where,
• yt is the dependent variable
• xt is the independent, or explanatory, variable
• et is the error, or disturbance, term
• T is the number of consumers in the sample
Note that xt is assumed to be the same for all consumers
ACE 562, University of Illinois at Urbana-Champaign 4-13
Motivation for adding the error term is similar to earlier arguments, but more detail is helpful Combined effect of other influences
• In reality, a large number of independent variables in addition to income affect food expenditure
• Assume the other independent variables are
unobservable, or we would include them in economic model
Approximation error
• Linear form of model may only be an approximation of the true relationship between income and food expenditure
Random component of human behavior
• Knowledge of all variables that influence an individual's food expenditure, may not be sufficient to explain that expenditure
ACE 562, University of Illinois at Urbana-Champaign 4-14
To complete the statistical model, we must specify the assumptions about the error term,
• ( ) 0tE e =
• 2 2 2var( ) [ ( )] [ ]t t t te E e E e E e σ= − = =
• et are independent so that ( ) 0t sCov e e = for all t s≠
• et follows a normal distribution
This can be summarized using the following notation,
2~ (0, ) 1,...,te N t Tσ = As we saw in Lecture 3, this is referred to as the iid normality assumption, which is shorthand for identical, independently distributed normal random variables Note: The simple linear regression model does not require the assumption that the error term is normally distributed. However, it is typically assumed.
ACE 562, University of Illinois at Urbana-Champaign 4-15
Now, let's explore some of the implications of the statistical model
1 2 1,...,t t ty x e t Tβ β= + + = Sometimes the statistical model is referred to as the "data generating process" for y For a given observation t, yt can be thought of as having two components
• A systematic component 1 2 txβ β+ , that is determined by an economic process
• A random component et , that is determined by
a probabilistic process Another way of saying the same thing is that the random variable yt is simply a linear transformation of the random variable et
1 2where a+ 1ndt t ty a be a x bβ β= + = = (Note that a is a parameter for each observation because tx is fixed for that observation, but a takes on different values for different observations)
ACE 562, University of Illinois at Urbana-Champaign 4-16
We can examine the statistical properties more formally by considering the expected value of food expenditure,
1 2[ ] [ ]t t tE y E x eβ β= + +
1 2[ ] [ ] [ ] [ ]t t tE y E E x E eβ β= + +
1 2[ ]t tE y xβ β= + Shows that the expected value of food expenditure, or "average" expenditure, is a linear function of income Now, reconsider the original statistical model,
1 2t t ty x eβ β= + + From above, we can substitute for 1 2 txβ β+ as follows,
[ ]t t ty E y e= + Allows a new interpretation of the statistical model
ACE 562, University of Illinois at Urbana-Champaign 4-17
Writing the statistical model as,
[ ]t t ty E y e= + Allows us to think of observed food expenditure as consisting of two components,
• [ ]tE y : expected, or mean, food expenditure, which will be the same for all consumers at a given level of income
• te : a random component that is unique to each
consumer We generated the same interpretation in the earlier constant mean model Now, the crucial difference is that the mean component is a function, rather than a constant
• The mean component varies linearly with the level of income
• 1 2[ ]t tE y xβ β= +
ACE 562, University of Illinois at Urbana-Champaign 4-18
ACE 562, University of Illinois at Urbana-Champaign 4-19
Now, consider the variance of food expenditure
2var[ ] ( [ ])t t ty E y E y⎡ ⎤= −⎣ ⎦
21 2var[ ] ( )t t ty E y xβ β⎡ ⎤= − −⎣ ⎦
2 2var[ ]t ty E e σ⎡ ⎤= =⎣ ⎦
This result is equivalent to saying that the variance of food expenditure (or the error term) is not related to the level of income Next, consider the covariance of food expenditure between two values yt and ys,
[ ]cov[ , ] ( [ ])( [ ])t s t t s sy y E y E y y E y= − −
[ ]1 2 1 2cov[ , ] ( )( )t s t t s sy y E y x y xβ β β β= − − − −
[ ]cov[ , ] 0t s t sy y E e e t s= = ≠ Hence, if the errors are independent, as implied by random sampling, then the selection of one consumer does not influence whether another will be selected
ACE 562, University of Illinois at Urbana-Champaign 4-20
Assumptions of the Simple Linear Regression Model SR1. 1 2 , 1,...,t t ty x e t Tβ β= + + = SR2. 1 2( ) 0 ( )t t tE e E y xβ β= ⇔ = + SR3. 2var( ) var( )t te y σ= = SR4. cov[ , ] cov[ , ] 0t s t se e y y t s= = ≠ SR5. The variable xt is not random and must take on
at least two different values SR6. 2 2
1 2~ (0, ) ~ ( , )t t te N y N xσ β β σ⇔ +
ACE 562, University of Illinois at Urbana-Champaign 4-21
ACE 562, University of Illinois at Urbana-Champaign 4-22
ACE 562, University of Illinois at Urbana-Champaign 4-23
Estimating the Parameters for the Statistical Model of the Food Expenditure and Income Relationship The statistical model "explains" how the sample of household expenditure data is generated The problem at hand is how to use the sample information on yt and xt to estimate the unknown parameters 1β and 2β One approach is to simply draw a line through the scatter of points that seems to "best fit" the data
• "Eyeball econometrics" While “eyeball” analysis may be useful as a starting point, there are several weaknesses to this approach: • Highly subjective; two researchers looking at the
same graph may choose different lines • Tendency to ignore outliers
• For most researchers, will work only for one y
and one x (two dimensions)
ACE 562, University of Illinois at Urbana-Champaign 4-24
ACE 562, University of Illinois at Urbana-Champaign 4-25
ACE 562, University of Illinois at Urbana-Champaign 4-26
ACE 562, University of Illinois at Urbana-Champaign 4-27
Just as in the constant mean model, we need a rule to systematically estimate 1β and 2β based on the observed sample of data
• Notice that for a given level of xt, 1 2[ ]t tE y xβ β= + , or the "center" of the pdf for
yt
• Suggests the "center" of sample data may yield good estimates of the population parameters
1β and 2β
• The only difference from the constant mean model is that the "center" varies with the level of income
The principle of least squared distance can again be used to find the desired estimates Minimize the sum of squares of the vertical distances between the line and the sample observations
ACE 562, University of Illinois at Urbana-Champaign 4-28
y
x
Minimizing the sum of squared errors
ACE 562, University of Illinois at Urbana-Champaign 4-29
To begin the formal derivation, let's restate the statistical model,
1 2t t ty x eβ β= + +
which can be re-written as,
1 2t t te y xβ β= − −
Then, given the sample observations on y and x, our objective is to minimize the following function,
2 21 2 1 2
1 1( , ) ( )
T T
t t tt t
S e y xβ β β β= =
= = − −∑ ∑
Since the values for yt are known, S is solely a function of the unknown parameters 1β and 2β Expanding the square, we obtain,
2 2 2 21 2 1 2 1 2 1 2
1( , ) ( 2 2 2 )
T
t t t t t tt
S y x y x y xβ β β β β β β β=
= + + − − +∑
ACE 562, University of Illinois at Urbana-Champaign 4-30
With further re-arranging ,
2 2 2 21 2 1 2 1 2 1 2
1 1 1 1 1( , ) 2 2 2
T T T T T
t t t t t tt t t t t
S y T x y x y xβ β β β β β β β= = = = =
= + + − − +∑ ∑ ∑ ∑ ∑ For the sample of 40 households,
1 1
2
1 1
2
1
40 2792 943.78
69435.0404 210206.2302
24875.065
T T
t tt t
T T
t t tt t
T
tt
T x y
x y x
y
= =
= =
=
= = =
= =
=
∑ ∑
∑ ∑
∑
and based on these computations the sum of squares relationship is,
2 21 2 1 2
1 2 1 2
( , ) 24875.07 40 10206.231887.56 138870.08 5584
S β β β ββ β β β= + +
− − +
This function is a quadratic in terms of the unknown parameters 1β and 2β
• "bowl-shaped" function
ACE 562, University of Illinois at Urbana-Champaign 4-31
ACE 562, University of Illinois at Urbana-Champaign 4-32
Minimum value of function is found by taking the partial differentials of S with respect to 1β and 2β ,
1 211
2 ( )( 1)T
t tt
S y x∂ β β∂β =
= − − −∑
1 212
2 ( )( )T
t t tt
S y x x∂ β β∂β =
= − − −∑
The values of 1β and 2β that make the partial derivatives equal zero are the least squares estimators, which are denoted b1 and b2 Substituting and setting each partial derivative equal to zero,
1 21
2 ( )( 1) 0T
t tt
y b b x=
− − − =∑
1 21
2 ( )( ) 0T
t t tt
y b b x x=
− − − =∑
ACE 562, University of Illinois at Urbana-Champaign 4-33
If we multiply both sides of each equation by (-1), they can be re-written in the following form:
1 21
( ) 0T
t tt
y b b x=
− + + =∑
2
1 21
( ) 0T
t t t tt
y x b x b x=
− + + =∑
and,
1 21 1
0T T
t tt t
y Tb b x= =
− + + =∑ ∑
2
1 21 1 1
0T T T
t t t tt t t
y x b x b x= = =
− + + =∑ ∑ ∑
With a little more re-arranging, we can arrive at the following equations,
1 21 1
T T
t tt t
Tb b x y= =
+ =∑ ∑
2
1 21 1 1
T T T
t t t tt t t
b x b x y x= = =
+ =∑ ∑ ∑
The previous two equations are known as the normal equations in least squares regression
ACE 562, University of Illinois at Urbana-Champaign 4-34
Now, we have two unknowns and two knowns in each equation, and we can solve for b1 and b2
Multiply the first normal equation by 1
T
tt
x=∑ and the
second by T,
2
1 21 1 1 1
T T T T
t t t tt t t t
T x b b x x y= = = =
⎛ ⎞+ =⎜ ⎟
⎝ ⎠∑ ∑ ∑ ∑
2
1 21 1 1
T T T
t t t tt t t
T x b T x b T y x= = =
+ =∑ ∑ ∑
ACE 562, University of Illinois at Urbana-Champaign 4-35
Now subtract the first of the above two equations from the second,
22
21 1 1 1 1
T T T T T
t t t t t tt t t t t
T x x b T y x x y= = = = =
⎡ ⎤⎛ ⎞− = −⎢ ⎥⎜ ⎟⎝ ⎠⎢ ⎥⎣ ⎦
∑ ∑ ∑ ∑ ∑
or,
1 1 12 2
2
1 1
T T T
t t t tt t t
T T
t tt t
T y x x yb
T x x
= = =
= =
−=
⎛ ⎞− ⎜ ⎟⎝ ⎠
∑ ∑ ∑
∑ ∑
which is the least squares estimator for the slope
ACE 562, University of Illinois at Urbana-Champaign 4-36
Now, having found the least squares estimator of the slope, b2, lets solve for b1 The first normal equation is,
1 21 1
T T
t tt t
Tb b x y= =
+ =∑ ∑
Simply divide this equation by T,
1 21 1
1 1T T
t tt t
b b x yT T= =
+ =∑ ∑
or,
1 2b b x y+ =
1 2b y b x= −
ACE 562, University of Illinois at Urbana-Champaign 4-37
To summarize, the least squares estimators for the intercept and slope of the regression line are,
1 2b y b x= −
1 1 12 2
2
1 1
T T T
t t t tt t t
T T
t tt t
T y x x yb
T x x
= = =
= =
−=
⎛ ⎞− ⎜ ⎟⎝ ⎠
∑ ∑ ∑
∑ ∑
You may see the formula for b2 stated in several different forms The derivation for one widely used formula can be found by substituting the formula for b1 into the second normal equation as follows,
( ) 22 2
1 1 1
T T T
t t t tt t t
y b x x b x y x= = =
− + =∑ ∑ ∑
Expanding and re-arranging,
2
21 1 1 1
T T T T
t t t t tt t t t
y x b x x x y x= = = =
⎛ ⎞+ − =⎜ ⎟
⎝ ⎠∑ ∑ ∑ ∑
ACE 562, University of Illinois at Urbana-Champaign 4-38
Or,
1 12
2
1 1
T T
t t tt t
T T
t tt t
y x y xb
x x x
= =
= =
−=
−
∑ ∑
∑ ∑
Recognizing that 1
T
tt
x T x=
=∑ , we can then write the
following solution for b2, which known as the computational formula,
12
2 2
1
T
t tt
T
tt
y x T yxb
x T x
=
=
−=
−
∑
∑
ACE 562, University of Illinois at Urbana-Champaign 4-39
Another version of the formula is,
12
2
1
( )( )
( )
T
t tt
T
tt
x x y yb
x x
=
=
− −=
−
∑
∑
Which is equivalent to,
12
2
1
1 ( )( )1
1 ( )1
T
t tt
T
tt
x x y yTb
x xT
=
=
− −−=
−−
∑
∑
2 2
ˆ sample covariance of andˆ sample variance of
xy
x
x ybx
σσ
= =
This is often a helpful interpretation of the least squares slope estimator A good exercise is to derive this form of the formula from the first one presented
ACE 562, University of Illinois at Urbana-Champaign 4-40
Finally, the formula is often stated in “deviations from the mean” form Let,
* *andt t t tx x x y y y= − = − then,
* *
12
*2
1
T
t tt
T
tt
y xb
x
=
=
=∑
∑
Important Points
• All of the different formulas for the least squares estimator b2 are equivalent
• Assuming arithmetic accuracy, all formulas
will yield exactly the same numerical values for a given sample of x and y
Three Notable Properties of LS Estimates
ACE 562, University of Illinois at Urbana-Champaign 4-41
1. Sum of the estimated errors always equals zero
First, note that the estimated error for each observation is simply the actual observation on y minus the value projected by the estimated regression line, or
1 2t̂ t te y b b x= − −
Condition "enforced" by the first normal equation
1 21
2 ( )( 1) 0T
t tt
y b b x=
− − − =∑
or
1
ˆ 0T
tt
e=
=∑
ACE 562, University of Illinois at Urbana-Champaign 4-42
2. Estimated regression line must pass through the sample means of x and y (centroid)
Shown by first noting that 1 2b y b x= − , which can be re-written as 1 2y b b x= +
3. Zero correlation between the estimated errors and tx , the explanatory variable
Condition "enforced" by the second normal equation
1 21
2 ( )( ) 0T
t t tt
y b b x x=
− − − =∑
or
1
ˆ 0T
t tt
e x=
=∑
⇒No tendency of estimated errors for observations above (below) the mean of x to be positive (negative) and vice versa
ACE 562, University of Illinois at Urbana-Champaign 4-43
Estimates for the Household Expenditure Function Based on the data from 40 randomly selected households, we can compute estimates of 1β and 2β as follows,
1 2 23.5945 (0.232253)(69.8) 7.3832b y b x= − = − =
1 1 12 2 2
2
1 1
(40)(69435.04) (2792)(943.78)(40)(210206.23) (2792)
T T T
t t t tt t t
T T
t tt t
T y x x yb
T x x
= = =
= =
−−
= =−⎛ ⎞
− ⎜ ⎟⎝ ⎠
∑ ∑ ∑
∑ ∑0.2323=
It is useful to report the estimates in terms of the estimated relationship between yt and xt,
ˆ 7.3832 0.2323t ty x= + where ˆty is the estimate of the expected (mean ) food expenditure for a given level of income
• Sometimes ˆty is called the "fitted value" of ty • ˆty is a point on the LS line for a given tx
ACE 562, University of Illinois at Urbana-Champaign 4-44
ACE 562, University of Illinois at Urbana-Champaign 4-45
Sample Regression Output from Excel
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.563096017R Square 0.317077125Adjusted R Square 0.29910547Standard Error 6.844922384Observations 40
ANOVAdf SS MS F Significance F
Regression 1 826.6352172 826.6352 17.64318 0.000155136Residual 38 1780.412573 46.85296Total 39 2607.04779
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 7.383217543 4.008356335 1.841956 0.073296 -0.731275911 15.497711X Variable 1 0.23225333 0.055293429 4.200378 0.000155 0.120317631 0.34418903
ACE 562, University of Illinois at Urbana-Champaign 4-46
Interpreting the Least Squares Estimates
ˆ 7.3832 0.2323t ty x= + Intercept (b1) = 7.382 literally is an estimate of the expected (average) level of food expenditure per week when income is zero
• Caution needs to be exercised when interpreting intercept estimates
• Usually little if any observations around zero
for independent variable
• Suggests estimate may not be very reliable in this range of the independent variable
Slope (b2) = 0.2323 is an estimate of the expected (average) change in food expenditure per week when income increases by one unit
• In this case, slope estimate indicates food expenditure per week is expected to increase by $0.23 when income per week increases by $1
ACE 562, University of Illinois at Urbana-Champaign 4-47
• Important to emphasize that slope indicates expected change on average when income increases one unit, not the actual change
• Remember that the actual change will in all
likelihood differ from the expected change due to the error term
• We can think of the expected change as being
“on the line” while the actual change is “off the line”
Income elasticity of food expenditure also may be of interest Recall the formula,
2ydy x xdx y y
η β= ⋅ = ⋅
Replacing 2β with 2b we can estimate the income elasticity as,
2ˆyxby
η = ⋅
This still leaves the question of what levels of x and y to use in estimating the elasticity
ACE 562, University of Illinois at Urbana-Champaign 4-48
It is conventional to use the sample means, since that is a representative point on the regression line,
2ˆyxby
η = ⋅
In this case, the estimated income elasticity is,
69.800ˆ 0.2323 0.68723.595yη = ⋅ =
Elasticity ˆ 0.687yη = is an estimate of the expected (average) percent change in food expenditure per week when income increases by one percent
• In this case, elasticity estimate indicates food expenditure per week is expected to increase by 0.687 percent when income per week increases by 1 percent
• Remember that income elasticity estimate will
vary for different points on the estimated regression line
ACE 562, University of Illinois at Urbana-Champaign 4-49
Linearity and Other Functional Forms In the simple regression example considered here, only a "straight-line" relationship between food expenditure and income was considered
• A non-linear relationship may well be more appropriate
The simple regression model is more flexible than it appears
• x and y variables can be transformations, such as logarithms, squares, cubes, or reciprocals
This raises the question of what do we mean when we state that the simple regression model is linear There are two definitions of linearity
ACE 562, University of Illinois at Urbana-Champaign 4-50
Linearity in variables: only a power of one on xt or yt,
1 2t t ty x eβ β= + + Yes!
41 2t t ty x eβ β= + + No!
2
1 2t t ty x eβ β= + + No!
1 2ln( ) ln( )t t ty x eβ β= + + No! Linearity in parameters: only a power of one on 1β or
2β , but higher powers and/or transformations are allowed on xt and/or ty ,
21 2t t ty x eβ β= + + Yes!
2
1 2t t ty x eβ β= + + No!
31 2t t ty x eβ β= + + No!
1 2ln( ) ln( )t t ty x eβ β= + + Yes!
1 2ln( )t t ty x eβ β= + + No!
ACE 562, University of Illinois at Urbana-Champaign 4-51
The definition of linearity used in the simple linear regression model is linear in parameters • Allows considerable flexibility in the
specification of the functional form of the model • We will study alternative functional forms that
are linear in parameters next semester