lecture 4-simple linear regression model-specification and

ACE 562, University of Illinois at Urbana-Champaign 4-1

ACE 562 Fall 2005

Lecture 4: Simple Linear Regression Model: Specification and Estimation

by

Professor Scott H. Irwin

Required Reading: Griffiths, Hill and Judge. "Simple Regression: Economic and Statistical Model Specification and Estimation," Ch. 5 in Learning and Practicing Econometrics Notation Warning: I previously distinguished random variables (Y) from their realized sample values (y). Following HGJ, I will no longer do this. Context of equation should make the distinction clear.


Overview Previously, we examined one economic variable at a time We will now focus on two economic variables

• Primary objective of economic analysis is to understand the relationship between economic variables

Key question: How to use the information contained in samples of economic data to learn about unknown parameters of economic relationships?

• When it is believed that values of one variable are systematically determined by another variable, simple linear regression can be used to model the relationship

• Simple linear regression model is a

specification of the process that we believe describes the relationship between two variables


Start with two variables ty and tx

• For each observation, we assume that tx is generated "outside" the process given by the simple linear regression model

• For each level of tx , we assume that ty is

generated by the following simple linear regression model

1 2t t ty x eβ β= + +

Since we only actually observe one sample of data,

our objective is to estimate the parameters ( 1 2an d β β ) of the above model


The Problem What is the relationship between average household expenditure on food and income? To answer this question, we must extend economic and statistical models considered Lecture 3

• Focused on food expenditure of households of size 3 with annual income of $25,0000

• Population of interest is now all households of

size 3, regardless of income level Why is knowledge of the relationship between food expenditure and income important

• Micro-level decision?

• Macro-level decisions?


The Economic Model Define household expenditure on food as y and household income as x Economic theory shows that

( )y f x= and we expect the relationship to be positive We need to be more precise about the relationship in order to ultimately estimate the parameters of the relationship

• Linear vs. non-linear

• In practice, never know exact form of the relation

• Use economic theory and information in

sample to make a reasonable choice


For simplicity, let's assume a linear relationship is reasonable,

1 2y xβ β= +

Contrast this model with the one considered earlier,

y β= The linear economic model has two parameters

• 1β : intercept, which shows the level of food expenditure when income is zero

• 2β : slope, which shows how much food

expenditure changes when income changes


Often interested in relationship between y and x in percentage terms, or elasticity Point elasticity formula,

2ydy x xdx y y

η β= ⋅ = ⋅

where yη is the elasticity of food expenditure with respect to income


The Statistical Model The linear economic model predicts that food expenditure for a given level of income will be the same for all consumers

• No scatter of points around the line in Figure 5.2

Recognize that actual expenditure for a given level of income will not be the same for all consumers,

1 2 1,...,t t ty x e t Tβ β= + + =

where,

• yt is the dependent variable

• xt is the independent, or explanatory, variable

• et is the error, or disturbance, term

• T is the number of consumers in the sample

Note that xt is assumed to be the same for all consumers


Motivation for adding the error term is similar to earlier arguments, but more detail is helpful Combined effect of other influences

• In reality, a large number of independent variables in addition to income affect food expenditure

• Assume the other independent variables are

unobservable, or we would include them in economic model

Approximation error

• Linear form of model may only be an approximation of the true relationship between income and food expenditure

Random component of human behavior

• Knowledge of all variables that influence an individual's food expenditure, may not be sufficient to explain that expenditure


To complete the statistical model, we must specify the assumptions about the error term,

• ( ) 0tE e =

• 2 2 2var( ) [ ( )] [ ]t t t te E e E e E e σ= − = =

• et are independent so that ( ) 0t sCov e e = for all t s≠

• et follows a normal distribution

This can be summarized using the following notation,

2~ (0, ) 1,...,te N t Tσ = As we saw in Lecture 3, this is referred to as the iid normality assumption, which is shorthand for identical, independently distributed normal random variables Note: The simple linear regression model does not require the assumption that the error term is normally distributed. However, it is typically assumed.


Now, let's explore some of the implications of the statistical model

1 2 1,...,t t ty x e t Tβ β= + + = Sometimes the statistical model is referred to as the "data generating process" for y For a given observation t, yt can be thought of as having two components

• A systematic component 1 2 txβ β+ , that is determined by an economic process

• A random component et , that is determined by

a probabilistic process Another way of saying the same thing is that the random variable yt is simply a linear transformation of the random variable et

1 2where a+ 1ndt t ty a be a x bβ β= + = = (Note that a is a parameter for each observation because tx is fixed for that observation, but a takes on different values for different observations)


We can examine the statistical properties more formally by considering the expected value of food expenditure,

1 2[ ] [ ]t t tE y E x eβ β= + +

1 2[ ] [ ] [ ] [ ]t t tE y E E x E eβ β= + +

1 2[ ]t tE y xβ β= + Shows that the expected value of food expenditure, or "average" expenditure, is a linear function of income Now, reconsider the original statistical model,

1 2t t ty x eβ β= + + From above, we can substitute for 1 2 txβ β+ as follows,

[ ]t t ty E y e= + Allows a new interpretation of the statistical model


Writing the statistical model as,

[ ]t t ty E y e= + Allows us to think of observed food expenditure as consisting of two components,

• [ ]tE y : expected, or mean, food expenditure, which will be the same for all consumers at a given level of income

• te : a random component that is unique to each

consumer We generated the same interpretation in the earlier constant mean model Now, the crucial difference is that the mean component is a function, rather than a constant

• The mean component varies linearly with the level of income

• 1 2[ ]t tE y xβ β= +


Now, consider the variance of food expenditure

2var[ ] ( [ ])t t ty E y E y⎡ ⎤= −⎣ ⎦

21 2var[ ] ( )t t ty E y xβ β⎡ ⎤= − −⎣ ⎦

2 2var[ ]t ty E e σ⎡ ⎤= =⎣ ⎦

This result is equivalent to saying that the variance of food expenditure (or the error term) is not related to the level of income Next, consider the covariance of food expenditure between two values yt and ys,

[ ]cov[ , ] ( [ ])( [ ])t s t t s sy y E y E y y E y= − −

[ ]1 2 1 2cov[ , ] ( )( )t s t t s sy y E y x y xβ β β β= − − − −

[ ]cov[ , ] 0t s t sy y E e e t s= = ≠ Hence, if the errors are independent, as implied by random sampling, then the selection of one consumer does not influence whether another will be selected


Assumptions of the Simple Linear Regression Model SR1. 1 2 , 1,...,t t ty x e t Tβ β= + + = SR2. 1 2( ) 0 ( )t t tE e E y xβ β= ⇔ = + SR3. 2var( ) var( )t te y σ= = SR4. cov[ , ] cov[ , ] 0t s t se e y y t s= = ≠ SR5. The variable xt is not random and must take on

at least two different values SR6. 2 2

1 2~ (0, ) ~ ( , )t t te N y N xσ β β σ⇔ +


Estimating the Parameters for the Statistical Model of the Food Expenditure and Income Relationship The statistical model "explains" how the sample of household expenditure data is generated The problem at hand is how to use the sample information on yt and xt to estimate the unknown parameters 1β and 2β One approach is to simply draw a line through the scatter of points that seems to "best fit" the data

• "Eyeball econometrics" While “eyeball” analysis may be useful as a starting point, there are several weaknesses to this approach: • Highly subjective; two researchers looking at the

same graph may choose different lines • Tendency to ignore outliers

• For most researchers, will work only for one y

and one x (two dimensions)


Just as in the constant mean model, we need a rule to systematically estimate 1β and 2β based on the observed sample of data

• Notice that for a given level of xt, 1 2[ ]t tE y xβ β= + , or the "center" of the pdf for

yt

• Suggests the "center" of sample data may yield good estimates of the population parameters

1β and 2β

• The only difference from the constant mean model is that the "center" varies with the level of income

The principle of least squared distance can again be used to find the desired estimates Minimize the sum of squares of the vertical distances between the line and the sample observations


y

x

Minimizing the sum of squared errors


To begin the formal derivation, let's restate the statistical model,

1 2t t ty x eβ β= + +

which can be re-written as,

1 2t t te y xβ β= − −

Then, given the sample observations on y and x, our objective is to minimize the following function,

2 21 2 1 2

1 1( , ) ( )

T T

t t tt t

S e y xβ β β β= =

= = − −∑ ∑

Since the values for yt are known, S is solely a function of the unknown parameters 1β and 2β Expanding the square, we obtain,

2 2 2 21 2 1 2 1 2 1 2

1( , ) ( 2 2 2 )

T

t t t t t tt

S y x y x y xβ β β β β β β β=

= + + − − +∑


With further re-arranging ,

2 2 2 21 2 1 2 1 2 1 2

1 1 1 1 1( , ) 2 2 2

T T T T T

t t t t t tt t t t t

S y T x y x y xβ β β β β β β β= = = = =

= + + − − +∑ ∑ ∑ ∑ ∑ For the sample of 40 households,

1 1

2

1 1

2

1

40 2792 943.78

69435.0404 210206.2302

24875.065

T T

t tt t

T T

t t tt t

T

tt

T x y

x y x

y

= =

= =

=

= = =

= =

=

∑ ∑

∑ ∑

∑

and based on these computations the sum of squares relationship is,

2 21 2 1 2

1 2 1 2

( , ) 24875.07 40 10206.231887.56 138870.08 5584

S β β β ββ β β β= + +

− − +

This function is a quadratic in terms of the unknown parameters 1β and 2β

• "bowl-shaped" function


Minimum value of function is found by taking the partial differentials of S with respect to 1β and 2β ,

1 211

2 ( )( 1)T

t tt

S y x∂ β β∂β =

= − − −∑

1 212

2 ( )( )T

t t tt

S y x x∂ β β∂β =

= − − −∑

The values of 1β and 2β that make the partial derivatives equal zero are the least squares estimators, which are denoted b1 and b2 Substituting and setting each partial derivative equal to zero,

1 21

2 ( )( 1) 0T

t tt

y b b x=

− − − =∑

1 21

2 ( )( ) 0T

t t tt

y b b x x=

− − − =∑


If we multiply both sides of each equation by (-1), they can be re-written in the following form:

1 21

( ) 0T

t tt

y b b x=

− + + =∑

2

1 21

( ) 0T

t t t tt

y x b x b x=

− + + =∑

and,

1 21 1

0T T

t tt t

y Tb b x= =

− + + =∑ ∑

2

1 21 1 1

0T T T

t t t tt t t

y x b x b x= = =

− + + =∑ ∑ ∑

With a little more re-arranging, we can arrive at the following equations,

1 21 1

T T

t tt t

Tb b x y= =

+ =∑ ∑

2

1 21 1 1

T T T

t t t tt t t

b x b x y x= = =

+ =∑ ∑ ∑

The previous two equations are known as the normal equations in least squares regression


Now, we have two unknowns and two knowns in each equation, and we can solve for b1 and b2

Multiply the first normal equation by 1

T

tt

x=∑ and the

second by T,

2

1 21 1 1 1

T T T T

t t t tt t t t

T x b b x x y= = = =

⎛ ⎞+ =⎜ ⎟

⎝ ⎠∑ ∑ ∑ ∑

2

1 21 1 1

T T T

t t t tt t t

T x b T x b T y x= = =

+ =∑ ∑ ∑


Now subtract the first of the above two equations from the second,

22

21 1 1 1 1

T T T T T

t t t t t tt t t t t

T x x b T y x x y= = = = =

⎡ ⎤⎛ ⎞− = −⎢ ⎥⎜ ⎟⎝ ⎠⎢ ⎥⎣ ⎦

∑ ∑ ∑ ∑ ∑

or,

1 1 12 2

2

1 1

T T T

t t t tt t t

T T

t tt t

T y x x yb

T x x

= = =

= =

−=

⎛ ⎞− ⎜ ⎟⎝ ⎠

∑ ∑ ∑

∑ ∑

which is the least squares estimator for the slope


Now, having found the least squares estimator of the slope, b2, lets solve for b1 The first normal equation is,

1 21 1

T T

t tt t

Tb b x y= =

+ =∑ ∑

Simply divide this equation by T,

1 21 1

1 1T T

t tt t

b b x yT T= =

+ =∑ ∑

or,

1 2b b x y+ =

1 2b y b x= −


To summarize, the least squares estimators for the intercept and slope of the regression line are,

1 2b y b x= −

1 1 12 2

2

1 1

T T T

t t t tt t t

T T

t tt t

T y x x yb

T x x

= = =

= =

−=

⎛ ⎞− ⎜ ⎟⎝ ⎠

∑ ∑ ∑

∑ ∑

You may see the formula for b2 stated in several different forms The derivation for one widely used formula can be found by substituting the formula for b1 into the second normal equation as follows,

( ) 22 2

1 1 1

T T T

t t t tt t t

y b x x b x y x= = =

− + =∑ ∑ ∑

Expanding and re-arranging,

2

21 1 1 1

T T T T

t t t t tt t t t

y x b x x x y x= = = =

⎛ ⎞+ − =⎜ ⎟

⎝ ⎠∑ ∑ ∑ ∑


Or,

1 12

2

1 1

T T

t t tt t

T T

t tt t

y x y xb

x x x

= =

= =

−=

−

∑ ∑

∑ ∑

Recognizing that 1

T

tt

x T x=

=∑ , we can then write the

following solution for b2, which known as the computational formula,

12

2 2

1

T

t tt

T

tt

y x T yxb

x T x

=

=

−=

−

∑

∑


Another version of the formula is,

12

2

1

( )( )

( )

T

t tt

T

tt

x x y yb

x x

=

=

− −=

−

∑

∑

Which is equivalent to,

12

2

1

1 ( )( )1

1 ( )1

T

t tt

T

tt

x x y yTb

x xT

=

=

− −−=

−−

∑

∑

2 2

ˆ sample covariance of andˆ sample variance of

xy

x

x ybx

σσ

= =

This is often a helpful interpretation of the least squares slope estimator A good exercise is to derive this form of the formula from the first one presented


Finally, the formula is often stated in “deviations from the mean” form Let,

* *andt t t tx x x y y y= − = − then,

* *

12

*2

1

T

t tt

T

tt

y xb

x

=

=

=∑

∑

Important Points

• All of the different formulas for the least squares estimator b2 are equivalent

• Assuming arithmetic accuracy, all formulas

will yield exactly the same numerical values for a given sample of x and y

Three Notable Properties of LS Estimates


1. Sum of the estimated errors always equals zero

First, note that the estimated error for each observation is simply the actual observation on y minus the value projected by the estimated regression line, or

1 2t̂ t te y b b x= − −

Condition "enforced" by the first normal equation

1 21

2 ( )( 1) 0T

t tt

y b b x=

− − − =∑

or

1

ˆ 0T

tt

e=

=∑


2. Estimated regression line must pass through the sample means of x and y (centroid)

Shown by first noting that 1 2b y b x= − , which can be re-written as 1 2y b b x= +

3. Zero correlation between the estimated errors and tx , the explanatory variable

Condition "enforced" by the second normal equation

1 21

2 ( )( ) 0T

t t tt

y b b x x=

− − − =∑

or

1

ˆ 0T

t tt

e x=

=∑

⇒No tendency of estimated errors for observations above (below) the mean of x to be positive (negative) and vice versa


Estimates for the Household Expenditure Function Based on the data from 40 randomly selected households, we can compute estimates of 1β and 2β as follows,

1 2 23.5945 (0.232253)(69.8) 7.3832b y b x= − = − =

1 1 12 2 2

2

1 1

(40)(69435.04) (2792)(943.78)(40)(210206.23) (2792)

T T T

t t t tt t t

T T

t tt t

T y x x yb

T x x

= = =

= =

−−

= =−⎛ ⎞

− ⎜ ⎟⎝ ⎠

∑ ∑ ∑

∑ ∑0.2323=

It is useful to report the estimates in terms of the estimated relationship between yt and xt,

ˆ 7.3832 0.2323t ty x= + where ˆty is the estimate of the expected (mean ) food expenditure for a given level of income

• Sometimes ˆty is called the "fitted value" of ty • ˆty is a point on the LS line for a given tx


Sample Regression Output from Excel

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.563096017R Square 0.317077125Adjusted R Square 0.29910547Standard Error 6.844922384Observations 40

ANOVAdf SS MS F Significance F

Regression 1 826.6352172 826.6352 17.64318 0.000155136Residual 38 1780.412573 46.85296Total 39 2607.04779

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%Intercept 7.383217543 4.008356335 1.841956 0.073296 -0.731275911 15.497711X Variable 1 0.23225333 0.055293429 4.200378 0.000155 0.120317631 0.34418903


Interpreting the Least Squares Estimates

ˆ 7.3832 0.2323t ty x= + Intercept (b1) = 7.382 literally is an estimate of the expected (average) level of food expenditure per week when income is zero

• Caution needs to be exercised when interpreting intercept estimates

• Usually little if any observations around zero

for independent variable

• Suggests estimate may not be very reliable in this range of the independent variable

Slope (b2) = 0.2323 is an estimate of the expected (average) change in food expenditure per week when income increases by one unit

• In this case, slope estimate indicates food expenditure per week is expected to increase by $0.23 when income per week increases by $1


• Important to emphasize that slope indicates expected change on average when income increases one unit, not the actual change

• Remember that the actual change will in all

likelihood differ from the expected change due to the error term

• We can think of the expected change as being

“on the line” while the actual change is “off the line”

Income elasticity of food expenditure also may be of interest Recall the formula,

2ydy x xdx y y

η β= ⋅ = ⋅

Replacing 2β with 2b we can estimate the income elasticity as,

2ˆyxby

η = ⋅

This still leaves the question of what levels of x and y to use in estimating the elasticity


It is conventional to use the sample means, since that is a representative point on the regression line,

2ˆyxby

η = ⋅

In this case, the estimated income elasticity is,

69.800ˆ 0.2323 0.68723.595yη = ⋅ =

Elasticity ˆ 0.687yη = is an estimate of the expected (average) percent change in food expenditure per week when income increases by one percent

• In this case, elasticity estimate indicates food expenditure per week is expected to increase by 0.687 percent when income per week increases by 1 percent

• Remember that income elasticity estimate will

vary for different points on the estimated regression line


Linearity and Other Functional Forms In the simple regression example considered here, only a "straight-line" relationship between food expenditure and income was considered

• A non-linear relationship may well be more appropriate

The simple regression model is more flexible than it appears

• x and y variables can be transformations, such as logarithms, squares, cubes, or reciprocals

This raises the question of what do we mean when we state that the simple regression model is linear There are two definitions of linearity


Linearity in variables: only a power of one on xt or yt,

1 2t t ty x eβ β= + + Yes!

41 2t t ty x eβ β= + + No!

2

1 2t t ty x eβ β= + + No!

1 2ln( ) ln( )t t ty x eβ β= + + No! Linearity in parameters: only a power of one on 1β or

2β , but higher powers and/or transformations are allowed on xt and/or ty ,

21 2t t ty x eβ β= + + Yes!

2

1 2t t ty x eβ β= + + No!

31 2t t ty x eβ β= + + No!

1 2ln( ) ln( )t t ty x eβ β= + + Yes!

1 2ln( )t t ty x eβ β= + + No!


The definition of linearity used in the simple linear regression model is linear in parameters • Allows considerable flexibility in the

specification of the functional form of the model • We will study alternative functional forms that

are linear in parameters next semester

lecture 4-simple linear regression model-specification and

Documents