1. basic econometrics revision - econometric modelling

65
Advanced Econometrics Dr. Uma Kollamparambil

Upload: trevor-chimombe

Post on 20-Oct-2015

98 views

Category:

Documents


5 download

DESCRIPTION

lecture slides

TRANSCRIPT

Advanced Econometrics

Dr. Uma Kollamparambil

Today's Agenda

• A quick round-up of basic econometrics• Econometric Modelling

Regression analysis

• Theory specifies the functional relationship• Measurement of relationship uses

regression analysis to arrive at values of a and b. Y = a + bX + e

• Components dependent & independent variables, intercept (O), coefficients, error term

• Regression may be simple or multivariate according to the no. of independent variables

Requirements• Model Specification: relationship

between dependent and independent variables– scatter plot– specify function that best fits the scatter

• Sufficient Data for estimation– cross sectional– time series– panel

Some Important terminology

– Least Squares Regression: Y = a + bX + e– Estimation

• point estimate• Interval estimates

– Inference• t-statistic• R-square or Coefficient of Determination• F-statistic

Estimation -- OLS

Ordinary Least Squares (OLS)• We have a set of datapoints, and want to fit a line to the data• The most “efficient” can be shown to be OLS. His minimizes the squared distance between the line and actual data points.

0

2

4

6

8

10

12

14

16

18

0 10 20 30 40 50

• How to Estimate a and b in the linear equation?• The OLS estimator solves:

• This minimization problem can be solved• using calculus.• The result is the OLS estimators of a and b

[ ]iiba bXaYMin −−;

[ ]iiba bXaYMin −−;

Regression Analysis -- OLS

XbYa

XX

YYXXb

XbaY

i

ii

jjj

ˆˆ

)(

))((ˆ2

−=

−−=

+⋅+=

∑∑

ε •The basic equation

•OLS estimator of b

•OLS estimator of a

Here a hat denotes an estimator, and a bar a sample mean.

Regression Analysis -- OLS

These are the estimated coefficients for the data above.

production period Demand (Y)Price (X)1 410 12 370 53 240 84 230 105 160 156 150 237 100 25

Coefficients

Intercept 384.98Q (X) -11.89

Regression Analysis -- Inference

Here, the R-squared is a measure of the goodness offit of our model, while the standard deviation of b gives us ameasure of confidence for out estimate of b.

2

2

ˆ

2

22

)()(

)ˆ(

)(

)ˆ(

∑∑

∑∑

−−

−=

−=

XXkn

YYS

YY

YYR

i

ii

b

i

i

Regression Analysis -- Confidence

These are “goodness of fit” measures reported by excelfor our example data.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.976786811R Square 0.954112475Adjusted R Square 0.94493497Standard Error 27.08645377Observations 7

ANOVAdf SS MS F Significance F

Regression 1 76274.47725 76274.48 103.9621 0.000155729Residual 5 3668.379888 733.676Total 6 79942.85714

Coefficients Standard Error t Stat P-valueIntercept 87.10614525 17.92601129 4.859204 0.004636Q (X) 12.2122905 1.19773201 10.19618 0.000156

Hypothesis testing

• Hypothesis formulation• Test:

– Confidence interval method: Construct interval of estimated b at desired level of confidence & SE of b. check if b falls within. If it does, accept null hypothesis

– Test of significance method: Estimate t-value of b and compare with table t value. If former less than latter accept the null hypothesis.

Hypothesis testing

the t-ratio. Combined with information in critical values from a “student-t” distribution, this ratio tells us how confident we are that a value is significantly different from zero.

SUMMARY OUTPUT

Regression StatisticsMultiple R 0.976786811R Square 0.954112475Adjusted R Square 0.94493497Standard Error 27.08645377Observations 7

ANOVAdf SS MS F Significance F

Regression 1 76274.47725 76274.48 103.9621 0.000155729Residual 5 3668.379888 733.676Total 6 79942.85714

Coefficients Standard Error t Stat P-valueIntercept 87.10614525 17.92601129 4.859204 0.004636Q (X) 12.2122905 1.19773201 10.19618 0.000156

=

bS

b

ˆ

ˆ

Analysis of Variance: F ratio• F ratio tests the overall significance of the regression.

• Tests the marginal contribution of new variable

• Tests for structural change in data

)/()1()1(

)(varexp

)1(var

2

2

knRk

RF

kniationlainedUn

kiationExplained

F

−−

−=

−=

)mod.(

).(

elofXnewnonRSS

ofnewXnoESSESS

Fnew

oldnew

=

)221(

)(

knnRSS

kRSSRSS

FUR

URR

−+

=

Multivariate regression

YXXX

uXy

u

u

u

XX

XX

XX

Y

Y

Y

knn

k

k

n

///

3

2

1

3

2

1

2

222

121

2

1

)(

1

1

1

−∧

=+=

+

=

ββ

βββ

Assumptions of OLS regression

– Model is correctly specified & is linear in parameters– X values are fixed in repeated sampling and y values

are continuous & stochastic

– Each uiis normally distributed with mean of ui=0

– Equal variance ui (Homoscedasticity)

– No autocorrelation or no correlation between ui and uj

– Zero covariance between Xi and ui

– No multicollinearity Cov(Xi Xj)=0 , multivariate regression

– Under assumption of CNLRM estimates are BLUE

Autocorrelation: covariance between error terms●Identification : DW d test 0-4 (near 2 indicates no autocorrelation)●R2 is overestimated●t and F tests misleading●Missed Variable: Correctly specify●Consider AR scheme

•Heteroscedasticity: Non-constant variance- Detection: scatter plot of error terms, park test,

goldfeld-Quandt test, white test etc

Regression Analysis : Some problems

- t and F tests misleading- Remedial measures include transformation of variables through WLS

● Muticollinearity: covariance between various X variables

● Detection: high R2 but t test insignificant, high pair-wise correlation between explanatory variables

● t and F tests misleading● remove model over-specification, use pooled

data, transform variables

Regression Analysis : Some problems

Model Specification

Sources of misspecification

• Omission of relevant variable• Inclusion of unnecessary variables• Wrong functional form• Errors of measurement• Incorrect specification of the stochastic

error term

Model Specification Errors:Omitting Relevant Variables and Including

Irrelevant Variables

• To properly estimate a regression model, we need to have specified the correct model

• A typical specification error occurs when the estimated model does not include the correct set of explanatory variables

• This specification error takes two forms– Omitting one or more relevant explanatory variables– Including one or more irrelevant explanatory variables

• Either form of specification error results in problems with OLS estimates

Model Specification Errors: Omitting Relevant Variables

• Example: Two-factor model of stock returns• Suppose that the true model that explains a particular stock’s

returns is given by a two-factor model with the growth of GDP and the inflation rate as factors

Suppose instead that we estimated the following model

• Thus, the error term of this model is actually equal to

• If there is any correlation between the omitted variable (INF) and the explanatory variable (GDP), then there is a violation of classical assumption Cov(uiXi)=0

tttt INFGDPr εβββ +++= 210

ttt GDPr εββ ++= 10

ttt INF εβε +=∗ 2

Model Specification Errors:Omitting Relevant Variables

• This means that the explanatory variable and the error term are not uncorrelated

• If that is the case, the OLS estimate of β1 (the coefficient of GDP) will be biased

• As in the above example, it is highly likely that there will be some correlation between two financial (or economic) variables

• If, however, the correlation is low or the true coefficient of the omitted variable is zero, then the specification error is very small

• When Cov(X1X2)#0, Estimate of both constant & slope biased

• Bias continues even with larger sample• When Cov(X1X2)=0, constant is biased, slope

unbiased• Variance of error is incorrectly estimated• Consequently, variance of slope is biased• Leads to misleading conclusions through confidence

interval and hypothesis testing procedures regarding statistical significance of the estimated parameters.

• Forecasts therefore based on mis-specified model will be unreliable

Model Specification Errors: Omitting Relevant Variables

● To avoid omitted variable bias, A simple solution is to add the omitted variable back to the model, but the problem with this solution is to be able to detect which is the omitted variable

● Omitted variable bias is hard to detect, but there could be some obvious indications of this specification error.

● The best way to detect the omitted variable specification bias is to rely on the theoretical arguments behind the model.

- Which variables does the theory suggest should be included?

- What are the expected signs of the coefficients?- Have we omitted a variable that most other similar studies include in the estimated model?

● Note, though, that a significant coefficient with the unexpected sign can also occur due to a small sample size

● However, most of the data sets used in empirical finance are large enough that this most likely is not the cause of the specification bias.

Model Specification Errors: Omitting Relevant Variables

Model Specification Errors:Including Irrelevant Variables

• Example: Going back to the two-factor model, suppose that we include a third explanatory variable in the model, for example, the degree of wage inequality (INEQ)

• So, we estimate the following model

• The estimated coefficients (both constant and slope) are unbiased

• The variance of the error term is estimated accurately

ttttt INEQINFGDPr εββββ ++++= 3210

• However the variance of the coefficients are inefficient

• The inclusion of an irrelevant variable (INEQ) in the model increases the standard errors of the estimated coefficients and, thus, decreases the t-statistics

• This implies that it will be more difficult to reject a null hypothesis that a coefficient of one of the explanatory variables is equal to zero

Model Specification Errors:Including Irrelevant Variables

Model Specification Errors:Including Irrelevant Variables

• Also, the inclusion of an irrelevant variable will usually decrease the adjusted R-sq (but not the R-sq)

• Overspecified model Considered to be a lesser evil compared to underspecified model

• But other problems like multicollinearity, loss of degrees of freedom

Model Specification Criteria• To decide whether an explanatory variable belongs in a

regression model, we can test whether most of the following conditions hold

– The importance of theory: Is the decision to include an explanatory variable in the model theoretically sound?

– t-Test: Is the variable statistically significant and does it have the expected coefficient sign?

– Adjusted R2: Does the overall fit of the model improve when we add the explanatory variable?

– Bias: Do the coefficients of the other variables change significantly (sign or statistical significance) when we add the variable to the model?

Problems with Specification Searches• In an attempt to find the “right” or “desired” model, a

researcher may estimate numerous models until an estimated model with the desired properties is obtained

• It is definitely the case that the wrong approach to model specification is data mining– In this case, the researcher would estimate every

possible model and choose to report only those that produce desired results

• The researcher should try to minimize the number of estimated models and guide the selection of variables mainly on theory and not purely on statistical fit

Sequential Model Specification Searches

• In an effort to find the appropriate regression model, it is common to begin with a benchmark (or base) specification and then sequentially add or drop variables

• The base specification can rely on theory and then add or drop variables based on adjusted R2 and t-statistics

• In this effort, it is important to follow the principle of parsimony: try to find the simplest model that best fits the data

• Make use of the F test for incremental contribution of variables

F test for incremental contribution of variables

• Very useful test in deciding if a new variable should be retained in the model

• e.g. Return on a stock is a function of GDP and inflation of the country.

• Question is should we include inflation in the model.

• Estimate a model without inflation and get Rsq(old).

Re-estimate including inflation and get its Rsq(new).

Ho: Addition of new variable does not improve the model

H1: Addition of new variable improves the model

If estimated F is higher than critical F table value, reject null hypothesis. It means inflation needs to be included in the above example.

newnew

oldnew

knR

parametersnewofnoRRF

−−

−=

/)1(

``./)(2

22

F test for incremental contribution of variables

Nominal vs. True level of Significance• Model derived from data mining should be assessed

not at conventional levels of significance ( ) such as 1,5,10%

• To begin with if there were “c” candidate regressor of which “k” are selected after data mining, true level of significance ( ) is related to nominal significance level as:

• If c=2, k=1 and =5% then =10%

αα

αα

)/(

)1(1*

/*

kc

kc

−−=

α

α *α

Model Specification:Choosing the Functional Form

• One of the assumptions to derive the nice properties of OLS estimates is that the estimated model is linear

• What if the relationship between two variables is not linear?

• OLS maintains its nice properties of unbiased and minimum variance estimates if we transform the non-linear relationship into a model that is linear in the coefficients

• Interesting case– Double-log (log-log) form

Model Specification: Choosing the Functional Form

●Example: A well-known model of nominal exchange rate determination is the Purchasing Power Parity (PPP) model

s = P/P*

s = nominal exchange rate (e.g. rand/$), P = price level in the SA, P* = price level in the US

●Taking natural logs, we can estimate the following model

ln(s) = β0 + β1ln(P) + β2ln(P*) + εi

●Property of double-log model: Estimated coefficients show elasticities between dependent and explanatory variables

●Example: A 1% change in P will result in a β1% change in the nominal exchange rate (s).

●How do we know if we’ve gotten the right functional form for our model?

- Expected coefficient signs, R2, t-stat and DW d-stat

Model Specification:Choosing the Functional Form

• If not satisfactory, – Examine the error terms– use economic theory to guide you

• We’ve seen that a linear regression can really fit nonlinear relationships

• Can use logs on RHS, LHS or both• Can use quadratic forms of x’s • Can use interactions of x’s

Model Specification:Choosing the Functional Form

How to choose Functional Form

• Think about the interpretation.

• Does it make more sense for x to affect y in percentage (use logs) or absolute terms?

• Does it make more sense for the derivative of x1 to vary with x1 (quadratic) or with x2 (interactions) or to be fixed?

How to choose Functional Form (cont'd)• We already know how to test joint exclusion

restrictions to see if higher order terms or interactions belong in the model

• It can be tedious to add and test extra terms, plus may find a square term matters when really using logs would be even better

• A test of functional form is Ramsey’s regression specification error test (RESET)

DW test for model misspecification

• You suspect that relevant variable Z ( might be a polynomial of existing X) was omitted from the assumed model.

• From the assumed model, obtain OLS residuals.

• Order residuals according to increasing values of Z

• Compute d stat from thus ordered residuals.

• If autocorrelation is noticed, then the model is mis-specified.

Ramsey’s RESET• Regression Specification Error Test• Estimate assumed model and derive ŷ• Then, estimate y = β0 + β1x1 + … + βkxk + δ1ŷ2 + δ1ŷ3

+error and test

• H0: δ1 = 0, δ2 = 0

• If Ho rejected it indicates mis-specified model• Advantage , in RESET you don’t have to specify the the

correct alternative model• Disadvantage: doesn't help in attaining the right model

newnew

oldnew

knR

parametersnewofnoRRF

−−

−=

/)1(

``./)(2

22

Lagrange Multiplier Test for Adding variable

• Y=b0+bX1 �1 (restricted)• Y=b0+b1X1+b2X2+b3X3 �2 (UR)

• Obtain residuals from 1 and regress it on all X in Eq2 including ones in eq1

• Ui=a0+a1X1+a2X2+a3X3

• If estimated Chi-sq>critical chi-sq, reject the restricted regression.

2.

2ionsofrestrictnonR χ≈

Nested vs. Non-nested Models

• Nested:• Y=a+b1X1+b2X2+b3X3+b4X4• Y=a+b1X1+b2X2

• Specification test and the restricted F test can be used to test for model specification errors

• Non-nested:• Y=a+b1X1+b2X2• Y=c0+c1Z1+c2Z2

Tests for Non-nested Models

• 1) Discrimination approach: simply select better model based on goodness of fit– Rsq, Adj-Rsq, AIC, SIC, SBC

• 2)Discerning approach: make use of information provided by other models as well along with the initial model

n

RSSnSIC

n

RSSeAIC

nk

nk

/

/2

=

=

kn

nRR

TSS

ESSR

−−−=

=

− 1)1(1 2

2

2

Non-nested Discerning Tests• If the models have the same dependent variables,

but non-nested x’s could still just make a giant model with the x’s from both and test joint exclusion restrictions that lead to one model or the other.

• Y=a+b1X1+b2X2• Y=c0+c1Z1+c2Z2• Y=a+b1X1+b2X2+c1Z1+c2Z2

• Use F test using both equations as reference model in turns.

newnew

oldnew

knR

parametersnewofnoRRF

−−

−=

/)1(

``./)(2

22

Davidson-MacKinnon J test,

• An alternative, the Davidson-MacKinnon test, uses ŷ from one model as regressor in the second model and tests for significance.

• Y=a+b1X1+b2X2 -�A

• Y=c0+c1Z1+c2Z2 -�B

• Estimate B and obtain Y^B

• Y=a+b1X1+b2X2+ b3Y^B

● Use t-test, if b3=0, not rejected, we accept model A

● Reverse the models and re-do steps

● More difficult if one model uses y and the other uses ln(y)

● Can follow same basic logic and transform predicted ln(y) to get ŷ for the second step

● In any case, Davidson-MacKinnon test may reject neither or both models rather than clearly preferring one specification

Davidson-MacKinnon J test,

Measurement Error

• Sometimes we have the variable we want, but we think it is measured with error

• Examples: A survey asks how many hours did you work over the last year, or how many weeks you used child care when your child was young

• Consequences of Measurement error in y different from measurement error in x

Measurement Error: Dependent Variable

• Y* is not directly measurable, it is measured wrongly as y=y*+ e

• Thus, really estimatingy = (β0 + β1x1 + …+ βkxk + e)+u

• When will OLS produce unbiased results?

• Only if E(e) =E(u)=0, e is uncorrelated with xj & u, β is unbiased

• But β has larger variances than with no measurement error

Measurement Error: Explanatory Variable

• x* is not directly measurable, it is measured wrongly as X=X*+ e

• Define measurement error as e1 = x1 – x1*

• y= β0 + β1(x1 -e)+u

• Really estimating y = β0 + β1x1 + (u – β1e1)

●Assume E(e1) = 0 , cov(ei,ej)=0 , cov(ei,ui)=0

●The effect of measurement error on OLS estimates depends on our assumption about the correlation between e1 and x1

●If Cov(x1, e1) # 0, OLS estimates are biased, and variances larger

●Use Proxy or IV variables

Measurement Error: Explanatory Variable

Proxy Variables• What if model is mis-specified because no data is

available on an important x variable?

• It may be possible to avoid omitted variable bias by using a proxy variable

• A proxy variable must be related to the unobservable variable –

• But must be uncorrelated with the error term

• Sargen test

Lagged Dependent Variables

• What if there are unobserved variables, and you can’t find reasonable proxy variables?

• May be possible to include a lagged dependent variable to account for omitted variables that contribute to both past and current levels of y

• Obviously, you must think past and current y are related for this to make sense

Missing Data – Is it a Problem?

• If any observation is missing data on one of the variables in the model, it can’t be used

• If data is missing at random, using a sample restricted to observations with no missing values will be fine

• A problem can arise if the data is missing systematically – say high income individuals refuse to provide income data

Non-random Samples

• If the sample is chosen on the basis of an x variable, then estimates are unbiased

• If the sample is chosen on the basis of the y variable, then we have sample selection bias

• Sample selection can be more subtle

• Say looking at wages for workers – since people choose to work this isn’t the same as wage offers

Outliers

• Sometimes an individual observation can be very different from the others, and can have a large effect on the outcome

• Sometimes this outlier will simply be do to errors in data entry – one reason why looking at summary statistics is important

• Sometimes the observation will just truly be very different from the others

Outliers (cont'd)

• Not unreasonable to fix observations where it’s clear there was just an extra zero entered or left off, etc.

• Not unreasonable to drop observations that appear to be extreme outliers, although readers may prefer to see estimates with and without the outliers

• Can use Stata to investigate outliers

Model Selection Criteria• Be data admissible: Prediction must be realistic

• Be consistent with theory

• Have weakly exogenous regressors

• Exhibit parameter constancy: values and signs must be consistent

• Exhibit data coherency: white noise residuals

• Be encompassing

Matrix Approach to OLS

• N x 1 n x k+1 k+1 x 1 n x 1

• bˆ = (X'X)-1X'Y

Assumptions

• E(u)=0 where u and 0 are n x 1 column vectors, 0 being a null vector.

• E(uu`)= where I is an n x n identity matrix (homoscedasticity and no autocorrelation)

• N x k matrix X I non-stochastic

I2σ

●The rank of X is p(X)=k, where k is the number of columns in X and k is less than the number of observations, n (no multi-collinearity)

● where is a 1 x k row vector and x is a k x 1 column vector.

●The u vector has a multivariate normal distribution

● i.e. U~N(0, )

Assumptions

0=xIλ Iλ

I2σ

0=xIλ

0=xIλ

2

2^

2

−=

Ynyy

YnyXR

I

II

β

kn

uu

kn

u

XXI

i

II

−=

−=

=−

∑−

^^^2

2

2^

)()cov(var

σ

σβ