professor: alan g. isaac - american university · x t (x t x )3=s3 (2) where sis based on the...

Professor: Alan G. Isaac

1

Applied Time-Series: A Very Short Introduction

April 4, 2018

Contents

1 Description 5

1.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Skewness and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3.1 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . 9

1.4 Correlation and Linear Regression . . . . . . . . . . . . . . . . . . . . 12

2 Difference Equations 14

3 Stochastic Process 19

3.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Moving Average Process . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 AR Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Stationary Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.6 Half-Life to Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 26

2

4 Unit Root Tests 26

4.1 Some Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2 Deterministic Regressors . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Augmented Dickey-Fuller Tests 31

5.1 Lag Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Nelson and Plosser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Selection of Deterministic Regressors . . . . . . . . . . . . . . . . . . 36

5.3.1 Phillips-Perron Test . . . . . . . . . . . . . . . . . . . . . . . . 39

5.4 Two Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.5 ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.5.1 Stationarity and Stability . . . . . . . . . . . . . . . . . . . . 44

5.5.2 The Role of the MA Term . . . . . . . . . . . . . . . . . . . . 44

5.5.3 The Autocorrelation Function . . . . . . . . . . . . . . . . . . 45

5.6 Model Selecton Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.6.1 Finite Prediction Error (FPE) . . . . . . . . . . . . . . . . . . 48

5.6.2 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . 49

5.7 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.7.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . 56

6 Spurious Regression 56

7 Testing for Unit Roots 59

8 Cointegration 61

8.0.1 A Multivariate Generalization . . . . . . . . . . . . . . . . . . 65

9 The Engle-Granger Cointegration Test 66

3

10 Error Correction 69

10.1 Vector Error Correction Models . . . . . . . . . . . . . . . . . . . . . 70

11 The Johansen Cointegration Test 71

12 Structural Breaks 77

Bibliography 78

4

These notes are extremely preliminary. Please report typos and other problems.

1 Description

1.1 Normal Distribution

A prominent example of a continuous distribution is the Gaussian or Normal distri-

bution. Let Y be a normally distributed random variable with mean µ and variance

σ2. By definition, its probability distribution function (pdf) is

pY (y) =1

σ√

2πexp

{−(y − µ)2

2σ2

}(1)

A standard shorthand for this is Y ∼ N(µ, σ2). If we think of pY (y) as telling us

roughly how likely we are to see Y near y, then we can see that the normal distribution

is most likely to take on values near the mean µ. We also see that when the variance

σ2 is large, the “penalty” for a given deviation from the mean is reduced.

Exercise 1 (Descriptive Statistics 1)

Generate 300 independent realizations of a standard normal random variable and

examine the sample properties: the number of observations, the minimum and max-

imum values, the mean, and the variance.

The standard normal distribution has mean of zero and variance of unity. If X

is the random variable, we write X ∼ N(0, 1), where the arguments give the mean

and the variance. Now multiply each realization of your first series by 2 to produce

a new series. Look at your descriptive statistics. What happens to the mean? What

happens to the variance?

xs = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , 3 0 0 ] ;

5

Through [{Min, Max, Mean, Variance , Length } [ xs ] ]

Through [{Min, Max, Mean, Variance , Length} [ 2* xs ] ]

1.2 Skewness and Kurtosis

We often use a sample mean and variance as estimates of the true (“population”)

mean and variance. The “third moment” about the sample mean, 1T

∑t(Xt − X̄)3,

might seem like a natural place to look for asymmetry in the distribution. Instead

we use the closely related concept of sample skewness:

S(X) = T 1/2∑

t(Xt − X̄)3

(∑

t(Xt − X̄)2)3/2=

1

T

∑t

(Xt − X̄)3/s3 (2)

where s is based on the biased estimator for the variance (Bickel and Doksum 1977,

p.388). The skewness of a symmetric distribution, such as the normal distribution, is

zero. S(X) is zero for symmetric distributions. If the upper tail of the distribution

is thicker, S(X) > 0. If the lower tail of the distribution is thicker, S(X) < 0.

The fourth sample moment about the mean would seem, like the second, to offer

an indication of dispersion. It does, but it puts even more weight on observations far

from the mean. We take advantage of this to ask whether the sample is relatively

peaked (leptokurtic) or flat (thick-tailed, platykurtic). The statistic we use is the

fourth standardized moment, known as sample kurtosis:

K(X) = T

∑t(Xt − X̄)4

(∑

t(Xt − X̄)2)2=

1

T

∑t

(Xt − X̄)4/s4 (3)

where again s is based on the biased estimator for the variance (Bickel and Doksum

1977, p.388). Kurtosis measures the peakedness or flatness of the distribution of the

6

series. Kurtosis is a measure of how “peaked” the distribution is. K = 3 for the

normal distribution. Distributions that are more peaked and have thicker tails will

have greater kurtosis. Distributions that are less peaked and have thiner tails will

have greater kurtosis. If the kurtosis exceeds 3, the distribution is peaked (leptokurtic)

relative to the normal; if the kurtosis is less than 3, the distribution is flat (platykurtic)

relative to the normal.


Produce the skewness and kurtosis of your two normal series above. What do you

notice about these two statstics?

Through [{ Skewness , Kurtos i s } [ ns ] ]

Through [{ Skewness , Kurtos i s } [ 2* ns ] ]


Now let us conduct the same examinations for a uniform random variable.

We notice a clear difference in the CDF plots for the two distributions, and we

see the relative peakedness of the normal distribution in its higher sample kurtosis.

This observation suggests that we might test whether our sample is generated by

draws from a normal distribution. Informally, we might look at whether the mean

and median are approximately equal. We could also check for S(X)=0 and K(X)=3

to be approximately satisfied. Finally, we can turn to formal tests.

The Jarque-Bera statistic allows a simple test for normality:

JB(X) =T

6

[S2(X) +

[K(X)− 3]2

4

](4)

The test statistic summarizes the deviation of the skewness and kurtosis of the se-

ries from those from the normal distribution. If the series is drawn from a normal

7

distribution, we should find JB(X) near zero. Under the null hypothesis of a normal

distribution, the Jarque-Bera statistic is distributed as χ2 with 2 degrees of freedom.

The resulting p-value is the probability under the null that a Jarque-Bera statistic

exceeds (in absolute value) the observed value—a small probability value leads to the

rejection of the null hypothesis of a normal distribution.

If the series we are testing is generated, we need an adjustment (k) for the number

of estimated coefficients used to create the series.

JB(X) =T − k

6

[S2(X) +

[K(X)− 3]2

4

](5)

(Note: Eviews 3.0 did not adjust the JB statistic on its residuals as it should.)

1.3 Correlation

We are often interested in the joint distribution of random variables. Two r.v.s are

independent when the value that one takes does not depend on the value the other

takes. Consider the following tables of probabilities for two discrete r.v.s X and Y :

x1 x2

y1 0.4 0.1

y2 0.2 0.3

x1 x2

y1 0.2 0.3

y2 0.2 0.3

In each table, all the probabilities add up to one. In the first table, if we are told

Y = y1 then we judge X = x1 four times as likely as X = x2, while if we know

Y = y2 then we find it only two-thirds as likely. In the second table, the probability

of X = x1 at 40% no matter what we know about Y .

The first table is the case that interest us here: X and Y tend to vary together

rather than independently. This introduces the notion of the covariance between two

8

random variables:

cov(X, Y ) = E(X − EX)(Y − EY ) (6)

Covariance is a measure of the linear association between two r.v.s. If X and Y tend

to be above their means together, then the covariance is positive. If one tends to be

high when the other is low, then their covariance is negative.

One problem with covariance as a measure of relatedness is that it depends on

the units of measurement. We therefore more often attend to the correlation between

two variables. Correlation is just normalized covariance. We “deflate” the covariance

between two random variables by the product of ther standard deviations. That is,

the correlation coefficient is defined as

cor(X, Y ) =cov(X, Y )

σXσY(7)

This deflation or normalization ensures that the measurement is scale free. The

correlation coefficient will always lie between zero and one.

1.3.1 Sample Correlation

We would like to be able to use the sample correlation as an estimate of the true

correlation between variables.

ĉor(X, Y ) =ĉov(X, Y )

sXsY(8)

where

ĉov(X, Y ) =1

T

T∑t=1

(Xt − X̄)(Yt − Ȳ ) (9)

9

Unlike our estimate of the sample variance, our correlation estimate is unbiased.

While the covariance estimate is biased, because we do not accommodate the loss of

one degree of freedom implied by estimating the means, this same bias affects the

normalization.

Exercise 4 (Correlation)

We will generate correlated random variables and look for the correlation in their

scatter plots. Proceed as follows:

generate two independent series, n1 and n2, each of 300 independent draws from

a standard normal distribution

create the following related series: n3 = n2+n1, n4 = n2−n1, and n5 = 50n2−n1.

determine the theoretical and empirical correlation between these series.

create scatter plots of ni against n1. Can you see the predicted/actual correla-

tion?

{ns01 , ns02} = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , {2 , 3 0 0} ] ;

{ns03 , ns04 , ns04} = {{1 , 1} , {−1, 1} , {−1, 50}} .{ ns01 , ns02 } ;

Co r r e l a t i on [{ ns01 , ns02 , ns03 , ns04 , ns04} // Transpose ] //

. . .MatrixForm

We can do some quick calculations to see what correlations and covariances we

should have found in our exercise. When we considered two independent N(0, 1)

variables n1 and n2, we should have found zero covariance. When we then set n3 =

10

Figure 1: Scatter Plots as Correlation Diagnostic

11

n1 + n2, we should have found

En3 = En1 + En2 = 0

varn3 = E(n3 − 0)2

= E(n21 + n22 + 2n1n2)

= σ21 + σ22 + 0 = 2

cov(n1, n3) = E(n21 + n1n2) = σ

21 + 0 = 1

cor (n1, n3) =1√2

When we check this out, we get very close (but of course our results are not exact).

1.4 Correlation and Linear Regression

The correlation coefficient, ρ ∈ [−1, 1], is the most common measure of linear relat-

edness in two series. If all the data points lie on a line, then correlation is perfect

(and ρ = 1 or ρ = −1). While two variables must be correlated for them to appear

related in a linear regression, the regression coefficient is not simply the correlation

coefficient. The correlation coefficient is symmetrical: since it deflates the covariance

by the product of the standard deviations of the two variables, it does not matter

whether you consider the correlation of X with Y or that of Y with X. In contrast,

the regression coefficient deflates the covariance by the sample variance of the RHS

variable. So if you regress Y on X (and a constant) you get an estimated slope of

β̂ =ĉov(X, Y )

s2X(10)

12

The asymmetry comes because the least squares estimator minimizes the average

squared deviation of the LHS variable from the fitted line. Thus regressing X on Y

does not simply invert the parameter estimate. (Indeed, if the variables are completely

uncorrelated the regression coefficient should be zero in both regressions.)

Let us quickly confirm this by example, as it will be of some interest when we

discuss cointegrating regressions.

Exercise 5 (Inverse Regression)

Here is a quick reminder that it matters for the estimated relationship what we put

on on the LHS. Look at the results of regressing n3 on n1 (constructed as above).

(Include a constant in the regression.) Look at the results of regressing n1 on n3

(Include a constant in the regression.) Compare the two slopes and the correlation

coefficient.

#I n v e r s e R e g r e s s i o n ( Python )

#u s i n g n1 and n3 from abo v e

n31reg = l s .OLS(n3 , n1 )n13reg = l s .OLS(n1 , n3 )b31 = n31reg . c o e f s [ 0 ]b13 = n13reg . c o e f s [ 0 ]cor r13 = numpy . c o r r c o e f ( [ n1 , n3 ] ) [ 0 , 1 ]print "n31reg, correlation , n13reg"print b31 , corr13 , b13print "Make slopes based on correlation and variances"v1 = n1 . var ( )v3 = n3 . var ( )print cor r13 *math . s q r t ( v1/v3 )print cor r13 *math . s q r t ( v3/v1 )printprint n3 .mean ( ) − b31*n1 .mean ( )print n31regprint n1 .mean ( ) − b13*n3 .mean ( )print n13reg

Since the variance of n3 is twice the variance of n1, we should get a slope about

13

half the size when n3 is our RHS variable. We do. And here is one other take on this:

you can think of the regression coefficient as the correlation coefficient adjusted for

the relative variability of the RHS variable. That is, for the regression of Y on X

β̂ = ĉor(X, Y )sYsX

(11)

So the correlation coefficient will always fall between the regression and inverse re-

gression coefficients, as you can verify from your example.

2 Difference Equations

Up to now we have been reviewing some fairly familiar statistical concepts. As we

move toward time series modeling, we need to add some familiarity with difference

equations.

For the most part, time series techniques have been concerned with linear dif-

ference equations with constant coefficients. That will be our focus. A difference

equation is simply an expression of the current value of a variable as a function of its

past values and some other forcing process. The applications are manifold.

An example:

We might expect speculation in the stock market to keep today’s prices very close to

tomorrow’s expected prices. With daily data and assuming expectations are pretty

good, we might then find the random walk model

Pt+1 = Pt + et+1 (12)

to be a reasonable model of the evolution of stock prices. All changes are unantic-

ipated, and captured in the innovation et+1. Note that we can also represent this

14

process as

∆Pt = et (13)

This suggests one test of our model would be running the regression

∆Pt = a0 + a1Pt−1 + et (14)

and testing the hypothesis that a0 = a1 = 0. We will have lots to say about his latter.

When we talk about difference equations, we want to be able to speak about series

that take on real values at each point in time. We will use the notation {yt} to refer

to these series, where t is allowed to take on any value from the infinite past to the

infinite future.

The order of a difference equation is the largest difference in time subscripts. Thus

our random walk model is a first order difference equation. An p-th order difference

equation can therefore be written as

yt = a0 + a1yt−1 + · · ·+ apyt−p + xt (15)

where xt is an arbitrary “forcing process”. A solution to a difference equation ex-

presses yt just in terms of xt and t and, to pin down a unique path for the process,

some constraints on its location (e.g., initial conditions). The solution to a differ-

ence equation is therefore a function that tells us how yt evolves over time. First

order difference equations have particularly simple dynamics, which we easily can

demonstrate.

Exercise 6 (Deterministic and Stochastic Difference Equations)

We will use an arbitrary inital value of 15, with various positive rates of decay. (What

happens if we give negative values to ρ?)

15

Figure 2: Deterministic and Stochastic Difference Equations: AR(1)

EViews (Difference Equations)

’Difference Equations

’pick File, New, Program

new workfile temp u 50 ’a new undated workfile with 50 obs

series y0=0 ’y0 is just a "place holder"

y0(1)=15

series yd ’the "deterministic" series

series ys ’the "stochastic" series

rndseed 20

series e=nrnd ’the stochastic "shocks"

scalar rho ’the "autoregressive" parameter

%mygraphs=""

for !ct=0 to 3

yd=y0

ys=y0

rho=0.9+(!ct)/20

smpl @first+1 @last

yd=rho*yd(-1)

ys=rho*ys(-1)+e

smpl @all

group yg yd ys

freeze(graph{!ct}) yg.line

%header ="rho="+@str(rho)

graph{!ct}.addtext(0.5,2) %header

16

%mygraphs=%mygraphs+" graph"+@str(!ct)

next

graph g4.merge %mygraphs

show g4

’pick SaveAs and give it a title

’pick Run

GAUSS

new;library pgraph;graphset;begwind;window(2,2,1);

y0=5;

rho=.1~.4~.7~1;

rndnseed=20;

e=rndns(150,1,rndnseed);

yd=recserar(zeros(150,4),y0~y0~y0~y0,rho);

ys=recserar(e~e~e~e,y0~y0~y0~y0,rho);

myseq=seqa(1,1,150);

ct=1;

do while ct

reset

18

3 Stochastic Process

A stochastic process represents the probabilistic evolution of a system over time.

It is also called a time-series process . Contrast this with a deterministic system.

For example, a system of linear difference equations may represent a deterministic

dynamical system. If we add a stochastic forcing function, it becomes a stochastic

process. We find a link between our stability conditions in the deterministic case and

the conditions for stationarity in the stochastic case.

A stochastic process X(t, ω) is a random variable for each admissible t ∈ T , where

T is an index set we think of as “time.” That is, a stochastic process is just an ordered

sequence of random variables, and the natural order for us will be time.

A time series is a sample from a stochastic process. When we plot a time series

X(t, ω), conceptually we are holding ω constant and allowing t to vary. The resulting

function of t is called a realization or sample function.

Given a finite set of random variables X = {Xt1 , Xt2 , . . . , Xtn}, the joint distri-

bution function is

FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = P{ω | X(t1, ω) ≤ xt1 , . . . , X(tn, ω) ≤ xtn} (16)

3.1 Stationarity

We call a time series strictly stationary if

FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = FXt1+h,Xt2+h,...,Xtn+h(xt1 , xt2 , . . . , xtn) (17)

So the distribution function is the same at each time, and the joint distribution is

determined only by separation in time and not by the absolute date. So for example,

19

if the expected value or variance are defined at any point in time, they must have

that same value at every point in time.

We generally work with a weaker notion of stationarity. We say that a time series

is covariance stationary if

Xt has a constant expected value, which we generally treat as 0 for convenience.

the covariance matrix of (Xt1 , Xt2 , . . . , Xtn) depends only on the distance be-

tween observations and not on the absolute date. (So it is identical to the

covariance matrix of (Xt1+h, Xt2+h, . . . , Xtn+h).)

So a stochastic process Xt is covariance stationary if it has a time-independent finite

mean and time-independent finite auto-covariances. That is, for all t, s, and j,

E(Xt) = E(Xt−j)

E[(Xt − µ)(Xt−s − µ)] = E[(Xt−n − µ)(Xt−n−s − µ)]

This allows us to write the autocovariance of Xt and Xt+h as a function only of

the distance between the two observations. Letting the mean be zero for notational

simplicity:

cov(Xt, Xt+h) = E{XtXt+h} = γ(h) (18)

3.2 White Noise

We build our first time-series models out of white-noise processes. A particularly

simple stochastic process is white-noise . A stochastic process εt is a white-noise

process if it is characterized by

zero mean

constant variance

20

0 50 100 150 2001.0

0.5

0.0

0.5

1.0

Figure 3: White Noise (Coin Flip)

no serial correlation: E{εtεt−h} = 0 ∀h 6= 0, t

White noise is clearly a stationary stochastic process. The time-independent mean

and autocovariances are part of its definition. The lack of serial correlation is just

another way of noting that the autocovariances are zero at any lag.

As an example of how we might build up stochastic processes from white noise,

let us suppose that when you play a coin toss game where you win $1 for a head and

lose $1 for a tail. Assigning the values (1,-1) for (H,T), and assuming a fair coin, we

sample the process and plot the resulting time series.

nobs = 200 ;xs = RandomChoice [{−1 , 1} , nobs ] ;L i s tL ineP lo t [ xs , Frame −> True ]

Visually, Figure 3 looks like white noise: mean zero with no serial correlation.

Later we will consider how to test that.

21

3.3 Moving Average Process

A simple MA(q) process is built up from a white noise process by averaging the

current shock with the previous q shocks. Let εtT1 be white noise, and let

xt =

q∑j=0

θjεt−j (19)

When an MA proces is estimated, we typically adopt the normalization θ0 = 1. At

the moment, however, we simply want to consider how the “average shock” over the

last few periods is changing over time. Think of this as a way of considering how

limited memory might affect one’s estimate of the mean of a process.

First of all, notice that an MA(q) process will clearly display serial correlation.

(For example, xt and xt−1 are both affected by εt−1.) So even though we are building

up our MA process from white noise, it will not be a white-noise process.

Let us return to our coin-flipping example. We can look for “hot streaks” in a

simple moving average of most recent winnngs.

L i s tL ineP lo t [ MovingAverage [ xs , 7 ] , Frame −> True ]

Even though there is no serial correlation in the simple coin flip outcomes, we

clearly observe serial correlation in the moving average.

3.4 AR Process

An AR(p) process depends directly on its own past values. Let εtT1 be white noise.

We can build up an AR(p) process from a white noise process in the following way.

xt =

p∑i=1

αixt−i + εt (20)

22

0 50 100 150 2001.0

0.5

0.0

0.5

1.0

Figure 4: Hot Streaks (MA(7) of Coin Flipping Outcomes)

Usually we express this more succinctly as

α(L)xt = εt (21)

where

α(L) = 1−p∑i=1

αiLi (22)

This is just a linear difference equation with a white-noise forcing function. Ques-

tions of stability are therefore unchanged from the analysis of difference equations, but

they now become question of the stationarity of the process. When we ask whether

an ARMA process is “stationary”, we are just raising the question of stability of the

difference equation.

Just as a stable difference equation implies very different behavior from a similar

unstable difference equation, a stationary stochastic process implies very different

behavior from a non-stationary stochastic process. Consider the following three AR

23

0 20 40 60 80 1000.30.40.50.60.70.80.91.0

rho=0.99

0 20 40 60 80 100

0.981.001.021.041.06

rho=1.0

0 20 40 60 80 1001.01.21.41.61.82.02.22.42.62.8

rho=1.01

Figure 5: Deterministic and Stochastic AR(1) with Additive Shocks

processes:

xt = 0.99xt−1 + εt (23)

xt = 1.00xt−1 + εt (24)

xt = 1.01xt−1 + εt (25)

where εt is white noise. If we ignore the stochastic term, we have deterministic

processes. We easily recognize that (given a nonzero initial value) the first case

converges to zero, the second case is constant, and the last case is explosive. In the

presence of the stochastic term, these differences persist, but some new considerations

arise.

Macroeconomic time series tend to be highly autocorrelated, but usually they do

24

not appear explosive. This has focused attention on the question of whether ρ is

almost 1 or actually equals 1.

3.5 Stationary Case

We will focus on the simplest AR(1) model:

yt = a1yt−1 + εt (26)

where εt is a white-noise process. In this section we focus on the case 0 ≤ a1 < 1,

which ensures stationarity. In this case, y has an unconditional mean of 0 and variance

of σ2ε/(1−a21). To see this, repeatedly substitute for lagged y to get the moving average

representation

yt =∞∑i=0

ai1εt−i (27)

Suppose we have a time-series of observations on y, (y1, . . . , yT ). Then we could

produce an OLS estimate of a1 as

â1 =

∑Tt=2 ytyt−1∑Tt=2 y

2t−1

(28)

Substituting from (26) for yt we get

â1 =a1∑T

t=2 y2t−1 +

∑Tt=2 εtyt−1∑T

t=2 y2t−1

= a1 +

∑Tt=2 εtyt−1∑Tt=2 y

2t−1

(29)

Since the term on the right has nonzero expectation, our estimator â1 is biased

(Keele and Kelly, 2006). This is just the standard problem of lagged dependent

25

variables in an OLS regression. But if we consider its probability limit we get

plimT→∞

â1 = a1 +plimT→∞

1T

∑Tt=2 εtyt−1

plimT→∞1T

∑Tt=2 y

2t−1

(30)

Our assumption 0 ≤ a1 < 0 assures stationarity and thus a finite value for the

denominator. The numerator is the mean of products with zero expected value, so

we find â1 is a consistent estimator of a1.

3.6 Half-Life to Convergence

The (stochastic) steady state of our simple AR(1) is 0. Suppose we are at the steady

state and then shocked. How long do we expect it to take us to get half-way back to

the steady state? Equivalently, given y0, how long do we expected it to take to get

to y0/2?

For t > 0, we will have yt = at1y0 +

∑t−1i=0 a

i1εt−i. Since the expected value of future

shocks is zero,

E0yt = at1y0 (31)

So the half life question is asking, when is

at1 = 1/2

t ln(a1) = − ln(2)

t = − ln(2)/ ln(a1)

(32)

4 Unit Root Tests

Dickey and Fuller (1979) present Monte Carlo results allowing a test of the null

hypothesis that γ = 0 in the regression (82). They found that the critical values for

26

the test of γ = 0 varied not only with sample size but also were different for the trend

stationary and difference stationary models.

The first approach to testing for unit roots relies on critical values tabulated by

David Dickey and reproduced in Wayne Fuller (1976). Suppose we have a model

yt = a1yt−1 + et (33)

where et is white noise. Rewrite this as

∆yt = γyt−1 + et (34)

where γ = a1−1. The hypothesis that a1 = 1 is thus the same as the hypothesis that

γ = 0. We can estimate γ using OLS.

Problem: under the null hypothesis

H0 : ρ = 1 (35)

you are regressing a stationary variable on a non-stationary variable. The non-

stationarity of yt under the null hypothesis implies that the standard t-ratio does

not have the Student’s t distribution. Instead, we must use tables of tabulated val-

ues, yielding what are often called τ tests. For example, consider the following table

of critical values for the one-sided test of H0 : γ = 0 versus H1 : γ < 0:

For example, consider a sample of 50 observations. The critical values for a one-

tailed t-test are approximately -1.68 and -2.41. As the sample size increases, these

approach the standard normal critical values of -1.65 and -2.33. We see that there

is a tendency to incorrectly evaluate the null hypothesis if we use the t or standard

normal tables.

27

Model: ∆yt = γyt−1 + �t y0 = 0DGP: ∆yt = �t y0 = 01% 5% 10%

#Obs t τ t τ t τ25 -2.49 -2.66 -1.71 -1.95 -1.32 -1.6050 -2.41 -2.62 -1.68 -1.95 -1.30 -1.61100 -2.37 -2.60 -1.66 -1.95 -1.29 -1.61∞ -2.33 -2.58 -1.65 -1.95 -1.28 -1.62

Table 1: Tabulated τ and t One-Sided Critical Values

Two examples will make this point.

First, suppose we estimate γ and get a t-statistic of 1.00. Applying the t or

standard normal critical values, we would not reject the null hypothesis of a unit root

against the explosive alternative that ρ > 1. Yet using the tabulated critical values,

we would reject.

Second, suppose we get a t-statistic of -2.5. We would reject the unit root using

the t or normal critical values, but we would not reject using the tabulated values.

4.1 Some Perspective

When we say a variable follows a random walk, especially if it is a variable we are

interested in, we are offering a statement of our ignorance. We are saying that we

have no more reason to believe that it will change in one direction than in the other.

It may sound like a solid conclusion to say that we cannot reject the hypothesis that

a variable follows a random walk. But as Frankel (1990, ch.6 of OER) has noted, we

could just as well say that after studying the variable we have “absolutely nothing to

say that would help to predict its movements.”

Second note that the most popular tests for unit roots ask whether we can reject

the hypothesis that the series has a unit root. If not, then practice is to continue our

28

analysis under the assumption of a unit root. In contrast with traditional econometric

practice where we sought statistical significance as a basis for our conclusions, finding

nothing is now satisfactory. As Frankel has noted, it is easier to find nothing than to

find something.

4.2 Deterministic Regressors

A problem with (34) is that it is so restrictive. We are testing the null hypothesis of a

random walk without drift against a simple first-order autoregressive process without

a constant term. This does not encompass many applications of economic interest.

Many series seem to display a trend, and we care about determining whether this

is an artifact or part of the nature of the series. So most applications of unit root

tests include these possibilities.

One of the interesting aspects of unit root tests is that allowing for these possibil-

ities changes the critical values, even though the underlying DGP is unchanged. This

is clear in the table 2.

29

DGP: ∆yt = �t y0 = 0Model: ∆yt = γyt−1 + �t a0 + γyt−1 + �t a0 + γyt−1 + a2t+ �tTest Statistic τ τµ ττ5% Critical Values -1.95/+0.91 -2.93/-0.03 -3.50/-0.871% Critical Value -2.62/+2.08 -3.58/+0.66 -4.15/-0.24

Table 2: Tabulated Critical Values for 50 Observations

30

DescriptionNormal DistributionSkewness and KurtosisCorrelationSample Correlation

Correlation and Linear Regression

Difference EquationsStochastic ProcessStationarityWhite NoiseMoving Average ProcessAR ProcessStationary CaseHalf-Life to Convergence

Unit Root TestsSome PerspectiveDeterministic Regressors

Augmented Dickey-Fuller TestsLag Length SelectionNelson and PlosserSelection of Deterministic RegressorsPhillips-Perron Test

Two RootsARMA ProcessesStationarity and StabilityThe Role of the MA TermThe Autocorrelation Function

Model Selecton CriteriaFinite Prediction Error (FPE)Information Criteria

DiagnosticsLikelihood Ratio Test

Spurious RegressionTesting for Unit RootsCointegrationA Multivariate Generalization

The Engle-Granger Cointegration TestError CorrectionVector Error Correction Models

The Johansen Cointegration TestStructural BreaksBibliography

professor: alan g. isaac - american university · x t (x t x )3=s3 (2) where sis based on the...

Documents