professor: alan g. isaac - american university · x t (x t x )3=s3 (2) where sis based on the...
TRANSCRIPT
-
Professor: Alan G. Isaac
1
-
Applied Time-Series: A Very Short Introduction
April 4, 2018
Contents
1 Description 5
1.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Skewness and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3.1 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Correlation and Linear Regression . . . . . . . . . . . . . . . . . . . . 12
2 Difference Equations 14
3 Stochastic Process 19
3.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Moving Average Process . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 AR Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Stationary Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Half-Life to Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 26
2
-
4 Unit Root Tests 26
4.1 Some Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Deterministic Regressors . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 Augmented Dickey-Fuller Tests 31
5.1 Lag Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2 Nelson and Plosser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Selection of Deterministic Regressors . . . . . . . . . . . . . . . . . . 36
5.3.1 Phillips-Perron Test . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Two Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.5 ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5.1 Stationarity and Stability . . . . . . . . . . . . . . . . . . . . 44
5.5.2 The Role of the MA Term . . . . . . . . . . . . . . . . . . . . 44
5.5.3 The Autocorrelation Function . . . . . . . . . . . . . . . . . . 45
5.6 Model Selecton Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6.1 Finite Prediction Error (FPE) . . . . . . . . . . . . . . . . . . 48
5.6.2 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.7.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . 56
6 Spurious Regression 56
7 Testing for Unit Roots 59
8 Cointegration 61
8.0.1 A Multivariate Generalization . . . . . . . . . . . . . . . . . . 65
9 The Engle-Granger Cointegration Test 66
3
-
10 Error Correction 69
10.1 Vector Error Correction Models . . . . . . . . . . . . . . . . . . . . . 70
11 The Johansen Cointegration Test 71
12 Structural Breaks 77
Bibliography 78
4
-
These notes are extremely preliminary. Please report typos and other problems.
1 Description
1.1 Normal Distribution
A prominent example of a continuous distribution is the Gaussian or Normal distri-
bution. Let Y be a normally distributed random variable with mean µ and variance
σ2. By definition, its probability distribution function (pdf) is
pY (y) =1
σ√
2πexp
{−(y − µ)2
2σ2
}(1)
A standard shorthand for this is Y ∼ N(µ, σ2). If we think of pY (y) as telling us
roughly how likely we are to see Y near y, then we can see that the normal distribution
is most likely to take on values near the mean µ. We also see that when the variance
σ2 is large, the “penalty” for a given deviation from the mean is reduced.
Exercise 1 (Descriptive Statistics 1)
Generate 300 independent realizations of a standard normal random variable and
examine the sample properties: the number of observations, the minimum and max-
imum values, the mean, and the variance.
The standard normal distribution has mean of zero and variance of unity. If X
is the random variable, we write X ∼ N(0, 1), where the arguments give the mean
and the variance. Now multiply each realization of your first series by 2 to produce
a new series. Look at your descriptive statistics. What happens to the mean? What
happens to the variance?
xs = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , 3 0 0 ] ;
5
-
Through [{Min, Max, Mean, Variance , Length } [ xs ] ]
Through [{Min, Max, Mean, Variance , Length} [ 2* xs ] ]
1.2 Skewness and Kurtosis
We often use a sample mean and variance as estimates of the true (“population”)
mean and variance. The “third moment” about the sample mean, 1T
∑t(Xt − X̄)3,
might seem like a natural place to look for asymmetry in the distribution. Instead
we use the closely related concept of sample skewness:
S(X) = T 1/2∑
t(Xt − X̄)3
(∑
t(Xt − X̄)2)3/2=
1
T
∑t
(Xt − X̄)3/s3 (2)
where s is based on the biased estimator for the variance (Bickel and Doksum 1977,
p.388). The skewness of a symmetric distribution, such as the normal distribution, is
zero. S(X) is zero for symmetric distributions. If the upper tail of the distribution
is thicker, S(X) > 0. If the lower tail of the distribution is thicker, S(X) < 0.
The fourth sample moment about the mean would seem, like the second, to offer
an indication of dispersion. It does, but it puts even more weight on observations far
from the mean. We take advantage of this to ask whether the sample is relatively
peaked (leptokurtic) or flat (thick-tailed, platykurtic). The statistic we use is the
fourth standardized moment, known as sample kurtosis:
K(X) = T
∑t(Xt − X̄)4
(∑
t(Xt − X̄)2)2=
1
T
∑t
(Xt − X̄)4/s4 (3)
where again s is based on the biased estimator for the variance (Bickel and Doksum
1977, p.388). Kurtosis measures the peakedness or flatness of the distribution of the
6
-
series. Kurtosis is a measure of how “peaked” the distribution is. K = 3 for the
normal distribution. Distributions that are more peaked and have thicker tails will
have greater kurtosis. Distributions that are less peaked and have thiner tails will
have greater kurtosis. If the kurtosis exceeds 3, the distribution is peaked (leptokurtic)
relative to the normal; if the kurtosis is less than 3, the distribution is flat (platykurtic)
relative to the normal.
Exercise 2 (Descriptive Statistics 2)
Produce the skewness and kurtosis of your two normal series above. What do you
notice about these two statstics?
Through [{ Skewness , Kurtos i s } [ ns ] ]
Through [{ Skewness , Kurtos i s } [ 2* ns ] ]
Exercise 3 (Descriptive Statistics 3)
Now let us conduct the same examinations for a uniform random variable.
We notice a clear difference in the CDF plots for the two distributions, and we
see the relative peakedness of the normal distribution in its higher sample kurtosis.
This observation suggests that we might test whether our sample is generated by
draws from a normal distribution. Informally, we might look at whether the mean
and median are approximately equal. We could also check for S(X)=0 and K(X)=3
to be approximately satisfied. Finally, we can turn to formal tests.
The Jarque-Bera statistic allows a simple test for normality:
JB(X) =T
6
[S2(X) +
[K(X)− 3]2
4
](4)
The test statistic summarizes the deviation of the skewness and kurtosis of the se-
ries from those from the normal distribution. If the series is drawn from a normal
7
-
distribution, we should find JB(X) near zero. Under the null hypothesis of a normal
distribution, the Jarque-Bera statistic is distributed as χ2 with 2 degrees of freedom.
The resulting p-value is the probability under the null that a Jarque-Bera statistic
exceeds (in absolute value) the observed value—a small probability value leads to the
rejection of the null hypothesis of a normal distribution.
If the series we are testing is generated, we need an adjustment (k) for the number
of estimated coefficients used to create the series.
JB(X) =T − k
6
[S2(X) +
[K(X)− 3]2
4
](5)
(Note: Eviews 3.0 did not adjust the JB statistic on its residuals as it should.)
1.3 Correlation
We are often interested in the joint distribution of random variables. Two r.v.s are
independent when the value that one takes does not depend on the value the other
takes. Consider the following tables of probabilities for two discrete r.v.s X and Y :
x1 x2
y1 0.4 0.1
y2 0.2 0.3
x1 x2
y1 0.2 0.3
y2 0.2 0.3
In each table, all the probabilities add up to one. In the first table, if we are told
Y = y1 then we judge X = x1 four times as likely as X = x2, while if we know
Y = y2 then we find it only two-thirds as likely. In the second table, the probability
of X = x1 at 40% no matter what we know about Y .
The first table is the case that interest us here: X and Y tend to vary together
rather than independently. This introduces the notion of the covariance between two
8
-
random variables:
cov(X, Y ) = E(X − EX)(Y − EY ) (6)
Covariance is a measure of the linear association between two r.v.s. If X and Y tend
to be above their means together, then the covariance is positive. If one tends to be
high when the other is low, then their covariance is negative.
One problem with covariance as a measure of relatedness is that it depends on
the units of measurement. We therefore more often attend to the correlation between
two variables. Correlation is just normalized covariance. We “deflate” the covariance
between two random variables by the product of ther standard deviations. That is,
the correlation coefficient is defined as
cor(X, Y ) =cov(X, Y )
σXσY(7)
This deflation or normalization ensures that the measurement is scale free. The
correlation coefficient will always lie between zero and one.
1.3.1 Sample Correlation
We would like to be able to use the sample correlation as an estimate of the true
correlation between variables.
ĉor(X, Y ) =ĉov(X, Y )
sXsY(8)
where
ĉov(X, Y ) =1
T
T∑t=1
(Xt − X̄)(Yt − Ȳ ) (9)
9
-
Unlike our estimate of the sample variance, our correlation estimate is unbiased.
While the covariance estimate is biased, because we do not accommodate the loss of
one degree of freedom implied by estimating the means, this same bias affects the
normalization.
Exercise 4 (Correlation)
We will generate correlated random variables and look for the correlation in their
scatter plots. Proceed as follows:
generate two independent series, n1 and n2, each of 300 independent draws from
a standard normal distribution
create the following related series: n3 = n2+n1, n4 = n2−n1, and n5 = 50n2−n1.
determine the theoretical and empirical correlation between these series.
create scatter plots of ni against n1. Can you see the predicted/actual correla-
tion?
{ns01 , ns02} = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , {2 , 3 0 0} ] ;
{ns03 , ns04 , ns04} = {{1 , 1} , {−1, 1} , {−1, 50}} .{ ns01 , ns02 } ;
Co r r e l a t i on [{ ns01 , ns02 , ns03 , ns04 , ns04} // Transpose ] //
. . .MatrixForm
We can do some quick calculations to see what correlations and covariances we
should have found in our exercise. When we considered two independent N(0, 1)
variables n1 and n2, we should have found zero covariance. When we then set n3 =
10
-
Figure 1: Scatter Plots as Correlation Diagnostic
11
-
n1 + n2, we should have found
En3 = En1 + En2 = 0
varn3 = E(n3 − 0)2
= E(n21 + n22 + 2n1n2)
= σ21 + σ22 + 0 = 2
cov(n1, n3) = E(n21 + n1n2) = σ
21 + 0 = 1
cor (n1, n3) =1√2
When we check this out, we get very close (but of course our results are not exact).
1.4 Correlation and Linear Regression
The correlation coefficient, ρ ∈ [−1, 1], is the most common measure of linear relat-
edness in two series. If all the data points lie on a line, then correlation is perfect
(and ρ = 1 or ρ = −1). While two variables must be correlated for them to appear
related in a linear regression, the regression coefficient is not simply the correlation
coefficient. The correlation coefficient is symmetrical: since it deflates the covariance
by the product of the standard deviations of the two variables, it does not matter
whether you consider the correlation of X with Y or that of Y with X. In contrast,
the regression coefficient deflates the covariance by the sample variance of the RHS
variable. So if you regress Y on X (and a constant) you get an estimated slope of
β̂ =ĉov(X, Y )
s2X(10)
12
-
The asymmetry comes because the least squares estimator minimizes the average
squared deviation of the LHS variable from the fitted line. Thus regressing X on Y
does not simply invert the parameter estimate. (Indeed, if the variables are completely
uncorrelated the regression coefficient should be zero in both regressions.)
Let us quickly confirm this by example, as it will be of some interest when we
discuss cointegrating regressions.
Exercise 5 (Inverse Regression)
Here is a quick reminder that it matters for the estimated relationship what we put
on on the LHS. Look at the results of regressing n3 on n1 (constructed as above).
(Include a constant in the regression.) Look at the results of regressing n1 on n3
(Include a constant in the regression.) Compare the two slopes and the correlation
coefficient.
#I n v e r s e R e g r e s s i o n ( Python )
#u s i n g n1 and n3 from abo v e
n31reg = l s .OLS(n3 , n1 )n13reg = l s .OLS(n1 , n3 )b31 = n31reg . c o e f s [ 0 ]b13 = n13reg . c o e f s [ 0 ]cor r13 = numpy . c o r r c o e f ( [ n1 , n3 ] ) [ 0 , 1 ]print "n31reg, correlation , n13reg"print b31 , corr13 , b13print "Make slopes based on correlation and variances"v1 = n1 . var ( )v3 = n3 . var ( )print cor r13 *math . s q r t ( v1/v3 )print cor r13 *math . s q r t ( v3/v1 )printprint n3 .mean ( ) − b31*n1 .mean ( )print n31regprint n1 .mean ( ) − b13*n3 .mean ( )print n13reg
Since the variance of n3 is twice the variance of n1, we should get a slope about
13
-
half the size when n3 is our RHS variable. We do. And here is one other take on this:
you can think of the regression coefficient as the correlation coefficient adjusted for
the relative variability of the RHS variable. That is, for the regression of Y on X
β̂ = ĉor(X, Y )sYsX
(11)
So the correlation coefficient will always fall between the regression and inverse re-
gression coefficients, as you can verify from your example.
2 Difference Equations
Up to now we have been reviewing some fairly familiar statistical concepts. As we
move toward time series modeling, we need to add some familiarity with difference
equations.
For the most part, time series techniques have been concerned with linear dif-
ference equations with constant coefficients. That will be our focus. A difference
equation is simply an expression of the current value of a variable as a function of its
past values and some other forcing process. The applications are manifold.
An example:
We might expect speculation in the stock market to keep today’s prices very close to
tomorrow’s expected prices. With daily data and assuming expectations are pretty
good, we might then find the random walk model
Pt+1 = Pt + et+1 (12)
to be a reasonable model of the evolution of stock prices. All changes are unantic-
ipated, and captured in the innovation et+1. Note that we can also represent this
14
-
process as
∆Pt = et (13)
This suggests one test of our model would be running the regression
∆Pt = a0 + a1Pt−1 + et (14)
and testing the hypothesis that a0 = a1 = 0. We will have lots to say about his latter.
When we talk about difference equations, we want to be able to speak about series
that take on real values at each point in time. We will use the notation {yt} to refer
to these series, where t is allowed to take on any value from the infinite past to the
infinite future.
The order of a difference equation is the largest difference in time subscripts. Thus
our random walk model is a first order difference equation. An p-th order difference
equation can therefore be written as
yt = a0 + a1yt−1 + · · ·+ apyt−p + xt (15)
where xt is an arbitrary “forcing process”. A solution to a difference equation ex-
presses yt just in terms of xt and t and, to pin down a unique path for the process,
some constraints on its location (e.g., initial conditions). The solution to a differ-
ence equation is therefore a function that tells us how yt evolves over time. First
order difference equations have particularly simple dynamics, which we easily can
demonstrate.
Exercise 6 (Deterministic and Stochastic Difference Equations)
We will use an arbitrary inital value of 15, with various positive rates of decay. (What
happens if we give negative values to ρ?)
15
-
Figure 2: Deterministic and Stochastic Difference Equations: AR(1)
EViews (Difference Equations)
’Difference Equations
’pick File, New, Program
new workfile temp u 50 ’a new undated workfile with 50 obs
series y0=0 ’y0 is just a "place holder"
y0(1)=15
series yd ’the "deterministic" series
series ys ’the "stochastic" series
rndseed 20
series e=nrnd ’the stochastic "shocks"
scalar rho ’the "autoregressive" parameter
%mygraphs=""
for !ct=0 to 3
yd=y0
ys=y0
rho=0.9+(!ct)/20
smpl @first+1 @last
yd=rho*yd(-1)
ys=rho*ys(-1)+e
smpl @all
group yg yd ys
freeze(graph{!ct}) yg.line
%header ="rho="+@str(rho)
graph{!ct}.addtext(0.5,2) %header
16
-
%mygraphs=%mygraphs+" graph"+@str(!ct)
next
graph g4.merge %mygraphs
show g4
’pick SaveAs and give it a title
’pick Run
GAUSS
new;library pgraph;graphset;begwind;window(2,2,1);
y0=5;
rho=.1~.4~.7~1;
rndnseed=20;
e=rndns(150,1,rndnseed);
yd=recserar(zeros(150,4),y0~y0~y0~y0,rho);
ys=recserar(e~e~e~e,y0~y0~y0~y0,rho);
myseq=seqa(1,1,150);
ct=1;
do while ct
-
reset
18
-
3 Stochastic Process
A stochastic process represents the probabilistic evolution of a system over time.
It is also called a time-series process . Contrast this with a deterministic system.
For example, a system of linear difference equations may represent a deterministic
dynamical system. If we add a stochastic forcing function, it becomes a stochastic
process. We find a link between our stability conditions in the deterministic case and
the conditions for stationarity in the stochastic case.
A stochastic process X(t, ω) is a random variable for each admissible t ∈ T , where
T is an index set we think of as “time.” That is, a stochastic process is just an ordered
sequence of random variables, and the natural order for us will be time.
A time series is a sample from a stochastic process. When we plot a time series
X(t, ω), conceptually we are holding ω constant and allowing t to vary. The resulting
function of t is called a realization or sample function.
Given a finite set of random variables X = {Xt1 , Xt2 , . . . , Xtn}, the joint distri-
bution function is
FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = P{ω | X(t1, ω) ≤ xt1 , . . . , X(tn, ω) ≤ xtn} (16)
3.1 Stationarity
We call a time series strictly stationary if
FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = FXt1+h,Xt2+h,...,Xtn+h(xt1 , xt2 , . . . , xtn) (17)
So the distribution function is the same at each time, and the joint distribution is
determined only by separation in time and not by the absolute date. So for example,
19
-
if the expected value or variance are defined at any point in time, they must have
that same value at every point in time.
We generally work with a weaker notion of stationarity. We say that a time series
is covariance stationary if
Xt has a constant expected value, which we generally treat as 0 for convenience.
the covariance matrix of (Xt1 , Xt2 , . . . , Xtn) depends only on the distance be-
tween observations and not on the absolute date. (So it is identical to the
covariance matrix of (Xt1+h, Xt2+h, . . . , Xtn+h).)
So a stochastic process Xt is covariance stationary if it has a time-independent finite
mean and time-independent finite auto-covariances. That is, for all t, s, and j,
E(Xt) = E(Xt−j)
E[(Xt − µ)(Xt−s − µ)] = E[(Xt−n − µ)(Xt−n−s − µ)]
This allows us to write the autocovariance of Xt and Xt+h as a function only of
the distance between the two observations. Letting the mean be zero for notational
simplicity:
cov(Xt, Xt+h) = E{XtXt+h} = γ(h) (18)
3.2 White Noise
We build our first time-series models out of white-noise processes. A particularly
simple stochastic process is white-noise . A stochastic process εt is a white-noise
process if it is characterized by
zero mean
constant variance
20
-
0 50 100 150 2001.0
0.5
0.0
0.5
1.0
Figure 3: White Noise (Coin Flip)
no serial correlation: E{εtεt−h} = 0 ∀h 6= 0, t
White noise is clearly a stationary stochastic process. The time-independent mean
and autocovariances are part of its definition. The lack of serial correlation is just
another way of noting that the autocovariances are zero at any lag.
As an example of how we might build up stochastic processes from white noise,
let us suppose that when you play a coin toss game where you win $1 for a head and
lose $1 for a tail. Assigning the values (1,-1) for (H,T), and assuming a fair coin, we
sample the process and plot the resulting time series.
nobs = 200 ;xs = RandomChoice [{−1 , 1} , nobs ] ;L i s tL ineP lo t [ xs , Frame −> True ]
Visually, Figure 3 looks like white noise: mean zero with no serial correlation.
Later we will consider how to test that.
21
-
3.3 Moving Average Process
A simple MA(q) process is built up from a white noise process by averaging the
current shock with the previous q shocks. Let εtT1 be white noise, and let
xt =
q∑j=0
θjεt−j (19)
When an MA proces is estimated, we typically adopt the normalization θ0 = 1. At
the moment, however, we simply want to consider how the “average shock” over the
last few periods is changing over time. Think of this as a way of considering how
limited memory might affect one’s estimate of the mean of a process.
First of all, notice that an MA(q) process will clearly display serial correlation.
(For example, xt and xt−1 are both affected by εt−1.) So even though we are building
up our MA process from white noise, it will not be a white-noise process.
Let us return to our coin-flipping example. We can look for “hot streaks” in a
simple moving average of most recent winnngs.
L i s tL ineP lo t [ MovingAverage [ xs , 7 ] , Frame −> True ]
Even though there is no serial correlation in the simple coin flip outcomes, we
clearly observe serial correlation in the moving average.
3.4 AR Process
An AR(p) process depends directly on its own past values. Let εtT1 be white noise.
We can build up an AR(p) process from a white noise process in the following way.
xt =
p∑i=1
αixt−i + εt (20)
22
-
0 50 100 150 2001.0
0.5
0.0
0.5
1.0
Figure 4: Hot Streaks (MA(7) of Coin Flipping Outcomes)
Usually we express this more succinctly as
α(L)xt = εt (21)
where
α(L) = 1−p∑i=1
αiLi (22)
This is just a linear difference equation with a white-noise forcing function. Ques-
tions of stability are therefore unchanged from the analysis of difference equations, but
they now become question of the stationarity of the process. When we ask whether
an ARMA process is “stationary”, we are just raising the question of stability of the
difference equation.
Just as a stable difference equation implies very different behavior from a similar
unstable difference equation, a stationary stochastic process implies very different
behavior from a non-stationary stochastic process. Consider the following three AR
23
-
0 20 40 60 80 1000.30.40.50.60.70.80.91.0
rho=0.99
0 20 40 60 80 100
0.981.001.021.041.06
rho=1.0
0 20 40 60 80 1001.01.21.41.61.82.02.22.42.62.8
rho=1.01
Figure 5: Deterministic and Stochastic AR(1) with Additive Shocks
processes:
xt = 0.99xt−1 + εt (23)
xt = 1.00xt−1 + εt (24)
xt = 1.01xt−1 + εt (25)
where εt is white noise. If we ignore the stochastic term, we have deterministic
processes. We easily recognize that (given a nonzero initial value) the first case
converges to zero, the second case is constant, and the last case is explosive. In the
presence of the stochastic term, these differences persist, but some new considerations
arise.
Macroeconomic time series tend to be highly autocorrelated, but usually they do
24
-
not appear explosive. This has focused attention on the question of whether ρ is
almost 1 or actually equals 1.
3.5 Stationary Case
We will focus on the simplest AR(1) model:
yt = a1yt−1 + εt (26)
where εt is a white-noise process. In this section we focus on the case 0 ≤ a1 < 1,
which ensures stationarity. In this case, y has an unconditional mean of 0 and variance
of σ2ε/(1−a21). To see this, repeatedly substitute for lagged y to get the moving average
representation
yt =∞∑i=0
ai1εt−i (27)
Suppose we have a time-series of observations on y, (y1, . . . , yT ). Then we could
produce an OLS estimate of a1 as
â1 =
∑Tt=2 ytyt−1∑Tt=2 y
2t−1
(28)
Substituting from (26) for yt we get
â1 =a1∑T
t=2 y2t−1 +
∑Tt=2 εtyt−1∑T
t=2 y2t−1
= a1 +
∑Tt=2 εtyt−1∑Tt=2 y
2t−1
(29)
Since the term on the right has nonzero expectation, our estimator â1 is biased
(Keele and Kelly, 2006). This is just the standard problem of lagged dependent
25
-
variables in an OLS regression. But if we consider its probability limit we get
plimT→∞
â1 = a1 +plimT→∞
1T
∑Tt=2 εtyt−1
plimT→∞1T
∑Tt=2 y
2t−1
(30)
Our assumption 0 ≤ a1 < 0 assures stationarity and thus a finite value for the
denominator. The numerator is the mean of products with zero expected value, so
we find â1 is a consistent estimator of a1.
3.6 Half-Life to Convergence
The (stochastic) steady state of our simple AR(1) is 0. Suppose we are at the steady
state and then shocked. How long do we expect it to take us to get half-way back to
the steady state? Equivalently, given y0, how long do we expected it to take to get
to y0/2?
For t > 0, we will have yt = at1y0 +
∑t−1i=0 a
i1εt−i. Since the expected value of future
shocks is zero,
E0yt = at1y0 (31)
So the half life question is asking, when is
at1 = 1/2
t ln(a1) = − ln(2)
t = − ln(2)/ ln(a1)
(32)
4 Unit Root Tests
Dickey and Fuller (1979) present Monte Carlo results allowing a test of the null
hypothesis that γ = 0 in the regression (82). They found that the critical values for
26
-
the test of γ = 0 varied not only with sample size but also were different for the trend
stationary and difference stationary models.
The first approach to testing for unit roots relies on critical values tabulated by
David Dickey and reproduced in Wayne Fuller (1976). Suppose we have a model
yt = a1yt−1 + et (33)
where et is white noise. Rewrite this as
∆yt = γyt−1 + et (34)
where γ = a1−1. The hypothesis that a1 = 1 is thus the same as the hypothesis that
γ = 0. We can estimate γ using OLS.
Problem: under the null hypothesis
H0 : ρ = 1 (35)
you are regressing a stationary variable on a non-stationary variable. The non-
stationarity of yt under the null hypothesis implies that the standard t-ratio does
not have the Student’s t distribution. Instead, we must use tables of tabulated val-
ues, yielding what are often called τ tests. For example, consider the following table
of critical values for the one-sided test of H0 : γ = 0 versus H1 : γ < 0:
For example, consider a sample of 50 observations. The critical values for a one-
tailed t-test are approximately -1.68 and -2.41. As the sample size increases, these
approach the standard normal critical values of -1.65 and -2.33. We see that there
is a tendency to incorrectly evaluate the null hypothesis if we use the t or standard
normal tables.
27
-
Model: ∆yt = γyt−1 + �t y0 = 0DGP: ∆yt = �t y0 = 01% 5% 10%
#Obs t τ t τ t τ25 -2.49 -2.66 -1.71 -1.95 -1.32 -1.6050 -2.41 -2.62 -1.68 -1.95 -1.30 -1.61100 -2.37 -2.60 -1.66 -1.95 -1.29 -1.61∞ -2.33 -2.58 -1.65 -1.95 -1.28 -1.62
Table 1: Tabulated τ and t One-Sided Critical Values
Two examples will make this point.
First, suppose we estimate γ and get a t-statistic of 1.00. Applying the t or
standard normal critical values, we would not reject the null hypothesis of a unit root
against the explosive alternative that ρ > 1. Yet using the tabulated critical values,
we would reject.
Second, suppose we get a t-statistic of -2.5. We would reject the unit root using
the t or normal critical values, but we would not reject using the tabulated values.
4.1 Some Perspective
When we say a variable follows a random walk, especially if it is a variable we are
interested in, we are offering a statement of our ignorance. We are saying that we
have no more reason to believe that it will change in one direction than in the other.
It may sound like a solid conclusion to say that we cannot reject the hypothesis that
a variable follows a random walk. But as Frankel (1990, ch.6 of OER) has noted, we
could just as well say that after studying the variable we have “absolutely nothing to
say that would help to predict its movements.”
Second note that the most popular tests for unit roots ask whether we can reject
the hypothesis that the series has a unit root. If not, then practice is to continue our
28
-
analysis under the assumption of a unit root. In contrast with traditional econometric
practice where we sought statistical significance as a basis for our conclusions, finding
nothing is now satisfactory. As Frankel has noted, it is easier to find nothing than to
find something.
4.2 Deterministic Regressors
A problem with (34) is that it is so restrictive. We are testing the null hypothesis of a
random walk without drift against a simple first-order autoregressive process without
a constant term. This does not encompass many applications of economic interest.
Many series seem to display a trend, and we care about determining whether this
is an artifact or part of the nature of the series. So most applications of unit root
tests include these possibilities.
One of the interesting aspects of unit root tests is that allowing for these possibil-
ities changes the critical values, even though the underlying DGP is unchanged. This
is clear in the table 2.
29
-
DGP: ∆yt = �t y0 = 0Model: ∆yt = γyt−1 + �t a0 + γyt−1 + �t a0 + γyt−1 + a2t+ �tTest Statistic τ τµ ττ5% Critical Values -1.95/+0.91 -2.93/-0.03 -3.50/-0.871% Critical Value -2.62/+2.08 -3.58/+0.66 -4.15/-0.24
Table 2: Tabulated Critical Values for 50 Observations
30
DescriptionNormal DistributionSkewness and KurtosisCorrelationSample Correlation
Correlation and Linear Regression
Difference EquationsStochastic ProcessStationarityWhite NoiseMoving Average ProcessAR ProcessStationary CaseHalf-Life to Convergence
Unit Root TestsSome PerspectiveDeterministic Regressors
Augmented Dickey-Fuller TestsLag Length SelectionNelson and PlosserSelection of Deterministic RegressorsPhillips-Perron Test
Two RootsARMA ProcessesStationarity and StabilityThe Role of the MA TermThe Autocorrelation Function
Model Selecton CriteriaFinite Prediction Error (FPE)Information Criteria
DiagnosticsLikelihood Ratio Test
Spurious RegressionTesting for Unit RootsCointegrationA Multivariate Generalization
The Engle-Granger Cointegration TestError CorrectionVector Error Correction Models
The Johansen Cointegration TestStructural BreaksBibliography