professor: alan g. isaac - american university · x t (x t x )3=s3 (2) where sis based on the...

30
Professor: Alan G. Isaac 1

Upload: others

Post on 31-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Professor: Alan G. Isaac

    1

  • Applied Time-Series: A Very Short Introduction

    April 4, 2018

    Contents

    1 Description 5

    1.1 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.2 Skewness and Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    1.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    1.3.1 Sample Correlation . . . . . . . . . . . . . . . . . . . . . . . . 9

    1.4 Correlation and Linear Regression . . . . . . . . . . . . . . . . . . . . 12

    2 Difference Equations 14

    3 Stochastic Process 19

    3.1 Stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.2 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.3 Moving Average Process . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.4 AR Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.5 Stationary Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    3.6 Half-Life to Convergence . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2

  • 4 Unit Root Tests 26

    4.1 Some Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.2 Deterministic Regressors . . . . . . . . . . . . . . . . . . . . . . . . . 29

    5 Augmented Dickey-Fuller Tests 31

    5.1 Lag Length Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    5.2 Nelson and Plosser . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5.3 Selection of Deterministic Regressors . . . . . . . . . . . . . . . . . . 36

    5.3.1 Phillips-Perron Test . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.4 Two Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    5.5 ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    5.5.1 Stationarity and Stability . . . . . . . . . . . . . . . . . . . . 44

    5.5.2 The Role of the MA Term . . . . . . . . . . . . . . . . . . . . 44

    5.5.3 The Autocorrelation Function . . . . . . . . . . . . . . . . . . 45

    5.6 Model Selecton Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.6.1 Finite Prediction Error (FPE) . . . . . . . . . . . . . . . . . . 48

    5.6.2 Information Criteria . . . . . . . . . . . . . . . . . . . . . . . 49

    5.7 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.7.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . 56

    6 Spurious Regression 56

    7 Testing for Unit Roots 59

    8 Cointegration 61

    8.0.1 A Multivariate Generalization . . . . . . . . . . . . . . . . . . 65

    9 The Engle-Granger Cointegration Test 66

    3

  • 10 Error Correction 69

    10.1 Vector Error Correction Models . . . . . . . . . . . . . . . . . . . . . 70

    11 The Johansen Cointegration Test 71

    12 Structural Breaks 77

    Bibliography 78

    4

  • These notes are extremely preliminary. Please report typos and other problems.

    1 Description

    1.1 Normal Distribution

    A prominent example of a continuous distribution is the Gaussian or Normal distri-

    bution. Let Y be a normally distributed random variable with mean µ and variance

    σ2. By definition, its probability distribution function (pdf) is

    pY (y) =1

    σ√

    2πexp

    {−(y − µ)2

    2σ2

    }(1)

    A standard shorthand for this is Y ∼ N(µ, σ2). If we think of pY (y) as telling us

    roughly how likely we are to see Y near y, then we can see that the normal distribution

    is most likely to take on values near the mean µ. We also see that when the variance

    σ2 is large, the “penalty” for a given deviation from the mean is reduced.

    Exercise 1 (Descriptive Statistics 1)

    Generate 300 independent realizations of a standard normal random variable and

    examine the sample properties: the number of observations, the minimum and max-

    imum values, the mean, and the variance.

    The standard normal distribution has mean of zero and variance of unity. If X

    is the random variable, we write X ∼ N(0, 1), where the arguments give the mean

    and the variance. Now multiply each realization of your first series by 2 to produce

    a new series. Look at your descriptive statistics. What happens to the mean? What

    happens to the variance?

    xs = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , 3 0 0 ] ;

    5

  • Through [{Min, Max, Mean, Variance , Length } [ xs ] ]

    Through [{Min, Max, Mean, Variance , Length} [ 2* xs ] ]

    1.2 Skewness and Kurtosis

    We often use a sample mean and variance as estimates of the true (“population”)

    mean and variance. The “third moment” about the sample mean, 1T

    ∑t(Xt − X̄)3,

    might seem like a natural place to look for asymmetry in the distribution. Instead

    we use the closely related concept of sample skewness:

    S(X) = T 1/2∑

    t(Xt − X̄)3

    (∑

    t(Xt − X̄)2)3/2=

    1

    T

    ∑t

    (Xt − X̄)3/s3 (2)

    where s is based on the biased estimator for the variance (Bickel and Doksum 1977,

    p.388). The skewness of a symmetric distribution, such as the normal distribution, is

    zero. S(X) is zero for symmetric distributions. If the upper tail of the distribution

    is thicker, S(X) > 0. If the lower tail of the distribution is thicker, S(X) < 0.

    The fourth sample moment about the mean would seem, like the second, to offer

    an indication of dispersion. It does, but it puts even more weight on observations far

    from the mean. We take advantage of this to ask whether the sample is relatively

    peaked (leptokurtic) or flat (thick-tailed, platykurtic). The statistic we use is the

    fourth standardized moment, known as sample kurtosis:

    K(X) = T

    ∑t(Xt − X̄)4

    (∑

    t(Xt − X̄)2)2=

    1

    T

    ∑t

    (Xt − X̄)4/s4 (3)

    where again s is based on the biased estimator for the variance (Bickel and Doksum

    1977, p.388). Kurtosis measures the peakedness or flatness of the distribution of the

    6

  • series. Kurtosis is a measure of how “peaked” the distribution is. K = 3 for the

    normal distribution. Distributions that are more peaked and have thicker tails will

    have greater kurtosis. Distributions that are less peaked and have thiner tails will

    have greater kurtosis. If the kurtosis exceeds 3, the distribution is peaked (leptokurtic)

    relative to the normal; if the kurtosis is less than 3, the distribution is flat (platykurtic)

    relative to the normal.

    Exercise 2 (Descriptive Statistics 2)

    Produce the skewness and kurtosis of your two normal series above. What do you

    notice about these two statstics?

    Through [{ Skewness , Kurtos i s } [ ns ] ]

    Through [{ Skewness , Kurtos i s } [ 2* ns ] ]

    Exercise 3 (Descriptive Statistics 3)

    Now let us conduct the same examinations for a uniform random variable.

    We notice a clear difference in the CDF plots for the two distributions, and we

    see the relative peakedness of the normal distribution in its higher sample kurtosis.

    This observation suggests that we might test whether our sample is generated by

    draws from a normal distribution. Informally, we might look at whether the mean

    and median are approximately equal. We could also check for S(X)=0 and K(X)=3

    to be approximately satisfied. Finally, we can turn to formal tests.

    The Jarque-Bera statistic allows a simple test for normality:

    JB(X) =T

    6

    [S2(X) +

    [K(X)− 3]2

    4

    ](4)

    The test statistic summarizes the deviation of the skewness and kurtosis of the se-

    ries from those from the normal distribution. If the series is drawn from a normal

    7

  • distribution, we should find JB(X) near zero. Under the null hypothesis of a normal

    distribution, the Jarque-Bera statistic is distributed as χ2 with 2 degrees of freedom.

    The resulting p-value is the probability under the null that a Jarque-Bera statistic

    exceeds (in absolute value) the observed value—a small probability value leads to the

    rejection of the null hypothesis of a normal distribution.

    If the series we are testing is generated, we need an adjustment (k) for the number

    of estimated coefficients used to create the series.

    JB(X) =T − k

    6

    [S2(X) +

    [K(X)− 3]2

    4

    ](5)

    (Note: Eviews 3.0 did not adjust the JB statistic on its residuals as it should.)

    1.3 Correlation

    We are often interested in the joint distribution of random variables. Two r.v.s are

    independent when the value that one takes does not depend on the value the other

    takes. Consider the following tables of probabilities for two discrete r.v.s X and Y :

    x1 x2

    y1 0.4 0.1

    y2 0.2 0.3

    x1 x2

    y1 0.2 0.3

    y2 0.2 0.3

    In each table, all the probabilities add up to one. In the first table, if we are told

    Y = y1 then we judge X = x1 four times as likely as X = x2, while if we know

    Y = y2 then we find it only two-thirds as likely. In the second table, the probability

    of X = x1 at 40% no matter what we know about Y .

    The first table is the case that interest us here: X and Y tend to vary together

    rather than independently. This introduces the notion of the covariance between two

    8

  • random variables:

    cov(X, Y ) = E(X − EX)(Y − EY ) (6)

    Covariance is a measure of the linear association between two r.v.s. If X and Y tend

    to be above their means together, then the covariance is positive. If one tends to be

    high when the other is low, then their covariance is negative.

    One problem with covariance as a measure of relatedness is that it depends on

    the units of measurement. We therefore more often attend to the correlation between

    two variables. Correlation is just normalized covariance. We “deflate” the covariance

    between two random variables by the product of ther standard deviations. That is,

    the correlation coefficient is defined as

    cor(X, Y ) =cov(X, Y )

    σXσY(7)

    This deflation or normalization ensures that the measurement is scale free. The

    correlation coefficient will always lie between zero and one.

    1.3.1 Sample Correlation

    We would like to be able to use the sample correlation as an estimate of the true

    correlation between variables.

    ĉor(X, Y ) =ĉov(X, Y )

    sXsY(8)

    where

    ĉov(X, Y ) =1

    T

    T∑t=1

    (Xt − X̄)(Yt − Ȳ ) (9)

    9

  • Unlike our estimate of the sample variance, our correlation estimate is unbiased.

    While the covariance estimate is biased, because we do not accommodate the loss of

    one degree of freedom implied by estimating the means, this same bias affects the

    normalization.

    Exercise 4 (Correlation)

    We will generate correlated random variables and look for the correlation in their

    scatter plots. Proceed as follows:

    generate two independent series, n1 and n2, each of 300 independent draws from

    a standard normal distribution

    create the following related series: n3 = n2+n1, n4 = n2−n1, and n5 = 50n2−n1.

    determine the theoretical and empirical correlation between these series.

    create scatter plots of ni against n1. Can you see the predicted/actual correla-

    tion?

    {ns01 , ns02} = RandomVariate [ NormalDistr ibut ion [ 0 , 1 ] , {2 , 3 0 0} ] ;

    {ns03 , ns04 , ns04} = {{1 , 1} , {−1, 1} , {−1, 50}} .{ ns01 , ns02 } ;

    Co r r e l a t i on [{ ns01 , ns02 , ns03 , ns04 , ns04} // Transpose ] //

    . . .MatrixForm

    We can do some quick calculations to see what correlations and covariances we

    should have found in our exercise. When we considered two independent N(0, 1)

    variables n1 and n2, we should have found zero covariance. When we then set n3 =

    10

  • Figure 1: Scatter Plots as Correlation Diagnostic

    11

  • n1 + n2, we should have found

    En3 = En1 + En2 = 0

    varn3 = E(n3 − 0)2

    = E(n21 + n22 + 2n1n2)

    = σ21 + σ22 + 0 = 2

    cov(n1, n3) = E(n21 + n1n2) = σ

    21 + 0 = 1

    cor (n1, n3) =1√2

    When we check this out, we get very close (but of course our results are not exact).

    1.4 Correlation and Linear Regression

    The correlation coefficient, ρ ∈ [−1, 1], is the most common measure of linear relat-

    edness in two series. If all the data points lie on a line, then correlation is perfect

    (and ρ = 1 or ρ = −1). While two variables must be correlated for them to appear

    related in a linear regression, the regression coefficient is not simply the correlation

    coefficient. The correlation coefficient is symmetrical: since it deflates the covariance

    by the product of the standard deviations of the two variables, it does not matter

    whether you consider the correlation of X with Y or that of Y with X. In contrast,

    the regression coefficient deflates the covariance by the sample variance of the RHS

    variable. So if you regress Y on X (and a constant) you get an estimated slope of

    β̂ =ĉov(X, Y )

    s2X(10)

    12

  • The asymmetry comes because the least squares estimator minimizes the average

    squared deviation of the LHS variable from the fitted line. Thus regressing X on Y

    does not simply invert the parameter estimate. (Indeed, if the variables are completely

    uncorrelated the regression coefficient should be zero in both regressions.)

    Let us quickly confirm this by example, as it will be of some interest when we

    discuss cointegrating regressions.

    Exercise 5 (Inverse Regression)

    Here is a quick reminder that it matters for the estimated relationship what we put

    on on the LHS. Look at the results of regressing n3 on n1 (constructed as above).

    (Include a constant in the regression.) Look at the results of regressing n1 on n3

    (Include a constant in the regression.) Compare the two slopes and the correlation

    coefficient.

    #I n v e r s e R e g r e s s i o n ( Python )

    #u s i n g n1 and n3 from abo v e

    n31reg = l s .OLS(n3 , n1 )n13reg = l s .OLS(n1 , n3 )b31 = n31reg . c o e f s [ 0 ]b13 = n13reg . c o e f s [ 0 ]cor r13 = numpy . c o r r c o e f ( [ n1 , n3 ] ) [ 0 , 1 ]print "n31reg, correlation , n13reg"print b31 , corr13 , b13print "Make slopes based on correlation and variances"v1 = n1 . var ( )v3 = n3 . var ( )print cor r13 *math . s q r t ( v1/v3 )print cor r13 *math . s q r t ( v3/v1 )printprint n3 .mean ( ) − b31*n1 .mean ( )print n31regprint n1 .mean ( ) − b13*n3 .mean ( )print n13reg

    Since the variance of n3 is twice the variance of n1, we should get a slope about

    13

  • half the size when n3 is our RHS variable. We do. And here is one other take on this:

    you can think of the regression coefficient as the correlation coefficient adjusted for

    the relative variability of the RHS variable. That is, for the regression of Y on X

    β̂ = ĉor(X, Y )sYsX

    (11)

    So the correlation coefficient will always fall between the regression and inverse re-

    gression coefficients, as you can verify from your example.

    2 Difference Equations

    Up to now we have been reviewing some fairly familiar statistical concepts. As we

    move toward time series modeling, we need to add some familiarity with difference

    equations.

    For the most part, time series techniques have been concerned with linear dif-

    ference equations with constant coefficients. That will be our focus. A difference

    equation is simply an expression of the current value of a variable as a function of its

    past values and some other forcing process. The applications are manifold.

    An example:

    We might expect speculation in the stock market to keep today’s prices very close to

    tomorrow’s expected prices. With daily data and assuming expectations are pretty

    good, we might then find the random walk model

    Pt+1 = Pt + et+1 (12)

    to be a reasonable model of the evolution of stock prices. All changes are unantic-

    ipated, and captured in the innovation et+1. Note that we can also represent this

    14

  • process as

    ∆Pt = et (13)

    This suggests one test of our model would be running the regression

    ∆Pt = a0 + a1Pt−1 + et (14)

    and testing the hypothesis that a0 = a1 = 0. We will have lots to say about his latter.

    When we talk about difference equations, we want to be able to speak about series

    that take on real values at each point in time. We will use the notation {yt} to refer

    to these series, where t is allowed to take on any value from the infinite past to the

    infinite future.

    The order of a difference equation is the largest difference in time subscripts. Thus

    our random walk model is a first order difference equation. An p-th order difference

    equation can therefore be written as

    yt = a0 + a1yt−1 + · · ·+ apyt−p + xt (15)

    where xt is an arbitrary “forcing process”. A solution to a difference equation ex-

    presses yt just in terms of xt and t and, to pin down a unique path for the process,

    some constraints on its location (e.g., initial conditions). The solution to a differ-

    ence equation is therefore a function that tells us how yt evolves over time. First

    order difference equations have particularly simple dynamics, which we easily can

    demonstrate.

    Exercise 6 (Deterministic and Stochastic Difference Equations)

    We will use an arbitrary inital value of 15, with various positive rates of decay. (What

    happens if we give negative values to ρ?)

    15

  • Figure 2: Deterministic and Stochastic Difference Equations: AR(1)

    EViews (Difference Equations)

    ’Difference Equations

    ’pick File, New, Program

    new workfile temp u 50 ’a new undated workfile with 50 obs

    series y0=0 ’y0 is just a "place holder"

    y0(1)=15

    series yd ’the "deterministic" series

    series ys ’the "stochastic" series

    rndseed 20

    series e=nrnd ’the stochastic "shocks"

    scalar rho ’the "autoregressive" parameter

    %mygraphs=""

    for !ct=0 to 3

    yd=y0

    ys=y0

    rho=0.9+(!ct)/20

    smpl @first+1 @last

    yd=rho*yd(-1)

    ys=rho*ys(-1)+e

    smpl @all

    group yg yd ys

    freeze(graph{!ct}) yg.line

    %header ="rho="+@str(rho)

    graph{!ct}.addtext(0.5,2) %header

    16

  • %mygraphs=%mygraphs+" graph"+@str(!ct)

    next

    graph g4.merge %mygraphs

    show g4

    ’pick SaveAs and give it a title

    ’pick Run

    GAUSS

    new;library pgraph;graphset;begwind;window(2,2,1);

    y0=5;

    rho=.1~.4~.7~1;

    rndnseed=20;

    e=rndns(150,1,rndnseed);

    yd=recserar(zeros(150,4),y0~y0~y0~y0,rho);

    ys=recserar(e~e~e~e,y0~y0~y0~y0,rho);

    myseq=seqa(1,1,150);

    ct=1;

    do while ct

  • reset

    18

  • 3 Stochastic Process

    A stochastic process represents the probabilistic evolution of a system over time.

    It is also called a time-series process . Contrast this with a deterministic system.

    For example, a system of linear difference equations may represent a deterministic

    dynamical system. If we add a stochastic forcing function, it becomes a stochastic

    process. We find a link between our stability conditions in the deterministic case and

    the conditions for stationarity in the stochastic case.

    A stochastic process X(t, ω) is a random variable for each admissible t ∈ T , where

    T is an index set we think of as “time.” That is, a stochastic process is just an ordered

    sequence of random variables, and the natural order for us will be time.

    A time series is a sample from a stochastic process. When we plot a time series

    X(t, ω), conceptually we are holding ω constant and allowing t to vary. The resulting

    function of t is called a realization or sample function.

    Given a finite set of random variables X = {Xt1 , Xt2 , . . . , Xtn}, the joint distri-

    bution function is

    FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = P{ω | X(t1, ω) ≤ xt1 , . . . , X(tn, ω) ≤ xtn} (16)

    3.1 Stationarity

    We call a time series strictly stationary if

    FXt1 ,Xt2 ,...,Xtn (xt1 , xt2 , . . . , xtn) = FXt1+h,Xt2+h,...,Xtn+h(xt1 , xt2 , . . . , xtn) (17)

    So the distribution function is the same at each time, and the joint distribution is

    determined only by separation in time and not by the absolute date. So for example,

    19

  • if the expected value or variance are defined at any point in time, they must have

    that same value at every point in time.

    We generally work with a weaker notion of stationarity. We say that a time series

    is covariance stationary if

    Xt has a constant expected value, which we generally treat as 0 for convenience.

    the covariance matrix of (Xt1 , Xt2 , . . . , Xtn) depends only on the distance be-

    tween observations and not on the absolute date. (So it is identical to the

    covariance matrix of (Xt1+h, Xt2+h, . . . , Xtn+h).)

    So a stochastic process Xt is covariance stationary if it has a time-independent finite

    mean and time-independent finite auto-covariances. That is, for all t, s, and j,

    E(Xt) = E(Xt−j)

    E[(Xt − µ)(Xt−s − µ)] = E[(Xt−n − µ)(Xt−n−s − µ)]

    This allows us to write the autocovariance of Xt and Xt+h as a function only of

    the distance between the two observations. Letting the mean be zero for notational

    simplicity:

    cov(Xt, Xt+h) = E{XtXt+h} = γ(h) (18)

    3.2 White Noise

    We build our first time-series models out of white-noise processes. A particularly

    simple stochastic process is white-noise . A stochastic process εt is a white-noise

    process if it is characterized by

    zero mean

    constant variance

    20

  • 0 50 100 150 2001.0

    0.5

    0.0

    0.5

    1.0

    Figure 3: White Noise (Coin Flip)

    no serial correlation: E{εtεt−h} = 0 ∀h 6= 0, t

    White noise is clearly a stationary stochastic process. The time-independent mean

    and autocovariances are part of its definition. The lack of serial correlation is just

    another way of noting that the autocovariances are zero at any lag.

    As an example of how we might build up stochastic processes from white noise,

    let us suppose that when you play a coin toss game where you win $1 for a head and

    lose $1 for a tail. Assigning the values (1,-1) for (H,T), and assuming a fair coin, we

    sample the process and plot the resulting time series.

    nobs = 200 ;xs = RandomChoice [{−1 , 1} , nobs ] ;L i s tL ineP lo t [ xs , Frame −> True ]

    Visually, Figure 3 looks like white noise: mean zero with no serial correlation.

    Later we will consider how to test that.

    21

  • 3.3 Moving Average Process

    A simple MA(q) process is built up from a white noise process by averaging the

    current shock with the previous q shocks. Let εtT1 be white noise, and let

    xt =

    q∑j=0

    θjεt−j (19)

    When an MA proces is estimated, we typically adopt the normalization θ0 = 1. At

    the moment, however, we simply want to consider how the “average shock” over the

    last few periods is changing over time. Think of this as a way of considering how

    limited memory might affect one’s estimate of the mean of a process.

    First of all, notice that an MA(q) process will clearly display serial correlation.

    (For example, xt and xt−1 are both affected by εt−1.) So even though we are building

    up our MA process from white noise, it will not be a white-noise process.

    Let us return to our coin-flipping example. We can look for “hot streaks” in a

    simple moving average of most recent winnngs.

    L i s tL ineP lo t [ MovingAverage [ xs , 7 ] , Frame −> True ]

    Even though there is no serial correlation in the simple coin flip outcomes, we

    clearly observe serial correlation in the moving average.

    3.4 AR Process

    An AR(p) process depends directly on its own past values. Let εtT1 be white noise.

    We can build up an AR(p) process from a white noise process in the following way.

    xt =

    p∑i=1

    αixt−i + εt (20)

    22

  • 0 50 100 150 2001.0

    0.5

    0.0

    0.5

    1.0

    Figure 4: Hot Streaks (MA(7) of Coin Flipping Outcomes)

    Usually we express this more succinctly as

    α(L)xt = εt (21)

    where

    α(L) = 1−p∑i=1

    αiLi (22)

    This is just a linear difference equation with a white-noise forcing function. Ques-

    tions of stability are therefore unchanged from the analysis of difference equations, but

    they now become question of the stationarity of the process. When we ask whether

    an ARMA process is “stationary”, we are just raising the question of stability of the

    difference equation.

    Just as a stable difference equation implies very different behavior from a similar

    unstable difference equation, a stationary stochastic process implies very different

    behavior from a non-stationary stochastic process. Consider the following three AR

    23

  • 0 20 40 60 80 1000.30.40.50.60.70.80.91.0

    rho=0.99

    0 20 40 60 80 100

    0.981.001.021.041.06

    rho=1.0

    0 20 40 60 80 1001.01.21.41.61.82.02.22.42.62.8

    rho=1.01

    Figure 5: Deterministic and Stochastic AR(1) with Additive Shocks

    processes:

    xt = 0.99xt−1 + εt (23)

    xt = 1.00xt−1 + εt (24)

    xt = 1.01xt−1 + εt (25)

    where εt is white noise. If we ignore the stochastic term, we have deterministic

    processes. We easily recognize that (given a nonzero initial value) the first case

    converges to zero, the second case is constant, and the last case is explosive. In the

    presence of the stochastic term, these differences persist, but some new considerations

    arise.

    Macroeconomic time series tend to be highly autocorrelated, but usually they do

    24

  • not appear explosive. This has focused attention on the question of whether ρ is

    almost 1 or actually equals 1.

    3.5 Stationary Case

    We will focus on the simplest AR(1) model:

    yt = a1yt−1 + εt (26)

    where εt is a white-noise process. In this section we focus on the case 0 ≤ a1 < 1,

    which ensures stationarity. In this case, y has an unconditional mean of 0 and variance

    of σ2ε/(1−a21). To see this, repeatedly substitute for lagged y to get the moving average

    representation

    yt =∞∑i=0

    ai1εt−i (27)

    Suppose we have a time-series of observations on y, (y1, . . . , yT ). Then we could

    produce an OLS estimate of a1 as

    â1 =

    ∑Tt=2 ytyt−1∑Tt=2 y

    2t−1

    (28)

    Substituting from (26) for yt we get

    â1 =a1∑T

    t=2 y2t−1 +

    ∑Tt=2 εtyt−1∑T

    t=2 y2t−1

    = a1 +

    ∑Tt=2 εtyt−1∑Tt=2 y

    2t−1

    (29)

    Since the term on the right has nonzero expectation, our estimator â1 is biased

    (Keele and Kelly, 2006). This is just the standard problem of lagged dependent

    25

  • variables in an OLS regression. But if we consider its probability limit we get

    plimT→∞

    â1 = a1 +plimT→∞

    1T

    ∑Tt=2 εtyt−1

    plimT→∞1T

    ∑Tt=2 y

    2t−1

    (30)

    Our assumption 0 ≤ a1 < 0 assures stationarity and thus a finite value for the

    denominator. The numerator is the mean of products with zero expected value, so

    we find â1 is a consistent estimator of a1.

    3.6 Half-Life to Convergence

    The (stochastic) steady state of our simple AR(1) is 0. Suppose we are at the steady

    state and then shocked. How long do we expect it to take us to get half-way back to

    the steady state? Equivalently, given y0, how long do we expected it to take to get

    to y0/2?

    For t > 0, we will have yt = at1y0 +

    ∑t−1i=0 a

    i1εt−i. Since the expected value of future

    shocks is zero,

    E0yt = at1y0 (31)

    So the half life question is asking, when is

    at1 = 1/2

    t ln(a1) = − ln(2)

    t = − ln(2)/ ln(a1)

    (32)

    4 Unit Root Tests

    Dickey and Fuller (1979) present Monte Carlo results allowing a test of the null

    hypothesis that γ = 0 in the regression (82). They found that the critical values for

    26

  • the test of γ = 0 varied not only with sample size but also were different for the trend

    stationary and difference stationary models.

    The first approach to testing for unit roots relies on critical values tabulated by

    David Dickey and reproduced in Wayne Fuller (1976). Suppose we have a model

    yt = a1yt−1 + et (33)

    where et is white noise. Rewrite this as

    ∆yt = γyt−1 + et (34)

    where γ = a1−1. The hypothesis that a1 = 1 is thus the same as the hypothesis that

    γ = 0. We can estimate γ using OLS.

    Problem: under the null hypothesis

    H0 : ρ = 1 (35)

    you are regressing a stationary variable on a non-stationary variable. The non-

    stationarity of yt under the null hypothesis implies that the standard t-ratio does

    not have the Student’s t distribution. Instead, we must use tables of tabulated val-

    ues, yielding what are often called τ tests. For example, consider the following table

    of critical values for the one-sided test of H0 : γ = 0 versus H1 : γ < 0:

    For example, consider a sample of 50 observations. The critical values for a one-

    tailed t-test are approximately -1.68 and -2.41. As the sample size increases, these

    approach the standard normal critical values of -1.65 and -2.33. We see that there

    is a tendency to incorrectly evaluate the null hypothesis if we use the t or standard

    normal tables.

    27

  • Model: ∆yt = γyt−1 + �t y0 = 0DGP: ∆yt = �t y0 = 01% 5% 10%

    #Obs t τ t τ t τ25 -2.49 -2.66 -1.71 -1.95 -1.32 -1.6050 -2.41 -2.62 -1.68 -1.95 -1.30 -1.61100 -2.37 -2.60 -1.66 -1.95 -1.29 -1.61∞ -2.33 -2.58 -1.65 -1.95 -1.28 -1.62

    Table 1: Tabulated τ and t One-Sided Critical Values

    Two examples will make this point.

    First, suppose we estimate γ and get a t-statistic of 1.00. Applying the t or

    standard normal critical values, we would not reject the null hypothesis of a unit root

    against the explosive alternative that ρ > 1. Yet using the tabulated critical values,

    we would reject.

    Second, suppose we get a t-statistic of -2.5. We would reject the unit root using

    the t or normal critical values, but we would not reject using the tabulated values.

    4.1 Some Perspective

    When we say a variable follows a random walk, especially if it is a variable we are

    interested in, we are offering a statement of our ignorance. We are saying that we

    have no more reason to believe that it will change in one direction than in the other.

    It may sound like a solid conclusion to say that we cannot reject the hypothesis that

    a variable follows a random walk. But as Frankel (1990, ch.6 of OER) has noted, we

    could just as well say that after studying the variable we have “absolutely nothing to

    say that would help to predict its movements.”

    Second note that the most popular tests for unit roots ask whether we can reject

    the hypothesis that the series has a unit root. If not, then practice is to continue our

    28

  • analysis under the assumption of a unit root. In contrast with traditional econometric

    practice where we sought statistical significance as a basis for our conclusions, finding

    nothing is now satisfactory. As Frankel has noted, it is easier to find nothing than to

    find something.

    4.2 Deterministic Regressors

    A problem with (34) is that it is so restrictive. We are testing the null hypothesis of a

    random walk without drift against a simple first-order autoregressive process without

    a constant term. This does not encompass many applications of economic interest.

    Many series seem to display a trend, and we care about determining whether this

    is an artifact or part of the nature of the series. So most applications of unit root

    tests include these possibilities.

    One of the interesting aspects of unit root tests is that allowing for these possibil-

    ities changes the critical values, even though the underlying DGP is unchanged. This

    is clear in the table 2.

    29

  • DGP: ∆yt = �t y0 = 0Model: ∆yt = γyt−1 + �t a0 + γyt−1 + �t a0 + γyt−1 + a2t+ �tTest Statistic τ τµ ττ5% Critical Values -1.95/+0.91 -2.93/-0.03 -3.50/-0.871% Critical Value -2.62/+2.08 -3.58/+0.66 -4.15/-0.24

    Table 2: Tabulated Critical Values for 50 Observations

    30

    DescriptionNormal DistributionSkewness and KurtosisCorrelationSample Correlation

    Correlation and Linear Regression

    Difference EquationsStochastic ProcessStationarityWhite NoiseMoving Average ProcessAR ProcessStationary CaseHalf-Life to Convergence

    Unit Root TestsSome PerspectiveDeterministic Regressors

    Augmented Dickey-Fuller TestsLag Length SelectionNelson and PlosserSelection of Deterministic RegressorsPhillips-Perron Test

    Two RootsARMA ProcessesStationarity and StabilityThe Role of the MA TermThe Autocorrelation Function

    Model Selecton CriteriaFinite Prediction Error (FPE)Information Criteria

    DiagnosticsLikelihood Ratio Test

    Spurious RegressionTesting for Unit RootsCointegrationA Multivariate Generalization

    The Engle-Granger Cointegration TestError CorrectionVector Error Correction Models

    The Johansen Cointegration TestStructural BreaksBibliography