nonlinear time series modelling: an introduction

{Journals}joes/13-5/s124/s124.3d

NONLINEAR TIME SERIES MODELLING:AN INTRODUCTION

Simon M. Potter

Federal Reserve Bank of New York

Abstract. Recent developments in nonlinear time series modelling are reviewed.Three main types of nonlinear model are discussed: Markov Switching, ThresholdAutoregression and Smooth Transition Autoregression. Classical and Bayesianestimation techniques are described for each model. Parametric tests fornonlinearity are reviewed with examples from the three types of model. Finallyforecasting and impulse response analysis is developed.

Keywords. Markov Switching; Threshold Autoregression; Smooth TransitionAutoregression

1. Introduction

It is now ten years since Jim Hamilton's seminal paper on nonlinear modelling ofU.S. output was published. This ten years has witnessed an explosion of interestamongst econometricians in the testing, estimation, specification and properties ofnonlinear models. The purpose of this paper is to give a non-technical survey ofthe main developments and some observations on the difficulties of successfulnonlinear modelling in macroeconomics.1 It is useful to begin by giving somemotivation for the need for nonlinear modelling.The 1970s and 1980s saw economists adopt many of the time series techniques

introduced by Box and Jenkins. The basis for such modelling approaches was theWold Representation: any covariance stationary time series can be expressed asmoving average function of present and past innovations:

Yt �X1i� 0

�i Utÿ i; withX

�2i <1; �0 � 1;

where

E [UtUtÿ i ]� 0 for all i 6� 0 and E [U2t ]� �2

u

0950-0804/99/05 0505±24 JOURNAL OF ECONOMIC SURVEYS Vol. 13, No. 5# Blackwell Publishers Ltd. 1999, 108 Cowley Rd., Oxford OX4 1JF, UK and 350 Main St., Malden,MA 02148, USA.

I would like to thank Gary Koop for numerous helpful discussions. The views expressed in this paperare those of the author and do not necessarily reflect the views of the Federal Reserve Bank of NewYork or the Federal Reserve System.


This infinite moving average can nearly always be well approximated by low orderautoregressive processes perhaps with some moving average components.Further, the dynamics of the time series could be `read off' from the WoldRepresentation since �i represents the impulse response function at horizon i.It might appear at first that there is no need for nonlinear modelling given the

Wold Representation. However, as well illustrated in the appropriately titled`Forecasting White Noise', by Clive Granger, lack of autocorrelation in a timeseries does not imply that the time series cannot be predicted. Indeed someperfectly predictable time series have zero autocorrelations at all lags.2

Further, while the Wold Representation gives the impulse response functiondirectly it imposes some strong restrictions on it. First, the impulse responsefunction does not depend on the recent history of the time series. Thus, forexample, the response to a positive innovation of 1% is the same whether lastperiod's growth rate was 8% or ÿ8%. Second, the response to innovations isrestricted to be homogeneous of degree 1. That is, once the response to a shock ofsize 1 has been found all other shocks are given by simple scalings of this response.Successful nonlinear time series modelling would improve forecasts and

produce a richer notion of business cycle dynamics than linear time series modelsallow. For this to happen two conditions are necessary. First, economic time seriesmust contain nonlinearities. Second, we need reliable statistical methods tosummarize and understand these nonlinearities suitable for time series of thetypical macroeconomic length. Unfortunately the second condition is needed toevaluate the veracity of the first condition and as we shall see it is not clear that wehave yet found reliable statistical methods.The organization of the paper is as follows: I start by describing three types of

models most widely used in the economics literature and Classical and Bayesianestimation techniques in simple cases; the testing problem is then discussed withrespect to these three models; finally simulation of the conditional expectations isdescribed and its use in the construction of forecasts and impulse responsefunctions.

2. Three models

In this section I discuss the three types of models that have most commonly beused in nonlinear modelling particularly for aggregate output measures andunemployment. I will use a common notation across all models. Yt will be aunivariate covariance stationary time series, yt� (Y1,Y2, ...,Yt) will be the historyof the time series up to time t. Vt will be a sequence of independent and identicallydistributed random variables with unit variance. When likelihood based methodsare discussed one can assume that Vt has a standard normal distribution. TheGreek letters �, �, � will respectively refer to the intercepts, autoregressivecoefficients and the scaling of the time series innovation. � (L ) is a polynomial inthe lag operator of the form:

�1L��2L2� �� pLp.

506 SIMON M. POTTER

# Blackwell Publishers Ltd. 1999


Below we will make use of the fact that if Vt�N(�, 1) and the prior belief on � isflat over the line then the posterior belief about � after observing a sample of sizeT is

��N1

T

XTt� 1

�t;1

T

0@ 1A:2.1. Markov switching

It is best to begin with Hamilton's model from Econometrica 1989. His originalmotivation was to model long swings in the growth rate of output but instead hefound evidence for discrete switches in the growth rate at business cyclefrequencies. Output growth was modelled as the sum of a discrete Markov chainand a Gaussian autoregression:

Yt�Zt�Xt,

where

Zt��0��1St, St� 0 or 1

and

Xt��1Xtÿ 1��2Xtÿ 2��3Xtÿ 3��4Xtÿ 4� �Vt,

and P[St� 1 | Stÿ 1� 1]� p, P[St� 0 | Stÿ 1� 0]� q, Vt�N(0, 1).The major estimation difficulty with the model is the lack of separate

observability of Zt and Xt. A simple variation on the model is:

Yt�Zt�� (L )Ytÿ 1� �Vt,

where only St is now unobservable. Note that the original model can also bewritten in this form by multiplying both sides by (1ÿ�(L )) we have:

Yt� (1ÿ� (L ))Zt�� (L )Ytÿ 1� �Vt. (1)

Expanding out the term (1ÿ�(L ))Zt using the lag operator we see that the twostate Markov chain is transformed into a tightly parameterized 32 state chain.A slightly different model is produced by allowing all of the parameters to

switch with the Markov chain:

Yt��s(t )��s(t )(L )Ytÿ 1� �s(t )Vt (2)

Three approaches to estimating the model have been taken.3 In Hamilton'soriginal article he developed a nonlinear filter to evaluate the likelihood functionof the model and then directly maximized the likelihood function. Hamilton(1990) constructed an EM algorithm that is particularly useful for the case whereall the parameters switch. Finally, Albert and Chib (1993) developed a Bayesianapproach to estimation that was later refined using results due to Chang-JinKim. The recent monograph by Kim and Nelson (1999) contains an excellent

NONLINEAR TIME SERIES MODELLING: AN INTRODUCTION 507



discussion of both Classical and Bayesian estimation of Markov switchingmodels. The idea behind all three approaches can be illustrated in the followingsimple model:

Yt��0(1ÿSt)��1St�Vt, t� 1, 2, ...T.For simplicity suppose we know that S0� 1, then entering the next period theP[S1� 1 | S0� 1]� p. The observation Y1 is either drawn from a normaldistribution with mean �0 and variance 1 or a normal distribution with mean�1 and variance 1 and the likelihood is given by

f�y1; �0; �1; p; q; s0 � 1�� p exp(ÿ0:5( y1 ÿ �1 )2 )� (1ÿ p)exp(0:5( y1 ÿ �0 )2 )��2�p

Assume for the moment the two mean parameters are known, then given therealization of Y1 Bayes rule can be used to update the probability S1� 1, denotethis by b1:

P[S1 | Y1,S0� 1, �0,�1, p, q]� b1

� exp(ÿ0:5( y1 ÿ �1 )2 )pexp(ÿ0:5( y1 ÿ �1 ) 2 )p� exp(ÿ0:5( y1 ÿ �0 )2 )(1ÿ p)

:

Now b1 can be used to produce a prediction of the state next period denote thisby bÃ2:

P[S2 | Y1,S0� 1,�0,�1, p, q]� bÃ2� pb1� (1ÿ q)(1ÿ b1).

Using this prediction we can weight together the two possible likelihood functionsdepending on the state to produce a likelihood function that averages out over thevalue of S2:

f( y2 | y1; �0,�1, p, q, s0� 1)

� b2 exp(ÿ0:5( y2 ÿ �1 )2 )� (1ÿ b2 )exp(ÿ 0:5( y2 ÿ �0 )2 )��2�p

and so on. This process then continues up through the last observation T toobtain the overall the likelihood function:

f( y1, ..., yT; �0,�1, p, q, s0� 1)

�YTt� 1

bt exp(ÿ0:5( yt ÿ �1 ) 2 )� (1ÿ bt )exp(ÿ0:5( yt ÿ �0 )2 )��2�p

Numerical optimization techniques can be used to find the maximum with respectto �0,�1, p, q. Further, one can also treat the probability that initial state s0� 1 asa parameter to be estimated.

508 SIMON M. POTTER



For both the EM algorithm and the Bayesian analysis and inference about theunobserved Markov chain one needs to `smooth' the estimate of the Markovstate bt so that it contains information from the whole sample. Only the lastprobability bT contains information on the whole observed time series. Usingthe Markov property and the exogeneity of the Markov chain we know thatconditional on observing tomorrow's state st� 1 all future realizations of theobserved time series {Ys : s> t} are not relevant for the estimate of today's state.Using this restriction and ignoring the dependence on estimated parameters wehave:

P[STÿ1 � 1; ST � 1 jYT]

� P[STÿ1 � 1 j ST � 1; YTÿ1]P[ST � 1 j YT]

� P[STÿ1 � 1 j ST � 1; YTÿ1]bT

� pbTÿ 1

bTbT; (3)

since

P[STÿ1 � 1 j ST � 1; YTÿ1]

� P[STÿ 1 � 1; ST � 1 j YTÿ 1 ]

P[ST � 1 j YTÿ 1 ]

� P[STÿ 1 j YTÿ 1 ]P[ST � 1 j STÿ 1 � 1]

bT

� bTÿ 1p

bT: (4)

Thus denoting the smoothed probability by bÄTÿ 1 we need to average out thevalue of ST in (3) by performing a similar calculation for the case that ST� 0:

~bTÿ 1 � bTÿ 1 p~bT

bT� (1ÿ q)

1ÿ ~bT

1ÿ bT

0@ 1A:This process continues until we arrive at time 1 or 0 depending on the assumptionmade on the initial condition. In the EM algorithm the smoothed probabilities areused to produce estimates of the unknown parameters as follows:

� i0 �

XTt� 1

yt (1ÿ ~bt� 1 ); � i0 �

XTt� 1

yt~bt� 1




and using (3):

p i �PT

t� 1 P[Stÿ 1 � 1; St � 1 j YT ]PTt� 1 P[Stÿ 1 � 1 j YT ]

� p iÿ 1btÿ 1

bt

~bt

btÿ 1

;

q i �PT

t� 1 P[Stÿ 1 � 0; St � 0 j YT ]PTt� 1 P[Stÿ 1 � 0 j YT ]

� q iÿ 1(1ÿ btÿ 1 )

(1ÿ bt )

(1ÿ ~bt )

(1ÿ ~btÿ 1 ):

Thus, the intercepts are estimated by weighting observations by the likelihoodthey are in regime 0 or 1 and the transition probabilities by a pseudo count of thenumber of times the Markov chain stayed in the same state.These updated parameter values are then used to re-run the filter and

smoother on the observed data. This produces new parameter values and theiteration continues until a fixed point is achieved. This fixed point will be a localmaxima of the likelihood function. By considering different starting parametervalues for the algorithm one can check which of one local maxima is the globalone. Of course one local maximum occurs at the least squares estimate of themean of the time series with no transitions amongst states and is easy to discard(but we shall see later it causes inference problems). Consider the case of theEM algorithm where all the smoothed probabilities were 1 or 0 and there weresome transitions amongst states. Then in this case the smoother would havecorrectly identified the movement of the Markov chain and the data would begrouped in the appropriate manner. The Bayesian approach works off thisintuition by using the filter probabilities to generate realizations of the Markovchain.Starting from bT a value of sT is drawn using standard inversion techniques

(that is, generate a uniform random number, if it is less than or equal to bT, sT� 1,otherwise sT� 0). Then using (4) if ST� 1 is drawn (or its obvious complement ifST� 0 is drawn), a value of STÿ 1 is drawn and this process continues until thewhole sequence of the realization of the Markov chain has been drawn, {s it }.Using this sequence of values for the Markov chain it is possible to split theobserved data directly into two regimes. That is,

� i0 �

1

Ti0

XTt� 1

yt (1ÿ s it ); Ti0 �

XTt� 1

(1ÿ s it );

� i1 �

1

Ti1

XTt� 1

yt sit; Ti

1 �XTt� 1

s it:

In the case that flat independent priors are used for �0,�1, their posterior densitiesare normal with means Ã� i

0, Ã�i1 and variances 1=Ti

0, 1=Ti1 respectively. These

posterior distributions can then be used to draw realizations of �0,�1.

510 SIMON M. POTTER



The posterior of the transition parameters can also be simply found under theassumption that the priors are independent Beta distributions:

f( p)/ p �1 ÿ 1(1ÿ p) �2 ÿ 1; f (q)/ q�1 ÿ 1(1ÿ q) �2 ÿ 1;

where �1, �2, �1, �2 are all strictly positive and a standard uniform prior is obtainedfor �1� �2� �1� �2� 1. Given the Beta prior the posterior will also have a Betaform with parameters:

� i1 � �1 � p iXTt� 1

P[Stÿ 1 � 1 j YT ]; � i2 � �2 � (1ÿ p)XTt� 1

P[Stÿ 1 � 1 j YT ]

� i1 � �1 � q iXTt� 2

P[Stÿ 1 � 0 j YT ]; � i2 � �2 � (1ÿ q i )XTt� 2

P[Stÿ 1 � 0 j YT ];

where pÃ i, qÃ i are defined in the same manner as in the EM algorithm. Again it iseasy to draw from these posterior densities to obtain the realization pi, qi.Combined with the realization for the intercepts the new values can be used to runthe nonlinear filter and smoother again and obtain a fresh draw of the Markovchain.This is a Gibbs sampling algorithm for the Markov switching model. One

important issue is how to initialize the algorithm. A good choice is the maximumlikelihood estimates. Unlike the EM algorithm the output of the Gibbs sampler isa collection of draws from the posterior of the Markov switching model. Inaddition to draws of the parameters this also includes the draws of the full samplesmoother. Features of the posterior distribution can then be found by averaging.For example, the estimate of the path of the Markov chain would be obtained byaveraging the draws of the full sample smoother:

P[St � 1 j YT ]� 1

I

XIi� 1

~b it:

2.2. Threshold models

Threshold autoregressive (TAR) models are perhaps the simplest generalization oflinear autoregressions. They were introduced to the time series literature byHowell Tong (see his 1983 and 1990 monographs for descriptions and extensivebackground on nonlinear time series).4 Various different threshold models havebeen successfully applied to US GDP=GNP by Beaudry and Koop (1993), Potter(1995) and Potter and Pesaran (1997). The general form is as follows

Yt�� j(t )�� j(t )(L )Ytÿ 1� � j(t )Vt,

where j(t)� 1 if Ytÿ d< r1, j(t)� 2 if r1 ÅYtÿ d< r2, ..., j(t)� J if rJÿ 1 ÅYtÿ d, and itis possible that the length of the autoregression varies across the regimes. Theparameters rj are called the thresholds and d is called the delay.




Although this model looks very similar to the Markov switching one in (2) thereis a crucial difference. In the threshold model regimes are defined by the pastvalues of the time series itself, in the Markov switching case regimes are defined bythe exogenous state of the Markov chain. Filardo (1994) constructed anintermediate case where the transition probabilities of the Markov chain varywith the history of the observed time series.Suppose {rj}, { pj}, d were known then the model can be estimated by

separating the data into groups by regime and finding the least squares estimatesfor the parameters in each regime. Unfortunately these parameters are not knownand standard nonlinear least squares algorithms are not useful since the sum ofsquares function is not differentiable with respect to these parameters. For thediscrete parameters of the delay and order of autoregressive lags it is easy torepeat the least squares estimation for each choice. In the case of the thresholdparameters one needs to estimate the sum of squares for a finite number ofchoices.Estimation can be illustrate in a similar simplified model to the Markov

switching:

Yt��01(Ytÿ 1< r)��11(Ytÿ 1 å r)�Vt,

where 1(Ytÿ 1< r)� 1 if the inequality holds and is zero otherwise.The first step is to organize data into the matrix:

Y1 Y0

Y2 Y1

..

. ...

YT YTÿ 1

26666643777775;

then sort this matrix from smallest to largest according to the values in the secondcolumn:

Y{1}t Y

{1}tÿ 1

Y{2}t Y

{2}tÿ 1

..

. ...

Y{T }t Y

{T }tÿ 1

26666664

37777775:

Now note that if r<Y{1}t ÿ 1 then all the data are in regime 1 ( Ã�1 would be the

sample average, �0 is not identified) or if r>Y{T }t ÿ 1 then all of the data are in

regime 0 ( Ã�0 would be the sample average, a1 is not identified). IfY

{1}t ÿ 1 Å r<Y

{2}t ÿ 1 then

� {1}0 � Y

{1}t ; � {1}

1 �1

Tÿ 1

XTs� 2

Y{s}t ;

512 SIMON M. POTTER



and we have the (least squares) recursion:

� {i� 1}0 � T

{i }0

T{i� 1}0

� {i }0 �

1

T{i� 1}0

Y{i� 1}t ; T {i� 1}

0 � i� 1

� {i� 1}i � T

{i }1

T{i� 1}1

� {i }0 ÿ

1

T{i� 1}1

Y{i� 1}t ; T {i� 1}

1 � Tÿ iÿ 1

These estimates give the sum of squared errors function in terms of the threshold:

SSE{i }0 �

XT {i }0

s� 1

(Y{s}t ÿ � {i }

0 )2; SSE {i }1 �

XTs� T

{i }0� 1

(Y{s}t ÿ � {i }

i ) 2

The obvious least squares estimate of r is the interval associated with the smallestsum of square errors (this is also the case when the delay is estimated). Note thatany estimate within this interval is equally valid. Under the assumption ofGaussianity of the errors this would also be the maximum likelihood estimate.Chan (1993) showed that the estimate of the threshold (and delay) converges at asufficiently fast rate that conditioning on the least squares=maximum likelihoodestimate of the threshold, one could ignore its sampling variability in theasymptotic inference about the other parameters.The Bayesian approach to estimating the threshold model (under the

assumption of Gaussian errors) is similar in terms of the regime coefficients.Continuing with our simple example and assuming flat independent priors on thetwo intercepts, conditional on the threshold, the intercepts would have Normaldistributions centered at the least squares estimates Ã� {i }

0 , Ã� {i }1 with variances

1=T {i }0 , 1=T {i }

1 respectively. In order to find the marginal posterior of the thresholdconsider the case of a flat prior on the thresholds. Then the joint posterior isproportional to the likelihood function.

p(�0; �1; r j YT )/ 1��2�p T

{i }0

exp ÿ0:5XT {i }

0

s� 1

(Y{s}t ÿ �0 )2

0@ 1A� 1��

2�p T

{i }1

exp ÿ0:5XT

s� T{i }0� 1

(Y{s}t ÿ �1 )2

0@ 1A:Subtracting and adding the least squares estimates of the intercepts in eachsquared function and re-arranging using the orthogonality of least squaresestimates we have:

p(�0; �1; r j YT )/ exp ÿ0:5XT {i }

0

s� 1

(Y{s}t ÿ � {i }

0 )2 �XT

s� T{i }0� 1

(Y{s}t ÿ � {i }

1 ) 2

264375

0B@1CA

� exp(ÿ0:5[T {i }0 (� {i }

0 ÿ �0 )2 � T{i }1 (� {i }

1 ÿ �1 )2 ]):




Now integrating out over �0, �1 we have:

p(r j YT) / 1��T0T1

p exp(ÿ0:5[SSEi0 � SSEi

1 ])(Y{i }tÿ 1 ÿ Y

{1ÿ 1}tÿ 1 ); (5)

for i� 2, ...,T. Where the intervals (Y{i }t ÿ 1ÿY

{i ÿ 1}t ÿ 1 ) represent the fact that there

is no information on the threshold between data points.The Bayesian modal estimate of the threshold is not likely to be the same as the

classical one (i.e., the same interval) even under the assumption of a flat prior onthe thresholds. Although the exponential term will be maximized at the samethreshold value as the total sum of square errors, the lead term involving theinverse of the square roots of the sample size in each regime will affect the locationof the maximum. Further, the mean and median of the posterior distribution ofthe threshold are very unlikely to be at the mode unless T is very large and the sumof squared errors terms dominates.Marginal inference about the regime coefficients is very different than in the

classical case (see Hansen 1999 for a method to improve the classical approach).In the Bayesian case inference about the intercepts would be based on weightingthe individual normal distributions by the posterior probability of the particularthreshold interval and threshold uncertainty would affect inference aboutindividual regime coefficients.

2.3. Smooth transition autogressions

For many the abrupt regime changes in the threshold model are unrealistic.Consider the case where one is forecasting US GDP and the initial release hisslightly below the threshold but the subsequent revision is above the threshold. Athreshold model would imply large changes in the forecast of the future for thissmall change initial conditions. Further, the difficulties of the non-standardlikelihood=least squares functions are a distraction.As originally suggested by Chan and Tong (1986) and subsequently developed

by Timo TeraÈ svirta and his various co-authors (see for example, his 1993monograph with Clive Granger) if one introduces smooth transitions betweenregimes standard nonlinear estimation techniques can be used. Since smoothtransition models have a more traditional structure, TeraÈ svirta has been able toimplement a model specification, estimation and diagnostic cycle very similar tothe Box and Jenkins approach (see his 1994 JASA paper). The models weresuccessfully applied to a wide range of industrial production series by TeraÈ svirtaand Anderson (1992).In the simple threshold model above imagine changing from the indicator

function to a smooth cumulative distribution function:

Yt��0(1ÿF(Ytÿ 1; , r))��1F(Ytÿ 1; , r)�Vt,

where F(ÿ1; , r)� 0, F(1; , r)� 1.

514 SIMON M. POTTER



The simplest smooth transition function is of a logistic type:

F(Ytÿ 1; ; r)�1

1� exp(ÿ (Ytÿ 1 ÿ r));

where the parameter > 0 determines the abruptness of the transition at r. Forexample, for very large the smooth transition model might effectively be thesame as a threshold model for certain values of r since for the pair observations ofYtÿ 1 either side of r the transition might be complete. On the other hand, if ' 0then the logistic function hardly varies away from 0.5 and there is really only oneregime.Once again if r and were known ex-ante, simple least squares methods could

be used to estimate the remaining parameters. Thus, one can concentrate the leastsquares function=likelihood function with respect to r and and use standardnonlinear optimizers to estimate these parameters. In practice because ofnumerical instability issues it makes some sense to limit the variation in one ofthese two parameters. One choice favored by TeraÈ svirta is to normalize (Ytÿ 1ÿ r)by the standard deviation of the delay variable. Another is to examine a finite setof thresholds, as in the Threshold autoregression case, thus leaving only to bedirectly estimated by the nonlinear optimizer.5

A more general version of the model is as follows:

Yt��1��1(L )Ytÿ 1� (�2��2(L )Ytÿ 1)F(Ytÿ d; r, d, )� �Vt (6)

One difference to the other two models is that less attention is focused onpossible changes in the variance of the innovations across regimes. In addition tothe logistic smooth transition function there has also been considerable attentionpaid to the possibility of symmetric transitions away from the threshold:

F s (Ytÿ 1; , r)� 1ÿ exp(ÿ (Ytÿ 1ÿ r)2).

Lubrano (1999) discusses Bayesian estimation of smooth transition autoregres-sions. As in the threshold autoregression case one can integrate out the intercept,autoregressive coefficients and variance. This leaves a 3-dimensional posteriordistribution. One dimension is the delay parameter and is discrete, the other twoinvolve the threshold and speed of transition. One choice is to find their posteriorusing standard numerical techniques. Another choice is to use a Metropolis±Hastings algorithm.

3. Testing

Perhaps the greatest theoretical progress in the last ten years has been in ourunderstanding of testing for nonlinearity in economic time series. On the otherhand, perhaps the least empirical progress has been made in finding evidence fornonlinearity in economic time series given the new theoretical tools available. Forexample, a range of statistical tests have been applied to aggregate output in theUnited States. At first the results were encouraging but as shown by Hansen




(1992, 1996a) some of the encouragement was of the wishful thinking variety.First, direct tests of the Hamilton's Markov switching model suggested that it wasnot statistically significant. Subsequent searches over different Markov switchingspecifications (Chib 1995,Hansen 1992) were motivated by this failure and theirsuccess should be qualified. Second, direct tests of threshold models also indicatethat the nonlinear terms are not highly statistically significant. Tests for smoothtransition models usually do not take into account the effects of searching overdifferent values of the delay parameter. These classical statistical results have alsobeen supported by the Bayesian analysis of Koop and Potter (1999a). Overallthere is probably less evidence for nonlinearity in US output at the end of the1990s then researchers thought at the start of the decade but still considerableevidence that the behavior of output in at business cycle turning points is not wellcaptured by linear models.6

In Hamilton's (1989) paper he was careful to point out that testing the nullhypothesis of a linear model against his particular nonlinear model was notstandard and appeared to be very difficult. There were two main problems. First,under the null hypothesis the transition parameters of the Markov chain in thealternative hypothesis were not pinned down. Consider the likelihood value fromabove:

f�y1; �0; �1; p; q; s0 � 1� � p exp(ÿ0:5( y1 ÿ �1 )2 )� (1ÿ p)exp(ÿ0:5( y1 ÿ �0 )2 )��2�p

In the case that �1��0 the value of p has no effect on the likelihood function.Second, the likelihood function for the nonlinear model has a local maximum at

the parameter values for the linear model with p set at the boundary value of 1since this implies all the realizations of the Markov chain will be in state 1 giventhat this is the initial condition.The first problem is common to nearly all nonlinear models with the notable

exception of the smooth transition class where the following reparametrization isavailable:

F�(Ytÿ 1; r; )� 1

1� exp(ÿ (Ytÿ d ÿ r))ÿ 0:5:

Now testing for nonlinearity can be undertaken by allowing to take on all realvalues under the alternative and fixing the delay at a particular value. The nullhypothesis of linearity is captured by the restriction � 0. As described inTeraÈ svirta (1994) a Lagrange Multiplier test can be developed. However, in thegeneral case where the delay is unknown the problem crops up again.In the general case the imposition of the null hypothesis of linearity leaves some

of the parameters describing the nonlinear model free. In statistics it is called theDavies' problem. In order to understand the seriousness of the Davies' problemconsider that under the null hypothesis for a fixed choice of the parameters thelikelihood ratio will have a Chi-squared distribution in large samples. But one can

516 SIMON M. POTTER



vary these `free' parameters to find a largest and smallest likelihood ratio. Againunder the null hypothesis these are both draws from the same Chi-squareddistribution. Obviously if the minimum value still exceeds the critical valuerejection of the null hypothesis is warranted. Such an outcome is possible in thecase where some restrictions are placed on the parameters present only under thealternative. For example, in the threshold case if each regime must have at least15% of the data or in the Markov switching case where one rules out boundaryvalues for the transition probabilities and defines them on a closed subset of theunit interval. One solution is to choose the nuisance parameters at random.Another more powerful one is to examine the properties of the maximum acrossthe parameter space as described in the accompanying article by Bruce Hansen.All of the previous solutions ignored some information in the behavior of the

test statistic as the parameters describing the nonlinear model are varied. In alarge enough sample it is safe to do this and concentrate on the most powerfultests involving the largest test statistic. However, in a typical macroeconomicapplication where the likelihood function might have many local maxima such anapproach can be dangerous. One obvious solution would be to examine someaverage of the test statistics. This intuition was formalized by Andrews andPloberger (1994) who showed that under certain conditions averaged test statisticswere the most powerful.In order to illustrate these ideas I will work with the simple threshold

autoregression as the alternative nonlinear model.

Yt��01(Ytÿ 1< r)��11(Ytÿ 1 å r)�Vt

and the sequence of observed data y0�ÿ0.2, y1� 2, y2�ÿ0.1, y3�ÿ1.9. Theinitial observation will be treated as fixed and for both linear and nonlinearmodels the innovation variance is assumed to be 1. For the linear model we have asample average of 0 and a resulting sum of squared errors of22� 0.12� 1.92� 7.62.For the threshold model we shall assume a priori that the threshold value is in

the interval [ÿ1, 1]. Thus we have 3 possible sum of squared error functions:

1. If ÿ1Å rÅÿ0.2, then the same sum of squared error function is the same asthe linear model since all the observations are drawn from the same regime.

2. If ÿ0.2< rÅÿ0.1 then observations 2 and 3 are drawn from the upperregime and observation 1 is drawn from the lower regime. Thus, Ã�0� 2,Ã�1� (ÿ0.1�ÿ1.9)=2�ÿ1 and the sum of squared errors is 02� 0.92� 0.92� 1.62.

3. If ÿ0.1< rÅ 1 then observations 1 and 3 are drawn from the lower regimeand observation 2 is drawn from the upper regime. Thus, Ã�0� (2 �ÿ1.9)=2� 0.05, Ã�1�ÿ0.1 and the sum of squared errors is 1.952� 02� 1.952� 7.605.

We have three log likelihood ratio statistics: 0, 6, 0.03 (in this example the loglikelihood ratio is just the difference in the sum of squared errors). Clearly, anestimate of r in the interval ÿ0.2< r Åÿ0.1 is the maximum likelihood estimateand the associated test statistic of 6 is large relative to a Chi-squared distribution




with one degree of freedom that one would use to measure statistical significancein large samples. On the other hand the minimum statistic of 0 is obviously notsignificant compared to a Chi-squared distribution. If we average the test statisticsagainst a uniform distribution for the threshold in the given interval the value is0.4� 0� 0.05� 6� 0.55� 0.03� 0.317. This is also the expected value of arandomized test but such a test would have considerable variability depending onwhich interval was chosen.Hansen (1996a) provides a general method of calculating sampling distributions

under the null hypothesis for such operations on the family of test statistics. Theprocedure in this case works as follows (note in this special case the proceduregives an exact small sample result, this is not true in general). Generate 3 standardnormal random variables and without loss of generality assume they are ordered:V1ÅV2<V3. The likelihood ratio statistic considered by plugging in themaximum likelihood estimate of the threshold is equivalent to the followingminimization problem:

49 (V

21 � V2

2 � V23 )ÿ 1

4 min{V21 � V2

2; V22 � V2

3 }

The 95th percentile of this distribution is approximately 2.9 thus on this measurethere is significant evidence of the threshold effect (the p-value for the observedstatistic is 0.1%). This give significance at a better than 1% level. However, if weconsider the average likelihood ratio statistic using the observed frequency oflikelihood ratios different than zero:

0:6[49 (V21 � V2

2 � V23 )ÿ 1

4(V21 � V 2

2 )ÿ 14(V

22 � V2

3 )];

the 95th percentile is approximately 1.4. Thus, on this measure the threshold effectis not statistically significant.An alternative to the classical approach is to use Bayes factors to compare the

linear and nonlinear models (see Koop and Potter, 1999a). The Bayes factor inthis case is ratio of the average value (over prior distribution of parameters) of thelikelihood function for the nonlinear model against the average value (over theprior distribution of parameters) of the likelihood function for the linear model.In order for the Bayes factor to be useful, informative priors on the parameters arerequired. In this simple case we shall assume that the prior on the mean for thelinear model is uniform over [ÿm,m] and that the prior on the nonlinear model isuniform and independent for both intercepts over the same interval of length 2m.Then using our previous result on the posterior for the threshold and the numbersfor differences in sum of squared errors form above we have:��

2�p

2m0:4� 1� 0:05�

��3p��2p exp(3)� 0:55�

��3p��2p exp(0:015)

24 35� 1:2533

m2:314 if m æ 2:

518 SIMON M. POTTER



Here to keep things simple I have assumed that the prior is sufficiently flat thatboundary value problems do not occur. The interpretation of the Bayes factor isdifferent to classical statistical tests. In this case it is giving (assuming linear andnonlinear models were equally likely ex ante) the posterior odds in favor of thenonlinear model. Notice that if 2Åm< 1.253� 2.314� 2.899 they are favorable.For an ignorant prior (m!1) the Bayes factor would always favor the simplerlinear model. The highest posterior odds for the nonlinear model are obtained atm� 2 and are equal to about 1.45, that is about 60% of the posterior weight isplaced on the nonlinear model and about 40% on the linear model.Obviously the main drawback of the Bayesian approach is its sensitivity to the

prior distributions on the parameters. For example, if the prior on the thresholdhad been uniform on [ÿ10, 10] then the Bayes factor would be��

2�p

2m

9:8

20� 1� 0:1

20�

��3p��2p exp(3)� 10:1

20�

��3p��2p exp(0:015)

24 35� 1:253

m1:241 if m æ 2:

and at best (m� 2) the posterior odds in favor of nonlinearity would be 44%. Thebenefit of the drawback is that the Bayes factor has a very big penalty for morecomplicated models something that classical models are unable to do. Forexample, in the case that both the threshold and intercepts are assumed touniformly distributed over a wide interval we are placing most of the ex-anteweight on very strong forms of nonlinearity. Suppose m� 10 and r�U[ÿ10, 10]then most threshold models produced would have bimodal distributions withmodes very far apart and values like y2�ÿ0.1 are improbable.Bayesian methods have a strong advantage when it comes to finding tests for

Markov switching models. Hansen (1992, 1996b) provides a bounds test that dealswith the both the Davies' problem and the zero scores problem. However, this is avery computationally intensive test and it only provides a bound on the size.Garcia (1998) adopts the approach of Andrews and Ploberger (1994)and Hansen(1996a) by ignoring the zero scores problem. His simulation results suggest thatzero scores might not be a problem in practice. In contrast, Bayesian testing of theMarkov switching models is simple and direct. In order to illustrate consider thesimple Markov switching model:

Yt��1St�Vt.

Here linearity is given by the restriction that �1� 0. As discussed in Koop andPotter (1999a) the Bayes factor in this Markov switching case can be written as theratio of the posterior density to the prior density for �1 evaluated at zero, i.e.,

b(�1 � 0)

p(�1 � 0 j YT ):




Once again suppose that �1 has a uniform prior on the interval [ÿm,m] and usingour previous results we have a conditional Bayes factor of:

b(�1 � 0)

p(�1 � 0 j YT; {s it }; pi; qi )

�

��Ti1

q2m

��2�p exp[ÿ0:5Ti

1(�i1 )

2 ]:

The overall Bayes factor is found by replacing the conditional posterior densitywith its average across posterior draws:

p(�1 � 0 j YT )� 1

I

XIi� 1

��Ti1

q��2�p exp[ÿ0:5Ti

1(�i1 )

2 ];

which requires minimal changes to existing computer code.Once again if the initial ignorance about �1 is high (large values of m) there is

less chance of finding evidence of nonlinearity. Or alternatively if the priors on pand q lead to most runs of {s it } consisting of zeros there is also little chance offinding nonlinearity. Chib (1995) discusses a slightly more computationallyintensive method of constructing the average likelihood for the Markov switchingmodel that is easier to implement when all the parameters of the model switchwith the Markov variable as in (2).

4. Constructing conditional expectations

Once a final nonlinear model is arrived at there remains the issue of understandingthe estimated dynamics and forecasting capabilities. Since the primary objective ofnonlinear modelling is to obtain the true conditional expectation function there isstill a substantial task remaining.The task is easiest for the Markov switching models. As an illustration consider

Hamilton's original model in the form given in (1). We introduce the followingnotation: P represents the transition matrix of the 32 state Markov chain, bt is avector representing the filter probabilities for each of the 32 individual states, s� isa vector containing the 32 possible values of (1ÿ� (L ))Zt.

Et [Yt� h ]�Et [Z�t � h ]�Et [� (L )Yt� hÿ 1 ].

The second term on the LHS can be evaluated using standard linear recursions.The first term on the LHS requires more care. Assume for the moment that S�twas in the information set at time t. Then one could find Et [S

�t� h] using the

estimated probability transition matrix as follows:

E [S�t | S�t � sj ]� s� 0Ph 0ej,

where ej is a vector of zeros except for the jth row which contains 1. Of course thestate of the Markov chain is not known at time t but one can replace ej with filterprobabilities over the states of the Markov chain at time t,

E [S�t | S�t � sj ]� s� 0Ph 0bt.

520 SIMON M. POTTER



For the TAR and STAR models obtaining the conditional expectation nearlyalways requires the use of simulation once the forecast horizon exceeds the lengthof the delay lag. If the horizon is less than the delay lag then the conditionalexpectation is given by iterating on the nonlinear difference equation. Forexample, in the simple model:

Yt�ÿ1(Ytÿ 4< 0)� 1(Ytÿ 4 å 0)�Vt,

the conditional expectation function up to horizon 4 is given by

E [Yt� h | Yt ]�ÿ1(Yt� hÿ 4< 0)� 1(Yt� hÿ 4 å 0), if hÅ 4.

Suppose the conditional expectation of Yt� 1 at time t was 1 (i.e., Ytÿ 3> 0). Itwould be incorrect to use this value in the indicator function to forecast Yt� 5� 1since Vt� 1 is standard normal there is non-zero probability that Yt� 1< 0.Continuing with this example, there is approximately a 16% probability thatYt� 1< 0. Thus,

E [Yt� 5 j Ytÿ 3 > 0]�ÿP[Yt� 1 < 0 j Ytÿ 3 > 0]� P[Yt� 1 æ 0 j Ytÿ 3 > 0]

�ÿ0:16� 0:84� 0:68:

Similar results would be available up to forecast horizon 8 where things becomemore complicated:

E [Yt� 9 | Ytÿ 3> 0]�ÿP[Yt� 5< 0 | Ytÿ 3> 0]�P[Yt� 5 å 0 | Ytÿ 3> 0].

In order to evaluate the probabilities on the RHS one can iterate forward thedistribution for Yt� 5 using the fact that Yt� 1 is drawn from a N(1, 1) withprobability 0.84 and from a N(ÿ1, 1) with probability 0.16.Notice that forecasts after horizon 4 were crucially dependent on the

assumption on the innovations were Gaussian and the size of their variance,unlike the linear case. Further the calculations were greatly simplified by the factthere were no autoregressive lags. Consider another simple model:

Yt� 0.5Ytÿ 11(Ytÿ 1 å 0)�Vt.

We have E [Yt� 1[Yt ]� 0.5Yt1(Yt å 0) and at horizon 2

E [Yt� 2 | Yt ]� 0.5E [Yt� 11(Yt� 1 å 0) | Yt ],

which can be evaluated using the fact the Yt� 11(Yt� 1 å 0) is a (conditional)truncated normal with parameters 0.5Yt1(Yt å 0) and 1 to obtain

E [Yt� 2 j Yt ]� 0:25Yt 1(Yt æ 0)� 0:5' (ÿ0:5Yt1(Yt æ 0))

1ÿ (ÿ0:5Yt1(Yt æ 0))

24 35;where ' (z) and (z) are the standard normal density and cumulative distributionfunctions respectively. Notice that for Yt t 0 the forecasts are very similar to a firstorder linear autoregression with zero intercept and autoregressive coefficient of




0.5. For smaller values of Yt the forecasts are very different from such a linearautoregression. However,

E [Yt� 3 | Yt ]� 0.5E [Yt� 21(Yt� 2 å 0) | Yt ],

is much less tractable since Yt� 2 does not have a conditional normal distribution.Instead of directly attempting to calculate this expectation consider using the

time series model to simulate a large number of time series for each particularhistory. Thus, we know that Yt� 1�N(0.5Yt1(Yt å 0), 1) and this fact can be usedto generate K realizations from this distribution. Now take this realizations andgenerate K realizations of Yt� 2 using K draws of Vt� 2 and the equation:

ykt � 2� 0.5ykt � 11( ykt � 1 å 0)� �kt � 2.

We can then approximate E [Yt� 3 | Yt ] by:

1

K

XKk� 1

0:5ykt� 21( ykt� 2 æ 0)

which by the Law of Large Numbers will converge to the conditional expectationas K!1.For both the Threshold and Smooth Transition Models dynamic simulation of

time series paths allows one to calculate good approximations to the conditionalexpectation function for a wide range of specifications.

4.1. Forecasting with known parameters

Given a method to calculate the conditional expectation function and a modelwith no unknown parameters the production of forecasts is straightforward. Thecomparison of these forecasts with those from linear models is less straight-forward. First, in out of sample comparisons it is important that the nonlinearfeature found in the historical sample is present. For example, many of thenonlinear models of U.S. output focus on the dynamics entering and recoveringfrom recessions. The last recession in the United States ended in early 1991, thus ithas been difficult to verify the nonlinearities out of sample.This lack of variation in the out of sample period can be compensated for by

various experiments within the sample. In particular, unlike linear models usefulinformation can be generated by considering in sample multistep aheadprediction. Consider first a linear first order autoregression with zero interceptand estimated first order coefficient Ã� and residuals {UÃ t}. We have the followingidentities:

Yt� 1 � �Yt � Ut� 1;

Yt� 2 � �Yt� 1 � Ut� 2 � �2Yt � �Ut� 1 � Ut� 2; :::;

Yt�H � �Yt�Hÿ 1 � Ut�H � �HYt �XHÿ 1

h� 0

�hUt�Hÿ h:

522 SIMON M. POTTER



Thus, for large sample size the in sample one-step ahead forecast variance is Ã�2,the two-step ahead is Ã�2(1� Ã�2 ), and h-step ahead by Ã�2(

PH ÿ 1h � 0

Ã�2h).For nonlinear models such a recursion does not exist. In fact it is not even

possible to show that the RMSE of a forecasts grows monotonically with thehorizon as in a linear model for all histories. This cost in terms of computationalcomplexity is a benefit in terms of model evaluation since simulation of h-stepahead conditional expectation function in sample can provide a diagnostic checkon the nonlinear model. By definition, for any forecast horizon, the variance ofthe forecast error using the best linear predictor has to be greater than or equal tothe variance of the forecast error using the true conditional expectation function,Thus, a good diagnostic check on the estimated nonlinear model is whether this istrue in the observed sample.There are two ways to formalize the notion of the best linear predictor. One is

to use the linear model used in the testing phase and iterate it as above to obtainmulti-step predictions. In this case it might prove useful to adjust for the loss ofobservations as the steps ahead of the prediction increase. For Markov switchingmodels this is an easy check since no simulation is required. For the other modelsit becomes more computationally intensive once the forecast horizon exceeds thedelay lag of the threshold variable.Using the iterated properties of a linear model is only a weak check since if

nonlinearity is present the model will only produce the best linear forecasts for onestep ahead. A stronger one in the case where nonlinearity is present but notnecessarily of the type estimated, is to estimate different linear models for eachprediction horizon. If the true nonlinearity is different from that estimated thensuch adaptive linear models should start to outperform the nonlinear model.

4.2. Forecasting with parameter uncertainty

It is typical in linear time series forecasting applications to provide some measureof the effect on the forecast of parameter uncertainty. There are a two main waysof doing this using classical statistical methods. One can use asymptoticapproximations for functions of the parameters of the model or one can simulaterandom draws from the approximately normal sampling distributions of theparameters and construct the forecasts from the draws. These methods are onlydirectly applicable to the STAR model. Even in this case the situation is far morecomplicated than the linear one, since we need to calculate the conditionalexpectation function, as described above, for each set of parameter values.In the case of the threshold model, the threshold estimates and delay are

converging at faster rates than the other parameters, hence in a large enoughsample they can be ignored. Unfortunately, the adequacy of this large sampleapproximation in time series of the typical length in macroeconomics is very muchopen to question. Further, unlike the mild effects that a poor approximationmight have in a linear model the effects in threshold models can be drastic.Consider the case where the threshold variable in the information set for theforecast is close to the estimate of the threshold. By treating the threshold and




delay as known the forecast will be very different depending on which side of thethreshold the observed data is. However, this is a false precision since the value ofthe threshold or delay is not known with certainty.Dacco and Satchell (1999) find that the regime misclassification introduced can

lead to a nonlinear model having inferior forecast performance to a linear modeleven when the nonlinear model is true and all parameters but the threshold areknown. This issue can be illustrated using the simple intercept shift thresholdautoregression under the assumption that �0,�1 are known. In this case theposterior distribution for the threshold is given by:

p(r | YT )/ exp(ÿ0.5[SSEi0�SSEi

1 ])(Y{i }t ÿ 1ÿY

{i ÿ 1}t ÿ 1 ).

Denoting the cumulative distribution function of the posterior of r by G we havefor the one step ahead forecast:

ET [YT� 1 ]��0� (�1ÿ�0)G( yT ).Obviously as the sample size increases G( yT ) will start to get closer to an indicatorfunction but in typical macroeconomic samples it will contain considerableuncertainty about the true location of the threshold.One implication of the above analysis is that for forecasting Bayesian

estimation of threshold models has a distinct advantage (although it might bepossible to adapt some of results of Hansen 1999 to reproduce a classical analog).This is also true for the Markov switching models. In this case the difficulty is thatthe estimate of the current Markov state is dependent on the whole set ofparameter estimates. Any change in the parameter estimates would affect theestimate of the Markov state. One could simulate from the asymptoticdistribution of the parameter estimates and then rerun the filter at theseparameter values to obtain some feeling for how the estimate of the current statewould change. But this is an inconsistent approach: the new filter values wouldimply different parameter estimates (by definition only maxima in the likelihoodfunction will not have this problem, see the discussion of the EM algorithmabove). The Bayesian solution to the forecasting problem is to use the output ofthe Gibbs sampler. Recall that for each complete draw from the Gibbs sample wehave draws of the unknown parameter values and the value of the filterprobability of the most recent state. These values can be used to form a forecast.Repeating this exercise across all the draws from the posterior and averaging willproduce the conditional expectation allowing for parameter uncertainty.

4.3. Impulse response functions

Once one is satisfied with the nonlinear model there is the remaining question ofdescribing how its dynamics differ from that of linear models fit to the same timeseries. Since most economists describe the dynamics of linear models usingimpulse response functions it is important to generalize impulse responsefunctions to nonlinear time series. As described in the introduction the lineardynamics of a time series are given by the Wold Representation. The coefficients

524 SIMON M. POTTER



in the Wold Representation can be thought of as producing the same answer tothe following four questions:

1. What is the response to a unit impulse today when all future shocks are sentto zero?

2. What is the response to a unit impulse today when all future shocks areintegrated out?

3. What is the derivative of the predictor of the future?4. How does the forecast of the future change between today and yesterday?

For nonlinear models there will be different answers to each question. Toillustrate consider the simple threshold model:

Yt�ÿ1(Ytÿ 1< 0)� 1(Ytÿ 1 å 0)�Vt

and the case of a positive unit impulse for the first two questions.

1. If Yt å 0 then the response is 0 for all horizons, if ÿ1ÅYt< 0 then response is2 for all horizons since this permanently moves the time series into the upperregime given the assumption of no future shocks, if Yt<ÿ1 then theresponse is 0 for all horizons.

2. At horizon 1 the response is the same as the answer to question 1 but as thehorizon increases we have the difference:

E [Yt� h | Yt� yt� 1]ÿE [Yt� h | Yt� yt ],

which must converge to zero by the stationarity of the underlying time series.3. The derivative is either 0 or not defined if Yt� 0.4. Consider the case where Etÿ 1[Yt ]� 1 and the realized value of Yt� 1. Then

the initial shock is 0 but

Et [Yt� 1 ]ÿEtÿ 1[Yt� 1 ]� 1ÿ 0.68� 0.32.Or the case where Etÿ 1[Yt ]� 1 and the realized value of Yt� 5. Then theinitial shock is 4 but the Et [Yt� 1 ]ÿEtÿ 1[Yt� 1 ] is still equal to 0.32.

It should be immediately apparent that questions 1 and 3 are not particularlyuseful questions to ask for a nonlinear time series. This leaves a choice between themore traditional definition of the impulse response function defined by the answerto question 2 and the forecasting revision function defined by the answer toquestion 4. In order to choose between the two possibilities observe that both theinitial condition and the magnitude and sign of the impulse is important indescribing the dynamics of nonlinear models. This is problematic since one canchose values of the initial condition or shock that produce atypical responses. Inanswering question 4 the properties of the impulse are defined directly by the timeseries model, that is:

Et [Yt ]ÿEtÿ 1[Yt ],




or �Vt in our examples. Further, in order to define the initial conditions one canuse the history of the time series or random draws from its simulated distributions.In answering question 2 there is no direct way of defining the relevant set ofperturbations away from the initial condition using the properties of the timeseries model. Note that using the time series innovation to define the perturbationis not correct since this innovation represents the unforcastable change betweentime tÿ 1 and t.Koop, Pesaran and Potter (1996) and Potter (1999) call the forecast revision

function a generalized impulse response function and develop its properties. Ofparticular importance is the fact that the generalized impulse response function isa random variable on the same probability space as the time series. Thus, in orderto measure the size of a response at a particular horizon one needs to measure thesize of a random variable. In the cases where the response averages to zero (thestandard one for the forecast revision function) one can use the concept of secondorder stochastic dominance. Some other experiments lead to responses with non-zero means where size can be measured by the mean. Perhaps the most interestingnonlinear feature found from impulse response functions has been a lower level ofpersistence of negative shocks in recessions than expansions (see Beaudry andKoop 1993).In the linear time series literature there is a considerable literature on inference

for impulse response functions. As the previous discussion of forecasting underparameter uncertainty would indicate such inference is more difficult fornonlinear time series models. Koop (1996) develops a Bayesian approach to theanalysis of generalized impulse response functions that allows directly forparameter uncertainty. Koop and Potter (1999b) combine this analysis of impulseresponse functions under parameter uncertainty with Bayesian measures of modeluncertainty (see the discussion at the end of Section 3) to present impulse responsefunctions that are less reliant on particular model specifications.

5. Concluding remarks

Three basic nonlinear time series models have been reviewed. Both Classical andBayesian approaches to estimation and inference have been described. The focushas been on univariate time series models. The three basic models do generalize tothe multiple time series case but some of the difficulties concerning inferencehighlighted in this article are compounded in higher dimensions. Given thedifficulties of interpreting test statistics for linearity vs. nonlinearity discussed, itseems important to shift the focus to differences in forecasting and dynamics thatallow for both parameter and model uncertainty.

Acknowledgements

I would like to thank Gary Koop for numerous helpful discussions. The views expressed inthis paper are those of the author and do not necessarily reflect the views of the FederalReserve Bank of New York or the Federal Reserve System.

526 SIMON M. POTTER



Notes

1. There is also a vast literature in finance. While many of the developments and problemsare similar, the higher frequency of observation of financial time series has allowed a

much greater emphasis on flexible non-parametric methods. An excellent example ofsuch methods is in Gallant, Rossi and Tauchen (1993).

2. The classic example is Brock and Chamberlain's 1984 working paper which likeGranger's paper has a title that gives the result. In the late 1980s nonlinear modeling

was strongly associated with the study of chaotic systems. Such systems are lessamenable to statistical techniques than the nonlinear time series models consideredhere.

3. Links to software for all three types of estimation can be found at http://weber.ucsd.edu/�jhamilto/sofware.htm#Markov

4. Software for classical estimation and inference of threshold autoregressions can be

obtained from www.ssc.wisc.edu/�bhansen. Software for Bayesian estimation andinference can be obtained from http://emlab.Berkeley.EDU/Software/abstracts/pot-ter0898.html.

These latter algorithms make heavy use of recursive least algorithms that can producebig speed ups in estimation time.

5. Smooth transition autoregressions can be estimated using standard econometricpackages with nonlinear estimation options. In addition there will be a website

available soon with Gauss software available implementating the full approach takenby TeraÈ svirta.

6. For a review of earlier testing approaches see Brock and Potter (1993).

References

Albert, J. and Chib, S. (1993) Bayesian inference via Gibbs sampling of autoregressive timeseries subject to Markov mean and variance shifts. Journal of Business and EconomicStatistics, 11, 1±15.

Andrews, D. and Ploberger, W. (1994) Optimal tests when a nuisance parameter is presentonly under the alternative. Econometrica, 62, 1383±1414.

Beaudry, P. and Koop, G. (1993) Do recessions permanently change output. Journal ofMonetary Economics, 31, 149±163.

Brock, W. and Chamberlain G. (1984) Spectral Analysis cannot tell a macroeconometricianwhether his time series came from a stochastic economy or a deterministic economy.SSRI WP 8419, University of Wisconsin Madison.

Brock, W. and Potter, S. (1993) Nonlinear time series and macroeconometrics. In G. S.Maddala et al. (eds), Handbook of Statistics, Amsterdam: North Holland.

Chan, K.S. (1993) Consistency and limiting distribution of the least squares estimator of athreshold autoregression. Annals of Statistics, 21, 520±5.

Chan, K. S. and Tong, H. (1986) On estimating thresholds in autoregressive models.Journal of Time Series Analysis, 7, 178±190.

Chib, S. (1995) Marginal likelihood from the Gibbs output. Journal of the AmericanStatistical Association, 90, 1313±1321.

Dacco R. and Satchell, S. (1999) Why do regime-switching models forecast so badly,Journal of Forecasting, 18, 1±16.

Filardo, A. (1994) Business cycle phases and their transitional dynamics. Journal ofBusiness and Economic Statistics, 12, 299±308.

Gallant, A. R., Rossi, P. E. and Tauchen, G. (1993) Nonlinear Dynamic Structures.Econometrica, 61, 871±908.




Garcia, R. (1998) Asymptotic null distribution of the likelihood ratio test in Markovswitching models. International Economic Review, 39, 763±788.

Granger, C. W. J. (1983) Forecasting white noise. In Applied Time Series Analysis ofEconomic Data, Proceedings of the Conference on Applied Time Series Analysis ofEconomic Data, edited by A. Zellner, U.S. Government Printing Office.

Granger, C. W. J. and TeraÈ svirta, T.,Modelling Nonlinear Economic Relationships. Oxford:Oxford University Press.

Hamilton, J. (1989) A new approach to the economic analysis of nonstationary time seriesand the business cycle. Econometrica, 57, 357±384.

Hamilton, J. (1990) Analysis of time series subject to changes in regime. Journal ofEconometrics, 45, 39±70.

Hansen, B. (1992) The likelihood ratio test under non-standard conditions: testing theMarkov switching model. Journal of Applied Econometrics, 7, S61±S82.

Hansen, B. (1996a) Inference when a nuisance parameter is not identified under the nullhypothesis. Econometrica, 64, 413±430.

Hansen, B. (1996b) Erratum. Journal of Applied Econometrics, 11, 195±198.Hansen B. (1999) Sample splitting and threshold estimation. Econometrica, forthcoming.Judge, G., Griffiths, W., Hill, R. C. and Lee, T.-C. (1985) The Theory and Practice of

Econometrics, second edition. New York: John Wiley and Sons.Kim, C.-J. and Nelson, C. (1999) State-Space Models with Regime Switching: Classical and

Gibbs-Sampling Approaches with Applications, Cambridge: MIT Press.Koop, G. (1996) Parameter uncertainty and impulse response analysis. Journal of

Econometrics, 72, 135±149.Koop, G., Pesaran, M. H. and Potter, S. M. (1996) Impulse response analysis in nonlinear

multivariate models. Journal of Econometrics, 74, 119±148.Koop, G. and Potter, S. M. (1999a) Bayes factors and nonlinearity: evidence from

economic time series. Journal of Econometrics, forthcoming.Koop, G. and Potter, S. M. (1999b) Asymmetries in US Unemployment. Journal of

Business and Economic Statistics, forthcoming.Lubrano, M. (1999) Bayesian analysis of nonlinear time series models with a threshold to

appear in Nonlinear Econometric Modelling, William Barnett et al. (ed.), Cambridge:Cambridge University Press.

Pesaran, M. H. and Potter, S. (1997) A floor and ceiling model of US output. Journal ofEconomic Dynamics and Control, 21, 661±695.

Poirier, D. (1995). Intermediate Statistics and Econometrics, Cambridge: The MIT Press.Potter, S. (1995) A nonlinear approach to US GNP. Journal of Applied Econometrics, 10,

109±125.Potter, S. (1999) Nonlinear impulse response functions. Journal of Economic Dynamics and

Control, forthcoming.Priestley, M. B. (1988) Non-Linear and Non-Stationary Time Series, New York: Academic

Press.TeraÈ svirta, T. and Anderson, H. (1992) Characterising nonlinearities in business cycles

using smooth transition autoregressive models. Journal of Applied Econometrics,S119±S136.

TeraÈ svirta, T. (1994) Specification, estimation and evaluation of smooth transitionautoregressive models. Journal of the American Statistical Association, 89, 208±218.

Tong, H. (1983) Threshold models in non-linear time series analysis. Lecture Notes inStatistics, No. 21, Heidelberg: Springer.

Tong, H. (1990), Non-linear Time Series: A Dynamical System Approach. Oxford:Clarendon Press.

528 SIMON M. POTTER


nonlinear time series modelling: an introduction

Documents