graphical models in time series analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of...

121
Graphical Models in Time Series Analysis Michael Eichler INAUGURAL-DISSERTATION zur Erlangung der Doktorw¨ urde der Naturwissenschaftlich-Mathematischen Gesamtfakult¨ at der Ruprecht-Karls-Universit¨ at Heidelberg 1999

Upload: others

Post on 24-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Graphical Modelsin Time Series Analysis

Michael Eichler

INAUGURAL-DISSERTATIONzur

Erlangung der Doktorwurdeder

Naturwissenschaftlich-Mathematischen Gesamtfakultatder

Ruprecht-Karls-UniversitatHeidelberg

1999

Page 2: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation
Page 3: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Contents

1 Introduction 1

1.1 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Graphical models for time series 4

2.1 Conditional correlation graphs . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Causality graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Markov properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Nonparametric analysis 20

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Testing for interrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Asymptotic null distribution . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2 Asymptotic local power of the test . . . . . . . . . . . . . . . . . . 37

3.2.3 Finite sample performance . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Time domain analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Partial correlation functions . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 Empirical partial spectral processes . . . . . . . . . . . . . . . . . . 45

3.3.3 Interrelation analysis in the time domain: An example . . . . . . . 52

4 Selection of graphical interaction models 57

4.1 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Asymptotical efficiency of a model selection . . . . . . . . . . . . . . . . . 63

4.3 Asymptotically efficient model selection . . . . . . . . . . . . . . . . . . . . 71

4.4 Proofs and auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . 75

Page 4: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

ii Contents

5 Selection of causal graphical models 795.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 795.2 Asymptotic properties of the final prediction error . . . . . . . . . . . . . . 845.3 Asymptotically efficient model selection . . . . . . . . . . . . . . . . . . . . 88

5.3.1 Asymptotic efficiency of CT (p,G) . . . . . . . . . . . . . . . . . . . 895.3.2 Other model selection criteria . . . . . . . . . . . . . . . . . . . . . 935.3.3 Model selection with estimated Σ . . . . . . . . . . . . . . . . . . . 98

5.4 Proofs and auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Appendix 110A Properties of L -Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 110B Matrices and Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

References 113

Page 5: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Chapter 1

Introduction

The origins of graphical models can be traced back to the beginning of the century whenGibbs (1902) used local neighbourhood relationships to describe the interactions in largesystems of particles. Another source has been in genetics, where Wright (1921, 1934)developed the so-called path analysis for the study of hereditary properties, linking par-ents and children graphically by arrows. In statistics, Bartlett (1935) studied the notionof interaction in three-way contingency tables and arrived at a similar description of in-teraction as in statistical physics. But it has been only within the last decades thatthe similarities between these different methods and also newer developments have beenrecognized (e.g. Darroch et al., 1980; Wermuth, 1976), which then has led to the uni-fied theory of graphical models for multivariate data. Since then graphical models havebecome increasingly popular.

Graphical models allow to describe and manipulate conditional independence relationsbetween variables in multivariate data. These relations are typically visualized by undi-rected graphs where the vertices represent the variables and edges between the verticesindicate conditional dependence. In situations where one variable is regarded as responseand the other as an explanatory variable, directed acyclic graphs and chain graphs can beused to incorporate such hypotheses into the statistical model (e.g. Wermuth and Lau-ritzen, 1990). Recently, directed acyclic graphs have also been used for the interpretation,conjecture and discovery of causal relations between variables (e.g. Pearl, 1995; Spirtes etal., 1993).

In the analysis of stationary time series, Brillinger (1996) and independently Dahlhaus(1996) have introduced graphical models as a tool for visualizing the interaction structurebetween components of a multivariate process. Their approach is based on partializedfrequency domain statistics like the partial spectral coherence which are obtained by re-moving from two fixed component processes the linear effects of all other components. Ifthe partial spectral coherence vanishes at all frequencies, the components are condition-ally uncorrelated given all other components. This leads to the undirected conditionalcorrelation graph, where each component of the process is represented by one vertex inthe graph. In Dahlhaus et al. (1997), conditional correlation graphs for stochastic pointprocesses are used to detect synaptic connections in neural networks, which also includes

Page 6: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2 Chapter 1. Introduction

the identification of the direction of connections.When concerned with data, the true conditional correlation graph of the process must

be estimated. Dahlhaus (1996) and Dahlhaus et al. (1997) suggest to estimate the partialspectral coherence nonparametrically e.g. by use of spectral kernel estimates. The missingedges in the graph are then determined by a series of tests. Alternatively, we couldalso think of fitting a parametric model to the data. The estimation of the conditionalcorrelation graph then becomes a problem of model selection. Here the best approximatinggraph is determined by minimizing an appropriate model selection criterion.

While the extension of undirected conditional independence (or correlation) graphsfor multivariate data to the time series case is straightforward, an appropriate definitionof directed graphs for time series seems to be much harder to obtain. The reason for thisis that time should play the key role in the definition of directed edges when concernedwith time series. One possible approach has been presented by Lynggaard and Walther(1993), who have used chain graphs for conditional Gaussian distributions to model thedependence structure of a time series. This approach leads to graphs where each vertexrepresents only one component at a fixed time.

The major problem in the definition of directed graphs is the meaning of the direction.In the case of directed acyclic graphs or chain graphs for multivariate data, the directionsof the edges are normally determined by prior information or research hypotheses. Thisconcept can also be applied to time series, but does not take advantage of the fact that thedata are measured over time. Instead, we can base our notion of direction on one of theseveral definitions of causality, which have been proposed for time series (Granger, 1969;Sims, 1972; Pierce and Haugh, 1977). Here, the approach of Lynggaard and Walther hasthe disadvantage, that it does not correspond to any of these concepts for causality.

1.1 Outline of the thesis

In this thesis, we will discuss both undirected and directed graphs for time series. Firstin Section 2.2, we will introduce a new class of directed graphs, which are based on theconcept of Granger-causality. In these causality graphs, each vertex represents one com-ponent process as in conditional correlation graphs. Unlike the approach by Lynggaardand Walther, these graphs do not fit into the usual framework of directed acyclic graphsor chain graphs. In particular, causality graphs allow more than one edge between twovertices. We investigate the properties of these new graphs and show that they are relatedto conditional correlation graphs.

For the prediction of a process from sampled values it seems natural to use onlythose variables which have a substantial influence or, in other words, are causal for acomponent. We therefore consider in Chapter 5 the problem of selecting the best fittingautoregressive model under constraints due to some causality graph, where we measurethe approximation by the final prediction error (Akaike, 1969, 1970). This leads to theconcept of asymptotically efficient model selection as it has been considered by Shibata(1980), Taniguchi (1980), and several other authors. As in the paper by Shibata, wederive an asymptotic lower bound for the final prediction error and prove the asymptoticefficiency of the AIC.

Page 7: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

1.1. Outline of the thesis 3

In Chapter 3 and 4, we discuss the estimation of the conditional correlation graphs of aprocess, starting with the the nonparametric approach by Dahlhaus (1996) and Dahlhauset al. (1997). In Section 3.2 we consider the problem of testing for the presence of an edgein the graph and present a new test based on the integrated partial spectral coherence.In simulations we show that the new test performs better than the one suggested byDahlhaus et al. (1997).

In the second part of Chapter 3, we briefly discuss the problem of identification ofsynaptic connections in neural networks. In the analysis of neurophysiological data, cor-relation analysis is still an important, widely used tool (e.g. Melssen and Epping, 1987).The advantage of such time domain based methods compared with the frequency ap-proach is the interpretability of the curves, which yield information about the directionand the type of connections. We propose a new time domain statistic which combines theadvantages of time domain analysis and the interrelation analysis suggested by Rosenberget al. (1989) and Dahlhaus et al. (1997). We prove a functional central limit theorem forthe new statistic.

Finally in Chapter 4, we consider the model selection problem for conditional corre-lation graphs. Again we restrict ourselves to the case of autoregressive models, whichwe parametrize in terms of the inverse covariances of the process. For a given graph,this parametrization will lead to simple constraints on the parameters. We show thatthe Whittle estimate (cf. Whittle, 1953, 1954) then is determined by equations similarto those for the maximum likelihood estimates in Gaussian graphical models. We derivean asymptotic lower bound for the Kullback-Leibler distance and prove the asymptoticefficiency of the corresponding version of the AIC.

The Appendix summarizes a few results about L-functions and matrix norms. Wenote that throughout this thesis, the standard norm for matrices and vectors will be theoperator norm and the Euclidian norm, respectively.

In conclusion I would like to thank all those who generously contributed to this work:My supervisor Professor Rainer Dahlhaus for his suggestions and his support, Wolfgang,Michael, and Martin for many helpful discussions and careful proof-reading, and Karinfor her kindness and constant encouragement.

Page 8: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Chapter 2

Graphical models for time series

Graphical models for stationary time series have been first discussed by Brillinger (1996)and Dahlhaus (1996) as a tool for visualizing the interaction structure between the compo-nents of a multivariate process. It has been shown that undirected graphs for multivariatedata can be generalized to the time series case, leading to the concept of conditional cor-relation graphs where each component of the process is represented by one vertex in thegraph. For directed graphs, no similar theory exists for time series since unlike in the caseof multivariate data time should play a key role in the definition of such directed graphs,and therefore new concepts have to be developed.

This chapter splits into two parts. In the first section, we introduce the basic conceptsof conditional correlation graphs for weakly stationary time series. As an example fora graphical model, we consider graphical autoregressive models with given conditionalcorrelation graph.

In the second part of the chapter, we use the concept of Granger-causality to definesemi-directed graphs for time series where arrows between the vertices indicate causationand lines represent conditional contemporaneous correlation. We investigate the proper-ties of these causality graphs and their relation to conditional correlation graphs.

2.1 Conditional correlation graphs

Let L 2(Ω,A ,P) be the Hilbert space of all square integrable random variables on someprobability space (Ω,A ,P). We consider countable sets Xii∈I , Yjj∈J , and Zkk∈Kof random variables in L 2(Ω,A ,P). Let spXi, i ∈ I be the closed linear subspace inL 2(Ω,A ,P) generated by Xi. Then putting M = sp1, Zk, k ∈ Z we say that Xiand Yj are conditional orthogonal given Zk if the projections of Xi and Yj ontothe orthogonal complement of M are orthogonal. In the following we adapt the notationof Dawid (1979) for conditional independence to denote conditional orthogonality. Withthis notation we have

Xi Yj | Zk ⇔ X − PMX ⊥ Y − PMY ∀X∈ spXi, i∈I ∀Y ∈ spYk, k∈K,

Page 9: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.1. Conditional correlation graphs 5

where PM is the orthogonal projection onto M . Since PMX is the best linear predictorof X in M , this implies that Xi and Yj are uncorrelated for all i ∈ I and j ∈ J afterremoving the linear effects of Zk.

The relation X Y |Z has the following properties where f(X) is a linear functional.

(C1) if X Y |Z then Y X |Z;

(C2) if X Y |Z then f(X) Y |Z;

(C3) if X Y |Z then X Y | (Z, f(X));

(C4) if X Y |Z and X W | (Y, Z) then X (W,Y ) |Z;

(C5) if X Y1 | (Z, Y2) and X Y2 | (Z, Y1) then X (Y1, Y2) |Z.

Property (C5), however, does not hold universally, but only under additional assumptions.For multivariate processes, such a condition can be formulated in terms of the spectralmatrix of the process. Suppose that Xa(t), a = 1, . . . , d is a weakly stationary processwith spectral matrix f(λ). Then (C5) holds if all eigenvalues of f(λ) are positive andbounded, i.e. there exist constants c1, c2 with 0 < c1 ≤ c2 < ∞ such that f(λ) satisfiesthe following boundedness condition

c11d ≤ f(λ) ≤ c21d, ∀λ ∈ [−π, π]. (2.1.1)

Here, the matrix inequality A ≤ B means that B − A is non-negative definite.In the following we consider simple undirected graphs G = (V,E), where V denotes

the set of vertices and E ⊆ (i, j) ∈ V × V |i 6= j the set of edges. For simplicity weassume that (i, j) ∈ E if (j, i) ∈ E. We can now define the conditional correlation graphof a weakly stationary multivariate process X(t), t ∈ Z.

Definition 2.1.1 (Conditional correlation graph) Let X(t), t ∈ Z be a d vector-valued weakly stationary stochastic process. Then the conditional correlation graph ofX(t) is the simple undirected graph G = (V,E) with vertices V = 1, . . . , d and edgesE such that

(i, j) /∈ E ⇔ Xj(t) Xi(t) | XV \i,j(t)for all i 6= j ∈ V .

Using (C1) to (C5), we can now derive similar properties as for conditional inde-pendence graphs for multivariate data. In particular, we obtain the following importantseparation theorem (Dahlhaus, 1996). Consider a fixed graph G = (V,E). For distinctsubsets A, B, and S of V we say that S separates A and B if there exists no sequence(vi, vi−1), i = 1, . . . , k, of edges in E such that v0 ∈ A, vk ∈ B, and vi /∈ S for alli = 1, . . . , k − 1, that is there is no way from A to B which does not contain at least oneelement of S.

Proposition 2.1.2 (Separation theorem) Let X(t), t ∈ Z be a vector-valued weaklystationary process such that condition (2.1.1) holds. Further let G = (V,E) be the condi-tional correlation graph of X(t) and suppose that A, B, and S are distinct subsets ofV . Then S separates A and B in G if and only if

XA(t) XB(t) | XS(t).

Page 10: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

6 Chapter 2. Graphical models for time series

Consider now fixed components Xa(t) and Xb(t) and let Cab = V \a, b. Thenthe orthogonal projection of Xa(t) onto sp1, XCab(t)t∈Z is obtained by minimizing

E

(Xa(t)− µa −

∑s∈Z

∑c∈Cab

φc(t− s)Xc(s))2

.

This leads to the partial error process

εa|Cab(t) = Xa(t)− µ∗a −∑s∈Z

∑c∈Cab

φ∗c(t− s)Xc(s),

where φ∗Cab(λ) = faCab(λ)f−1CabCab

(λ) and µ∗a = EXa(t) − φ∗Cab(0)EXCab(t) are the optimal

values and φ∗Cab is the Fourier transform of the filter φ∗Cab (cf. Brillinger, 1981). Then if theprocesses Xa(t) and Xb(t) are conditionally orthogonal given all other componentsXCab(t), it follows that

corrεa|Cab(t), εb|Cab(t+ u)

= 0

for all t, u ∈ Z. Equivalently, we obtain in the frequency domain

fab|Cab(λ) = fεa|Cabεb|Cab (λ) = 0

for all frequencies λ ∈ [−π, π], where fεa|Cabεb|Cab (λ) denotes the cross-spectrum of the two

partial error processes. We call fab|Cab(λ) the partial spectrum of Xa(t) and Xb(t)given XCab(t). It follows from the form of φ∗Cab that the partial spectrum is given by

fab|Cab(λ) = fab(λ)− faCab(λ)fCabCab(λ)−1fCabb(λ). (2.1.2)

For an analysis of the interaction structure, the partial spectrum is typically standardized,which leads to the partial spectral coherence,

Rab|Cab(λ) =fab|Cab(λ)(

faa|Cab(λ)fbb|Cab(λ))1/2

. (2.1.3)

The quantity |Rab|Cab(λ)|2 is bounded between 0 and 1 with 0 if Xa(t) and Xb(t) areconditionally orthogonal given XCab(t) while the value 1 indicates perfect linear relationbetween the partialized variables.

The next lemma due to Dahlhaus (1996) states an important relation between thepartial spectral coherences and the inverse of the spectral matrix. It allows an efficientcomputation of all frequency domain statistics needed for estimating the conditional cor-relation graph.

Proposition 2.1.3 Suppose that X(t), t ∈ Z is a vector-valued weakly stationary pro-cess such that condition (2.1.1) holds. Then if g(λ) denotes the inverse spectral matrix,we have

faa|Ca(λ) =1

gaa(λ),(i)

Page 11: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.1. Conditional correlation graphs 7

Rab|Cab(λ) = − gab(λ)√gaa(λ)gbb(λ)

,(ii)

faa|Cab(λ) =faa|Ca(λ)

1− |Rab|Cab(λ)|2,(iii)

fab|Cab(λ) =Rab|Cab(λ)

√faa|Ca(λ)fbb|Cb(λ)

1− |Rab|Cab(λ)|2.(iv)

Proof. (i) and (ii) have been proved by Dahlhaus (1996). From the inverse variancelemma (e.g. Whittaker, 1990, Proposition 5.7.3) it further follows that

gaa(λ) =fbb|Cab(λ)

faa|Cab(λ)fbb|Cab(λ)− fab|Cab(λ)fba|Cab(λ),

which together with (i) and the definition of Rab|Cab(λ) completes the proof.

The proposition does not only provide an efficient method of computing the partializedfrequency domain statistics, but also allows a new characterization of conditional orthog-onality and thus of the missing edges of the conditional correlation graph. For this, letR =

(R(u−v)

)u,v∈Z with R(u) = E

(X(t)X(t+u)′

)be the infinite dimensional covariance

matrix of X(t). Then we denote by R(i) the inverse of the covariance matrix. It thenfollows that R(i)(u) is the Fourier transform of the inverse spectral matrix (e.g. Shaman1975, 1976), that is

R(i)ab (u) =

1

∫Π

f−1ab (λ) exp(iλu)dλ. (2.1.4)

Proposition 2.1.4 Let XV (t), t ∈ Z be a vector-valued weakly stationary process suchthat condition (2.1.1) holds. Then the following two statements are equivalent:

(i) Xa(s) Xb(s′) | XV \a,b(t), t ∈ Z, ∀s, s′ ∈ Z;

(ii) Xa(s) Xb(s′) | XV (t), t ∈ Z\Xa(s), Xb(s

′), ∀s, s′ ∈ Z.

Proof. It follows directly from the Proposition 2.1.3 that fab|Cab(λ) = 0 if and onlyif f−1

ab (λ) = 0 for all frequencies λ ∈ [−π, π]. Since by (2.1.4) the latter is equivalent

to R(i)ab (u) = 0 for all u ∈ Z, the assertion now follows by application of the variance-

covariance lemma (e.g. Whittaker, 1990, Chapter 5).

As an example for graphical models for time series we consider autoregressive pro-cesses under the restrictions of a conditional correlation graph. Here, we can use the lastproposition to reparametrize the model in order to obtain feasible parameter constraints.

Example 2.1.5 Let X(t), t ∈ Z be a d vector-valued stationary autoregressive processof order p,

X(t) =p∑

h=1

A(h)X(t− h) + ε(t),

Page 12: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

8 Chapter 2. Graphical models for time series

where A(h) are d × d matrices and ε(t) are independent and identically normally dis-tributed with mean E(ε(t)) = 0 and covariance matrix E

(ε(t)ε(t)′

)= Σ of full rank d.

We further assume that the spectral matrix of X(t) exists and satisfies the bounded-ness condition (2.1.1). Then the inverse spectral matrix f−1(λ) of X(t) is (cf. Dahlhaus,1996)

f−1(λ) = 2πA(eiλ)′

ΨA(e−iλ

), (2.1.5)

where Ψ = Σ−1 and A(z) = 1d −A(1)z − . . .−A(p)zp is the characteristic polynomial ofthe process.

Now suppose that X(t) has conditional correlation graph G = (V,E). Then for all(i, j) /∈ E it follows that

d∑k,l=1

p∑u,v=0

ΨklAki(u)Alj(v) exp(iλ(v − u)) = 0.

In the special case where Σ = σ21d, we still have

σ2

d∑k=1

p∑u,v=0

Aki(u)Akj(v) exp(iλ(v − u)) = 0,

which yields the following 2p+ 1 restrictions on the parameters

d∑k=1

p∑u=0

Aki(u)Akj(u+ h) = 0 ∀h ∈ −p, . . . , p,

where A(0) = −1d and A(u) = 0 if u < 0 or u > p. It is clear from the above expressionsthat it is difficult to work with these restrictions especially if more than one edge is missingfrom the graph.

We suggest another approach to the problem. It is well known, that for autoregressivemodels of order p the inverse covariances R(i)(u) vanish for |u| > p (e.g. Bhansali, 1980;Battaglia, 1984). Because of the uniqueness of the factorization in (2.1.5) (cf. Masani,1966), an AR(p) process is also determined by the set of inverse covariances

θ =(vech(R(i)(0))′, vec(R(i)(1))′, . . . , vec(R(i)(p))′

)′,

where the vech operator stacks only the elements contained in the lower triangular sub-matrix. Then θ again consists of pd2 + d(d+ 1)/2 parameters.

The restrictions imposed on the parameters by the conditional correlation graph Ghave now a simple formulation in terms of the inverse covariances R(i)(u),

R(i)ij (u) = R

(i)ji (u) = 0 ∀(i, j) /∈ E.

Thus, we can parametrize a graphical autoregressive model of order p and graph G by aparameter vector θ ∈ Rk(p,G) with k(p,G) = (2p+ 1)|E|+ (p+ 1)d, where |E| denotes thenumber of edges in E.

Page 13: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 9

Remark 2.1.6 The definition of conditional correlation graphs can also be generalizedto the case of time-continuous stochastic processes such as point processes, which we willbriefly discuss in Section 3.3. Consider a random (signed) measure µ on R. Then wereplace in the definition of conditional orthogonality the space spXi, i ∈ I by the closedsubspace Mµ generated by the set∫

R

φ(t)dµ(t)∣∣∣φ : R→ R is continuous with bounded support

.

Then two random measures µ1 and µ2 are conditionally orthogonal given ν1, . . . , νd if

X − PM1,ν1,... ,νdX ⊥ Y − PM1,ν1,... ,νd

Y

for all X ∈Mµ1 and Y ∈Mµ2

2.2 Causality graphs

In the analysis of multivariate data, graphical models based on directed acyclic graphsor chain graphs have been used to model and detect causal effects between the variablesunder study (e.g. Wermuth and Lauritzen, 1990; Pearl, 1995). This approach for detectingcausation relies on research hypotheses which define an ordering of the variables. Thus itis assumed that we already know the possible directions for causation. Without such priorinformation it becomes unclear how causation can be inferred from patterns of association.

In the case of stochastic time series, the ordering of the variables in time provides anatural basis for the definition of causality. The most frequently used concept of causalityhas been introduced by Granger (1969). Here, one process X(t) is said to be causal foranother process Y (t) if the prediction of Y (t) using all relevant information available attime t− 1 apart from X(t) can be improved by adding the available information aboutX(t).

In this section, we introduce a new class of graphs which visualize the causal rela-tionships between the components of multivariate stationary time series. In these graphs,the vertices are connected by arrows and lines corresponding to the presence of Granger-causality and instantaneous causality, respectively. We investigate the Markov propertiesof these graphs and show their relation to the conditional correlation graphs as defined inthe previous section.

2.2.1 Definition

The original definition of causality by Granger (1969) has been formulated in terms ofmean squared prediction error. Here, we will adapt a slightly weaker definition in termsof conditional orthogonality in the Hilbert space of square integrable random variables,which is due to Hosaya (1977) and Florens and Mouchart (1985).

Consider two weakly stationary stochastic processes X(t) and Y (t) on a proba-bility space (Ω,A ,P). In this section we denote by X(t) = X(s), s < t the set of allpast values of X(t) at time t. Further, we set ¯X(t) = X(s), s ≤ t and define A(t) asthe set of all relevant information accumulated since time t− 1.

Page 14: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

10 Chapter 2. Graphical models for time series

Definition 2.2.1 (Linear causality) Y (t) is not (linearly) causal for X(t) (relativeto A(t)) if and only if

X(t) Y (t) | A(t)\Y (t).

Further, there is no instantaneous (linear) causality between X(t) and Y (t) (relativeto A(t)) if and only if

X(t) Y (t) | ¯A(t)\X(t), Y (t).

We note that in the literature instantaneous causality has been defined without condi-tioning on the present of A(t). However, in the framework of graphical models the abovedefinition where instantaneous noncausality corresponds to conditional noncorrelation ofthe increments of the process is more appropriate.

In a multivariate setting there are different causality patterns like direct and indirectcausality, feedback, or spurious causality possible (cf. Hsiao, 1982). These patternscan be visualized by graphs where arrows between vertices correspond to causality andlines correspond to instantaneous causality. We therefore consider mixed graphs (graphswith both directed and undirected edges) G = (V,Ed, Eu) where V is the set of vertices,Ed ⊆ (u, v) ∈ V ×V |u 6= v is the set of directed edges, and Eu ⊆ (u, v) ∈ V ×V |u 6= vis the set of undirected edges. For simplicity we assume that if (i, j) ∈ Eu then also(j, i) ∈ Eu.

Definition 2.2.2 (Causality graph) Let X(t), t ∈ Z be a d vector-valued weaklystationary stochastic process. Then the (linear) causality graph of X(t) (relative toA(t)) is the graph G = (V,Ed, Eu) with vertices V = 1, . . . , d which satisfies for alli 6= j ∈ V the conditions

(i, j) /∈ Ed ⇔ Xj(t) Xi(t) | A(t)\Xi(t) (2.2.1)

and

(i, j) /∈ Eu ⇔ Xj(t) Xi(t) | ¯A(t)\Xi(t), Xj(t). (2.2.2)

The definition of a causality graph clearly depends on the information set A(t). Here,we will be concerned only with multivariate processes and always use A(t) = X(t). How-ever, it has been noted before (Granger, 1969; Hsiao, 1982) that omittance of confoundingvariables can lead to spurious causality. If the causal relations between the measured vari-ables and the confounding variables are known, it is possible to determine the set of vari-ables needed to be included into the analysis in order to be able to assess the direct causalrelation between two specific variables (cf. Pearl, 1995). In general, when the causalitygraph is unknown and constructed from the data, such methods are not available andspurious causality cannot be ruled out. We will not discuss this any deeper.

Example 2.2.3 (AR-processes) Let X(t), t ∈ Z be a d vector-valued stationaryautoregressive process of order p,

X(t) =

p∑j=1

A(j)X(t− j) + ε(t),

Page 15: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 11

5

1

2

3

4

Figure 2.2.1: Causality graph for the autoregressive process in Example 2.2.3.

where A(i) are d×d matrices and ε are independent and identically distributed errors withmean zero and covariance matrix Σ. Then it is well known (cf. Tjøstheim, 1981; Hsiao,1982) that Xi(t) does not cause Xj(t) if and only if the corresponding componentsAji(h) vanish for all h, that is

Xj(t) Xi(t) | XV \i(t) ⇔ Aji(h) = 0 ∀h ∈ 1, . . . , p.

Further, Xi(t) and Xj(t) are not instantaneously causal if and only if the corre-sponding error components εi(t) and εj(t) are conditionally orthogonal given εV \i,j(t).Thus, instantaneous causality can be expressed in terms of the inverse covariance matrixΨ = Σ−1. We have the characterization

Xi(t) Xj(t) | XV (t), XV \i,j(t) ⇔ εi(t) εj(t) | εV \i,j(t) ⇔ ψij = ψji = 0.

As an example, we consider the following AR(1)-process

X(t) = AX(t− 1) + ε(t)

with parameters

A =

a1,1 0 0 0 00 a2,2 a2,3 0 0a3,1 a3,2 a3,3 a3,4 00 0 0 a4,4 a4,5

0 0 a5,3 0 a5,5

and

Σ−1 =

ψ1,1 ψ1,2 ψ1,3 0 0ψ2,1 ψ2,2 ψ2,3 0 0ψ3,1 ψ3,2 ψ3,3 ψ3,4 0

0 0 ψ4,3 ψ4,4 00 0 0 0 ψ5,5

.

The causality graph for this process is shown in Figure 2.2.1. From this graph, we cansee that e.g. X1(t) does not cause X5(t) relative to the full process. However, amore intuitive interpretation of the graph suggests that X1(t) causes X5(t) only viaX3(t), that is

X5(t) X1(t) | X3,5(t).

Page 16: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

12 Chapter 2. Graphical models for time series

In the following section we investigate the properties of causality graphs and show thatsuch an intuitive interpretation is indeed possible within the framework of causalitygraphs.

2.2.2 Markov properties

In this section, we investigate the properties of causality graphs. We start with a versionof the block independence lemma (cf. Whittaker, 1990) which allows us to consider blocksof variables and derive causality relations between these blocks.

Lemma 2.2.4 Let X(t), t ∈ Z be a vector-valued weakly stationary process which sat-isfies condition (2.1.1). Then we have for I1, I2, J ⊆ V and I = I1 ∪ I2

(i) XJ(t) XIr(t) | XV \Ir(t), r = 1, 2 ⇔ XJ(t) XI(t) | XV \I(t),

(ii) XIr(t) XJ(t) | XV \J(t), r = 1, 2 ⇔ XI(t) XJ(t) | XV \J(t),

(iii) XIr(t) XJ(t) | X(t), XV \(Ir∪J)(t), r = 1, 2 ⇔ XI(t) XJ(t) | X(t), XV \(I∪J)(t).

Proof. For (i) and (iii) this is an immediate consequence of (C5). For (ii) this followsfrom the linearity of the covariance in each argument.

The next theorem focuses on the relation between causality and conditional orthogo-nality of the processes as used in the previous section. For this, we first need some morenotation from graph theory. Let G = (V,Ed, Eu) be a mixed graph. If two vertices v andu are connected by an edge in Eu we call v and u neighbours. The set of neighbours of vwill be denoted by ne(v). Further if there exists an edge (v, u) ∈ Ed then v is a parentof u and u is a child of v. The corresponding sets of children and parents of a vertex vare denoted by ch(v) and pa(v), respectively. Additionally we define pa(v) = pa(v)∪ vand similarly ne(v) and ch(v).

Theorem 2.2.5 Let X(t), t ∈ Z be a vector-valued weakly stationary stochastic processwith causality graph G = (V,Ed, Eu). Further suppose that X(t) satisfies condition(2.1.1). If for i 6= j ∈ V the causality graph satisfies the following conditions

(i) i /∈ ch(j) and j /∈ ch(i),

(ii) i /∈ ne(j),

(iii) ne(i) ∩ ch(j) = ∅ and ne(j) ∩ ch(i) = ∅,

(iv) ne(v) ∩ ch(j) = ∅ ∀ v ∈ ch(i),

then the processes Xi(t) and Xj(t) satisfy

Xi(t), t ∈ Z Xj(t), t ∈ Z | XV \i,j(t), t ∈ Z,

that is the processes Xi(t) and Xj(t) are conditionally orthogonal given all othercomponents XV \i,j(t).

Page 17: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 13

Proof. According to Proposition 2.1.4 it is sufficient to show that for all t, h ∈ Z

Xi(t) Xj(t+ h) | XV (t), t ∈ Z\Xi(t), Xj(t+ h).

We can assume that h ≥ 0 (otherwise we swap the indices i and j). First, consider h > 0.Since by (i) and (iv) the vertices in ne(j) are not children of i, it follows from Lemma2.2.4 (ii) that

Xne(j)(t+ h) Xi(t) | XV (t+ h)\Xi(t)

which implies

Xj(t+ h) Xi(t) | XV (t+ h)\Xi(t), Xne(j)(t+ h).

Since we have further by the Lemma 2.2.4

Xj(t+ h) XV \ne(j)(t+ h) | XV (t+ h), Xne(j)(t+ h),

we obtain by (C4)

Xj(t+ h) Xi(t) | XV (t+ h+ 1)\Xi(t), Xj(t+ h).

In the case h = 0, this follows directly from (ii).Now assume that

Xj(t+ h) Xi(t) | XV (s)\Xi(t), Xj(t+ h) (2.2.3)

for some s > t + h. Let K = V \(ch(i) ∪ ch(j)). By condition (iv) Xk(t) and Xl(t)are not instantaneously causal for all k ∈ ch(i) and l ∈ ch(j), and we get by Lemma 2.2.4(iii)

Xch(i)(s) Xch(j)(s) | XV (s), XK(s). (2.2.4)

Since ch(j) and K ∪ ch(i) are disjunct, we have

Xj(t+ h) Xch(i)∪K(s) | XV (s)\Xj(t+ h)

and thus by (2.2.3)

Xj(t+ h) Xi(t) | XV (s)\Xj(t+ h), Xch(i)∪K(s). (2.2.5)

On the other hand, we similarly obtain

Xi(t) Xch(j)(s) | XV (s)\Xi(t), XK(s)

and by (2.2.4)

Xi(t) Xch(j)(s) | XV (s)\Xi(t), Xch(i)∪K(s).

Together with (2.2.5) this implies

Xj(t+ h) Xi(t) | XV (s+ 1)\Xi(t), Xj(t+ h).

The assertion of the theorem now follows by induction over s.

The theorem motivates the following definition of a moral graph which differs fromthe definition in the case of multivariate data.

Page 18: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

14 Chapter 2. Graphical models for time series

Definition 2.2.6 Let G = (V,Ed, Eu) be the causality graph of a weakly stationaryprocess XV (t). Then the moral graph of G is the undirected simple graph Gm = (V,Em)such that (i, j) /∈ Em if and only if conditions (i) to (iv) in Theorem 2.2.5 hold.

It follows now immediately from Theorem 2.2.5 that the causality graph of a processX(t) is related to the conditional correlation graph of X(t).

Corollary 2.2.7 Let G = (V,Ed, Eu) be the causality graph of a weakly stationary processX(t) satisfying (2.1.1) and Gm = (V,Em) the corresponding moral graph. Then ifG i = (V,E i) is the conditional correlation graph of the process X(t) we have E i ⊆ Em.

For a subset A of V we define the set of ancestors an(A) of A as the smallest setB ⊆ V such that A ⊂ B and pa(B) ⊆ B. Further, we denote by Gan(A) the causalitygraph of the subprocess Xan(A)(t).

Definition 2.2.8 (Markov properties) Let X(t), t ∈ Z be a vector-valued weaklystationary stochastic process. Further, let G = (V,Ed, Eu) be a mixed graph with directededges Ed and undirected edges Eu.

(i) X(t) satisfies the causal pairwise Markov property with respect to G, if for all(i, j) /∈ Ed

Xj(t) Xi(t) | XV \i(t)

and for all (i, j) /∈ Eu

Xj(t) Xi(t) | X(t), XV \i,j(t).

(ii) X(t) satisfies the causal local Markov property with respect to G, if for all i ∈ V

Xi(t) XV \pa(i)(t) | Xpa(i)(t)

and

Xi(t) XV \ne(i)(t) | X(t), Xne(i)(t).

(iii) X(t) satisfies the causal global Markov property with respect to G, if for alldistinct subsets A,B, S ⊂ V such that S separates A and B in (Gan(A∪B∪S))

m, thatis in the moral graph of the ancestral set of A ∪B ∪ S, then

XB(t) XA(t) | XB∪S(t)

and

XB(t) XA(t) | XA∪B∪S(t), XS(t).

The causality graphs have been defined such that the causal pairwise Markov propertyis satisfied. However, it is the global Markov property which allows the intuitive interpre-tation of the graphs. The next theorem states that under the boundedness condition onthe spectral matrix the causal pairwise Markov property already implies the global one.

Page 19: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 15

Theorem 2.2.9 Let X(t), t ∈ Z be a vector-valued weakly stationary stochastic process.Further, let G = (V,Ed, Eu) be the causality graph for X(t). Then if condition (2.1.1)holds, X(t) satisfies the causal global Markov property with respect to G.

Proof. Assume that A, B, and S are distinct subsets such that S separates A and B inthe moral graph of Gm

an(A∪B∪S). Let V ∗ = an(A ∪ B ∪ S). By definition of the ancestralset, we have

XV ∗(t) XV \V ∗(t) | XV ∗(t). (2.2.6)

Therefore, we get for all (i, j) /∈ Ed∗ = (i, j) ∈ Ed|i, j ∈ V ∗

Xj(t) Xi(t) | XV ∗\i(t).

Further, let Eu∗ be the set of undirected vertices (i, j) with i, j ∈ V ∗ such that i and j arenot separated by V ∗\i, j in the undirected graph (V,Eu). It then follows for (i, j) /∈ Eu∗

from the global Markov property for ordinary graphical models that

Xi(t) Xj(t) | X(t), XV ∗\i,j(t)

and further by (2.2.6)

Xi(t) Xj(t) | XV ∗(t), XV ∗\i,j(t).

Therefore GV ∗ = (V ∗, Ed∗, Eu∗) is the causality graph of the subprocess XV ∗(t).Let Em be the set of edges in the moral graph Gm

V ∗ . Analogously to the proof ofTheorem 2.2.5 we now find that if (i, j) /∈ Em then

Xi(t1) Xj(t2) | ¯XV (t)\Xi(t1), Xj(t2)

for any t1, t2 ≤ t. Since S separatesA andB in Em, there exists a partition A,B, S, V1, V2of V ∗ such that

¯XB∪V1(t) ¯XA∪V2(t) | ¯XS(t). (2.2.7)

This implies the second part of the defined global Markov property. Further there existsa partition S1, S2 of S such that

¯XB∪V1(t) XS1(t) | ¯XA∪V2∪S2(t), XS1(t)

and

XA∪V2(t) XS2∪B∪V1(t) | XB∪V1∪S(t),

since otherwise there would exist vertices b ∈ B ∪ V1 and a ∈ A ∪ V2 which were marriedand thus connected in Em in contradiction to (2.2.7). The first relation implies

¯XB∪V1(t) XA∪V2(t) | XS(t), XS2(t).

Page 20: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

16 Chapter 2. Graphical models for time series

(b)

B

S

C

S

C

B

AA

(a)

Figure 2.2.2: Process with direct and indirect feedback: (a) Causality graph and (b) moral graph.

Together with the second relation, we obtain from this

¯XB∪V1(t), XS2(t) XA∪V2(t) | XS(t)

and finally

XB(t) XA(t) | XS∪B(t).

This completes the proof.

The separation criterion for noncausality is symmetric in the sets A and B. Thereforeit can only be used to detect noncausality in both directions, that is XA(t) does notcause XB(t) relative to XA∪B∪S(t) and vice versa. When concerned with directedacyclic graphs or chain graphs, this criterion is sufficient as two variables can only beconnected in one direction. In the case of causality graphs, we cannot assume such anordering of the variables (without loss of generality).

As an example, we consider a process X(t) with causality graph shown in Figure2.2.2. The graph suggests that XA(t) causes XB(t) or XC(t) only via XS(t). Onthe other hand, XB(t) and XC(t) both have a causal effect on XA(t) which is notmediated by XS(t). Consequently, S does not separate A from B and C in the moralgraph.

Moral graphs as defined above basically visualize conditional orthogonality of the com-ponent processes for the set of ancestors of the variables under study. For the investigationwhether XA(t) causes XB(t), however, we are more interested in the conditional or-thogonality between the present of XB(t) and the past of XA(t). The problem canbe solved by inserting a new vertex B∗ into the graph, which represents XB(t), while allother vertices stand for the past (A for XA(t), etc.). By moralizing this extended graph,we obtain an extended moral graph Gm

an(A∪B∪B∗∪S). Figure 2.2.3 shows the extended moralgraphs Gm

an(A∪B∪B∗∪S) and Gman(A∪C∪C∗∪S) for the example in Figure 2.2.2. Since S∪B sep-

arates B∗ and A we find that XA(t) and XB(t) are conditional orthogonal given XB∪S(t)and thus that the process XA(t) does not cause XB(t) relative to XA∪B∪S(t). Forvertices A and C we obtain a similar result.

The next theorem shows that the extension of causality graphs just described indeedcan be used for the identification of unidirectional noncausality. For simplicity we consideronly block graphs, where the target variables, XB1(t), . . . , XBk(t) say, have been combinedto one vertex B. Otherwise the vertices B∗1 to B∗k need to be connected in line with therules for obtaining the reduced graph Gan(A∪B∪C) applied to the full extended graph withvertices 1, . . . , d, 1∗, . . . , d∗.

Page 21: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 17

*

B

S

B

AA

(a) (b)

B*

CS

C

C

Figure 2.2.3: Extended moral graphs for the causality graph in Fig. 2.2.2: (a) XA(t) does not causeXB(t) relative to XS(t); (b) XA(t) does not cause XC(t) relative to XS(t)

*

1

2

3

4

55

Figure 2.2.4: Extended moral graph for the causality graph in Figure 2.2.1.

Theorem 2.2.10 Let X(t), t ∈ Z be a vector-valued weakly stationary stochastic pro-cess such that condition (2.1.1) holds. Suppose that A, B, and S are distinct subsets ofV such that S ∪B separates B∗ and A in the extended moral graph Gm

an(A∪B∪B∗∪S). Then

the process XA(t) does not cause XB(t) relative to XA∪B∪S(t), that is

XB(t) XA(t) | XS∪B(t).

Proof. Let V ∗ = an(A ∪ B ∪ S). Clearly the moral graph Gman(A∪B∪S) is a subset of

the extended moral graph. Therefore if two vertices i, j 6= B∗ are not connected then itfollows as in the proof of Theorem 2.2.5 that

Xi(t) Xj(t) | XV ∗\i,j(t).

Since S ∪B separates B∗ from A there exists a partition A,B, S, V1, V2 of V ∗ such thatS ∪B separates V1 and A ∪ V2 and pa(B) ⊆ S ∪ V1. Then

XV1(t) XA∪V2(t) | XS∪B(t).

Since S ∪B now also separates B∗ from A ∪ V2, we further get

XB(t) XA∪V2(t) | XS∪B(t),

from which the assertion of the theorem follows.

It follows now from this result that the intuitive interpretation of the graph in Figure2.2.1 has been correct. As we can see in Figure 2.2.4, the set 3, 5 separates 1 and 5∗,which leads to X5(t) X1(t) | X3,5(t).

Page 22: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

18 Chapter 2. Graphical models for time series

2.2.3 Concluding remarks

In this section we have considered linear causality graphs for weakly stationary processes.The question arises whether it is possible to generalize the definition such that nonlinearcausal relationships between the processes can be handled.

Remark 2.2.11 Florens and Mouchart (1982) and Bouissou et al. (1986) have also de-fined Granger-causality in terms of conditional independence. While properties (C1)to (C4) still hold if we replace conditional orthogonality by conditional independence,stronger assumptions on the process are needed to guarantee also property (C5). Evenwith such assumptions, Lemma 2.2.4 (ii) additionally requires that noncausality for singlecomponent processes Xi(t), i ∈ I,

Xi(t) XJ(t) | XV \J(t) ∀ i ∈ I,

implies noncausality for the joint vector process XI(t),

XI(t) XJ(t) | XV \J(t).

We list three examples for which this condition holds.

(i) X(t) is a Gaussian process. Then conditional independence corresponds to con-ditional orthogonality for which we have proved Lemma 2.2.4.

(ii) There is no instantaneous causality present between the components of X(t).Then we have trivially

XIr(t) XI\Ir(t) | X(t)

for all sets I1, I2 ⊆ V and I = I1 ∪ I2. Together with the left hand side in Lemma2.2.4 (ii), this implies

XIr(t) XJ(t) | XV \J(t), XI\Ir(t),

from which the block independence follows by (C5).

(iii) X(t) is an autoregressive process of the form

Xi(t) = fi(X(t), εi(t)) ∀i = 1, . . . , K (2.2.8)

where fi are measurable functions monotone in εi(t) and ε(t) X(t). Then if

Xi(t) Xj(t) | XV \i(t)

the function fi(X(t), εi(t)) is constant in Xj(t) almost surely.To show this we write fxj(y) = fi(x, y) for any X(t) = x to denote that we leave allcomponents of X(t) except xj(t) fixed. Since εi(t) is independent from the past ofX(t) we have

P(Xi(t) ≤ y | X(t) = x) = P(fxj(εi(t)) ≤ y | X(t) = x)

= P(fxj(εi(t)) ≤ y) = Fεi(t)(f−1xj

(y))

Page 23: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

2.2. Causality graphs 19

where Fεi(t) is the distribution of εi(t). The conditional independence then impliesthat the left hand side does not depend on xj. Consequently

f−1xj

(y) = f−1x′j

(y) P− a.s.

for all xj, x′j and therefore fxj(y) = fx′j(y).

Since the dependence of X(t) on the past X(t) is defined for each component sepa-rately the pairwise independence implies the mutual independence.

In this section we have shown that Granger-causality can be used to define directedgraphs for weakly stationary time series. These causality graphs can be interpreted interms of causality and instantaneous causality (or conditional contemporaneous corre-lation), which gives an intuitive meaning the the directed edges in the graph. Anotheradvantage of causality graphs is the relation to autoregressive models which are an impor-tant tool in time series analysis. On the other side the derivation of noncausality relationsfrom the graph seems to be more difficult than e.g. for chain graphs. We note, however,that in absence of instantaneous causality the causality graph of a process can be obtainedfrom the chain graph as defined by Lynggaard and Walther (1993) by forming blocks foreach component process.

Page 24: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Chapter 3

Nonparametric analysis

Partial spectral coherences are a well known tool for the nonparametric investigationof functional relationships between the components of multivariate stochastic processes(e.g. Brillinger, 1981; Rosenberg et al., 1989). The results of such an interrelation analysiscan be visualized by conditional correlation graphs, which allow an intuitive interpretationof the obtained dependence structure.

As we have seen in the last chapter, the conditional correlation graph of a stochas-tic process is constructed by determining the pairs of components for which the partialspectral coherence is zero at all frequencies. However, when concerned with data, thisdecision must be based on estimates of the partial spectral coherence which are onlyapproximately zero even if there is no direct interrelation. Therefore tests need to beemployed for building the conditional correlation graph from the data.

The first part of this chapter deals with the problem of testing for the presence ofan edge in the otherwise complete conditional correlation graph. In Section 3.2, we con-sider the integrated partial spectral coherence as a test statistic and prove its asymptoticnormality under the null hypothesis that the edge is missing. We also derive its asymp-totic distribution under a sequence of contiguous alternatives. A simulation study showsthat the new test has good power against small deviations from the null hypothesis andperforms better than existing global tests.

The frequency domain methods discussed in this chapter have also been used for theanalysis of stationary point processes such as neuronal spike trains (e.g. Rosenberg et al.,1989; Dahlhaus et al., 1997). For the identification of synaptic connections in a neuronalnet, we are interested not only in the strength of a connection, which is measured by thepartial spectral coherence, but also in information about the direction and the type ofconnection (excitatory or inhibitory). Although the direction can be identified by the helpof spectral phase curves (cf. Dahlhaus et al., 1997), frequency domain methods do notallow to distinguish excitatory and inhibitory connections. In Section 3.3, we suggest anextended analysis using partialized time domain statistics. We prove a functional centrallimit theorem for the new statistics. Examples with simulated data show, that the newstatistics allow the correct identification of the type and direction of a connection whileretaining information about direct and indirect association between components.

Page 25: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.1. Introduction 21

3.1 Introduction

In this section, we set down the basic definitions concerning the estimation of frequencydomain statistics. We consider a d vector-valued stationary time series X(t), t ∈ Z suchthat E|Xa(t)|k <∞ for all a = 1, . . . , d and k ∈ N. Then

ca1,... ,ak(u1, . . . , uk−1) = cumXa1(t+ u1), . . . , Xak−1

(t+ uk−1), Xak(t)

is the kth order cumulant of the process. If a function fa1,... ,ak : Πk−1 → C with theproperty

ca1,... ,ak(u1, . . . , uk−1) =

∫Πk−1

fa1,... ,ak(λ1, . . . , λk−1) exp(

ik−1∑j=1

ujλj

)dλ1 · · · dλk−1

exists, where Π = [−π, π], we call it the kth order cumulant spectrum. We make thefollowing assumptions on X(t).

Assumption 3.1.1 X(t), t ∈ Z is a d vector-valued stationary stochastic process de-fined on a probability space (Ω,A ,P).

(i) X(t) has mean zero and spectral density matrix f(λ) =(fij(λ)

)i,j=1,... ,d

which

satisfies the boundedness condition

a11d ≤ f(λ) ≤ a21d ∀λ ∈ [−π, π]

for constants a1 and a2 with 0 < a1 ≤ a2 <∞.

(ii) The kth order cumulants of X(t) satisfy the mixing conditions∑u1,... ,uk−1∈Z

(1 + |uj|2

)|ca1,... ,ak(u1, . . . , uk−1)| <∞

for all j = 1, . . . , k − 1.

The nonparametric estimation of the spectral densities fab(λ) is usually based on theperiodogram which has the form

I(T )ab (λ) =

(2πH

(T )2 (0)

)−1d(T )a (λ)d

(T )b (−λ),

where

d(T )a (λ) =

T∑t=1

h(T )(t)Xa(t) exp(−iλt)

is the finite Fourier transform of the ath component of the process and

H(T )k (λ) =

T∑t=1

h(T )(t)k exp(−iλt)

are the Fourier transforms of the data taper h(T )(t) = h((t− 1

2)/T). The taper function

h : R→ [0, 1] is of bounded variation and vanishes outside the interval [0, 1]. Further, it

Page 26: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

22 Chapter 3. Nonparametric analysis

should be smooth with h(0) = h(1) = 0 in order to improve the small sample propertiesof the estimates.

For the estimation of the spectral densities fij(λ) we consider kernel estimates of theform

f(T )ij (λ) =

∫Π

w(T )(λ− α)I(T )ij (α)dα

for a kernel w(T )(λ) = MTw(MTλ). We need the following assumptions.

Assumption 3.1.2 The kernel function w(λ) is bounded, symmetric, nonnegative, andLipschitz continuous with∫

R

w(λ)dλ = 1 and

∫R

λ2w(λ)dλ <∞.

Further w(λ) has continuous Fourier transform w(α) such that∫R

w(α)2dα <∞ and

∫R

w(α)4dα <∞.

Assumption 3.1.3 The sequence (MT )T∈N satisfies MT = O(T β) with 14< β < 1

2.

It follows from these assumptions that

f(T )ab (λ)− fab(λ) = OP

(√MT

T

)(3.1.1)

and

Ef(T )ab (λ) =

∫Π

fab(α)w(T )(λ− α)dα = fab(λ) +O( 1

M2T

)(3.1.2)

uniformly in λ ∈ [−π, π] (e.g. Brillinger, 1981, Theorems 7.4.2 and 7.4.4). The conditionson MT guarantee that both the stochastic variation in (3.1.1) and the bias in (3.1.2) tendto zero sufficiently fast.

Substituting the kernel estimates f(T )ab (λ) for the true spectra fab(λ) in (2.1.2) and

(2.1.3), we now obtain consistent estimates for the partial spectra fab|Cab(λ) and the partialspectral coherences Rab|Cab(λ). However, the identification of the conditional correlationgraph requires the computation of these statistics for all pairs a, b ∈ 1, . . . , d, which iscomputationally impracticable for large d. Here, we can use Lemma 2.1.3, which allowsto compute the same estimates efficiently by an inversion of the estimated spectral matrixf (T )(λ).

3.2 Testing for interrelation

Having estimated the partial spectral coherences as described above, the conditional cor-relation graph can be identified by testing for the presence of each single edge. Since the

Page 27: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 23

partial spectral coherence is identical to zero, if the corresponding edge in the graph ismissing, we are interested in testing the null hypothesis

H0 : Rab|Cab(λ) ≡ 0 against H1 : Rab|Cab(λ) 6≡ 0. (3.2.1)

Tests for this problem have been used quite frequently for interrelation analysis in multi-variate processes. The conditional correlation graph now summarizes and visualizes theresults of such an analysis. However, the intuitive interpretation of the conditional corre-lation graph uses the global Markov property, which deals with the presence or absence ofsets of edges. Therefore, the correct approach for the identification would be to considerthe test problem under additionally constraints due to edges which already have beendeleted from the graph. Another problem which clearly arises is the problem of multipletesting. These problems will not be discussed in this work.

Dahlhaus et al. (1997) suggested to reject the null hypothesis H0 if the partial spectral

coherence∣∣R(T )

ab|Cab(λ)∣∣2 exceeds an appropriate threshold. Thus the test statistic has the

form

S(T ) = sup

λ∈[0,π]

(2T

µMT

∣∣R(T )ab|Cab(λ)

∣∣2)where

µ =2πH4

H22

∫R

w(α)2dα.

The partial spectral coherence rescaled as above is asymptotically χ22-distributed, but the

exact distribution of the supremum is difficult to obtain. For an approximation we assumethat the kernel function has compact support, [−π

2, π

2] say. Since the partial coherences

at frequencies λ1 and λ2 which are separated widely enough such that the correspondingsmoothing intervals are non-overlapping are approximately independent, S

(T ) can then be

approximated by the maximum over MT independent χ22-distributed random variables.

Thus, the null hypothesis is rejected at significance level α if S(T ) ≥ χ2

2,(1−α)1/MT, where

χ22,p denotes the p-quantile of the χ2

2-distribution.Taniguchi et al. (1996) considered instead the integrated partial spectral coherence

S(T ) =1

∫Π

∣∣R(T )ab|Cab(λ)

∣∣2dλ (3.2.2)

as a test statistic for the test problem in (3.2.1). In a more general setting, asymptoticnormality is established for test statistics of the form

S(T )∗ =

√T

(∫Π

K(f (T )(λ)

)dλ− c

),

where K(·) is a holomorphic function. The proof is based on a Taylor expansion of

first order. Thus with (3.1.1) and setting K(f(λ)) = (2π)−1∣∣Rab|Cab(λ)

∣∣2, we get for theintegrated partial spectral coherence

S(T ) =

∫Π

K(f(λ)

)dλ+

d∑i,j=1

∫Π

∂K(f(λ)

)∂fij

(f

(T )ij (λ)− fij(λ)

)dλ+OP

(MT

T

). (3.2.3)

Page 28: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

24 Chapter 3. Nonparametric analysis

Under the null hypothesis H0 the first term vanishes since K(f(λ)) ≡ 0. Further, we canrewrite the first derivatives as

∂K(f(λ))

∂fij=

∑k,l∈a,b

∂K(f(λ))

∂fkl|Cab

∂fkl|Cab∂fij

(λ).

With the abbreviation Dab(λ) = 1/(2πfaa|Cab(λ)fbb|Cab(λ)

), we have for k, l ∈ a, b such

that k 6= l

∂K(f(λ))

∂fkl|Cab= flk|Cab(λ)Dab(λ)

and

∂K(f(λ))

∂fkk|Cab= −|Rab|Cab(λ)|2

fkk|Cab(λ).

It follows from these expressions that under the null hypothesis H0 the second termin (3.2.3) is also zero and accordingly together with Assumption 3.1.3 we get S(T ) =

oP(T−

12

). Thus the central limit theorem in Taniguchi et al. (1996) does not hold for the

integrated partial spectral coherence if the null hypothesis in (3.2.1) is considered.In the following, we derive the correct asymptotic distribution of S(T ) under the null

hypothesis H0 and under a class of local alternatives.

3.2.1 Asymptotic null distribution

The derivation of the asymptotic distribution of S(T ) is based on the following quadraticapproximation

S(T )2 =

∫Π

K(f(λ)

)dλ+

d∑i,j=1

∫Π

∂K(f(λ))

∂fij

(f

(T )ij (λ)− fij(λ)

)dλ

+1

2

d∑i,j,k,l=1

∫Π

∂2K(f(λ))

∂fij∂fkl

(f

(T )ij (λ)− fij(λ)

)(f

(T )kl (λ)− fkl(λ)

)dλ.

(3.2.4)

It follows from Assumption 3.1.1 (i) that the third derivative of K(·) is bounded in aneighbourhood of f . Therefore we find by (3.1.1) and Assumption 3.1.3 that

S(T ) = S(T )2 + oP

(√MT/T

). (3.2.5)

Under the null hypothesis H0, the first two terms in (3.2.4) vanish. Furthermore, weobtain

∂2K(f(λ))

∂fab|Cab∂fba|Cab= Dab(λ),

Page 29: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 25

while all other second derivatives are zero. Further it follows from the following Lemma3.2.1 that

d∑i,j=1

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)fij(λ)(f

(T )kl (λ)− fkl(λ)

)dλ = 0.

Therefore under the null hypothesis H0 S(T )2 takes the form

S(T )2 =

d∑i,j,k,l=1

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)f(T )ij (λ)f

(T )kl (λ)dλ.

Lemma 3.2.1 Under Assumption 3.1.1 we have the following identities:

(i)d∑

i,j=1

∂fab|Cab∂fij

(λ)fij(λ) = fab|Cab(λ);

(ii)d∑

i,j,k,l=1

∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)fik(λ)flj(λ) = fab|Cab(λ)fba|Cab(λ);

(iii)d∑

i,j,k,l=1

∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)fil(λ)fkj(λ) = faa|Cab(λ)fbb|Cab(λ).

Proof. From the definition of the partial spectral density we get the derivatives

∂fab|Cab∂fab

(λ) = 1

∂fab|Cab∂fas

(λ) = −∑j 6=a,b

gsj(λ)fjb(λ) ∀s 6= a, b

∂fab|Cab∂frb

(λ) = −∑j 6=a,b

faj(λ)gjr(λ) ∀r 6= a, b

∂fab|Cab∂frs

(λ) =∂fab|Cab∂fas

(λ) ·∂fab|Cab∂frb

(λ) ∀r, s 6= a, b,

where g(λ) is the inverse of the spectral matrix fCabCab(λ). All other derivatives are equalto zero. Substituting these expressions for the derivatives in (i) to (iii), the sums yieldthe terms on the right side.

We now state the main theorem of this section.

Theorem 3.2.2 Suppose that Assumptions 3.1.1 to 3.1.3 hold. Then under the nullhypothesis H0,

TST −MTµ√MTσ

D→ N (0, 1),

where

µ =2πH4

H22

∫R

w(α)2dα and σ2 =4πH2

4

H42

∫R

w(α)4dα.

Page 30: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

26 Chapter 3. Nonparametric analysis

Proof. In Lemmas 3.2.3, 3.2.7, and 3.2.9 we prove the convergence of the cumulantsof first, second, and higher order of S

(T )2 to the corresponding cumulants of the limit

distribution. The results then follows from (3.2.5).

For the proofs we need the following function, which has been introduced by Dahlhaus(1983). Let L(T ) : R→ R be the periodic extension (with period 2π) of

L(T )(λ) =

T, |λ| ≤ 1/T1

|λ|, 1/T < |λ| ≤ π

. (3.2.6)

The properties of this function are summarized in the appendix. Under the stated as-sumptions on the taper function we then have∣∣H(T )

k (λ)∣∣ ≤ CL(T )(λ),

with a constant C ∈ R independent of T and λ. Similarly, we obtain by Assumption 3.1.2for the kernel function

w(T )(λ) ≤ CL(MT )(λ)2

MT

.

Further, we define the sequence Φ(T )2 T∈N of functions

Φ(T )2 (λ) =

|H(T )2 (λ)|2

2πH(T )4 (0)

, (3.2.7)

which is an approximate identity (cf. Dahlhaus, 1983).

Lemma 3.2.3 Suppose that Assumptions 3.1.1 to 3.1.3 hold. Under the null hypothesisH0 we have

E(S(T )2 ) =

MT

T

2πH4

H22

∫R

w(α)2dα + o

(√MT

T

). (3.2.8)

Proof. For fixed i, j, k, l let

g(λ) = Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ).

Noting cumd(T )a (α) = 0, it then follows from the product theorem for cumulants

Page 31: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 27

(cf. Brillinger, 1981, Theorem 2.3.2) that

E

[ ∫Π

g(λ)f(T )ij (λ)f

(T )kl (λ)dλ

]=

1

(2πH(T )2 (0))2

∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)

· cumd(T )i (α1)d

(T )j (−α1)d

(T )k (α2)d

(T )l (−α2)dα1dα2dλ

=1

(2πH(T )2 (0))2

∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)

·[cumd(T )

i (α1), d(T )j (−α1), d

(T )k (α2), d

(T )l (−α2)

+ cumd(T )i (α1), d

(T )j (−α1)cumd(T )

k (α1), d(T )l (−α2)

+ cumd(T )i (α1), d

(T )k (α2)cumd(T )

j (−α1), d(T )l (−α2)

+ cumd(T )i (α1), d

(T )l (−α2)cumd(T )

j (−α1), d(T )k (α2)

]dα1dα2dλ.

(3.2.9)

By Theorem 4.3.2 of Brillinger (1981) we have

cumd(T )a1

(α1), . . . , d(T )ak

(αk)= (2π)k−1H(T )(α1 + . . .+ αk)fa1...ak(α1, . . . , αk−1) +O(1)

(3.2.10)

uniformly in α1, . . . , αk ∈ Π. Substituting into (3.2.9) the first term becomes

C

T

∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)fijkl(α1,−α1, α2)dα1dα2dλ+O( 1

T 2

)= O

( 1

T

).

Further it follows from (3.1.2) and Lemma 3.2.1 that∫Π

∂fab|Cab∂fij

(λ)fij(α− λ)w(T )(α)dα = O( 1

M2T

), (3.2.11)

and thus the leading term of the second term becomes∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)fij(α1)fkl(α2)dα1dα2dλ = O( 1

M4T

). (3.2.12)

With the above bounds for H(T )2 (λ) and w(T )(λ) we obtain for the third term

C

T 2

∣∣∣ ∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)∣∣H(T )

2 (α1 + α2)∣∣2fik(α1)fjl(−α1)dα1dα2dλ

∣∣∣≤ C

T 2M2T

∫Π3

L(MT )(λ− α1)2L(MT )(λ− α2)2L(T )(α1 + α2)2dα1dα2dλ

≤ C

TM2T

∫Π2

L(MT )(λ+ α)2L(MT )(λ− α)2dα dλ = O( 1

T

).

The last term now can be rewritten as

2πH4

TH22

∫Π3

g(λ+ α)w(T )(λ)w(T )(λ+ β)Φ(T )2 (β)fil(α)fkj(α)dαdβdλ.

Page 32: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

28 Chapter 3. Nonparametric analysis

In order to prove the convergence to the term on the right side in (3.2.8), we first showthat the differences∣∣∣ ∫

Π3

g(λ+ α)w(T )(λ)w(T )(λ+ β)Φ(T )2 (β)fil(α)fkj(α)dαdβdλ

−∫

Π2

g(λ+ α)w(T )(λ)2fil(α)fkj(α)dαdλ∣∣∣

and ∣∣∣ ∫Π2

g(λ+ α)w(T )(λ)2fil(α)fkj(α)dαdλ−∫

Π2

w(T )(λ)2g(α)fil(α)fkj(α)dαdλ∣∣∣

both are of order o(√

MT

). Since w is Lipschitz continuous, the first difference is bounded

by

C

∫Π3

w(T )(λ)|w(T )(λ+ β)− w(T )(λ)|Φ(T )2 (β)dαdβdλ

≤ CMT

∫Π2

w(λ)|w(λ+MTβ)− w(λ)|Φ(T )2 (β)dβdλ

≤ CM2T

∫Π

|β|Φ(T )2 (β)dβ

≤ CM2T

T

∫Π

L(T )(β)dβ ≤ CM2T log(T )

T,

which is of the desired order since by Assumption 3.1.3√MT = O(T

β2 ). For the second

difference we note that by Assumption 3.1.1 (ii) the spectral densities and together with(i) also the inverse spectra are continuous. Thus g is Lipschitz continuous and we get∣∣∣ ∫

Π2

g(λ+ α)w(T )(λ)2fil(α)fkj(α)dαdλ−∫

Π2

w(T )(λ)2g(α)fil(α)fkj(α)dλdα∣∣∣

≤ C

∫Π2

|g(λ+ α)− g(α)|w(T )(λ)2dαdλ

≤ C

M2T

∫Π

|λ|L(MT )(λ)4dλ ≤ C.

Thus we have shown

E(S(T )2 ) =

MT

∫Π

d∑i,j,k,l=1

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)fil(λ)fkj(λ)dλ+ o(√MT

T

).

The assertion of the lemma now follows from Lemma 3.2.1.

For the derivation of the covariance of S(T ) we define the sequence Ψ(T )T∈N offunctions

Ψ(T )(α1, . . . , α5) =1

C(T )Ψ

w(T )(α1) · · ·w(T )(α4)Φ(T )2 (α5)Φ

(T )2 (α1 + α2 − α3 − α4 + α5)

Page 33: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 29

with

C(T )Ψ =

∫Π5

w(T )(α1) · · ·w(T )(α4)Φ(T )2 (α5)Φ

(T )2 (α1 + α2 − α3 − α4 + α5)dα1 · · · dα5.

Lemma 3.2.4 Let w satisfy Assumption 3.1.2. Then

(i) Ψ(T ) is an approximate identity and

(ii) limT→∞

1

MT

C(T )Ψ =

∫Π

w(α)4dα.

Proof. (i) It follows immediately from the definition of Ψ(T ) that∫Π5

Ψ(T )(α1, . . . , α5)dα1 · · · dα5 = 1

and ∫Π5

|Ψ(T )(α1, . . . , α5)|dα1 · · · dα5 ≤ K <∞.

Further, we have for any δ > 0∫Uδ(0)C

|Ψ(T )(α1, . . . , α5)|dα1 · · · dα5 ≤5∑i=1

∫|αi|>δ

∫Π4

|Ψ(T )(α1, . . . , α5)|dα1 · · · dα5,

(3.2.13)

where Uδ(0) = α ∈ R5|‖α‖∞ ≤ δ. For i = 1 we obtain∫|α1|>δ

∫Π4

|Ψ(T )(α1, . . . , α5)|dα1 · · · dα5

≤ C

MT δ2

∫Π5

w(T )(α2) · · ·w(T )(α4)Φ(T )2 (α5)Φ

(T )2 (α1 + α2 − α3 − α4 + α5)dα1 · · · dα5

≤ C

MT δ2→ 0 as T →∞.

The cases i = 2, . . . , 5 can be treated similarly. Therefore (3.2.13) tends asymptoticallyto zero.

(ii) To prove the second part of the lemma, we first note that∫Π

w(α)4dα =

∫Π3

w(α1)w(α2)w(α3)w(α1 + α2 − α3)dα1dα2dα3.

Thus we get for δT > 0∣∣∣ 1

MT

C(T )Ψ −

∫Π3

w(α1)w(α2)w(α3)w(α1 + α2 − α3)dα1dα2dα3

∣∣∣≤∫

Π5

w(α1)w(α2)w(α3)∣∣w(α1 + α2 − α3 +MT (α5 − α4)

)− w(α1 + α2 − α3)

∣∣· Φ(T )

2 (α4)Φ(T )2 (α5)dα1 · · · dα5

Page 34: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

30 Chapter 3. Nonparametric analysis

≤ 2‖w‖∞∫

Π3

∫|α4|>δT∨|α5|>δT

w(α1)w(α2)w(α3)Φ(T )2 (α4)Φ

(T )2 (α5)dα1 · · · dα5

+ CMT

∫Π3

∫|α4|≤δT∧|α5|≤δT

w(α1)w(α2)w(α3)∣∣α5 − α4

∣∣Φ(T )2 (α4)Φ

(T )2 (α5)dα1 · · · dα5.

For the second term we find for each ε > 0 some δ > 0 such that for δT = δ/MT the termis bounded by

CMT δT

∫Π5

w(α1)w(α2)w(α3)Φ(T )2 (α4)Φ

(T )2 (α5)dα1 · · · dα5 = Cδ ≤ ε

2.

Then there exists T0 > 0 such that for all T > T0 the first term is bounded by

C

∫Π

∫|α4|>δT

Φ(T )2 (α4)Φ

(T )2 (α5)dα4dα5 ≤

C

T

∫|α|>δT

L(T )(α)2dα ≤ CM2T

Tδ2≤ ε

2

since M2T/T → 0 for T →∞. This proves (ii).

Lemma 3.2.5 Let g : R→ R and h : R3 → R be integrable functions. Then∣∣∣ ∫Π6

g(λ)h(α1 + α2 + α3 − λ, λ− α1, λ− α2)Ψ(T )(α1, . . . , α5)dα1 · · · dα5dλ

−∫

Π

g(λ)h(−λ, λ, λ)dλ∣∣∣ = o(1).

Proof. This follows e.g. from Theorem 2.10 in Alt (1992).

For the derivation of the cumulants of second and higher order we have to considercumulants of the form

cumd

(T )ij,1

(αj,1)d(T )ij,2

(−αj,1)d(T )ij,3

(αj,2)d(T )ij,4

(−αj,2) | j = 1, . . . , k. (3.2.14)

Let∑

i.p. denote the sum over all indecomposable partitions P1, . . . , Pm of the table

α1,1 −α1,1 α1,2 −α1,2...

......

...αk,1 −αk,1 αk,2 −αk,2

(3.2.15)

with pj = |Pj|, Pj = γj,1, . . . , γj,pj and γj = γj,1 + . . . + γj,pj . We call two sets Pi andPj hooked if there exists an index l ∈ 1, . . . , k and variables γir ∈ Pi and γir′ ∈ Pjsuch that γir and γir′ are both contained in the lth row αl,1,−αl,1, αl,2,−αl,2. Thus,in an indecomposable partition every set Pj is hooked to at least one other set Pi. Eachpartition P1, . . . , Pm of the table (3.2.15) can also be interpreted as a partition of thetable

α1,1 −α1,1

α1,2 −α1,2...

...αk,2 −αk,2

. (3.2.16)

Page 35: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 31

This partition, however, might now be decomposable into two or more indecomposablesubpartitions. We call these subpartitions non-hooked as two sets from different subpar-titions are not hooked. Further, a partition is said to cover only one variable if it consistsof exactly one row of the table (3.2.16).

Further, for each partition P1, . . . , Pm, we denote by Qj = νj,1, . . . , νj,pj the setsof the corresponding partition of the table of indices

i1,1 i1,2 i1,3 i1,4...

......

...ik,1 ik,2 ik,3 ik,4

. (3.2.17)

Now suppose that we select for each of the s− u subpartitions which cover more thanone variable two variables αlj,1,ij,1 and αlj,2,ij,2 such that lj,1 6= lj,2 for all j = 1, . . . , s− u.Then a sequence (lj1,1, lj1,2), . . . , (ljr,1, ljr,2) in the set (lj,1, lj,2)|j = 1, . . . , s−u is calleda circle if

lju,2 = lju+1,1 ∀u = 1, . . . , r − 1 and ljr,2 = lj1,1.

As the next lemma shows the variables can be selected such that not more than one circleis obtained.

Lemma 3.2.6 Let J1, . . . , Js−u denote the non-hooked subpartitions which cover morethan one variable. Then for each subpartition Jr there exist two variables αlr,1,ir,1 andαlr,2,ir,2 covered by Jr such that the set (lj,1, lj,2)|j = 1, . . . , s− u does not contain morethan one circle.

Proof. Unifying the sets within each subpartition, we obtain again an indecomposablepartition Q1, . . . , Qs of the table (3.2.15). Since the u subpartitions which cover onlyone variable do not link different rows they can be omitted without destroying indecom-posability. Now suppose there are two circles represented by the sets Qj1 , . . . , Qjr andQjr+1 , . . . , Qjr′

, respectively. Then the sets A = Qj1∪ . . .∪Qjr , B = Qjr+1∪ . . .∪Qjr′, and

Qjr′+1, . . . , Qjs−u form a new indecomposable partition. Therefore there exists a sequence

A, Q1, . . . , Qq, B such that two consecutive sets are hooked. Choosing the variables cor-respondingly we obtain a selection with at least one circle less. As there can be only afinite number of circles, we can apply this scheme repeatedly.

We note, that the lemma remains true if we consider only a subset of the non-hookedsubpartitions.

By the product theorem for cumulants, (3.2.14) can now be written as

cumd

(T )ij,1

(αj,1)d(T )ij,2

(−αj,1)d(T )ij,3

(αj,2)d(T )ij,4

(−αj,2) | j = 1, . . . , k

=∑i.p.

m∏j=1

cumd(T )νj,1

(γj,1), . . . , d(T )νj,pj

(γj,pj)

Page 36: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

32 Chapter 3. Nonparametric analysis

and further by (3.2.14)

=∑i.p.

m∏j=1

[(2π)pj−1H(T )

pj(γj)fνj,1...νj,pj (γj,1, . . . , γj,pj−1) +O(1)

]=∑i.p.

m∏j=1

(2π)pj−1H(T )pj

(γj)fνj,1...νj,pj (γj,1, . . . , γj,pj−1) +RT , (3.2.18)

where the remainder term RT is given by

RT =∑i.p.

∑J$1,... ,m

∏j∈J

(2π)pj−1H(T )pj

(γj)fνj,1...νj,pj (γj,1, . . . , γj,pj−1).

Lemma 3.2.7 Suppose that Assumptions 3.1.1 to 3.1.3 hold. Then under the null hy-pothesis H0

var(S

(T )2

)=MT

T 2

4πH24

H42

∫Π

w(λ)4dλ+ o(MT

T 2

).

Proof. For the sake of simplicity, we use the abbreviation

gj(λ) = Dab(λ)∂fab|Cab∂fij,1ij,2

(λ)∂fba|Cab∂fij,3ij,4

(λ).

From (3.2.18) we obtain for the variance

var(S

(T )2

)=

d∑ij,1,... ,ij,4=1j∈1,2

∫Π6

2∏j=1

[gj(λj)w

(T )(λj − αj,1)w(T )(λj − αj,2)]

· cumI(T )i1,1i1,2

(α1,1)I(T )i1,3i1,4

(α1,2), I(T )i2,1i2,2

(α2,1)I(T )i2,3i2,4

(α2,2)dα1,1 · · · dα2,2dλ1dλ2

=1

(2πH2T )4

∑i.p.

d∑ij,1,... ,ij,4=1j∈1,2

∫Π6

2∏j=1

[gj(λj)w

(T )(λj − αj,1)w(T )(λj − αj,2)]

·m∏j=1

(2π)pj−1H(T )pj

(γj)fνj,1...νj,pj (γj,1, . . . , γj,pj−1)dα1,1 · · · dα2,2dλ1dλ2 +RT ,

(3.2.19)

where the remainder term RT is of smaller order than the main term. We evaluate theterm for the different partitions separately. We start with all partitions of the form

α1,1,±α2,σ1, −α1,1,∓α2,σ1, α1,2,±α2,σ2, −α1,2,∓α2,σ2. (3.2.20)

Page 37: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 33

with σ1 6= σ2. First, we consider the partition α1,1, α2,2 etc., that is σ1 = 1, σ2 = 2; theother cases can be treated similarly. We then obtain terms of the form

1

H42T

4

∫Π6

2∏j=1

gj(λj)w(T )(λj − αj,1)w(T )(λj − αj,2)

∣∣H(T )2 (α1,j + α2,j)

∣∣2· fi1,1i2,1(α1,1)fi1,2i2,2(−α1,1)fi1,3i2,3(α1,2)fi1,4i2,4(−α1,2)dα1,1 · · · dα2,2dλ1dλ2

=C

(T )Ψ H2

4

H42T

2

∫Π6

g1(λ1)g2(λ2 − λ1 + α1,1 + α2,1)

· fi1,1i2,1(λ1 − α1,1)fi1,2i2,2(α1,1 − λ1)fi1,3i2,3(λ1 − α1,2)fi1,4i2,4(α1,2 − λ1)

·Ψ(T )(α1,1, α1,2, α2,1, α2,2, λ2)dα1,1 · · · dα2,2dλ1dλ2

and further by Lemma 3.2.5

=C

(T )Ψ H2

4

H42T

2

∫Π

g1(λ)g2(−λ)fi1,1i2,1(λ)fi1,2i2,2(−λ)fi1,3i2,3(λ)fi1,4i2,4(−λ)dλ+ o(MT

T 2

).

Inserting the definition of gj(λ) and observing that

∂fab|Cab∂fij

(−λ) =∂fba|Cab∂fji

(λ),

the corresponding term in (3.2.19) becomes

C(T )Ψ H2

4

T 2H42

d∑ij,1,... ,ij,4=1j∈1,2

∫Π

Dab(λ)2 ∂fab|Cab∂fi1,1i1,2

(λ)∂fba|Cab∂fi2,2i2,1

(λ)fi1,1i2,1(λ)fi2,2i1,2(λ)

·∂fba|Cab∂fi1,3i1,4

(λ)∂fba|Cab∂fi2,4i2,3

(λ)fi1,3i2,3(λ)fi2,4i1,4(λ)dλ+ o(MT

T 2

)and further by Lemmas 3.2.1 (iii) and 3.2.4 (ii)

=MT

T 2

2πH24

H42

∫Π

w(λ)4dλ+ o(MT

T 2

).

For the partition with α1,1,−α2,2, etc. we obtain the same result while for the othertwo cases we get by Lemma 3.2.1 (ii)

MT

T 2

H24

H42

∫Π

w(λ)4dλ

∫Π

|Rab|Cab(λ)|4dλ+ o(MT

T 2

)= o(MT

T 2

), (3.2.21)

since the first term is zero.Next we show that for all other partitions the corresponding term in (3.2.19) is of

smaller order. First, if the partition consists of only one or two sets, we directly get theupper bound

C

T 4

∫Π6

w(T )(λ1 − α1,1) · · ·w(T )(λ2 − α2,2)T 2dα1,1 · · · dα2,2dλ1dλ2 ≤C

T 2.

Page 38: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

34 Chapter 3. Nonparametric analysis

Second, if there exists only one non-hooked subpartition (i.e. all sets are hooked in thetable (3.2.15)), we obtain

C

T 4M4T

∫Π6

L(MT )(λ1 − α1,1)2 · · ·L(MT )(λ2 − α2,2)2m∏j=1

L(T )(γj)dα1,1 · · · dα2,2dλ1dλ2

≤ C

T 4M2T

∫Π4

L(MT )(α1,1 − α1,2)2L(MT )(α2,1 − α2,2)2m∏j=1

L(T )(γj)dα1,1 · · · dα2,2

≤ CM2T

T 4

∫Π4

m∏j=1

L(T )(γj)dα1,1 · · · dα2,2

≤ CM2T log(T )m−2

T 3= o(MT

T 2

).

Third, we consider partitions with two non-hooked subpartitions. If the partitionconsists of four sets, either the sets are of the form (3.2.20) or one subpartition consistsof one subset, P1 say, of the form α,−α. In the latter case we can apply (3.2.11) whichyields a factor of order O

(M−2

T

). Thus we get

C

T 4M5T

∫Π5

L(MT )(λ1 − α1,2)2 · · ·L(MT )(λ2 − α2,2)2Tm∏j=2

L(T )(γj)dα1,2 · · · dα2,2dλ1dλ2.

Integrating over λ1 and λ2 and using L(MT )(λ)2 ≤M2T this is bounded by

C

T 3MT

∫Π3

m∏j=2

L(T )(γj)dα1,2 · · · dα2,2 ≤C log(T )

T 2MT

. (3.2.22)

Fourth, if a partition of length m = 3 consists of two non-hooked subpartitions, thereare two possible cases: Either one set is the form α,−α - this case can be handled asabove - or one set, P1 say, spans two variables and two hooked sets have two elementseach. It follows that γ2 = −γ3 and we therefore obtain

C

T 4M4T

∫Π6

L(MT )(λ1 − α1,1)2 · · ·L(MT )(λ2 − α2,2)2TL(T )(γ2)2dα1,2 · · · dα2,2dλ1dλ2

and further by integrating over λ1, λ2 and the variables in P1

≤ C

T 3

∫Π

L(T )(γ2)2dγ2 ≤C

T 2.

Next we consider partitions of length m = 4 with three non-hooked subpartions. Sincethe partition is indecomposable two subpartitions must be of the form α1,i,−α1,i andα2,j,−α2,j, respectively, with i 6= j. From (3.2.11) we then get the upper bound

C

T 2M4T

∫Π2

L(T )(α1,i∗ ± α2,j∗)2dα1,i∗dα2,j∗ ≤

C

TM4T

≤ C

T 2, (3.2.23)

where i∗ and j∗ are determined by i∗ 6= i and j∗ 6= j.Finally, a partition of length m = 3 with 3 non-hooked subpartitions again must

contain two sets of the form α1,i,−α1,i and α2,j,−α2,j. Similarly to above we see thatthe terms are of order O(T−2).

Page 39: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 35

Lemma 3.2.8 Suppose that P1, . . . , Pm is an indecomposable partition of the table

α1 −α1...

...αk −αk

.

If m = k then for any k − 2 variables αi1 , . . . , αik−2we obtain∫

Πk−2

k∏j=1

L(T )(γj)dαi1 · · · dαik−2≤ CL(T )(αik−1

± αik)2 log(T )k−2.

If m < k then there exist k − 2 variables αi1 , . . . , αik−2such that∫

Πk−2

m∏j=1

L(T )(γj)dαi1 · · · dαik−2≤ CT log(T )k−2.

Proof. The first part follows from the indecomposability of the partition and the prop-erties of the L(T )-function. For the second part we note that because of the indecompos-ability of the partition there exists an ordering Pj1 , . . . , Pjm and variables αi1 , . . . , αim−1

such that αir ∈⋃ri=1 Pji and −αir ∈ Pjr+1 . Therefore we have∫

Πk−2

L(T )(γ1) · · ·L(T )(γm−1)dαi1 · · · dαik−2

≤ C log(T )

∫Πk−3

L(T )(γ1 + γ2) · · ·L(T )(γm−1)dαi2 · · · dαik−2

≤ C log(T )m−1L(T )(γ1 + . . .+ γm) = CT log(T )m−1.

Lemma 3.2.9 Suppose that Assumptions 3.1.1 to 3.1.3 hold. Then under the null hy-pothesis H0 the cumulants of kth order of S

(T )2 satisfy for all k ≥ 3

cumkS(T )2 = o

(Mk/2T

T k

).

Proof. Defining gj(λ) as in the proof of Lemma 3.2.7 we obtain from (3.2.18) for thekth order cumulant similarly as for the variance

|cumkS(T )2 |

≤ C

T 2k

∑i.p.

∣∣∣ d∑ij,1,... ,ij,4=1j∈1,... ,k

∫Π3k

k∏j=1

[gj(λj)w

(T )(λj − αj,1)w(T )(λj − αj,2)]

·m∏j=1

H(T )pj

(γj)fνj,1...νj,pj (γj,1, . . . , γj,pj−1)dα1,1 · · · dαk,2dλ1 · · · dλk∣∣∣+RT , (3.2.24)

where the remainder term RT is of smaller order than the main term. Therefore it sufficesto show that each summand of the first sum in the main term asymptotically tends tozero with rate M

k/2T /T k.

Page 40: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

36 Chapter 3. Nonparametric analysis

Let P1, . . . , Pm be an indecomposable partition of table (3.2.15) which consists ofs non-hooked subpartitions. Further suppose that u subpartitions, Pm−u+1, . . . , Pm say,cover only one variable, that is the partitions are of the form α,−α. Then applicationof (3.2.11) to each of the u subpartitions yields a factor of order O(M−2u

T ). Renumberingthe variables the corresponding summand in 3.2.24 therefore is bounded by

C

T 2k−uM2k+uT

∫Π3k−u

k∏j=1

L(MT )(λj − αj,1)2k−u∏j=1

L(MT )(λj − αj,2)2

·m−u∏j=1

L(T )(γj)dα1,1 · · · dαk,1dα1,2 · · · dαk−u,2dλ1 · · · dλk.

Next, suppose there are s′ subpartitions, of the form Pi1 , . . . , Pir such that pij = 2for all j = 1, . . . , r. Then according to Lemma 3.2.6 we can select two variables αlj,1,ij,1and αlj,2,ij,2 such that the set of row indices (l1,1, l1,2), . . . , (ls′,1, ls′,2) does not containmore than one circle.

Similarly for each of the remaining s−s′−u subpartitions, we can choose two variables,such that the remaining variables in the set satisfy the conditions of Lemma 3.2.8. Now wecan bound L(MT )(·)2 by M2

T for each of the 2k−2s+u variables which are left. Integratingover these variables, we thus have by Lemma 3.2.8

C log(T )ρM2k−4s+u

T

T 2k−s+s′

∫Πk+2(s−u)

s−u∏j=1

[L(MT )(λlj,1 − αlj,1,ij,1)2L(MT )(λlj,2 − αlj,2,ij,2)2

]·s′∏j=1

L(T )(αlj,1,ij,1 ± αlj,2,ij,2)2dαl1,1,i1,1 · · · dαls−u,2,is−u,2dλ1 · · · dλk

for some ρ ∈ N. Integrating over αls′+1,1,is′+1,1, . . . , αls−u,2,is−u,2 we further obtain

C log(T )ρM

2(k−s−s′)T

T 2k−s+s′MuT

∫Πk+2s′

s′∏j=1

[L(MT )(λlj,1 − αlj,1,ij,1)2L(MT )(λlj,2 − αlj,2,ij,2)2

]·s′∏j=1

L(T )(αlj,1,ij,1 ± αlj,2,ij,2)2dαl1,1,i1,1 · · · dαls′,2,is′,2dλ1 · · · dλk.

Integrating over the remaining αlj,1,ij,1 we get

C log(T )ρM

2(k−s)−s′T

T 2k−sMuT

∫Πk

s′∏j=1

L(MT )(λlj,2 ∓ λlj,1)2dλ1 · · · dλk.

Since the pairs (lj,1, lj,2) where chosen such that at most one circle exists, integrationover λ1, . . . , λk yields an additional factor of order O(M s′+1

T ). Thus for any partitionP1, . . . , PM with s non-hooked subpartitions the corresponding summand in (3.2.24) isbounded by

CM

k/2T

T k

(M2

T

T

)k−sMT log(T )ρ

Mk/2+uT

.

Page 41: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 37

For s ≤ k this is of order o(M

k/2T /T k

)since M2

T/T → 0 and k ≥ 3. If s = k+ 1 the upperbound can be rewritten as

CM

k/2T

T k

(T

M4T

)M3

T log(T )ρ

Mk/2+uT

.

Noting that T/M4T → 0 and u ≥ 2 since the partition was indecomposable, this is again

of order o(M

k/2T /T k

).

3.2.2 Asymptotic local power of the test

We now investigate the asymptotic power of the test based on the test statistic S(T ) undera class of local alternatives. More precisely, we derive the asymptotic distribution of S(T )

under the series of alternatives

H1,T : fab|Cab(λ) = cT f0(λ), ∀λ ∈ [−π, π],

where f0 is a complex-valued continuous function on [−π, π] and cTT∈N is a real valuedsequence such that cT → 0 and cT

√T → ∞ as T → ∞. Under these conditions the

alternatives H1,T converge to the null hypothesis H0.As in the previous section, we consider the Taylor expansion of S(T ),

S(T ) =

∫Π

K(f(λ)

)dλ+

d∑i,j=1

∫Π

∂K(f(λ))

∂fij

(f

(T )ij (λ)− fij(λ)

)dλ

+1

2

d∑i,j,k,l=1

∫Π

∂2K(f(λ))

∂fij∂fkl

(f

(T )ij (λ)− fij(λ)

)(f

(T )kl (λ)− fkl(λ)

)dλ+OP

((MT

T

)32

).

(3.2.25)

Under the alternative H1,T the constant term becomes∫Π

K(f(λ)

)dλ =

1

∫Π

∣∣Rab|Cab(λ)∣∣2dλ = c2

T

∫Π

|f0(λ)|2Dab(λ)dλ.

The summands in the second term in (3.2.25) can be written as∑k,l∈a,b

∫Π

∂K(f(λ))

∂fkl|Cab

∂fkl|Cab∂fij

(λ)(f

(T )ij (λ)− fij(λ)

)dλ,

where the first derivatives of K(f(λ)) with respect to the partial spectra are of the form

∂K(f(λ))

∂fab|Cab= cTf0(−λ)Dab(λ),

∂K(f(λ))

∂fba|Cab= cTf0(λ)Dab(λ)

and

∂K(f(λ))

∂faa|Cab=∂K(f(λ))

∂fbb|Cab= O(c2

T ).

Page 42: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

38 Chapter 3. Nonparametric analysis

It then follows from∫Π

f0(λ)Dab(λ)∂fab|Cab∂fij

(λ)(f

(T )ij (λ)− fij(λ)

)dλ = oP

(T−

12

)(3.2.26)

(cf. Taniguchi et al., 1996; similar integrals are also treated in Section 3.3) and (3.1.1) thatthe first term in the Taylor expansion is of order oP (c2

T ). Further, the second derivativesin the third term in (3.2.25) can be expressed in terms of second derivatives with respectto partial spectra. We have

∂2K(f(λ))

∂fab|Cab∂fba|Cab= Dab(λ),

while all other second derivatives are of order O(cT ). Substituting these second deriva-tives into the third term in (3.2.25), we find that all summands corresponding to secondderivatives of order O(cT ) are of order oP (c2

T ) by (3.1.1). Therefore the third term takesthe form

d∑i,j,k,l=1

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)(f

(T )ij (λ)− fij(λ)

)(f

(T )kl (λ)− fkl(λ)

)dλ+ oP (c2

T )

=d∑

i,j,k,l=1

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)[f

(T )ij (λ)f

(T )kl (λ)− fij(λ)fkl(λ)

]dλ+ oP (c2

T ),

since by Lemma 3.2.1 and (3.2.26)∑i,j

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)fij(λ)∂fba|Cab∂fkl

(λ)(f

(T )kl (λ)− fkl(λ)

)dλ

= cT

∫Π

2πf0(λ)Dab(λ)2∂fba|Cab∂fkl

(λ)(f

(T )kl (λ)− fkl(λ)

)dλ = oP (c2

T ).

Thus, we can write S(T ) under the series of local alternatives H1,T as

S(T ) = c2Tν(f0) +

∑i,j,k,l

∫Π

Dab(λ)∂fab|Cab∂fij

(λ)∂fba|Cab∂fkl

(λ)[f

(T )ij (λ)f

(T )kl (λ)− fij(λ)fkl(λ)

]dλ

+ oP (c2T ) +OP

((MT

T

)32

), (3.2.27)

where

ν(f0) =

∫Π

|f0(λ)|2Dab(λ)dλ.

Theorem 3.2.10 Suppose that Assumptions 3.1.1 to 3.1.3 hold. If cT = M1/4/T 1/2

then the test statistic S(T ) has under the series of local alternatives H1,T the asymptoticdistribution

TS(T ) −MTµ√MTσ

D→ N (ν(f0), 1),

where µ and σ2 are defined as in Theorem 3.2.2.

Page 43: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 39

Proof. We show that Lemmas 3.2.3, 3.2.7, and 3.2.9 also hold for the second term in(3.2.27). For this, it is sufficient to incorporate a few changes into the proofs.

First, we note that instead of (3.2.12) we have∫Π3

g(λ)w(T )(λ− α1)w(T )(λ− α2)[fij(α1)fkl(α2)− fij(λ)fkl(λ)

]dα1dα2dλ = O

( 1

M4T

).

For the rest of the proofs, the additional term −g(λ)fij(λ)fkl(λ) in the integrand has noeffect.

In the proof of Lemma 3.2.7, we have made use of the null hypothesis in equation(3.2.21), which under H1,T takes the form

MT

T 2

∫Π

|Rab|Cab(λ)|4dλ = O(M2

T

T 4

)= o(MT

T 2

),

since M2T/T → 0. Further, we have under H1,T

d∑i,j=1

∫Π

∂fab|Cab∂fij

(λ)w(T )(α− λ)fij(α)dλ = O(M1/4

T

T 1/2

)(3.2.28)

instead of order O(M−2

T

)as under the null hypothesis. We have made use of this in the

derivation of the upper bounds (3.2.22) and (3.2.23). Substituting the correct rates underH1,T , we obtain

CMT

T 2

log(T )M5/4T

T 5/2= o(MT

T 2

)instead of (3.2.22) and CM

1/2T /T instead of (3.2.23).

To proof the convergence of the cumulants of higher order, we consider the case u ≥k − 1 separately. First assume that u = k − 1. Because of the indecomposability of thepartition, the u subpartitions must lie in different rows of the table (3.2.15). Supposethat α2,i2 , . . . , αk,ik are the corresponding variables. Integrating first over α2,i2 , . . . , αk,ikand then over λ2, . . . , λk yields

CT u/2Mu/4T

T 2kM2T

∫Πk+2

L(MT )(λ1 − α1,1)2L(MT )(λ1 − α1,2)2m∗∏j=1

L(T )(γj)dα1,1 . . . dαk,1dα1,2dλ1

≤ KT u/2Mu/4T

T 2kM2T

log(T )m∗−2

∫Π3

L(MT )(λ1 − α1,1)2L(MT )(λ1 − α1,2)2

· L(T )(α1,1 ± α1,2)2dα1,1dα1,2dλ1

≤ KT (k−1)/2M(k−1)/4T

T 2klog(T )m

∗−2TMT

≤ C(Mk/2

T

T k

)(M2T

T

) k−12 M

k+34

T

M3k/2−1T

log(T )m∗−2 = o

(Mk/2T

T k

).

Page 44: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

40 Chapter 3. Nonparametric analysis

In the case u = k we get analogously the upper bound

CT u/2Mu/4T

T 2k

∫Πk

m∗∏j=1

L(T )(γj)dα1,1 . . . dαk,1 ≤ C(Mk/2

T

T k

)(M2T

T

) k−22 log(T )m

∗−1

M5k/4−2T

.

For the partitions with u ≤ k− 2 we can modify the upper bound derived in the proofof Lemma 3.2.9 by multiplying with M

9u/4T /T u/2 which is the change of rate due to having

(3.2.28) instead of (3.2.11). Thus we find that the corresponding summand in (3.2.24) isbounded by

C(Mk/2

T

T k

)(M2T

T

)k−s+u/2M1+u/4T log(T )ρ

Mk/2T

,

which is of the desired rate since 1 + u4< k

2.

3.2.3 Finite sample performance

We now examine the finite sample performance of the integrated partial spectral coherenceS(T ) in comparison with the maximum partial spectral coherence S

(T ) from Dahlhaus et

al. (1997). For this purpose, we consider the following three dimensional autoregressiveprocess

X1(t) = 0.5X1(t− 1)− 0.3X1(t− 2) + ε1(t),

X2(t) = 0.5X2(t− 1) + 0.06X1(t− 1)− 0.3X2(t− 2) + ε2(t),

X3(t) = 0.5X3(t− 1) + 0.03X2(t− 1)− 0.3X3(t− 2) + ε3(t)

(3.2.29)

with ε(t)iid∼ N (0,13). In Figure 3.2.1, the partial spectral coherences of the process are

shown. Since |R13|2(λ)|2 is identical to zero, the edge between vertices 1 and 3 is missingin the corresponding conditional correlation graph.

For the identification of the conditional correlation graph, we test for the presence ofeach possible edge in the graph. Therefore we consider the test problems

H(a,b)0 :

∣∣Rab|Cab(λ)∣∣2 ≡ 0 against H

(a,b)1 :

∣∣Rab|Cab(λ)∣∣2 6≡ 0 (3.2.30)

(a)

0.0

0.02

0.0 π

par

tial c

oher

ence

frequency

(b)

0.0

0.02

0.0 π

par

tial c

oher

ence

frequency

(c)

0.0

0.02

0.0 π

par

tial c

oher

ence

frequency

Figure 3.2.1: Partial spectral coherences for X(t): (a) |R12|3(λ)|2, (b) |R13|2(λ)|2, (c) |R23|1(λ)|2.

Page 45: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.2. Testing for interrelation 41

H(1,3)0 vs. H

(1,3)1 H

(2,3)0 vs. H

(2,3)1 H

(1,2)0 vs. H

(1,2)1

T BT S(T ) S(T ) S(T ) S

(T ) S(T ) S

(T )

4096 0.005 12.9 0.5 17.4 0.2 36.1 0.6

0.01 8.6 3.6 15.6 3.9 40.5 6.0

0.02 8.4 8.0 17.0 10.1 53.8 17.4

0.04 8.9 11.2 22.0 16.9 71.3 35.9

8192 0.01 9.1 2.4 16.7 2.6 58.2 4.5

0.01 7.2 6.4 20.0 8.4 74.6 17.6

0.02 6.9 9.3 26.2 14.0 89.5 39.0

0.04 8.2 12.3 38.3 23.4 97.1 66.0

16384 0.02 7.3 5.1 26.4 8.3 92.7 17.7

0.01 6.8 7.7 36.4 15.4 98.3 43.2

0.02 7.1 10.4 53.6 23.7 99.9 74.7

0.04 8.4 12.1 70.6 37.6 100.0 95.5

32768 0.04 6.3 7.1 53.5 13.9 99.9 47.2

0.01 6.5 9.3 74.2 24.9 100.0 84.5

0.02 7.4 10.6 89.4 41.2 100.0 99.2

Table 3.2.1: Rejection rates out of 5000 replications of the integrated partial spectal coherence S(T ) andthe maximum partial spectral coherence S(T )

for the test problems H(a,b)0 against H(a,b)

1 at significancelevel α = 5%. The true model is given by (3.2.29).

for (a, b) ∈

(1, 2), (1, 3), (2, 3)

. For the edge (1, 3) the null hypothesis H(1,3)0 is true

and the two components X1(t) and X3(t) are not directly related, but only indirectlyvia the second component X2(t). For the other two edges (1, 2) and (2, 3), the partialspectral coherences are nonzero, although the deviations from the null hypothesis aresmall. We note, that the partial spectral coherence |R23|1(λ)|2 deviates much less fromthe null hypothesis than |R12|3(λ)|2.

The process X(t) has been simulated with sample sizes T = 4096, 8192, 16384, and32768. To examine the effect of the smoothing bandwidth BT = 1/MT , different band-widths BT = 0.005, 0.01, 0.02, and 0.04 were used for the smoothing of the periodogram.Table 3.2.1 reports the performance of the two test statistics S(T ) and S

(T ) for each of the

three test problems in (3.2.30) at significance level α = 0.05 based on 5000 replications.

The results show that under the null hypothesis H(1,3)0 both statistics S(T ) and S

(T )

in general exhibit little over-rejection, which becomes less for increasing sample size. Forboth statistics, the type I error varies for different bandwidths, although S

(T ) seems to

be more affected by the choice of bandwidth. In particuar for small bandwidth and smallsample size, the test based on S

(T ) seems to break down totally, as its rejection rate

becomes less than 5% globally under the null hypothesis and the local alternatives.

Comparing the performance under the alternatives, we find that in general S(T ) is

Page 46: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

42 Chapter 3. Nonparametric analysis

more than twice as powerful as S(T ) . For both statistics, the power increases for higher

bandwidths, but the integrated partial spectral coherence seems to be more robust againstundersmoothing and rejects the alternatives more often than the null hypothesis.

In summary, the simulation study shows that the new test performs reasonably wellalthough it tends to over-rejection under the null hypothesis. It has better power thanthe test suggested by Dahlhaus et al. (1997) and seems to be less affected by over- orundersmoothing.

We note, that in the case d = 2 we obtain the integrated spectral coherence and thusa test for independence between two time series. The test is a frequency domain versionof a test proposed by Haugh (1976) and Hong (1996), which is based on the residualcross-correlation function after prewhitening the two time series. The test statistic S(T )

has the advantange that no prewhitening of the series is necessary for the application ofthe test. This is of importance if for example point processes are considered, where nosimilar prewhitening methods exist. Since the frequency domain analysis of point processis quite similar to the time series case (cf. Brillinger 1972), we conjecture that the sameresults with minor changes hold for point processes.

3.3 Time domain analysis

In neurophysiology, the identification of synaptic connections is typically based on therecorded times of discharges of the neurons under study. The association between suchneural spike trains is commonly measured by the cross-correlation histogram which isthe histogram of the times of occurrence of all output spikes relative to the times ofoccurrence of all input spikes. The peaks and troughs in such histograms indicate thechange of probability for the occurrence of an output spike due to the presence of an inputspike, thus revealing the type and the time delay of the synaptic connection. However,when analysing the structure of a larger neural ensemble we cannot infer from the cross-correlation histogram to what extent these changes are due to a direct connection betweenthe two neurons, to indirect connections via other neurons, or to common inputs.

In Brillinger et al. (1976), Rosenberg et al. (1989), and Dahlhaus et al. (1997), thefrequency domain approach discussed in the previous sections has been applied to neu-rophysiological data. Although the method allows to distinguish direct from indirectconnections, no distinction is made between excitatory and inhibitory connections.

In this section, we propose a new partialized time domain statistic, which combinesthe advantages of both of these methods. In particular, it can be interpreted in the sameway as the cross-correlation histograms and thus allows to distinguish between excitatoryand inhibitory connections and to identify effects due to common inputs from neurons (orother stimuli) which have not been recorded.

3.3.1 Partial correlation functions

The method discussed in this section is entirely based on the relation between the se-quences of times of occurrence of discharges recorded from different neurons. Therefore

Page 47: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 43

these neural spike trains can be represented by stochastic point processes. Here a pointprocess on R is defined as a random counting process N where N(A) denotes the numberof point events (representing the times of discharges) occurring in some Borel set A ⊆ R.For the details about point processes we refer to Daley and Vere-Jones (1988).

Consider a multivariate stationary point process N = (N1, . . . , Nd)′ on R with finite

moments of all orders. Then the reduced cumulant measure of kth order exists and isgiven by

cumNa1(A1), . . . , Nak(Ak)

=

∫A1

· · ·∫Ak

dC ′a1,... ,ak(t1 − tk, . . . , tk−1 − tk)dtk.

Further, if the function fa1,... ,ak : Rk−1 → C given by

fa1,... ,ak(λ1, . . . , λk−1) = (2π)1−k∫Rk−1

exp(

ik∑i=1

λiui

)dC ′a1,... ,ak

(u1, . . . , uk−1)

exists, we call it the kth order cumulant spectrum of N .As in the time series case, we can investigate the direct association between two

components Na and Nb by considering partialized statistics with the linear effects of allremaining components NCab removed. Minimizing the mean squared error

E

[ ∫R

φ(t)dNa(t)−∫R

φ(t)

(µ+

∫R

γ(t− u)dNCab(u)

)dt

]2

with respect to γ and µ for all integrable functions φ, we obtain the partial error process

dεa|Cab(t) = dNa(t)− µ∗dt−∫R

γ∗(t− u)dNCab(u)dt,

where γ∗ has Fourier transform γ∗(λ) = faCab(λ)fCabCab(λ)−1 and µ∗ = pa−γ∗(0)pCab . Thepartialized statistics are then defined as statistics of the partial error processes εa|Cab(t)and εb|Cab(t). In particular, we obtain the partial spectrum

fab|Cab(λ) = fεa|Cabεb|Cab (λ) = fab(λ)− faCab(λ)fCabCab(λ)−1fCabb(λ).

Noting that by Lemma 3.2.1

γ∗i (λ) = −∂fab|Cab∂fib

(λ),

we can rewrite the partial cumulant spectra of fourth order in terms of the derivatives offab|Cab and the cumulant spectra fijkl. In particular we have

fabab|Cab(λ,−λ, µ) =d∑

i,j,k,l=1

∂fab|Cab∂fij

(λ)∂fab|Cab∂fkl

(µ)fijkl(λ,−λ, µ). (3.3.1)

Instead of considering frequency domain statistics, we also can use the usual timedomain statistics for point processes. We define the partial covariance density as the

Page 48: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

44 Chapter 3. Nonparametric analysis

density of the absolutely continuous part of the reduced cumulant measure C ′εa|Cabεb|Cab,

that is we have for a 6= b∫R2

φ1(t1)φ2(t2)qab|Cab(t1 − t2)dt1dt2 = cov∫

R

φ1(t1)dεa|Cab(t1),

∫R

φ2(t2)dεb|Cab(t2)

and ∫R2

φ1(t1)φ2(t2)[qaa|Cab(t1 − t2) + paδt1t2

]dt1dt2

= cov∫

R

φ1(t1)dεa|Cab(t1),

∫R

φ2(t2)dεa|Cab(t2).

with a similar equation for the case a = b. Naturally, the partial covariance density isrelated to the partial spectral density by

qab|Cab(u) =

∫R

(fab|Cab(λ)− paδab

)exp(−iλu)dλ.

We note that it is not possible to standardize the partial covariance density properly suchthat its values lie in [−1, 1] since the reduced covariance measure Caa|Cab has mass pa atthe origin due to the discreteness of the sample paths of Na. Instead, we suggest thefollowing scaled partial covariance density

ρab|Cab(u) =qab|Cab(u)√papb

as a measure for the “correlation” since the variance of Na is pa (though on a differentscale). For point processes such as Hawkes’ self-exciting processes (Hawkes, 1971a, 1971b)with linear relations between the components, this function has the property that it staysconstant if the mean intensities of all components are increased by one constant factor.

If the process has been observed on the interval [0, T ], the spectral densities fab(λ) canbe estimated by the periodogram

I(T )ab (λ) =

(2πH

(T )2 (0)

)−1d(T )a (λ)d

(T )b (−λ),

where

d(T )a (λ) =

∫R

h(T )(t) exp(−iλt)[dNa(t)− p(T )

a

]is the finite Fourier transform of the point process, h(T )(t) = h(t/T ) is a data taper withFourier transforms

H(T )k (λ) =

∫R

h(T )(t)k exp(−iλt)dt

and

p(T )a =

(H

(T )1 (0)

)−1∫R

h(T )(t)dNa(t)

Page 49: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 45

is an estimate for the mean intensity pa of the ath component of the process. We assume,that the taper function h : R → R is of bounded variation and vanishes outside theinterval [0, 1]. We further define Hk =

∫Rh(t)kdt.

As in the case of time series, smoothing of the periodogram leads to a consistentspectral estimate,

f(T )ab (λ) =

∫R

I(T )ab (α)w(T )(λ− α)dα,

where w(T )(λ) = MTw(MTλ) is a kernel function. The partial spectra and partial spectralcoherences then can be estimated from the inverse of the estimated spectral matrix byProposition 2.1.3. Further, we can estimate the partial covariance density qab|Cab(u) fora 6= b by

q(T )ab|Cab(u) =

∫R

f(T )ab|Cab(λ) exp(−iuλ)ζ(λ)dλ,

where ζ(λ) is some convergence factor decreasing to zero for |λ| → ∞. Similarly we obtainestimates for qaa|Cab(u) and qaa|Ca(u). Finally, the scaled partial covariance densities can

be estimated by substituting p(T )a for the mean intensities pa.

In the next section we derive (in a more general context) the asymptotic normality ofthe partial covariance density and show that

√T(q

(T )ab|Cab(u)− q(ζ)

ab|Cab(u)),

where

q(ζ)ab|Cab(u) =

∫R

fab|Cab(λ) exp(−iuλ)ζ(λ)dλ,

converges weakly to a Gaussian process. For the ordinary covariance density, a similarresult has been proved by Eichler (1995).

3.3.2 Empirical partial spectral processes

For some measurable function ζ let L 2ζ (R) be the space of all complex-valued functions

g on R such that the seminorm

ρζ(g) =(∫

R

|g(λ)|2ζ(λ)dλ)1/2

is finite. Further, let G be a subset of L 2ζ (R) and X be the space of all bounded, complex-

valued functions on G which are uniformly continuous with respect to the seminorm. Weequip X with the Borel field BX generated by the open sets corresponding to the uniformnorm ‖x‖∞ = sup |x(g)| for x ∈X .

We consider empirical partial spectral processes

E(T )(g) =√T

∫R

g(λ)[f(T )ab|Cab(λ)− fab|Cab(λ)]ζ(λ)dλ

Page 50: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

46 Chapter 3. Nonparametric analysis

indexed by g ∈ G . If the partial spectrum fab|Cab(λ) is bounded, it follows from the

Cauchy-Schwarz inequality and the almost sure boundedness of f(T )ab|Cab(λ) for a fixed re-

alisation of N and fixed T that the sample paths of E(T )(g) are almost surely uniformlycontinuous with respect to ρζ and thus lie in X .

The limit process of E(T )(g) for T → ∞ is defined by its finite dimensional distri-butions. We call a stochastic process E(g) a partial spectral process if its sample pathsare in X almost surely and its finite dimensional distributions are normal with mean zeroand covariances

covE(g), E(h)

=2πH4

H22

∫R2

g(λ)ζ(λ)h(µ)ζ(µ)fabab|Cab(λ,−λ, µ)dλdµ

+2πH4

H22

∫R

g(λ)ζ(λ)[h(λ)ζ(λ)faa|Cab(λ)fbb|Cab(λ)

+ h(−λ)ζ(λ)fab|Cab(λ)fab|Cab(λ)]dλ

and by Lemma 3.2.1 and (3.3.1)

=2πH4

H22

d∑i,j,k,l=1

∫R2

g(λ)ζ(λ)h(µ)ζ(µ)∂fab|Cabfij

(λ)∂fab|Cabfkl

(µ)fijkl(λ,−λ, µ)dλdµ

+2πH4

H22

d∑i,j,k,l=1

∫R

g(λ)ζ(λ)[h(λ)ζ(λ)

∂fab|Cabfij

(λ)∂fba|Cabflk

(λ)fik(λ)flj(λ)

+ h(−λ)ζ(λ)∂fab|Cabfij

(λ)∂fab|Cabfkl

(λ)fil(λ)fkj(λ)]dλ.

For the results in this section, we need to impose conditions on the strength of thedependence of the data and on the size of the index class G . The latter is determined bythe covering number of G , which we denote by

N(δ, ρζ ,G ) = infm ∈ N|∃g1, . . . , gm ∈ L 2ζ (R) ∀g ∈ G : min

1≤k≤m‖g − gk‖ < δ

(e.g. Pollard, 1984). If G is a totally bounded subset of L 2ζ (R), N(δ, ρζ ,G ) is finite for

all δ > 0. We make the following assumptions.

Assumption 3.3.1 N = (N1, . . . , Nd)′ is an orderly multivariate stationary point pro-

cess on R such that the following conditions hold.

(i) N possesses finite moments of all orders and reduced cumulant measures C ′a1,... ,ak

such that there exists a constant C with∫Rk−1

(1 + |uj|

)∣∣dC ′a1,... ,ak(u1, . . . , uk−1)

∣∣ ≤ Ck (3.3.2)

for all j ∈ 1, . . . , k − 1 and k ≥ 2.

Page 51: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 47

(ii) The spectral matrix f(λ) of N satisfies the boundedness condition

a11d ≤ f(λ) ≤ a21d, ∀λ ∈ R

for constants a1 and a2 such that 0 < a1 ≤ a2 <∞.

Assumption 3.3.2 w : R→ R is a nonnegative, bounded, and symmetric function withcompact support [−1, 1] such that ∫

R

w(α)dα = 1.

Further, MTT∈N is a real-valued sequence such that MT = O(T β) with 14< β < 1

2.

Assumption 3.3.3 The function ζ ∈ L (R) is nonnegative, bounded, symmetric, andmonotonically decreasing on R+. Further we assume that ζ0(λ) = ζ(λ)1/2 is also inte-grable.

Assumption 3.3.4 G is a totally bounded subset of L 2ζ (R) such that for all g ∈ G the

product g · ζ is bounded and the covering numbers of G satisfy∫ 1

0

[log(N(u, ρζ ,G )2

u

)]2

du <∞.

By Assumption 3.3.1 the partial spectrum is a bounded and twice continuously differ-entiable function of the ordinary spectra. Further, under the conditions of Assumption3.3.2 the kernel estimates f

(T )ij (λ) are consistent. We therefore can rewrite the empirical

partial spectral process with a Taylor expansion as

E(T )(g) =√T

d∑i,j=1

∫R

g(λ)∂fab|Cab∂fij

(λ)[f

(T )ij (λ)− fij(λ)

]ζ(λ)dλ+RT

=√T

d∑i,j=1

∫R2

g(λ)ζ(λ)∂fab|Cab∂fij

(λ)w(T )(α− λ)[I

(T )ij (α)− fij(α)

]dαdλ

+√T

d∑i,j=1

∫R2

g(λ)ζ(λ)∂fab|Cab∂fij

(λ)w(T )(α− λ)[fij(α)− fij(λ)

]dαdλ +RT

where the remainder term is of the form

RT = T12

d∑i,j,k,l=1

∫R

∂2fab|Cab(λ)

∂fij∂fkl

∣∣∣∣f=f

g(λ)ζ(λ)(f

(T )ij (λ)− fij(λ)

)(f

(T )kl (λ)− fkl(λ)

)dλ.

We note that (3.1.1) and (3.1.2) hold also for point processes. Application of the Cauchy-Schwarz inequality then yields for the remainder

|RT (g)| ≤ Cρζ(g)OP

(MT√T

),

Page 52: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

48 Chapter 3. Nonparametric analysis

while by (3.1.2) the bias term is bounded by

C√Tρζ(g)

(∫R

∣∣∣ ∫R

w(T )(α− λ)[fij(α)− fij(λ)

]dα∣∣∣2ζ(λ)dλ

) 12

≤ Cρζ(g)

√T

M2T

.

Thus the empirical partial spectral process can be approximated by

E(T )(g) =d∑

i,j=1

E(T )ij,ζ0

(T )ij (g)

)+ oP

(ρζ(g)

), (3.3.3)

where

E(T )ij,ζ0

(T )ij (g)

)=√T

∫R

φ(T )ij (g, λ)

[I

(T )ij (λ)− fij(λ)

]ζ0(λ)dλ

are empirical spectral processes (cf. Eichler, 1995) with arguments

φ(T )ij (g, λ) =

1

ζ0(λ)

∫R

∂fab|Cab∂fij

(α)g(α)ζ(α)w(T )(α− λ)dα.

Similarly we define

φij(g, λ) =1

ζ0(λ)

∂fab|Cab∂fij

(λ)g(λ)ζ(λ).

Then both functions are bounded with ‖φ(T )ij (g)‖∞ ≤ ‖φij(g)‖∞ ≤ C and further∫

R

∣∣φ(T )ij (g, λ)

∣∣ζ0(λ)dλ ≤∫R

∣∣φij(g, λ)∣∣ζ0(λ)dλ ≤ C. (3.3.4)

Noting that w(T )(λ) is an approximate identity we also obtain∫R

∣∣φ(T )ij (g, λ)− φij(g, λ)

∣∣ζ0(λ)dλ = o(1). (3.3.5)

Theorem 3.3.5 Suppose that Assumptions 3.3.1 to 3.3.4 hold. Then the finite dimen-sional distributions of E(T )(g), g ∈ G converge to the finite dimensional distributions ofthe partial spectral process E(g), g ∈ G .

Proof. Because of (3.3.3) it is sufficient to show that the sum of empirical spectralprocesses on the right side is asymptotically normal. For this, we show the convergenceof the cumulants of first, second and higher order to the corresponding cumulants of thelimiting distribution. Since the arguments φ

(T )ij (g) depend on T we cannot apply the

results in Eichler (1995), although the line of proof is similar.First, we consider the Fourier transforms

d(T )a (λ) =

∫R

h(T )(t) exp(−iλt)[dNa(t)− padt

],

Page 53: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 49

which are related to d(T )a (λ) by the equation

d(T )a (λ) = d(T )

a (λ)−H(T )1 (λ)H

(T )1 (0)−1d(T )

a (0).

For the evaluation of the cumulants of d(T )a (λ), we need the following nonperiodic extension

of the L(T )-function,

L(T )(λ) =

T, |λ| ≤ 1/T1

|λ|, |λ| > 1/T

, (3.3.6)

which has been introduced by Eichler (1995) for handling the cumulants of time continuousprocesses. The properties are summarized in the appendix. We then have

cumd(T )a (λ), d

(T )b (µ)

= cum

d(T )a (λ), d

(T )b (µ)

+O

(T−1L(T )(λ)L(T )(µ)

), (3.3.7)

which immediately leads to∣∣EE(T )ij,ζ0

(T )ij (g)

)∣∣ ≤ √T ∫R

∣∣φ(T )ij (g, λ)

∣∣ ∣∣EI(T )ij (λ)− fij(λ)

∣∣ζ(λ)dλ ≤ C√T.

A similar equation like (3.3.7) with remainder of order O(L(T )(λ) +L(T )(µ)

)holds for the

cumulant cumd

(T )i (λ), d

(T )j (−λ), d

(T )k (µ), d

(T )l (−µ)

. Thus we find for the cumulants of

second order

cumE

(T )ij,ζ0

(T )ij (g1)

), E

(T )kl,ζ0

(T )kl (g2)

)= T

∫R2

φ(T )ij (g1, λ)φ

(T )kl (g2, µ)cum

I

(T )ij (λ), I

(T )kl (µ)

ζ0(λ)ζ0(µ)dλdµ

=2πH4

H22

∫R2

φ(T )ij (g1, λ)φ

(T )kl (g2, µ)ζ0(λ)ζ0(µ)fijkl(λ,−λ, µ)dλdµ

+2πH4

H22

∫R2

φ(T )ij (g1, λ)φ

(T )kl (g2, µ)ζ0(λ)ζ0(µ)

(T )2 (λ+ µ)fik(λ)fjl(−λ)

+ Φ(T )2 (λ− µ)fil(λ)fjk(−λ)

]dλdµ+ o(1)

where Φ(T )2 (λ) is defined as in (3.2.7). The first term converges to

2πH4

H22

∫R2

φij(g1, λ)φkl(g2, µ)ζ0(λ)ζ0(µ)fijkl(λ,−λ, µ)dλdµ

since by (3.3.4) and (3.3.5)∫R2

∣∣φ(T )ij (g1, λ)φ

(T )kl (g2, µ)− φij(g1, λ)φkl(g2, µ)

∣∣ζ0(λ)ζ0(µ)fijkl(λ,−λ, µ)dλdµ

≤ C

∫R

∣∣φ(T )ij (g1, λ)− φij(g1, λ)

∣∣ζ0(λ)dλ

∫R

∣∣φ(T )kl (g2, µ)

∣∣ζ0(µ)dµ

+ C

∫R

∣∣φ(T )kl (g2, µ)− φkl(g2, µ)

∣∣ζ0(µ)dµ

∫R

∣∣φij(g1, λ)∣∣ζ0(λ)dλ = o(1).

Page 54: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

50 Chapter 3. Nonparametric analysis

Next, we define

hij,r(λ) =∂fab|Cab∂fij

(λ)ζ(λ)gr(λ)

and

g(T )(λ) =

∫R2

hkl,1(α + β + λ)fik(λ+ β)fjl(−λ− β)w(T )(α)w(T )(β)dαdβ.

We note that ‖g(T )‖∞ is uniformly bounded in T . Then, since Φ(T )2 (λ) and w(T )(λ) are

approximate identities, we can rewrite the first summand in the second term as∫R2

φij(g1, λ)φij(g2, µ)ζ0(λ)ζ0(µ)Φ(T )2 (λ+ µ)fik(λ)flj(λ)

=

∫R2

g(T )(λ)hkl,2(µ− λ)Φ(T )2 (µ)dλdµ

=

∫R3

hij,1(α + λ)hkl,2(β − λ)fik(λ)flj(λ)w(T )(α)w(T )(β)dαdβdλ+ o(1)

=

∫R

hij,1(λ)hkl,2(−λ)fik(λ)flj(λ)dλ+ o(1).

The convergence of the second summand can be obtained similarly. Substituting theformulas for hij,1 and hkl,2 and summing over the indices i, j, k, l, we obtain the cumulantsof second order of the limiting process by application of Lemma 3.2.1.

Finally, using the same notation as in the previous section, the cumulants of higherorder are bounded by∣∣cum

E

(T )i1j1,ζ0

(T )i1j1

(g1)), . . . , E

(T )ikjk,ζ0

(T )ikjk

(gk))∣∣

≤∑i.p.

∑J⊆M

O(T−k/2)

∫Rk

k∏r=1

∣∣φ(T )irjr

(gr, λr)∣∣ζ0(λr)

∏r∈J

L(T )(βj)dλ1 · · · dλk = o(1),

which can be shown as in Eichler (1995) since the functions φ(T )irjr

(gr, λr) are uniformlybounded in T .

Lemma 3.3.6 Suppose that Assumptions 3.3.1 to 3.3.4 hold. The cumulants of the pro-cess E(T )

ij,ζ0

(T )ij (g)

), g ∈ G are uniformly bounded by

cumk

E

(T )ij,ζ0

(T )ij (g)

)≤ (2k!)ck0ρζ(g)k.

Proof. Similarly as in the proof of Theorem 2.3 in Eichler (1995), we have

cumk

E

(T )ij,ζ0

(T )ij (g)

)≤ (2k)!ck0ρζ(φ

(T )ij (g))k. (3.3.8)

Now since w(T ) has bounded support [−BT , BT ] with BT = M−1T , we have∫

R

ζ(λ)

ζ0(α)w(T )(α− λ)dα ≤ sup

ξ∈[−BT ,BT ]

∣∣∣ ζ(λ)

ζ0(λ− ξ)

∣∣∣ ≤ C

Page 55: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 51

and thus by Jensen’s inequality∫R

|φ(T )ij (α)|2ζ0(α)dα ≤

∫R2

|g(α)|2∣∣∣∂fab|Cab∂fij

(α)∣∣∣2ζ(α)2ζ0(λ)−2w(T )(α− λ)dλdα

≤∥∥∥∂fab|Cab

∂fij

∥∥∥2

∫R

|g(α)|2ζ(α)dα · supα∈R

∫R

ζ(α)

ζ0(λ)w(T )(α− λ)dλ

≤ cρζ(g)2.

Therefore we have ρζ0(φ(T )ij ) ≤ cρζ(g). Substituting into (3.3.8) we obtain the stated

uniform bound of the cumulants.

Lemma 3.3.7 Suppose that Assumptions 3.3.1 to 3.3.4 hold. The empirical partial spec-tral process E(T )(g), g ∈ G is stochastically equicontinuous, that is for each η > 0 andε > 0 there exists δ > 0 such that

lim supT→∞

P

sup[δ]|E(T )(g − h)| > η

< ε,

where [δ] = (g, h) ∈ G 2|ρζ(g − h) < δ.

Proof. The process can be written as

E(T )(g) =d∑

i,j=1

E(T )ij,ζ0

(T )ij (g)

)+RT (g).

It follows from Lemma 3.3.6 similar to the proof in Eichler (1995) that the processesE

(T )ij,ζ0

(T )ij (g)

)are stochastically equicontinuous. Further, the remainder term is of

order oP(ρζ(g)

)uniformly in g ∈ G . Thus we get

P

sup[δ]

∣∣E(T )(g − h)∣∣ > η

≤ P

d∑i,j

sup[δ]

∣∣E(T )ij,ζ0

(T )ij (g − h)

)∣∣ > η2

+P

sup[δ]

∣∣RT (g)−RT (h)∣∣ > η

2

d∑i,j

Psup[δ]

∣∣E(T )ij,ζ0

(T )ij (g − h)

)∣∣ > ηK2+Psupg∈G

∣∣RT (g)∣∣ > η

4.

As we can choose T and δ for any fixed ε > 0 and η > 0 such that both terms are smallerthan ε/2, this proves the lemma.

Lemma 3.3.8 Suppose that Assumptions 3.3.1 to 3.3.4 hold. Then the empirical partialspectral process E(T )(g), g ∈ G is measurable.

Proof. Since under the assumptions the sample paths of the empirical partial spectralprocess are almost surely uniformly continuous, the measurability can be proved as in thecase of empirical spectral processes (cf. Eichler, 1995).

Page 56: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

52 Chapter 3. Nonparametric analysis

Theorem 3.3.9 Suppose that Assumptions 3.3.1 to 3.3.4 hold. Then the empirical partialspectral process E(T )(g), g ∈ G converges weakly on X to the partial spectral processE(g), g ∈ G .

Proof. We have proved the stochastic equicontinuity and the measurability of theempirical partial spectral process and the weak convergence of its finite dimensional dis-tributions to that of the limiting process. Then the weak convergence of E(T )(g), g ∈ G follows by Theorem 10.2 in Pollard (1990), in which the outer measure P∗ can be replacedby the probability measure P due to the measurability of the empirical partial spectralprocess.

3.3.3 Interrelation analysis in the time domain: An example

In this section, we illustrate the properties of the proposed correlation function as a toolfor the identification of connectivity in multivariate point processes. For this purpose,we have generated data from a mutually exciting nonlinear point process which allowsexcitatory and inhibitory connections between the components.

For a multivariate point process N = (N1, . . . , Nd)′ on R, let Ht denote the σ-algebra

generated by the process up to time t. We then consider processes such that the condi-tional intensity functions satisfy

PdNa(t) = 1

∣∣Ht

= exp

(µa +

d∑b=1

µab(t))

where

µab(t) =

∫R

γab(t− u)dNb(u).

The constant µa characterizes the spontaneous activity of process Na while µab(t) signifiesthe change of activity induced by process Nb. The link function γab determines the natureof the connection from Nb to Na. In particular, γab(u) < 0 represents inhibition whilepositive values indicate excitation. We suppose that γab(u) = 0 for all u < 0 whichexpresses that only events from the past can have an influence. Apart from the fact thatsuch processes do not allow for the modeling of refractory periods, this seems to be areasonable model for synaptic interactions.

13

4

5

(a) (b)

excitatory connectioninhibitory connection

1

2

3

4

5

2

Figure 3.3.1: (a) Connectivity of the simulated multivariate point process. (b) Estimated conditionalcorrelation graph.

Page 57: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 53

0.0

1.0

0.0

1.0

0.0

1.0

0.0

1.0

0.0 50.0

-11.0

6.0

0.0 50.0

-11.0

6.0

0.0 50.0

-11.0

6.0

0.0 50.0

-11.0

6.0part

ial c

oher

ence

part

ial p

hase

frequency [Hz]

1

2

3

4

5

Figure 3.3.2: Below the diagonal: Estimated partial spectral coherences∣∣R(T )

ab|Cab(λ)∣∣2 (solid) and spectral

coherences∣∣R(T )

ab (λ)∣∣2 (dotted) for the simulated data. The horizontal dashed lines represent critical

thresholds at significance level α = 5% for the maximum partial spectral coherence S(T ) under the

hypothesis that∣∣Rab|Cab(λ)

∣∣2 ≡ 0. Above the diagonal: Estimated partial phase spectra φ(T )ab (λ) with

95%-confidence bands.

We have generated a five-dimensional process with connectivity structure as shown inFigure 3.3.1 (a) and sample length T = 30. For the simulation we have used exponentiallink functions of the form,

γab(u) = αab exp(− βab(u− uab)

).

where αab = 1.2 and −2.0 for excitatory and inhibitory connections, respectively, βab =100.0, and uab = 0.025, which determines the time delay of the connection.

Figure 3.3.2 shows the estimated partial spectral coherences for the simulated data.From these we obtain the conditional correlation graph in Figure 3.3.1 by connectingtwo vertices with an edge whenever the corresponding partial spectral coherence exceedsthe threshold. Apart from one additional edge between vertices 2 and 3, this graphcorresponds to the connectivity structure of the process. A better understanding of theunderlying structure can be gained by application of the identifiction procedure suggestedby Dahlhaus et al. (1997). Noting that the slope of the partial phase curve φab|Cab(λ)indicates the time delay of the connection, we can identify the directions of all edgesbut (2, 3). By the directions we can now identify those edges which might be due to a

Page 58: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

54 Chapter 3. Nonparametric analysis

0.0

1.0

0.0 50.0

par

tial c

oher

ence

frequency [Hz]

Figure 3.3.3: Estimated partial spectral coherence∣∣R(T )

23|1(λ)∣∣2.

marrying parents effect, which are then reexamined with all successors deleted from the

graph. For the edge (2, 3) we therefore have to compute∣∣R(T )

23|1(λ)∣∣2 (Fig. 3.3.3), from

which it follows that indeed the additional edge is due to a marrying parents effect. Thus,we have correctly identified the connections and their directions. However, this frequencydomain analysis provides no information about the type of connections.

Next, we analyse the same data by use of scaled partial covariance densities. Theestimates of the scaled partial covariance densities are given in Figure 3.3.4. The curves,having peaks or troughs at specific times while being otherwise flat, are typical for the

-70.0

180.0

-70.0

180.0

-70.0

180.0

-70.0

180.0

-0.1 0.1

-70.0

180.0

-0.1 0.1

-70.0

180.0

-0.1 0.1

-70.0

180.0

-0.1 0.1

-70.0

180.0part

ial c

orre

latio

n

corr

elat

ion

time [sec]

1

2

3

4

5

Figure 3.3.4: Estimated scaled partial covariance densities ρ(T )ab|Cab(u) and estimated scaled covariance

densities ρ(T )ab (u) for the simulated data. The horizontal dotted lines represent pointwise critical values

for the hypothesis ρab|Cab(u) = 0 at significance level α = 5%.

Page 59: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

3.3. Time domain analysis 55

(a)

-36.0

116.0

-0.1 0.1

par

tial c

orre

latio

n time [sec]

(b)

-40.0

218.0

-0.1 0.1

par

tial c

orre

latio

n

time [sec]

Figure 3.3.5: Estimated scaled partial covariance densities (a) ρ(T )23|1(u) and (b) ρ(T )

34|12(u).

kind of plots obtained for neurophysiological data sets. Here, direct association betweenthe components of the process is marked by significant deviations of the scaled partialcovariance density from zero. In the plots, the horizontal dotted lines represent thepointwise critical values centered around zero. These are calculated from the limitingdistribution of ρ

(T )ab|Cab(u) under the null hypothesis ρab|Cab(u) = 0. Since the values are not

simultaneous critical values, we have to make allowances for exceeding these thresholds.Simultaneous confidence bands can be obtained from Theorem 3.3.9 similar as in Eichler(1995, Example 3.2).

Examining the plots, we identify the same edges in the conditional correlation graphas by the frequency approach above. Further, we can estimate the time delay and thusthe direction of a connection by the distance of the peak or trough from the origin. Forall edges except (2, 3) we obtain approximately the correct time delay of 25 milliseconds,

although one curve, ρ(T )34|125(u) has a flat second peak at the origin. For the remaining

edge (2, 3) we find a significant trough at the origin. Since synaptic connections betweenneurons cannot have zero delay, these significant deviations at the origin typically indicatea common input or - when concerned with partialized statistics - a marrying parents effectif this is suggested by the directions of the other edges. In order to decide between thesetwo possibilities, we can apply the same identification procedure as in the frequencydomain. Thus, we have to compute ρ

(T )23|1(u) and ρ

(T )34|12(u). As we can see from the plots

in Figure 3.3.5, there is no common input and the peak and the trough were only due toa marrying parents effect.

Finally, the type of the connection - excitation of inhibition - is indicated by peaksand troughs, respectively. Examining the plots, we can detect two inhibiting connectionsbetween N1 and N2 and between N2 and N5, while all other connections are excitatory.Therefore, the connectivity structure of the process as given in Figure 3.3.1 has beencompletely identified.

In Figure 3.3.4, we have also included the scaled covariance densities without partial-ization. These curves are equivalent to the cross-correlation histogram, which is a widelyused tool in the analysis of neurophysiological data. More precisely, the cross-correlationhistogram is an estimate for the renewal density of the process, which can be obtainedfrom the covariance density by a linear transformation. As expected, these curves donot provide sufficient information for the identification of the connectivity structure ofthe process. Particularly interesting is the scaled covariance density between N3 and N5

Page 60: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

56 Chapter 3. Nonparametric analysis

which exhibits both a peak and a trough. Consequently, it could not be used for inferenceabout the type of the connection even if the frequency domain method were employed forthe identification of connections and their directions.

In summary, the example has shown that the new proposed scaled partial covariancedensity can be used for the identification of the connectivity structure of multivariatepoint processes. The new statistic provides information about the type and the directionof a connection and allows to distinguish direct connections from indirect connections andcommon inputs. It can be interpreted in the same way as the widely used cross-correlationhistogram and therefore might lead to a better acceptance of partialization methods inneurophysiology. The statistics can be computed efficiently by fast Fourier transformsfor point processes (cf. Rigas, 1991, 1992) and therefore be used as an extension of thepartialization analysis in the frequency domain.

Page 61: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Chapter 4

Selection of graphical interactionmodels

In the previous chapter, we have seen that conditional independence graphs can be used tosummarize and visualize the findings of nonparametric interrelation analysis in a conciseway. However, it is unclear in which way the estimated dependence structure is relatedto the probability distribution of the observed process and whether it is (in some sense)the best estimate.

An alternative to the nonparametric approach is the fitting of parametric graphi-cal models where the parameters are constrained with respect to conditional correlationgraphs. The problem of estimating the dependence structure of the process now becomesa problem of model selection where the best approximating model minimizes some chosenmodel distance such as the Kullback-Leibler information divergence.

In this chapter we discuss the problem of model selection for the class of graphicalautoregressive models introduced in Example 2.1.5. The aim is to derive the asymptoticefficiency of a version of the AIC criterion with respect to the Kullback-Leibler informa-tion divergence. In the first section we derive implicit equations for the Whittle estimatewhich are similar to the equations for the maximum likelihood estimate in the case ofordinary Gaussian graphical models. In Section 4.2 we investigate the asymptotic be-haviour of the Kullback-Leibler information and show that it can be approximated by adeterministic function. This is exploited in Section 4.3 to derive the asymptotic efficiencyof the proposed model selection criterion.

4.1 Model fitting

A fundamental, information theoretic measure for the separation or distance between twoprobability distributions is the Kullback-Leibler information (Kullback and Leibler, 1951),which gives the mean information per observation for the discrimination between thetrue and a fitted distribution. Let X(1), . . . , X(T ) be observations from a multivariateGaussian stationary process specified by some infinite parameter θ0. Then for density

Page 62: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

58 Chapter 4. Selection of graphical interaction models

functions pθ0 and pθ and spectral matrices fθ0 and fθ the Kullback-Leibler informationbetween the process and a fitted model specified by the parameter θ is given by

I(θ, θ0) = limT→∞

1

TEθ0

(pθ(X1, . . . , XT )

pθ0(X1, . . . , XT )

)=

1

∫Π

(log

( det fθ(λ)

det fθ0(λ)

)+ tr

[fθ0(λ)f−1

θ (λ)− 1d]dλ

(cf. Parzen, 1983). Minimization of I(θ, θ0) with respect to θ is equivalent to minimizing

L (θ) =1

∫Π

(log det fθ(λ) + tr

[fθ0(λ)f−1

θ (λ)])dλ.

In the following we assume that X(t), t ∈ Z is an autoregressive process of infiniteorder to which we will fit graphical autoregressive models of finite order p. Allowing theorder p to diverge to infinity for increasing sample sizes, the process can asymptoticallybe fitted by the correct model which is crucial in our investigation of the asymptoticproperties of the Kullback-Leibler information.

Assumption 4.1.1 X(t), t ∈ Z is a d vector-valued stochastic process defined on aprobability space (Ω,A ,P) such that the following conditions hold.

(i) X(t) is a stationary Gaussian autoregressive process,

X(t) =∞∑h=1

AhX(t− h) + ε(t), (4.1.1)

with d× d coefficient matrices Ah such that ‖Ah‖ 6= 0 for infinitely many h ∈ N forany matrix norm ‖ · ‖. The innovations ε(t), t ∈ Z are independent and normallydistributed with mean E

(ε(t))

= 0 and regular covariance matrix E(ε(t)ε(t)′

)= Σ.

(ii) The spectral matrix f(λ) of X(t) exists and satisfies the boundedness condition

a11d ≤ g(λ) ≤ a21d ∀λ ∈ [−π, π]

for constants a1 and a2 such that 0 < a1 ≤ a2 <∞.

(iii) There exists β > 1 such that the covariances R(u) of X(t) satisfy∑u∈Z

|u|β‖R(u)‖ <∞.

(iv) X(t) has conditional independence graph G0 = (V,E0).

As in Example 2.1.5 we parametrize graphical autoregressive models by the inversecovariances R

(i)ij (u). Thus we get infinite dimensional parameter vectors

θ =(vech(R(i)(0))′, vec(R(i)(1))′, vec(R(i)(2))′, . . .

)′,

where the vech operator stacks only the elements contained in the lower triangular sub-matrix. We denote the spectral matrices, covariances, and inverse covariances specifiedby the parameter θ by fθ(λ), Rθ(u), and R

(i)θ (u), respectively.

Page 63: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.1. Model fitting 59

Assumption 4.1.2 Θ is a subset of `2(R) such that the following conditions hold.

(i) The spectral matrices fθ satisfy for all θ ∈ Θ the boundedness condition

b11d ≤ fθ(λ) ≤ b21d ∀λ ∈ [−π, π]

for constants b1 and b2 such that 0 < b1 ≤ b2 <∞.

(ii) There exists a constant C > 0 such that the covariances Rθ(u) satisfy∑u∈Z

|u|β‖Rθ(u)‖ < C

for all θ ∈ Θ(p,G), where β is the same as in Assumption 4.1.1.

(iii) There exists θ0 in Θ such that fθ0(λ) = f(λ) for all λ ∈ [−π, π] and θ0 belongs tothe interior of Θ.

Next, let G denote the set of all graphs G = (V,E) such that V = 1, . . . , d andE ⊆ (i, j) ∈ V 2|i 6= j. For p ∈ N and G ∈ G , the AR(p,G) model is now given by theparameter space

Θ(p,G) =θ ∈ Θ

∣∣R(i)ij,θ(u) = 0 if (i, j) /∈ E or |u| > p

.

Let Ip,G denote the set of indices for which Θ(p,G) is not constrained to zero and πp,Gthe projection of `2(R) onto the subspace spanned by Θ(p,G).

Minimization of the Kullback-Leibler information I(θ, θ0), or equivalently L (θ), withrespect to θ ∈ Θ(p,G) yields the best AR(p,G) approximation of X(t), which we denoteby the parameter

θ0(p,G) = argminθ∈Θ(p,G)

L (θ).

We require that θ0(p,G) exists and is uniquely defined.

Assumption 4.1.3 The best approximation θ0(p,G) in Θ(p,G) with respect to theKullback-Leibler information I(θ, θ0) is unique and belongs to the relative interior ofΘ(p,G) with respect to the seminorm ‖ · ‖πp,G .

From Lemma B.3 we obtain the following derivatives of L (θ)

∂L (θ)

∂θk=

1

∫Π

tr[(fθ0(λ)− fθ(λ)

)∂f−1θ (λ)

∂θk

]dλ. (4.1.2)

Since the inverse spectral matrix is linear in the parameters we get an explicit formulafor its derivatives. Let θk correspond to R

(i)ab (u). Then

∂f−1ij,θ(λ)

∂θk=

2πδiaδja if a = b and u = 0

2π[δiaδjb exp(−iλu) + δibδja exp(iλu)

]otherwise

.

Page 64: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

60 Chapter 4. Selection of graphical interaction models

Substituted into (4.1.2) we therefore get

∂L (θ)

∂θk= 0 ⇔

∫Π

(fθ0,ab(λ)− fab,θ(λ)) exp(iλu)dλ = 0. (4.1.3)

This leads to the following set of equations, which characterize the best AR(p,G) approx-imation θ0(p,G),

Rij,θ0(p,G)(u) = Rij,θ0(u) ∀(i, j) ∈ E ∀u ∈ −p, . . . , pR

(i)ij,θ0(p,G)(u) = 0 ∀(i, j) /∈ E ∀u ∈ −p, . . . , p

(4.1.4)

and additionally R(i)θ0(p,G)(u) = 0 for all |u| > p.

In the following we will also need the second and third derivatives of L (θ). By LemmaB.3 and the linearity of f−1

θ (λ) in the parameters we obtain for the second derivatives

∂2L (θ)

∂θi∂θj=

1

∫Π

tr[(fθ0(λ)− fθ(λ)

)∂2f−1θ (λ)

∂θi∂θj

]dλ−

∫Π

tr[∂fθ(λ)

∂θi

∂f−1θ (λ)

∂θj

]dλ

=1

∫Π

tr[fθ(λ)

∂f−1θ (λ)

∂θifθ(λ)

∂f−1θ (λ)

∂θj

]dλ,

and similarly for the third derivatives

∂3L (θ)

∂θi∂θj∂θk=

1

∫Π

tr[fθ(λ)

∂f−1θ (λ)

∂θifθ(λ)

∂f−1θ (λ)

∂θjfθ(λ)

∂f−1θ (λ)

∂θk

]dλ.

We will denote the vector of first derivatives by

∇L (θ) =(∂L (θ)

∂θi

)i∈N

,

and the matrix of second derivatives by

∇2L (θ) =(∂2L (θ)

∂θi∂θj

)i,j∈N

.

The linearity of f−1θ (λ) in θ also implies that

θ′1∇2L (θ)θ2 =1

∫Π

tr[fθ(λ)f−1

θ1(λ)fθ(λ)f−1

θ2(λ)]dλ. (4.1.5)

In practice, model distances such as the Kullback-Leibler information need to be esti-mated as they depend on the unknown parameter θ0. Akaike (1973) pointed out that theKullback-Leibler information is related to the method of maximum likelihood. Therefore,given observations X(1), . . . , X(T ) from the process X(t), minimum distance estimatescan be obtained by maximizing the Gaussian likelihood function or, equivalently, mini-mizing the −1/T log likelihood function

L ∗T (θ) =

1

2log(2π) +

1

2Tlog detRθ,T +

1

2TX ′TR

−1θ,TXT , (4.1.6)

Page 65: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.1. Model fitting 61

where Rθ,T =(Rθ(u − v)

)u,v=1,... ,T

. A more favourable choice for fitting graphical au-

toregressive models is the likelihood approximation suggested by Whittle (1953, 1954).Approximating the matrix R−1

θ,T by the corresponding matrix of inverse covariances (cf.Shaman, 1975, 1976) together with the Szego identity (cf. Grenander and Szego, 1958)leads to the Whittle likelihood

LT (θ) =1

∫Π

(log det fθ(λ) + tr

[I(T )(λ)fθ(λ)−1

])dλ,

which estimates L (θ) consistently. Thus we get as a minimum distance estimate theWhittle estimate

θT (p,G) = argminθ∈Θ(p,G)

LT (θ).

The first derivative of the Whittle likelihood is

∂LT (θ)

∂θi=

1

∫Π

tr[(I(T )(λ)− fθ(λ)

)∂f−1θ (λ)

∂θi

]dλ. (4.1.7)

Since f−1θ (λ) is linear in θ, the data dependent term vanishes in the second derivative and

we find for all θ ∈ Θ

∇2LT (θ) = ∇2L (θ). (4.1.8)

Consequently, also the third derivatives of LT (θ) and L (θ) are equal. Setting the firstderivative to zero leads to the following characterization of the Whittle estimates in theAR(p,G) model.

Theorem 4.1.4 Suppose that Assumptions 4.1.1 and 4.1.2 hold. Then the Whittle-estimate θT (p,G) in the graphical autoregressive model AR(p,G) is given by the equations

Rij,θT (p,G)(u) = Rij(u) ∀(i, j) ∈ E ∀u ∈ −p, . . . , p,

R(i)

ij,θT (p,G)(u) = 0 ∀(i, j) /∈ E ∀u ∈ −p, . . . , p,

and R(i)

θT (p,G)(u) = 0 for all |u| > p, where Rij(u) is defined as

Rij(u) =

∫Π

I(T )ij (λ) exp(iλu)dλ.

Proof. The result follows from the arguments leading to (4.1.4) applied to the firstderivative in (4.1.7).

These equations are similar to the equations for the maximum likelihood estimatesin ordinary Gaussian graphical models (cf. Lauritzen, 1996). More precisely, these arethe restrictions for a Gaussian graphical model in which the set of vertices consists ofthe entire process X(t), t ∈ Z. This, however, is not surprising by the way the Whittlelikelihood approximates the likelihood function in (4.1.6), as the Whittle likelihood mainly

Page 66: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

62 Chapter 4. Selection of graphical interaction models

neglects edge effects due to observing only a finite horizon by substituting asymptoticapproximations for the finite sample quantities detRθ,T and R−1

θ,T .The asymptotic properties of the Whittle estimate in general are well known (e.g.

Dzhaparidze and Yaglom, 1983). For example, we have the following central limit theorem.

Theorem 4.1.5 Under Assumptions 4.1.1 to 4.1.3 we have

√T(θT (p,G)− θ0(p,G)

) D→ N (0, chΓ(p,G)−1Γ0(p,G)Γ(p,G)−1)

where ch = H4/H22 , Γ0(p,G) = πp,G∇2L (θ0)πp,G and Γ(p,G) = πp,G∇2L (θ0(p,G))πp,G

with Γ(p,G)−1 = πp,GΓ(p,G)−πp,G for any generalized inverse Γ(p,G)−.

Proof. see e.g. Dzhaparidze and Yaglom (1983, Section 5.6).

From the Whittle estimate θT (p,G) we can finally compute estimates for the param-eters A1, . . . , Ap and Σ in (4.1.1).

(a) From the estimates R(i)

θT (p,G)(u) for the inverse covariances we can obtain the covari-

ances RθT (p,G)(u) via computation of f−1

θT (p,G)and fθT (p,G). Then estimates for the

matrices A1, . . . , Ap and Σ can be determined by solving the Yule-Walker equationsp∑

u=0

AuRθT (p,G)(u− v) = δv0Σ, v = 0, . . . , p,

where A0 = −1d.

(b) The parameters A1, . . . , Ap and Σ are related to the inverse covariances by theequation system

R(i)

θT (p,G)(v) =

p−v∑u=0

AuΣAu+v,

where again A(0) = −1d. This problem is equivalent to the estimation of mov-ing average parameters from the covariances of a process. An iterative algorithmfor solving such an equation system has been suggested e.g. by Tunnicliffe Wilson(1972).

We are now interested in the AR(p,G) model, where G ∈ G and p is selected from agiven range 1 ≤ p ≤ PT with PT ≤ T , which minimizes the Kullback-Leibler informationI(θT (p,G), θ0) between the fitted model and the true process. A model selection (pT , GT )which has this optimality property at least asymptotically is called asymptotically effi-cient.

Definition 4.1.6 (Asymptotically efficient model selection) A selection of models(pT , G)T∈N with 1 ≤ p ≤ PT and G ∈ G is called asymptotically efficient if

I(θT (pT , G), θ0)

min1≤p≤PT

minG∈G

I(θT (p,G), θ0)

P→ 1.

Page 67: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.2. Asymptotical efficiency of a model selection 63

The derivation of model selection criteria with this optimality property is based uponan approximation of the Kullback-Leibler information by some deterministic function.For this, it is necessary that asymptotically the process X(t) can be fitted by thecorrect model, which implies that PT must diverge to infinity. On the other hand theapproximation holds only if the stochastic variation in I(θT (p,G), θ0) due to the estimateθT (p,G) vanishes asymptotically for all p ≤ PT . More precisely, we require that thefollowing conditions hold.

Assumption 4.1.7 PTT∈N is an integer-valued sequence such that PT → ∞ and

P52T log(T )2/T → 0 as T →∞.

4.2 Asymptotical efficiency of a model selection

In this section, we investigate the asymptotic behaviour of the Kullback-Leibler infor-mation I(θT (p,G), θ0) between a fitted and the true model. In particular, we derivean asymptotic lower bound for I(θT (p,G), θ0), which is important for establishing theasymptotic efficiency of model selection criteria.

Taniguchi (1980) discussed the asymptotics of the final prediction error for fittingspectral models to univariate time series. Although the line of proof is similar, the con-ditions of Taniguchi do not hold for the parametrization by inverse covariances (also inthe univariate case). In the next two lemmas we show that weaker statements about thesecond and third derivatives and the convergence of |fθ0(p,G)(λ) − fθ0(λ)| can be derivedunder the Assumptions 4.1.1 to 4.1.3.

Lemma 4.2.1 Suppose that Assumptions 4.1.1 and 4.1.2 hold. Then

(i) there exist constants c1 and c2 such that

‖∇2L (θ)‖ ≤ c1 <∞ and ‖∇2L (θ)‖inf ≥ c2 > 0 (4.2.1)

uniformly in θ ∈ Θ;

(ii) for all η ∈ `2(R) and all ζ ∈ πp,G(R∞)

∣∣∣ ∞∑i,j,k=1

∂3L (θ)

∂θi∂θj∂θkηiηjζk

∣∣∣ ≤ C√k(p,G)‖η‖2‖ζ‖

uniformly in θ ∈ Θ, 1 ≤ p ≤ PT , and G ∈ G . For ζ ∈ `1(R) the term is bounded byC‖η‖2‖ζ‖1.

Proof. The result is proved in Section 4.4.

Page 68: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

64 Chapter 4. Selection of graphical interaction models

Lemma 4.2.2 Suppose that Assumptions 4.1.1 to 4.1.3 hold. Then for all graphs G suchthat G0 ⊆ G we have ∫

Π

∥∥f−1θ0(p,G)(λ)− f−1

θ0(λ)∥∥2

2dλ = O(p−(2β+1)) (4.2.2)

and ∫Π

∥∥fθ0(p,G)(λ)− fθ0(λ)∥∥2

2dλ = O(p−(2β+1)), (4.2.3)

where ‖A‖2 = (tr(A∗A))1/2.

Proof. The result is proved in Section 4.4.

The methods in this section are based on Taylor expansions of the Kullback-Leiblerinformation. For this we need the following lemma, which states the uniform consistencyof the Whittle estimates.

Lemma 4.2.3 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then we have

max1≤p≤PT

∥∥θT (p,G)− θ0(p,G)∥∥ = oP (1).

Proof. We first consider the difference∣∣LT (θ)−L (θ)∣∣ =

∣∣∣ ∫Π

tr[(I(T )(λ)− fθ0(λ)

)f−1θ (λ)

]dλ∣∣∣

≤d∑

i,j=1

∑|u|≤p

∣∣∣ ∫Π

(I

(T )ij (λ)− fij,θ0(λ)

)R

(i)ij,θ(u) exp(iλu)dλ

∣∣∣and further since R

(i)ij,θ(u) does not depend on λ

≤ max1≤i,j≤d

max|u|≤p

∣∣R(i)ij,θ(u)

∣∣ d∑i,j=1

∑|u|≤p

∣∣∣ ∫Π

(I

(T )ij (λ)− fij,θ0(λ)

)exp(iλu)dλ

∣∣∣.By Assumption 4.1.2 and the definition of R

(i)θ (u), the first factor can be bounded by

(4π2b1)−1 uniformly over θ ∈ Θ. Further by Lemma 4.4.1 each summand in the secondfactor has second moment of order O(T−1) uniformly in 1 ≤ u, v ≤ PT . Thus

E

[max

1≤p≤PT

∑|u|≤p

d∑i,j=1

∣∣∣ ∫Π

(I(T )ij (λ)− fij,θ0(λ)) exp(iλu)dλ

∣∣∣]2

= O(P 2

T

T

).

We therefore have

max1≤p≤PT

supθ∈Θ(p,G)

|LT (θ)−L (θ)| = OP

( PT√T

),

Page 69: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.2. Asymptotical efficiency of a model selection 65

which implies that

LT (θ0(p,G))P→ L (θ0(p,G)) and |LT (θT (p,G))−L (θT (p,G))| P→ 0

uniformly in 1 ≤ p ≤ PT . Together with the inequalities LT (θT (p,G)) ≤ LT (θ0(p,G))and L (θ0(p,G)) ≤ L (θT (p,G)), this leads to the uniform convergence

max1≤p≤PT

∣∣L (θT (p,G))−L (θ0(p,G))∣∣ = oP (1).

Since the difference∣∣L (θT (p,G))−L (θ0(p,G))

∣∣ can be bounded from below by

L (θT (p,G))−L (θ0(p,G))

= ∇L (θ0(p,G))′(θT (p,G)− θ0(p,G)) +1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2L (θT )

≥ 1

2‖θT (p,G)− θ0(p,G)‖2‖∇2L (θT )‖inf ,

where θT = θ0(p,G) + ξ(θT (p,G) − θ0(p,G)) for some ξ ∈ [0, 1], it follows that θT (p,G)converges to θ0(p,G) uniformly in 1 ≤ p ≤ PT .

The Kullback-Leibler information can now be studied by the help of Taylor expansions.We first note that I

(θT (p,G), θ0) can be written as

I(θT (p,G), θ0) =

[L (θT (p,G))−L (θ0(p,G))

]+ I(θ0(p,G), θ0

)and with a Taylor expansion of L (θT (p,G)) about θ0(p,G)

=1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2L (θ0(p,G))+ I(θ0(p,G), θ0

)+OP

(p

12‖θT (p,G)− θ0(p,G)‖3

),

(4.2.4)

where the first order term vanishes because of πp,G∇L (θ0(p,G)) = 0. In this decomposi-

tion of I(θT (p,G), θ0) the first term represents the variance due to estimating θ0(p,G) by

θT (p,G) while the second term can be approximated by 12‖θ0(p,G)− θ0‖2

∇2L (θ0) and thusrepresents the bias due to fitting an incorrect model.

In the following we investigate the asymptotic behaviour of the variance term. Taylorexpansion for the first derivative of LT (θT (p,G)) yields

∇LT (θT (p,G))−∇LT (θ0(p,G)) = ∇2LT (θ0(p,G))(θT (p,G)− θ0(p,G)) + Z(p,G)

where

Zi(p,G) =∑

j,k∈I(p,G)

∂3LT (θT )

∂θi∂θj∂θk

(θT (p,G)j − θ0(p,G)j

)(θT (p,G)k − θ0(p,G)k

)and θT = θ0(p,G)+ξ(θT (p,G)−θ0(p,G)) for some ξ ∈ [0, 1]. By the definition of θT (p,G)and θ0(p,G) we have πp,G

(∇LT (θT (p,G))

)= πp,G

(∇L (θ0(p,G))

)= 0. Noting that by

(4.1.8) the second derivative of LT (θ) can be replaced by that of L (θ), we get

πp,G

(∇LT (θT (p,G))−∇LT (θ0(p,G))

)= Γ(p,G)(θT (p,G)− θ0(p,G)) + πp,g

(Z(p,G)

),

Page 70: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

66 Chapter 4. Selection of graphical interaction models

where Γ(p,G) = πp,G∇2L (θ0(p,G))πp,G as in Theorem 4.1.5. This leads to the equation∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

=[∇LT (θ0(p,G))−∇L (θ0(p,G))

]′[(θT (p,G)− θ0(p,G)

)+ Γ(p,G)−1Z(p,G)

]=∥∥θT (p,G)− θ0(p,G)

∥∥2

Γ(p,G)+ 2Z(p,G)′

(θT (p,G)− θ0(p,G)

)+∥∥Z(p,G)

∥∥2

Γ(p,G)−1 .

Noting that in the first term Γ(p,G) can be replaced by ∇2L (θ0(p,G)), we finally haveby Lemma 4.2.1∥∥θT (p,G)− θ0(p,G)

∥∥2

∇2L (θ0(p,G))

=∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))

∥∥2

Γ(p,G)−1 +OP

(p

12‖θ‖3 + p‖θ‖4

)(4.2.5)

with the abbreviation θ = θT (p,G) − θ0(p,G). In the next lemmas we prove that themoments of the first term on the right side can be approximated by the moments of aχ2-distribution up to the fourth order.

Lemma 4.2.4 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then

E

[T∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))

∥∥2

Γ(p,G)−1

]= chtr

[Γ(p,G)−1Γ0(p,G)

]+O

(p 52 log(T )

T

).

If the graph G contains the true graph G0, we further have

tr[Γ(p,G)−1Γ0(p,G)

]= k(p,G) +O

(p1−β).

Proof. Noting that ‖Γ(p,G)−1‖1 ≤ Cp3/2 by Lemma B.1 and ‖Γ(p,G)−1‖ ≤ C, weobtain with Lemma 4.4.1

TE∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))

∥∥2

Γ(p,G)−1

=∑

i1,i2∈N

TE[ 2∏k=1

(∂LT (θ0(p,G))

∂θik− ∂LT (θ0(p,G))

∂θik

)]Γ−1i1i2

(p,G)

=∑

i1,i2∈N

[chΓi1i2,0(p,G) +O

(p log(T )

T

)]Γ−1i1i2

(p,G)

= chtr(Γ0(p,G)Γ(p,G)−1

)+O

(p 52 log(T )

T

),

where by the definition of Γ(p,G)−1 all summands with i1, i2 /∈ Ip,G are zero, which yieldsp2 nonzero summands.

Page 71: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.2. Asymptotical efficiency of a model selection 67

For the second part we note that∣∣∣∣tr[fθ0(p,G)(λ)∂f−1

θ (λ)

∂θifθ0(p,G)(λ)

∂f−1θ (λ)

∂θj

]− tr

[fθ0(λ)

∂f−1θ (λ)

∂θifθ0(λ)

∂f−1θ (λ)

∂θj

]∣∣∣∣≤∣∣∣∣tr[(fθ0(p,G)(λ)− fθ0(λ))

∂f−1θ (λ)

∂θifθ0(p,G)(λ)

∂f−1θ (λ)

∂θj

]∣∣∣∣+

∣∣∣∣tr[fθ0(λ)∂f−1

θ (λ)

∂θi(fθ0(p,G)(λ)− fθ0(λ))

∂f−1θ (λ)

∂θj

]∣∣∣∣≤ C

∥∥fθ0(p,G)(λ)− fθ0(λ)∥∥

2

(‖fθ0(p,G)(λ)‖+ ‖fθ0(λ)‖

).

It follows by Lemma 4.2.2 and the Cauchy-Schwarz inequality

tr(

Γ(p,G)−1[Γ0(p,G)− Γ(p,G)

])≤ C‖Γ(p,G)−1‖1

(∫Π

∥∥fθ0(p,G)(λ)− fθ0(λ)∥∥2

2dλ) 1

2= O

(p1−β),

which completes the proof.

Lemma 4.2.5 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then if G0 ⊆ G

E

[ Tch

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1 − k(p,G)]4

= 48k(p,G) + 12k(p,G)2 +O(p4−β)+O

(p

112 log(T )2

T

),

otherwise if G0 * G

E

[ Tch

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

]2

= O(p2).

Proof. First assume that G0 ⊆ G. We evaluate for m = 2, 3, 4

TmE∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))

∥∥2m

Γ(p,G)−1

= Tm∞∑

i1,... ,i2m=1

E

[ 2m∏k=1

(∂LT (θ0(p,G))

∂θik− ∂LT (θ0(p,G))

∂θik

)] m∏k=1

Γ−1i2k−1i2k

(p,G).

(4.2.6)

Setting

Yik =

√T

ch

(∂LT (θ0(p,G))

∂θik− ∂LT (θ0(p,G))

∂θik

)

Page 72: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

68 Chapter 4. Selection of graphical interaction models

we obtain from the product theorem for cumulants and Lemma 4.4.2

E

[ 2m∏k=1

Yik

]=

2m∑n=1

∑π1,... ,πn

n∏r=1

cumYir,1 , . . . , Yir,nr

=∑

π1,... ,πnnr=2

m∏r=1

cumYir,1 , Yir,2

+

2m∑i1,1,i1,2=1i1,1 6=i1,2

cumYi1,1 , Yi1,2

O

(log(T )2

T

)+O

(log(T )4

T 2

).

(4.2.7)

Lemma 4.4.1 yields for the first term∑π1,... ,πnnr=2

m∏r=1

[Γir,1ir,2,0(p,G) +O

(p log(T )

T

)]

=∑

π1,... ,πnnr=2

m∏r=1

Γir,1ir,2,0(p,G) +∑

π1,... ,πnnr=2

m∑k=1

m∏r=1r 6=k

Γir,1ir,2,0(p,G)O(p log(T )

T

)+O

(p2 log(T )2

T 2

).

Substituting (4.2.7) into (4.2.6) we then obtain with Bp,G = Γ0(p,G)Γ(p,G)−1

E

[ Tch

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

]4

=(tr(Bp,G)

)4+ 12tr(B2

p,G)(tr(Bp,G))2 + 12

(tr(B2

p,G))2 + 32tr(B3p,G)tr(Bp,G)

+ 48tr(B4p,G) +O

(p 112 log(T )2

T

)and further in the same way as in the proof of Lemma 4.2.4

= k(p,G)4 + 12k(p,G)3 + 44k(p,G)2 + 48k(p,G) +O(p4−β)+O

(p 112 log(T )2

T

).

Similarly, we get

E

[ Tch

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

]3

= k(p,G)3 + 6k(p,G)2 + 8k(p,G) +O(p3−β)+O

(p

92 log(T )2

T

)and

E

[ Tch

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

]2

=(tr(Bp,G)

)2+ 2tr(B2

p,G) +O

(p

72 log(T )2

T

)(4.2.8)

= k(p,G)2 + 2k(p,G) +O(p2−β)+O

(p

72 log(T )2

T

).

Page 73: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.2. Asymptotical efficiency of a model selection 69

Together with Lemma 4.4.1 this proves the first part of the lemma. The second partfollows directly from equation (4.2.8).

It follows now from (4.2.5) and Lemma 4.2.1 that

∥∥θT (p,G)− θ0(p,G)∥∥2 ≤ C

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2L (θ0(p,G))= OP

(k(p,G)

T

).

This implies that equation (4.2.4) for the Kullback-Leibler information can be rewritten

I(θT (p,G), θ0) = I(θ0(p,G), θ0) +1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2L (θ0(p,G))+ oP

(k(p,G)

T

).

Further Lemma 4.2.5 suggests that I(θT (p,G), θ0) can be approximated by the followingfunction

LT (p,G) =k(p,G)ch

2T+ I(θ0(p,G), θ0).

The next results show that this approximation holds uniformly for all 1 ≤ p ≤ PT , whichallows us to reformulate asymptotic efficiency in terms of the sequence (p∗T , G

∗T ) which

minimizes LT (p,G).

Definition 4.2.6 (p∗T , G∗T ) is the sequence of models which attains the minimum of

LT (p,G),

(p∗T , G∗T ) = argmin

1≤p≤PT ,G∈GLT (p,G)

for all T ∈ N.

Under Assumption 4.1.7 LT (PT , G0) and therefore also LT (p∗T , G∗T ) converge to zero as

T →∞. But the second term of LT (p,G) does not vanish if the approximating model iswrongly specified, which is the case for finite p or if G does not contain the true graphG0. Thus it follows that p∗T diverges to infinity as T → ∞ and G0 ⊆ G∗T for almost allT ∈ N.

Lemma 4.2.7 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then

max1≤p≤PT

maxG∈G

∣∣∣T‖∇LT (θ0(p,G))−∇L (θ0(p,G))‖2Γ(p,G)−1 − k(p,G)ch

TLT (p,G)

∣∣∣ = oP (1).

Proof. Since G is finite, it is sufficient to prove the convergence for fixed G ∈ G . First

Page 74: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

70 Chapter 4. Selection of graphical interaction models

consider the case G0 ⊆ G. We then get with Lemma 4.2.5

E

[max

1≤p≤PT

∣∣∣∣T∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))

∥∥2

Γ(p,G)−1 − k(p,G)

TchLT (p,G)

∣∣∣∣]4

≤∑

1≤p≤PT

E

[T∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1 − k(p,G)

TchLT (p,G)

]4

≤∑

1≤p≤PT

(48k(p,G) + 12k(p,G)2 +O

(p4−β)

T 4LT (p,G)4

)+O

(P 52T log(T )2

T

)

≤∑

1≤p≤p∗T

(Cp2

p∗T4 +

Cp4−β

p∗T4

)+

∑p∗T<p≤PT

(C

p2+ Cp−β

)+O

(P 52T log(T )

T

),

which tends to zero by the assumptions on PT and β and p∗T →∞. On the other hand ifG does not contain G0, I(θ0(p,G), θ0) does not vanish for p→∞ and therefore LT (p,G)is bounded away from zero uniformly in p ∈ N. It then follows from the second part ofLemma 4.2.5 that

E

[max

1≤p≤PT

∥∥∇LT (θ0(p,G))−∇L (θ0(p,G))∥∥2

Γ(p,G)−1

]2

= O(P 3

T

T

),

which completes the proof.

Theorem 4.2.8 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then

max1≤p≤PT

maxG∈G

∣∣∣∣I(θT (p,G), θ0)

LT (p,G)− 1

∣∣∣∣ = oP (1).

Proof. Let G be fixed. We first note that equation (4.2.5) together with Lemma 4.2.7implies that

max1≤p≤PT

∥∥θT (p,G)− θ0(p,G)∥∥2

= OP

(PTT

).

Since further p/TLT (p,G) ≤ C uniformly in p ∈ N, it follows from (4.2.4) and (4.2.5)

max1≤p≤PT

∣∣∣∣I(θT (p,G), θ0)− LT (p,G)

LT (p,G)

∣∣∣∣= max

1≤p≤PT

∣∣∣∣T∥∥θT (p,G)− θ0(p,G)

∥∥2

∇2L (θ0(p,G))− k(p,G)ch

2TLT (p,G)

∣∣∣∣+OP

( PT√T

)≤ max

1≤p≤PT

∣∣∣∣T‖∇LT (θ0(p,G))−∇L (θ0(p,G))‖Γ(p,G)−1 − k(p,G)ch

2TLT (p,G)

∣∣∣∣+ oP (1),

from which the assertion follows by Lemma 4.2.7.

The uniform approximation of the Kullback-Leibler information by LT (p,G) leads nowto an asymptotic lower bound for I(θT , θ0) and a new characterization of the asymptoticefficiency.

Page 75: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.3. Asymptotically efficient model selection 71

Theorem 4.2.9 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. If (pT , GT )T∈Nis a random sequence such that 1 ≤ pT ≤ PT and GT ∈ G , then we have for all ε > 0

limT→∞

P

(I(θT (pT , GT ), θ0)

LT (p∗T , G∗T )

≥ 1− ε)

= 1.

Proof. The result is a direct consequence of the inequality

I(θT (pT , GT ), θ0)

LT (p∗T , G∗T )

≥ I(θT (pT , GT ), θ0)

LT (pT , GT )

and Theorem 4.2.8.

Corollary 4.2.10 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then a ran-dom sequence (pT , GT )T∈N such that 1 ≤ pT ≤ PT and GT ∈ G is asymptotically efficientif and only if it satisfies

I(θT (pT , GT ), θ0)

LT (p∗T , G∗T )

P→ 1.

Proof. Let (pT , GT )T∈N be the (random) sequence which minimizes I(θT (p,G), θ0).

Since accordingly I(θT (pT , GT ), θ0) ≤ I(θT (p∗T , G

∗T ), θ0) we get

P

(I(θT (pT , G

T ), θ0)

LT (p∗T , G∗T )

≥ 1− ε)≤ P

(I(θT (p∗T , G

∗T ), θ0)

LT (p∗T , G∗T )

≥ 1− ε),

which converges to zero as T →∞. Therefore

I(θT (pT , GT ), θ0)

I(θT (pT , GT ), θ0)

P→ 1 ⇔ I(θT (pT , GT ), θ0)

LT (p∗T , G∗T )

P→ 1,

from which the assertion follows.

4.3 Asymptotically efficient model selection

Model distances in general depend naturally on the unknown distribution of the observa-tions and therefore cannot be minimized for model selection. Empirical model distancessuch as the log likelihood function can be used for the estimation of parameters withineach model, but typically do not provide good overall estimates of the theoretical modeldistance and therefore need to be corrected (e.g. Shibata, 1997).

Page 76: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

72 Chapter 4. Selection of graphical interaction models

Akaike (1973) considered model selection by minimizing the expected Kullback-Leiblerinformation, which leads to a simple bias correction term for the log likelihood function.In our situation, we have the following Taylor approximation for L (θ) and LT (θ)

L (θT (p,G)) ≈ L (θ0(p,G)) +1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

Γ(p,G),

LT (θT (p,G)) ≈ LT (θ0(p,G))− 1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

Γ(p,G).

Taking expectations this leads to the following version of Akaike’s AIC criterion

CT (p,G) = LT (θT (p,G)) + chk(p,G)

T.

In the following we show that the minimizing sequence

(pT , GT ) = argmin1≤p≤PT ,G∈G

CT (p,G)

is asymptotically efficient with respect to the Kullback-Leibler information. For this, wefirst rewrite the criterion as

CT (p,G) = LT (p,G) +[LT (θ0(p,G))−L (θ0(p,G))

]+k(p,G)ch

2T+[LT (θT (p,G))−LT (θ0(p,G))

]+ L (θ0).

CT (p,G) estimates the Kullback-Leibler information well if the second to fourth term onthe right side are negligiable compared with the first term LT (p,G).

Lemma 4.3.1 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then

max1≤p≤PT

maxG∈G

∣∣∣∣ k(p,G)ch2T

+[LT (θT (p,G))−LT (θ0(p,G))

]LT (p,G)

∣∣∣∣ = oP (1).

Proof. First, we obtain by a Taylor expansion of LT (θ0(p,G)) about θT (p,G)

LT (θ0(p,G))−LT (θT (p,G))

= ∇LT (θT (p,G))′(θ0(p,G)− θT (p,G))

+1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2LT (θT (p,G))+OP

(p

12‖θT (p,G)− θ0(p,G)‖3

)=

1

2

∥∥θT (p,G)− θ0(p,G)∥∥2

∇2L (θ0(p,G))+OP

(p

12‖θT (p,G)− θ0(p,G)‖3

),

where we have used the identity in (4.1.8) and further a Taylor expansion for the secondderivative together with Lemma 4.2.1 (ii) to get for θ ∈ `2(R)

θ′∇2L (θT (p,G))θ = θ′∇2L (θ0(p,G))θ +O(p

12‖θT (p,G)− θ0(p,G)‖‖θ‖2

).

The result follows now as in the proof of Theorem 4.2.8 from (4.2.5) and Lemma 4.2.7.

Page 77: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.3. Asymptotically efficient model selection 73

The lemma shows that the third term is uniformly negligible compared with LT (p,G).In contrast, the fourth term is constant and the second term is of order OP (

√p/T ), which

follows from the proof of Lemma 4.2.2. However, the behaviour of the minimum (pT , GT )depends only on the differences

CT (p,G)− CT (p∗T , G∗T ).

Obviously, here the fourth term cancels and it is therefore sufficient to show that[LT (θ0(p,G))−L (θ0(p,G))

]−[LT (θ0(p∗T , G

∗T ))−L (θ0(p∗T , G

∗T ))]

is uniformly negligible compared with LT (p,G).

Lemma 4.3.2 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then the differ-ence

max1≤p≤PT

maxG∈G

[LT (θ0(p,G))−L (θ0(p,G))

]−[LT (θ0(p∗T , G

∗T ))−L (θ0(p∗T , G

∗T ))]

LT (p,G)

tends to zero in probability.

Proof. We have with the abbreviations θp,G = θ0(p,G) and θ∗T = θ0(p∗T , G∗T )[

LT (θp,G)−L (θp,G)]−[LT (θ∗T )−L (θ∗T )

]=

1

∫Π

tr[(f−1θp,G

(λ)− f−1θ∗T

(λ))(I(T )(λ)− g(λ)

)]dλ

=1

∫Π

tr([(

f−1θp,G

(λ)− f−1θ0

(λ))

+(f−1θp,G

(λ)− f−1θ0

(λ))](

I(T )(λ)− fθ0(λ)))dλ.

Noting that LT (p,G) ≥ LT (p∗T , G∗T ), we therefore obtain

max1≤p≤PT

maxG∈G

∣∣∣∣[LT (θp,G)−L (θp,G)

]−[LT (θ∗T )−L (θ∗T )

]LT (p,G)

∣∣∣∣≤ max

1≤p≤PTmaxG∈G

C

LT (p,G)

∣∣∣ ∫Π

tr[(f−1θp,G

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ∣∣∣

+C

LT (p∗T , G∗T )

∣∣∣ ∫Π

tr[(f−1θ∗T

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ∣∣∣

≤ max1≤p≤PT

maxG∈G

C

LT (p,G)

∣∣∣ ∫Π

tr[(f−1θp,G

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ∣∣∣.

Since G is finite, it is now sufficient to show that for every G ∈ G

PT∑p=1

1

LT (p,G)4E

[ ∫Π

tr[f−1θp,G

(λ)− f−1θ0

(λ)][I(T )(λ)− fθ0(λ)

]dλ]4

(4.3.1)

Page 78: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

74 Chapter 4. Selection of graphical interaction models

converges to zero. By the product theorem for cumulants, the assumption of normality,and Lemma 4.4.1 we find for the mean

E

[ ∫Π

tr[(f−1θp,G

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ]4

=(

cum∫

Π

tr[(f−1θp,G

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ,∫

Π

tr[(f−1θp,G

(λ)− f−1θ0

(λ))(I(T )(λ)− fθ0(λ)

)]dλ)2

+O( log(T )3

T 3

)=

C

T 2

[ ∫Π

tr[g(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))g(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))]dλ

]2

+O( log(T )3

T 3

).

Next, we derive a lower bound of similar form for LT (p,G). We first consider thesecond term of LT (p,G) for which Taylor expansion of L (θp,G) about θ0 yields

I(θ0(p,G), θ0) = L (θp,G)−L (θ0) =1

2

∥∥θp,G − θ0

∥∥2

∇2L (θ0)+O

(‖θp,G − θ0‖1‖θp,G − θ0‖2

).

To show that the remainder is of smaller order than the quadratic term, we note that

‖θp,G − θ0‖1 ≤∑|u|≤p

∫Π

∥∥f−1θp,G

(λ)− fθ0(λ)∥∥

1dλ+

∑|u|>p

∥∥R(i)θ0

(u)∥∥

1.

By Lemma 4.2.2 the first term is of order O(p

12−β), while the second term vanishes for

p → ∞ since R(i)θ0

(u) is absolutely summable. Therefore ‖θp,G − θ0‖1 converges to zero.LT (p,G) can now be rewritten as

LT (p,G) =k(p,G)σh

2T+

1

2

∥∥θp,G − θ0

∥∥2

∇2L (θ0)+ o(‖θp,G − θ0‖2

)≥ 1

∫Π

tr[fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))]dλ

+ o(‖θp,G − θ0‖2

)and thus

1

LT (p,G)2

[ ∫Π

tr[fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))]dλ]2

≤ C

uniformly in 1 ≤ p ≤ PT . With this we obtain for (4.3.1)

PT∑p=1

C

T 2LT (p,G)4

[ ∫Π

tr[fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))fθ0(λ)

(f−1θp,G

(λ)− f−1θ0

(λ))]dλ

]2

≤PT∑p=1

C

T 2LT (p,G)2≤

p∗T∑p=1

Cp

p∗T2 +

PT∑p=p∗T+1

C

p2.

Since p∗T diverges to infinity, both summands tend to zero as T → ∞ and the proof iscomplete.

Page 79: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.4. Proofs and auxiliary results 75

Theorem 4.3.3 Suppose that Assumptions 4.1.1 to 4.1.3 and 4.1.7 hold. Then the se-quence of (pT , GT )T∈N is asymptotically efficient, that is

I(θT (pT , GT ), θ0)

LT (p∗T , G∗T )

P→ 1.

Proof. In Lemmas 4.3.1 and 4.3.2 we have shown that[CT (p,G)− CT (p∗T , G

∗T )]

LT (p,G)−[LT (p,G)− LT (p∗T , G

∗T )]

LT (p,G)

converges to zero in probability uniformly in 1 ≤ p ≤ PT . It then follows from theinequalities CT (pT , GT ) ≤ CT (p∗T , G

∗T ) and LT (pT , GT ) ≥ LT (p∗T , G

∗T ) that for all ε > 0

limT→∞

P

(LT (p∗T , G

∗T )

LT (pT , GT )≥ 1− ε

)= 1.

The result now follows from Theorem 4.2.8 and Corollary 4.2.10.

4.4 Proofs and auxiliary results

Proof of Lemma 4.2.1.

(i) For η ∈ `2(R) we define matrix valued functions hη on [−π, π] by

hη(λ) =∑u∈Z

R(i)η (u) exp(−iλu). (4.4.1)

These matrices are hermitian, but for general η ∈ `2(R) not necessarily positive definite.By (4.1.5), Assumption 4.1.2 (i), and Lemma B.2

η′∇2L (θ)η =

∫Π

tr[fθ(λ)hη(λ)fθ(λ)hη(λ)

]dλ

≤∫

Π

‖fθ(λ)‖2tr[hη(λ)hη(λ)∗

]dλ

≤ supλ∈[−π,π]

‖fθ(λ)‖2

∫Π

∥∥hη(λ)∥∥2

2dλ

= supλ∈[−π,π]

‖fθ(λ)‖2∑u∈Z

∥∥R(i)η (u)

∥∥2

2≤ C‖η‖2.

This proves the first inequality in (4.2.1). The second inequality is proven similarly with‖fθ(λ)‖inf substituted for ‖fθ(λ)‖.

Page 80: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

76 Chapter 4. Selection of graphical interaction models

(ii) Suppose that η, ζ ∈ `2(R) and let hη(λ) and hζ(λ) be defined as in (4.4.1). Sincef−1θ (λ) is linear in θ, we obtain∣∣∣∣ ∑

i,j,k∈N

∂3L (θ)

∂θi∂θj∂θkηiηjζk

∣∣∣∣ ≤ ∫Π

∣∣∣tr[fθ(λ)hη(λ)fθ(λ)hη(λ)fθ(λ)hζ(λ)]∣∣∣dλ

≤∫

Π

‖fθ(λ)‖3‖hζ(λ)‖tr[hη(λ)hη(λ)∗

]dλ

≤ b32 supλ∈[−π,π]

‖hζ(λ)‖∫

Π

∥∥hη(λ)∥∥2

2dλ.

Noting that for ζ ∈ πp,G(R∞) ⊆ `2(R)

‖hζ(λ)‖ ≤ ‖hζ(λ)‖1 ≤∑u∈Z

∥∥R(i)ζ (u)

∥∥1≤ C‖ζ‖1 ≤ C

√k(p,G)‖ζ‖

uniformly for all λ ∈ [−π, π] and∫

Π‖hη(λ)‖2

2dλ ≤ 2‖η‖22, the second part of the lemma

follows.

Proof of Lemma 4.2.3. If G0 ⊆ G, then θ0(p,G) converges to θ0 as p→∞. Thereforewe have the following Taylor expansion for L (θ)

L (θ0(p,G))−L (θ0) = ∇L (θ0)′(θ0(p,G)− θ0) +1

2

∥∥θ0(p,G)− θ0

∥∥2

∇2L (θ),

where θ = θ0 +ξ(θ0(p,G)−θ0) for some ξ ∈ [0, 1]. The first term is zero since θ0 minimizesL (θ). For the second term we obtain by (4.1.5) and Lemma 4.2.1 the lower bound

1

2

∫Π

tr[fθ(λ)

(f−1θ0(p,G)(λ)− f−1

θ0(λ))fθ(λ)

(f−1θ0(p,G)(λ)− f−1

θ0(λ))]dλ

≥ 1

2

∫Π

‖fθ(λ)‖inftr[(f−1θ0(p,G)(λ)− f−1

θ0(λ))fθ(λ)

(f−1θ0(p,G)(λ)− f−1

θ0(λ))]dλ

≥ 1

2

∫Π

‖fθ(λ)‖2inftr

[(f−1θ0(p,G)(λ)− f−1

θ0(λ))(f−1θ0(p,G)(λ)− f−1

θ0(λ))]dλ

≥ b21

2

∫Π

∥∥f−1θ0(p,G)(λ)− f−1

θ0(λ)∥∥2

2dλ. (4.4.2)

Similarly we get the upper bound

L (θ0(p,G))−L (θ0) ≤ b22

2

∫Π

∥∥f−1θ0(p,G)(λ)− f−1

θ0(λ)∥∥2

2dλ. (4.4.3)

Now let GS be the saturated graph with all edges included. Then by the equations in(4.1.4) the Fourier coefficients Rθ0(p,GS)(u) and Rθ0(u) are equal for all |u| ≤ p. Thus byAssumptions 4.1.1 and 4.1.2∫

Π

∥∥fθ0(p,GS)(λ)− fθ0(λ)∥∥2

2dλ =

∑|u|>p

∥∥Rθ0(p,GS)(u)−Rθ0(u)∥∥2

2= O

(p−(2β+1)

). (4.4.4)

Page 81: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

4.4. Proofs and auxiliary results 77

For the inverse spectral matrices the same holds since

‖f−1θ0(p,GS)(λ)− f−1

θ0(λ)‖2 = ‖f−1

θ0(p,GS)(λ)(fθ0(λ)− fθ0(p,GS)(λ)

)f−1θ0

(λ)‖2 (4.4.5)

≤ ‖f−1θ0(p,GS)(λ)‖‖f−1

θ0(λ)‖‖fθ0(p,GS)(λ)− fθ0(λ)‖2,

where ‖fθ0(p,GS)(λ)−1(λ)‖ and ‖f−1θ0

(λ)‖ are bounded uniformly in λ by Assumptions 4.1.1and 4.1.2. Now consider the inverse spectral matrix f−1

∗ (λ) with components

f−1ij,θ∗(λ) =

f−1ij,θ0(p,GS)(λ) if (i, j) ∈ E

0 if (i, j) /∈ E

where θ∗ = πp,G(θ0(p,GS)). Since θ0(p,GS) → θ0 also implies θ∗ → θ0 and θ0 belongs tothe interior of Θ, there exists p0 ∈ N such that θ∗ ∈ Θ(p,G) ⊆ Θ for all p ≥ p0.

Since G0 ⊆ G it trivially holds that∣∣f−1ij,θ∗(λ)−f−1

ij,θ0(λ)∣∣ = 0 for all (i, j) /∈ E. Therefore

we also have that ∫Π

∥∥f−1θ∗ (λ)− f−1

θ0(λ)∥∥2

2dλ = O

(p−(2β+1)

).

Since θ0(p,G) minimizes L (θ) for all θ ∈ Θ(p,G), it then follows from (4.4.2) and (4.4.3)that∫

Π

∥∥f−1θ0(p,G)(λ)− f−1

θ0(λ)∥∥2

2dλ ≤ C

∣∣L (θ0(p,G))−L (θ0)∣∣

≤ C∣∣L (θ∗)−L (θ0)

∣∣ ≤ C

∫Π

∥∥f−1θ∗ (λ)− f−1

θ0(λ)∥∥2

2dλ,

which together with (4.4.4) proves (4.2.2). The second part of the lemma then followsby the same argument used in (4.4.5) and the uniform boundedness of ‖f−1

θ0(p,GS)(λ)‖ and

‖fθ0(λ)‖.

Lemma 4.4.1 Suppose that Assumptions 4.1.1, 4.1.2 and 4.1.7 hold. Then we have withf(λ) = fθ0(λ)

TE

∫Π2

(I

(T )ij (λ)− fij(λ)

)(I

(T )kl (λ)− fkl(λ)

)exp(iλu+ iµv)dλdµ

=

2πH4

H22

∫Π

[fik(λ)flj(λ) exp(iλ(u− v)) + fil(λ)fkj(λ) exp(iλ(u+ v))

]dλ

+O

(p log(T )

T

)uniformly in |u|, |v| ≤ p.

Proof. By standard arguments we find that the mean above is equal to

2πH4

H22

∫Π2

[fik(λ)flj(λ)Φ

(T )2 (λ+ µ) + fil(λ)fkj(λ)Φ

(T )2 (λ− µ)

]exp(iλu+ iµv)dλdµ

+O

(log(T )

T

),

Page 82: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

78 Chapter 4. Selection of graphical interaction models

where Φ(T )2 is defined as in (3.2.7) and the error term is uniformly bounded in |u|, |v| ≤ p.

Noting that exp(iλu) is Lipschitz continuous in λ with constant C|u|, we have∫Π

| exp(iv(λ+ µ))− exp(ivλ)|Φ(T )2 (µ)dµ ≤ C

T

∫Π

|vµ|L(T )(µ)dµ ≤ C|v| log(T )

T,

which proves the lemma.

Lemma 4.4.2 Under Assumptions 4.1.1 and 4.1.2 we have∣∣∣ ∫Π

cumI

(T )ij (λ)− fij,θ0(λ)

exp(−iλu)dλ

∣∣∣ = O( 1

T

)and for k ≥ 2

∣∣∣ ∫Πk

cumI

(T )i1j1

(λ1), . . . , I(T )ikjk

(λk)

exp(− i

k∑l=1

λlul

)dλ1 · · · dλk

∣∣∣ = O( log(T )k−2

T k−1

)uniformly in u, u1, . . . , uk ∈ Z.

Proof. The first part follows directly from∣∣E(I(T )

ij (λ))− fij,θ0(λ)

∣∣ = O(T−1) uniformlyin λ ∈ [−π, π]. For the second part we note that by the normality assumption

cumd(T )i1

(λ1), . . . , d(T )ik

(λk) = 0

if k ≥ 3. Therefore it follows from the product theorem for cumulants and (3.2.10) that∫Πk

∣∣cumI

(T )i1j1

(λ1), . . . , I(T )ikjk

(λk)∣∣dλ1 · · · dλk

≤ C

T k

∑i.p.

∫Πk

k∏r=1

L(T )(γr)dλ1 · · · dλk ≤C log(T )k−2

T k−1,

where we have used the notation from Section 3.2.

Page 83: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Chapter 5

Selection of causal graphical models

The first step in model selection is to choose an appropriate model distance. Using thefinal prediction error instead of the Kullback-Leibler distance, we focus on predictingfuture values of the sampled process by the past. We therefore would like to use onlythose variables which lead to a substantial reduction of the prediction error. But thisis just the definition of causality. It therefore seems to be natural to consider causalgraphical time series models when choosing models for prediction.

In the first section, we use a modified version of the final prediction error (cf. Akaike,1969, 1970; Nakano and Tagami, 1987) to derive the minimum distance estimate in theclass of finite autoregressive models under the restriction of causal graphs. In order toderive asymptotic efficiency for a number of model selection criteria such as AIC (Akaike,1973, 1974) or MFPE (Akaike, 1971), we investigate in Section 2 the asymptotic behaviourof the final prediction error and derive an asymptotic lower bound for it. In Section 3, wederive an asymptotically efficient model selection criterion. We further show that AIC andMFPE are asymptotically efficient for an appropriately chosen model distance. As thismodel distance implicitly depends on the covariance matrix of the innovations, we brieflydiscuss the case where an estimate of the covariance matrix is used for the estimation ofthe parameters.

5.1 Introduction

In this section, we set up the framework for the selection of causal graphical autoregressivemodels. As in the last chapter, we assume that the true process X(t) is an autoregressiveprocess of infinite order and thus that the class of approximating models asymptoticallyincludes the data generating model. Although the model selection method does notdepend on this assumption, it is essential for the analysis in Section 2 and 3, wherewe derive asymptotic optimality for some model selection criteria. We first state theassumptions on the process X(t).

Page 84: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

80 Chapter 5. Selection of causal graphical models

Assumption 5.1.1 X(t), t ∈ Z is a d vector-valued stochastic process defined on aprobability space (Ω,A ,P) such that the following conditions hold.

(i) X(t) is a stationary Gaussian autoregressive process,

X(t) =∞∑h=1

AhX(t− h) + ε(t), (5.1.1)

with d×d coefficient matrices Ah such that ‖Ah‖ 6= 0 for infinitely many h ∈ N and

∞∑h=1

‖Ah‖ <∞

for any matrix norm ‖·‖. The innovations ε(t), t ∈ Z are independent and normallydistributed with mean E

(ε(t))

= 0 and regular covariance matrix E(ε(t)ε(t)′

)= Σ.

(ii) The spectral matrix f(λ) of X(t) exists and satisfies the boundedness condition

c11d ≤ f(λ) ≤ c21d ∀λ ∈ [−π, π],

for constants c1, c2 such that 0 < c1 ≤ c2 <∞.

(iii) X(t) has causality graph G0 = (V,Ed0 , E

u0 ).

It is well known (cf. Brillinger, 1981; Section 3.8) that under conditions (i) and (ii) ofAssumption 5.1.1 the process X(t) has the moving average representation

X(t) =∞∑h=0

Bhε(t− h), B0 = 1d, (5.1.2)

and the coefficient matrices Bh are again absolutely summable,

∞∑h=0

‖Bh‖ <∞.

Furthermore, let R denote the infinite dimensional covariance matrix(R(u − v)

)u,v∈N,

where R(u) = E(X(t)X(t + u)′

)with components rij(u) for i, j = 1, . . . , d. Noting that

R is the Toeplitz matrix of the spectral matrix f , that is

R = B(f) =

(∫Π

f(λ) exp(iλ(u− v))dλ

)u,v∈N

.

It then follows from condition (ii) of Assumption 5.1.1 that R satisfies a similar bound-edness condition, that is we have ‖R‖ <∞ and ‖R‖inf > 0. The same holds also for theinverse covariance matrix R−1 as its eigenvalues are the inverse eigenvalues of R and thusare also bounded and bounded away from zero.

Page 85: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.1. Introduction 81

Before we define the model distance, we introduce some notations which will be usedthroughout the chapter. First we define for the process X(t) the infinite dimensionalvectors

X(∞)(t) = vec(X(t− 1), X(t− 2), . . .

)X(p)(t) = vec

(X(t− 1), . . . , X(t− p), 0, . . .

).

With these definitions we have

Rp = E(X(p)(t)X(p)(t)′

), rp = E

(X(t)X(p)(t)′

), and r = E

(X(t)X(∞)(t)′

)Further, for a sequence of d × d matrices Φh, h ∈ N we define the infinite dimensionalvector φ = vec(Φ) where Φ denotes the d × ∞ matrix Φ = (Φ1,Φ2, . . . ). Since d isfixed a vector φ ∈ R∞ uniquely specifies a sequence of matrices Φh. Therefore we willalternatingly use the vector form denoted by small letters φ, a, a(p,G) etc. and the matrixform denoted by the corresponding capital letters Φ, A, A(p,G), etc. , respectively.

The model selection of causal graphical models will be based upon a weighted form ofthe final prediction error, which was introduced by Akaike (1969, 1970) for the order selec-tion in autoregressive models. The version defined below has been proposed by Nakanoand Tagami (1987), again for the selection of the order of multivariate autoregressivemodels. The distance depends on a weighting matrix which is supposed to be fixed. Wefurther need the following assumption on Q.

Assumption 5.1.2 Q is a symmetric, positive definite, real-valued d× d matrix.

Suppose now that Y (t), t ∈ Z is an independent realization of the process X(t).We can then assess the approximation of X(t) by a Gaussian autoregressive model withparameter vector φ by the weighted prediction error

∆(φ) = EY

[(Y (t)− ΦY (p)(t)

)′Q(Y (t)− ΦY (p)(t)

)]= EY tr

[Q(Y (t)− ΦY (p)(t)

)(Y (t)− ΦY (p)(t)

)′]= tr

[Q(Φ− A)R(Φ− A)′

]+ tr

(QΣ)

= ‖φ− a‖2H + tr

(QΣ),

where H =(R′⊗Q

)and EY denotes the expectation with respect to the process Y (t).

The independent realization is used as we are particularly interested in the distance be-tween the true model and a model fitted from the data, that is φ then depends on arealization of the process X(t).

We note that this model distance does not depend on the covariance matrix Σ in thefitted model. Therefore it cannot be used to select graphs which have an uncompleteset of undirected edges. However, when concerned with prediction, the fitting of theinstantaneous causality structure of the process seems to be of minor importance. Onepossibility for selecting the full structure of causal graphs might be a two step procedurewhere in the first step the best subset of directed edges is selected according to the abovemodel distance. Estimating the covariance matrix by the sample covariance matrix of

Page 86: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

82 Chapter 5. Selection of causal graphical models

the residuals, an ordinary Gaussian graphical model can be fitted to the residuals in thesecond step. A similar method has been proposed by Swanson and Granger (1997), whofit directed and undirected (ordinary) graphical models to the residuals in order to findan appropriate ordering of the variables for impulse response and forecast error variancedecomposition analysis. We will not discuss this matter any further.

Next, let G be the set of all mixed graphs G = (V,Ed, Eu) with V = 1, . . . , dsuch that the set of undirected edges Eu is complete, that is Eu = (i, j) ∈ V 2|i 6= j.We will denote such complete sets of edges by Ed

S and EuS as they correspond to the

saturated model with all possible edges included. In Example 2.2.3 we have seen that thecoefficient matrices Φh of an autoregressive process of order p which has the causal graphG = (V,Ed, Eu

S) are restricted by the following constraints

Φji,h = 0 ∀(i, j) /∈ Ed ∀h ∈ N and Φh = 0 ∀h > p. (5.1.3)

We call an autoregressive model with constraints (5.1.3) on the parameters an AR(p,G)model.

Defining the diagonalized projection matrix πp,G with diagonal elements (πp,G(k))k∈Nsuch that πp,G

(d(dh − d + j − 1) + i

)= 1 if h ≤ p and either (j, i) ∈ Ed or i = j, and

zero elsewhere, it is now clear that the parameter vector φ of an AR(p,G) model lies inthe subspace Θ(p,G) of the vector space

Θ =φ ∈ R∞

∣∣ ‖φ‖H <∞

which is given by the projection πp,G. The dimension of this subspace θ(p,G) is k(p,G) =p(|Ed| + d). It then follows that the best approximation a(p,G) of the process X(t)under the constraints of an AR(p,G) model is given by the projection of a onto thesubspace Θ(p,G), that is a(p,G) is uniquely defined by the equations

πp,GHπp,Ga(p,G) = πp,GHa and(1− πp,G

)a(p,G) = 0. (5.1.4)

With the definitions

Hp,G = (R′p ⊗G Q) = πp,G(R′ ⊗Q)πp,G and vecG(QARp) = πp,G vec(QAR)

we obtain from (5.1.4) for the best approximation

a(p,G) = H−1p,G vecG(QARp) = H−1

p,G vecG(Qrp), (5.1.5)

where H−1p,G = πp,G(Hp,G)−πp,G for any generalized inverse (Hp,G)− of Hp,G.

It follows from (5.1.5) and Assumption 5.1.1 that a(p,G) is bounded with respect tothe Euclidean vector norm uniformly in p ∈ N. However, for our analysis we need thatthe coefficients are absolutely summable.

Assumption 5.1.3 For any causal graph G ∈ G such that G0 ⊆ G the projectionsa(p,G) of the parameter vector a onto Θ(p,G) satisfy

lim supp→∞

‖a(p,G)‖1 = lim supp→∞

‖A(p,G)‖1 <∞.

Page 87: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.1. Introduction 83

For the saturated graph GS, which contains all possible edges, a multivariate gener-alization of Baxter’s inequality (cf. Hannan and Deistler, 1988; Cheng and Pourahmadi,1993) yields that the projections a(p,Gs) are uniformly bounded with respect to ‖ · ‖1.However, for finite predictors satisfying the restrictions due to a causal graph with missingedges a similar result is not known.

Given observations X(1), . . . , X(T ) from the process X(t), we can fit an AR(p,G)model, where G ∈ G and p is selected from a given range 1 ≤ p ≤ PT with PT < T , byminimizing the empirical model distance

∆T (φ(p,G)) =1

T0

T∑t=PT+1

(X(t)− Φ(p,G)X(p)(t)

)′Q(X(t)− Φ(p,G)X(p)(t)

)with respect to φ(p,G) ∈ Θ(p,G). Here T0 = T −PT . Defining the sample covariances by

rij(u, v) =1

T0

T∑t=PT+1

Xi(t− u)Xj(t− v)

with similar definitions for R(u), Rp, and rp, we can rewrite the empirical model distanceas

∆T (φ(p,G)) = tr[QR(0)

]− 2tr

[QΦ(p,G)r′p] + tr

[QΦ(p,G)RpΦ(p,G)

].

Therefore the minimum distance estimate for a(p,G) is given by

a(p,G) = H−1p,G vecG

(Qrp

). (5.1.6)

Further, the covariance matrix Σ can be estimated by

Σ(p,G) =1

T0

T∑t=PT+1

εa(p,G)(t)εa(p,G)(t)′,

where εa(p,G)(t) = X(t)− A(p,G)X(p)(t) are the residuals in the fitted model.Having specified a model distance and a method for fitting the models to data, we

are now interested in the fitted model which approximates the true process X(t) best.Thus the aim is to find the AR(p,G) model with 1 ≤ p ≤ PT and G ∈ G such that thedistance between the fitted and the true model

∆(a(p,G)) =∥∥a(p,G)− a

∥∥2

H+ tr

(QΣ)

is minimized. A model selection (pT , GT ) which has this optimality property at leastasymptotically is called asymptotically efficient.

Definition 5.1.4 (Asymptotically efficient model selection) A selection of models(pT , GT )T∈N with 1 ≤ p ≤ PT and G ∈ G is called asymptotically efficient if∥∥a(pT , GT )− a

∥∥2

H

min1≤p≤PT

minG∈G

∥∥a(p,G)− a∥∥2

H

P→ 1.

Page 88: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

84 Chapter 5. Selection of causal graphical models

In order to minimize the distance asymptotically, the maximal order PT of the modelsobviously must diverge to infinity as T → ∞. In the next section, we show that thedistance can be approximated by a deterministic function uniformly for all p ≤ PT if PTdoes not diverge to fast. Thus we need the following assumption.

Assumption 5.1.5 PTT∈N is an integer-valued sequence such that PT → ∞ andP 2T/T → 0 as T →∞.

5.2 Asymptotic properties of the final prediction error

In this section, we investigate the asymptotic behaviour of the model distance ∆(a(p,G))between a fitted and the true model and, as in the last chapter, derive an asymptotic lowerbound for it. We first note that by the orthogonality of a(p,G)− a(p,G) and a(p,G)− awe obtain the following decomposition of the model distance

∆(a(p,G)) = ‖a(p,G)− a‖2H + tr(QΣ)

= ‖a(p,G)− a(p,G)‖2H + ‖a(p,G)− a‖2

H + tr(QΣ). (5.2.1)

Here, the first term measures the estimation error due to sampling variation while thesecond term describes the bias due to the fitting of an incorrect model. The third termsignifies the stochastic noise and is irrelevant for the minimization problem. Since thevariance term is typically of order p/T (cf. Linhardt and Zucchini, 1986) and the biasterm decreases with an increasing number of parameters, the two terms must be balancedin order to minimize the model distance.

We consider now the first, stochastic term in (5.2.1). By equation (5.1.6) we find

a(p,G)− a(p,G) = H−1p,G

[vecG

(Qrp

)− Hp,Ga(p,G)

]= H−1

p,G vecG(Qrp −QA(p,G)Rp

).

Let εa(p,G)(t) = X(t)−A(p,G)X(p)(t) be the residuals when fitting the best approximatingAR(p,G) model to the data. Defining

v(p,G) = vecG(Qrp −QA(p,G)Rp

)= vecG

( 1

T0

T∑t=PT+1

Qεa(p,G)(t)X(p)(t)′

)we therefore have

‖a(p,G)− a(p,G)‖2H = ‖v(p,G)‖2

H−1p,GHp,GH

−1p,G

= ‖v(p,G)‖2H−1p,G

+ ‖v(p,G)‖2∆H−1

p,G

with ∆H−1p,G = H−1

p,GHp,GH−1p,G −H

−1p,G. By Lemma 5.4.1 we find that

max1≤p≤PT

∣∣∣∣‖v(p,G)‖2∆H−1

p,G

‖v(p,G)‖2H−1p,G

∣∣∣∣ ≤ max1≤p≤PT

‖H‖‖H−1p,G −H

−1p,G‖

(1 + ‖H−1

p,G‖‖H‖)

(5.2.2)

Page 89: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.2. Asymptotic properties of the final prediction error 85

converges to zero in probability. Thus the second term is negligiable compared with thefirst term.

In the next step, we replace the residuals εa(p,G)(t) in v(p,G) by the innovations ε(t).

Lemma 5.2.1 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then

E

∥∥∥ vecG

( 1

T0

T∑t=PT+1

Q(εa(p,G)(t)− ε(t))X(p)(t)′)∥∥∥2

= O( pT0

‖a(p,G)− a‖2).

Proof. The result is proved in Section 5.4.

Lemma 5.2.2 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then we have

E

[T0

∥∥∥ vecG

( 1

T0

T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥2

H−1p,G

− tr(Bp,G

)]4

(5.2.3)

= 48tr(B4p,G

)+ 12

[tr(B2p,G

)]2+O

(p5

T0

),

where Bp,G = (R(p)′ ⊗G QΣQ)H−1p,G.

Proof. The result is proved in Section 5.4.

With the definition

v0(p,G) = vecG

( 1

T0

T∑t=PT+1

Qε(t)X(p)(t)′)

we can now rewrite ‖a(p,G)− a(p,G)‖2H in terms of v0(p,G) and v(p,G) as follows

‖a(p,G)− a(p,G)‖2H = ‖v0(p,G)‖2

H−1p,G

+ 2v0(p,G)′H−1p,G

(v(p,G)− v0(p,G)

)+ ‖v0(p,G)− v(p,G)‖2

H−1p,G

+ ‖v(p,G)‖2∆H−1

p,G.

(5.2.4)

If the graph G contains the true graph G0, ‖a(p,G)− a‖H tends to zero as p→∞. Thenaccording to Lemma 5.2.1 the second and the third term are negligiable compared withthe first one and we have together with (5.2.2) and Lemma 5.2.2

‖a(pT , G)− a(pT , G)‖2H −

1

T0

tr(BpT ,G

)= oP

( pTT

)for any sequence pT ∈ 1, . . . , PT diverging to infinity. If G0 * G, we still have

‖a(p,G)− a(p,G)‖2H = OP

( pT0

).

Combining this result with the decomposition (5.2.1) we find that ‖a(p,G)− a‖2H can be

approximated by the deterministic function

LT (p,G) =1

T0

tr[(R′p ⊗G QΣQ)H−1

p,G

]+ ‖a(p,G)− a‖2

H .

Page 90: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

86 Chapter 5. Selection of causal graphical models

Here, the first term basically depends on the number of parameters as it follows from theevaluation tr

(R′p⊗GQΣQ

)= p

∑(i,j)∈Ed rjj(0)Q′iΣQi that there exist constants c1, c2 > 0

such that

c1 k(p,G) ≤ tr(Bp,G) ≤ c2 k(p,G). (5.2.5)

In the important case where QΣ = σ21d we have equality with c1 = c2 = σ2. The next

result shows that the approximation does hold uniformly in 1 ≤ p ≤ PT . As in the lastchapter, this result allows us to reformulate the definition of asymptotic efficiency in termsof the sequence (p∗T , G

∗T ) which minimizes the LT (p,G).

Definition 5.2.3 (p∗T , G∗T ) is the sequence of models which attains the minimum of

LT (p,G),

(p∗T , G∗T ) = argmin

1≤p≤PT ,G∈GLT (p,G)

for all T ∈ N.

Under Assumption 5.1.5 LT (PT , G0) and therefore also LT (p∗T , G∗T ) converge to zero as

T →∞. But the second term of LT (p,G) does not vanish if the approximating model iswrongly specified, which is the case for finite p or if G does not contain the true causalgraph G0. Thus it follows that p∗T diverges to infinity as T →∞ and G0 ⊆ G∗T for almostall T ∈ N.

Theorem 5.2.4 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then

maxG∈G

max1≤p≤PT

∣∣∣∣‖a(p,G)− a‖2H

LT (p,G)− 1

∣∣∣∣ = oP (1).

Proof. Since the set of graphs G is finite, it is sufficient to prove the result for anygraph G. From (5.2.4) we get the identity

‖a(p,G)− a‖2H − LT (p,G) =‖a(p,G)− a(p,G)‖2

H −1

T0

tr(Bp,G

)=‖v0(p,G)‖2

H−1p,G− 1

T0

tr(Bp,G

)+ 2v0(p,G)′H−1

p,G

(v(p,G)− v0(p,G)

)+ ‖v0(p,G)− v(p,G)‖2

H−1p,G

+ ‖v(p,G)‖2∆H−1

p,G.

(5.2.6)

We prove the convergence for the four terms in (5.2.6) separately. For the first term wehave with Lemma 5.2.2

E

[max

1≤p≤PT

∣∣∣∣∣T0

∥∥v0(p,G)∥∥2

H−1p,G− tr

(Bp,G

)T0 LT (p,G)

∣∣∣∣∣]4

≤PT∑p=1

E

[T0

∥∥v0(p,G)∥∥2

H−1p,G− tr

(Bp,G

)T0 LT (p,G)

]4

≤p∗T∑p=1

Ck(p,G)2

k(p∗T , G∗T )4

+

PT∑p=p∗T+1

C ′

k(p,G)2+O

(P 2T

T0

), (5.2.7)

Page 91: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.2. Asymptotic properties of the final prediction error 87

which converges to zero as T → ∞ since p∗T diverges to infinity. Further the second andthe third term in (5.2.6) are bounded by

C‖v(p,G)− v0(p,G)‖2 + C ′‖v0(p,G)‖‖v(p,G)− v0(p,G)‖.

According to Lemma 5.2.1 we have

E

[max

1≤p≤PT

∣∣∣∣‖v(p,G)− v0(p,G)‖LT (p,G)1/2

∣∣∣∣]2

≤PT∑p=1

E

[‖v(p,G)− v0(p,G)‖2

LT (p,G)

]

≤PT∑p=1

Cp

T0

‖a(p,G)− a‖2

LT (p,G)≤ C

P 2T

T0

,

since H is positive definite and therefore ‖a(p,G) − a‖2 ≤ LT (p,G)/‖H‖inf . Therefore‖v(p,G)− v0(p,G)‖2/LT (p,G) converges to zero in probability uniformly in 1 ≤ p ≤ PT .The same holds also for the second term in (5.2.6) since it follows from (5.2.7) that

max1≤p≤PT

‖v0(p,G)‖LT (p,G)1/2

= OP (1).

For the last term in (5.2.6), we obtain from (5.2.2) the upper bound

‖v(p,G)‖2∆H−1

p,G≤ C‖v(p,G)‖2

H−1p,G‖H−1

p,G‖‖H−1p,G −H

−1p,G‖.

Noting that because of ‖v(p,G)‖H−1p,G

= ‖v0(p,G)‖H−1p,G

+ ‖v(p,G)− v0(p,G)‖H−1p,G

max1≤p≤PT

∣∣∣∣‖v(p,G)‖2H−1p,G

LT (p,G)

∣∣∣∣is bounded in probability, the desired convergence then follows from Lemma 5.4.1.

Theorem 5.2.5 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then for anyrandom variables (pT , GT ) such that 1 ≤ pT ≤ PT and GT ∈ G we have

limT→∞

P

(‖a(pT , GT )− a‖2

H

LT (p∗T , G∗T )

≥ 1− ε)

= 1.

Furthermore (pT , GT ) is an asymptotically efficient model selection if and only if

‖a(pT , GT )− a‖2H

LT (p∗T , G∗T )

P→ 1.

Proof. The first part follows directly from Theorem 5.2.4 since by definition of (p∗T , G∗T )

‖a(pT , GT )− a‖2H

LT (p∗T , G∗T )

≥ ‖a(pT , GT )− a‖2H

LT (pT , GT ).

The proof of the second part is the same as for Corollary 4.2.10.

Page 92: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

88 Chapter 5. Selection of causal graphical models

5.3 Asymptotically efficient model selection

In Section 5.1 we formulated the model selection problem in terms of the model distance∆(a(p,G)). In order to derive a model selection criterion which is optimal in the sense ofDefinition 5.1.4, we need to estimate ∆(a(p,G)), as ∆(a(p,G)) itself is a random variablewhich depends on the unknown distribution of the true process. However, to find goodestimators is a main problem in the theory of model selection (eg. Linhardt and Zucchini,1986; Shibata, 1997). The simplest approach is to restrict oneself to the estimation of theexpected model distance E

(∆(a(p,G))

).

First, we define the covariance matrices

S(p,G) =1

T0

T∑t=PT+1

εa(p,G)(t)εa(p,G)(t)′ and Σ(p,G) = E

(εa(p,G)(t)εa(p,G)(t)

′).From these definitions we can see immediately that the matrices S(p,G) and Σ(p,G) arerelated to Σ(p,G) and Σ, respectively, by

tr(QS(p,G)

)= ‖a(p,G)− a(p,G)‖2

Hp,G+ tr

(QΣ(p,G)

), (5.3.1)

tr(QΣ(p,G)

)= ‖a(p,G)− a‖2

H + tr(QΣ). (5.3.2)

Then the model distances ∆(a(p,G)) and ∆T (a(p,G)) can be rewritten as

∆(a(p,G)) = tr(QΣ(p,G)

)+ ‖a(p,G)− a(p,G)‖2

H ,

∆T (a(p,G)) = tr(QS(p,G)

)− ‖a(p,G)− a(p,G)‖2

Hp,G

≈ tr(QS(p,G)

)− ‖a(p,G)− a(p,G)‖2

H .

Taking expectations we thus find

E(∆T (a(p,G))

)= E

(∆(a(p,G))

)− 2E

(‖a(p,G)− a(p,G)‖2

H

)= E

(∆(a(p,G))

)− 2

T0

tr[(R′p ⊗G QΣQ

)H−1p,G

],

which leads to the following model selection criterion

CT (p,G) = T0tr(QΣ(p,G)

)+ 2tr

[(R′p ⊗G QΣ(p,G)Q

)H−1p,G

]. (5.3.3)

In Section 5.3.1 we prove the asymptotic efficiency of the above criterion CT (p,G). InSection 5.3.2 we show that the criteria AIC and MFPE are also asymptotically efficientif the weighting matrix Q is chosen as the inverse of the covariance matrix Σ. As thecovariance matrix is unknown, it must be estimated from the data. In Section 5.3.3, wetherefore briefly discuss the parameter estimation if an estimate is used for Q.

Page 93: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 89

5.3.1 Asymptotic efficiency of CT (p,G)

In this section, we consider the criterion function CT (p,G) defined by (5.3.3). Noting(5.3.1) and (5.3.2) we can rewrite CT (p,G) as

CT (p,G) = T0 LT (p,G) + T0tr(QΣ)

+ T0tr[Q(S(p,G)− Σ(p,G)

)]+

tr[(R′p ⊗G QΣQ

)H−1p,G

]− T0‖a(p,G)− a(p,G)‖2

Hp,G

+ 2tr

[(R′p ⊗G Q

Σ(p,G)− Σ

Q)H−1p,G

]+ 2tr

[(Rp −Rp)

′ ⊗G QΣ(p,G)QH−1p,G

]+ 2tr

[(R′p ⊗G QΣ(p,G)Q

)(H−1

p,G −H−1p,G)].

(5.3.4)

In order to show that the sequence (pT , GT )t∈N which minimizes CT (p,G) for every T ∈ Nis asymptotically efficient, we have to show that compared with the first term all otherterms are negligiable.

Lemma 5.3.1 Under Assumptions 5.1.1 to 5.1.3 and 5.1.5 we have for any symmetricinfinite dimensional matrix M such that ‖M‖ <∞

max1≤p≤PT

‖a(p,G)− a(p,G)‖2M

LT (p,G)= OP (1).

Proof. From the decomposition in (5.2.1) we get

‖a(p,G)− a(p,G)‖2H

LT (p,G)≤∣∣∣∣‖a(p,G)− a‖2

H

LT (p,G)− 1

∣∣∣∣+

∣∣∣∣tr[(R′p ⊗G QΣQ)H−1

p,G

]T0 LT (p,G)

∣∣∣∣.By the definition of LT (p,G) the second term is bounded, while the first term con-verges to zero in probability uniformly in 1 ≤ p ≤ PT by Theorem 5.2.4. Further since‖a(p,G)− a(p,G)‖2

M ≤ ‖a(p,G)− a(p,G)‖2H‖M‖/‖H‖inf the same result holds also with

M substituted for H.

Lemma 5.3.2 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then for allG ∈ G

max1≤p≤PT

‖Σ(p,G)− S(p,G)‖√LT (p,G)

= oP (1) and max1≤p≤PT

‖Σ(p,G)− S(p,G)‖2 = oP (1),(i)

max1≤p≤PT

‖S(p,G)− Σ(p,G)‖√LT (p,G)

= oP (1) and max1≤p≤PT

‖S(p,G)− Σ(p,G)‖2 = oP (1),(ii)

(iii) there exists a constant C independent of G such that

max1≤p≤PT

‖Σ(p,G)− Σ‖LT (p,G)

≤ C.

Proof. The result is proved in Section 5.4.

Page 94: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

90 Chapter 5. Selection of causal graphical models

Lemma 5.3.3 Under Assumptions 5.1.1 to 5.1.3 and 5.1.5 we have

max1≤p≤PT

∣∣∣∣tr[(R′p ⊗G QΣQ)H−1

p,G

]− T0‖a(p,G)− a(p,G)‖2

Hp,G

T0 LT (p,G)

∣∣∣∣ = oP (1).

Proof. Using the decomposition for ‖a(p,G)−a‖2H in (5.2.1) and the identity ‖a(p,G)−

a(p,G)‖Hp,G = ‖a(p,G)− a(p,G)‖H , the nominator in the lemma can be written as

tr[(R′p ⊗G QΣQ

)H−1p,G

]− T0

∥∥a(p,G)− a(p,G)∥∥2

Hp,G

= tr[(R′p ⊗G QΣQ)H−1

p,G

]+ T0

∥∥a(p,G)− a∥∥2

H

− T0‖a(p,G)− a‖2H − T0

∥∥a(p,G)− a(p,G)∥∥2

Hp,G−Hp,G

= T0 LT (p,G)− T0

∥∥a(p,G)− a∥∥2

H− T0

∥∥a(p,G)− a(p,G)∥∥2

Hp,G−Hp,G. (5.3.5)

For the first two terms we get by Theorem 5.2.4

max1≤p≤PT

∣∣∣∣LT (p,G)− ‖a(p,G)− a‖2H

LT (p,G)

∣∣∣∣ = oP (1),

while the last term is bounded by

‖a(p,G)− a(p,G)‖2Hp,G−Hp,G

≤ ‖Hp,G−1‖‖Hp,G −Hp,G‖‖a(p,G)− a(p,G)‖2

H .

Here, the last factor is bounded in probability uniformly in 1 ≤ p ≤ PT by Lemma 5.3.1.Thus the convergence of the last term in (5.3.5) follows from Lemma 5.4.1.

Lemma 5.3.4 Under Assumptions 5.1.1 to 5.1.3 and 5.1.5 we have for all G ∈ G

max1≤p≤PT

∣∣∣∣tr[(R′p ⊗G Q

(Σ(p,G)− Σ

)Q)H−1p,G

]T0 LT (p,G)

∣∣∣∣ = oP (1).

Proof. By Lemma B.2 we get for the nominator∣∣tr[Rp′ ⊗G Q

(Σ(p,G)− Σ

)QH−1p,G

]∣∣≤∣∣tr(R′p ⊗G 1d)∣∣∥∥Q(Σ(p,G)− S(p,G)

)Q∥∥‖H−1

p,G‖≤ Cp‖Σ(p,G)− Σ‖.

The last factor can be splitted into three terms

‖Σ(p,G)− Σ‖ = ‖Σ(p,G)− S(p,G)‖+ ‖S(p,G)− Σ(p,G)‖+ ‖Σ(p,G)− Σ‖.

By Lemma 5.3.2 the first two terms converge to zero in probability uniformly in 1 ≤ p ≤PT while for the last term the ratio ‖Σ(p,G) − Σ‖/LT (p,G) is uniformly bounded. Itfollows now from p/T0LT (p,G) ≤ 1/c1 for all p ∈ N, where c1 is given by (5.2.5), andfrom PT/T0 → 0 that

max1≤p≤PT

[p

T0

‖Σ(p,G)− Σ‖LT (p,G)

]= oP (1),

which proves the lemma.

Page 95: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 91

Lemma 5.3.5 Under Assumptions 5.1.1 to 5.1.3 and 5.1.5 we have for all G ∈ G

max1≤p≤PT

∣∣∣∣tr[(

(Rp −Rp)′ ⊗G QΣ(p,G)Q

)H−1p,G

]T0 LT (p,G)

∣∣∣∣ = oP (1) (5.3.6)

and

max1≤p≤PT

∣∣∣∣tr[(R′p ⊗G QΣ(p,G)Q

)(H−1

p,G −H−1p,G)]

T0 LT (p,G)

∣∣∣∣ = oP (1). (5.3.7)

Proof. Simple multiplication proves that A ⊗G B = (A ⊗G 1)(1 ⊗G B) for dp × dpmatrices A and d × d matrices B. Since 1dp ⊗G QΣ(p,G)Q is positive definite, we findaccordingly ∣∣tr[(Rp −Rp)

′ ⊗G QΣ(p,G)QH−1p,G

]∣∣≤ ‖Rp −Rp‖‖H−1

p,G‖tr[1dp ⊗G QΣ(p,G)Q

]≤ C p ‖Rp −Rp‖

∥∥Σ(p,G)∥∥.

From Lemma 5.3.2 we obtain for ‖Σ(p,G)‖

‖Σ(p,G)‖ ≤ ‖Σ(p,G)‖+ ‖Σ(p,G)− Σ(p,G)‖ = ‖Σ(p,G)‖+ oP (1)

uniformly in p. Further ‖Σ(p,G)‖ is uniformly bounded by

‖Σ(p,G)‖ ≤ C1‖R(0)‖+ C2‖a(p,G)‖‖rp‖2 + C3‖a(p,G)‖2‖Rp‖ ≤ C

for some constant C. Thus ‖Σ(p,G)‖ is bounded in probability uniformly in 1 ≤ p ≤ PT .Then (5.3.6) follows from Lemma 5.4.1 and the inequality p/T0LT (p,G) ≤ 1/c1.

For the nominator in (5.3.7) we get similarly the upper bound

C p ‖Rp‖‖Σ(p,G)‖‖H−1p,G −H

−1p,G‖.

Since ‖Rp‖ is also bounded in probability uniformly in 1 ≤ p ≤ PT , (5.3.7) follows directlyfrom Lemma 5.4.1.

As we can see from the proof of Lemma 5.3.2, the term tr[Q(S(p,G) − Σ(p,G)

)]is of the same order as LT (p, T ) if G contains the true graph G0, and therefore thecorresponding ratio in (5.3.4) does not converge to zero as T →∞. For the minimizationof CT (p,G), however, it is sufficient to consider differences CT (p,G)−CT (p′, G′). Taking(p′, G′) = (p∗T , G

∗T ) we then get the following lemma.

Lemma 5.3.6 Under Assumptions 5.1.1 to 5.1.3 and 5.1.5 we have for all G ∈ G with(p∗T , G

∗T ) from Definition 5.2.3

max1≤p≤PT

∣∣∣∣tr[Q(S(p∗T , G

∗T )− Σ(p∗T , G

∗T ))]− tr

[Q(S(p,G)− Σ(p,G)

)]LT (p,G)

∣∣∣∣ = oP (1).

Page 96: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

92 Chapter 5. Selection of causal graphical models

Proof. By the definition of S(p,G) and Σ(p,G), the nominator can be rewritten as

2tr[Q(A(p∗T , G

∗T )− A(p,G)

)(rPT − rPT

)]+ tr

[Q(A(p∗T , G

∗T )− A(p,G)

)(RPT −RPT

)(A(p∗T , G

∗T ) + A(p,G)

)′].

(5.3.8)

By Lemma 5.4.2, noting that ‖H‖inf > 0 and L(p,G) ≥ LT (p∗T , G∗T ), we get for the first

term

E

[max

1≤p≤PT

∣∣∣∣tr[Q(A(p∗T , G

∗T )− A(p,G)

)(rPT − rPT

)]LT (p,G)

∣∣∣∣]4

≤PT∑p=1

16

LT (p,G)4E

[tr[Q(A(p,G)− A

)(rPT − rPT

)]]4

≤PT∑p=1

[C‖a(p,G)− a‖4

H

T 20LT (p,G)4

+C ′‖a(p,G)− a‖2

H‖a(p,G)− a‖21

T 30LT (p,G)4

]

and with ‖a(p,G)− a‖21 ≤ k(p,G)‖a(p,G)− a‖2 + ‖a‖2

1 ≤ Cp

≤PT∑p=1

[C

T 20LT (p,G)2

+Cp

T 30LT (p,G)3

]

≤p∗T∑p=1

C

p∗T2 +

PT∑p=p∗T+1

C

p2,

which converges to zero since p∗T diverges to infinity. If G0 ⊆ G the second term in (5.3.8)can be treated in the same way only with different constants since by Assumption 5.1.3‖a(p,G) + a(p∗T , G

∗T )‖1 is uniformly bounded in 1 ≤ p ≤ PT as T → ∞. If on the other

hand G0 * G, LT (p,G) is bounded away from zero and the assertion follows from Lemma5.3.2.

In Lemmas 5.3.3 to 5.3.6 we have proved that the difference[CT (p,G)− CT (p∗T , G

∗T )]

LT (p,G)−[LT (p,G)− LT (p∗T , G

∗T )]

LT (p,G)(5.3.9)

converges to zero in probability uniformly in 1 ≤ p ≤ PT . From this we obtain the mainresult of this section.

Theorem 5.3.7 Suppose that Assumptions 5.1.1 to 5.1.3 and 5.1.5 hold. Then

‖a(pT , GT )− a‖2H

LT (p∗T , G∗T )

P→ 1,

that is, (pT , GT ) is an asymptotically efficient selection of an AR(p,G) model.

Page 97: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 93

Proof. Since by the definition of (pT , GT ) and (p∗T , G∗T ) we have on the one hand

CT (pT , GT ) ≤ CT (p∗T , G∗T ) and on the other LT (pT , GT ) ≥ LT (p∗T , G

∗T ), the uniform con-

vergence of (5.3.9) to zero implies that for all ε > 0

limT→∞

P

(LT (p∗T , G

∗T )

LT (pT , GT )≥ 1− ε

)= 1.

The assertion of the theorem now follows from Theorem 5.2.5.

5.3.2 Other model selection criteria

The model distance ∆(a(p,G)) and thus the derived model selection criterion depend onthe choice of the weighting matrix Q. In this section we mainly discuss the case whereQΣ = σ2. We further show that AIC and MFPE are asymptotically efficient with respectto the model distance ∆(a(p,G)) for this choice of Q.

Case I: Suppose that QΣ = σ21d. Then

tr[(R′p ⊗G QΣQ

)H−1p,G

]= σ2tr

[(R′p ⊗G Q

)H−1p,G

]= σ2k(p,G).

Case II: Let Q = 1d. Further the diagonal elements σii of the covariance matrix Σ areof the form σii = σ2. By Lemma B.2 we then get

tr[(R′p ⊗G Σ

)H−1p,G

]= tr

[(1dp ⊗G Σ

)(R′p ⊗G 1d

)H−1p,G

]= σ2k(p,G).

For both choices of Q the function LT (p,G) becomes

LT (p,G) =k(p,G)σ2

T0

+ ‖a(p,G)− a‖2H =

k(p,G)tr(QΣ)

T0d+ tr

[Q(Σ(p,G)− Σ

)].

Estimating both tr(QΣ) and tr[QΣ(p,G)

]by tr

[QΣ(p,G)

]and correcting for the bias,

we then get the following version of the model selection criterion CT (p,G)

CT (p,G) =

(T0 +

2k(p,G)

d

)tr[QΣ(p,G)

],

which has the usual penalty term depending on the number of parameters used for fittingthe model.

Rewriting CT (p,G) as in the last section, we obtain

CT (p,G) = T0LT (p,G) + T0tr(QΣ)

+ T0tr[Q(S(p,G)− Σ(p,G)

)]+k(p,G)σ2 − T0

∥∥a(p,G)− a(p,G)∥∥2

Hp,G

+

2k(p,G)

dtr[Q(Σ(p,G)− Σ

)].

From this representation it follows now in the same way as in the previous section thatminimizing CT (p,G) leads to an asymptotically efficient model selection (pT , GT ).

Page 98: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

94 Chapter 5. Selection of causal graphical models

For the selection of the order of a multivariate autoregressive process, Nakano andTagami (1987) proposed a similar criterion

FPEQ(p,G) =(

1 +p|G|dT0

)(1− p|G|

dT0

)−1

tr(QΣ(p,G)

),

which also depends on a weighting matrix Q. Rewriting the criterion as(1 +

2k(p,G)

dT0

+2k(p,G)2

dT0(dT0 − k(p,G))

)tr(QΣ(p,G)

)suggests that the criterion has the same properties as CT (p,G). To prove this, we considermore generally the criterion function

C(δ)T (p,G) =

(T0 + δT (p,G) +

2k(p,G)

d

)tr[QΣ(p,G)

].

The next theorem states the asymptotic efficiency of the minimizing model sequence.

Theorem 5.3.8 Suppose that Assumptions 5.1.1, 5.1.3, and 5.1.5 hold and additionallythe conditions for either case I or case II are satisfied. Further let δT (p,G) be a randomor deterministic function such that

max1≤p≤PT

∣∣∣δT (p,G)

T0

∣∣∣ P→ 0

and

max1≤p≤PT

∣∣δT (p,G)− δT (p∗T , G∗T )

T0LT (p,G)

∣∣∣ P→ 0

for all G ∈ G . Then the model selection (p(δ)T , G

(δ)T ) which minimizes C

(δ)T (p,G) is also

asymptotically efficient.

Proof. The proof is omitted as it is the same as for Theorem 4.2 in Shibata (1980).

Many other criteria have been suggested for model selection for multivariate timeseries (see eg. Reinsel, 1991; Lutkepohl, 1985), most of them depend on the determinantof Σ(p,G). The most prominent criteria are:

MFPE (Akaike, 1971):

MFPE(p,G) =(

1 +k(p,G)

dT0

)d(1− k(p,G)

dT0

)−ddet(Σ(p,G)

)AIC (Akaike, 1973, 1974):

AIC(p,G) = log det(Σ(p,G)

)+

2k(p,G)

T

BIC (Schwarz, 1978; Rissanen, 1978):

BIC(p,G) = log det(Σ(p,G)

)+k(p,G) log(T )

T

Page 99: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 95

In order to show the asymptotic efficiency of the MFPE and the AIC, we first considerthe following criterion

CT (p,G) =(T0d+ 2k(p,G)

)[det(QΣ(p,G)

)] 1d . (5.3.10)

From Lemma B.3 it follows that det(Σ1) is twice continuously differentiable as a func-

tion of Σ1, by which we get the following Tayler expansion of det(QΣ1)1d about Σ0 with

QΣ0 = σ21d

det(QΣ1

) 1d = det

(QΣ0

) 1d +

1

ddet(QΣ0

) 1d tr[Σ−1

0 (Σ1 − Σ0)]

+ ρ(Σ1,Σ0)

=1

dtr(QΣ1

)+ ρ(Σ1,Σ0)

with remainder term

ρ(Σ1,Σ0) =1

d2det(QΣ) 1d

(tr[Σ−1(Σ1 − Σ0)

])2

− 1

ddet(QΣ) 1d tr[Σ−1(Σ1 − Σ0)Σ−1(Σ1 − Σ0)

],

where Σ = Σ0 + ξ(Σ0 − Σ1) for some ξ ∈ [0, 1]. Since the matrix inversion is continuous,Σ is bounded in a neighbourhood of Σ0. Thus there exist constants C, η > 0 such thatfor all Σ1 such that ‖Σ1 − Σ0‖ ≤ η∣∣ρ(Σ1,Σ0)

∣∣ ≤ C‖Σ1 − Σ0‖2. (5.3.11)

Lemma 5.3.9 Suppose that Assumptions 5.1.1, 5.1.3, and 5.1.5 hold. Further let CTgiven by (5.3.10) with QΣ = σ2

1d. If (pT , GT )T∈N is a sequence such that 1 ≤ pT ≤ PT ,G ∈ G and for all p0 ∈ N

limT→∞

P(pT ≤ p0

)= 0 (5.3.12)

and

limT→∞

P(G0 * GT

)= 0, (5.3.13)

then for all ε > 0

limT→∞

P

(∣∣CT (pT , GT )− CT (pT , GT )∣∣

T0LT (pT , GT )≥ ε

)= 0.

Proof. It follows from (5.3.11) that for ‖Σ(p,G)− Σ‖ ≤ η∣∣∣∣CT (p,G)− CT (p,G)

T0LT (p,G)

∣∣∣∣ =

(T0d+ 2k(p,G)

)∣∣ρ(Σ(p,G),Σ)∣∣

T0LT (p,G)≤ C‖Σ(p,G)− Σ‖2

LT (p,G),

where

‖Σ(p,G)−Σ‖2 ≤ 2‖Σ(p,G)−S(p,G)‖2 + 2‖S(p,G)−Σ(p,G)‖2 + 2‖Σ(p,G)−Σ‖2.

Page 100: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

96 Chapter 5. Selection of causal graphical models

For the first two terms we immediately get by Lemma 5.3.2

max1≤p≤PT

‖Σ(p,G)− S(p,G)‖2

LT (p,G)= oP (1) (5.3.14)

and

max1≤p≤PT

‖S(p,G)− Σ(p,G)‖2

LT (p,G)= oP (1). (5.3.15)

Also by Lemma 5.3.2 together with (5.3.2) we get for the last term

‖Σ(p,G)− Σ‖2

LT (p,G)=‖a(p,G)− a‖4

H

LT (p,G)≤ ‖a(p,G)− a‖2

H ,

which obviously does not converge to zero uniformly in 1 ≤ p ≤ PT . However, if G0 ⊆ Gand p→∞ the term ‖a(p,G)−a‖2

H converges to zero, and therefore for every ε > 0 thereexists p0 ∈ N such that

minp>p0

minG0⊆G

‖a(p,G)− a‖2H < ε.

We then get

P(‖a(pT , GT )− a‖2

H ≥ ε)≤ P

(‖a(pT , GT )− a‖2

H ≥ ε∣∣∣ pT > p0, G0 ⊆ GT

)+P

(pT ≤ p0

)+P

(G0 * GT

).

The first term now is zero while the other two terms converge to zero as T → ∞ by theassumptions on pT and GT . Thus we have shown that for every ε > 0

limT→∞

P

(C∥∥Σ(pT , GT )− Σ

∥∥2

LT (pT , GT )≥ ε

)= 0,

which together with (5.3.14) and (5.3.15) implies that ‖Σ(pT , GT )−Σ‖ converges to zeroin probability as T →∞. The assertion of the lemma now follows from

P

(∣∣CT (pT , GT )− CT (pT , GT )∣∣

T0LT (pT , GT )≥ ε

)≤ P

(C∣∣ρ(Σ(pT , GT ),Σ

)∣∣LT (pT , GT )

≥ ε ∧∥∥Σ(pT , GT )− Σ

∥∥1≤ η

)+P

(∥∥Σ(pT , GT )− Σ∥∥

1> η)

≤ P(C∥∥Σ(pT , GT )− Σ

∥∥2

1

LT (pT , GT )≥ ε)

+P(∥∥Σ(pT , GT )− Σ

∥∥1> η),

where both terms converge to zero as T →∞ by the evaluations above.

Page 101: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 97

Theorem 5.3.10 Suppose that Assumptions 5.1.1, 5.1.3, and 5.1.5 hold. Further letQΣ = σ2

1d and CT given by (5.3.10). Then the model selection

(pT , GT ) = argmin

1≤p≤PT ,G∈GCT (p,G)

is asymptotically efficient.

Proof. We first show that pT and GT satisfy the conditions (5.3.12) and (5.3.13) inLemma 5.3.9. For this we define

C(as)T (p,G) =

(T0d+ 2k(p,G)

)det(QΣ(p,G)

) 1d .

According to Lemma 5.3.2 ‖Σ(p,G)−Σ(p,G)‖ converges to zero in probability uniformlyin 1 ≤ p ≤ PT . Thus since CT (p,G) is continuous in Σ(p,G)

max1≤p≤PT

∣∣∣∣CT (p,G)− C(as)T (p,G)

T0d

∣∣∣∣ = oP (1). (5.3.16)

Consider now the sequence (p(as)T , G

(as)T ) which minimizes C

(as)T (p,G). Noting that by the

assumptions on the process Σ(p,G)− Σ is positive definite for all p ∈ N and G ∈ G , weget

C(as)T (p,G)

T0d> det

(QΣ)

1d = σ2.

Since C(as)T (PT , G0) attains asymptotically the minimum σ2, C

(as)T (p

(as)T , G

(as)T ) also con-

verges to σ2 and accordingly p(as)T diverges to infinity as T → ∞. Furthermore, if G

does not contain the true graph G0 we have limp→∞ det(QΣ(p,G)

)> σ2, which implies

G0 ⊆ G(as)T for almost all T . Since by (5.3.16) CT (p,G) converges to C

(as)T (p,G) uniformly

in p and G, it follows that pT and GT fulfill conditions (5.3.12) and (5.3.13).As in the proof of Theorem 5.3.7 we consider now the term∣∣∣∣CT (pT , G

T )− CT (p∗T , G

∗T )

LT (pT , GT )

− LT (pT , GT )− LT (p∗T , G

∗T )

LT (pT , GT )

∣∣∣∣≤∣∣∣∣CT (pT , G

T )− CT (pT , G

T )

LT (pT , GT )

∣∣∣∣+

∣∣∣∣CT (p∗T , G∗T )− CT (p∗T , G

∗T )

LT (p∗T , G∗T )

∣∣∣∣+

∣∣∣∣CT (pT , GT )− LT (pT , G

T )

LT (pT , GT )

∣∣∣∣+

∣∣∣∣CT (p∗T , G∗T )− LT (p∗T , G

∗T )

LT (p∗T , G∗T )

∣∣∣∣.By Lemma 5.3.9 and (5.3.9) the four terms on the right side converge to zero in probabilityas T →∞. Since further (pT , G

T ) minimizes CT (p,G) and (p∗T , G

∗T ) minimizes LT (p,G),

it follows that for all ε > 0

limT→∞

P

(LT (p∗T , G

∗T )

LT (pT , GT )≥ 1− ε

)= 1.

Together with Theorem 5.2.4 this proves the theorem.

The asymptotic efficiency of the AIC and the MFPE with respect to the model distance∆(a(p,G)) with Q = Σ−1 now can be derived by showing that small changes to CT (p,G)as in Theorem 5.3.8 do not destroy the asymptotic optimality.

Page 102: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

98 Chapter 5. Selection of causal graphical models

5.3.3 Model selection with estimated Σ

As we have seen in the last section, the model selection based on the criteria AIC andMFPE is asymptotically efficient with respect to ∆(a(p,G)) if the weighting matrix Q ischosen as the inverse covariance matrix Σ−1. As the parameter estimation is based onthe minimization of the empirical model distance ∆T , it also depends on Q. Therefore,as the covariance matrix is in general unknown, we have to use an estimate for Q. Inthis section, we show that the concept of asymptotic efficiency transfers unchanged to themodified situation.

We first note that in the saturated model the estimation of the parameters is inde-pendent of Q, since

a(p,GS) = Hp,GS vec(Qrp

)=(R−1p ⊗Q−1)

(1dp ⊗Q

)vec(rp)

=(R−1p ⊗ 1d

)vec(rp)

= vec(rpR

−1p

).

This suggest the following procedure for the estimation of the parameters and the selectionof an optimal model.

Algorithm 5.3.11 (Model fitting and selection)

(a) Estimate the parameters in the AR(PT , GS) model by a(PT , GS) = vec(rPT R

−1PT

).

(b) Estimate Σ by Σ(PT , GS) and Q by QT = Σ(PT , GS)−1.

(c) Estimate parameters in the model AR(p,G) by minimizing

∆∗T (φ(p,G)) =1

T0

T∑t=PT+1

(X(t)− Φ(p,G)X(p)(t)

)′QT

(X(t)− Φ(p,G)X(p)(t)

)(d) Select the best model by minimizing AIC(p,G) or MFPE(p,G).

Minimization of the modified empirical model distance ∆∗T (φ(p,G)) leads to the esti-mate a(p,G) = H−1

p,G vecG(QT rp

), where Hp,G = (R′p ⊗G QT ). Thus we get

(a(p,G)− a(p,G)) = H−1p,G vecG

(QT rp − QTA(p,G)Rp

).

Following the arguments in Section 5.2 it is sufficient to show that

‖a(p,G)− a(p,G)‖2H = ‖ vecG

(QT rp − QTA(p,G)Rp

)‖H−1

p,GHp,GH−1p,G

≈ ‖ vecG(QT rp − QTA(p,G)Rp

)‖H−1

p,G

≈ ‖ vecG(Qrp −QA(p,G)Rp

)‖H−1

p,G.

Noting that by Lemma 5.3.2 and the inequality ‖Σ(PT , GS)− Σ‖ ≤ ‖a(PT , GS)− a‖2H

‖Σ(PT , GS)− Σ‖ = oP (1),

we get similarly as in Lemma 5.4.1 that

‖QT −Q‖ = ‖Σ(PT , GS)−1 − Σ−1‖ = oP (1). (5.3.17)

It follows then, that Lemma 5.4.1 also holds for the modified Hp,G.

Page 103: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.3. Asymptotically efficient model selection 99

Theorem 5.3.12 Suppose that Assumptions 5.1.1, 5.1.3, and 5.1.5 hold and that Q =Σ−1. Further let a(p,G) be the parameter estimate in Algorithm 5.3.11. Then we have

maxG∈G

max1≤p≤PT

∣∣∣∣‖a(p,G)− a‖2H

LT (p,G)− 1

∣∣∣∣ = oP (1).

Proof. We follow the line of proof in Section 5.2. Defining

v(p,G) = vecG(QT rp − QTA(p,G)Rp

),

we obtain according to the definition of vecG

v(p,G)− v(p,G) = πp,G(1dp ⊗ (QT −Q)Q−1

)vec(Qrp −QA(p,G)Rp

).

Therefore we can rewrite the stochastic part of ∆(a(p,G)) as

‖a(p,G)− a(p,G)‖2H = ‖v(p,G)‖2

H−1p,G

+ ‖v(p,G)‖2∆H−1

p,G

= ‖v(p,G)‖2H−1p,G

+ ‖v(p,G)‖2I + ‖v(p,G)‖2

∆H−1p,G, (5.3.18)

where v(p,G) = vec(Qrp −Qa(p,G)Rp

)and

I = (1dp ⊗ (QT +Q))π′GH

−1p,GπG(1dp ⊗ (QT −Q)

).

By the same arguments as in Section 5.2 the last term is, uniformly in 1 ≤ p ≤ PT ,negligiable compared with the first two terms. For the second term we obtain the upperbound

‖v(p,G)‖2I ≤ ‖v(p,G)‖2‖I‖ ≤ C‖v(p,G)‖2‖QT −Q‖‖QT‖.

Defining the vector

v0(p,G) = vec

(1

T0

T∑t=PT+1

Qε(t)X(p)(t)′)

we further get

‖v(p,G)‖2 ≤ 2‖v0(p,G)‖2 + 2‖v(p,G)− v0(p,G)‖2. (5.3.19)

It follows now from Lemma 5.2.2 with Hp,G replaced by 1d2p that ‖v0(p,G)‖2 has mean

tr(R′p ⊗QΣQ

)and further as in the proof of Theorem 5.2.4

max1≤p≤PT

∣∣∣∣T0‖v0(p,G)‖2 − tr(R′p ⊗QΣQ

)T0LT (p,G)

∣∣∣∣ = oP (1). (5.3.20)

This implies that ‖v0(p,G)‖2 is bounded in probability uniformly in 1 ≤ p ≤ PT .For the second term in (5.3.19) we would like to apply Lemma 5.2.1. However, the

components of v(p,G)− v0(p,G) are not set to zero corresponding to the restrictions due

Page 104: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

100 Chapter 5. Selection of causal graphical models

to the graph G, and therefore the first term in (5.4.5) does not vanish. Still, we can boundthis term by C‖a(p,G)− a‖2, which gives

max1≤p≤PT

‖v(p,G)− v0(p,G)‖2

LT (p,G)= OP (1).

Together with (5.3.20) and (5.3.17) we therefore have

max1≤p≤PT

‖v(p,G)‖2I

LT (p,G)= oP (1).

Thus we have shown that ‖a(p,G)− a(p,G)‖2H can be approximated by ‖v(p,G)‖2

Hp,Gup

to terms of smaller order. The assertion of the theorem then follows as in the proof ofTheorem 5.2.4.

The results of Sections 5.3.1 and 5.3.2 now remain valid if we use the correct weightingmatrix Q = Σ−1 instead of an estimate. But as we have seen in the previous section,

this can be done without actually knowing the weighting matrix by using det(Σ(p,G)

) 1d

instead of tr(QΣ(p,G)

)/d. Therefore model selection based on either MFPE or AIC is

still asymptotically efficient.

Corollary 5.3.13 Let (pT , GT ) be the sequence which minimizes the criterion AIC(p,G)or MFPE(p,G). Under the assumptions of Theorem 5.3.12 (pT , GT ) is asymptoticallyefficient.

Proof. The assertion follows from Theorem 5.3.12 and Theorem 5.3.10.

5.4 Proofs and auxiliary results

Lemma 5.4.1 Suppose that Assumptions 5.1.1 and 5.1.5 hold. Then we have the follow-ing convergence results

max1≤p≤PT

‖Rp −Rp‖ = oP (1), (5.4.1)

max1≤p≤PT

‖R−1p −R−1

p ‖ = oP (1), (5.4.2)

max1≤p≤PT

‖Hp,G −Hp,G‖ = oP (1), (5.4.3)

max1≤p≤PT

‖H−1p,G −H

−1p,G‖ = oP (1). (5.4.4)

Proof. The first assertion of the lemma follows from

max1≤p≤PT

‖Rp −Rp‖2 ≤ max1≤p≤PT

‖Rp −Rp‖22 = ‖RPT −RPT ‖2

2

Page 105: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.4. Proofs and auxiliary results 101

and

E

(‖RPT −RPT ‖2

2

)=

d∑i,j=1

PT∑u,v=1

E(rij(u, v)− rij(u− v)

)2

=1

T 20

d∑i,j=1

PT∑u,v=1

T∑t,s=PT+1

(rii(t− s)rjj(t− s) + rij(t− s+ u− v)rji(t− s− u+ v)

)≤ P 2

T

T0

C.

The result for the inverses can be derived from (5.4.1) as in the proof of Lemma 3 in Berk(1974) by noting that

‖R−1p −R−1

p ‖ = ‖R−1p

(Rp −Rp

)R−1p ‖

≤(‖R−1

p −R−1p ‖+ ‖R−1

p ‖)‖R−1

p ‖‖Rp −Rp‖.

Next, by the definition of H(p,G) and H(p,G), we have

‖Hp,G −Hp,G‖ ≤ ‖(Rp −Rp

)⊗G Q‖2

≤ ‖(Rp −Rp

)⊗Q‖2 ≤ ‖Rp −Rp‖2‖Q‖2.

As we have proved (5.4.1) also for the norm ‖ · ‖2, this implies (5.4.3). The last assertionof the lemma is proved in the same way as (5.4.2).

With the lemma and the triangle inequality it follows that

max1≤p≤PT

‖Rp‖, max1≤p≤PT

‖R−1p ‖, max

1≤p≤PT‖Hp,G‖, and max

1≤p≤PT‖H−1

p,G‖

are all bounded in probability.

Proof of Lemma 5.2.1. The proof is only notationally more complex than the proofof Lemma 3.1 in Shibata (1980). Putting α(p,G) = A(p,G)− A we have

E

∥∥∥ vecG

( 1

T0

T∑t=PT+1

Q(εa(p,G)(t)− ε(t))X(p)(t)′)∥∥∥2

≤ 1

T 20

T∑t,s=PT+1

d∑i,j=1

(i,j)∈Ed

p∑q=1

E

[Qiα(p,G)X(∞)(t)Xj(t− q)Xj(s− q)X(∞)(s)′α(p,G)′Q′i

]

≤ 1

T 20

T∑t,s=PT+1

d∑i,j=1

(i,j)∈Ed

p∑q=1

(Qiα(p,G)Rqd+j

)2+(Qiα(p,G)R(t,s)α(p,G)′Q′i

)rjj(t− s)

+(Qiα(p,G)R(t−s+q)d+j

)(Qiα(p,G)R(s−t+q)d+j

), (5.4.5)

Page 106: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

102 Chapter 5. Selection of causal graphical models

where Mk denotes the k-th column vector of a matrix M and R(t,s) is the infinite dimen-sional matrix with components rij(t − u, s − v). For (i, j) ∈ Ed, it follows from (5.1.4)that

Qiα(p,G)Rqd+j =d∑

k,l=1

∞∑u=1

Qik(akl,u(p,G)− akl,u)rlj(q − u) = 0.

For the second term we find(Qiα(p,G)R(t,s)α(p,G)′Qi

)≤ ‖Qiα(p,G)‖2‖R‖ ≤ ‖Qi‖2‖α(p,G)‖2‖R‖,

while the third term is bounded by

T∑t,s=PT+1

d∑j=1

(Qiα(p,G)R(t−s+q)d+j

)(Qiα(p,G)R(s−t+q)d+j

)≤ T0

T∑t=PT+1

d∑j=1

(Qiα(p,G)R(t−s+q)d+j

)2 ≤ ‖Qi‖2‖α(p,G)‖2‖R‖2.

Thus we obtain as an upper bound for (5.4.5)

p

T0

‖Q‖22‖a(p,G)− a‖2‖R‖

d∑j=1

∑u∈Z

|rjj(u)|+ ‖R‖,

which completes the proof.

Proof of Lemma 5.2.2. We first evaluate the mean

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥2

H−1p,G

=T∑

t1,t2=PT+1

E

[tr(X(p)(t2)X(p)(t1)′ ⊗G Qε(t1)ε(t2)′Q

)H−1p,G

]. (5.4.6)

The sum can be reduced to terms of the form

E[εj1(t1)Xk1(t1−u1)εj2(t2)Xk2(t2−u2)

], (5.4.7)

which by the product theorem for cumulants and the normality assumption split into thesum of the products of moments of pairs. Since E

(εj(t)Xk(s)

)= 0 for all t > s and either

t1 > t2 − u2 or t2 > t1 − u1, (5.4.7) reduces to

E[εj1(t1)εj2(t2)

]E[Xk1(t1−u1)Xk2(t2−u2)

]= δt1t2σj1j2rk1k2(u1 − u2). (5.4.8)

Substituted into (5.4.6), we thus obtain

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥2

H−1p,G

= T0tr(Bp,G). (5.4.9)

Page 107: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.4. Proofs and auxiliary results 103

Next, we evaluate the higher moments

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥2m

H−1p,G

T∑t1,... ,t2m=PT+1

E

[ m∏n=1

tr((X(p)(tn,2)X(p)(tn,1)′ ⊗G Qε(tn,1)ε(tn,2)′Q

)H−1p,G

)](5.4.10)

for m = 4, 6, 8. For this, let∑

π1,... ,πndenote the sum over all partitions π1, . . . , πn of

1, . . . , 2m with nr = |πr| ≥ 2. If πr = πr,1, . . . , πr,nr we set ir,ν = iπr,ν for 1 ≤ ν ≤ nrwith analogous definitions for other indices. For notational convenience we also defineir,ν = i2(r−1)+ν for 1 ≤ r ≤ n and ν ∈ 1, 2, that is the sequence i1,1, . . . , i

m,2 corresponds

to the nonpermutated sequence i1, . . . , i2m of the indices. Further let∑

i.p.(πr)denote the

sum over all indecomposable partitions Y1, . . . , Ynr with Ys = Ys,1, Ys,2 of the table

εjr,1(tr,1) Xkr,1(tr,1 − ur,1)...

...εjr,nr (tr,nr) Xkr,nr (tr,nr − ur,nr)

.

Then by the product theorem for cumulants and the normality assumption, and notingthat E(εj(t)Xk(t− u)) = 0, we have

T∑t1,... ,t2m=PT+1

E

[ 2m∏n=1

εjn(tn)Xkn(tn − un)]

=m∑n=1

∑π1,... ,πn

n∏r=1

∑i.p.(πr)

∑tr,1,... ,tr,nr

nr∏s=1

E(Ys,1Ys,2

)and further using (5.4.8) for all partitions πr with nr = 2

= Tm0∑

P1,... ,Pm

m∏n=1

σjn,1jn,2rkn,1kn,2(un,1 − un,2) +RT , (5.4.11)

where the remainder term is of the form

RT =m−1∑n=1

∑π1,... ,πn

∏nr=2

T0σjr,1jr,2rkr,1kr,2(ur,1 − ur,2)∏nr>2

∑i.p.(πr)

∑tr,1,... ,tr,nr

nr∏s=1

E(Ys,1Ys,2

).

For fixed set πr we set t∗s,v = tr,ν if Ys,v = εjr,ν (tr,ν) or Ys,v = Xkr,ν (tr,ν−ur,ν) for 1 ≤ s ≤ nrand v ∈ 1, 2. The terms E

(Ys,1Ys,2

)can take the form

E(εj1(t1)εj2(t2)

)= δt1t2σj1j2 ,

E(Xk1(t1 − u1)Xk2(t2 − u2)

)= rk1k2(t2 − t1 + u1 − u2),

E(εj1(t1)Xk2(t2 − u2)

)=

d∑l=1

bk2l,t2−t1−u2σlj1 ,

Page 108: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

104 Chapter 5. Selection of causal graphical models

where Bu = (bij,u)ij are the coefficient matrices of the moving average representation in(5.1.2) with the extension Bu = 0 for u < 0. Since in each case the right hand side isabsolutely summable in any of t1 or t2, there exists a constant C independent of jn, kn,tn, and un for 1 ≤ n ≤ 2m such that

T∑t∗s,v=PT+1

∣∣E(Ys,1Ys,2)∣∣ ≤ C

for v ∈ 1, 2 and uniformly in T . Further because of the indecomposability of thepartition Y1, . . . , Ynr , there exists an ordering Ys1 , . . . , Ysnr such that t∗sν ,2 = t∗sν+1,1

for1 ≤ ν < nr. Therefore∑

i.p.(πr)

∑tr,1,... ,tr,nr

nr∏ν=1

∣∣E(Ysν ,1Ysν ,2)∣∣ ≤ ∑i.p.(πr)

∑t∗s1,2

,... ,t∗snr ,2

Cnr∏ν=2

∣∣E(Ysν ,1Ysν ,2)∣∣≤∑i.p.(πr)

∑t∗s2,2

,... ,t∗snr ,2

C2nr∏ν=3

∣∣E(Ysν ,1Ysν ,2)∣∣≤∑i.p.(πr)

CnrT0. (5.4.12)

Thus the remainder term in (5.4.11) is of order O(Tm−10 ) and accordingly the main term

of (5.4.10) is equal to

Tm0∑

P1,... ,Pm

∑∗

in,kn,un

m∏n=1

Q′in,1ΣQin,2rkn,1kn,2(un,1 − un,2)H−1in,1k

n,1,i

n,2k

n,2

(un,1 − un,2)

= Tm0∑

P1,... ,Pm

∑∗

in,kn,un

m∏n=1

Din,1kn,1,in,2kn,2(un,1 − un,2)H−1in,1k

n,1,i

n,2k

n,2

(un,1 − un,2),

where D = R′ ⊗G QΣQ and∑∗in,kn,un abbreviates the sum over the indices in, kn and

un such that (in, kn) ∈ Ed for 1 ≤ n ≤ 2m. For m = 2 this together with the evaluationsfor the remainder term in (5.4.11) and ‖H−1

p,G‖1 ≤ Cp3/2 leads to

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥4

H−1p,G

= T 20

[(tr(Bp,G)

)2+ 2tr(B2

p,G)]

+O(T0p3). (5.4.13)

If m = 3 any partition of length 2 consists of sets π1 and π2 such that either n1 = 2 andn2 = 4 or n1 = n2 = 3. In the former case the corresponding terms are equal to

O(T 20 )∑∗

in,kn,un

Di1,1k1,1,i1,2k1,2(u1,1−u1,2)3∏

n=1

H−1in,1k

n,1,i

n,2k

n,2

(un,1−un,2) = O(T 2

0 p4)

by (5.4.8), (5.4.12), and

tr(Dp,GH

−1p,G

)≤ Cp and ‖H−1

p,GDp,GH−1p,G‖1 ≤ Cp3/2. (5.4.14)

Page 109: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.4. Proofs and auxiliary results 105

Defining

Ψ =∑π1,π2nr=3

∑∗

in,kn,un

( 2∏r=1

∑i.p.(πr)

∑tr,1,... ,tr,3

3∏s=1

E(Ys,1Ys,2

)) 3∏n=1

H−1in,1k

n,1,i

n,2k

n,2

(un,1−un,2),

we then have

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥6

H−1p,G

= T 30

[(tr(Bp,G)

)3+ 6tr(B2

p,G)tr(Bp,G)

+ 8tr(B3p,G)]

+ Ψ +O(T 20 p

4). (5.4.15)

Finally in the case m = 4 we first consider the partitions of length 3 such that n1 = 2and n2 = n3 = 3,

∑∗

in,kn,un

[∑π1

n1=2

Di1,1k1,1,i1,2k1,2(u1,1−u1,2)][ ∑

π2,π3n2,n3=3

3∏r=2

∑i.p.(πr)

∑tr,1,... ,tr,3

4∏s=1

E(Ys,1Ys,2

)]

·3∏

n=1

H−1in,1k

n,1,i

n,2k

n,2

(un,1−un,2).

If π1 is of the form 2ν−1, 2ν for some ν ∈ 1, . . . , 4 this term is equal to 4T0tr(Bp,G

while otherwise by the second inequality in (5.4.14) the term is only of order O(T 3

0 p9/2).

Similarly we find that the remaining terms are at most of order O(T 3

0 p5). Thus we have

E

∥∥∥ vecG

( T∑t=PT+1

Qε(t)X(p)(t)′)∥∥∥8

H−1p,G

= T 40

[(tr(Bp,G)

)4+ 12tr(B2

p,G)(tr(Bp,G))2 + 12

(tr(B2

p,G))2

+ 32tr(B3p,G)tr(Bp,G) + 48tr(B4

p,G)]

+ 4 T0tr(Bp,G

)Ψ +O(T 3

0 p5). (5.4.16)

Substituting (5.4.9), (5.4.13), (5.4.15), and (5.4.16) into (5.2.3), the terms containing Ψand all terms of order O(p3) or greater cancel and we thus obtain the desired result.

We first prove a multivariate version of Lemma 4.2 in Shibata (1980).

Lemma 5.4.2 Suppose that Assumptions 5.1.1 and 5.1.5 hold. Then there exist constantsC1, . . . , C3 such that for any vectors α, β ∈ RdPT+d

E

[ d∑i,j=1

PT∑u,v=0

αi(u)βj(v)(rij(u, v)− rij(u− v)

)]2

≤ C1

T0

‖α‖2‖β‖21 (5.4.17)

and

E

[ d∑i,j=1

PT∑u,v=0

αi(u)βj(v)(rij(u, v)− rij(u− v)

)]4

≤ C2

T 20

‖α‖4‖β‖41 +

C3

T 30

‖α‖2‖α‖21‖β‖4

1.

(5.4.18)

Page 110: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

106 Chapter 5. Selection of causal graphical models

Proof. Noting that rij(u, v) is an unbiased estimate for rij(u−v) we get by the producttheorem for cumulants

cum 2∏k=1

[rikjk(uk, vk)− rikjk(uk − vk)

]= cum

ri1j1(u1, v1), ri2j2(u2, v2)

=

1

T 20

T∑t1,t2=PT+1

[ri1i1(t2 − t1 + u2 − u1)rj1j2(t2 − t1 + v1 − v2)

+ ri1j2(t2 − t1 + v2 − u1)rj1i2(t2 − t1 + u2 − v1)].

Putting α(u) =(α1(u), . . . , αd(u)

)′and equivalently for β(u), the first part of the lemma

then follows from the evaluations∣∣∣ PT∑u1,u2=0

α(u1)′R(t+ u1 − u2)α(u2)∣∣∣ ≤ ‖α‖2‖R‖ (5.4.19)

T∑t=PT+1

PT∑v1,v2=0

∣∣β(v1)R(t+ v1 − v2)β(v2)∣∣ ≤ ‖β‖2

1

T∑t=PT+1

‖R(t)‖1 (5.4.20)

and

T∑t=PT+1

PT∑u1,u2=0

d∑i1,i2=1

∣∣αi1(u1)ri1k(t+ u1 + r)αi2(u2)ri2l(t+ u2 + s)∣∣

≤[ T∑t=PT+1

( PT∑u1=0

d∑i1=1

αi1(u1)ri1k(t+ u1 + r))2] 1

2

×[ T∑t=PT+1

( PT∑u2=0

d∑i2=1

αi2(u2)ri2l(t+ u2 + s))2] 1

2

≤ ‖α‖2‖R‖2.

(5.4.21)

For the second part we get

cum 4∏k=1

[rikjk(uk, vk)− rikjk(uk − vk)

]= cum

ri1j1(u1, v1), . . . , ri4j4(u4, v4)

+∑π1,π2

2∏r=1

cumrir,1jr,1(ur,1, vr,1), rir,2jr,2(ur,2, vr,2)

,

(5.4.22)

where∑

π1,π2denotes the sum over all partitions π1, π2 of 1, . . . , 4 with πr = πr,1, πr,2

and ir,ν = iπr,ν with similar definitions for the other indices. According to the producttheorem for cumulants the first term is equal to∑

i.p.

4∏r=1

rir,1ir,2(tr,2 − tr,1 + ur,1 − ur,2),

Page 111: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.4. Proofs and auxiliary results 107

where∑

i.p. denotes the sum over all indecomposable partitions π1, . . . , π4 with πr =πr,1, πr,2 of the table

(i1, u1, t1) (j1, v1, t1)...

...(i4, u4, t4) (j4, v4, t4)

.

Since the partitions are indecomposable the product contains at least one factor

riτ1 iτ2 (tτ2 − tτ1 + uτ1 − uτ2) or riτ1jτ2 (tτ2 − tτ1 + uτ1 − vτ2)rjτ1 iτ3 (tτ3 − tτ1 + vτ1 − uτ3)

with τ1 6= τ2 6= τ3. Therefore we can apply either (5.4.19) or (5.4.21). Repeated useof (5.4.20) then gives the upper bound CT−3

0 ‖α‖2‖α‖21‖β‖4

1. According to (5.4.17), theremaining terms in (5.4.22) are bounded by CT−2

0 ‖α‖4‖β‖41, which completes the proof

of the lemma.

Proof of Lemma 5.3.2. (i) In order to prove the first part we rewrite Σ(p,G)−S(p,G)as

Σ(p,G)− S(p,G)

=(A(p,G)− A(p,G)

)Rp

(A(p,G)− A(p,G)

)′− rp

(A(p,G)− A(p,G)

)′ − (A(p,G)− A(p,G))r′p

+(A(p,G)− A(p,G)

)RpA(p,G)′ + A(p,G)Rp

(A(p,G)− A(p,G)

)′(5.4.23)

=(A(p,G)− A(p,G)

)Rp

(A(p,G)− A(p,G)

)′+(A(p,G)− A(p,G)

)[(Rp −Rp

)A(p,G)′ − (rp − rp)′

]+[A(p,G)′

(Rp −Rp

)− (rp − rp)

](A(p,G)− A(p,G)

)′+(A(p,G)− A(p,G)

)R(A(p,G)− A

)′+(A(p,G)− A

)R(A(p,G)− A(p,G)

)′,

(5.4.24)

where we have used that

rpA(p,G)′ = rA(p,G)′ = ARA(p,G)′

A(p,G)RpA(p,G)′ = A(p,G)RA(p,G)′

and similarly for A(p,G). Thus we have the upper bound

‖Σ(p,G)− S(p,G)‖2

≤ ‖a(p,G)− a(p,G)‖[C1‖a(p,G)− a(p,G)‖ ‖Rp‖+ C2‖Rp −Rp‖ ‖a(p,G)‖

+ C3‖rp − rp‖2

]+ C4‖a(p,G)− a(p,G)‖ ‖a(p,G)− a‖ ‖R‖.

Page 112: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

108 Chapter 5. Selection of causal graphical models

Since LT (p,G) is bounded, we get by Lemma 5.3.1

max1≤p≤PT

‖a(p,G)− a(p,G)‖√LT (p,G)

= OP (1) and max1≤p≤PT

‖a(p,G)− a(p,G)‖2 = oP (1). (5.4.25)

Application of Lemma 5.4.1 then yields the desired convergence for the first three terms.For the last term we obtain from the definition of LT (p,G) and (5.4.25)

max1≤p≤PT

‖a(p,G)− a(p,G)‖ ‖a(p,G)− a‖√LT (p,G)

≤ C max1≤p≤PT

‖a(p,G)− a(p,G)‖ = oP (1),

which proves the first part of (i). The second part follows directly from

‖Σ(p,G)− S(p,G)‖2 = ‖a(p,G)− a(p,G)‖[C‖a(p,G)− a‖+ op(1)

]and the first equation in (5.4.25).

(ii) By the definitions of S(p,G) and Σ(p,G) we get

S(p,G)− Σ(p,G) = A(p,G)(Rp −Rp

)A(p,G)′

−(rp − rp

)A(p,G)′ − a(p,G)

(rp − rp

)′.

Application of Lemma 5.4.2 to the first term yields for G0 ⊆ G

E

[max

1≤p≤PT

∥∥A(p,G)(Rp −Rp

)A(p,G)′

∥∥1√

LT (p,G)

]4

≤PT∑p=1

E∥∥A(p,G)

(Rp −Rp

)A(p,G)′

∥∥4

1

LT (p,G)2

≤PT∑p=1

[C‖a(p,G)‖4

2‖a(p,G)‖41

T 20LT (p,G)2

+C‖a(p,G)‖2

2‖a(p,G)‖61

T 30LT (p,G)2

]

≤p∗T∑p=1

C

k∗T2 +

PT∑p=p∗T+1

C

k2,

which tends to zero since p∗T diverges to infinity as T → ∞. If G0 * G we only have‖a(p,G)‖2

1 ≤ k(p,G)‖a(p,G)‖2 ≤ Cp. However, LT (p,G) is bounded away from zero andwe get by the first part of Lemma 5.4.2

E

[max

1≤p≤PT

∥∥A(p,G)(Rp −Rp

)A(p,G)′

∥∥1

]2 ≤ PT∑p=1

C‖a(p,G)‖24‖a(p,G)‖2

1

T 20

≤ CP 2T

T0

.

This also proves the second part of (ii) for general G ∈ G .(iii) From the identity

Σ(p,G)− Σ =(A(p,G)− A

)R(A(p,G)− A

)′

Page 113: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

5.4. Proofs and auxiliary results 109

it follows directly from the definition of LT (p,G) that∥∥Σ(p,G)− Σ∥∥

2

LT (p,G)≤∥∥a(p,G)− a

∥∥2‖R‖LT (p,G)

≤ ‖R‖‖H‖inf

for all p ∈ N, which completes the proof of the lemma.

We note that if the graphG is complete, we have an ordinary multivariate AR(p) modelwithout any restrictions on the parameters of the model, and the parameter estimate isof the form rp = A(p,G)Rp. In this case all but the first term in (5.4.23) cancel. For

general graphs, however, these terms cancel only if we consider tr[Q(Σ(p,G)− S(p,G)

)]instead of Σ(p,G)− S(p,G). For this term we have the stronger result

max1≤p≤PT

tr[Q(Σ(p,G)− S(p,G)

)]LT (p,G)

= OP (1).

Page 114: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

Appendix

A Properties of L -functions

Let L(T ) : R→ R be the 2π-periodic extension of

L(T )(λ) =

T, |λ| ≤ 1/T1

|λ|, 1/T < |λ| ≤ π

.

These functions have been introduced by Dahlhaus (1983) as a useful device for cumulantcalculations in the spectral analysis. The properties of these L(T )-functions are summa-rized in the following lemma.

Lemma A.1 Let α ∈ R and r ∈ N. Then we obtain for L(T ) with a constant C indepen-dent of T

(i) L(T )(α) is monotone increasing in T ∈ R+ and decreasing in α ∈ [0, π],

(ii) L(T )(cα) ≤ c−1L(T )(α) for all c ∈ (0, 1],

(iii)

∫Π

L(T )(α)dα ≤ C log(T ),

(iv)

∫Π

L(T )(α)rdα ≤ CT r−1.

Let r, s ∈ N and T, S > 0. Then L(T ) and L(S) satisfy for |α|, |β| ≤ π

(v) L(T )(α)rL(S)(β)s ≤ L(T )(α− β

2

)rL(S)(β)s + L(T )(α)r L(S)

(α− β2

)s.

Further, L(T ) and L(S) have the following convolution properties

(vi)

∫Π

L(T )(β + α)L(S)(γ − α)dα ≤ C maxlog(T ), log(S) L(minT,S)(β + γ),

(vii)

∫Π

L(T )(β + α)rL(S)(γ − α)rdα ≤ C maxT r−1, Sr−1 L(minT,S)(β + γ)r.

Page 115: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

B. Matrices and Norms 111

Proof. The proofs are straightforward. Most can be found in Dahlhaus (1983) andDahlhaus (1990).

In the case of time-continuous processes, the methods based on L(T )-functions can stillbe used by replacing the above definition by the following nonperiodic version

L(T )(λ) =

T, |λ| ≤ 1/T1

|λ|, |λ| > 1/T

,

which has been introduced by Eichler (1995). These L(T )-functions are not integrableand therefore properties (iii) and (vi) in the previous lemma need to be replaced by thefollowing result. All other properties still hold for these L-functions with straightforwardmodifications.

Lemma A.2 Let ζ : R→ R be an integrable and bounded function and α, β, γ ∈ R. Weobtain with a constant C independent of T and S

(i)

∫R

L(T )(α)ζ(α)dα ≤ C log(T ),

(ii)

∫R

L(T )(β + α)L(S)(γ − α)ζ(α)dα ≤ C maxlog(T ), log(S) L(minT,S)(β + γ).

Proof. The result has been proved in similar form in Eichler (1995).

B Matrices and Norms

Let x be an n dimensional vector and A be an n × n matrix. The standard norms usedin this thesis are then the Euclidean norm

‖x‖ = ‖x‖2 =(x2

1 + . . .+ x2n

) 12

for vectors and the operator norm

‖A‖ = sup‖x‖≤1

‖Ax‖,

for matrices. Further let A∗ denote the conjugate transpose of A and

‖A‖2 =(tr(AA∗)

) 12 , ‖A‖1 =

n∑i,j=1

|Aij| and ‖A‖inf = inf‖x‖=1

|x′Ax|.

Similarly we define ‖x‖1 = |x1|+ . . .+ |xn|. The next two lemmas summarizes some wellknown inequalities for matrix norms and the trace of products of matrices.

Lemma B.1 Let A and B be n× n matrices and x, y n dimensional vectors.

(i) ‖AB‖ ≤ ‖A‖ ‖B‖.

Page 116: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

112 Appendix

(ii) ‖A‖ ≤ ‖A‖2 ≤√n‖A‖.

(iii) ‖A‖1 ≤ n‖A‖2 ≤ n32‖A‖.

(iv) |tr(AB)| ≤ ‖A‖‖B‖2.

(v) |x′Ay| ≤ ‖x‖ ‖y‖ ‖A‖.

(vi) |x′Ax| ≥ ‖x‖2‖A‖inf .

Proof. e.g. Horn and Johnson (1985).

Lemma B.2 Suppose now that B is positive definite and bounded and A hermitean witheigenvalues λ1, . . . , λn.

(i) tr(AB) ≤ ‖A‖ tr(B).

(ii) tr(ABA) ≤ ‖A‖ tr(AB).

(iii) tr(AB) ≥ min1≤i≤nλi

tr(B).

If B is also symmetric then it further holds that

(iv) tr(ABA) ≤ ‖B‖ tr(AA∗).

Proof. The proofs are straightforward noting that A has decomposition A = UΛU∗

where U is unitarian and Λ is the diagonal matrix with entries λi.

The next lemma summarizes some results on differentiation of matrices.

Lemma B.3 Let A(θ) be a n× n matrix parametrized in θ ∈ Rk. Then

∂A(θ)−1

∂θi= −A(θ)−1A(θ)

∂θiA(θ)−1,

∂ log det(A(θ)

)∂θi

= tr(A(θ)−1A(θ)

∂θi

),

∂ det(A(θ)

)∂θi

= det(A(θ)

)tr(A(θ)−1∂A(θ)

∂θi

),

∂2 det(A(θ)

)∂θi∂θj

= det(A(θ)

)[tr(A(θ)−1∂A(θ)

∂θi

)tr(A(θ)−1∂A(θ)

∂θj

)− tr

(A(θ)−1∂A(θ)

∂θiA(θ)−1∂A(θ)

∂θj

)+ tr

(A(θ)−1∂

2A(θ)

∂θi∂θj

)].

Proof. The results can be found e.g. in Harville (1997).

Page 117: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

References

Akaike, H. (1969). Fitting autoregressive models for prediction. Ann. Inst. Stat. Math.,21, 243-247.

Akaike, H. (1971). Autoregressive model fitting for control. Ann. Inst. Stat. Math., 23,163-180.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood prin-ciple. In 2nd internat. Sympos. Inform. Theory (B.N. Petrov and F. Csaki, eds),Akademia Kiado, Budapest, 267-281. (reprinted in: Breakthroughs in Statistics,Vol. I, S. Kotz and N.L. Johnson (eds.), Springer, New York, 1994.)

Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Autom.Control, AC-19, 716-723.

Alt, H.W. (1992). Lineare Funktionalanalysis, Springer, Berlin.

Bartlett, M.S. (1935). Contingency table interactions. J. Roy. Statist. Soc. Ser., Supple-ment, 2, 248-252.

Battaglia, F. (1984). Inverse covariances of a multivariate time series. Metron, 42, No.3-4,117-129.

Berk, K.N. (1974). Consistent autoregressive spectral estimates. Ann. Statist., 2, 489-502.

Bhansali, R.J. (1980). Autoregressive and window estimates of the inverse correlationfunction. Biometrika, 67, 551-561.

Bouissou, M.B., Laffont, J.J., and Voung, Q.H. (1986). Tests of noncausality underMarkov assumptions for qualitative panel data. Econometrica, 54, 395-414.

Brillinger, D.R. (1972) The spectral analysis of stationary interval functions. In Proc. 6thBerkeley Symp., Vol 1, Berkely, California, pp. 483-513.

Brillinger, D.R. (1981) Time Series: Data Analysis and Theory, McGraw Hill, New York.

Page 118: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

114 References

Brillinger, D.R. (1996). Remarks concerning graphical models for time series and pointprocesses. Revista de Econometria, 16, 1-23.

Brillinger, D.R., Bryant, H.L., and Segundo, J.P. (1976) Identification of synaptic inter-actions. Biol. Cybernetics, 22, 213-229.

Cheng, R., and Pourahmadi, M. (1993). Baxter’s inequality and convergence of finitepredictors of multivariate stochastic processes. Probab. Theory Relat. Fields, 95,115-124.

Dahlhaus, R. (1983). Spectral analysis with tapered data. J. Time Ser. Anal., 4, 163-175.

Dahlhaus, R. (1990). Nonparametric high resolution spectral estimation. Probab. TheoryRelat. Fields, 85, 147-180.

Dahlhaus, R. (1996). Graphical interaction models for multivariate time series. Preprint,Universitat Heidelberg.

Dahlhaus, R., Eichler, M., and Sandkuhler, J. (1997). Identification of synaptic connec-tions in neural ensembles by graphical models. J. Neurosc. Meth., 77, 93-107.

Darroch, J.N., Lauritzen, S.L., and Speed, T.P. (1980). Markov fields and log-linearmodels. Ann. Statist., 8, 522-539.

Dawid, A.P. (1979). Conditional independece in statistical theory (with discussion).J. Roy. Stat. Soc. Ser. B, 41, 1-31.

Daley, D.J., and Vere-Jones, D. (1988). An Introduction to the Theory of Point Processes,Springer, New York.

Dzhaparidze, K.O., and Yaglom, A.M. (1983). Spectrum parameter estimation in timeseries analysis. In Developments in Statistics, Vol. 4, P.R. Krishnaiah (ed.), 1-96.Academic Press, New York.

Eichler, M. (1995). Empirical spectral processes and their applications to stationary pointprocesses. Ann. Appl. Prob., 4, 1161-1176.

Florens, J.P. and Mouchart, M. (1982). A note on noncausality. Econometrica, 50, 583-591.

Florens, J.P., and Mouchart, M. (1985). A linear theory for noncausality. Econometrica,53, 157-175.

Gibbs, W. (1902). Elementary Principles of Statistical Mechanics, Yale University Press,New Haven, Connecticut.

Granger, C.W.J. (1969). Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37, 424-438.

Grenander, U., and Szego, G. (1958). Toeplitz Forms and Their Applications, Universityof California Press, Berkeley.

Page 119: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

References 115

Hannan, E.J., and Deistler, M. (1988). The Statistical Theory of Linear Systems. Wiley,New York.

Harville, D.A. (1997). Matrix Algebra from a Statistician’s Perspective, Springer, NewYork.

Haugh, L.D. (1976). Checking the independence of two covariance-stationary time series:a univariate residual covariance approach. J. Am. Statist. Assoc., 71, 378-385.

Hawkes, A.G. (1971a). Spectra of some self-exciting and mutually-exciting point process.Biometrika, 58, 83-90.

Hawkes, A.G. (1971b). Point spectra of some mutually exciting point processes. J. Roy.Statist. Soc. Ser. B, 33, 438-443.

Hong, Y. (1996). Testing for independence between two covariance stationary time series.Biometrika, 83, 615-625.

Horn, R.A., and Johnson, C.R. (1985). Matrix analysis. Cambridge University Press,Cambridge.

Hosaya, Y. (1977). On the Granger condition for non-causality. Econometrica, 45, 1735-1736.

Hsiao, C. (1982). Autoregressive modeling and causal ordering of econometric variables.J. Econ. Dyn. Control, 4, 243-259.

Kullback, S., and Leibler, R.A. (1951). On information and sufficiency. Ann. Math.Statist., 22, 79-86.

Lauritzen, S.L. (1996). Graphical Models. Oxford University Press, Oxford.

Linhart, H., and Zucchini, W. (1986). Model Selection. Wiley, New York.

Lutkepohl, H. (1985). Comparison of criteria for estimating the order of a vector autore-gressive process. J. Time Ser. Anal., 6, 35-52. Correction (1987), 8, 373.

Lynggaard, H., and Walther, K.H. (1993). Dynamic Modelling with Mixed GraphicalAssociation Models, Master’s Thesis, Aalborg University.

Masani, P. (1966). Recent trends in multivariate prediction theory. In MultivariateAnalysis (P.R. Krishnaiah ed.), Academic Press, New York, pp. 351-382.

Melssen, W.J., and Epping, W.J.M. (1987). Detection and estimation of neural connec-tivity based on crosscorrelation analysis. Biol. Cybernetics, 57, 403-414.

Nakano, J., and Tagami, S. (1987). Order selection of a multivariate autoregressive modelby a modification of the FPE criterion. Int. J. Control., 45, 589-596.

Parzen, E. (1983). Autoregressive spectral estimation. In: Handbook of statistics 3,D.R. Brillinger and P.R. Krishnaiah (eds.), North Holland, Amsterdam.

Page 120: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

116 References

Pearl, J. (1995). Causal diagrams for empirical research (with discussion). Biometrika,82, 669-710.

Pierce, D.A., and Haugh, L.D. (1977). Causality in temporal systems: Characterizationsand a survey. J. Econ., 5, 265-293.

Pollard, D. (1984). Convergence of Stochastic processes. Springer, New York.

Pollard, D. (1990). Empirical Processes: Theory and Application. SIAM, Philadelphia.

Reinsel, G.C. (1991). Elements of multivariate time series analysis. Springer, New York.

Rigas, A.G. (1992). Spectral analysis of stationary point processess using the fast fouriertransform algorithm. J. Time Ser. Anal., 13, 441-450.

Rigas, A.G. (1991). Spectra-based estimates of certain time-domain parameters of abivariate stationary point process. Math. Biosc., 104, 185-201.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14, 465-471.

Rosenberg, J.R., Amjad, A.M., Breeze, P., Brillinger, D.R., and Halliday, D.M. (1989).The Fourier approach to the identification of functional coupling between neuronalspike trains. Prog. Biophysics Mol. Biol., 53, 1-31.

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist., 6, 461-464.

Shaman, P. (1975). An approximate inverse for the covariance matrix of moving averageand autoregressive processes. Ann. Statist., 3, 532-538.

Shaman, P. (1976). Approximations for stationary covariance matrices and their inverseswith the application to ARMA models. Ann. Statist., 4, 292-301.

Shibata, R. (1980). Asymptotically efficient selection of the order of the model for esti-mating parameters of a linear process. Ann. Statist., 8, 147-164.

Shibata, R. (1997). Bootstrap estimate of Kullback-Leibler information for model selec-tion. Stat. Sin., 7, 375-394.

Sims, C.A. (1972). Money, income and causality. American Economic Review, 62, 540-562.

Spirtes, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search,Springer Lecture Notes, 81, Springer, New York.

Swanson, N.R., and Granger, C.W.J. (1997). Impulse response functions based on a causalapproach to residual orthogonalization in vector autoregressions. J. Am. Stat. As-soc., 92, 357-367.

Taniguchi, M. (1980). On selection of the order of the spectral density model for astationary process. Ann. Inst. Statist. Math., 32, 401-419.

Page 121: Graphical Models in Time Series Analysisgalton.uchicago.edu/~eichler/thesis.pdf · also think of tting a parametric model to the data. The estimation of the conditional correlation

References 117

Taniguchi, M., Puri, M.L., and Kondo, M. (1996). Nonparametric approach for non-gaussian vector stationary processes. J. Multi. Anal., 56, 259-283.

Tjøstheim, D. (1981). Granger-causality in multiple time series. J. Econ., 17, 157-176.

Tunnicliffe Wilson, G. (1972). The factorization of matricial spectral densities. SIAMJ. Appl. Math., 23, 420-426.

Wermuth, N. (1976). Analogies between multipicative models in contingency tables andcovariance selection. Biometrics, 32, 95-108.

Wermuth, N., and Lauritzen, S.L. (1990). On substantive research hypothesis, conditionalindependence graphs and graphical chain models (with discussion). J. Roy. Statist.Soc. Ser. B, 52, 21-72.

Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. John Wiley,Chichester.

Whittle, P. (1953). Estimation and information in stationary time series. Ark. Mat.2, 423-434. (reprinted in: Breakthroughs in Statistics, Vol. III, S. Kotz and N.L.Johnson (eds.), Springer, New York, 1997.)

Whittle, P. (1954). Some recent contributions to the theory of stationary processes.Appendix to A Study in the Analysis of Stationary Time Series, by H. Wold, 2nded. 196-228. Almquist and Wiksell, Uppsala.

Wright, S. (1921). Correlation and causation. J. Agricul. Res., 20, 557-585.

Wright, S. (1934). The method of path coefficients. Ann. Math. Statist., 5, 161-215.