sayan/full/top_10_papers/#1... · submitted to the annals of statistics consistency of maximum...

$: sayan/full/Top_10_papers/#1... · Submitted to the Annals of Statistics CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATION FOR SOME DYNAMICAL SYSTEMS By Kevin McGoff,{, Sayan Mukherjeey,{,$
Submitted to the Annals of Statistics

CONSISTENCY OF MAXIMUM LIKELIHOODESTIMATION FOR SOME DYNAMICAL SYSTEMS

By Kevin McGoff∗,¶, Sayan Mukherjee†,¶, AndrewNobel‡,‖, and Natesh Pillai§,∗∗

Duke University¶, University of North Carolina‖, and Harvard University∗∗

We consider the asymptotic consistency of maximum likelihoodparameter estimation for dynamical systems observed with noise.Under suitable conditions on the dynamical systems and the obser-vations, we show that maximum likelihood parameter estimation isconsistent. Our proof involves ideas from both information theoryand dynamical systems. Furthermore, we show how some well-studiedproperties of dynamical systems imply the general statistical prop-erties related to maximum likelihood estimation. Finally, we exhibitclassical families of dynamical systems for which maximum likelihoodestimation is consistent. Examples include shifts of finite type withGibbs measures and Axiom A attractors with SRB measures.

1. Introduction. Maximum likelihood estimation is a common, well-studied, and powerful technique for statistical estimation. In the context ofa statistical model with an unknown parameter, the maximum likelihoodestimate of the unknown parameter is, by definition, any parameter valueunder which the observed data is most likely; such parameter values are saidto maximize the likelihood function with respect to the observed data. Inclassical statistical models, one typically thinks of the unknown parameteras a real number or possibly a finite dimensional vector of real numbers.Here we consider maximum likelihood estimation for statistical models inwhich each parameter value corresponds to a stochastic system observedwith noise.

Hidden Markov models (HMMs) provide a natural setting in which tostudy both stochastic systems with observational noise and maximum likeli-hood estimation. In this setting, one has a parametrized family of stochastic

∗KM gratefully acknowledges the support of NSF grant DMS 10-45153.†SM is pleased to acknowledge the support of NIH (Systems Biology): 5P50-GM081883,

AFOSR: FA9550-10-1-0436, NSF CCF-1049290, and NSF DMS-1209155.‡AN was supported in part by NSF grants DMS 0907177 and DMS 1310002.§NP acknowledges the support of NSF grant DMS 1107070.AMS 2000 subject classifications: Primary 37A50, 37A25, 62B10, 62F12, 62M09; sec-

ondary 37D20, 60F10, 62M05, 62M10, 94A17Keywords and phrases: Dynamical systems, hidden Markov models, maximum likeli-

hood estimation, strong consistency

1imsart-aos ver. 2013/03/06 file: MLE_aos_July22_2014.tex date: July 22, 2014

2 K. MCGOFF ET AL.

processes that are assumed to be Markov, and one attempts to performinference about the underlying parameters from noisy observations of theprocess. There has been a substantial amount of work on statistical infer-ence for HMMs, and we do not attempt a complete survey of that areahere. In the 1960s, Baum and Petrie [5, 37] studied consistency of maximumlikelihood estimation for finite state HMMs. Since that time, several otherauthors have shown that maximum likelihood estimation is consistent forHMMs under increasingly general conditions [13, 16, 18, 29, 30, 31], cul-minating with the work of [15], which currently provides the most generalconditions on HMMs under which maximum likelihood estimation has beenshown to be consistent.

We focus here on the consistency of maximum likelihood estimation forparametrized families of deterministic systems observed with noise. Inferencemethods for deterministic systems from noisy observations are of interest ina variety of scientific areas (for a few examples, see [19, 20, 28, 38, 39, 40,46, 49]).

For the purpose of this article, the terms deterministic system and dy-namical system refer to a map T : X → X. The set X is referred to as thestate space, and the transformation T governs the evolution of states overone (discrete) time increment. Our main interest here lies in families of dy-namical systems observed with noise. More precisely, we consider a statespace X and a parameter space Θ, and to each θ in Θ, we associate a dy-namical system Tθ : X → X. Note that the state space X does not dependon θ. For each θ in Θ, we assume that the system is started at equilibriumfrom a Tθ-invariant measure µθ. See Section 2 for precise definitions. Weare particularly interested in situations in which the family of dynamicalsystems is observed via noisy measurements (or observations). We considera general observation model specified by a family of probability densitiesgθ(· | x) : θ ∈ Θ, x ∈ X, where gθ(· | x) prescribes the distribution ofan observation given that the state of the dynamical system is x and thestate of nature is θ. Under some additional conditions (see Section 3), ourfirst main result states that maximum likelihood estimation is a consistentmethod of estimation of the parameter θ.

We have chosen to state the conditions of our main consistency resultin terms of statistical properties of the family of dynamical systems andthe observations. However, these particular statistical properties have notbeen directly studied in the dynamical systems literature. In the interestof applying our general result to specific systems, we also establish severalconnections between well-studied properties of dynamical systems and thestatistical properties relevant to maximum likelihood estimation. Finally,

imsart-aos ver. 2013/03/06 file: MLE_aos_July22_2014.tex date: July 22, 2014

CONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS 3

we apply these results to some examples, including shifts of finite type withGibbs measures and Axiom A attractors with SRB (Sinai-Ruelle-Bowen)measures. It is widely accepted in the field of ergodic theory and dynamicalsystems that these classes of systems have “good” statistical properties, andour results may be viewed as a precise confirmation of this view.

1.1. Previous work. There has been a substantial amount of work onstatistical inference for HMMs, and a complete survey of that area is beyondthe scope of this work. The asymptotic consistency of maximum likelihoodestimation for HMMs has been studied at least since the work of Baum andPetrie [5, 37] under the assumption that both the hidden state space X andthe observation space Y are finite sets. Leroux extended this result to thesetting where Y is a general space and X is a finite set [31]. Several otherauthors have shown that maximum likelihood estimation is consistent forHMMs under increasingly general conditions [13, 16, 18, 29, 30], culminatingwith the work of [15], which currently provides the most general conditionsfor HMMs under which maximum likelihood estimation has been shown tobe consistent.

Let us now discuss the results of [15] in greater detail. Consider parametrizedfamilies of HMMs in which both the hidden state space X and the observationspace Y are complete, separable metric spaces. The main result of [15] showsthat under several conditions, maximum likelihood estimation is a consistentmethod of estimation of the unknown parameter. These conditions involvesome requirements on the transition kernel of the hidden Markov chain,as well as basic integrability conditions on the observations. The proof ofthat result relies on information-theoretic arguments, in combination withthe application of some mixing conditions that follow from the assumptionson the transition kernel. To prove our consistency result, we take a similarinformation-theoretic approach, but instead of placing explicit restrictionson the transition kernel, we identify and study mixing conditions suitablefor dynamical systems. See Remarks 2.4 and 3.3 for further discussion of ourresults in the context of HMMs.

Other directions of study regarding inference for HMMs include the be-havior of MLE for misspecified HMMs [14], asymptotic normality for param-eter estimates [8, 23], the dynamics of Bayesian updating [44], and startingthe hidden process away from equilibrium [15]. Extending these results todynamical systems is of potential interest.

The topic of statistical inference for dynamical systems has been widelystudied in a variety of fields. Early interest from the statistical point of viewis reflected in the following surveys [6, 12, 21, 22]. For a recent review of this


4 K. MCGOFF ET AL.

area with many references see [33]. There has been significant methodologicalwork in the area of statistical inference for dynamical systems (for a fewrecent examples, see [20, 19, 38, 46, 49]), but in this section we attemptto describe some of the more theoretical work in this area. The relevanttheoretical work to date falls (very) roughly into three classes:

• state estimation (also known as denoising or filtering) for dynamicalsystems with observational noise;• prediction for dynamical systems with observational noise; and• system reconstruction from dynamical systems without noise.

Let us now mention some representative works from these lines of research.In the setting of dynamical systems with observational noise, Lalley in-

troduced several ideas regarding state estimation in [25]. These ideas weresubsequently generalized and developed in [26, 27]. Key results from thisline of study include both positive and negative results on the consistency ofdenoising a dynamical system under additive observational noise. In short,the magnitude of the support of the noise seems to determine whether con-sistent denoising is possible. In related work, Judd [24] demonstrated thatMLE can fail (in a particular sense) in state estimation when noise is large.It is perhaps interesting to note that there are examples of Axiom A systemswith Gaussian observational noise for which state estimation cannot be con-sistent (by results of [26, 27]) and yet MLE provides consistent parameterestimation (by Theorem 5.7).

Steinwart and Anghel considered the problem of consistency in predictionaccuracy for dynamical systems with observational noise [45]. They were ableto show that support vector machines are consistent in terms of predictionaccuracy under some conditions on the decay of correlations of the dynamicalsystem.

The work of Adams and Nobel uses ideas from regression to study re-construction of measure-preserving dynamical systems [1, 34, 35] withoutnoise. These results show that certain types of inference are possible un-der fairly mild ergodicity assumptions. A sample result from this line ofwork is that a measure-preserving transformation may be consistently re-constructed from a typical trajectory observed without noise, assuming thatthe transformation preserves a measure that is absolutely continuous (withRadon-Nikodym derivative bounded away from 0 and infinity) with respectto a known reference measure.

1.2. Organization. In Section 2, we give some necessary background ondynamical systems observed with noise. Section 3 contains a statement anddiscussion of our main result (Theorem 3.1), which asserts that under some



general statistical conditions, maximum likelihood parameter estimation isconsistent for families of dynamical systems observed with noise. The pur-pose of Section 4 is to establish connections between well-studied propertiesof dynamical systems and the (statistical) conditions appearing in Theorem3.1. Section 5 gives several examples of widely studied families of dynamicalsystems to which we apply Theorem 3.1 and therefore establish consistencyof maximum likelihood estimation. The proofs of our main results appear inSection 6, and we conclude with some final remarks in Section 7.

2. Setting and notation. Recall that our primary objects of studyare parametrized families of dynamical systems. In this section we intro-duce these objects in some detail. First let us recall some terminology re-garding dynamical systems and ergodic theory. We use X to denote a statespace, which we assume to be a complete separable metric space endowedwith its Borel σ-algebra X . Then a measurable dynamical system on X isdefined by a measurable map T : X → X, which governs the evolutionof states over one (discrete) time increment. For a probability measure µon the measurable space (X,X ), we say that T preserves µ (or µ is T -invariant) if µ(T−1E) = µ(E) for each set E in X . We refer to the quadruple(X,X , T, µ) as a measure-preserving system. To generate a trajectory (Xk)from such a measure-preserving system, one chooses X0 according to µ andsets Xk = T k(X0) for k ≥ 0. Note that (Xk) is then a stationary X-valuedstochastic process. Lastly, the measure-preserving system (X,X , T, µ) is saidto be ergodic if T−1E = E implies µ(E) ∈ 0, 1. See the books [36, 48] foran introduction to measure-preserving systems and ergodic theory.

Let us now introduce the setting of parametrized families of dynamicalsystems. We denote the parameter space by Θ, which is assumed to be acompact metric space endowed with its Borel σ-algebra. Fix a state space Xand its Borel σ-algebra X as above. To each parameter θ in Θ, we associatea measurable transformation Tθ : X → X, which prescribes the dynamicscorresponding to the parameter θ. Finally, we need to specify some initialconditions. In this article, we consider the case that the system is startedfrom equilibrium. More precisely, we associate to each θ in Θ a Tθ-invariantBorel probability measure µθ on (X,X ). Thus, to each θ in Θ, we associatea measure-preserving system (X,X , Tθ, µθ), and we refer to the collection(X,X , Tθ, µθ)θ∈Θ as a parametrized family of dynamical systems. For easeof notation, we will refer to (Tθ, µθ)θ∈Θ as a family of dynamical systems on(X,X ), instead of referring to the family of quadruples (X,X , Tθ, µθ)θ∈Θ.

We would like to study the situation that such a family of dynamical sys-tems is observed via noisy measurements. Here we describe the specifics of


6 K. MCGOFF ET AL.

our observation model. We suppose that we have a complete, separable met-ric space Y, endowed with its σ-algebra Y, which serves as our observationspace. We also assume that we have a family of Borel probability densitiesgθ(· | x) : θ ∈ Θ, x ∈ X with respect to a fixed reference measure ν on Y.The density gθ(· | x) prescribes the distribution of our observation given thatthe state of the dynamical system is x and the state of nature is θ. Lastly,we assume that the noise involved in successive observations is conditionallyindependent given θ and the underlying trajectory of the dynamical system.Thus, our full model consists of a parametrized family of dynamical systems(Tθ, µθ)θ∈Θ on a measurable space (X,X ) with corresponding observationdensities gθ(· | x) : θ ∈ Θ, x ∈ X.

In general, we would like to estimate the parameter θ from our observa-tions. Maximum likelihood estimation provides a basic method for perform-ing such estimation. Our first main result states that maximum likelihoodestimation is a consistent estimator of θ under some general conditions onthe family of systems and the noise. In order to state these results precisely,we now introduce the likelihood for our model. For the sake of notation, itwill be convenient to denote finite sequences (xi, . . . , xj) with the notation

xji .As we have assumed that our observations are conditionally independent

given θ and a trajectory (Xk), we have that for θ ∈ Θ and yn0 ∈ Yn+1, thelikelihood of observing yn0 given θ and (Xk) is

pθ(yn0 | Xn

0 ) =

n∏j=0

gθ(yj | Xj).

Since Xk = T kθ (X0) given θ and X0, the conditional likelihood of yn0 given θand X0 = x is

pθ(yn0 | x) =

n∏j=0

gθ(yj | T jθ (x)).

Since our model also assumes that X0 is distributed according to µθ, wehave that for θ ∈ Θ and yn0 ∈ Yn+1, the marginal likelihood of observing yn0given θ is

(2.1) pθ(yn0 ) =

∫pθ(y

n0 | x) dµθ(x).

We denote by νn the product measure on Yn+1 with marginals equal to ν.Let Pθ be the probability measure on X×YN such that for Borel sets A ⊂ Xand B ⊂ Yn+1, it holds that

Pθ(A×B) =

∫ ∫1A(x) 1B(yn0 ) pθ(y

n0 | x) dνn(yn0 ) dµθ(x),



which is well-defined by Kolmogorov’s consistency theorem. Let Eθ denoteexpectation with respect to Pθ, and let PYθ be the marginal of Pθ on YN.

Before we define consistency, let us first consider the issue of identifiability.Our notion of identifiability is captured by the following equivalence relation.

Definition 2.1. Define an equivalence relation on Θ as follows: let θ ∼θ′ if PYθ = PYθ′ . Denote by [θ] the equivalence class of θ with respect to thisequivalence relation.

In a strong theoretical sense, if θ′ is in [θ], then the systems correspondingto the parameter values θ′ and θ cannot be distinguished from each otherbased on observations of the system.

Now we fix a distinguished element θ0 in Θ. Here and in the rest of thepaper, we assume that θ0 is the “true” parameter, i.e. the data are generatedfrom the measure PYθ0 . Hence, one may think of [θ0] as the set of parametersthat cannot be distinguished from the true parameter.

Definition 2.2. An approximate maximum likelihood estimator (MLE)is a sequence of measurable functions θn : (Y)n+1 → Θ such that

(2.2)1

nlog pθn(Y n0 )(Y

n0 ) ≥ sup

θ

1

nlog pθ(Y

n0 )− oa.s.(1),

where oa.s.(1) denotes a process that tends to zero Pθ0-a.s as n tends toinfinity.

Remark 2.1. Several notions in this article, including the definition ofapproximate MLE above, involve taking suprema over θ in Θ. In many situ-ations of interest to us, X and Θ are compact and all relevant functions arecontinuous in these arguments. In such cases, we have sufficient regularity toguarantee that suprema over θ in Θ are measurable. However, in the generalsituation, such suprema are not guaranteed to be measurable, and one musttake some care. As all our measurable spaces are Polish (complete, separablemetric spaces), such functions are always universally measurable [7, Proposi-tion 7.47]. Similarly, a Borel-measurable (approximate) maximum likelihoodestimator need not exist, but the Polish assumption ensures the existence ofuniversally measurable maximum likelihood estimators [7, Proposition 7.50].Thus, all probabilities and expectations may be unambiguously extended tosuch quantities.

Remark 2.2. In this work, we do not consider specific schemes for con-structing an approximate MLE. Based on the existing results regarding de-noising and system reconstruction (e.g., [1, 25, 26, 27, 34, 35], which are


8 K. MCGOFF ET AL.

briefly discussed in Section 1.1), explicit construction of an approximateMLE may be possible under suitable conditions. Although the descriptionand study of such constructive methods could be interesting, it is outside ofthe scope of this work.

Remark 2.3. In principle, one could consider inference based on theconditional likelihood pθ(· | x0) in place of the marginal likelihood pθ(·).However, we do not pursue this direction in this work. For non-linear dy-namical systems, even the conditional likelihood pθ(· | x0) may depend verysensitively on x0 (see [6], for example). Thus, optimizing over x0 is essen-tially no more “tractable” than marginalizing the likelihood via an invariantmeasure.

Remark 2.4. The framework of this paper may be translated into thelanguage of Markov chains as follows. For each θ ∈ Θ, we define a (degener-ate) Markov transition kernel Qθ as follows:

Qθ(x, y) = δTθ(x)(y).

In other words, for each θ ∈ Θ, x ∈ X, and Borel set A ⊂ X, the probabilitythat X1 ∈ A conditioned on X0 = x is

Qθ(x,A) = δTθ(x)(A),

where δx is defined to be a point mass at x.In all previous work on consistency of maximum likelihood estimation

for HMMs (including [13, 15, 16, 18, 29, 30]), there has been significantassumptions placed on the Markovian structure of the hidden chain. Forexample, the central hypothesis appearing in [15] requires that there is a σ-finite measure λ on X such that for some L ≥ 0, the L-step transition kernelQLθ (x, ·) is absolutely continuous with respect to λ with bounded Radon-Nikodym derivative. If X is uncountable, then the degeneracy of Qθ, whicharises directly from the fact that we are considering deterministic systems,makes the existence of such a dominating measure impossible. In short, itis precisely the determinism in our hidden processes that prevents previoustheorems for HMMs from applying to dynamical systems.

Nonetheless, there is a special case of systems that we consider in Section5.1 that overlaps with the systems considered in the HMM literature. If Xis a shift of finite type, Tθ is the shift map σ : X → X for all θ, µθ is a(1-step) Markov measure for all θ, and gθ(· | x) depends only θ and thezero coordinate x0, then both the present work and the results in [15] applyto this setting and guarantee consistency of any approximate MLE underadditional assumptions on the noise.



3. Consistency of MLE. In this section, we show that under suitableconditions, any approximate MLE is consistent for families of dynamicalsystems observed with noise. To make this statement precise, we make thefollowing definition of consistency.

Definition 3.1. An approximate MLE (θn)n is consistent at θ0 if θn(Y n0 )

converges to [θ0], Pθ0-a.s. as n tends to infinity.

For the sake of notation, define the function γ : Θ× Y → R+, where

γθ(y) = supx∈X

gθ(y | x).

Also, for x > 0, let log+ x = max(0, log(x)).Consider the following conditions on a family of dynamical systems ob-

served with noise.

(S1) Ergodicity.The system (Tθ0 , µθ0) on (X,X ) is ergodic.

(S2) Logarithmic integrability at θ0.It holds that

Eθ0[log+ γθ0(Y0)

]<∞,

and

Eθ0

[∣∣∣∣log

∫gθ0(Y0 | x) dµθ0(x)

∣∣∣∣]<∞.

(S3) Logarithmic integrability away from θ0.For each θ′ /∈ [θ0], there exists a neighborhood U of θ′ such that

Eθ0[supθ∈U

log+ γθ(Y0)

]<∞.

(S4) Upper semi-continuity of the likelihood.For each θ′ /∈ [θ0] and n ≥ 0, the function θ 7→ pθ(Y

n0 ) is upper semi-

continuous at θ′, Pθ0-a.s.

(S5) Mixing condition.There exists ` ≥ 0 such that for each m ≥ 0, there exists a measurablefunction Cm : Θ × Ym+1 → R+ such that if t ≥ 1 and w0, . . . , wt ∈


10 K. MCGOFF ET AL.

Ym+1, then∫ t∏j=0

pθ(wj | Tj(m+`)θ x) dµθ(x) ≤

t∏j=0

Cm(θ, wj)t∏

j=0

pθ(wj).

Furthermore, for each θ′ /∈ [θ0], there exists a neighborhood U of θ′

such that

supm

Eθ0[supθ∈U

logCm(θ, Y m0 )

]<∞.

(S6) Exponential identifiability.For each θ /∈ [θ0], there exists a sequence of measurable sets An ⊂ Yn+1

such that

lim infn

PYθ0(An) > 0 and lim supn

1

nlogPYθ (An) < 0.

The following theorem is our main general result.

Theorem 3.1. Suppose that (Tθ, µθ)θ∈Θ is a parametrized family of dy-namical systems on (X,X ) with corresponding observation densities (gθ)θ∈Θ.If conditions (S1)–(S6) hold, then any approximate MLE is consistent at θ0.

The proof of Theorem 3.1 is given in Section 6. In the following remark,we discuss the conditions (S1)–(S6).

Remark 3.2. The conditions (S1)-(S3) involve basic irreducibility andintegrability conditions, and similar conditions have appeared in previouswork on consistency of maximum likelihood estimation for HMMs (see, forexample, [15, 31]). Taken together, conditions (S1) and (S2) ensure the al-most sure existence and finiteness of the entropy rate for the process (Yn):

h(θ0) = limn

1

nlog pθ0(Y n

0 ).

Condition (S3) serves as a basic integrability condition in the proof of The-orem 3.1, in which one must essentially show that for θ /∈ [θ0],

lim supn

1

nlog pθ(Y

n0 ) < h(θ0).

Conditions (S4)–(S6) are more interesting from the point of view of dynam-ical systems, and we discuss them in greater detail below.



The upper semi-continuity of the likelihood (S4) is closely related to thecontinuity of the map θ 7→ µθ. In general, the continuous dependence of µθon θ places non-trivial restrictions on a family of dynamical systems. Thisproperty (continuity of θ 7→ µθ) is often called “statistical stability” in thedynamical systems and ergodic theory literature, and it has been studiedfor some families of systems (for examples, see [2, 17, 42, 47] and referencestherein). In Section 4.1, we show how statistical stability of the family ofdynamical systems may be used to establish the upper semi-continuity ofthe likelihood (S4).

The mixing condition (S5) involves control of the correlations of the ob-servation densities along trajectories of the underlying dynamical system.Although the general topic of decay of correlations has been widely studiedin dynamical systems (see [3] for an overview), condition (S5) is not impliedby the particular decay of correlations properties that are typically studiedfor dynamical systems. Nonetheless, we show in Section 4.2 how some well-studied mixing properties of dynamical systems imply the mixing condition(S5).

Finally, condition (S6) involves the exponential identifiability of the trueparameter θ0. We show in Section 4.3 how large deviations for a familyof dynamical systems may be used to establish exponential identifiability(S6). Large deviations estimates for dynamical systems have been studiedin [41, 50], and our main goal in Section 4.3 is to connect such results toexponential identifiability (S6).

Remark 3.3. Suppose one has a family of bi-variate stochastic processes(Xθ

k , Yθk ) : θ ∈ Θ, where (Xθ

k) is interpreted as a hidden process and (Y θk )

as an observation process. If the observations have conditional densities withrespect to a common measure given (Xθ

k) and θ, then it makes sense to askwhether maximum likelihood estimation is a consistent method of inferencefor the parameter θ.

It is well-known that the setting of stationary stochastic processes may betranslated into the deterministic setting of dynamical systems, which may becarried out as follows. Let (Xθ

k) : θ ∈ Θ be a family of stationary stochastic

processes on a measurable space (X,X ). Consider the product space X =X⊗Z with corresponding σ-algebra X . Each process (Xθ

k) corresponds to a

probability measure µθ on (X, X ) with the property that µθ is invariant underthe left-shift map T : X → X given by x = (xi)i 7→ T (x) = (xi+1)i. Withthis translation, Theorem 3.1 shows that maximum likelihood estimationis consistent for families of hidden stochastic processes (Xθ

k) observed withnoise, whenever the corresponding family of dynamical systems (T, µθ) on


12 K. MCGOFF ET AL.

(X, X ) with observation densities satisfy conditions (S1)-(S6).With the above translation, Theorem 3.1 applies to some families of pro-

cesses allowing infinite-range dependence in both the hidden process (Xθk)

and the observation process (Y θk ). From this point of view, Theorem 3.1

highlights the fact that maximum likelihood estimation is consistent for de-pendent processes observed with noise as long as they satisfy some generalconditions: ergodicity, logarithmic integrability of observations, continuousdependence on the parameters, and some mixing of the observation process.It is interesting to note that the existing work on consistency of maximumlikelihood estimation for HMMs [11, 13, 15, 16, 18, 29, 30, 31] makes as-sumptions of precisely this sort in the specific context of Markov chains.

4. Statistical properties of dynamical systems. In our main con-sistency result (Theorem 3.1), we establish the consistency of any approx-imate MLE under conditions (S1)-(S6). We have chosen to formulate ourresult in these terms because they reflect general statistical properties ofdynamical systems observed with noise that are relevant to parameter in-ference. However, these conditions have not been explicitly studied in thedynamical systems literature, despite the fact that much effort has been de-voted to understanding certain statistical aspects of dynamical systems. Inthis section, we make connections between the general statistical conditionsappearing in Theorem 3.1 and some well-studied properties of dynamicalsystems. Section 4.1 shows how the notion of statistical stability may beused to verify the upper semi-continuity of the likelihood (S4). Section 4.2connects well-known mixing properties of some measure-preserving dynam-ical systems to the mixing property (S5). In Section 4.3, we show how largedeviations for dynamical systems may be used to deduce the exponentialidentifiability condition (S6). Proofs of statements in this section, as well asadditional discussion, appear in Supplementary Appendix A [32].

4.1. Statistical stability and continuity of pθ. As discussed in Remark3.2, the upper semi-continuity condition (S4) places non-trivial restrictionson the family of dynamical systems under consideration. In this section, weestablish sufficient conditions for (S4) to hold. The continuous dependenceof µθ on θ is a property called statistical stability in the dynamical systemsliterature [2, 17, 42, 47]. Let us state this property precisely. Let M(X)denote the space of Borel probability measures on X. Endow M(X) withthe topology of weak convergence: µn converges to µ if

∫fdµn converges to∫

fdµ as n tends to infinity, for each continuous, bounded function f : X→R. The family of dynamical systems (Tθ, µθ)θ∈X on (X,X ) is said to havestatistical stability if the map θ 7→ µθ is continuous with respect to the weak



topology on M(X).The following proposition shows that under some continuity and com-

pactness assumptions, statistical stability of the family of dynamical systemsimplies upper semi-continuity of the likelihood (S4).

Proposition 4.1. Suppose that X and Θ are compact and the mapsT : Θ × X → X and g : Θ × X × Y → R+ are continuous. If the family(Tθ, µθ)θ∈Θ has statistical stability, then upper semi-continuity of the likeli-hood (S4) holds.

The proof of Proposition 4.1 appears in Supplementary Appendix A.1[32].

4.2. Mixing. In this section, we focus on the mixing condition (S5). Re-call that (S5) involves a non-trivial restriction on the correlations of theobservation densities gθ along trajectories of the underlying dynamical sys-tem. Although mixing conditions have been widely studied in the dynamicsliterature, the particular type of condition appearing in (S5) appears notto have been investigated. Nonetheless, we show that a well-studied mixingproperty for dynamical systems implies the statistical mixing property (S5).

In order to study mixing for dynamical systems, one typically places re-strictions on the type of events or observations that one considers (by consid-ering certain functionals of the process). For example, in some situations asubstantial amount work has been devoted to finding particular partitions ofstate space with respect to which the system possess good mixing properties(an example of such partitions are the well-known Markov partitions [9]). If asystem has good mixing properties with respect to a particular partition, andif that partition possesses certain (topological) regularity properties, then itis often possible to show that the system also has good mixing properties forrelated function classes, such as Lipschitz or Holder continuous observables.For variations of this approach to mixing in dynamical systems, see the vastliterature on decay of correlations (for an introduction, see the survey [3]).

In this section, we follow the above approach to study the mixing condition(S5) for dynamical systems observed with noise. First, we define a mixingproperty for families of dynamical systems with respect to a partition (M1).Second, we define a regularity property for partitions (M2). Third, we definea topological regularity property for a family of observation densities (M3).Finally, in the main result of this section (Proposition 4.2), we show howthese three properties together imply the mixing condition (S5).

Here and in the rest of this section, we consider only invertible transfor-mations. It is certainly possible to modify the definitions slightly to handle


14 K. MCGOFF ET AL.

the non-invertible case, but we omit such modifications.We will have need to consider finite partitions of X. The join of two

partitions C0 and C1 is defined to be the common refinement of C0 andC1, and it is denoted C0

∨C1. Note that for any measurable transformation

T : X → X, if C is a partition, then so is T−1C = T−1A : A ∈ C. For afixed partition C and i ≤ j, let Cji =

∨jk=i T

−kθ C. Notice that Cji depends on

θ through Tθ, although we suppress this dependence in our notation. Nowconsider the following alternative conditions, which may be used in place ofcondition (S5).

(M1) Mixing condition with respect to the partition C.There exists L : Θ→ R+ and ` ≥ 0 such that for all θ ∈ Θ, m,n ≥ 0,A ∈ Cm0 , and B ∈ Cn0 , it holds that

µθ

(A ∩ T−(m+`)

θ B)≤ Lθ µθ(A) µθ(B).

Furthermore, for each θ′ /∈ [θ0] there exists a neighborhood U of θ′

such thatsupθ∈U

Lθ <∞.

(M2) Regularity of the partition C. There exists β ∈ (0, 1) such that forall θ ∈ Θ and m,n ≥ 0, if A ∈ Cn−m and x, z ∈ A, then

d(x, z) ≤ βmin(m,n).

(M3) Regularity of observations. There exists a functionK : Θ×Y → R+

such that for y ∈ Y and x, z ∈ X,

gθ(y | x) ≤ gθ(y | z) exp(K(θ, y) d(x, z)

).

Furthermore, for each θ′ /∈ [θ0], there exists a neighborhood U of θ′

such that

Eθ0[supθ∈U

K(θ, Y0)

]<∞.

Let us now state the main proposition of this section, whose proof isdeferred to Supplementary Appendix A.2 [32].

Proposition 4.2. Suppose (Tθ, µθ)θ∈Θ is a family of dynamical systemson (X,X ) with corresponding observation densities (gθ)θ∈Θ. If there exists apartition C of X such that conditions (M1) and (M2) are satisfied, and if theobservation regularity condition (M3) is satisfied, then the mixing property(S5) holds.



4.3. Exponential identifiability. In this section, we study the exponentialidentifiability condition (S6). We show how large deviations for dynamicalsystems may be used in combination with some regularity of the observationdensities to establish exponential identifiability (S6).

Let X1 and X2 be metric spaces with metrics d1 and d2, respectively.Recall that a function f : X1 → X2 is said to be Holder continuous if thereexist α > 0 and C > 0 such that for each x, z in X1, it holds that

d2(f(x), f(z)) ≤ Cd1(x, z)α.

If (T, µ) is a dynamical system on (X,X ) such that T : X → X is Holdercontinuous, then we refer to (T, µ) as a Holder continuous dynamical system.For many dynamical systems, the class of Holder continuous functions f :X→ R provides a natural class of observables whose statistical properties arefairly well understood and satisfy some large deviations estimates [41, 50].

Consider the following conditions, which we later show are sufficient toguarantee exponential identifiability (S6).

(L1) Large deviations. For each θ /∈ [θ0], for each Holder continuousfunction f : X→ R, and for each δ > 0, it holds that

lim supn

1

nlogµθ

(∣∣∣∣ 1nn−1∑k=0

f(T kθ (x))−∫fdµθ

∣∣∣∣ > δ

)< 0.

(L2) Regularity of observations. There exists α > 0 andK : Θ×Y → R+

such that for each x and z in X, it holds that

gθ(y | x) ≤ gθ(y | z) exp(K(θ, y) d(x, z)α

).

Furthermore, for θ ∈ Θ and C > 0, it holds that

supx

∫exp(CK(θ, y)

)gθ(y | x) dν(y) <∞.

The following proposition relates large deviations for dynamical systemsto the exponential identifiability condition (S6).

Proposition 4.3. Suppose that (Tθ, µθ)θ∈Θ is a family of Holder contin-uous dynamical systems on the (X,X ) with corresponding observation den-sities (gθ)θ∈Θ. Further suppose that the large deviations property (L1) andthe observation regularity property (L2) are satisfied. Then the exponentialidentifiability condition (S6) holds.

The proof of Proposition 4.3 appears in Supplementary Appendix A.3[32].


16 K. MCGOFF ET AL.

5. Examples. In this section we present some classical families of dy-namical systems for which maximum likelihood estimation is consistent. Webegin in Section 5.1 by considering symbolic dynamical systems called shiftsof finite type. The state space for such systems consists of (bi-)infinite se-quences of symbols from a finite set, and the transformation on the statespace is always given by the “left-shift” map, which just shifts each pointone coordinate to the left. Such systems are considered models of “chaotic”dynamical systems that may be defined by a finite amount of combinatorialinformation. In this setting Gibbs measures form a natural class of invariantmeasures, which have been studied due to their connections to statisticalphysics. These measures play a central role in a topic called the thermody-namic formalism, which is well-described in the books [10, 43]. Note thatk-th order finite state Markov chains form a special case of Gibbs measures.The main result of this section is Theorem 5.1, which states that under suffi-cient regularity conditions, any approximate maximum likelihood estimatoris consistent for families of Gibbs measures on a shift of finite type. Thecrucial assumptions for this theorem involve continuous dependence of theGibbs measures on θ and sufficiently regular dependence of gθ(y | x) on x.Additional proofs and discussion for this section appear in SupplementaryAppendix B [32].

Having established consistency of maximum likelihood estimation for fam-ilies of Gibbs measures on a shift of finite type, we deduce in Section 5.2that maximum likelihood estimation is consistent for families of Axiom Aattractors observed with noise. Axiom A systems are well-studied differen-tiable dynamical systems on manifolds that, like shifts of finite type, exhibit“chaotic” behavior (for a thorough treatment of Axiom A systems, see thebook [10]). In related statistical work, Lalley [25] considered the problem ofdenoising the trajectories of Axiom A systems. For these systems, there isa natural class of measures, known as SRB (Sinai-Ruelle-Bowen) measures.See the article [52] for an introduction to these measures with discussion oftheir interpretation and importance. With the construction of Markov par-titions [9, 10], one may view an Axiom A attractor with its SRB measureas a factor of a shift of finite type with a Gibbs measure. Using this naturalfactor structure, we establish the consistency of any approximate maximumlikelihood estimator for Axiom A systems. Proofs and discussion of thesetopics appear in Supplementary Appendix C [32].

5.1. Gibbs measures. In this section, we consider the setting of sym-bolic dynamics, shifts of finite type, and Gibbs measures. We prove thatany approximate maximum likelihood estimator is consistent for these sys-



tems (Theorem 5.1) under some general assumptions on the observations.Finally, we consider two examples of observations in greater detail. In thefirst example, we consider “discrete” observations, corresponding to a “noisychannel.” In the second example, we consider making real-valued observa-tions with Gaussian observational noise. For a brief intoduction to shiftsof finite type and Gibbs measures that contains everything needed in thiswork, see Supplementary Appendix B [32]. For a complete introduction toshifts of finite type and Gibbs measures, see [10].

Let us now consider some families of measure-preserving systems on SFTs.Let A be an alphabet, and let M be a binary matrix with dimensions |A| ×|A|. Let X = XM be the associated SFT, and let X be the Borel σ-algebraon X. For α > 0, let f : Θ→ Cα(X) be a continuous map, and let µθ be theGibbs measure associated to the potential function fθ. In this setting, werefer to (µθ)θ∈Θ as a continuously parametrized family of Gibbs measureson (X,X ).

Theorem 5.1. Suppose X = XM is a mixing shift of finite type and(µθ)θ∈Θ is a continuously parametrized family of Gibbs measures on (X,X ).If the family of observation densities (gθ)θ∈Θ satisfies the integrability con-ditions (S2) and (S3) and the regularity conditions (M3) and (L2), then anyapproximate maximum likelihood estimator is consistent.

The proof of Theorem 5.1 is based on an appeal to Theorem 3.1. However,in order to verify the hypotheses of Theorem 3.1, we combine the results ofSection 4 with some well-known properties of Gibbs measures. This proofappears in Supplementary Appendix B [32].

Remark 5.2. There is an analogous theory of “one-sided” symbolic dy-namics and Gibbs measures, in which AZ is replaced by AN and appro-priate modifications are made in the definitions. The two-sided case dealswith invertible dynamical systems, whereas the one-sided case handles non-invertible systems. We have stated Theorem 5.1 in the invertible setting,although it applies as well in the non-invertible setting, with the obviousmodifications.

Example 5.3. In this example, we consider families of dynamical sys-tems (Tθ, µθ) on (X,X ), where X is a mixing shift of finite type, Tθ = σ|X,and µθ is a continuous family of Gibbs measures on X (as in Theorem 5.1).Here we consider the particular observation model in which our observationsof X are passed through a discrete, memoryless, noisy channel. Suppose thatY is a finite set, ν is counting measure on Y, and for each symbol a in A and


18 K. MCGOFF ET AL.

parameter θ in Θ, we have a probability distribution πθ(· | a) on Y. We con-sider the case that our observation densities gθ satisfy gθ(· | x) = πθ(· | x0).This situation is covered by Theorem 5.1, since the following conditions maybe easily verified: observation integrability (S2) and (S3) and observationregularity (M3) and (L2).

Example 5.4. In this example, we once again consider families of dy-namical systems (Tθ, µθ) on (X,X ), such that X is a mixing shift of finitetype, Tθ = σ|X, and µθ is a continuous family of Gibbs measures on X (as inTheorem 5.1). Here we consider the particular observation model in which wemake real-valued, parameter-dependent measurements of the system, whichare corrupted by Gaussian noise with parameter-dependent variance. Moreprecisely, let us assume that Y = R, and there exists a Lipschitz continuousϕ : Θ× X→ R and continuous s : Θ→ (0,∞) such that

gθ(y | x) =1

s(θ)√

2πexp

(− 1

2s(θ)2(ϕθ(x)− y)2

).

We now proceed to verify conditions (S2), (S3), (M3), and (L2). First, bycompactness and continuity, there exist C1, C2, C3 > 0 such that for θ in Θ,y in Y and x in X, it holds that

(5.1) C−11 exp(−C2y

2) ≤ gθ(y | x) ≤ C1 exp(−C3y2).

From (5.1), one easily obtains the observation integrability conditions (S2)and (S3). Furthermore, there exists C4, C5 > 0 such that for x, z ∈ X, itholds that

gθ(y | x)

gθ(y | z)= exp

(− 1

2s(θ)2[(ϕθ(x)− y)2 − (ϕθ(z)− y)2]

)= exp

(− 1

2s(θ)2[(ϕθ(x)− ϕθ(z))(ϕθ(x) + ϕθ(z)) + 2y(ϕθ(z)− ϕθ(x))]

)≤ exp ((C4 + C5|y|)|ϕθ(x)− ϕθ(z)|) .

(5.2)

Let ϕ be Lipschitz continuous with constant C6, and let K(θ, y) = C6(C4 +C5|y|). With this choice of K and (5.2), one may easily verify the observationregularity conditions (M3) and (L2).

Remark 5.5. Similar calculations to those in Example 5.4 imply thatany approximate maximum likelihood estimator is also consistent if the ob-servational noise is “double-exponential” (i.e. gθ(y | x) ∝ e−|y−x|). Indeed,these calculations should hold for most members of the exponential family,although we do not pursue them here.



5.2. Axiom A systems. In this section, we show how the previous resultsmay be applied to some smooth (differentiable) families of dynamical sys-tems. These results follow easily from the results in Section 5.1, using thework of Bowen and others (see [9, 10] and references therein) in construct-ing Markov partitions for these systems. With Markov partitions, AxiomA systems may be viewed as factors of the shifts of finite type with Gibbsmeasures. For a brief introduction of Axiom A systems that contains thedetails necessary for this work, see Supplementary Appendix C [32].

The basic fact that allows us to transfer our results from shifts of finitetype to Axiom A systems is that consistency of maximum likelihood esti-mation is preserved under taking appropriate factors. Let us now make thisstatement precisely. Suppose that (Tθ, µθ)θ∈Θ is a family of dynamical sys-tems on (X,X ) with observation densities (gθ)θ∈Θ. Further, suppose thatthere are continuous maps π : Θ× X→ X and T : Θ× X→ X such that

i) for each θ, we have that πθ Tθ = Tθ πθ;ii) for each θ, there is a unique probability measure µθ on X such that

µθ π−1θ = µθ;

iii) for each θ, the map πθ is injective µθ-a.s.

For x in X and θ in Θ, define gθ(·|x) = gθ(·|πθ(x)). Then (Tθ, µθ)θ∈Θ is afamily of dynamical systems on (X, X ) with observation densities (gθ)θ∈Θ.In this situation, we say that (Tθ, µθ, gθ)θ∈Θ is an isomorphic factor of(Tθ, µθ, gθ)θ∈Θ, and π is the factor map. The following proposition addressesthe consistency of maximum likelihood estimation for isomorphic factors.Its proof is straight-forward and omitted.

Proposition 5.6. Suppose that (Tθ, µθ, gθ)θ∈Θ is an isomorphic fac-tor of (Tθ, µθ, gθ)θ∈Θ. Then maximum likelihood estimation is consistent for(Tθ, µθ, gθ)θ∈Θ if and only if maximum likelihood estimation is consistent for(Tθ, µθ, gθ)θ∈Θ.

For the sake of brevity, we defer precise definitions for Axiom A systemsto Supplementary Appendix C [32].

We consider families of Axiom A systems as follows. Suppose that f :Θ× X→ X is a parametrized family of diffeomorphisms such that

i) θ 7→ fθ is Holder continuous;ii) there exists α > 0 such that for each θ, the map fθ is C1+α;iii) for each θ, Ω(fθ) is an Axiom A attractor and the restriction fθ|Ω(fθ)

is topologically mixing;iv) for each θ, the measure µθ is the unique SRB measure corresponding

to fθ [10, Theorem 4.1].


20 K. MCGOFF ET AL.

If these conditions are satisfied, then we say that (fθ, µθ)θ∈Θ is a parametrizedfamily of Axiom A systems on (X,X ).

Theorem 5.7. Suppose that (fθ, µθ)θ∈Θ is a parametrized family of Ax-iom A systems on (X,X ). Further, suppose that (gθ)θ∈Θ is a family of obser-vations densities satisfying the following conditions: observation integrability(S2) and (S3) and observation regularity (M3) and (L2). Then maximumlikelihood estimation is consistent.

The proof of Theorem 5.7 appears in Supplementary Appendix C [32].

6. Proof of the main result. Propositions 6.1–6.5 are used in theproof of Theorem 3.1, which is given at the end of the present section.

Proposition 6.1. Suppose that condition (S1) (ergodicity) holds. Thenthe process (Yk) is ergodic under PYθ0.

Proof. Let m > 0 be arbitrary, and let A and B be Borel subsets ofYm+1. To obtain the ergodicity of Ykk, it suffices to show that (see [36])

(6.1) limn

1

n

n∑k=0

PYθ0(Y m0 ∈ A, Y k+m

k ∈ B) = PYθ0(Y m0 ∈ A)PYθ0(Y m

0 ∈ B).

For x ∈ X, define

ηA(x) =

∫1A(ym0)pθ0(ym0 | x

)dνm(ym0 ),

and define ηB(x) similarly. For k > m, by the conditional independence ofY m

0 and Y k+mk given θ0 and X0 = x, we have that

PYθ0(Y m0 ∈ A, Y k+m

k ∈ B)

=

∫ ∫1A(ym0)1B(yk+mk

)pθ0(yn+m

0 | x)dνn+m(yn+m

0 )dµθ0(x)

=

∫ (∫1A(ym0)pθ0(ym0 | x

)dνm(ym0 )

·∫

1B(yk+mk

)pθ0(yk+mk | T kθ0(x)

)dνm(yk+m

k )

)dµθ0(x)

=

∫ηA(x) ηB

(T kθ0(x)

)dµθ0(x),



where we have used Fubini’s theorem. Since m is fixed, we have that

limn

1

n

n∑k=0

PYθ0(Y m

0 ∈ A, Y k+mk ∈ B

)= lim

n

(1

n

m∑k=0

PYθ0(Y m

0 ∈ A, Y k+mk ∈ B

)+

1

n

n∑k=m+1

∫ηA(x) ηB

(T kθ0(x)

)dµθ0(x)

)

= limn

1

n

n∑k=m+1

∫ηA(x) ηB

(T kθ0(x)

)dµθ0(x).

Since (Tθ0 , µθ0) is ergodic, an alternative characterization of ergodicity (see[36]) gives that

limn

1

n

n∑k=0

PYθ0(Y m

0 ∈ A, Y k+mk ∈ B

)= lim

n

1

n

n∑k=m+1

∫ηA(x) ηB

(T kθ0(x)

)dµθ0(x)

=

∫ηA(x) dµθ0(x)

∫ηB(x) dµθ0(x)

= PYθ0(Y m0 ∈ A)PYθ0(Y m

0 ∈ B).

Thus we have verified Equation (6.1), and the proof is finished.

For the following propositions, recall our notation that

γθ(y) = supxgθ(y | x).

Proposition 6.2. Suppose that Conditions (S1) and (S2) hold. Thenthere exists h(θ0) ∈ (−∞,∞) such that

h(θ0) = limn

Eθ0(

1

nlog pθ0(Y n

0 )

).

Moreover, the following equality holds Pθ0-a.s.:

h(θ0) = limn

1

nlog pθ0(Y n

0 ).

Proof. The proposition is a direct application of Barron’s generalizedShannon-McMillan-Breiman Theorem [4]. Here we simply check that the


22 K. MCGOFF ET AL.

hypotheses of that theorem hold in our setting. Since condition (S1) (ergod-icity) holds, Proposition 6.1 gives (Yk) is stationary and ergodic under Pθ0 .By definition, Y n

0 has density pθ0(Y n0 ) with respect to the σ-finite measure

νn. The measure νn is a product of the measure ν taken n+1 times. As such,the sequence νn clearly satisfies Barron’s condition that this sequence is“Markov with stationary transitions.” Define Dn = Eθ0(log pθ0(Y n+1

0 )) −Eθ0(log pθ0(Y n

0 )). Let us show that for n > 0, we have that

(6.2) Eθ0(| log pθ0(Y n0 )|) <∞,

which clearly implies that −∞ < Dn < ∞. Once (6.2) is established, wewill have verified all of the hypotheses of Barron’s generalized Shannon-McMillan-Breiman Theorem, and the proof of the proposition will be com-plete.

Observe that the first part of the integrability condition (S2) gives that

Eθ0[log+ pθ0(Y n

0 )

]≤ (n+ 1)Eθ0

[log+ γθ0(Y0)

]<∞.(6.3)

Then the second part of the integrability condition (S2) implies that

Eθ0[log pθ0(Y n

0 )

]= Eθ0

[log

pθ0(Y n0 )∏n

k=0

∫gθ0(Yk | x)dµθ0(x)

]+ Eθ0

[n∑k=0

log

∫gθ0(Yk | x)dµθ0(x)

]

≥ −(n+ 1)Eθ0[∣∣log

∫gθ0(Y0 | x)dµθ0(x)

∣∣]> −∞,

(6.4)

where we have used that relative entropy is non-negative. By (6.3) and (6.4),we conclude that (6.2) holds, which completes the proof.

The following proposition is used in the proof of Theorem 3.1 to given analmost sure bound for the normalized log-likelihoods in terms of quantitiesinvolving only expectations.

Proposition 6.3. Suppose that conditions (S1), (S3), and (S5) hold.Let ` be as in condition (S5). Then for θ′ /∈ [θ0], there exists a neighborhood



U of θ′ such that for each m > 0, the following inequality holds Pθ0-a.s:

lim supn→∞

supθ∈U

1

nlog pθ(Y

n0 ) ≤ 1

m+ Èθ0(

supθ∈U

log pθ(Ym

0 )

)+

`

m+ Èθ0(

supθ∈U

log+ γθ(Y0)

)+

1

m+ Èθ0(

supθ∈U

logCm(θ, Y m0 )

).

Informally, in the proof of Proposition 6.3, we use the mixing propertyfrom condition (S5) to parse a sequence of observations into alternatingsequences of “large blocks” and “small blocks,” and then the ergodicity andintegrability conditions finish the proof. More specifically, we break up thesequence of observations Y n

0 into alternating blocks of length m and `, where` is given by condition (S5).

Proof. Let θ′ /∈ [θ0]. Fix a neighborhood U of θ′ so that the conclusionsof both condition (S3) and condition (S5) hold. Let m > 0 be arbitrary,and let ` be as in condition (S5). We consider sequences of observations oflength n, where n is a large integer. These sequences of observations willbe parsed into alternating blocks of lengths m and `, respectively, startingfrom an offset of size s and possibly ending with a remainder sequence. Forthe sake of notation, we use interval notation to denote intervals of integers.For n > 2(m+ `) and s in [0,m+ `), let R = R(s,m, `, n) ∈ [0,m+ `) andk = k(s,m, `, n) ≥ 0 be defined by the condition n = s+k(m+ `)+R. Thenwe partition [0, n] as follows:

Bs =[0, s),

Is(j) =[s+ (m+ `)(j − 1), s+ (m+ `)(j − 1) +m

), for 1 ≤ j ≤ k,

Js(j) =[s+ (m+ `)(j − 1) +m, s+ (m+ `)j

), for 1 ≤ j ≤ k,

Es =[s+ t(m+ `), n

].

Given a sequence Y n0 of observations, we define subsequences of Y n

0 accordingto the above partitions of [0, n]:

bs = Y |Bs ,ws(j) = Y |Is(j), for 1 ≤ j ≤ k,vs(j) = Y |Js(j), for 1 ≤ j ≤ k,es = Y |Es .


24 K. MCGOFF ET AL.

For a sequence yt0 in Yt+1, define

γθ(yt0) =

t∏j=0

γθ(yj) =

t∏j=0

supxgθ(yj | x).

Then for θ in U , it follows from condition (S5) that

pθ(Yn

0 ) ≤γθ(bs)γθ(es) ·k∏j=1

γθ(vs(j)) ·k∏j=1

Cm(θ, ws(j)) ·k∏j=1

pθ(ws(j)).

Taking the logarithm of both sides and averaging over s in [0,m + `), weobtain

log pθ(Yn

0 ) ≤ 1

m+ `

m+`−1∑s=0

k∑j=1

[log pθ(ws(j)) + logCm(θ, ws(j))

]

+1

m+ `

m+`−1∑s=0

k∑j=1

log γθ(vs(j))

+1

m+ `

m+`−1∑s=0

[log γθ(bs) + log γθ(es)

].

(6.5)

Let us now take the supremum over θ in U in (6.5) and evaluate the limitsof the three terms on the right-hand-side as n tends to infinity.

Let ξ1 : Ym+1 → R and ξ2 : Ym+1 → R be defined by

ξ1(ym0 ) = supθ∈U

log pθ(ym0 ), ξ2(ym0 ) = sup

θ∈UlogCm(θ, ym0 ).

With this notation, we have that

1

n

m+`−1∑s=0

k∑j=1

[supθ∈U

log pθ(ws(j)) + supθ∈U

logCm(θ, ws(j))

]

=1

n

n∑i=0

[ξ1(Y i+m

i ) + ξ2(Y i+mi )

].

Since (Yk) is ergodic (by Proposition 6.1), it follows from Birkhoff’s ergodic



theorem and conditions (S3) and (S5) that the following limit exists Pθ0–a.s.:

limn

1

n

m+`−1∑s=0

k∑j=1

[supθ∈U

log pθ(ws(j)) + supθ∈U

logCm(θ, ws(j))

]

= limn

1

n

n∑i=0

[ξ1(Y i+m

i ) + ξ2(Y i+mi )

]= Eθ0

[ξ1(Y m

0 )

]+ Eθ0

[ξ2(Y m

0 )

]= Eθ0

[supθ∈U

log pθ(Ym

0 )

]+ Eθ0

[supθ∈U

logCm(θ, Y m0 )

].

(6.6)

Similarly, using Birkhoff’s ergodic theorem and condition (S3), we havethat the following holds Pθ0–a.s.:

lim supn

1

n

m+`−1∑s=0

k∑j=1

supθ∈U

log γθ(vs(j)) ≤ lim supn

1

n

n∑i=0

supθ∈U

log+ γθ(Yi+`−1i )

≤ ` lim supn

1

n

n∑i=0

supθ∈U

log+ γθ(Yi)

= Èθ0[supθ∈U

log+ γθ(Y0)

].

(6.7)

Lastly, Birkhoff’s ergodic theorem and condition (S3) again imply that thefollowing limit holds Pθ0–a.s.:

limn

1

n

m+`−1∑s=0

[supθ∈U

log+ γθ(bs) + supθ∈U

log+ γθ(es)

]= 0,(6.8)

where we have used that max(|Bs|, |Es|) ≤ m+ `.Combining the inequalities in (6.5)-(6.8), we obtain that

lim supn→∞

supθ∈U

1

nlog pθ(Y

n0 ) ≤ 1

m+ Èθ0[supθ∈U

log pθ(Ym

0 )

]+

1

m+ Èθ0[supθ∈U

logCm(θ, Y m0 )

]+

`

m+ Èθ0[supθ∈U

log+ γθ(Y0)

],

as desired.


26 K. MCGOFF ET AL.

The following proposition is a direct application of Lemma 10 in [15] tothe present setting, and we omit the proof.

Proposition 6.4. Suppose that the following conditions hold: ergodicity(S1), logarithmic integrability at θ0 (S2), and exponential identifiability (S6).Then for θ /∈ [θ0], it holds that

lim supn

1

nEθ0[log pθ(Y

n0 )

]< h(θ0).

The following proposition provides an essential estimate in the proof ofTheorem 3.1.

Proposition 6.5. Suppose that conditions (S1)-(S6) hold, and let ` beas in (S5). Then for θ′ /∈ [θ0], there exists m > 0 and a neighborhood U ofθ′ such that

h(θ0) >1

m+ Èθ0[supθ∈U

log pθ(Ym

0 )

]+

`

m+ Èθ0[supθ∈U

log+ γθ(Y0)

]+

1

m+ Èθ0[supθ∈U

logCm(θ′, Y m0 )

].

Proof. Suppose θ′ /∈ [θ0]. By Proposition 6.4, there exists ε > 0 suchthat

(6.9) lim supn

1

nEθ0[log pθ′(Y

n0 )

]< h(θ0)− ε.

By conditions (S3) (logarithmic integrability away from θ0) and (S5) (mix-ing), there exists a neighborhood U ′ of θ′ and m0 > 0 such that for m ≥ m0,we have that

ε/2 >`

m+ Èθ0[

supθ∈U ′

log+ γθ(Y0)

]+

1

m+ Èθ0[

supθ∈U ′

logCm(θ, Y m0 )

].

(6.10)

Fix m ≥ m0 such that

(6.11)1

m+ Èθ0[log pθ′(Y

m0 )

]< lim sup

n

1

nEθ0[log pθ′(Y

n0 )

]+ ε/4.



For η > 0, let B(θ′, η) denote the ball of radius η about θ′ in Θ. For η suchthat B(θ′, η) ⊂ U ′, we have that

supθ∈B(θ′,η)

log pθ(Ym

0 ) ≤m∑k=0

supθ∈U ′

log+ γθ(Yk).

The sum above is integrable with respect to Pθ0 and does not depend on η.Then (the reverse) Fatou’s Lemma implies that

lim supη→0

Eθ0[

supθ∈B(θ′,η)

log pθ(Ym

0 )

]≤ Eθ0

[lim supη→0

supθ∈B(θ′,η)

log pθ(Ym

0 )

].

By condition (S4) (upper semi-continuity of θ 7→ pθ(Ym

0 )), we see that

Eθ0[lim supη→0

supθ∈B(θ′,η)

log pθ(Ym

0 )

]≤ Eθ0

[log pθ′(Y

m0 )

].

Now by an appropriate choice of η > 0, we have shown that there exists aneighborhood U ⊂ U ′ of θ′ such that

(6.12)1

m+ `Eθ0[supθ∈U

log pθ(Ym

0 )

]<

1

m+ `Eθ0[log pθ′(Y

m0 )

]+ ε/4.

Combining the estimates (6.9)-(6.12), we obtain the desired inequality.

Proof of Theorem 3.1. Let h(θ0) be defined as in Proposition 6.2. Weprove the theorem by showing the following statement: for each closed setC in Θ such that C ∩ [θ0] = ∅, it holds that

(6.13) lim supn

supθ∈C

1

nlog pθ(Y

n0 ) < h(θ0).

Let C be a closed subset of Θ such that C ∩ [θ0] = ∅. Since Θ is compact,C is compact. Suppose that for each θ′ ∈ C, there exists a neighborhood Uof θ′ such that

(6.14) lim supn

supθ∈U∩C

1

nlog pθ(Y

n0 ) < h(θ0).

Then by compactness, we would conclude that (6.13) holds and thus com-plete the proof of the theorem.

Let θ′ be in C. Let us now show that that there exists a neighborhood Uof θ′ such that (6.14) holds. Since θ′ is in C, we have that θ′ /∈ [θ0]. Let ` be


28 K. MCGOFF ET AL.

as in (S5). By Proposition 6.5, there exists m > 0 and a neighborhood U ′ ofθ′ such that

h(θ0) >1

m+ Èθ0[

supθ∈U ′

log pθ(Ym

0 )

]+

`

m+ Èθ0[

supθ∈U ′

log+ γθ(Y0)

]+

1

m+ Èθ0[

supθ∈U ′

logCm(θ, Y m0 )

].

(6.15)

By Proposition 6.3, there exists a neighborhood U ⊂ U ′ of θ′ such that

lim supn→∞

supθ∈U

1

nlog pθ(Y

n0 ) ≤ 1

m+ Èθ0[supθ∈U

log pθ(Ym

0 )

]+

`

m+ Èθ0[supθ∈U

log+ γθ(Y0)

]+

1

m+ Èθ0[supθ∈U

logCm(θ′, Y m0 )

].

(6.16)

Combining (6.15) and (6.16), we obtain (6.14), which completes the proofof the theorem.

7. Concluding remarks. In this paper, we demonstrate how the prop-erties of a family of dynamical systems affect the asymptotic consistency ofmaximum likelihood parameter estimation. We have exhibited a collection ofgeneral statistical conditions on families of dynamical systems observed withnoise, and we have shown that under these general conditions, maximumlikelihood estimation is a consistent method of parameter estimation. Fur-thermore, we have shown that these general conditions are indeed satisfiedby some classes of well-studied families of dynamical systems. As mentionedin the introduction, our results can be considered as a theoretical validationof the notion from dynamical systems that these classes of systems have“good” statistical properties.

However, there remain interesting families of systems to which our resultsdo not apply, including some classes of systems that are also believed to have“good” statistical properties. In particular, the class of systems modeled byYoung towers with exponential tail [51] has exponential decay of correla-tions and certain large deviations estimates [41]. These families include apositive measure set of maps from the quadratic family (x 7→ ax(1 − x))and the Henon family, as well as certain billiards and many other systemsof physical and mathematical interest [51]. In short, the setting of systems



modeled by Young towers with exponential tail provides a very attractivesetting in which to consider consistency of maximum likelihood estimation.Unfortunately, our proof does not apply to systems in this setting in gen-eral, mainly due to the presence of the mixing condition (S5), which is notsatisfied by these systems in general.

A natural next step might be to obtain rates of convergence and derivecentral limit theorems for maximum likelihood estimation. To this end, itmight be possible to build off of analogous results for HMMs [8, 23]. Weleave these questions for future work.

SUPPLEMENTARY MATERIAL

Supplement to “Consistency of maximum likelihood estimationfor some dynamical systems”(doi: COMPLETED BY THE TYPESETTER; .pdf). We provide three tech-nical appendices. In Appendix A, we present proofs of Propositions 4.1, 4.2,and 4.3. In Appendix B, we discuss shifts of finite type and Gibbs measuresand prove Theorem 5.1. Finally, Appendix C contains definitions for AxiomA systems, as well as a proof of Theorem 5.7.

References.

[1] T. M. Adams and A. B. Nobel. Finitary reconstruction of a measure preservingtransformation. Israel J. Math., 126:309–326, 2001.

[2] J. F. Alves, M. Carvalho, and J. M. Freitas. Statistical stability and continuity of SRBentropy for systems with Gibbs-Markov structures. Communications in MathematicalPhysics, 296:739–767, 2010.

[3] V. Baladi. Decay of correlations. In Smooth ergodic theory and its applications(Seattle, WA, 1999), volume 69 of Proc. Sympos. Pure Math., pages 297–325. Amer.Math. Soc., Providence, RI, 2001.

[4] A. R. Barron. The strong ergodic theorem for densities: generalized Shannon-McMillan-Breiman theorem. Ann. Probab., 13(4):1292–1303, 1985.

[5] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finitestate Markov chains. Ann. Math. Statist., 37:1554–1563, 1966.

[6] L. M. Berliner. Statistics, probability and chaos. Statist. Sci., 7(1):69–122, 1992.With discussion and a rejoinder by the author.

[7] D. P. Bertsekas and S. E. Shreve. Stochastic optimal control, volume 139 of Mathe-matics in Science and Engineering. Academic Press Inc. [Harcourt Brace JovanovichPublishers], New York, 1978. The discrete time case.

[8] P. J. Bickel, Y. Ritov, and T. Ryden. Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models. Ann. Statist., 26(4):1614–1635, 1998.

[9] R. Bowen. Markov partitions for Axiom A diffeomorphisms. Amer. J. Math., 92:725–747, 1970.

[10] R. Bowen. Equilibrium states and the ergodic theory of Anosov diffeomorphisms,volume 470 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, revised edition,2008. With a preface by David Ruelle, Edited by Jean-Rene Chazottes.


30 K. MCGOFF ET AL.

[11] O. Cappe, E. Moulines, and T. Ryden. Inference in hidden Markov models. SpringerSeries in Statistics. Springer, New York, 2005. With Randal Douc’s contributionsto Chapter 9 and Christian P. Robert’s to Chapters 6, 7 and 13, With Chapter14 by Gersende Fort, Philippe Soulier and Moulines, and Chapter 15 by StephaneBoucheron and Elisabeth Gassiat.

[12] S. Chatterjee and M. R. Yilmaz. Chaos, fractals and statistics. Statist. Sci., 7(1):49–68, 1992.

[13] R. Douc and C. Matias. Asymptotics of the maximum likelihood estimator for generalhidden Markov models. Bernoulli, 7(3):381–420, 2001.

[14] R. Douc and E. Moulines. Asymptotic properties of the maximum likelihood estima-tion in misspecified hidden Markov models. Ann. Statist., 40(5):2697–2731, 2012.

[15] R. Douc, E. Moulines, J. Olsson, and R. van Handel. Consistency of the maximumlikelihood estimator for general hidden Markov models. Ann. Statist., 39(1):474–513,2011.

[16] R. Douc, E. Moulines, and T. Ryden. Asymptotic properties of the maximumlikelihood estimator in autoregressive models with Markov regime. Ann. Statist.,32(5):2254–2304, 2004.

[17] J. M. Freitas and M. Todd. The statistical stability of equilibrium states for intervalmaps. Nonlinearity, 22(2):259–281, 2009.

[18] V. Genon-Catalot and C. Laredo. Leroux’s method for general hidden markov models.Stochastic Processes and their Applications, 116(2):222 – 243, 2006.

[19] E. Ionides, C. Breto, and A. King. Inference for nonlinear dynamical systems. Proc.Nat. Acad. Sciences, 103(49):18438–18443, 2006.

[20] E. L. Ionides, A. Bhadra, Y. Atchade, and A. King. Iterated filtering. Ann. Statist.,39(3):1776–1802, 2011.

[21] V. Isham. Statistical aspects of chaos: a review. In Networks and chaos—statisticaland probabilistic aspects, volume 50 of Monogr. Statist. Appl. Probab., pages 124–200.1993.

[22] J. L. Jensen. Chaotic dynamical systems with a view towards statistics: a review.In Networks and chaos—statistical and probabilistic aspects, volume 50 of Monogr.Statist. Appl. Probab., pages 201–250. 1993.

[23] J. L. Jensen and N. V. Petersen. Asymptotic normality of the maximum likelihoodestimator in state space models. The Annals of Statistics, 27(2):pp. 514–535, 1999.

[24] K. Judd. Failure of maximum likelihood methods for chaotic dynamical systems.Phys. Rev. E, 75:036210, Mar 2007.

[25] S. P. Lalley. Beneath the noise, chaos. Ann. Statist., 27(2):461–479, 1999.[26] S. P. Lalley. Removing the noise from chaos plus noise. In Nonlinear dynamics and

statistics (Cambridge, 1998), pages 233–244. Birkhauser Boston, Boston, MA, 2001.[27] S. P. Lalley and A. B. Nobel. Denoising deterministic time series. Dyn. Partial Differ.

Equ., 3(4):259–279, 2006.[28] K. J. H. Law and A. M. Stuart. Evaluating data assimilation algorithms. Monthly

Weather Review, 140(11):3757–3782, 2012.[29] F. Le Gland and L. Mevel. Basic properties of the projective product with applica-

tion to products of column-allowable nonnegative matrices. Math. Control SignalsSystems, 13(1):41–62, 2000.

[30] F. Le Gland and L. Mevel. Exponential forgetting and geometric ergodicity in hiddenMarkov models. Math. Control Signals Systems, 13(1):63–93, 2000.

[31] B. G. Leroux. Maximum-likelihood estimation for hidden Markov models. StochasticProcess. Appl., 40(1):127–143, 1992.

[32] K. McGoff, S. Mukherjee, A. Nobel, and N. Pillai. Supplement to “Consistency of



maximum likelihood estimation for some dynamical systems”, 2014.[33] K. McGoff, S. Mukherjee, and N. Pillai. Statistical inference for dynamical systems:

a review. arXiv:1204.6265, 2013.[34] A. Nobel. Consistent estimation of a dynamical map. In Nonlinear dynamics and

statistics (Cambridge, 1998), pages 267–280. Birkhauser Boston, Boston, MA, 2001.[35] A. B. Nobel and T. M. Adams. Estimating a function from ergodic samples with

additive noise. IEEE Trans. Inform. Theory, 47(7):2895–2902, 2001.[36] K. Petersen. Ergodic theory, volume 2 of Cambridge Studies in Advanced Mathemat-

ics. Cambridge University Press, Cambridge, 1989. Corrected reprint of the 1983original.

[37] T. Petrie. Probabilistic functions of finite state Markov chains. Ann. Math. Statist,40:97–115, 1969.

[38] V. F. Pisarenko and D. Sornette. Statistical methods of parameter estimation fordeterministically chaotic time series. Phys. Rev. E, 69:036122, Mar 2004.

[39] D. Poole and A. E. Raftery. Inference for deterministic simulation models: theBayesian melding approach. J. Amer. Statist. Assoc., 95(452):1244–1255, 2000.

[40] J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for dif-ferential equations: a generalized smoothing approach. J. R. Stat. Soc. Ser. B Stat.Methodol., 69(5):741–796, 2007. With discussions and a reply by the authors.

[41] L. Rey-Bellet and L.-S. Young. Large deviations in non-uniformly hyperbolic dynam-ical systems. Ergodic Theory Dynam. Systems, 28(2):587–612, 2008.

[42] D. Ruelle. Differentiation of SRB states. Comm. Math. Phys., 187(1):227–241, 1997.[43] D. Ruelle. Thermodynamic formalism. Cambridge Mathematical Library. Cambridge

University Press, Cambridge, second edition, 2004. The mathematical structures ofequilibrium statistical mechanics.

[44] C. R. Shalizi. Dynamics of Bayesian updating with dependent data and misspecifiedmodels. Electron. J. Stat., 3:1039–1074, 2009.

[45] I. Steinwart and M. Anghel. Consistency of support vector machines for forecast-ing the evolution of an unknown ergodic dynamical system from observations withunknown noise. Ann. Statist., 37(2):841–875, 2009.

[46] T. Toni, D. Welch, N. Strelkowa, A. Ipsen, and M. Stumpf. Approximate bayesiancomputation scheme for parameter inference and model selection in dynamical sys-tems. J. R. Soc. Interface, 6(31):187–202, 2009.

[47] C. H. Vasquez. Statistical stability for diffeomorphisms with dominated splitting.Ergodic Theory Dynam. Systems, 27(1):253–283, 2007.

[48] P. Walters. An introduction to ergodic theory, volume 79 of Graduate Texts in Math-ematics. Springer-Verlag, New York, 1982.

[49] S. N. Wood. Statistical inference for noisy nonlinear ecological dynamic systems.Nature, 466(7310):1102–1104, 2010.

[50] L.-S. Young. Large deviations in dynamical systems. Trans. Amer. Math. Soc.,318(2):525–543, 1990.

[51] L.-S. Young. Statistical properties of dynamical systems with some hyperbolicity.Ann. of Math. (2), 147(3):585–650, 1998.

[52] L.-S. Young. What are SRB measures, and which dynamical systems have them? J.Statist. Phys., 108(5-6):733–754, 2002. Dedicated to David Ruelle and Yasha Sinaion the occasion of their 65th birthdays.


32 K. MCGOFF ET AL.

Department of Mathematics,Duke University, Durham NC 27708E-mail: [email protected]

Departments of Statistical Science,Computer Science, and MathematicsInstitute for Genome Sciences & PolicyDuke University, Durham NC 27708E-mail: [email protected]

Department of Statistics and OperationsResearch, University of North Carolina,Chapel Hill NC 27599-3260E-mail: [email protected]

Department of Statistics,Harvard University, Cambridge, MA 02138E-mail: [email protected]


Submitted to the Annals of Statistics

SUPPLEMENT TO “CONSISTENCY OF MAXIMUMLIKELIHOOD ESTIMATION FOR SOME DYNAMICAL

SYSTEMS”

By Kevin McGoff, Sayan Mukherjee, AndrewNobel, and Natesh Pillai

APPENDIX A: ADDITIONAL PROOFS

A.1. Proof of Proposition 4.1. In order to prove Proposition 4.1, wefirst establish the following lemma.

Lemma A.1. Suppose that for each ym0 ∈ Ym+1, θ′ ∈ Θ, and ε > 0, thereexists a neighborhood U of θ′ such that

supθ∈U

supx|pθ′(ym0 | x)− pθ(ym0 | x)| < ε,

and

supθ∈U

∣∣∣∣∫ pθ′(ym0 | x)dµθ′(x)−

∫pθ′(y

m0 | x)dµθ(x)

∣∣∣∣ < ε.

Then for each ym0 ∈ Ym+1, the function θ 7→ pθ(ym0 ) is continuous, and

therefore condition (S4) holds.

Proof. Let ym0 be in Ym+1, and let θ be in Θ. Let ε > 0. Choose U asin the hypothesis corresponding to ε/2. Then for θ in U , we have that

|pθ′(ym0 )− pθ(ym0 )| =∣∣∣∣∫ pθ′(y

m0 | x) dµθ′(x)−

∫pθ(y

m0 | x) dµθ(x)

∣∣∣∣≤∣∣∣∣∫ pθ′(y

m0 | x) dµθ′(x)−

∫pθ′(y

m0 |x) dµθ(x)

∣∣∣∣+

∫|pθ′(ym0 | x)− pθ(ym0 | x)| dµθ(x)

< ε.

Here we restate and then prove Proposition 4.1.

Proposition 4.1. Suppose that X and Θ are compact and the mapsT : Θ × X → X, g : Θ × X × Y → R+, and µ : Θ → M(X) are continuous.Then the upper semi-continuity of the likelihood 4 holds.

1imsart-aos ver. 2013/03/06 file: MLE_aos_Supplement_July22_2014.tex date: July 22, 2014

2

Proof. Let ym0 be in Ym+1, and let θ′ be in Θ. Let ε > 0. Note that sinceT and g are continuous and Θ× X is compact, the map (θ, x) 7→ pθ(y

m0 | x)

is uniformly continuous. Thus there exists a neighborhood U1 of θ′ such that

(A.1) supθ∈U1

supx|pθ′(ym0 | x)− pθ(ym0 | x)| < ε.

Since θ 7→ µθ is continuous with respect to the weak topology and x 7→pθ′(y

m0 | x) is continuous on X, there exists a neighborhood U2 of θ′ such

that

(A.2) supθ∈U2

∣∣∣∣∫ pθ′(ym0 | x)dµθ′(x)−

∫pθ′(y

m0 | x)dµθ(x)

∣∣∣∣ < ε.

Let U = U1 ∩ U2. By (A.1) and (A.2), the hypotheses of Lemma A.1 aresatisfied. By Lemma A.1, the upper semi-continuity of the likelihood (S4)holds.

A.2. Proof of Proposition 4.2. In order to build towards the proofof Proposition 4.2, we need to introduce some additional notation. Let C bea partition of X. For ϕ : X→ R+, define

V (ϕ, C) = maxA∈C

inf

η > 0 : sup

x∈Aϕ(x) ≤ η inf

x∈Aϕ(x)

.

Note that for A in C, we have that

(A.3) supx∈A

ϕ(x) ≤ V (ϕ, C) infz∈A

ϕ(z).

Loosely speaking, the quantity V (ϕ, C) allows us to control the ratio of thevalues of ϕ within the cells of the partition C.

In the propositions that follow, we will have need to investigate propertiesof the map x 7→ pθ(y

n0 | x) for θ ∈ Θ and yn0 ∈ Yn+1. To ease notation, let

us define, for y = yn0 ∈ Y n+1 and θ ∈ Θ, the function ϕyθ : X → R+ by

ϕyθ (x) = pθ(y | x). Let us now establish two preliminary propositions, which

are used in the proof of Proposition 4.2.

Proposition A.2. Suppose (Tθ, µθ)θ∈Θ is a family of dynamical systemson (X,X ) with corresponding observation densities (gθ)θ∈Θ. Further, supposethat the partition mixing condition (M1) holds with respect to a partition C.For arbitrary `,m ≥ 0, suppose that n and k ≥ 0 satisfy n = k(m+ `) +m.

For any θ in Θ and yn0 in Yn+1, if we let yj = yj(m+`)+mj(m+`) , then∫ k∏

j=0

ϕyjθ T

j(m+`)θ (x) dµθ(x) ≤ Lkθ

k∏j=0

V (ϕyjθ , C

m0 )

k∏j=0

∫ϕyjθ (x) dµθ(x).

imsart-aos ver. 2013/03/06 file: MLE_aos_Supplement_July22_2014.tex date: July 22, 2014

CONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS, SUPPLEMENT 3

Proof. Let C′ =∨kj=0 T

−j(m+`)θ Cm0 . By definition of C′, we have that for

A in C′, there exists Ajkj=0 such that A = ∩kj=0T−j(m+`)θ Aj , with each Aj

in Cm0 . For such A, by the mixing condition (M1), we have that

(A.4) µθ(A) ≤ Lkθk∏j=0

µθ(Aj).

Using (A.4), we obtain that

∫ k∏j=0

ϕyjθ T

j(m+`)θ (x) dµθ(x) =

∑A∈C′

∫A

k∏j=0

ϕyjθ T

j(m+`)θ (x) dµθ(x)

≤∑A∈C′

(k∏j=0

supx∈A

ϕyjθ T

j(m+`)θ (x)

)µθ(A)

≤ Lkθ∑A∈C′

(k∏j=0

supx∈Aj

ϕyjθ (x)

)k∏j=0

µθ(Aj).

(A.5)

By (A.5) and (A.3), we have that∫ t∏j=0

ϕyjθ T

j(m+`)θ (x) dµθ(x) ≤ Lkθ

∑A∈C′

(k∏j=0

supx∈Aj

ϕyjθ (x)

)k∏j=0

µθ(Aj)

≤ Lkθk∏j=0

V (ϕyjθ , C

m0 )

·∑A∈C′

k∏j=0

infx∈Aj

ϕyjθ (x)

k∏j=0

µθ(Aj)

≤ Lkθk∏j=0

V (ϕyjθ , C

m0 )

k∏j=0

∫ϕyjθ (x) dµθ(x),

as desired.

Proposition A.3. Suppose that the partition regularity (M2) and ob-servation regularity (M3) conditions hold. Fix θ′ /∈ [θ0], and let U be thecorresponding neighborhood of θ′ in (M3). If we let Yn = Y n

0 , then

supn

Eθ0[supθ∈U

log V (ϕYnθ , Cn0 )

]<∞.


4

Proof. Let θ′ and U be as in the hypotheses, and let θ ∈ U . Fur-thermore, let A ∈ Cn0 , and let z1, z2 ∈ A. Note that if 0 ≤ k ≤ n, thend(T kθ (z1), T kθ (z2)

)≤ βmin(k,n−k) by the partition regularity condition (M2).

Then

ϕYnθ (z1) = pθ(Y

n0 | z1)

=n∏k=0

gθ(Yk | T kθ z1)

≤n∏k=0

gθ(Yk | T kθ z2) exp

(K(θ, Yk)d(T kθ z1, T

kθ z2)

)

≤ ϕYnθ (z2) exp

(n∑k=0

K(θ, Yk)βmin(k,n−k)

).

Hence,

log V (ϕYnθ , Cn0 ) ≤

n∑k=0

K(θ, Yk)βmin(k,n−k),

and therefore,

Eθ0[supθ∈U

log V (ϕYnθ , Cn0 )

]≤ Eθ0

[supθ∈U

K(θ, Y0)

] n∑k=0

βmin(k,n−k)(A.6)

≤ Eθ0[

supθ∈U

K(θ, Y0)

]· 2

n/2∑k=0

βk(A.7)

≤ Eθ0[

supθ∈U

K(θ, Y0)

]· 2∞∑k=0

βk.(A.8)

The bound in (A.8) is finite by the observation regularity condition (M3) anddoes not depend on n. Thus, we have finished the proof of the proposition.

Here we restate Proposition 4.2 and give its proof.

Proposition 4.2. Suppose (Tθ, µθ)θ∈Θ is a family of dynamical systemson (X,X ) with corresponding observation densities (gθ)θ∈Θ. If there exists apartition C of X such that conditions (M1) and (M2) are satisfied, and if theobservation regularity condition (M3) is satisfied, then the mixing property(S5) holds.



Proof. Let ` ≥ 0 be as in (M1). Define Cm : Θ× Y m+1 → R+ by

Cm(θ, ym0 ) = Lθ · V (ϕyθ , C

m0 ),

where y = ym0 .Let k,m ≥ 0 be arbitrary, and let yj ∈ Ym+1 for 0 ≤ j ≤ k. By Proposi-

tion A.2, we have that∫ k∏j=0

pθ(yj | T j(m+`)

θ (x))dµθ(x) =

∫ k∏j=0

ϕyjθ T

j(m+`)θ (x) dµθ(x)

≤ Lkθk∏j=0

V (ϕyjθ , C

m0 ) ·

k∏j=0

∫ϕyjθ (x) dµθ(x)

=k∏j=0

Cm(θ,yj) ·k∏j=0

pθ(yj).

which is the first requirement of condition (S5).Now let θ′ /∈ [θ0], and let U be a neighborhood of θ′ satisfying both (M1)

and (M3). By definition of Cm, setting Ym = Y m0 , we have

supm

Eθ0[supθ∈U

logCm(θ, Y m0 )

]≤(

supθ∈U

logLθ

)+ sup

mEθ0[supθ∈U

log V (ϕYmθ , Cm0 )

],

which is finite by (M1) and Proposition A.3. Thus the mixing condition (S5)is satisfied.

A.3. Proof of Proposition 4.3. In this section, we present the proofof Proposition 4.3. We begin by establishing several lemmas. Here andthroughout this section, we fix a metric d(·, ·) on X and assume withoutloss of generality that d(x, z) ≤ 1 for all x, z ∈ X (one may always choose anequivalent metric with this property).

Lemma A.4. Suppose that (Tθ, µθ)θ∈Θ is a family of dynamical systemson (X,X ) with corresponding observation densities (gθ)θ∈Θ satisfying theobservation regularity condition that (L2). Let θ be in Θ, and suppose thatthe transformation Tθ : X → X is Holder continuous. For any bounded,measurable ξ : Ym+1 → R, if ψθ : X→ R is defined by

ψθ(x) =

∫ξ(ym0 ) pθ(y

m0 | x) dνm(ym0 ),

then ψθ is Holder continuous.


6

Proof. Let θ be in Θ, and suppose that Tθ : X→ X is Holder continuous.Let ξ : Ym+1 → R be bounded and measurable, and let ψθ be as in thehypotheses.

By Condition (L2), for x, z ∈ X and ym0 ∈ Ym+1, we have that

pθ(ym0 | x) =

∏gθ(yk | T kθ (x)

)≤ exp

(m∑k=0

K(θ, yk) d(T kθ (x), T kθ (z)

)α) m∏k=0

gθ(yk | T kθ (z)

)= exp

(m∑k=0


)α)pθ(y

m0 | z).

(A.9)

Since Tθ is Holder and m is fixed, there exists C1 > 0 and α1 > 0 such that

m∑k=0


)α ≤ C1d(x, z)α1

m∑k=0

K(θ, yk).

Then by (A.9) and the convexity of the exponential, we have that∣∣∣∣1− pθ(ym0 |z)

pθ(ym0 |x)

∣∣∣∣ ≤ exp

(C1d(x, z)α1

m∑k=0

K(θ, yk)

)− 1

≤ d(x, z)α1 exp

(C1

m∑k=0

K(θ, yk)

),

where we have used our convention that d(x, z) ≤ 1. Thus,

|ψθ(x)− ψθ(z)| =∣∣∣∣∫ ξ(ym0 )

(pθ(y

m0 | x)− pθ(ym0 | z)

)dνm(ym0 )

∣∣∣∣≤∫‖ξ‖∞ pθ(y

m0 | x)

∣∣∣∣1− pθ(ym0 | z)

pθ(ym0 | x)

∣∣∣∣ dνm(ym0 )

≤ ‖ξ‖∞d(x, z)α1

∫exp

(C1

m∑k=0

K(θ, yk)

)pθ(y

m0 | x) dνm(ym0 )

≤ ‖ξ‖∞Mm+1 d(x, z)α1 ,

where

M = supx

∫exp(C1K(θ, y)

)gθ(y | x) dν(y)

is finite by assumption.



The following lemma, which we state without proof, gives a conditionalconcentration inequality. It may be proved in a fashion similar to other well-known concentration inequalities by conditioning on Z [3]. It is used in theproof of Proposition 4.3.

Lemma A.5. Suppose that X1, . . . , Xn are conditionally independent givenZ and |Xi| ≤ M a.s. Let Yi = Xi − E(Xi | Z). Then for 0 < c < 4M , itholds that

P(

1

n

n∑i=1

Yi ≥ c)≤ exp

(−n c2

16M2

).

Let us now state and prove Proposition 4.3.

Proposition 4.3. Suppose that (Tθ, µθ)θ∈Θ is a family of Holder-continuousdynamical systems on (X,X ) with corresponding observation densities (gθ)θ∈Θ.Suppose that (Tθ0 , µθ0) is ergodic. Further, suppose that the large deviationsproperty (L1) and the observation regularity property (L2) are satisfied. Thenthe exponential identifiability condition (S6) holds.

Proof. Let θ /∈ [θ0]. Then there exists m and a bounded, measurablefunction ξ : Ym+1 → R such that Eθ0(ξ(Y m

0 )) = 1 and Eθ(ξ(Y m0 )) = 0. For

n > 0, define

An =

(x, (yi)) ∈ X× YN :

1

n

n∑k=0

ξ(yk+mk ) ≥ 1

2

.

By the Birkhoff ergodic theorem, we have that Pθ0(An) tends to 1 as n tendsto infinity, since PYθ0 is ergodic (Proposition 6.1).

For k > 0 and y = (yi) in YN, define

ξk(y) = ξ(yk+mk ).

Let FX be the σ-algebra of sets in X×YN of the form A×YN, where A is ameasurable subset of X. We denote by Eθ(· | x) the conditional expectationEθ(· | FX). For k > 0 and (x,y) in X× YN, define

Zk(x,y) = ξk(y)− Eθ0(ξk | x).

Note that for each s in 0, . . . ,m, the collection of random variables ξj(m+1)+s :j ≥ 0 is mutually conditionally independent given FX . The same statementholds for the collection Zj(m+1)+s : j ≥ 0.


8

For n > 0, define

Bn =

(x,y) ∈ X× YN :

1

n

n∑k=0

Zk(x,y) ≥ 1

4

Cn =

(x,y) ∈ X× YN :

1

n

n∑k=0

Eθ(ξk | x) ≥ 1

4

.

Note that An ⊂ Bn ∪ Cn. We proceed by estimating Pθ(Bn) and Pθ(Cn)from above.

For s in 0, . . . ,m and n = q(m + 1) + r, with 0 ≤ r < m + 1, we havethat

n−1∑k=0

Zk =

m∑s=0

q∑j=0

Zj(m+1)+s +Rn,

where ‖Rn‖∞ is uniformly bounded in n. It follows that for large n, the setBn is contained in ∪ms=0Dn,s, where

Dn,s =

(x,y) :

1

q

q∑j=0

Zj(m+1)+s(x,y) ≥ 1

8

.

By Lemma A.5, there exists ε1 > 0 such that for s in 0, . . . ,m and alllarge n, it holds that

Pθ(Dn,s) ≤ exp(−nε1).

Therefore, for all large n, we have that

Pθ(Bn) ≤m∑s=0

Pθ(Dn,s) ≤ (m+ 1) exp(−nε1).

Define

ψ(x) =

∫ξ(ym0 ) pθ(y

m0 | x) dνm(ym0 ),

and note that ψ is measurable with respect to FX . In fact, we have that

Eθ(ξk | x) =

∫ξ(yk+m

k ) pθ(yk+mk | T kθ (x)) dνm(yk+m

k )

=

∫ξ(ym0 ) pθ(y

m0 | T kθ (x)) dνm(yk+m

k )

= ψ T kθ (x).



Thus,

Cn =

(x,y) ∈ X× YN :

1

n

n∑k=0

ψ T kθ (x) ≥ 1

4

.

Since Tθ is Holder continuous and the observation regularity condition (L2)holds, Lemma A.4 ensures that ψ is Holder continuous. Also note that∫

ψ(x) dµθ(x) = Eθ(Eθ(ξ | x)

)= Eθ(ξ) = 0.

Then by the large deviations condition (L1), there exists ε2 > 0 and K2 > 0such that

Pθ(Cn) = µθ

(x ∈ X :

1

n

n∑k=0

ψ T kθ (x) ≥ 1

4

)≤ K2 exp(−ε2n).

Now with ε ≤ min(ε1, ε2) and K ≥ (m+ 1) +K2, we obtain that

Pθ(An) ≤ Pθ(Bn) + Pθ(Cn) ≤ K exp(−εn),

which concludes the proof of the proposition.

APPENDIX B: SHIFTS OF FINITE TYPE AND GIBBS MEASURES

Let A be a finite set, called the alphabet. The state space for a shift offinite type is a subset of ΣA = AZ, which we call the full-shift on A. We putthe discrete topology on A (meaning that all subsets of A are open), and weendow ΣA with the product topology induced by the discrete topology onA. For a matrix M with dimensions |A| × |A| and entries in 0, 1, let

XM = x ∈ ΣA : ∀n ∈ Z, Mxn,xn+1 = 1.

The set XM is called a shift of finite type (SFT). In effect, M describesthe allowed transitions between symbols in A. Note that XM is a compactsubset of ΣA. Also, if the matrix M has the property that there exists N inN such that each entry of MN is positive, then we say that XM is a mixingSFT.

Now we describe the transformation on any SFT. Let σ : ΣA → ΣA bethe left shift map, defined as follows. For x = (xn) ∈ ΣA, let σ(x) = (σ(x)n)be defined by σ(x)n = xn+1. With this definition, since ΣA is consists ofbi-infinite sequences, it is not difficult to see that σ is a homeomorphism.Also, for any SFT XM , one may easily observe that σ(XM ) = XM , so that


10

one may regard σXM as a map from XM to itself. Thus, the transformationassociated with any SFT XM is σ|XM .

In order to define Gibbs measures, we require a metric structure on XM .For each β ∈ (0, 1), we define a metric on XM , but the choice of β is entirelyarbitrary for our purposes. For x and y in ΣA, let

n(x, y) = inf|m| : m ∈ Z, xm 6= ym.

Fix β ∈ (0, 1), and define a metric on ΣA as follows: for each x and y in ΣA,let

d(x, y) = βn(x,y).

It is straight-forward to check that d(·, ·) is a metric on ΣA and that itinduces the product topology on ΣA.

For α ∈ (0, 1), let Cα(XM ) denote the class of functions f : XM → Rsuch that there exists a constant C = C(f) such that for each x, y in XM ,it holds that

|f(x)− f(y)| ≤ Cd(x, y)α.

We refer to Cα(XM ) as the class of Holder continuous potential functions onXM , and we endow Cα(XM ) with the topology induced by the supremumnorm: ‖f‖ = supx |f(x)|.

The following classical theorem guarantees the existence and uniquenessof Gibbs measures for “potential functions” f in Cα(XM ) under mild con-ditions on M .

Theorem B.1 ([1]). Suppose that XM is a mixing SFT and f is inCα(XM ). Then there exists a unique σ-invariant Borel probability measureµ on XM such that there exist constants c > 0 and ρ and for each x in XM

and m > 0, it holds that

c ≤µ(y ∈ XM : ym0 = xm0

)exp(−ρm+

∑mk=0 f σk(x)

) ≤ c−1.

Furthermore, (σ|XM , µ) is ergodic.

Any measure satisfying the conclusion of Theorem B.1 is called a Gibbsmeasure. If the value of the potential function f in Theorem B.1 dependsonly on the coordinates xk−k, then the corresponding measure µ is a Markovchain of order 2k + 1.

Let us now recall Theorem 5.1 and presents its proof.


CONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS, SUPPLEMENT11

Theorem 5.1. Suppose X = XM is a mixing shift of finite type and(µθ)θ∈Θ is a continuously parametrized family of Gibbs measures on (X,X ).If the family of observation densities (gθ)θ∈Θ satisfies the integrability con-ditions (S2) and (S3) and the regularity conditions (M3) and (L2), then anyapproximate maximum likelihood estimator is consistent.

Proof. Let θ0 be in Θ. We obtain Theorem 5.1 by direct application ofTheorem 3.1. As we have explicitly assumed the integrability conditions (S2)and (S3), we need only show that the following conditions are satisfied: er-godicity (S1), upper semi-continuity of the likelihood (S4), mixing (S5), andexponential identifiability (S6). Ergodicity of the system (σ|X, µθ0) followsimmediately from the assumption that µθ0 is a Gibbs measure (see TheoremB.1). For the remaining three properties, we use the results of Section 4.

In order to apply the results of Section 4, we need to define a partition onX = XM . We use the natural partition of XM given by the zero-coordinate:for x in XM , let

[x]0 = y ∈ XM : x0 = y0,

and let C be the partition of XM into sets of the form [x]0. Note that Cji =∨jk=i σ

−kC consists of sets of the form

[x]ji = y ∈ XM : yji = xji.

By definition of the metric structure on XM , we have that condition (M2)holds with this partition (i.e. if [x]n−m = [y]n−m then d(x, y) ≤ βn(x,y) =βmin(m,n)).

By various well-known results in dynamical systems, we have that

1. the map θ 7→ µθ is continuous in the weak topology (see [1]);2. the mixing condition (M1) holds with respect to C [1, Theorem 1.25];3. the large deviations for Holder observables condition (L1) holds (see

[4]).

By 1., the hypotheses of Proposition 4.1 are satisfied, and thus the uppersemi-continuity condition (S4) is satisfied. By 2. and our assumption that theobservation regularity condition (M3) is satisfied, Proposition 4.2 gives thatthe mixing condition (S5) holds. Finally, by 3., the hypotheses of Proposition4.3 are satisfied, and therefore the exponential identifiability condition (S6)holds.

APPENDIX C: AXIOM A SYSTEMS

In this section, we introduce some definitions for smooth dynamical sys-tems.


12

Definition C.1. Suppose that M is a manifold endowed with a Rie-mannian metric (such manifolds are called Riemannian manifolds), andf : M → M is a diffeomorphism. Let TxM denote the tangent space toM at x. A closed subset Λ ⊂ M is called hyperbolic if f(Λ) = Λ and foreach x in Λ, there exist subspaces Esx and Eux of TxM such that

i) TxM = Esx ⊕ Eux ;ii) Df(Esx) = Esf(x) and Df(Eux) = Euf(x);

iii) there exist constants c > 0 and λ in (0, 1) such that

• ‖Dfnv‖ ≤ cλn‖v‖ for all n ≥ 0 and v in Esx, and

• ‖Df−nv‖ ≤ cλn‖v‖ for all n ≥ 0 and v in Eux ;

Definition C.2. A point x in M is called non-wandering if for eachneighborhood U of x, it holds that

U ∩⋃n>0

fn(U) 6= ∅.

The set Ω(f), consisting of the non-wandering points for f , is closed andf -invariant. A point x is periodic if fn(x) = x for some n > 0. Any periodicpoint is in Ω(f).

Definition C.3. The diffeomorphism f :M→M satisfies Axiom A ifΩ(f) is hyperbolic and

Ω(f) = x ∈M : x is periodic .

The set Ω(f) is an Axiom A attractor if f satisfies Axiom A and there existsa neighborhood U of Ω(f) such that fn(x)→ Ω(f) as n tends to infinity foreach x in U .

Definition C.4. If M is hyperbolic under f , then f is called Anosov.

By Anosov’s Closing Lemma, every Anosov diffeomorphism satisfies Ax-iom A.

Example C.1. Let X = Td, the d-dimensional torus, and let f be ahyperbolic toral automorphism induced by the matrix A in GL(d,Z). By thehyperbolicity assumption, f is Anosov. Therefore f is “structurally stable”(see [2]), meaning that there exists a neighborhood U of f in the space ofdiffeomorphisms of X (with the C1 topology) such that for each g in U , g isalso Anosov and g is conjugate to f .


CONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS, SUPPLEMENT13

Here we recall Theorem 5.7 and present its proof. We consider families ofAxiom A systems as follows. Suppose that f : Θ×X→ X is a parametrizedfamily of diffeomorphisms such that

i) θ 7→ fθ is Holder continuous;ii) there exists α > 0 such that for each θ, the map fθ is C1+α;iii) for each θ, Ω(fθ) is an Axiom A attractor and the restriction fθ|Ω(fθ)

is topologically mixing;iv) for each θ, the measure µθ is the unique SRB measure corresponding

to fθ [1, Theorem 4.1].

If these conditions are satisfied, then we say that (fθ, µθ)θ∈Θ is a parametrizedfamily of Axiom A systems on (X,X ).

Theorem 5.7. Suppose that (fθ, µθ)θ∈Θ is a parametrized family of Ax-iom A systems on (X,X ). Further, suppose that (gθ)θ∈Θ is a family of obser-vations densities satisfying the following conditions: observation integrability(S2) and (S3) and observation regularity (M3) and (L2). Then maximumlikelihood estimation is consistent.

Proof. By the continuity assumptions i) and ii) on f , there exists a shiftof finite type XM and a Holder-continuous map π : Θ×XM → X (see [1, The-orem 3.18]). Furthermore, by the topological mixing assumption iii), XM maybe taken to be a mixing shift of finite type, and there exists a unique, Holder-continuous family of Gibbs measures µθ such that µθ π−1

θ = µθ (see theproof of [1, Theorem 4.1]). Thus we have that (fθ, µθ, gθ)θ∈Θ is an isomorphicfactor of (σ|XM , µθ, gθ)θ∈Θ. By Theorem 5.1, any approximate maximumlikelihood estimator for (σ|XM , µθ, gθ)θ∈Θ is consistent. Then by Proposition5.6, any approximate maximum likelihood estimator for (fθ, µθ, gθ)θ∈Θ isconsistent.

REFERENCES

[1] R. Bowen. Equilibrium states and the ergodic theory of Anosov diffeomorphisms, vol-ume 470 of Lecture Notes in Mathematics. Springer-Verlag, Berlin, revised edition,2008. With a preface by David Ruelle, Edited by Jean-Rene Chazottes.

[2] A. Katok and B. Hasselblatt. Introduction to the modern theory of dynamical systems,volume 54 of Encyclopedia of Mathematics and its Applications. Cambridge Univer-sity Press, Cambridge, 1995. With a supplementary chapter by Katok and LeonardoMendoza.

[3] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics,141:148 –188, 1989.

[4] L.-S. Young. Large deviations in dynamical systems. Trans. Amer. Math. Soc.,318(2):525–543, 1990.


sayan/full/top_10_papers/#1... · submitted to the annals of statistics consistency of maximum...

Documents