discrepancy-based theory and algorithms for forecasting ...mohri/pub/tsj.pdfstationary ergodic...

Annals of Mathematics and Artificial Intelligence

Discrepancy-Based Theory and Algorithms for ForecastingNon-Stationary Time Series

Vitaly Kuznetsov ⋅ Mehryar Mohri

Received: January 20, 2019

Abstract We present data-dependent learning bounds for the general sce-nario of non-stationary non-mixing stochastic processes. Our learning guaran-tees are expressed in terms of a data-dependent measure of sequential com-plexity and a discrepancy measure that can be estimated from data undersome mild assumptions. Our learning bounds guide the design of new algo-rithms for non-stationary time series forecasting for which we report severalfavorable experimental results.

Keywords time series ⋅ forecasting ⋅ non-stationary ⋅ non-mixing ⋅ generaliza-tion bounds ⋅ discrepancy ⋅ expected sequential covering numbers ⋅ sequentialRademacher complexity

1 Introduction

Time series forecasting plays a crucial role in a number of domains rangingfrom weather forecasting and earthquake prediction to applications in eco-nomics and finance. The classical statistical approaches to time series analysisare based on generative models such as the autoregressive moving average(ARMA) models, or their integrated versions (ARIMA) and several other ex-tensions [Engle, 1982; Bollerslev, 1986; Brockwell and Davis, 1986; Box andJenkins, 1990; Hamilton, 1994]. Most of these models rely on strong assump-tions about the noise terms, often assumed to be i.i.d. random variables sam-pled from a Gaussian distribution, and the guarantees provided in their sup-port are often only asymptotic.

Vitaly Kuznetsov (Corresponding author)Google Research, 76 Ninth Avenue, New York, NY 10011.E-mail: [email protected]

Mehryar MohriGoogle Research and Courant Institute, 251 Mercer Street, New York, NY 10012.E-mail: [email protected]

2

An alternative non-parametric approach to time series analysis consistsof extending the standard i.i.d. statistical learning theory framework to thatof stochastic processes. In much of this work, the process is assumed to bestationary and suitably mixing [Doukhan, 1994]. Early work along this ap-proach consisted of the VC-dimension bounds for binary classification givenby Yu [1994] under the assumption of stationarity and β-mixing. Under thesame assumptions, Meir [2000] presented bounds in terms of covering numbersfor regression losses and Mohri and Rostamizadeh [2009] proved general data-dependent Rademacher complexity learning bounds. Vidyasagar [1997] showedthat PAC learning algorithms in the i.i.d. setting preserve their PAC learn-ing property in the β-mixing stationary scenario. A similar result was provenby Shalizi and Kontorovich [2013] for mixtures of β-mixing processes and byBerti and Rigo [1997] and Pestov [2010] for exchangeable random variables.Alquier and Wintenberger [2010] and Alquier et al. [2014] also establishedPAC-Bayesian learning guarantees under weak dependence and stationarity.Chen and Wu [2018] provide concentration results for linear time series.

A number of algorithm-dependent bounds have also been derived for thestationary mixing setting. Lozano et al. [2006] studied the convergence of reg-ularized boosting. Mohri and Rostamizadeh [2010] gave data-dependent gen-eralization bounds for stable algorithms for ϕ-mixing and β-mixing stationaryprocesses. Steinwart and Christmann [2009] proved fast learning rates for regu-larized algorithms with α-mixing stationary sequences and Modha and Masry[1998] gave guarantees for certain classes of models under the same assump-tions.

However, stationarity and mixing are often not valid assumptions. For ex-ample, even for Markov chains, which are among the most widely used typesof stochastic processes in applications, stationarity does not hold unless theMarkov chain is started with an equilibrium distribution. Similarly, long mem-ory models such as ARFIMA, may not be mixing or mixing may be arbitrarilyslow [Baillie, 1996]. In fact, it is possible to construct first order autoregres-sive processes that are not mixing [Andrews, 1983]. Additionally, the mix-ing assumption is defined only in terms of the distribution of the underlyingstochastic process and ignores the loss function and the hypothesis set used.This suggests that mixing may not be the right property to characterize learn-ing in the setting of stochastic processes.

A number of attempts have been made to relax the assumptions of station-arity and mixing. Adams and Nobel [2010] proved asymptotic guarantees forstationary ergodic sequences. Agarwal and Duchi [2013] gave generalizationbounds for asymptotically stationary (mixing) processes in the case of stableon-line learning algorithms. Kuznetsov and Mohri [2014] established learningguarantees for fully non-stationary β- and ϕ-mixing processes.

In this paper, we consider the general case of non-stationary non-mixingprocesses. We are not aware of any prior work providing generalization boundsin this setting. In fact, our bounds appear to be novel even when the processis stationary (but not mixing). The learning guarantees that we present holdfor both bounded and unbounded memory models. Deriving generalization

3

bounds for unbounded memory models even in the stationary mixing casewas an open question prior to our work [Meir, 2000]. Our guarantees coverthe majority of approaches used in practice, including various autoregressivemodels.

The key ingredients of our generalization bounds are a data-dependentmeasure of sequential complexity (expected sequential covering number or se-quential Rademacher complexity [Rakhlin et al., 2010]) and a measure of dis-crepancy between the sample and target distributions. Kuznetsov and Mohri[2014, 2017a] also give generalization bounds in terms of discrepancy. However,unlike these result, our analysis does not require any mixing assumptions whichare hard to verify in practice. More importantly, under some additional mildassumption, the discrepancy measure that we propose can be estimated fromdata, which leads to data-dependent learning guarantees for non-stationarynon-mixing case.

We use the theory that we develop to devise new algorithms for non-stationary time series forecasting that benefit from our data-dependent guar-antees. The parameters of generative models such as ARIMA are typicallyestimated via the maximum likelihood technique, which often leads to non-convex optimization problems. In contrast, our objective is convex and leadsto an optimization problem with a unique global solution that can be foundefficiently. Another issue with standard generative models is that they addressnon-stationarity in the data via a differencing transformation which does notalways lead to a stationary process. In contrast, we address the problem ofnon-stationarity in a principled way using our learning guarantees.

The rest of this paper is organized as follows. The formal definition ofthe time series forecasting learning scenario as well as that of several keyconcepts is given in Section 2. In Section 3, we introduce and prove our newgeneralization bounds. Section 4 provides an analysis in the special case ofkernel-based hypotheses with regression losses. In Section 5, we give data-dependent learning bounds based on the empirical discrepancy. These resultsare used to devise new forecasting algorithms in Section 6. In Appendix 7, wereport the results of preliminary experiments using these algorithms.

2 Preliminaries

We consider a general time series prediction scenario where the learner receivesa realization Z1, . . . , ZT of some stochastic process, where, for any t ∈ [T ],Zt = (Xt, Yt) is in Z = X × Y, with X an input space and Y an output space.We will often use the shorthand Zmn to denote a sequence of random variablesZn, Zn+1, . . . , Zm. Given a loss function L∶Y ×Y → [0,∞) and a hypothesis setH of functions mapping from X to Y, the objective of the learner is to selecta predictor h∶X → Y in H that achieves a small path-dependent generalizationerror LT+1(h,ZT1 ), that is the generalization error conditioned on the dataobserved:

LT+1(h,ZT1 ) = E[L(h(XT+1), YT+1)∣Z1, . . . , ZT ]. (1)

4

The path-dependent generalization error that we consider in this work is afiner measure than the averaged generalization error

LT+1(h) = E[L(h(XT+1), YT+1)] = E[E[L(h(XT+1), YT+1)∣Z1, . . . , ZT ]], (2)

since it takes into consideration the specific realization of the stochastic processand does not average over all possible histories. The results presented in thispaper apply as well to the setting where the time parameter t can take non-integer values and where the prediction lag is an arbitrary value l ≥ 0. Thus,the error can be defined more generally by E[L(h(XT+l), YT+l)∣Z1, . . . , ZT ],but for notational simplicity we set l = 1.

Our setup covers a larger number of scenarios commonly used in practice.The case X = Yp corresponds to a large class of autoregressive models. TakingX = ⋃∞p=1Yp leads to growing memory models which, in particular, includestate space models. More generally, X may contain both the history of theprocess Yt and some additional side information. Note that the output spaceY may also be high-dimensional. This covers both the case where we seek toforecast a multi-variate or high-dimensional time series, and that of multi-stepforecasting.

We denote by F the family of loss functions associated to H: F = (x, y)→L(h(x), y)∶h ∈ H. For any z = (x, y) ∈ Z, L(h(x), y) can thus be replaced byf(z) for some f ∈ F . We will assume a bounded loss function, that is ∣f ∣ ≤Mfor all f ∈ F for some M ∈ R+.

The key quantity of interest in the analysis of generalization is the followingsupremum of the empirical process defined as follows:

Φ(ZT1 ) = supf∈F

⎛⎝E[f(ZT+1)∣ZT1 ] −

T

∑t=1qtf(Zt)

⎞⎠, (3)

where q1, . . . , qT are real numbers, which in the standard learning scenariosare chosen to be all equal to 1

T. In our general setting, different Zts may follow

different distributions, thus distinct weights could be assigned to the errorsmade on different sample points, depending on their relevance to forecastingthe future ZT+1. The generalization bounds that we present below are foran arbitrary sequence q = (q1, . . . qT ) which, in particular, covers the caseof uniform weights. Remarkably, our bounds do not even require the non-negativity of q.

The two key ingredients of our analysis are the notions of sequential com-plexity [Rakhlin et al., 2010] and that of discrepancy measure between targetand source distributions. In the next two sections, we give a detailed descrip-tion of these notions.

2.1 Sequential Complexities

Our generalization bounds are expressed in terms of data-dependent measuresof sequential complexity such as expected sequential covering number or se-

5

g(z1), v1

g(z2), v2 g(z3), v3

g(z4), v4 g(z5), v5 g(z6), v6 g(z7), v7

g(z12), v12

+11

+1+11 1

1

Z1, Z01 D1

Z3, Z03 D2(·|Z 0

1)Z2, Z02 D2(·|Z1)

Z5, Z05

D3(·|Z1, Z02)

Z4, Z04

D3(·|Z1, Z2)

Fig. 1 Weighted sequential cover and sequential covering numbers.

quential Rademacher complexity [Rakhlin et al., 2010], which we review inthis section.

We adopt the following definition of a complete binary tree: a Z-valuedcomplete binary tree z is a sequence (z1, . . . , zT ) of T mappings zt∶±1t−1 →Z, t ∈ [1, T ]. A path in the tree is σ = (σ1, . . . , σT−1) ∈ ±1T−1. To simplifythe notation, we will write zt(σ) instead of zt(σ1, . . . , σt−1), even though ztdepends only on the first t−1 elements of σ. The following definition generalizesthe classical notion of covering numbers to sequential setting. A set V of R-valued trees of depth T is a sequential α-cover (with respect to q-weighted`p norm) of a function class G on a tree z of depth T if for all g ∈ G and allσ ∈ ±T , there is v ∈ V such that

⎛⎝T

∑t=1

∣vt(σ) − g(zt(σ))∣p⎞⎠

1p

≤ ∥q∥−1q α,

where ∥⋅∥q is the dual norm associated to ∥⋅∥p, that is 1p+ 1q= 1. The (sequential)

covering number Np(α,G,z) of a function class G on a given tree z is defined tobe the size of the minimal sequential cover. The maximal covering number isthen taken to be Np(α,G) = supzNp(α,G,z). In the case of uniform weights,this definition coincides with the standard definition of sequential coveringnumbers. Note that this is a purely combinatorial notion of complexity whichignores the distribution of the process in the given learning problem.

Data-dependent sequential covering numbers can be defined as follows.Given a stochastic process distributed according to the distribution p withpt(⋅∣zt−11 ) denoting the conditional distribution at time t, we sample a Z ×Z-valued tree of depth T according to the following procedure. Draw two in-dependent samples Z1, Z

′1 from p1: in the left child of the root draw Z2, Z

′2

according to p2(⋅∣Z1) and in the right child according to p2(⋅∣Z ′1). More gen-

erally, for a node that can be reached by a path (σ1, . . . , σt), we draw Zt, Z′t

according to pt(⋅∣S1(σ1), . . . , St−1(σt−1)), where St(1) = Zt and St(−1) = Z ′t.

Let z denote the tree formed using Zts and T the distribution of trees z therebydefined. Then, the expected covering number is defined as Ez∼T [Np(α,G,z)].

6

1 Tt T + 1

key difference

L(h,Zt11 ) L(h,ZT

1 )

Fig. 2 Key difference in the definition of discrepancy.

Figure 1 illustrates these notions. For i.i.d. sequences, the notion of expectedsequential covering numbers exactly coincides with that of expected coveringnumbers from classical statistical learning theory.

The sequential Rademacher complexity RseqT (G) of a function class G is

defined by

RseqT (G) = sup

zE⎡⎢⎢⎢⎢⎣

supg∈G

T

∑t=1σtqtg(zt(σ))

⎤⎥⎥⎥⎥⎦, (4)

where the supremum is taken over all complete binary trees of depth T withvalues in Z and where σ is a sequence of Rademacher random variables[Rakhlin et al., 2010, 2011, 2015a,b]. Similarly, one can also define the notionof distribution-dependent sequential Rademacher complexity as well as othernotions of sequential complexity such as Littlestone dimension and sequentialmetric entropy that have been shown to characterize learning in the on-linelearning scenario. For further details, we refer the reader to [Littlestone, 1987;Rakhlin et al., 2010, 2011, 2015a,b].

2.2 Discrepancy Measure

The final ingredient needed for expressing our learning guarantees is the notionof discrepancy between target distribution and the distribution of the sample:

discT (q) = supf∈F

(E[f(ZT+1)∣ZT1 ] −T

∑t=1qtE[f(Zt)∣Zt−11 ]). (5)

In what follows we will omit subscript T and write discT to simplify the nota-tion. Figure 2 illustrates the key difference term in the definition of discrepancy.

The discrepancy disc is a natural measure of the non-stationarity of thestochastic process Z with respect to both the loss function L and the hypoth-esis set H. In particular, note that if the process Z is i.i.d., then we simplyhave disc(q) = 0 provided that qts form a probability distribution. To help thereader develop further intuition about discrepancy, we provide several explicitexamples below.

Example 1. Consider the case of a time-homogeneous Markov chain on aset 0, . . . ,N − 1 such that P(Xt ≡ (i − 1) mod N ∣Xt−1 = i) = p and P(Xt ≡(i+1) mod N ∣Xt−1 = i) = 1−p for some 0 ≤ p ≤ 1. This process is non-stationary

7

if it is not started with an equilibrium distribution. Suppose that the set ofhypothesis is x ↦ a(x − 1) + b(x + 1)∶a + b = 1, a, b ≥ 0 and the loss functionL(y, y′) = `(∣y − y′∣) for some `. It follows that for any (a, b)

E[f(Zt)∣Zt−11 ] = p∣a − b − 1∣ + (1 − p)∣a − b + 1∣

and hence disc(q) = 0 provided q is a probability distribution. Note that if wechose a larger hypothesis set x↦ a(x − 1) + b(x + 1)∶a, b ≥ 0 then

E[f(Zt)∣Zt−11 ] = p∣(a + b − 1)Xt + a − b∣ + (1 − p)∣(a + b + 1)Xt + a − b∣

and in general it may be the case that disc(q) ≠ 0. This highlights an impor-tant property of discrepancy: it takes into account not only the underlyingdistribution of the stochastic process but other components of the learningproblem such as the loss function and the hypothesis set that is being used.

Example 2. Let ε0, ε1, . . . be a sequence of i.i.d. random variables such thatP(εt = −1) = p and P(ε = 1) = 1 − p for some 0 ≤ p ≤ 1. Consider the followingstochastic: Yt = Yt−1 + εtY0 for t ≥ 1 and X0 = ε0. Observe that this processis not Markov, not stationary and not mixing. Furthermore, when p ≠ 1

2this

process has a stochastic trend. Let H = x ↦ x + c∶ c ∈ [−1,1] and consider aloss function L such that L(y′, y) = `(y′ − y) for some ` and any y, y′. Observethat discrepancy in this case is given by

suph∈H

⎛⎝EεT+1[`(c − εT+1X0)] −

1

T

T

∑t=1

Eεt[`(c − εtX0)]⎞⎠= 0.

This result combined with Corollary 2 below can be used to provide a learningguarantee for this setting.

Example 3. The next example is concerned with periodic time series. Letε0, ε1, . . . be a sequence of i.i.d. standard Gaussian random variables and con-sider a stochastic process Yt = sin(φt)+εt. We let H be a set of offset functionsas before1 H = x↦ x+ c ∶ c ∈ [−1,1]. If L(y′, y′) = (y − y′)2 then discrepancyis given by

suph∈H

⎛⎝(c − sin(φ(T + 1))2 −

T

∑t=1qt(c − sin(φt))2

⎞⎠

and since sin is a periodic function it is possible to choose q such that aboveexpression is non-positive and ∥q∥2 = 1√

T. This result together with Corollary 2

below can be used to provide learning guarantees in this setting and shows thatperiodic time series can be improperly learned with offset functions.

The weights q play a crucial role in the learning problem. Consider ourExample 1, where transition probability distributions (pi,1 − pi) are differ-ent for each state i. Note that choosing q to be a uniform distribution, ingeneral, leads to a non-zero discrepancy. However, with q′t = 1Xt−1=XT

andqt = q′t/∑Tt=1 q′t discrepancy is zero. Note that in fact it is not the only choice

1 A set of periodic functions would also be sufficient for this example.

8

that leads to a zero discrepancy in this example and in fact any distributionthat is supported on ts for which Xt−1 = XT will lead to the same result.However, qts based on q′t have an advantage of providing the largest effectivesample.

It is also possible to give bounds on disc(q) in terms of other naturaldistances between distribution. For instance, if q is a probability distributionthen Pinsker’s inequality yields

disc(q) ≤M∥PT+1(⋅∣ZT1 )−T

∑t=1qtPt(⋅∣Zt−11 )∥

≤ 1√2D

12⎛⎝PT+1(⋅∣ZT1 ) ∥

T

∑t=1qtPt(⋅∣Zt−11 )

⎞⎠,

where ∥ ⋅ ∥ is the total variation distance and D(⋅ ∥ ⋅) the relative entropy,Pt+1(⋅∣Zt1) the conditional distribution of Zt+1, and ∑Tt=1 qtPt(⋅∣Zt−11 ) the mix-ture of the sample marginals. Note that these upper bounds are often too loosesince they are agnostic to the loss function and the hypothesis set that is be-ing used. For our earlier Markov chain example, the support of PT+1(⋅∣ZT1 ) isXT − 1,XT + 1 while the mixture 1

T ∑Tt=1Pt(⋅∣Zt−11 ) is likely to be supported

on the whole set 0, . . . ,N − 1 which leads to a large total variation distance.Of course, it is possible to choose qts so that the mixture is also supportedonly on XT − 1,XT + 1 but that may reduce the effective sample size whichis not necessary when working with disc(q).

However, the most important property of the discrepancy disc(q) is that, asshown later in Section 5, it can be estimated from data under some additionalmild assumptions. Kuznetsov and Mohri [2014] also give generalization boundsbased on averaged generalization error for non-stationary mixing processes interms of a related notion of discrepancy. It is not known if the discrepancymeasure used in [Kuznetsov and Mohri, 2014] can be estimated from data.

3 Generalization Bounds

In this section, we prove new generalization bounds for forecasting non-stationarytime series. The first step consists of using decoupled tangent sequences to es-tablish concentration results for the supremum of the empirical process Φ(ZT1 ).Given a sequence of random variables ZT1 we say that Z′T

1 is a decoupled tan-gent sequence if Z ′

t is distributed according to P(⋅∣Zt−11 ) and is independentof Z∞

t . It is always possible to construct such a sequence of random variables[De la Pena and Gine, 1999]. The next theorem is the main result of thissection.

Theorem 1 Let ZT1 be a sequence of random variables distributed accordingto p. Fix ε > 2α > 0. Then, the following holds:

P(Φ(ZT1 ) − disc(q) ≥ ε) ≤ Ev∼T

[N1(α,F ,v)] exp(− (ε − 2α)28M2∥q∥22

).

9

Proof Since the difference of two suprema is upper bounded by the supremumof the difference, by Markov’s inequality, the following holds for any ε > 0:

P(Φ(ZT1 ) − disc(q) ≥ ε)

≤ P⎛⎝

supf∈F

⎛⎝T

∑t=1qt(E[f(Zt)∣Zt−11 ] − f(Zt))

⎞⎠≥ ε

⎞⎠

≤ exp(−λε)E⎡⎢⎢⎢⎢⎣

exp⎛⎝λ supf∈F

⎛⎝T

∑t=1qt(E[f(Zt)∣Zt−11 ] − f(Zt))

⎞⎠⎞⎠

⎤⎥⎥⎥⎥⎦.

Let Z′T1 be a decoupled tangent sequence associated to ZT1 , then, the following

equalities hold: E[f(Zt)∣Zt−11 ] = E[f(Z ′t)∣Zt−11 ] = E[f(Z ′

t)∣ZT1 ]. Using theseequalities and Jensen’s inequality, we obtain the following:

E [ exp (λ supf∈F

T

∑t=1qt(E[f(Zt)∣Zt−11 ] − f(Zt)))]

= E [ exp (λ supf∈F

E [T

∑t=1qt(f(Z ′

t) − f(Zt))∣ZT1 ])]

≤ E [ exp (λ supf∈F

T

∑t=1qt(f(Z ′

t) − f(Zt)))],

where the last expectation is taken over the joint measure of ZT1 and Z′T1 .

Applying Lemma 2 (Appendix A), we can further bound this expectation by

E(z,z′)∼T

Eσ[ exp(λ sup

f∈F

T

∑t=1σtqt(f(z′t(σ)) − f(zt(σ))))]

≤ E(z,z′)∼T

Eσ[ exp(λ sup

f∈F

T

∑t=1σtqtf(z′t(σ)) + λ sup

f∈F

T

∑t=1

−σtqtf(zt(σ)))]

≤ 12

E(z,z′)

Eσ[ exp(2λ sup

f∈F

T

∑t=1σtqtf(z′t(σ)))]

+ 12

E(z,z′)

Eσ[ exp(2λ sup

f∈F

T

∑t=1σtqtf(zt(σ)))]

= Ez∼T

Eσ[ exp(2λ sup

f∈F

T

∑t=1σtqtf(zt(σ)))],

where the second inequality holds by the convexity of the exponential functionand the last inequality by symmetry. Given z, let C denote the minimal α-cover with respect to the q-weighted `1-norm of F on z. Then, the followingbound holds

supf∈F

T

∑t=1σtqtf(zt(σ)) ≤ max

c∈C

T

∑t=1σtqtct(σ) + α.

10

By the monotonicity of the exponential function,

Eσ[ exp(2λ sup

f∈F

T

∑t=1σtqtf(zt(σ)))] ≤ exp(2λα)E

σ[ exp(2λmax

c∈C

T

∑t=1σtqtct(σ))]

≤ exp(2λα) ∑c∈C

Eσ[ exp(2λ

T

∑t=1σtqtct(σ))].

Since ct(σ) depends only on σ1, . . . , σT−1, by Hoeffding’s bound,

Eσ[ exp(2λ

T

∑t=1σtqtct(σ))]

= E [ exp(2λT−1∑t=1

σtqtct(σ)) EσT

[ exp(2λσT qT cT (σ))∣σT−11 ]]

≤ E [ exp(2λT−1∑t=1

σtqtct(σ)) exp(2λ2q2TM2)].

Iterating this inequality and using the union bound, we obtain the following:

P( supf∈F

T

∑t=1qt(E[f(Zt)∣Zt−11 ] − f(Zt)) ≥ ε)

≤ Ev∼T

[N1(α,G,v)] exp (− λ(ε − 2α) + 2λ2M2∥q∥22).

Optimizing over λ > 0 completes the proof. ⊓⊔

An immediate consequence of Theorem 1 is the following result.

Corollary 1 For any δ > 0, with probability at least 1 − δ, the following in-equality holds for all f ∈ F and all α > 0:

E[f(ZT+1)∣ZT1 ] ≤T

∑t=1qtf(Zt)+disc(q)+2α+M∥q∥2

√8 log

Ev∼T [N1(α,F ,v)]δ

.

We are not aware of other finite sample bounds in a non-stationary non-mixingcase. In fact, our bounds appear to be novel even in the stationary non-mixingcase.

While Rakhlin et al. [2015a] give high probability bounds for a differentquantity than the quantity of interest in time series prediction,

supf∈F

⎛⎝T

∑t=1qt(E[f(Zt)∣Zt−11 ] − f(Zt))

⎞⎠, (6)

their analysis of this quantity can also be used in our context to derive highprobability bounds for Φ(ZT1 ) − disc(q). However, this approach results inbounds that are in terms of purely combinatorial notions such as maximalsequential covering numbers N1(α,F). While at first sight, this may seem asa minor technical detail, the distinction is crucial in the setting of time series

11

prediction. Consider the following example. Let Z1 be drawn from a uniformdistribution on 0,1 and Zt ∼ p(⋅∣Zt−1) with p(⋅∣y) being a distribution over0,1 such that p(x∣y) = 2/3 if x = y and 1/3 otherwise. Let G be defined by G =g(x) = 1x≥θ ∶ θ ∈ [0,1]. Then, one can check that Ev∼T [N1(α,G,v)] = 2, whileN1(α,G) ≥ 2T . The data-dependent bounds of Theorem 1 and Corollary 1highlight the fact that the task of time series prediction lies in-between thefamiliar i.i.d. scenario and the adversarial on-line learning setting.

However, the key component of our learning guarantees is the discrepancyterm disc(q). Note that in the general non-stationary case, the bounds of The-orem 1 may not converge to zero due to the discrepancy between the target andsample distributions. This is also consistent with the lower bounds of Barve andLong [1996] that we discuss in more detail in Section 5. However, convergencecan be established in some special cases. In the i.i.d. case, our bounds reduce tothe standard covering numbers learning guarantees. The discrepancy examplesof the previous section also show that convergence can be established for vari-ous stochastic processes, including non-mixing and non-stationary ones. In thedrifting scenario, with ZT1 being a sequence of independent random variables,our discrepancy measure and bounds coincide with those studied in [Mohriand Munoz Medina, 2012]. In the case of φ-mixing non-stationary stochas-tic processes, our results provide tighter learning bounds for path-dependentgeneralization error than the previous best results in [Kuznetsov and Mohri,2017a].2 However, shown in Section 5, the most important advantage of ourbounds is that the discrepancy measure we use can be estimated from data.

We now show that expected sequential covering numbers can be upperbounded in terms of the sequential Rademacher complexity. Generalizationbounds in terms of the sequential Rademacher complexity are not as tightas bounds in terms the expected sequential covering numbers since the for-mer is a purely combinatorial notion. Nevertheless, the analysis of sequentialRademacher complexity may be simpler for certain hypothesis classes such asfor instance that of kernel-based hypotheses, which we study in Section 4. Wehave the following extension of Sudakov’s Minoration Theorem to the settingof sequential complexities.

Theorem 2 The following bound holds for the sequential Rademacher com-plexity of F :

supα>0

α

2

√logN2(2α,F) ≤ 3

√π

2logT Rseq

T (F),

whenever N2(2α,F) < +∞.

Proof We consider the following Gaussian-Rademacher sequential complexity:

GseqT (F ,z) = E

γ,σ

⎡⎢⎢⎢⎢⎣supf∈F

(T

∑t=1qtσtγtf(zt(σ)))

⎤⎥⎥⎥⎥⎦, (7)

2 For further details, see the discussion following Corollary 2.

12

where σ is an independent sequence of Rademacher random variables, γ isan independent sequence of standard Gaussian random variables and z is acomplete binary tree of depth T with values in Z.

Observe that if V is any α-cover with respect to the q-weighted `2-normof F on z, then, the following holds by independence of γ and σ:

Gseq(F ,z) ≥ EγEσ

⎡⎢⎢⎢⎢⎣supv∈V

(T

∑t=1qtσtγtvt(σ))

⎤⎥⎥⎥⎥⎦= E

σEγ


(T


⎤⎥⎥⎥⎥⎦.

Observe that V is also 2α-cover with respect to the q-weighted `2-norm of Fon z. We can obtain a smaller 2α-cover V0 from V be eliminating vs that areα close to some other v′ ∈ V . Since V is finite, let V = v1, . . . ,v∣V ∣, and foreach vi we delete vj ∈ vi+1, . . . ,v∣V ∣ such that

⎛⎝T

∑t=1

(qtvit(σ) − qtvjt (σ))2⎞⎠

1/2≤ α.

It is straightforward to verify that V0 is 2α-cover with respect to the q-weighted`2-norm of F on z. Furthermore, it follows that for a fixed σ, the followingholds:

Eγ

⎡⎢⎢⎢⎢⎣(T

∑t=1qtσtγtvt(σ) −

T

∑t=1qtσtγtv

′t(σ))

2⎤⎥⎥⎥⎥⎦≥ α2.

for any v′,v ∈ V0. Let Zi, i = 1, . . . , ∣V0∣ be a sequence of independent Gaussianrandom variables with E[Zi] = 0 and E[Z2

i ] = α2/2. Observe that E[(Zi−Zj)] =α2 and hence by Sudakov-Fernique inequality it follows that

EσEγ


(T


⎤⎥⎥⎥⎥⎦≥ E

σEγ

⎡⎢⎢⎢⎢⎣supv∈V0

(T


⎤⎥⎥⎥⎥⎦≥ E [ max

i=1,...,∣V0∣Zi]

≥ α2

√log ∣V0∣,

where the last inequality is the standard result for Gaussian random variables.Therefore, we conclude that Gseq(F ,z) ≥ supα>0

α2

√logN2(2α,F ,z). On the

other hand, by the standard properties of Gaussian complexity [Ledoux andTalagrand, 1991], we can write:

GseqT (F ,z) ≤ 3

√π

2logT E

ε,σ


(T

∑t=1qtσtεtf(zt(σ)))

⎤⎥⎥⎥⎥⎦,

13

where ε is an independent sequence of Rademacher variables. We re-arrangez into zε so that zt(σ) = zεt (εσ) for all σ ∈ ±1T and it follows that

Eε,σ


(T

∑t=1qtσtεtf(zt(σ)))

⎤⎥⎥⎥⎥⎦= E

ε,σ


(T

∑t=1qtσtεtf(zεt (εσ)))

⎤⎥⎥⎥⎥⎦

≤ Eσ

supz

Eσ


(T

∑t=1qtσtεtf(zt(εσ)))

⎤⎥⎥⎥⎥⎦

= supz

Eσ


(T

∑t=1qtσtf(zt(σ)))

⎤⎥⎥⎥⎥⎦.

Therefore, the following inequality holds

supα>0

α

2

√logN2(2α,F ,z) ≤ 3

√π

2logTRseq

T (F)

and the proof is completed by taking the supremum with respect to z of bothsides of this inequality. ⊓⊔

Observe that, by definition of the sequential complexities, the followinginequalities hold: Ev∼T [N1(α,G,v)] ≤ Ev∼T [N2(α,G,v)] ≤ N2(α,G). Thus,setting α = ∥q∥2/2, applying Corollary 1 and Theorem 2, and using the factthat

√x + y ≤ √

x +√y for x, y > 0, yields the following result.

Corollary 2 For any δ > 0, with probability at least 1 − δ, the following in-equality holds for all f ∈ F and all α > 0:

E[f(ZT+1)∣ZT1 ]

≤T

∑t=1qtf(Zt) + disc(q) + ∥q∥2 + 6M

√4π logTRseq

T (F) +M∥q∥2√

8 log1

δ.

As already mentioned in Section 2, sequential Rademacher complexity canbe further upper bounded in terms of the sequential metric entropy, sequen-tial Littlestone dimension, maximal sequential covering numbers and othercombinatorial notions of sequential complexity of F . These notions have beenextensively studied in the past [Rakhlin et al., 2015b]. Note that, in [Kuznetsovand Mohri, 2017a], an almost identical bound on E[f(ZT+s)∣ZT1 ] is proven un-der the assumption that the underlying stochastic process is φ-mixing (seeTheorem 3 in [Kuznetsov and Mohri, 2017a]). Our result requires neither ofthese assumptions.

Corollary 1 or Corollary 2 can also be used to derive oracle inequalities forthe setting that we are considering. Let f∗ = argminf∈F E[f(ZT+1)∣ZT1 ] and

14

f0 = argminf∈F ∑Tt=1 qtf(Zt). Then, it follows that

E[f0(ZT+1)∣ZT1 ] −E[f∗(ZT+1)∣ZT1 ] = E[f0(ZT+1)∣ZT1 ] −T

∑t=1qtf0(Zt)

+T

∑t=1qtf0(Zt) −

T

∑t=1qtf

∗(Zt)

+T

∑t=1qtf

∗(Zt) −E[f∗(ZT+1)∣ZT1 ]

≤ 2Φ(ZT1 ).

The following result immediately follows.

Corollary 3 For any δ > 0, with probability at least 1 − δ, for all α > 0,

E[f0(ZT+1)∣ZT1 ]

≤ inff∈F

E[f(ZT+1)∣ZT1 ]+disc(q) + ∥q∥2 + 6M√

4π logTRseqT (F) +M∥q∥2

√8 log

1

δ.

We conclude this section by observing that our results also hold in thecase where qt = qt(f,XT+1, Zt), which is a common heuristic used in somealgorithms for forecasting non-stationary time series [Lorenz, 1969; Zhao andGiannakis, 2016]. We formalize this result in the following theorem.

Theorem 3 Let q∶F ×X ×Z ↦ [−B,B] and suppose XT+1 is ZT1 -measurable.Then, for any δ > 0, with probability at least 1 − δ, for all f ∈ F and all α > 0,

E[f(ZT+1)∣ZT1 ]

≤ 1

T

T

∑t=1q(f,XT+1, Zt)f(Zt) + disc(q) + 2α + 2MB

¿ÁÁÀ

8log Ev∼T [N1(α,G,v)]

δ

T,

where disc(q) is defined by

disc(q) = supf∈F

(E[f(ZT+1)∣ZT1 ] −T

∑t=1

EZt

[q(f,XT+1, Zt)f(Zt)∣Zt−11 ]), (8)

and where G = (x, z)↦ q(f, x, z)f(z)∶ f ∈ F.

We illustrate this result with some examples. Consider for instance a Gaus-sian Markov process with Pt(⋅∣ZT1 ) being a normal distribution with mean Zt−1and unit variance. Suppose f(h, z) = `(∥h(x)−y∥2) for some function `. We let

q(h,x′, (x, y)) = exp(− 12∥y−h(x)−x′+h(x′)∥22)/ exp (− 1

2∥x−y∥22) and observe

15

that for any f :

EZt

[q(XT+1, Zt)f(Zt)∣Zt−11 ]

= ∫ `(∥y − h(Xt)∥2)q(h,XT+1, (Xt, y)) exp ( − 1

2∥y −Xt∥22)dy

= ∫ `(∥y − h(Xt)∥2) exp ( − 1

2∥y − h(Xt) −XT+1 + h(XT+1)∥22)dy

= ∫ `(∥x∥2) exp ( − 1

2∥x −XT+1 + h(XT+1)∥22)dx

= ∫ `(∥y − h(XT+1)∥2) exp ( − 1

2∥y −XT+1∥22)dy

= E[f(ZT+1)∣ZT1 ],

which shows that disc(q) = 0 in this case. More generally, if Z is a time-homogeneous Markov process, then one can use the Radon-Nikodym deriva-

tive dP(⋅∣x′)dP(⋅∣x) (y − h(x) + h(x

′)) for q, which will again lead to zero discrepancy.

The major obstacle for this approach is that Radon-Nikodym derivatives aretypically unknown and one needs to learn them from data via density esti-mation, which itself can be a difficult task. In Section 4, we investigate analternative approach to choosing weights q based on extending the results ofTheorem 1 to hold uniformly over weight vectors q.

4 Kernel-Based Hypotheses with Regression Losses

In this section, we present generalization bounds for kernel-based hypotheseswith regression losses. Our results in this section are based on the learningguarantee presented in Corollary 2 in terms of the sequential Rademachercomplexity of a class. Our first result is a bound on the sequential Rademachercomplexity of the kernel-based hypothesis with regression losses.

Lemma 1 Let p ≥ 1 and F = (x, y) → (w ⋅ Ψ(x) − y)p∶ ∥w∥H ≤ Λ where His a Hilbert space and Ψ ∶X → H a feature map. Assume that the condition∣w ⋅ x − y∣ ≤ M holds for all (x, y) ∈ Z and all w such that ∥w∥H ≤ Λ. Then,the following inequalities hold:

RseqT (F) ≤ pMp−1CTR

seqT (H) ≤ CT (pMp−1Λr∥q∥2), (9)

where K is a PDS kernel associated to H, H = x → w ⋅ Ψ(x) ∶ ∥w∥H ≤ Λ,

r = supxK(x,x), and CT = 8(1 + 4√

2 log3/2(eT 2)).

Proof We begin the proof by setting qtf(zt(σ)) = qt(w ⋅Ψ(xt(σ))−yt(σ))2 =1T(w ⋅x′(σ)−y′t)2, where x′t(σ) =

√TqtΨ(xt(σ)) and y′t(σ) =

√Tqtyt(σ). We

16

let z′t = (x′t,y′t). Then we observe that

RseqT (F) = sup

z′=(x′,y′)Eσ

⎡⎢⎢⎢⎢⎣supw

1

T

T

∑t=1σt(w ⋅ x′t(σ) − y′t(σ))p

⎤⎥⎥⎥⎥⎦

= supz=(x,y)

Eσ


T

∑t=1qtσt(w ⋅ xt(σ) − yt(σ))p

⎤⎥⎥⎥⎥⎦.

Since x → ∣x∣p is pMp−1-Lipschitz over [−M,M], by Lemma 13 in [Rakhlinet al., 2015a], the following bound holds:

RseqT (F) ≤ pMp−1CTR

seqT (H′),

where H′ = (x, y) → w ⋅ Ψ(x) − y ∶ ∥w∥H ≤ Λ. Note that Lemma 13 re-quires that Rseq

T (H′) > 1/T which is guaranteed by Khintchine’s inequality.By definition of the sequential Rademacher complexity

RseqT (H′) = sup

(x,y)Eσ


T

∑t=1σtqt(w ⋅ Ψ(xt(σ)) − y(σ))

⎤⎥⎥⎥⎥⎦

= supx

Eσ


T

∑t=1σtqtw ⋅ Ψ(xt(σ))

⎤⎥⎥⎥⎥⎦+ sup

yEσ

⎡⎢⎢⎢⎢⎣

T

∑t=1σtqty(σ)

⎤⎥⎥⎥⎥⎦=Rseq

T (H),

where for the last equality we used the fact that σts are mean zero randomvariables and σt is independent of y(σ) = y(σ1, σ2, . . . , σt−1). This proves thefirst result. To prove the second bound we observe that the right-hand sidecan be bounded as follows:

supx

Eσ


T

∑t=1σtqtw ⋅ Ψ(xt(σ))

⎤⎥⎥⎥⎥⎦≤ Λ sup

xEσ

XXXXXXXXXXX

T

∑t=1σtqtΨ(xt(σ))

XXXXXXXXXXXH

≤ Λ supx

¿ÁÁÁÀE

σ

XXXXXXXXXXX

T

∑t=1σtqtΨ(xt(σ))

XXXXXXXXXXX

2

H

= Λ supx

¿ÁÁÁÀE

σ

⎡⎢⎢⎢⎢⎣

T

∑t,s=1

σtσsqtqsΨ(xt(σ)) ⋅ Ψ(xs(σ))⎤⎥⎥⎥⎥⎦

≤ Λ supx

¿ÁÁÀ T

∑t=1q2t Eσ[K(xt(σ), xt(σ))]

≤ Λr∥q∥2,

where again we are using the fact that if s < t then

Eσ[σtσsqtqsK(xt(σ), xs(σ))] = E

σ[σt]E

σ[σsqtqsK(xt(σ), xs(σ))] = 0

by the independence of σt from σs, as well as xt(σ) = xt(σ1, . . . , σt−1) andxs(σ) = xs(σ1, . . . , σs). ⊓⊔

17

Our next result establishes a high-probability learning guarantee for kernel-based hypothesis. Combining Corollary 2 with Lemma 1, yields the followingresult.

Theorem 4 Let p ≥ 1 and F = (x, y) → (w ⋅ Ψ(x) − y)p∶ ∥w∥H ≤ Λ whereH is a Hilbert space and Ψ ∶X → H a feature map. Assume that the condition∣w ⋅ x − y∣ ≤ M holds for all (x, y) ∈ Z and all w such that ∥w∥H ≤ Λ. IfZT1 = (XT

1 ,YT1 ) is a sequence of random variables then, for any δ > 0, with

probability at least 1−δ the following holds for all h ∈ x→w ⋅Ψ(x)∶ ∥w∥H ≤ Λ:

E[(h(XT+1) − YT+1)p∣ZT1 ] ≤T

∑t=1qt(h(Xt) − Yt)p + disc(q) +CTΛr∥q∥2

+ 2M∥q∥2√

8 log1

δ,

where CT = 48pMp√

4π logT (1 + 4√

2 log3/2(eT 2)). Thus, for p = 2,

E[(h(XT+1) − YT+1)2∣ZT1 ] ≤T

∑t=1qt(h(Xt) − Yt)2 + disc(q) +O

⎛⎝(log2 T )Λr∥q∥2

⎞⎠.

The results in Theorem 4 (as well as Theorem 1) can be extended to holduniformly over q and we provide exact statement in Theorem 6 in Appendix A.This result suggests that we should seek to minimize ∑Tt=1 qtf(Zt) + disc(q)over q and w. This insight is used to develop our algorithmic solutions forforecasting non-stationary time series in Section 6.

5 Estimating Discrepancy

In Section 3, we showed that the discrepancy measure disc(q) is crucial forforecasting non-stationary time series. In particular, if we could select a dis-tribution q over the sample ZT1 that would minimize the discrepancy disc(q)and use it to reweight training points, then we could achieve a more favor-able learning guarantee for an algorithm trained on this reweighted sample.In some special cases, the discrepancy disc(q) can be computed analytically.However, in general, we do not have access to the distribution of ZT1 and hencewe need to estimate the discrepancy from data. Furthermore, in practice, wenever observe ZT+1 and it is not possible to estimate disc(q) without somefurther assumptions. One natural assumption is that the distribution Pt ofZt does not change drastically with t on average. Under this assumption, thelast s observations ZTT−s+1 are effectively drawn from the distribution close toPT+1. More precisely, we can write

disc(q) ≤ supf∈F

(1

s

T

∑t=T−s+1

E[f(Zt)∣Zt−11 ] −T

∑t=1qtE[f(Zt)∣Zt−11 ])

+ supf∈F

(E[f(ZT+1)∣ZT1 ] − 1

s

T

∑t=T−s+1

E[f(Zt)∣Zt−11 ]).

18

We will assume that the second term, denoted by discs, is sufficiently smalland will show that the first term can be estimated from data. But, we firstnote that our assumption is necessary for learning in this setting. Observe thatthe following inequalities hold:

supf∈F

(E[ZT+1∣ZT1 ] −E[f(Zr)∣Zr−11 ]) ≤T

∑t=r

supf∈F

(E[f(Zt+1)∣Zt1] −E[f(Zt)∣Zt−11 ])

≤MT

∑t=r

∥Pt+1(⋅∣Zt1) −Pt(⋅∣Zt−11 )∥TV,

for all r = T − s + 1, . . . , T . Therefore, we must have

discs ≤1

s∑

t=T−s+1supf∈F

(E[ZT+1∣ZT1 ] −E[f(Zt)∣Zt1]) ≤s + 1

2Mγ,

where γ=supt∥Pt+1(⋅∣Zt1)−Pt(⋅∣Zt−11 )∥TV. Barve and Long [1996] showed that

[VC-dim(H)γ] 13 is a lower bound on the generalization error in the setting of

binary classification where ZT1 is a sequence of independent but not identicallydistributed random variables (drifting).

More precisely, Barve and Long [1996] consider the setting in which thelearner observes a sequence (X1, Y1), (X2, Y2), . . . from P1, P2, . . . respectively.It is shown that there exists a sufficiently small ε > 0 and a constant c > 0such that if γ > cε3/d then, for any algorithm A and any t0, there exists asequence P1, P2, . . . and T > t0 such that the generalization error of hA isat least ε = Ω((γd) 1

3 ), where hA is the hypothesis produced by A using thesample (X1, Y1), . . . , (XT , YT ).

Observe that this setting is the special case of the general setup consid-ered in our work, since in our case the pairs (Xt, Yt) are not required to beindependent.

The following result shows that we can estimate the first term in the upperbound on disc(q).Theorem 5 Let ZT1 be a sequence of random variables. Then, for any δ > 0,with probability at least 1 − δ, the following holds for all α > 0:

supf∈F

⎛⎝T

∑t=1

(pt − qt)E[f(Zt)∣Zt−11 ]⎞⎠≤ supf∈F

⎛⎝T

∑t=1

(pt − qt)f(Zt)⎞⎠+B,

where B = 2α +M∥q − p∥2√

8 log Ez∼T [N1(α,F,z)]δ

and where p is the uniformdistribution over the last s points.

Proof The first step consists of upper-bounding the difference of suprema bythe supremum of the differences:

supf∈F

⎛⎝T

∑t=1

(pt − qt)E[f(Zt)∣Zt−11 ]⎞⎠− supf∈F

⎛⎝T

∑t=1

(pt − qt)f(Zt)⎞⎠

≤ supf∈F

⎛⎝T

∑t=1

(pt − qt)(E[f(Zt)∣Zt−11 ] − f(Zt))⎞⎠.

19

Next, arguments similar to those in the proof of Theorem 1 can be applied tocomplete the proof. ⊓⊔

Theorem 1 and Theorem 5 combined with the union bound yield the fol-lowing result.

Corollary 4 Let ZT1 be a sequence of random variables. Then, for any δ > 0,with probability at least 1 − δ, the following holds for all f ∈ F and all α > 0:

E[f(ZT+1)∣ZT1 ] ≤T

∑t=1qtf(Zt) + disc(q) + discs +4α

+M[∥q∥2 + ∥q − p∥2]√

8 log 2Ev∼T [N1(α,G,z)]δ

,

where disc(q) = supf∈F (∑Tt=1(pt−qt)f(Zt)) and p is the uniform distribution

over the last s points.

We note that, while we used a uniform prior p over the last s points to stateTheorem 5 and Corollary 4, any other distribution over the sample points canbe used as well. Our choice of p is based on the assumption that the last spoints admit a distribution that is the most similar to the future that we areseeking to predict.

In Section 6, we combine these results with Theorem 6 that extends learn-ing guarantees to hold uniformly over qs to derive novel algorithms for non-stationary time series prediction.

6 Algorithms

In this section, we use our learning guarantees to devise algorithms for fore-casting non-stationary time series. We consider a broad family of kernel-basedhypothesis classes with regression losses which we analyzed in Section 4.

Suppose L is the squared loss and H = x → w ⋅ Ψ(x)∶ ∥w∥H ≤ Λ, whereΨ ∶X → H is a feature mapping from X to a Hilbert space H. Theorem 4, The-orem 6 and Theorem 4 suggest that we should solve the following optimizationproblem:

minq∈Q,w

⎧⎪⎪⎨⎪⎪⎩

T

∑t=1qt(w ⋅ Ψ(xt) − yt)2 + disc(q) + λ1∥w∥2H

⎫⎪⎪⎬⎪⎪⎭(10)

where λ1 is a regularization hyperparameter and Q is some convex boundeddomain. Note that for simplicity, we have omitted O(∥q∥) terms in (10). Onecan extend the formulation in (10) to account for these terms as well.

The optimization problem in (10) is quadratic (and hence convex) in wand convex in q since disc(q) is a convex function of q as a supremum oflinear functions. However, this objective is not jointly convex in (q,w). Ofcourse, one could apply a block coordinate descent to solve this objective whichalternates between optimizing over q and solving a QP in w. In general, no

20

convergence guarantees can be provided for this algorithm. In addition, eachq-step involves finding disc(q) which in itself may be a costly procedure. Inthe sequel we address both of these concerns.

First, let us observe that if a(w) = ∑Tt=1 pt(w ⋅ Ψ(xt) − yt)2 with pt beinga uniform distribution on the last s points, then by definition of empiricaldiscrepancy

disc(w) = supw≤Λ

⎛⎝a(w) −

T

∑t=1qt(w ⋅ Ψ(xt) − yt)2

⎞⎠

= supw≤Λ

⎛⎝T

∑t=1

(ut − qt)a(w) −T

∑t=1qt((w ⋅ Ψ(xt) − yt)2 − a(w))

⎞⎠

≤T

∑t=1qtdt + λ2∥q − v∥p

where λ2 is some constant (a hyperparameter), v is a prior typically chosen touniform weights u, p ≥ 1 and dts are instantaneous discrepancies defined by

dt = supw≤Λ

RRRRRRRRRRRa(w) − (w ⋅ Ψ(xt) − yt)2

RRRRRRRRRRR.

Note that dt can also be defined in terms of windows averages, where (w ⋅Ψ(xt) − yt)2 is replaced with 1

2l ∑t+ls=t−l(w ⋅ Ψ(xs) − ys)2 for some l.

This leads to the following optimization problem:

minq∈Q,w

T

∑t=1qt(w ⋅ Ψ(xt) − yt)2 +

T

∑t=1dtqt + λ1∥w∥2H + λ2∥q − v∥p. (11)

This optimization problem is still not convex but now dts can be precomputedbefore solving (11) which may be considerably more efficient. For instance, theq-step in the block coordinate descent reduces to a simple LP. Below we showhow (11) can be turned into a convex optimization problem when Q = [0,1]T .

6.1 Convex Optimization over [0,1]T and Dual Problems

In this section, we consider the case when Q = [0,1]T and we show how (11)can be turned into a convex optimization problem which then can be solvedefficiently. We apply change of variables rt = 1/qt, which leads to the followingoptimization problem:

minr∈D,w

T

∑t=1

(w ⋅ Ψ(xt) − yt)2 + dtrt

+ λ1∥w∥2H + λ2⎛⎝T

∑t=1

∣r−1t − vt∣p⎞⎠

1/p, (12)

21

where D = r∶ rt ≥ 1. We can remove the (⋅)1/p on the last term by first turningit into a constraint, raising it to the pth power and then moving it back to theobjective by introducing a Lagrange multiplier:

minr∈D,w

T

∑t=1


+ λ1∥w∥2H + λ2T

∑t=1

∣r−1t − vt∣p. (13)

Note that the first two terms in (12) are jointly convex in (r,w): the first termis a sum of quadratic-over-linear functions which is convex and the second termis a squared norm which is again convex.

The last step is to observe that ∣r−1t − vt∣ = ∣vtr−1t ∣p∣rt − v−1t ∣p ≤ vpt ∣rt − vt∣psince r−1t ≤ 1. Therefore, we have the following optimization problem:

minr∈D,w

T

∑t=1


+ λ1∥w∥2H + λ2T

∑t=1vpt ∣rt − v−1t ∣p, (14)

which is jointly convex over (r,w).For many real-world problems, Ψ is specified implicitly via a PDS kernel

K and it is computationally advantageous to consider the dual formulation of(14). Using the dual problem associated to w (14) can be rewritten as follows:

minr∈D

maxα

−λ1T

∑t=1rtα

2t −αTKα+2λ1α

TY+T

∑t=1

dtrt+λ2

T

∑t=1vpt ∣rt−v−1t ∣p, (15)

where K is the kernel matrix and Y = (y1, . . . , yT )T . We provide a full deriva-tion of this result in Appendix B.

Both (14) and (15) can be solved using standard descent methods, where,at each iteration, we solve a standard QP in α or w, which admits a closed-form solution.

6.2 Discrepancy Computation

The final ingredient that is needed to solve optimization problems (14) or (15)is an algorithm to find instantaneous discrepancies dts. Recall that in generalthese are defined as

supw′≤Λ

RRRRRRRRRRR

T

∑s=1

ps`(w′ ⋅ Ψ(xs) − ys) − `(w′ ⋅ Ψ(xt) − yt)RRRRRRRRRRR

(16)

where ` is some specified loss function. For an arbitrary ` this may be a dif-ficult optimization problem. However, observe that if ` is a convex functionthen the objective in (16) is a difference of convex functions so we may useDC-programming to solve this problem. For general loss functions, the DC-programming approach only guarantees convergence to a stationary point.However, for the squared loss, our problem can be cast as an instance of thetrust region problem, which can be solved globally using the DCA algorithmof Tao and An [1998].

22

6.3 Two-Stage Algorithms

An alternative simpler algorithm based on the data-dependent bounds ofCorollary 4 consists of first finding a distribution q minimizing the discrepancyand then using that to find a hypothesis minimizing the (regularized) weightedempirical risk. This leads to the following two-stage procedure. First, we finda solution q∗ of the following convex optimization problem:

minq≥0

supw′≤Λ

(T

∑t=1

(pt − qt)(w′ ⋅ Ψ(xt) − yt)2), (17)

where Λ is parameter that can be selected via cross-validation (for exampleusing techniques in [Kuznetsov and Mohri, 2016]). Our generalization boundshold for arbitrary weights q but we restrict them to being positive sequences.Note that other regularization terms such as ∥q∥22 and ∥q−p∥22 from the boundof Corollary 4 can be incorporated in the optimization problem, but we discardthem to minimize the number of parameters. This problem can be solvedusing standard descent optimization methods, where, at each step, we useDC-programming to evaluate the supremum over w′.

The solution q∗ of (17) is then used to solve the following (weighted) kernelridge regression problem:

minw

T

∑t=1q∗t (w ⋅ Ψ(xt) − yt)2 + λ2∥w∥2H. (18)

Note that, in order to guarantee the convexity of this problem, we requireq∗ ≥ 0. We note that one can also use ε-sensitive loss instead of squared lossfor this algorithm.

6.4 Further Extensions and Discussion

We conclude this section with the observation that the learning guaranteesthat we presented in Section 3 and Section 4 can be used to derive algorithmsfor many other problems that involve time series data with loss functions andhypothesis set distinct from regression losses and linear hypotheses consideredin this section.

For instance, we can choose a hypothesis set H to be a set of neural net-works with pre-specified architecture. In that case, we can apply stochasticgradient descent (SGD) to solve the optimization problem (10). Note thatcomputing disc(q) in that case is almost identical to solving the WassersteinGAN optimization problem [Arjovsky et al., 2017]. However, no convergenceguarantees are known for this setting. We leave these extensions to futurework.

23

0 500 1000 1500 2000 2500 3000

0.4

0.2

0.0

0.2

0.4

0 500 1000 1500 2000 2500 30000.6

0.4

0.2

0.0

0.2

0.4

0.6

0 500 1000 1500 2000 2500 30000.4

0.3

0.2

0.1

0.0

0.1

0.2

0.3

0.4

0 500 1000 1500 2000 2500 30000.25

0.20

0.15

0.10

0.05

0.00

0.05

0.10

0.15

0.20

Fig. 3 Synthetic datasets (top to bottom): ads1, ads2, ads3, ads4.

7 Experiments

In this section, we present the results of experiments evaluating our algorith-mic solutions on a number of synthetic and real-world datasets. In particular,we consider the one-stage algorithm presented in Section 6 which is based onsolving the optimization problem in (11). While solving this problem as op-posed to (14) may result in a sub-optimal results, this simplification allows usto use an alternating optimization method described in Section 6: for a fixedq, problem (11) is a simple QP over w and, for a fixed w, the problem reducesto an LP in q. This iterative scheme admits a straightforward implementationusing existing QP and LP solvers. In the rest of this section, we will refer tothis algorithm as discrepancy-based forecaster (DBF).

24

0 500 1000 1500 2000 2500 30000.0

0.5

1.0

1.5

2.0

2.5

3.0

0 500 1000 1500 2000 2500 30000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

0 500 1000 1500 2000 2500 30000.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

0 500 1000 1500 2000 2500 30000.0

0.2

0.4

0.6

0.8

1.0

1.2

Fig. 4 True (green) and estimated (red) instantaneous discrepancies for synthetic data (topto bottom): ads1, ads2, ads3, ads4.

We have chosen ARIMA models as a baseline comparator in our exper-iments. These models are standard and are commonly used in practice forforecasting non-stationary time series.

We present two sets of experiments: synthetic data experiments (Section 7.1)and real-world data experiments (Section 7.2).

7.1 Experiments with Synthetic Data

In this section, we present the results of our experiments on some syntheticdatasets. These experimental results allow us to gain some further understand-ing of the discrepancy-based approach to forecasting. In particular, they help

25

0 500 1000 1500 2000 2500 30000.0000

0.0002

0.0004

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

0 500 1000 1500 2000 2500 30000.0

0.2

0.4

0.6

0.8

1.0

0 500 1000 1500 2000 2500 30000.0000

0.0001

0.0002

0.0003

0.0004

0.0005

0.0006

0 500 1000 1500 2000 2500 30000.00000

0.00005

0.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

Fig. 5 Weights q chosen by DBF when used with true (green) and estimated (red) instan-taneous discrepancies for synthetic data (top to bottom): ads1, ads2, ads3, ads4.

us study the effects of using estimated instantaneous discrepancies instead ofthe true ones.

We have used four artificial datasets: ads1, ads2, ads3 and ads4. Thesedatasets are generated using the following autoregressive processes:

ads1: Yt = αtYt−1 + εt, αt = −0.9 if t ∈ [1000,2000] and 0.9 otherwise,

ads2: Yt = αtYt−1 + εt, αt = 1 − (t/1500),ads3: Yt = αi(t)Yt−1 + εt, α1 = −0.5 and α2 = 0.9

ads4: Yt = −0.5Yt−1 + εt,

where εt are independent Gaussian random variables with zero mean and σ =0.05. Note that i(t) in the definition of ads3 is a stochastic process on 1,2,

26

0 500 1000 1500 20000.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0 500 1000 1500 20000.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

0.0050

0.0055

0 500 1000 1500 20000.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.010

0 500 1000 1500 20000.0015

0.0020

0.0025

0.0030

0.0035

0.0040

0.0045

0.0050

Fig. 6 Running MSE for synthetic data experiments (top to bottom): ads1, ads2, ads3,ads4. For each time t on the horizontal axis we plot MSE up to time t of tDBF (blue), eDBF(red), ARIMA (green).

such that P(i(t + s) = i∣i(t + s − 1) = . . . = i(s) = i, i(s − 1) ≠ i) = (0.99995)t.In other words, if the process i(t) spends exactly τ last time steps in state i,then at the next time step it will stay in i with probability (0.99995)τ andwill move to a different state with probability 1 − (0.99995)τ .

The first stochastic process (ads1) is supposed to model sudden abruptchanges in the data generating mechanism. The scenario in which parametersof the data generating process smoothly drift is modeled by ads2. The settingin which the changes can occur at random times is captured by ads3. Finally,ads4 is generated by a stationary random process. See Figure 3.

27

Table 1 Mean squared error (standard deviation) for synthetic data. tDBF is DBF withtrue instantaneous discrepancies dt as its input. eDBF is DBF with estimated instantaneousdiscrepancies dt as its input. The results in bold are statistically significant using one-sidedpaired t-test at 5% level.

Dataset tDBF eDBF ARIMA

ads1 3.135 × 10−3 3.743 × 10−3 5.723 × 10−3(7.504×10−3) (6.171 × 10−3) (10.143 × 10−3)

ads2 2.800 × 10−3 3.530 × 10−3 4.348 × 10−3(3.930×10−3) (6.620 × 10−3) (6.770 × 10−3)

ads3 3.282 × 10−3 3.887 × 10−3 4.066 × 10−3(6.417×10−3) (9.277 × 10−3) (6.122 × 10−3)

ads4 2.573 × 10−3 2.889 × 10−3 2.593 × 10−3(3.516×10−3) (4.262 × 10−3) (3.578 × 10−3)

For each dataset, we have generated time series with 3,000 sample points. Inall our experiments, we used the following protocol. For each t ∈ [750,775, . . . ,2995],(y1, . . . , yt) is used as a development set and (yt+1, . . . , yt+25) is used as a testset. On the development set, we first train each algorithm with different hy-perparameter settings on (y1, . . . , yt−25) and then select the best performinghyperparameters on (yt−24, . . . , yt). This set of hyperparameters is then usedfor training on the full development set (y1, . . . , yt) and mean squared error(MSE) on (yt+1, . . . , yt+25) averaged over all t ∈ [750,775, . . . ,2995] is reported.

Recall that DBF algorithm requires two regularization hyperparameters λ1and λ2. We optimized these parameters over the following two sets of values forλ1 and λ2 respectively: 10−3,10−4,10−5,10−6 and 100, 10, 1, 0.1, 0.05, 0.01, 0.001, 0.

ARIMA models have three hyperparameters p, d and q that we also setvia the validation procedure described above. Recall that ARIMA(p, d, q) is agenerative model defined by the following autoregression:

⎛⎝

1 −p−1∑i=0

φiLi⎞⎠(1 −L)dYt+1 =

⎛⎝

1 +q

∑j=1

θjLj⎞⎠εt.

where L is a lag operator, that is, LYt = Yt−1. Therefore, validation over (p, d, q)is equivalent to validation over different sets of features used to train the model.For instance, (p, d, q) = (3,0,0) means that we are using (yt−3, yt−2, yt−1) asour features while (p, d, q) = (2,1,0) corresponds to (yt−3−yt−2, yt−2−yt−1). ForDBF, we fix feature vectors to be (yt−3, yt−2, yt−1). For ARIMA, we optimizeover p, d, q ∈ 0,1,23. We use the maximum likelihood approach to estimatethe unknown parameters of ARIMA models.

Finally, observe that the discrepancy estimation procedure discussed inSection 6 also requires a hyperparameter s representing the length of the mostrecent history of the stochastic process. We did not make an attempt to opti-mize this parameter and in all of our experiments we set s = 20.

The results of our experiments are presented in Table 1. We have comparedDBF with true discrepancies as its input tDBF, DBF with estimated discrepan-cies as its input eDBF and ARIMA. In all experiments with non-stationary pro-cesses (ads1, ads2, ads3), tDBF performs better than both eDBF and ARIMA.

28

Table 2 Real-world datasets statistics

Dataset URL Sizebitcoin https://www.quandl.com/data/BCHARTS/BTCNCNY 1705coffee https://www.quandl.com/data/COM/COFFEE BRZL 2205eur/jpy https://www.quandl.com/data/ECB/EURJPY 4425jpy/usd http://data.is/269FpLF 4475

mso http://data.is/269F3EV 1235silver https://www.quandl.com/data/COM/AG EIB 2251soy https://www.quandl.com/data/COM/SOYB 1 2218temp http://data.is/1qlX2AN 3649

Table 3 Mean squared error (standard deviation) for real-world data. The results in boldare statistically significant using one-sided paired t-test at 5% level.

Dataset DBF ARIMA

bitcoin 4.400 × 10−3 4.900 × 10−3(26.500×10−3) (29.990 × 10−3)

coffee 3.080 × 10−3 3.260 × 10−3(6.570×10−3) (6.390 × 10−3)

eur/jpy 7.100 × 10−5 7.800 × 10−5(16.900×10−5) (24.200 × 10−5)

jpy/usd 9.770 × 10−1 10.004 × 10−1(25.893×10−1) (27.531 × 10−1)

mso 32.876 × 100 32.193 × 100

(55.586×100) (51.109 × 100)

silver 7.640 × 10−4 34.180 × 10−4(46.65×10−4) (158.090 × 10−4)

soy 5.071 × 10−2 5.003 × 10−2(9.938×10−2) (10.097 × 10−2)

temp 6.418 × 100 6.461 × 100

(9.958×100) (10.324 × 100)

Similarly, eDBF outperforms ARIMA on the same datasets. These results arestatistically significant at 5%-level using one-sided paired t-test. Figure 6 il-lustrates the dynamics of MSE as a function of time t for all three algorithmson all of the synthetic datasets.

Our results suggest that accurate discrepancy estimation can lead to asignificant improvement in performance. We present the results of discrepancyestimation for our experiments in Figure 4. Figure 5 shows the correspondingweights q chosen by DBF.

7.2 Experiments with Real-World Data

In this section, we present the results of our experiments with real-worlddatasets. For our experiments, we have chosen eight time series from differentdomains: currency exchange rates (bitcoin, eur/jpy, jpy/usd), commodityprices (coffee, soy, silver) and meteorology (mso, temp). Further details ofthese datasets are summarized in Table 2.

29

0 100 200 300 400 500 600 700 800 9000.000

0.005

0.010

0.015

0.020

0.025

0 200 400 600 800 1000 1200 1400 16000.00010

0.00015

0.00020

0.00025

0.00030

0.00035

0.00040

0.00045

0.00050

0 500 1000 1500 2000 2500 3000 3500 40000.00004

0.00006

0.00008

0.00010

0.00012

0.00014

0.00016

0.00018

0 500 1000 1500 2000 2500 3000 3500 4000 45000

1

2

3

4

5

0 100 200 300 400 5000

50

100

150

200

250

0 200 400 600 800 1000 1200 1400 16000.000

0.001

0.002

0.003

0.004

0.005

0 200 400 600 800 1000 1200 1400 16000.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0 500 1000 1500 2000 2500 3000468

10121416182022

Fig. 7 Running MSE for real-world data experiments (top to bottom): bitcoin, coffee,eur/jpy, jpy/usd, mso, silver, soy, temp. For each time t on the horizontal axis we plotMSE up to time t of DBF (red) and ARIMA (green).

In all our experiments, we used the same protocol as in the previous section.In particular, for each t ∈ [750,775, . . ., (y1, . . . , yt) is used as a developmentset and (yt+1, . . . , yt+25) is used as a test set. On the development set, we firsttrain each algorithm with different hyperparameter settings on (y1, . . . , yt−25)and then select the best performing hyperparameters on (yt−24, . . . , yt). Thisset of hyperparameters is then used for training on the full development set(y1, . . . , yt) and mean squared error (MSE) over the remaining data pointsis reported. The range of the hyperparameters for both DBF and ARIMA isalso the same as in previous section. Note that since true discrepancies are

30

unknown, we only present the results for DBF with discrepancies estimatedfrom data.

The results of our experiments are summarized in Table 3. Observe thatDBF outperforms ARIMA on 5 out of 8 datasets. It should be noted thatthe error variance is high compared to the mean error. However, the highervariance is likely due to inherent low signal-to-noise ratio in these real worlddatasets.

Figure 7 illustrates the dynamics of MSE as a function of time t for allthree algorithms on all of the synthetic datasets. These results are statisticallysignificant at 5%-level using one-sided paired t-test. There is no statisticaldifference in the performance of ARIMA and DBF on the rest of the datasets.

Our results suggest that discrepancy-based approach to prediction of non-stationary time series may lead to improved performance compared to othertraditional approaches such as ARIMA.

8 Conclusion

We presented a general theoretical analysis of learning in the broad scenarioof non-stationary non-mixing processes, the realistic setting for a variety ofapplications. We discussed in detail several algorithms benefiting from thelearning guarantees presented. Our theory can also provide a finer analysisof several existing algorithms and help devise alternative principled learningalgorithms.

The key ingredient of our analysis and algorithms is the notion of dis-crepancy. This is as a fundamental concept for learning with non-stationarystochastic processes and was used in the analysis of several other time seriesforecasting techniques [Kuznetsov and Mohri, 2017b; Kuznetsov and Mariet,2019; Zimin and Lampert, 2017; Kuznetsov and Mohri, 2016] following theearlier work in [Kuznetsov and Mohri, 2015].

Acknowledgements This work was partly funded by NSF CCF-1535987, IIS-1618662,and a Google Research Award.

31

References

Terrence M. Adams and Andrew B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4):1345–1367, 2010.

A. Agarwal and J.C. Duchi. The generalization ability of online algorithms fordependent data. Information Theory, IEEE Transactions on, 59(1):573–587,2013.

Pierre Alquier and Olivier Wintenberger. Model selection for weakly depen-dent time series forecasting. Technical Report 2010-39, Centre de Rechercheen Economie et Statistique, 2010.

Pierre Alquier, Xiaoyin Li, and Olivier Wintenberger. Prediction of time seriesby statistical learning: general losses and fast rates. Dependence Modelling,1:65–93, 2014.

Donald Andrews. First order autoregressive processes and strong mixing.Cowles Foundation Discussion Papers 664, Cowles Foundation for Researchin Economics, Yale University, 1983.

Martin Arjovsky, Soumith Chintala, and Leon Bottou. Wasserstein GAN,2017. URL http://arxiv.org/abs/1701.07875.

Richard Baillie. Long memory processes and fractional integration in econo-metrics. Journal of Econometrics, 73(1):5–59, 1996.

Rakesh D. Barve and Philip M. Long. On the complexity of learning fromdrifting distributions. In COLT, 1996.

Patrizia Berti and Pietro Rigo. A Glivenko-Cantelli theorem for exchangeablerandom variables. Statistics & Probability Letters, 32(4):385 – 391, 1997.

Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. JEconometrics, 1986.

George Edward Pelham Box and Gwilym Jenkins. Time Series Analysis,Forecasting and Control. Holden-Day, Incorporated, 1990.

Peter J Brockwell and Richard A Davis. Time Series: Theory and Methods.Springer-Verlag, New York, 1986.

Likai Chen and Wei Biao Wu. Concentration inequalities for empirical pro-cesses of linear time series. Journal of Machine Learning Research, 18(231):1–46, 2018.

Victor H. De la Pena and Evarist Gine. Decoupling: from dependence to in-dependence: randomly stopped processes, U-statistics and processes, martin-gales and beyond. Probability and its applications. Springer, NY, 1999.

Paul Doukhan. Mixing: properties and examples. Lecture notes in statistics.Springer-Verlag, New York, 1994.

Robert Engle. Autoregressive conditional heteroscedasticity with estimates ofthe variance of United Kingdom inflation. Econometrica, 50(4):987–1007,1982.

James D. Hamilton. Time series analysis. Princeton, 1994.Vitaly Kuznetsov and Zelda Mariet. Foundations of sequence-to-sequence

modeling for time series. In AISTATS, 2019.

http://arxiv.org/abs/1701.07875

32

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for time seriesprediction with non-stationary processes. In ALT, 2014.

Vitaly Kuznetsov and Mehryar Mohri. Learning theory and algorithms forforecasting non-stationary time series. In NIPS, 2015.

Vitaly Kuznetsov and Mehryar Mohri. Time series prediction and on-linelearning. In COLT, 2016.

Vitaly Kuznetsov and Mehryar Mohri. Generalization bounds for non-stationary mixing processes. Machine Learning, 106(1):93–117, 2017a.

Vitaly Kuznetsov and Mehryar Mohri. Discriminative state space models. InNIPS, 2017b.

M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetryand Processes. Ergebnisse der Mathematik und ihrer Grenzgebiete. U.S.Government Printing Office, 1991.

Nick Littlestone. Learning quickly when irrelevant attributes abound: A newlinear-threshold algorithm. Machine Learning, 1987.

E. N. Lorenz. Atmospheric predictability as revealed by naturally occurringanalogues. Journal of the Atmospheric Sciences, 26:636–646, 1969.

Aurelie C. Lozano, Sanjeev R. Kulkarni, and Robert E. Schapire. Convergenceand consistency of regularized boosting algorithms with stationary β-mixingobservations. In NIPS, 2006.

Ron Meir. Nonparametric time series prediction through adaptive model se-lection. Machine Learning, pages 5–34, 2000.

D.S. Modha and E. Masry. Memory-universal prediction of stationary randomprocesses. Information Theory, IEEE Transactions on, 44(1):117–133, Jan1998.

Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm forlearning with drifting distributions. In ALT, 2012.

Mehryar Mohri and Afshin Rostamizadeh. Rademacher complexity boundsfor non-i.i.d. processes. In NIPS, 2009.

Mehryar Mohri and Afshin Rostamizadeh. Stability bounds for stationary ϕ-mixing and β-mixing processes. Journal of Machine Learning Research, 11:789–814, 2010.

Vladimir Pestov. Predictive PAC learnability: A paradigm for learning fromexchangeable input data. In GRC, 2010.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning:Random averages, combinatorial parameters, and learnability. In NIPS,2010.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning:Stochastic, constrained, and smoothed adversaries. In NIPS, 2011.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Sequential com-plexities and uniform martingale laws of large numbers. Probability Theoryand Related Fields, 2015a.

Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learningvia sequential complexities. JMLR, 16(1), January 2015b.

Cosma Shalizi and Aryeh Kontorovich. Predictive PAC learning and processdecompositions. In NIPS, 2013.

33

Ingo Steinwart and Andreas Christmann. Fast learning from non-i.i.d. obser-vations. In NIPS, 2009.

Pham Dinh Tao and Le Thi Hoai An. A D.C. optimization algorithm forsolving the trust-region subproblem. SIAM Journal on Optimization, 8(2):476–505, 1998.

M. Vidyasagar. A Theory of Learning and Generalization: With Applicationsto Neural Networks and Control Systems. Springer-Verlag New York, Inc.,1997.

Bin Yu. Rates of convergence for empirical processes of stationary mixingsequences. The Annals of Probability, 22(1):94–116, 1994.

Zhizhen Zhao and Dimitrios Giannakis. Analog forecasting with dynamics-adapted kernels. CoRR, abs/1412.3831, 2016.

Alexander Zimin and Christoph Lampert. Learning Theory for ConditionalRisk Minimization. In AISTAT, 2017.

34

A Proofs

In the construction described in Section 2.1, we denoted by z the tree defined using Zts anddenoted by T the distribution of z. Here, we will also denote by z′ the tree formed by Z′tsand denote by T the joint distribution of (z,z′).

Lemma 2 Let ZT1 be a sequence of random variables and let Z′T1 be a decoupled tangentsequence. Then, for any measurable function G, the following equality holds:

E⎡⎢⎢⎢⎢⎣

G( supf∈F

T

∑t=1

qt(f(Z′t)−f(Zt)))

⎤⎥⎥⎥⎥⎦

= Eσ

E(z,z′)∼T

[G( supf

T

∑t=1

σtqt(f(z′t(σ))−f(zt(σ))))]. (19)

The result also holds with the absolute value around the sums in (19).

Proof The proof follows an argument invoked in the proof of Theorem 3 of Rakhlin et al.[2011]. We only need to check that every step holds for an arbitrary weight vector q, in lieuof the uniform distribution vector u, and for an arbitrary measurable function G, insteadof the identity function. Let p denote the joint distribution of the random variables Zts.Observe that we can write the left-hand side of (19) as follows:

E [G( supf∈F

Σ(σ))] = EZ1,Z

′1∼p1

EZ2,Z

′2∼p2(⋅∣Z1)

⋯ EZT ,Z

′T∼pT (⋅∣ZT−1

1)[G( sup

f∈FΣ(σ))],

where σ = (1, . . . ,1) ∈ ±1T and Σ(σ) = ∑Tt=1 σtqt(f(Z

′t) − f(Zt)). Now, by definition

of decoupled tangent sequences, the value of the last expression is unchanged if we swapthe sign of any σi−1 to −1 since that is equivalent to permuting Zi and Z′i. Thus, the lastexpression is in fact equal to

EZ1,Z

′1∼p1

EZ2,Z

′2∼p2(⋅∣S1(σ1))

⋯ EZT ,Z

′T∼pT (⋅∣S1(σ1),...,ST−1(σT−1))

[G( supf∈F

Σ(σ))]

for any sequence σ ∈ ±1T , where St(1) = Zt and Z′t otherwise. Since this equality holdsfor any σ, it also holds for the mean with respect to uniformly distributed σ. Therefore, thelast expression is equal to

Eσ

EZ1,Z

′1∼p1

EZ2,Z

′2∼p2(⋅∣S1(σ1))

⋯ EZT ,Z

′T∼pT (⋅∣S1(σ1),...,ST−1(σT−1))

[G( supf∈F

Σ(σ))].

This last expectation coincides with the expectation with respect to drawing a random treez and its tangent tree z′ according to T and a random path σ to follow in that tree. Thatis, the last expectation is equal to

Eσ

E(z,z′)∼T

[G( supf

T

∑t=1

σtqt(f(z′t(σ)) − f(zt(σ))))],

which concludes the proof. ⊓⊔

Theorem 6 Let p ≥ 1 and F = (x, y) → (w ⋅ Ψ(x) − y)p∶ ∥w∥H ≤ Λ where H is a Hilbertspace and Ψ ∶X → H a feature map. Assume that the condition ∣w ⋅ x − y∣ ≤M holds for all(x, y) ∈ Z and all w such that ∥w∥H ≤ Λ. Fix q∗. Then, if ZT1 = (XT

1 ,YT1 ) is a sequence

of random variables, for any δ > 0, with probability at least 1 − δ, the following bound holdsfor all h ∈H = x→w ⋅ Ψ(x)∶ ∥w∥H ≤ Λ and all q such that 0 < ∥q − q∗∥1 ≤ 1:

E[(h(XT+1) − YT+1)p∣ZT1 ] ≤T

∑t=1

qt(h(Xt) − Yt)p+ disc(q) +G(q) + 4M∥q − q∗∥1

35

where G(q) = 4M(√

8 log 2δ+√

2 log log2 2(1 − ∥q − q∗∥1)−1 + CTΛr)(∥q∗∥2 + 2∥q − q∗∥1)

and CT = 48pMp√

4π logT (1 + 4√

2 log3/2(eT 2)). Thus, for p = 2,

E[(h(XT+1) − YT+1)2∣ZT1 ] ≤T

∑t=1

qt(h(Xt) − Yt)2+ disc(q)

+O⎛

⎝Λr(log2 T )

√log log2 2(1 − ∥q − q∗∥1)−1(∥q∗∥2 + ∥q − q∗∥1)

⎞

⎠.

This result extends Theorem 4 to hold uniformly over q. Similarly, one can prove ananalogous extension for Theorem 1. This result suggests that we should try to minimize

∑Tt=1 qtf(Zt) + disc(q) over q and w. This bound is in certain sense analogous to margin

bounds: it is the most favorable when there exists a good choice for q∗ and we hope to findq that is going to be close to this weight vector. These insights are used to develop ouralgorithmic solutions for forecasting non-stationary time series in Section 6.

Proof Let (εk)∞k=0 and (q(k))∞k=0 be infinite sequences specified below. By Theorem 4, the

following holds for each k

P(E[f(ZT+1)∣ZT1 ] >T

∑t=1

qt(k)f(Zt) +∆(q(k)) +C(q(k)) + 4M∥q∥2εk) ≤ exp(−ε2k),

where ∆(q(k)) denotes the discrepancy computed with respect to the weights q(k) andC(q(k)) = CT ∥q(k)∥2. Let εk = ε +

√2 log k. Then, by the union bound we can write

P⎛

⎝∃k∶E[f(ZT+1)∣ZT1 ]>

T

∑t=1

qt(k)f(Zt)+∆(q(k))+C(q(k))+4M∥q(k)∥2εk⎞

⎠≤∞∑k=1

e−ε2k

≤∞∑k=1

e−ε2−logk2

≤2e−ε2.

We choose the sequence q(k) to satisfy ∥q(k) − q∗∥1 = 1 − 2−k. Then, for any q such that0 < ∥q − u∥1 ≤ 1, there exists k ≥ 1 such that

1 − ∥q(k) − q∗∥1 < 1 − ∥q − q∗∥1 ≤ 1 − ∥q(k − 1) − q∗∥1 ≤ 2(1 − ∥q(k) − q∗∥1).

Thus, the following inequality holds:√

2 log k ≤√

2 log log2 2(1 − ∥q − q∗∥1)−1.

Combining this with the observation that the following two inequalities hold:

T

∑t=1

qt(k − 1)f(Zt) ≤T

∑t=1

qtf(Zt) + 2M∥q − q∗∥1

∆(q(k − 1)) ≤∆(q) + 2M∥q − q∗∥1,∥q(k − 1)∥2 ≤ 2∥q − q∗∥1 + ∥q∗∥2

shows that the event

⎧⎪⎪⎨⎪⎪⎩

E[f(ZT+1)∣ZT1 ] >T

∑t=1

qtf(Zt) + disc(q) +G(q) + 4M∥q − q∗∥1⎫⎪⎪⎬⎪⎪⎭

where G(q) = 4M(ε +√

2 log log2 2(1 − ∥q − q∗∥1)−1 + CTΛr)(∥q∗∥2 + 2∥q − q∗∥1) implies

the following one

⎧⎪⎪⎨⎪⎪⎩

E[f(ZT+1)∣ZT1 ]>T

∑t=1

qt(k − 1)f(Zt)+∆(q(k − 1))+C(q(k − 1))+4M∥q(k − 1)∥2εk−1⎫⎪⎪⎬⎪⎪⎭

,

which completes the proof. ⊓⊔

36

B Dual Optimization Problem

In this section, we provide a detailed derivation of the optimization problem in (15) startingwith optimization problem in (11). The first step is to appeal to the following chain ofequalities:

minw

T

∑t=1

qt(w ⋅ Ψ(xt) − yt)2+ λ2∥w∥

2H

= minw

T

∑t=1

(w ⋅ x′t − y′t)

2+ λ2∥w∥

2H

= maxβ

− λ2T

∑t=1

β2t −

T

∑s,t=1

βsβtx′sx

′t + 2λ2∑

t=1βty

′t

= maxβ

− λ2T

∑t=1

β2t −

T

∑s,t=1

βsβt√qs

√qtKs,t + 2λ2∑

t=1βt

√qtyt

= maxα

− λ2T

∑t=1

α2t

qt−αTKα + 2λ2α

TY, (20)

where the first equality follows by substituting x′t =√qtΨ(xt) and y′t =

√qtyt the second

equality uses the dual formulation of the kernel ridge regression problem and the last equalityfollows from the following change of variables: αt =

√qtβt.

By (20), optimization problem in (11) is equivalent to the following optimization problem

min0≤q≤1max

α − λ1

T

∑t=1

α2t

qt−αTKα + 2λ1α

TY + (d⋅q) + λ2∥q − u∥p.

Next, we apply the change of variables rt = 1/qt and appeal to the same arguments as weregiven for the primal problem in Section 6 to arrive at (15).

discrepancy-based theory and algorithms for forecasting ...mohri/pub/tsj.pdfstationary ergodic...

Documents