advanced machine learning - nyu courantmohri/amls/aml_time_series.pdfadvanced machine learning -...

103
MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH.. GOOGLE RESEARCH.. VITALY KUZNETSOV KUZNETSOV@ Advanced Machine Learning Time Series Prediction

Upload: phamliem

Post on 15-May-2018

226 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

GOOGLE RESEARCH..VITALY KUZNETSOV KUZNETSOV@

Advanced Machine LearningTime Series Prediction

Page 2: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

MotivationTime series prediction:

• stock values.

• economic variables.

• weather: e.g., local and global temperature.

• sensors: Internet-of-Things.

• earthquakes.

• energy demand.

• signal processing.

• sales forecasting.

2

Page 3: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Google Trends

3

Page 4: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Google Trends

4

Page 5: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Google Trends

5

Page 6: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ChallengesStandard Supervised Learning:

• IID assumption.

• Same distribution for training and test data.

• Distributions fixed over time (stationarity).

6

none of these assumptions holds for time series!

Page 7: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

OutlineIntroduction to time series analysis.

Learning theory for forecasting non-stationary time series.

Algorithms for forecasting non-stationary time series.

Time series prediction and on-line learning.

7

Page 8: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

Introduction to Time Series Analysis

Page 9: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Classical FrameworkPostulate a particular form of a parametric model that is assumed to generate data.

Use given sample to estimate unknown parameters of the model.

Use estimated model to make predictions.

9

Page 10: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Autoregressive (AR) ModelsDefinition: AR( ) model is a linear generative model based on the th order Markov assumption:where

• s are zero mean uncorrelated random variables with variance .

• are autoregressive coefficients.

• is observed stochastic process.

10

✏t

pp

Yt

8t, Yt =pX

i=1

aiYt�i + ✏t

a1, . . . , ap

Page 11: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Moving Averages (MA)Definition: MA( ) model is a linear generative model for the noise term based on the th order Markov assumption:where

• are moving average coefficients.

11

q

q

8t, Yt = ✏t +qX

j=1

bj✏t�j

b1, . . . , bq

Page 12: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ARMA modelDefinition: ARMA( , ) model is a generative linear model that combines AR( ) and MA( ) models:

12

p

p

q

q

.

(Whittle, 1951; Box & Jenkins, 1971)

8t, Yt =pX

i=1

aiYt�i + ✏t +qX

j=1

bj✏t�j

Page 13: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ARMA

13

Page 14: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

StationarityDefinition: a sequence of random variables is stationary if its distribution is invariant to shifting in time.

14

Z = {Zt}+1�1

(Zt, . . . , Zt+m) (Zt+k, . . . , Zt+m+k)

same distribution

Page 15: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Weak StationarityDefinition: a sequence of random variables is weakly stationary if its first and second moments are invariant to shifting in time, that is,

• is independent of .

• for some function .

15

Z = {Zt}+1�1

f

tE[Zt]

E[ZtZt�j ] = f(j)

Page 16: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Lag OperatorLag operator is defined by

ARMA model in terms of the lag operator:

Characteristic polynomialcan be used to study properties of this stochastic process.

16

L LYt = Yt�1.

P (z) = 1�pX

i=1

aizi

1�

pX

i=1

aiLi

!Yt =

1 +

qX

j=1

bjLj

!✏t

Page 17: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Weak Stationarity of ARMATheorem: an ARMA( , ) process is weakly stationary if the roots of the characteristic polynomial are outside the unit circle.

17

P (z)

p q

Page 18: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofIf roots of the characteristic polynomial are outside the unit circle then: where for all and are constants.

18

| i| > 1 i = 1, . . . , p c, c0

P (z) = 1�pX

i=1

aizi = c( 1 � z) · · · ( p � z)

= c0(1� �11 z) · · · (1� �1

p z)

Page 19: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofTherefore, the ARMA( , ) process admits MA( ) representation:where is well-defined since

19

1

1� �1

i L

!�1

=1X

k=0

⇣� �1

i L⌘k

| �1i | < 1.

p q 1�

pX

i=1

aiLi

!Yt =

1 +

qX

j=1

bjLj

!✏t

Yt =

✓1� �1

1 L

◆�1

· · ·✓1� �1

p L

◆�1✓1 +

qX

j=1

bjLj

◆✏t

Page 20: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofTherefore, it suffices to show thatis weakly stationary.

The mean is constant

Covariance function only depends on the lag :

20

Yt =1X

j=0

�j✏t�j

lE[YtYt�l]

E[YtYt�l] =1X

k=0

1X

j=0

�k�jE[✏t�j✏t�l�k] =1X

j=0

�j�j+l

E[Yt] =1X

j=0

�jE[✏t�j ] = 0.

.

Page 21: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ARIMANon-stationary processes can be modeled using processes whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored:where has no unit roots.

Definition: ARIMA( , , ) model is an ARMA( , ) modelfor :

21

P (z) = R(z)(1� z)D

R(z)

(1� L)DYt

p pq qD

.

1�

pX

i=1

aiLi

! 1� L

!D

Yt =

1 +

qX

j=1

bjLj

!✏t

Page 22: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ARIMANon-stationary processes can be modeled using processes whose characteristic polynomial has unit roots.

Characteristic polynomial with unit roots can be factored:where has no unit roots.

Definition: ARIMA( , , ) model is an ARMA( , ) modelfor :

22

Page 23: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Other ExtensionsFurther variants:

• models with seasonal components (SARIMA).

• models with side information (ARIMAX).

• models with long-memory (ARFIMA).

• multi-variate time series models (VAR).

• models with time-varying coefficients.

• other non-linear models.

23

Page 24: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Modeling VarianceDefinition: the generalized autoregressive conditional heteroscedasticity GARCH( , ) model is an ARMA( , ) model for the variance of the noise term :where

• s are zero mean Gaussian random variables with variance conditioned on .

• is the mean parameter.

24

�t ✏t

8t, �2t+1 = ! +

p�1X

i=0

↵i�2t�i +

q�1X

j=0

�j✏2t�j

✏t�t {Yt�1, Yt�2, . . .}

! > 0

p pq q

(Engle, 1982; Bollerslev, 1986)

Page 25: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

GARCH ProcessDefinition: the generalized autoregressive conditional heteroscedasticity (GARCH( , )) model is an ARMA( , ) model for the variance of the noise term :where

• s are zero mean Gaussian random variables with variance conditioned on .

• is the mean parameter.

25

Page 26: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

State-Space ModelsContinuous state space version of Hidden Markov Models:where

• is an -dimensional state vector.

• is an observed stochastic process.

• and are model parameters.

• and are noise terms.

26

Xt+1 = BXt +Ut,

Yt = AXt + ✏t

Xt n

Yt

A B

Ut ✏t

Page 27: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

State-Space Models

27

Page 28: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

EstimationDifferent methods for estimating model parameters:

• Maximum likelihood estimation:

• Requires further parametric assumptions on the noise distribution (e.g. Gaussian).

• Method of moments (Yule-Walker estimator).

• Conditional and unconditional least square estimation.

• Restricted to certain models.

28

Page 29: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Invertibility of ARMADefinition: an ARMA( , ) process is invertible if the roots of the polynomial are outside the unit circle.

29

p q

Q(z) = 1 +qX

j=1

bjzj

Page 30: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning guaranteeTheorem: assume ARMA( , ) is weakly stationary and invertible. Let denote the least square estimate of and assume that is known. Then, converges in probability to zero.

Similar results hold for other estimators and other models.

30

baTkbaT � ak

p

p

qYt ⇠

a = (a1, . . . , ap)

Page 31: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

NotesMany other generative models exist.

Learning guarantees are asymptotic.

Model needs to be correctly specified.

Non-stationarity needs to be modeled explicitly.

31

Page 32: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

Theory

Page 33: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Time Series ForecastingTraining data: finite sample realization of some stochastic process,

Loss function: , where is a hypothesis set of functions mapping from to .

Problem: find with small path-dependent expected loss,

33

H

X Y

h 2 H

L : H ⇥ Z ! [0, 1]

(X1, Y1), . . . , (XT , YT ) 2 Z = X ⇥ Y.

L(h,ZT1 ) = E

ZT+1

⇥L(h, ZT+1)|ZT

1

⇤.

Page 34: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Standard AssumptionsStationarity:

Mixing:

34

(Zt, . . . , Zt+m) (Zt+k, . . . , Zt+m+k)

same distribution

B A

n n+ k

dependence between events decaying with k.

Page 35: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

𝜷-MixingDefinition: a sequence of random variables is -mixing if

35

Z = {Zt}+1�1

B A

n n+ k

dependence between events decaying with k.

�(k) = supn

EB2�n

�1

sup

A2�1n+k

���P[A | B]� P[A]����! 0.

Page 36: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning TheoryStationary and -mixing process: generalization bounds.

• PAC-learning preserved in that setting (Vidyasagar, 1997).

• VC-dimension bounds for binary classification (Yu, 1994).

• covering number bounds for regression (Meir, 2000).

• Rademacher complexity bounds for general loss functions (MM and Rostamizadeh, 2000).

• PAC-Bayesian bounds (Alquier et al., 2014).

36

Page 37: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning TheoryStationarity and mixing: algorithm-dependent bounds.

• AdaBoost (Lozano et al., 1997).

• general stability bounds (MM and Rostamizadeh, 2010).

• regularized ERM (Steinwart and Christmann, 2009).

• stable on-line algorithms (Agarwal and Duchi, 2013).

37

Page 38: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProblemStationarity and mixing assumptions:

• often do not hold (think trend or periodic signals).

• not testable.

• estimating mixing parameters can be hard, even if general functional form known.

• hypothesis set and loss function ignored.

38

Page 39: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

QuestionsIs learning with general (non-stationary, non-mixing) stochastic processes possible?

Can we design algorithms with theoretical guarantees?

need a new tool for the analysis.

39

Page 40: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Key Quantity - Fixed h

40

1 Tt T + 1

key difference

Key average quantity:����1

T

TX

t=1

hL(h,ZT

1 )� L(h,Zt�11 )

i����.

L(h,Zt�11 ) L(h,ZT

1 )

Page 41: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

DiscrepancyDefinition:

• captures hypothesis set and loss function.

• can be estimated from data, under mild assumptions.

• in IID case or for weakly stationary processes with linear hypotheses and squared loss (K and MM, 2014).

41

� = 0

� = suph2H

�����L(h,ZT1 )�

1

T

TX

t=1

L(h,Zt�11 )

�����.

Page 42: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Weighted DiscrepancyDefinition: extension to weights .

• strictly extends discrepancy definition in drifting (MM and

Muñoz Medina, 2012) or domain adaptation (Mansour, MM,

Rostamizadeh 2009; Cortes and MM 2011, 2014); or for binary loss (Devroye et al., 1996; Ben-David et al., 2007).

• admits upper bounds in terms of relative entropy, or in terms of -mixing coefficients of asymptotic stationarity for an asymptotically stationary process.

42

(q1, . . . , qT ) = q

�(q) = suph2H

�����L(h,ZT1 )�

TX

t=1

qt L(h,Zt�11 )

�����.

Page 43: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

EstimationDecomposition: .

43

1 T�s T T + 1

�(q) suph2H

✓1

s

TX

t=T�s+1

L(h,Zt�11 )�

TX

t=1

qt L(h,Zt�11 )

+ suph2H

✓L(h,ZT

1 )�1

s

TX

t=T�s+1

L(h,Zt�11 )

◆.

�(q) �0(q) +�s

Page 44: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning GuaranteeTheorem: for any , with probability at least , for all and ,

44

� > 0 1� �

h 2 H

where G = {z 7! L(h, z) : h 2 H}.

↵ > 0

L(h,ZT1 )

TX

t=1

qtL(h, Zt) +�(q) + 2↵+ kqk2

r2 log

E[N1(↵,G, z)]�

,

Page 45: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Bound with Emp. DiscrepancyCorollary: for any , with probability at least , for all and ,

45

� > 0 1� �

h 2 H ↵ > 0

where

8>>>><

>>>>:

b�(q) = sup

h2H

✓1

s

TX

t=T�s+1

L(h, Zt)�TX

t=1

qtL(h, Zt)

us unif. dist. over [T � s, T ]

G = {z 7! L(h, z) : h 2 H}.

L(h,ZT1 )

TX

t=1

qtL(h, Zt) +b�(q) +�s + 4↵

+

hkqk2 + kq� usk2

is

2 log

2Ez

⇥N1(↵,G, z)

�,

Page 46: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Weighted Sequential α-CoverDefinition: let be a -valued full binary tree of depth . Then, a set of trees is an -norm -weighted -cover of a function class on if

46

z Z T

V ↵

G z

l1 q

�����

" v1�g(z1)v3�g(z3)v6�g(z6)v12�g(z12)

#�����1

kqk1.

(Rakhlin et al., 2010; K and MM, 2015)

g(z1), v1

g(z2), v2 g(z3), v3

g(z4), v4 g(z5), v5 g(z6), v6 g(z7), v7

g(z12), v12

+1�1

+1+1�1 �1

�1

8g 2 G, 8� 2 {±1}T , 9v 2 V :TX

t=1

|vt(�)� g(zt(�))| ↵

kqk1.

Page 47: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Sequential Covering NumbersDefinitions:

• sequential covering number:

• expected sequential covering number:

47

N1(↵,G, z) = min{|V| : V l1-norm q-weighted ↵-cover of G}.

Z1, Z01 ⇠ D1

Z3, Z03 ⇠ D2(·|Z 0

1)Z2, Z02 ⇠ D2(·|Z1)

Z5, Z05

⇠ D3(·|Z1, Z02)

Z4, Z04

⇠ D3(·|Z1, Z2) ZT : distribution based on Zts.

Ez⇠ZT

[N1(↵,G, z)].

Page 48: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofKey quantities:

Chernoff technique: for any ,

48

�(q) = suph2H

���L(h,ZT1 )�

TX

t=1

qt L(h,Zt�11 )

���.

�(ZT1 ) = sup

h2H

⇣L(h, ZT )�

TX

t=1

qtL(h, Zt)⌘

P⇥�(ZT

1 ��(q) > ✏)⇤

P"sup

h2H

TX

t=1

qt⇥L(h,Zt�1

1 )� L(h, Zt)⇤> ✏

#(sub-add. of sup)

= P"exp

⇣t suph2H

TX

t=1

qt⇥L(h,Zt�1

1 )� L(h, Zt)⇤⌘

> et✏#

(t > 0)

e�t✏ E"exp

⇣t suph2H

TX

t=1

qt⇥L(h,Zt�1

1 )� L(h, Zt)⇤⌘

#. (Markov’s ineq.)

t > 0

Page 49: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

SymmetrizationKey tool: decoupled tangent sequence associated to .

• and i.i.d. given .

49

Z0T1 ZT

1

Zt Z 0t Zt�1

1

P⇥�(ZT

1 ��(q) > ✏)⇤

e�t✏ E"exp

⇣t suph2H

TX

t=1

qt⇥L(h,Zt�1

1 )� L(h, Zt)⇤⌘

#

= e�t✏ E"exp

⇣t suph2H

TX

t=1

qt⇥E[L(h, Z 0

t)|Zt�11 ]� L(h, Zt)

⇤⌘#

(tangent seq.)

= e�t✏ E"exp

⇣t suph2H

Eh TX

t=1

qt⇥L(h, Z 0

t)� L(h, Zt)⇤ ���ZT

1

i⌘#(lin. of expectation)

e�t✏ E"exp

⇣t suph2H

TX

t=1

qt⇥L(h, Z 0

t)� L(h, Zt)⇤⌘

#. (Jensen’s ineq.)

Page 50: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Symmetrization

50

P⇥�(ZT

1 ��(q) > ✏)⇤

e�t✏ E"exp

⇣t suph2H

TX

t=1

qt⇥L(h, Z 0

t)� L(h, Zt)⇤⌘

#

= e�t✏ E(z,z0)

E�

"exp

⇣t suph2H

TX

t=1

qt�t

⇥L(h, z0t(�))� L(h, zt(�))

⇤⌘#

(tangent seq. prop.)

= e�t✏ E(z,z0)

E�

"exp

⇣t suph2H

TX

t=1

qt�tL(h, z0t(�)) + t sup

h2H

TX

t=1

qt�tL(h, zt(�))⌘#

(sub-add. of sup)

e�t✏ E(z,z0)

E�

"1

2

exp

⇣2t sup

h2H

TX

t=1

qt�tL(h, z0t(�))

+

1

2

exp

⇣2t sup

h2H

TX

t=1

qt�tL(h, zt(�))⌘#

(convexity. of exp)

= e�t✏ EzE�

"exp

⇣2t sup

h2H

TX

t=1

qt�tL(h, zt(�))⌘#

.

Page 51: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Covering Number

51

P⇥�(ZT

1 ��(q) > ✏)⇤

e�t✏ EzE�

"exp

⇣2t sup

h2H

TX

t=1

qt�tL(h, zt(�))⌘#

e�t✏ EzE�

"exp

⇣2thmaxv2V

TX

t=1

qt�tvt(�) + ↵i⌘#

(↵-covering)

e�t(✏�2↵) Ez

"X

v2VE�

"exp

⇣2t

TX

t=1

qt�tvt(�)⌘##

(monotonicity of exp)

e�t(✏�2↵) Ez

"X

v2Vexp

⇣ t2kqk2

2

⌘##(Hoe↵ding’s ineq.)

Ez

hN1(↵,G, z)

iexp

� t(✏� 2↵) +

t2kqk2

2

�.

Page 52: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

Algorithms

Page 53: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReviewTheorem: for any , with probability at least , for all and ,

This bound can be extended to hold uniformly over at the price of the additional term: .

Data-dependent learning guarantee.

53

� > 0 1� �

h 2 H

q

eO(kq� uk1p

log2 log2(1� kq� uk)�1)

↵ > 0

L(h,ZT1 )

TX

t=1

qtL(h, Zt) +b�(q) +�s + 4↵

+

hkqk2 + kq� usk2

is

2 log

2Ez

⇥N1(↵,G, z)

�.

Page 54: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Risk MinimizationKey Idea: directly optimize the upper bound on generalization over and .

This problem can be solved efficiently for some and .

54

q h

L H

Page 55: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Kernel-Based RegressionSquared loss function:

Hypothesis set: for PDS kernel ,

Complexity term can be bounded by .

55

H =n

x 7! w · �K(x) : kwkH ⇤o

.

K

O⇣(log

3/2 T )⇤ sup

x

K(x, x)kqk2⌘

L(y, y0) = (y � y0)2

Page 56: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Instantaneous DiscrepancyEmpirical discrepancy can be further upper bounded in terms of instantaneous discrepancies:where and .

56

b�(q) TX

t=1

qtdt +Mkq� uk1

M = supy,y0

L(y, y0)

dt = suph2H

1

s

TX

t=T�s+1

L(h, Zt)� L(h, Zt)

!

Page 57: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofBy sub-additivity of supremum

57

b�(q) = suph2H

(1

s

TX

t=T�s+1

L(h, Zt)�TX

t=1

qtL(h, Zt)

)

= suph2H

(TX

t=1

qt

1

s

TX

t=T�s+1

L(h, Zt)� L(h, Zt)

!

+TX

t=1

⇣ 1

T� qt

⌘1s

TX

t=T�s+1

L(h, Zt)

)

TX

t=1

qt suph2H

1

s

TX

t=T�s+1

L(h, Zt)� qtL(h, Zt)

!+Mku� qk1.

Page 58: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Computing DiscrepanciesInstantaneous discrepancy for kernel-based hypothesis with squared loss: .

Difference of convex (DC) functions.

Global optimum via DC-programming: (Tao and Ahn, 1998).

58

dt = supkw0k⇤

TX

s=1

us(w0 · �K(xs)� ys)

2 � (w0 · �K(xt)� yt)2

!

Page 59: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

Corresponding optimization problem: .

59

� > 0 1� �

h 2 H 0 < kq� uk1 1

minq2[0,1]T ,w

(TX

t=1

qt(w · K(xt)� yt)2 + �1

TX

t=1

qtdt + �2kwkH + �3kq� uk1

)

L(h,ZT

1 ) TX

t=1

qt

L(h, Zt

) +

b�(q) +�

s

+

eO⇣log

3/2 T sup

x

K(x, x)⇤+ kq� uk1⌘

Page 60: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

Corresponding optimization problem: .

60

� > 0 1� �

h 2 H 0 < kq� uk1 1

minq2[0,1]T ,w

(TX

t=1

qt(w · K(xt)� yt)2 + �1

TX

t=1

qtdt + �2kwkH + �3kq� uk1

)

L(h,ZT

1 ) TX

t=1

qt

L(h, Zt

) +

b�(q) +�

s

+

eO⇣log

3/2 T sup

x

K(x, x)⇤+ kq� uk1⌘

Page 61: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

Corresponding optimization problem: .

61

� > 0 1� �

h 2 H 0 < kq� uk1 1

minq2[0,1]T ,w

(TX

t=1

qt(w · K(xt)� yt)2 + �1

TX

t=1

qtdt + �2kwkH + �3kq� uk1

)

L(h,ZT

1 ) TX

t=1

qt

L(h, Zt

) +

b�(q) +�

s

+

eO⇣log

3/2 T sup

x

K(x, x)⇤+ kq� uk1⌘

Page 62: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

Corresponding optimization problem: .

62

� > 0 1� �

h 2 H 0 < kq� uk1 1

minq2[0,1]T ,w

(TX

t=1

qt(w · K(xt)� yt)2 + �1

TX

t=1

qtdt + �2kwkH + �3kq� uk1

)

L(h,ZT

1 ) TX

t=1

qt

L(h, Zt

) +

b�(q) +�

s

+

eO⇣log

3/2 T sup

x

K(x, x)⇤+ kq� uk1⌘

Page 63: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy-Based ForecastingTheorem: for any , with probability at least , for all kernel-based hypothesis and all .

Corresponding optimization problem: .

63

� > 0 1� �

h 2 H 0 < kq� uk1 1

minq2[0,1]T ,w

(TX

t=1

qt(w · K(xt)� yt)2 + �1

TX

t=1

qtdt + �2kwkH + �3kq� uk1

)

L(h,ZT

1 ) TX

t=1

qt

L(h, Zt

) +

b�(q) +�

s

+

eO⇣log

3/2 T sup

x

K(x, x)⇤+ kq� uk1⌘

Page 64: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Convex ProblemChange of variable: .

Upper bound: .

• where .

• convex optimization problem.

64

rt = 1/qt

D = {r : rt � 1}

.

|r�1t � 1/T | T�1|rt � T |

minr2D,w

(TX

t=1

(w · K(xt)� yt)2 + �1dt

rt+ �2kwkH + �3

TX

t=1

|rt � T |)

Page 65: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Two-Stage AlgorithmMinimize empirical discrepancy over (convex optimization).

Solve (weighted) kernel-ridge regression problem: where is the solution to discrepancy minimization problem.

65

q

minw

(TX

t=1

q

⇤t (w · K(xt)� yt)

2 + �kwkH

)

q⇤

b�(q)

Page 66: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Preliminary ExperimentsArtificial data sets:

66

,

Page 67: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

True vs. Empirical Discrepancies

67

Discrepancies Weights

Page 68: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Running MSE

68

Page 69: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Real-world DataCommodity prices, exchange rates, temperatures & climate.

69

Page 70: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

Time Series Prediction & On-line Learning

Page 71: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Two Learning ScenariosStochastic scenario:

• distributional assumption.

• performance measure: expected loss.

• guarantees: generalization bounds.

On-line scenario:

• no distributional assumption.

• performance measure: regret.

• guarantees: regret bounds.

• active research area: (Cesa-Bianchi and Lugosi, 2006; Anava et al. 2013, 2015, 2016; Bousquet and Warmuth, 2002; Herbster and Warmuth, 1998, 2001; Koolen et al., 2015).

71

Page 72: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

On-Line Learning SetupAdversarial setting with hypothesis/action set .

For to do

• player receives .

• player selects .

• adversary selects .

• player incurs loss .

Objective: minimize (external) regret

72

ht 2 H

H

Tt = 1

xt 2 X

yt 2 Y

L(ht(xt), yt)

RegT =TX

t=1

L(ht(xt), yt)� minh2H⇤

TX

t=1

L(h(xt), yt).

Page 73: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Example: Exp. Weights (EW)Expert set , .

73

EW({E1, . . . ,EN

})1 for i 1 to N do

2 w1,i 13 for t 1 to T do

4 Receive(xt

)

5 h

t

PN

i=1 wt,iEiPNi=1 wt,i

6 Receive(yt

)7 Incur-Loss(L(h

t

(xt

), yt

))8 for i 1 to N do

9 w

t+1,i w

t,i

e

�⌘L(Ei(xt),yt). (parameter ⌘ > 0)

10 return h

T

H⇤ = {E1, . . . ,EN} H = conv(H⇤)

Page 74: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

EW GuaranteeTheorem: assume that is convex in its first argument and takes values in . Then, for any and any sequence , the regret of EW at time satisfies

74

For

L

[0, 1] ⌘>0y1, . . . , yT 2 Y T

RegT logN

⌘+

⌘T

8

.

⌘ =

p8 logN/T ,

RegT p

(T/2) logN.

RegTT

= O

✓rlogN

T

◆.

Page 75: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

EW - ProofPotential:

Upper bound:

75

�t = log

PNi=1 wt,i.

�t

� �t�1 = log

PN

i=1 wt�1,i e�⌘L(Ei(xt),yt)

PN

i=1 wt�1,i

= log�

Ewt�1

[e�⌘L(Ei(xt),yt)]�

= log

✓E

wt�1

exp

✓�⌘

⇣L(E

i

(xt

), yt

)� Ewt�1

[L(Ei

(xt

), yt

)]⌘� ⌘ E

wt�1

[L(Ei

(xt

), yt

)]

◆�◆

�⌘ Ewt�1

[L(Ei

(xt

), yt

)] +⌘

2

8(Hoe↵ding’s ineq.)

�⌘L

�E

wt�1

[Ei

(xt

)], yt

�+

2

8(convexity of first arg. of L)

= �⌘L(ht

(xt

), yt

) +⌘

2

8.

Page 76: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

EW - ProofUpper bound: summing up the inequalities yields

Lower bound:

Comparison:

76

�T � �0 �⌘

TX

t=1

L(ht(xt), yt) +⌘

2T

8.

T

� �0 = log

NX

i=1

e

�⌘

PTt=1 L(Ei(xt),yt) � logN

� log

N

max

i=1e

�⌘

PTt=1 L(Ei(xt),yt) � logN

= �⌘

N

min

i=1

TX

t=1

L(Ei

(x

t

), y

t

)� logN.

TX

t=1

L(ht(xt), yt)�Nmin

i=1

TX

t=1

L(Ei(xt), yt) logN

+

⌘T

8

.

Page 77: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

QuestionsCan we exploit both batch and on-line to

• design flexible algorithms for time series prediction with stochastic guarantees?

• tackle notoriously difficult time series problems e.g., model selection, learning ensembles?

77

Page 78: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Model SelectionProblem: given time series models, how should we use sample to select a single best model?

• in i.i.d. case, cross-validation can be shown to be close to the structural risk minimization solution.

• but, how do we select a validation set for general stochastic processes?

• use most recent data?

• use the most distant data?

• use various splits?

• models may have been pre-trained on .

78

NZT

1

ZT1

Page 79: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning EnsemblesProblem: given a hypothesis set and a sample , find accurate convex combination with and .

• in most general case, hypotheses may have been pre-trained on .

79

H ZT1

h =PT

t=1 qtht h 2 HA

q 2 �

ZT1

on-line-to-batch conversion for general non-stationary non-mixing processes.

Page 80: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

On-Line-to-Batch (OTB)Input: sequence of hypotheses returned after rounds by an on-line algorithm minimizing general regret

80

h = (h1, . . . , hT )

AT

RegT =TX

t=1

L(ht, Zt)� infh⇤2H⇤

TX

t=1

L(h⇤, Zt).

Page 81: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

On-Line-to-Batch (OTB)Problem: use to derive a hypothesis with small path-dependent expected loss,

• i.i.d. problem is standard: (Littlestone, 1989), (Cesa-Bianchi et al.,

2004).

• but, how do we design solutions for the general time-series scenario?

81

h = (h1, . . . , hT ) h 2 H

LT+1(h,ZT1 ) = E

ZT+1

⇥L(h, ZT+1)|ZT

1

⇤.

Page 82: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

QuestionsIs OTB with general (non-stationary, non-mixing) stochastic processes possible?

Can we design algorithms with theoretical guarantees?

need a new tool for the analysis.

82

Page 83: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Relevant Quantity

83

Average difference:

1 T T + 1

key difference

LT+1(ht,ZT1 )

t+ 1

Lt(ht,Zt�11 )

1

T

TX

t=1

hLT+1(ht,Z

T1 )� Lt(ht,Z

t�11 )

i.

Page 84: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

On-line DiscrepancyDefinition:

• : sequences that can return.

• : arbitrary weight vector.

• natural measure of non-stationarity or dependency.

• captures hypothesis set and loss function.

• can be efficiently estimated under mild assumptions.

• generalization of definition of (Kuznetsov and MM, 2015) .

84

HA Aq = (q1, . . . , qT )

disc(q) = suph2HA

�����

TX

t=1

qthLT+1(ht,Z

T1 )� Lt(ht,Z

t�11 )

i�����.

Page 85: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy EstimationBatch discrepancy estimation method.

Alternative method:

• assume that the loss is -Lipschitz.

• assume that there exists an accurate hypothesis :

85

µ

h⇤

⌘ = infh⇤

EhL(ZT+1, h

⇤(XT+1))|ZT1

i⌧ 1.

Page 86: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Discrepancy EstimationLemma: fix sequence in . Then, for any , with probability at least , the following holds for all :

86

ZT1 Z � > 0

1� � ↵ > 0

where

ddiscH(q) = suph2H,h2HA

�����

TX

t=1

qthL�ht(XT+1), h(XT+1)

�� L

�ht, Zt

�i�����.

disc(q) ddiscHT (q) + µ⌘ + 2↵+Mkqk2

r2 log

E[N1(↵,G, z)]�

,

Page 87: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Proof Sketch

87

disc(q) = suph2HA

�����

TX

t=1

qthLT+1(ht,Z

T1 )� Lt(ht,Z

t�11 )

i�����

suph2HA

�����

TX

t=1

qt

LT+1(ht,Z

T1 )� E

hL(ht(XT+1), h

⇤(XT+1))��ZT

1

i������

+ suph2HA

�����

TX

t=1

qthEhL(ht(XT+1), h

⇤(XT+1))��ZT

1

i�� Lt(ht,Z

t�11 )

i�����

µ suph2HA

TX

t=1

qt EhL(h⇤(XT+1), YT+1)

���ZT1

i

+ suph2HA

�����

TX

t=1

qthEhL(ht(XT+1), h

⇤(XT+1))��ZT

1

i�� Lt(ht,Z

t�11 )

i�����

= µ suph2HA

EhL(h⇤(XT+1), YT+1)

���ZT1

i

+ suph2HA

�����

TX

t=1

qthEhL(ht(XT+1), h

⇤(XT+1))��ZT

1

i�� Lt(ht,Z

t�11 )

i�����.

Page 88: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning GuaranteeLemma: let be a convex loss bounded by and a hypothesis sequence adapted to . Fix . Then, for any , the following holds with probability at least for the hypothesis :

88

L M

� > 0 1� �

h =PT

t=1 qtht

ZT1 q 2 �

hT1

LT+1(h,ZT1 )

TX

t=1

qtL(ht, Zt) + disc(q) +Mkqk2

r2 log

1

�.

Page 89: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ProofBy definition of the on-line discrepancy,

is a martingale difference, thus by Azuma’s inequality, whp,

By convexity of the loss:

89

At = qthLt(ht, Z

t�11 )� L(ht, Zt)

i

TX

t=1

qtLt(ht, Zt�11 )

TX

t=1

qtL(ht, Zt) + kqk2q

2 log

1� .

TX

t=1

qthLT+1(ht,Z

T1 )� Lt(ht,Z

t�11 )

i disc(q).

LT+1(h,ZT1 )

TX

t=1

qtLT+1(ht,ZT1 ).

Page 90: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

Learning GuaranteeTheorem: let be a convex loss bounded by and a set of hypothesis sequences adapted to . Fix . Then, for any , the following holds with probability at least for the hypothesis :

90

L M H⇤

� > 0 1� �

h =PT

t=1 qtht

ZT1 q 2 �

LT+1(h,ZT1 )

inf

h⇤2H

TX

t=1

LT+1(h⇤,ZT

1 ) + 2disc(q) +RegT

T

+Mkq� uk1 + 2Mkqk2

r2 log

2

�.

Page 91: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ConclusionTime series forecasting:

• key learning problem in many important tasks.

• very challenging: theory, algorithms, applications.

• new and general data-dependent learning guarantees for non-mixing non-stationary processes.

• algorithms with guarantees.

Time series prediction and on-line learning:

• proof for flexible solutions derived via OTB.

• application to model selection.

• application to learning ensembles.

91

Page 92: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesT. M. Adams and A. B. Nobel. Uniform convergence of Vapnik-Chervonenkis classes under ergodic sampling. The Annals of Probability, 38(4):1345–1367, 2010.

A. Agrawal, J. Duchi. The generalization ability of online algorithms for dependent data. Information Theory, IEEE Transactions on, 59(1):573–587, 2013.

P. Alquier, X. Li, O. Wintenberger. Prediction of time series by statistical learning: general losses and fast rates. Dependence Modelling, 1:65–93, 2014.

Oren Anava, Elad Hazan, Shie Mannor, and Ohad Shamir. Online learning for time series prediction. COLT, 2013.

92

Page 93: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesO. Anava, E. Hazan, and A. Zeevi. Online time series prediction with missing data. ICML, 2015.

O. Anava and S. Mannor. Online Learning for heteroscedastic sequences. ICML, 2016.

P. L. Bartlett. Learning with a slowly changing distribution. COLT, 1992.

R. D. Barve and P. M. Long. On the complexity of learning from drifting distributions. Information and Computation, 138(2):101–123, 1997.

93

Page 94: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesP. Berti and P. Rigo. A Glivenko-Cantelli theorem for exchangeable random variables. Statistics & Probability Letters, 32(4):385 – 391, 1997.

S. Ben-David, J. Blitzer, K. Crammer, and F. Pereira. Analysis of representations for domain adaptation. In NIPS. 2007.

T. Bollerslev. Generalized autoregressive conditional heteroskedasticity. J Econometrics, 1986.

G. E. P. Box, G. Jenkins. (1990) . Time Series Analysis, Forecasting and Control.

94

Page 95: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesO. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. COLT, 2001.

N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms. IEEE Trans. on Inf. Theory , 50(9), 2004.

N. Cesa-Bianchi and C. Gentile. Tracking the best hyperplane with a simple budget perceptron. COLT, 2006.

C. Cortes and M. Mohri. Domain adaptation and sample bias correction theory and algorithm for regression. Theoretical Computer Science, 519, 2014.

95

Page 96: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesV. H. De la Pena and E. Gine. (1999) Decoupling: from dependence to independence: randomly stopped processes, U-statistics and processes, martingales and beyond. Probability and its applications. Springer, NY.

P. Doukhan. (1994) Mixing: properties and examples. Lecture notes in statistics. Springer-Verlag, New York.

R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation. Econometrica, 50(4):987–1007, 1982.

96

Page 97: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesD. P. Helmbold and P. M. Long. Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1): 27-46, 1994.

M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2), 1998.

M. Herbster and M. K. Warmuth. Tracking the best linear predictor. JMLR, 2001.

D. Hsu, A. Kontorovich, and C. Szepesvári. Mixing time estimation in reversible Markov chains from a single sample path. NIPS, 2015.

97

Page 98: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesW. M. Koolen, A. Malek, P. L. Bartlett, and Y. Abbasi. Minimax time series prediction. NIPS, 2015.

V. Kuznetsov, M. Mohri. Generalization bounds for time series prediction with non-stationary processes. In ALT, 2014.

V. Kuznetsov, M. Mohri. Learning theory and algorithms for forecasting non-stationary time series. In NIPS, 2015.

V. Kuznetsov, M. Mohri. Time series prediction and on-line learning. In COLT, 2016.

98

Page 99: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesA. C. Lozano, S. R. Kulkarni, and R. E. Schapire. Convergence and consistency of regularized boosting algorithms with stationary β-mixing observations. In NIPS, pages 819–826, 2006.

Y. Mansour, M. Mohri, and A. Rostamizadeh. Domain adaptation: Learning bounds and algorithms. In COLT. 2009.

R. Meir. Nonparametric time series prediction through adaptive model selection. Machine Learning, pages 5–34, 2000.

99

Page 100: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesD. Modha, E. Masry. Memory-universal prediction of stationary random processes. Information Theory, IEEE Transactions on, 44(1):117–133, Jan 1998.

M. Mohri, A. Munoz Medina. New analysis and algorithm for learning with drifting distributions. In ALT, 2012.

M. Mohri, A. Rostamizadeh. Rademacher complexity bounds for non-i.i.d. processes. In NIPS, 2009.

M. Mohri, A. Rostamizadeh. Stability bounds for stationary φ-mixing and β-mixing processes. Journal of Machine Learning Research, 11:789–814, 2010.

100

Page 101: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesV. Pestov. Predictive PAC learnability: A paradigm for learning from exchangeable input data. GRC, 2010.

A. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. In NIPS, 2010.

L. Ralaivola, M. Szafranski, G. Stempfel. Chromatic PAC-Bayes bounds for non-iid data: Applications to ranking and stationary beta-mixing processes. JMLR 11:1927–1956, 2010.

C. Shalizi and A. Kontorovitch. Predictive PAC-learning and process decompositions. NIPS, 2013.

101

Page 102: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesA. Rakhlin, K. Sridharan, A. Tewari. Online learning: Random averages, combinatorial parameters, and learnability. NIPS,2010.

I. Steinwart, A. Christmann. Fast learning from non-i.i.d. observations. NIPS, 2009.

M. Vidyasagar. (1997). A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems. Springer-Verlag New York, Inc.

V. Vovk. Competing with stationary prediction strategies. COLT 2007.

102

Page 103: Advanced Machine Learning - NYU Courantmohri/amls/aml_time_series.pdfAdvanced Machine Learning - Mohri@ page Learning Theory Stationary and -mixing process: generalization bounds

pageAdvanced Machine Learning - Mohri@

ReferencesB. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, 1994.

103