for experts and combinatorial games - course web … experts and combinatorial games wouter m....

Second-order Quantile Methodsfor Experts and Combinatorial Games

Wouter M. Koolen Tim van Erven

Royal Holloway, Friday 28th August, 2015

Focus on expert setting

Online sequential prediction with expert advice

Core instance of advanced online learning tasks

I Bandits

I Combinatorial & matrix prediction

I Online convex optimization

I Boosting

I . . .

Beyond the Worst CaseTwo reasons data is often easier in practice:

Data complexity

I Stochastic data (gap)

I Low noise

I Low variance

second-order

Model complexity

I Simple model is good

I Multiple good models

quantiles

Second-order & Quantiles

I Any combination

Data complexity

I Low noise

I Low variance

second-order

Model complexity

quantiles

I Any combination

Data complexity

I Low noise

I Low variance

second-order

Model complexity

quantiles

I Any combination

Data complexity

I Low noise

I Low variance

second-order

Model complexity

quantiles

I Any combination

Data complexity

I Low noise

I Low variance

second-order

Model complexity

quantiles

I Any combination

Data complexity

I Low noise

I Low variance

second-order

Model complexity

quantiles

I Any combination

All we need is the right learning rate

Existingalgorithms(Hedge, Prod, . . . )

oraclelearning rate η

exploit

Sec-ord. & Quant.

Can we exploit Second-order & Quantiles on-line?

All we need is the right learning rate

Existingalgorithms(Hedge, Prod, . . . )

oraclelearning rate η

exploit

Sec-ord. & Quant.

Can we exploit Second-order & Quantiles on-line?

But everyone struggles with the learning rate

0.001 0.01 0.1 1 10 100learning rate η

regretOracle η

I not monotonic,

I not smooth

over time.

State of the art:

Second-order

Cesa-Bianchi, Mansour, and Stoltz 2007, Hazanand Kale 2010, Chiang, Yang, Lee, Mahdavi,Lu, Jin, and Zhu 2012, De Rooij, Van Erven,Grunwald, and Koolen 2014, Gaillard, Stoltz, andVan Erven 2014, Steinhardt and Liang 2014

Quantiles

Hutter and Poland 2005, Chaud-huri, Freund, and Hsu 2009, Cher-nov and Vovk 2010, Luo andSchapire 2014

But everyone struggles with the learning rate

0.001 0.01 0.1 1 10 100learning rate η

regretOracle η

I not monotonic,

I not smooth

over time.

State of the art:

Second-order

Cesa-Bianchi, Mansour, and Stoltz 2007, Hazanand Kale 2010, Chiang, Yang, Lee, Mahdavi,Lu, Jin, and Zhu 2012, De Rooij, Van Erven,Grunwald, and Koolen 2014, Gaillard, Stoltz, andVan Erven 2014, Steinhardt and Liang 2014

Quantiles

Hutter and Poland 2005, Chaud-huri, Freund, and Hsu 2009, Cher-nov and Vovk 2010, Luo andSchapire 2014

Learning the learning rate

We found a formulation of the algorithm where simply putting aprior on η works.

Our algorithm Squint

wkt+1 ∝ π(k) E

γ(η)

kt−η2V k

Sec-ord. & Quant.

guarantees for each subset K of experts, at each time T ≥ 0:

RKT ≺√

VKT (− lnπ(K) + ln lnT )

I Run-time of Hedge

I Only ln lnT extra over oracle learning rate.

I Extension to Combinatorial Games

Overview

I Fundamental online learning problem

I Review previous guarantees

I New Squint algorithm with improved guarantees

Fundamental model for learning: Hedge setting

I K experts

I In round t = 1, 2, . . .I Learner plays distribution wt = (w1

t , . . . ,wKt ) on experts

I Adversary reveals expert losses `t = (`1t , . . . , `Kt ) ∈ [0, 1]K

I Learner incurs loss wᵀt `t

I The goal is to have small regret

RkT :=

T∑t=1

wᵀt `t︸︷︷︸

Learner

−T∑t=1

`kt︸︷︷︸Expert k

with respect to every expert k .

I K experts

. . .I In round t = 1, 2, . . .

I Learner plays distribution wt = (w1t , . . . ,w

Kt ) on experts

RkT :=

T∑t=1

Learner

−T∑t=1

I K experts

. . .I In round t = 1, 2, . . .

I Learner plays distribution wt = (w1t , . . . ,w

Kt ) on experts

RkT :=

T∑t=1

Learner

−T∑t=1

Classic Hedge Result

The Hedge algorithm with learning rate η

wkt+1 :=

e−ηLkt∑

k e−ηLkt

where Lkt =t∑

upon proper tuning of η ensures [Freund and Schapire, 1997]

RkT ≺

√T lnK for each expert k

which is tight for adversarial (worst-case) losses

but underwhelming in practice

Two broad lines of improvement.

Second-order bounds Quantile bounds

wkt+1 :=

e−ηLkt∑

k e−ηLkt

where Lkt =t∑

RkT ≺

wkt+1 :=

e−ηLkt∑

k e−ηLkt

where Lkt =t∑

RkT ≺

wkt+1 :=

e−ηLkt∑

k e−ηLkt

where Lkt =t∑

RkT ≺

Second-order bounds

Cesa-Bianchi et al. [2007], Hazan and Kale [2010], Chiang et al.[2012], De Rooij et al. [2014], Gaillard et al. [2014], Steinhardt andLiang [2014]

RkT ≺

√V k

T lnK for each expert k.

for some second-order V kT ≤ LkT ≤ T

I Pro: stochastic case, learning sub-algorithms

I Con: specialized algorithms. hard-coded lnK .

Second-order bounds

Cesa-Bianchi et al. [2007], Hazan and Kale [2010], Chiang et al.[2012], De Rooij et al. [2014], Gaillard et al. [2014], Steinhardt andLiang [2014]

RkT ≺

√V k

T lnK for each expert k.

for some second-order V kT ≤ LkT ≤ T

I Pro: stochastic case, learning sub-algorithms

I Con: specialized algorithms. hard-coded lnK .

Quantile bounds

Hutter and Poland [2005], Chaudhuri et al. [2009], Chernov andVovk [2010], Luo and Schapire [2014]

Prior π on experts:

mink∈K

RkT ≺

√T(− lnπ(K)

)for each subset K of experts

I Pro: over-discretized models, company baseline

I Con: specialized algorithms. Efficiency. Inescapable T .

Quantile bounds

Hutter and Poland [2005], Chaudhuri et al. [2009], Chernov andVovk [2010], Luo and Schapire [2014]

Prior π on experts:

mink∈K

RkT ≺

√T(− lnπ(K)

I Pro: over-discretized models, company baseline

I Con: specialized algorithms. Efficiency. Inescapable T .

Our contribution

Squint algorithm with bound

RKT ≺√

VKT(− lnπ(K) + CT

where RKT = Eπ(k|K) RkT and VKT = Eπ(k|K) V k

T denote the average(under the prior π) among the reference experts k ∈ K of theregret Rk

T =∑T

t=1 rkt and the (uncentered) variance of the excess

losses V kT =

∑Tt=1(rkt )2 (where rkt = (wt − ek)ᵀ`t).

The cool . . .

I We aggregate over all learning rates

I While staying as efficient as Hedge

Squint

Fix prior π(k) on experts and γ(η) on learning rates.

Potential function

ΦT := Eπ(k)γ(η)

kT−η2V

Weights

wkT+1 :=

π(k)Eγ(η)[eηR

kT−η2V

kT η]

normalisation.

Board:

I Argue weights ensure 1 = Φ0 ≥ Φ1 ≥ Φ2 ≥ · · · .I Derive second-order quantile bound from ΦT ≤ 1 (Dirac γ).

Squint

Potential function

ΦT := Eπ(k)γ(η)

kT−η2V

Weights

wkT+1 :=

π(k)Eγ(η)[eηR

kT−η2V

kT η]

normalisation.

Board:

Squint

Potential function

ΦT := Eπ(k)γ(η)

kT−η2V

Weights

wkT+1 :=

π(k)Eγ(η)[eηR

kT−η2V

kT η]

normalisation.

Board:

Three priors

Idea: have prior γ(η) put sufficient mass around optimal η

1. Uniform prior (generalizes to conjugate)

γ(η) = 2

Efficient algorithm, CT = lnVKT .

2. Chernov and Vovk [2010] prior

γ(η) =ln 2

η ln2(η)

Not efficient, CT = ln lnVKT .

3. Improper(!) log-uniform prior

γ(η) =1

Efficient algorithm, CT = ln lnT

Three priors

γ(η) = 2

γ(η) =ln 2

η ln2(η)

γ(η) =1

Three priors

γ(η) = 2

γ(η) =ln 2

η ln2(η)

γ(η) =1

Three priors

γ(η) = 2

γ(η) =ln 2

η ln2(η)

γ(η) =1

Squint w. log-uniform prior

Closed-form expression for weights:

wkT+1 ∝ π(k)

∫ 1/2

kT−η2V

∝ π(k)e

(RkT )2

)− erf

T−VkT

)√V k

Note: erf part of e.g. C99 standard.Constant time per expert per round

Extensions

Combinatorial concept class C ⊆ {0, 1}K :

I Shortest path

I Spanning trees

I Permutations

Component iProd algorithm guarantees:

RuT ≺

√V u

(comp(u) + KCT

)for each u ∈ conv(C).

The reference set of experts K is subsumed by an “averageconcept” vector u ∈ conv(C), for which our bound relates thecoordinate-wise average regret Ru

T =∑

t,k uk rkt to the averaged

variance V uT =

∑t,k uk(rkt )2 and the prior entropy comp(u).

No range factor. Drop-in replacement for Component Hedge[Koolen, Warmuth, and Kivinen, 2010]

Extensions

I Shortest path

I Spanning trees

I Permutations

RuT ≺

√V u

(comp(u) + KCT

T =∑

variance V uT =

Extensions

I Shortest path

I Spanning trees

I Permutations

RuT ≺

√V u

(comp(u) + KCT

T =∑

variance V uT =

Conclusion

A fresh algorithm for an old problem.

I different

I efficient

I better

Thank you!

for experts and combinatorial games - course web … experts and combinatorial games wouter m....

Documents