tutorial on bandits games - sébastien bubecksbubeck.com/tutorial.pdfa little bit of advertising...
TRANSCRIPT
![Page 1: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/1.jpg)
Tutorial on Bandits Games
Sebastien Bubeck
![Page 2: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/2.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
![Page 3: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/3.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
A ∈ 1, . . . , d
![Page 4: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/4.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
A ∈ 1, . . . , d
![Page 5: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/5.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
A ∈ 1, . . . , d
`1 `2 `d
![Page 6: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/6.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
A ∈ 1, . . . , d
`1 `2 `d
loss suffered: `A
![Page 7: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/7.jpg)
Online Learning with Full Information
Adversary
Player
1: CNN 2: NBC d : ABC
A ∈ 1, . . . , d
`1 `2 `d
loss suffered: `A
Feedback:`1, . . . , `d
![Page 8: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/8.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
![Page 9: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/9.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
A ∈ 1, . . . , d
![Page 10: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/10.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
A ∈ 1, . . . , d
![Page 11: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/11.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
`1 `2 `d
A ∈ 1, . . . , d
![Page 12: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/12.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
`1 `2 `d
A ∈ 1, . . . , dloss suffered: `A
![Page 13: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/13.jpg)
Online Learning with Bandit Feedback
Adversary
Player
1 2 d
`1 `2 `d
A ∈ 1, . . . , dloss suffered: `A
Feedback: `A
![Page 14: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/14.jpg)
Some Applications
Computer Go Brain computer interface Medical trials
Packets routing Ads placement Dynamic allocation
![Page 15: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/15.jpg)
A little bit of advertising
Survey on multi-armed bandits to appear in
S. Bubeck and N. Cesa-Bianchi.Regret Analysis of Stochastic and Nonstochastic Multi-armedBandit Problems.To appear in Foundations and Trends in Machine Learning,2012. (Draft available on my webpage.)
![Page 16: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/16.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 17: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/17.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 18: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/18.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 19: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/19.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 20: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/20.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 21: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/21.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 22: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/22.jpg)
Notation
For each round t = 1, 2, . . . , n;
1 The player chooses an arm It ∈ 1, . . . , d, possibly with thehelp of an external randomization.
2 Simultaneously the adversary chooses a loss vector`t = (`1,t , . . . , `d ,t) ∈ [0, 1]d .
3 The player incurs the loss `It ,t , and observes:
The loss vector `t in the full information setting.Only the loss incured `It ,t in the bandit setting.
Goal: Minimize the cumulative loss incured. We consider theregret:
Rn = En∑
t=1
`It ,t − mini=1,...,d
En∑
t=1
`i ,t .
![Page 23: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/23.jpg)
Exponential Weights (EW, EWA, MW, Hedge, ect)
Draw It at random from pt where
pt(i) =exp
(−η∑t−1
s=1 `i ,s
)
∑dj=1 exp
(−η∑t−1
s=1 `j ,s
)
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapireand Warmuth [1997])
Exp satisfies
Rn ≤√
n log d
2.
Moreover for any strategy,
supadversaries
Rn ≥√
n log d
2+ o(
√n log d).
![Page 24: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/24.jpg)
Exponential Weights (EW, EWA, MW, Hedge, ect)
Draw It at random from pt where
pt(i) =exp
(−η∑t−1
s=1 `i ,s
)
∑dj=1 exp
(−η∑t−1
s=1 `j ,s
)
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapireand Warmuth [1997])
Exp satisfies
Rn ≤√
n log d
2.
Moreover for any strategy,
supadversaries
Rn ≥√
n log d
2+ o(
√n log d).
![Page 25: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/25.jpg)
Exponential Weights (EW, EWA, MW, Hedge, ect)
Draw It at random from pt where
pt(i) =exp
(−η∑t−1
s=1 `i ,s
)
∑dj=1 exp
(−η∑t−1
s=1 `j ,s
)
Theorem (Cesa-Bianchi, Freund , Haussler, Helmbold, Schapireand Warmuth [1997])
Exp satisfies
Rn ≤√
n log d
2.
Moreover for any strategy,
supadversaries
Rn ≥√
n log d
2+ o(
√n log d).
![Page 26: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/26.jpg)
The one-slide-proof
wt(i) = exp
(−η
t−1∑
s=1
`i ,s
), Wt =
d∑
i=1
wt(i), pt(i) =wt(i)
Wt
logWn+1
W1= log
(1
d
d∑
i=1
wn+1(i)
)≥ log
(1
dmax
iwn+1(i)
)
= −ηmini
n∑
t=1
`i ,t − log d
logWn+1
W1=
n∑
t=1
logWt+1
Wt=
n∑
t=1
log
(d∑
i=1
wt(i)
Wtexp(−η`i ,t)
)
=n∑
t=1
log (E exp(−η`It ,t )
≤n∑
t=1
(−ηE`It ,t +
η2
8
)
![Page 27: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/27.jpg)
The one-slide-proof
wt(i) = exp
(−η
t−1∑
s=1
`i ,s
), Wt =
d∑
i=1
wt(i), pt(i) =wt(i)
Wt
logWn+1
W1= log
(1
d
d∑
i=1
wn+1(i)
)≥ log
(1
dmax
iwn+1(i)
)
= −ηmini
n∑
t=1
`i ,t − log d
logWn+1
W1=
n∑
t=1
logWt+1
Wt=
n∑
t=1
log
(d∑
i=1
wt(i)
Wtexp(−η`i ,t)
)
=n∑
t=1
log (E exp(−η`It ,t )
≤n∑
t=1
(−ηE`It ,t +
η2
8
)
![Page 28: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/28.jpg)
The one-slide-proof
wt(i) = exp
(−η
t−1∑
s=1
`i ,s
), Wt =
d∑
i=1
wt(i), pt(i) =wt(i)
Wt
logWn+1
W1= log
(1
d
d∑
i=1
wn+1(i)
)≥ log
(1
dmax
iwn+1(i)
)
= −ηmini
n∑
t=1
`i ,t − log d
logWn+1
W1=
n∑
t=1
logWt+1
Wt=
n∑
t=1
log
(d∑
i=1
wt(i)
Wtexp(−η`i ,t)
)
=n∑
t=1
log (E exp(−η`It ,t )
≤n∑
t=1
(−ηE`It ,t +
η2
8
)
![Page 29: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/29.jpg)
Magic trick for bandit feedback
˜i ,t =
`i ,tpt(i)
1It=i ,
is an unbiased estimate of `i ,t . We call Exp3 the Exp strategy runon the estimated losses.
Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003])
Exp3 satisfies:Rn ≤
√2nd log d .
Moreover for any strategy,
supadversaries
Rn ≥1
4
√nd + o(
√nd).
![Page 30: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/30.jpg)
Magic trick for bandit feedback
˜i ,t =
`i ,tpt(i)
1It=i ,
is an unbiased estimate of `i ,t . We call Exp3 the Exp strategy runon the estimated losses.
Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003])
Exp3 satisfies:Rn ≤
√2nd log d .
Moreover for any strategy,
supadversaries
Rn ≥1
4
√nd + o(
√nd).
![Page 31: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/31.jpg)
Magic trick for bandit feedback
˜i ,t =
`i ,tpt(i)
1It=i ,
is an unbiased estimate of `i ,t . We call Exp3 the Exp strategy runon the estimated losses.
Theorem (Auer, Cesa-Bianchi, Freund and Schapire [2003])
Exp3 satisfies:Rn ≤
√2nd log d .
Moreover for any strategy,
supadversaries
Rn ≥1
4
√nd + o(
√nd).
![Page 32: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/32.jpg)
High probability bounds
What about bounds directly on the true regret
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ?
Auer et al. [2003] proposed Exp3.P:
pt(i) = (1− γ)exp
(−η∑t−1
s=1˜i ,s
)
∑dj=1 exp
(−η∑t−1
s=1˜j ,s
) +γ
d,
where˜i ,t =
`i ,tpt(i)
1It=i +β
pt(i).
![Page 33: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/33.jpg)
High probability bounds
What about bounds directly on the true regret
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ?
Auer et al. [2003] proposed Exp3.P:
pt(i) = (1− γ)exp
(−η∑t−1
s=1˜i ,s
)
∑dj=1 exp
(−η∑t−1
s=1˜j ,s
) +γ
d,
where˜i ,t =
`i ,tpt(i)
1It=i +β
pt(i).
![Page 34: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/34.jpg)
High probability bounds
What about bounds directly on the true regret
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ?
Auer et al. [2003] proposed Exp3.P:
pt(i) = (1− γ)exp
(−η∑t−1
s=1˜i ,s
)
∑dj=1 exp
(−η∑t−1
s=1˜j ,s
) +γ
d,
where˜i ,t =
`i ,tpt(i)
1It=i +β
pt(i).
![Page 35: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/35.jpg)
High probability bounds
Theorem (Auer et al. [2003], Audibert and Bubeck [2011])
Let δ ∈ (0, 1), with β =
√log(dδ−1)
nd , η = 0.95√
log dnd and
γ = 1.05√
d log dn , Exp3.P satisfies with probability at least 1− δ:
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ≤ 5.15√
nd log(dδ−1).
On the other hand with β =√
log dnd , η = 0.95
√log dnd and
γ = 1.05√
d log dn , Exp3.P satisfies, for any δ ∈ (0, 1), with
probability at least 1− δ:
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ≤
√nd
log dlog(δ−1) + 5.15
√nd log d .
![Page 36: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/36.jpg)
High probability bounds
Theorem (Auer et al. [2003], Audibert and Bubeck [2011])
Let δ ∈ (0, 1), with β =
√log(dδ−1)
nd , η = 0.95√
log dnd and
γ = 1.05√
d log dn , Exp3.P satisfies with probability at least 1− δ:
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ≤ 5.15√
nd log(dδ−1).
On the other hand with β =√
log dnd , η = 0.95
√log dnd and
γ = 1.05√
d log dn , Exp3.P satisfies, for any δ ∈ (0, 1), with
probability at least 1− δ:
n∑
t=1
`It ,t − mini=1,...,d
n∑
t=1
`i ,t ≤
√nd
log dlog(δ−1) + 5.15
√nd log d .
![Page 37: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/37.jpg)
Other types of normalization
INF (Implicitly Normalized Forecaster) is based on a potentialfunction ψ : R∗− → R∗+ increasing, convex, twice continuouslydifferentiable, and such that (0, 1] ⊂ ψ(R∗−).
At each time step INF computes the new probabilitydistribution as follows:
pt(i) = ψ
(Ct −
t−1∑
s=1
˜i ,s
),
where Ct is the unique real number such that∑d
i=1 pt(i) = 1.
ψ(x) = exp(ηx) + γd corresponds exactly to the Exp3 strategy.
ψ(x) = (−ηx)−1/2 + γd is the quadratic INF strategy.
![Page 38: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/38.jpg)
Other types of normalization
INF (Implicitly Normalized Forecaster) is based on a potentialfunction ψ : R∗− → R∗+ increasing, convex, twice continuouslydifferentiable, and such that (0, 1] ⊂ ψ(R∗−).
At each time step INF computes the new probabilitydistribution as follows:
pt(i) = ψ
(Ct −
t−1∑
s=1
˜i ,s
),
where Ct is the unique real number such that∑d
i=1 pt(i) = 1.
ψ(x) = exp(ηx) + γd corresponds exactly to the Exp3 strategy.
ψ(x) = (−ηx)−1/2 + γd is the quadratic INF strategy.
![Page 39: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/39.jpg)
Other types of normalization
INF (Implicitly Normalized Forecaster) is based on a potentialfunction ψ : R∗− → R∗+ increasing, convex, twice continuouslydifferentiable, and such that (0, 1] ⊂ ψ(R∗−).
At each time step INF computes the new probabilitydistribution as follows:
pt(i) = ψ
(Ct −
t−1∑
s=1
˜i ,s
),
where Ct is the unique real number such that∑d
i=1 pt(i) = 1.
ψ(x) = exp(ηx) + γd corresponds exactly to the Exp3 strategy.
ψ(x) = (−ηx)−1/2 + γd is the quadratic INF strategy.
![Page 40: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/40.jpg)
Other types of normalization
INF (Implicitly Normalized Forecaster) is based on a potentialfunction ψ : R∗− → R∗+ increasing, convex, twice continuouslydifferentiable, and such that (0, 1] ⊂ ψ(R∗−).
At each time step INF computes the new probabilitydistribution as follows:
pt(i) = ψ
(Ct −
t−1∑
s=1
˜i ,s
),
where Ct is the unique real number such that∑d
i=1 pt(i) = 1.
ψ(x) = exp(ηx) + γd corresponds exactly to the Exp3 strategy.
ψ(x) = (−ηx)−1/2 + γd is the quadratic INF strategy.
![Page 41: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/41.jpg)
Minimax optimal regret bound
Theorem (Audibert and Bubeck [2009], Audibert and Bubeck[2010], Audibert, Bubeck and Lugosi [2011])
Quadratic INF satisfies:
Rn ≤ 2√
2nd .
![Page 42: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/42.jpg)
Extension: partial monitoring
Partial monitoring: the received feedback at time t is somesignal S(It , `t), see Cesa-Bianchi and Lugosi [2006].
A simple interpolation between full info. and bandit feedbackis the partial monitoring setting of Mannor and Shamir [2011]:S(It , `t) = `i ,t , i ∈ N (It) whereN : 1, . . . , d → P(1, . . . , d) is some knownneighboorhood mapping. A natural loss estimate in that caseis
˜i ,t =
`i ,t1i∈N (It)∑j∈N (i) pt(j)
.
Mannor and Shamir [2011] proved that Exp with the aboveestimate has a regret of order
√αn where α is the
independence number of the graph associated to N .
![Page 43: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/43.jpg)
Extension: partial monitoring
Partial monitoring: the received feedback at time t is somesignal S(It , `t), see Cesa-Bianchi and Lugosi [2006].
A simple interpolation between full info. and bandit feedbackis the partial monitoring setting of Mannor and Shamir [2011]:S(It , `t) = `i ,t , i ∈ N (It) whereN : 1, . . . , d → P(1, . . . , d) is some knownneighboorhood mapping. A natural loss estimate in that caseis
˜i ,t =
`i ,t1i∈N (It)∑j∈N (i) pt(j)
.
Mannor and Shamir [2011] proved that Exp with the aboveestimate has a regret of order
√αn where α is the
independence number of the graph associated to N .
![Page 44: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/44.jpg)
Extension: partial monitoring
Partial monitoring: the received feedback at time t is somesignal S(It , `t), see Cesa-Bianchi and Lugosi [2006].
A simple interpolation between full info. and bandit feedbackis the partial monitoring setting of Mannor and Shamir [2011]:S(It , `t) = `i ,t , i ∈ N (It) whereN : 1, . . . , d → P(1, . . . , d) is some knownneighboorhood mapping. A natural loss estimate in that caseis
˜i ,t =
`i ,t1i∈N (It)∑j∈N (i) pt(j)
.
Mannor and Shamir [2011] proved that Exp with the aboveestimate has a regret of order
√αn where α is the
independence number of the graph associated to N .
![Page 45: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/45.jpg)
Extension: contextual bandits
Contextual bandits: at each time step t one receives a contextst ∈ S, and one wants to perform as well as the best mappingfrom contexts to arms:
RSn = En∑
t=1
`It ,t − ming :S→1,...,d
En∑
t=1
`g(st),t .
A related problem is bandit with experts advice: N experts areplaying the game, and the player observes their actions ξkt ,k = 1, . . . ,N. One wants to compete with the best expert:
RNn = E
n∑
t=1
`It ,t − mink∈1,...,N
En∑
t=1
`ξkt ,t .
With the bandit feedback `It ,t one can build an estimate for
the loss of expert k as ˜kt =
`It ,t1It=ξktpt(It)
. Playing Exp on the setof experts with the above loss estimate yieldsRNn ≤
√2nd logN.
![Page 46: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/46.jpg)
Extension: contextual bandits
Contextual bandits: at each time step t one receives a contextst ∈ S, and one wants to perform as well as the best mappingfrom contexts to arms:
RSn = En∑
t=1
`It ,t − ming :S→1,...,d
En∑
t=1
`g(st),t .
A related problem is bandit with experts advice: N experts areplaying the game, and the player observes their actions ξkt ,k = 1, . . . ,N. One wants to compete with the best expert:
RNn = E
n∑
t=1
`It ,t − mink∈1,...,N
En∑
t=1
`ξkt ,t .
With the bandit feedback `It ,t one can build an estimate for
the loss of expert k as ˜kt =
`It ,t1It=ξktpt(It)
. Playing Exp on the setof experts with the above loss estimate yieldsRNn ≤
√2nd logN.
![Page 47: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/47.jpg)
Extension: contextual bandits
Contextual bandits: at each time step t one receives a contextst ∈ S, and one wants to perform as well as the best mappingfrom contexts to arms:
RSn = En∑
t=1
`It ,t − ming :S→1,...,d
En∑
t=1
`g(st),t .
A related problem is bandit with experts advice: N experts areplaying the game, and the player observes their actions ξkt ,k = 1, . . . ,N. One wants to compete with the best expert:
RNn = E
n∑
t=1
`It ,t − mink∈1,...,N
En∑
t=1
`ξkt ,t .
With the bandit feedback `It ,t one can build an estimate for
the loss of expert k as ˜kt =
`It ,t1It=ξktpt(It)
. Playing Exp on the setof experts with the above loss estimate yieldsRNn ≤
√2nd logN.
![Page 48: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/48.jpg)
Stochastic Assumption
Assumption (Robbins [1952])
The sequence of losses (`t)1≤t≤n is a sequence of i.i.d randomvariables.
For historical reasons in this setting we consider gains rather thanlosses and we introduce different notation:
Let νi be the unknown reward distribution underlying arm i ,µi the mean of νi , µ
∗ = max1≤i≤d µi and ∆i = µ∗ − µi .Let Xi ,s ∼ νi be the reward obtained when pulling arm i forthe sth time, and Ti (t) =
∑ts=1 1Is=i the number of times
arm i was pulled up to time t.
Thus here
Rn = nµ∗ − En∑
t=1
µIt =d∑
i=1
∆iETi (n).
![Page 49: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/49.jpg)
Stochastic Assumption
Assumption (Robbins [1952])
The sequence of losses (`t)1≤t≤n is a sequence of i.i.d randomvariables.
For historical reasons in this setting we consider gains rather thanlosses and we introduce different notation:
Let νi be the unknown reward distribution underlying arm i ,µi the mean of νi , µ
∗ = max1≤i≤d µi and ∆i = µ∗ − µi .Let Xi ,s ∼ νi be the reward obtained when pulling arm i forthe sth time, and Ti (t) =
∑ts=1 1Is=i the number of times
arm i was pulled up to time t.
Thus here
Rn = nµ∗ − En∑
t=1
µIt =d∑
i=1
∆iETi (n).
![Page 50: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/50.jpg)
Stochastic Assumption
Assumption (Robbins [1952])
The sequence of losses (`t)1≤t≤n is a sequence of i.i.d randomvariables.
For historical reasons in this setting we consider gains rather thanlosses and we introduce different notation:
Let νi be the unknown reward distribution underlying arm i ,µi the mean of νi , µ
∗ = max1≤i≤d µi and ∆i = µ∗ − µi .Let Xi ,s ∼ νi be the reward obtained when pulling arm i forthe sth time, and Ti (t) =
∑ts=1 1Is=i the number of times
arm i was pulled up to time t.
Thus here
Rn = nµ∗ − En∑
t=1
µIt =d∑
i=1
∆iETi (n).
![Page 51: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/51.jpg)
Stochastic Assumption
Assumption (Robbins [1952])
The sequence of losses (`t)1≤t≤n is a sequence of i.i.d randomvariables.
For historical reasons in this setting we consider gains rather thanlosses and we introduce different notation:
Let νi be the unknown reward distribution underlying arm i ,µi the mean of νi , µ
∗ = max1≤i≤d µi and ∆i = µ∗ − µi .Let Xi ,s ∼ νi be the reward obtained when pulling arm i forthe sth time, and Ti (t) =
∑ts=1 1Is=i the number of times
arm i was pulled up to time t.
Thus here
Rn = nµ∗ − En∑
t=1
µIt =d∑
i=1
∆iETi (n).
![Page 52: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/52.jpg)
Stochastic Assumption
Assumption (Robbins [1952])
The sequence of losses (`t)1≤t≤n is a sequence of i.i.d randomvariables.
For historical reasons in this setting we consider gains rather thanlosses and we introduce different notation:
Let νi be the unknown reward distribution underlying arm i ,µi the mean of νi , µ
∗ = max1≤i≤d µi and ∆i = µ∗ − µi .Let Xi ,s ∼ νi be the reward obtained when pulling arm i forthe sth time, and Ti (t) =
∑ts=1 1Is=i the number of times
arm i was pulled up to time t.
Thus here
Rn = nµ∗ − En∑
t=1
µIt =d∑
i=1
∆iETi (n).
![Page 53: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/53.jpg)
Optimism in face of uncertainty
General principle: given some observations from an unknownenvironment, build (with some probabilistic argument) a set ofpossible environments Ω, then act as if the real environment wasthe most favorable one in Ω.
Application to stochastic bandits: given the past rewards, buildconfidence intervals for the means (µi ) (in particular build upperconfidence bounds), then play the arm with the highest upperconfidence bound.
![Page 54: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/54.jpg)
Optimism in face of uncertainty
General principle: given some observations from an unknownenvironment, build (with some probabilistic argument) a set ofpossible environments Ω, then act as if the real environment wasthe most favorable one in Ω.
Application to stochastic bandits: given the past rewards, buildconfidence intervals for the means (µi ) (in particular build upperconfidence bounds), then play the arm with the highest upperconfidence bound.
![Page 55: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/55.jpg)
UCB (Upper Confidence Bounds)
Theorem (Hoeffding [1963])
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then withprobability at least 1− δ,
EX ≤ 1
t
t∑
s=1
Xs +
√log δ−1
2t.
This directly suggests the famous UCB strategy of Auer,Cesa-Bianchi and Fischer [2002]:
It ∈ argmax1≤i≤d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√2 log t
Ti (t − 1).
Auer et al. proved the following regret bound:
Rn ≤∑
i :∆i>0
10 log n
∆i.
![Page 56: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/56.jpg)
UCB (Upper Confidence Bounds)
Theorem (Hoeffding [1963])
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then withprobability at least 1− δ,
EX ≤ 1
t
t∑
s=1
Xs +
√log δ−1
2t.
This directly suggests the famous UCB strategy of Auer,Cesa-Bianchi and Fischer [2002]:
It ∈ argmax1≤i≤d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√2 log t
Ti (t − 1).
Auer et al. proved the following regret bound:
Rn ≤∑
i :∆i>0
10 log n
∆i.
![Page 57: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/57.jpg)
UCB (Upper Confidence Bounds)
Theorem (Hoeffding [1963])
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then withprobability at least 1− δ,
EX ≤ 1
t
t∑
s=1
Xs +
√log δ−1
2t.
This directly suggests the famous UCB strategy of Auer,Cesa-Bianchi and Fischer [2002]:
It ∈ argmax1≤i≤d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√2 log t
Ti (t − 1).
Auer et al. proved the following regret bound:
Rn ≤∑
i :∆i>0
10 log n
∆i.
![Page 58: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/58.jpg)
Illustration of UCB
![Page 59: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/59.jpg)
Illustration of UCB
![Page 60: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/60.jpg)
MOSS (Minimax Optimal Stochastic Strategy)
In a distribution-free sense one can show that UCB has a regretalways bounded as Rn ≤ c
√nd log n. Furthermore one can prove
that for any strategy there exists a set of distributions such thatRn ≥ 1
20
√nd . To close this gap Audibert and Bubeck (2009)
introduced MOSS:
It = argmaxi∈1,...,d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√√√√max(
log(
nKTi (t−1)
), 0)
Ti (t − 1).
One can show that MOSS satisfies:
Rn ≤ cd
∆log(n), and Rn ≤ c
√nd .
![Page 61: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/61.jpg)
MOSS (Minimax Optimal Stochastic Strategy)
In a distribution-free sense one can show that UCB has a regretalways bounded as Rn ≤ c
√nd log n. Furthermore one can prove
that for any strategy there exists a set of distributions such thatRn ≥ 1
20
√nd . To close this gap Audibert and Bubeck (2009)
introduced MOSS:
It = argmaxi∈1,...,d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√√√√max(
log(
nKTi (t−1)
), 0)
Ti (t − 1).
One can show that MOSS satisfies:
Rn ≤ cd
∆log(n), and Rn ≤ c
√nd .
![Page 62: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/62.jpg)
MOSS (Minimax Optimal Stochastic Strategy)
In a distribution-free sense one can show that UCB has a regretalways bounded as Rn ≤ c
√nd log n. Furthermore one can prove
that for any strategy there exists a set of distributions such thatRn ≥ 1
20
√nd . To close this gap Audibert and Bubeck (2009)
introduced MOSS:
It = argmaxi∈1,...,d
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s +
√√√√max(
log(
nKTi (t−1)
), 0)
Ti (t − 1).
One can show that MOSS satisfies:
Rn ≤ cd
∆log(n), and Rn ≤ c
√nd .
![Page 63: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/63.jpg)
Distribution-dependent lower bound
For any p, q ∈ [0, 1], let
kl(p, q) = p logp
q+ (1− p) log
1− p
1− q.
Theorem (Lai and Robbins [1985])
Consider a consistent strategy, i.e. s.t. ∀a > 0, we haveETi (n) = o(na) if ∆i > 0. Then for any Bernoulli rewarddistributions,
lim infn→+∞
Rn
log n≥∑
i :∆i>0
∆i
kl(µi , µ∗).
Note that1
2∆i≥ ∆i
kl(µi , µ∗)≥ µ∗(1− µ∗)
2∆i.
![Page 64: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/64.jpg)
Distribution-dependent lower bound
For any p, q ∈ [0, 1], let
kl(p, q) = p logp
q+ (1− p) log
1− p
1− q.
Theorem (Lai and Robbins [1985])
Consider a consistent strategy, i.e. s.t. ∀a > 0, we haveETi (n) = o(na) if ∆i > 0. Then for any Bernoulli rewarddistributions,
lim infn→+∞
Rn
log n≥∑
i :∆i>0
∆i
kl(µi , µ∗).
Note that1
2∆i≥ ∆i
kl(µi , µ∗)≥ µ∗(1− µ∗)
2∆i.
![Page 65: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/65.jpg)
Distribution-dependent lower bound
For any p, q ∈ [0, 1], let
kl(p, q) = p logp
q+ (1− p) log
1− p
1− q.
Theorem (Lai and Robbins [1985])
Consider a consistent strategy, i.e. s.t. ∀a > 0, we haveETi (n) = o(na) if ∆i > 0. Then for any Bernoulli rewarddistributions,
lim infn→+∞
Rn
log n≥∑
i :∆i>0
∆i
kl(µi , µ∗).
Note that1
2∆i≥ ∆i
kl(µi , µ∗)≥ µ∗(1− µ∗)
2∆i.
![Page 66: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/66.jpg)
KL-UCB
Theorem (Chernoff’s inequality)
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then
P
(1
t
t∑
s=1
Xs ≤ EX − ε
)≤ exp (−t kl(EX − ε,EX )) .
In particular this implies that with probability at least 1− δ:
EX ≤ max
q ∈ [0, 1] : kl
(1
t
t∑
s=1
Xs , q
)≤ log δ−1
t
.
![Page 67: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/67.jpg)
KL-UCB
Theorem (Chernoff’s inequality)
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then
P
(1
t
t∑
s=1
Xs ≤ EX − ε
)≤ exp (−t kl(EX − ε,EX )) .
In particular this implies that with probability at least 1− δ:
EX ≤ max
q ∈ [0, 1] : kl
(1
t
t∑
s=1
Xs , q
)≤ log δ−1
t
.
![Page 68: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/68.jpg)
KL-UCB
Theorem (Chernoff’s inequality)
Let X ,X1, . . . ,Xt be i.i.d random variables in [0, 1], then
P
(1
t
t∑
s=1
Xs ≤ EX − ε
)≤ exp (−t kl(EX − ε,EX )) .
In particular this implies that with probability at least 1− δ:
EX ≤ max
q ∈ [0, 1] : kl
(1
t
t∑
s=1
Xs , q
)≤ log δ−1
t
.
![Page 69: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/69.jpg)
KL-UCB
Thus Chernoff’s bound suggests the KL-UCB strategy of Garivierand Cappe [2011] (see also Honda and Takemura [2010], Maillard,Munos and Stoltz [2011]) :
It ∈ argmax1≤i≤d
max
q ∈ [0, 1] :
kl
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s , q
≤ (1 + ε) log t
Ti (t − 1)
.
Garivier and Cappe proved the following regret bound for n largeenough:
Rn ≤∑
i :∆i>0
(1 + 2ε)∆i
kl(µi , µ∗)log n.
![Page 70: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/70.jpg)
KL-UCB
Thus Chernoff’s bound suggests the KL-UCB strategy of Garivierand Cappe [2011] (see also Honda and Takemura [2010], Maillard,Munos and Stoltz [2011]) :
It ∈ argmax1≤i≤d
max
q ∈ [0, 1] :
kl
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s , q
≤ (1 + ε) log t
Ti (t − 1)
.
Garivier and Cappe proved the following regret bound for n largeenough:
Rn ≤∑
i :∆i>0
(1 + 2ε)∆i
kl(µi , µ∗)log n.
![Page 71: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/71.jpg)
KL-UCB
Thus Chernoff’s bound suggests the KL-UCB strategy of Garivierand Cappe [2011] (see also Honda and Takemura [2010], Maillard,Munos and Stoltz [2011]) :
It ∈ argmax1≤i≤d
max
q ∈ [0, 1] :
kl
1
Ti (t − 1)
Ti (t−1)∑
s=1
Xi ,s , q
≤ (1 + ε) log t
Ti (t − 1)
.
Garivier and Cappe proved the following regret bound for n largeenough:
Rn ≤∑
i :∆i>0
(1 + 2ε)∆i
kl(µi , µ∗)log n.
![Page 72: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/72.jpg)
A non-UCB strategy: Thompson’s sampling
In Thompson [1933] the following strategy was proposed for thecase of Bernoulli distributions:
Assume a uniform prior on the parameters µi ∈ [0, 1].
Let πi ,t be the posterior distribution for µi at the tth round.
Let θi ,t ∼ πi ,t (independently from the past given πi ,t).
It ∈ argmaxi=1,...,d θi ,t .
The first theoretical guarantee for this strategy was provided inAgrawal and Goyal [2012], and in Kaufmann, Korda, and Munos[2012] it was proved that it attains essentially the same regret thanKL-UCB.
![Page 73: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/73.jpg)
A non-UCB strategy: Thompson’s sampling
In Thompson [1933] the following strategy was proposed for thecase of Bernoulli distributions:
Assume a uniform prior on the parameters µi ∈ [0, 1].
Let πi ,t be the posterior distribution for µi at the tth round.
Let θi ,t ∼ πi ,t (independently from the past given πi ,t).
It ∈ argmaxi=1,...,d θi ,t .
The first theoretical guarantee for this strategy was provided inAgrawal and Goyal [2012], and in Kaufmann, Korda, and Munos[2012] it was proved that it attains essentially the same regret thanKL-UCB.
![Page 74: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/74.jpg)
A non-UCB strategy: Thompson’s sampling
In Thompson [1933] the following strategy was proposed for thecase of Bernoulli distributions:
Assume a uniform prior on the parameters µi ∈ [0, 1].
Let πi ,t be the posterior distribution for µi at the tth round.
Let θi ,t ∼ πi ,t (independently from the past given πi ,t).
It ∈ argmaxi=1,...,d θi ,t .
The first theoretical guarantee for this strategy was provided inAgrawal and Goyal [2012], and in Kaufmann, Korda, and Munos[2012] it was proved that it attains essentially the same regret thanKL-UCB.
![Page 75: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/75.jpg)
A non-UCB strategy: Thompson’s sampling
In Thompson [1933] the following strategy was proposed for thecase of Bernoulli distributions:
Assume a uniform prior on the parameters µi ∈ [0, 1].
Let πi ,t be the posterior distribution for µi at the tth round.
Let θi ,t ∼ πi ,t (independently from the past given πi ,t).
It ∈ argmaxi=1,...,d θi ,t .
The first theoretical guarantee for this strategy was provided inAgrawal and Goyal [2012], and in Kaufmann, Korda, and Munos[2012] it was proved that it attains essentially the same regret thanKL-UCB.
![Page 76: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/76.jpg)
A non-UCB strategy: Thompson’s sampling
In Thompson [1933] the following strategy was proposed for thecase of Bernoulli distributions:
Assume a uniform prior on the parameters µi ∈ [0, 1].
Let πi ,t be the posterior distribution for µi at the tth round.
Let θi ,t ∼ πi ,t (independently from the past given πi ,t).
It ∈ argmaxi=1,...,d θi ,t .
The first theoretical guarantee for this strategy was provided inAgrawal and Goyal [2012], and in Kaufmann, Korda, and Munos[2012] it was proved that it attains essentially the same regret thanKL-UCB.
![Page 77: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/77.jpg)
Heavy-tailed distributions
The standard UCB works for all σ2 - subgaussian distributions (notonly bounded distributions), i.e. such that
E exp (λ(X − EX )) ≤ σ2λ2
2,∀λ ∈ R.
It is easy to see that this is equivalent to
∃α > 0 s.t. E exp(αX 2) < +∞.
What happens for distributions with heavier tails? Can we getlogarithmic regret if the distributions only have a finite variance?
![Page 78: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/78.jpg)
Heavy-tailed distributions
The standard UCB works for all σ2 - subgaussian distributions (notonly bounded distributions), i.e. such that
E exp (λ(X − EX )) ≤ σ2λ2
2,∀λ ∈ R.
It is easy to see that this is equivalent to
∃α > 0 s.t. E exp(αX 2) < +∞.
What happens for distributions with heavier tails? Can we getlogarithmic regret if the distributions only have a finite variance?
![Page 79: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/79.jpg)
Heavy-tailed distributions
The standard UCB works for all σ2 - subgaussian distributions (notonly bounded distributions), i.e. such that
E exp (λ(X − EX )) ≤ σ2λ2
2,∀λ ∈ R.
It is easy to see that this is equivalent to
∃α > 0 s.t. E exp(αX 2) < +∞.
What happens for distributions with heavier tails? Can we getlogarithmic regret if the distributions only have a finite variance?
![Page 80: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/80.jpg)
Median of means, Alon, Gibbons, Matias and Szegedy[2002]
Lemma
Let X ,X1, . . . ,Xn be i.i.d random variables such thatE(X − EX )2 ≤ 1. Let δ ∈ (0, 1), k = 8 log δ−1 and N = n
8 log δ−1 .Then with probability at least 1− δ,
EX ≤ median
1
N
N∑
s=1
Xs , . . . ,1
N
kN∑
s=(k−1)N+1
Xs
+ 8
√8 log(δ−1)
n.
![Page 81: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/81.jpg)
Median of means, Alon, Gibbons, Matias and Szegedy[2002]
Lemma
Let X ,X1, . . . ,Xn be i.i.d random variables such thatE(X − EX )2 ≤ 1. Let δ ∈ (0, 1), k = 8 log δ−1 and N = n
8 log δ−1 .Then with probability at least 1− δ,
EX ≤ median
1
N
N∑
s=1
Xs , . . . ,1
N
kN∑
s=(k−1)N+1
Xs
+ 8
√8 log(δ−1)
n.
![Page 82: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/82.jpg)
Median of means, Alon, Gibbons, Matias and Szegedy[2002]
Lemma
Let X ,X1, . . . ,Xn be i.i.d random variables such thatE(X − EX )2 ≤ 1. Let δ ∈ (0, 1), k = 8 log δ−1 and N = n
8 log δ−1 .Then with probability at least 1− δ,
EX ≤ median
1
N
N∑
s=1
Xs , . . . ,1
N
kN∑
s=(k−1)N+1
Xs
+ 8
√8 log(δ−1)
n.
![Page 83: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/83.jpg)
Median of means, Alon, Gibbons, Matias and Szegedy[2002]
Lemma
Let X ,X1, . . . ,Xn be i.i.d random variables such thatE(X − EX )2 ≤ 1. Let δ ∈ (0, 1), k = 8 log δ−1 and N = n
8 log δ−1 .Then with probability at least 1− δ,
EX ≤ median
1
N
N∑
s=1
Xs , . . . ,1
N
kN∑
s=(k−1)N+1
Xs
+ 8
√8 log(δ−1)
n.
![Page 84: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/84.jpg)
Robust UCB
This suggests a Robust UCB strategy, Bubeck, Cesa-Bianchi andLugosi [2012]:
It ∈ argmax1≤i≤d
median
1
Ni ,t
Ni,t∑
s=1
Xi ,s , . . . ,1
Ni ,t
ktNi,t∑
s=(kt−1)Ni,t+1
Xi ,s
+ 32
√log t
Ti (t − 1),
with kt = 16 log t and Ni ,t = Ti (t−1)16 log t . The following regret bound
can be proved for any set of distributions with variance bounded by1:
Rn ≤ c∑
i :∆i>0
log n
∆i.
![Page 85: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/85.jpg)
Robust UCB
This suggests a Robust UCB strategy, Bubeck, Cesa-Bianchi andLugosi [2012]:
It ∈ argmax1≤i≤d
median
1
Ni ,t
Ni,t∑
s=1
Xi ,s , . . . ,1
Ni ,t
ktNi,t∑
s=(kt−1)Ni,t+1
Xi ,s
+ 32
√log t
Ti (t − 1),
with kt = 16 log t and Ni ,t = Ti (t−1)16 log t . The following regret bound
can be proved for any set of distributions with variance bounded by1:
Rn ≤ c∑
i :∆i>0
log n
∆i.
![Page 86: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/86.jpg)
Markovian rewards
Assumption
The sequence (Xi ,t)t≥1 forms an aperiodic irreducible finite-stateMarkov chain with unknown transition matrix Pi .
Again in this framework it is possible to design a UCB strategywith logarithmic regret (Tekin and Liu, [2011]), using the followingresult:
Theorem (Lezaud [1998])
Let X1, . . . ,Xt be an aperiodic irreducible finite-state Markov chainwith transition matrix P. Let λ2 be the second largest eigenvalueof the multiplicative symmetrization of P and ε = 1− λ2. Let µ bethe expectation of X1 under the stationary distribution. Thereexists C > 0 such that for any γ ∈ (0, 1],
P
(1
t
t∑
s=1
Xs ≥ µ+ γ
)≤ C exp
(− tγ2ε
28
).
![Page 87: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/87.jpg)
Markovian rewards
Assumption
The sequence (Xi ,t)t≥1 forms an aperiodic irreducible finite-stateMarkov chain with unknown transition matrix Pi .
Again in this framework it is possible to design a UCB strategywith logarithmic regret (Tekin and Liu, [2011]), using the followingresult:
Theorem (Lezaud [1998])
Let X1, . . . ,Xt be an aperiodic irreducible finite-state Markov chainwith transition matrix P. Let λ2 be the second largest eigenvalueof the multiplicative symmetrization of P and ε = 1− λ2. Let µ bethe expectation of X1 under the stationary distribution. Thereexists C > 0 such that for any γ ∈ (0, 1],
P
(1
t
t∑
s=1
Xs ≥ µ+ γ
)≤ C exp
(− tγ2ε
28
).
![Page 88: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/88.jpg)
Markovian rewards
Assumption
The sequence (Xi ,t)t≥1 forms an aperiodic irreducible finite-stateMarkov chain with unknown transition matrix Pi .
Again in this framework it is possible to design a UCB strategywith logarithmic regret (Tekin and Liu, [2011]), using the followingresult:
Theorem (Lezaud [1998])
Let X1, . . . ,Xt be an aperiodic irreducible finite-state Markov chainwith transition matrix P. Let λ2 be the second largest eigenvalueof the multiplicative symmetrization of P and ε = 1− λ2. Let µ bethe expectation of X1 under the stationary distribution. Thereexists C > 0 such that for any γ ∈ (0, 1],
P
(1
t
t∑
s=1
Xs ≥ µ+ γ
)≤ C exp
(− tγ2ε
28
).
![Page 89: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/89.jpg)
Online Lipschitz and Stochastic Optimization
Stochastic multi-armed bandit where 1, . . . ,K is replaced by X .At time t, select xt ∈ X , then receive a random variable rt ∈ [0, 1]such that E[rt |xt ] = f (xt).
Assumption
X is equipped with a symmetric function ρ : X × X → R+ suchthat ρ(x , x) = 0. f is Lipschitz with respect to ρ, that is
|f (x)− f (y)| ≤ ρ(x , y), ∀x , y ∈ X .
Rn = nf ∗ − En∑
t=1
f (xt),
where f ∗ = supx∈X f (x).
![Page 90: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/90.jpg)
Online Lipschitz and Stochastic Optimization
Stochastic multi-armed bandit where 1, . . . ,K is replaced by X .At time t, select xt ∈ X , then receive a random variable rt ∈ [0, 1]such that E[rt |xt ] = f (xt).
Assumption
X is equipped with a symmetric function ρ : X × X → R+ suchthat ρ(x , x) = 0. f is Lipschitz with respect to ρ, that is
|f (x)− f (y)| ≤ ρ(x , y), ∀x , y ∈ X .
Rn = nf ∗ − En∑
t=1
f (xt),
where f ∗ = supx∈X f (x).
![Page 91: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/91.jpg)
Online Lipschitz and Stochastic Optimization
Stochastic multi-armed bandit where 1, . . . ,K is replaced by X .At time t, select xt ∈ X , then receive a random variable rt ∈ [0, 1]such that E[rt |xt ] = f (xt).
Assumption
X is equipped with a symmetric function ρ : X × X → R+ suchthat ρ(x , x) = 0. f is Lipschitz with respect to ρ, that is
|f (x)− f (y)| ≤ ρ(x , y), ∀x , y ∈ X .
Rn = nf ∗ − En∑
t=1
f (xt),
where f ∗ = supx∈X f (x).
![Page 92: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/92.jpg)
Online Lipschitz and Stochastic Optimization
Stochastic multi-armed bandit where 1, . . . ,K is replaced by X .At time t, select xt ∈ X , then receive a random variable rt ∈ [0, 1]such that E[rt |xt ] = f (xt).
Assumption
X is equipped with a symmetric function ρ : X × X → R+ suchthat ρ(x , x) = 0. f is Lipschitz with respect to ρ, that is
|f (x)− f (y)| ≤ ρ(x , y), ∀x , y ∈ X .
Rn = nf ∗ − En∑
t=1
f (xt),
where f ∗ = supx∈X f (x).
![Page 93: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/93.jpg)
Example in 1d
![Page 94: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/94.jpg)
Where should one sample next?
x
How to define a high probability upper bound at any state x?
![Page 95: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/95.jpg)
Noiseless case, rt = f (xt)
f(x )t
xt
f
f *
Lipschitz property → the evaluation of f at xt provides a firstupper-bound on f .
![Page 96: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/96.jpg)
Noiseless case, rt = f (xt)
New point → refined upper-bound on f .
![Page 97: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/97.jpg)
Noiseless case, rt = f (xt)
![Page 98: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/98.jpg)
Back to the noisy case
x
![Page 99: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/99.jpg)
UCB in a given domain
xt
f(xt)
rt
x
Xi
For a fixed domain Xi 3 x containing ni points xt ∈ Xi , we havethat
∑nit=1 rt − f (xt) is a martingale. Thus by Azuma’s inequality,
1
ni
ni∑
t=1
rt +
√log 1/δ
2ni≥ 1
ni
ni∑
t=1
f (xt) ≥ f (x)− diam(Xi ),
since f is Lipschitz (where diam(Xi ) = supx ,y∈Xiρ(x , y)).
![Page 100: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/100.jpg)
UCB in a given domain
xt
f(xt)
rt
x
Xi
For a fixed domain Xi 3 x containing ni points xt ∈ Xi , we havethat
∑nit=1 rt − f (xt) is a martingale. Thus by Azuma’s inequality,
1
ni
ni∑
t=1
rt +
√log 1/δ
2ni≥ 1
ni
ni∑
t=1
f (xt) ≥ f (x)− diam(Xi ),
since f is Lipschitz (where diam(Xi ) = supx ,y∈Xiρ(x , y)).
![Page 101: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/101.jpg)
UCB in a given domain
xt
f(xt)
rt
x
Xi
For a fixed domain Xi 3 x containing ni points xt ∈ Xi , we havethat
∑nit=1 rt − f (xt) is a martingale. Thus by Azuma’s inequality,
1
ni
ni∑
t=1
rt +
√log 1/δ
2ni≥ 1
ni
ni∑
t=1
f (xt) ≥ f (x)− diam(Xi ),
since f is Lipschitz (where diam(Xi ) = supx ,y∈Xiρ(x , y)).
![Page 102: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/102.jpg)
High probability upper bound
1ni
∑ni
t=1 rt
diam(Xi)
√log 1/δ2ni
Upper-bound
Xi
w.p. 1− δ, 1
ni
ni∑
t=1
rt +
√log 1/δ
2ni+ diam(Xi ) ≥ sup
x∈Xi
f (x).
Tradeoff between number of points in a domain and size of the domain.
By considering several domains we can derive a tigther upper bound.
![Page 103: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/103.jpg)
High probability upper bound
1ni
∑ni
t=1 rt
diam(Xi)
√log 1/δ2ni
Upper-bound
Xi
w.p. 1− δ, 1
ni
ni∑
t=1
rt +
√log 1/δ
2ni+ diam(Xi ) ≥ sup
x∈Xi
f (x).
Tradeoff between number of points in a domain and size of the domain.
By considering several domains we can derive a tigther upper bound.
![Page 104: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/104.jpg)
High probability upper bound
1ni
∑ni
t=1 rt
diam(Xi)
√log 1/δ2ni
Upper-bound
Xi
w.p. 1− δ, 1
ni
ni∑
t=1
rt +
√log 1/δ
2ni+ diam(Xi ) ≥ sup
x∈Xi
f (x).
Tradeoff between number of points in a domain and size of the domain.
By considering several domains we can derive a tigther upper bound.
![Page 105: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/105.jpg)
A hierarchical decomposition
Use a tree of partitions at all scales:
Bi (t)def= min
µi (t) +
√2 log(t)
Ti (t)+ diam(i), max
j∈C(i)Bj(t)
![Page 106: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/106.jpg)
Hierarchical Optimistic Optimization (HOO)
[Bubeck, Munos, Stoltz, Szepesvari, 2008, 2011]: Consider a treeof partitions of X , each node i corresponds to a subdomain Xi .
HOO Algorithm:Let Tt be the set of expandednodes at round t.- T1 = root (space X )- At t, select a leaf It of Tt bymaximizing the B-values,- Tt+1 = Tt ∪ It- Select xt ∈ XIt
- Observe reward rt and updatethe B-values:
h,iB
Bh+1,2i−1
Bh+1,2i
X t
Turned−onnodes
Followed path
Selected node
Pulled point
Bi (t)def= min
[µi (t) +
√2 log(t)
Ti (t)+ diam(i), max
j∈C(i)Bj(t)
]
![Page 107: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/107.jpg)
Example in 1d
rt ∼ B(f (xt)) a Bernoulli distribution with parameter f (xt)
Resulting tree at time n = 1000 and at n = 10000.
![Page 108: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/108.jpg)
Analysis of HOO
The near-optimality dimension d of f is defined as follows: Let
Xεdef= x ∈ X , f (x) ≥ f ∗ − ε
be the set of ε-optimal points. Then Xε can be covered by O(ε−d)balls of radius ε. A similar notion was introduced in [Kleinberg,Slivkins, Upfal, 2008].
Theorem (Bubeck, Munos, Stoltz, Szepesvari, 2008)
HOO satisfies:Rn = O(n
d+1d+2 ).
![Page 109: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/109.jpg)
Analysis of HOO
The near-optimality dimension d of f is defined as follows: Let
Xεdef= x ∈ X , f (x) ≥ f ∗ − ε
be the set of ε-optimal points. Then Xε can be covered by O(ε−d)balls of radius ε. A similar notion was introduced in [Kleinberg,Slivkins, Upfal, 2008].
Theorem (Bubeck, Munos, Stoltz, Szepesvari, 2008)
HOO satisfies:Rn = O(n
d+1d+2 ).
![Page 110: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/110.jpg)
Analysis of HOO
The near-optimality dimension d of f is defined as follows: Let
Xεdef= x ∈ X , f (x) ≥ f ∗ − ε
be the set of ε-optimal points. Then Xε can be covered by O(ε−d)balls of radius ε. A similar notion was introduced in [Kleinberg,Slivkins, Upfal, 2008].
Theorem (Bubeck, Munos, Stoltz, Szepesvari, 2008)
HOO satisfies:Rn = O(n
d+1d+2 ).
![Page 111: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/111.jpg)
Analysis of HOO
The near-optimality dimension d of f is defined as follows: Let
Xεdef= x ∈ X , f (x) ≥ f ∗ − ε
be the set of ε-optimal points. Then Xε can be covered by O(ε−d)balls of radius ε. A similar notion was introduced in [Kleinberg,Slivkins, Upfal, 2008].
Theorem (Bubeck, Munos, Stoltz, Szepesvari, 2008)
HOO satisfies:Rn = O(n
d+1d+2 ).
![Page 112: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/112.jpg)
Example 1:
Assume the function is locally peaky around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||).
ε
ε
It takes O(ε0) balls of radius ε to cover Xε with ρ(x , y) = ||x − y ||.Thus d = 0 and the regret is O(
√n).
![Page 113: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/113.jpg)
Example 1:
Assume the function is locally peaky around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||).
ε
ε
It takes O(ε0) balls of radius ε to cover Xε with ρ(x , y) = ||x − y ||.Thus d = 0 and the regret is O(
√n).
![Page 114: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/114.jpg)
Example 1:
Assume the function is locally peaky around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||).
ε
ε
It takes O(ε0) balls of radius ε to cover Xε with ρ(x , y) = ||x − y ||.Thus d = 0 and the regret is O(
√n).
![Page 115: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/115.jpg)
Example 2:Assume the function is locally quadratic around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||2).
ε
ε
For ρ(x , y) = ||x − y ||, it takes O(ε−D/2) balls of radius ε to
cover Xε. Thus d = D/2 and Rn = O(nD+2D+4 ).
For ρ(x , y) = ||x − y ||2, it takes O(ε0) ρ-balls of radius ε tocover Xε. Thus d = 0 and Rn = O(
√n).
![Page 116: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/116.jpg)
Example 2:Assume the function is locally quadratic around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||2).
ε
ε
For ρ(x , y) = ||x − y ||, it takes O(ε−D/2) balls of radius ε to
cover Xε. Thus d = D/2 and Rn = O(nD+2D+4 ).
For ρ(x , y) = ||x − y ||2, it takes O(ε0) ρ-balls of radius ε tocover Xε. Thus d = 0 and Rn = O(
√n).
![Page 117: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/117.jpg)
Example 2:Assume the function is locally quadratic around its maximum:
f (x∗)− f (x) = Θ(||x∗ − x ||2).
ε
ε
For ρ(x , y) = ||x − y ||, it takes O(ε−D/2) balls of radius ε to
cover Xε. Thus d = D/2 and Rn = O(nD+2D+4 ).
For ρ(x , y) = ||x − y ||2, it takes O(ε0) ρ-balls of radius ε tocover Xε. Thus d = 0 and Rn = O(
√n).
![Page 118: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/118.jpg)
Example
X = [0, 1]D , α ≥ 0 and mean-payoff function f locally ”α-smooth”around (any of) its maximum x∗ (in finite number):
f (x∗)− f (x) = Θ(||x − x∗||α) as x → x∗.
Theorem
Assume that we run HOO using ρ(x , y) = ||x − y ||β.
Known smoothness: β = α. Rn = O(√n), i.e., the rate is
independent of the dimension D.
Smoothness underestimated: β < α.
Rn = O(n(d+1)/(d+2)) where d = D(
1β −
1α
).
Smoothness overestimated: β > α. No guarantee. Note:UCT corresponds to β = +∞.
![Page 119: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/119.jpg)
Example
X = [0, 1]D , α ≥ 0 and mean-payoff function f locally ”α-smooth”around (any of) its maximum x∗ (in finite number):
f (x∗)− f (x) = Θ(||x − x∗||α) as x → x∗.
Theorem
Assume that we run HOO using ρ(x , y) = ||x − y ||β.
Known smoothness: β = α. Rn = O(√n), i.e., the rate is
independent of the dimension D.
Smoothness underestimated: β < α.
Rn = O(n(d+1)/(d+2)) where d = D(
1β −
1α
).
Smoothness overestimated: β > α. No guarantee. Note:UCT corresponds to β = +∞.
![Page 120: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/120.jpg)
Example
X = [0, 1]D , α ≥ 0 and mean-payoff function f locally ”α-smooth”around (any of) its maximum x∗ (in finite number):
f (x∗)− f (x) = Θ(||x − x∗||α) as x → x∗.
Theorem
Assume that we run HOO using ρ(x , y) = ||x − y ||β.
Known smoothness: β = α. Rn = O(√n), i.e., the rate is
independent of the dimension D.
Smoothness underestimated: β < α.
Rn = O(n(d+1)/(d+2)) where d = D(
1β −
1α
).
Smoothness overestimated: β > α. No guarantee. Note:UCT corresponds to β = +∞.
![Page 121: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/121.jpg)
Example
X = [0, 1]D , α ≥ 0 and mean-payoff function f locally ”α-smooth”around (any of) its maximum x∗ (in finite number):
f (x∗)− f (x) = Θ(||x − x∗||α) as x → x∗.
Theorem
Assume that we run HOO using ρ(x , y) = ||x − y ||β.
Known smoothness: β = α. Rn = O(√n), i.e., the rate is
independent of the dimension D.
Smoothness underestimated: β < α.
Rn = O(n(d+1)/(d+2)) where d = D(
1β −
1α
).
Smoothness overestimated: β > α. No guarantee. Note:UCT corresponds to β = +∞.
![Page 122: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/122.jpg)
Example
X = [0, 1]D , α ≥ 0 and mean-payoff function f locally ”α-smooth”around (any of) its maximum x∗ (in finite number):
f (x∗)− f (x) = Θ(||x − x∗||α) as x → x∗.
Theorem
Assume that we run HOO using ρ(x , y) = ||x − y ||β.
Known smoothness: β = α. Rn = O(√n), i.e., the rate is
independent of the dimension D.
Smoothness underestimated: β < α.
Rn = O(n(d+1)/(d+2)) where d = D(
1β −
1α
).
Smoothness overestimated: β > α. No guarantee. Note:UCT corresponds to β = +∞.
![Page 123: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/123.jpg)
Path planning
![Page 124: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/124.jpg)
Combinatorial prediction game
Adversary
Player
![Page 125: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/125.jpg)
Combinatorial prediction game
Adversary
Player
![Page 126: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/126.jpg)
Combinatorial prediction game
Adversary
Player
![Page 127: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/127.jpg)
Combinatorial prediction game
Adversary
Player
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7
![Page 128: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/128.jpg)
Combinatorial prediction game
Adversary
Player
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7
loss suffered: `2 + `7 + . . .+ `d
![Page 129: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/129.jpg)
Combinatorial prediction game
Adversary
Player
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7
loss suffered: `2 + `7 + . . .+ `d
Feedback:
Full Info: `1, `2, . . . , `d
![Page 130: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/130.jpg)
Combinatorial prediction game
Adversary
Player
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7
loss suffered: `2 + `7 + . . .+ `d
Feedback:
Full Info: `1, `2, . . . , `dSemi-Bandit: `2, `7, . . . , `dBandit: `2 + `7 + . . .+ `d
![Page 131: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/131.jpg)
Combinatorial prediction game
Adversary
Player
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7
loss suffered: `2 + `7 + . . .+ `d
Feedback:
Full Info: `1, `2, . . . , `dSemi-Bandit: `2, `7, . . . , `dBandit: `2 + `7 + . . .+ `d
![Page 132: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/132.jpg)
Notation
S ⊂ 0, 1d
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 `t ∈ Rd+
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 Vt ∈ S, loss suffered: `Tt Vt
Rn = En∑
t=1
`Tt Vt −minu∈S
En∑
t=1
`Tt u
![Page 133: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/133.jpg)
Notation
S ⊂ 0, 1d
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 `t ∈ Rd+
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 Vt ∈ S, loss suffered: `Tt Vt
Rn = En∑
t=1
`Tt Vt −minu∈S
En∑
t=1
`Tt u
![Page 134: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/134.jpg)
Notation
S ⊂ 0, 1d
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 `t ∈ Rd+
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 Vt ∈ S, loss suffered: `Tt Vt
Rn = En∑
t=1
`Tt Vt −minu∈S
En∑
t=1
`Tt u
![Page 135: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/135.jpg)
Notation
S ⊂ 0, 1d
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 `t ∈ Rd+
`2 `6 `d−1
`1
`4
`5
`9
`d−2
`d`3
`8
`7 Vt ∈ S, loss suffered: `Tt Vt
Rn = En∑
t=1
`Tt Vt −minu∈S
En∑
t=1
`Tt u
![Page 136: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/136.jpg)
Set of concepts S ⊂ 0, 1d
Paths k-sets Matchings
k-sized intervals
Spanning trees
Parallel bandits
![Page 137: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/137.jpg)
Key idea
Vt ∼ pt , pt ∈ ∆(S)
Then, unbiased estimate ˜t of the loss `t :
˜t = `t in the full information game,
˜i ,t =
`i,t∑V∈S:Vi=1 pt(V )Vi ,t in the semi-bandit game,
˜t = P+
t VtVTt `t , with Pt = EV∼pt (VV
T ) in the bandit game.
![Page 138: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/138.jpg)
Key idea
Vt ∼ pt , pt ∈ ∆(S)
Then, unbiased estimate ˜t of the loss `t :
˜t = `t in the full information game,
˜i ,t =
`i,t∑V∈S:Vi=1 pt(V )Vi ,t in the semi-bandit game,
˜t = P+
t VtVTt `t , with Pt = EV∼pt (VV
T ) in the bandit game.
![Page 139: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/139.jpg)
Key idea
Vt ∼ pt , pt ∈ ∆(S)
Then, unbiased estimate ˜t of the loss `t :
˜t = `t in the full information game,
˜i ,t =
`i,t∑V∈S:Vi=1 pt(V )Vi ,t in the semi-bandit game,
˜t = P+
t VtVTt `t , with Pt = EV∼pt (VV
T ) in the bandit game.
![Page 140: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/140.jpg)
Key idea
Vt ∼ pt , pt ∈ ∆(S)
Then, unbiased estimate ˜t of the loss `t :
˜t = `t in the full information game,
˜i ,t =
`i,t∑V∈S:Vi=1 pt(V )Vi ,t in the semi-bandit game,
˜t = P+
t VtVTt `t , with Pt = EV∼pt (VV
T ) in the bandit game.
![Page 141: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/141.jpg)
Key idea
Vt ∼ pt , pt ∈ ∆(S)
Then, unbiased estimate ˜t of the loss `t :
˜t = `t in the full information game,
˜i ,t =
`i,t∑V∈S:Vi=1 pt(V )Vi ,t in the semi-bandit game,
˜t = P+
t VtVTt `t , with Pt = EV∼pt (VV
T ) in the bandit game.
![Page 142: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/142.jpg)
Loss assumptions
Definition (L∞)
We say that the adversary satisfies the L∞ assumption: if‖`t‖∞ ≤ 1 for all t = 1, . . . , n.
Definition (L2)
We say that the adversary satisfies the L2 assumption: if `Tt v ≤ 1for all t = 1, . . . , n and v ∈ S.
![Page 143: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/143.jpg)
Loss assumptions
Definition (L∞)
We say that the adversary satisfies the L∞ assumption: if‖`t‖∞ ≤ 1 for all t = 1, . . . , n.
Definition (L2)
We say that the adversary satisfies the L2 assumption: if `Tt v ≤ 1for all t = 1, . . . , n and v ∈ S.
![Page 144: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/144.jpg)
Expanded Exponentially weighted average forecaster(Exp2)
pt(v) =exp
(−η∑t−1
s=1˜Ts v)
∑u∈S exp
(−η∑t−1
s=1˜Ts u)
In the full information game, against L2 adversaries, we have(for some η)
Rn ≤√
2dn,
which is the optimal rate, Dani, Hayes and Kakade [2008].
Thus against L∞ adversaries we have
Rn ≤ d3/2√
2n.
But this is suboptimal, Koolen, Warmuth and Kivinen [2010].
Audibert, Bubeck and Lugosi [2011] showed that, for any η,there exists a subset S ⊂ 0, 1d and an L∞ adversary suchthat:
Rn ≥ 0.02 d3/2√n.
![Page 145: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/145.jpg)
Expanded Exponentially weighted average forecaster(Exp2)
pt(v) =exp
(−η∑t−1
s=1˜Ts v)
∑u∈S exp
(−η∑t−1
s=1˜Ts u)
In the full information game, against L2 adversaries, we have(for some η)
Rn ≤√
2dn,
which is the optimal rate, Dani, Hayes and Kakade [2008].
Thus against L∞ adversaries we have
Rn ≤ d3/2√
2n.
But this is suboptimal, Koolen, Warmuth and Kivinen [2010].
Audibert, Bubeck and Lugosi [2011] showed that, for any η,there exists a subset S ⊂ 0, 1d and an L∞ adversary suchthat:
Rn ≥ 0.02 d3/2√n.
![Page 146: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/146.jpg)
Expanded Exponentially weighted average forecaster(Exp2)
pt(v) =exp
(−η∑t−1
s=1˜Ts v)
∑u∈S exp
(−η∑t−1
s=1˜Ts u)
In the full information game, against L2 adversaries, we have(for some η)
Rn ≤√
2dn,
which is the optimal rate, Dani, Hayes and Kakade [2008].
Thus against L∞ adversaries we have
Rn ≤ d3/2√
2n.
But this is suboptimal, Koolen, Warmuth and Kivinen [2010].
Audibert, Bubeck and Lugosi [2011] showed that, for any η,there exists a subset S ⊂ 0, 1d and an L∞ adversary suchthat:
Rn ≥ 0.02 d3/2√n.
![Page 147: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/147.jpg)
Expanded Exponentially weighted average forecaster(Exp2)
pt(v) =exp
(−η∑t−1
s=1˜Ts v)
∑u∈S exp
(−η∑t−1
s=1˜Ts u)
In the full information game, against L2 adversaries, we have(for some η)
Rn ≤√
2dn,
which is the optimal rate, Dani, Hayes and Kakade [2008].
Thus against L∞ adversaries we have
Rn ≤ d3/2√
2n.
But this is suboptimal, Koolen, Warmuth and Kivinen [2010].
Audibert, Bubeck and Lugosi [2011] showed that, for any η,there exists a subset S ⊂ 0, 1d and an L∞ adversary suchthat:
Rn ≥ 0.02 d3/2√n.
![Page 148: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/148.jpg)
Legendre function
Definition
Let D be a convex subset of Rd with nonempty interior int(D) andboundary ∂D. We call Legendre any function F : D → R such that
F is strictly convex and admits continuous first partialderivatives on int(D),
For any u ∈ ∂D, for any v ∈ int(D), we have
lims→0,s>0
(u − v)T∇F((1− s)u + sv
)= +∞.
![Page 149: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/149.jpg)
Legendre function
Definition
Let D be a convex subset of Rd with nonempty interior int(D) andboundary ∂D. We call Legendre any function F : D → R such that
F is strictly convex and admits continuous first partialderivatives on int(D),
For any u ∈ ∂D, for any v ∈ int(D), we have
lims→0,s>0
(u − v)T∇F((1− s)u + sv
)= +∞.
![Page 150: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/150.jpg)
Legendre function
Definition
Let D be a convex subset of Rd with nonempty interior int(D) andboundary ∂D. We call Legendre any function F : D → R such that
F is strictly convex and admits continuous first partialderivatives on int(D),
For any u ∈ ∂D, for any v ∈ int(D), we have
lims→0,s>0
(u − v)T∇F((1− s)u + sv
)= +∞.
![Page 151: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/151.jpg)
Bregman divergence
Definition
The Bregman divergence DF : D × int(D) associated to aLegendre function F is defined by
DF (u, v) = F (u)− F (v)− (u − v)T∇F (v).
Definition
The Legendre transform of F is defined by
F ∗(u) = supx∈D
xTu − F (x).
Key property for Legendre functions: ∇F ∗ = (∇F )−1.
![Page 152: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/152.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 153: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/153.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)pt
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 154: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/154.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)pt
wt
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 155: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/155.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)pt
wt
w ′t+1
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 156: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/156.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)pt
wt
w ′t+1
wt+1
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 157: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/157.jpg)
Online Stochastic Mirror Descent (OSMD)
Parameter: F Legendre on D ⊃ Conv(S)
Conv(S)
D
∆(S)pt
wt
w ′t+1
wt+1
pt+1
(1) w ′t+1 ∈ D :
w ′t+1 = ∇F ∗(∇F (wt)− ˜
t
)
(2) wt+1 ∈ argminw∈Conv(S)
DF (w ,w ′t+1)
(3) pt+1 ∈ ∆(S) : wt+1 = EV∼pt+1V
![Page 158: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/158.jpg)
General regret bound for OSMD
Theorem
If F admits a Hessian ∇2F always invertible then,
Rn / diamDF(S) + E
n∑
t=1
˜Tt
(∇2F (wt)
)−1 ˜t .
![Page 159: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/159.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 160: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/160.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 161: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/161.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 162: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/162.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 163: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/163.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 164: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/164.jpg)
Different instances of OSMD: LinExp (Entropy Function)
D = [0,+∞)d , F (x) = 1η
∑di=1 xi log xi
Full Info: Exp
Semi-Bandit=Bandit: Exp3Auer et al. [2002]
Full Info: Component HedgeKoolen, Warmuth and Kivinen [2010]
Semi-Bandit: MWKale, Reyzin and Schapire [2010]
Bandit: bad algorithm!
![Page 165: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/165.jpg)
Different instances of OSMD: LinINF (ExchangeableHessian)
D = [0,+∞)d , F (x) =∑d
i=1
∫ xi0 ψ−1(s)ds
INF, Audibert and Bubeck [2009]
ψ(x) = exp(ηx) : LinExpψ(x) = (−ηx)−q, q > 1 : LinPoly
![Page 166: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/166.jpg)
Different instances of OSMD: LinINF (ExchangeableHessian)
D = [0,+∞)d , F (x) =∑d
i=1
∫ xi0 ψ−1(s)ds
INF, Audibert and Bubeck [2009]
ψ(x) = exp(ηx) : LinExpψ(x) = (−ηx)−q, q > 1 : LinPoly
![Page 167: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/167.jpg)
Different instances of OSMD: LinINF (ExchangeableHessian)
D = [0,+∞)d , F (x) =∑d
i=1
∫ xi0 ψ−1(s)ds
INF, Audibert and Bubeck [2009]
ψ(x) = exp(ηx) : LinExpψ(x) = (−ηx)−q, q > 1 : LinPoly
![Page 168: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/168.jpg)
Different instances of OSMD: LinINF (ExchangeableHessian)
D = [0,+∞)d , F (x) =∑d
i=1
∫ xi0 ψ−1(s)ds
INF, Audibert and Bubeck [2009]
ψ(x) = exp(ηx) : LinExpψ(x) = (−ηx)−q, q > 1 : LinPoly
![Page 169: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/169.jpg)
Different instances of OSMD: Follow the regularized leader
D = Conv(S), then
wt+1 ∈ argminw∈D
(t∑
s=1
˜Ts w + F (w)
)
Particularly interesting choice: F self-concordant barrier function,Abernethy, Hazan and Rakhlin [2008]
![Page 170: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/170.jpg)
Different instances of OSMD: Follow the regularized leader
D = Conv(S), then
wt+1 ∈ argminw∈D
(t∑
s=1
˜Ts w + F (w)
)
Particularly interesting choice: F self-concordant barrier function,Abernethy, Hazan and Rakhlin [2008]
![Page 171: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/171.jpg)
Minimax regret for the full information game
Theorem (Koolen, Warmuth and Kivinen [2010])
In the full information game, the LinExp strategy (with well-chosenparameters) satisfies for any concept class S ⊂ 0, 1d and anyL∞-adversary:
Rn ≤ d√
2n.
Moreover for any strategy, there exists a subset S ⊂ 0, 1d andan L∞-adversary such that:
Rn ≥ 0.008 d√n.
![Page 172: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/172.jpg)
Minimax regret for the semi-bandit game
Theorem (Audibert, Bubeck and Lugosi [2011])
In the semi-bandit game, the LinExp strategy (with well-chosenparameters) satisfies for any concept class S ⊂ 0, 1d and anyL∞-adversary:
Rn ≤ d√
2n.
Moreover for any strategy, there exists a subset S ⊂ 0, 1d andan L∞-adversary such that:
Rn ≥ 0.008 d√n.
![Page 173: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/173.jpg)
Minimax regret for the bandit game
For the bandit game the situation becomes trickier.
First it appears necessary to add some sort of forcedexploration on S to control third order error terms in theregret bound.
Second, the control of the quadratic term ˜Tt
(∇2F (wt)
)−1 ˜t
is much more involved than previously.
![Page 174: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/174.jpg)
Minimax regret for the bandit game
For the bandit game the situation becomes trickier.
First it appears necessary to add some sort of forcedexploration on S to control third order error terms in theregret bound.
Second, the control of the quadratic term ˜Tt
(∇2F (wt)
)−1 ˜t
is much more involved than previously.
![Page 175: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/175.jpg)
Minimax regret for the bandit game
For the bandit game the situation becomes trickier.
First it appears necessary to add some sort of forcedexploration on S to control third order error terms in theregret bound.
Second, the control of the quadratic term ˜Tt
(∇2F (wt)
)−1 ˜t
is much more involved than previously.
![Page 176: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/176.jpg)
John’s distribution
Theorem (John’s Theorem)
Let K ⊂ Rd be a convex set. If the ellipsoid E of minimal volumeenclosing K is the unit ball in some norm derived from a scalarproduct 〈·, ·〉, then there exists M ≤ d(d + 1)/2 + 1 contact pointsu1, . . . , uM between E and K, and µ ∈ ∆M (the simplex ofdimension M − 1), such that
x = dM∑
i=1
µi 〈x , ui 〉ui ,∀ x ∈ Rd .
![Page 177: Tutorial on Bandits Games - Sébastien Bubecksbubeck.com/tutorial.pdfA little bit of advertising Survey on multi-armed bandits to appear in S. Bubeck and N. Cesa-Bianchi. Regret Analysis](https://reader036.vdocuments.us/reader036/viewer/2022081615/5fe104b5d7a51d3583265ff5/html5/thumbnails/177.jpg)
Minimax regret for the bandit game
Theorem (Audibert, Bubeck and Lugosi [2011], Bubeck,Cesa-Bianchi and Kakade [2012])
In the bandit game, the Exp2 strategy with John’s explorationsatisfies for any concept class S ⊂ 0, 1d and any L∞-adversary:
Rn ≤ 4d2√n,
and respectively Rn ≤ 4d√n for an L2-adversary.
Moreover for any strategy, there exists a subset S ⊂ 0, 1d andan L∞-adversary such that:
Rn ≥ 0.01 d3/2√n.
For L2-adversaries the lower bound is 0.05 min(n, d√n).
Conjecture: for an L∞-adversary the correct order of magnitude isd3/2√n and it can be attained with OSMD.