corralling a band of bandit algorithms · 2020. 1. 3. · corralling a band of bandit algorithms...
TRANSCRIPT
![Page 1: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/1.jpg)
Corralling a Band of Bandit Algorithms
Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1
1Microsoft Research, New York2Toyota Technological Institute at Chicago
![Page 2: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/2.jpg)
Contextual Bandits in Practice
Personalized news recommendation in MSN:
EXP4
Epoch-Greedy
LinUCB
ILOVETOCONBANDITS
BISTRO, BISTRO+
...
2 / 19
![Page 3: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/3.jpg)
Contextual Bandits in Practice
Personalized news recommendation in MSN:
EXP4
Epoch-Greedy
LinUCB
ILOVETOCONBANDITS
BISTRO, BISTRO+
...
2 / 19
![Page 4: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/4.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 5: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/5.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 6: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/6.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 7: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/7.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 8: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/8.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 9: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/9.jpg)
Motivation
So many existing algorithms, which one should I use ??
no one single algorithm is guaranteed to be the best
Naive approach: try all and pick the best
inefficient, wasteful, nonadaptive
Hope: create a master algorithm that
selects base algorithms automatically and adaptively on the fly
performs closely to the best in the long run
3 / 19
![Page 10: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/10.jpg)
A Closer Look
Full information setting:
“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it
Bandit setting:
use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?
Serious flaw in this approach:
regret guarantee is only about the actual performance
but the performance of base algorithms are significantly influenceddue to lack of feedback
4 / 19
![Page 11: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/11.jpg)
A Closer Look
Full information setting:
“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it
Bandit setting:
use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?
Serious flaw in this approach:
regret guarantee is only about the actual performance
but the performance of base algorithms are significantly influenceddue to lack of feedback
4 / 19
![Page 12: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/12.jpg)
A Closer Look
Full information setting:
“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it
Bandit setting:
use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?
Serious flaw in this approach:
regret guarantee is only about the actual performance
but the performance of base algorithms are significantly influenceddue to lack of feedback
4 / 19
![Page 13: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/13.jpg)
A Closer Look
Full information setting:
“expert” algorithm (e.g. Hedge (Freund and Schapire, 1997)) solves it
Bandit setting:
use a multi-armed bandit algorithm (e.g. EXP3 (Auer et al., 2002))?
Serious flaw in this approach:
regret guarantee is only about the actual performance
but the performance of base algorithms are significantly influenceddue to lack of feedback
4 / 19
![Page 14: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/14.jpg)
Difficulties
An example:
when run separately
when run with a master
Right objective:perform almost as well as the best base algorithm if it was run on its own
Difficulties:
worse performance ↔ less feedback
requires better tradeoff between exploration and exploitation
5 / 19
![Page 15: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/15.jpg)
Difficulties
An example:
when run separately
when run with a master
Right objective:perform almost as well as the best base algorithm if it was run on its own
Difficulties:
worse performance ↔ less feedback
requires better tradeoff between exploration and exploitation
5 / 19
![Page 16: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/16.jpg)
Difficulties
An example:
when run separately
when run with a master
Right objective:perform almost as well as the best base algorithm if it was run on its own
Difficulties:
worse performance ↔ less feedback
requires better tradeoff between exploration and exploitation
5 / 19
![Page 17: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/17.jpg)
Difficulties
An example:
when run separately
when run with a master
Right objective:perform almost as well as the best base algorithm if it was run on its own
Difficulties:
worse performance ↔ less feedback
requires better tradeoff between exploration and exploitation
5 / 19
![Page 18: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/18.jpg)
Difficulties
An example:
when run separately
when run with a master
Right objective:perform almost as well as the best base algorithm if it was run on its own
Difficulties:
worse performance ↔ less feedback
requires better tradeoff between exploration and exploitation
5 / 19
![Page 19: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/19.jpg)
Related Work and Our Results
Maillard and Munos (2011) studied similar problems
EXP3 with higher amount of uniform exploration
if base algorithms have√T regret, master has T 2/3 regret
Our results:
a novel algorithm: more active and adaptive exploration
I almost same regret as base algorithms
two major applications:
I exploiting easy environments while keeping worst case robustness
I selecting correct model automatically
6 / 19
![Page 20: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/20.jpg)
Related Work and Our Results
Maillard and Munos (2011) studied similar problems
EXP3 with higher amount of uniform exploration
if base algorithms have√T regret, master has T 2/3 regret
Our results:
a novel algorithm: more active and adaptive exploration
I almost same regret as base algorithms
two major applications:
I exploiting easy environments while keeping worst case robustness
I selecting correct model automatically
6 / 19
![Page 21: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/21.jpg)
Related Work and Our Results
Maillard and Munos (2011) studied similar problems
EXP3 with higher amount of uniform exploration
if base algorithms have√T regret, master has T 2/3 regret
Our results:
a novel algorithm: more active and adaptive exploration
I almost same regret as base algorithms
two major applications:
I exploiting easy environments while keeping worst case robustness
I selecting correct model automatically
6 / 19
![Page 22: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/22.jpg)
Related Work and Our Results
Maillard and Munos (2011) studied similar problems
EXP3 with higher amount of uniform exploration
if base algorithms have√T regret, master has T 2/3 regret
Our results:
a novel algorithm: more active and adaptive exploration
I almost same regret as base algorithms
two major applications:
I exploiting easy environments while keeping worst case robustness
I selecting correct model automatically
6 / 19
![Page 23: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/23.jpg)
Related Work and Our Results
Maillard and Munos (2011) studied similar problems
EXP3 with higher amount of uniform exploration
if base algorithms have√T regret, master has T 2/3 regret
Our results:
a novel algorithm: more active and adaptive exploration
I almost same regret as base algorithms
two major applications:
I exploiting easy environments while keeping worst case robustness
I selecting correct model automatically
6 / 19
![Page 24: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/24.jpg)
Outline
1 Introduction
2 Formal Setup
3 Main Results
4 Conclusion and Open Problems
7 / 19
![Page 25: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/25.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 26: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/26.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 27: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/27.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 28: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/28.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 29: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/29.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 30: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/30.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 31: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/31.jpg)
A General Bandit Problem
for t = 1 to T do
Environment reveals some side information xt ∈ X
Player picks an action θt ∈ Θ
Environment decides a loss function ft : Θ×X 7→ [0, 1]
Player suffers and observes (only) the loss ft(θt , xt)
Example: contextual bandits
x is context, θ ∈ Θ is a policy, ft(θ, x) = loss of arm θ(x)
environment: i.i.d., adversarial or hybrid
(Pseudo) Regret: REG = supθ∈Θ E[∑T
t=1 ft(θt , xt)− ft(θ, xt)]
8 / 19
![Page 32: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/32.jpg)
Base Algorithms
Suppose given M base algorithms B1, . . . ,BM .
At each round t, receive suggestions θ1t , . . . , θ
Mt .
Hope: create a master s.t.
loss of master ≈ loss of best base algorithm if run on its own
How to formally measure?
9 / 19
![Page 33: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/33.jpg)
Base Algorithms
Suppose given M base algorithms B1, . . . ,BM .
At each round t, receive suggestions θ1t , . . . , θ
Mt .
Hope: create a master s.t.
loss of master ≈ loss of best base algorithm if run on its own
How to formally measure?
9 / 19
![Page 34: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/34.jpg)
Base Algorithms
Suppose given M base algorithms B1, . . . ,BM .
At each round t, receive suggestions θ1t , . . . , θ
Mt .
Hope: create a master s.t.
loss of master ≈ loss of best base algorithm if run on its own
How to formally measure?
9 / 19
![Page 35: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/35.jpg)
Base Algorithms
Suppose given M base algorithms B1, . . . ,BM .
At each round t, receive suggestions θ1t , . . . , θ
Mt .
Hope: create a master s.t.
loss of master ≈ loss of best base algorithm if run on its own
How to formally measure?
9 / 19
![Page 36: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/36.jpg)
Goal
Suppose running Bi alone gives
REGBi ≤ Ri (T )
When running master with all base algorithms, want
REGM ≤ O(poly(M)Ri (T ))
impossible in general!
Example: for contextual bandits
Algorithm REG environment√T i.i.d.
T 2/3 hybrid
10 / 19
![Page 37: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/37.jpg)
Goal
Suppose running Bi alone gives
REGBi ≤ Ri (T )
When running master with all base algorithms, want
REGM ≤ O(poly(M)Ri (T ))
impossible in general!
Example: for contextual bandits
Algorithm REG environment√T i.i.d.
T 2/3 hybrid
10 / 19
![Page 38: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/38.jpg)
Goal
Suppose running Bi alone gives
REGBi ≤ Ri (T )
When running master with all base algorithms, want
REGM ≤ O(poly(M)Ri (T ))
impossible in general!
Example: for contextual bandits
Algorithm REG environment
ILOVETOCONBANDITS (Agarwal et al., 2014)√T i.i.d.
BISTRO+ (Syrgkanis et al., 2016) T 2/3 hybrid
10 / 19
![Page 39: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/39.jpg)
Goal
Suppose running Bi alone gives
REGBi ≤ Ri (T )
When running master with all base algorithms, want
REGM ≤ O(poly(M)Ri (T ))
impossible in general!
Example: for contextual bandits
Algorithm REG environment
Master√T i.i.d.
Master T 2/3 hybrid
10 / 19
![Page 40: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/40.jpg)
Goal
Suppose running Bi alone gives
REGBi ≤ Ri (T )
When running master with all base algorithms, want
REGM ≤ O(poly(M)Ri (T )) impossible in general!
Example: for contextual bandits
Algorithm REG environment
Master√T i.i.d.
Master T 2/3 hybrid
10 / 19
![Page 41: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/41.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensuresREGBi
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 42: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/42.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensures
REGBi ≤ Ri (T )?
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 43: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/43.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensures
REGBi ≤ E[maxt
1pt,i
]Ri (T )
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 44: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/44.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensures
REGBi ≤ E[(
maxt
1pt,i
)αi]Ri (T )
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 45: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/45.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensures
REGBi ≤ E[(
maxt
1pt,i
)αi]Ri (T )
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 46: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/46.jpg)
A Natural Assumption
Typical strategy:
sample a base algorithm it ∼ pt
feed importance-weighted feedback to all Bi : ft(θt ,xt)pt,i
1{i = it}
Assume Bi ensures
REGBi ≤ E[(
maxt
1pt,i
)αi]Ri (T )
Want REGM ≤ O(poly(M)Ri (T ))
Note: EXP3 still doesn’t work
11 / 19
![Page 47: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/47.jpg)
Outline
1 Introduction
2 Formal Setup
3 Main Results
4 Conclusion and Open Problems
12 / 19
![Page 48: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/48.jpg)
A Special OMD
Intuition 1: want 1pt,i
to be small
Mirror Map 1/pt,i ≈
Shannon Entropy (EXP3) exp(η · loss)
Log Barrier: − 1η
∑i ln pi (Foster et al., 2016) η · loss
In some sense, this provides the least extreme weighting
13 / 19
![Page 49: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/49.jpg)
A Special OMD
Intuition 1: want 1pt,i
to be small
Mirror Map 1/pt,i ≈
Shannon Entropy (EXP3) exp(η · loss)
Log Barrier: − 1η
∑i ln pi (Foster et al., 2016) η · loss
In some sense, this provides the least extreme weighting
13 / 19
![Page 50: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/50.jpg)
A Special OMD
Intuition 1: want 1pt,i
to be small
Mirror Map 1/pt,i ≈
Shannon Entropy (EXP3) exp(η · loss)
Log Barrier: − 1η
∑i ln pi (Foster et al., 2016) η · loss
In some sense, this provides the least extreme weighting
13 / 19
![Page 51: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/51.jpg)
A Special OMD
Intuition 1: want 1pt,i
to be small
Mirror Map 1/pt,i ≈
Shannon Entropy (EXP3) exp(η · loss)
Log Barrier: − 1η
∑i ln pi (Foster et al., 2016) η · loss
In some sense, this provides the least extreme weighting
13 / 19
![Page 52: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/52.jpg)
An Increasing Learning Rates Schedule
Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability
Solution:
individual learning rates:∑
i− ln piηi
as mirror map
increase learning rate ηi when 1pt,i
is too large
14 / 19
![Page 53: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/53.jpg)
An Increasing Learning Rates Schedule
Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability
Solution:
individual learning rates:∑
i− ln piηi
as mirror map
increase learning rate ηi when 1pt,i
is too large
14 / 19
![Page 54: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/54.jpg)
An Increasing Learning Rates Schedule
Intuition 2: Need to learn faster if a base algorithm has a low sampledprobability
Solution:
individual learning rates:∑
i− ln piηi
as mirror map
increase learning rate ηi when 1pt,i
is too large
14 / 19
![Page 55: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/55.jpg)
Our Algorithm: Corral
initial learning rates ηi = η, initial thresholds ρi = 2M
for t = 1 to T do
Observe xt and send xt to all base algorithms
Receive suggested actions θ1t , . . . , θ
Mt
Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)
Construct unbiased estimated loss f it (θ, x) and send it to Bi
Update pt+1 ← Log-Barrier-OMD(pt ,ft(θt ,xt)pt,it
eit ,η)
for i = 1 to M do
if 1 > ρi then update ρi ← 2ρi , ηi ← βηi
15 / 19
![Page 56: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/56.jpg)
Our Algorithm: Corral
initial learning rates ηi = η, initial thresholds ρi = 2M
for t = 1 to T do
Observe xt and send xt to all base algorithms
Receive suggested actions θ1t , . . . , θ
Mt
Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)
Construct unbiased estimated loss f it (θ, x) and send it to Bi
Update pt+1 ← Log-Barrier-OMD(pt ,ft(θt ,xt)pt,it
eit ,η)
for i = 1 to M do
if 1 > ρi then update ρi ← 2ρi , ηi ← βηi
15 / 19
![Page 57: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/57.jpg)
Our Algorithm: Corral
initial learning rates ηi = η, initial thresholds ρi = 2M
for t = 1 to T do
Observe xt and send xt to all base algorithms
Receive suggested actions θ1t , . . . , θ
Mt
Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)
Construct unbiased estimated loss f it (θ, x) and send it to BiUpdate pt+1 ← Log-Barrier-OMD(pt ,
ft(θt ,xt)pt,it
eit ,η)
for i = 1 to M do
if 1 > ρi then update ρi ← 2ρi , ηi ← βηi
15 / 19
![Page 58: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/58.jpg)
Our Algorithm: Corral
initial learning rates ηi = η, initial thresholds ρi = 2M
for t = 1 to T do
Observe xt and send xt to all base algorithms
Receive suggested actions θ1t , . . . , θ
Mt
Sample it ∼ pt , predict θt = θitt , observe loss ft(θt , xt)
Construct unbiased estimated loss f it (θ, x) and send it to BiUpdate pt+1 ← Log-Barrier-OMD(pt ,
ft(θt ,xt)pt,it
eit ,η)
for i = 1 to M do
if 1pt+1,i
> ρi then update ρi ← 2ρi , ηi ← βηi
15 / 19
![Page 59: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/59.jpg)
Regret Guarantee
Theorem
If for some environment there exists a base algorithm Bi such that:
REGBi ≤ E[ραiT ,i
]Ri (T )
then under the same environment Corral ensures:
REGM = O(M
η+ Tη −
E[ρT ,i ]
η+ E[ραi
T ,i ]Ri (T )
)
16 / 19
![Page 60: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/60.jpg)
Application
Contextual Bandits:
Algorithm REG environment
ILOVETOCONBANDITS (Agarwal et al., 2014)√T i.i.d.
BISTRO+ (Syrgkanis et al., 2016) T 2/3 hybrid
17 / 19
![Page 61: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/61.jpg)
Application
Contextual Bandits:
Algorithm REG environment
Corral√T i.i.d.
Corral T 3/4 hybrid
17 / 19
![Page 62: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/62.jpg)
Outline
1 Introduction
2 Formal Setup
3 Main Results
4 Conclusion and Open Problems
18 / 19
![Page 63: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/63.jpg)
Conclusion
We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.
least extreme weighting: Log-Barrier-OMD
increasing learning rate to learn faster
applications for many settings
Open problems:
inherit exactly the regret of base algorithms ?
REGM ≤ O(poly(M)Ri (T ))
dependence on M: from polynomial to logarithmic?
19 / 19
![Page 64: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/64.jpg)
Conclusion
We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.
least extreme weighting: Log-Barrier-OMD
increasing learning rate to learn faster
applications for many settings
Open problems:
inherit exactly the regret of base algorithms ?
REGM ≤ O(poly(M)Ri (T ))
dependence on M: from polynomial to logarithmic?
19 / 19
![Page 65: Corralling a Band of Bandit Algorithms · 2020. 1. 3. · Corralling a Band of Bandit Algorithms Alekh Agarwal1, Haipeng Luo1, Behnam Neyshabur2, Rob Schapire1 1Microsoft Research,](https://reader034.vdocuments.us/reader034/viewer/2022051805/5ff2997842afd17afa3c2cd4/html5/thumbnails/65.jpg)
Conclusion
We resolve the problem of creating a master that is almost as well as thebest base algorithm if it was run on its own.
least extreme weighting: Log-Barrier-OMD
increasing learning rate to learn faster
applications for many settings
Open problems:
inherit exactly the regret of base algorithms ?
REGM ≤ O(poly(M)Ri (T ))
dependence on M: from polynomial to logarithmic?
19 / 19