crush optimism with pessimism: structured bandits beyond ... · the goal of this paper achieve the...
TRANSCRIPT
Crush Optimism with Pessimism:Structured Bandits Beyond Asymptotic Optimality
Kwang-Sung Jun
join work with Chicheng Zhang
1
Structured bandits• Input: Arm set 𝒜, hypothesis class ℱ ⊂ 𝒜 → ℝ
• Initialize: The environment chooses 𝑓∗ ∈ ℱ (unknown to the learner)
For 𝑡 = 1,… , 𝑛• Learner: chooses an arm 𝑎" ∈ 𝒜• Environment: generates the reward 𝑟" = 𝑓∗ 𝑎" + (zero-mean stochastic noise)• Learner: receives 𝑟"
• Goal: Minimize the cumulative regret
𝔼 Reg# = 𝔼 𝑛 ⋅ max$∈𝒜
𝑓∗ 𝑎 −:"'(
#
𝑓∗ 𝑎"
• Note: fixed arm set (=non-contextual), realizability 𝑓∗ ∈ ℱ
2
e.g., linear𝒜 = 𝑎!, … , 𝑎" ∈ ℝ#ℱ = {𝑎 ↦ 𝜃$𝑎: 𝜃 ∈ ℝ#}
“the set of possible configurations of the mean rewards”
Structured bandits• Why relevant?
Techniques may transfer to RL (e.g., ergodic RL [Ok18])
• Naive strategy: UCB⟹ )
*log 𝑛 regret bound (instance-dependent)
• Scales with the number of arms 𝐾• Instead, the complexity of the hypothesis class ℱ should appear.
• The asymptotically optimal regret is well-defined.
• E.g., linear bandits : 𝑐∗ ⋅ log 𝑛 for some well-defined 𝑐∗ ≪ )*.
3
The goal of this paper
Achieve the asymptotic optimality with improved finite-time regret for any ℱ.
[Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018
(the worst-case regret is beyond the scope)
Asymptotic optimality (instance-dependent)• Optimism in the face of uncertainty
(e.g., UCB, Thompson sampling)
⟹ optimal asymptotic / worst-case regretin 𝑲-armed bandits.
• Linear bandits: optimal worst-case rate = 𝑑 𝑛• Asymptotically optimal regret? ⟹ No!
4
(AISTATS’17) sweet
sour
Do they like orange or apple? Maybe have them try lemon and see if they
are sensitive to sourness..
(1,0)
(0.95, 0.1)
(1,0)
mean reward = 1*sweet + 0*sour
• 𝔼 Reg# ≥ 𝑐 𝑓∗ ⋅ log 𝑛 (asymptotically)
𝑐 𝑓∗ = min&!,…,&" )*
3+,!
"
𝛾+ ⋅ Δ+
• 𝛾∗ = 𝛾(∗, … , 𝛾)∗ ≥ 0 : the solution
• To be optimal, we must pull arm 𝑎 like 𝛾$∗ ⋅ log 𝑛 times.
• E.g., 𝛾+,-./∗ = 8, 𝛾.01/2,∗ = 0 ⟹ lemon is the informative arm!
• When 𝑐 𝑓∗ = 0: Bounded regret! (except for pathological ones [Lattimore14])
Asymptotic optimality: lower bound
5
∀𝑔 ∈ 𝒞 𝑓∗ , 3+,!
"
𝛾+ ⋅ KL- 𝑓 𝑎 , 𝑔 𝑎 ≥ 1
”competing” hypotheses KL divergence with noise distribution 𝜈
Δ+ = max.∈𝒜
𝑓∗ 𝑏 − 𝑓∗(𝑎)
s. t. 𝛾+∗ 1 = 0
Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014.
Existing asymptotically optimal algorithms• Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20]
⟹ ensures every arm’s pull count is an unbounded function of 𝑛 such as +.2 #(3+.2 +.2 #.
⟹𝔼Reg# ⪅ 𝑐 𝑓∗ ⋅ log 𝑛 + 𝐾 ⋅ +.2 #(3+.2 +.2 #
• Issues
1. 𝐾 appears in the regret* ⟹ what if 𝐾 is exponentially large?
2. cannot achieve bounded regret when 𝑐 𝑓∗ = 0
• Parallel studies avoid forced exploration, but still depend on 𝐾. [Menard+20, Degenne+20]
6*Dependence on 𝐾 can be avoided in special cases (e.g., linear).
Contribution
7* it’s necessary (will be updated in arxiv)
Research Question
Assume ℱ is finite. Can we design an algorithm that
• enjoys the asymptotic optimality
• adapts to bounded regret whenever possible
• does not necessarily depend on 𝐾?
Proposed algorithm:CRush Optimism with Pessimism (CROP)
• No forced exploration 😀• The regret scales not with 𝐾 but with 𝐾" ≤ 𝐾 (defined in the paper).• An interesting log log 𝑛 term in the regret*
Preliminaries
8
Assumptions• ℱ < ∞• The noise model
𝑟" = 𝑓∗ 𝑎" + 𝜉" where 𝜉" is 1-sub-Gaussian. (generalized to 𝜎! in the paper)
• Notations: 𝑎∗ 𝑓 ≔ arg max$∈𝒜
𝑓 𝑎 , 𝜇∗ 𝑓 ≔ 𝑓 𝑎∗ 𝑓
• 𝑓 supports arm 𝑎 ⟺ 𝑎∗ 𝑓 = 𝑎• 𝑓 supports reward 𝑣 ⟺ 𝜇∗ 𝑓 = 𝑣
• [Assumption] Every 𝑓 ∈ ℱ has a unique best arm (i. e. , 𝑎∗ 𝑓 = 1 )
9
Competing hypotheses• 𝒞 𝑓∗ consists of 𝑓 ∈ ℱ such that
• (1) assigns the same reward to the best arm 𝑎∗(𝑓∗)• (2) but supports a different arm 𝑎∗ 𝑓 ≠ 𝑎∗(𝑓∗)
• Importance: it’s why we get log(𝑛) regret!
10
= 𝑓∗𝑓$
arms
mean reward
1 2 3
𝑓%𝑓&
𝑓'𝑓(
1
0
𝑓)
.75.5
.25
∀𝑔 ∈ 𝒞 𝑓∗ , 3+,!
"
𝛾+ ⋅𝑓∗ 𝑎 − 𝑔 𝑎 2
2 ≥ 1
Lower bound revisited
11
”competing” hypotheses
s. t. 𝛾+∗ 1∗ = 0
𝑐 𝑓∗ ≔ min&!,…,&" )*
3+,!
"
𝛾+ ⋅ Δ+
𝛾+ ln 𝑛 samples for each 𝑎 ∈ 𝒜 can distinguish 𝑓∗ from 𝑔 confidently.
Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989.
Finds arm pull allocations that (1) eliminate competing hypotheses and(2) ‘reward’-efficient
• Assume Gaussian rewards. • 𝔼 Reg# ≥ 𝑐 𝑓∗ ⋅ log 𝑛 , asymptotically.
Δ+ = max.∈𝒜
𝑓∗ 𝑏 − 𝑓∗(𝑎)
Example: Cheating code• 𝜖 > 0: very small (like 0.0001)
• Λ > 0: not too small (like 0.5)
• The lower bound: Θ +.2! )4! ln 𝑛
• UCB: Θ )5 ln 𝑛
• Exponential gap in 𝐾!
12
A1 A2 A3 A4 A5 A6
𝑓) 𝟏 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓% 1 − 𝜖 𝟏 1 − 𝜖 1 − 𝜖 0 Λ𝑓& 1 − 𝜖 1 − 𝜖 𝟏 1 − 𝜖 Λ 0𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 𝟏 Λ Λ𝑓' 𝟏 + 𝛜 1 1 − 𝜖 1 − 𝜖 0 0𝑓( 1 𝟏 + 𝛜 1 − 𝜖 1 − 𝜖 0 Λ𝑓* 1 − 𝜖 1 − 𝜖 𝟏 + 𝛜 1 Λ 0… … … … … … …
cheating armslog2𝐾*
base arms𝐾*
{1 − 𝜖, 1, 1 + 𝜖}rewards: 0, Λ
The function classes• 𝒞 𝑓∗ : Competing ⟹ cannot distinguishable using 𝑎∗(𝑓∗), but supports a different arm
• 𝒟 𝑓∗ : Docile ⟹ distinguishable using 𝑎∗(𝑓∗)• ℰ 𝑓∗ : Equivalent ⟹ supports 𝑎∗(𝑓∗) and the reward 𝜇∗(𝑓∗)• [Proposition 2] ℱ = 𝒞 𝑓∗ ∪𝒟 𝑓∗ ∪ ℰ(𝑓∗) (disjoint union)
13
ℱ
ℰ∗𝒞∗
𝒟∗ =
𝑓3
𝑓2 𝑓4
𝑓5
𝑓6
𝑓!= 𝑓∗𝑓3
arms
mean reward
1 2 3
𝑓2𝑓4
𝑓5
𝑓6
1
0
𝑓!
.75.5.25
Θ(log𝑛)Θ(1)
can be Θ(log log𝑛)
regret contribution
CRush Optimism with Pessimism (CROP)
14
CROP: Overview• The confidence set
𝐿" 𝑓 ≔:6'(
"
𝑟6 − 𝑓 𝑎67
ℱ" ∶= 𝑓 ∈ ℱ: 𝐿"8( 𝑓 −min9∈ℱ
𝐿"8( 𝑔 ≤ 𝛽" ≔ Θ ln 𝑡 ℱ
• Four important branches• Exploit, Feasible, Fallback, Conflict
• Exploit• Does every 𝑓 ∈ ℱ" support the same best arm?• If yes, pull that arm.
15
ERM
confidence level: 1 − poly !7
CROP v1At time 𝑡,• Maintain a confidence set ℱ" ⊆ ℱ• If every 𝑓 ∈ ℱ" agree on the best arm
• (Exploit) pull that arm.
• Else: (Feasible)• Compute the pessimism: e𝑓" = argmin
;∈ℱ"max$∈𝒜
𝑓(𝑎) (break ties by the cum. loss)
• Compute 𝛾∗ ≔ solution of the optimization problem 𝑐 e𝑓"• (Tracking) Pull 𝑎" = argmin
$∈𝒜<=++_?.=/@($)
C#∗
16
Cf. optimism: f𝑓" = argmax;∈ℱ"
max$∈𝒜
𝑓(𝑎)
Why pessimism?
• Suppose ℱ" = 𝑓(, 𝑓7, 𝑓D• If I knew 𝑓∗, I could track 𝛾 𝑓∗ (= the solution of 𝑐(𝑓∗))• Which 𝑓 should I track?
• Pessimism: either does the right thing, or eliminates itself.
• Other choices: may get stuck (so does ERM)
Key idea: the LB constraints prescribes how to distinguish 𝑓∗ from those supporting higher rewards.17
Arms A1 A2 A3 A4 A5
𝑓) 1 .99 .98 0 0
𝑓% .98 .99 .98 .25 0
𝑓& .97 .97 .98 .25 .25
But we may still get stuck.• Due to docile hypotheses.
• We must do something else.
18
s. t. ∀ 𝑔 ∈ 𝒞 𝑓 ∪𝒟 𝑓 : 𝜇∗ 𝑔 ≥ 𝜇∗ 𝑓 , 3+
𝛾+𝑓 𝑎 − 𝑔 𝑎 2
2 ≥ 1
𝛾 ≥ max 𝛾 𝑓 , 𝜙 𝑓
• Includes docile hypotheses with best rewards higher 𝜇∗ 𝑓
Arms A1 A2 A3 A4 A5
𝑓) 1 .99 .98 0 0
𝑓% .98 .99 .98 .25 0
𝑓& .97 .97 1 .25 .25
𝑓$ .97 .97 1 .2499 .25
𝜓 𝑓 ≔ arg min&∈ *,8 "
Δ9:; 𝑓 ⋅ 𝛾+∗ 1 + 3+<+∗ 1
Δ+ 𝑓 ⋅ 𝛾+
𝑓∗ =
When to fallback to 𝜓 𝑓• ℬ" ≔ 𝑎∗ 𝑓 , 𝜇∗ 𝑓 : 𝑓 ∈ ℱ" ⟹ induces a partition of ℱ"• Optimistic set: hℱ" = the partition containing the optimism
• Pessimistic set: iℱ" = the partition containing the pessimism
• Condition: Use 𝛾 ̅𝑓" if
∀ 𝑓 ∈ Zℱ7, 3+
𝛾+ ̅𝑓7𝑓 𝑎 − ̅𝑓7 𝑎
2
2 ≥ 1
• otherwise, fallback to 𝜓 ̅𝑓" .
• Then, we never get stuck• Crush optimism with pessimism (or end up crushing
pessimism itself)19
mean rewards
arms
[a partition of ℱ0]optimistic setpessimistic set
CROP v2At time 𝑡,• Maintain a confidence set ℱ7 ⊆ ℱ• If every 𝑓 ∈ ℱ7 agree on the best arm
• (Exploit) pull that arm.
• Else if 𝛾 ̅𝑓7 is sufficient to eliminate the optimistic set Zℱ7• (Feasible) 𝜋7 = 𝛾 ̅𝑓7
• Else• (Fallback) 𝜋7 = 𝜓 ̅𝑓7
• (Tracking) Pull 𝑎7 = argmin+∈𝒜
=>??_AB>;C(+)F+,-
20
Still, we may not be asymptotical optimal• Issue: Which informative arm to pull?
• If we follow 𝑓7,• when 𝑓∗ = 𝑓7, it’s fine.• when 𝑓∗ = 𝑓D,we have suboptimal_const ⋅ log 𝑛 regret (and can be made arbitrarily
suboptimal)
• Intuition: to guard against Θ 𝑛 regret, we aim to be 1 − (# -confident.
to guard against Θ log𝑛 regret (w/ suboptimal const), we aim to be 1 − (☐ -confident.
• Solution: construct a 1 − (+.2 # -confident set.
21
Arms A1 A2 A3 A4 A5
𝑓) 1 .99 .98 0 0
𝑓% .98 .99 .98 .25 0
𝑓& .98 .99 .98 .25 .50
• ℱ̇" = 𝑓 ∈ iℱ": 𝐿"8( 𝑓 − 𝐿"8( ̅𝑓" ≤ �̇�" = 𝑂 log ℱ log 𝑡• We have ℱ̇" ⊆ iℱ" ⊂ ℱ"
• Ask: Compute 𝛾 𝑓 for every 𝑓 ∈ ℱ̇". Do they all agree, up to constant scaling?• YES: CROP v2• NO: set 𝜋" = 𝜙 ̅𝑓"
We build a refined confidence set
22
s. t. 𝛾+∗ 1 = 0
𝜙 𝑓 = arg min&!,…,&" )*
3+,!
"
𝛾+ ⋅ Δ+
distinguish those that give conflicting advice!
∀𝑔 ∈ ℰ 𝑓 : 𝛾 𝑔 ∝ 𝛾 𝑓 , 3+,!
"
𝛾+ ⋅𝑓 𝑎 − 𝑔 𝑎 2
2 ≥ 1
confidence level: 1 − poly !?BG 7
CROP v3 (final)At time 𝑡,• Maintain a confidence set ℱ7 ⊆ ℱ• If every 𝑓 ∈ ℱ7 agree on the best arm
• (Exploit) pull that arm.
• Else if ∃𝑓, 𝑔 ∈ ℱ̇7: 𝛾(𝑓) and 𝛾 𝑔 are not proportional to each other• (Conflict) 𝜋7 = 𝜙 ̅𝑓7
• Else if 𝛾 ̅𝑓7 is sufficient to eliminate the optimistic set Zℱ7• (Feasible) 𝜋7 = 𝛾 ̅𝑓7
• Else• (Fallback) 𝜋7 = 𝜓 ̅𝑓7
• (Tracking) Pull 𝑎7 = argmin+∈𝒜
=>??_AB>;C(+)F+,-
23
Main results
24
Main result• Effective number of arms: 𝐾E = the number of arms with 𝜓$ 𝑓 ≠ 0 for some 𝑓 ∈ ℱ
• [Theorem 1] Anytime regret of CROP:𝔼 Reg# = O 𝑃( ln 𝑛 + 𝑃7 ln ln 𝑛 + 𝑃D ln ℱ +𝐾E
where
• [Corollary 1] If 𝑃( = 0, then 𝑃7 = 0. Thus, bounded regret.
25
(from feasible)
(from conflict)
(from fallback, mainly)
𝑃! =3+
Δ+ ⋅ 𝛾+(𝑓∗)
𝑃2 =3+
Δ+ ⋅ max1∈ℰ(1∗)𝜙+(𝑓)
𝑃4 =3+
Δ+ ⋅ max1∈ℱ𝜓+(𝑓)
Example: Cheating code• 𝐾E ≈ log7𝐾
• CROP: +/())4!
ln 𝑛
• Forced exploration: +/())4! ln 𝑛 + 𝐾
• If Λ = .5, 𝐾 = 2F and 𝑛 = 𝐾• 𝒅𝟐 vs 𝟐𝐝
• Exponential improvement!
26
A1 A2 A3 A3 A5 A6
𝑓) 1 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓% 1 − 𝜖 1 1 − 𝜖 1 − 𝜖 0 Λ𝑓& 1 − 𝜖 1 − 𝜖 1 1 − 𝜖 Λ 0𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 1 Λ Λ𝑓' 1 1 + 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓( 1 1 − 𝜖 1 + 𝜖 1 − 𝜖 0 0𝑓* 1 1 − 𝜖 1 − 𝜖 1 + 𝜖 0 0… … … … … … …
cheating armslog2𝐾*
base arms𝐾*
Lower bound• We pull some uninformative arm log(log 𝑛 ) times. Is it necessary?
• Existing lower bounds say: it can be anywhere between Θ 1 and 𝑜(log𝑛).
• Question: Say an algorithm A is asymptotically optimal. Can it pull all uninformative arms 𝑂 1times?
• [Theorem 2] The answer is NO. There exists ℱI for which there exists an uninformative arm 𝑎 with
𝔼 pull_count# 𝑎 ≥ 𝑐 ⋅ ln ln𝑛
(conditions are more relaxed in the paper; will be updated in the arxiv in a few days)
27
The risk of naively mimicking the oracle• The oracle: knows 𝑓∗
• At time t,• If ∀𝑎, pull_count" 𝑎 ≥ 𝛾$ 𝑓∗ ⋅ ln 𝑡
• (Exploit) Pull 𝑎∗ 𝑓∗
• Else• (Explore) Track 𝛾 𝑓∗
• Most existing algorithms try to mimic the oracle!• E.g., replace 𝑓∗ with the ERM + forced exploration.
• CROP is not an exception
28
The risk of naively mimicking the oracle• Regret of UCB: 𝑂 min )
5 ln 𝑛 , 𝜖𝑛
• The regret of the oracle: 𝑂 min +/ )4! ln 𝑛 , 𝑛
⟹ Linear worst-case regret!
• Intuitively, if 𝑛 is small, pulling 𝜖-optimal arm is great!
29
A1 A2 A3 A4 A5 A6
𝑓" 𝟏 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0
𝑓! 1 − 𝜖 𝟏 1 − 𝜖 1 − 𝜖 0 Λ
𝑓# 1 − 𝜖 1 − 𝜖 𝟏 1 − 𝜖 Λ 0
𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 𝟏 Λ Λ
𝑓% 𝟏 1 + 𝜖 1 − 𝜖 1 − 𝜖 0 0
𝑓& 𝟏 1 − 𝜖 1 + 𝜖 1 − 𝜖 0 0
𝑓' 𝟏 1 − 𝜖 1 − 𝜖 1 + 𝜖 0 0
… … … … … … …
cheating armslog% 𝐾.
base arms𝐾.
The oracle
UCB
regret
time
• Can we achieve the best of both worlds? I.e., 𝑂 min +/ )4! ln 𝑛 , 𝜖𝑛
• Yes, if we know 𝜖
It may not be the end of optimism
30
AO
UCB
regret
time
run UCB run asymptotically optimal
Summary• CROP: Asymptotically optimal, adapt to bounded regret, with improved finite-time regret.
• Provides considerations when avoiding forced exploration.
• Reveals the danger of naively mimicking the oracle
• What next?• the worst-case regret simultaneously• can we use the pessimism for linear bandits?• can we even avoid solving the optimization problem?• Lower bounds for finite-time instance-dependent regret?• No explicit specification of confidence set construction/width?
31