crush optimism with pessimism: structured bandits beyond ... · the goal of this paper achieve the...

Crush Optimism with Pessimism:Structured Bandits Beyond Asymptotic Optimality

Kwang-Sung Jun

join work with Chicheng Zhang

1

Structured bandits• Input: Arm set 𝒜, hypothesis class ℱ ⊂ 𝒜 → ℝ

• Initialize: The environment chooses 𝑓∗ ∈ ℱ (unknown to the learner)

For 𝑡 = 1,… , 𝑛• Learner: chooses an arm 𝑎" ∈ 𝒜• Environment: generates the reward 𝑟" = 𝑓∗ 𝑎" + (zero-mean stochastic noise)• Learner: receives 𝑟"

• Goal: Minimize the cumulative regret

𝔼 Reg# = 𝔼 𝑛 ⋅ max$∈𝒜

𝑓∗ 𝑎 −:"'(

#

𝑓∗ 𝑎"

• Note: fixed arm set (=non-contextual), realizability 𝑓∗ ∈ ℱ

2

e.g., linear𝒜 = 𝑎!, … , 𝑎" ∈ ℝ#ℱ = {𝑎 ↦ 𝜃$𝑎: 𝜃 ∈ ℝ#}

“the set of possible configurations of the mean rewards”

Structured bandits• Why relevant?

Techniques may transfer to RL (e.g., ergodic RL [Ok18])

• Naive strategy: UCB⟹ )

*log 𝑛 regret bound (instance-dependent)

• Scales with the number of arms 𝐾• Instead, the complexity of the hypothesis class ℱ should appear.

• The asymptotically optimal regret is well-defined.

• E.g., linear bandits : 𝑐∗ ⋅ log 𝑛 for some well-defined 𝑐∗ ≪ )*.

3

The goal of this paper

Achieve the asymptotic optimality with improved finite-time regret for any ℱ.

[Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018

(the worst-case regret is beyond the scope)

Asymptotic optimality (instance-dependent)• Optimism in the face of uncertainty

(e.g., UCB, Thompson sampling)

⟹ optimal asymptotic / worst-case regretin 𝑲-armed bandits.

• Linear bandits: optimal worst-case rate = 𝑑 𝑛• Asymptotically optimal regret? ⟹ No!

4

(AISTATS’17) sweet

sour

Do they like orange or apple? Maybe have them try lemon and see if they

are sensitive to sourness..

(1,0)

(0.95, 0.1)

(1,0)

mean reward = 1*sweet + 0*sour

• 𝔼 Reg# ≥ 𝑐 𝑓∗ ⋅ log 𝑛 (asymptotically)

𝑐 𝑓∗ = min&!,…,&" )*

3+,!

"

𝛾+ ⋅ Δ+

• 𝛾∗ = 𝛾(∗, … , 𝛾)∗ ≥ 0 : the solution

• To be optimal, we must pull arm 𝑎 like 𝛾$∗ ⋅ log 𝑛 times.

• E.g., 𝛾+,-./∗ = 8, 𝛾.01/2,∗ = 0 ⟹ lemon is the informative arm!

• When 𝑐 𝑓∗ = 0: Bounded regret! (except for pathological ones [Lattimore14])

Asymptotic optimality: lower bound

5

∀𝑔 ∈ 𝒞 𝑓∗ , 3+,!

"

𝛾+ ⋅ KL- 𝑓 𝑎 , 𝑔 𝑎 ≥ 1

”competing” hypotheses KL divergence with noise distribution 𝜈

Δ+ = max.∈𝒜

𝑓∗ 𝑏 − 𝑓∗(𝑎)

s. t. 𝛾+∗ 1 = 0

Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014.

Existing asymptotically optimal algorithms• Mostly uses forced exploration. [Lattimore+17,Combes+17,Hao+20]

⟹ ensures every arm’s pull count is an unbounded function of 𝑛 such as +.2 #(3+.2 +.2 #.

⟹𝔼Reg# ⪅ 𝑐 𝑓∗ ⋅ log 𝑛 + 𝐾 ⋅ +.2 #(3+.2 +.2 #

• Issues

1. 𝐾 appears in the regret* ⟹ what if 𝐾 is exponentially large?

2. cannot achieve bounded regret when 𝑐 𝑓∗ = 0

• Parallel studies avoid forced exploration, but still depend on 𝐾. [Menard+20, Degenne+20]

6*Dependence on 𝐾 can be avoided in special cases (e.g., linear).

Contribution

7* it’s necessary (will be updated in arxiv)

Research Question

Assume ℱ is finite. Can we design an algorithm that

• enjoys the asymptotic optimality

• adapts to bounded regret whenever possible

• does not necessarily depend on 𝐾?

Proposed algorithm:CRush Optimism with Pessimism (CROP)

• No forced exploration 😀• The regret scales not with 𝐾 but with 𝐾" ≤ 𝐾 (defined in the paper).• An interesting log log 𝑛 term in the regret*

Preliminaries

8

Assumptions• ℱ < ∞• The noise model

𝑟" = 𝑓∗ 𝑎" + 𝜉" where 𝜉" is 1-sub-Gaussian. (generalized to 𝜎! in the paper)

• Notations: 𝑎∗ 𝑓 ≔ arg max$∈𝒜

𝑓 𝑎 , 𝜇∗ 𝑓 ≔ 𝑓 𝑎∗ 𝑓

• 𝑓 supports arm 𝑎 ⟺ 𝑎∗ 𝑓 = 𝑎• 𝑓 supports reward 𝑣 ⟺ 𝜇∗ 𝑓 = 𝑣

• [Assumption] Every 𝑓 ∈ ℱ has a unique best arm (i. e. , 𝑎∗ 𝑓 = 1 )

9

Competing hypotheses• 𝒞 𝑓∗ consists of 𝑓 ∈ ℱ such that

• (1) assigns the same reward to the best arm 𝑎∗(𝑓∗)• (2) but supports a different arm 𝑎∗ 𝑓 ≠ 𝑎∗(𝑓∗)

• Importance: it’s why we get log(𝑛) regret!

10

= 𝑓∗𝑓$

arms

mean reward

1 2 3

𝑓%𝑓&

𝑓'𝑓(

1

0

𝑓)

.75.5

.25

∀𝑔 ∈ 𝒞 𝑓∗ , 3+,!

"

𝛾+ ⋅𝑓∗ 𝑎 − 𝑔 𝑎 2

2 ≥ 1

Lower bound revisited

11

”competing” hypotheses

s. t. 𝛾+∗ 1∗ = 0

𝑐 𝑓∗ ≔ min&!,…,&" )*

3+,!

"

𝛾+ ⋅ Δ+

𝛾+ ln 𝑛 samples for each 𝑎 ∈ 𝒜 can distinguish 𝑓∗ from 𝑔 confidently.

Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989.

Finds arm pull allocations that (1) eliminate competing hypotheses and(2) ‘reward’-efficient

• Assume Gaussian rewards. • 𝔼 Reg# ≥ 𝑐 𝑓∗ ⋅ log 𝑛 , asymptotically.

Δ+ = max.∈𝒜

𝑓∗ 𝑏 − 𝑓∗(𝑎)

Example: Cheating code• 𝜖 > 0: very small (like 0.0001)

• Λ > 0: not too small (like 0.5)

• The lower bound: Θ +.2! )4! ln 𝑛

• UCB: Θ )5 ln 𝑛

• Exponential gap in 𝐾!

12

A1 A2 A3 A4 A5 A6

𝑓) 𝟏 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓% 1 − 𝜖 𝟏 1 − 𝜖 1 − 𝜖 0 Λ𝑓& 1 − 𝜖 1 − 𝜖 𝟏 1 − 𝜖 Λ 0𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 𝟏 Λ Λ𝑓' 𝟏 + 𝛜 1 1 − 𝜖 1 − 𝜖 0 0𝑓( 1 𝟏 + 𝛜 1 − 𝜖 1 − 𝜖 0 Λ𝑓* 1 − 𝜖 1 − 𝜖 𝟏 + 𝛜 1 Λ 0… … … … … … …

cheating armslog2𝐾*

base arms𝐾*

{1 − 𝜖, 1, 1 + 𝜖}rewards: 0, Λ

The function classes• 𝒞 𝑓∗ : Competing ⟹ cannot distinguishable using 𝑎∗(𝑓∗), but supports a different arm

• 𝒟 𝑓∗ : Docile ⟹ distinguishable using 𝑎∗(𝑓∗)• ℰ 𝑓∗ : Equivalent ⟹ supports 𝑎∗(𝑓∗) and the reward 𝜇∗(𝑓∗)• [Proposition 2] ℱ = 𝒞 𝑓∗ ∪𝒟 𝑓∗ ∪ ℰ(𝑓∗) (disjoint union)

13

ℱ

ℰ∗𝒞∗

𝒟∗ =

𝑓3

𝑓2 𝑓4

𝑓5

𝑓6

𝑓!= 𝑓∗𝑓3

arms

mean reward

1 2 3

𝑓2𝑓4

𝑓5

𝑓6

1

0

𝑓!

.75.5.25

Θ(log𝑛)Θ(1)

can be Θ(log log𝑛)

regret contribution

CRush Optimism with Pessimism (CROP)

14

CROP: Overview• The confidence set

𝐿" 𝑓 ≔:6'(

"

𝑟6 − 𝑓 𝑎67

ℱ" ∶= 𝑓 ∈ ℱ: 𝐿"8( 𝑓 −min9∈ℱ

𝐿"8( 𝑔 ≤ 𝛽" ≔ Θ ln 𝑡 ℱ

• Four important branches• Exploit, Feasible, Fallback, Conflict

• Exploit• Does every 𝑓 ∈ ℱ" support the same best arm?• If yes, pull that arm.

15

ERM

confidence level: 1 − poly !7

CROP v1At time 𝑡,• Maintain a confidence set ℱ" ⊆ ℱ• If every 𝑓 ∈ ℱ" agree on the best arm

• (Exploit) pull that arm.

• Else: (Feasible)• Compute the pessimism: e𝑓" = argmin

;∈ℱ"max$∈𝒜

𝑓(𝑎) (break ties by the cum. loss)

• Compute 𝛾∗ ≔ solution of the optimization problem 𝑐 e𝑓"• (Tracking) Pull 𝑎" = argmin

$∈𝒜<=++_?.=/@($)

C#∗

16

Cf. optimism: f𝑓" = argmax;∈ℱ"

max$∈𝒜

𝑓(𝑎)

Why pessimism?

• Suppose ℱ" = 𝑓(, 𝑓7, 𝑓D• If I knew 𝑓∗, I could track 𝛾 𝑓∗ (= the solution of 𝑐(𝑓∗))• Which 𝑓 should I track?

• Pessimism: either does the right thing, or eliminates itself.

• Other choices: may get stuck (so does ERM)

Key idea: the LB constraints prescribes how to distinguish 𝑓∗ from those supporting higher rewards.17

Arms A1 A2 A3 A4 A5

𝑓) 1 .99 .98 0 0

𝑓% .98 .99 .98 .25 0

𝑓& .97 .97 .98 .25 .25

But we may still get stuck.• Due to docile hypotheses.

• We must do something else.

18

s. t. ∀ 𝑔 ∈ 𝒞 𝑓 ∪𝒟 𝑓 : 𝜇∗ 𝑔 ≥ 𝜇∗ 𝑓 , 3+

𝛾+𝑓 𝑎 − 𝑔 𝑎 2

2 ≥ 1

𝛾 ≥ max 𝛾 𝑓 , 𝜙 𝑓

• Includes docile hypotheses with best rewards higher 𝜇∗ 𝑓

Arms A1 A2 A3 A4 A5

𝑓) 1 .99 .98 0 0

𝑓% .98 .99 .98 .25 0

𝑓& .97 .97 1 .25 .25

𝑓$ .97 .97 1 .2499 .25

𝜓 𝑓 ≔ arg min&∈ *,8 "

Δ9:; 𝑓 ⋅ 𝛾+∗ 1 + 3+<+∗ 1

Δ+ 𝑓 ⋅ 𝛾+

𝑓∗ =

When to fallback to 𝜓 𝑓• ℬ" ≔ 𝑎∗ 𝑓 , 𝜇∗ 𝑓 : 𝑓 ∈ ℱ" ⟹ induces a partition of ℱ"• Optimistic set: hℱ" = the partition containing the optimism

• Pessimistic set: iℱ" = the partition containing the pessimism

• Condition: Use 𝛾 ̅𝑓" if

∀ 𝑓 ∈ Zℱ7, 3+

𝛾+ ̅𝑓7𝑓 𝑎 − ̅𝑓7 𝑎

2

2 ≥ 1

• otherwise, fallback to 𝜓 ̅𝑓" .

• Then, we never get stuck• Crush optimism with pessimism (or end up crushing

pessimism itself)19

mean rewards

arms

[a partition of ℱ0]optimistic setpessimistic set

CROP v2At time 𝑡,• Maintain a confidence set ℱ7 ⊆ ℱ• If every 𝑓 ∈ ℱ7 agree on the best arm


• Else if 𝛾 ̅𝑓7 is sufficient to eliminate the optimistic set Zℱ7• (Feasible) 𝜋7 = 𝛾 ̅𝑓7

• Else• (Fallback) 𝜋7 = 𝜓 ̅𝑓7

• (Tracking) Pull 𝑎7 = argmin+∈𝒜

=>??_AB>;C(+)F+,-

20

Still, we may not be asymptotical optimal• Issue: Which informative arm to pull?

• If we follow 𝑓7,• when 𝑓∗ = 𝑓7, it’s fine.• when 𝑓∗ = 𝑓D,we have suboptimal_const ⋅ log 𝑛 regret (and can be made arbitrarily

suboptimal)

• Intuition: to guard against Θ 𝑛 regret, we aim to be 1 − (# -confident.

to guard against Θ log𝑛 regret (w/ suboptimal const), we aim to be 1 − (☐ -confident.

• Solution: construct a 1 − (+.2 # -confident set.

21

Arms A1 A2 A3 A4 A5

𝑓) 1 .99 .98 0 0

𝑓% .98 .99 .98 .25 0

𝑓& .98 .99 .98 .25 .50

• ℱ̇" = 𝑓 ∈ iℱ": 𝐿"8( 𝑓 − 𝐿"8( ̅𝑓" ≤ �̇�" = 𝑂 log ℱ log 𝑡• We have ℱ̇" ⊆ iℱ" ⊂ ℱ"

• Ask: Compute 𝛾 𝑓 for every 𝑓 ∈ ℱ̇". Do they all agree, up to constant scaling?• YES: CROP v2• NO: set 𝜋" = 𝜙 ̅𝑓"

We build a refined confidence set

22

s. t. 𝛾+∗ 1 = 0

𝜙 𝑓 = arg min&!,…,&" )*

3+,!

"

𝛾+ ⋅ Δ+

distinguish those that give conflicting advice!

∀𝑔 ∈ ℰ 𝑓 : 𝛾 𝑔 ∝ 𝛾 𝑓 , 3+,!

"

𝛾+ ⋅𝑓 𝑎 − 𝑔 𝑎 2

2 ≥ 1

confidence level: 1 − poly !?BG 7

CROP v3 (final)At time 𝑡,• Maintain a confidence set ℱ7 ⊆ ℱ• If every 𝑓 ∈ ℱ7 agree on the best arm


• Else if ∃𝑓, 𝑔 ∈ ℱ̇7: 𝛾(𝑓) and 𝛾 𝑔 are not proportional to each other• (Conflict) 𝜋7 = 𝜙 ̅𝑓7

• Else if 𝛾 ̅𝑓7 is sufficient to eliminate the optimistic set Zℱ7• (Feasible) 𝜋7 = 𝛾 ̅𝑓7

• Else• (Fallback) 𝜋7 = 𝜓 ̅𝑓7

• (Tracking) Pull 𝑎7 = argmin+∈𝒜

=>??_AB>;C(+)F+,-

23

Main results

24

Main result• Effective number of arms: 𝐾E = the number of arms with 𝜓$ 𝑓 ≠ 0 for some 𝑓 ∈ ℱ

• [Theorem 1] Anytime regret of CROP:𝔼 Reg# = O 𝑃( ln 𝑛 + 𝑃7 ln ln 𝑛 + 𝑃D ln ℱ +𝐾E

where

• [Corollary 1] If 𝑃( = 0, then 𝑃7 = 0. Thus, bounded regret.

25

(from feasible)

(from conflict)

(from fallback, mainly)

𝑃! =3+

Δ+ ⋅ 𝛾+(𝑓∗)

𝑃2 =3+

Δ+ ⋅ max1∈ℰ(1∗)𝜙+(𝑓)

𝑃4 =3+

Δ+ ⋅ max1∈ℱ𝜓+(𝑓)

Example: Cheating code• 𝐾E ≈ log7𝐾

• CROP: +/())4!

ln 𝑛

• Forced exploration: +/())4! ln 𝑛 + 𝐾

• If Λ = .5, 𝐾 = 2F and 𝑛 = 𝐾• 𝒅𝟐 vs 𝟐𝐝

• Exponential improvement!

26

A1 A2 A3 A3 A5 A6

𝑓) 1 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓% 1 − 𝜖 1 1 − 𝜖 1 − 𝜖 0 Λ𝑓& 1 − 𝜖 1 − 𝜖 1 1 − 𝜖 Λ 0𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 1 Λ Λ𝑓' 1 1 + 𝜖 1 − 𝜖 1 − 𝜖 0 0𝑓( 1 1 − 𝜖 1 + 𝜖 1 − 𝜖 0 0𝑓* 1 1 − 𝜖 1 − 𝜖 1 + 𝜖 0 0… … … … … … …

cheating armslog2𝐾*

base arms𝐾*

Lower bound• We pull some uninformative arm log(log 𝑛 ) times. Is it necessary?

• Existing lower bounds say: it can be anywhere between Θ 1 and 𝑜(log𝑛).

• Question: Say an algorithm A is asymptotically optimal. Can it pull all uninformative arms 𝑂 1times?

• [Theorem 2] The answer is NO. There exists ℱI for which there exists an uninformative arm 𝑎 with

𝔼 pull_count# 𝑎 ≥ 𝑐 ⋅ ln ln𝑛

(conditions are more relaxed in the paper; will be updated in the arxiv in a few days)

27

The risk of naively mimicking the oracle• The oracle: knows 𝑓∗

• At time t,• If ∀𝑎, pull_count" 𝑎 ≥ 𝛾$ 𝑓∗ ⋅ ln 𝑡

• (Exploit) Pull 𝑎∗ 𝑓∗

• Else• (Explore) Track 𝛾 𝑓∗

• Most existing algorithms try to mimic the oracle!• E.g., replace 𝑓∗ with the ERM + forced exploration.

• CROP is not an exception

28

The risk of naively mimicking the oracle• Regret of UCB: 𝑂 min )

5 ln 𝑛 , 𝜖𝑛

• The regret of the oracle: 𝑂 min +/ )4! ln 𝑛 , 𝑛

⟹ Linear worst-case regret!

• Intuitively, if 𝑛 is small, pulling 𝜖-optimal arm is great!

29

A1 A2 A3 A4 A5 A6

𝑓" 𝟏 1 − 𝜖 1 − 𝜖 1 − 𝜖 0 0

𝑓! 1 − 𝜖 𝟏 1 − 𝜖 1 − 𝜖 0 Λ

𝑓# 1 − 𝜖 1 − 𝜖 𝟏 1 − 𝜖 Λ 0

𝑓$ 1 − 𝜖 1 − 𝜖 1 − 𝜖 𝟏 Λ Λ

𝑓% 𝟏 1 + 𝜖 1 − 𝜖 1 − 𝜖 0 0

𝑓& 𝟏 1 − 𝜖 1 + 𝜖 1 − 𝜖 0 0

𝑓' 𝟏 1 − 𝜖 1 − 𝜖 1 + 𝜖 0 0

… … … … … … …

cheating armslog% 𝐾.

base arms𝐾.

The oracle

UCB

regret

time

• Can we achieve the best of both worlds? I.e., 𝑂 min +/ )4! ln 𝑛 , 𝜖𝑛

• Yes, if we know 𝜖

It may not be the end of optimism

30

AO

UCB

regret

time

run UCB run asymptotically optimal

Summary• CROP: Asymptotically optimal, adapt to bounded regret, with improved finite-time regret.

• Provides considerations when avoiding forced exploration.

• Reveals the danger of naively mimicking the oracle

• What next?• the worst-case regret simultaneously• can we use the pessimism for linear bandits?• can we even avoid solving the optimization problem?• Lower bounds for finite-time instance-dependent regret?• No explicit specification of confidence set construction/width?

31

crush optimism with pessimism: structured bandits beyond ... · the goal of this paper achieve the...

Documents