advanced machine learning - nyu courantmohri/amls/aml_bandit.pdfadvanced machine learning - mohri@...

Advanced Machine Learning

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Bandit Problems

pageAdvanced Machine Learning - Mohri@

Multi-Armed Bandit ProblemProblem: which arm of a -slot machine should a gambler pull to maximize his cumulative reward over a sequence of trials?

• stochastic setting.

• adversarial setting.

2

K


MotivationClinical trials: potential treatments for a disease to select from, new patient or category at each round (Thompson, 1933).

Ads placement: selection of ad to display out of a finite set (which could vary with time though) for each new web page visitor.

Adaptive routing: alternative paths for routing packets through a “series of tubes” or alternative roads for driving from a source to a destination.

Games: different moves at each round of a game such as chess, or Go.

3


Key ProblemExploration vs exploitation dilemma (or trade-off):

• inspect new arms with possibly better rewards.

• use existing information to select best arm.

4


OutlineStochastic bandits

Adversarial bandits

5


Stochastic Model arms: for each arm ,

• reward distribution .

• reward mean .

• gap to best: , where .

6

�i = µ⇤ � µi

Pi

K

µi

i 2 {1, . . . ,K}

µ⇤ = maxi2[1,K]

µi


Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Equivalent descriptions:

• on-line learning with partial information ( full).

• one-state MDPs (Markov Decision Processes).

7

6=

XIt,t ⇠ PIt

It 2 {1, . . . ,K}

Tt = 1


ObjectivesExpected regret

Pseudo-regret

By Jensen’s inequality, .

8

E[RT ] = E

"max

i2[1,K]

TX

t=1

Xi,t �TX

t=1

XIt,t

#.

RT = maxi2[1,K]

E

"TX

t=1

Xi,t �TX

t=1

XIt,t

#.

= µ⇤T � E

"TX

t=1

XIt,t

#.

RT E[RT ]


Expected RegretIf s take values in , then

The dependency cannot be improved;

better guarantees can be achieved for pseudo-regret.

9

O(pT )

[�r,+r]

E

"max

i2[1,K]

TX

t=1

(Xi,t � µ⇤)

# r

p2T logK.

(Xi,t � µi)


Pseudo-RegretExpression in terms of s:

where denotes the number of times arm was pulled up to time , .

Proof:

10

�i

RT =KX

i=1

E[Ti(T )]�i ,

Ti(t) i

t Ti(t) =Pt

s=1 1Is=i

RT = µ⇤T � E

"TX

t=1

XIt,t

#= E

"TX

t=1

(µ⇤ �XIt,t)

#

= E

"TX

t=1

KX

i=1

(µ⇤ �Xi,t)1It=i

#=

TX

t=1

KX

i=1

E[(µ⇤ �Xi,t)] E[1It=i]

=KX

i=1

(µ⇤ � µi) E

"TX

t=1

1It=i

#=

KX

i=1

E[Ti(T )]�i.


ε-Greedy StrategyAt time ,

• with probability , select arm with best emp. mean.

• with probability , select random arm.

For , with ,

• for , for some .

• thus, and .

Logarithmic regret but,

• requires knowledge of .

• sub-optimal arms treated similarly (naive search).

11

(Auer et al. 2002a)

� = mini : �i>0

�i

1�✏t

✏t

i

t

✏t = min( 6K�2t , 1)

t � 6K�2 Pr[It 6= i⇤] C

�2t C>0

E[Ti(T )] C�2 log T RT

Pi : �i>0

C�i�2 log T

�


UCB StrategyOptimism in face of uncertainty:

• at each time compute upper confidence bound (UCB) on the expected reward of each arm .

• select arm with largest UCB.

Idea: wrong arm cannot be selected for too long.

• by definition, .

• pulling often UCB closer to .

12

i 2 [1,K]t 2 [1, T ]

i

i µi

µi µ⇤ UCBi

(Lai and Robbins, 1985; Agrawal 1995; Auer et al. 2002a)


Note on Concentration IneqsLet be a random variable such that for all ,

where is a convex function. For Hoeffding’s inequality and , .

Then,

13

X

log E⇥et(X�E[X])

⇤ (t),

t � 0

Pr[X � E[X] > ✏] = Pr[et(X�E[X]) > et✏]

inft>0

e�t✏ E[et(X�E[X])]

inft>0

e�t✏e (t)

= e� supt>0(t✏� (t))

= e� ⇤(✏)).

X 2 [a, b] (t) = t2(b�a)2

8


UCB StrategyAverage reward estimate for arm by time :

Concentration inequality (e.g., Hoeffding’s ineq.):

Thus, for any , with probability at least ,

14

i t

�>0 1��

bµi,t =1

Ti(t)

Pts=1 Xi,s1Is=i.

Pr[µi � 1t

Pts=1 Xi,s > ✏] e�t ⇤(✏).

µi <1

t

tX

s=1

Xi,s + ⇤�1✓1

tlog

1

�

◆.


(α, ψ)-UCB StrategyParameter ; -UCB strategy consists of selecting at time

15

↵>0t

(↵, )

It 2 argmaxi2[1,K]

bµi,t�1 + ⇤�1

✓↵ log t

Ti(t� 1)

◆�.


(α, ψ)-UCB GuaranteeTheorem: for , the pseudo-regret of -UCB satisfies

• for Hoeffding’s lemma, -UCB, (Auer et al. 2002a),

16

(↵, )↵>2

RT X

i : �i>0

✓↵�i

⇤(�i2 )

log T +↵

↵� 2

◆.

↵ ⇤(✏) = 2✏2

RT X

i : �i>0

✓2↵

�ilog T +

↵

↵� 2

◆.


ProofLemma: for any ,

Proof: observe that

• Now, for ,

• By definition of , the number of non-zero terms in the sum is at most .

17

TX

t=1

1It=i s+TX

t=s+1

1It=i 1Ti(t�1)�s.

s � 0

st⇤

TX

t=1

1It=i =TX

t=1

1It=i1Ti(t�1)<s +TX

t=1

1It=i1Ti(t�1)�s.

t⇤ = max {t T : 1Ti(t�1)<s 6= 0}TX

t=1

1It=i 1Ti(t�1)<s =t⇤X

t=1

1It=i 1Ti(t�1)<s.


ProofFor any and define . At time , if is selected, then

Thus, at least of one of these three terms is non-negative. Also, if one is non-positive, at least one of the other two is non-negative.

18

⌘i,t�1 = ⇤�1� ↵ log tTi(t�1)

�t ti i

(bµi,t�1 + ⌘i,t�1)� (bµi⇤,t + ⌘i⇤,t�1) � 0

,[bµi,t�1 � µi,t�1 � ⌘i,t�1] + [2⌘i,t�1 ��i] + [µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1] � 0.


ProofTo bound the pseudo-regret, we bound . But, observe first that

Thus,

19

E[Ti(T )]

Ti(t� 1) � s =

⇠↵ log T

⇤(�i2 )

⇡� ↵ log t

⇤(�i2 )

) �i � 2⌘i,t�1 � 0.

E[Ti(T )] = E

TX

t=1

1It=i

�

s+ E

TX

t=s+1

1It=i 1Ti(t�1)�s

�

s+TX

t=s+1

Pr[bµi,t�1 � µi,t�1 � ⌘i,t�1 � 0] + Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0].


ProofEach of the two probability terms can be bounded as follows using the union bound:

Final constant of the bound obtained by further simple calculations.

20

Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0]

Pr

9s 2 [1, t] : µ⇤ � bµi⇤,s � ⇤�1

⇣↵ log t

s

⌘� 0

�

tX

s=1

1

t↵=

1

t↵�1.


Lower BoundTheorem: for any strategy such that for any arm and any for any set of Bernoulli reward distributions, the following holds for all Bernoulli reward distributions:

• a more general result holds for general distributions.

21

i �>0

lim infT!+1

RT

log T�

X

i : �i>0

�i

D(µi k µ⇤).

E[Ti(T )] = o(T �)

(Lai and Robbins, 1985)


NotesObserve that

22

X

i : �i>0

�i

D(µi k µ⇤)� µ⇤(1� µ⇤)

X

i : �i>0

1

�i,

since D(µi k µ⇤) = µi logµi

µ⇤ + (1� µi) log1� µi

1� µ⇤

µiµi � µ⇤

µ⇤ + (1� µi)µ⇤ � µi

1� µ⇤

=(µi � µ⇤)2

µ⇤(1� µ⇤)=

�2i

µ⇤(1� µ⇤).


OutlineStochastic bandits

Adversarial bandits

23


Adversarial Model arms: for each arm ,

• no stochastic assumption.

• rewards in .

24

[0, 1]

K i 2 {1, . . . ,K}


Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Notes:

• rewards for all arms determined by adversary simultaneously with the selection of an arm by player.

• adversary oblivious or nonoblivious (or adaptive).

• strategies: deterministic, regret of at least for some (bad) sequences, thus must consider randomized.

25

xIt,t

t=1 T

xi,t

It

T2

It 2 {1, . . . ,K}


ScenariosOblivious case:

• adversary rewards selected independently of the player’s actions; thus, reward vector at time only a function of .

Non-oblivious case:

• adversary rewards at time function of the player’s past actions .

• notion of regret problematic: cumulative reward compared to a quantity that depends on the player’s actions! (single best action in hindsight function of actions played; playing that single “best” action could have resulted in different rewards.)

26

tt

t

I1, . . . , It�1

I1, . . . , IT


ObjectivesMinimize regret ( ), expectation or high prob.:

Pseudo-regret:

By Jensen’s inequality, .

27

RT E[RT ]

ì,t = 1� xi,t

RT = maxi2[1,K]

TX

t=1

xi,t �TX

t=1

xIt,t =TX

t=1

Ìt,t � mini2[1,K]

TX

t=1

ì,t.

RT = E

"TX

t=1

Ìt,t

#� min

i2[1,K]E

"TX

t=1

ì,t

#.


Importance WeightingIn the bandit setting, the cumulative loss of each arm is not observed, so how should we update the probabilities?

Estimates via surrogate loss:

where is the probability distribution the player uses at time to draw an arm ( ).

Unbiased estimate: for any ,

28

pt = (p1,t, . . . , pK,t)

t

ei,t =

ì,tpi,t

1It=i ,

EIt⇠pt

[ei,t] =KX

j=1

pj,tì,tpi,t

1j=i = ì,t.

pi,t>0

i


EXP3

29

EXP3 (Exponential weights for Exploration and Exploitation)

(Auer et al. 2002b)EXP3(K)

1 p1 ( 1K , . . . , 1

K )

2 (eL1,0, . . . , eLK,0) (0, . . . , 0)3 for t 1 to T do4 Sample(It ⇠ pt)5 Receive(`It,t)6 for i 1 to K do

7 ei,t `i,t

pi,t1It=i

8 eLi,t eLi,t�1 + ei,s

9 for i 1 to K do

10 pi,t+1 e�⌘ eLi,t

PKj=1 e�⌘ eLj,t

11 return pT+1


EXP3 GuaranteeTheorem: the pseudo-regret of EXP3 can be bounded as follows:

Proof: similar to that of EG, but we cannot use Hoeffding’s inequality since is unbounded.

30

RT logK

⌘+

⌘KT

2.

Choosing to minimize the bound gives⌘

RT p

2KT logK.

ei,t


ProofPotential: .

Upper bound:

31

�t = logPK

i=1 e�⌘eLi,t

�t � �t�1 = log

PKi=1 e

�⌘eLi,t

PNi=1 e

�⌘eLi,t�1

= log

PKi=1 e

�⌘eLi,t�1e�⌘ei,tPN

i=1 e�⌘eLi,t�1

= logh

Ei⇠pt

⇥e�⌘ei,t

⇤i

Ei⇠pt

⇥e�⌘ei,t

⇤� 1 (log x x� 1)

Ei⇠pt

⇥� ⌘ei,t +

⌘2

2e2i,t

⇤(e�x 1� x+ x2

2 )

= �⌘ Ei⇠pt

[ei,t] +⌘2

2E

i⇠pt

l2i,t1It=i

p2i,t

�

= �⌘`It,t +⌘2

2

l2It,tpIt,t

�⌘`It,t +⌘2

2

1

pIt,t.


ProofUpper bound: summing up the inequalities yields

Lower bound: for all ,

Comparison:

32

j 2 [1,K]

E[�T��0]�⌘ EIt⇠pt

TX

t=1

Ìt,t

�+ E

It⇠pt

TX

t=1

⌘2

2pIt,t

�=�⌘E

TX

t=1

Ìt,t

�+⌘2KT

2.

E[�T � �0] = EIt⇠pt

log

KX

i=1

e�⌘eLi,T

�� logK

�

� �⌘ EIt⇠pt

[eLj,T ]� logK = �⌘ EIt⇠pt

[Lj,T ]� logK.

8j 2 [1,K], ⌘E

TX

t=1

Ìt,t

�� ⌘E[Lj,T ] logK +

⌘2

2KT

) RT logK

⌘+

⌘KT

2.


NotesWhen is not known:

• standard doubling trick.

• or, use , then .

High probability bounds:

• importance weighting problem: unbounded second moment (see (Cortes, Mansour, MM, 2010)),

• (Auer et al., 2002b): mixing probability with a uniform distribution to ensure a lower bound on ; but not sufficient for high probability bound.

• solution: biased estimate with a parameter to tune.

33

T

⌘t =q

logKKt RT 2

pKT logK

Ei⇠pt [e2i,t] =`2It,tpIt,t

.

pi,t

ei,t =

`i,t1It=i+�pi,t

�>0


Lower BoundSufficient lower bound in a stochastic setting for the pseudo-regret (and therefore for the expected regret).

Theorem: for any and any player strategy, there exists a distribution of losses in for which

34

T �1{0, 1}

RT � 1

20

pKT.

(Bubek and Cesa-Bianchi, 2012)


NotesBound of EXP3 matching lower bound modulo Log term.

Log-free bound: where is a constant ensuring and increasing, convex, twice differentiable over (Audibert and Bubeck, 2010).

• EXP3 coincides with .

• log-free bound with and .

• formulation as mirror descent.

• only in oblivious case.

35

pi,t+1 = (Ct � eLi,t) CtPKi=1 pi,t+1 = 1

(x) = e⌘x

(x) = (�⌘x)�q q = 2

R⇤

References

Advanced Machine Learning - Mohri@ page

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, vol. 27, pp. 1054–1078, 1995.

Jean-Yves Audibert and Sebastian Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, vol. 11, pp. 2635– 2686, 2010.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol. 47, no. 2–3, pp. 235– 256, 2002a.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002b.

Sébastian Bubeck, Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5, 1-122, 2012.

36

References

Advanced Machine Learning - Mohri@ page

Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS, 2010.

T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. Ph.D. thesis, Universite Paris-Sud, 2005.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, vol. 25, pp. 285–294, 1933.

37

advanced machine learning - nyu courantmohri/amls/aml_bandit.pdfadvanced machine learning - mohri@...

Documents