advanced machine learning - nyu courantmohri/amls/aml_bandit.pdfadvanced machine learning - mohri@...

Advanced Machine Learning

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Bandit Problems

pageAdvanced Machine Learning - Mohri@

Multi-Armed Bandit ProblemProblem: which arm of a -slot machine should a gambler pull to maximize his cumulative reward over a sequence of trials?

• stochastic setting.

• adversarial setting.

MotivationClinical trials: potential treatments for a disease to select from, new patient or category at each round (Thompson, 1933).

Ads placement: selection of ad to display out of a finite set (which could vary with time though) for each new web page visitor.

Adaptive routing: alternative paths for routing packets through a “series of tubes” or alternative roads for driving from a source to a destination.

Games: different moves at each round of a game such as chess, or Go.

Key ProblemExploration vs exploitation dilemma (or trade-off):

• inspect new arms with possibly better rewards.

• use existing information to select best arm.

OutlineStochastic bandits

Adversarial bandits

Stochastic Model arms: for each arm ,

• reward distribution .

• reward mean .

• gap to best: , where .

�i = µ⇤ � µi

i 2 {1, . . . ,K}

µ⇤ = maxi2[1,K]

Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Equivalent descriptions:

• on-line learning with partial information ( full).

• one-state MDPs (Markov Decision Processes).

XIt,t ⇠ PIt

It 2 {1, . . . ,K}

Tt = 1

ObjectivesExpected regret

Pseudo-regret

By Jensen’s inequality, .

E[RT ] = E

i2[1,K]

Xi,t �TX

RT = maxi2[1,K]

Xi,t �TX

= µ⇤T � E

RT E[RT ]

Expected RegretIf s take values in , then

The dependency cannot be improved;

better guarantees can be achieved for pseudo-regret.

O(pT )

[�r,+r]

i2[1,K]

(Xi,t � µ⇤)

p2T logK.

(Xi,t � µi)

Pseudo-RegretExpression in terms of s:

where denotes the number of times arm was pulled up to time , .

Proof:

RT =KX

E[Ti(T )]�i ,

Ti(t) i

t Ti(t) =Pt

s=1 1Is=i

RT = µ⇤T � E

(µ⇤ �XIt,t)

(µ⇤ �Xi,t)1It=i

E[(µ⇤ �Xi,t)] E[1It=i]

(µ⇤ � µi) E

E[Ti(T )]�i.

ε-Greedy StrategyAt time ,

• with probability , select arm with best emp. mean.

• with probability , select random arm.

For , with ,

• for , for some .

• thus, and .

Logarithmic regret but,

• requires knowledge of .

• sub-optimal arms treated similarly (naive search).

(Auer et al. 2002a)

� = mini : �i>0

1�✏t

✏t = min( 6K�2t , 1)

t � 6K�2 Pr[It 6= i⇤] C

�2t C>0

E[Ti(T )] C�2 log T RT

Pi : �i>0

C�i�2 log T

UCB StrategyOptimism in face of uncertainty:

• at each time compute upper confidence bound (UCB) on the expected reward of each arm .

• select arm with largest UCB.

Idea: wrong arm cannot be selected for too long.

• by definition, .

• pulling often UCB closer to .

i 2 [1,K]t 2 [1, T ]

µi µ⇤ UCBi

(Lai and Robbins, 1985; Agrawal 1995; Auer et al. 2002a)

Note on Concentration IneqsLet be a random variable such that for all ,

where is a convex function. For Hoeffding’s inequality and , .

log E⇥et(X�E[X])

⇤ (t),

t � 0

Pr[X � E[X] > ✏] = Pr[et(X�E[X]) > et✏]

inft>0

e�t✏ E[et(X�E[X])]

inft>0

e�t✏e (t)

= e� supt>0(t✏� (t))

= e� ⇤(✏)).

X 2 [a, b] (t) = t2(b�a)2

UCB StrategyAverage reward estimate for arm by time :

Concentration inequality (e.g., Hoeffding’s ineq.):

Thus, for any , with probability at least ,

�>0 1��

bµi,t =1

Pts=1 Xi,s1Is=i.

Pr[µi � 1t

Pts=1 Xi,s > ✏] e�t ⇤(✏).

µi <1

Xi,s + ⇤�1✓1

(α, ψ)-UCB StrategyParameter ; -UCB strategy consists of selecting at time

↵>0t

(↵, )

It 2 argmaxi2[1,K]

bµi,t�1 + ⇤�1

✓↵ log t

Ti(t� 1)

◆�.

(α, ψ)-UCB GuaranteeTheorem: for , the pseudo-regret of -UCB satisfies

• for Hoeffding’s lemma, -UCB, (Auer et al. 2002a),

(↵, )↵>2

i : �i>0

✓↵�i

⇤(�i2 )

log T +↵

↵� 2

↵ ⇤(✏) = 2✏2

i : �i>0

✓2↵

�ilog T +

↵� 2

ProofLemma: for any ,

Proof: observe that

• Now, for ,

• By definition of , the number of non-zero terms in the sum is at most .

1It=i s+TX

1It=i 1Ti(t�1)�s.

s � 0

1It=i =TX

1It=i1Ti(t�1)<s +TX

1It=i1Ti(t�1)�s.

t⇤ = max {t T : 1Ti(t�1)<s 6= 0}TX

1It=i 1Ti(t�1)<s =t⇤X

1It=i 1Ti(t�1)<s.

ProofFor any and define . At time , if is selected, then

Thus, at least of one of these three terms is non-negative. Also, if one is non-positive, at least one of the other two is non-negative.

⌘i,t�1 = ⇤�1� ↵ log tTi(t�1)

�t ti i

(bµi,t�1 + ⌘i,t�1)� (bµi⇤,t + ⌘i⇤,t�1) � 0

,[bµi,t�1 � µi,t�1 � ⌘i,t�1] + [2⌘i,t�1 ��i] + [µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1] � 0.

ProofTo bound the pseudo-regret, we bound . But, observe first that

E[Ti(T )]

Ti(t� 1) � s =

⇠↵ log T

⇤(�i2 )

⇡� ↵ log t

⇤(�i2 )

) �i � 2⌘i,t�1 � 0.

E[Ti(T )] = E

1It=i 1Ti(t�1)�s

Pr[bµi,t�1 � µi,t�1 � ⌘i,t�1 � 0] + Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0].

ProofEach of the two probability terms can be bounded as follows using the union bound:

Final constant of the bound obtained by further simple calculations.

Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0]

9s 2 [1, t] : µ⇤ � bµi⇤,s � ⇤�1

⇣↵ log t

⌘� 0

t↵�1.

Lower BoundTheorem: for any strategy such that for any arm and any for any set of Bernoulli reward distributions, the following holds for all Bernoulli reward distributions:

• a more general result holds for general distributions.

i �>0

lim infT!+1

log T�

i : �i>0

D(µi k µ⇤).

E[Ti(T )] = o(T �)

(Lai and Robbins, 1985)

NotesObserve that

i : �i>0

D(µi k µ⇤)� µ⇤(1� µ⇤)

i : �i>0

since D(µi k µ⇤) = µi logµi

µ⇤ + (1� µi) log1� µi

1� µ⇤

µiµi � µ⇤

µ⇤ + (1� µi)µ⇤ � µi

1� µ⇤

=(µi � µ⇤)2

µ⇤(1� µ⇤)=

µ⇤(1� µ⇤).

OutlineStochastic bandits

Adversarial bandits

Adversarial Model arms: for each arm ,

• no stochastic assumption.

• rewards in .

[0, 1]

K i 2 {1, . . . ,K}

Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Notes:

• rewards for all arms determined by adversary simultaneously with the selection of an arm by player.

• adversary oblivious or nonoblivious (or adaptive).

• strategies: deterministic, regret of at least for some (bad) sequences, thus must consider randomized.

It 2 {1, . . . ,K}

ScenariosOblivious case:

• adversary rewards selected independently of the player’s actions; thus, reward vector at time only a function of .

Non-oblivious case:

• adversary rewards at time function of the player’s past actions .

• notion of regret problematic: cumulative reward compared to a quantity that depends on the player’s actions! (single best action in hindsight function of actions played; playing that single “best” action could have resulted in different rewards.)

I1, . . . , It�1

I1, . . . , IT

ObjectivesMinimize regret ( ), expectation or high prob.:

Pseudo-regret:

By Jensen’s inequality, .

RT E[RT ]

`i,t = 1� xi,t

RT = maxi2[1,K]

xi,t �TX

xIt,t =TX

`It,t � mini2[1,K]

RT = E

#� min

i2[1,K]E

Importance WeightingIn the bandit setting, the cumulative loss of each arm is not observed, so how should we update the probabilities?

Estimates via surrogate loss:

where is the probability distribution the player uses at time to draw an arm ( ).

Unbiased estimate: for any ,

pt = (p1,t, . . . , pK,t)

ei,t =

`i,tpi,t

1It=i ,

EIt⇠pt

[ei,t] =KX

pj,t`i,tpi,t

1j=i = `i,t.

pi,t>0

EXP3 (Exponential weights for Exploration and Exploitation)

(Auer et al. 2002b)EXP3(K)

1 p1 ( 1K , . . . , 1

2 (eL1,0, . . . , eLK,0) (0, . . . , 0)3 for t 1 to T do4 Sample(It ⇠ pt)5 Receive(`It,t)6 for i 1 to K do

7 ei,t `i,t

pi,t1It=i

8 eLi,t eLi,t�1 + ei,s

9 for i 1 to K do

10 pi,t+1 e�⌘ eLi,t

PKj=1 e�⌘ eLj,t

11 return pT+1

EXP3 GuaranteeTheorem: the pseudo-regret of EXP3 can be bounded as follows:

Proof: similar to that of EG, but we cannot use Hoeffding’s inequality since is unbounded.

RT logK

Choosing to minimize the bound gives⌘

2KT logK.

ProofPotential: .

Upper bound:

�t = logPK

i=1 e�⌘eLi,t

�t � �t�1 = log

PKi=1 e

�⌘eLi,t

PNi=1 e

�⌘eLi,t�1

PKi=1 e

�⌘eLi,t�1e�⌘ei,tPN

i=1 e�⌘eLi,t�1

= logh

Ei⇠pt

⇥e�⌘ei,t

Ei⇠pt

⇥e�⌘ei,t

⇤� 1 (log x x� 1)

Ei⇠pt

⇥� ⌘ei,t +

2e2i,t

⇤(e�x 1� x+ x2

= �⌘ Ei⇠pt

[ei,t] +⌘2

i⇠pt

l2i,t1It=i

= �⌘`It,t +⌘2

l2It,tpIt,t

�⌘`It,t +⌘2

pIt,t.

ProofUpper bound: summing up the inequalities yields

Lower bound: for all ,

Comparison:

j 2 [1,K]

E[�T��0]�⌘ EIt⇠pt

�+ E

It⇠pt

2pIt,t

�=�⌘E

�+⌘2KT

E[�T � �0] = EIt⇠pt

e�⌘eLi,T

�� logK

� �⌘ EIt⇠pt

[eLj,T ]� logK = �⌘ EIt⇠pt

[Lj,T ]� logK.

8j 2 [1,K], ⌘E

�� ⌘E[Lj,T ] logK +

) RT logK

NotesWhen is not known:

• standard doubling trick.

• or, use , then .

High probability bounds:

• importance weighting problem: unbounded second moment (see (Cortes, Mansour, MM, 2010)),

• (Auer et al., 2002b): mixing probability with a uniform distribution to ensure a lower bound on ; but not sufficient for high probability bound.

• solution: biased estimate with a parameter to tune.

⌘t =q

logKKt RT 2

pKT logK

Ei⇠pt [e2i,t] =`2It,tpIt,t

ei,t =

`i,t1It=i+�pi,t

Lower BoundSufficient lower bound in a stochastic setting for the pseudo-regret (and therefore for the expected regret).

Theorem: for any and any player strategy, there exists a distribution of losses in for which

T �1{0, 1}

RT � 1

(Bubek and Cesa-Bianchi, 2012)

NotesBound of EXP3 matching lower bound modulo Log term.

Log-free bound: where is a constant ensuring and increasing, convex, twice differentiable over (Audibert and Bubeck, 2010).

• EXP3 coincides with .

• log-free bound with and .

• formulation as mirror descent.

• only in oblivious case.

pi,t+1 = (Ct � eLi,t) CtPKi=1 pi,t+1 = 1

(x) = e⌘x

(x) = (�⌘x)�q q = 2

References

Advanced Machine Learning - Mohri@ page

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, vol. 27, pp. 1054–1078, 1995.

Jean-Yves Audibert and Sebastian Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, vol. 11, pp. 2635– 2686, 2010.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol. 47, no. 2–3, pp. 235– 256, 2002a.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002b.

Sébastian Bubeck, Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5, 1-122, 2012.

References

Advanced Machine Learning - Mohri@ page

Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS, 2010.

T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. Ph.D. thesis, Universite Paris-Sud, 2005.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, vol. 25, pp. 285–294, 1933.

advanced machine learning - nyu courantmohri/amls/aml_bandit.pdfadvanced machine learning - mohri@...

Documents

paper session i-a - advanced manned launch system (amls

defense technical information center - logisticsfunctional...

foundations of machine learning - jeffrey...

amls advanced level pretest

an enhanced amls method and its performance -...

new analysis and algorithm for learning with drifting...

chapter 4 : amls method - institut für mathematik 4 : amls...

amls users manual

an implementation and evaluation of the amls method...

ml reinforcement learning - new york...

support vector machines - cs.nyu.edumohri/mls/ml_svm.pdf ·...

advanced machine learning - nyu...

an adaptive mls surface for reconstruction with...

learning bounds for importance weighting - nyu...

enhancing eigenvector approximations of huge gyroscopic...

automated mud logging system ecotek...

active military life skills (amls) an active relationships...

introduction to machine learning lecture...

an implementation and evaluation of the amls method for...

genetic control of temperature preference in the nematode...