advanced machine learning - nyu courantmohri/amls/aml_bandit.pdfadvanced machine learning - mohri@...

37
Advanced Machine Learning MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Bandit Problems

Upload: dokhanh

Post on 15-May-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

Advanced Machine Learning

MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH..

Bandit Problems

Page 2: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Multi-Armed Bandit ProblemProblem: which arm of a -slot machine should a gambler pull to maximize his cumulative reward over a sequence of trials?

• stochastic setting.

• adversarial setting.

2

K

Page 3: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

MotivationClinical trials: potential treatments for a disease to select from, new patient or category at each round (Thompson, 1933).

Ads placement: selection of ad to display out of a finite set (which could vary with time though) for each new web page visitor.

Adaptive routing: alternative paths for routing packets through a “series of tubes” or alternative roads for driving from a source to a destination.

Games: different moves at each round of a game such as chess, or Go.

3

Page 4: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Key ProblemExploration vs exploitation dilemma (or trade-off):

• inspect new arms with possibly better rewards.

• use existing information to select best arm.

4

Page 5: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

OutlineStochastic bandits

Adversarial bandits

5

Page 6: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Stochastic Model arms: for each arm ,

• reward distribution .

• reward mean .

• gap to best: , where .

6

�i = µ⇤ � µi

Pi

K

µi

i 2 {1, . . . ,K}

µ⇤ = maxi2[1,K]

µi

Page 7: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Equivalent descriptions:

• on-line learning with partial information ( full).

• one-state MDPs (Markov Decision Processes).

7

6=

XIt,t ⇠ PIt

It 2 {1, . . . ,K}

Tt = 1

Page 8: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ObjectivesExpected regret

Pseudo-regret

By Jensen’s inequality, .

8

E[RT ] = E

"max

i2[1,K]

TX

t=1

Xi,t �TX

t=1

XIt,t

#.

RT = maxi2[1,K]

E

"TX

t=1

Xi,t �TX

t=1

XIt,t

#.

= µ⇤T � E

"TX

t=1

XIt,t

#.

RT E[RT ]

Page 9: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Expected RegretIf s take values in , then

The dependency cannot be improved;

better guarantees can be achieved for pseudo-regret.

9

O(pT )

[�r,+r]

E

"max

i2[1,K]

TX

t=1

(Xi,t � µ⇤)

# r

p2T logK.

(Xi,t � µi)

Page 10: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Pseudo-RegretExpression in terms of s:

where denotes the number of times arm was pulled up to time , .

Proof:

10

�i

RT =KX

i=1

E[Ti(T )]�i ,

Ti(t) i

t Ti(t) =Pt

s=1 1Is=i

RT = µ⇤T � E

"TX

t=1

XIt,t

#= E

"TX

t=1

(µ⇤ �XIt,t)

#

= E

"TX

t=1

KX

i=1

(µ⇤ �Xi,t)1It=i

#=

TX

t=1

KX

i=1

E[(µ⇤ �Xi,t)] E[1It=i]

=KX

i=1

(µ⇤ � µi) E

"TX

t=1

1It=i

#=

KX

i=1

E[Ti(T )]�i.

Page 11: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ε-Greedy StrategyAt time ,

• with probability , select arm with best emp. mean.

• with probability , select random arm.

For , with ,

• for , for some .

• thus, and .

Logarithmic regret but,

• requires knowledge of .

• sub-optimal arms treated similarly (naive search).

11

(Auer et al. 2002a)

� = mini : �i>0

�i

1�✏t

✏t

i

t

✏t = min( 6K�2t , 1)

t � 6K�2 Pr[It 6= i⇤] C

�2t C>0

E[Ti(T )] C�2 log T RT

Pi : �i>0

C�i�2 log T

Page 12: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

UCB StrategyOptimism in face of uncertainty:

• at each time compute upper confidence bound (UCB) on the expected reward of each arm .

• select arm with largest UCB.

Idea: wrong arm cannot be selected for too long.

• by definition, .

• pulling often UCB closer to .

12

i 2 [1,K]t 2 [1, T ]

i

i µi

µi µ⇤ UCBi

(Lai and Robbins, 1985; Agrawal 1995; Auer et al. 2002a)

Page 13: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Note on Concentration IneqsLet be a random variable such that for all ,

where is a convex function. For Hoeffding’s inequality and , .

Then,

13

X

log E⇥et(X�E[X])

⇤ (t),

t � 0

Pr[X � E[X] > ✏] = Pr[et(X�E[X]) > et✏]

inft>0

e�t✏ E[et(X�E[X])]

inft>0

e�t✏e (t)

= e� supt>0(t✏� (t))

= e� ⇤(✏)).

X 2 [a, b] (t) = t2(b�a)2

8

Page 14: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

UCB StrategyAverage reward estimate for arm by time :

Concentration inequality (e.g., Hoeffding’s ineq.):

Thus, for any , with probability at least ,

14

i t

�>0 1��

bµi,t =1

Ti(t)

Pts=1 Xi,s1Is=i.

Pr[µi � 1t

Pts=1 Xi,s > ✏] e�t ⇤(✏).

µi <1

t

tX

s=1

Xi,s + ⇤�1✓1

tlog

1

◆.

Page 15: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

(α, ψ)-UCB StrategyParameter ; -UCB strategy consists of selecting at time

15

↵>0t

(↵, )

It 2 argmaxi2[1,K]

bµi,t�1 + ⇤�1

✓↵ log t

Ti(t� 1)

◆�.

Page 16: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

(α, ψ)-UCB GuaranteeTheorem: for , the pseudo-regret of -UCB satisfies

• for Hoeffding’s lemma, -UCB, (Auer et al. 2002a),

16

(↵, )↵>2

RT X

i : �i>0

✓↵�i

⇤(�i2 )

log T +↵

↵� 2

◆.

↵ ⇤(✏) = 2✏2

RT X

i : �i>0

✓2↵

�ilog T +

↵� 2

◆.

Page 17: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofLemma: for any ,

Proof: observe that

• Now, for ,

• By definition of , the number of non-zero terms in the sum is at most .

17

TX

t=1

1It=i s+TX

t=s+1

1It=i 1Ti(t�1)�s.

s � 0

st⇤

TX

t=1

1It=i =TX

t=1

1It=i1Ti(t�1)<s +TX

t=1

1It=i1Ti(t�1)�s.

t⇤ = max {t T : 1Ti(t�1)<s 6= 0}TX

t=1

1It=i 1Ti(t�1)<s =t⇤X

t=1

1It=i 1Ti(t�1)<s.

Page 18: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofFor any and define . At time , if is selected, then

Thus, at least of one of these three terms is non-negative. Also, if one is non-positive, at least one of the other two is non-negative.

18

⌘i,t�1 = ⇤�1� ↵ log tTi(t�1)

�t ti i

(bµi,t�1 + ⌘i,t�1)� (bµi⇤,t + ⌘i⇤,t�1) � 0

,[bµi,t�1 � µi,t�1 � ⌘i,t�1] + [2⌘i,t�1 ��i] + [µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1] � 0.

Page 19: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofTo bound the pseudo-regret, we bound . But, observe first that

Thus,

19

E[Ti(T )]

Ti(t� 1) � s =

⇠↵ log T

⇤(�i2 )

⇡� ↵ log t

⇤(�i2 )

) �i � 2⌘i,t�1 � 0.

E[Ti(T )] = E

TX

t=1

1It=i

s+ E

TX

t=s+1

1It=i 1Ti(t�1)�s

s+TX

t=s+1

Pr[bµi,t�1 � µi,t�1 � ⌘i,t�1 � 0] + Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0].

Page 20: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofEach of the two probability terms can be bounded as follows using the union bound:

Final constant of the bound obtained by further simple calculations.

20

Pr[µ⇤ � bµi⇤,t�1 � ⌘i⇤,t�1 � 0]

Pr

9s 2 [1, t] : µ⇤ � bµi⇤,s � ⇤�1

⇣↵ log t

s

⌘� 0

tX

s=1

1

t↵=

1

t↵�1.

Page 21: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Lower BoundTheorem: for any strategy such that for any arm and any for any set of Bernoulli reward distributions, the following holds for all Bernoulli reward distributions:

• a more general result holds for general distributions.

21

i �>0

lim infT!+1

RT

log T�

X

i : �i>0

�i

D(µi k µ⇤).

E[Ti(T )] = o(T �)

(Lai and Robbins, 1985)

Page 22: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

NotesObserve that

22

X

i : �i>0

�i

D(µi k µ⇤)� µ⇤(1� µ⇤)

X

i : �i>0

1

�i,

since D(µi k µ⇤) = µi logµi

µ⇤ + (1� µi) log1� µi

1� µ⇤

µiµi � µ⇤

µ⇤ + (1� µi)µ⇤ � µi

1� µ⇤

=(µi � µ⇤)2

µ⇤(1� µ⇤)=

�2i

µ⇤(1� µ⇤).

Page 23: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

OutlineStochastic bandits

Adversarial bandits

23

Page 24: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Adversarial Model arms: for each arm ,

• no stochastic assumption.

• rewards in .

24

[0, 1]

K i 2 {1, . . . ,K}

Page 25: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Bandit SettingFor to do

• player selects action (randomized).

• player receives reward .

Notes:

• rewards for all arms determined by adversary simultaneously with the selection of an arm by player.

• adversary oblivious or nonoblivious (or adaptive).

• strategies: deterministic, regret of at least for some (bad) sequences, thus must consider randomized.

25

xIt,t

t=1 T

xi,t

It

T2

It 2 {1, . . . ,K}

Page 26: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ScenariosOblivious case:

• adversary rewards selected independently of the player’s actions; thus, reward vector at time only a function of .

Non-oblivious case:

• adversary rewards at time function of the player’s past actions .

• notion of regret problematic: cumulative reward compared to a quantity that depends on the player’s actions! (single best action in hindsight function of actions played; playing that single “best” action could have resulted in different rewards.)

26

tt

t

I1, . . . , It�1

I1, . . . , IT

Page 27: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ObjectivesMinimize regret ( ), expectation or high prob.:

Pseudo-regret:

By Jensen’s inequality, .

27

RT E[RT ]

`i,t = 1� xi,t

RT = maxi2[1,K]

TX

t=1

xi,t �TX

t=1

xIt,t =TX

t=1

`It,t � mini2[1,K]

TX

t=1

`i,t.

RT = E

"TX

t=1

`It,t

#� min

i2[1,K]E

"TX

t=1

`i,t

#.

Page 28: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Importance WeightingIn the bandit setting, the cumulative loss of each arm is not observed, so how should we update the probabilities?

Estimates via surrogate loss:

where is the probability distribution the player uses at time to draw an arm ( ).

Unbiased estimate: for any ,

28

pt = (p1,t, . . . , pK,t)

t

ei,t =

`i,tpi,t

1It=i ,

EIt⇠pt

[ei,t] =KX

j=1

pj,t`i,tpi,t

1j=i = `i,t.

pi,t>0

i

Page 29: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

EXP3

29

EXP3 (Exponential weights for Exploration and Exploitation)

(Auer et al. 2002b)EXP3(K)

1 p1 ( 1K , . . . , 1

K )

2 (eL1,0, . . . , eLK,0) (0, . . . , 0)3 for t 1 to T do4 Sample(It ⇠ pt)5 Receive(`It,t)6 for i 1 to K do

7 ei,t `i,t

pi,t1It=i

8 eLi,t eLi,t�1 + ei,s

9 for i 1 to K do

10 pi,t+1 e�⌘ eLi,t

PKj=1 e�⌘ eLj,t

11 return pT+1

Page 30: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

EXP3 GuaranteeTheorem: the pseudo-regret of EXP3 can be bounded as follows:

Proof: similar to that of EG, but we cannot use Hoeffding’s inequality since is unbounded.

30

RT logK

⌘+

⌘KT

2.

Choosing to minimize the bound gives⌘

RT p

2KT logK.

ei,t

Page 31: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofPotential: .

Upper bound:

31

�t = logPK

i=1 e�⌘eLi,t

�t � �t�1 = log

PKi=1 e

�⌘eLi,t

PNi=1 e

�⌘eLi,t�1

= log

PKi=1 e

�⌘eLi,t�1e�⌘ei,tPN

i=1 e�⌘eLi,t�1

= logh

Ei⇠pt

⇥e�⌘ei,t

⇤i

Ei⇠pt

⇥e�⌘ei,t

⇤� 1 (log x x� 1)

Ei⇠pt

⇥� ⌘ei,t +

⌘2

2e2i,t

⇤(e�x 1� x+ x2

2 )

= �⌘ Ei⇠pt

[ei,t] +⌘2

2E

i⇠pt

l2i,t1It=i

p2i,t

= �⌘`It,t +⌘2

2

l2It,tpIt,t

�⌘`It,t +⌘2

2

1

pIt,t.

Page 32: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

ProofUpper bound: summing up the inequalities yields

Lower bound: for all ,

Comparison:

32

j 2 [1,K]

E[�T��0]�⌘ EIt⇠pt

TX

t=1

`It,t

�+ E

It⇠pt

TX

t=1

⌘2

2pIt,t

�=�⌘E

TX

t=1

`It,t

�+⌘2KT

2.

E[�T � �0] = EIt⇠pt

log

KX

i=1

e�⌘eLi,T

�� logK

� �⌘ EIt⇠pt

[eLj,T ]� logK = �⌘ EIt⇠pt

[Lj,T ]� logK.

8j 2 [1,K], ⌘E

TX

t=1

`It,t

�� ⌘E[Lj,T ] logK +

⌘2

2KT

) RT logK

⌘+

⌘KT

2.

Page 33: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

NotesWhen is not known:

• standard doubling trick.

• or, use , then .

High probability bounds:

• importance weighting problem: unbounded second moment (see (Cortes, Mansour, MM, 2010)),

• (Auer et al., 2002b): mixing probability with a uniform distribution to ensure a lower bound on ; but not sufficient for high probability bound.

• solution: biased estimate with a parameter to tune.

33

T

⌘t =q

logKKt RT 2

pKT logK

Ei⇠pt [e2i,t] =`2It,tpIt,t

.

pi,t

ei,t =

`i,t1It=i+�pi,t

�>0

Page 34: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

Lower BoundSufficient lower bound in a stochastic setting for the pseudo-regret (and therefore for the expected regret).

Theorem: for any and any player strategy, there exists a distribution of losses in for which

34

T �1{0, 1}

RT � 1

20

pKT.

(Bubek and Cesa-Bianchi, 2012)

Page 35: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

pageAdvanced Machine Learning - Mohri@

NotesBound of EXP3 matching lower bound modulo Log term.

Log-free bound: where is a constant ensuring and increasing, convex, twice differentiable over (Audibert and Bubeck, 2010).

• EXP3 coincides with .

• log-free bound with and .

• formulation as mirror descent.

• only in oblivious case.

35

pi,t+1 = (Ct � eLi,t) CtPKi=1 pi,t+1 = 1

(x) = e⌘x

(x) = (�⌘x)�q q = 2

R⇤

Page 36: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

References

Advanced Machine Learning - Mohri@ page

R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Mathematics, vol. 27, pp. 1054–1078, 1995.

Jean-Yves Audibert and Sebastian Bubeck. Regret bounds and minimax policies under partial monitoring. Journal of Machine Learning Research, vol. 11, pp. 2635– 2686, 2010.

Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol. 47, no. 2–3, pp. 235– 256, 2002a.

Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert Schapire. The non-stochastic multi-armed bandit problem. SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002b.

Sébastian Bubeck, Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning 5, 1-122, 2012.

36

Page 37: Advanced Machine Learning - NYU Courantmohri/amls/aml_bandit.pdfAdvanced Machine Learning - Mohri@ page Motivation Clinical trials: potential treatments for a disease to select from,

References

Advanced Machine Learning - Mohri@ page

Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.

Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In NIPS, 2010.

T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

Gilles Stoltz. Incomplete information and internal regret in prediction of individual sequences. Ph.D. thesis, Universite Paris-Sud, 2005.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.

W. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Bulletin of the American Mathematics Society, vol. 25, pp. 285–294, 1933.

37