325 notes

Upload: shipra-agrawal

Post on 14-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 325 Notes

    1/23

    Lecture Notes for MS&E 325: Topics in Stochastic Optimization

    (Stanford) and CIS 677: Algorithmic Decision Theory and Bayesian

    Optimization (UPenn)

    Ashish GoelStanford University

    [email protected]

    Sudipto GuhaUniversity of Pennsylvania

    [email protected]

    Winter 2008-09 (Stanford); Spring 2008-09 (UPenn)Under Construction: Do not Distribute

  • 8/2/2019 325 Notes

    2/23

    2

  • 8/2/2019 325 Notes

    3/23

    Chapter 1

    Introduction: Class Overview,

    Markov Decision Processes, and

    Priors

    This class deals with optimization problems where the input comes from a probability distribu-tion, or in some cases, is generated iteratively by an adversary. The first part of the class dealswith Algorithmic Decision Theory, where we will study algorithms for designing strategies formaking decisions which are provably (near)-optimal, computationally efficient, and use availableand acquired data, as well as probabilistic models thereon. This field touches upon statistics, ma-chine learning, combinatorial algorithms, and convex/linear optimization and some of the resultswe study are several decades old. This field is seeing a resurgence because of the large scale of databeing generated by Internet applications.

    The second part of this class will deal with combinatorial optimization problems, such as knap-sack, scheduling, routing, network design, and inventory management, given stochastic inputs.

    1.1 Input models and objectives

    Consider N alternatives, and assume that you have to choose one alternative during every timestep t, where t goes from 0 to . This series of choices is called a strategy; the arm chosen bythe strategy at time t is denoted as at. Generally, the alternatives are called arms, and choosingan alternative is called playing the corresponding arm. Arm i gives a reward of ri(t) at time t,where ri(t) may depend on all the past choices made by the strategy. The quantity ri(t) may bea random variable, with a distribution that is unknown, known, or on which you have some priorbeliefs. The quantity ri(t) may also be chosen adversarially. This gives two broad classes of inputmodels, probabilistic and adversarial, with many important variations. We will study these modelsin great detail.

    There are also several objectives we could have. The first is a finite-horizon objective, wherewe are given a finite horizon T and the goal is to:

    Maximize E[T1t=0

    rat(t)].

    3

  • 8/2/2019 325 Notes

    4/23

    The second is the infinite horizon discounted reward model. Here, we are given a discount factor [0, 1); informally this is todays value for a Dollar that we will make tomorrow. Think of thisas (1 - the interest rate). If you have a long planning horizon, should be chosen to be very closeto 1. If you are very short-sighted, will be close to 0. The objective is:

    Maximize E[

    t=0

    t

    rat(t)].

    By linearity of expectations, we can exchange the limit and the sum so that we obtain theequivalent objectives:

    MaximizeT1t=0

    E[rat(t)], and

    Maximizet=0

    tE[rat(t)].

    We will now define rat(t) as the expected reward, and get rid of the expectation altogether.

    This expectation is over both the input (if the input is probabilistic) and the strategy (if the strategyis randomized).

    These problems are all collectively called multi-armed bandit problems. Colloquially, a slotmachine in a casino is also called a one-armed bandit, since it has one lever, and usually robs youof your money. Hence, a multi-armed bandit is an appropriate name for this class of problems,where we can think of each alternative as one arm of a multi-armed slot machine.

    Remember, our goal in this class is to design these strategies algorithmically.

    1.2 An illustrative example

    Suppose you are in Las Vegas for an year, and will go to a casino everyday. In the casino there are

    two slot machines, each of which gives a Dollar as a reward with some unknown probability, whichmay not be the same for both machines. For the first, there is data available from 3 trials, and oneof them was a success (i.e. gave a reward) and and two were failures (i.e. gave no reward). For thesecond machine, there is data available from 2 108 trials, 108 of which gave a reward. We will saythat the first machine is a (1, 2) machine: the first component of the tuple refers to the numberof successful trials, and the second refers to the number of unsuccessful ones. Thus, the secondmachine is a (108, 108) machine.

    What would be your best guess of the expected reward from the first machine? Clearly 1/3.For the second machine? Clearly 1/2. Also, it seems clear that the second machine is also lessrisky. So which machine should you play the very first day? Surprisingly, it is the first. If you geta reward the first day, the first machine would become a (2, 2) machine and you can play it again.

    If you get a reward again, the first machine would become a (3, 2) machine, and would start tolook better than the second. If on the other hand, the first machine does not give you a reward thefirst day, or the second day, then you can revert to playing the second machine. We will see howto make this statement more formal.

    Here, we sacrifice some expected reward (i.e. choose not to exploit the best available machine) onthe first day in order to explore an arm with a high upside; this is an example of an exploration-exploitation tradeoff and is a central feature of algorithmic decision theory..

    4

  • 8/2/2019 325 Notes

    5/23

    1.3 Markov Decision Processes

    Before starting on the main topics in this class, it is worth seeing one of the most basic anduseful tools in stochastic optimization: Markov Decision Processes. When you see a stochasticoptimization problem, this is probably the first tool you must try, the second being Bellmansformula for dynamic programming which we will briefly see later. Only when these are inefficientor inapplicable should you try more advanced approaches such as the ones we are going to see inthe rest of this class.

    Assume you are given a finite state space S, an initial state s0, a set of actions A, a rewardfunction r : S A , and a function P : S A S [0, 1] such that vSP(u,a,v) = 1 for allstates u S and all actions a A. Informally, a MDP is like a Markov chain, but the transitionshappen only when an action is taken, and the transition probabilities depend on the state as wellas the action taken. Also, depending on the state u and the action a, we get a reward r(u, a). Ofcourse the reward itself could be a random variable, but as pointed out earlier, we replace it by itsexpected value in that case.

    Given a MDP, you might want to maximize the finite horizon reward or the infinite horizondiscounted reward. We will focus on the latter for now. The former is tractable as well. Let

    be the discount factor. Let (s) be the expected discounted reward obtained by the optimumstrategy starting from state s, and let (s, a) be the expected discounted reward obtained by theoptimum strategy starting from state s assuming that action a is performed. Then the followinglinear constraints must be satisfied by the s:

    u S, a A : (u) (u, a) (1.1)u A, a A : (u, a) r(u, a) +

    vS

    P(u,a,v)(v) (1.2)

    The optimum solution can now be found by a linear program with the constraints as given

    above and the linear objective:Minimize (s0) (1.3)

    Thus, MDPs can be solved very efficiently, provided the state space is small. MDPs can alsobe defined with countable state spaces, but then the LP formulation above is not directly useful asa solution procedure.

    1.4 Priors and posteriors

    Let us revisit the illustrative example. Consider the (1, 2) machine. Given that this is all youknow about the machine, what would be a reasonable estimate of the success probability of the

    machine? It seems reasonable to say 1/3; remember, this is an illustrative example, and there arescenarios in which some other estimate of probability might make sense. The tuple (1, 2) and theprobability estimate (1/3) represents a prior belief on the machine. If you play this machine, andyou get a success (which we now believe will happen with a probability of 1/3), then the machinewill become a (2, 2) machine, with a success probability estimate of 1/2; this is called a posteriorbelief. Of course, if the first trial results in a failure, the posterior would have been (1, 3) with asuccess probability estimate of 1/4.

    5

  • 8/2/2019 325 Notes

    6/23

    This is an example of what are known as Beta priors. We will use (, ) to roughly1, denotethe number of successes and failures, respectively. We will see these priors in some detail later on,and will refine and motivate the definitions. These are the most important class of priors. Anotherprior we will use frequently (despite it being trivial) is the fixed prior, where we believe we knowthe success probability p and it never changes. We will refer to an arm with this prior as a standard

    arm with probability p.For the purpose of this class, we can think of a prior as a Markov chain with a countable statespace S, transition probability matrix (or kernel) P, a current state u S, and a reward functionr : S . Typically, the range of the reward function will be [0, 1] and we will interpret that asthe probability of success. Beta priors can then be interpreted as having a state space Z+ Z+, areward function r(, ) = /( + ), and transition probabilities P((, ), ( + 1, )) = /( + )and P((, ), (, + 1)) = /( + ) (and 0 elsewhere).

    When we perform an experiment on, i.e. play, an arm with prior (S,r,P,u), the state of thearm changes to v according to the transition probability matrix P. We assume that we observe thischange, and we now obtain the posterior (S,r,P,v), which acts as the prior for the next step.For much of this class, we will assume that the prior changes only when an arm is played.

    How do we know we have the correct prior? Where does a prior come from? On one level, these

    are philosophical questions, for which there is no formal answer since this is a class in optimization,we can assume that the priors are given to us, they are reliable, and we will optimize assumingthey are correct. On another level, the priors often come from a generative model, i.e. from someknowledge of the underlying process. For example, we might know (or believe) that a slot machinegives Bernoulli iid rewards, with an unknown probability parameter p. We might further make theBayesian assumption that the probability that this parameter is p given the number of successesand failures we have observed is proportional to the probability that we would get the observednumber of successes and failures if the parameter were p. This leads to the Beta priors, as wewill discuss later. Of course this just brings up the philosophical question of whether the Bayesianassumption is valid or not. In this class, we will be agnostic with respect to to this question. Givena prior, we will attempt to obtain optimum strategies with respect to that prior. But we will also

    discuss the scenario where the rewards are adversarial.

    1More precisely, 1 and 1 denote the number of successes and failures, respectively.

    6

  • 8/2/2019 325 Notes

    7/23

    Chapter 2

    Discounted Multi-Armed Bandits and

    the Gittins Index

    We will now study the problem of maximizing the expected discounted infinite horizon reward,

    given a discount factor , and given n arms with the i-th arm having a prior (Si, ri, Pi, ui). Ourgoal is to maximize the expected discounted reward by playing exactly one arm in every time step.The state ui is observable, an arm makes a transition according to Pi when it is played, and thistransition takes exactly one time unit, during which no other arm can be played.

    This problem is useful in many domains including advertising, marketing, clinical trials, oilexploration, etc. Since Si, ri, Pi remain fixed as the prior evolves, the state of the system at anytime is described by (u1, u2, . . . , un). If the state spaces of the individual arms are finite and ofsize k each, the state space of the system can be of size O(kn) which precludes a direct dynamicprogramming or MDP based approach. We will still be able to solve this efficiently using a strikingand beautiful theorem, due to Gittins and Jones. We will assume for now that the range of eachreward function is [0, 1]. Also recall that [0, 1).

    Theorem 2.1 [The Gittins index theorem:] Given a discount factor , there exists a functiong from the space of all priors to [0, 1/(1 )] such that it is an optimum strategy to play an armi for which g(Si, ri, Pi, ui) is the largest.

    This theorem is remarkable in terms of how sweeping it is. Notice that the function g, alsocalled the Gittins index, depends only on one arm, and is completely oblivious to how manyother arms there are in the system and what their priors are. For common priors such as theBeta priors, one can purchase or pre-compute the Gittins index for various values of , , andthen the optimum strategy can be implemented using a simple table-lookup based process, muchsimpler than an MDP or a dynamic program over the joint space of all the arms. This theorem isexistential, but as it turns out, the very existence of the Gittins index leads to an efficient algorithm

    for computing it. We will outline a method for finite state spaces.

    2.1 Computing the Gittins index

    First, recall the definition of the standard arm with success probability p; this arm, denoted Rp,always gives a reward with probability p. Given two standard arms Rp and Rq, where p < q, which

    7

  • 8/2/2019 325 Notes

    8/23

    would you rather play? Clearly, Rq. Thus, the Gittins index of Rq must be higher than that of Rp.R1 dominates all possible arms, and R0 is dominated by all possible arms. The arm Rp yields atotal discounted profit of p/(1 ), by summing up the infinite geometric progression p,p,p2, . . ..

    Given an arm J with prior (S,r,P,u), there must be a standard arm Rp such that given these twoarms, the optimum strategy is indifferent between playing J or Rp in the first step. The Gittins

    index of arm J must then be the same as the Gittins index of arm Rp, and hence any strictlyincreasing function of p can be used as the Gittins index. In this class, we will use g = p/(1 ).The goal then is merely to find the arm Rp given arm J.

    This is easily accomplished using a simple LP. Consider the MDP with a finite state space Sand action space (aJ, aR), where the first action corresponds to playing arm J and the secondcorresponds to playing arm Rp, in which case we will always play Rp since nothing changes in thenext step. The first action yields reward r(u) when in state u, whereas the second yields the rewardx = p/(1 ). The first action leads to a transition according to the matrix P whereas the secondleads to termination of the process. The optimum solution to this MDP is given by:

    Minimize (u), subject to: (a)

    s

    S, (s)

    x, and (b)

    s

    S, (s)

    r(s) +

    vS

    P(u, v)(v).

    If this objective function is bigger than x then it must be better to play arm J in state u.Let the optimum objective function value from this LP be denoted z(x). Our goal is to find thesmallest x such that z(x) x (which will denote the point of indifference between Rp and J). Wecan obtain this by performing a binary search over x: if z(x) = x then the Gittins index can notbe larger than x, and if z(x) > x then the Gittins index must be larger than x. In fact, this canalso be obtained from the LP:

    Minimize x, subject to: (a) (u) x, (b) s S, (s) x, and (c) s S, (s) r(s)+vS

    P(u, v)(v)

    A spreadsheet for approximately computing the Gittins index for arms with Beta priors isavailable at http://www.stanford.edu/ashishg/msande325 09/gittins index.xls.

    Exercise 2.1 In this problem, we will assume that three advertisers have made bids on the samekeyword in a search engine. The search engine (which acts as the auctioneer) assigns (, ) priorsto each advertiser, and uses their Gittins index to compute the winner. The advertisers are:

    1. Advertiser (a) has = 2, = 5, has bid $1 per click and has no budget constraint.

    2. Advertiser (b) has = 1, = 4, pays $0.2 per impression and additionally $1 if his ad isclicked. He has no budget constraint.

    3. Advertiser (c) has = 1, = 2, has bid $1.5 per click and his ad can only be shown 5 times(including this one).

    There is a single slot, the discount factor is = 0.95 and a first price auction is used. Computethe Gittins index for each of the three advertisers. Which ad should the auctioneer allocate the slotto? Briefly speculate on what might be a reasonable second price auction.

    8

    http://www.stanford.edu/~ashishg/msande325_09/gittins_index.xlshttp://www.stanford.edu/~ashishg/msande325_09/gittins_index.xlshttp://www.stanford.edu/~ashishg/msande325_09/gittins_index.xlshttp://www.stanford.edu/~ashishg/msande325_09/gittins_index.xls
  • 8/2/2019 325 Notes

    9/23

    2.2 A proof of the Gittins index theorem

    We will follow the proof of Tsitsiklis; the paper is on the class web-page.

    9

  • 8/2/2019 325 Notes

    10/23

    10

  • 8/2/2019 325 Notes

    11/23

    Chapter 3

    Bayesian Updates, Beta Priors, and

    Martingales

    As mentioned before, priors often come from a parametrized generative model, followed by Bayesian

    updates. We will call such priors Bayesian. We will assume that the generative model is a familyof single parameter distributions, parametrized by . Let f denote the probability distributionon the reward, if the underlying parameter of the generative model is . If we knew , the priorwould be trivial. At time t, the prior is denoted as a probability distribution pt on the parameter. Let xt denote the reward obtained the t-th time an arm is played (i.e. the observation at timet). We are going to assume there exist suitable probability measures over which we can integratethe functions f and pt.

    A Bayesian update essentially says that the posterior probability at time t (i.e. the prior fortime t + 1) of the parameter being given an observation xt is proportional to the probability ofthe observation being t, given parameter . This of course is modulated by the probability of theparameter being at time t. Hence, pt+1(|xt) is proportional to f(xt)pt(). We have to normalize

    this to make pt+1 a probability distribution, which gives us

    pt+1(|xt) = f(xt)pt() f(xt)pt(

    )d.

    3.1 Beta priors

    Recall that the prior Beta(t, t) corresponds to having observed t1 successes and t1 failuresup to time t. We will show that we can also interpret the prior Beta(t, t) as one that comes fromthe generative model of Bernoulli distributions with Bayesian updates, where the parameter corresponds to the on probability of the Bernoulli distribution.

    Suppose 0 = 1 and 0 = 1, i.e., we have observed 0 successes and 0 failures initially. It seemsnatural to have this correspond to having a uniform prior on over the range [0, 1]. Applying theBayes rule repeatedly, we get pt(|t, t) is proportional to the probability of observing t 1successes and t 1 failures if the underlying parameter is , i.e. proportional to

    t + t 2t 1

    t1(1 )t1.

    11

  • 8/2/2019 325 Notes

    12/23

    Normalizing, and using the fact that1x=0 x

    a1(1 x)b1 = (a)(b)/(a + b), we get

    pt() =(t + t)

    t1(1 )t1(t)(t)

    .

    This is known as the Beta distribution. The following exercise completes the proof that the

    transition matrix over state spaces defined earlier is the same as the Bayesian update based priorderived above.

    Exercise 3.1 Given the prior pt as defined above, show that the probability of obtaining a rewardat time t is t/(t + t).

    3.2 Bayesian priors and martingales

    We will now show that Bayesian updates result in a martingale process for the rewards. Considerthe case where the prior can be given both over a state space (S,r,P,y) as well as a generativemodel with Bayesian updates. Let st denote the state at time t and let rt denote the expected

    reward in state st. We will show that the sequence of rewards rt is a martingale with respect tothe sequence of states st. This fact will come in handy multiple times later in this class.Formally, the claim that the sequence of rewards rt is a martingale with respect to the sequence

    of states st is the same as saying thatE[rt+1|s0, s1, . . . , st] = rt.

    We will use x to denote the observed reward at time t and y to denote the observed re-ward at time t + 1. Thinking of the prior as coming from a generative model, we get rt =x x

    pt()f(x)ddx, wherept depends only on st. Similarly, we get

    1 E[rt+1|st] =y y f(y)pt+1(

    |st)dUsing Bayesian updates, we get pt+1(

    |st, x) = f(x)pt()

    f(x)pt()d

    . In order to remove the condition-

    ing over x we need to integrate over x, i.e., pt+1(

    |st) =

    x

    pt+1(

    |st, x)

    pt(

    )f(x)ddx.

    Combining, we obtain:

    E[rt+1|st] =y

    y

    f(y)

    x

    f(x)pt()

    f(x)pt()d

    pt()f(x)d

    dxddy.

    The integrals over and cancel out, giving

    E[rt+1|st] =y

    y

    f(y)

    x

    f(x)pt()dxddy.

    Since f is a probability distribution, the inner-most integral evaluates to pt(), giving

    E[rt+1|st] = y y f(y)pt()ddy.This is the same as rt.

    Exercise 3.2 Give an example of a prior over a state space such that this prior can not be obtainedfrom any generative model using Bayesian updates. Prove your claim.

    1Here , , are just different symbols; they are not derivatives or second derivatives of .

    12

  • 8/2/2019 325 Notes

    13/23

    Exercise 3.3 Extra experiments can not hurt: The budgeted learning problem is defined as follows:You are given n arms with a separate prior (Si, ri, Pi, ui) on each arm. You are allowed to makeT plays. At the end of the T plays, you must pick a single arm i, and you will earn the expectedreward of the chosen arm at that time. Let z denote the expected reward obtained by the optimumstrategy for this problem. Show that z is non-decreasing in T if the priors are Bayesian.

    The next two exercises illustrate how surprising the Gittins index theorem is.

    Exercise 3.4 Show that the budgeted learning problem does not admit an index-based solution,preferably using Bayesian priors as counter-examples. Hint: Define arms A,B such that given A,Bthe optimum choice is to play A whereas given A and two copies of B, the optimum choice is toplay B. Hence, there can not be a total order on all priors.

    Exercise 3.5 Show that the finite-horizon multi-armed bandit problem does not admit an index-based solution, preferably using Bayesian priors as counter-examples.

    13

  • 8/2/2019 325 Notes

    14/23

    14

  • 8/2/2019 325 Notes

    15/23

    Chapter 4

    Minimizing Regret Against Unknown

    Distributions

    The algorithm in this chapter is based on the paper, Finite-time Analysis of the

    Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, and P. Fischer, http://www.springerlink.com/content/l7v1647363415h1t/ .

    The Gittins index is efficiently computable, decouples different arms, and gives an optimumsolution. There is one problem of course; it assumes a prior, and is optimum only in the classof strategies which have no additional information. We will now make the problem one levelmore complex. We will assume that the reward for each arm comes from an unknown probabilitydistribution. Consider N arms. Let Xi,s denote the reward obtained when the i-th arm is playedfor the s-th time. We will assume that the random variables Xi,s are independent of each other,and we will further assume that for any arm i, the variables Xi,s are identically distributed. Noother assumptions will be necessary. We will assume that these distributions are generated by anadversary who knows the strategy we are going to employ (but not any random coin tosses we

    may make). Let i = E[Xi,s] be the expected reward each time arm i is played. Let i

    denote thearm with the highest expected reward, . Let i =

    i denote the difference in the expectedreward of the optimal arm and arm i.

    A strategy must choose an arm to play during each time step. Let It denote the arm chosen attime t and let ki,t denote the number of times arm i is played during the first t steps.

    Ideally, we would like to maximize the total reward obtained over all time horizons T simulta-neously. Needless to say there is no hope of achieving this: the adversary may randomly choosea special arm j, make all Xj,s equal to 1 deterministically, and for all i = j, make all Xi,s = 0deterministically. For any strategy, there is a probability at least half that the strategy will notplay arm j in the first N/2 steps, and hence, the expected profit of any strategy is at most N/4over the first N/2 steps, whereas the optimum profit if we knew the distribution would be N/2.

    Instead, we can set a more achievable goal. Define the regret of a strategy at time T to be thetotal difference between the optimal reward over the first T steps and the reward of the strategyover the same period. The expected regret is then given by

    E[Regret(T)] = T T

    i=1

    It =

    i:i

  • 8/2/2019 325 Notes

    16/23

    In a classic paper, Lai and Robbins showed that this expected regret can be no less thanN

    i=1

    iD(Xi||X)

    log T.

    Here D(Xi

    ||X) is the Kullback-Leibler divergence (also known as the information divergence) of

    the distribution of Xi with respect to the distribution of X and is given by

    fi ln fi/f , wherethe integral is over some suitable underlying measure space. Surprisingly, they showed that thislower bound can be asymptotically matched for special classes of distributions asymptotically, i.e.,as T . We are going to see an even more surprising result, due to Auer, Cesa-Bianchi, andFischer, where a very similar bound is achieved simultaneously for all T and for all distributionswith support in [0, 1]. We will skip the proof of correctness since that is clearly specified in thepaper (with a different notation), but will describe the algorithm, state their main theorem, andsee a useful balancing trick.

    The algorithm, which they call UCB1, assigns an upper confidence bound to each arm. LetXi(t) denote the average reward obtained from all the times arm i was played up to and including

    time t. Let ct,s = 2 ln ts denote a confidence interval. Then, at the end of time t, assign thefollowing index, called the upper-confidence-index and denoted Ui(t) to arm i:

    Ui(t) = Xi(t) + ct,ki(t).

    During the first N steps, each arm is played exactly once in some arbitrary order. After that,the algorithm repeatedly plays the arm with the highest upper-confidence index, i.e. at time t + 1,plays the arm with the highest confidence index Ui(t). Ties are broken arbitrarily.

    This strikingly simple rule leads to the following powerful theorem ( proof scribed by PranavDandekar):

    Theorem 4.1 The expected number of times arm i is played up to time T, E[ki(T)], is at most8 ln T2i

    + , where is some fixed constant independent of the distributions, T, or the number of arms.

    Proof: In the first N steps, the algorithm will play each arm once. Therefore, we have

    i, ki(T) = 1 +T

    t=N+1

    {It = i}

    where {It = i} is an indicator variable which takes value 1 if arm i is played at time t and 0otherwise. For any integer l 1, we can similarly write

    i, ki(T) = l +T

    t=N+1

    {It = i ki(t 1) l}

    Let

    i = arg maxi

    i

    = i

    U(t) = Ui(t)

    X(t) = Xi(t)

    k(t) = ki(t)

    16

  • 8/2/2019 325 Notes

    17/23

    If arm i was played at time t, this implies its score, Ui(t 1), was at least the score of the arm withthe highest mean, U(t 1) (note that this is a necessary but not sufficient condition). Thereforewe have

    i, ki(T) l +T

    t=N+1

    {Ui(t 1) U(t 1) ki(t 1) l}

    i, ki(T) l +T

    t=N+1

    maxlsi

  • 8/2/2019 325 Notes

    18/23

    where is a constant.Plugging into the expression for the regret, we get an expected regret ofO((ln T)

    i:i

  • 8/2/2019 325 Notes

    19/23

    Chapter 5

    Minimizing Regret in the Partial

    Information Model

    Scribed by Michael Kapralov. Based primarily on the paper: The Nonstochastic

    Multiarmed Bandit Problem. P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire.SIAM J on Computing, 32(1), 48-77.We use xi(t) [0, 1] to denote the reward obtained from playing arm i at time t, but now

    we assume that the values xi(t) are determined by an adversary and need not come from a fixeddistribution. We assume that the adversary chooses a value for xi(t) at the beginning of time t.The algorithm must then choose an arm It to play, possibly using some random coin tosses. Thealgorithm then receives profit XIt(t). It is easy to see that any deterministic solution gets revenue0, so any solution with regret bounds needs to be randomized. It is important that the adversarydoes not see the outcomes of the random coin tosses by the algorithm. This is called the partialinformation model, since only the profit from the chosen arm is revealed. We will compare theprofit obtained by an algorithm to the profit obtained by the best arm in hindsight.

    Our algorithm maintains weights wi(t)

    0, where t is the timestep. We set wi(0) := 1 for alli. Denote W(t) :=

    Ni=1 wi(t).

    At time t arm i is chosen with probability

    pi(t) = (1 ) wi(t)W(t)

    +

    N,

    where > 0 is a parameter that will be assigned a value later. We denote the index of the armplayed at time t by It.

    We define the random variable xi(t) as

    xi(t) =

    xi(t)/pi(t) if arm i was chosen0 o.w.

    Note that E[xi(t)] = xi(t). We can now define the update rule for weights wi(t):

    wi(t + 1) = wi(t)exp

    Nxi(t)

    .

    The regret after T steps is

    Regret[T] = maxi

    T

    t=1

    E[Xi] T

    t=1

    E[XIt(t)]

    .

    19

  • 8/2/2019 325 Notes

    20/23

    Note that the Regret[T] = T, i.e. linear, but we will be able to tune to a fixed value of T toobtain sublinear regret. In order to handle unbounded T we can play using a fixed for some timeand then adjust .

    We will use the following facts, which follow from the definition of xi(t):

    xi(t) =

    xi(t)

    pi(t) N

    (5.1)N

    i=1

    pi(t)xi(t) = xIt(t) (5.2)

    Ni=1

    pi(t) (xi(t))2 =

    Ni=1

    xi(t)xi(t) Ni=1

    xi(t). (5.3)

    We have

    W(t + 1)

    W(t)=

    Ni=1 wi(t + 1)

    W(t)=

    Ni=1 wi(t)exp

    N

    xi(t)

    W(t)

    N

    i=1

    wi(t)W(t)

    +

    Ni=1

    wi(t)W(t)

    N

    xi(t) + 2

    N2(xi(t))2

    = 1 +Ni=1

    wi(t)

    W(t)

    Nxi(t) +

    2

    N2(xi(t))

    2

    .

    We applied the inequality exp(x) 1 + x + x2, x [0, 1] to exp N

    xi(t)

    (justified by 5.1) and usedthe definition of W(t) to substitute the first sum with 1.

    Since pi(t) (1 )wi(t)/W(t), we have wi(t)W(t) pi(t)1 N(1) . Using this estimate togetherwith 5.2 and 5.3, we get

    W(t + 1)

    W(t) 1 +

    N

    i=1

    pi(t)

    1

    Nxi(t) +

    2

    N2(xi(t))

    2 1 +

    N(1 ) xIk(t) +2

    N2(1 )Ni=1

    xi(t).

    We now take logarithms of both sides and sum over t from 1 to T. Note that the lhs telescopesand we get after applying the inequality log(1 + x) x to the rhs

    logW(T)

    W(0)

    Tt=1

    N(1 ) xIk(t) +2

    N2(1 )N

    j=1

    xi(t)

    =1

    1 T

    t=1

    NxI

    k

    (t) +2

    N2

    N

    j=1

    T

    t=1

    xi(t) .We denote the reward obtained by the algorithm by G =

    Tt=1 xIk(t) and the optimal reward by

    G = maxiT

    t=1 xi(t). Using the fact that E[G] E[Tt=1 xi(i)]i and W(0) = N, we get

    E

    log

    W(T)

    N

    1

    1

    NE[G] +

    2

    NE[G]

    . (5.4)

    20

  • 8/2/2019 325 Notes

    21/23

    On the other hand, since W(T) = expT

    t=1N

    xj(t)

    , we have

    logW(T)

    N log Wj(T)

    N=

    Tt=1

    Nxj(t) log N.

    Using the fact that E[xj(t)] = xj(t) and setting j = argmax1jNT

    t=1 xj(t), we get

    E logW(T)

    N

    NEG log N. (5.5)

    Putting 5.4 and 5.5 together, we get

    NE[G] log N 1

    1

    NE[G]

    +

    2

    N(1 ) E[G] (5.6)

    This implies that

    E[G] (1 )E[G] E[G] (1 )Nlog N

    , (5.7)

    i.e.

    E[G] E[G] 2E[G] + Nlog N

    . (5.8)

    To balance the first two terms, we set :=

    NlogN2G and getting

    E[G] E[G]

    2Nlog NE[G]. (5.9)

    Since G T, we also have

    E[G] E[G] 2T + Nlog N

    , (5.10)

    and setting :=

    NlogN2T yields

    E[G] E[G] 2NT log N. (5.11)

    Exercise 5.1 This is the second straight algorithm we have seen that has a regret that depends onT. Unlike the previous algorithm, this one requires a knowledge of T. Present a technique thatconverts an algorithm which achieves a regret of O(f(N)

    T) for any given T to one that achieves

    a regret of O(f(N)

    T) for all T.

    Exercise 5.2 Imagine now that there are M distinct types of customers. During each time step,you are told which type of customer you are dealing with. You must show the customer one product(equivalent to playing an arm) which the customer will either purchase or discard. If the customerpurchases the product, then you make some amount between 0 and 1. The regret is computed relativeto the best product-choice for each customer type, in hindsight. Present an algorithm that achievesregretO(

    M NT log N) against an adversary. Prove your result. What lower bound can you deduce

    from material that has been covered in class or pointed out in the reading list?

    21

  • 8/2/2019 325 Notes

    22/23

    Exercise 5.3 Designed by Bahman Bahmani.Assume a seller with an unlimited supply of a good is sequentially selling copies of the good to

    n buyers each of whom is interested in at most one copy of the good and has a private valuation forthe good which is a number in [0, 1]. At each instance, the seller offers a price to the current buyer,and the buyer will buy the good if the offered price is less than or equal to his private valuation.

    Assume the buyers valuations are iid samples from a fixed but unknown (to the seller) distribu-tion with cdf F(x) = P r(valuation x). Define D(x) = 1 F(x) and f(x) = xD(x).In this problem, we will prove that if f(x) has a unique global maximumx in(0, 1) andf(x) f(x) C2/K2

    d) Prove that using a good choice of K and some of the results on MAB discussed in class, theseller can achieve O(

    n log n) regret.

    22

  • 8/2/2019 325 Notes

    23/23

    Chapter 6

    The Full Information Model, along

    with Linear Generalizations

    This chapter is based on the paper Efficient algorithms for online decision problems

    by A. Kalai and S. Vempala, http://people.cs.uchicago.edu/

    kalai/papers/onlineopt/onlineopt.pdf

    Exercise 6.1 Read the algorithm by Zinkevich, Online convex programming and generalized in-finitesimal gradient ascent, http://www.cs.ualberta.ca/maz/publications/ICML03.pdf.How would you apply his technique and proof in a black-box fashion to the simplest linear gener-alization of the multi-armed bandit problem in the full information model (eg. as described in thepaper by Kalai and Vempala)? Note that Zinkevich requires a convex decision space, whereas Kalaiand Vempala assume that the decision space is arbitrary, possibly just a set of points.

    Exercise 6.2 Modify the analysis of UCB1 to show that in the full information model, playingthe arm with the best average return so far has statistical regret O(T(log T + log N)) (i.e. regretagainst a distribution).

    23

    http://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdfhttp://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdfhttp://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdfhttp://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdfhttp://www.cs.ualberta.ca/~maz/publications/ICML03.pdfhttp://www.cs.ualberta.ca/~maz/publications/ICML03.pdfhttp://www.cs.ualberta.ca/~maz/publications/ICML03.pdfhttp://www.cs.ualberta.ca/~maz/publications/ICML03.pdfhttp://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdfhttp://people.cs.uchicago.edu/~kalai/papers/onlineopt/onlineopt.pdf