multi-armed bandit problems with dependent arms

31
1 Multi-armed Bandit Problems with Dependent Arms Sandeep Pandey ([email protected]) Deepayan Chakrabarti (deepay@yahoo- inc.com) Deepak Agarwal ([email protected])

Upload: lefty

Post on 16-Jan-2016

49 views

Category:

Documents


1 download

DESCRIPTION

Multi-armed Bandit Problems with Dependent Arms. Sandeep Pandey ([email protected]) Deepayan Chakrabarti ([email protected]) Deepak Agarwal ([email protected]). (unknown reward probabilities). μ 1. μ 2. μ 3. Background: Bandits. Bandit “arms”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multi-armed Bandit Problems with Dependent Arms

1

Multi-armed Bandit Problems with Dependent Arms

Sandeep Pandey ([email protected])

Deepayan Chakrabarti ([email protected])

Deepak Agarwal ([email protected])

Page 2: Multi-armed Bandit Problems with Dependent Arms

2

Background: Bandits

Bandit “arms”

μ1 μ2 μ3(unknown reward

probabilities)

Pull arms sequentially so as to maximize the total expected reward

• Show ads on a webpage to maximize clicks

• Product recommendation to maximize sales

Page 3: Multi-armed Bandit Problems with Dependent Arms

3

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards

“Skiing, snowboarding”

“Skiing, snowshoes”

“Get Vonage!”

μ1=0.3 μ2=0.28 μ3=10-6

“Snowshoe rental”

μ2=0.31

Page 4: Multi-armed Bandit Problems with Dependent Arms

4

Dependent Arms

Reward probabilities μi are generally assumed to be independent of each other

What if they are dependent? E.g., ads on similar topics, using similar

text/phrases, should have similar rewards A click on one ad other “similar” ads may

generate clicks as well Can we increase total reward using this

dependency?

Page 5: Multi-armed Bandit Problems with Dependent Arms

5

μi ~ f(π[i])

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Successes si ~ Bin(ni, μi)

# pulls of arm i

Some distribution

(known)

Cluster-specific parameter (unknown)

Page 6: Multi-armed Bandit Problems with Dependent Arms

6

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

t=0

t=0

T

Page 7: Multi-armed Bandit Problems with Dependent Arms

7

Discounted Reward

x1 x2

x”1 x”2

x’1 x’2

Pull Arm 1

x3 x4

x”3 x”4

x’3 x’4

Pull Arm 3

Arm 2

Arm 4

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

Page 8: Multi-armed Bandit Problems with Dependent Arms

8

Discounted Reward

x1 x2

x”1 x”2

x’1 x’2

Pull Arm 1

x3 x4

x”3 x”4

x’3 x’4

Pull Arm 3

Arm 2

Arm 4

MDP for cluster 1

MDP for cluster 2

The optimal policy can be computed using per-cluster MDPs only.

Optimal Policy:

• Compute an (“index”, arm) pair for each cluster

• Pick the cluster with the largest index, and pull the corresponding arm

• Reduces the problem to smaller state spaces

• Reduces to Gittins’ Theorem [1979] for independent bandits

• Approximation bounds on the index for k-step lookahead

Page 9: Multi-armed Bandit Problems with Dependent Arms

9

Cluster Model of Dependence

Arm 1

Arm 4

Arm 3

Arm 2

μi ~ f(π1) μi ~ f(π2)

Total reward:

Discounted: ∑ αt.E[R(t)], α = discounting factor

Undiscounted: ∑ E[R(t)]

t=0

t=0

T

Page 10: Multi-armed Bandit Problems with Dependent Arms

10

Undiscounted Reward

Arm 1

Arm 4

Arm 3

Arm 2

All arms in a cluster are similar They can be grouped into one hypothetical “cluster arm”

“Cluster arm” 1

“Cluster arm” 2

Page 11: Multi-armed Bandit Problems with Dependent Arms

11

Undiscounted Reward

Arm 1

Arm 4

Arm 3

Arm 2

“Cluster arm” 1

“Cluster arm” 2

Two-Level Policy

In each iteration:

Pick “cluster arm” using a traditional bandit policy

Pick an arm within that cluster using a traditional bandit policy

Each “cluster arm” must have some estimated

reward probability

Page 12: Multi-armed Bandit Problems with Dependent Arms

12

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

Page 13: Multi-armed Bandit Problems with Dependent Arms

13

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni,

i.e., average success rate, summing over all arms in the cluster [Kocsis+/2006, Pandey+/2007] Initially, r = μavg = average μ of arms in cluster

Finally, r = μmax = max μ among arms in cluster “Drift” in the reward probability of the “cluster arm”

Page 14: Multi-armed Bandit Problems with Dependent Arms

14

Reward probability drift causes problems

Drift Non-optimal clusters might temporarily look better

optimal arm is explored only O(log T) times

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)

Page 15: Multi-armed Bandit Problems with Dependent Arms

15

Reward probability of a “cluster arm” What is the reward probability r of a “cluster

arm”? MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

for all arms i in cluster

Page 16: Multi-armed Bandit Problems with Dependent Arms

16

Reward probability of a “cluster arm”

MEAN: r = ∑si / ∑ni

MAX: r = max( E[μi] )

PMAX: r = E[ max(μi) ]

Both MAX and PMAX aim to estimate μmax and thus reduce drift

Bias in estimation

of μmax

Variance of estimator

High

Unbiased

Low

High

Page 17: Multi-armed Bandit Problems with Dependent Arms

17

Comparison of schemes

10 clusters, 11.3 arms/cluster MAX performs best

Page 18: Multi-armed Bandit Problems with Dependent Arms

18

Issues

What is the reward probability of a “cluster arm”?

How do cluster characteristics affect performance?

Page 19: Multi-armed Bandit Problems with Dependent Arms

19

Effects of cluster characteristics We analytically study the effects of cluster

characteristics on the “crossover-time” Crossover-time: Time when the expected reward

probability of the optimal cluster becomes highest among all “cluster arms”

Page 20: Multi-armed Bandit Problems with Dependent Arms

20

Effects of cluster characteristics Crossover-time Tc for MEAN depends on:

Cluster separation Δ = μopt – μmax outside opt cluster

Δ increases Tc decreases

Cluster size Aopt

Aopt increases Tc increases

Cohesiveness in opt cluster 1-avg(μopt – μi) Cohesiveness increases Tc decreases

Page 21: Multi-armed Bandit Problems with Dependent Arms

21

Experiments (effect of separation)

Δ increases Tc decreases higher reward

Page 22: Multi-armed Bandit Problems with Dependent Arms

22

Experiments (effect of size)

Aopt increases Tc increases lower reward

Page 23: Multi-armed Bandit Problems with Dependent Arms

23

Experiments (effect of cohesiveness)

Cohesiveness increases Tc decreases higher reward

Page 24: Multi-armed Bandit Problems with Dependent Arms

24

Related Work

Typical multi-armed bandit problems Do not consider dependencies Very few arms

Bandits with side information Cannot handle dependencies among arms

Active learning Emphasis on #examples required to achieve a

given prediction accuracy

Page 25: Multi-armed Bandit Problems with Dependent Arms

25

Conclusions

We analyze bandits where dependencies are encapsulated within clusters

Discounted Reward the optimal policy is an index scheme on the clusters

Undiscounted Reward Two-level Policy with MEAN, MAX, and PMAX Analysis of the effect of cluster characteristics on

performance, for MEAN

Page 26: Multi-armed Bandit Problems with Dependent Arms

26

Discounted Reward

x1 x2

x3 x4 x”1 x”2

x’1 x’2

x3 x4

x3 x4

Pull Arm 1

success

failure

Change of belief for both arms 1

and 2Estimated

reward probabilities

Pull

Arm 2

Pul

l A

rm 3

Pull

Arm 4

• Create a belief-state MDP

• Each state contains the estimated reward probabilities for all arms

• Solve for optimal

1 2 3 4

Page 27: Multi-armed Bandit Problems with Dependent Arms

27

Background: Bandits

Bandit “arms”

p1 p2 p3(unknown payoff

probabilities)

Regret = optimal payoff – actual payoff

Page 28: Multi-armed Bandit Problems with Dependent Arms

28

Reward probability of a “cluster arm” What is the reward probability of a “cluster

arm”? Eventually, every “cluster arm” must converge to

the most rewarding arm μmax within that cluster since a bandit policy is used within each cluster However, “drift” causes problems

Page 29: Multi-armed Bandit Problems with Dependent Arms

29

Experiments

Simulation based on one week’s worth of data from a large-scale ad-matching application

10 clusters, with 11.3 arms/cluster on average

Page 30: Multi-armed Bandit Problems with Dependent Arms

30

Comparison of schemes

10 clusters, 11.3 arms/cluster Cluster separation Δ = 0.08 Cluster size Aopt = 31 Cohesiveness = 0.75 MAX performs best

Page 31: Multi-armed Bandit Problems with Dependent Arms

31

Reward probability drift causes problems

Intuitively, to reduce regret, we must: Quickly converge to the optimal “cluster arm” and then to the best arm within that cluster

Arm 1

Arm 4

Arm 3

Arm 2

Cluster 1 Cluster 2

Best (optimal) arm, with reward

probability μopt

(opt cluster)