strategic experimentation

S. Rady, Yonsei University 2012 Lecture 1: Poisson Bandits – 1

Strategic Experimentation

Lecture 1: Poisson Bandits

Introduction

Introduction

• Optimal Learning

• Strategic Issues

• Bandits

The Model

The Single-AgentProblem

The CooperativeProblem

The Strategic Problem

Conclusion


A Model of Rational Learning

Introduction



• Bandits

The Model




Conclusion


People are uncertain about their environment

They learn from experience

Bayesian learning

• Agents possess a prior belief about some unknownparameter(s) of the environment

• They know the distribution of possible outcomesconditional on the parameter(s)

• They use Bayes’ rule to update their belief on the basis ofwhat they observe

A Single-Agent Example

Introduction



• Bandits

The Model




Conclusion


Optimal learning typically involves experimentation– Sacrifice current rewards for better information

which will generate better rewards in the future– Exploitation versus exploration

Examples:– Seller of a new good– Farmer choosing between a traditional crop

and a gene-modified one– Researcher pursuing a new research agenda

A Single-Agent Example

Introduction



• Bandits

The Model




Conclusion


Multi-period monopolist learning about the demand curve

No physical links between periods

• Rothschild (1974)• Aghion, Bolton, Harris & Julien (1991)• Keller & Rady (1999)

Optimal learning may be incomplete

Identical sellers may charge different prices in the long runbecause of different sequences of observations


Introduction



• Bandits

The Model




Conclusion


Agents learn from the experiments of others,as well as from their own

Examples:– Oligopolists in a new market– Farmers with neighboring fields– Researchers pursuing a common research agenda

With publicly observable actions and outcomes,there is an informational externality


Introduction



• Bandits

The Model




Conclusion


• Aghion, Espinoza & Julien (1993), Harrington (1995),Keller & Rady (2003):

– Differentiated-goods duopoly

• Bergemann & Valimaki (1996, 1997, 2000):– Sellers and buyers learning about a new product

• Bolton & Harris (1999, 2000), Keller, Rady & Cripps (2005):– Pure informational externality– Two-armed bandits in continuous time

Multi-Armed Bandit Problem

Introduction



• Bandits

The Model




Conclusion


Stylized model of the exploitation-exploration trade-off– Agent decides repeatedly which slot machine to play– Each machine produces a random stream of payoffs– Uncertainty about the distribution of payoff streams

Sequential design of experiments

• Robbins (1952)• Bellman (1956), Bradt, Johnson & Karlin (1956)• Gittins and Jones (1974)• Karatzas (1984), Presman (1990)

• Bergemann & Valimaki (2006)

Strategic Experimentation with Bandits

Introduction



• Bandits

The Model




Conclusion


Bolton & Harris (1999):– Two-armed bandits in continuous time– One arm is “safe”– The other arm is either “good” or “bad”

(Brownian motion with high or low drift)– Type of risky arm identical across players– Publicly observable actions and outcomes– Pure informational externality– Unique symmetric Markov perfect equilibrium– Free-riding versus encouragement

Strategic Experimentation with Bandits

Introduction



• Bandits

The Model




Conclusion


Keller, Rady & Cripps (2005), Keller & Rady (2010):– Poisson process with high or low arrival rate– Unique symmetric Markov perfect equilibrium– Multiplicity of asymmetric Markov perfect equilibria– No equilibria in cutoff strategies– Inefficiency– Encouragement effect

The Model

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Poisson Bandits

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Two-armed bandit in continuous time

One arm is safe (S)generates a known flow payoff s

Other arm is risky (R)yields i.i.d. lump-sums of known mean h which arriveaccording to a Poisson process

if good (θ = 1), Poisson intensity is λ1 (≡ flow payoff λ1h)

if bad (θ = 0), Poisson intensity is λ0 (≡ flow payoff λ0h)

s > 0 and λ1 > λ0 ≥ 0 known to players

True value of θ initially unknown to players

Assumption: λ1h > s > λ0h

Exponential Bandits

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Keller, Rady & Cripps (2005):

• λ0 = 0

• First lump-sum resolves all uncertainty about θ

• Arrives at an exponentially distributed random time(→ “exponential bandit”)

Much simpler than the case λ0 > 0 in Keller & Rady (2010):

• No encouragement effect

• Backward induction from single-agent cut-off

Actions and Payoffs

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


One unit of a perfectly divisible resource per unit of time

If fraction kt ∈ [0, 1] is allocated to R on [t, t + dt[

S yields (1− kt)s dt

R pays a lump-sum with probability ktλθ dt

Expected payoff increment conditional on θ:

[(1− kt)s+ ktλθh] dt

Payoffs and Beliefs

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


pt = posterior probability that θ equals 1

Et[λθ] = ptλ1 + (1− pt)λ0 = λ(pt)

Expected payoff increment conditional on observationsup to time t:

[(1− kt)s+ ktλ(pt)h] dt

Total payoff in per-period units:

E[∫ ∞

0r e−r t [(1− kt)s+ ktλ(pt)h] dt

]

Common Posterior Beliefs

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Each player has a replica two-armed bandit

– same θ

– i.i.d. arrival times

Common prior

Observable actions and outcomes

Hence common posterior

Applying Bayes’ Rule

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Let player n = 1, . . . , N allocate kn,t to R on [t, t + dt[

Define

Kt =N∑

n=1

kn,t

Conditional on θ, the probability of none of the playersreceiving a lump-sum in [t, t + dt[ is

N∏

n=1

e−kn,tλθ dt =N∏

n=1

(1− kn,tλθ dt) = 1−Ktλθ dt

No News is Bad News

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


Given pt at time t and no lump-sum in [t, t + dt[:

pt + dpt =pt (1−Ktλ1 dt)

(1− pt)(1−Ktλ0 dt) + pt (1−Ktλ1 dt)

ordpt = −Kt∆λ pt(1− pt) dt

with ∆λ = λ1 − λ0

When there is a lump-sum on any of the risky arms, theposterior belief jumps up from pt− to

pt = j(pt−) =λ1pt−λ(pt−)

Stationary Markov Strategies

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


With pt− as the natural state variable:

kt = k(pt−)

where k : [0, 1] → [0, 1] is left-continuous and piecewiseLipschitz-continuous

Call the strategy simple if k(p) ∈ {0, 1} for all p

Cut-off Strategies

Introduction

The Model

• Setup

• Beliefs

• Strategies




Conclusion


For some p ∈ [0, 1] :

If p > p, then k(p) = 1 (play R exclusively)If p ≤ p, then k(p) = 0 (play S exclusively)

Example:

Myopic agent ignores information content of own actions(as if r = ∞)

Payoff = max {s, λ(p)h}

Cut-off:

pm =s− λ0h

λ1h− λ0h

The Single-Agent Problem

Introduction

The Model


• Principle of Optimality

• Bellman Equation

• Solution



Conclusion


Principle of Optimality

Introduction

The Model




• Solution



Conclusion


u(p) = maxk∈[0,1]

{

r [(1− k)s+ kλ(p)h] dt

+ e−r dtE [u(p) | p, k ]}

With probability kλ(p) dt:

Lump-sum,

u(p) = u(j(p))

With probability 1− kλ(p) dt:

No lump-sum,

u(p) = u(p)− k∆λ p(1− p)u′(p) dt

Insert and simplify to get Bellman equation!

Bellman Equation

Introduction

The Model




• Solution



Conclusion


u(p) = s+ maxk∈[0,1]

k {b(p, u)− c(p)}

with the expected short-term opportunity cost

c(p) = s− λ(p)h

and the expected long-term learning benefit

b(p, u) = [λ(p) (u(j(p))− u(p))−∆λ p(1− p)u′(p)]/r

(left-hand derivative!)

=⇒ k = 1 or k = 0 is optimal

Single-Agent Optimum

Introduction

The Model




• Solution



Conclusion


When k = 1 is optimal, u satisfies a differential-differenceequation (DDE) that admits an explicit solution

Optimal cut-off p∗1 < pm determined by

• value matching: u(p∗1) = s

• smooth pasting: u′(p∗1) = 0

0 1pm

u

s

λ1h

��

p∗1

. ........................... .........................................................

...............................

................................

.................................

..................................

.....................................

The Cooperative Problem

Introduction

The Model



• Efficient Strategies


Conclusion


Efficient Experimentation

Introduction

The Model



• Efficient Strategies


Conclusion


Bellman equation for N players jointly maximising theiraverage payoff:

u(p) = s+ maxK∈[0,N ]

K {b(p, u)− c(p)/N}

=⇒ K = N or K = 0 is optimal

=⇒ Optimal cut-off p∗N < p∗1, decreasing in N

0 1p∗N p∗1

K

0

N


Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Characterising Best Responses

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Bellman equation for player n, given K¬n = K − kn:

un(p) = s+K¬n b(p, un) + maxkn∈[0,1]

kn {b(p, un)− c(p)}

Indifference diagonal:

DK¬n= {(p, u) ∈ [0, 1]× IR+: u = s+K¬n c(p)}

0 1pm

u

s

λ1h

��

DK¬n

R best responseto K¬n

S best responseto K¬n

General Results for Markov Perfect Equilibria

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


(1) In any MPE, each player obtains at least the single-agentoptimum payoff

(2) All MPE are inefficient (free-riding)

(3) There is no MPE where all players use cut-off strategies

(4) In any MPE, at least one player experiments at somebeliefs below p∗1 (encouragement) if and only if λ0 > 0

No MPE in Cut-Off Strategies

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


0 1pm

u

s

λ1h

��

D1

R dominant

S best response

to K¬n = 1

p1 p2

u2

........................

.....................

.....................

....................

....................................... ..........................

.....................................................

............................

u1. ................ ................. ..................

.....................................

...................

Player 1 must change action at p2 as well as at p1

Encouragement Effect for λ0 > 0

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Suppose all players play S at all beliefs p ≤ p∗1

un(p∗1) = V ∗

1 (p∗1) = s, u′

n(p∗1) = (V ∗

1 )′(p∗1) = 0

b(p∗1, un) ≤ c(p∗1) = b(p∗1, V∗1 )

=⇒ un(j(p∗1)) ≤ V ∗

1 (j(p∗1))

=⇒ un(j(p∗1)) = V ∗

1 (j(p∗1))

un − V ∗1 minimal at j(p∗1)

=⇒ u′n(j(p

∗1)) ≤ (V ∗

1 )′(j(p∗1))

un(j2(p∗1)) ≥ V ∗

1 (j2(p∗1))

=⇒ b(j(p∗1), un) ≥ b(j(p∗1), V∗1 ) > c(j(p∗1))

So all players must use R at the belief j(p∗1)

Encouragement Effect for λ0 > 0

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Each player’s Bellman equation now yields

un(j(p∗1)) = s+N b(j(p∗1), un)− c(j(p∗1))

≥ s+N b(j(p∗1), V∗1 )− c(j(p∗1))

> s+ b(j(p∗1), V∗1 )− c(j(p∗1))

= V ∗1 (j(p

∗1))

— a contradiction with un(j(p∗1)) = V ∗

1 (j(p∗1))

No Encouragement Effect for λ0 = 0

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Encouragement effect rests on two conditions

• Experimentation by a pioneer must increase the likelihoodthat others will return to the risky arm

• These future experiments must be valuable to the pioneer

The second condition is not met for λ0 = 0

So for λ0 = 0, experimentation in any MPE stops at p∗1

Symmetric Equilibrium

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


No symmetric MPE in pure strategies

Indifference between R and S requires

b(p, u) = c(p)

— DDE with closed-form solution

Bellman equation then implies

k(p) =u(p)− s

(N − 1)c(p)

Symmetric MPE

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Construction and uniqueness by elementary methods

0 1pm

u

s

λ1h

��

DN−1

pN p†

N

. .........................................................

..............................

...............................

................................

.................................

.................................

..................................

R w.prob. 0

R w.prob. 1

strict mixing

Smooth pasting at pN

Comparative statics as N↑: common payoffs↑, pN↓, p†Nl

Symmetric MPE

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Interior allocation between cut-offs pN < p†N

0 1p∗N p∗

1

K

0

N

pN p†

N

. ............ ........... ......... .....................

................

...................

.........................

............................

......................

.........................

...........................

Inefficiency: pN > p∗N (not reached in finite time)

Encouragement effect: pN < p∗1 if λ0 > 0

Asymmetric Equilibria

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


No asymmetric MPE demonstrated in Bolton & Harris (1999)


• λ0 = 0 (no encouragement effect)

• variety of asymmetric MPE in simple strategies

• complete classification for N = 2

• most inequitable MPE for arbitrary N

• higher average payoff than symmetric MPE


• λ0 > 0 (encouragement effect)

• construction of a particular asymmetric MPE

• Pareto improvement over symmetric MPE

The Most Inequitable MPE for λ0 = 0 and N = 2

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


0 1p∗1

pm

u

s

λ1h

��

D1

p1

...........................

..........................

................................................. ..... .........

..............................................

.......................

.......................

........................

........................

. .............................................

........................

.........................

..... .... ....

p2

.....................................

....................................

....................................

....................................

Average payoff higher than in symmetric MPE

Average payoff increases as burden of experimentation isshared more equitably

Constructing Asymmetric MPE for λ0 > 0

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Two ideas:

• Give all players identical continuation values after asuccess on a risky arm

• Let players alternate between the roles of experimenterand free-rider before all experimentation stops

Construction below for N = 2 generalises to arbitrary N

Asymmetric MPE for λ0 > 0 and N = 2

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


0 1pm

u

s

λ1h

��

D1D1/2

K = 2

p‡

1<K<2

p♯

K = 1

p♭

K = 0

• On ]p♯, 1] both players play symmetrically

• On ]p♭, p♯] players take turns playing R

• On [0, p♭] both players play S

Strictly higher average payoff on ]p♭, 1] than in the symmetricequilibrium

Pareto improvement for sufficiently frequent turns on ]p♭, p♯]

Why is Taking Turns More Efficient?

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


Under symmetry:

• A player who deviates to the safe action slows down thegradual slide of beliefs towards more pessimism

• This makes the other players experiment more than theywould on the equilibrium path

In the alternation phase of an asymmetric MPE:

• A deviation from the risky to the safe action freezes thebelief in its current state

• This delays the time at which another player takes over

Best, Worst and Most Inequitable MPE

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion



• λ0 = 0

• Symmetric MPE yields uniformly lowest average payoffs

• For N = 2, the limits as λ0 ↓ 0 of the MPE justconstructed yield uniformly highest average payoffs

• For N = 2, the most inequitable MPE yield uniformlyextremal individual payoffs; these are in simple strategies

Do such MPE exist for λ0 > 0?

Consider simple MPE for N = 2 and small λ0 > 0

Large Jumps

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


For λ0 > 0 small enough:

• j(p∗2) ≥ pm

• Everybody plays R after a success

• We can again give the players common continuation valuesafter a success on a risky arm

Precise condition:

λ0

(

λ0

λ1

)

λ0∆λ

≤r

2

Sufficient condition:λ0 ≤

r

2

A Simple MPE for N = 2

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


0 1pm

u

s

λ1h

��

D1

R dominant

S/R mutualbest responses

p

.....................................

.....................................

......................................

......................................

ps

. .......................................

................

.........................

...........

p

..................

................

.............................

. ............. ............. ............................

a

• on ]p, 1] both players play R• on ]ps, p] player 1 plays S, player 2 plays R• on ]p, ps] player 1 plays R, player 2 plays S• on [0, p] both players play S

Lower average payoff than in the previous MPE

◦ Monotonicity? ‘Anticipation’

Rewarding the Last Experimenter

Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


• Maintain assumption of ‘large jumps’

• Relax payoff symmetry above D1

– Give player 1 higher payoff at j(p)

– Two effects:

◦ Lower intensity just to the right of p

◦ Player 1 is willing to play R left of p

– Average payoff

◦ decreases at high beliefs

◦ increases around p


Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


• Consequences:

– Can construct equilibria with progressivelyincreasing payoff asymmetry, approachingthe most inequitable MPE for λ0 = 0

– There exist no uniformly best or worst MPEwhen λ0 > 0


Introduction

The Model




• Best Responses

• General Results

• Symmetric MPE

• Asymmetric MPE

• Simple MPE

• More Asymmetry

Conclusion


p∗2

pm

u

s ��

D1

p2

`

p1

`

p

`

ps

`

``

`

`

• Fix (p2, u1) above D1;find p1and pand ps• Fill in u2 above D1and just below• Fill in u2 above p . . . does it join up?• Adjust (p2, u1) if necessary

Conclusion

Introduction

The Model




Conclusion


Concluding Remarks

Introduction

The Model




Conclusion


• Poisson bandits

– Alternative to Brownian setup

– News comes in ‘lumps’

• Asymmetric equilibria with encouragement

– Construction by elementary methods

– Taking turns is superior to ‘mixing’

– No uniformly ‘best’ or ‘worst’ MPE unless λ0 = 0

Concluding Remarks

Introduction

The Model




Conclusion


• What else?

– Ongoing work on ‘bad news’ scenario– Smooth pasting does not hold– Characterization of symmetric MPE– Viscosity solutions of the Bellman equation

• What next?

– Introduce negative correlation of type of risky arm acrossplayers

– Consider non-Markovian equilibria

strategic experimentation

Documents