searching for arms - northwestern universityfaculty.wcas.northwestern.edu/~apa522/slides-sfa.pdfdi...

Searching for Arms

Daniel Fershtman Alessandro Pavan

October 1, 2019

Motivation

Experimentation/Sequential Learning

central to many problems

In many cases,

endogenous set of alternatives/arms

search

Tradeoff: exploring existing alternatives vs searching for new ones

Motivation

Example

Consumer sequentially explores different alternatives within “consideration set”,while expanding consideration set through search

Firm interviews candidates, while searching for additional suitable candidates tointerview

Researcher splits time on several ongoing projects of unknown return, while alsosearching for new projects

Difference

experimentation: directed

search: undirected

This Paper

Multi-armed bandit problem with endogenous set of arms

Optimal policy: index policy (with special index for Search)

Extension to problems with irreversible choice (based on partial information)

Weitzman: special case where set of boxes exogenous and uncertaintyresolved after first inspection

Search Index

Definition

GS(ωS) = supτ,π

E[∑τ−1

s=0 δs(rπs − cπs )|ωS

]E[∑τ−1

s=0 δs |ωs

]Recursive representation

GS(ωS) =Eχ∗

[∑τ∗−1s=0 δs(rs − cs)|ωS

]Eχ∗

[∑τ∗−1s=0 δs |ωs

]χ∗: policy selecting physical arm with highest Gittins index (among thosebrought by new search) if such index higher than search index and searchotherwise

τ∗ : first time search index + indexes of all physical arms brought by newsearch fall below value of search index at time search launched

Difficulties

Opportunity cost of search depends on entire composition of current choice set

e.g., profitability of searching for additional candidates depends on observablecovariates of current candidates (gender, education, etc.) and past interviews

Non-stationarity in search technology

search outcome may depend on

type and number of arms previously foundpast search costs

Search competes with its own “descendants” (i.e., with arms discovered throughpast searches)

correlation

Treating search as “meta arm” requires decisions within meta arm invariant to infooutside meta arm

bandit problems with meta arms (e.g., arms that can be activated withdifferent intensities – “super-processes”) rarely admit index solution

Literature

Bandits

Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky(1995), Keller and Rady (1999)...

Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017)

Bandits with time-varying set of alternatives

Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)...

Sequential search for best alternative (Pandora’s problem)

Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016),Doval (2018)...

Experimentation before irreversible choice

Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)...

⇒ KEY DIFFERENCE: Endogeneity of set of arms

Plan

1 Model

2 Optimal policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

irreversible choicesearch frictionsmultiple search armsno discounting

Model: Environment

Discrete time: t = 0, ...,∞

Available “physical” arms in period t: It = 1, ..., nt(I0 exogenous)

At each t, DM

pull arm among It

search for new arms

opt-out: arm i = 0 (fixed reward equal to outside option)

Pulling arm i ∈ It

reward ri ∈ Rtransition to new “state”

Search

costly

stochastic set of new arms It+1\It

Model: “Physical” Arms

“State” of physical arm: ωP = (ξ, θ) ∈ ΩP

ξ ∈ Ξ : persistent “type”

θ ∈ Θ: evolving state

Example:

ξ: type of research project/idea (theory, empirical, experimental)θ = (σm) : history of signals about project’s impactr : utility from working on project

HωP : distribution over ΩP , given ωP

Reward: r(ωP)

Usual assumptions:

Arm’ state “frozen” when not pulledtime-autonomous processesevolution of arms’ states independent across arms, conditional on arms’types

Model: Search Technology

State of search technology: ωS = ((c0,E0), (c1,E1), ..., (cm,Em)) ∈ ΩS

m: number of past searches

ck : cost of k’th search

Ek = (nk(ξ) : ξ ∈ Ξ): result of k-th search

nk(ξ) ∈ N: number of arms of type ξ found

HωS : joint distribution over (c,E), given ωS

Key assumptions

independence of calendar time

independence of arms’ idiosyncratic shocks, θ

Correlation though ξ

Model: Search Technology

Stochasticity in search technology:

learning about alternatives not yet in consideration set

evolution of DM’s ability to find new alternatives

e.g., limited set of outside alternatives

fatigue/experience

Model: states and policies

Period-t state: St ≡ (ωSt ,SP

t )

ωSt : state of search technology

SPt ≡ (St(ωP) : ωP ∈ ΩP) state of physical arms

SPt (ωP): number of physical arms in state ωP ∈ ΩP

Definition eliminates dependence on calendar time, while keeping track of relevantinformation

Policy χ describes feasible decisions at all histories

Policy χ optimal if it maximizes expected discounted sum of net payoffs

Eχ ∞∑

t=0

δt

∞∑j=1

xjtrjt − ctyt

|S0

Plan

1 Model

2 Optimal policy

3 Dynamics


5 Applications

6 Extensions

Irreversible choiceSearch frictionsmultiple search armsno discounting

Optimal Policy

Indexes for Physical Arms

Index for “physical” arms:

GP(ωP) ≡ supτ>0

E[∑τ−1

s=0 δsrs |ωP

]E[∑τ−1

s=0 δs |ωP

]τ : stopping time

Interpretations:

maximal expected discounted reward, per unit of expected discounted time(Gittins)

annuity that makes DM indifferent between stopping right away andcontinuing with option to retire in the future (Whittle)

fair charge (Weber)

Index for Search

Index for search:

GS(ωS) ≡ supπ,τ

E[∑τ−1

s=0 δs(rπs − cπs )|ωS

]E[∑τ−1

s=0 δs |ωs

]τ : stopping time

π: choice among arms discovered AFTER t and FUTURE searches

rπs , cπs : stochastic rewards/costs, under rule π

Interpretation: fair (flow) price for visiting “casinos” found stochastically over time,playing in them, and continue searching for other casinos

Definition:

accommodates for correlation among arms found over time

compatible with possibility that search lasts indefinitely and bringsunbounded set of alternatives

Index policy

DefinitionIndex policy selects at each t “search” iff

GS(ωSt ) ≥ G∗(SP

t )︸︷︷︸maximal index amongavailable physical arms

otherwise, it selects any “physical” arm with index G∗(SPt )

Optimality of index policy

Theorem 1

Index policy optimal in bandit problem with search for new arms

Implications of Index Policy

Each period DM must assign task to a worker

Each worker can be ξ =Male of ξ=Female

different processes over signals/rewards

Probability search brings Male: .8

Fixing value of highest index, optimality of searching for new candidates same nomatter whether you have 49 M and 1 F, or 25 M and 25 F

Given highest physical index G∗(SPt ), composition of set of physical arms irrelevant

for decision to search

However, opportunity cost of search (value of continuing with current agents)depends on number of M and F (and past outcomes)

Maximal index among current arms NOT sufficient statistics for state ofcurrent arms when it comes to continuation payoff with current arms

Plan

1 Model

2 Optimal policy

3 Dynamics


5 Applications

6 Extensions


Dynamics

Dynamics under index policy

Stationary search technology: HωS = HS all ωS

if DM searches at t, all physical arms present at t never pulled again

(search=replacement)

Result extends to “Improving search technologies”:

physical arms required to pass more stringent tests over time

Deteriorating search technology:

e.g., finite set of arms

DM may return to arms present before last search

Plan

1 Model

2 Optimal policy

3 Dynamics


5 Applications

6 Extensions


Proof of Main Theorem

Proof of Theorem 1: Road Map

1 Characterization of payoff under index policy

representation uses “timing process” based on optimal stopping in indexesdefinition:

physical arms: stop when index drops below its initial value(Mandelbaum, 1986)

search: stop when search index and all indexes of newly arrived armssmaller than value of search index when search began

2 Dynamic programming

payoff function under index policy solves dynamic programming equation

Proof: Step 1

κ(v |S) ∈ N ∪ ∞: minimal time until all indexes (search/existing arms/newlyfound arms) weakly below v ∈ R+

Lemma 1

V(S0)︸︷︷︸payoff under

index policy, startingfrom state S0

=

∫ ∞0

[1− Eδκ(v|S0)︸︷︷︸expected discountedtime till all indexes

drop weakly below v

]dv

Proof: Step 2

V(S0) solves dynamic programming equation:

V(S0) = max V S(ωS |S0)︸︷︷︸value from searching

and revertingto index

policy thereafter

, maxωP∈ωP∈ΩP :SP

0(ωP )>0

V P(ωP |S0)︸︷︷︸value from pullingphysical arm andreverting to indexpolicy thereafter

Proof uses

representation of payoff under index policy from Lemma 1

decomposition of overall problem into collection of binary problems wherechoice is between single arm (possibly search) and auxiliary fictitious armwith fixed reward

Plan

1 Model

2 Index policy

3 Dynamics


5 Applications

6 Extensions


Applications

Dynamic Matching on a Platform

Platform dynamically matches agents

Shocks to match quality

Gradual learning about attractiveness

Platform solicits buyers/sellers in response to past past bids (match outcomes)

Joint dynamics of

biddingmatchingsolicitation

Distortions in solicitation dynamics (due to mkt power + private info)

Design of Search Engines

Representative buyer uses search engine to identify product to purchase

Search brings set of sponsored and organic links

Clicking on a link brings additional information

GSP auction

sellers compete by submitting bidshigher bids: higher positionspayments linked to clicks

Result permits to

endogenize click through rates (CTR)characterize firms’ value for being on different positions/pages

Auction design

how many products per page?

payments

Plan

1 Model

2 Index policy

3 Dynamics


5 Applications

6 Extensions


Extensions

Extension 1: Irreversible Choice

Irreversible choice

In each period, DM can

search for new alternatives

experiment with existing ones

irreversibly select one alternative from those found from past searches

Type-ξ arm must be pulled Mξ ≥ 0 times before DM can irreversibly commit to it(Weitzman: Mξ = 1 all ξ)

Flow-payoff from irreversibly selecting arm in state ωP : R(ωP)

Extension 1: Irreversible Choice

Partial order on states of physical arms: ωP ωP

e.g., ωP = (ξ, σ,m) where “m” is number of times arm has been activated

ωP = (ξ, σ,m) ωP = (ξ, σ, m) if m ≥ m

Definition

Type ξ satisfies “better-later-than-sooner” property if, for any ωP ωP , eitherR(ωP) ≥ R(ωP) or R(ωP),R(ωP) ≤ 0.

Weitzman: special case in which R(ωP) = R(ωP)

TheoremSuppose all types satisfy “better-later-than-sooner” property. Then index policy optimal.

Extension 2: Search frictions

Results extend to settings where pull of an arm occupies arbitrary number ofperiods (before a different action may be taken)

Relative length of time in which pulling arms is interrupted for search can be madearbitrarily small (by re-scaling payoffs and adjusting discount factor)

Hence analysis extends to settings where

search and experimentation “virtually” in parallel

Conclusions

Experimentation with endogenous set of alternatives determined by past searches

Optimal policy: index policy

“physical” arms: Gittins (1979) index“search” arm: special index with recursive structure

accounts for selection from new arms found

Constant, or improving, search technology: search=replacement

Otherwise,

existing arms put on hold and resumed later

Irreversible actions:

“better-later-than-sooner” property: index policy optimal

Applications:

mediated matchingdesign of search enginesR&D and patenting

Conclusions

THANKS!

Meta Arms

Arm 1:

1,000 first timeλ ∈ 1, 10 subsequent times (equal probability, perfectly persistent)

Arm 2 (Meta Arm) can be used in two modes

2(A): 100 first time, 0 thereafter

2(B): 11 each period

Selection of Arm 2’s mode irreversible

Optimal policy (δ = .9):

start w. Arm 1

If λ = 10, use arm 2 in mode 2(A) for one period, followed by arm 1thereafterIf λ = 1, use arm 2 in mode 2(B) thereafter

No index representation, no matter index def.

Go back

Policy: formal definition

Period-t decision: dt ≡ (xt , yt)

xit = 1 if “physical” arm i pulled; xit = 0 otherwiseyt = 1 if search; yt = 0 otherwise

Sequence of decisions d = (dt)∞t=0 feasible if, for all t ≥ 0:

xjt = 1 only if j ∈ It∑j∈It xjt + yt = 1

Rule χ governing feasible decisions (dt)t≥0 is a policy iff sequence of decisionsdχt t≥0 under χ is Fχt t≥0-adapted, where Fχt t≥0 is natural filtration inducedby χ

Go back

Recursive characterization of index for search

Index of search arm can be re-written as

GS(ωS) =Eχ∗

[∑τ∗−1s=0 δs(rs − cs)|ωS

]Eχ∗

[∑τ∗−1s=0 δs |ωs

] ,

where χ∗ is index policy and τ∗ is first time s ≥ 1 at which index of search andindexes of all physical arms obtained through search fall below value of searchindex at s = 0.

Go back

Proof of Lemma 1

v 0 = maxG∗(SP0 ),GS(ωS

0 )

t0: first time all indexes (including search) strictly below v 0 (t0 =∞ if eventnever occurs)

η(v 0|S0): discounted sum of rewards, net of search costs, till t0

(includes rewards from newly arrived arms)

v 1 = maxG∗(SPt0 ),GS(ωS

t0 ) (note: t0 = κ(v 1|S0))

...

η(v i |S0): net rewards between κ(v i |S0) and κ(v i+1|S0)− 1

Stochastic sequence of values (v i )i≥0, times (κ(v i |S0))i≥0, and discounted netrewards (η(v i |S0))i≥0

Proof of Lemma 1

Proof of Lemma 1

(Average) payoff under index policy:

V(S0) = (1− δ)E

[∞∑i=0

δκ(v i )η(v i )|S0

].

Starting at κ(v i ), optimal stopping time in index defining v i is κ(v i+1)

if v i is index of physical arm, κ(v i+1) is first time its index drops below v i

if v i is index of search arm, κ(v i+1) is first time search index + index of allarms discovered after κ(v i ) drop below v i

Hence, v i = expected discounted sum of net rewards, per unit of expecteddiscounted time, from κ(v i ) until κ(v i+1)− 1:

v i =E[η(v i )|Fκ(v i )

]E[1− δκ(v i+1)−κ(v i )|Fκ(v i )

]/(1− δ)

Same true if multiple arms and/or search have index equal to v i at κ(v i )

Proof of Lemma 1

Plugging in expression for v i ,

V(S0) = E

[∞∑i=0

v i(δκ(v i ) − δκ(v i+1)

)|S0

]

Therefore,

V(S0) = E[∫ ∞

0

vdδκ(v)|S0

]=

∫ ∞0

(1− Eδκ(v|S0)

)dv

Go back

Proof of DP

Want to show that V(S0) solves dynamic programming equation:

V(S0) = max V S(ωS |S0)︸︷︷︸value from searching

and revertingto index

policy thereafter

, maxωP∈ωP∈ΩP :SP

0(ωP )>0

V P(ωP |S0)︸︷︷︸value from pullingphysical arm andreverting to indexpolicy thereafter

Auxiliary arms

e(ωAM): state with single auxiliary arm yielding fixed reward M

Note: κ(v | S0 ∨ e(ωAM)︸︷︷︸

S0 + auxiliary arm

) =

κ(v |S0) if v ≥ M

∞ otherwise

From Lemma 1, payoff from index policy when auxiliary arm added:

V(S0 ∨ e(ωAM)) =

∫ ∞0

[1− Eδκ(v|S0∨e(ωAM ))]dv

= M +

∫ ∞M

[1− Eδκ(v|S0)]dv

= V(S0) +

∫ M

0

Eδκ(v|S0)dv

Auxiliary arms

DS(ωS |e(ωS) ∨ e(ωAM))︸︷︷︸

loss from startingwith search given only

search + auxiliaryarm

≡ V(e(ωS) ∨ e(ωAM))︸︷︷︸

value under indexpolicy given only

search + auxiliaryarm

− V S(ωS |e(ωS) ∨ e(ωAM))︸︷︷︸

value of searchingand reverting to indexpolicy given only search

+ auxiliary arm

=

0 if M ≤ GS(ωS)

> 0 if M > GS(ωS)

Similarly, for physical arm in state ωP :

DP(ωP |e(ωP) ∨ e(ωAM)) =

0 if M ≤ GP(ωP)

> 0 if M > GP(ωP)

Proof that V solves Bellman eq

Can show (“tedious”): DS(ωS |S0) =∫ v0

0DS(ωS |e(ωS) ∨ e(ωA

M))dEδκ(M|SP0 )

Hence: DS(ωS |S0) = 0

⇐⇒ DS(ωS |e(ωS) ∨ e(ωAM)) = 0, ∀M ∈ [0,maxG∗(SP

0 ),GS(ωS)]

⇐⇒ G∗(SP0 ) ≤ GS(ωS)

loss from starting with search = 0 iff search has largest index, and > 0 otherwise

Similarly, DP(ωP |S0) = 0 ⇐⇒ GP(ωP) = G∗(SP0 ) ≥ GS(ωS)

Hence, V(S0) = max

V S(ωS |S0), max

ωP∈ωP∈ΩP :SP0

(ωP )>0V P(ωP |S0)

V(S0) solves dynamic programming equation (hence index policy optimal)

Validation

Assumption: For any S, and policy χ,

limt→∞

δtEχ ∞∑

s=t

δs

∞∑j=1

xjs rjs − csys

|S = 0

Solution to DP equation coincides with value function

Assumption satisfied if rewards/costs uniformly bounded

Also compatible with unbounded rewards/costs. E.g., arms are samplingprocesses, with rewards drawn from Normal distribution with unknown mean

Go back

Irreversible:Proof

Fictitious environment with no irreversible choice

For any physical arm in state ωP found though search, or pulled in period t,“auxiliary” arm generated at t with fixed reward R(ωP) also “found”

Auxiliary arms remain in same state forever and do not generate otherauxiliary arms

Pulling auxiliary arm corresponding to arm j equivalent to choosing armj (once pulled, it will be pulled forever)

Given state ωP , NEW index of physical arm

GP(ωP) ≡ supπ,τ

E[∑τ−1

s=0 δs rs |ωP

]E[∑τ−1

s=0 δs |ωP

](similar to search index)

rule π specifies selection over primitive and auxiliary armsrs : period-s reward (can coincide with R(ωP) in case period-s selectionis auxiliary arm)

Index for search as before - but search adjusted to include discovery ofauxiliary armsIndex policy optimal in fictitious environmentDifficulty: Recasting problem this way possible only if auxiliary armscorresponding to past states of same arm never selected

guaranteed by “‘better-later-than-sooner” property Go back

searching for arms - northwestern universityfaculty.wcas.northwestern.edu/~apa522/slides-sfa.pdfdi...

Documents