searching for arms - northwestern universityfaculty.wcas.northwestern.edu/~apa522/slides-sfa.pdfdi...
TRANSCRIPT
Searching for Arms
Daniel Fershtman Alessandro Pavan
October 1, 2019
Motivation
Experimentation/Sequential Learning
central to many problems
In many cases,
endogenous set of alternatives/arms
search
Tradeoff: exploring existing alternatives vs searching for new ones
Motivation
Example
Consumer sequentially explores different alternatives within “consideration set”,while expanding consideration set through search
Firm interviews candidates, while searching for additional suitable candidates tointerview
Researcher splits time on several ongoing projects of unknown return, while alsosearching for new projects
Difference
experimentation: directed
search: undirected
This Paper
Multi-armed bandit problem with endogenous set of arms
Optimal policy: index policy (with special index for Search)
Extension to problems with irreversible choice (based on partial information)
Weitzman: special case where set of boxes exogenous and uncertaintyresolved after first inspection
Search Index
Definition
GS(ωS) = supτ,π
E[∑τ−1
s=0 δs(rπs − cπs )|ωS
]E[∑τ−1
s=0 δs |ωs
]Recursive representation
GS(ωS) =Eχ∗
[∑τ∗−1s=0 δs(rs − cs)|ωS
]Eχ∗
[∑τ∗−1s=0 δs |ωs
]χ∗: policy selecting physical arm with highest Gittins index (among thosebrought by new search) if such index higher than search index and searchotherwise
τ∗ : first time search index + indexes of all physical arms brought by newsearch fall below value of search index at time search launched
Difficulties
Opportunity cost of search depends on entire composition of current choice set
e.g., profitability of searching for additional candidates depends on observablecovariates of current candidates (gender, education, etc.) and past interviews
Non-stationarity in search technology
search outcome may depend on
type and number of arms previously foundpast search costs
Search competes with its own “descendants” (i.e., with arms discovered throughpast searches)
correlation
Treating search as “meta arm” requires decisions within meta arm invariant to infooutside meta arm
bandit problems with meta arms (e.g., arms that can be activated withdifferent intensities – “super-processes”) rarely admit index solution
Literature
Bandits
Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky(1995), Keller and Rady (1999)...
Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017)
Bandits with time-varying set of alternatives
Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)...
Sequential search for best alternative (Pandora’s problem)
Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016),Doval (2018)...
Experimentation before irreversible choice
Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)...
⇒ KEY DIFFERENCE: Endogeneity of set of arms
Plan
1 Model
2 Optimal policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
irreversible choicesearch frictionsmultiple search armsno discounting
Model
Model: Environment
Discrete time: t = 0, ...,∞
Available “physical” arms in period t: It = 1, ..., nt(I0 exogenous)
At each t, DM
pull arm among It
search for new arms
opt-out: arm i = 0 (fixed reward equal to outside option)
Pulling arm i ∈ It
reward ri ∈ Rtransition to new “state”
Search
costly
stochastic set of new arms It+1\It
Model: “Physical” Arms
“State” of physical arm: ωP = (ξ, θ) ∈ ΩP
ξ ∈ Ξ : persistent “type”
θ ∈ Θ: evolving state
Example:
ξ: type of research project/idea (theory, empirical, experimental)θ = (σm) : history of signals about project’s impactr : utility from working on project
HωP : distribution over ΩP , given ωP
Reward: r(ωP)
Usual assumptions:
Arm’ state “frozen” when not pulledtime-autonomous processesevolution of arms’ states independent across arms, conditional on arms’types
Model: Search Technology
State of search technology: ωS = ((c0,E0), (c1,E1), ..., (cm,Em)) ∈ ΩS
m: number of past searches
ck : cost of k’th search
Ek = (nk(ξ) : ξ ∈ Ξ): result of k-th search
nk(ξ) ∈ N: number of arms of type ξ found
HωS : joint distribution over (c,E), given ωS
Key assumptions
independence of calendar time
independence of arms’ idiosyncratic shocks, θ
Correlation though ξ
Model: Search Technology
Stochasticity in search technology:
learning about alternatives not yet in consideration set
evolution of DM’s ability to find new alternatives
e.g., limited set of outside alternatives
fatigue/experience
Model: states and policies
Period-t state: St ≡ (ωSt ,SP
t )
ωSt : state of search technology
SPt ≡ (St(ωP) : ωP ∈ ΩP) state of physical arms
SPt (ωP): number of physical arms in state ωP ∈ ΩP
Definition eliminates dependence on calendar time, while keeping track of relevantinformation
Policy χ describes feasible decisions at all histories
Policy χ optimal if it maximizes expected discounted sum of net payoffs
Eχ ∞∑
t=0
δt
∞∑j=1
xjtrjt − ctyt
|S0
Plan
1 Model
2 Optimal policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
Irreversible choiceSearch frictionsmultiple search armsno discounting
Optimal Policy
Indexes for Physical Arms
Index for “physical” arms:
GP(ωP) ≡ supτ>0
E[∑τ−1
s=0 δsrs |ωP
]E[∑τ−1
s=0 δs |ωP
]τ : stopping time
Interpretations:
maximal expected discounted reward, per unit of expected discounted time(Gittins)
annuity that makes DM indifferent between stopping right away andcontinuing with option to retire in the future (Whittle)
fair charge (Weber)
Index for Search
Index for search:
GS(ωS) ≡ supπ,τ
E[∑τ−1
s=0 δs(rπs − cπs )|ωS
]E[∑τ−1
s=0 δs |ωs
]τ : stopping time
π: choice among arms discovered AFTER t and FUTURE searches
rπs , cπs : stochastic rewards/costs, under rule π
Interpretation: fair (flow) price for visiting “casinos” found stochastically over time,playing in them, and continue searching for other casinos
Definition:
accommodates for correlation among arms found over time
compatible with possibility that search lasts indefinitely and bringsunbounded set of alternatives
Index policy
DefinitionIndex policy selects at each t “search” iff
GS(ωSt ) ≥ G∗(SP
t )︸ ︷︷ ︸maximal index amongavailable physical arms
otherwise, it selects any “physical” arm with index G∗(SPt )
Optimality of index policy
Theorem 1
Index policy optimal in bandit problem with search for new arms
Implications of Index Policy
Each period DM must assign task to a worker
Each worker can be ξ =Male of ξ=Female
different processes over signals/rewards
Probability search brings Male: .8
Fixing value of highest index, optimality of searching for new candidates same nomatter whether you have 49 M and 1 F, or 25 M and 25 F
Given highest physical index G∗(SPt ), composition of set of physical arms irrelevant
for decision to search
However, opportunity cost of search (value of continuing with current agents)depends on number of M and F (and past outcomes)
Maximal index among current arms NOT sufficient statistics for state ofcurrent arms when it comes to continuation payoff with current arms
Plan
1 Model
2 Optimal policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
Irreversible choiceSearch frictionsmultiple search armsno discounting
Dynamics
Dynamics under index policy
Stationary search technology: HωS = HS all ωS
if DM searches at t, all physical arms present at t never pulled again
(search=replacement)
Result extends to “Improving search technologies”:
physical arms required to pass more stringent tests over time
Deteriorating search technology:
e.g., finite set of arms
DM may return to arms present before last search
Plan
1 Model
2 Optimal policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
Irreversible choiceSearch frictionsmultiple search armsno discounting
Proof of Main Theorem
Proof of Theorem 1: Road Map
1 Characterization of payoff under index policy
representation uses “timing process” based on optimal stopping in indexesdefinition:
physical arms: stop when index drops below its initial value(Mandelbaum, 1986)
search: stop when search index and all indexes of newly arrived armssmaller than value of search index when search began
2 Dynamic programming
payoff function under index policy solves dynamic programming equation
Proof: Step 1
κ(v |S) ∈ N ∪ ∞: minimal time until all indexes (search/existing arms/newlyfound arms) weakly below v ∈ R+
Lemma 1
V(S0)︸ ︷︷ ︸payoff under
index policy, startingfrom state S0
=
∫ ∞0
[1− Eδκ(v|S0)︸ ︷︷ ︸expected discountedtime till all indexes
drop weakly below v
]dv
Proof: Step 2
V(S0) solves dynamic programming equation:
V(S0) = max V S(ωS |S0)︸ ︷︷ ︸value from searching
and revertingto index
policy thereafter
, maxωP∈ωP∈ΩP :SP
0(ωP )>0
V P(ωP |S0)︸ ︷︷ ︸value from pullingphysical arm andreverting to indexpolicy thereafter
Proof uses
representation of payoff under index policy from Lemma 1
decomposition of overall problem into collection of binary problems wherechoice is between single arm (possibly search) and auxiliary fictitious armwith fixed reward
Plan
1 Model
2 Index policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
irreversible choicesearch frictionsmultiple search armsno discounting
Applications
Dynamic Matching on a Platform
Platform dynamically matches agents
Shocks to match quality
Gradual learning about attractiveness
Platform solicits buyers/sellers in response to past past bids (match outcomes)
Joint dynamics of
biddingmatchingsolicitation
Distortions in solicitation dynamics (due to mkt power + private info)
Design of Search Engines
Representative buyer uses search engine to identify product to purchase
Search brings set of sponsored and organic links
Clicking on a link brings additional information
GSP auction
sellers compete by submitting bidshigher bids: higher positionspayments linked to clicks
Result permits to
endogenize click through rates (CTR)characterize firms’ value for being on different positions/pages
Auction design
how many products per page?
payments
Plan
1 Model
2 Index policy
3 Dynamics
4 Proof of main theorem
5 Applications
6 Extensions
irreversible choicesearch frictionsmultiple search armsno discounting
Extensions
Extension 1: Irreversible Choice
Irreversible choice
In each period, DM can
search for new alternatives
experiment with existing ones
irreversibly select one alternative from those found from past searches
Type-ξ arm must be pulled Mξ ≥ 0 times before DM can irreversibly commit to it(Weitzman: Mξ = 1 all ξ)
Flow-payoff from irreversibly selecting arm in state ωP : R(ωP)
Extension 1: Irreversible Choice
Partial order on states of physical arms: ωP ωP
e.g., ωP = (ξ, σ,m) where “m” is number of times arm has been activated
ωP = (ξ, σ,m) ωP = (ξ, σ, m) if m ≥ m
Definition
Type ξ satisfies “better-later-than-sooner” property if, for any ωP ωP , eitherR(ωP) ≥ R(ωP) or R(ωP),R(ωP) ≤ 0.
Weitzman: special case in which R(ωP) = R(ωP)
TheoremSuppose all types satisfy “better-later-than-sooner” property. Then index policy optimal.
Extension 2: Search frictions
Results extend to settings where pull of an arm occupies arbitrary number ofperiods (before a different action may be taken)
Relative length of time in which pulling arms is interrupted for search can be madearbitrarily small (by re-scaling payoffs and adjusting discount factor)
Hence analysis extends to settings where
search and experimentation “virtually” in parallel
Conclusions
Experimentation with endogenous set of alternatives determined by past searches
Optimal policy: index policy
“physical” arms: Gittins (1979) index“search” arm: special index with recursive structure
accounts for selection from new arms found
Constant, or improving, search technology: search=replacement
Otherwise,
existing arms put on hold and resumed later
Irreversible actions:
“better-later-than-sooner” property: index policy optimal
Applications:
mediated matchingdesign of search enginesR&D and patenting
Conclusions
THANKS!
Meta Arms
Arm 1:
1,000 first timeλ ∈ 1, 10 subsequent times (equal probability, perfectly persistent)
Arm 2 (Meta Arm) can be used in two modes
2(A): 100 first time, 0 thereafter
2(B): 11 each period
Selection of Arm 2’s mode irreversible
Optimal policy (δ = .9):
start w. Arm 1
If λ = 10, use arm 2 in mode 2(A) for one period, followed by arm 1thereafterIf λ = 1, use arm 2 in mode 2(B) thereafter
No index representation, no matter index def.
Go back
Policy: formal definition
Period-t decision: dt ≡ (xt , yt)
xit = 1 if “physical” arm i pulled; xit = 0 otherwiseyt = 1 if search; yt = 0 otherwise
Sequence of decisions d = (dt)∞t=0 feasible if, for all t ≥ 0:
xjt = 1 only if j ∈ It∑j∈It xjt + yt = 1
Rule χ governing feasible decisions (dt)t≥0 is a policy iff sequence of decisionsdχt t≥0 under χ is Fχt t≥0-adapted, where Fχt t≥0 is natural filtration inducedby χ
Go back
Recursive characterization of index for search
Index of search arm can be re-written as
GS(ωS) =Eχ∗
[∑τ∗−1s=0 δs(rs − cs)|ωS
]Eχ∗
[∑τ∗−1s=0 δs |ωs
] ,
where χ∗ is index policy and τ∗ is first time s ≥ 1 at which index of search andindexes of all physical arms obtained through search fall below value of searchindex at s = 0.
Go back
Proof of Lemma 1
v 0 = maxG∗(SP0 ),GS(ωS
0 )
t0: first time all indexes (including search) strictly below v 0 (t0 =∞ if eventnever occurs)
η(v 0|S0): discounted sum of rewards, net of search costs, till t0
(includes rewards from newly arrived arms)
v 1 = maxG∗(SPt0 ),GS(ωS
t0 ) (note: t0 = κ(v 1|S0))
...
η(v i |S0): net rewards between κ(v i |S0) and κ(v i+1|S0)− 1
Stochastic sequence of values (v i )i≥0, times (κ(v i |S0))i≥0, and discounted netrewards (η(v i |S0))i≥0
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
Proof of Lemma 1
(Average) payoff under index policy:
V(S0) = (1− δ)E
[∞∑i=0
δκ(v i )η(v i )|S0
].
Starting at κ(v i ), optimal stopping time in index defining v i is κ(v i+1)
if v i is index of physical arm, κ(v i+1) is first time its index drops below v i
if v i is index of search arm, κ(v i+1) is first time search index + index of allarms discovered after κ(v i ) drop below v i
Hence, v i = expected discounted sum of net rewards, per unit of expecteddiscounted time, from κ(v i ) until κ(v i+1)− 1:
v i =E[η(v i )|Fκ(v i )
]E[1− δκ(v i+1)−κ(v i )|Fκ(v i )
]/(1− δ)
Same true if multiple arms and/or search have index equal to v i at κ(v i )
Proof of Lemma 1
Plugging in expression for v i ,
V(S0) = E
[∞∑i=0
v i(δκ(v i ) − δκ(v i+1)
)|S0
]
Therefore,
V(S0) = E[∫ ∞
0
vdδκ(v)|S0
]=
∫ ∞0
(1− Eδκ(v|S0)
)dv
Go back
Proof of DP
Want to show that V(S0) solves dynamic programming equation:
V(S0) = max V S(ωS |S0)︸ ︷︷ ︸value from searching
and revertingto index
policy thereafter
, maxωP∈ωP∈ΩP :SP
0(ωP )>0
V P(ωP |S0)︸ ︷︷ ︸value from pullingphysical arm andreverting to indexpolicy thereafter
Auxiliary arms
e(ωAM): state with single auxiliary arm yielding fixed reward M
Note: κ(v | S0 ∨ e(ωAM)︸ ︷︷ ︸
S0 + auxiliary arm
) =
κ(v |S0) if v ≥ M
∞ otherwise
From Lemma 1, payoff from index policy when auxiliary arm added:
V(S0 ∨ e(ωAM)) =
∫ ∞0
[1− Eδκ(v|S0∨e(ωAM ))]dv
= M +
∫ ∞M
[1− Eδκ(v|S0)]dv
= V(S0) +
∫ M
0
Eδκ(v|S0)dv
Auxiliary arms
DS(ωS |e(ωS) ∨ e(ωAM))︸ ︷︷ ︸
loss from startingwith search given only
search + auxiliaryarm
≡ V(e(ωS) ∨ e(ωAM))︸ ︷︷ ︸
value under indexpolicy given only
search + auxiliaryarm
− V S(ωS |e(ωS) ∨ e(ωAM))︸ ︷︷ ︸
value of searchingand reverting to indexpolicy given only search
+ auxiliary arm
=
0 if M ≤ GS(ωS)
> 0 if M > GS(ωS)
Similarly, for physical arm in state ωP :
DP(ωP |e(ωP) ∨ e(ωAM)) =
0 if M ≤ GP(ωP)
> 0 if M > GP(ωP)
Proof that V solves Bellman eq
Can show (“tedious”): DS(ωS |S0) =∫ v0
0DS(ωS |e(ωS) ∨ e(ωA
M))dEδκ(M|SP0 )
Hence: DS(ωS |S0) = 0
⇐⇒ DS(ωS |e(ωS) ∨ e(ωAM)) = 0, ∀M ∈ [0,maxG∗(SP
0 ),GS(ωS)]
⇐⇒ G∗(SP0 ) ≤ GS(ωS)
loss from starting with search = 0 iff search has largest index, and > 0 otherwise
Similarly, DP(ωP |S0) = 0 ⇐⇒ GP(ωP) = G∗(SP0 ) ≥ GS(ωS)
Hence, V(S0) = max
V S(ωS |S0), max
ωP∈ωP∈ΩP :SP0
(ωP )>0V P(ωP |S0)
V(S0) solves dynamic programming equation (hence index policy optimal)
Validation
Assumption: For any S, and policy χ,
limt→∞
δtEχ ∞∑
s=t
δs
∞∑j=1
xjs rjs − csys
|S = 0
Solution to DP equation coincides with value function
Assumption satisfied if rewards/costs uniformly bounded
Also compatible with unbounded rewards/costs. E.g., arms are samplingprocesses, with rewards drawn from Normal distribution with unknown mean
Go back
Irreversible:Proof
Fictitious environment with no irreversible choice
For any physical arm in state ωP found though search, or pulled in period t,“auxiliary” arm generated at t with fixed reward R(ωP) also “found”
Auxiliary arms remain in same state forever and do not generate otherauxiliary arms
Pulling auxiliary arm corresponding to arm j equivalent to choosing armj (once pulled, it will be pulled forever)
Given state ωP , NEW index of physical arm
GP(ωP) ≡ supπ,τ
E[∑τ−1
s=0 δs rs |ωP
]E[∑τ−1
s=0 δs |ωP
](similar to search index)
rule π specifies selection over primitive and auxiliary armsrs : period-s reward (can coincide with R(ωP) in case period-s selectionis auxiliary arm)
Index for search as before - but search adjusted to include discovery ofauxiliary armsIndex policy optimal in fictitious environmentDifficulty: Recasting problem this way possible only if auxiliary armscorresponding to past states of same arm never selected
guaranteed by “‘better-later-than-sooner” property Go back