searching for arms - northwestern universityfaculty.wcas.northwestern.edu/~apa522/slides-sfa.pdfdi...

61
Searching for Arms Daniel Fershtman Alessandro Pavan October 1, 2019

Upload: others

Post on 26-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Searching for Arms

Daniel Fershtman Alessandro Pavan

October 1, 2019

Page 2: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Motivation

Experimentation/Sequential Learning

central to many problems

In many cases,

endogenous set of alternatives/arms

search

Tradeoff: exploring existing alternatives vs searching for new ones

Page 3: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Motivation

Example

Consumer sequentially explores different alternatives within “consideration set”,while expanding consideration set through search

Firm interviews candidates, while searching for additional suitable candidates tointerview

Researcher splits time on several ongoing projects of unknown return, while alsosearching for new projects

Difference

experimentation: directed

search: undirected

Page 4: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

This Paper

Multi-armed bandit problem with endogenous set of arms

Optimal policy: index policy (with special index for Search)

Extension to problems with irreversible choice (based on partial information)

Weitzman: special case where set of boxes exogenous and uncertaintyresolved after first inspection

Page 5: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Search Index

Definition

GS(ωS) = supτ,π

E[∑τ−1

s=0 δs(rπs − cπs )|ωS

]E[∑τ−1

s=0 δs |ωs

]Recursive representation

GS(ωS) =Eχ∗

[∑τ∗−1s=0 δs(rs − cs)|ωS

]Eχ∗

[∑τ∗−1s=0 δs |ωs

]χ∗: policy selecting physical arm with highest Gittins index (among thosebrought by new search) if such index higher than search index and searchotherwise

τ∗ : first time search index + indexes of all physical arms brought by newsearch fall below value of search index at time search launched

Page 6: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Difficulties

Opportunity cost of search depends on entire composition of current choice set

e.g., profitability of searching for additional candidates depends on observablecovariates of current candidates (gender, education, etc.) and past interviews

Non-stationarity in search technology

search outcome may depend on

type and number of arms previously foundpast search costs

Search competes with its own “descendants” (i.e., with arms discovered throughpast searches)

correlation

Treating search as “meta arm” requires decisions within meta arm invariant to infooutside meta arm

bandit problems with meta arms (e.g., arms that can be activated withdifferent intensities – “super-processes”) rarely admit index solution

Page 7: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Literature

Bandits

Gittins and Jones (1974), Rothschild (1974), Rustichini and Wolinsky(1995), Keller and Rady (1999)...

Surveys: Bergemann and Valimaki (2008), Horner and Skrzypacz (2017)

Bandits with time-varying set of alternatives

Whittle (1981), Varaiya et al. (1985), Weiss (1988), Weber (1994)...

Sequential search for best alternative (Pandora’s problem)

Weitzman (1979), Olszewski and Weber (2015), Choi and Smith (2016),Doval (2018)...

Experimentation before irreversible choice

Ke, Shen and Villas-Boas (2016), Ke and Villas-Boas (2018)...

⇒ KEY DIFFERENCE: Endogeneity of set of arms

Page 8: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Optimal policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

irreversible choicesearch frictionsmultiple search armsno discounting

Page 9: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model

Page 10: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model: Environment

Discrete time: t = 0, ...,∞

Available “physical” arms in period t: It = 1, ..., nt(I0 exogenous)

At each t, DM

pull arm among It

search for new arms

opt-out: arm i = 0 (fixed reward equal to outside option)

Pulling arm i ∈ It

reward ri ∈ Rtransition to new “state”

Search

costly

stochastic set of new arms It+1\It

Page 11: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model: “Physical” Arms

“State” of physical arm: ωP = (ξ, θ) ∈ ΩP

ξ ∈ Ξ : persistent “type”

θ ∈ Θ: evolving state

Example:

ξ: type of research project/idea (theory, empirical, experimental)θ = (σm) : history of signals about project’s impactr : utility from working on project

HωP : distribution over ΩP , given ωP

Reward: r(ωP)

Usual assumptions:

Arm’ state “frozen” when not pulledtime-autonomous processesevolution of arms’ states independent across arms, conditional on arms’types

Page 12: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model: Search Technology

State of search technology: ωS = ((c0,E0), (c1,E1), ..., (cm,Em)) ∈ ΩS

m: number of past searches

ck : cost of k’th search

Ek = (nk(ξ) : ξ ∈ Ξ): result of k-th search

nk(ξ) ∈ N: number of arms of type ξ found

HωS : joint distribution over (c,E), given ωS

Key assumptions

independence of calendar time

independence of arms’ idiosyncratic shocks, θ

Correlation though ξ

Page 13: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model: Search Technology

Stochasticity in search technology:

learning about alternatives not yet in consideration set

evolution of DM’s ability to find new alternatives

e.g., limited set of outside alternatives

fatigue/experience

Page 14: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Model: states and policies

Period-t state: St ≡ (ωSt ,SP

t )

ωSt : state of search technology

SPt ≡ (St(ωP) : ωP ∈ ΩP) state of physical arms

SPt (ωP): number of physical arms in state ωP ∈ ΩP

Definition eliminates dependence on calendar time, while keeping track of relevantinformation

Policy χ describes feasible decisions at all histories

Policy χ optimal if it maximizes expected discounted sum of net payoffs

Eχ ∞∑

t=0

δt

∞∑j=1

xjtrjt − ctyt

|S0

Page 15: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Optimal policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

Irreversible choiceSearch frictionsmultiple search armsno discounting

Page 16: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Optimal Policy

Page 17: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Indexes for Physical Arms

Index for “physical” arms:

GP(ωP) ≡ supτ>0

E[∑τ−1

s=0 δsrs |ωP

]E[∑τ−1

s=0 δs |ωP

]τ : stopping time

Interpretations:

maximal expected discounted reward, per unit of expected discounted time(Gittins)

annuity that makes DM indifferent between stopping right away andcontinuing with option to retire in the future (Whittle)

fair charge (Weber)

Page 18: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Index for Search

Index for search:

GS(ωS) ≡ supπ,τ

E[∑τ−1

s=0 δs(rπs − cπs )|ωS

]E[∑τ−1

s=0 δs |ωs

]τ : stopping time

π: choice among arms discovered AFTER t and FUTURE searches

rπs , cπs : stochastic rewards/costs, under rule π

Interpretation: fair (flow) price for visiting “casinos” found stochastically over time,playing in them, and continue searching for other casinos

Definition:

accommodates for correlation among arms found over time

compatible with possibility that search lasts indefinitely and bringsunbounded set of alternatives

Page 19: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Index policy

DefinitionIndex policy selects at each t “search” iff

GS(ωSt ) ≥ G∗(SP

t )︸ ︷︷ ︸maximal index amongavailable physical arms

otherwise, it selects any “physical” arm with index G∗(SPt )

Page 20: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Optimality of index policy

Theorem 1

Index policy optimal in bandit problem with search for new arms

Page 21: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Implications of Index Policy

Each period DM must assign task to a worker

Each worker can be ξ =Male of ξ=Female

different processes over signals/rewards

Probability search brings Male: .8

Fixing value of highest index, optimality of searching for new candidates same nomatter whether you have 49 M and 1 F, or 25 M and 25 F

Given highest physical index G∗(SPt ), composition of set of physical arms irrelevant

for decision to search

However, opportunity cost of search (value of continuing with current agents)depends on number of M and F (and past outcomes)

Maximal index among current arms NOT sufficient statistics for state ofcurrent arms when it comes to continuation payoff with current arms

Page 22: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Optimal policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

Irreversible choiceSearch frictionsmultiple search armsno discounting

Page 23: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Dynamics

Page 24: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Dynamics under index policy

Stationary search technology: HωS = HS all ωS

if DM searches at t, all physical arms present at t never pulled again

(search=replacement)

Result extends to “Improving search technologies”:

physical arms required to pass more stringent tests over time

Deteriorating search technology:

e.g., finite set of arms

DM may return to arms present before last search

Page 25: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Optimal policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

Irreversible choiceSearch frictionsmultiple search armsno discounting

Page 26: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Main Theorem

Page 27: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Theorem 1: Road Map

1 Characterization of payoff under index policy

representation uses “timing process” based on optimal stopping in indexesdefinition:

physical arms: stop when index drops below its initial value(Mandelbaum, 1986)

search: stop when search index and all indexes of newly arrived armssmaller than value of search index when search began

2 Dynamic programming

payoff function under index policy solves dynamic programming equation

Page 28: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof: Step 1

κ(v |S) ∈ N ∪ ∞: minimal time until all indexes (search/existing arms/newlyfound arms) weakly below v ∈ R+

Lemma 1

V(S0)︸ ︷︷ ︸payoff under

index policy, startingfrom state S0

=

∫ ∞0

[1− Eδκ(v|S0)︸ ︷︷ ︸expected discountedtime till all indexes

drop weakly below v

]dv

Page 29: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof: Step 2

V(S0) solves dynamic programming equation:

V(S0) = max V S(ωS |S0)︸ ︷︷ ︸value from searching

and revertingto index

policy thereafter

, maxωP∈ωP∈ΩP :SP

0(ωP )>0

V P(ωP |S0)︸ ︷︷ ︸value from pullingphysical arm andreverting to indexpolicy thereafter

Proof uses

representation of payoff under index policy from Lemma 1

decomposition of overall problem into collection of binary problems wherechoice is between single arm (possibly search) and auxiliary fictitious armwith fixed reward

Page 30: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Index policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

irreversible choicesearch frictionsmultiple search armsno discounting

Page 31: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Applications

Page 32: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Dynamic Matching on a Platform

Platform dynamically matches agents

Shocks to match quality

Gradual learning about attractiveness

Platform solicits buyers/sellers in response to past past bids (match outcomes)

Joint dynamics of

biddingmatchingsolicitation

Distortions in solicitation dynamics (due to mkt power + private info)

Page 33: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Design of Search Engines

Representative buyer uses search engine to identify product to purchase

Search brings set of sponsored and organic links

Clicking on a link brings additional information

GSP auction

sellers compete by submitting bidshigher bids: higher positionspayments linked to clicks

Result permits to

endogenize click through rates (CTR)characterize firms’ value for being on different positions/pages

Auction design

how many products per page?

payments

Page 34: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Plan

1 Model

2 Index policy

3 Dynamics

4 Proof of main theorem

5 Applications

6 Extensions

irreversible choicesearch frictionsmultiple search armsno discounting

Page 35: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Extensions

Page 36: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Extension 1: Irreversible Choice

Irreversible choice

In each period, DM can

search for new alternatives

experiment with existing ones

irreversibly select one alternative from those found from past searches

Type-ξ arm must be pulled Mξ ≥ 0 times before DM can irreversibly commit to it(Weitzman: Mξ = 1 all ξ)

Flow-payoff from irreversibly selecting arm in state ωP : R(ωP)

Page 37: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Extension 1: Irreversible Choice

Partial order on states of physical arms: ωP ωP

e.g., ωP = (ξ, σ,m) where “m” is number of times arm has been activated

ωP = (ξ, σ,m) ωP = (ξ, σ, m) if m ≥ m

Definition

Type ξ satisfies “better-later-than-sooner” property if, for any ωP ωP , eitherR(ωP) ≥ R(ωP) or R(ωP),R(ωP) ≤ 0.

Weitzman: special case in which R(ωP) = R(ωP)

TheoremSuppose all types satisfy “better-later-than-sooner” property. Then index policy optimal.

Page 38: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Extension 2: Search frictions

Results extend to settings where pull of an arm occupies arbitrary number ofperiods (before a different action may be taken)

Relative length of time in which pulling arms is interrupted for search can be madearbitrarily small (by re-scaling payoffs and adjusting discount factor)

Hence analysis extends to settings where

search and experimentation “virtually” in parallel

Page 39: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Conclusions

Experimentation with endogenous set of alternatives determined by past searches

Optimal policy: index policy

“physical” arms: Gittins (1979) index“search” arm: special index with recursive structure

accounts for selection from new arms found

Constant, or improving, search technology: search=replacement

Otherwise,

existing arms put on hold and resumed later

Irreversible actions:

“better-later-than-sooner” property: index policy optimal

Applications:

mediated matchingdesign of search enginesR&D and patenting

Page 40: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Conclusions

THANKS!

Page 41: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Meta Arms

Arm 1:

1,000 first timeλ ∈ 1, 10 subsequent times (equal probability, perfectly persistent)

Arm 2 (Meta Arm) can be used in two modes

2(A): 100 first time, 0 thereafter

2(B): 11 each period

Selection of Arm 2’s mode irreversible

Optimal policy (δ = .9):

start w. Arm 1

If λ = 10, use arm 2 in mode 2(A) for one period, followed by arm 1thereafterIf λ = 1, use arm 2 in mode 2(B) thereafter

No index representation, no matter index def.

Go back

Page 42: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Policy: formal definition

Period-t decision: dt ≡ (xt , yt)

xit = 1 if “physical” arm i pulled; xit = 0 otherwiseyt = 1 if search; yt = 0 otherwise

Sequence of decisions d = (dt)∞t=0 feasible if, for all t ≥ 0:

xjt = 1 only if j ∈ It∑j∈It xjt + yt = 1

Rule χ governing feasible decisions (dt)t≥0 is a policy iff sequence of decisionsdχt t≥0 under χ is Fχt t≥0-adapted, where Fχt t≥0 is natural filtration inducedby χ

Go back

Page 43: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Recursive characterization of index for search

Index of search arm can be re-written as

GS(ωS) =Eχ∗

[∑τ∗−1s=0 δs(rs − cs)|ωS

]Eχ∗

[∑τ∗−1s=0 δs |ωs

] ,

where χ∗ is index policy and τ∗ is first time s ≥ 1 at which index of search andindexes of all physical arms obtained through search fall below value of searchindex at s = 0.

Go back

Page 44: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

v 0 = maxG∗(SP0 ),GS(ωS

0 )

t0: first time all indexes (including search) strictly below v 0 (t0 =∞ if eventnever occurs)

η(v 0|S0): discounted sum of rewards, net of search costs, till t0

(includes rewards from newly arrived arms)

v 1 = maxG∗(SPt0 ),GS(ωS

t0 ) (note: t0 = κ(v 1|S0))

...

η(v i |S0): net rewards between κ(v i |S0) and κ(v i+1|S0)− 1

Stochastic sequence of values (v i )i≥0, times (κ(v i |S0))i≥0, and discounted netrewards (η(v i |S0))i≥0

Page 45: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 46: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 47: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 48: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 49: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 50: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 51: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 52: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 53: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Page 54: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

(Average) payoff under index policy:

V(S0) = (1− δ)E

[∞∑i=0

δκ(v i )η(v i )|S0

].

Starting at κ(v i ), optimal stopping time in index defining v i is κ(v i+1)

if v i is index of physical arm, κ(v i+1) is first time its index drops below v i

if v i is index of search arm, κ(v i+1) is first time search index + index of allarms discovered after κ(v i ) drop below v i

Hence, v i = expected discounted sum of net rewards, per unit of expecteddiscounted time, from κ(v i ) until κ(v i+1)− 1:

v i =E[η(v i )|Fκ(v i )

]E[1− δκ(v i+1)−κ(v i )|Fκ(v i )

]/(1− δ)

Same true if multiple arms and/or search have index equal to v i at κ(v i )

Page 55: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of Lemma 1

Plugging in expression for v i ,

V(S0) = E

[∞∑i=0

v i(δκ(v i ) − δκ(v i+1)

)|S0

]

Therefore,

V(S0) = E[∫ ∞

0

vdδκ(v)|S0

]=

∫ ∞0

(1− Eδκ(v|S0)

)dv

Go back

Page 56: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof of DP

Want to show that V(S0) solves dynamic programming equation:

V(S0) = max V S(ωS |S0)︸ ︷︷ ︸value from searching

and revertingto index

policy thereafter

, maxωP∈ωP∈ΩP :SP

0(ωP )>0

V P(ωP |S0)︸ ︷︷ ︸value from pullingphysical arm andreverting to indexpolicy thereafter

Page 57: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Auxiliary arms

e(ωAM): state with single auxiliary arm yielding fixed reward M

Note: κ(v | S0 ∨ e(ωAM)︸ ︷︷ ︸

S0 + auxiliary arm

) =

κ(v |S0) if v ≥ M

∞ otherwise

From Lemma 1, payoff from index policy when auxiliary arm added:

V(S0 ∨ e(ωAM)) =

∫ ∞0

[1− Eδκ(v|S0∨e(ωAM ))]dv

= M +

∫ ∞M

[1− Eδκ(v|S0)]dv

= V(S0) +

∫ M

0

Eδκ(v|S0)dv

Page 58: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Auxiliary arms

DS(ωS |e(ωS) ∨ e(ωAM))︸ ︷︷ ︸

loss from startingwith search given only

search + auxiliaryarm

≡ V(e(ωS) ∨ e(ωAM))︸ ︷︷ ︸

value under indexpolicy given only

search + auxiliaryarm

− V S(ωS |e(ωS) ∨ e(ωAM))︸ ︷︷ ︸

value of searchingand reverting to indexpolicy given only search

+ auxiliary arm

=

0 if M ≤ GS(ωS)

> 0 if M > GS(ωS)

Similarly, for physical arm in state ωP :

DP(ωP |e(ωP) ∨ e(ωAM)) =

0 if M ≤ GP(ωP)

> 0 if M > GP(ωP)

Page 59: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Proof that V solves Bellman eq

Can show (“tedious”): DS(ωS |S0) =∫ v0

0DS(ωS |e(ωS) ∨ e(ωA

M))dEδκ(M|SP0 )

Hence: DS(ωS |S0) = 0

⇐⇒ DS(ωS |e(ωS) ∨ e(ωAM)) = 0, ∀M ∈ [0,maxG∗(SP

0 ),GS(ωS)]

⇐⇒ G∗(SP0 ) ≤ GS(ωS)

loss from starting with search = 0 iff search has largest index, and > 0 otherwise

Similarly, DP(ωP |S0) = 0 ⇐⇒ GP(ωP) = G∗(SP0 ) ≥ GS(ωS)

Hence, V(S0) = max

V S(ωS |S0), max

ωP∈ωP∈ΩP :SP0

(ωP )>0V P(ωP |S0)

V(S0) solves dynamic programming equation (hence index policy optimal)

Page 60: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Validation

Assumption: For any S, and policy χ,

limt→∞

δtEχ ∞∑

s=t

δs

∞∑j=1

xjs rjs − csys

|S = 0

Solution to DP equation coincides with value function

Assumption satisfied if rewards/costs uniformly bounded

Also compatible with unbounded rewards/costs. E.g., arms are samplingprocesses, with rewards drawn from Normal distribution with unknown mean

Go back

Page 61: Searching for Arms - Northwestern Universityfaculty.wcas.northwestern.edu/~apa522/Slides-SFA.pdfDi culties Opportunity cost of search depends on entire composition of current choice

Irreversible:Proof

Fictitious environment with no irreversible choice

For any physical arm in state ωP found though search, or pulled in period t,“auxiliary” arm generated at t with fixed reward R(ωP) also “found”

Auxiliary arms remain in same state forever and do not generate otherauxiliary arms

Pulling auxiliary arm corresponding to arm j equivalent to choosing armj (once pulled, it will be pulled forever)

Given state ωP , NEW index of physical arm

GP(ωP) ≡ supπ,τ

E[∑τ−1

s=0 δs rs |ωP

]E[∑τ−1

s=0 δs |ωP

](similar to search index)

rule π specifies selection over primitive and auxiliary armsrs : period-s reward (can coincide with R(ωP) in case period-s selectionis auxiliary arm)

Index for search as before - but search adjusted to include discovery ofauxiliary armsIndex policy optimal in fictitious environmentDifficulty: Recasting problem this way possible only if auxiliary armscorresponding to past states of same arm never selected

guaranteed by “‘better-later-than-sooner” property Go back