a framework for modeling bounded rationality: mis-specified bayesian-markov decision processes

Upload: alexandergir

Post on 07-Oct-2015

9 views

Category:

Documents


0 download

DESCRIPTION

A Framework for Modeling Bounded Rationality: Mis-specifiedBayesian-Markov Decision Processes

TRANSCRIPT

  • A Framework for Modeling Bounded Rationality: Mis-specified

    Bayesian-Markov Decision Processes

    Ignacio Esponda Demian Pouzo(WUSTL) (UC Berkeley)

    February 25, 2015

    PRELIMINARY AND INCOMPLETE

    Abstract

    We provide a framework to study dynamic optimization problems where the agent is uncer-tain about her environment but has (possibly) an incorrectly specified model, in the sense thatthe support of her prior does not include the true model. The agents actions affect both herpayoff and also what she observes about the environment; she then uses these observations toupdate her prior according to Bayes rule. We show that if optimal behavior stabilizes in thisenvironment, then it is characterized by what we call an equilibrium. An equilibrium strategy is a mapping from payoff relevant states to actions such that: (i) given the strategy , theagents model that is closest (according to the KullbackLeibler divergence) to the true modelis (), and (ii) is a solution to the dynamic optimization problem where the agent is certainthat the correct model is (). The framework is applicable to several aspects of bounded ra-tionality, where the reason why a decision maker has incorrect beliefs can be traced to her useof an incorrectly-specified model.

    Esponda: Olin Business School, Campus Box 1133, Washington University, 1 Brookings Drive, Saint Louis,MO6313, [email protected]; Pouzo: 530-1 Evans Hall #3880, Berkeley, CA 94720, [email protected].

    arX

    iv:1

    502.

    0690

    1v1

    [q-fi

    n.EC]

    24 F

    eb 20

    15

  • 1 Introduction

    We study a single-agent recursive dynamic optimization problem where the agent is uncertain aboutthe primitives of the environment. The non-standard aspect of this decision problem is that theagent has a mis-specified model of the world, in the sense that the support of her prior does notinclude the true environment. Our objective is to characterize the limiting behavior of the agent.Our main motivation for studying this problem is to provide a common framework that incorporatesseveral aspects of bounded rationality that have previously been studied in specific contexts.

    A standard assumption in economics is that agents have correct beliefs in equilibrium. Thisassumption is often justified by a learning story. One of the main insights from the literature onlearning in decision problems is that agents may not have correct beliefs if they do not have enoughincentives to experiment (e.g., Rothschild [1974], McLennan [1984], Easley and Kiefer [1988]). Asimilar insight emerges in a game theoretic context via the notion of a self-confirming equilibrium(Battigalli [1987], Rubinstein and Wolinsky [1994], Fudenberg and Levine [1993a], Dekel et al.[2004]) which requires players to have beliefs that are consistent with observed past play, thoughnot necessarily correct when feedback is coarse. Thus, it is well-known that learning need not leadto correct beliefs.

    In this paper, we want to emphasize another reason why learning need not lead to incorrectbeliefs: If agents have mis-specified models of the world, then they will end up having incorrectbeliefs even if they receive unlimited feedback. A simple example is where a player learns by runninga linear regression but the correct specification is actually non-linear.

    There is a growing literature that proposes new equilibrium concepts to capture the behavior ofplayers who are boundedly rational and learn from past interactions: sampling equilibrium (Osborneand Rubinstein, 1998), the inability to recognize patterns (Piccione and Rubinstein [2003], Eysterand Piccione [2011]), valuation equilibrium (Jehiel and Samet, 2007), analogy-based expectationequilibrium (Jehiel, 2005), cursed equilibrium (Eyster and Rabin, 2005), and behavioral equilibrium(Esponda, 2008). Many of these solution concepts can be viewed as modeling the behavior of agentswho end up having incorrect beliefs due to incorrectly specified models. Our framework attemptsto integrate some of these concepts and promote further applications. For example, we illustratein Section 7 how our framework includes the decision-theoretic analogs of the last three solutionconcepts as particular cases.

    Some explanations for why agents may have mis-specified models include complexity (Aragoneset al., 2005), the desire to avoid over-fitting the data (Al-Najjar [2009], Al-Najjar and Pai [2009]),and costly attention (Schwartzstein, 2009). We do not attempt to provide micro-foundations forwhy agents possibly have mis-specified models. In this paper, we take the mis-specification as givenand characterize the resulting behavior.

    We study a standard dynamic environment where the agents current decision affects bothher future payoffs and what she learns about her uncertain environment. To focus on the ideaof incorrect beliefs due to mis-specified priors, we consider noisy environments, where the agentobserves full feedback, thus ruling out incorrect beliefs due to lack of experimentation. We establishtwo main results for a large class of dynamic environments. First, if dynamic behavior stabilizes inthe dynamic environment, then it is characterized by what we call an equilibrium. An equilibriumis a strategy mapping the payoff relevant states to actions such that: (i) given the policy function, the agents model that is closest (according to the KullbackLeibler divergence) to the truemodel is (), and (ii) is a solution to the dynamic optimization problem where the agent is

    2

  • certain that the correct model is (). Second, we show that if is an equilibrium, then thereis an agent who asymptotically optimizes and whose behavior converges to the equilibrium [thisresult is not yet written in this version]. The framework is applicable to several aspects of boundedrationality, where the reason why a decision maker has incorrect beliefs can be traced to her use ofan incorrectly-specified model.

    There is a large literature that studies learning foundations for rational expectations equilibriumboth with rational and boundedly-rational agents (Bray and Kreps [1981], Bray [1982], Blumeand Easley [1982], Blume and Easley [1984]). There is also a large game-theoretic literature thatstudies explicit learning models in order to justify Nash equilibrium and self-confirming equilibrium(Fudenberg and Kreps [1988], Fudenberg and Kreps [1993],Fudenberg and Kreps [1995], Fudenbergand Levine [1993b], Kalai and Lehrer [1993]). Unlike our paper, this literature generally studiesrepeated, not dynamic, environments. Also, agents in most of these papers can be viewed as havingcorrectly specified models, at least on the equilibrium path.

    There is also a literature on decision problems with uncertain parameters (Easley and Kiefer[1988], Aghion et al. [1991]). In this literature, agents have correctly specified models and theMartingale Convergence Theorem implies that their beliefs converge. Moreover, the problem facedby the agent becomes static once beliefs converge. The main focus of this literature is on incentivesto experiment and whether beliefs converge to the truth.

    There is also a closely related statistics literature on the consistency of Bayesian updating undercorrectly-specified (e.g., Freedman [1963], Diaconis and Freedman [1986]) and mis-specified models(e.g., Berk [1966], Bunke and Milhaud [1998]). The statistics literature has focused on the passivelearning case. We extend some of this literature by allowing the agent to take actions that mightalso affect what she learns. Thus, learning will be endogenous in our setting.

    The work that exists on the topic of Bayesian learning under mis-specified learning seemsto be limited to examples or particular applications. Nyarko [1991] presents an example wherebeliefs and actions fail to converge under mis-specified models; we use this example in the nextsection to illustrate some of our main points. Barberis et al. [1998] show that a particular type ofmis-specification about the process governing earnings can explain the over and under-reaction ofinvestors to information. Rabin and Vayanos [2010] show that agents who incorrectly believe in thegamblers fallacy can exaggerate the magnitude of changes in an underlying state but underestimatetheir duration. There are also some papers that either implicitly or explicitly study the learningproblem of an agent that has a mis-specified model. For example, Sobel [1984] studies the non-linearpricing problem of a monopolist when consumers naively act as if prices are linear; as in our paper,the beliefs of the consumer are endogenously determined by her consumption choice. Spiegler [2012]studies the behavior of policy-makers or politicians when the public naively attributes observedoutcomes to the most recent actions.

    In macroeconomics (Evans and Honkapohja [2001] Chapter 13, Sargent [2001] Chapter 6), thereare several papers studying particular settings where agents make forecasts using statistical modelsthat are mis-specified. While the motivation is similar, we focus on agents who follow Bayes ruleand attempt to provide a fairly general decision-theoretic model that can be applied to a wide rangeof circumstances.

    We hope to provide a general framework for modeling agents with mis-specified agents and toconvey the idea that the consequences of mis-specifications can be precisely characterized.

    In the next section, we present an example and discuss further the main points of our paper.In Section 3, we present the Markov decision process (MDP) faced by the agent. In Section 4, we

    3

  • present the Bayesian- Markov decision process (BMDP), which captures the fact that the agentis uncertain about the true MDP that she faces. We also provide a definition of equilibrium (i.e.,steady-state behavior) for a BMDP. In Sections 5 and 6, we provide a foundation for our notion ofequilibrium. We conclude in Section 7 with additional examples.

    NOTE: THIS DRAFT IS PRELIMINARY AND INCOMPLETE.More references will be added in the next draft.

    2 Illustrative example: monopolist with unknown demand

    We illustrate some of our main points by discussing Nyarkos (1991) example of a monopolist withunknown demand. The monopolist chooses at every period t = 0, 1, ... a price xt X = {2, 10} andthen sells st+1, determined by the demand function

    st+1 = a bxt + t,

    where (t)t is an i.i.d. normally distributed process with mean zero and unit variance.1 The

    monopolist observes sales st+1 but she does not observe the random shocks t; she does know,however, the distribution of (t)t. The monopolist has no costs of production and, therefore, herprofits in period t are

    pi(xt, st+1) = xtst+1.

    The monopolist wishes to maximize the discounted expected profits, where her discount factor is [0, 1).

    Notice that we are not using the most natural notation for price (here denoted by x) and forsales at period t (which should be more naturally indexed by t, not t + 1). We maintain thisnotation, however, because it is in line with the notation of the more general setup of the paper,which allows for dynamic decision problems where the new state is also affected by the previousstate.

    The interesting feature of the problem is that the monopolist does not know the true demandintercept and slope. Let R2 represent the set of models of the world entertained by themonopolist, where = (a, b) denotes a demand intercept and slope. The monopolist starts with aprior 0 with full support over and updates her prior using Bayes rule. Thus, it is well knownfrom dynamic programming that the monopolists problem can be represented recursively by avalue function which is defined over a state space, where a state represents the monopolists beliefover .

    Suppose that the true demand parameter is = (28.5, 5.25) and that the set of models thatthe monopolist considers possible is given by the rectangle with vertices at the points , , (a, b)and (a, b), where a = 20, a = 16, b = 1, and b = 4. In this case / and, therefore, wesay that the monopolist has a mis-specified model. This situation is depicted in Figure 1, which isbasically reproduced from Nyarko (1991, page 422).

    Nyarko [1991] shows formally that the monopolists actions do not converge. To see the intuitionbehind this result, suppose that the monopolist were to always choose price 2. Then, on average,she would observe sales a b2. She would then believe that any (a, b) that also gives rise to such

    1As mentioned by Nyarko, sales can be negative with positive probability but the normal distribution is neverthelesschosen for simplicity.

    4

  • average sales can explain the data. The set of all such (a, b)s is given by the line with slope 2passing through the true parameter . Moreover, = (ab) is the only point on that line thatalso belongs to her set of models of the world . But, under parameter ,

    Epi(2, s) = 2(a b2) < 10(a b10) = Epi(10, s),

    and, therefore, the monopolist would actually strictly prefer to charge price 2. Thus, if the monop-olist were to always charge a price of 2, she would eventually become very confident that the trueparameter is , but then she would prefer to deviate and charge 10.

    A similar argument establishes that if the monopolist were to always charge a price of 10, thenshe would eventually become very confident that the true parameter is , but then, since

    Epi(2, s) = 2(a b2) > 10(a b10) = Epi(10, s),

    she would prefer to deviate and charge 2. Thus, the monopolists behavior forever cycles betweenprices 2 and 10.

    While this example might give the impression that studying the behavior of agents with mis-specified models can give rise to strange phenomena that are hard to analyze, the main objectiveof the paper is to convince the reader that there is indeed a lot of regularity of behavior even inmis-specified settings.

    Notice that the idea that actions might not converge is not exclusive to mis-specified models.For example, even if a model is correctly specified and beliefs converge, then actions might notconverge if they are not continuous in beliefs. This lack of continuity commonly arises when thereare a finite number of actions. It is indeed one of the reasons why we allow for mixed strategies;for example, Nash equilibrium would also not always exist without considering mixed strategies.

    Suppose that we allow the monopolist to choose a mixed strategy. Figure 1 depicts the mixedstrategy where is the probability of choosing price 2. If the monopolist chooses , then it is nottoo difficult to show that she will eventually become very confident that the true model is given bythe point (a, b) in Figure 1. This is the point on the set that is closest to when distance ismeasured along the line with slope 2 + (1 )10 passing through . If the optimal strategy ofa monopolist who is convinced that the true parameter is (a, b) is (i.e., if such a monopolistis indifferent between prices 2 and 10), then we say that is an equilibrium. Notice that despiteworking with a single-agent decision problem, the solution of the problem is a fixed point becausethe strategy of the agent affects her beliefs. In the example, different strategies correspond todifferent slopes of lines passing through and, therefore, to different parameters on the set .2

    More generally, we consider the problem of an agent facing a Markov Decision Process (MDP),which is a dynamic optimization problem where the state variable follows a Markov process. Themonopolists problem is in this section is a particular case where, under the assumption thatthe true parameter is known, the problem is static in the sense that future states (sales in thisexample) do not depend on previous states. Next, for any MDP we consider a Bayesian-MarkovDecision Process (BMDP), which is the problem of an agent who does not know the true transitionprobability function. The agent starts with a prior over a set of models, where each modelrepresents a transition probability function, and she updates her prior while making decisions thatmaximize her discounted expected payoffs. The BMDP is said to be mis-specified if the true model

    2In this example, we can also get convergence by allowing for a continuum of prices. It many settings, however,it might be natural to restrict attention to a finite set of actions.

    5

  • is not one of the models considered by the agent. The problem of the monopolist with unknowndemand discussed above is a special case.

    Our objective is to characterize the limiting or steady-state behavior of the agent in a BMDP.For this purpose, we define the notion of equilibrium for general BMDPs. Equilibrium of a BMDPis conveniently defined in terms of the simpler MDP. In the above case of the monopolist withunknown demand, to verify whether a strategy is an equilibrium we only need to verify whether itis optimal in the static problem where the agent knows the parameter (i.e,., she does not have tolearn it) but might believe it is different from the true parameter. Equilibrium of a BMDP is definedas a fixed point where the agent chooses a strategy that is optimal in the MDP under a parameterthat itself depends on the strategy chosen by the agent (as well as on the true parameter and theset of models entertained by the agent, of course). We show that under fairly general assumptionsequilibrium exists. In the above monopoly example, there is in fact a unique equilibrium and it is,as expected, strictly mixed. Section 7.1 generalizes the monopoly example and provides the formalresults.

    We then turn to justifying our definition of equilibrium for a BMDP. We can apply standarddynamic programming techniques to show that a BMDP can be cast recursively by using a valuefunction that solves a standard Bellman equation and which depends on the state of the MDPand the belief over the set of models . We say that behavior stabilizes in the BMDP if theagents behavior as a function of the state of the MDP converges. One of our main results is that,if behavior stabilizes in the BMDP, then it must stabilize to what we call an equilibrium of theBMDP. Thus, behavior that does not arise in equilibrium cannot be the limiting behavior of theBMDP.

    To establish the previous result that equilibrium captures the steady-state of the BMDP, wemust first be able to justify and interpret mixed strategies in our setting. We follow the standardapproach introduced in game theory by Harsanyi [1973], where the agent receives (small) payoffperturbations that to an outside observer make her look as if she is mixing. This approach wasused by Fudenberg and Kreps [1993] to provide a learning foundation for mixed-strategy Nashequilibrium in the context of normal-form games.3 We follow a similar approach and perturb thepayoffs in the BMDP. We then provide a learning foundation for equilibrium in this perturbedversion of the game. The result relies on extending the statistical literature on learning undermis-specified models (e.g., Berk, 1966) to the case where the decision maker concurrently takesactions. Finally, we show that the limit of equilibria of the perturbed BMDP as the perturbationvanishes corresponds to an equilibrium of the (unperturbed) BMDP.

    3 A Markov Decision Process (MDP)

    We begin by describing the environment faced by the agent.

    Definition 1. A Markov Decision Process (MDP ) is a tuple S,X,, q0, Q, pi, where

    S is a finite set of states X is a finite set of actions

    3The class of models they considered is known in the literature as stochastic fictitious play.

    6

  • : S 2X is a non-empty constraint correspondence q0 (S) is a probability distribution on the initial state Q : S X (S) is a transition probability function pi : S X S R is a per-period payoff function [0, 1) is a discount factor

    Throughout the paper, it will be useful to stress the dependence of an MDP on a particulartransition probability function Q; thus, we use MDP (Q) to denote an MDP with transitionprobability function Q.

    The timing is as follows. At the beginning of every period t = 0, 1, 2, ..., the agent observersstate st S and chooses actions xt (st) X. Then a new state st+1 is drawn according to theprobability distribution Q( | st, xt) and the agent receives payoff pi(st, xt, st+1) in period t.4 Theinitial state s0 is drawn according to the probability distribution q0.

    As a benchmark, we begin by considering the standard case of an agent who faces an MDP (Q).The agent chooses a policy rule that specifies at each point in time a (possibly random) action asa function of the history of states and actions observed up to that point. As usual, the ob-jective of the agent is to choose a feasible policy rule to maximize expected discounted utility,

    t=0 tpi(st, xt, st+1).

    ASSUMPTION A1. sup(s,x,s)Gr()S |pi(s, x, s)| 0, then

    x arg maxx(s)

    S

    {pi(s, x, s) + VQ(s)

    }Q(ds|s, x).

    An optimal strategy always exists because the space S X is finite and VQ is bounded. It isalso easy to see that there is always a deterministic optimal strategy where the agent does notrandomize. Nevertheless, random strategies will play an important role in the sequel, when theagent does not know the transition probability function.

    4Depending on the application, we can think of st+1 being drawn at the end of period t and affecting period tspayoff or at the beginning of period t+ 1.

    7

  • 4 A Bayesian-Markov Decision Process (BMDP)

    We now consider an agent who faces an MDP but who is uncertain about the transition probabilityfunction. The agent has a prior over a set of possible transition functions and updates her beliefsusing Bayes rule. We refer to the problem with uncertainty as the Bayesian-Markov decisionprocess (BMDP ).

    Definition 4. A Bayesian-Markov Decision Process (BMDP ) is an MDP, S,X,, q0, Q, pi, ,and a tuple Q, 0, B where

    Q = {Q : } is a family of transition probability functions, where each transitionprobability function Q is indexed by a parameter 0 () is a prior B : S2X() () is the Bayesian operator: for all A Borel measurable and all

    (s, s, x, ) S2 X(),

    B(s, s, x, )(A) =AQ(s

    | s, x)(d)Q(s

    | s, x)(d) .

    The timing is the same as the timing specified under the MDP in Section 3. The difference isthat the agent now has a belief over the set of possible transition probability functions and updatesthis belief according to Bayes rule. We interpret the set Q as the different transition probabilityfunctions (i.e., models of the world) that the agent considers possible.

    Definition 5. A BMDP is mis-specified if Q / Q; otherwise, it is correctly specified.

    We restrict attention to a certain class of possibly mis-specified BMDPs, as captured by thefollowing assumptions.

    ASSUMPTION A2. (i) is a compact subset of an Euclidean space Rk; (ii) the prior 0 hasfull support: supp(0) = .

    ASSUMPTION A3. (i) For 0-almost every , if (s, s, x) S2 X and Q(s | s, x) = 0then Q(s | s, x) = 0; (ii) For all (s, s, x) S2 X, Q(s | s, x) is continuous as a function of forall .

    ASSUMPTION A4. For all s, s0 S, there exist finite sequences (s1, ..., sn) and (x0, x1, ..., xn)such that xi (si) for all i = 0, 1, ..., n and

    Q(s | sn, xn)Q(sn | sn1, xn1)...Q(s1 | s0, x0) > 0.

    Assumption A2 requires certain regularity assumption on , such as the requirement that itlives in a finite-dimensional space, that are known to be important to obtain consistency of Bayesian

    8

  • updating in the standard (i.e., correctly specified) setting (Freedman [1963]). Assumption A3(i)is necessary to make sure that Bayesian updating is well defined. We view this assumption asruling out a particularly stark form of mis-specification under which the agents model of the worldcannot explain an observation. If such were the case, updating would fail and we would expect theagent to re-consider her models of the world. Assumption A3(ii) is a technical condition that playsan important role in several proofs. Finally, Assumption A4 guarantees that we can always getfrom one state to another by some sequence of states and actions. When we study, in Section 5,the perturbed version of the problem where the agent chooses all actions with positive probability,Assumption A4 will guarantee that all states in S X are recurrent.

    The next assumption requires additional definitions.

    Definition 6. The weighted Kullback-Leibler divergence (wKLD) is a mapping KQ : (SX) R+ such that for any m (S X) and ,

    KQ(m, ) =

    (s,x)SXEQ(|s,x)

    [ln

    (Q(S|s, x)Q(S|s, x)

    )]m(s, x).

    The set of closest models given m (S X) is the setQ(m) arg min

    KQ(m, ).

    Lemma 1. For every m (S X), KQ(m, ) is continuous, greater than or equal to zero, andfinite; moreover, Q(m) is non-empty, compact valued, and upper hemi-continuous as a functionof m.

    Proof. See the Appendix.

    Definition 6 extends the standard definition of Kullback-Leibler divergence to the case wherethe sample from which the agent learns is drawn from a distribution m. The set Q(m) can beinterpreted as the set of models that are closest to the true model Q when the agent has accessto an infinite number of exogenous observations of (s, x), drawn independently according to m,and for each observation observes the corresponding draw of the new state s. In particular, if themodel is correctly specified, then the true model is always in the set of closest models, i.e., for allm (S X), there exists Q(m) such that Q = Q.

    Our final assumption on the primitives plays the role of an identification assumption.

    ASSUMPTION A5. For every m (S X), if , Q(m), then Q( | s, x) = Q( | s, x)for all (s, x) S X such that m(s, x) > 0.

    Assumption A5 requires that the models that are closest to the true model must be indistin-guishable given the data available to the agent. To motivate this assumption A5, notice that thatthere are two reasons why the agent might not be able to distinguish between two models. The firstis that she does not take a particular action and therefore fails to learn in some dimension. Then theagent can entertain different models of the world, as long as these models cannot be distinguished

    9

  • given her data; this situation is permitted by Assumption A5. The second reason why the agentmight not be able to distinguish between two models is that these models have different transitionprobability functions but one model better explains one feature of the data and the other modelbetter explains another feature in such a way that these models are equidistant to the truth (interms of the wKLD). This type of mis-specification is ruled out by Assumption A5. An informalargument for ruling out this type of mis-specification is that the agent might decide to break thetie between these two models by adding yet another model to her set of initial models. A moreformal argument shows that we cannot expect behavior to settle down when Assumption A5 fails.The next example illustrates this point.

    Example 1. Consider a BMDP where the state s S = {0, 1} represents whether a coin landsheads or tails and Q(1 | x, s) = 1/2 for all (x, s), i.e., the coin tosses are i.i.d. and do notdepend on the previous action or state. Suppose that Q(1 | x, s) = for all (x, s), so that theagent understands that the coin tosses are i.i.d, and that {1/4, 3/4}. In this case, the wKLDbecomes, for all m,

    KQ(m, ) =1

    2ln

    1/2

    +

    1

    2ln

    1/2

    1 .

    Thus, KQ(m, 1/4) = KQ(m, 3/4) and, therefore, Q(m) = {1/4, 3/4} for all m. Since Q1/4 6= Q3/4,then Assumption A5 is not satisfied. Intuitively, when a coin is truly unbiased, then = 1/4 and = 3/4 are equidistant to the truth and are equally likely to explain the data coming from anunbiased coin. In terms of the model where the agent updates her beliefs using Bayesian updating,it is not too hard to establish that, for a non-dogmatic prior 0(1/4) (0, 1), the agents beliefs over{1/4, 3/4} will never converge (for a proof, see Berk [1966] p. 57). Therefore, it is easy to embedan action space to this model in such a way that the actions of the agent will not converge, even ifwe were to make the actions continuous in beliefs, as we do later in the paper. Finally, AssumptionA5 will be satisfied as long as the agent incorporates an additional element (1/4, 3/4) to herset of models of the world.

    As the examples throughout the paper illustrate, there are many contexts where it is straightfor-ward to check that Assumption A5 is verified. The following is a sufficient condition for AssumptionA5 that is satisfied in many of our examples.

    Proposition 1. Suppose that the following three conditions are satisfied: (i) is convex, (ii) Qis linear in (i.e., if [0, 1] and , , where = + (1 ), then Q( | s, x) =Q( | s, x) + (1 )Q( | s, x) for all (x, s) S X), and (iii) for all (s, s, x) S2 X, ifQ(s | s, x) = 0 then Q(s | s, x) = 0 . Then Assumption A5 is satisfied.Proof. See the Appendix.

    To illustrate the previous definitions, we now introduce another example and verify that itsatisfies Assumptions A1-A5.

    10

  • Example 2. Every period t = 0, 1, ...,an agent chooses an action x X = {0, 1}. Given the actionx, a state s S = {0, 1} is drawn according to the transition probability function Q(s | x),where Q(1 | 0) = 0 and Q(1 | 1) = 1. Let the true model Q be represented by Q for some (0, 1)2. The agent receives payoff

    pi(x, s) = (x+ 1)s 12x

    each period, and her objective is to maximize discounted expected utility with discount factor [0, 1). The above primitives describe an MDP where each new state is drawn as a functionof the action but not of the previous state; thus, the problem is inherently static. Moreover,Assumption A1 is satisfied because pi is bounded.

    Lets consider two different versions of BMDPs associated with the above MDP, where in eachcase Q = {Q : } is the set of models of the world entertained by the agent and 0 is herfull-support prior. In the first case, [0, 1]2 and the BMDP is correctly specified. In the secondcase, = { [0, 1]2 : 0 = 1} and the BMDP is mis-specified if and only if 0 6= 1. In themis-specified case, the agent incorrectly believes that her action does not affect the probability ofdrawing the state.

    We now verify Assumptions A2-A5 for each case. Assumption A2 is satisfied in each case because is compact and 0 has full support. Also, Q(s

    | x) (0, 1) for all \{(0, 0), (1, 1)}. Thus,if in each case we choose a prior 0 that puts no mass at either (0, 0) or (1, 1), then AssumptionA3(i) is satisfied. Assumption A3(ii) is also satisfied because Q is continuous in . AssumptionA4 is satisfied because the process governing the state is i.i.d. and each realization of S has positiveprobability (irrespective of the action taken by the agent). Finally, we can apply Proposition 1 tocheck for Assumption A5. In each case, is convex and Q is linear in . Finally, the assumptionthat (0, 1)2 implies that Q(s | s, x) > 0 for all (s, s, x) S2 X, so condition (iii) inProposition 1 does not apply. Hence, Assumption A5 is also satisfied.

    Finally, we write down the wKLD and derive the correspondence Q() of models that areclosest to the true model Q in each of the two cases. We begin with the case of the mis-specifiedBMDP. Fix any m (SX) and (, ) . Notice that only the marginal distribution mX overX is relevant for characterizing Q() because the current state does not affect the future state.Then

    KQ(m, (, )) = mX(0)

    (ln

    1 01 (1

    0) + ln

    00

    )+

    +mX(1)

    (ln

    1 11 (1

    1) + ln

    11

    )= (1 m) ln(1 ) m ln + C,

    where C contains terms that do not depend on and

    m = mX(0)0 +mX(1)

    1. (2)

    It is easy to check that, for every m, KQ(m, (, )) is strictly convex in and has a unique minimizergiven by m. Thus, Q(m) = {(m, m)} is a singleton for all m (S X); in particular,Assumption A5 is satisfied, a fact that we had already verified indirectly using Proposition 1. Theintuition for this result is straightforward. If we fix m then the state s = 1 is drawn i.i.d. with

    11

  • probability mX(0)0 + mX(1)

    1. Then the closest model

    m is exactly the one that attaches this

    probability to s = 1 being drawn.Next, we consider the case where = [0, 1]2 and, therefore, the BMDP is correctly specified.

    It is easy to check that Q(m) = {} is a singleton set containing the true model as long asmX(1) (0, 1); the idea is that an agent that chooses each action with positive probability willobtain information about each of the two dimensions of . But Q(m) is no longer a singletonif mX(1) {0, 1}. The reason is that the agent never observes the consequence of choosing eitherx = 1 or x = 0 and, therefore, can have any belief about one dimension of . Formally,

    Q(m) = { : i = i , j [0, 1]}, (3)

    where mX(i) = 1, where i, j {0, 1} and i 6= j. Thus, even though the BMDP is correctly specified,the agent can have incorrect beliefs about those parameters for which she observes no information.

    We now turn to the analysis of BMDPs. As in the case of MDPs, it is well known that theproblem of maximizing discounted expected utility in BMDPs can be cast recursively, were thedifference with the MDP is that the state space now includes both the state variable s and thebelief .

    Our main objective is to characterize the agents behavior in a BMDP when the time period issufficiently large so that we might expect beliefs to have settled down and behavior to converge.For this purpose, we conclude this section by defining the notion of equilibrium for a BMDP. Anequilibrium represents steady-state behavior in a BMDP. In Section 6 we make the argument formaland show that the notion of equilibrium that we present here corresponds to steady-state behavior.

    Definition 7. The transition kernel given a strategy is a transition probability functionM : S X (S X) such that

    M(s, x | s, x) = (x | s)Q(s | x, s). (4)

    An invariant distribution of the transition kernel M is a distribution m (S X) thatsatisfies

    m(s, x) =

    (s,x)SX

    M(s, x | s, x)m(s, x)

    for all (s, x) S X.

    It is a standard result that, for every , an invariant distribution of M exists.5We now provide an informal motivation for the definition of equilibrium that follows. Suppose

    that, after sufficiently enough time, beliefs have converged in the BMDP and the agents behaviorcan be represented by a strategy that maps each state in S to some probability distributionover X. Then the strategy induces a transition kernel M over the space (S X). Let m

    5The proof follows by noticing that M is a linear (hence continuous) self-map on a convex and compact subset ofan Euclidean space (the set of probability distributions over the finite set SX); hence, Browers fixed point theoremimplies existence of an invariant distribution.

    12

  • be an invariant distribution of M and suppose that the time average of observations over S Xconverges to the invariant distribution m. Then we might expect that the agents beliefs over themodels of the world have support in the set of closest models given m, Q(m). Finally, if isto be a candidate for steady-state behavior it must be the case that there is a belief over withsupport over Q(m) that makes an optimal strategy in the BMDP with transition probabilityinduced by such a belief.

    Definition 8. (Equilibrium of a BMDP) A strategy and probability distribution (,m) (S X) is an equilibrium of the BMDP with true model Q if there exists () suchthat

    (i) is an optimal strategy for the MDP (Q), where Q =

    Q(d), and(ii) (Q(m)), where m is an invariant distribution of the transition kernel M.

    We make several remarks regarding the definition of equilibrium of a BMDP.

    Remark 1. Provided that Definition 8 captures steady-state behavior in BMDPs (investigated inthe next sections), then one of the main benefits of the above definition of equilibrium is that themodeler who is interested in the limiting behavior of the agent in a BMDP can restrict attention tothe much simpler class of MDPs. We illustrate throughout the paper how the definition makes con-crete, endogenous predictions for mis-specified agents and that these predictions do not obviouslyfollow from the exogenous primitives.

    Remark 2. Definition 8 places two restrictions on equilibrium behavior: (i) optimization givenbeliefs and (ii) endogenous restrictions on beliefs. This dichotomy is standard in game theory andis made explicit by Fudenberg and Levine [1993a] in their definition of self-confirming equilibrium.Optimization is a standard assumption. The innovation behind our definition is to specify howbeliefs are to be endogenously restricted as a function of the true model Q, the agents models ofthe world Q, and the agents strategy .Remark 3. We define an equilibrium to be a strategy-distribution pair. The interpretation isthat the agents behavior is given by her strategy and that the corresponding distribution overoutcomes in SX is given by the corresponding invariant distribution m. The need to specify theinvariant distribution as part of the definition of equilibrium is that, in dynamic games, a strategymight have several invariant distribution associated with it and, therefore, a strategy might not besufficient to determine the equilibrium outcome of the decision process.

    Remark 4. In the special case where the model is correctly specified, then the optimal strategy forthe true model is always an equilibrium of the BMDP. This result is analogous to the result that aNash equilibrium is always a self-confirming equilibrium (e.g., Fudenberg and Levine, 1993a). Asin that case, the converse result does not necessarily hold because the agent might not learn thecorrect model if she does not sufficiently experiment.

    Proposition 2. If the BMDP with true transition probability function Q is correctly specified, thenthe optimal strategy for the MDP (Q) is an equilibrium of the BMDP.

    Proof. Let be an optimal strategy for the MDP (Q). Let m be an invariant distribution ofthe transition kernel M; see footnote 5 for the argument guaranteeing existence of an invariant

    13

  • distribution. Since the BMDP is correctly specified, there exists such that Q = Q .By Lemma 1, the wKLD is greater than or equal to zero. In addition, KQ(m,

    ) = 0. Then, Q(m). Since is optimal for the MDP (Q), then is an equilibrium of the BMDP.

    Remark 5. Existence of equilibrium. Proposition 2 and the fact that an optimal strategy for theMDP always exists imply that an equilibrium always exists whenever the BMDP is correctly spec-ified. Unfortunately, existence of equilibrium for mis-specified BMDPs cannot be established usingstandard methods. The standard approach to proving existence is to show that the correspondingbest response correspondence has a fixed point. In our setting, let BR() = { : minvariant of M and (Q(m)) such that is optimal for MDP (Q)}. The problem is thatthis correspondence is not necessarily convex-valued. For example, fix and suppose that m andm are two different invariant distributions of M and that Q(m) = {} and Q(m) = {}.Then different beliefs can justify different elements of BR(). For example, it is possible that isoptimal for MDP (Q) and

    is optimal for MDP (Q) but that a convex combination of and is not optimal for either belief or .

    Our solution to this problem will be to study a perturbed version of the BMDP, show thatequilibrium exists in the perturbed version, and then show that the limit of a sequence of equilibriaas the perturbation vanishes exists and is an equilibrium of the original BMDP. We postpone theproof to the next section but state the existence result now.

    Theorem 1. An equilibrium of a BMDP always exists.

    Proof. Follows from Lemma 7, Theorem 2, and Theorem 3 in Section 5; see the discussion rightafter Theorem 3.

    Example 2, continued. We now find all the equilibria for each version of the BMDP inExample 2. Let the strategy represent the probability that x = 1 (again, we can ignore the states because the agent believescorrectlythat the current state does not affect the draw of the nextstate).

    First, we consider the case where 1 < 1/2 < 0 and study the mis-specified BMDP where = { [0, 1]2 : 0 = 1}. For each , there is a unique invariant distribution m and its marginalover X satisfies mX(1) = . Thus, for each with invariant distribution m, (Q(m)) is the setof possible beliefs. By (2) above, this belief is degenerate and puts probability 1 on

    = (1 )0 + 1. (5)Thus, by Definition 8, is an equilibrium strategy if and only if is an optimal strategy when

    the parameter governing the (static) problem is given by (5). We show that there is a uniqueequilibrium strategy and that the agent strictly mixes in equilibrium. Suppose that = 1 Then,by (5), the agent believes that the true parameter is 1. But then

    E1pi(1, s) = 21

    1

    2< 1 = E1pi(0, s

    )

    and the agent prefers to deviate and choose = 0. Thus, = 1 is not an equilibrium strategy.A similar reasoning yields that = 0 is not an equilibrium strategy either. Thus, an equilibrium

    14

  • strategy (0, 1) must be such that, under the corresponding belief in equation (5), the agentis indifferent between her actions, i.e.,

    Epi(1, s) = 2

    1

    2= = Epi(0, s

    ). (6)

    The unique solution (hence, the unique equilibrium strategy) of (6) is

    =0 1/20 1

    (0, 1).

    For example, if 0 = 3/4 and 1 = 1/4, then = 1/2 is the unique equilibrium strategy for thismis-specified BMDP.

    Next, we consider the case where = [0, 1]2 and, therefore, the BMDP is correctly specified.For concreteness, suppose that 0 = 3/4 and 1 = 1/4. Then,

    Epi(1, s) = 21

    1

    2< 0 = Epi(0, s

    ),

    and, therefore, = 0 is the unique optimal strategy given . Thus, by Proposition 2, = 0is an equilibrium strategy of the BMDP. Next, consider any strategy (0, 1). In this case,the corresponding invariant distribution m has a marginal mX(1) (0, 1) and, by the previousdiscussion, beliefs must be degenerate at the true parameter . Then (0, 1) cannot be optimalgiven these beliefs and, therefore, (0, 1) is not an equilibrium strategy. Finally, suppose that = 1. The corresponding invariant distribution m has a marginal mX(1) = 1 and, by (5), theagent must believe that 1 =

    1 but can hold any belief about 0 [0, 1]. In particular, she can

    believe that 0 = 0, in which case it is (weakly) optimal to choose = 1. Thus, = 1 is the only

    other equilibrium strategy of the BMDP.

    A feature of Example 2 (and also of the monopolist example in Section 2) is that the environmentfaced by the agent who knows the transition probability function is static, meaning that thecurrent state does not influence the future state.

    Definition 9. A transition probability function is static if the distribution over the new statedoes not depend on the previous state, i.e., Q( | s, x) = Q( | s, x) for all x X and all s, s S. ABMDP is static if all transition probability functions in Q are static.6

    In static BMDPs, it is without loss of generality to restrict attention to strategies such that( | s) = ( | s) for all s, s S. Thus, in static BMDPs we denote a strategy by (X).Moreover, for a given strategy there is a unique invariant distribution m (S X) and itsmarginal over X, denoted by mX, coincides with . In addition, the wKLD depends only on mXrather than on . Thus, in the remainder of the paper we abuse notation and denote the wKLDand the correspondence Q() as a function of rather than m if the BMDP is static. Notice alsothat in this case it is not necessary to specify the invariant distribution as part of the equilibrium.

    6Of course, the problem is dynamic for the agent who also needs to learn the transition probability function, evenif these transition probability functions are static.

    15

  • Remark 6. More generally, there are several environments where, for each , there is a uniqueinvariant distribution m and the set Q(m) is a singleton. In these environments, by letting be the unique element of Q(m), Definition 8 becomes:

    a strategy is an equilibrium of the BMDP if and only if is optimal forthe MDP (Q).

    The next two sections provide a justification for the notion of equilibrium proposed above. Thereader who is interested in applications of the equilibrium concept can jump to Section 7.

    5 Perturbed Decision Processes

    In this section, we perturb the payoffs of the MDP and BMDP introduced in the previous sectionand establish that equilibrium exists under a class of perturbations. These perturbations wereintroduced by Harsanyi [1973] in the context of normal-form games and have been incorporatedby Doraszelski and Escobar [2010] in the type of dynamic settings studied here. We then considera sequence of perturbed environments where the perturbation goes to zero and establish that thelimit (which exists) is an equilibrium of the unperturbed environment. This result implies existenceof equilibrium in the unperturbed environment. In the next section, we show that equilibria in theseperturbed games can be viewed as the steady-states of the perturbed BMDP.

    Definition 10. A perturbed MDP is an MDP S,X,, q0, Q, pi, and a tuple = V, PV where V R|X| is a set of payoff perturbations for each action PV : S (V) is a distribution over payoff perturbations conditional on each state s S

    The timing of a perturbed MDP coincides with the (unperturbed) MDP defined in Section3 except for two modifications. First, before taking an action in period t, the agent not onlyobserves st but she now also observes a vector of payoff perturbations , where (x) denotes theperturbation corresponding to action x. This vector of payoff perturbations is drawn i.i.d. everyperiod conditional on the state st, according to the probability distribution PV( | st). Second, afterthe agent chooses xt and the new st+1 is realized, her payoff for period t is now

    pi(st, xt, st+1) + (xt).

    ASSUMPTION A6. For all s S, PV(|s) is absolutely continuous with respect to the Lebesguemeasure on R|X| and

    V ||||PV(d|s)

  • Lemma 2. There exists a unique solution V Q to the Bellman equation (7); moreover VQ is bounded

    and continuous as a function of Q.

    Proof. See the Appendix.

    Definition 11. A strategy : S (X) is optimal for the perturbed MDP if for all s S andall x X,

    (x | s) = PV( : x arg max

    x(s)

    S

    {pi(s, x, s) + (x) + V Q(s

    )}Q(ds|s, x) | s

    ). (8)

    We can think of an optimal strategy for the perturbed MDP in two stages. First, for each (s, )the agent determines the optimal action. Second, we integrate over . In other words, if is anoptimal strategy then (x | s) is the probability that x is optimal when the state is s and theperturbation is , taken over all possible realizations of . The next result is a consequence of theabsolute continuity of PV.

    Lemma 3. An optimal strategy for the perturbed MDP exists (and is, therefore, unique). Moreover, is continuous as a function of the transition probability function Q.

    Proof. That an object like in (8) is well defined is standard (e.g., Harsanyi [1973]) and followsfrom the facts that there are a finite number of actions and that the set of s such that the agentis indifferent between two actions lies in a lower-dimensional hyperplane. By absolute continuity,the set of s where the agent is indifferent has measure zero. Thus, the RHS of (8) must add up to1 when added over all actions and, therefore, in (8) is well defined. The fact that is continuous

    as a function of Q follows from the fact that V Q is continuous in Q (Lemma 2) and by the Theoremof the Maximum.

    An important role will be played by environments where the perturbation induces the agent tochoose all possible actions.

    Definition 12. A fully perturbed MDP is a perturbed MDP where, for all s S, the supportof PV( | s) is R.7

    Lemma 4. If is the optimal strategy for a fully perturbed MDP, then there exists c > 0 suchthat (x | s) c for all (s, x) S X.Proof. See the Appendix.

    7The next version will extend this definition to bounded supports.

    17

  • If is such that all actions are chosen with positive probability for each state, then AssumptionA4 implies that all (s, x) S X are visited infinitely often; thus, there is a unique invariantdistribution associated with .

    Lemma 5. If satisfies (x | s) > 0 for all (s, x) S X, then the transition kernel M has aunique invariant distribution m, and m satisfies m(s, x) > 0 for all (s, x) S X.Proof. See the Appendix.

    Lemma 6. Suppose that m(s, x) > 0 for all (s, x) S X. Then Q(d) =

    Q(d) for

    any , (Q(m)).Proof. By Assumption A5, for all , Q(m), Q( | s, x) = Q( | s, x) for all (s, x) SX;hence, the result follows.

    Along similar lines, we can now define a perturbed version of a BMDP.

    Definition 13. A perturbed BMDP is a perturbed MDP S,X,, q0, Q, pi, , = V, PV anda tuple Q, 0, B as defined in Definition 4. A fully perturbed BMDP is a perturbed BMDPwhere the corresponding MDP is fully perturbed.

    The timing of a perturbed BMDP is the same as the timing of a BMDP specified in Section4, where now the timing of the payoff perturbations coincides with the timing described above forthe perturbed MDP.

    Finally, we extend the definition of equilibrium of a BMDP in Section 4 to the case of a perturbedBMDP. The definition coincides with Definition 8, except that of course we require optimality withrespect to the perturbed version of the MDP.

    Definition 14. (Equilibrium of a perturbed BMDP) A strategy and probability distribution(,m) (SX) is an equilibrium of the perturbed BMDP with true model Q if thereexists () such that

    (i) is an optimal strategy for the perturbed MDP (Q), where Q =

    Q(d), and(ii) (Q(m)), where m is an invariant distribution of the transition kernel M.

    Theorem 2. An equilibrium of a fully perturbed BMDP always exists.

    Proof. Let = { : (x | s) [c, 1 (|X| 1)c] for all (s, x) S X}, where c (0, 1) isdefined in Lemma 4. For all , Lemma 5 implies that there is a unique invariant distribution,which we denote m, of the transition kernel M. Also, for all , Lemmas 5 and 6 implythat

    Q(d) =

    Q

    (d) for all , (Q(m)). Thus, the following function Q thatmaps elements in to transition probability functions is well defined: Q() =

    Q(d), where

    is any belief that belongs to (Q(m)) and m is the (unique) invariant distribution of the

    18

  • transition kernel M. Next, we define the function f : where f() is the optimal strategy

    for the perturbed MDP (Q()), which exists and is unique by Lemma 3. Notice that if is afixed point of f , then (,m) is an equilibrium of the perturbed BMDP. We now apply Browersfixed point theorem to establish that a fixed point of f always exists. The space is a compactand convex subset of an Euclidean space. Thus, it remains to establish that f is continuous. Weestablish this last result in two steps.

    First, we show that Q defined above is continuous for all . Let and supposethat (n)n is a sequence of strategies in

    that converges to . Let (mn)n be the sequence of(unique) invariant distributions for the transition kernels (Mn)n. By compactness of (X S),pick a subsequence of (mnk )k (denoted in the same way to simplify notation) that converges tom. By Lemma 12, m is an invariant distribution of M. For each element in the subsequence,the fact that Q(mnk ) is non-empty (by Lemma 1) and that is compact implies that we canpick a subsubsequence nkl Q(mnkl ) that converges to some . Then Q(m) by theupper hemicontinuity of Q() established in Lemma 1. Thus, the facts that nkl Q(mnkl ) and Q(m) imply that Q(nkl ) = Qnkl and Q() = Q. Continuity of Q then follows because Qis continuous as a function of (by Assumption A3). Second, by Lemma 4, the function mapping atransition probability function Q to the optimal strategy of a perturbed MDP is continuous. Sincef is the composition of this function with the function Q, then f is continuous.

    We now consider a sequence of BMDPs where the payoff perturbations go to zero.

    Definition 15. A vanishing family of perturbed BMDPs is a sequence of BMDPs with thefollowing properties:

    each BMDP in the sequence shares the same primitives S,X,, q0, Q, pi, and Q, 0, B

    each BMDP possibly differs in the perturbation structureV, P V

    , where, N now indexes

    the element of the sequence8

    perturbations vanish as : For all s S and any sequence of measurable set (D)

    lim

    D|(x)|P V(d|s) = 0. (9)

    A vanishing family of fully perturbed BMDPs is a family of perturbed BMDPs where eachBMDP is fully perturbed.

    Lemma 7. For a fixed set of primitives S,X,, q0, Q, pi, and Q, 0, B, a vanishing sequenceof fully perturbed BMDPs always exists.

    Proof. Note that conditions on perturbation structure in Definition 12 and Definition 15 are com-patible (e.g., Normal distribution with mean zero and vanishing variance).

    8In a slight abuse of notation, we now use to index the sequence whereas was used in definition 10 to denotea tuple.

    19

  • Theorem 3. Fix a vanishing sequence of perturbed BMDPs and a corresponding sequence (,m)of equilibria such that lim(,m) = (,m). Then (,m) is an equilibrium of the (unperturbed)BMDP.

    Proof. By assumption, there is a sequence (,m, ) such that, for all , (i) m is invariant for

    the transition kernel M , (ii) (Q(m)), and (iii) is optimal for the perturbed MDP (Q),

    where Q =

    Q(d), and (iv) lim(,m) = (,m). By compactness of (), we can fix a

    subsequence where converges to some . Let Q =

    Q(d). By Lemma 1, Q() is upper

    hemicontinuous, compact valued, and has compact domain and range; hence, the correspondence(Q()) inherits the same properties. Therefore, (Q()) satisfies the closed-graph property,which implies that (Q(m)). Moreover, by Lemma 12, m is an invariant distribution of thetransition kernel M. Thus, to show that (,m) is an equilibrium of the (unperturbed) BMDP,it remains to show that is an optimal strategy for the (unperturbed) MDP with transitionprobability function Q.

    Fix any s S and any x X such that (x | s) > 0. Since lim = , there existsks,x > 0 and s,x such that for all s,x , (x | s) ks,x . From now on, consider only suchsufficiently large s for the fixed subsequence. For all such s, let D(x) be the set of V suchthat

    (x) (x) S{pi(s, s, x) + V

    Q(s)}Q(ds|s, x)

    S{pi(s, s, x) + V

    Q(s)}Q(ds|s, x) (10)

    for all x 6= x. Notice that P V(D(x)) = (x | s) (see equation 8). Since (x | s) ks,x is anoptimal strategy for the perturbed MDP (Q), then P V(D

    (x)) ks,x > 0 for all . Integratingexpression (10) over all D(x), we obtain

    1

    P V(D(x))

    D(x)

    {(x) (x)}P V(d) S{pi(s, s, x) + V

    Q(s)}Q(ds|s, x)

    S{pi(s, s, x) + V

    Q(s)}Q(ds|s, x).

    Taking limits over and using assumption (9) and the fact that S is finite, it follows thatS{pi(s, s, x) + lim

    V Q

    (s)}Q(ds|s, x)S{pi(s, s, x) + lim

    V Q

    (s)}Q(ds|s, x) 0. (11)

    Finally, since Q is continuous (Assumption A3) and bounded, then limQ = Q. Thus, byLemma 13, lim V

    Q

    () = VQ() and, therefore, expression (11) implies that x is optimal instate s in the MDP with transition probability function Q. Because the above holds for all (s, x)such that (x | s) > 0, it follows that is optimal for MDP (Q).

    A corollary of the above results is that an equilibrium of the (unperturbed) BMDP alwaysexists (as stated in Theorem 1 of Section 4). To prove this statement, fix a vanishing family of fullyperturbed BMDPs, which we can do by Lemma 7. By Theorem 2, there exists a correspondingsequence of equilibria. Since equilibria live in a compact space, then there exists a subsequence ofequilibrium that converges. Theorem 3 says that this limit point is an equilibrium of the (unper-turbed) BMDP.

    As a final remark, notice that Theorem 3 is of independent interest and holds even if thevanishing sequence of perturbed BMDPs is not fully perturbed.

    20

  • 6 Learning foundation for equilibrium

    In this section, we provide a learning foundation for the notion of equilibrium of a fully perturbedBMDP introduced in Section 5. Throughout this section, we fix a fully perturbed BMDP with

    components S,X,, q0, Q, pi, ,V, P V

    , and Q, 0, B satisfying assumptions A1-A5. It is

    easy to see that the agents dynamic optimization problem can be cast recursively as

    W (s, , ) = maxx(s)

    S

    {pi(s, x, s) + (x) +

    VW (s, , )P V(d

    |s)}Q(ds

    |x, s), (12)

    where Q =

    Q(d) and where = B(s, s, x, ) is next periods belief, which is updated given

    Bayes rule. The main difference with respect to the case where the agent knows the transitionprobability function is that now the agents beliefs about are part of the state space.9

    Lemma 8. There exists a unique solution W to the Bellman equation (12); moreover W isbounded in S() for all V and continuous in SV () (under the product topology).Proof. See the Appendix.

    Definition 16. A policy function is a function f : S V () X mapping the state ofthe dynamic optimization problem to the set of actions. A policy function f is optimal (for theperturbed BMDP) if

    f(s, , ) arg maxx(s)

    S

    {pi(s, x, s) + (x) +

    VW (s, , )PV(d|s)

    }Q(ds

    |x, s)

    for all (s, , ) S V ().

    The restriction to deterministic policy functions is without loss of generality and simplifies someof the following definitions.

    Let h = (s0, 0, x0, ..., st, t, xt, ...) represent the infinite history or outcome path of the dy-namic optimization problem and let H = (S V X) represent the space of infinite histories.The primitives of the BMDP, including the prior 0 (), and a policy function f induce a prob-ability distribution over H that is defined in a standard way; let P 0,f denote this probabilitydistribution over H.

    Definition 17. The sequence of intended strategies given policy function f is the sequence(t)t of random variables t : H (X)S such that, for all t = 0, 1, ...,

    t(h)(x | s) = P V ( : f(s, , t) = x | s) , (13)

    where t () is the posterior at time t defined recursively by t = B(st, st1, xt1, t1).9We now also explicitly include the payoff perturbation as a state variable for convenience.

    21

  • By an argument similar to that used to prove Lemma 3, the assumption of absolute continuityof P V( | s) implies that (13) is well defined.

    Notice that, given a policy function f , an intended strategy t describes the strategy that theagent would like to play if she were to arrive at time t with the beliefs t. One reasonable criteriato claim that the agents behavior stabilizes is that her intended behavior stabilizes.

    Definition 18. A strategy : S X (X) is stable under policy function f if

    P 0,f(

    limt t(h

    ) = 0)> 0,

    where (t)t is the sequence of intended strategies given f .

    Lemma 9. Consider a fully perturbed BMDP and a strategy with invariant distribution m forthe transition kernel M. If is stable under policy function f , then there exists a set of historiesH H with P 0,f (H) > 0 such that, for all h H,

    limt

    1

    t

    t1=0

    1{s,x}(s , x )(h) = m(s, x) > 0.

    for all (s, x) SX and limt t(h) = 0, where (t)t is the sequence of intended strategiesgiven f .

    Lemma 9 says that, if the sequence of intended strategies converges in a fully perturbed BMDP,then the frequency of outcomes also converges; moreover, this frequency converges to the invariantdistribution of the transition kernel M. By Lemma 5, this invariant distribution is unique. Thus,for fully perturbed BMDPs, the notion of stability in Definition 18 captures the idea that behaviorand outcomes stabilize in the BMDP.

    The next result characterizes the set of strategies that are stable when the agent follows anoptimal policy function in the fully perturbed BMDP.

    Theorem 4. Consider a fully perturbed BMDP and a strategy with invariant distribution m forthe transition kernel M. If is stable under a policy function f that is optimal, then (,m) is anequilibrium of the fully perturbed BMDP.

    The proof of Theorem 4 follows from lemma 9 and the following lemmas, which are proven inthe Appendix.

    Lemma 10. Given a policy function f , suppose that there exists a H H such that P0,f (H) >0 and, for all (s, x) S X,

    limt

    1

    t

    t1=0

    1{s,x}(s , x )(h) = m(s, x)

    22

  • for all h H. Then, for any open set U Q(m),limtt (U) = 1

    a.s.P0,f over H, where (t)t is defined recursively as t+1 = B(st+1, st, f(st, t, t), t) for allt.

    Lemma 11. Let f be an optimal policy function for a fully perturbed BMDP and let be aset with the property that Q Q(d) =

    Q

    (d) for all , (). Suppose that thereexists a H H with P0,f (H) > 0 such that, for any open set U ,

    limtt (U) = 1

    a.s.P0,f over H. Then limt (h) = a.s.P0,f over H, where is the optimalstrategy for a fully perturbed MDP(Q).

    Proof of Theorem 4. Let be stable under a policy function f that is optimal and let m bethe invariant distribution of M. By Lemma 9, the time average of outcomes converges to m. ByLemma 10, beliefs concentrate over Q(m). If we can shows that Q(m) has the property statedin Lemma 11, then that lemma implies the desired result. Notice that, by Lemma 9, m(s, x) > 0for all (s, x) S X; thus, Lemma 6 implies that the set Q(m) has the desired property.

    7 Examples

    7.1 Monopolist with unknown demand

    We now study a generalized version of Nyarko [1991] example discussed in Section 2. The monopolistchooses at each period t = 0, 1, ... a price xt X = {2, 10}. After choosing price xt, the monopolyobserves quantity sold according to the demand function

    st+1 = a bxt + t,where (t)t are i.i.d. random variables. Nyarko [1991] considered the special case where (t)t arenormally distributed with mean zero and unit variance. Our results allow us to easily generalizethis assumption to the case where the error term is distributed according to an absolutely contin-uous probability distribution with density f and support in R and where the following regularityassumption holds: for every z 6= 0, the set

    { : f(z + ) = f()} (14)has probability strictly less than 1 according to f. Assumption 14 rules out, for example, densityfunctions that are periodic.10 The corresponding transition probability function is, for all Borel

    10As mentioned by Nyarko [1991], the assumption that the support is R is chosen for simplicity despite the factthat realized quantity can be negative with positive probability. It is possible to also consider the case where haspositive support but the agent does not know the distribution of ; in this case, the agent cannot identify the demandintercept from the mean of , but this lack of identification is irrelevant for optimal behavior.

    23

  • sets S,Q(S | x) =

    {abx+S}

    f(s (a bx))d,

    = (a, b). Let = (a, b) denote the true demand parameter; i.e., the true transition probabilityfunction is Q = Q .

    For simplicity, the monopolist incurs no costs of production. The profits received in period tare then

    pi(xt, st+1) = xtst+1.

    Notice that the problem of the monopolist who knows the demand function is static and that itwould be more natural to index price and quantity with the same time period t. Nevertheless, wemaintain our notation from the paper, which is more general and allows us to capture problemsthat are intrinsically dynamic even without a learning problem.

    In this example, S = R, which does not satisfy the assumption that S is a finite set. Nevertheless,it is not too difficult to show that the results of the paper go through. Thus, the previous setup isan example of an MDP.

    Next, we assume that the monopolist does not know the parameter = (a, b); moreover, shenever observes the shocks t, but knows their distribution. In particular, let R2 be a compactset and suppose that the monopolist starts with prior 0 with support . This problem is now aBMDP and it is straightforward to check that Assumptions A1-A4 are satisfied (extended to thecase where S R).

    Finally, to check Assumption A5 we begin by writing the WKLD. Since this MDP is inherentlystatic (i.e., the current state does not affect the new state), as remarked at the end of Section 4 wereplace the marginal of the invariant distribution given a strategy directly by the strategy. In thisexample, let the strategy [0, 1] denote the probability that the agent chooses x = 2.

    KQ(, ) = (

    R

    ln

    (f(r

    2 r2 + )f()

    )f()d + (1 )

    R

    ln

    (f(r

    10 r10 + )f()

    )f()d

    ),

    where rx = a bx. We consider two cases.

    Case 1. Correctly specified BMDP: Suppose that , so that the BMDP is correctlyspecified. By Lemma 1, KQ is always greater than or equal to zero. Suppose that (0, 1). If is such that r

    x rx = 0 for all x, then KQ(, ) = 0; thus, such a minimizes the WKLD. Notice

    also that the unique solution to the two previous equations is a = a and b = b. Thus, the abovesolution minimizes the WKLD. Moreover, the fact that ln() is strictly concave and that assumption(14) is satisfied (where we set z = r

    x rx) implies that the above is the unique minimizer of the

    WKLD.Next, consider the case where = 1 or = 0. By applying the same arguments, we conclude

    that the set of minimizers of the WKLD is the set of that solves solves r

    2 r2 = 0 or r

    10r10 = 0,respectively. Thus, we have established that

    Q() =

    { : a b2 = a b2} if = 1{} if (0, 1){ : a b10 = a b10} if = 0

    (15)

    24

  • Figure 1 provides intuition for this result. The two lines through with slope 2 and 10correspond to the sets Q() when = 1 and = 0, respectively. The intuition is that if theagent only plays x = 2 (or 10), then she cannot distinguish between any of the parameters alongthe line with slope 2 (or 10). However, if she were to play both x = 2 and x = 10 with positiveprobability, then the agents belief would be given by the intersection of these two lines, which is{}. In particular, notice that Assumption A5 is satisfied because, for each line, all the pointsalong the line lead to the same transition probability function when the agent chooses the actioncorresponding to the line with probability 1.

    Let denote optimal strategy in the MDP (i.e., is known). Suppose, for concreteness,that is such that = 1 is the unique optimal strategy for the MDP (this is true if and onlyif a < 12b). Then Proposition 2 says that is an equilibrium strategy for the BMDP. Thequestion that we would like to answer is whether there exist other equilibrium strategies. By (15),if the agent mixes then it must be the case that she puts probability 1 on the true model and,therefore, chooses . So if another equilibrium exists it must involve the agent choosing 10 withprobability 1. For simplicity, suppose that the agent has degenerate beliefs (which is without lossof generality if is convex). For the agent to prefer price 10, she must believe that 10 is (weakly)better than 2, i.e., a 12b, as depicted by the dashed line in Figure 1. Moreover, (a, b) must beon the line with slope 10 in Figure 1. Thus, if includes points along the line with slope 10 andabove the dashed line, then = 0 is also an equilibrium strategy for the BMDP.

    Finally, suppose that we consider a fully perturbed version of this BMDP. Then the fact that theagent must mix implies that she has correct beliefs about and, therefore, the unique equilibriumstrategy is = 1. If we now take a sequence of equilibrium strategies where the perturbationvanishes, then the sequence must trivially converge to = 1. Thus, the fully perturbed BMDPcan be used to refine the se of equilibria of the unperturbed BMDP.

    Case 2. Mis-specified BMDP: Suppose that is a rectangle in R2 with vertices at thepoints , , (a, b), and (a, b), where a < a < a and b < b < b. Notice that the BMDP ismis-specified because . By minimizing the WKLD we obtain

    Q() =

    {(a, b) if (a, b

    ) if (16)

    where a = a (2 + 10(1 ))(b b), b = b (a a)/(2 + 10(1 )), and =

    5/4 (a a)/(8(b b)). In particular, the closest model to the true model is always uniqueand Assumption A5 is verified for the mis-specified case.

    For concreteness, consider the special case studied by Nyarko [1991] and depicted in Figure 1where (a, b) = (28.5, 5.25), a = 20, a = 16, b = 1, and b = 4. In this case, the two lines withslope 2 and 10 happen to pass through two vertices of the rectangle , though the same resultshold if, for example, a < 16 and b < 1. If = 1, then all the points on the line with slope2 minimize the WKLD, but now the point that belongs to that is closest to along the axiscreated by the line with slope 2 is ; in fact, is the unique point that is both on the line andbelongs to , but this fact is not important and the result extends to the case where a < 16 andb < 1. Using equation16, we can verify that b1 = 1 and, therefore, (a, b) is indeed the uniqueminimizer. Next, consider the case where < 1. Figure 1 depicts the line through point withslope 2 + 10(1 ). If , where is defined above, it is easy to see that the point on that

    25

  • is closest to when measured along the line with slope 2 + 10(1 ) is a unique point lying onthe top of the rectangle that define ; this point is (a, b) in equation 16.

    We now go back to the general case where a < a < a and b < b < b. Suppose that = 1,so that the agent believes, according to (x), that the true parameter is (a, b). If 2(a 2b) 12b, then the unique optimal strategy is for the agent toplay = 0. Thus, = 1 is not an equilibrium of the mis-specified BMDP. Next, consider the casewhere = 0, so that the agent believes, according to (16), that the true parameter is (a, b). If2(a2b) > 10(a10b), or equivalently, if a < 12b, then the unique optimal strategy is for theagent to play = 1. Thus, = 0 is not an equilibrium of the mis-specified BMDP. Thus, if botha > 12b and a < 12b, as is the case in the example considered by Nyarko [1991], the equilibriummust be in mixed strategies. In such a case, the monopolist must choose that leads to a beliefthat the parameter is (a, b) such that she is indifferent between prices x = 2 and x = 10:

    2(a 2b) = 10(a 10b),

    or, equivalently, a = 12b. Using (16) and some algebra, the unique equilibrium strategy for thecase where a > 12b and a < 12b is

    =5

    4 1

    8

    (a a)(b a/12) .

    For the parameters specified by Nyarko, the unique equilibrium strategy is = .95.

    7.2 Trading with adverse selection

    We use the trading example from Esponda (2008, Section I) in order to illustrate that the currentframework includes as particular cases the decision-theoretic analogs of three game theoretic con-cepts that have been defined to capture the behavior of boundedly rational players: (fully) cursedequilibrium Eyster and Rabin [2005], analogy-based expectation equilibrium (Jehiel [2005], Jehieland Koessler [2008]), and behavioral equilibrium (Esponda [2008]). Splieger [2011] and Esponda[2008] discuss further relationships between these concepts that are not elaborated here.

    At each period t = 0, 1, ..., a buyer and a seller simultaneously submit a (bid) price xt from afinite set X and an ask price at from a finite set A, respectively. If xt at, then the buyer pays xtto the seller and receives the sellers object, which she values at vt, drawn from a finite set V. Ifxt < at, the no trade takes place and each player receives 0. At the time she makes an offer, thebuyer does not know her realized value vt. Suppose that the sellers ask price and the buyers valueare drawn each period from the same probability distribution q (AV), where qA (A) andqV(V) denote the marginal distributions. Our objective is to analyze the optimal pricing strategyof a risk neutral buyer. (The typical story is that there is a population of sellers each of whomfollows the weakly dominant strategy of asking for her valuation; thus, the ask price is a functionof the sellers valuation and, if buyer and seller valuations are correlated, as is the case in adverseselection settings, then the ask price and buyer valuation are also correlated.)

    7.2.1 Behavioral equilibrium

    Suppose that the buyer observes her realized value vt at the end of each period t if and only if tradetakes place on that period. Suppose also that the buyer always observes the ask price at submitted

    26

  • by the seller at the end of each period.11 Esponda [2008] allows for more general types of feedbackand captures different types of feedback by using an information feedback function, an approachthat is common in the self-confirming equilibrium literature. Here, in order to avoid having tointroduce a new element to the setup in Section 3, we model limited feedback by a reformulationof the state space. In particular, the state space is S = A V {}, where (a, v) represents thestate drawn at the end of period t and a value of v = indicates that there was no trade and,therefore, the buyer did not observe the realized value of the object.

    Let Q denote the true transition probability function. Then, for all x X,

    Q(a, v | x) = q(a, v)1{xa}(x) (17)

    for all (a, v) A V andQ(a, | x) = qA(a)1{x

  • For any strategy (X) and parameter , the wKLD can be written as

    K(, ) =xX

    (x)EQ(|x) lnQ(a, v | x)Q(a, v | x)

    =xX

    (x)

    (a,v):x

  • suppose that the buyer always observes the value of the object, irrespective of whether she trades.The true transition probability function is

    Q(a, v) = q(a, v) (24)

    and the parameterized one is given by

    Q(a, v) = qA(a)qV(v)

    for all (a, v) AV and = AV. Notice that EQpi(x, (a, v)) and EQpi(x, (a, v)) are stillprovided by (19) and (20), respectively.

    For any strategy (X) and parameter , the wKLD is now given by

    K(, ) =xX

    (x)EQ(|x) lnQ(a, v | x)Q(a, v | x)

    =

    (a,v)AVq(a, v) ln

    q(a, v)

    qA(a)qV(v),

    which does not depend on . Some algebra then yields that any that minimizes the wKLD given satisfies (21) and

    EBv = Eq(v).

    Thus, an equilibrium of the BMDP is a strategy such that every x in its support maximizes

    Pq(x a) (Eq(v) x) , (25)

    which no longer depends on .13 This case corresponds to a fully cursed equilibrium (Eyster andRabin [2005]) and, in this trading context, it was originally discussed by Kagel and Levin [1986]and Holt and Sherman [1994].

    7.2.3 Analogy-based expectation equilibrium

    While a behavioral equilibrium and a fully cursed equilibrium can be viewed as capturing the sametype of mis-specification (i.e., failure to understand correlation), an analogy-based expectationequilibrium captures a richer class of mis-specifications by introducing the notion of an analogyclass. In the context of the trading example, suppose that we partition the set V into k analogyclasses (Vj)j=1,...,k, where jVj = V and ViVj = 0 for all i 6= j. Jehiel [2005], Jehiel and Koessler[2008] implicitly make the same feedback assumption as Eyster and Rabin (2005); the two conceptswere developed independently, though. Suppose, as in the analysis of cursed equilibrium in Section7.2.2, that the buyer always observes the realization vt. The true transition probability function isthe same function Q from equation (24), but the mis-specified models are now represented by

    Q(a, v) = q(a | v)q(v),13Of course, cursed equilibrium is defined as a fixed point in a game with multiple agents; the point is that it is

    not a fixed point with a single agent because beliefs do not depend on the agents strategy. A similar remark holdsfor the analogy-based expectation equilibrium analyzed in the next section.

    29

  • for all (a, v) A V, where, for every analogy class i = 1, ..., j,

    q(a | v) = q(a | v)

    for all v, v Vi. Thus, the agent believes (possibly incorrectly) that the distribution of A condi-tional on v might depend on the analogy class to which v belongs but does not depend on whichelement of an analogy class we take. In other words, the buyer believes that (a, v) are independentconditional on v Vi, for each i = 1, ..., k.

    The expected profit function when the model is is now also different and given by

    EQpi(x, (a, v)) = PA(x a) (E(v | x a) x) . (26)

    For any strategy (X) and parameter , the wKLD is now given by

    K(, ) =xX

    (x)EQ(|x) lnQ(a, v | x)Q(a, v | x)

    =

    (a,v)AVq(a, v) ln

    q(a, v)

    q(a | v)qV(v),

    which does not depend on . Some algebra then yields that any that minimizes the wKLD given satisfies (21), q(v) = qV(v) for all v V, and, for all i = 1, ..., j and all v Vi,

    q(a | v) = qVi(a)

    vVi q(a | v)q(v)vVi q(v)

    .

    Then, (26) becomes

    EQpi(x, (a, v)) =ki=1

    Pq(v Vi) (Pq(x a | v Vi) (Eq(v | v Vi) x)) . (27)

    An equilibrium of the BMDP is a strategy such that every x in its support maximizes (27),which does not depend on . This case corresponds to the analogy-based expectation equilibriumof Jehiel and Koessler [2008]. Notice that in the particular case where the partition over V is trivial,i.e., k = 1, then (27) reduces to expression (25) and the analogy-based expectation equilibrium isequivalent to the fully cursed equilibrium. See also Splieger [2011] (Chapter 8) for a discussion ofanalogy-based expectation equilibrium in this trading example.

    7.2.4 Behavioral equilibrium with analogy classes

    One of the benefits of having a unifying framework is that we can easily consider new types ofmis-specifications that are a combination of previous cases. For example, consider the tradingexample where feedback coincides with the behavioral equilibrium case analyzed above (i.e., thebuyer only observes realized values when she trades) and where the buyer has the type of mis-specification analyzed in an analogy-based expectation equilibrium (i.e., she learns using analogyclasses (Vj)j=1,...,k).

    30

  • As in the behavioral equilibrium case, let S = A V {} denote the state space. The truetransition probability function Q is once again given by (24) . Then

    Q(a, v) = q(a, v)1{xa}(x)

    for all (a, v) A V andQ(a,) = qA(a)1{x

  • 7.3 Search with uncertainty about future job offers

    We consider the problem of an infinitely-lived agent that faces the choice of accepting or rejectinga wage offer. The agent is uncertain of her chances of receiving a job offer in the future and of herchances of being fired if she accepts employment. The probability that the agent receives an offerand the probability that she is fired from her job depend on economic fundamentals. We study thebehavior of a mis-specified agent who fails to realize that the chance of future wage offers or thechance of being fired are related to economic fundamentals. We find that if the chance of receivingan offer and the chance of being fired are negatively correlated, then a mis-specified agent will beless selective when accepting wage offers compared to an agent with the correct model.

    At the beginning of each period t = 0, 1, ... the agent decides whether to accept or reject a givenwage offer wt. If she accepts the offer, then she earns wt in that period; otherwise, she earns zero.After she makes her employment decision, an economic fundamental zt+1 is drawn i.i.d. from thefinite set Z according to the probability distribution G. If the agent is employed, then she is firedwith probability (z), where is a vector in [0, 1]|Z|. If the agent is unemployed (either becauseshe was employed and then fired or because she did not accept employment at the beginning ofthe period), then with probability (z), where [0, 1]|Z|, she draws a new wage wt+1 from theinterval [0, 1] according to the absolutely continuous distribution F . With probability 1(z), theunemployed agent receives no wage offer, which we conveniently represent by saying she receivesa negative wage offer, wt+1 = w < 0 , which she will of course never accept. The agent will haveto decide whether to accept or reject wt+1 at the beginning of next period. If the agent acceptedemployment at wage wt at the beginning of time t and was not fired, then she starts next periodwith wage offer wt+1 = wt and will again have to decide whether to quit or remain in her job atthat offer. As usual, the agent maximizes her discounted expected utility, where per period payoffsare pi(wt, xt) = xtwt and [0, 1) is her discount factor.

    It is easy to see that this setting constitutes an MDP with state space S = W Z, whereW {w} [0, 1], constraint correspondence (s) = X for all s S, and, for [0, 1]|Z| (why weindex Q on this vector will become clear soon), a true transition probability function Q such that,for all Borel sets A, all z Z, and all (w, z, x) S X,

    Q(w A, z | w, z, x) = G(z)Q(w A | z, w, x),

    where

    Q(w A | z, w, 1) = (1 (z))1A(w) + (z)(z)

    AF (dw)

    + (z)(1 (z))1{w}(w)and

    Q(w A | z, w, 0) = (z)

    AF (dw) + (1 (z))1{w}(w).

    We consider the corresponding BMDP where the agent knows the primitives except the vectors and that determine the probability of being fired and receiving an offer, respectively. The agentbelieves, possibly incorrectly, that does not depend on the economic fundamental, i.e., (z) = for all z Z , where [0, 1] is a parameter she wants to learn. Given the assumption that is constant, then the agents behavior depends only on the average value of and, therefore, we

    32

  • will simply assume that the agent knows (or its average value). The set of models in the BMDPis then Q = {Q(,...,) : }, where = [0, 1] and supp(0) = . From now on, we denoteQ(,...,) by Q.

    The assumption that the economic fundamental is i.i.d. and revealed only after the workerdecides whether to accept an offer is made for simplicity. It guarantees that, irrespective of whetherthe agent is mis-specified or knows the true model, her optimal strategy depends only on the wageoffer, and not on the economic fundamental. Thus, we can better compare strategies in the mis-specified and correct settings and isolate the effect of the mis-specification.

    It is a straightforward exercise to verify that Assumptions A1-A5 are satisfied in this context.[As usual, with the caveat that we need to extend the assumptions for the case where W R ratherthan being a finite set]

    The next result characterizes equilibrium for this BMDP.

    Proposition 3. A strategy is an equilibrium strategy of the BMDP if and only if it is characterizedby an equilibrium threshold w such that lower offers are rejected and higher offers are accepted,irrespective of the value of z, i.e., for all z Z, (0 | w, z) = 1 if w < w and (1 | w, z) = 1 ifw > w, where w solves the system of equations:

    w(1 + EG[]) = (1 EG[]){

    w>w(w w)F (dw)

    }(29)

    and

    =mX(0)

    mX(0) +mX(1) (EG [])

    EG[] +mX(1) (EG [])

    mX(0) +mX(1) (EG [])

    (EG[] +

    CovG(, )

    EG []

    ), (30)

    where

    mX(0) =EG[] (1 F (w))EG[]

    (1 F (w)) {EG[] EG[]}+ EG[] (31)

    and mX(1) = 1mX(0). Moreover, if CovG(, ) > 0, then there is a unique equilibrium thresholdw.

    Proof. See the Appendix.

    Equation (29) is the standard equation in search models that characterizes the threshold w asa function of the parameter (i.e., when the agent believes in the mis-specified model Q). Itis easy to see that there is a unique threshold w that solves (29) for each ; denote this solutionby w(). Figure 2 plots two examples of w(); notice that w() is increasing because, if the agentbelieves in a higher probability of receiving a wage offer, then accepting a current offer becomesless attractive, and the optimal threshold increases.

    As usual in equilibrium, the belief is determined endogenously by the strategy of the agent,which is characterized by w in this example. Equations (30) and (31) describe how the belief

    depends on w. Consider first equation (30). The agent only observes the realization of , i.e.,whether she receives a wage offer, in cases where she is unemployed. There are two reasons whythe agent can be unemployed. The first reason is that the agent rejected the offer. Since thisdecision happens before the new economic fundamental is realized, there is no correlation between

    33

  • this decision and the chance of getting a wage offer. Thus, the mis-specified agent will believethat the probability of getting a wage offer is given by the true average probability, EG[]. Thissituation is reflected by the first term in the RHS of (30).

    The second reason for being unemployed is that the agent accepted an offer but was then fired.In this case, the probability of being fired might be correlated with the probability of receiving anoffer, but the agent fails to account for this possibility. If, say, CovG(, ) < 0, so that the agentis less likely to get an offer in periods in which she is fired, then she will have a more pessimisticview about the probability of receiving a wage offer relative to the average probability EG[] . Thesecond term in the RHS of (30) captures this bias.

    The weights on the RHS of (30) represent the probability of being unemployed by choice ordue to being fired conditional on the probability of being unemployed, respectively. As describedby (31), these weights depend on the invariant probability of accepting an offer, mX(1). Since thisprobability is determined by the agents strategy, then the weights and, therefore, the belief alsoendogenously depend on the threshold strategy w. For example, if the agent rejects more offers,so that w increases, then the weight on being unemployed by choice increases and the bias in

    decreases.Let (w) denote the beliefs that correspond to following threshold strategy w, which is obtained

    by replacing (31) into (30). Figure 1 plots this relationship for two cases. In the left panel,CovG(, ) < 0 and, as explained above, the agent has a negative bias. As w increases, the agentspends more time unemployed and the bias diminishes; thus, () is increasing. In the right panel,CovG(, ) > 0 and, therefore, the bias is positive. As w increases, the bias diminishes, whichimplies that () is decreasing.

    As depicted in Figure 1, any intersection of the functions w() and () is an equilibrium thresh-old. In the case CovG(, ) < 0, there might be multiple equilibrium thresholds. But in the caseCovG(, ) > 0, the facts that w() is increasing and () is decreasing imply that there is a uniqueequilibrium threshold.

    Next, we characterize the optimal strategy for the agent who knows the true transition proba-bility function Q.

    Proposition 4. The optimal strategy : W Z (X) for the MDP (Q) is essentially uniqueand is characterized by the unique threshold wo that solves

    wo(1 + EG[]) = (EG[] EG[]){

    w>wo(w wo)F (dw)

    }. (32)

    Proof. See the Appendix.

    By comparing equations (29) and (32), we observe that the only difference appears in the termmultiplying the RHS. In the mis-specified case, the term is (1 EG[]); in the correct case, theterm is (EG[]EG[]) = EG[](1EG[])CovG(, ). Thus, the misspecification affects theoptimal threshold in two ways. First, the misspecified agent estimates the mean of incorrectly,i.e., 6= EG[]; second, the misspecified agents ignores the potential correlation between and ,i.e., CovG(, ). Using these observations, we can now compare the equilibrium strategy of the

    34

  • mis-specified BMDP with the optimal strategy for the MDP with the correct transition probabilityfunction.

    Proposition 5. If CovG(, ) < (>)0 then wm < (>)w

    o, for any threshold w

    m that characterizes

    an equilibrium strategy of the BMDP.

    Proof. See the Appendix.

    The intuition of the proof relies on the two differences highlighted above between the optimalsolution to the search problem under the true model