econometric applications of maxmin expected utility

Econometric Applications of Maxmin Expected Utility

Author(s): Gary Chamberlain

Source: Journal of Applied Econometrics, Vol. 15, No. 6, Special Issue: Inference and

Decision Making (Nov. - Dec., 2000), pp. 625-644

Published by: Wiley

Stable URL: https://www.jstor.org/stable/2678563

Accessed: 12-07-2019 13:17 UTC

REFERENCES

Linked references are available on JSTOR for this article:

https://www.jstor.org/stable/2678563?seq=1&cid=pdf-reference#references_tab_contents

You may need to log in to JSTOR to access the linked references.

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide

range of content in a trusted digital archive. We use information technology and tools to increase productivity and

facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected].

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at

https://about.jstor.org/terms

Wiley is collaborating with JSTOR to digitize, preserve and extend access to Journal of

Applied Econometrics

This content downloaded from 206.253.207.235 on Fri, 12 Jul 2019 13:17:20 UTCAll use subject to https://about.jstor.org/terms

JOURNAL OF APPLIED ECONOMETRICS

J. Appl. Econ. 15: 625-644 (2000)

ECONOMETRIC APPLICATIONS OF MAXMIN EXPECTED UTILITY

GARY CHAMBERLAIN*

Department of Economics, Harivard University, Cambridge, MA 02138, USA

SUMMARY

Gilboa and Schmeidler (1989) develop a set of axioms for decision making under uncertainty. The axioms imply a utility function and a set of distributions such that the preference ordering is obtained by calculating expected utility with respect to each distribution in the set, and then taking the minimum of expected utility over the set. In a portfolio choice problem, the distributions are joint distributions for the data that will be available when the choice is made and for the future returns that will determine the value

of the portfolio. The set of distributions could be generated by combining a parametric model with a set of prior distributions. We apply this framework to obtain a preference ordering over decision rules, which map the data into a choice. We seek a decision rule that maximizes the minimum expected utility (or, equivalently, minimizes maximum risk) over the set of distributions. An algorithm is provided for the case of a finite set of distributions. It is based on solving a concave programme to find the least-favourable mixture of these distributions. The minimax rule is a Bayes rule with respect to this least-favourable distribution. The minimax value is a lower bound for minimax risk relative to a larger set of distributions. An upper bound can be found by fixing a decision rule and calculating its maximum risk. We apply the algorithm to an estimation problem in an autoregressive, random-effects model for panel data. Copyright © 2000 John Wiley & Sons, Ltd.

1. INTRODUCTION

Consider an individual making a portfolio choice at date T involving two assets. The (gross) returns at t per unit invested at t- 1 are ylt and Y2t The individual has observed these returns from t = 0 to t = T. He has also observed the values of the variables Y/3t, ... ,YKI, which are thought to be relevant in forecasting future returns. So the information available to him when he makes his portfolio choice is z {(yit,.. ,YKt)}T= He invests one unit, divided between an amount a in asset one and 1 - a in asset two, and then holds on to the portfolio until date H. Let w = {(y1t,Y2t)}I= T+ 1 and let h(w,a) denote the value of the portfolio at t = H:

H H

h(w,a)=a I1 ylt+(- a) 7 y2t (1) t=T+l t=T+I

How should a be chosen?

Consider an econometrician who observes a sample vector z drawn from a distribution P0 for some value of the parameter 0 in the parameter space E c RIP. He is interested in a real-valued function g(0) and would like an estimator that is optimal under a mean-square error criterion. How should he choose an estimator?

*Correspondence to: Gary Chamberlain, Department of Economics, Harvard University, Cambridge, MA 02138, USA. e-mail: [email protected]

Copyright © 2000 John Wiley & Sons, Ltd. Received 30 September 1999 Revised 31 August 2000


G. CHAMBERLAIN

Gilboa and Schmeidler (1989) develop a set of axioms for decision making under uncertainty. The axioms imply a utility function and a set of distributions such that the preference ordering is obtained by calculating expected utility with respect to each distribution in the set, and then taking the minimum of expected utility over the set. In Section 2 we apply this framework to obtain a preference ordering over decision rules, which map the observation z into a choice a. The decision maker's problem is to choose a decision rule that maximizes the minimum expected utility. In the portfolio choice problem, this gives

maxmin u(h (w,d(z)))dQ(z, ) dcD QeSJ

where d is a decision rule, D is the set of feasible decision rules, and u is the utility function. The value (z,wv) is regarded as the realization of a random variable (Z, W) with distribution Q, and S is the set of distributions.

With risk defined as the negative of expected utility, this framework corresponds to Wald's (1950) minimax risk criterion. In the estimation problem, we could take the set S to be {Pe: 0 E }. This gives

minmax f(g() -d(z))2dPo(z) (2)

The set S of distributions is a key element of this framework. A possible criticism of the criterion in equation (2) is that it puts too much weight on parts of the parameter space that are a priori unlikely. A response is to consider a set r of prior distributions on E. Let wv= 0, dQ1(z,0) = dPor(z)d7(0), and S {Q- , : 7 E r}. The estimation problem becomes:

ninmax / J(g(0) - d(z))2dP(z)dw(0) (3)

This corresponds to Good's (1952) argument that a minimax solution is reasonable provided that only reasonable subjective distributions are entertained. If F (and hence S) consists of a single distribution, then the solution to equation (3) is the Bayes rule for that prior distribution, and the framework reduces to Bayesian decision theory.

Note that for any prior distribution 7r,

J J(g(0)- d(z))ddPo(z)d(0) < max ((g(0) -d(z)) 2lPo(z)

So the minimax risk value in equation (2) provides an upper bound on equation (3), for any set r of prior distributions. If r consists of all prior distributions on ), including point masses that assign probability one to a single point, then equation (3) reduces to equation (2). Even if the set of all priors is thought to be too big, in that it contains distributions that are not subjectively reasonable, it may still be of interest to calculate the minimax risk value in equation (2). It may be that a Bayes rule d, for a subjectively reasonable prior -r has a maximum risk (over 0 E 0) that is close to the minimax value in equation (2). Then that decision rule is attractive in terms of average risk with respect to the prior 7r and in terms of maximum risk over ). In addition, if the set D is unrestricted, then the posterior risk (with respect to 7r) is minimized by the choice d,(z) for any value of the observation z.

Section 3 develops an algorithm for computing a minimax decision rule. We consider the case

Copyright ( 2000 John Wiley & Sons, Ltd.

626

J. Appl. Ecoin. 15: 625-644 (2000)



in which S is the convex hull of a finite set of distributions. For example, in equation (3), we could have F equal to the convex hull of {{r,... ,7rjr, so that the set of prior distributions consists of mixtures of a finite set of priors. The algorithm is based on Blackwell and Girshick's (1954) minimax theorem for S-games, in which nature has a finite set of pure strategies. The optimal mixed strategy for nature corresponds to a least-favourable distribution, and the minimax rule is a Bayes rule with respect to this least-favourable distribution. This Bayes rule is

based on a subjectively reasonable prior to the extent that 1, ... .1 j are subjectively reasonable priors. The least-favourable distribution is obtained numerically by solving a concave programming problem. We use a sequential quadratic programming algorithm.

The minimax risk value that we obtain provides a lower bound on the minimax value relative to a larger set of distributions. For example, suppose that o in equation (2) contains an open set of RP. We can obtain a lower bound for the minimax value in equation (2) by selecting a finite subset {01,...,OJ} c 9, and applying our algorithm with S equal to the convex hull of {Po,... ,PJ). We can obtain an upper bound on the minimax value in equation (2) by fixing a decision rule and calculating its maximum risk over E). For example, we can calculate the maximum risk over E of the maximum likelihood estimator.

If the set D of feasible decision rules is unrestricted, then a Bayes rule can be obtained by minimizing posterior risk. Section 4 develops this case, and also considers a problem in nonparametric estimation in which there are restrictions on the set of feasible rules.

Section 5 considers an autoregressive, random-effects model for panel data. We focus on estimation of the autoregressive parameter, with mean-square error as the risk function. We obtain a lower bound on the minimax risk in equation (2) by using a finite subset 0{1,... Oj} c E. The maximum risk over 9 of the maximum likelihood estimator provides an upper bound, which turns out to be fairly close to the lower bound.

2. PREFERENCES

Consider an individual making a decision under uncertainty. Suppose that he will observe the value z of a random variable Z before making his choice. The outcome given choice a depends upon a random variable W, whose value w may not be known when the choice is made. Z x W is the range of (Z, W), A is the set of possible choices, and X is the set of outcomes.

Let y denote the set of probability distributions over X with finite support, corresponding to lotteries with prizes in X. The probabilities in these lotteries are exogenously given, as in a roulette lottery. Consider y, and Y2 in y, with the union of their supports equal to {x}j}= l; Yi assigns probabilities {pj}= 1 to these outcomes, and Y2 assigns probabilities {qj}= i. Then for a E (0,1), the mixture ayl + (1 - )y2 e Y assigns probabilities {op + (1 -a)q1} l to these outcomes.

Let C denote the set of mappings from Z x W to y. An element / EE C can be regarded as a lottery in which the prize corresponding to state (of nature), (z,w) is a roulette lottery. I resembles a horse lottery in that the probabilities of the states are not exogenously given. Let LC denote the set of constant functions in C. We shall identify the roulette lotteries y with Cc. If

a E (0,1) and f,g E , then af+ (1 - a)g denotes the horse lottery in C whose prize in state (z,w) is the roulette lottery in y corresponding to the mixture af(z,w) + (1 - a)g(z,w).

Gilboa and Schmeidler (1989) consider a preference relation > over L that satisfies certain axioms. A key axiom is certainty-independence: for allf, g in L and r in Lc and for all a E (0,1),

Copyright © 2000 John Wiley & Sons, Ltd.

627

J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

f > g if and only if af+ (1 - a)r > acg + (1 - a)r. So the horse lotteryfis strictly preferred to the horse lottery g if and only if the (element by element) a-mixture off with any roulette lottery r is strictly preferred to the corresponding mixture of g with r. Gilboa and Schmeidler show that their axioms are equivalent to the existence of an affine function u: Y - 1 R and a non-empty, closed, convex set S of probability measures on Z x W such that: for all f,g E £,

f>-g iff min u ofdQ > min uo gdQ. Qe$ Qe$

If the certainty-independence axiom is strengthened so that it holds not just for the constant functions but for all r in £, then we have the Anscombe and Aumann (1963) version of the Savage (1954) axioms, and the set S consists of a single distribution. A (randomized) decision rule d is a mapping from Z to A*, the set of probability distributions on A with finite support. (We shall identify A with the subset of A* consisting of degenerate distributions). Let D denote the set of feasible decision rules. The mapping: /?: W x A* -+ y determines the outcome distribution as a function of (w,a*). For example, if a* assigns probabilities {pj}= I to the choices {a/}j= I, then h(w,a*) is the roulette lottery that assigns probabilities {pj}= 1 to the outcomes {h(w,aj)}= i. A decision rule d E D corresponds to a horse lottery Id E £: lc(z,w) = h(w,d(z)). The preference relation on C induces a preference relation on the set D of decision rules. Define the risk function as the negative of expected utility:

r(Q, d) - u(ld(z, w))dQ(z, w) xW

Then the decision maker's problem is:

min max r(Q, d) deD QeS

The use of risk, and hence a minimax criterion, is traditional, dating back to Wald (1950). We shall not be explicit about measurability and integrability restrictions. Such issues can be avoided by taking the state space Z x W to be a finite set.

The connection of this framework to the portfolio choice problem is quite direct. Z corresponds to the data available when the portfolio is chosen. W is a vector of future returns on the assets, and Q is the joint distribution for (Z, W). The function h is given in equation (1) (for a c A), and ui is a von Neumann-Morgenstern utility function defined over roulette lotteries with monetary prizes. The decision rule d determines the amount a invested in asset one as a function of the data z. The set of feasible rules D could include all such functions mapping Z into A. (It could also include all randomized rules, mapping Z into A*).

In the estimation problem in equation (3), we set w equal to 0, u(h(0,a))=-(g(0)-a)2, Q,(A x B)= fBPeO(A)dr(0), and S = IQ, : T } E r}. A decision rule d maps the data z into an estimate a. An example of a restriction on the set D of feasible rules is an unbiasedness restriction: D = {d: Z - R : fd(z)dPo (z) = g(0), 0 E 6}.

An alternative approach for working with a set of priors is to calculate the corresponding set of posterior distributions. Some principle would be needed to make a choice based on this set of posterior distributions. Then we would have a decision rule, and we could ask whether it is optimal under a preference relation that satisfies certain axioms.


628

J. Appl. Econ. 15: 625-644 (2000)



3. ALGORITHM

We shall consider a finite set of distributions: {Ql,... ,Q}, and S is the convex hull:

S= 6eQj : < < 1,Z5= 1J (4) '=1 j=l

Consider a zero-sum game in which the decision maker chooses d E D, nature chooses Q E S, and the payoff to the decision maker is -r(Q,d). The minimax (or upper) value of the game is

V = inf sup r(Q, d) dEcD QES

A minimax decision rule do satisfies supQ sr(Q,do) = V. The maxmin (or lower) value of the game is

V = sup inf r(Q, d) Qe,S dED

A least-favourable distribution Qo satisfies inf E D, r(Qo, d) = V. A decision rule dQ is Bayes with respect to the distribution Q if

r(Q, dQ) = inf r(Q d) de D

A decision rule d generates a vector of risk values (r(Ql,d),... ,r(Qj,d)). The risk set consists of all such vectors as d varies over D:

Sr = {(r(Ql,d)..., r(Q, d)) E R7 : d E D}

We can regard the game as being played as follows. The decision maker chooses a point s = (s,... ,sJ) E S,-. Independently of his choice, nature chooses a coordinate j with probability 6j. Blackwell and Girshick (1954, Chapter 2.4) refer to such games, in which nature has a finite number of pure strategies, as S-games. The minimax theorem for S-games states that if the risk set is bounded, then

inf sup r(Q, d) = sup inf r(Q, d) deD QeS QeS dD

and there exists a least favourable distribution Qo. If in addition the risk set is convex and closed, then there exists a minimax decision rule do, and it is Bayes with respect to Qo. We shall assume that the risk set is convex, closed, and bounded. (See Blackwell and Girshick, 1954, Theorem 2.4.2, and Ferguson, 1967, Theorem 1, p. 82. The mixed extension of the game allows the decision maker to use mixed strategies, in which case the risk set is automatically convex since it is the convex hull of S,.; see Blackwell and Girshick, 1954, Theorem 2.4.1.) Let Ej denote the J- 1 dimensional simplex:

sJ= {6 Ec J:6>0, 6j= }

and let Q6 denote the mixture distribution: Q6 = J= SQj. As 6 varies over sE, Q6 varies over S. Note that the risk function is affine in its first argument:


629

J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

J

r(Q6, d) 6jr(Q, d) j=l

Let dE denote the Bayes rule with respect to Q6. Consider the minimized risk:

p(6) min r(Q, d) = r(Q6, ) (5) de'D

Since r(Q6,d) is an affine function of 6 for each d, it follows that p is a concave function. So maximizing p over the convex set Ej is a concave program:

6o = arg maxp(6) (6)

The least favourable distribution is Qo = ZJ= 1oj Qj. The concave program in equation (6) can be solved using a sequential quadratic programming algorithm, as in Wilson (1963). (The routine used in the application in Section 5 is nag_nlp_sol, from the NAG Fortran 90 library; it is based on the subroutine NPSOL described in Gill et al., 1986.) Let do be a Bayes rule with respect to Qo. In order for do to be minimax and Qo to be least

favourable (so 60 solves equation (6)), it is necessary and sufficient that

r(Qk, do) = max r(Qj, do) if Sok > 0, k = 1,...,J (7) 1 <j<J

So assume that equation (7) holds. Then for any decision rule d,

max r(Qj, d) > r(Qo, d) > r(Qo, do) = max r(Qj, do) 1l<j<J 1l<j<J

So do is minimax. For any Q e S,

r(Q, dQ) < r(Q, do) < max r(Q, do) = r(Qo, do) 1 j<J

So Q0 is least favourable. This sufficiency argument does not use the minimax theorem. To see that equation (7) is necessary, assume that do is minimax and Qo is least favourable:

max r(Qy, do) = inf max r(Qy, d) = V 1<_<J deD 1<_j_<J

inf r(Qo,d) = sup inf r(Q,d) = V dcD QES dGD

Since do is a Bayes rule with respect to Qo, the conclusion of the minimax theorem implies that

J

ojr(Qj, do) = r(Qo, do) = inf r(Qo, d) = V = V = max r(Q, do) deD I:!Di7

j=1

So the maximum risk of do equals the average risk under Q0, which implies equation (7). In the numerical algorithm for the concave program, we use a subgradient of p. It is convenient to solve for 6j= 1- _J- 1 Sj and regard p as defined on a subset of TZJ-1: MJ-_ { 6E RJ- : l > 0,Y-~ 6j < 1}. Let (e s- 1 have jth component equal to r(Qj,d)- r(Qj,Id). We will show that (C is a subgradient of p at 8. Note that

p(5) = ((6, }) + r(Qj, d6)


630

J. Appl. Econ. 15: 625-644 (2000)



where (a,b) denotes 1= 1 aibi for a,b E 7k. For any 6' E MJ_ 1,

p(') = min ( j(r(Qj, d) -r(Qj, d)) + r(Q, d)) d/cD j1

j-1

< 6j(r(Qj ) r(Qj, , d6)) + r(Qj, d6) j=l

= (C, 6') + r(Qj, d6) = (, 6) + (C, S - ) + r(Qj, d6)

= p(6) + (C' - )

and so (^ is a subgradient.

3.1 Minimax Bounds

The minimax value r(Qo,do) is with respect to the set S of distributions. If we consider a larger set of distributions S' D S, then

V = inf sup r(Q, d) < inf sup r(Q, d) = V dcD QES deD QGS

So the minimax value relative to S provides a lower bound for the minimax value relative to the larger set S'.

Now fix a decision rule d, and construct an upper bound:

V' < sup r(Q, d) QeS'

This upper bound is useful in that it may be feasible to maximize r(Q,d) over Q c S' for a fixed d, even though it is not feasible to compute the minimax value for S'.

Kempthorne (1987) develops an algorithm for the case in which the least-favourable prior distribution is known to have finite support, but the location of the support points is not known. Suppose, for example, in equation (2) that E is a closed interval on the real line; the risk function is an analytic function of 0 for any decision rule; and the risk of the is eBayes procedure for the least-favourable prior distribution is not constant on E. Then the least-favourable prior distribution has finite support, and the algorithm converges to this distribution. The algorithm constructs a sequence of discrete prior distributions whose successive support sets change by adjusting the locations of the existing support points as well as adding new points to the support. For a given number of support points, the algorithm finds a local maximum of the Bayes risk of the Bayes rule, maximizing over the locations of the support points as well as the probabilities attached to the support points. This is not a concave program. The optimisation is done using an unconstrained maximization procedure with differences to approximate derivatives.

4. THE SET D OF DECISION RULES

A step in our algorithm requires finding the Bayes rule d6, which maximizes the risk r(Q6,d) over the set D of feasible decision rules. Consider first the case in which D is unrestricted.


631

J. Appl. Econl. 15: 625-644 (2000)


G. CHAMBERLAIN

4.1 D Is Unrestricted

In this case, we can obtain the Bayes rule by minimizing posterior risk- see Wald (1950), Chap. 5.1), Blackwell and Girshick (1954, Chap. 7.3), and Ferguson (1967, Chap. 1.8). The distribution Q for (Z, W) can be decomposed into Q,n, which is the marginal distribution for Z, and Qc, which is the conditional distribution for W given Z: Q(A x B)= fAQc(B z)dQ,,(z). Let f be the density of Qy,,, with respect to the measure ut: Qjil(A) = fAfj(z)dU(z) (j= 1,... ,J). Define the loss function as the negative of conditional expected utility given Z = z for the choice a:

L(Q, z,a) - u(h(w, a))dQc(w z) (8)

Then we have

7=1 -j=1 IJ J dJ = _j=1

>z inf ZL(Qj,z,a)fj(z)6j dl(z)

We shall assume that the infimum is in fact obtained for some choice a c A. We can regard 6 as providing prior probabilities on the discrete parameter space {l,...,J}, and calculate the posterior probabilities as

/ J

6 (z) =fjf(z) 6 fk (9) k/ /c=l

Then a Bayes rule d5 for the prior 6 can be obtained by minimizing posterior risk, which equals posterior expected loss:

d(z) = argmnin L(Qj,z,a)6(z) (10) aeA . j=1

Mixture models

Consider a mixture model in which the distribution Q for the vector (Z, W) has the following form:

Q(A x B)= f Po(A x B)d7r(O)

We start with a parameter space E, and {Pg : 0 E)} is a set of distributions for (Z, W). We introduce a family F of prior distributions wF on E. This gives a set {Q, : cE r1} of distributions for (Z, W), in which the prior -F plays the role of a parameter. Consider a finite set of prior distributions: {Trl,... ,rJ}, and F is the convex hull:

lr- {=- : 0<J<16 Z } J= 1 j= 1


632

J. Appl. Econ. 15: 625-644 (2000)



Let Qj = Qj. Then the set S = {Q, : - c F} is the convex hull of {QI,... ,Qj}, as in equation (4), which is the basis for our algorithm in Section 3.

The probability distribution P0 can be decomposed into a marginal distribution Po,7, and a conditional distribution Po : Po(A x B) = fAPo(B z)dPo,,,(z). We shall assume that Po,1, has density f(z 0) with respect to the measure t: Po,,,(A)= fAf(zl)db(z) for all 0c E. Since Qij = fPo,1dri(0), the density fj for Qe,,, is given by

fj(z) - f (z 0)d7(0) (11)

Let irj denote the posterior distribution of 0 conditional on Z:

(B I z) [f(z)] - (fz I 0)d7r6(0)

Then the loss function in equation (8) can be obtained by integrating the loss for P0 with respect to the posterior distribution of 0:

L(Qj, z, a) L(Po, , a)dj(0 z) (12)

These formulas for the marginal density/f and the loss L(Qj,z,a) in equations (11) and (12) are useful for calculating the posterior risk, as in equations (9) and (10). In the portfolio choice problem, Z corresponds to the data available Z nd when the portfolio is

chosen, and W is a vector of future returns on the assets. The specification of the family {P : 0 c v} might be based on a vector autoregression with multivariate normal innovations, and r would be a family of prior distributions for the parameters of the vector autoregression. Barberis (2000) uses such a specification with a ifsingle prior distribution. In the portfolio choice problem, the focus is not on the parameter vector ; the role of the parametric model is to generate a joint distribution for the observables Z and W. Now consider an estimation problem. Here the focus is on a function of the parameter, which

we shall denote by g(0). In this case we set w equal to 0, and the choice a is the estimate of g(0). If g is real valued, the loss function could be L(Po,z,a) = (g(0) - a)2, with mean-square error for the risk function. Or the loss function could have a piecewise linear form:

4.2 p is Restricted

We may be interested in a restricted class of decision rules. Consider, for example, the mixture model, {fEPo dir(O) : -rFc}, where the prior distributions in F are indexed by a parameter T eC R7: r = {T : T c R/1}. For a given value of T, there is a decision rule dT that minimizes posterior risk:

dT(z) = argmin f L(Po,z,a)dTr(O I z) aeA


633

J. Appl. Econ0. 15: 625-644 (2000)


G. CHAMBERLAIN

where i7- is the posterior distribution corresponding to the prior rF-. Suppose that we set Qj = fPo dir,T ( = 1,..., J), and as a step in our minimax algorithm, minimize the risk r(Q,,d) = "ji= I 6jr(Qj,d). The unrestricted Bayes rule d6 in equation (10) minimizes posterior risk for the prior distribution jJ= ltSj-r,. This distribution is not, in general, in F unless F is convex. So if F is not convex, the unrestricted Bayes rule will generally not belong to {d : Tr Rt"}. Or consider setting Qj = Po with {01,... ,sj} c {E. Then the Qj will not generally be in the mixture model unless F includes all point masses on E. Nevertheless, we may want to restrict the set of decision rules to D = {d, : T E R'} because the mixture model with the family r of prior distributions is tractable or familiar. Then we need to solve

min r(Q6, d) TE"RI7

in order to obtain the minimized risk p(S) in equation (5). Another application could arise in non-parametric estimation. Based on asymptotic theory or

other arguments, we may be interested in a class of estimators, such as orthogonal series estimators. Obtaining a particular estimator within the class requires a value for a vector of parameters T, which may govern how much smoothness is imposed on the estimator. Efromovich (1999, Chap. 3) develops a data-driven orthogonal series estimator for a univariate density based on a random sample of size n. The procedure for determining which terms appear in the series requires setting a value for T. For example, terms in the series beyond a certain cutoff point are dropped if a test statistic does not exceed a threshold that depends upon T. Default settings are suggested for T. To evaluate the performance of the estimator, Efromovich constructs eight 'corner' densities that represent features of interest that are expected to occur in practice. Monte Carlo simulation is used with data sets generated according to each of the corner densities. The risk measure is expected integrated squared error. Efromovich notes that the optimal choice of the parameter T involves a tradeoff between better estimation for some of the corner densities and worse estimation for others. Within our framework, a way to make an optimal tradeoff is to let {Qi,... ,Qj be the distributions for the sample corresponding to the corner densities, and choose the parameter T by solving

min max r(Qj,dT) (13)

Let gj () = r(Qj,dT) and suppose that gj : R" -7 R is continuously differentiable with gradient Vgj. Applying our algorithm in Section 3 to equation (13) gives To0 Rj" and 60 c EJ such that

E oVgj(To) = 0 j=l

and

gk(ro) = max g}(ro) if 6Ok > O, k = ,..., jc{1 ...J}'

An alternative algorithm for equation (13) is developed in Polak (1997, Chap. 2.4). For a given value of T, solve the following quadratic program:

=argminm 6([n(r) gj()] +21 I2 V( ) (14) 6EE = j =1 j j= 1 2 ~~~~j=lI


634

J. Appl. Econ. 15: 625-644 (2000)



where I(rT) maxj {i,... ,J.gj(r) and Ilall2= (a,a). Obtain the search direction J

h =- 6Vgj(T) j=1

A new value for r is chosen according to T* = r + Ah, where the scalar A is determined by a step- size algorithm. Then replace T by T* and repeat until the minimum value of the quadratic program in equation (14) is zero. Polak (1997, Chap. 2) develops additional algorithms for equation (13) and provides references to related work.

5. APPLICATION: AUTOREGRESSIVE MODELS FOR PANEL DATA

We will work with the following parametric family:

Zit = 7Zi,t-1 + oai + Uit

ai {Zio zio}1 i-d A(TI + T2Zio, Ov)

Uit I {ofi, Zio = Zio}N i.id.r(O, -a2) (i = 1,..., N; t = 1, .., T) (15)

The parameter vector is ( = (0,b), with 0 = (y,A), 0 = (Ti,T2,a), and A-= a,/a. We obtain a family of distributions {Po : 0 c 9} by specifying a single prior distribution for 10 and integrating 10 out of the model. The prior distribution for I0 is motivated by work in Chamberlain and Hirano (1999) using residuals from regressions of log earnings on education and age in the Panel Study of Income Dynamics. It specifies that l/r2,-X2(10)/0.9, so that the 0.1 and 0.9 quantiles for ar are 0.24 and 0.43. Conditional on a, the components of (T1,T2) are independent normals with variances proportional to a2. The mean of Tr is 0, the mean of 72 is 0.25, and the standard deviations of 1r and r2 in the (unconditional) t-distribution are 0.20.

We obtain the same P0 family if we start out with a = (al,... ,aN) as part of the parameter vector: ( = (O,,ao). The random effects model in equation (15) provides a prior distribution for a given (0,O). Combining this with our prior distribution for 0 gives the Po distribution.

The observation is Z (Z1, ... ,ZT, ... ZN,...,Z NT)'. The Po distribution for Z is conditional on {zio}N= i, which is observed. The values for {zio}Y= 1 in our risk calculations are obtained by drawing from a normal distribution with mean 0 and standard deviation 0.45; these values for zio are then kept fixed in evaluating risk. The density (for A > 0) is

f(z I 0) = c(z) det(H)1/2 det(H)-/2(m'Hm -m'Hm + z*'z* + b2)-(N )/2

where 1/a2- Gamma (bl/2,2,/b2), (Trl,2) 1 r^Af(ml,ur2B- 1),

H= -2 ) (mM ) H= XX+ H m = H- (Xz* + Hm) 0 A IN o -0

X= (R IN IT), IT is a Tx 1 vector of ones, z* = (zl,... ,zt,... ,zN1,.. ,ZNT) with zit zit-7yzi,t-l, R is a NT x 2 matrix with the row corresponding to z*t equal to (1 zio), and c(z) is some function of z that does not depend upon 0. It is useful to simplify f(z 0), since it is evaluated repeatedly for different values of 0. The computational simplification is similar to the one described in Chamberlain (2000).


635

J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

We will focus on the estimation of y, using a squared-error loss function: L(Po,z,a) = (y - )2. We set W= 0, since 0 combined with a choice (an estimate) determines the utility-relevant outcome. We shall work with a finite subset {01,... ,OJ} of E, with Qj= Po and S equal to the convex hull of {Q1,...,Q(}. (This corresponds to a mixture model in which the prior distribution wrj assigns unit mass to the point Oj.) We let the set of decision rules D be unrestricted, so that the Bayes rule da is the posterior mean of y:

d6(z) = 7(j)f(z | 0j)j/ f/(z | Oj)Sj (16) j=1 ./=1

where y(0) denotes the first component of 0. This follows from equations (9) and (10), with fj.(z) =f(z I Oj) (as in equation (11) with Tri assigning unit mass to the point Oj). The risk under Q6 for an estimator d is

J

r(Q, d)= Z jr(Poe, d) (17) j=l

where r(Po,d)= fz[y(O) - d(z)]2f(z I 0)dz. We can approximate r (Po,d) by Monte Carlo simulation. Obtain independent and identically

distributed draws {Z(O,k)}k= 1 from f( 10). Then we have

r(Po, d) --f [y(0) -d(Z(, k))]2 (18) k=l

We use the same set of pseudo-random numbers to construct {Z(O,k)}= 1 for each 0. This ensures that the simulated r(O,d) varies smoothly in 0. Then we calculate p() = r(Q6,d) from equations (16) and (17). A numerical optimisation routine is used for the constrained maximization of p over the J- 1 dimensional simplex. (The routine is nag_nlp_sol, from the NAG Fortran 90 library). The maximizing value 60 gives the least-favourable prior, and p(60) is the minimax value for risk, relative to the set {01,... ,OJ}.

Consider the case N = 100 and T = 2. Preliminary work indicated that most of the mass in the least-favourable prior is concentrated in the rectangle with 0 < 7 < 1.4 and 0 < A < 1.4. Then I set up a grid with 15 values for 7: 0,0.1,... ,1.4, and 8 values for A: 10-4,0.2,0.4,... ,1.4, giving J= 120 values for 0. The solution to the concave program gives a minimax value for root mean- square error (MSE) of p(6o)/2 = 0.115. (The Monte Carlo simulation uses K= 8000 samples.) The least-favourable prior assigns 0 probability to almost 60% of the points: j= 0 for 69 points, 0 < 6Sj < 10-6 for 5 points, and S0j > 10 -6 for 46 points. The maximum root MSE of the minimax estimator over the 120 points is equal to the minimax value: maxjr(Ps,do)12 = 0.115. The root MSE is equalized at 0.115 across the 46 points that are assigned probability greater than 10-6. (The variation is in the fourth decimal place, between 0.1149 and 0.1151.) So the solution satisfies the necessary and sufficient condition in equation (7) very well.

The upper panels of Figure 1 show the root MSE at the 15 values of y for A = 10-4, 0.6, and 1.2. The lower panels show the least-favourable prior probabilities, 60j. We see that the root MSE is equalized at the minimax value of 0.115 for the points that receive positive probability. The support of the conditional distribution of 7 given A is fairly concentrated, and it shifts to the left as A increases. This negative correlation between y and A under 60 is visible in Figure 2,


636

J. Appl. Econ. 15: 625-644 (2000)



X = .0001

- o0oooo0oooo e

= .6

' \~~ 0.12

0.1

0.08

0.06 ·

0.12 <

X= 1.2

>0000 ooooo

0.1 -

0.08

0.06

0.04 *0.04 0.04 0 0.5 1 1.5 0 0.5 1 1.5 0 0.5 1 1.5

Y Y Y

= .6

0.15

X= 1.2

0.1-

0.05

1.5

Y

0 0.5 1 1.5

7

Figure 1. Root mean-square error for minimax estimator (upper panel): r(O,do)112; the dashed line indicates the minimax value of 0.115. Least-favourable prior probabilities (lower panel): 60. Evaluated at

0 = (y,AX) C {0,....,} fory = {0,.1,...,1.4} and for A as shown. N= 100 and T= 2

which shows the joint distribution. The negative correlation is also visible in Figure 3, which shows equal probability contours of the joint distribution. The support of the joint distribution is fairly concentrated along a negatively sloped diagonal. We may have a precise estimate of a positive covariance between Zi2 and Zil, but different values for the (7y,A) pair can imply the same covariance, with 7y decreasing as A increases. So knowing that (7y,A) is distributed along a negatively sloped diagonal in the first quadrant does not help in the difficult choice of a point on that diagonal. (Figs. 2 and 3 use bicubic spline interpolation to the probability values at the 120 points, with negative values in the interpolating spline set to 0.)

Consider the minimax value corresponding to the entire parameter space:

Vo = inf sup r(, d) d o0E

The minimax value relative to {01,... ,0j} provides a lower bound on Ve. We can obtain an


0.12

0.1

w LL

2 0.08 o

0.06

= .0001

0.15

Q

o- a 0.1 0

0.05

0a

4(I, .0 Cu

637

J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

x 10-4 .-

· .

4.~~~~~~~~~~~~~~~ .,.,.

·

4..

.i . .

,,.,.,...

: :

9.

8.

7.

6.

I 4 3

C"4

,

01

1.5 1.5

1.4

0.8 0.6

x 0 o y

Figure 2. Least-favourable prior probabilities So; interpolation. N= 100 and T= 2

upper bound by calculating the maximum risk over e of the maximum likelihood (ML) estimator, which solves maxs E f (z I 0). The risk at a given 0 is calculated using equation (18), where now 0 is not restricted to the finite set of J values. The risk function for the ML estimator

is smooth and unimodal. With = 7Z x [10 4,oo), the maximum value for root MSE is 0.132, which is attained at (y,A) = (0.903, 0.007). So we can bound the minimax value for the whole parameter space between 0.115 and 0.132. The maximum likelihood estimator is attractive in terms of risk, with a maximum root MSE that is quite close to the lower bound.

Now consider using the minimax estimator do based on {01,... ,OJ} to provide an upper bound on VQ. First we shall examine the maximum risk of do for (y,A) in the rectangle [0,1.4] x [10-4,1.4]. This will indicate how well the minimax value for an infinite, bounded set can be approximated by the minimax value for a finite set. The maximum risk is obtained using a grid search combined with a numerical optimisation routine. The grid has 0.01 increments for 7y and 0.05 increments for A. The maximum value for root MSE is 0.117, which is attained at (y,A)= (0.913, 0.712). So the minimax value for the rectangle is very tightly bounded between 0.115 and 0.117.

Figure 4 compares the risk functions for the ML and minimax estimators, for 0 < y < 1.4 and


638

'

..

.

..

rc Wn· ·

k-

''

''

*.

.

. .

*,:

. .

.

. .

. .

fA;A:

>

I

.I

J. Appl. Econ. 15: 625-644 (2000)



0 0.2 0.4 0.6 0.8 1 1.2

Figure 3. Equal probability contours of the least-favourable prior 60; interpolation. The heights of the contours are at equal probability increments: A, 2A, 3A,... N= 100 and T= 2

A = 10 -4, 0.4, 0.8, and 1.2. The risk of the minimax estimator is evaluated at 0.01 increments for 7y. The minimax estimator do based on the 120 points is close to being a minimax estimator for the rectangle. However, for values of A greater than 1.4, there are points where the root MSE for do exceeds the maximum value for ML of 0.132. This is so even though adding such points to {01,... ,j} has very little effect on the minimax value, which remains close to 0.115.

5.1 Local Parameter Space: A Connection with Asymptotic Theory

A key idea in asymptotic statistics, due to Le Cam (1986), is that a sequence of statistical experiments (or models) can be approximated by a limit experiment. The observation Z has i.i.d. components (Z1,... ,Zn). The joint distribution of Z is P" for some value of the parameter 0 in the parameter space 9, which is an open subset of RP. The approximation involves a local parameter h = Jn (0 - 00), where 00 is a fixed point in the interior of 9 and is regarded as known. Under certain conditions, for large n the experiments


639

J. Appl. Econ0. 15: 625-644 (2000)


G. CHAMBERLAIN

0.14

0.12 -

) 0.1-

20.08-

0.06 -

1.5 0

~~~~~~~~

0.14

0.12 -

u 0.1-

0

- 0.08-

0.06 -

0.04 1.5 0

0.5

y

0.5 1

7

Figure 4. Root mean-square error for minimax estimator (solid line): r(O,do)l 2, evaluated for y C [0,1.4] in 0.01 increments, and for A as shown; the dashed line indicates the minimax value of 0.115. Root mean- square error for maximum likelihood estimator (dash dot line), evaluated for y E [0,1.4] in 0.05 increments,

and for A as shown. N= 100 and T= 2

{Po h C RP} and {A(h, ): h C RP}

have similar statistical properties. Here the limit experiment consists of observing a single observation from a normal distribution with mean h and variance matrix equal to the inverse of the Fisher information matrix. See van der Vaart (1998, Chap. 7) for an exposition. He observes that (p. 97): 'A motivation for studying a local approximation is that, usually, asymptotically, the "true" parameter can be known with unlimited precision. The true statistical difficulty is therefore determined by the nature of the measures Po in a small neighbourhood of the true value. In the present situation "small" turns out to be "of size O(l1/ n)".' One version of optimality in this framework is provided by the local asymptotic minimax theorem. It gives a lower bound for the maximum risk over a small neighbourhood of 00 (van der Vaart, Chap. 8.7).

This suggests examining the maximum risk of an estimator, such as maximum likelihood, not


640

= .0001

0.14

0.12

U 0.1 c°

2 0.08

\

j ' - - K , \ -r

X =.4 I I

.1-

0.06-

0.04 0

0.14

t~ -i r,

0.5

= .8 1

U. I

1

0.1 w LU o

0 0 t_

1.5

X= 1.2

0.08 -

0.06

0 .04 0. O 0.5 1.5

11/1A

1

Y

A.

1

J. Appl. Econz. 15: 625-644 (2000)


ECONOMETRIC APPLICATIONS OF MAXMIN EXPECTED UTILITY 641

over all of E but over a neighbourhood around some point, such as the maximum likelihood estimate. Likewise, we can consider the minimax value and minimax estimator corresponding to this neighbourhood.

For an example, I will use data on log earnings residuals for a sample of high school graduates from the PSID. The data are described in Chamberlain and Hirano (1999) and have N= 100, T= 9. (As above, we condition on the t = 0 observation.) The maximum likelihood estimates are 7ML= 0.426 and AML= 0.508. Profile likelihood intervals at an approximate 0.99 confidence level are [0.327, 0.532] for -y and [0.329, 0.708] for A. I will set the local parameter space equal to the rectangle [0.32, 0.54] x [0.32, 0.71].

For the minimax analysis over this rectangle, I set up a grid with 10 equally spaced values for y from 0.32 to 0.54, and 7 equally spaced values for A from 0.32 to 0.71. The solution to the concave program gives a minimax value for root MSE of 0.033. (The Monte Carlo simulation uses K= 8000 samples.) The least-favourable prior assigns 0 probability to 70% of the points: 60= 0 for 49 points, <0 60J< 10-6 for 2 points, and SoJ> 10-6 for 19 points. The maximum root MSE of the discrete minimax estimator over the 70 points is equal to the minimax value 0.033. The root MSE is equalized at 0.033 across the 19 points that are assigned probability greater than 10-6. (The variation is in the fourth decimal place, between 0.0327 and 0.0328.) So the solution satisfies condition (7) very well.

Figure 5 shows the least-favourable prior probabilities 60 and Figure 6 equal probability

°-4 ^ ^ ^ --^ ^ 0.. x 10-4

3· . . ..

0. 0.3 0. 2.5- u -.

0.8 A9~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ Y

Figure 5 Leastfavourable prio. probabilities , intepolation N 100 and

0.8 0.3

o.F u . estfvual pirpoailte o ntroain 0.55

Copyright ( 2000 John Wiley & Sons, Ltd. J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

0.35 0.4 0.45 0.5

Y

Figure 6. Equal probability contours of the least-favourable prior 60; interpolation. The heights of the contours are at equal probability increments: A, 2A, 3A,... N= 100 and T= 9

contours for the least-favourable prior. The support of the distribution is quite concentrated along a negatively sloped diagonal. Figure 7 compares the risk functions for the ML and discrete minimax estimators, for 0.32 < 7 < 0.54 and A = 0.32, 0.45, 0.58, and 0.71. The risk of the minimax estimator is evaluated at 0.01 increments for y.

The risk of the ML estimator is fairly constant over the rectangle. The maximum root MSE is 0.039, which is attained at (,A) = (0.540, 0.320). The minimum root MSE over the rectangle is 0.035. Now consider the minimax estimator do based on {01,...,0O}. The maximum risk of do over the rectangle is obtained using a grid search combined with a numerical optimisation routine. The grid has 0.01 increments for 7 and 0.01 increments for A. The maximum value for root MSE of do is 0.033, which is attained at (7,A) = (0.442, 0.669). So the minimax value for the rectangle is 0.033, and the minimax estimator do based on the J= 70 points is close to being a minimax estimator for the rectangle.

If one were given the rectangle as the parameter space, then the minimal value for maximum root MSE over this rectangle is 0.033, and there is a minimax estimator based on 70 points that essentially attains this value. The maximum likelihood estimator has root MSE that varies over


642

J. Appl. Econ. 15: 625-644 (2000)



0.(

w (I)

0 o

2;

= .45

0.035 -

0.03

0.35 0.4 0.45 0.5 0.55 0.025

0.3 0.35 0.4 0.45 0.5 0.55

Y

X,= .58 I I

0.035 w o')

.- 0

0.03

0.35 0.4 0.45

7

0.5 0.55

Figure 7. Root mean-square error for minimax estimator (solid line): r(O,do)12, evaluated for y C [0.32,0.54] in 0.01 increments, and for A as shown; the dashed line indicates the minimax value of 0.033. Root mean- square error for maximum likelihood estimator (dash dot line), evaluated for ey [0.32,0.54] in 0.01

increments, and for A as shown. N= 100 and T= 9

the rectangle between 0.035 and 0.039. So the ML estimator is dominated over this local parameter space, but not by much. There is not much room here for improving on the risk of the ML estimator. In other applications, however, it might be of interest to evaluate the risk of the two-step estimator that applies maximum likelihood in the first step and then applies the minimax procedure over a neighbourhood around the maximum likelihood estimate.

6. CONCLUSION

We have developed an algorithm for calculating minimax decision rules with respect to the convex hull of a finite set of distributions. The minimax rule is a Bayes rule for the least- favourable distribution in the convex hull. The corresponding minimax value provides a lower bound on the minimax risk with respect to a larger set of distributions.


0.04

0.035

0.03

w iU 0O

o

0.025 0.3

0.04

0.035 - w co

0.03 -

0.( 0.3

0.025 0.3 0.35 0.4 0.45 0.5 0.55

Y

· · · 34 ..

I I I I

I1 nA

nljr. ... i//-1 ;.

643

I I

.,

- -

V.VUl

- - - - - - -

^A%--

J. Appl. Econ. 15: 625-644 (2000)


G. CHAMBERLAIN

In our application, we compared the maximum risk of the maximum likelihood estimator with our lower bound. The comparison indicates that there is not a great deal to be gained from an alternative estimator in terms of reducing maximum risk. The application uses a set of point masses on the parameter space G. In ther applications, it may be useful to work with a set of non-degenerate prior distributions. Our algorithm can be used to find the least-favourable mixture of these prior distributions. This least-favourable mixture will be subjectively reasonable if each prior in the set is subjectively reasonable. The Bayes estimator for this least-favourable prior will minimize the maximum risk over the set of priors. We can calculate the maximum risk of this Bayes estimator over all of a, and check whether it has lower maximum risk than an alternative such as the maximum likelihood estimator. We can also

compare the maximum risk over g of this Bayes estimator with the lower bound based on the set of priors. If the maximum risk is close to the lower bound, then this Bayes estimator is attractive in terms of average risk with respect to the least-favourable prior and with respect to maximum risk over e.

ACKNOWLEDGEMENTS

I am grateful to Moshe Buchinsky, John Geweke, Jinyong Hahn, Keisuke Hirano, Guido Imbens, Frank Kleibergen, Peter Klibanoff, Thomas Knox, Charles Manski, Ariel Pakes, Benjamin Polak, Jack Porter, and John Rust for discussions. Financial support was provided by the National Science Foundation.

REFERENCES

Anscombe, F. and R. Aumann (1963), 'A definition of subjective probability', Annals of Mathematical Statistics, 34, 199-205.

Barberis, N. (2000), 'Investing for the long run when returns are predictable', Jozrnal of Finance, 55, 225-264.

Blackwell, D. and M. Girshick (1954), Theory of Games and Statistical Decisions, John Wiley, New York. Chamberlain, G. (2000), 'Econometrics and decision theory', Journal of Econometrics, 95, 255-283. Chamberlain, G. and K. Hirano (1999), 'Predictive distributions based on longitudinal earnings data',

Annales d'Economie et de Statistique, 55-56, 211-242. Efromovich, S. (1999), Nonparametric Curve Estimation: Methods, Theory, and Applications, Springer-

Verlag, New York. Ferguson, T. (1967), Mathematical Statistics. A Decision Thzeoretic Approach, Academic Press, New York. Gilboa, I. and D. Schmeidler (1989), 'Maxmin expected utility with non-unique prior', Journal of

Mathematical Economics, 18, 141-153. Gill, P., W. Murray, M. Saunders and M. Wright (1986), 'User's guide for NPSOL', Version 4.0, Report

SOL 86-2, Department of Operations Research, Stanford University. Good, I. J. (1952), 'Rational decisions', Journal of the Royal Statistical Society, Series B, 14, 107-114. Kempthorne, P. (1987), 'Numerical specification of discrete least favourable prior distributions', SIAM J.

Sci. Stat. Comput., 8, 171-184. Le Cam, L. (1986), Asymptotic Methods in Statistical Decision Theory, Springer-Verlag, New York. Polak, E. (1997), Optimization: Algorithms and Consistent Approximations, Springer Verlag, New York. Savage, L. J. (1954), The Foundations of Statistics, John Wiley, New York. Van der Vaart, A. W. (1998), Asymptotic Statistics, Cambridge University Press, Cambridge. Wald, A. (1950), Statistical Decision Functions, John Wiley, New York. Wilson, R. B. (1963), A Simplicial Algorithm for Concave Programming, PhD dissertation, Harvard

University, Cambridge, MA.


644

J. Appl. Econ. 15: 625-644 (2000)


econometric applications of maxmin expected utility

Documents