on pr ofile likelihood - semantic scholarhi-squared distribution of the log lik e-liho o d ratio...

25

Upload: others

Post on 21-Feb-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

ON PROFILE LIKELIHOODBY S.A. MURPHY1 AND A.W. VAN DER VAARTUniversity of Michigan and Free University AmsterdamWe show that semiparametric pro�le likelihoods, where the nuisance parameter hasbeen pro�led out, behave like ordinary likelihoods in that they have a quadratic ex-pansion. In this expansion the score function and the Fisher information are replacedby the e�cient score function and e�cient Fisher information, respectively. The expan-sion may be used, among others, to prove the asymptotic normality of the maximumlikelihood estimator, to derive the asymptotic chi-squared distribution of the log like-lihood ratio statistic, and to prove the consistency of the observed information as anestimator of the inverse of the asymptotic variance.1. IntroductionA likelihood function for a low-dimensional parameter can be conveniently visualized byits graph. If the likelihood is smooth, then this is roughly, at least locally, a reversedparabola with its top at the maximum likelihood estimator. The steepness of the graphat the maximum likelihood estimator is known as the \observed information" and providesan estimate for the inverse of the variance of the maximum likelihood estimator: steeplikelihoods yield accurate estimates.The use of the likelihood function in this fashion is not possible for higher-dimensionalparameters, and fails particularly for semiparametric models. For example, in semiparamet-ric models the observed information, if it exists, would at best be an in�nite-dimensionaloperator. Frequently, this problem is overcome by using a pro�le likelihood rather thana full likelihood. If the parameter is partitioned as (�; �), with � being a low-dimensionalparameter of interest and � a higher dimensional nuisance parameter, and ln(�; �) is the fulllikelihood, then the pro�le likelihood for � is de�ned as(1:1) pln(�) = sup� ln(�; �):The pro�le likelihood may be used to a considerable extent as a full likelihood for �. First,the maximum likelihood estimator for �, the �rst component of the pair (�n; �n) that max-imizes ln(�; �), is the maximizer of the pro�le likelihood function � 7! pln(�). This is just1 Research partially supported by NIDA grant A P50 DA 10075.MSC 1991 subject classi�cations: 62G15, 62G20, 62F25.Keywords and phrases: Least favorable submodel, maximum likelihood, standard error, nuisance parameter,semiparametric model, likelihood ratio statistic. 1

Page 2: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

\taking the supremum in two steps". Second, the likelihood ratio statistic for testing the(composite) hypothesis H0: � = �0 can be expressed in the pro�le likelihood function aspln(�n)pln(�0) :Finally, it is customary to use the steepness of the pro�le likelihood function as an estimateof the variability of �. For a Euclidean parameter this is justi�ed by Pate�eld (1977), whoshows that in parametric models the inverse of the observed pro�le information is equal tothe �-aspect of the full observed inverse information. A further discussion in the parametriccontext is given by Barndor�-Nielsen and Cox (1994, p.90).Thus, it appears that the pro�le likelihood can be used and visualized in the samefashion as an ordinary parametric likelihood. This may be considered obvious enoughto recommend their use, for computational and inferential purposes alike. However, theusual theoretical justi�cation for use of a (pro�le) likelihood as an inferential tool, basedon a Taylor expansion, fails for semiparametric models for a variety of reasons. First, themaximum likelihood estimator � for the nuisance parameter may converge at a slower ratethan the usual pn-rate. Second, semiparametric likelihoods may fail to be di�erentiable inthe nuisance parameter in a suitable sense, in particular when such a likelihood contains\empirical" terms (in the sense of Owen (1988)). Third, handling the remainder terms insome sort of a Taylor expansion is impossible under the classical Cram�er conditions, by thepresence of an in�nite-dimensional nuisance parameter.The purpose of this paper is to give a general justi�cation for using a semiparametricpro�le likelihood function as an inferential tool. For this we use empirical processes toreduce the in�nite dimensional problem to a problem involving the �nite dimensional \leastfavorable submodel."Some of the issues can be illustrated by the well-known Cox model for survival analysis.Consider the simplest version, without censoring. In this model the hazard function of thesurvival time T of a subject with covariate Z is given by�T jZ(t) = e�Z�(t);for a linear function z 7! �z (taken one dimensional for simplicity) and an unspeci�edbaseline hazard function �. The density of the observation (T; Z) is equal toe�z�(t)e�e�z�(t);where � is the cumulative hazard (with �(0) = 0). The usual estimator for (�;�) based ona sample of size n from this model is the maximum likelihood estimator (�; �), where thelikelihood (Andersen et al., 1993) is de�ned as, with �ftg the jump of � at t,ln(�;�) = nYi=1e�zi�ftige�e�zi�(ti):2

Page 3: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

The form of this likelihood forces � to be a jump function with jumps at the observeddeaths ti only, and the likelihood depends smoothly on the unknowns ��ft1g; : : : ;�ftng�.However, the usual Taylor expansion scheme does not apply to this likelihood, because thedimension of this vector converges to in�nity with n, and it is unclear which other devicecould be used to \di�erentiate" the likelihood with respect to �. The usual solution is to\pro�le" out the nuisance parameter. Elementary calculus shows that, for a �xed �, thefunction (�1; : : : ; �n) 7! nYi=1e�zi�ie�e�ziPj:tj�ti �jis maximal for 1�k = Xi:ti�tk e�zi :Thus the pro�le likelihood is given bypln(�) = sup� ln(�;�) = nYi=1 e�ziPi:ti�tk e�zi e�1:The latter expression is the Cox partial likelihood. Thus the Cox partial likelihood estimatoris a maximum likelihood estimator, and a maximum pro�le likelihood estimator for �.The Cox estimator was shown to be asymptotically normal by Tsiatis (1981) andAndersen and Gill (1982). This was not a trivial undertaking, because in form the pro�lelikelihood does not resemble an ordinary likelihood at all. Even though it is a product, theterms in the product are heavily dependent. Tsiatis (1981) and Andersen and Gill (1982)show how this particular pro�le likelihood can nevertheless be elegantly handled usingcounting processes and martingales.The Cox pro�le likelihood has the great advantage that it can be found in closed formby analytic methods. This has been the basis for �nding the properties of the Cox estimator,but it turns out to be special to this model. In most other examples of interest the pro�lelikelihood is, and remains, a supremum, which can be calculated numerically, but whosestatistical properties must be deduced in a more implicit manner than those of the Coxpro�le likelihood. In fact, it is not even clear that a general pro�le likelihood is di�erentiable,a supremum of di�erentiable functions not necessarily being di�erentiable itself. Thus weneed to attack the problem in another way. In this paper we use a sandwiching device basedon least favorable submodels.The Cox likelihood as given previously can be termed an \empirical likelihood", asit contains the point masses �ftig. (The terms in the product approximate P�;�(T =tijZ = zi).) Several other slightly di�erent choices would have been natural also, althoughthey lead to slightly di�erent estimators. See Andersen et al. (1993, pg. 221-226) for adiscussion of this issue in the context of nonparametric estimation of the cumulative hazardin survival analysis. In general, there is no unique method to de�ne a likelihood for asemiparametric model, and the semiparametric likelihoods proposed in the literature take3

Page 4: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

a variety of forms: for instance, ordinary densities, empirical likelihoods and mixtures ofthese two. In this paper we consider arbitrary \likelihoods" based on a random sample ofn observations X1; : : : ; Xn from a distribution depending on two parameters � and �. Wewrite the \likelihood" for one observation as l(�; �)(xi), whence the full likelihood ln(�; �)is a product of this expression over i. Of course, the results of the paper may not be validif the choice of likelihood does not satisfy the conditions made below.The main result of this paper is the following asymptotic expansion of the pro�lelikelihood. Under conditions, listed in Section 3, we prove that, for any random sequence~�n P! �0,(1:2) log pln(~�n) = log pln(�0) + (~�n � �0)T nXi=1 ~0(Xi)� 12n(~�n � �0)T ~I0(~�n � �0) + oP�0 ;�0 �pnk~�n � �0k+ 1�2:Here ~0 is the e�cient score function for �, the ordinary score function minus its orthogonalprojection onto the closed linear span of the score functions for the nuisance parameter.Furthermore, ~I0 is its covariance matrix, the e�cient Fisher information matrix. Undersimilar conditions, the maximum likelihood estimator is asymptotically normal, and hasthe asymptotic expansion(1:3) pn(�n � �0) = 1pn nXi=1 ~I�10 ~0(Xi) + oP�0 ;�0 (1):Taking this into account we see that the parabolic approximation to the log pro�le likelihoodgiven by equation (1.2) is centered, to the �rst order, at �. In other words, it is possible toexpand the log pro�le likelihood function around �, in the form(1:4) log pln(~�n) = log pln(�n)� 12n(~�n � �n)T ~I0(~�n � �n)+ oP�0;�0 �pnk~�n � �0k+ 1�2:The asymptotic expansions (1.2) and (1.4) justify using a semiparametric pro�le like-lihood as an ordinary likelihood, at least asymptotically. In particular, we present threecorollaries. (These also show that (1.3){(1.4) are a consequence of (1.2).) We assumethroughout the paper that the true parameter �0 is interior to the parameter set.COROLLARY 1.1. If (1.2) holds, ~I0 is invertible, and �n is consistent, then (1.3){(1.4) hold.In particular, the maximum likelihood estimator is asymptotically normal with mean zeroand covariance matrix the inverse of ~I0.Proof. Relation (1.4) follows from (1.2){(1.3) and some algebra. We shall derive (1.3)from (1.2). Set �n = n�1=2Pni=1 ~0(Xi) and h = pn(� � �0).4

Page 5: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

Applying (1.2) with the choices ~� = � and ~� = �0 + n�1=2 ~I�10 �n, we �ndlog pln(�) = log pln(�0) + hT�n � 12 hT ~I0h+ oP �khk+ 1�2;log pln(�0 + n�1=2 ~I�10 �n) = log pln(�0) + �Tn ~I�10 �n � 12�Tn ~I�10 �n + oP (1):By the de�nition of �, the expression on the left (and hence on the right) in the �rst equationis larger than the expression on the left in the second equation. It follows thathT�n � 12 hT ~I0h� 12�Tn ~I�10 �n � �oP �khk+ 1�2:The left side of this inequality is equal to�12(h� ~I�10 �n)T ~I0(h� ~I�10 �n) � �ckh� ~I�10 �nk2;for a positive constant c, by the nonsingularity of ~I0. Conclude thatkh� ~I�10 �nk = oP �khk+ 1�:This implies �rst that khk = OP (1), and next, by reinsertion, that kh� ~I�10 �nk = oP (1).This completes the proof of (1.3).The second corollary concerns the likelihood ratio statistic, and shows that this behavesas it should. The corollary justi�es using the setn�: 2 log pln(�n)pln(�) � �2d;1��oas a con�dence set of approximate coverage probability 1� �.COROLLARY 1.2. If (1.2) holds, ~I0 is invertible, and �n is consistent, then under the nullhypothesis H0: � = �0, the sequence 2 log�pln(�n)=pln(�0)� is asymptotically chi-squareddistributed with d degrees of freedom.Proof. This is immediate from (1.3){(1.4) upon choosing ~�n = �0 in (1.4).The third corollary concerns a discretized second derivative of the pro�le likelihood.This estimator is the square of the numerical derivative of the signed log-likelihood ratiostatistic as discussed by Chen and Jennrich (1996). In their Theorem 3.1 they show that,in the parametric setting, the derivative of the signed log-likelihood ratio statistic is thesquare root of the observed information about �. Indeed since the �rst derivative of thepro�le likelihood at � should be zero, the expression in the following display can be labelledthe observed information about � (evaluated in the direction vn). Thus this corollary canbe used to construct an estimate of the standard error of �.5

Page 6: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

COROLLARY 1.3. If (1.2) holds, then, for all sequences vn P! v 2 Rd and hn P! 0 such that(pnhn)�1 = OP (1), �2log pln(�n + hnvn)� log pln(�n)nh2n P! vT ~I0v:Proof. This is immediate from (1.4) upon choosing ~�n = �n + hnvn.The restriction on the rate of convergence of hn to zero stems from the error term in(1.4) being in terms of k~�n��0k+1 rather than in k~�n� �nk only. If the model is parametricand �~�n (de�ned below) is consistent then Cram�er's conditions (Cram�er, 1946, pg. 500; LeCam and Yang, 1990, pg. 101), are su�cient to show that the error is in terms of thelatter di�erence. For this reason, we believe that in semiparametric problems for which thenuisance parameter can be estimated at a parametric (square root n) rate, the error termin (1.4) can be improved.The organization of the paper is as follows. In Section 2 we review the notion of a leastfavorable submodel. This concept is rather useful in understanding why the general schemeto prove expansion (1.2) given in Section 3 should be and is successful. This scheme appliesto a wide variety of examples. In Section 4 we discuss four examples, where we refer to theliterature for technical details.We use the notationsPn and Gn for the empirical distribution and the empirical processof the observations, respectively. Furthermore, we use operator notation for evaluatingexpectations. Thus, for every measurable function f and probability measure P ,Pnf = 1n nXi=1f(Xi); Pf = Z f dP; Gnf = 1pn nXi=1�f(Xi)� P0f�;where P0 = P�0 ;�0 is the true underlying measure of the observations. We abbreviate(�0; �0) to 0 also at other places. Additionally, for a function x 7! `(�; �)(x) indexedby (�; �), notation such as P`(�; �) is an abbreviation for R `(�; �)(x) dP (x) evaluated at(�; �) = (�; �).6

Page 7: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

2. Least Favorable SubmodelsTo prove (1.2) we reduce the high dimensional model to a �nite dimensional, random sub-model of the same dimension as �. This submodel varies in both � and � and is \centered"at a random value of �. The centering at the random value of � is to account for the max-imization relative to � for �xed �. We then use ordinary Taylor series expansions on thesubmodel, where the random derivatives are controlled by empirical process techniques.We will use approximately least favorable submodels. In this section we review theconcepts of \e�cient scores" and \least favorable submodels." We also review how one mayde�ne a score function for a in�nite dimensional parameter. This section may be skippedby readers familiar with (semiparametric) e�ciency theory.First consider least favorable submodels in the setting of a smooth parametric modelP�;�, where both � and � are Euclidean. Partition the Fisher information matrix as� i�� i��i�� i�� � :When i�� can be expressed as a linear combination of the columns of i��, i�1�� i�� can bede�ned as the solution to the equation i��x = i��. The e�cient score function for � at(�; �), is then given by ~�;� = `�;� � (i�1�� i��)TA�;�where `�;� and A�;� are the score functions for � and � respectively. The e�cient scorefunction is uncorrelated with any linear combination of the score function for �. A sequenceof estimators � is best regular (asymptotically e�cient) at (�; �) if and only if it satis�es(2:1) pn(� � �) = � i�� � i�� i�1�� i����1pnPn~�;� + oP (1):Under regularity conditions the maximum likelihood estimator for � in the full parametricmodel will satisfy this equation.Consider a submodel t 7! Pt;�t indexed by t. For example, the submodel de�ned by�t = �� (i�1�� i��)T (t� �). The information for the estimation of the \parameter" t is givenby the covariance of the �rst derivative of the log likelihood with respect to t and evaluatedat t = �. For the model with �t = ��(i�1�� i��)T (t��) the derivative is ~�;� and its covariancematrix is i�� � i�� i�1�� i�� . Thus the well-behaved maximum likelihood estimator of t inthe submodel satis�es (2.1) with t in place of �. This submodel is called least favorableat (�; �) when estimating � in the presence of the nuisance parameter �, because out of allsubmodels t 7! Pt;�t, this submodel has the smallest information about t. In parametricmodels existence of the e�cient score and existence of the least favorable submodel aresynonymous.Similarly in semiparametric models, a best regular (asymptotically e�cient) estimatormust be asymptotically linear in the e�cient score. Thus if the maximum likelihood es-timator of � is e�cient in the full model, it is asymptotically linear in the e�cient score.7

Page 8: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

Given a semiparametric model of the type fP�;�: � 2 �; � 2 Hg, the score function for � isde�ned, as usual, as the partial derivative with respect to � of the log likelihood. A scorefunction for � is @@t jt=0 ln p�;�t(x) = A�;�h(x);where h is a `direction' in which �t 2 H approaches �, running through some index set H .In information calculations H is usually taken equal to a subset of a Hilbert space, andA�� :H 7! L2(P�;�) is continuous. (For example, in the example of logistic regression witherrors in variables, � is a distribution function, H is the set of all bounded functions withmean R h d� zero and �t(�) = R �(1 + th) d�.) The e�cient score function for � at (�; �) isde�ned as ~�;� = `�;� ���;�`�;�where ��;�`�;� minimizes the squared distance P���`�;� � k)2 over all functions k in theclosed linear span (in L2(P�;�)) of the score functions for �. The inverse of the variance of~�;� is the Cram�er-Rao bound for estimating � in the presence of �. A submodel t 7! Pt;�twith �� = � is least favorable at (�; �) if~�;� = @@t jt=� ln pt;�t :Since a projection ��;�`�;� on the closed linear span of the nuisance scores is not necessarilya nuisance score itself, existence of the e�cient score does not imply existence of a leastfavorable submodel. (Problems seem to arise in particular at the maximum likelihoodestimator (�; �), which may happen to be `on the boundary of the parameter set'.) However,in all our examples a least favorable submodel exists or can be approximated su�cientlyclosely.If the projection ��;�`�;� is a nuisance score, then it can be written in the follow-ing attractive manner. The adjoint operator A��;�:L2(P�;�) 7! H , is characterized by therequirementP�;�(A�;�h)g = hA�;�h; giP�;� = hh;A��;�gi�; every h 2 H; g 2 L2(P�;�):The projection of `�;� onto the closure of the space spanned by the nuisance scores, A�;�his itself a nuisance score if and only if the range of A��;�A�;� contains A��;�`�;� in which caseit is given by ��;�`�;� = A�;�h�;� ; h�;� = (A��;�A�;�)�A��;�`�;�where (A��;�A�;�)� is a \generalized inverse" and h�;� is called the least favorable direction.In two of our examples we form the projection in this manner.8

Page 9: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

3. Main ResultWe assume the existence of an approximately least favorable submodel. We assume that,for each parameter (�; �), there exists a map, which we denote by t 7! �t(�; �), from a �xedneighborhood of � into the parameter set for � such that the map t 7! `(t; �; �)(x) de�nedby `(t; �; �)(x) = log l�t;�t(�; �)�(x)is twice continuously di�erentiable, for all x. We denote the derivatives by _(t; �; �)(x) and�(t; �; �)(x), respectively. The submodel with parameters �t;�t(�; �)� should pass through(�; �) at t = �:(3:1) ��(�; �) = �; every (�; �):The second important structural requirement that should lead the construction of thissubmodel is that it be least favorable at (�0; �0) for estimating � in the sense that(3:2) _(�0; �0; �0) = ~�0;�0 :This means that the score function for the \parameter," t of the model with likelihoodl�t;�t(�0; �0)� at t = �0 is the e�cient score function for �. Note that we assume thisonly at the true parameter (�0; �0). (In particular, not at every realization of the (pro�le)estimators, where (3.2) may be problematic due to the rough nature of these estimators.)For every �xed �, let �� be a parameter at which the supremum in the de�nition ofthe pro�le likelihood is taken. (Thus pln(�) = ln(�; ��) for every �.) Assume that for anyrandom sequences ~�n P! �0(3.3) �~�n P! �;(3.4) P0 _(�0; ~�n; �~�n) = oP �k~�n � �0k+ n�1=2�:Conditions (3.1){(3.4) are the structural conditions for the following theorem. Conditions(3.1){(3.2) say how the least favorable submodel should be de�ned, leaving considerablefreedom to adapt to examples. Condition (3.4) is a \no-bias condition" and is discussedbelow.In addition, we require continuity of the functions _(t; �; �) and �(t; �; �) in (t; �; �),and conditions that restrict the sizes of the collections of the functions _ and �, so as tomake them manageable. The latter conditions are expressed in the language of empiricalprocesses. See e.g. Van der Vaart and Wellner (1996) for the de�nition, properties andexamples of Glivenko-Cantelli and Donsker classes. In some examples, these tools simplifythe proofs greatly while in other examples these tools appear to be unavoidable. Essentiallythey are used to show that the remainder terms of certain expansions are negligible.9

Page 10: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

THEOREM 3.1. Let (3.1){(3.4) be satis�ed, and suppose that the functions (t; �; �) 7!_(t; �; �)(x) and (t; �; �) 7! �(t; �; �)(x) are continuous at (�0; �0; �) for P0-almost every x (orin measure). Furthermore, suppose that there exists a neighborhood V of (�0; �0; �) suchthat the class of functions � _(t; �; �): (t; �; �)2 V is P0-Donsker with square-integrable en-velope function, and such that the class of functions ��(t; �; �): (t; �; �)2 V is P0-Glivenko-Cantelli and is bounded in L1(P0). Then (1.2) is satis�ed.Proof. For simplicity assume that � is one-dimensional. The general proof is simi-lar. Since _(t; �; �) ! ~0 as (t; �; �) ! (�0; �0; �0), and the functions _(t; �; �) are dom-inated by a square-integrable function, we have by the dominated convergence theorem,for every (~t; ~�; ~�) P! (�0; �0; �0), that P0� _(~t; ~�; ~�) � ~0�2 P! 0. Similarly, we have thatP0 �(~t; ~�; ~�) P! P0 �(�0; �0; �0). Since t 7! exp `(t; �0; �0) is proportional to a smooth one-dimensional submodel, its derivatives satisfy the usual identityP0 �(�0; �0; �0) = �P0 _2(�0; �0; �0) = �~I0:These facts, together with the empirical process conditions imply that, for every randomsequence (~t; ~�; ~�) P! (�0; �0; �0),(3.5) Gn _(~t; ~�; ~�)) = Gn ~0 + oP (1);(3.6) Pn�(~t; ~�; ~�) P! � ~I0;By (3.1) and the de�nition of �� ,1n�log pln(~�)� log pln(�0)� = Pn log l(~�; �~�)� Pn log l(�0; ��0)8<: � Pn log l�~�;�~�(�0; ��0)��Pn log l��0;��0(�0; ��0)�� Pn log l�~�;�~�(~�; �~�)��Pn log l��0;��0 (~�; �~�)�:To obtain the lower bound, the �rst log likelihood is replaced by a lower value and thesecond log likelihood is left alone. For the upper bound, the reverse procedure is carriedout. Both the upper and the lower bound are di�erences Pn`(~�; ~ ) � Pn`(�0; ~ ) (with ~ equal to (�0; ��0) and (~�; �~�), respectively) in the log likelihood at two parameter values(�0 and ~�) of least favorable submodels. We apply a two-term Taylor expansion to thesedi�erences, leaving ~ �xed, and show that the upper and lower bounds agree up to a termof the order oP �k~� � �0k+ n�1=2�2.For ~t a convex combination of ~� and �0,Pn`(~�; ~ )�Pn`(�0; ~ ) = (~� � �0)TPn _(�0; ~ ) + 12 (~� � �0)TPn�(~t; ~ )(~� � �0):In the second term on the right we can replace Pn�(~t; ~ ) by �~I0 at the cost of adding aoP k~� � �0k2-term, in view of (3.6). The �rst term on the right is equal to1pn (~� � �0)TGn _(�0; ~ ) + (~� � �0)TP0 _(�0; ~ )10

Page 11: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

= 1pn (~� � �0)TGn ~0 + (~� � �0)oP �k~� � �0k+ n�1=2�:Combining the results of the last two displays, we obtain (1.2).An heuristic reason why the upper/lower bound argument is successful is as follows.If �� achieves the supremum in (1.1), then the map � 7! (�; ��) ought to be an estimatorof a least favorable submodel for the estimation of � (See Severini and Wong (1992)). Inboth the lower and upper bound, we di�erentiate the likelihood along approximations tothe least favorable submodel. By de�nition, di�erentiation of the likelihood along the leastfavorable submodel (if the derivative exists) yields the e�cient score function for �. Thee�cient information matrix is the covariance matrix of the e�cient score function, and, asusual, the expectation of minus the second derivative along this submodel should yield thesame matrix. This explains relations (3.5)-(3.6).An interpretation of condition (3.4) is as follows. Given the construction of the sub-model t 7! �t(�; �), the derivative _(�0; ~�n; �~�n) can be considered an estimator of thee�cient score function ~0 = _(�0; �0; �0). Condition (3.4) requires that the di�erence in themean of this estimator and the mean P0 ~0 = 0 is negligible to the order oP �k~�n��0k+n�1=2�.This \no-bias" condition has appeared in the literature in several forms and connections,and in that sense appears to be natural. For instance, in his construction of e�cient one-stepestimators, Klaassen (1987) requires estimators ^n;� for ~0 such that P0 ^n;�0 = oP (n�1=2).Klaassen also shows that the existence of such estimators is necessary for asymptoticallye�cient estimators of �n to exist, so that his \no bias" condition does not require too much.(Also cf. Bickel et al. (1993, p 395, equation (19)).) In the last sentence \asymptotic ef-�ciency" refers to estimators attaining the Cram�er-Rao bound given by the e�cient scorefunction ~0. If the \no-bias" condition fails, then this Cram�er-Rao bound, which is based ona local, di�erential analysis of the model near the true parameter, is too optimistic. Thusthe \no-bias" condition can be viewed as the condition on the global model that impliesthat the local information calculations are not misleading.The condition has appeared also in proofs of asymptotic normality of the maximumlikelihood estimator, and of the asymptotic chi-squared distribution of the likelihood ra-tio statistic. For instance, Huang (1996) and van der Vaart (1996) impose the conditionP0 _(�0; �; ��) = oP (n�1=2), and in their work on the likelihood ratio statistic Murphy andvan der Vaart (1997) assume that P0 _(�0; �0; ��0) = oP (n�1=2). The present condition (3.4)is an extension of these conditions.The veri�cation of (3.4) may be easy due to special properties of the model, but mayalso require considerable e�ort. In the latter case, the veri�cation usually boils down to alinearization of P0 _(�0; �; �) in (�; �) combined with establishing a rate of convergence of�~�n . The linear term in the expansion is approximately equal to the covariance between thee�cient score and a score function for the nuisance parameter, �. Since the e�cient score isuncorrelated with all score functions for the nuisance parameter, this term should be zero.11

Page 12: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

The expansion relative to � does not cause di�culty since the error in the linearizationshould be of lower order than the linear term in � and thus will be oP �k~�n � �0k). Thus,under some regularity conditions, condition (3.4) is equivalent to(3:7) P0 _(�0; �0; �~�n) = oP �k~�n � �0k+ n�1=2�:The extreme case that the expression in the left side of (3.7) is identically zero occurs,among others, in certain mixture models. (See below for an example.)One way to expand P0 _(�0; �0; �) in � is to write,(3:8) (P0 � P�0;��0 (�0;�))� _(�0; �0; �)� _(�0; �0; �0)�� P0 _(�0; �0; �0)hp�0;��0 (�0 ;�) � p0p0 � A0(��0(�0; �)� �0)i;where A0 is the score operator for � at (�0; �0) (and hence P0 ~0A0h = 0 for every h by theorthogonality property of the e�cient score function). (This presumes that � belongs to alinear space and that A0 is a derivative of log p�;� relative to �; in speci�c examples theTaylor expansion may take a di�erent form.) The above display suggests thatk�~�n � �0k = OP �k~�n � �0k�+OP �n�1=2�;along with the expansion in �, is su�cient to verify (3.4). This should certainly be truein smooth parametric models (� �nite dimensional), explaining why condition (3.4) doesnot show up in the analysis of classical parametric models. In cases where the nuisanceparameter is not estimable at pn-rate, this is not satis�ed. Then, provided the expansioncan be carried into its second order term, it is still su�cient to havek�~�n � �0k = OP �k~�n � �0k�+ oP �n�1=4�:Special properties of the model may allow to take the linearization even further and thena slower rate of convergence of the nuisance parameter may be su�cient. For example, inthe partially linear regression model considered by Donald and Newey (1994), where thenuisance parameter is partitioned in an unknown regression function and an unknown errordensity, the diagonal elements of the second derivative relative to the nuisance parameter(which normally would yield positive terms) vanish and there is cancellation in the mean ofthe cross product term. As shown by Donald and Newey (1994) in this example much lessthan a n�1=4-rate on the nuisance parameter su�ces. It appears di�cult to capture suchspecial situations in a general result, whence we have left the \no-bias" condition as a basiccondition.Methods to deduce rates of convergence of maximum likelihood estimators of in�nitelydimensional parameters were recently developed by Van de Geer (1993, 1995), Wong andShen (1995), Birg�e and Massart (1993, 1997), and were adapted to incorporate nuisanceparameters by Murphy and Van der Vaart (1997c). Generally, these authors show how therate of convergence can be expressed in the entropy of the model. We shall not describethese methods in this paper. 12

Page 13: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

4. ExamplesIn this section we provide a number of examples with the aims of:- showing the variation in de�nitions of likelihoods;- showing the di�erent ways that least favorable submodels can be de�ned;- indicating the di�erent techniques that can be applied to verify the no-bias condition(3.4).These examples have been the subject of separate papers, by various authors, one exampleper paper. Most of these papers focused on the asymptotic normality of the maximum like-lihood estimator �, or considered computational issues. To justify the quadratic expansionof the present paper, we need to extend the methods used in these papers, but since thebasic techniques remain the same, we shall refer to the literature for most of the technicaldetails.4.1. The Proportional Hazards Model for Current Status DataUnder \current status" censoring a subject is examined once, at a random observation time,and at this time it is observed whether the subject is alive or not. The survival time T andthe observation time Y are assumed to be independent given a vector of covariates Z. Inthe proportional hazards model the hazard function of T given Z is equal to t 7! e�TZ _�(t),where �(t) = R t0 _�(u)du is called the baseline cumulative hazard function. The observationsare n i.i.d. copies of X = (Y;�; Z), where � is 1 if T � Y and 0 otherwise.The density of X is given byp�;�(x) = �1� exp(�e�T z�(y))��� exp(�e�T z�(y))�1�� fY;Z(y; z):We take this expression, where for likelihood inference on (�; �), we may delete the termfY;Z(Y; Z), as the likelihood.Maximum likelihood estimation in this model is considered by Huang (1996). Eventhough there is no explicit expression, the pro�le likelihood function is easy to compute,for instance using the iterative least concave majorant algorithm of Groeneboom and Well-ner (1992). Figure 1 of Huang shows the pro�le likelihood function for �. Its parabolic formis striking.In this model the score function for � takes the form`�;�(x) = z�(y)Q(x; �; �);for the function Q(x; �; �) given byQ(x; �; �) = e�T zh� e�e�T z�(y)1� e�e�T z�(y) � (1� �)i:13

Page 14: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

Inserting a submodel t 7! �t such that h(y) = �@=@tjt=0�t(y) exists for every y into the loglikelihood and di�erentiating at t = 0 we obtain a score function for � of the formA�;�h(x) = h(y)Q(x; �; �):For every nondecreasing, nonnegative function h the submodel �t = � + th is well de�nedif t is positive and yields a (one-sided) derivative h at t = 0. Thus the preceding displaygives a (one-sided) score for � at least for all h of this type. The linear span of thesefunctions contains A�;�h for all bounded functions h of bounded variation. The e�cientscore function for � is de�ned as ~�;� = `�;� � A�;�h�;� for the vector of functions h�;�minimizing the distance P��k`�;� � A�;�hk2. In view of the similar structure of the scoresfor � and �, this is a weighted least squares problem with weight function Q(x; �; �). Thesolution at the true parameters is given by the vector-valued function(4:1) h0(Y ) = �0(Y )h00(Y ) = �0(Y )E�0�0�ZQ2(X ; �0; �0)j Y �E�0�0�Q2(X ; �0; �0)j Y � :As the formula shows (and as follows from the nature of the minimization problem), thevector of functions h0(y) is unique only up to null sets for the distribution of Y . We shallassume that (under the true parameters) there exists a version of the conditional expectationthat is di�erentiable with bounded derivative.Thus we de�ne, for t a vector in Rd and � a �xed function that is approximately theidentity, �t(�; �) = � + (� � t)T�(�)(h00 � ��10 � �)`(t; �; �) = log l(t;�t(�; �)):The function �t(�; �) is essentially � plus a perturbation in the least favorable direction, h0,but its de�nition is somewhat complicated in order to ensure that �t(�; �) really de�nes acumulative hazard function within our parameter space, at least for t that are su�cientlyclose to �. First, the construction using h00 � ��10 � �, rather than h00, ensures that theperturbation that is added to � is absolutely continuous with respect to �; otherwise �t(�; �)would not be a nondecreasing function. Second, the function � serves a similar purpose.As long as �(�0) = �0 on the support of the observation times Y , the model t 7! �t(�0; �0)will be least favorable at (�0; �0). The inclusion of � is a technical device to make the leastfavorable paths t 7! �t(�; �) more manageable for (�; �) equal to estimators. (See at theend of the section for the precise properties � should have.)In this model, the cumulative hazard function is not estimable at pn-rate. However,using entropy methods, Murphy and van der Vaart (1997), extending earlier results ofHuang (1996), show that(4:2) Z (�~�(y)� �0(y))2 dFY (y) = OP �k~� � �0k2 + n�2=3�:14

Page 15: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

This is su�cient for the veri�cation of equation (3.4). For instance, we can verify (3.7) bythe decomposition as in (3.8) upon noting that, for some constant C and every x (undersome regularity conditions and provided that � is su�ciently regular),�� _(�0; �0; �)(x)� _(�0; �0; �0)(x)�� � Cj� � �0j(y);��p�0;�(x)� p�0;�0(x)�A0(� � �0)(x)p�0;�0(x)�� � Cj� � �0j2(y):Since the expressions on the left for a �xed x depend on � only through �(y), these inequal-ities follow from ordinary Taylor expansions and uniform bounds on the �rst and secondderivatives. Thus by the decomposition as in (3.8) and taking expectations, the rate of con-vergence on �~� given in the preceding display readily translates into a rate of convergenceof the \bias" term, of the order OP�k~� � �0k2�+OP (n�2=3), better than needed.A precise argument reveals that the conditions of the paper can be checked under thefollowing assumptions. We restrict the parameter set for � to a compact in Rd and assumethat the true parameter �0 is an interior point. Furthermore, we restrict the parameterset for � to the set of all cumulative hazard functions on an interval [0; � ] with �(�) � Mfor a given (large) constant M . The observations times Y are in an interval [�; � ]. Thetrue parameter �0 satis�es �0(��) > 0 and �0(�) < M , and is continuously di�erentiablewith derivative bounded away from zero on [�; � ]. The covariate vector Z is boundedand E cov(Zj Y ) is positive de�nite. Finally, we assume that the function h0 given by(4.1) has a version which is di�erentiable with a bounded derivative on [�; � ]. Then wecan take the function �: [0;M ] 7! [0;M ] to be any �xed function such that �(y) = y onthe interval [�0(�); �0(�)], such that the function y 7! �(y)=y is Lipschitz and such that�(y) � c(y^ (M � y)) for a su�ciently large constant c depending on (�0; �0) only. (By ourassumption that [�0(�); �0(�)] � (0;M) such a function exists.)4.2. Case-Control Studies with a Missing CovariateThe following model is considered by Roeder, Carroll and Lindsay (1996), who use the pro�lelikelihood to set a con�dence interval in a study of the e�ect of cholesterol on heart disease.The model is expressed in terms of a basic random vector (D;W;Z), whose distributionis described in the following way (our parameterization is slightly di�erent from Roeder,Carroll and Lindsay (1996)):- D is a logistic regression on expZ with intercept and slope and �, respectively;- W is a linear regression on Z with intercept and slope �0 and �1, respectively, and anN(0; �2)-error;- Given Z the variables D and W are independent;- Z has a completely unspeci�ed distribution �.The unknown parameters are � = (�; �0; �1; ; �) ranging over � � R4 � (0;1) and thedistribution � of the regression variable with support in an interval Z � R. The likelihood15

Page 16: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

for the vector (D;W;Z) takes the form p�(d; wj z) d�(z), with � denoting the standardnormal density,p�(d; wj z) = � 11 + exp(� � �ez)�d� exp(� � �ez)1 + exp(� � �ez)�1�d 1���w � �0 � �1z� �and d� denoting the density of � with respect to a dominating measure.Roeder, Carroll and Lindsay (1996) and Murphy and van der Vaart (1997b) considerboth a prospective and retrospective (or case-control) model. In the prospective modelwe observe two independent random samples of sizes nC and nR from the distributionsof (D;W;Z) and (D;W ), respectively. (The indexes C and R are for \complete" and\reduced", respectively.) In the terminology of Roeder, Carroll and Lindsay (1996), thecovariate Z in a full observation (D;W;Z) is a \golden standard", but, in view of thecosts of measurement, for a selection of observations only the \surrogate covariate" W isavailable. In their example W is the natural logarithm of total cholesterol, Z is the naturallogarithm of LDL cholesterol, and we are interested in heart disease D = 1.We shall consider the situation that the number of complete and reduced observationsare of comparable magnitude. More precisely, our theorem applies to the situation thatthe fraction nC=nR is bounded away from 0 and 1. For simplicity of notation, we shallhenceforth assume that nC = nR. Then the observations can be paired and the observationsin the prospective model can be summarized as n i.i.d. copies of X = (YC ; ZC ; YR) fromthe densityx = (yC ; zC ; yR) 7! p�(yC j zC) d�(zC) Z p�(yRj z) d�(z) =: p�(yC j zC) d�(zC) p�(yRj �):Here we denote the complete sample components by YC = (DC ;WC) and ZC and the re-duced sample components by YR = (DR;WR). In the complete sample part of the likelihoodwe use an empirical likelihood with �fzg, the measure of the point fzg,l(�; �)(x) = p�(yC j zC) �fzCg Z p�(yRj z) d�(z):Note that the \likelihood" is a density if � is assumed to be discrete; however we shall usethe above even if it is known that � must be continuous.There is no similar notational device to �t the retrospective version of the model in thei.i.d. set-up of this paper. Therefore, we shall restrict ourselves to the prospective model asgiven in the preceding paragraph. However, since the pro�le likelihoods of the prospectiveand retrospective models are algebraically identical, the arguments can be extended to theretrospective model.We shall concentrate on the regression coe�cient, �, considering �2 = (�0; �1; �) and �as nuisance parameters. (Thus, the parameter � in the general results should be replaced by� throughout this section.) Note that the assumption of a known support means that in themaximum likelihood estimation, � is constrained to have support contained in Z . Assuming16

Page 17: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

that �0 is nondegenerate, Murphy and van der Vaart (1997b) show that the maximumlikelihood estimator (�; �) is asymptotically normal. Here we shall verify that the conditionsof Theorem 3.1 are satis�ed, so that the pro�le likelihood function is approximated by aquadratic in �.To de�ne a least favorable submodel, we �rst calculate scores for � and �. The scorefunction `�;� for � is given by`�;�(yC ; zC ; yR) = `�(yC jzC) + R `�(yRj z) p�(yRj z) d�(z)p�(yRj �)where `�(yjz) = @=@� log p�(yj z). Furthermore, de�ning �t(�) = �(�) + t R � h d� for h abounded function satisfying R h d� = 0, we can di�erentiate l(�; �t) at t = 0 to form thescore for � in the direction h,A�;�h(x) = h(zC) + R h(z)p�(yRj z) d�(z)p�(yRj �) :A version of the Hilbert space adjoint A��;� of this operator is given byA��;�g(zC ; z) = Z g(yR; zC) p�(yRj z) d�(yR):(� is the product of the counting measure and Lebesgue measure.) According to the generaltheory the least favorable direction h�;� for the estimation of � in the presence of theunknown � is given by h�;� = (A��;�A�;�)�1A��;�`�;�and the e�cient information matrix for � when � is unknown is~I�;� = P�;�`�;�`T�;� � P�;��A�;�(A��;�A�;�)�1A��;�`�;��`T�;�:An explicit expression for the inverse operator (A��;�A�;�)�1 is unknown (and it is unlikelythat there is one). However, the inverse certainly exists in the Hilbert space sense, so that theexpressions are well-de�ned. Furthermore, in Section 8 of Murphy and van der Vaart (1997b)it is shown that A��0;�0A�0;�0 is continuously invertible on the space of Lipschitz continuousfunctions. Since A��0;�0`�0;�0 is Lipschitz continuous, the function h0 is also bounded andLipschitz, so that it can be used to de�ne the submodel �t as given previously.The e�cient information for � is obtained by inverting the e�cient information matrixfor � and taking the (1; 1)-element. This corresponds to a further projection of the e�cientscore functions for �. Partition � into � = (�; �2), where �2 = (�0; �1; ; �2), and partitionthe e�cient information matrix ~I0 for � in four submatrices, accordingly. The e�cientscore function for � is the �rst coordinate of ~�;� minus its projection on the remainingcoordinates of ~�;�. Thus we de�neaT0 = �1;�~I0;12(~I0;22)�1�;d�t(�; �) = �1 + (� � t)aT0 (h0 � �h0)�d�;�t(�; �) = � + (t� �)a0;17

Page 18: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

where �h = R h d� and �0h0 = 0. Since h0 is bounded, �t(�; �) has a positive density withrespect to � for every su�ciently small j��tj and hence de�nes an element of the parameterset for �. Now we use the least favorable patht 7! ��t(�; �)2;�t(�; �)�in the parameter space for the nuisance parameter (�2; �). This leads to `(t; �; �2; �) =log l��t(�; �);�t(�; �)�. This submodel is least favorable at (�0; �0) in that (3.2) is satis�edin the form @@t jt=�0`(t; �0; (�0)2; �0) = aT0 ~0;where ~0(x) = `�0;�0(x)�A�0;�0h0(x):The function ~0 is the e�cient in uence function for the parameter � in the presence of thenuisance parameter �, while the function aT0 ~0 is the e�cient score function for � in thepresence of the nuisance parameter (�2; �), both evaluated at (�0; �0). (Also cf. Section 7 ofMurphy and van der Vaart (1997b).)In this model, the distribution function is estimable at pn-rate relative to a variety ofnorms. For instance, for H the set of functions h:Z 7! R that are uniformly bounded by1 and uniformly Lipschitz with Lipschitz constant 1, Murphy and van der Vaart (1997b)show that suph2H���Z h(z) d(�~� � �0)(z)���+ k�~� � �0k = OP �j~� � �0j+ n�1=2�:To see that this is su�cient for the veri�cation of equation (3.4), we note that for everyfunction of the form g(x) = g1(yC ; zC) + g2(yR),(P�0;� � P0)g = Z Z �g1(y; z) + g2(y)�p�0(yj z) d�(y) d(�� �0)(z);P0ghp�0;� � p0p0 � A0�d� � d�0d�0 �i = 0:These equations allow to show that the bias P0 _(�0; �0; (�2)0; �) is suitably small in itsdependence on � = �~� by the argument of (3.8). For a complete veri�cation of (3.4) wewould next study the dependence of P0 _(�0; �; �2; �) on (�; �2), but this is straightforward,because it only involves Euclidean parameters. (Cf. (3.7).) The second equation in thepreceding display shows that the second term in (3.8) (adapted to the present de�nitionof score operator A0) vanishes, a consequence of the linearity of the model in �. To showthat the �rst term is suitably small (for � = �0), we apply the �rst equation with g =_(�0; �0; �)� _(�0; �0; �0) and � = �~�. The bounded Lipschitz norms of the function z 7!R �g1(y; z)+ g2(y)�p�0 (yj z) d�(y) (the inner integrals in the second equation) can be shownto converge to zero in probability. Therefore, these functions are contained in oP (1)H andwe may bound the double integral by oP (1) suph2H��R h d(� � �0)��, which is of the desiredorder oP �j~� � �0j+ n�1=2� (still for � = �~�).18

Page 19: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

4.3. Shared Gamma Frailty ModelIn the frailty model, we observe event times for subjects occurring in groups such as twins orlitters. To allow for a positive intra-group correlation in the subjects' event times, subjects inthe same group are assumed to share a common unobservable variable G, called the \frailty".For a given group the observations are X = fNj(t); Yj(t); Zj(t); j = 1; : : : ; J; 0 � t � �g,where Nj ; j = 1; : : : ; J count the events for each of the at most J subjects in the group andform a multivariate counting process. The observation interval, [0; � ] is assumed �nite. Thecensoring process, Yj is nonincreasing, is left continuous, with right hand limits (caglad)and takes values in f0; 1g. The jth subject can be observed to experience an event at time tonly if Yj(t) = 1. The d-dimensional covariate process, Zj is also caglad and is assumed tobe a.s. uniformly bounded and of uniformly bounded variation. Let F comt be the \completedata" sigma �eld generated by fG;Nj(s); Yj(s); Zj(s); j = 1; : : : ; J; 0 � s � tg. GivenG and Z the intensity of Nj with respect to the sequence of complete data sigma �elds,indexed by t, is assumed to follow a proportional hazards model,G exp(�TZj(�))Yj(�) _�(�)where � is a d-dimensional vector of regression coe�cients and _� is the baseline hazardfunction. The unobserved frailty, G, follows a gamma distribution with mean one andvariance �. We assume that given G and Z, the censoring is independent and furthermoreboth the censoring and covariate processes are noninformative ofG. See Nielsen et al. (1992)for more discussion concerning the censoring. The terms independent and noninformativeare discussed by Andersen et al. (1993). The key consequence of assumed noninformativecensoring and covariate processes is that the likelihood for one group's complete data (X;G),is a function of G only through the partial likelihood,JYj=18<:Yt�� �G exp(�TZj(t))Yj(t) _�(t)��Nj(t) e�G R �0 exp(�TZj(s))Yj (s) _�(s)ds9=; p(G; �);where p(�; �) is a gamma density corresponding to a mean of one and variance of �.Since G is not observed, we integrate the compete data partial likelihood over G toform the observed data partial likelihood,QJj=1Qt�� �(1 + �N�(t�)) exp(�TZj(t))Yj(t) _�(t)��Nj(t)�1 + � R �0 ~Y �� (s) _�(s)ds�1=�+N�(�) ;where ~Y �� =Pj exp(�TZj)Yj and N� =Pj Nj. Also given X , G has a amma distributionwith E(GjX) = 1 + �N�(�)1 + � R �0 ~Y �� d� ; and var(GjX) = � 1 + �N�(�)�1 + � R �0 ~Y �� d��2 :19

Page 20: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

This model has been considered by Nielsen et al. (1992), Murphy (1994, 1995) and Parner(1997ab). Nielsen et al. (1992), estimate standard errors by using the curvature of aparabola �tted to the log pro�le likelihood.We consider estimation of the parameters, � = (�; �); �(�) = R �0 _�(s)ds based on ob-servation of n i.i.d. copies of X . The parameter space for � is the set of nondecreasing,nonnegative functions on [0; � ] with �(0) = 0, the parameter space for � is B, a com-pact subset of Rd, and lastly the parameter space for � is [��=�(�);M ] for known M and� = (J max�;z e�T z)�1. Thus the parameter space for � is not the same for every value of�. The observed data partial likelihood has no maximizer. A convenient empirical likeli-hood is given by replacing _�(t) by �ftg,(4:3) l(�; �)(X) = QJj=1Qt�� �(1 + �N�(t�)) exp(�TZj(t))Yj(t)�ftg��Nj(t)�1 + � R �0 ~Y �� (s)d�(s)�1=�+N�(�) :This is not the only possible extension; see Andersen et. al. (1993) and Murphy (1995).The conditions of this paper may be checked under the assumptions, that �0 2 [0;M),�0 belongs to the interior of the set B, �0(�) < 1 and �0 is strictly increasing on [0; � ].Furthermore assume that P (Y�(�) > 0) > 0, P (N�(�) � 2) > 0 and if for a vector, c and ascalar, c0, P (cTZj(0+)Yj(0) = c0Yj(0); j = 1; : : : ; J) = 1 then c = 0.The maximum likelihood estimator (�; �) for this model was shown to be asymptoticallyconsistent (with respect to the uniform norm) and normal by Parner (1997a, 1997b) in themore general correlated frailty model but under similar conditions.The score function for �, the derivative of the log empirical likelihood with respect to�, is given by `�;� = (`T1;�;�; `2;�;�)T where`1;�;�(X) =Xj Z Zj dNj � 1 + �N�(�)1 + � R �0 ~Y �� d� Z �0 Xj e�TZjYjZj d�and 2;�;�(X) = Z �0 N�(s�)1 + �N�(s�) dN�(s)+ ��2 log(1 + � Z �0 ~Y �� d�)� 1 + �N�(�)1 + � R �0 ~Y �� d�� Z �0 ~Y �� d�! :The score operator for � in the direction, h (a bounded function) isA�;�h(X) = Z h dN� � 1 + �N�(�)1 + � R �0 ~Y �� d� Z �0 ~Y �� h d�:20

Page 21: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

Since the adjoint of A�;� satis�es the equality,P�;�(A�;�h)2 = hA�;�h; A�;�hiP�;�= hh;A��;�A�;�hi�for h a bounded function, the variance of the score operator for � yields a version of A��;�A�;�.Following Woodbury's (1971) missing information principle we have that the variance of theobserved data score is given by the variance of the complete data score minus the expectedvalue of the conditional variance of the complete data score given the observed data. (Alsocf. Bickel et al. (1993).) The complete data score is given by R h(dN� �G ~Y �d�). SoP�;�(A�;�h)2 = var�Z h(dN� �G ~Y �d�)�� P�;� var(GjX)�Z h ~Y � d��2!= P�;�0B@Z h2G ~Y �� d� � � 1 + �N�(�)�1 + � R �0 ~Y �� d��2 �Z h ~Y � d��21CA :From this we see that,A��;�A�;�h(t) = P�;�0B@h(t) 1 + �N�(�)1 + � R �0 ~Y �� d� ~Y �� (t)� � 1 + �N�(�)�1 + � R �0 ~Y �� d��2 Z �0 h ~Y �� d� ~Y �� (t)1CA :Similarly we may derive A��;�`�;� by considering the covariance of A�;� with `�;� and usingWoodbury's missing information principle. This results inA��;�`1;�;�(t) = P�;�0B@ ~Y �0� (t) 1 + �N�(�)1 + � R �0 ~Y �� d� � ~Y �� (t) 1 + �N�(�)�1 + � R �0 ~Y �� d��2 � Z �0 ~Y �0� d�1CAandA��;�`2;�;�(t) = P�;�0B@ ~Y �� (t) N�(�)� R �0 ~Y �� d��1 + � R �0 ~Y �� d��21CAwhere ~Y �0� (t) = Pj Zj(t)e�TZj(t)Yj(t). Parner (1997a,1997b) shows that A��0;�0A�0;�0 iscontinuously invertible as an operator on the space of bounded functions of bounded vari-ation (this may also be proved via the methods used by Murphy, Rossini and van derVaart (1996)). Since A��0;�0`�0;�0 is a vector of bounded functions of bounded variation,the \least favorable direction" for the estimation of � in the presence of the unknown � ish0 = �A��0;�0A�0;�0��1A��0;�0`�0;�0 .The theorems are applied, for t a vector in Rd+1,�t(�; �) = � + (� � t)T Z � h0d�`(t; �; �) = log l(t;�t(�; �))21

Page 22: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

Since h0 is bounded, the density 1 + (� � t)T h0 of �t(�; �) with respect to � is positive forsu�ciently small j�� tj. In this case �t(�; �) de�nes a nondecreasing function and is a trueparameter of the model.The cumulative baseline hazard, � is estimable at pn-rate in the supremum norm.Suppose that ~� P! �0. Uniform consistency of �~� may be proved by minor adaptions of theconsistency proofs given by Murphy (1994) and Parner (1997a). To show thatsup0�t�� j�~�(t)� �0(t)j = OP (k~� � �0k+ n�1=2);we may verify the conditions of Theorem 3.1 in Murphy and van der Vaart (1997c). Theseconditions (continuous invertibility of the information operator and that the score func-tions belong to a Donsker class) were veri�ed by Parner (1997b). This result permits theveri�cation of (3.4).4.4. A Mixture ModelThis is another version of the frailty model, with the group size equal to two. The obser-vations are a random sample of pairs of event times. As before, we allow for intra-groupcorrelation by assuming that every pair of event times share the same unobserved frailty G.Given G, the two event times U and V are assumed to be independent and exponentiallydistributed with hazard rates G and �G, respectively. The parameter, � is assumed tobelong to (0;1). In contrast to the gamma frailty model, the distribution � of G is a com-pletely unknown distribution on (0;1). The observations are n i.i.d. copies of X = (U; V )from the density p�;�(X) = Z ge�gU �ge��gV d�(g):This forms the likelihood, l(�; �), for estimation. The conditions of this paper may bechecked under the assumption that R (g2 + g�6:5) d�0(g) < 1. Asymptotic normality ande�ciency of � is shown by van der Vaart (1996). Murphy and van der Vaart (1997) considerlikelihood ratio tests concerning �.The score function for �, the derivative of the log density with respect to �, is given by`�;�(x) = R (��1 � gv)g2e�g(u+�v) d�(g)R g2e�g(u+�v) d�(g) :The conditional expectation E��`�;�(U; V )jU + �V = u + �v� minimizes the distance to`�;�(u; v) over all functions of u+ �v. Since the statistic U + �V is su�cient for �, all scorefunctions for � are functions of u + �v. Therefore, if it is an �-score, then the conditionalexpectation must be the closest �-score. In that case the e�cient score function ~0 is givenby ~0(u; v) = `�0;�0(u; v)� E�0�`�0;�0(U; V )jU + �V = u + �0v�= R 12(u� �0v)g3e�g(u+�0v) d�0(g)R �0g2e�g(u+�0v) d�0(g) :22

Page 23: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

To verify that the conditional expectation is an �-score, it su�ces to verify that ~0 is ascore. For B a measurable set in (0;1), de�ne the path,�t(�; �)(B) = ��B�1 + � � t2� ��1�:Whenever j�� tj < 2� this is a distribution on the positive real line. By an easy calculationwe see that, ~�;�(x) = @@t jt=� ln pt;�t(�;�)(x):This, in addition to the preceding equations, implies that the path t 7! �t(�; �) indexes aleast favorable submodel for �.Consistency of �~� (with respect to the weak topology) can be shown by minor adapta-tions to the classical proof of Kiefer and Wolfowitz (1956). The function _(�0; �;0 �0)(x) =@@t log l(t;�t(�0; �0))(x)jt=�0 equals the e�cient score function ~0(x). From the representa-tion of the e�cient score function as a conditional score, we see that P�0;�0 ~�0;� = 0 for every�0, �0 and �; this is more than su�cient to verify (3.4). This rather nice property allowsus to avoid proving a rate of convergence for the estimator �~�. (This is fortunate, becauseit follows from van der Vaart (1991) that the information for estimating the distributionfunction �(0; b] is zero, so that the rate of convergence of the best estimators is less thanthe square root of n.) REFERENCES[1] Andersen, P.K., Borgan, O., Gill, R.D. and Keiding, N., (1993). Statistical ModelsBased on Counting Processes. Springer Verlag, New York.[2] Andersen, P.K. and Gill, R.D., (1982). Cox's regression model for counting processes:A large sample study. Annals of Statistics 10, 1100{1120.[3] Barndor�-Nielsen, O.E. and Cox, D.R., (1994). Inference and Asymptotics. Chapman& Hall, London.[4] Bickel, P., Klaassen, C., Ritov, Y. and Wellner, J., (1993). E�cient and adaptive es-timation for semiparametric models. Johns Hopkins University Press, Baltimore.[5] Birg�e, L. and Massart, P., (1993). Rates of convergence for minimum contrast esti-mators. Probab. Theory Related Fields 97, 113{150.[6] Birg�e, L. and Massart, P., (1997). From model selection to adaptive estimation.Festschrift for Lucien Le Cam, 55{87. Springer, New York.[7] Chen, J.S. and Jennrich, R.I., (1996). The signed root deviance pro�le and con�-dence intervals in maximum likelihood analysis. Journal of the American Statisti-cal Association 91, 993{998. 23

Page 24: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

[8] Cram�er, H., (1946). Mathematical methods of statistics. Princeton University Press,Princeton.[9] Donald, S.G. and Newey, W.K., (1994). Series estimation of semilinear models 50.Journal of Multivariate Analysis 50, 30{40.[10] van de Geer, S.A., (1995). The method of sieves and minimum contrast estimators.Math. Methods Statist. 4, 20{38.[11] van de Geer, S.A., (1993). Hellinger-consistency of certain nonparametric maximumlikelihood estimators. Annals of Statistics 21, 14{44.[12] Groeneboom, P. and Wellner, J.A., (1992). Information Bounds and NonparametricMaximum Likelihood Estimation. Birkh�auser, Basel.[13] Huang, J., (1996). E�cient estimation for the Cox model with interval censoring.Annals of Statistics 24, 540{568.[14] Kiefer, J. and Wolfowitz, J., (1956). Consistency of the Maximum Likelihood Esti-mator in the Presence of In�nitely Many Nuisance Parameters. Annals of Mathe-matical Statistics 27, 887{906.[15] Klaassen, C.A.J., (1987). Consistent estimation of the in uence function of locallye�cient estimates. Annals of Statistics 15, 617{627.[16] Le Cam, L. and Yang, G.L., (1990). Asymptotics in statistics. Springer-Verlag, NewYork.[17] Murphy, S.A., (1994). Consistency in a proportional hazards model incorporating arandom e�ect. Annals of Statistics 22, 712{731.[18] Murphy, S.A., (1995). Asymptotic theory for the frailty model. Annals of Statistics23, 182{198.[19] Murphy, S.A. and Van der Vaart, A.W., (1997d). MLE in the proportional oddsmodel. Journal of the American Statistical Association 92, 968{976.[20] Murphy, S.A. and Van der Vaart, A.W., (1997). Semiparametric likelihood ratioinference. Annals of Statistics 25, 1471{1509.[21] Murphy, S.A. and Van der Vaart, A.W., (1996c). Observed information in semipara-metric models. preprint.[22] Murphy, S.A. and van der Vaart, A.W., (1997b). Semiparametric mixtures incase-control studies. preprint.[23] Nielsen, G.G., Gill, R.D., Andersen, P.K. and Sorensen, T.I., (1992). A countingprocess approach to maximum likelihood estimation in frailty models. Scandina-vian Journal of Statistics 19, 25{44.[24] Owen, A.B., (1988). Empirical likelihood ratio con�dence intervals for a single func-tional. Biometrika 75, 237{249. 24

Page 25: ON PR OFILE LIKELIHOOD - Semantic Scholarhi-squared distribution of the log lik e-liho o d ratio statistic, and to pro v ... presence of an in nite-dimensi onal n uisance parameter

[25] Parner, E., (1997a). Consistency in the correlated gamma-frailty model. preprint.[26] Parner, E., (1997b). Asymptotic normality in the correlated gamma-frailty model.preprint.[27] Pate�eld, W.M., (1977). On the maximized likelihood function. Sankhy�a, Ser. B 39,92{96.[28] Roeder, K., Carroll, R.J. and Lindsay, B.G., (1996). A semiparametric mixture ap-proach to case-control studies with errors in covariables. Journal of the AmericanStatistical Association 91, 722{732.[29] Severini, T.A. and Wong, W.H., (1992). Pro�le likelihood and conditionally para-metric models. Annals of Statistics 20, 1768-1802.[30] Tsiatis, A.A., (1981). A large sample study of Cox's regression model. Annals ofStatistics 9, 93-108.[31] Van der Vaart, A.W. and Wellner, J.A., (1996). Weak Convergence and EmpiricalProcesses. Springer Verlag, New York.[32] Van der Vaart, A.W., (1991). On di�erentiable functionals. Annals of Statistics 19,178{204.[33] Van der Vaart, A.W., (1996). E�cient estimation in semiparametric models. Annalsof Statistics 24, 862{878.[34] Wong, W.H. and Shen, X., (1995). A probability ineuqality for the likelihood sur-face and convergence rate of the maximum likelihood estimator. Annals of Statis-tics 23, 603{632.[35] Woodbury, M.A., (1971). Discussion of paper by Hartley and Hocking. Biometrics27, 808{817.S. A. MurphyDepartment of StatisticsUniversity of Michigan1440 Mason HallAnn Arbor, MI 48109-1027, USAA.W. van der VaartDepartment of MathematicsFree UniversityDe Boelelaan 1081a1081 HV Amsterdam, The Netherlands 25