estimation of regression parameters in missing data problems · the conditional cumulative...

Estimation of regression parametersin missing data problems

Don L. Mcleish and Cyntha Struthers

Key words and phrases: Linear regression; missing covariates; missing data; two-stage studies.

MSC 2000 : Primary 62N02; secondary: 62J99.

Abstract: Suppose Y is a response variable, possibly multivariate, with a density function f(y|x, v;β)conditional on the covariates (x, v) where x and v are vectors and β is a vector of unknown parameters.The authors consider the problem of estimating β when data on the covariate vector v are available forall observations while data on the covariate x are missing at random. They compare several estimatorswith the profile estimator with respect to bias and standard deviation when the response and covariatesare discrete or continuous.

1. INTRODUCTION

Suppose Y is a response variable, possibly multivariate, with a density function f(y|x, v;β) wherex and v are vectors of covariates and β is a vector of unknown parameters. A problem recentlyconsidered by many authors (Chatterjee, Chen & Breslow 2003,; Lawless, Kalbfleisch & Wild1999; Reilly & Pepe 1995; Robins, Hsieh & Newey 1994; Robins, Rotnitsky & Zhao 1995; Carroll& Wand 1991; Pepe & Fleming 1991; Lipsitz, Ibrahim & Zhao 1999; Chen 2004) is one in whichdata on the covariate vector v are available for all observations while data on the covariate x aremissing for some of the observations. The set of observations for which both x and v are availableis referred to as the validation sample. The objective is to estimate the vector of parameters β forsuch data. We are interested in examples in which Y , X and V are continuous random variablessince previous papers have focused on the case in which Y was dichotomous and V was discretewith a small number of possible values or the case in which, although the data are continuous, thedata have been partitioned into a finite number of strata. Lawless et al. (1999) discuss a studywhich examined the risk factors for babies born that are ‘small for gestational age’. In this examplethe continuous response variable is birth weight. Covariates such as gestational age and routinelycollected hospital data are available for all babies while other study covariates are only availablefor babies in the validation sample.We assume that f(y|x, v;β) is known up to a vector of unknown parameters β. Let ∆ be an

indicator variable such that ∆ = 1 if an observation is in the validation set and thus Y , X and Vare observed, and ∆ = 0 if an observation is in the non-validation set in which case Y and V areobserved but X is unobserved. The observations are assumed to be ‘missing at random’ (Little &Rubin 2002), that is, E(∆|Y,X, V ) = π(Y, V ) where π is a known function depending only on theobservable quantities (Y, V ). If π is unknown then it could be replaced by a consistent estimatorπ. We assume throughout that, with probability one

Eβ(∆|X,V ) = η(X,V, β) > 0. (1)

1

It will be convenient to adopt the following convention for marginal/conditional probability(density) functions. We write fX(x), fV (v), fX|V (x|v), etc. unless there is no confusion in whichcase we write f(x), f(v), f(x|v), etc. We adopt a similar convention for cumulative distributionfunctions which will be denoted by capital letters. For expectations of random variables we willuse an integral even if the random variable is discrete.

2. ESTIMATING FUNCTIONS

Suppose, given complete observations (∆i, yi, xi, vi), i = 1, ..., N , we estimate a parameter β usingan unbiased estimating function of the form

NXi=1

S(yi|xi, vi, β) (2)

where S is unbiased and satisfies Eβ [S(Y |X,V, β)] = 0 for all β. The most common choice for Sis the conditional score function for estimating β

S(y|x, v, β) = ∂

∂βlog f(y|x, v;β). (3)

For simplicity we assume that S(y|x, v, β) is given by (3) although any other conditionally unbiasedestimating function could be used instead.We call (∆i, yi,∆ixi, vi), i = 1, ...,N the ‘incomplete’ data and assume that these are indepen-

dent and identically distributed observations. Conditioning (2) on the observed data we obtain thefollowing estimating function for β

ψ(β) =NXi=1

{∆iS(yi|xi, vi, β) + (1−∆i)Eβ[S(Y |X,V, β)|yi, vi]}. (4)

For (4) to be the correct conditional expectation of (2) we need to assume that ∆i and xi areconditionally independent given (yi, vi). This is essentially the ‘missing at random’ assumption.Robins et al. (1994), Lipsitz et al. (1999) and Chen (2004) have considered a weighted esti-

mating function given by

NXi=1

{ ∆i

π(yi, vi)S(yi|xi, vi, β) + [1− ∆i

π(yi, vi)]Eβ[S(Y |X,V, β)|yi, vi]}. (5)

This estimating function is unbiased if f(x|v), the conditional probability density function of Xgiven v, is misspecified in the calculation of the term E[S(Y |X,V, β)|yi, vi] or if the functionπ(yi, vi) is incorrectly specified but not both. This property, which is called double robustness,does not obtain without a price. Comparing (5) and (4) it is apparent that the difference is in theweights attached to the two sources of information: the information from the validated observationsand f(x|y, v) and the information from all the observations and f(y|v). In (5) the expected weighton the former is one and on the latter is zero, and more information is drawn from the validatedobservations. Thus when π(yi, vi) is small, we might expect (5) to lose efficiency when there is alarge number of missing observations, since the contribution of the missing observations is under-represented relative to (4). Note also that the unbiasedness of (5) holds provided that f(x|v) ismisspecified but nonrandom. It does not hold, for example, if we replace E[S(Y |X,V, β)|yi, vi] byan estimator so the question of potential bias persists. Many authors use an inefficient estimatorof f(x|v) based on the complete data only. Lipsitz et al. (1999) use the EM algorithm to estimateall parameters in this model, including those governing the missing data mechanism π, and f(x|v).

2

The term Eβ[S(Y |X,V, β)|yi, vi] in (5) or (4) can be written as

Eβ[S(Y |X,V, β)|yi, vi] =Z

S(yi|x, vi, β)f(yi|x, vi;β)f(yi|vi;β) f(x|vi)dx (6)

where f(y|v;β) = R f(y|x, v;β)f(x|v)dx = E[f(y|X, v;β)|V = v]. This conditional expectation canbe evaluated either analytically or numerically if f(x|v), is known and the parametric maximumlikelihood estimate of β obtained by solving ψ(β) = 0. If f(x|v) is unknown then the conditionalexpectation is unknown and various approaches have been suggested to deal with this problem.One approach is to estimate the conditional expectation by using an empirical estimator of F (x|v),the conditional cumulative distribution function of X given v (see Pepe & Fleming 1991; Car-roll & Wand, 1991). When S is the score function, Lawless et al. (1999) call this ‘estimated’pseudolikelihood. This approach is useful for discrete v primarily because it is easy to estimate theconditional distribution f(x|v), or more precisely, conditional expectations of the form E[h(x)|v].If v is continuous (or discrete but with many possible values), consistent estimation of E[h(x)|v]is still possible, but requires smoothing across neighbouring values of v since there is at most onevalidated observation x corresponding to a given v. Typically this smoothing operation introducesbias in the estimation function and the estimators.The simplest estimator for β is obtained by using only the validated observations in the first

term of (4). Note that since

E[∆S(Y |X,V, β)] =∂

∂βEβ [π(Y, V )]

this approach leads to an unbiased estimating function only if the expected proportion of validatedobservations does not depend on the parameter β. This is true, for example, if the expectedproportion of validated observations does not change with Y.Now the asymptotic variance of an estimator obtained by solving an unbiased estimating equa-

tion ψ(β) = 0 depends not only on the function ψ but also the derivative of its expectation withrespect to β, evaluated at the true value β0 of the parameter (see, for example, Small & McLeish1994). Typically, under some regularity conditions,

var(bβ) ∼ varβ0 [ψ(β0)]

{Eβ0 [ψ0(β0)]}2

, (7)

so that the efficiency depends both on the estimating function and its derivative. If we wish to at-tempt to achieve asymptotic efficiency for estimation of β relative to the case when f(x|v) is known,we must estimate both Eβ[S(Y |X,V, β)|Y, V ] and its derivative with sufficient precision. That is,we require consistent estimators which converge sufficiently rapidly to result in the same asymptoticbehaviour of the numerator and denominator of (7). To see this we note that Eβ [S(Y |X,V, β)|y, v]depends on the parameter β both as an argument to the function S(Y |X,V, β) and as a parameterunderlying the expectation and thus

∂

∂βEβ[S(Y |X,V, β)|y, v] = Eβ[

∂

∂βS(Y |X,V, β)|y, v] + varβ [S(Y |X,V, β)|y, v]. (8)

Suppose we replace the expectation Eβ[S(Y |X,V, β)|y, v] in (4) by an estimate E[.|y, v] to obtainan estimating function of the form

NXi=1

{∆iS(yi|xi, vi, β) + (1−∆i)E[S(Y |X,V, β)|yi, vi]}. (9)

Suppose also that E[S(Y |X,V, β)|y, v] and E[ ∂∂βS(Y |X,V, β)|y, v] are consistent estimators of theterms Eβ[S(Y |X,V, β)|y, v] and Eβ [

∂∂βS(Y |X,V, β)|y, v] in (8). If the estimator E[.|y, v] does not

3

depend on the parameter β then, in view of the denominator of (7), the estimator β obtained from(9) cannot hope to achieve full efficiency since

∂

∂βE[S(Y |X,V, β)|y, v] = E[

∂

∂βS(Y |X,V, β)|y, v]

and the variance term in (8) is omitted. In particular the estimator E[.|y, v] based on the empiricaldistribution of X given (y, v) does not lead to a fully efficient estimator of β.In our approach to the problem, we first note that the evaluation of (6) is often not feasible for

each iteration of an algorithm for solving ψ(β) = 0 and for each missing observation. Instead weconsider the following methods which are easier to implement:

1. Evaluate the integral (6) by Monte Carlo, by generating X from a suitable family of distrib-utions and averaging the results.

2. Evaluate the integral (6) using the observed (validated) values of X.

We will see that these two methods are closely related.We consider estimators of Eβ[S(Y |X,V, β)|y, v] that can be expressed in terms of a weight

function W (X,U, y, v, β) where U = (V,∆) and W is chosen so that for an arbitrary function H

Eβ[H(Y,X, V, β)|Y = y, V = v] = Eβ[W (X,U, y, v, β)H(y,X, v, β)]. (10)

Since the operation E(∗|Y = y, V = v) is a linear functional on a Hilbert space of functions H,the existence of such a function W is given by the Riesz representation theorem of Hilbert spacetheory (see Halmos 1951). Hence we call W a ‘Riesz’ factor representing the linear functionalE(∗|Y = y, V = v).When probability densities exist and are such that W is square integrable, thecanonical property that W must have so that (10) holds is

Eβ [W (X,U, y, v, β)|X] = f(y|X, v;β)

f(y|v;β)f(X|v)fX(X)

. (11)

The importance of (10) is that it allows us to estimate a conditional expectation by using anestimator of an unconditional expectation. In other words we can estimate Eβ [H(Y,X, V, β)|y, v]using an average of the form

EN [H(Y,X, V, β)|y, v, β] = 1

N

NXj=1

H(y, xj , v, β)W (xj , uj , y, v, β) (12)

or a self-normalized weighted average

EN [H(Y,X, V, β)|y, v, β] =PN

j=1H(y, xj , v, β)W (xj , uj , y, v, β)PNj=1W (xj , uj , y, v, β)

(13)

for a sample of size N . Note that if we let H = 1 in (10) we obtain Eβ [W (X,U, y, v, β)] = 1 and ifwe replace the denominator of (13) by its expected value we obtain (12). The advantage of (13) isthat if the weight W contains a function of (y, v, β), such as f(y|v;β), then this function cancelsin the ratio. Self-normalized weights were used in the simulation study in Section 6.

Lemma 1. If the weight function W (X,U, y, v, β) satisfies (10) and EN is given by (12) or (13),then for an arbitrary integrable function H,

EN [H(Y,X, V, β)|y, v, β]→ Eβ [H(Y,X, V, β)|Y = y, V = v] a.s.

4

For method 2 the weights must be positive only for validated observations xj , that is,

W (xj , uj , yi, vi, β) > 0 implies ∆j = 1. (14)

We call these the empirical weights and denote them WE since they are only applied to observedvalues of xj in the validation sample. Note that if the xj values were simulated from some distri-bution, such as in method 1, then there is no such constraint on the weights. Suppose we replacethe term Eβ[S(Y |X,V, β)|yi, vi] in (4) by an estimator of the form (12) to obtain the estimatingfunction for β

ψE (β) =1

N − 1NXi=1

NXj=1,j 6=i

Tij(β) (15)

whereTij(β) = ∆iS(yi|xi, vi, β) + (1−∆i)WE(xj , uj , yi, vi, β)S(yi|xj , vi, β). (16)

If we write (4) as ψ (β) =PN

i=1 τ i(β) where

τ i(β) = ∆iS(yi|xi, vi, β) + (1−∆i)Eβ [S(Y |X,V, β)|yi, vi] (17)

then the following theorem shows that the Tij ’s and the τ i’s have similar properties.

Theorem 1. Suppose that the weight function WE(xj , uj , yi, vi, β) satisfies (10) and (14). LetI(β) = varβ [τ i(β)] be the Fisher information in observation i. Then the following hold:(a) The estimating functions Tij(β) are unbiased, that is,

Eβ [Tij(β)] = Eβ [τ i(β)] = 0

and consequently ψE (β) is an unbiased estimating function. In addition both Tij and τ i areinformation unbiased, that is,

Eβ

∙∂

∂βTij(β)

¸= Eβ

∙∂

∂βτ i(β)

¸= −I(β).

(b) There exists bβN , a consistent sequence of solutions of the estimating equation ψE

³bβN´ = 0,such that bβN is asymptotically normal with mean β and asymptotic variance

[NI(β)]−1[1 +q(β)

I(β)] + o(N−1)

whereq(β) = varβ{E[(1−∆i)S(yi|xj , vi, β)W (xj , uj , yi, vi, β)|xj , vj ]} ≥ 0, i 6= j. (18)

(c) The optimal linear combination of the estimating functions Tij(β) is ψE (β) .

The asymptotic relative efficiency of the estimator of β obtained by solving ψE (β) = 0 com-pared to the maximum likelihood estimator of β obtained by solving ψ(β) = 0 and assuming f(x|v)is known is given by [1 + q(β)/I(β)]−1. This efficiency is close to one if q(β) is close to zero whichoccurs if the conditional expectation in (18) has little dependence on (xj , vj). It is this dependencewhich causes correlation among the values of Tij(β).To see the connection between methods 1 and 2 we first consider the implementation of method

1 using a weight function. Suppose that we use Monte Carlo to impute the missing values of X by

5

simulation from an importance distribution (see Robert & Casella 2004) with probability densityfunction h(x|y, v) where the choice of the importance distribution may depend on (y, v). For Xchosen from h(x|y, v) we define the Monte Carlo weight function as

WMC(X, y, v, β) =f(y|X, v;β)

f(y|v;β)f(X|v)h(X|y, v) . (19)

WMC satisfies (10) since

Eβ[WMC(X, y, v, β)H(y,X, v, β)] =

ZH(y,X, v, β)

f(y|X, v;β)

f(y|v;β)f(X|v)h(X|y, v)h(X|y, v)dX

= Eβ [H(Y,X, V, β)|Y = y, V = v].

The use of Monte Carlo weights is essentially equivalent to replacing the conditional expected valuein (4) by a Monte Carlo integral obtained using importance sampling. Lipsitz et al. (1999) use asimilar approach in the context of double robustness.The following result is similar to Theorem 1 but for Monte Carlo weights.

Theorem 2. Suppose that we generate M values Xij , j = 1, ...,M independently for each non-validated pair (yi, vi) from the importance density h(x|yi, vi). Suppose also that the weight functionWMC(Xj , yi, vi, β) is given by (19) and

Tij(β) = ∆iS(yi|xi, vi, β) + (1−∆i)WMC(xij , yi, vi, β)S(yi|xij , vi, β). (20)

Then

ψMC (β) =1

M

NXi=1

MXj=1

Tij(β)

is an unbiased estimating function which is the optimal linear combination of the estimating func-tions Tij(β). Also if bβN is a consistent sequence of solutions of the estimating equation

ψMC

³bβN´ = 0and if M →∞ as N →∞, then the asymptotic variance of bβN is [NI(β)]−1 so the estimator isfully efficient.

Note that if we choose the importance distribution h to have the property

h(X|y, v) = f(y|X, v;β)

f(y|v;β) f(X|v) = f(X|y, v;β)

then WMC equals one and the variance of WMC is minimized. Of course this choice requiresknowledge of f(x|v). Note also that WMC given by (19) does not depend on V . We could infact generate pairs (X,V ) from an importance distribution h(X,V |y, v) with corresponding weightfunction

WMC(X,V, y, v, β) =f(y|X, v;β)

f(y|v;β)f(X|v)

h(X,V |y, v) (21)

which still satisfies (10). We consider such an importance distribution to make the connectionbetween methods 1 and 2. Suppose that, for given (y, v), we sample “proxy” values of X fromthe validated observations (yj , xj , vj) with selection probabilities that may depend on (xj , vj). Wecan extend method 1 by allowing ourselves to choose validated observations with vj values withina certain neighbourhood of the given v, since we regard these as the most relevant for estimating

6

f(X|y, v;β), or more generally accepting these observations with probability proportional to kv(vj)where the non-negative kernel kv(.) can be rescaled so thatZ

kv(w)dw = 1

for each v. Then the selected values (xj , vj) have joint probability density function proportionalto

f(X,V )× probability(x observed)× probability(“accepted” as proxy) = f(X,V )η(X,V, β)kv(V )

where η(X,V, β) is given by (1). This is closely analogous to implementing method 2 using animportance distribution of the form h(X,V |y, v) ∝ η(X,V, β)f(X,V )kv(V ). Of course the kernelkv can be replaced by a distribution concentrated entirely at the point v, in which case

h(X|y, v) ∝ η(X, v, β)f(X|v)I(V = v). (22)

The choice of kv and the decision of how large a neighbourhood to ‘average’ or smooth over dependson how constant we think f(X|v) is as a function of v. The extreme case would be to average overall of the validated observations xj without regard to the corresponding value of vj . For this case

h(X|y, v) ∝ E[η(X,V, β)|X]f(X) =Z

η(X,V, β)f(X,V )dV. (23)

3. CONSTRUCTION OF WEIGHT FUNCTIONS

Recall that in Theorem 1, the weight function WE(xj , uj , yi, vi, β) satisfied (10) and (14). We nowshow how to construct such a weight function. We drop the subscript E for convenience. LetR(X,V, v) be a Riesz factor representing the linear functional E(∗|V = v) which satisfies

E[H(X,V )|V = v] = E[H(X, v)R(X,V, v)]. (24)

A property analogous to (11) that R must have so that (24) holds is

E[R(X,V, v)|X] = f(X|v)fX(X)

. (25)

Given a non-negative Riesz factor R we can define an importance distribution using

h(X,V |y, v) ∝ f(X|v)η(X,V, β)

R(X,V, v)

for (X,V ) in the validation sample. The corresponding weight function is

W (X,U, y, v, β) =∆

η(X,V, β)

f(y|X, v;β)

f(y|v;β) R(X,V, v). (26)

Lemma 2. The weight function W given by (26) satisfies (10) if (X,V ) corresponds to an observedpair of values from the validation sample.

Note that the function A = ∆/η(X,V, β) in (26) satisfies E(A|X,V ) = 1 and may be replacedby functions with the same property such as ∆/π(Y, V ), ∆/E[π(Y, V )|X] (provided R does notdepend on V ) and 1 to obtain weight functions which also satisfy (10).

7

A form for R(X,V, v) satisfying (25) is

R(X,V, v) =f(X|v)

f(X|V )fV (V )gv(V |X) (27)

for non-negative functions gv such thatRgv(V |X)dV = 1. The following are possible Riesz factors

R(X,V, v):

(R1) I(V=v)P (V=v) if V is discrete and P (V = v) > 0

(R2) f(X|v)f(X|V )fV (V )kv(V ) for a non-negative kernel density kv(V )

(R3) f(X|v)fX(X)

= f(v|X)fV (v)

= f(X,v)fX(X)fV (v)

(R4) f(X|v)f(X|V )

(R5) f(X|v)f(X|V ∈Nv)

I(V ∈ Nv) for some neighbourhood Nv of v.

The Riesz factor (R1) corresponds to the importance distribution (22), (R2) corresponds toselecting observations with neighbouring values of V having probability density function kv(V ),(R3) is closely related to (23), and (R4) is a special case of (R2). Since (R1)-(R5) all satisfy (25)it seems reasonable to choose the R with the smallest variance which is (R3). This choice assumesof course that we know f(x|v), fX(x), etc. To use one of (R1)-(R5) we could assume a modelfor f(x|v) such as normal or logistic or simply adopt a parametric model for R. In general, aparametric assumption governing R is made only to obtain the weights on the score function. Inparticular if we do not trust the assumption over the whole range of values of V we can use a Rieszfactor like (R5) which restricts the influence of the model to a small neighbourhood of v. Chen(2004) assumes a parametric form for the odds ratio

f(v|x)f(v0|x0)f(v|x0)f(v0|x)

for a particular point (x0, v0). Since he uses self-normalized weights this is equivalent to a para-metric specification of f(x|v).If we use Riesz factor (R3) then estimation of this ratio is simpler than the estimation of

f(x, v). By Sklar’s theorem (see Nelson 1999) there exists a copula C(u1, u2) on the unit squaresuch that the joint cumulative distribution function F (x, v) satisfies F (x, v) = C(FX(x), FV (v)).Letting c(u1, u2) represent the joint probability density function of the copula C, we have

R(X,V, v) =f(X, v)

fX(X)fV (v)= c(FX(X), FV (v))

so that R obtains from the two marginal cumulative distribution functions and the copula repre-senting the dependence. Typically the data carry more information about the marginals than thejoint distribution so that the problem becomes one of choosing or estimating the copula. Note thatthe Riesz factor (R2) can also be written using the copula as

R(X,V, v) =f(X|v)

f(X|V )fV (V )kv(V ) =c(F (X), F (v))

c(F (X), F (V ))

kv(V )

fV (V ). (28)

8

Note that if we assume a priori that X and V are independent random variables and use Rieszfactor (R3) then the weight function is

W (X,U, y, v, β) =∆

η(X,V, β)

f(y|X, v;β)

f(y|v;β)where, for given v, weights are applied to all of the validated Xj , not just the validated Xj withVj = v.

4. ESTIMATION OF β AND SPECIFIC WEIGHT FUNCTIONS

Since the weight function W depends on β, estimation of β must be done iteratively. Supposeβ0 is an initial estimate of β, based on the validation data only and let βn be the estimate of βobtained on the nth iteration. On the nth iteration Eβ[S(y|X, v, β)|y, v] is estimated for each pairof (y, v) values in the non-validation set using

EN [S(y|X, v, β)|y, v] =NXj=1

W (xj , uj , y, v, βn−1)S(y|xj , v, β) (29)

for self-normalized weights W (xj , uj , y, v, β). The new estimate of β, βn, is the solution to theequation

NXi=1

{∆iS(yi|xi, vi, β) + (1−∆i)EN [S(Y |X,V, β)|yi, vi]} = 0. (30)

We iterate these two steps until convergence is obtained.We now describe the specific weight functions used in the simulation study in section 6.

4.1 CCB Weights.

We call the weight function

W (xj , uj , y, v, β) ∝ ∆jI(vj = v)f(y|xj , v;β)η(xj , v, β)

(31)

the CCB weight since using this weight function and the algorithm described above for estimatingβ we obtain the pseudo-score estimator of β suggested by Chatterjee et al. (2003). Note thatif π(y, v) = π(v) for all y then (31) simplifies to W (xj , uj , y, v, β) ∝ ∆jI(vj = v)f(y|xj , v;β).Using this weight function gives the same estimator for β, on convergence, as proposed by Pepe& Fleming (1991). Note also that if W (xj , uj , y, v, β) ∝ ∆jI(vj = v) is used then the mean scoreestimator of Reilly & Pepe (1995) is obtained.

4.2 Copula Weights.

We chose to adopt a family of copulas C (u1, u2;α) indexed by a single parameter α in the Rieszfactor (28). For discrete V we used kv(vj) = I(vj = v). For continuous V, kv(vj) = exp[−a(vj−v)2]could be used. An estimate of α can be can be obtained from an estimate of Kendall’s tau usingthe simple relationship between the copula and Kendall’s tau (see Jouni & Clemen 1996). Familiesof copulas can be found in Nelson (1999). The choice of copula could be determined by goodnessof fit (see Genest & Rivest 1993). For the simulation study in Section 6 we used the Frank copulawith joint probability density function given by

c(u, z;α) =α exp(α(1 + u+ z))(eα − 1)

(eα+αu + eα+αz − eα − eαu+αz)2−∞ < α <∞, α 6= 0.

9

The weight function is

W (xj , uj , y, v, β) ∝∆j

η(xj , vj, β)

f(y|xj, v;β)f(y|v;β)

c(FX(xj), FV (v);α)

c(FX(xj), FV (vj);α)

I(vj = v)

fV (vj). (32)

The marginal distribution for V was estimated using the empirical distribution of V . The marginaldistribution of X was estimated on each iteration by putting positive mass equal toh

η(xj , vj , β)i−1

/Xj

hη(xj, vj , β)

i−1on each validated xj where β is the current estimate of β.

4.3 Regression Weights.

For the construction of these weights we assumed that X and V have a joint bivariate normaldistribution with mean vector (µX , µV )

0 and covariance matrix∙σ2X ρσXσV

ρσXσV σ2V

¸.

The weight function is

W (xj , uj , y, v, β) ∝ ∆jf(y|xj , v;β)η(xj , vj , β)

× f(xj |v)fX(xj)

(33)

where f(x|v) is the N(µX + ρσXσV(v − µV ),

¡1− ρ2

¢σ2X) probability density function and fX(x) is

the N¡µX , σ

2X

¢probability density function. Estimation of the parameters (µX , µV , σ

2X , σ

2V , ρ) is

straightforward (see Section 6).

4.4 Profile and Doubly Restricted Profile Weights.

The weights in this case correspond to replacing the joint distribution of (X,V ) by a non-parametricmaximum likelihood estimator based on the data. This is similar to the approach in Chen (2004)except that Chen implicitly assumes that the maximum likelihood estimator for f(x|v) has massonly on those observed values of x appearing with the given v. Some properties of this estimatorare given in the Appendix.Define the function

bηP (x, v, β) = 1− PNi=1(1−∆i)I(vi = v)f(yi|x, vi;β)/f(yi|vi;β)PN

i=1 I(vi = v)(34)

which is an estimator of η(x, v, β) based on the non-validated observations. Although this estimatormay not fall in the interval [0, 1] for finite sample sizes it is consistent as n→∞.

Theorem 3. For fixed β and a given value of v, a nonparametric maximum likelihood estima-tor bF (x|v) of F (x|v) has, as possible support points, the values of x corresponding to validatedobservations at that value of v as well as the values of xv for which

bηP (xv, v, β) = 0. (35)

On those points x corresponding to validated observations for which bηP (x, v, β) > 0, the mass isF (dx|v) ∝

PNi=1∆iI(xi = x, vi = v)bηP (x, v, β) . (36)

10

There is an apparent arbitrariness in the mass that the maximum likelihood estimator placeson points xv which satisfy bηP (xv, v, β) = 0. This makes sense however if we identify bηP (xv, v, β)with the data’s best estimator of the probability that the pair (xv, v) will be in the validationsample. Suppose that, for a given (x, v) pair, the true value of η(x, v, β) = P (∆ = 1|x, v) werezero. This means that no such (x, v) pair will appear in the validation sample and yet (x, v) couldbe a possible support point of F (x|v). This means we could add the point (x, v) to the set ofsupport points and assign it arbitrary mass. Note also that the set of xv values satisfying (35)depends on our current estimates of β and f(x|v). An algorithm for dealing with the additionalcomputational burden this entails is given in the Appendix.If we use bF (x|v) to estimate F (x|v) the weights take the form

W (x, u, y, v, β) =W (x, y, v, β) ∝ f(y|x, v;β)F (dx|v) (37)

We call these the profile weights. The profile estimator which restricts the support of the maximumlikelihood estimator of F (x|v) to the validated x which are observed with identical v we refer to,following Zhang & Rockette (2005), as the doubly restricted profile.

4.5 Comments

When π(y, v) is unknown Chatterjee et al. (2003) replace π by a consistent estimate π and η by ηwhere

η(x, v, β) =

Zπ(y, v)f(y|x, v;β)dy. (38)

Chatterjee et al. (2003) and Robins et al. (1994) have found that estimators which use estimatedrather than known selection probabilities are more efficient. Such an estimator would be obtainedin this case by replacing η(x, v;β) in (31), (32), or (33) by η(x, v;β) given by (38) or (34). For thesimulation study in Section 6 we used the known π(y, v). Simulations were also conducted using(34) but no appreciable gain in efficiency was observed for this situation. The profile estimatorimplicitly estimates these probabilities.To use the estimator (12) it would appear that a large number of weightsW (xj , uj , y, v, β) need

to be computed. Typically a few of the weights dominate and the rest are very close to zero andso an undue amount of calculation is done. A simple method involving rejection sampling providesan approximation to the same conditional expectation Eβ [H(Y,X, V, β)|y, v] but with a fraction ofthe computation. Suppose we choose at random one of the possible validated observations (xj , vj)for which vj = v and compute the weight W (xj , uj , y, v, β) associated with this observation. Ifthis weight is less than some constant c, we accept this value with probability W (xj, uj , y, v, β)/cand assign it a new weight equal to one. If W (xj , uj , y, v, β) > c, we accept this observation andassign it a new weight equal to W (xj , uj , y, v, β)/c. If this process is repeated independently, thenthe weighted average over all accepted values (with their new weights) is an unbiased estimator ofEβ[H(Y,X, V, β)|y, v]. In our simulation study there were approximately 150 validated observationsand it was feasible to use all of them. Therefore this technique was not implemented.

5. EXAMPLE: SIMPLE LINEAR REGRESSION

Consider a simple linear regression of the form

Y ∼ N(β0 + β1X, σ2) (39)

and suppose V is a variable correlated with X. In this case the estimating equation (30) is givenby " PN

j=1(yj − β0 − β1 exj)PNj=1(yj exj − β0 exj − β1

fx2j )#=

∙00

¸(40)

11

where

exj = ∆jxj + (1−∆j)EN (X|yj , vj) (41)fx2j = ∆jx2j + (1−∆j)EN (X

2|yj , vj),

EN is given by (13) and β equals the current estimate β.To obtain the estimate of β based on the weighted estimating function (5) we replace (41) by

exj = ∆j

π(yj , vj)xj + [1− ∆j

π(yj, vj)]EN (X|yj , vj) (42)

fx2j = ∆j

π(yj , vj)x2j + [1−

∆j

π(yj , vj)]EN (X

2|yj , vj).

We refer to the estimate based on (42) as the doubly robust estimate.Solving the system of equations (40), we obtain

µβ0β1

¶=1bσ2x⎡⎣ Y (

PNj=1

fx2j )/N −X(PN

j=1 yj exj)/Nccov(y, ex)

⎤⎦where

X =NXj=1

exj/N, Y =NXj=1

yj/N, (43)

bσ2x = NXj=1

fx2j/N − ( NXj=1

exj/N)2,and

ccov(y, ex) = 1

N

NXj=1

(yj − Y ) exj . (44)

Note that bσ2x is a consistent estimator of var(X) provided that bE[g(X)|y, v] consistently estimatesE[g(X)|y, v]. Therefore the updated estimates of β are given by

β1 =ccov(y, ex)bσ2x and β0 = Y − β1X. (45)

By analogy to the completely observed case we obtain an estimate of σ2 from the equation

Nσ2 =Xi

Xj

[∆iI(xi = xj) + (1−∆i)∆jW (xi, ui, yj , vj , β)](yi − β0 − β1xj)2.

Replacing β0 and β1 by the estimates in (45) we obtain

σ2 =1

N

NXj=1

(yj − Y )2 − β2

1bσ2x. (46)

Clearly this estimator resembles the usual estimator of σ2 for simple linear regression.

12

Denote the design matrix for the regression by

X =

⎡⎢⎢⎢⎢⎢⎢⎣1 ∆1x11 ∆2x2

.

.

.1 ∆nxn

⎤⎥⎥⎥⎥⎥⎥⎦ .

Let D =diag(∆1,∆2, ...∆n) andW = (Wij) whereWij =W (xj , uj , yi, vi, β). Thenµβ0β1

¶= (X0(D+ (I−D)W0)X)−1X0(D+ (I−D)W0)Y

indicating that standard software for weighted least squares (with weight matrix D+ (I−D)W0)can be used to estimate parameters in any of these models. The doubly robust estimate is similarbut with

D =diag(∆1

π(y1, v1),∆2

π(y2, v2), ...,

∆1π(yn, vn)

).

6. SIMULATION STUDY

To compare the performance of the various weight functions described in Section 4 we used thefollowing setup for a simulation study. We assumed a joint multivariate normal model for therandom vector (X,V, Y )0 with mean vector (µX , µV , µY )0 where µY = β0 + β1µX and covariancematrix ∙

Σ11 Σ12Σ012 Σ22

¸where

Σ11 = σ2X , Σ12 =£ρσXσV σ2Xβ1

¤and

Σ22 =

∙σ2V ρσXσV β1

ρσXσV β1 σ2 + β21σ2X

¸.

It follows that the conditional distribution of Y given (X,V ) is the simple linear model (39). Wetook µX = µV = 0, σ2X = σ2V = 1, β0 = 0, β1 = 1 and σ2 = 1. Four different values of ρ,the correlation between X and V , were used: ρ = 0.9, 0.75, 0.5, and 0.25. Observations for Vwere first generated using the multivariate normal distribution above. The observations were thendiscretized into six categories based on equal probability intervals for V .We assumed that

π(y, v) = min(1, exp[−e−αv(y − y0)]).

This function with parameter values α = 0.5, and y0 = −3 was chosen so that, for the multivariatenormal distribution assumed above, the proportion of validated observations was between 10 and15 percent for each value of v in the discrete V case. For each value of v, π(y, v) is a decreasingfunction of y, qualitatively resembling the selection probabilities we might expect for the lowbirthweight example.The parameter β was estimated iteratively using each of the weight functions discussed in

Section 4. The doubly robust estimate of β was obtained using (42) and the doubly restrictedprofile weights (37). The parameter σ2 was estimated using (46). We used a sample size of

13

N = 1000 and 500 simulations. The regression weights required estimates of the parameters(µX , µV , σ

2X , σ

2V , ρ). The sample mean and variance were used to estimate the parameters µV and

σ2V . The parameters µX , σ2X and ρ were estimated on each iteration using (43) and (44).

6.1 Parametric Maximum Likelihood Estimation

We compared all of the above methods with the fully parametric maximum likelihood estimatorassuming (Y,X, V ) is multivariate normal. For simplicity we ignored the rounding in the variableV. Since the likelihood factors into the form

NYi=1

f(yi|vi;β)NYi=1

f(xi|yi, vi;β)∆i

it is convenient to use the following reparameterization. Let the conditional distribution of Y givenv be

N(a0 + a1v, σ2Y |v) (47)

and the conditional distribution of X given (y, v) be

N(b0 + b1v + b2y, σ2X|y,v). (48)

The parameters in (47) and (48) are easy to estimate by regression methods. In terms of theseparameters

β1 =cov(Y,X)

σ2X(49)

β0 = µY − β1(b0 + b1µV + b2µY ) (50)

andσ2 = σ2Y − c0Ac, (51)

wherec0 = [cov(Y,X) cov(Y, V )]

A−1 =∙

σ2X cov(Y,X)cov(Y,X) σ2V

¸and

σ2X = (b1 + 2b2a1)b1σ2V + b22σ

2Y + σ2X|y,v

cov(Y,X) = (b1 + b2a1)a1σ2V + b2σ

2Y |v.

The maximum likelihood estimates of β0, β1 and σ2 are obtained by replacing the quantities onthe right hand side of (49), (50) and (51) by their maximum likelihood estimates.

6.2. Discussion

Figure 1 shows the bias and standard deviation of five estimators of the intercept, slope andvariance, including the CCB, the doubly restricted profile, and the parametric maximum likelihoodestimator (labelled MVN MLE) discussed in Section 6.1. The standard deviations and biases areplotted versus ρ, the correlation between X and V . The simulations show that methods which,on the surface should be nearly unbiased, have an unexpectedly large level of bias of 30 percentand more. The small bias associated with the MVN MLE estimator, especially when ρ = 0.9,is presumably because we did not adjust the model for the discretization of V . When ρ = 0.25,there is substantial bias in the CCB as well as the doubly restricted profile. There is a rather

14

0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

correlation(X,V)

Bia

s (in

terc

ept)

0.2 0.4 0.6 0.8 1−0.4

−0.2

0

0.2

0.4

correlation(X,V)

Bia

s (s

lope

)

0.2 0.4 0.6 0.8 1−0.1

0

0.1

0.2

0.3

correlation(X,V)

Bia

s (s

igm

a)

0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

correlation(X,V)

SD

(in

terc

ept)

0.2 0.4 0.6 0.8 10.05

0.1

0.15

0.2

correlation(X,V)

SD

(sl

ope)

0.2 0.4 0.6 0.8 10.02

0.04

0.06

0.08

0.1

correlation(X,V)

SD

(si

gma)

V discrete, 6 categories

MVN MLEDoubly Restricted ProfileCCBRegressionCopula

Figure 1: Comparison of estimators for discrete V with 6 categories.

15

subtle reason for this bias which we discuss in the next section. The bias of the CCB estimatorcan be reduced by pooling or smoothing the estimator across adjacent values of v. The regressionestimator performs quite well for low and high values of ρ.In Figure 2 we compare the best four estimators in terms of bias and standard deviation. The

conclusion to be drawn from these figures is that, at least in this case, the profile estimator per-forms remarkably well in terms of bias compared with its competitors particularly the doubly robustestimator. The bias is remarkably small especially for the estimation of the most important para-meter, the slope parameter. This occurs even though the profile estimator requires substantiallyfewer assumptions than the other estimators and requires the estimation of the infinite-dimensionalnuisance parameter f(x|v).

0.2 0.4 0.6 0.8 1−0.1

−0.05

0

0.05

0.1

correlation(X,V)

Bia

s (in

terc

ept)

0.2 0.4 0.6 0.8 1−0.06

−0.04

−0.02

0

0.02

correlation(X,V)

Bia

s (s

lope

)

0.2 0.4 0.6 0.8 1−0.05

0

0.05

0.1

0.15

correlation(X,V)

Bia

s (s

igm

a)

0.2 0.4 0.6 0.8 10.05

0.1

0.15

0.2

correlation(X,V)

SD

(in

terc

ept)

0.2 0.4 0.6 0.8 10.05

0.1

0.15

0.2

0.25

correlation(X,V)

SD

(sl

ope)

0.2 0.4 0.6 0.8 10.04

0.06

0.08

0.1

correlation(X,V)

SD

(si

gma)


MVN MLE ProfileRegressionDoubly Robust

Figure 2: Bias and standard deviation of four estimators, discrete V, 6 categories.

For many estimators the bias is so great that the variance is a small component of the meansquared error. Clearly most of this bias needs to be removed before any consideration of efficiency.Many of the estimators have two to five times the variance of the parametric maximum likelihoodestimator. This is the price paid for lack of knowledge of the form of the distribution and/or thenuisance parameter in the distribution f(x|v). Even if we correctly guess the form of this distribu-tion (in this case as bivariate normal) but have to estimate the parameters of the distribution, theprice paid is about a six-fold increase in variance for the regression estimator and a three or four-fold increase for the parametric maximum likelihood estimator as compared to the case in whichf(x|v) is completely known. Note that the difference between the regression estimator and theparametric maximum likelihood estimator, which both assume the bivariate normal distribution,

16

is largely numerical. Weighted estimators like the regression estimator replace a conditional expec-tation by a weighted average over the validated observations. Although the profile estimator doesnot perform quite as well as the parametric maximum likelihood estimator with respect to varianceit does have smaller variance than the regression estimator and the doubly robust estimator.When we enlarge the support of V to have twenty categories, we obtain similar but more

dramatic conclusions (see Figure 3).

0.2 0.4 0.6 0.8 1−0.2

−0.1

0

0.1

0.2

correlation(X,V)

Bia

s (in

terc

ept)

0.2 0.4 0.6 0.8 1−0.3

−0.2

−0.1

0

0.1

correlation(X,V)

Bia

s (s

lope

)

0.2 0.4 0.6 0.8 1−0.2

0

0.2

0.4

0.6

correlation(X,V)

Bia

s (s

igm

a)

0.2 0.4 0.6 0.8 10.05

0.1

0.15

0.2

0.25

correlation(X,V)

SD

(in

terc

ept)

0.2 0.4 0.6 0.8 10.05

0.1

0.15

0.2

0.25

correlation(X,V)

SD

(sl

ope)

0.2 0.4 0.6 0.8 10.04

0.06

0.08

0.1

correlation(X,V)

SD

(si

gma)


ProfileRegressionDoubly Robust

Figure 3: Discrete V , 20 categories.

The doubly robust estimator often has more bias than the profile estimator and, at least forlow values of ρ, a much higher variance, corresponding to efficiencies around 25 percent relative tothe profile for estimating the slope coefficient.

6.3. Bias

The bias is caused, not by the weighting function used, as much as by the choice of supportfor the estimated distribution of f(x|v). The primary difference between regression and CCB anddoubly restricted profile is in the placement of weights. Both CCB and doubly restricted profile arerequired to place positive weights only on those validated x-values which share the same covariatev. This seems to be a natural and seemingly harmless constraint, at least when the sample sizeis large. There is little bias due to this constraint when x and v are highly correlated because inthis case we do not expect a big discrepancy between the actual x value and the validated x-values

17

on which we can put positive weights. Regression, as well as the profile estimator with a moregeneral support set, permits weights at other x-values. This makes a substantial difference in theability of the estimator to respond to a discrepancy between the true covariate value x and thoseobserved with a similar value of v. In general, as n→∞, this bias disappears because eventuallyevery possible supporting interval for f(x|v) is sampled.

−3 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5−4

−3

−2

−1

0

1

2

3

4

x

y

doubly restricted profile

CCB

regression

known f(x|v)

Imputed values of x for v=−0.57

observed x

Figure 4: Imputed values of x for various types of estimators.

The following example will clearly illustrate the source of the bias. In Figure 4 we choosea particular data set obtained from our simulations above and, for those observations with aparticular v = 0.57, we plot both the validated pairs (x, y) and the imputed pairs (x, y) forvarious methods. Notice that the regression and the known f(x|v) estimator (constructed usingthe correct distribution of f(x|v)) provide a sensible nearly linear imputation of values becausethey are permitted to place weight on all of the observed values of x. However, neither the doublyrestricted profile nor the CCB estimator is able to impute a value of x greater than 0 because allof the validated x for this v were negative. Since the regression estimator is permitted to placeweight at any validated value of x not just those with the same value of v, it imputes the valueof x more successfully than either the profile or the CCB estimator. Similarly, if we enlarge thesupport of the profile, we can essentially eliminate its bias.

7. CONCLUDING REMARKS

Many of the estimators designed to accommodate missing covariates in a simple regression havesubstantial bias for small and moderate sample sizes, including the doubly robust estimator which isdesigned specifically to reduce this bias. The profile estimator, however, appears to have propertiesat least as good as the competitors, with a minimum of bias and standard deviation.

18

7. APPENDIX: PROOFS OF LEMMAS AND THEOREMS

For ease of notation we suppress the subscript β on Eβ in the following proofs.

Proof of Lemma 1. The proof of this lemma is an application of the law of large numbers.

Proof of Theorem 1. Let

Wij =WE(xj , uj , yi, vi, β), i 6= j, Wii = 1

and Sij = S(yi|xj , vi, β) for all i, j.Then (16) can be written as Tij (β) = ∆iSii + (1 −∆i)WijSij . Also let Ei(.) denote conditionalexpectation given zi = (∆i, yi, vi) and β. Then (17) can be written as τ i(β) = ∆iSii + (1 −∆i)Ei(Sii).For i 6= j, fixed (yi, vi) and arbitrary function H it follows from (10) that

E[H(yj , xj , vj , β)|yj = yi, vj = vi] = E[W (xj , uj , yi, vi, β)H(yi, xj, vi, β)]

and also

Ei{E[H(yj , xj , vj , β)|yj = yi, vj = vi]} = Ei[W (xj , uj , yi, vi, β)H(yi, xj , vi, β)]

which we may write as Ei[WijH(xj)] = Ei[H(xi)]. In particular Ei(WijSij) = Ei(Sii). Now theunbiasedness of τ i(β) holds since

E[τ i(β)] = E [∆iSii + (1−∆i)Ei(Sii)] = E [∆iEi(Sii) + (1−∆i)Ei(Sii)] = E (Sii) = 0.

NowE[Tij (β) |zi, xi] = τ i(β) (52)

since

E[Tij (β) |zi, xi] = E[∆iSii + (1−∆i)WijSij |zi, xi] = ∆iSii + (1−∆i)Ei(WijSij)

= ∆iSii + (1−∆i)Ei(Sii) = τ i(β).

Thus the unbiasedness of Tij(β) holds since

E [Tij (β)] = E{E[Tij (β) |zi, xi]} = E[τ i(β)] = 0. (53)

To show

E[∂

∂βTij(β)] = −I(β) (54)

we note that the unbiasedness of Tij(β) implies ∂∂βE[Tij(β)] = 0 and

E[∂

∂βTij(β)] +E[Tij(β)(Sii + Sjj)] = 0, i 6= j.

Result (54) follows since

E[Tij(β)Sii] = E[∆iS2ii + (1−∆i)WijSijSii]

= E(∆iS2ii) +E{E[(1−∆i)WijSijSii|zi, xi]}

= E(∆iS2ii) +E[(1−∆i)SiiEi(Sii)]

= E(∆iS2ii) +E[(1−∆i)Ei(Sii)Ei(Sii)]

= E{[τ i(β)]2} = var[τ i(β)] = I(β)

19

and

E[Tij(β)Sjj ] = E{E[Tij(β)Sjj |zi, xj , vj ]}= E[Tij(β)E(Sjj |zi, xj , vj)]= E[Tij(β)E(Sjj |xj , vj)] = E[Tij(β)× 0] = 0.

The proof of E[ ∂∂β τ i(β)] = −I(β) is similar to the proof of (54).To prove the asymptotic normality of

ψE(β) =1

N − 1NXi=1

NXj=1,j 6=i

Tij(β)

we use a lemma due to Hoeffding (see Serfling 1980, Chapter 5).

Lemma. Suppose Z1, Z2, ... are independent identically distributed random vectors and

UN =

µN

2

¶−1 X1≤i<j≤N

h(Zi, Zj)

is a U-statistic with h a symmetric kernel satisfying E[h(Zi, Zj)] = 0 and E{[h(Zi, Zj)]2} < ∞for all i < j. Then

UN → 0 almost surely

and N1/2UN → N(0, 4ζ) in distribution

where ζ = var{E[h(Z1, Z2)|Z1]}.

Note that, although Tij(β) is not symmetric in the arguments i, j, the function

h(xi, zi, xj , zj) = Tij(β) + Tji(β)

is symmetric and we can apply Hoeffding’s lemma to the statistic

NXi=1

NXj=1,j 6=i

Tij(β) =NXi=1

Xj<i

h(xi, zi, xj , zj) =

µN

2

¶UN .

To obtain the asymptotic variance of ψE(β) we need

ζ = var{E[Tij(β) + Tji(β)|zi, xi]}= var{τ i(β) +E[Tji(β)|zi, xi]} using (52)

= var[τ i(β)] + var{E[Tji(β)|zi, xi]}+ 2cov(τ i(β), E[Tji(β)|zi, xi])= var[τ i(β)] + var{E[Tji(β)|zi, xi]}

since the covariance term is zero. To show this we note that Wij(1 −∆j) = 0 by property (14).

20

Thus

cov(τ i(β), E[Tji(β)|zi, xi])= E{τ i(β)E[Tji(β)|zi, xi]}− 0 = E{τ j(β)E[Tij(β)|zj , xj ]}= E{E[τ j(β)Tij(β)|zj , xj ]} = E[τ j(β)Tij(β)]

= E{[∆jSjj + (1−∆j)Ej(Sjj)][∆iSii + (1−∆i)WijSij ]}= E(∆jSjj)E(∆iSii) +E[(1−∆j)Ej(Sjj)]E(∆iSii)

+E[∆jSjj(1−∆i)WijSij ] +E[(1−∆j)Ej(Sjj)(1−∆i)WijSij ]

= [E(∆jSjj)]2 +E[Ej(Sjj)]−E(∆jSjj)E(∆iSii) +E[Sjj(1−∆i)WijSij ] + 0

= 0 +E{E[(1−∆i)WijSijSjj|xj , vj , zi]} = E[(1−∆i)WijSijE(Sjj |xj , vj)]= E[(1−∆i)WijSij × 0] = 0.

Thus ζ = I(β) + q(β) where

q(β) = var{E[Tji(β)|zi, xi]} = var{E[Tij(β)|zj , xj ]}= var{E[∆iSii + (1−∆i)WijSij |zj , xj ]}= var{E(∆iSii) +E[(1−∆i)WijSij |zj , xj ]}= var{E[(1−∆i)WijSij |xj , vj ]}.

It follows from Hoeffding’s lemma that the distribution of ψE(β) is asymptotically normal and

var[ψE(β)] = var

µN

2UN

¶≈

N2

4

µ4ζ

N

¶≈ N [I(β) + q(β)] . (55)

The existence of a consistent sequence of solutions of ψE(βN ) = 0 follows from standardarguments: for n sufficiently large, ∂

∂βψE(β) is almost surely negative and so ψE(β) takes on bothpositive and negative values in an arbitrarily small neighbourhood around the true value β0. Theasymptotic normality of bβN follows from the asymptotic normality and linearity of ψE(β). Result(54) implies

E[∂

∂βψE(β)] = −NI(β). (56)

Using (55) and (56) the asymptotic variance of βN is given by

var[ψE(β)]

[ ∂∂βψE(β)]2=

N [I(β) + q(β)] + o(N)

N2{E[ ∂∂βTij(β)]}2= [NI(β)]−1

∙1 +

q(β)

I(β)

¸+ o(N−1).

We now show that ψE (β) is optimal in the sense that it has minimum asymptotic varianceamong all linear combinations of the basic estimating functions. Suppose ψ1(β), ..., ψN (β) areestimating functions which are possibly correlated. A necessary and sufficient condition for thevector a0 = (1, 1, ..., 1) to satisfy

a0 = argmina

var[P

i aiψi(β0)]

{ ∂∂βE[

Pi aiψi(β0)]}2

is that

c∂

∂βΨ(β) = V ar [Ψ(β)]1

where c is a scalar constant, Ψ(β) = (ψ1(β), ...,ψN (β))0 with covariance matrix V ar [Ψ(β)] and

1 = (1, 1, ..., 1)0. In other words, the a0is are identically equal to one provided that the vector of

21

row (or column) sums of the covariance matrix is proportional to the vector of derivatives of theestimating functions. In the case of the Tij(β)0s the row sum of the covariance matrix isX

{i,j;j 6=j}

X{k,l;k 6=l}

E[Tij(β)Tkl(β)]

which is a constant independent of i, j and E[ ∂∂βTij(β)] is also a constant independent of i, j.Therefore the optimal linear combination of these estimating functions is (15).

Proof of Theorem 2. Let Tij(β) be defined by (20) and τ i(β) be defined by (17). Now

E[Tij (β) |zi, xi] = E[∆iS(yi|xi, vi, β) + (1−∆i)WMC(xij , yi, vi, β)S(yi|xij , vi, β)|zi, xi]= ∆iS(yi|xi, vi, β) + (1−∆i)E[WMC(xij , yi, vi, β)S(yi|xij , vi, β)|zi]= ∆iS(yi|xi, vi, β) + (1−∆i)E[S(yi|xi, vi, β)|zi] by (10)= τ i(β) (57)

andE[Tij (β)] = E{E[Tij (β) |zi, xi]} = E[τ i(β)] = 0

and therefore Tij (β) is unbiased.Note that by (57) we have

cov([Tij(β)− τ i(β)], τk(β))

= E{τk(β)[Tij(β)− τ i(β)]} since E[τk(β)] = 0

= E{τk(β)E[Tij(β)− τ i(β)|zi, xi]}= E[τk(β)× 0] = 0

for all i, j, k. For j 6= k

E {[Tij(β)− τ i(β)] [Tik(β)− τ i(β)]} = E (E {[Tij(β)− τ i(β)] [Tik(β)− τ i(β)] |zi, xi})= E {E [Tij(β)− τ i(β)|zi, xi]E [Tik(β)− τ i(β)|zi, xi]} = 0

by the independence of xij , j = 1, 2, ...,M for fixed i and (57). Also for i 6= l

E {[Tij(β)− τ i(β)] [Tlk(β)− τ l(β)]} = E [Tij(β)− τ i(β)]E [Tlk(β)− τ l(β)] = 0

by the independence of (zi, xi), i = 1, 2, ..., N . Therefore

var[ψMC(β)] = var[1

M

NXi=1

MXj=1

Tij(β)] = var{ 1M

NXi=1

MXj=1

[Tij(β)− τ i(β)] +NXi=1

τ i(β)}

= var{NXi=1

1

M

MXj=1

[Tij(β)− τ i(β)]}+ var[NXi=1

τ i(β)]

=N

Mvar[Tij(β)− τ i(β)] +NI(β) = N [I(β) +O(M−1)].

Tij(β) is an unbiased estimating function of the independent random variables (zi, xi), i =1, 2, ..., N and the independent random variables xij , j = 1, 2, ...,M for fixed i. Also h(xij |yi, vi),the probability density function of xij , does not depend on β. Therefore

E[∂

∂βTij(β)] = −E[Tij(β)Sii] = −E{E[Tij(β)|zi, xi]Sii} = −E[τ i(β)Sii]

= −E[∆iS2(yi|xi, vi, β) + (1−∆i)SiiEi(Sii)] = −E{∆iS

2ii + (1−∆i)[Ei(Sii)]

2}= −I(β)

22

where Sii = S(yi|xi, vi, β) and thus

E[∂

∂βψMC(β)] = −NI(β).

The asymptotic variance of βN is given by

var[ψMC(β)]

[ ∂∂βψMC(β)]2=

N [I(β) +O(M−1)]N2{E[ ∂∂βTij(β)]}2

≈ [NI(β)]−1 provided M →∞.

The proof that ψMC (β) is optimal is essentially identical to the proof of Theorem 1.

Proof of Lemma 2. Note that for R satisfying (25) and any function G(X,V ) we have

E[G(X, v)R(X,V, v)] = E{G(X, v)E[R(X,V, v)|X]} = E[G(X,V )|V = v]. (58)

Also for for W defined by (26)

E[W (X,U, y, v, β)] = E

∙∆

η(X,V, β)

f(y|X, v;β)

f(y|v;β) R(X,V, v)

¸= E

½E

∙∆

η(X,V, β)

f(y|X, v;β)

f(y|v;β) R(X,V, v)|X,V

¸¾= E

∙f(y|X, v;β)

f(y|v;β) R(X,V, v)

¸= E

½f(y|X, v;β)

f(y|v;β) E[R(X,V, v)|X]¾

= E

∙f(y|X, v;β)

f(y|v;β)f(X|v)f(X)

¸=

1

f(y|v;β)E∙f(y|X, v;β)

f(X|v)f(X)

¸=

1

f(y|v;β)E [f(y|X,V ;β)|V = v] =f(y|v;β)f(y|v;β) = 1.

Therefore with W given by (26)

E[W (X,U, y, v, β)H(y,X, v, β)] =E[W (X,U, y, v, β)H(y,X, v, β)]

E[W (X,U, y, v, β)]

=E[ ∆

η(X,V,β)H(y,X, v, β)f(y|X, v;β)R(X,V, v)] 1f(y|v;β)

E[ ∆η(X,V,β)f(y|X, v;β)R(X,V, v)] 1

f(y|v;β)

=EnEh

∆η(X,V,β)H(y,X, v, β)f(y|X, v;β)R(X,V, v)|X,V

ioEnEh

∆η(X,V,β)f(y|X, v;β)R(X,V, v)|X,V

io=

E [H(y,X, v, β)f(y|X, v;β)R(X,V, v)]

E [f(y|X, v;β)R(X,V, v)]

=E[H(y,X, V, β)f(y|X,V ;β)|V = v]

E[f(y|X,V ;β)|V = v]using (58)

= E[H(Y,X, V, β)|Y = y, V = v]

and (10) holds.

23

Proof of Theorem 3. The joint likelihood is

L(β, Fx|v, Fv) =NYi=1

[f(yi|xi, vi;β)F (dxi|vi)]∆i [f(yi|vi;β)]1−∆iNYi=1

Fv(dvi)

For convenience, we assume that the conditional distribution of X given V is discrete, and hascountable support in a set S(v) which may depend on v. Assume for the present that β is knownand drop it from the notation. In order to maximize this likelihood for fixed β, we need to maximize

NYi=1

[F (dxi|vi)]∆i [f(yi|vi)]1−∆iNYi=1

Fv(dvi)

=NYi=1

[F (dxi|vi)]∆i [X

x∈S(v)f(yi|x, vi)F (dx|vi)]1−∆i

NYi=1

Fv(dvi) (59)

subject to the usual constraint that F (x|v) is a cumulative distribution function for each v. Notethat since the values of v are all observed, we can rely on the usual nonparametric maximumlikelihood arguments (see Efron 1967, Turnbull 1974) to show that the empirical cumulative distri-bution function, bFv(v) = 1

N

PNi=1 I(vi ≤ v), is the nonparametric maximum likelihood estimator

of Fv. Therefore we consider the maximization of

NXi=1

{∆i ln(f(xi|vi)) + (1−∆i) ln[X

x∈S(v)f(yi|x, vi)f(x|vi)]}. (60)

where f(x|v) = F (dx|v) is the jump size. For given S(v), we can maximize (60) in a straightforwardmanner (see, for example, Zhao 2005) using Lagrange multipliers to accommodate the constraintP

x F (dx|v) = 1. For each pair (x, v) we differentiate (60) with respect to f(x|v) and set thederivative equal to zero to obtain

NXi=1

{∆iI(xi = x, vi = v)

f(x|v) + (1−∆i)I(vi = v)f(yi|x, vi;β)P

z∈S(v) f(yi|z, v;β)f(z|v)} = c(v) (61)

where c(v) is a Lagrange multiplier which may depend on v. Multiplying (61) by f(x|v) we obtainNXi=1

{∆iI(xi = x, vi = v) + (1−∆i)I(vi = v)f(yi|x, vi;β)f(x|v)P

z∈S(v) f(yi|z, v;β)f(z|v)} = c(v)f(x|v) (62)

and summing (62) over all x ∈ S(v) gives c(v) = N bFv(dv). Assuming there is at least oneobservation with vj = v, then N bFv(dv) > 0 and (61) can be written asPN

i=1∆iI(xi = x, vi = v)

N bFv(dv) = bηP (x, v, β)f(x|v) (63)

where

bηP (x, v, β) = 1− PNi=1(1−∆i)I(vi = v)f(yi|x, vi;β)/

Pz∈S(v) f(yi|z, v;β)f(z|v)PN

i=1 I(vi = v).

To solve (63) two cases need to be considered. If bηP (x, v, β) < 0 then f(x|v) = 0 andPNi=1∆iI(xi =

x, vi = v) = 0. If bηP (x, v, β) > 0 then the size of the jump f(x|v) is

f(x|v) =PN

i=1∆iI(xi = x, vi = v)

N bFv(dv)bηP (x, v, β) . (64)

24

Note that if bηP (x, v, β) = 0, thenPNi=1∆iI(xi = x, vi = v) = 0 and in this case f(x|v) is undeter-

mined by (63). In summary, F (x|v) must only have jumps either at validated x corresponding tothe given value of v or at unobserved values of x for which bηP (x, v, β) = 0.The solution to equation (63) is not, in general, unique. For example, suppose x1 6= x2,

x1, x2 ∈ S (v), and x1 and x2 are not in the validation sample. Then if f(y1|x1, v;β) = f(y2|x2, v;β)where (y1, v) and (y2, v) are points in the non-validated sample, the total mass associated withf (x1|v) and f (x2|v) can be distributed arbitrarily between them without changing the value ofthe term X

x∈S(v)f(yi|x, vi;β)f(x|vi)

in the likelihood.An intuitive argument for (63) can also be given. By the law of large numbers under mild

integrability conditions, and for v in the support of the discrete distribution for V,PNi=1(1−∆i)I(vi = v)f(yi|x, vi;β)/f(yi|vi;β)

N bFv(dv) → E[f(Y |x, v;β)f(Y |v;β) (1−∆)|V = v] = 1− η(x, v, β)

indicating that bηP (x, v, β) is a consistent estimator of η(x, v, β) based only on the non-validatedobservations. Then the two sides of (63) are both approximations to P (X = x,∆ = 1|V = v)with the estimate on the left side based only on the validated observations and the approximationon the right hand side uses F (x|v) together with an estimate of η(x, v, β) obtained only from thenon-validated observations.We now consider the problem of specifying the support set S(v) which is usually unknown. Since

(60) can be decomposed into terms involving the different values of v we consider the maximizationof one of these terms for a fixed value of v. Assume for this v that x1, ..., xj are the validatedobservations and that there are n− j non-validated observations. We wish to maximize

jXi=1

{ln[f(xi|v)] +nX

i=j+1

ln[X

x∈S(v)f(yi|x, v)f(x|v)]}. (65)

If we take S(v) to be {x1, ..., xj} then there are points x in the true support set which havesmall values of η(x, v, β) and will therefore be under-represented since they are less likely to bevalidated. Our problem is to augment, if possible, the set of validated observations {x1, ..., xj}with a support point x /∈ {x1, ..., xj} with positive probability mass which will increase the valueof the log-likelihood (65).Suppose then that we permit the addition of an x from S which is a larger set of candidate

support points. For convenience we can chose S to be an evenly spaced finite set of possible xvalues. Obviously the value of x we should choose is the value which, when given a small probabilityε > 0, increases the value of (65) by the greatest amount. Suppose we scale down the probabilitiesgiven to x1, ..., xj by (1− ε) and give probability ε to x. Then the log likelihood is

jXi=1

{ln[f(xi|v)] +nX

i=j+1

ln[

jXl=1

f(yi|xl, v)f(xl|v)(1− ε) + εf(yi|x, v)]}

25

and its derivative with respect to ε at ε = 0 is

nXi=j+1

f(yi|x, v)− f(yi|v)f(yi|v) =

NXi=1

(1−∆i)I(vi = v)[f(yi|x, v)f(yi|v) − 1]

= N bFv(dv)[1− bηP (x, v, β)]− NXi=1

(1−∆i)I(vi = v)

=NXi=1

[∆i − bηP (x, v, β)]I(vi = v).

This derivative is positive and the likelihood increases by the movement of some mass to the pointx provided that

bηP (x, v, β) < PNi=1∆iI(vi = v)PNi=1 I(vi = v)

(66)

or the estimated probability of a point at (x, v) being validated is less than the proportion ofvalidated observations in the stratum V = v. The value of x which makes the largest local differenceto the likelihood is

xv = argmaxx∈S

NXi=1

[∆i − bηP (x, v, β)]I(vi = v)

= argminx∈S

bηP (x, v, β). (67)

Our procedure for choosing the set of support points S(v) then is to begin with the set of validatedobservations {x1, ..., xj} and then at each iteration for β we add up to three points, one at a time,which satisfy (67) and (66).

The Algorithm. This algorithm consists of alternating the maximization of the likelihood overf(x|v) for fixed (β, σ) and over (β, σ) for fixed f(x|v), so that each iteration is guaranteed toincrease the likelihood at each step. We begin with an initial estimate of (β, σ) and S(v) equal tothe set of all validated x for the given value of v.The maximum likelihood estimation of f(x|v) has two essential components: the selection of

the support set S(v) for each v and iteratively solving (63) for fixed S(v) and (β, σ) to obtain themaximum likelihood estimate of f(x|v) restricted to that support. That is, we solve

f (k+1)(x|v) =NXi=1

{∆iI(xi = x, vi = v) + (1−∆i)I(vi = v)f(yi|x, vi;β)f (k)(x|v)P

z∈S(v) f(yi|z, v;β)f (k)(z|v)}. (68)

Note that we have superscripted f (k)(x|v) with (k) as a reminder that it uses the k’th iterate, f (k).All x ∈ S(v) are initialized with f (0)(x|v) > 0. The iteration of (68) results in an estimator forwhich bηP (x, v, β) > 0 at each x ∈ S(v) but this is not necessarily true for values of x outside thissupport. Indeed we augment this supporting set using values of x for which (66) holds.Repeat the following steps until the parameters (β, σ) converge:

1. Repeat steps (a) and (b) for all v:

(a) Repeat steps (i) and (ii) three times:

(i) Find the maximum likelihood estimate f(dx|v), x ∈ S(v) using (68).

26

(ii) For this f(x|v), find xv satisfying (67) and if it also satisfies (66) replace S(v) byS(v) ∪ {xv}.

(b) For this maximum likelihood estimate f(x|v) of f(x|v) defined on the points x ∈ S(v),define weights proportional to

f(y|x, v;β)f(x|v).

2. Using these weights, find the maximum likelihood estimate of β using (45) and the estimateof σ using (46).

Note that in our algorithm a point xv satisfying (67) and (66) which is added to S(v) will notleave the support set once added even if the estimate of β changes. However there is little harmin an inappropriate support point since the maximization of the likelihood can put zero mass onsuch points.

ACKNOWLEDGEMENTS

D. L. McLeish’s research was supported by a grant from the Natural Sciences and EngineeringResearch Council of Canada. The authors would like to gratefully acknowledge very useful discus-sions with J. Lawless and Yang Zhao on this work. In particular, simulations conducted by YangZhao indicated the need to extend the support for the profile estimator and notes and referencesprovided by J. Lawless have been instrumental in assisting this research. The authors would alsolike to thank R. Little, J.D. Kalbfleisch, the Editor, Associate Editor as well as the referees foruseful comments and suggestions.

REFERENCES

R. J. Carroll & M. P. Wand (1991). Semiparametric estimation in logistic measurement error models.Journal of the Royal Statistical Society Series B, 53, 573—585.

N. Chatterjee, Y. Chen & N. E. Breslow (2003). A pseudoscore estimator for regression problems withtwo-phase sampling. Journal of the American Statistical Association, 98, 158—168.

H.Y. Chen (2004). Nonparametric and semiparametric models for missing covariates in parametric re-gression. Journal of the American Statistical Association, 99, 1176—1189

J. Chen & N. E. Breslow (2004). Semiparametric efficient estimation for the auxiliary outcome problemwith the conditional mean model. Canadian Journal of Statistics, 359—372.

A. P. Dempster, N. Laird & D. B. Rubin (1977). Maximum likelihood from incomplete data via the EMalgorithm (with discussion). Journal of the Royal Statistical Society Series B, 39, 1—38.

B. Efron (1967). The two-sample problem with censored data. In Proceedings of Fifth Berkeley Symposiumon Mathematical Statistics and Probability held at the Statistical Laboratory, University of California,June 21-July 18, 1965 and December 27, 1965-January 7, 1966: Volume 4, Biology and Problems ofHealth (L. M. Le Cam & J. Neyman, eds.), University of California Press, 831—853.

C. Genest & J. P. Rivest (1993). Statistical inference procedures for bivariate archimedean copulas,Journal of the American Statistical Association, 88, 1034—1043.

P. R. Halmos (1951). Introduction to Hilbert Space and the Theory of Spectral Multiplicity. Chelsea, NewYork.

27

M. N. Jouini & R. T. Clemen (1996). Copula models for aggregating expert opinions. ManagementScience, 44, 444—457.

J. F. Lawless, J. D. Kalbfleisch & C. J. Wild (1999). Semiparametric methods for response-selective andmissing data problems in regression. Journal of the Royal Statistical Society Series B, 61, 413—438.

S. R. Lipsitz, J. G. Ibrahim & L. P. Zhao (1999). A weighted estimating equation for missing covariatedata with properties similar to maximum likelihood. Journal of the American Statistical Association,94, 1147—1160.

R. J. A. Little & D. B. Rubin (2002). Statistical Analysis with Missing Data, 2nd Edition. Wiley, NewYork.

R. B. Nelson (1999). An Introduction to Copulas. Springer, New York.

M. S. Pepe & T. R. Fleming (1991). A non-parametric method for dealing with mismeasured covariatedata. Journal of the American Statistical Association, 86, 108—113.

M. S. Reilly & M. S. Pepe (1995). A mean score method for missing and auxiliary covariate data inregression models. Biometrika, 82, 299—214.

C. P. Robert & G. Casella (1999). Monte Carlo Statistical Methods. Springer, New York.

J. M. Robins, F. Hsieh & W. Newey (1995). Semiparametric efficient estimation of a conditional densitywith missing or mismeasured covariates. Journal of the Royal Statistical Society Series B, 57, 409—424.

J. M. Robins, A. Rotnitzky, L. P. Zhao & S. Lipsitz (1994). Estimation of regression coefficients whensome regressors are not always observed. Journal of the American Statistical Association, 89, 846—866.

R. J. Serfling, Approximation Theorems of Mathematical Statistics. Wiley, New York.

B. W. Silverman (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall, NewYork.

C. G. Small & D. L. McLeish (1994). Hilbert Space Methods in Probability and Statistical Inference.Wiley, New York.

B. W. Turnbull (1974). Nonparametric estimation of a survivorship function with doubly censored data.Journal of the Royal Statistical Society Series B, 38, 169—173.

Z. Zhang & H. E. Rockette (2005). On maximum likelihood estimation in parametric regression withmissing covariates. Journal of Statistical Planning and Inference (to appear).

Y. Zhao. (2005). Design and Efficient Estimation in Regression Analysis with Missing Data in TwoPhase Studies. Doctoral dissertation, Department of Statistics and Actuarial Science, University ofWaterloo, Waterloo.

28

estimation of regression parameters in missing data problems · the conditional cumulative...

Documents