a semiparametric estimator of survival for doubly truncated data

13
Research Article Received 19 October 2009, Accepted 11 March 2010 Published online in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/sim.3938 A semiparametric estimator of survival for doubly truncated data Carla Moreira and Jacobo de Uña-Álvarez Doubly truncated data are often encountered in the analysis of survival times, when the sample reduces to those individuals with terminating event falling on a given observational window. In this paper we assume that some information about the bivariate distribution function (df) of the truncation times is available. More specifically, we represent this information by means of a parametric model for the joint df of the truncation times. Under this assumption, a new semiparametric estimator of the lifetime df is derived. We obtain asymptotic results for the new estimator, and we show in simulations that it may be more efficient than the Efron–Petrosian nonparametric maximum likelihood estimator. Data on the age at diagnosis of childhood cancer in North Portugal are analyzed with the new method. Copyright © 2010 John Wiley & Sons, Ltd. Keywords: double truncation; observational bias; nonparametric estimation; survival analysis 1. Introduction Randomly truncated data appear in a number of fields, including Epidemiology, survival analysis, economics, and Astronomy. Under left truncation, only individuals with lifetimes exceeding a random time (the truncation time) are observed, and ignoring this issue results in a severe bias in estimation. A typical example involving left truncation is that of cross-sectional sampling, for which the sample reduces to the individuals in progress at a given date. Right truncation is a similar phenomenon, and studies on AIDS incubation (among others) are known to suffer from this type of selection issue. This is because typically AIDS databases consist only of those individuals diagnosed prior to some specific date. Nonparametric methods for one-sided (left or right) truncated data were introduced in the seminal paper [1], see also [2] and [3]. Some authors have pointed out that the available information about the truncation time allows for the construction of more efficient estimators. See for example [4--6]. Doubly truncated data are sometimes encountered. Double truncation means that only lifetimes falling on an observable random interval will be recruited. Efron and Petrosian [7] motivated this problem with quasar data, and they also introduced the nonparametric maximum likelihood estimator (NPMLE) under double truncation. Also, Bilker and Wang [8] noticed that time from HIV infection to AIDS diagnosis can be treated as doubly truncated because, besides the right- truncation issue discussed above, HIV was unknown before 1982, thus leading to a left-truncated setup. The NPMLE for doubly truncated data was revisited in [9], which formally established its uniform consistency and convergence to a normal. Moreira and de Uña-Álvarez [10] introduced a bootstrap approximation for the NPMLE. For further motivation and in order to help intuition, let us introduce in some detail the childhood cancer data analyzed in Section 4. These data concern all the cases of childhood cancer diagnosed in North Portugal between 1 January 1999 and 31 December 2003. Assume X as the age at diagnosis, and let V be the time from birth to the end of recruitment. Then, the sampling information reduces to those ( X , V ) values satisfying U X V , where U = V 5 and where the time is measured in years. Here, the truncation interval [U , V ] varies from individual to individual, since it is Facultad de CC. Económicas y Empresariales, Universidad de Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain Correspondence to: Carla Moreira, Facultad de CC. Económicas y Empresariales, Universidad de Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain. E-mail: [email protected], [email protected] Contract/grant sponsor: Spanish Ministerio de Ciencia e Innovación; contract/grant number: MTM2008-03129 Contract/grant sponsor: Xunta de Galicia; contract/grant number: PGIDIT07PXIB300191PR Contract/grant sponsor: Portuguese Fundação Ciência e Tecnologia; contract/grant number: SFRH/BD/60262/2009 Contract/grant sponsor: Xunta de Galicia under the INBIOMED project, (DXPCTSUG, Ref. 2009/063) Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147–3159 3147

Upload: carla-moreira

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Research Article

Received 19 October 2009, Accepted 11 March 2010 Published online in Wiley Online Library

(wileyonlinelibrary.com) DOI: 10.1002/sim.3938

A semiparametric estimator of survivalfor doubly truncated data

Carla Moreira∗† and Jacobo de Uña-Álvarez

Doubly truncated data are often encountered in the analysis of survival times, when the sample reduces to those individuals withterminating event falling on a given observational window. In this paper we assume that some information about the bivariatedistribution function (df) of the truncation times is available. More specifically, we represent this information by means of aparametric model for the joint df of the truncation times. Under this assumption, a new semiparametric estimator of the lifetimedf is derived. We obtain asymptotic results for the new estimator, and we show in simulations that it may be more efficient thanthe Efron–Petrosian nonparametric maximum likelihood estimator. Data on the age at diagnosis of childhood cancer in NorthPortugal are analyzed with the new method. Copyright © 2010 John Wiley & Sons, Ltd.

Keywords: double truncation; observational bias; nonparametric estimation; survival analysis

1. Introduction

Randomly truncated data appear in a number of fields, including Epidemiology, survival analysis, economics, andAstronomy. Under left truncation, only individuals with lifetimes exceeding a random time (the truncation time) areobserved, and ignoring this issue results in a severe bias in estimation. A typical example involving left truncation is thatof cross-sectional sampling, for which the sample reduces to the individuals in progress at a given date. Right truncationis a similar phenomenon, and studies on AIDS incubation (among others) are known to suffer from this type of selectionissue. This is because typically AIDS databases consist only of those individuals diagnosed prior to some specific date.

Nonparametric methods for one-sided (left or right) truncated data were introduced in the seminal paper [1], seealso [2] and [3]. Some authors have pointed out that the available information about the truncation time allows for theconstruction of more efficient estimators. See for example [4--6].

Doubly truncated data are sometimes encountered. Double truncation means that only lifetimes falling on an observablerandom interval will be recruited. Efron and Petrosian [7] motivated this problem with quasar data, and they alsointroduced the nonparametric maximum likelihood estimator (NPMLE) under double truncation. Also, Bilker and Wang[8] noticed that time from HIV infection to AIDS diagnosis can be treated as doubly truncated because, besides the right-truncation issue discussed above, HIV was unknown before 1982, thus leading to a left-truncated setup. The NPMLEfor doubly truncated data was revisited in [9], which formally established its uniform consistency and convergence to anormal. Moreira and de Uña-Álvarez [10] introduced a bootstrap approximation for the NPMLE.

For further motivation and in order to help intuition, let us introduce in some detail the childhood cancer data analyzedin Section 4. These data concern all the cases of childhood cancer diagnosed in North Portugal between 1 January 1999and 31 December 2003. Assume X∗ as the age at diagnosis, and let V ∗ be the time from birth to the end of recruitment.Then, the sampling information reduces to those (X∗,V ∗) values satisfying U∗�X∗�V ∗, where U∗ =V ∗−5 and wherethe time is measured in years. Here, the truncation interval [U∗,V ∗] varies from individual to individual, since it is

Facultad de CC. Económicas y Empresariales, Universidad de Vigo, Campus Lagoas-Marcosende, 36310 Vigo, Spain∗Correspondence to: Carla Moreira, Facultad de CC. Económicas y Empresariales, Universidad de Vigo, Campus Lagoas-Marcosende, 36310

Vigo, Spain.†E-mail: [email protected], [email protected]

Contract/grant sponsor: Spanish Ministerio de Ciencia e Innovación; contract/grant number: MTM2008-03129Contract/grant sponsor: Xunta de Galicia; contract/grant number: PGIDIT07PXIB300191PRContract/grant sponsor: Portuguese Fundação Ciência e Tecnologia; contract/grant number: SFRH/BD/60262/2009Contract/grant sponsor: Xunta de Galicia under the INBIOMED project, (DXPCTSUG, Ref. 2009/063)

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147–3159

3147

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

determined by the (random) individual’s birthdate. The same truncation pattern will be found in any application in whichthe sample reduces to those subjects with terminating event in a fixed observational window. Now, if we are interested inthe estimation of the distribution function (df) of X∗, computation of the ordinary empirical df cannot be recommendedin general. This is because each value x of X∗ is observed with a probability given by

P(U∗�X∗�V ∗|X∗ = x)= P(U∗�x�V ∗),

(assuming independence between (U∗,V ∗) and X∗), which is clearly influenced by the x-value. Roughly speaking,relatively small and large values of X∗ may be less frequently observed, hence inducing a sampling bias which isimmediately transferred to a systematic bias of the ordinary empirical df. Of course, for specific choices of the distributionof (U∗,V ∗) this observational bias can be more or less severe, and this will be explored in more detail in Section 3.Similar problems arise when the interest is focused on the df of V ∗, which for the childhood cancer data is linked tothe so-called birth process of the diseased population. Note that, since U∗ =V ∗−5, the pair (X∗, X∗+5) plays the roleof truncation interval for V ∗.

In this paper we introduce a new estimator under double truncation which makes use of some available informationabout the df of the pair of truncation times. The definition of the estimator and its main properties are given in Section 2.This new estimator is semiparametric, since some parametric family of dfs is assumed for the truncation times, whereasnothing is assumed about the lifetime df. This situation can be realistic in practice; for example for the childhood cancerdata, the birth dates of the individuals developing the disease could be assumed in principle to follow a stationaryprocess, thus leading to a uniform distribution of U∗ (resp. of V ∗). Interestingly, unlike the NPMLE, the new estimatorhas explicit form, which facilitates its usage and analysis. More specifically, we introduce an empirical approximationof the standard error of the estimator and we use it to construct confidence limits. Simulations reported in Section 3show that the new estimator may behave more efficiently than the Efron–Petrosian NPMLE. Section 4 is devoted to theanalysis of the childhood cancer data, whereas the technical details are deferred to Section 5.

2. The estimator: asymptotic results

First we introduce some notation. Let X∗ be the lifetime of ultimate interest and let (U∗,V ∗) be the pair of truncationtimes, hence we observe the triplet (U∗, X∗,V ∗) if and only if U∗�X∗�V ∗. We assume that (U∗,V ∗) is independent ofX∗. Let F denote the df of X∗, and let K be the joint df of (U∗,V ∗). We denote by (Ui , Xi ,Vi ), 1�i�n, the observeddata. Throughout this paper we assume that K belongs to a parametric family of dfs, {K�}�∈� say, where � is a vectorof parameters and � stands for the parametric space. In addition, �0 will denote the ‘true’ parameter.

Under the described double truncation scenario, the relative probability of observing a lifetime x is proportional to

G(x;�)=∫

{u�x�v}K�(du,dv).

In order to see this, let F∗ be the df of the observed lifetimes. Then,

F∗(x)= P(X∗�x |U∗�X∗�V ∗)= P(�)−1∫ x

0G(t;�)dF(t),

where P(�)= P(U∗�X∗�V ∗)=∫∞0 G(t;�)dF(t). As a consequence, the quotient F∗(dx)/F(dx) (i.e. the relative prob-

ability of sampling x) reduces to P(�)−1G(x;�). Assume that G(x;�) is strictly positive on the support of X∗. Oneimmediately obtains the reversed equation

F(x)= P(�)∫ x

0

dF∗(t)

G(t;�)≡∫ x

0

dF∗(t)

G(t;�)

/∫ ∞

0

dF∗(t)

G(t;�),

and hence a natural estimator of F follows:

F(x; �)=∫ x

0

dF∗n (t)

G(t; �)

/∫ ∞

0

dF∗n (t)

G(t; �)≡ P(�)

∫ x

0

dF∗n (t)

G(t; �), (1)

where � is a suitable estimator of �, and where F∗n stands for the ordinary empirical df of the Xi s.

The estimator F(x; �) can be alternatively motivated as an MLE. As noted by Shen [9], the likelihood of the(Ui , Xi ,Vi )’s (L) can be factorized as a product of the conditional likelihood of the (Ui ,Vi ) given the Xi ’s (Lc), and

3148

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

the marginal likelihood of the Xi ’s (Lm):

L=Lc ×Lm ≡n∏

i=1

g(Ui ,Vi ;�)

G(Xi ;�)×

n∏i=1

G(Xi ;�)

P(�),

where g(u,v;�)=�2/�u�vP(U∗�u,V ∗�v)= K�(du,dv) stands for the joint density of (U∗,V ∗) (assumed to exist; see

Remark 2.1 below for the definition of g when (U∗,V ∗) falls on a line). For each �, Lm is maximized by

F(x;�)= P(�)∫ x

0

dF∗n (t)

G(t;�)≡∫ x

0

dF∗n (t)

G(t;�)

/∫ ∞

0

dF∗n (t)

G(t;�),

and the maximum is a constant [4]. Hence, the maximizer of the full-likelihood L is given by (�, F(x; �)), where� stands for the maximizer of Lc.

The estimator (1) reduces to the semiparametric estimator in [4] when there is no truncation from the right(i.e. P(V ∗ =∞)=1). Note that, unlike with left-truncated data, the function G(x;�) does not need to be (and, in general,it will not be) monotone here.

Remark 2.1In some instances the random vector (U∗,V ∗) will fall on a line w. p. 1, V ∗ =U∗+� say, see Section 4 for moti-vation. In this case, we rather have (whenever v=u+�) g(u,v;�)=�/�u P(U∗�u)= L(du;�) and G(x;�)= L(x;�)−L(x −�;�), where L(.;�) is the df of the parametric model assumed for U∗. In particular, if U∗ follows a uniformdistribution on an interval which contains (aF −�,bF ), where (aF ,bF ) stands for the support of X∗, we have that G(x;�)is constant. In this case, there is no observational bias, and the ordinary empirical df of the Xi is a consistent estimatorof F .

Remark 2.2Unlike the Efron–Petrosian NPMLE (which must be computed in an iterative way), the new semiparametric estimatorF(x; �) has explicit form. This immediately leads to simpler asymptotic expressions for the limiting distribution. As itwill be seen, a simple plug-in method to estimate the standard error of (1) can be introduced.

Now we state the main results for both the estimated parameter � and the semiparametric estimator F(x; �). As weare mainly interested in testing problems about � and the construction of confidence limits for F(x), we only reporthere the results concerning the distributional convergence of these estimators. Of course, formal results of consistencycan be also obtained following arguments similar to those in [4]. Also, in order to favor the reading of the manuscript,we only refer to the needed, notationally involved assumptions as ‘under regularity’. These assumptions mainly imposesmoothness on logLc and the convergence of the solution of the maximum likelihood equation. Similarly, all the limitvariances are assumed to be finite. See Section 5 for further details.

Theorem 2.1Let � be a solution to the maximum likelihood equation, that is

���

logLc (�)=0.

Under regularity, we have√

n(�−�0)→ N (0, I (�0)−1) in law, where I (�0) is the Fisher information matrix

I (�0)=−E�0

[�2

��2log

g(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

].

As usual, in practice one will rather use the empirical Fisher information

I (�)=−1

n

n∑i=1

�2

��2log

g(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣∣�=�

to compute the standard errors and covariances of the estimated parameters.

Example 2.1As an illustrative example, we consider a situation in which U∗ ∼Beta(�1,1) and V ∗ ∼Beta(1,�2) are independent.Then, we have g(u,v;�)=�1�2u�1−1(1−v)�2−1, 0<u,v<1. Assume that the support of X∗, (aF ,bF ) say, is containedin the interval (0,1), hence we have for aF<x<bF

G(x;�)= P(U∗�X∗�V ∗ | X∗ = x)= P(U∗�x)P(V ∗�x)= x�1 (1−x)�2 .

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3149

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

The conditional likelihood becomes Lc(�)=Lc,1(�1)Lc,2(�2), where

Lc,1(�1)=n∏

i=1

�1U�1−1i

X�1i

, Lc,2(�2)=n∏

i=1

�2(1−Vi )�2−1

(1− Xi )�2,

which are, respectively, maximized by

�1 =[−1

n

n∑i=1

logUi

Xi

]−1

, �2 =[−1

n

n∑i=1

log1−Vi

1− Xi

]−1

.

It is straightforward to obtain the second-order derivatives of the log likelihood, which are:

�2

��21

logg(u,v,�)

G(t,�)=− 1

�21

,�2

��22

logg(u,v,�)

G(t,�)=− 1

�22

,�2

��1��2log

g(u,v,�)

G(t,�)=0,

hence the Fisher information matrix becomes I (�)=diag(1/�21,1/�2

2). This example will be of further use in thesimulations section below.

For our next result we need to introduce the following matrix:

W (x,�) = P(�)2∫ ∞

0

�G(t;�)

��

dF∗(t)

G(t;�)2

∫ x

0

dF∗(t)

G(t;�)− P(�)

∫ x

0

�G(t;�)

��

dF∗(t)

G(t;�)2

=∫ ∞

0

�G(t;�)

��

1

G(t;�)[F(x)− I (t�x)]dF(t).

Theorem 2.2Under regularity, we have

√n(F(x; �)− F(x))→ N (0,�2(x)) in law, where �2(x)=�2

1(x)+�22(x), with �2

1(x)=W (x,�0)T I (�0)W (x,�0) and

�22(x)= P(�0)

[∫ x

0

dF(t)

G(t;�0)+ F2(x)

∫ ∞

0

dF(t)

G(t;�0)−2F(x)

∫ x

0

dF(t)

G(t;�0)

].

Remark 3The first term in the limit variance �2(x) comes from the variance when estimating �0, whereas the second term isdirectly related to the correction of the observational bias. Indeed, the limit variance of (1) reduces to �2

2(x) in the caseof perfect knowledge on the biasing function G(.;�0). This would be the case, for example when sampling individualswith terminating events in a fixed, given observational window of length � (V ∗ =U∗+�), if one knew the distributionalform of the ‘incidence process’ U∗. Also interestingly, in the case of no observational bias (i.e. a constant functionG(.;�0)), �2

2(x) reduces to F(x)(1− F(x)) which is just the asymptotic variance of the ordinary empirical df.

In practice, one will be interested in the construction of confidence limits for F(x). Theorem 2.2 suggests the following100(1−�) per cent confidence interval:

I1−� =(

F(x; �)±z�/2�(x)√

n

), (2)

with

�2(x)= W (x, �)T I (�)W (x, �)+ P (�)

[∫ x

0

dF(t; �)

G(t; �)+ F(x; �)2

∫ ∞

0

dF(t; �)

G(t; �)−2F(x; �)

∫ x

0

dF(t; �)

G(t; �)

],

where

W (x,�)= P(�)2∫ ∞

0

�G(t;�)

��

dF∗n (t)

G(t;�)2

∫ x

0

dF∗n (t)

G(t;�)− P(�)

∫ x

0

�G(t;�)

��

dF∗n (t)

G(t;�)2,

and where z� stands for the (1−�)-th quantile of the standard normal distribution.

3150

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

3. Simulations

In this section we illustrate the finite sample behavior of the semiparametric estimator (1) through simulation studies.The results corresponding to the (conditional) MLE of �0 will also be reported. One of the main goals of our simulationswill be the comparison between the new estimator and the Efron–Petrosian NPMLE, say FEP

n (x). For this comparison,we will use the quotient of mean squared errors (MSEs) as a measure of relative efficiency. Note that if

RE(FEPn (x), F(x; �))= MSE(F(x; �))

MSE(FEPn (x))

,

attains a value of �<1, then the semiparametric estimator performs better; more specifically, it is meant that the efficiencyof the NPMLE is just the 100� per cent that of the semiparametric estimator, or that the new estimator is 1/� timesmore efficient that the NPMLE.

We have considered two different situations of double truncation. In Case 1, the pair (U∗,V ∗) has a joint densityg(u,v)=g1(u)g2(v), where g1 and g2 denote the marginal densities (hence U∗ and V ∗ are independently generated). InCase 2, we simulate U∗ and then we take V ∗ =U∗+� for some fixed constant �>0, hence the joint density of (U∗,V ∗)does not exist. Note that Case 2 follows the spirit of the childhood cancer data discussed in the Introduction.

For Case 1, we take U∗ ∼U (0,1), V ∗ ∼U (0,1), X∗ ∼U (0.25,1) for model 1.1; U∗ ∼U (0,1), V ∗ ∼U (0,1), X∗ =0.5U (0,1)2 (a Beta(1/2,1) adapted to the support (0,0.5)) for model 1.2; and U∗ ∼Beta(0.2,1), V ∗ ∼Beta(1,0.4),X∗ ∼U (0.25,1) for model 1.3. These models 1.1–1.3 fall under the umbrella of our Example 2.1 in Section 2. ForCase 2, we take �=0.25 and U∗ ∼U (0,0.75), X∗ ∼U (0,1) in model 2.1, and U∗ ∼U (0,1), X∗ ∼0.75Beta(3/4,1)+0.25in model 2.2.

In Figure 1 we illustrate the observational bias induced by each of these five models. Note that the situations rangefrom no observational bias (model 2.2), to strongly biased situations. We included in the simulations sampling biasesin favor of both small lifetimes (e.g. models 1.1 and 1.3) and large lifetimes (model 1.2); besides, situations with anobservable s-shaped curve (when the true curve is linear) were also considered (model 2.1). As parametric informationabout the pair (U∗,V ∗) we always consider a Beta(�1,1) for U∗ and a Beta(1,�2) for V ∗ in Case 1. For each model, wesimulate 1000 trials with final sample size n =50, 250, or 500. This means that, for each trial, the number of simulated

0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

0.0

0.4

0.8

0.0

0.4

0.8

0.0

0.4

0.8

0.0

0.4

0.8

Model 1.1

0.0 0.1 0.2 0.3 0.4 0.5

Model 1.2

0.2 0.4 0.6 0.8 1.0

Model 1.3

0.0 0.2 0.4 0.6 0.8 1.0

Model 2.1

0.2 0.4 0.6 0.8 1.0

Model 2.2

Figure 1. Observational bias for the simulated models: target F (dashed line) and observable lifetime distribution approximatedby the empirical df F∗

n of a single sample with n =5000 (solid line).

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3151

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

data is much larger than n, actually N ≈n P(�0)−1 were needed on average, where recall that P(�0) stands for theproportion of no truncation. For the simulated models, the proportion of truncation ranged between 44 and 88 per cent.More specifically, the following right- and left-truncation proportions were obtained: 37 per cent (right) and 44 per cent(left) for model 1.1; 83 and 5 per cent for model 1.2; 9 and 35 per cent for model 1.3; 42 and 33 per cent for model 2.1;and 53 and 22 per cent for model 2.2. However, since (as usual with truncated data) we worked on the basis of reachinga given n, the type of observational bias as depicted in Figure 1 is more informative than the probability of truncationitself.

In Figures 2 and 3 we report the MSEs of the semiparametric estimator and of the NPMLE for each model andsample size, evaluated at each of the nine deciles x0.1, . . . , x0.9 of F . For comparison purposes, we also include in thesefigures the MSE pertaining to the ‘ideal’ estimator which makes use of the true biasing function G(.;�0), say F(x;�0).We also report in Table I the bias and standard deviations of the estimated parameter �0. In all the cases it is seen thatthe estimators converge to their respective targets. As is expected, the more efficient estimator was that based on thetrue biasing function; in this case, the term �2

1(x) in the variance of the semiparametric estimator, see Theorem 2.2, justvanishes. When comparing F(x; �) to FEP

n (x), the most relevant result is that in all the cases the relative efficiency of theNPMLE was below 1, with the only exception of model 2.2, see Figure 3. In this case, the biasing function is constant,and it is poorly approximated through the fitted parametric model. However, even in this exception the estimator F(x;�0)outperforms the NPMLE. In general, we can conclude that the new estimator is more efficient, with an MSE which canbe up to 13 per cent that of the NPMLE.

In all simulated scenarios we have seen that the bias in estimation is of a smaller order of magnitude than thestandard deviation. This means that the MSE of the involved estimators basically reduces to their variances. Finally, itis interesting to compare the MSEs of the ‘ideal’ estimator under double truncation, F(x;�0), to those corresponding tothe ordinary empirical df based on the same sample size, which is known to be

MSEord(x)= F(x)(1− F(x))

n.

This comparison allows to investigate the relative difficulties in estimation which are directly implied by the samplingbias. For example, in model 2.2 the MSE of F(x;�0) is close to MSEord(x) along all the x deciles (average value ofMSE F(x;�0)/MSEord(x)=0.973 in the case n =50); this is in agreement with the absence of sampling bias in this

2 4 6 8

0.015

0.000

n=50

MS

E’s

2 4 6 8

0.000

0.004

0.008

n=250

MS

E’s

2 4 6 8

0.000

0.002

0.004

n=500

MS

E’s

2 4 6 8

0.00

0.02

0.04

n=50

MS

E’s

2 4 6 8

0.000

0.015

0.030

n=250

MS

E’s

2 4 6 8

0.000

0.010

0.020

n=500

MS

E’s

2 4 6 8

0.000

0.004

0.008

n=50

MS

E’s

2 4 6 8

0.0000

0.0010

n=250

MS

E’s

2 4 6 8

n=500

MS

E’s

Figure 2. MSE of the semiparametric estimator (solid line), the Efron–Petrosian NPMLE (dashed line), and the idealestimator with perfect knowledge on the biasing function (dotted line), along 1000 trials of a sample size n. From top to

bottom: models 1.1, 1.2, and 1.3.

3152

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

0.00

0.02

0.04

n=50

n=50

MS

E’s

0.000

0.005

0.010

0.015

n=250

n=250

MS

E’s

0.000

0.002

0.004

0.006

0.008

n=500

n=500

MS

E’s

MS

E’s

0.000

0.005

0.010

0.015

MS

E’s

0.000

0.001

0.002

0.003

0.004

MS

E’s

0.0000

0.0010

0.0020

Figure 3. MSE of the semiparametric estimator (solid line), the Efron–Petrosian NPMLE (dashed line), and the idealestimator with perfect knowledge on the biasing function (dotted line), along 1000 trials of a sample size n. From top

to bottom: models 2.1 and 2.2.

Table I. Bias and standard deviation (in brackets) of the estimated parameters for the simulated models.

n Models �1 �2

1.1 0.017 (0.151) 0.016 (0.144)1.2 0.022 (0.145) 0.020 (0.142)

50 1.3 0.005 (0.030) 0.011 (0.061)2.1 0.074 (0.312)2.2 0.036 (0.455)

1.1 0.004 (0.067) 0.003 (0.063)1.2 0.004 (0.064) 0.001 (0.063)

250 1.3 0.001 (0.013) 0.002 (0.027)2.1 0.025 (0.138)2.2 0.023 (0.199)

1.1 0.002 (0.045) 0.003 (0.045)1.2 0.003 (0.045) 0.000 (0.045)

500 1.3 0.000 (0.009) 0.001 (0.018)2.1 0.003 (0.094)2.2 0.009 (0.139)

model (Figure 1). On the contrary, in model 1.2 (under which there is a significative observational bias), the MSE ofF(x;�0) is up to 1 order of magnitude greater than that associated to the ordinary empirical df (the average value ofMSEF(x;�0)/MSEord(x) along all the x deciles is 4.699 in the case n =50).

A very interesting question is how robust is the semiparametric estimator to model misspecifications. To inves-tigate this issue, we simulate again model 1.1 but we take U∗ ∼Beta(1,a) with a =1, so Beta(�1,1) is a wrongparametric information about U∗, whereas a Beta(1,�2) is still correctly specified for V ∗. The efficiency of theNPMLE relative to the semiparametric estimator is depicted in Figure 4 for a =1/5,1/2,3/2, and 4, for the casen =250. This figure shows that the new estimator may behave still better than the NPMLE under a slight misspec-ification of the parametric model (a =1/2,3/2) and that it may lose much efficiency under strong departures fromreality (a =1/5,4). For a deeper insight we provide separately the bias and the standard deviation of both estima-tors in Table II. From this table it is seen that the misspecification of the parametric model results in a strong biasfor the new estimator, which reaches the same order of magnitude as the standard deviation, being responsible forthe loss of efficiency when |a−1 | is large. Similar results (not shown) were obtained when misspecifying the dfof V ∗.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3153

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

2 4 6 8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

deciles

a=1/5a=1/2a=3/2a=4

Figure 4. MSE of the semiparametric estimator (S P) relative to that of Efron–Petrosian NPMLE (E P), for model 1.1 withU∗ ∼Beta(1,a), along 1000 trials of sample size n =250. The semiparametric information about (U∗,V ∗) is as in model 1.1.

4. The childhood cancer data

In this section we report our analysis of the childhood cancer data. As mentioned in the Introduction, these data concernall the cases of childhood cancer diagnosed in North Portugal between 1 January 1999 and 31 December 2003. 406individuals reported complete information about the age at diagnosis X∗ (we take years as time scale) and the dateof birth D∗. As discussed in the Introduction, the age at diagnosis is doubly truncated by (U∗ =V ∗−5,V ∗), whereV ∗ stands for the elapsed time between birth and end of study (31 December 2003); in other words, V ∗ represents the ageof the individual at the closing date. Then, following our Remark 2.1 in Section 2, we have G(x;�)= L(x;�)−L(x −�;�)for the biasing function, where L(.;�) stands for the df of U∗ and � is the length of the observational window (fiveyears). In Figure 5, left, the semiparametric estimator of the df of X∗ is depicted, together with the 95 per centpointwise confidence band, which was calculated according to (2). As parametric information about U∗, we have takenthe Beta(�,1) distribution, adapted to the support of (−5,15). It is important to remark that, by definition of childhoodcancer, the support of X∗ is (0,15), and hence our sampling information reduces to the births which took place in 1984and afterwards. For comparison purposes, we also included in Figure 5, left, the Efron–Petrosian NPMLE.

In Figure 5, right, we depict the NPMLE of the df of V ∗ (which is doubly truncated by (X∗, X∗+5)) together withthe fitted Beta(�,1) model, for which we obtained �=1.19 (standard error: 0.1817). Note that the estimators in theleft panel are constructed by inverse-weighting the data Xi according to their counterparts in the right panel. We alsopoint out that the null hypothesis of a uniform distribution for V ∗ (i.e. �=1) is accepted at a 5 per cent level; thiscan be interpreted in terms of the stationarity of the birth dates for the individuals who will develop the disease.As a consequence, there is not much observational bias on the age at diagnosis in this case (see Remark 2.1). Whenanalyzing subgroups of individuals, however, we have found situations in which there exists a clear sampling bias. Thiswas the case for example for the 38 reported cases of neuroblastoma cancer (results not shown). Similarly, we haveconfirmed the existence of a remarkable bias on the Vi s for the whole data set, resembling the situation of simulatedmodel 2.1, see Figure 1. Hence, accounting for an eventual observational bias may be a matter of much importance inapplications.

In order to investigate the performance of the confidence limits in Figure 5, left, we have compared along 1000trials the Monte Carlo standard deviation sMC(x) of the semiparametric estimator and the values of the asymptoticapproximation sA(x)= �(x)/

√n, and the mean and standard deviation of sMC(x)/sA(x) for the nine deciles as reported

in Table III. We have taken model 2.2 and n =500 for these simulations since they correspond almost perfectly to thesituation in Figure 5 (see Figure 1). From Table III we see that the asymptotic formula provides a good approximationof the standard error of the semiparametric estimator at least up to quantile 0.80. At the right tail of the distribution,however, some overestimation of the actual standard deviation is appreciated. As a consequence, the confidence band inFigure 5 could be somehow inflated at the far right tail.

3154

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

Table II. Bias and standard deviation of the semiparametric estimator (S P) and the Efron–PetrosianNPMLE (E P) for model 1.1, with U∗ ∼Beta(1,a), along 1000 trials of sample size n =250.

a Deciles Bias (SP) sd (SP) Bias (EP) sd (EP)

1 −0.013 0.021 0.000 0.0252 −0.027 0.030 −0.001 0.0363 −0.041 0.037 −0.002 0.0444 −0.053 0.043 −0.002 0.052

1/5 5 −0.064 0.052 −0.003 0.0596 −0.070 0.064 −0.003 0.0647 −0.072 0.064 −0.003 0.0698 −0.067 0.069 −0.004 0.0729 −0.048 0.072 −0.005 0.073

1 −0.006 0.020 0.002 0.0242 0.014 0.030 0.002 0.0353 −0.021 0.037 0.002 0.0434 −0.030 0.044 0.002 0.052

1/2 5 −0.033 0.049 0.002 0.0586 −0.036 0.055 0.001 0.0647 −0.037 0.059 0.000 0.0698 −0.032 0.063 −0.001 0.0719 −0.021 0.063 −0.001 0.070

1 0.007 0.021 0.000 0.0222 0.013 0.030 0.000 0.0333 0.019 0.036 0.001 0.0424 0.024 0.042 0.001 0.050

3/2 5 0.026 0.048 0.000 0.0586 0.028 0.053 0.002 0.0647 0.027 0.055 0.003 0.0708 0.022 0.057 0.001 0.0749 0.015 0.055 0.001 0.075

1 0.026 0.022 0.001 0.0202 0.048 0.032 0.002 0.0313 0.062 0.039 0.001 0.0404 0.070 0.045 0.000 0.049

4 5 0.072 0.051 0.000 0.0576 0.070 0.056 0.002 0.0667 0.061 0.057 0.002 0.0718 0.046 0.059 0.002 0.0769 0.026 0.056 0.002 0.076

The semiparametric information about (U∗,V ∗) is as in model 1.1.

5. Technical proofs

Technical proofs of Theorems 2.1 and 2.2 follow the standard arguments. For the sake of completeness, here we providethe key steps of the proofs.

Proof to Theorem 2.1Under regularity, we have:

0 = ���

logLc (�)=n∑

i=1

���

logg(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�

=n∑

i=1

���

logg(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

+n∑

i=1

�2

��2log

g(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�

(�−�0),

where we have used the mean value theorem, and where � is between � and �0. Introduce the matrix

An =−1

n

n∑i=1

�2

��2log g(Ui ,Vi ;�)G(Xi ;�)−1

∣∣∣�=�

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3155

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

0 5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

Age at diagnosis

0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

Birth to end of the study

Figure 5. Left: The semiparametric estimator of the cumulative df for the age at diagnosis (solid line) and 95 per cent pointwiseconfidence band (dotted lines); the Efron–Petrosian NPMLE is indicated (dashed line). Right: Fitted beta distribution for V ∗

(solid line) and corresponding Efron–Petrosian NPMLE (dashed line). Childhood cancer data.

Table III. Mean and standard deviation of sMC(x)/sA(x) along 1000 trialsof model 2.2 with n =500.

Deciles Mean SMC(x)/SA(x) sd SMC(x)/SA(x)

1 1.055 0.2422 1.133 0.2533 1.117 0.2434 1.147 0.2505 1.146 0.2536 1.117 0.2537 1.036 0.2488 0.891 0.2399 0.654 0.226

hence we have

An (�−�0)= 1

n

n∑i=1

���

logg(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

≡ 1

n

n∑i=1

�(Ui ,Vi , Xi ).

Assume that the joint density of (U∗,V ∗) exists. (In the case V ∗ =U∗+� the proof is analogous. This can be seenby replacing g(u,v) by the univariate density of U∗ as indicated in Remark 2.1, and each double integral belowby a simple integral w.r.t. u). As the conditional density of (Ui ,Vi ) given Xi is g(u,v;�0)G(Xi ;�0)−1 I (u�Xi�v)we have

E�0

[���

logg(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

| Xi

]=∫ ∫

u�Xi �v

���

g(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

du dv

= ���

∫ ∫u�Xi �v

g(u,v;�)

G(Xi ;�)du dv

∣∣∣∣�=�0

=0,

where we have assumed interchangeability of differentiation and integration.Hence, E�(Ui ,Vi , Xi )=0. �

3156

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

Besides, assuming interchangeability of differentiation and integration,

−E�0

[�2

��2log

g(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

∣∣∣∣∣ Xi

]= −

∫ ∫u�Xi �v

�2

��2

g(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

du dv

+∫ ∫

u�Xi �v

���

g(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

[���

g(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

]TG(Xi ;�0)

g(u,v;�0)du dv,

= − �2

��2

∫ ∫u�Xi �v

g(u,v;�)

G(Xi ;�)du dv

∣∣∣∣�=�0

+∫ ∫

u�Xi �v

���

g(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

[���

logg(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

]T

du dv,

= 0+∫ ∫

u�Xi �v

���

logg(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

[���

logg(u,v;�)

G(Xi ;�)

∣∣∣∣�=�0

]Tg(u,v;�0)

G(Xi ;�0)du dv,

= E�0

⎡⎣( ���

logg(Ui ,Vi ;�)

G(Xi ;�)

)(���

logg(Ui ,Vi ;�)

G(Xi ;�)

)T∣∣∣∣∣�=�0

∣∣∣∣∣∣ Xi

⎤⎦ ,

hence the information matrix I (�0)= E�0 [�(Ui ,Vi , Xi )�(Ui ,Vi , Xi )T] becomes

I (�0)=−E�0

[�2

��2log

g(Ui ,Vi ;�)

G(Xi ;�)

∣∣∣∣�=�0

].

Under regularity, we have An → I (�0), hence

�−�0 ∼ I (�0)−1 1

n

n∑i=1

�(Ui ,Vi , Xi ), (3)

and, by the Central Limit Theorem,√

n(�−�0)→ N (0, I (�0)−1) in law. �

Proof to Theorem 2.2Note that

F(x; �)− F(x)= [F(x; �)− F(x;�0)]+[F(x;�0)− F(x)]≡ I + I I.

For I I we have:

F(x;�0)− F(x) = P(�0)∫ x

0

dF∗n (t)

G(t;�0)− P(�0)

∫ x

0

dF∗(t)

G(t;�0)

=[∫ x

0

dF∗n (t)

G(t;�0)−∫ x

0

dF∗(t)

G(t;�0)

]P(�0)+[P(�0)−1 − P(�0)−1]P(�0)P(�0)

∫ x

0

dF∗n (t)

G(t;�0).

By the Strong Law we have that P(�0)∫ x

0 G(t;�0)−1dF∗n (t)→ P(�0)

∫ x0 G(t;�0)−1dF∗(t)= F(x) w. p. 1, and hence I I

is asymptotically equivalent to

I I ∼ P(�0)1

n

n∑i=1

I (Xi�x)

G(Xi ,�0)− P(�0)−1 P(�0)F(x)

= P(�0)1

n

n∑i=1

I (Xi�x)− F(x)

G(Xi ,�0)= 1

n

n∑i=1

�(Xi , x)

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3157

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

where �(t, x)= P(�0)(I (t�x)− F(x))G(t,�0)−1. For I introduce the function

W (x,�) = ���

(P(�)∫ x

0

dF∗n (t)

G(t;�))

= P(�)2∫ ∞

0

�G(t;�)

��

dF∗n (t)

G(t;�)2

∫ x

0

dF∗n (t)

G(t;�)− P(�)

∫ x

0

�G(t;�)

��

dF∗n (t)

G(t;�)2,

(see Section 2). By the mean value theorem we have, for some � between � and �0,

F(x; �)− F(x;�0) = P (�)∫ x

0

dF∗n (t)

G(t; �)− P(�0)

∫ x

0

dF∗n (t)

G(t;�0)

= W (x, �)T(�−�0).

Note that (under regularity) we have W (x, �)→W (x,�0) for each sequence �→�0, where W (x;�) is the matrix inTheorem 2.2, namely

W (x,�) = P(�)2∫ ∞

0

�G(t;�)

��

dF∗(t)

G(t;�)2

∫ x

0

dF∗(t)

G(t;�)− P(�)

∫ x

0

�G(t;�)

��

dF∗(t)

G(t;�)2

=∫ ∞

0

�G(t;�)

��

1

G(t;�)[F(x)− I (t�x)]dF(t).

This shows that I ∼W (x,�0)T(�−�0) and hence, by (3),

I ∼W (x,�0)T I (�0)−1 1

n

n∑i=1

�(Ui ,Vi , Xi ).

In sum,

F(x; �)− F(x)∼W (x,�0)T I (�0)−1 1

n

n∑i=1

�(Ui ,Vi , Xi )+ 1

n

n∑i=1

�(Xi , x),

and, by the Central Limit Theorem,√

n(F(x; �)− F(x))→ N (0,�2(x)) in law,

where

�2(x)=Var�0 [W (x,�0)T I (�0)−1�(Ui ,Vi , Xi )+�(Xi , x)].

As E�0 [�(Xi , x)]=0, E�0 [�(Ui ,Vi , Xi ) | Xi ]=0 and E�0 [�(Ui ,Vi , Xi )�(Ui ,Vi , Xi )T]= I (�0), it is easily seen that

�2(x)=W (x,�0)T I (�0)W (x,�0)+ E�0 [�(Xi , x)2]=�21(x)+ E�0 [�(Xi , x)2],

while

E�0 [�(Xi , x)2] = P(�0)2 E�0

[(I (Xi�x)− F(x))2

G(Xi ;�0)2

]

= P(�0)

[∫ x

0

dF(t)

G(t;�0)+ F2(x)

∫ ∞

0

dF(t)

G(t;�0)−2F(x)

∫ x

0

dF(t)

G(t;�0)

]= �2

2(x).

Acknowledgements

We thank the two anonymous referees and the Associated Editor for their careful reading of the paper and their suggestions whichhave greatly improved the manuscript. This work is supported by the research Grant MTM2008-03129 of the Spanish Ministeriode Ciencia e Innovación, by the Grant PGIDIT07PXIB300191PR of the Xunta de Galicia, and SFRH/BD/60262/2009 Grantof Portuguese Fundação Ciência e Tecnologia. The authors are also grateful to the Xunta de Galicia for financial support underthe INBIOMED project, (DXPCTSUG, Ref. 2009/063). We also thank the IPO (the Portuguese Institute of Cancer at Porto) forkindly providing the childhood cancer data.

3158

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

C. MOREIRA AND J. DE UÑA-ÁLVAREZ

References1. Lynden-Bell D. A method of allowing for known observational selection in small samples applied to 3CR quasars. Monthly Notices of the

Royal Astronomical Society 1971; 155:95--118.2. Woodroofe M. Estimating a distribution function with truncated data. Annals of Statistics 1985; 13:163--177.3. Stute W. Almost sure representation of the product-limit estimator for truncated data. Annals of Statistics 1993; 21:146--156.4. Wang MC. A semiparametric model for randomly truncated data. Journal of the American Statistical Association 1989; 84:742--748.5. Asgharian M, MLan CE, Wolfson DB. Length-biased sampling with right-censoring: an unconditional approach. Journal of the American

Statistical Association 2002; 97:201--209.6. de Uña-Álvarez J. Nonparametric estimation under length-biased sampling and type I censoring: a moment based approach. Annals of the

Institute of Statistical Mathematics 2004; 56:667--681.7. Efron B, Petrosian V. Nonparametric methods for doubly truncated data. Journal of the American Statistical Association 1999; 94:824--834.8. Bilker W, Wang M-C. A semiparametric extension of the Mann–Whitney test for randomly truncated data. Biometrics 1996; 52:10--20.9. Shen PS. Nonparametric analysis of doubly truncated data. Annals of the Institute of Statistical Mathematics 2008; DOI: 10.1007/s10463-

008-0192-2.10. Moreira C, de Uña-Álvarez J. Bootstrapping the NPMLE for doubly truncated data. Journal of Nonparametric Statistics 2010; 22.

DOI: 10.1080/10485250903556102.

Copyright © 2010 John Wiley & Sons, Ltd. Statist. Med. 2010, 29 3147--3159

3159