and quantile in the presence of auxiliary information
TRANSCRIPT
University of Alberta
A Robust Estimator of Distribution Function
and Quantile in the Presence of Auxiliary Information
Hong Li O
A t hesis
submitted to the Faculty of Graduate Studies and Research
in partial fulfillment of
the requirernent for the degree of a Master of Science
in
Statistics.
Depart ment of mat hematical Sciences
Edmonton, Alberta, Canada
Spring 2000.
National Library I*I of Canada Bibliothèque nationale du Canada
Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395, nie Wellington Ottawa O N K1A O N 4 Ottawa ON K1A ON4 Canada Canada
The author has granted a non- exclusive licence allowing the National L i b r q of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.
The author retâins ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fkom it may be printed or otherwise reproduced without the author's permission.
L'auteur a accordé une Iicence non exclusive permettant à la Bibliotheque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la fome de microfiche/tilm, de reproduction sur papier ou sur format électronique.
L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.
Canada
Abstract
Recently, estimation of distribution function and quantiles has received considerable
attention in s w e y sampling. This t hesis mainly considers estimation of distribut ion
function and quantiles under single-stage and two-stage sampling designs. In the
first part, for single-stage sampling process, estimation of distribution function and
quantiles is considered through the use of a~uiliary information. In the second part,
along the lines of first part, the estimation problem under twestage sampling plan is
considered. The idea of estimat ing population total under model-assist ed approach is
also extended to estimate population distribut ion function and the resulting est imator
is s h o w to perform better than the conventional estimator for moderate sample sizes.
Through a Monte Carlo simulation study, proposed estimators are cornpared Nith
conventional esthators.
Acknowledgement
Upon the completion of my thesis, I would like to take this opportunity to ac-
knowledge dl those people who have helped me.
First 1 would like to thank my supervisor, Dr. N. G. Narasimha Prasad for his
guidance, patience and helps throughout the preparation of this thesis. Special thanks
go to Dr. Rohana J. Karunamuni and Dr. Francis Yeh for serving as the cornmittee
members.
My gratitude extends to the Depart ment of Mathematical Sciences, University of
Alberta, for providing me the opportunity to study here, to Dr. Peter M. Hooper
for academic and financial support in the kst year of my study, and to the fac-
ulty members of the department who provided me with many interesthg courses in
S tat ist ics.
Contents
List of Tables
List of Figures
1 Introduction 1
1.1 Background .................................................................................. 1
1.2 Objective ...................................................................................... 4
2 RRview of literature 6
.................. 2.1 General estimation method of distribtuion hnction 6
....................................................... 2.1.1 Design-based approach 7
........................................................... 2.1.2 f rediction âpproach 8
2.1.3 Model-assisted approach .................................................... 9
.... 2.2 Robustness of model-assisted distribution function estimator 12
........................................ 2.3 -More distribution function estimators 14
...................................... 2.4 General estimation method of quantile 17
3 Distribution function and quantile estimator for one-stage sampling: non-parametric approach
3.1 Formulation of the generdestirnator ofdistribution fùnction ....
......................................................................... 3.2 Sizxrulation study
4 Distribution and quantile estimation in twestage sampling 47
4.1 Tm-O-s tage distribut ion funct ion est imator in convent ional t heory 47
4.2 General est imator of distribution funct ion in two-stage sarnpling ...................................................................................... 49
4.3 Alternative estimators ................................................................. 54
References
List of Tables
3.1 Simulation results mit h linear sirnulated population data:
relative biases (RB), mean square errors (MSE) and relative
root mean square errors (RRiMSE) of different distribution
fünction esthators (population size = 500, sample size = 30,
number of simulation = 1000.) ..................................................... 32
3.2 Simulation results with nonlinear sirnulated population data:
relative biases (RB), rnean square errors (NIISE) and relative
root mean square errors (EUXbISE) of different distribution
function estimators (population size = 500, sample size = 30,
number of simulation = 1000.) ..................................................... 33
............................................................. 3.3 Cornmon kernel funct ion. 40
3.4 Relative biases (RB) and relative root mean square errors
(RRMSE) of F,, at 0.25, 0.50 and 0.7Sth quantile of Y
under different kernels with linear simulated population ............. 41
3.5 Relative biases (RB) and relative root mean square errors
(RRMSE) of FP, at 0.25, 0.50 and 0.7Sth quantile of Y
under different kernels Mt h nonlinear simulated population.. ..... 42
3.6 Simulation results with sirnulated iinear population data:
relative biases (RB) and relative root mean square errors
(RRMSE) of different aLh quantile q(a) estimators (population
...... size = 500, sample size = 30, number of simulation = 1000.) 45
3.7 Simdation results Rrit h simulated nonlinear population data:
relative biases (RB) and relative root mean square errors
(RRMSE) of different alh quant ile q(a) estirnators (population
size = 500, sample size = 30, number of simulation = 1000.) .... 46
4. la Estimation of distribut ion funct ion under two-st age sampling
scheme (k = 400, py = 6.46968e-5) : relative biases (RB),
relative root mean square errors (RKMSE) and relative -
............................................. efficiency (RE.) with respect to Fu
4- 1 b Estimation of quant ile under two-stage sampling scheme
(k = 400, p, = 6.46968eA5) : relative biases (RB), relative
root mean square errors (m4SE) and relative efficiency - .. .................................................. (RE) with respect to q, ....
4.2a Estimation of distribution funct ion under two-stage sampling
scheme (k = 400, py = 0.505525): relative biases (RB),
relative root mean square errors (RR-MSE) and relative -
.............................................. efficiency (RE) with respect to Fu
4.2b Estimation of quantiles under two-stage sampling scheme
(k = 400, p, = 0.505525): relative biases (RB), relative
root mean square errors (RR-MSE) and relative efficiency - ............................................................ (RE) with respect to q,...
4.3a Estimation of diçtribut ion function under two-stage sampling
scheme (k = 400, py = 0.999688) : relative biases (RB),
relative root mean square errors (FUUMSE) and relative -
.......................................... efficiency (RE) with respect to Fu--.
4.3b Estimation of quantiles under two-stage sampling scheme
(k = 400, p, = 0.999688): relative biases (RB), relative
root mean square errors (RRMSE) and relative efficiency - (RE) with respect to q,. .............................................................
4.4 Estimation of distribution function under two-stage sarnpling
scheme (k = 245, py = 6.46968ë5): relative biases (RB),
relative root mean square errors (RmISE) and relative -
............................................... efficiency (RE) wit h respect to Fu 66
Estimation of distribut ion funct ion under tsvo-st age sunpling
scheme (k = 245, p, = 0.505525): relative biases (RB),
relative root mean square errors (RRMSE) and relative -
........................................... efficiency (RE) with respect to Fu..
Estimation of distribut ion funct ion under two-st age sarnpling
scheme (k = 245, p y = 0.999688): relative biases (RB),
relative root mean squa,re errors (FtEUMSE) and relative -
efficiencv (F?E) with respect to F,, ...............................................
List of Figures
Scatterplot of linear simulated population korn model
Y = -2X + X ~ S ........................ .. ..................................... 30
Scatterplot of non1inea.r simulated population f?om model
Y = 1 5 0 + 2 . 5 e - - Y + ~ f e .......................~~................................... 31
Plot of population distribution function Fp and mode1 assisted ..
......................... estimator Fm, with linear simulated population 34
Plot of population distribution function Fp and proposed
........................ estimator Fm, with linear sirnulated population 35
Plot of population distribution hinction Fp and pseudo ernpirical
likelihood estirnator Fpe *th lin- simulated population.. ........ 36
Plot of population distribution hinction F, and model assisted A
estimator Fm, wit h nonlinear simulated population.. ................. 37
Plot of population distribution function Fp and proposed
estimator F,, with nonlinear simulated population .................. 38
Plot of population distribution function F, and pseudo
ernpirical likelihood estirnator FP, nith linear sirnulated ........... 39
Chapter 1 Introduction
In modern society, the need for statistical information has increased considerately for
making policy decisions. In part icular, data are replasly collected to sat is@ the need
for information about specified set of elements. The collection of all such elements is
c d e d a finite population. But collecting the data for each element in a population
be too eupensive and/or time consurning; sometimes it is even irnpractical. So,
one of the most important rnethod of coliecting information for policy decision is
sample survey, that is, a partial investigation of the finite population.
A sample s w e y costs less than a cornpkte enurneration and may even be more
accurate than the complete enurneration due to the fact that personnel of higher
quality can be employed for the data collection. So, designing an efficient sampling
scheme at possible lowest cost and to develop some efficient methods to estimate
population parameters kom the sample data under given design have become major
issues in sample surveys.
In this thesis, 1 WU only concentrate on the estimation of the distribution function
and associate quantiles of a finite population.
1.1 Background
Consider a finite s w e y population U which consists of N distinct elements. Each
element is identified through label j (j = 1, ... , N). Let va,lues of the population
1
elements for variable Y be YI, Y2, ..-, YN7 then for any 8ven number t (- w < t < cm),
the population distribution function FN (t) is defined as the proportion of elements in
the population for which I; 5 t , for j E U , that is
FN ( t ) := N P 1 C A(t - k;),
where
1: when t 2 Yj; A(t - Yj) =
O, othemise.
Also, population ath quantile (O < cr < 1) for Y is defined as
Without loss of generality, we omit N in notation of F'(t) or qN(a).
Consider an auxiliary variable X, associated with Y, nith the known values XI,
Nom, if a single-stage sampling scheme is undertaken to select a sample s = { jl , j2 :
... , j,} of size n under some sampling design {S, p ( * ) } , where S denotes the set of
all possible samples of size n, p(s ) is the probability of selecting the sample S. s E
S, such that Cs, p ( s ) = 1. The values for Y in set s are observed as 2/j, ; 9j2 : . .. yjn .
Then, only mith this
given by Rao (1994)
information, a general class of design-based estimator of F ( t ) is
as:
And naturally, the crth quantile q(a) rnay be estimated as
So, in the following 1 d l focus on the estimation of distribution hnction. The
associate quantile estimator nill be defined as (1.5) accordingly. In (1.4) : di(s) is
the basic design weight for single-stage sampling which can depend on both s and i
(i E s) ( s e , Godambe, 1955), which also satisfies the design-unbiasedness condition:
C~,:i, , lp(s)di(s) = 1 for i = L, ..., N. To get an unbiased estimator of FI it is
also essential to assume that the inclusion probabilities .rri = xIs: i E 3 ) p ( s ) , for i = 1 _ . . . , N i are all positive (Rao, 1975). Under simple randorn sarnpbng scherne, F(t )
reduces to the naive ernpiricd dHstribut ion fimction est imat or
Fddl ( t ) = n-l p ( t - Y &
P l u g e g Fêdl for F (1.5) gives t h naive quantile estimator Qedf (a).
The above approach to estimate distribution hnction is called probability Sam-
pling approach which is free h e m mode1 assumpt ion. However, the above estima-
tors do not incorporate any supplementary information directly into the estimation
process, dt hough the use of amïliary information on population is k n o m to increase
the precision of an estirnator. Isn theory, information on one auxiliary variable may
be used at the survey design stage by defining appropriate inclusion probabilities ri.
But this is not possible when information on several awiliary variables is available.
So we rnay wish to introduce the auxiliary information at the survey estimation stage
also (Chambers and Dunstan, 1986). In recent years, several estirnators of distri-
bution function of a finite population have been proposed tvith the use oE m~eliary
population information (Rao, 1994).
1.2 Objective
For singbstage sarnpling scheme, the most notable distribut ion fimction estima-
tors are the model assisted estimator (Rao, 1994) and the pseudo-empirical mazuimiun
Likelihood estimator (Chen and Sitter, 1996). The former uses the complete infor-
mation of auxiliary variable but with the açsumption that the relationship between
Y and follow a linear model. Eventhough, the latter does not assume any rnodel,
the resulting estimator is optimal under a linear model. Also, the latter rnethod is
computational intensive. In this thesis, I will develop an estimator of the h i t e pop-
ulation distribution function and associate quantiles such that it uses the complete
information of and is exempt of model assumption. More precisely, this estimator
uses a nonpatametric approach.
For a twmstage sampling scheme, Roy& (1976) proposed the linear least-squares
prediction approach in the estimation of population total, with the cluster size as
supplementary information. In this thesis, The same approach will be used to get
an estimator of the finite population distribution function and associate quantiles.
Al l of the estirnators proposed in this thesis mdl be cornpared with conventional
est imators t hrough a Monto Car10 simulation st udy.
1.3 Thesis Overview
In chapter 2, some of the literature on the estimation of distribution function and
quantiles for a finite population under single-stage and tvvo-stage sampling scheme
are reviewed. In chapter 3, by exploring the advantages of different methods under
single-stage sampling plan, a robust distribution function estirnator is proposed. This
estimator can be used under single-stage sampling plan Rrit hout mode1 assurnption.
The proposed rnethod is examined through a limited simulation study. In chapter
4, t his t hesis presents some discussion of the estimation of distribution hnc t ion and
associate quantiles under two-stage sampling scheme. Also, the basic idea of esti-
rnating population total is extended to estimate population distribution funct ion. A
Monto Car10 sirnulation study is d s o presented to assess the efficiency of proposed
estimator.
Chapter 2
Review of Literature
Singlestage sarnpling is the most simple sarnpling scheme in sample surveys. Several
met hods are considered in the fiterature for estimating distribut ion function and cluan-
tiles, which can be grouped under three broad headings: (i) design-based approach;
(ii) model-dependent approach, or prediction approach; and (üi) model-assisted ap-
proach.
However, in practice, mult i-st age sampling is also kequently used, especidly in
surveys of hurnan populations. For =ample, Kish (1965) described a stratified three-
stage sample of dwellings, i.e., sampling counties at the first stage, sampling blocks
within selected counties at the second stage and sampling dwellings niithin chosen
blocks at the third stage. For this kind of siurveys, there is a wide variety of possible
designs in any stage, and there is a much wider variety of possible estimators of
interested parameters.
So, this chapter mainly reviews some of papers that are relevant to the estimation
of finite population distribut ion function and quant iles under single-st age sampling
design.
2.1 General Estimation Method of Distribution Function
Consider a single-stage sampling scheme deôned by {S, p ( . ) ) and a sample set
s E S with design weight di(s) for i E S. Also, we assume availability of ai~Gfiary
6
information, Say x, for ail of the units in the population. Let yj,, ..-, yj, be the
observed values for Y in set S . We consider belon7 dserent approaches to estirnate
the distribution Eunction (1.1) and crt" quantile (1.3).
2.1.1 Design-based Approach
Under the design-based approach, F ( t ) can be estimated by the ratio estirnator
F,(t), the digerence estirnator &(t) and the regession estimator F,, ( t ) (See Rao
and Liu (1992) and Rao (1994)).
In the case of a single a~xil iary X-variable, denote 9(xj) = Aa(t - R ~ ) Ç m j =
1 A ; where, R =
G = C i E , d i ( s ) g ( x i ) ,
sarnple s, we have
and
The above three estirnators have some good properties. For example? Pr(t) re-
duces to F ( t ) and variance becomes zero m-hen yj a xj for all j E U. This indicaies
that the ratio estirnator can gain large efEiciency when A(t - yi) and A(t - hi)
have a stronger bear relationship. However, in real world, the correlation between
7
A (t - yi) and A(t - kci) is generally weaker t han that between yi and xi, so here
est k a t or Fr (t) actually gains very lit tle efficiency over ~ ( t ) . As for est irnators Pd (t)
and F,,,(t), they suffer fiom the same drawback as the ratio estimator Fr(t) . Since
FT,(t) involves the cornputation of ccru(B, G) and v(G), it is cornputationally more
cumbersome than the ratio estimator F~ (t). Moreover, all of these three estimators
are not rnodel-unbiased under a linear model assurnption, although they are asymp-
totically design-unbiased (Rao, Kovar and Mantel, 1990 and Rao, 1994).
2.1.2 Prediction Approach
This approach, contrary to the desigx-based approach, assumes that the popula-
t ion Y values are random and the relationship between Y and the amuiliary variable
X follows a certain model. The most-common used superpopulation model is given
by7
where ,B is an unknown parameter, v(xj) is a strictly positive function of xj, and
E ~ ' S are independent and identically distributed (i.i.d) random variables Mth zero
means. Under this approach valid inference wil l be made with respect to the model
and irrespective of the sarnpling design p( s ) .
At hrst, note that FN ( t ) can be decomposed as fo110m-s,
where s is the same as before, S = { j; j = 1, 2, . . . , N and j $! s). T h e only unknom
in (2.5) is the second term. It is also assurned that the yj 's are independent and
the population mode1 holds for the sample, so there WU be no sample selection bias
(Krieger and Pfeffemann, 1992). Under t his assurnption, a general est imator (refer-
enced as CD estirnator) of distribution function, which was suggested by Chambers
and Dunstan (1986), is given by
where, b, = {Ci,-, y i ~ i / ~ 2 (xi)} { Cies X;/V~(X~)}-' y mhich is the best linear unbiased
estimator (BLUE) of 0, uni = {v(x~)}- ' (t - bnxi)- Rao and Liu (1992) noted t hat,
when v 2 ( x j ) = o2xj7 (2.6) reduces to
are approximately independent with mean zero and variance c2.
In general, the model-based estirnator of the distribution function d l not be the
same as those suggested by the conventional design-based approach as in (1.4). But
there is one special situation "when the sample is a stratified random sample and the
auxiliary information X is a qualitative variable, indicating the stratum in which a
population elernent occurs" under which F s ( t ) and FN(t) are consistent (Chambers
and Dunstan, 1986).
2.1.3 Model-assisted Approach
From earlier
assump tion hee,
discussion, nie see that although probability sampling approach is
the associated inferences have to refer to repeated sampling instead
9
of just the particular sample, s, that has been chosen- Prediction approach, on
the other hand, ignores the sampling design completely and the dehit ion of FG(t)
depends on (2.4) being the "correct" rnodel for the population. In particular, it
assumes that the heteroscedasticity hinction v (2) is correctly specified. However? in
practice, this is unlikely to be the case and in large sarnples, prediction inferences are
also very sensitive to rnodel misspecitications (Hansen, Madow and Tepping, 1983).
To overcome the shortcomings in the above two rnethods? the model-assisted ap-
proach is proposed as an approach providing valid inferences under an assumed model
and at the same t h e protecting against model misspecifications in the sense of pro-
viding valid design-based inferences irrespect ive of the population Y -values. That is7
we can only consider design-consistent estimators that are also model-unbiased (at
least asymptotically) under an assurned rnodel. So in the case of estimating a finite
population distribution hnction, -ive may assume that YI : Y2: --.: YN corne from the
superpopulation with model (2.4). Cases of this problern have b e n treated by Cham-
bers and Dunstan(l986), Kuk(1988), Godambe(l989), Rao, Kovar and Mante1(1990),
Chambers (1992), Rao and Liu(1992) and Rao(1994).
Let G denote the model cumulative density function (c.d.f.) of E j 7 then under the
superpopulat ion rnodel (2.4),
is a population estirnating function, each of those terms has expectation zero. Using
estimation hnction t heory, Godarnbe (1939) arrived at a model-assisted es timat or of
F ( t ) under the special case of di(s) = ql. Rao (1994) evtended this to a general
design weight di (s) , where the case v2 (x j ) = xju2 was considered.
Under the model (2.4), a predictor of A(t - yj) is given by
Shen a model assisted estimator of F ( t ) based on (2.9) can be gîven by the difference
est irnator:
which is model-unbiased (at least asyrnptotically) under the assumed model, but its
asymptotic design-bias is zero only for a subclass of sarnpling designs. However, this
subclass seerns to cover a wide variety of sarnpling design(Godambe, 1989).
For the special case of di(s) = TF', Rao, Kovar and Mantel (1990) also proposed
an alternative model-assisted estimator (it can be reference as RKM es timator) ,
alter noting that the ratio and the diffaence estimators, ~ , ( t ) and ~ ~ ( t ) , are not
model-unbiased for FN(t). This estimator is asymptotically model-unbiased and
design-unbiased under all designs. Rao and Liu (1992) evtended this estimator to
the general case of di (s) as follows:
where,
So7 pma(t) and &,(t) are difFerent only in the last term of the formula.
2.2 Robustness of Model-assisted Distribution Function
Estimator
From above section, we know that the rnodel-assisted estimator F,,(t) and Fd,(t)
have some good properties (Rao, Kovar and Mantel, 1990), but ~ h e t h e r or not they
are better than F(t) as in (1.4) depends on the degree of validity of the
model, this is reflected in the prediction of A (t - yj). So, some improvements
might arise with the use of a more general model 'and the local fitting method which
could be employed to increase the robustness of the estimator.
Chambers, Dorfman and Wehrly (1993) also explored a robust estimation ap-
proach via nonparametric regression method against model misspecification. For a
general model
where q(.) is some reasonably smooth function, $(.) is some knom hnction and e
is the random error with zero mean. This approach works by £kt getting a robust
(although may be inefficient) predictor of population total <P of q5 RTith nonparametric
regession, then using bias calibration method to get a more efficient predictor. That
is, one srnooths #(y) against x to obtain i j , (x) and then smooths @(y) -fjl(x) against
x to obtain l), (x). So, the b a l smoothing of Q(y) against x is defined as fji (x) + q2 (2). In the case of estimating the finite population distribut ion hnc t ion of Y, w-e
can take 4(Y) = N-'A(t - Y), then
Here, Fs(t) is obtained fkom (1.6) and q(x, t) = Pr{Y 5 t/x) n.hich is assumed
to be a smooth function of x for any t. Chambers, Dorfman and TVehrly (1993)
estimated it by
where the weights wf can be calculated using kernel
y 4 , (2.16)
smoothing method (Chambers,
Dorhan and Wehrly, 1993). Then the predictor of Fns(t) is:
And a robust calibrated estimator is given by
where uis are the nonparametric prediction m-eights defining ~ ~ ~ ( t ) , f i and ô- are
sample-based estimators of the conditional model evpectation E ( Y / X ) and standard
deviation (v.~R(Y/x))* under the working model
sample-based estimator of the distribut ion function
13
for the
G(*) of
population. G(-) is a
t h e st andardized error
-1
Y - E(Y'x) . ~f we put ui = Ti - 1
in (2.18) (where is the inclusion proba- ,mm% N - n bility), then the resulting estimator is basicdy the same as that sugested by Rao,
Kovar and Mantel (1990). They also suggested that, by choosing the w-eights ui to
reflect the actual distribution of the sample X values, the robustness of the predictor
can be improved, and this heavily depends on the choice of the bandwidth h,. Some
rules of choosing bandwidth are also given in that paper. However, this method may
loss some ac iency by obtaining the bias robustness.
2.3 More Distribution Function Estimators
In above section, we saw some distribution function estimators Nith the utility of
complete information of the auxi.liaq variable X. However, in some situations, we
may not have any am~iliary variables, but ~ 4 t h some auxiliary information about F or
its associate parameters available. This kind of situation has been discussed by Qin
and Lawless (1994) who used the profile empirical likelihood-based kernel rnethod to
estimate the finite population distribution function under the semiparameter mode1
assumption. That is, they estimated F as the maximum empirical likelihood estima-
tor (MELE) pmel, by mavimizing the profile empirical likelihood function L ( F )
(Zhang, 1997) which is defined as
And
- Zhang (1997) established the weak convergence of Fmde so that this approach not only
overcomes the disadvantage of the standard kernel method which can not exploit extra
information systernatically, but also makes the variance of estimator smaller to arrive
a t some degree of increasing of efficiency. In this thesis, this situation Nill not be
discussed furt her .
Another case is that we can not get the complete information about the a~~xil iary
vâriable x, but stiU with sorne s u r n r n q available information on X. U s u d y this
kind of information c m be represented as
where, T ( X ) = ( .r l (X) , ... , T , ( X ) ) ~ . Shen, how this information can be irsed to
irnprove the estimation'?
Chen and Qin (1993) proposed an empirical likelihood approach to be used in
simple random sampling mhen this kind of auxiliary information (2 -2 1) is available.
Their results suggested that this approach has desirable properties when population
distribution function and quantiles are estimated under this situation. But, the
formulation of their method did not extend to more complex survey designs. Chen
and Sitter (1996) generalized it to estimate the parameter which is some function of
the distribution function < = <(FN) under the situation mhen the sample s is dramn
using some sarnpling design {D , p(-) ) as follows.
For a finite population U, if the entire finite population is available, the empirical
likelihood, ori,hdly introduced by Owen (1988, 1990) for constructing confidence
regions in nonparametric settings, nrill be
wit h the corresponding log-likelihood hnct ion
where pj = P(Y = y,). Suppose that ps (O 5 pj 5 1) are unknonm and zjEu pj 2 1- J
Then, in the presence of auxiliary information (2.21), (2.23) could be mavimized
subject to
However, since we only have a sample set s of size n of the entire population available,
we can view (2.23) as a population total, then using the unif?ed theory of sampling
met ho dology to get a design unbiased estimator of Z(p), that is
where di (s)'s are the desiga meights which satisb ES(CiEs di(s) log p, ) = log(pi)
and Es stands for the expectation respecting to the sampling design. Chen and
S itt er t ermed (2.25) as pseudo-ernpirical likelihood. For auxiliary in format ion
of the form (2.21), the problem reduces to maximizing (2.25) subject to (2.24) with
N replaced by n. Using the Lagrange multiplier method, they gave the resulting
pseudo empirical mavimum likelihood estimator (PEMLE) of FN(t ) as
- di (s) 1 . for i E s, Pi = c.,~~(s) 1 + X*T(X~) ,
and X satisfies
2.4 General Estimation Method of Quantile
Compared t o the estimation of distribution function, more attention lias been
gïven to quantile estimation, for oreuample, McCarthy (l965), Loynes (l966), Sedransk
and Meyer (l978), Gross (1980), Sedransk and Smith (1983), Chambers and Dunstan
(1986), h o , Kovar and Mantel (1990), Olsson and Rootzen (1996). Many- papers
discuss the estimation of the median, for the survey variable of interest under simple
random sarnpling or stratified random samphg, e-g., Gross (1980), Sedransk and
Meyer (1978), Sedransk and Smith (1983), without makingexplicit use of the ailuiliary
variables in the construction of est imators.
To make use of the auxiliary information, Chambers and Dunstan (1986) consid-
ered briefly the estimation of a quantile by inverting a model-based estimator F of
the distribution function through formula (1.5). \men the population size IV and the
sample size n are large, the estimator suffers the possible bias problem resulting frorn
the mode1 misspecification in addition to intensive comput ation. This approach alço
assumes that the complete information about the a~xiliar-y variable is knom. But
in some applications, especially when the auxiliary information is e~tracted frorn the
statistical reports or other secondary sources: only a certain surnrn~ary measures or
a grouped frequency distribution of the a~uiliary variable X are available. In such
situations, Kuk and Mak (1989) proposed three estimators of median: ratio esti-
mator position estimator 6 f y p and stratification estimator &llrs under
the assumption that only the median hiAY of X is knom. They also showed that
the last two estimators are efficient when the relationship between Y and ,Y departs
from the linearity assumpt ion.
Rao, Kovar and Mante1 (1990) studied the ratio estimator and regession es-
timator for a general ath quantile, and showed that their estimators could lead to
considerable gains in efficiency over the custornary estimator @(a) mhen 5- is approx-
irnately proportional to A\ for j = 1, ..., N. Olsson and Rootzen (1996) proposed a
quantile estimator for a nonparametric components of variance situation and proved
its consistency and asppto t ic normality.
Chapter 3 Distribution Function and Quantile Est imat or for OneSt age Sampling: Non-parametric Approach
The estimators given in previous chapter assume the linear mode1 as described in
(2.4). In this chapter, a general model is considered to obtain estimators sirnilar to
the ones that are described in the previous chapter.
Formulation of the General
Function
Estimator of Dis tribution
Suppose that nothing is known about the relationship between the study variable
Y and the aadiary variable except some population information about X. That
is, only the foIloFving general model c m be assumed.
E(Y/X = x) = 7 (2) , Var (Y IX = x) = au2 (x) ,
where q(-) is unknom, v(.) is some positive function of x. In this case, nonparametric
method is fiequently used, and several methods have been proposed for estimating
0 (. ) , for euample, kernel, spline, and orthogonal series met hods. Fan (1992) developed
a design-adaptive nonparametric regession method, which was based on a weighted
local h e a r regession. This method works as follows.
Suppose that the second derivative of q(xo) exists. Using Taylor's =pansion in a
small neighborhood of a point gives
Shen estimating q(xo) is equivalent to estimating the intercept a of a local linear
regression problem. Now if t h e study variable Y and auxiliary variable .ri satisQ
(3.1), we can consider such a n-eighted local Linear regression problem: to h d a and
b by rninimizing
where K(-) is some kernel function and h, is the smooth bandwidth. The solution
of b wiU not be discussed in this thesis. Then, the weighted least square solution of
a is d e h e d to be the local linear regression smoother, i-e.,
with
The bandwidth h, can be shosen either subjectively by data analysts or objec-
tively by data. Fan (1992) showed that the local linear regression smoother have high
asymptotic efficiency (it can be 100% w-ïth a suitable choice of kernel and bnndm-ïdth)
among possible linear smoothers. It adapts to almost aU regression settuigs and does
not require any modifications even at the boundary. 1 explore the use of this method
to obtain an estimator for distribution function and quant ile.
The rnodel-assisted estimator Fm in (2.11) has some advantages, and its efficiency
depends largely on the accuracy of the predictor of 4 ( t - yj). So we c m improve the
distribution function estimator via improving the predictor of A(t - yj) . Frorn the
predictor expression (2.9), we see that it uses the assumption of linearity between Y
and X. But in most cases, we can only use the general mode1 (3.1) to describe the
relationship between Y and X, this gives a new predictor of A(t - yj) as
where
By using an efficient predictor $(xi) of I;. ( s e Fan, l992), g ( x j ) should be more
efficient than cj(xj) in (2.9). If we plug 9 ( x j ) into the formula (2. Il), we Riill get a
new estimator, denoted as ~,,(t), of the distribution Function as follom:
and expect that it has a better performance. In next section, we ndl present some
simulation results to compare the behavior of Fm,, Fm, and Fpe.
About the estimation of a* quantile of Y, we will estirnate it by inverting the
distribution function estirnator F( t ) to get q(a). Since F ( t ) is not necessarily non-
decreasing function of t (Olsson and Rootzén, 1996), to estimate the quantile q(a) ,
we have to mod* F(t) to F ( t ) such that F ( t ) is a nondecreasing function of t, i.e.,
we have
F ( t ) = sup{Fyy) : y 5 t} . (3.9)
It c m be showed that p ( t ) and p ( t ) have the same lirniting distribution Eunction and
the above modification would not affect the value of ath quantile q(a) (Olsson and
Rootzen, 1996).
In the simulation, we will compare some quantile estirnators which are obtained
korn the direct or indirect use of auxiliary information. They are denoted as qma(ct),
Qpm.(~), Ce(a), &(a) and &(a). The former three are obtained hom (1.5) when
~ ( t ) is replaced by &,(t), F,,(t) and &,(t) respectively; the later tn-O are the ratio
estirnator and the clifference estimator suggested in Rao, Kovar and Mantel (1990).
3.2 Simulation Study
In this section, a limited simulation is conducted to assess the estimation proce-
dures in earlier section.
Here, two sets of simulated population data are used. Suppose that there are N
distinct elements in each population under study. One population data set (Popula-
tion 1) is generated kom a linear model of the form yj = a + Pxj + ' U ( X ~ ) E ~ ~ Another
population data set (Population 2) cornes hem a nonpararnetric model of the form
yj = v ( x j ) + v ( x ~ ) E ~ for j = Il . . . , N. In both cases, xj's (xj > 0 ) and ,'s are 1
independent, x j - N ( p x , O$) , E j Y. N(0, 1) u ( x ~ ) = X; for j = l7 2, ... N.
22
No matter which set of population data are being uskg, the basic procedures are
the same- In each simulation, we use the population vdue X2 : ... XN of the
aiLui1iar-y variable X and the observed value of Y in a sample set s to estirnate the
distribution funct ion and associate quant iles of Y t hrough rnodel-=assis ted approach
under linear model assumpt ion (Pma(t), 4,, (a)) , or under general model assurnpt ion
(~,(t), Qma(a)), except when we calculate the pseudo empirical maximum likeli-
hood estimators, Fec(t) and @,,(a), only the population mean ZN and the observation
value of X in the sarnple are available as ai~eliary informat ion. We will t alce ArA,,
samples of size n Erom the simulated population. The procedures are given as foUows.
In each simulation, we wi11 use the simple random sarnpling without replacement
(SRSWOR) scheme to obtain a sample s of size n out of the population of N elements,
because this simple sampling scheme actually plays an important role in practice. So
we get the observation data of Y as {yi, i E s } , the corresponding auxiliary values
are {xi, i E s) and the design weights are di(s) = for each sample point i E S .
Distribution function - estimation
Method 1: Mode1 assisted estirnator Pm, proposed in Rao (1994), i.e., the esti-
rnator calculated hom (2.9) - (2.11). This estimator is based on the linear mode1 I
(2.4) assumption and v (xj) = x: with the utility of complete information of auxiliuy
variable x.
Step 1: Predict y when given x.
Under the above assumptions, ,8 is estimated by the ratio estimate b = Cies yi
C i E s xi ' which is a best linear unbiased estimator of ,O for any sample. Shen, when given xj,
yj is predicted as Qj = Pxj for each j E U .
Step 2: Calculation of the residual of yi for each i in s through formula (2.10).
Step 3: Predict A (t - y j ) for each element j in the population U through formula
(2.9).
Step 4: Get the value of the distribution function estirnator Fm, at t by using
(2.11)-
Method 2: Mode1 assisted estirnator F ~ , proposed here. This estirnator is based I
on the general model (3.1) and under the assurnption that u ( x ~ ) = x; for j E U with
the utility of complete information of auxiliary variable X.
Step 1: Predict y when given x.
Since the concrete form of the model is rinknonm, the aa~il iary information of X
and sample data of Y are used to fit an appropriate model. Here, the local linear
regession method (Fan, 1992) is used for this purpose. That is, yj is predicted
through (3.3) when given xj for each j E U.
Step 2: Cdculation of the residuals of yi for each i in s by using formula (3.7).
Step 3: Predict A (t - y j ) for each element j in the population U by using formula
(3.6).
In this process, an appropriate kernel function K(.) and an appropriate bandwidth
h,, are needed. Here, the uniform kernel function ( s e Table 3.3) is firstly used to
estimate the distribution function F,, and the performance of Fp, is compared
with other distribution function estimators. Later on, the effect of different kernel
hinctions on the estimator F, d also be checked. As for the choice of smooth
bandwidth &, the cross-validation technique is used. So we choose
where q-i(.) is the regession estimator (3.3) Nithout using the ith observation data
(xi, yi) of the sample S.
Step 4: Estimate the distribution function F( t ) through ((3.8).
1
Note that, in both of the above methods, it is assumed that v(xi) = x: <and the
complete information of X is available.
Met hod 3: Pseudo ernpirical mâ,uimurn likelihood estirnator F ~ , . This rnethod
assumes that only the population mean and the sampled data of are available as the
auxiliary information, but it does not need any mode1 assumption for the relationship
between Y and X.
S tep 1: Solve the equation (2.28). -
Let xN be the population mean of X', then T(xi) = xi - &. So c = 1 in ( Z X ) ,
and (2.28) becomes
where &(s) = $. To solve this nonlinear equation for A, the Newton-Raphson
rnethod (Press, Flannery, Teukokky and Vetterling, 1990) is used here. Pluggjng
one-term Taylor's expansion of (2.27) into the constraint conditions (2.24) gives an
initial guess of X as
Shen starting fkom this initial guess, an approximate solution of root X can be ob-
t ained quickly .
S tep 2: Using the formula (2.27) to calculate the optimal value Fi for i E S.
Step 3: Using the formula (2.26) to calculate the value of distribution function
estimator Fpe at given t.
Also note that, since SRSWOR sampling scheme is being used, F,, is actucally
coincide with Fm,[, here.
The ath quantile - estimation
When estirnating the quantiles of Y, the assumption is the same as that in the
preceding section. If the a d i a r y information of has already b e n used in the
estimation of the distribution function F ( t ) as above, the distribution function esti-
mator d l be directly inverted to estima te the corresponding quantiles of Y. That is,
the formula (3.9) and (1.5) will be used to estimate the quantile q(cr). The resulting
estimators are denoted as &,(a), @,,(a) and &,(a) corresponding to Fm,, FFa and
Fpe respect ively.
However, if the quantile need to be estimated directly fkom the sampled data of
Y with the complete a~xi1izu-y information in hand, then it can be estimated by the
ratio estimator 4, and the difference estimator & proposed by Rao, Kovar and Mantel
(1990) as:
mhere &, QY are the sample ath quantile of X and Y respectively for a given sample
s; q, is the population cuth quantile of X . In this simulation, all these five kinds of
quantile estimator of Y will be compared.
Here the relative biases (RB) and relative root mean square errors (RRMSE)
generated in this simulation experiment are used as rneasurement to compare the
performance of dieerent estimators B(t) , which may be eit her the dis tribut ion hnction
estimator F(t) or the quantile estimator @(a). But for the distribution fimction
estimator, its values only fall in the range between O and 1, so the mean square errors
(MSE) are also appropriate to measure the efficiency in this case and nrill be examined
too. For each sarnple s E S , let ar ( t ) denote the estimator 8( t ) obtained horn the rth
simulation sample, and 0 ( t) denote the population value of the interested parameter,
then
1v8imu ( B r ( t ) - M S E ( ~ ) := C 7
r=l N i m u
and
Note: Here, we are only interested in the distribution function values at the
three quartiles (0.25, 0.50 and 0 . 7 ~ ~ ~ quantile), so RB and RRMSE are appropriate
rneasures for estimators of distribution function.
27
A singlestage sarnphg scheme is usually used in the sampling kom a s m d pop-
ulation, and in most cases, only a very small sample set are observed. So7 in this
simulation, a small simulated population of size N = 500 are generated and small
samples of size n = 30 are taken out of each population. However, the methods
discussed here also apply to population and samples of any other sizes. After 1: 000
t imes simulation runs, the simulation results become stable, so t his section only shows
the simulation results of N&, = 1,000.
1
The linear simulated population data corne fkom model = -2Xj + 5 ? c j , and 1
the nonhear simulated population data corne hom model Yj = 150 + 2.~e- -~ - ' + x,' f j :
i.i.d i.i.d where, E~ - N(0 , l ) and X j - N(10.2,2.0~) for j = 1 : -.., 500. The population
data of these two data sets are s h o m in Figure 3.1 and 3.2.
Firstly, uniform kernel hnction is used to estimate F ( t ) through ((3.8) and the
performance of different estimators are compared. Then the effects of different kernels
in the process of local regession on the estirnator (3.8) of the distribution function
are also e~amined.
Cornparison arnong different estirnators of distribution function
The results of the estimation of distribution function F( t ) at three different points,
t = 0.25, 0.50 and 0 . 7 5 ~ ~ quantile of Y, are shom in Table 3.1 - 3.2. IWhere Fp
stands for the population cumulative distribution function ; Fma, F', and FP, stand
for the estirnator obtained from method 1, method 2 and method 3, respectively. RB,
M S E , RRMSE stand for the relative biases, mean square errors and relative root of
mean square errors of the corresponding distribut ion funct ion est imat or respect ively.
Fiorne 3 . 3 ~ - 3 . 4 ~ give the plots for the population distribution function and dZFerent
distribution estimators.
The results indicate that among the three kinds of estimator, if the underlying
relationship between the study variable Y and the audiary variable X is Linear (see
Fim-e XI), for the estimation of the distribution hinction F ( t ) , in terms of biases.
as measured by RB, and in terms of efficiency, as rneasured by MSE and RRMSE.
Fm, has the srnallest biases and is the most efficient estimator. The pseudo empirical
rnâ.uimum likelihood estimator Fp, gives the worst performance among these three
estirnators except that in the median of the population.
In the case of having a nonlinear simulated population data (see Figure 3.2), in
terms of biases, as measured by RB, and in terms of efficiency, as rneasured by M S E
and RRMSE, Fpma and FP, have comparable performances, but both of them are
less biased and more efficient than Fm,.
This indicates that when it is k n o m that there exists a strong linear relationship
between Y and X in the underlying population, Ga can be used to estimate the
distribution function comfortably; but when no indication of strong linear relationship A -
appears, Fpma and F, may give a better estimation. In most cases, Fm, is more
stable since it makes use of the cornplete information of the au,diary variable, but.
FPe has the advantage that it can still be used when only the surnrnary information
of the a u x i l i q variable is available and it is still bias robust.
Figure 3.1: Scatterplot of Linear Simulated Population fiom Mode1
Figure 3.2: Scatterplot of Nonlinear Simdated Population from Mode1
Y = 150 + 2.5e-" + X ~ E
Table 3.1: Simulation Results wit h Linear Simulated Population Data:
Relative Biases (RB), Mean Square Errors (MSE) and Relative Root
Mean Square Errors (RRMSE) of Dif5erent Distribution Function Es-
tirnators (Population Size = 500, Sample Size = 30, Number of Sim-
ulation = 1000.)
Table 3.2: Simulation Results with Nonlinear Simulated Population Data:
Relative Biases (RB), Mean Square Errors (MSE) and Relative Root
Mean Square Errors (RRMSE) of Different Distribution Function Es-
tirnators (Population Size = 500, Sample Size = 30, Number of Sim-
ulation = 1000.)
Figure 3.3a: Plot of Population Distribution Function Fi and Mode1 As
sisted Estirnator Fm, wit h Linear Simulated Population
Figure 3.3b: Plot of Population Distribution Function F, and Proposed
Estirnator F,, wit h Linear Simulated Population
Figure 3.3~: Plot of Population Distribution Function F, and Pseudo
Ernpirical Likelihood Estirnator &;,, with Linear Sirnulated Population
Figure 3.4a: Plot of Population Distribution F'unction Fp and Mode1 A s
sisted Estimator F,, wit h Nonlinear Simulated Population
Figure 3.4b: Plot of Population Distribution Function Fp and
Estimator Fv, wit h Nonlinear Simulated Population
Proposed
Figure 3 . 4 ~ : Plot of Population Distribution Fhnction F' and Pseudo Em-
pirical Likelihood Estirnator Fpe with Nonlinear Simulated Population
The effect of kernel on the model-assisted estirnator
When the proposed model-assisted estirnator Fp, is used to estirnate the popu-
lation distribution function and associate quantiles, an appropriate kernel fimction is
needed. And the associate quantile esimator cj,, is just the inverse of the distribu-
tion function estimator F,,. So in this section, only effects of kernel hnctions on
the proposed distribution hnction estimator are being exarnined by comparing the
following seven common kernelç:
Table 3.3: Common Kernel Function
Name
Uniform
Triangle
Epanechnikov
Quartic
Triweight
Cosinus
Gaussian
The estimation results are given in Table 3.4 .- 3.5.
Table 3.4: Relative Biases (RB) and Relative Root Mean Square Errors
(RRMSE) of Fp, at 0.25,0 .50 and 0.75th quantile of Y under Different
Kernels with Linear Simulated Population
Kernel Type
Uniform
Triangle
Epanechnikov
Quart ic
Cosinus
Gaussian
Table 3.5: Relative Biases (RB) and Relative Root Mean Square Errors
(RRMSE) of Fm, at 0.25, 0.50 and 0 . 7 5 ~ quantile of Y under Different
Kernels wit h Nonlinear Simulated Population
Kernel Type
Uni forrn
Triangle
. .. .. .
Epanechnikov
Quart ic
Triweight
Cosinus
Gaussian
RRMSE (Fm&) at
t = cyth quantile)
The above results demonstrate that no rnatter which kind of mode1 the underlying
population data tmly comply to, in terms of biases and efficiency, as rneasured by
RB, MSE and RRMSE, the estimator 02 distribution function obtained fkom (3.8)
has a better performance when the Unifarm kernel function is used. Especially,
d e n the underlying relationship between Y and A' is nonlinear, the use of uni£orm
kernel function produces an estimator with the smallest bias and the highese efficiency
compared to ot her kernel funct ions.
However, when any one of the kernel function 1 - 7 is used in the process of
estimating distribution function, the estirnator Fm, always has better performance
than p,, when the underlying distribution hnction is nonlinear. So in the follonlng
the Uniform kernel function is still used t o obtain Qma.
Cornparison among different quant ile est imators
- - In this simulation, five kinds of ath quantile estirnator: Qma, Qpa7 Ge7 qT, qd are
cornpared. The first t hree stand for the inversion of Fm,, F ~ , and Ppe ; the latter two
are the ratio estirnator, the difference est iniator with the a~viliary information used,
respectively. The results are given in Table 3.6 3.7.
For the linear simulated population clata, in terms of biases and efficiency, as
rneasured by RB and RRiWSE, estirnator tma has the best performance arnong these
five estimators, but at the 0.75th quantile o f Y, the bias of ka is larger than that of
ratio estimator 4,. Q, has better performance than QFa evcept in the case of the
0.251~~ quantile. Generally, the first three estimators have smaller biases, as measured
by RB, and more efficiency, as rneasuredRR&f SE than the last two estirnators except
at the 0.75th quantile, & is better than QPe and Qma. So the &st three estimators
are appropriate to be used. In particular, @,, is encouraged to be used when strong
linear relat ionship cxist s in the underlying population.
For the nonlinear simulated population data, in terms of biases and efficiency, as
measured by RB and RRMSE, estimator qp, is less biased and more efficient than
Q,,, and even les biased than Q, in the median ( 0 . 5 ~ ~ quantile) of Y. However, 9,
has the highest efEciency at all of these three quantiles, and in the O Z t h and 0.75"h
quantiles, $, also has the smallest bias. In the 0.Sth and 0.7Sth quantiles, qr has a
higher efficiency thao &,. So in the case of nonlinear population as an interested
population, op, and Qm, are suggested to be used when compared to @,,.
Table 3.6: Simulation Results wit h Simulated Linear Population Data:
Relative Biases (RB) and Relative Root Mean Square Errors (RRMSE)
of DifEerent a* Quantile q ( a ) Estimators (Population Size = 500,
Sample Size = 30, Simulation Times = 1000.)
RRMSE (&) RRMSE(&,) RRM SE ( & ) RRM S E (4,) RRh.ISE(Qd)
Table 3.7: Simulation Results with S i d a t e d Nonlinear Population Data:
Relative Biases (RB) and Relative Root Mean Square Errors (RRILISE)
of DifKerent at" Quantile q(a) Estimators (Population Size = 500,
SampIe Size = 30, Simulation Times = 1000.)
RRhf S E (dm,) RRMSE(Qpa) RRI1fSE(&) RRM S E ( g r ) RRMSE ( Q d )
0.058523 0.050050 0.004770 O. 060386 0.051528
Chapter 4 Distribut ion Function and Quant ile Estimation in Twestage Sampling
In this chapter, we consider the estimation of distribution fimction and a'" quantile
of a finite population under two-stage sarnpling scheme. In pract ice, under the most
simple SRSWOR/SRSWOR sampling plan, there are already several conventional
est iniators of distribution function developed under design-based inference met hods.
However, the estimation under a superpopulation mode1 is considered in this chap
ter. Section 4.1 gives a brief introduction to the conventional estimators. In section
4.2, dong the same lines of chapter 3 and extending the results for estimating pop-
ulation total t o population distribution function, a distribution function estimator
is proposed. Some of alternative estimators are &en in section 4.3. Finaily, in
section 4.4, the proposed estimator is compared with conventional est imators under
SRSWOR/SRSWOR samphg plan t hrough simulation st udy.
4.1 Two-stage Distribution Function Estimator in
Conventional Theory
Consider a finite population which contains K elements and are cu?-anged in N
clusters, the i t h cluster is h o - to contain Mi elernents, so that CL, .VI. = K.
Then the population can be represented as U = { X j 7 i = 1, . . . , N; j = 1 : . . .: Mi}.
First a sample s of size n units is taken out of N clusters at the hrst stage and
47
then a sample si of size mi for i E s is chosen at the second stage. According to
the conventional sampling theory, there are several types of estimator for twestage
cluster sampling. For the most cornmon and simple design: simple random sampling
without replacement (SRSWOR) in both stages, the usual estirnators are (see Royal,
1976):
( i ) the expansion estimator
( ii ) the 'iinbiased" estimator
( iii ) the "ratietype" estimator
and
where,
Remark: The estimator Fu is unbiased only if && happens to equal &$ (Royal,
1976). Where as, in general, p. and Fr are biased estimator.
In the next section, estimation of distribution hnction under a superpopulat ion
rnodel be consdered.
4.2 Proposed Method of Estimation: Distribution Function
in Twu-stage Sampling
Assume the population U is generated from a tw-est age superpopulat ion rnodel:
By denoting E,(-) for expectation operators under the above rnodel, it is assiuned
that ( s e Scott and Smith, 1969)
are identically distributed random variables nrith h i t e population dis-
tribution funct ion Fy (t) , which is defined as
In current case, the population has nothing to do with the time of sampling. It is
easy to see that it satisfies the assumption A in Olsson and Rootzènys paper (1996),
( i ) The finite population distribution hinction Fy(t) of Y is (piecewise) contin-
uous.
( ii ) The N clusters are independent, so the vectors (xi, .-., K M = ) , i = 1: ...: N ,
are independent.
( iii ) For each i, the correlation coefficient between A(t - Yi j ) and A(t - xjr)
depends only on t, so we can denote it as ph@).
NON suppose that it is needed to estirnate the distribution firnction FI-(t) of Y
and the associate cuth quantile. A two-stage sampling scheme is conducted to get the
observation data. That is, at the first stage, n out of N clusters are chosen t hrough
some sarnpling scheme to get a primary sample s; a t the second stage, mi out of Adi
elements of each selected cluster i E s are chosen under designed sampling scherne to
N get a subsample si. Since K = xi=, Mi, then the s~ame decomposition method as
that in chapter 2 can be used to decompose the finite population distribution hnction
FY ( t) as follows:
Notice that only the fkst terrn TI in (4.11) is completely knom after the sarnpling,
so the other two terms, TI[ and TI[[, have to be predicted fkom the observation data
in order to obtain the estirnator of the h i t e population distribution hnction Fy ( t ) .
VVhen all the population cluster sizes, Mi (i = 1, .. . , N), are knom, estimating FI. ( t )
is equivalent to estirnating Ta(t) - the population total of A (t - Y ) , which is defined
as
Hence the prediction method (Royall, 1976) can be used to get an unbiased estimator
of TA. Firstly, observe that
E[A(t - Y,) - FY(t)] = 0,
and
where,
which is the common correlation coefficient between A(t - xj) and A(t - xjr) within
cluster i, and
where
Hence, the resulting estimator for Fu ( t ) , mhich is termed as predictive estimator.
is given by:
where,
with the weight given by
Along the lines of Royal1 (1976), it can be s h o w that the above estirnator F' ( t ) is
the best unbiased estimator for the finite population parameter Fu(t) mhen and
pA are k n m , and the mean square enor of ps ( t ) is
In practice, values of 0% and p, are usually unknonn. Hom-ever, the est imat e for 0%
and Pa cm be used to obtain an approximately optimal estimator of the interested
A simple estimator of p, ( t ) , given in Ohson and Rootien(l996), is as follows.
Denote .m = #{i : mi # 1 and i E s ) and set
2 1 a&)=- m C C j,si {a(t - 'mi Y,) - F ( t ) }2
i E s and mi+l
1 C C15j+jl<m {a(' - Y,) - '(')){A(' - Y,/) - F ( t ) ) mi(mf - 1)
i E s and mi#l
4.3 Alternative Estimators
This section gives some discussion on alternative est imators of the finite distribu-
t ion hnct ion in two-stage sarnpling. The corresponding quant ile estimators can be
obtained kom formula (1 -5 ) wit h different distnbut ion hinction estirnator plugged in.
At hrst, for the easy use of the predictive estirnator (4.19), the simple estimator
E( t ) of the distribution huiction is used in the process of estimating o i ( t ) and p A ( t ) .
I n practice, for simplicity, E(t) can be replaced by Fl(t) ,
when it is reasonable to ignore the dependence wïthin each cluster. When al1 the
sample size mis are equal, they will result in the same estimator (Royall, 19%).
Sorne additional estirnators can be obtained as special cases of the predictive
this case the optimal estimator (4.19) becomes
This is actuaIly the conventional eupmsion estirnator Fe.
optimal estimator can be represented as
This also suggests t hat the conventional estimators Fe as in (4.1) and Fr as in (4.3)
be repIaced in the third part of the predictive estimator, that gives
and
Another form with FP used in the third part of the predictive estimator can be given
as:
And, actually, kg, and F~~ are just the predictive estimator representstion of Fr and
The diffaence between (4.27) and (4.30) only exists in the second term which
predicts CiEs xj,, a(t - Kj)-
In the next section, results from a Monto Carlo simulation ndl be presented to
compare estimators Fe as in (4.1), Fr as in (4.3) and Fu as in (4.2) with F, as in
(4.19) and Fg, as in (4.28) under SRSWOR/SRSWOR sunpling scherne.
4.4 Simulation Study
In this section, the simulation study For sorne special cases of two-stage sarnpling
process is presented.
Since two-stage sampling scheme is usually used in moderate or large populations.
Here, three large simulated populations are generated using two-stage sampling pro-
cedure. In each population, it is assumed that there are 1V = 200 clusters and the
elements have a common correlation coefficient, p(xj7 Kjl) ni t hin each cluster, for
j # j' and i = 1, ..., N. The cluster sizes in each population considered in this study
are:
80, for i = 1, -.., 50;
= 100, [or i = 51, ...? 100;
120, for i = 101, ..., 200.
Finally we generate the data using the followhg procedure.
S tep 1: Generate cluster rneans pl, p2, . . . , pN of each cluster such that pi - N ( p 7 O:)
independently, where p = 100.0, and
0.81, Population 1,
0.91, Population 2,
50.91, Population 3.
S tep 2: Within each cluster i, sii , . . .: Einc are generated such that s, - N(0, O:),
for j = 1: ..., fili, where
100.7, Population 1,
a, = 0.9, Population 2,
0.9: Population 3.
Step 3: Generate the population data of size K = 21: 000 {Y,: i = 1' ...: iV;
K j = pi +se, for i = l7 --., N; j = 1, ---, ALi.
Then, note that in each population, the correlation coefficient within cluster i is:
~y = p ( K j , Kj>) = , for 1 5 j # jf 5 Mi and i = 1: ..., N. fl; + oz
In this simulation study, we consider three kinds of population in which every
cluster h a a correlation coefficient satiskng: i) p, = O; ii) O < p~ < 1; and iii)
pA rr 1 by assigning py satisfying i), ii) and iii) too. The three simulated popula-
tions generated in section 4.4.1 have the correlation coeficients: py = 6.46968e-5;
PY = 0.505525 and py = 0.999688 respectively. But in the process of estimating dis-
tribution function, and p, are aççumed to be unknown. They can be eçtimated
through (4.22) and (4.23). The estimates Sa and 8, are plugged in the predictive
estimator to obtain corresponding distribution h c t ion est imate p' ( t ) . We consider
two cases: case ( i ) ni th moderate sample size within cluster:
15, for i = 1: ..., 5;
% = { 2 for i = 67 . . , 15 ; for i E s,
25, for i = 16? ..., 20;
so that k = 400; and case ( ii ) +th small sample size nithin cluster:
IOl for z = l7 ...l 5;
mi = 12: for i = 6, ..., 15; for i f sl
15, fo r i=16 , ..., 20;
so that k = 245.
Ln both cases small sample is selected using twestage sampling plan by first
selecting 20 clusters hom 200 clusters using SRSWOR. Wit hin i t h selected cluster,
mi units; as d e h e d in (4.36) for case 1 and in (4.37) for case 2, are selected using
SRSWOR.
After 10,000 simulation runs under above mentioned three populations, simulation
results become stable, then the relative biases (RB), relative root mean square errors
(RRMSE) and relative efficiency (RE) +th respect to Fu (or qu) are computed for
the proposed estimators using N,,, = 10,000. RB and RRMSE are computed
using (3.14) and (3.16) respectively; while R E is computed as follows.
Let 8, denote the unbiased estimator of 8, 8 denote any estimator of 8, then we
RE(&) = relative efficiency of 8 with respect to 8, := RRMSE (8,)
(4.38) RRA.ISE(~) -
The above rnentioned measures are computed based on 10:000 runs and results
for distribution functions are tabulated in Table 4.la, 4.2a7 4.3a for population 1, 2
and 3, respectively; for quantiles in Table 4. lb, 4.2b and 4.3b7 respectively. Turning
to case 2, only estimation of distribution function has been considered and results are
tabulated in Table 4-4 - 4.6 for population 1 to 3 respectively. The minimum value
of RB and RRMSE, and maximum value of RE will appear in bold font.
The simulation results in case 1 indicate that. although the estimates ô: and
PA rather than the population values for them are used in the estimation of distri-
bution Eunction, the predictive estirnator Pg still has better performance t han the
conventional ones, especidy (when p, < ph),
i) when p, is close to zero, hence pA is also close to zero, Fg attains its optimal
value at Go which is also the conventional evpansion estimator Fe, and the alternative
predictive estimator pgg, also performs bet ter than the convent ional est imator Fu, Fr.
ii) when O < p, < 1, Fg is still better than FE, Fu and Fr in terms of biases, as
measured by RB; efficiency, as rneasured by RRM SE; and relative efficiency Fvith
respect to Fu, as rneasured by RB.
iii) when pA is close to the unity, the results show t hat Fs does at tain the optimum
For the estimation of quantiles, basically, c& has almost all of the good character-
istics of Fg, evcept when pA is close to the unity, for the estimation of lem (upper)
tail quantiles, the bias, as rneasured by RB, of Qg is bigger than that of qe.
The results of distribution fùnction estimation in case 2 (Table 4.4 - 4.6) show
that F, still has a better performance than the conventional estimators of distribu-
tion hnction under sampling scheme SRSWOR/SRSWOR when O < py < 1, but
the relative efficiency (RE) decreases. So, based on the simulation results, the pre-
dictive estimators Pg, qg are suggested to be used under two stage sampling plan
SRSWOR/SRSWOR, especially when a moderate sCmple data set is available.
Table 4.la: Estimation of Distribution Fundion under Twestage Sam-
pling Scheme (k = 400, py = 6.46968ed5): Relative Biases (RB), Rel-
ative Root Mean Square Errors (RRMSE) and Relative Efficiency
(RE) with respect to Fu
Table 4.lb: Estimation of Quantile under Two-stage Sampling Scheme
(k = 400, p, = 6.46968e-5): Relative Biases (RB), Relative Root Mean
Square Errors (RRMSE) and Relative Efficiency (RE) with respect
R Rk1 S E (qe) RRMS E (qu) RRMSE(q,) RRMSE($) RRMSE(&,) RRMSE(G1)
Table 4.2a: Est irnation of Distribution Funct ion under Two-stage Sam-
pling Scheme (k = 400, py = 0.505525): Relative Biases (RB), Relative
Root Mean Square Errors (RRMSE) and Relative Efficiency (RE)
with respect to Fu
Table 4.2b: Estimation of Quantile under Two-stage Sampling Scheme
(k = 400, py = 0.505525 ): Relative Biases (RB), Relative Root Mean
Square Errors (RRMSE) and Relative Efficiency (RE) with respect
R RfV1 SE (q , ) RRMSE (qJ RR&ISE(Q~) RRMSE(ljg) R RM SE (Gg,) RRMSE (QgI)
Table 4.3a: Estimation of Distribution Function under Two-stage Sam-
pling Scheme (k = 400, p, = 0.999688): Relative Biases (RB), Relative
Root Mean Square Errors (RRMSE) and Relative Efficiency (RE)
with respect to Fu
Table 4.3b: Estimation of Quantile under Two-stage Sampling Scheme
(k = 400, p, = 0.999688): Relative Biases (RB), Relative Root Mean
Square Errors (RRMSE) and Relative Efficiency (RE) with respect
RRMSE (a) R R M S E (&) RRMSE (qr ) RRMSE (4,) RRMSE ((î,,) RRMSE (bI)
Table 4.4: Estimation of Distribution Fundion under Two-st age Sarnpling
Scheme (k = 245, p, = 6.46968e-5): Relative Biases (RB), Relative
Root Mean Square Errors (RRMSE) and Relative Efficiency (RE)
with respect to pu
Table 4.5: Estimation of Distribution Function under Twestage Sampling
Scheme (k = 245, p , = 0.505525): Relative Biases (RB), Relative Root
Mean Square Errors (RRMSE) and Relative Efficiences
respect to Fu
(RE) with
Table 4.6: Estimation of Distribution Function under Two-stage Sampluig
Scheme (k = 245, p , = 0.999688): Relative Biases (RB), Relative Root
Mean Square Errors (RRMSE)
respect to Fu
and Relative Efficiency (RE) with
Chambers, R. L. and Dunstan, R. (1986). Estimating distribution functions from
survey data. Biometrika. 73,497 - 604-
Chambers, R. L., D o r h m , A. H., and Wehrly, T. E. (1993). Bias robust estima-
tion in h i t e popdations using nonparametrie calibration. Journal of the American
Statistical Association. 88, 268 - 277.
Chen, J. and Qin, J. (1993). Empirical likelihood estimation for finite populations and
the effective usage of au,.a'liaz-y in au,~liaryinformation. Biometrika. 80, 107 - 116.
Chen, J. and Sitter, R. R. (1996). A pseudo empi~cal likelihood approach to the
effective use of a ~ , ~ l i q information in complex surveys. Simon Fraser University,
Burnaby.
Chen, S. X. (1997). Empirical Likefiood- based kernel density estimation. Aus tralian
Journal of Statistics. 39, 47 - 56.
Cheng, K .F., Lin, P. E. (1981). Nonparametric estimation of a regression function:
limiting distribution. Australian Journal of Statistics. 23, 186 - 195.
Cochran, W. G. (1977). SampLing teclmïques (Third Edition). New York: John
Wiley & Sons.
Dielrnan, D., Lowry, C., and Pfaffenberger, R. (1994). A comparison of qtritntile
estimators. Communications in S tatistics. Part B-Sirnulat ion and Computat ion.
23,355 - 371.
Fan, J. Q. (1992). Design-adaptive nonparametric regression. Journal of the Ameri-
c m Statistical Association. 87, 998 - 1004.
Fransisco, C. A. and Fuller, W. A. (1991). Quantile estimation rvith a complex survey
design. The Annals of Statistics. 19,454 - 469.
Godambe, V. P. (1955). A uniûed theory of sarnpling fiom finite populations. Journal
of the Royal Statistical Society. B17,269 - 278.
Godambe, V. P. (1989). Estimation of cumulative distribution o f a survey population.
Technical Report STAT-89-17, University of Waterloo.
Gross, S. T. (1980). &ledian estimation in sample surveys. Proc Survey Res. Meth.
Sect., Am. Statist. Assoc. 181-184.
Hansen, M. H., Madow, W. G. and Tepping, B. J. (1983). An evaluation of model-
dependent and probability-smpfing inferences in sample surveys. Journal of the
Arnerican S tat ist icd Association. 78,776- 793.
Kish, L. (1965). Survey sampling. Wiley, New York.
Krieger, A. M. and Pfeffermann, P. (1992). &laximum Likelihood estimation fi-om
complex surveys. Unpublished Technical Report, Depart ment of S tatistics, Gniver-
sity of Pennsylvania.
Kuk, A. Y. C. (1988). Estimation o f distribution funetions and medians under sam-
p h g 1~4th unequd probabilities. Biometrika. 75 , 97 - 103.
Kuk; A. Y. C. and Mak, T. K. (1989). Median estimation in the presence o f auviliary
information. Journal of the Royal Statisticd Society. B51,261 - 269.
Loynes, R. M. (1966). Some aspects of the estimation of quantiles. Journal of the
Royal S tatistical Society. B28, 497-51 2.
McCarthy, P. G. (1965). Stratified sampfingand distribution-fiee conhdence intervals
for a medim. Journal of the American Statistical Association. 60, '7'72 - 783.
Olsson, J. and Rootzén, H. (1996). Quantile estimation &om repeated measurements.
Journal of the American Statistical Association. 91, 1560 - 1565.
Owen, A. B. (1988). Empirical LikeLihood ratio confidence intervals for a single func-
tiond. Biometrika. 75,237-249.
Owen, A. B. (1990). Empiricd likelihood confidence regions. The Annals of S tatis-
tics. l8,9O-lZO.
Press, W. H., Flannery, B. P., Teukolsky Çaul A. and Vetterling W. T. (1990). NLZ-
merical Recipes, The art of Scientific Comp uting (Fortran Version). Cambridge Uni-
versity Press,
Qin J. and Lawless J. (1994). Empincd likeLihood and general estimating equations.
The Annals of S tatistics. 22:300-325.
Royall, R. M. (1976). The Iinear least-squares prediction approach to trvo stage
sstrnphg. Journal of the American Statistical Association. 71, 657 - 670.
Rao, J. N. K. (l975). Unbiased variance estimation for multistage designs. Sankhya.
C3 7, 133-1 39.
Rao, J. N. K., Kovar, J. G. and Mantel, H. J. (1990). On estimating distribution
functions and quantiles fiom survey data using awüiliary idonnation. Biometrika,
77, 365 - 375.
Rao, J. N. K. and Liu, J. (1992). On estimating distribution functions fi-oln sanzple
survey data usingsupplementary information at the estimation stage. In Saleh, A. K.
M. E. (ed.) Nonparametric statistics and related topics. Elsevier Science Publishers;
Amsterdam. pp 399 - 407.
Rao, J. N. K. (1994). Estimating totals and distribution functions using auialiary
information at the estimation stage. Journal of Official Statistics. 10: 153 - 165.
Scott, A. and Smith, S. M. F. (1969). Estimation in multi-stage surveys. Journal of
the Arnerican Statistical Association. 64? 830 - 840.
Sedransk, J. and Meyer, J. (1978). Confidence intervals for the quantiles of a 6-
nite population: simple random stratified random sampling. Journal of the Royal
Statistical Society. B40;239 - 252.
Sedransk, J. and Smith, P. (1983). Lower bounds for confidence coefficients for con-
fidence intervals for f i t e population quantiles. Comrn. Statist. Theory Meth.
12,1329 - 1344.
Thompson, M. E. (1997). Theory of sample surveys. S t Edmundsbury Press, Bury
St Edmunds, Suffolk.
Zhang, B. (1997). Estirnating a distribution function in the presenxe of auxifiary
information- Metrika- 46,245 - 251.