and quantile in the presence of auxiliary information

University of Alberta

A Robust Estimator of Distribution Function

and Quantile in the Presence of Auxiliary Information

Hong Li O

A t hesis

submitted to the Faculty of Graduate Studies and Research

in partial fulfillment of

the requirernent for the degree of a Master of Science

in

Statistics.

Depart ment of mat hematical Sciences

Edmonton, Alberta, Canada

Spring 2000.

National Library I*I of Canada Bibliothèque nationale du Canada

Acquisitions and Acquisitions et Bibliographie Services services bibliographiques 395 Wellington Street 395, nie Wellington Ottawa O N K1A O N 4 Ottawa ON K1A ON4 Canada Canada

The author has granted a non- exclusive licence allowing the National L i b r q of Canada to reproduce, loan, distribute or sell copies of this thesis in microform, paper or electronic formats.

The author retâins ownership of the copyright in this thesis. Neither the thesis nor substantial extracts fkom it may be printed or otherwise reproduced without the author's permission.

L'auteur a accordé une Iicence non exclusive permettant à la Bibliotheque nationale du Canada de reproduire, prêter, distribuer ou vendre des copies de cette thèse sous la fome de microfiche/tilm, de reproduction sur papier ou sur format électronique.

L'auteur conserve la propriété du droit d'auteur qui protège cette thèse. Ni la thèse ni des extraits substantiels de celle-ci ne doivent être imprimés ou autrement reproduits sans son autorisation.

Canada

Abstract

Recently, estimation of distribution function and quantiles has received considerable

attention in s w e y sampling. This t hesis mainly considers estimation of distribut ion

function and quantiles under single-stage and two-stage sampling designs. In the

first part, for single-stage sampling process, estimation of distribution function and

quantiles is considered through the use of a~uiliary information. In the second part,

along the lines of first part, the estimation problem under twestage sampling plan is

considered. The idea of estimat ing population total under model-assist ed approach is

also extended to estimate population distribut ion function and the resulting est imator

is s h o w to perform better than the conventional estimator for moderate sample sizes.

Through a Monte Carlo simulation study, proposed estimators are cornpared Nith

conventional esthators.

Acknowledgement

Upon the completion of my thesis, I would like to take this opportunity to ac-

knowledge dl those people who have helped me.

First 1 would like to thank my supervisor, Dr. N. G. Narasimha Prasad for his

guidance, patience and helps throughout the preparation of this thesis. Special thanks

go to Dr. Rohana J. Karunamuni and Dr. Francis Yeh for serving as the cornmittee

members.

My gratitude extends to the Depart ment of Mathematical Sciences, University of

Alberta, for providing me the opportunity to study here, to Dr. Peter M. Hooper

for academic and financial support in the kst year of my study, and to the fac-

ulty members of the department who provided me with many interesthg courses in

S tat ist ics.

Contents

List of Tables

List of Figures

1 Introduction 1

1.1 Background .................................................................................. 1

1.2 Objective ...................................................................................... 4

2 RRview of literature 6

.................. 2.1 General estimation method of distribtuion hnction 6

....................................................... 2.1.1 Design-based approach 7

........................................................... 2.1.2 f rediction âpproach 8

2.1.3 Model-assisted approach .................................................... 9

.... 2.2 Robustness of model-assisted distribution function estimator 12

........................................ 2.3 -More distribution function estimators 14

...................................... 2.4 General estimation method of quantile 17

3 Distribution function and quantile estimator for one-stage sampling: non-parametric approach

3.1 Formulation of the generdestirnator ofdistribution fùnction ....

......................................................................... 3.2 Sizxrulation study

4 Distribution and quantile estimation in twestage sampling 47

4.1 Tm-O-s tage distribut ion funct ion est imator in convent ional t heory 47

4.2 General est imator of distribution funct ion in two-stage sarnpling ...................................................................................... 49

4.3 Alternative estimators ................................................................. 54

References

List of Tables

3.1 Simulation results mit h linear sirnulated population data:

relative biases (RB), mean square errors (MSE) and relative

root mean square errors (RRiMSE) of different distribution

fünction esthators (population size = 500, sample size = 30,

number of simulation = 1000.) ..................................................... 32

3.2 Simulation results with nonlinear sirnulated population data:

relative biases (RB), rnean square errors (NIISE) and relative

root mean square errors (EUXbISE) of different distribution

function estimators (population size = 500, sample size = 30,

number of simulation = 1000.) ..................................................... 33

............................................................. 3.3 Cornmon kernel funct ion. 40

3.4 Relative biases (RB) and relative root mean square errors

(RRMSE) of F,, at 0.25, 0.50 and 0.7Sth quantile of Y

under different kernels with linear simulated population ............. 41

3.5 Relative biases (RB) and relative root mean square errors

(RRMSE) of FP, at 0.25, 0.50 and 0.7Sth quantile of Y

under different kernels Mt h nonlinear simulated population.. ..... 42

3.6 Simulation results with sirnulated iinear population data:

relative biases (RB) and relative root mean square errors

(RRMSE) of different aLh quantile q(a) estimators (population

...... size = 500, sample size = 30, number of simulation = 1000.) 45

3.7 Simdation results Rrit h simulated nonlinear population data:

relative biases (RB) and relative root mean square errors

(RRMSE) of different alh quant ile q(a) estirnators (population

size = 500, sample size = 30, number of simulation = 1000.) .... 46

4. la Estimation of distribut ion funct ion under two-st age sampling

scheme (k = 400, py = 6.46968e-5) : relative biases (RB),

relative root mean square errors (RKMSE) and relative -

............................................. efficiency (RE.) with respect to Fu

4- 1 b Estimation of quant ile under two-stage sampling scheme

(k = 400, p, = 6.46968eA5) : relative biases (RB), relative

root mean square errors (m4SE) and relative efficiency - .. .................................................. (RE) with respect to q, ....

4.2a Estimation of distribution funct ion under two-stage sampling

scheme (k = 400, py = 0.505525): relative biases (RB),

relative root mean square errors (RR-MSE) and relative -

.............................................. efficiency (RE) with respect to Fu

4.2b Estimation of quantiles under two-stage sampling scheme

(k = 400, p, = 0.505525): relative biases (RB), relative

root mean square errors (RR-MSE) and relative efficiency - ............................................................ (RE) with respect to q,...

4.3a Estimation of diçtribut ion function under two-stage sampling

scheme (k = 400, py = 0.999688) : relative biases (RB),

relative root mean square errors (FUUMSE) and relative -

.......................................... efficiency (RE) with respect to Fu--.

4.3b Estimation of quantiles under two-stage sampling scheme

(k = 400, p, = 0.999688): relative biases (RB), relative

root mean square errors (RRMSE) and relative efficiency - (RE) with respect to q,. .............................................................

4.4 Estimation of distribution function under two-stage sarnpling

scheme (k = 245, py = 6.46968ë5): relative biases (RB),

relative root mean square errors (RmISE) and relative -

............................................... efficiency (RE) wit h respect to Fu 66

Estimation of distribut ion funct ion under tsvo-st age sunpling

scheme (k = 245, p, = 0.505525): relative biases (RB),

relative root mean square errors (RRMSE) and relative -

........................................... efficiency (RE) with respect to Fu..

Estimation of distribut ion funct ion under two-st age sarnpling

scheme (k = 245, p y = 0.999688): relative biases (RB),

relative root mean squa,re errors (FtEUMSE) and relative -

efficiencv (F?E) with respect to F,, ...............................................

List of Figures

Scatterplot of linear simulated population korn model

Y = -2X + X ~ S ........................ .. ..................................... 30

Scatterplot of non1inea.r simulated population f?om model

Y = 1 5 0 + 2 . 5 e - - Y + ~ f e .......................~~................................... 31

Plot of population distribution function Fp and mode1 assisted ..

......................... estimator Fm, with linear simulated population 34

Plot of population distribution function Fp and proposed

........................ estimator Fm, with linear sirnulated population 35

Plot of population distribution hinction Fp and pseudo ernpirical

likelihood estirnator Fpe *th lin- simulated population.. ........ 36

Plot of population distribution hinction F, and model assisted A

estimator Fm, wit h nonlinear simulated population.. ................. 37

Plot of population distribution function Fp and proposed

estimator F,, with nonlinear simulated population .................. 38

Plot of population distribution function F, and pseudo

ernpirical likelihood estirnator FP, nith linear sirnulated ........... 39

Chapter 1 Introduction

In modern society, the need for statistical information has increased considerately for

making policy decisions. In part icular, data are replasly collected to sat is@ the need

for information about specified set of elements. The collection of all such elements is

c d e d a finite population. But collecting the data for each element in a population

be too eupensive and/or time consurning; sometimes it is even irnpractical. So,

one of the most important rnethod of coliecting information for policy decision is

sample survey, that is, a partial investigation of the finite population.

A sample s w e y costs less than a cornpkte enurneration and may even be more

accurate than the complete enurneration due to the fact that personnel of higher

quality can be employed for the data collection. So, designing an efficient sampling

scheme at possible lowest cost and to develop some efficient methods to estimate

population parameters kom the sample data under given design have become major

issues in sample surveys.

In this thesis, 1 WU only concentrate on the estimation of the distribution function

and associate quantiles of a finite population.

1.1 Background

Consider a finite s w e y population U which consists of N distinct elements. Each

element is identified through label j (j = 1, ... , N). Let va,lues of the population

1

elements for variable Y be YI, Y2, ..-, YN7 then for any 8ven number t (- w < t < cm),

the population distribution function FN (t) is defined as the proportion of elements in

the population for which I; 5 t , for j E U , that is

FN ( t ) := N P 1 C A(t - k;),

where

1: when t 2 Yj; A(t - Yj) =

O, othemise.

Also, population ath quantile (O < cr < 1) for Y is defined as

Without loss of generality, we omit N in notation of F'(t) or qN(a).

Consider an auxiliary variable X, associated with Y, nith the known values XI,

Nom, if a single-stage sampling scheme is undertaken to select a sample s = { jl , j2 :

... , j,} of size n under some sampling design {S, p ( * ) } , where S denotes the set of

all possible samples of size n, p(s ) is the probability of selecting the sample S. s E

S, such that Cs, p ( s ) = 1. The values for Y in set s are observed as 2/j, ; 9j2 : . .. yjn .

Then, only mith this

given by Rao (1994)

information, a general class of design-based estimator of F ( t ) is

as:

And naturally, the crth quantile q(a) rnay be estimated as

So, in the following 1 d l focus on the estimation of distribution hnction. The

associate quantile estimator nill be defined as (1.5) accordingly. In (1.4) : di(s) is

the basic design weight for single-stage sampling which can depend on both s and i

(i E s) ( s e , Godambe, 1955), which also satisfies the design-unbiasedness condition:

C~,:i, , lp(s)di(s) = 1 for i = L, ..., N. To get an unbiased estimator of FI it is

also essential to assume that the inclusion probabilities .rri = xIs: i E 3 ) p ( s ) , for i = 1 _ . . . , N i are all positive (Rao, 1975). Under simple randorn sarnpbng scherne, F(t )

reduces to the naive ernpiricd dHstribut ion fimction est imat or

Fddl ( t ) = n-l p ( t - Y &

P l u g e g Fêdl for F (1.5) gives t h naive quantile estimator Qedf (a).

The above approach to estimate distribution hnction is called probability Sam-

pling approach which is free h e m mode1 assumpt ion. However, the above estima-

tors do not incorporate any supplementary information directly into the estimation

process, dt hough the use of amïliary information on population is k n o m to increase

the precision of an estirnator. Isn theory, information on one auxiliary variable may

be used at the survey design stage by defining appropriate inclusion probabilities ri.

But this is not possible when information on several awiliary variables is available.

So we rnay wish to introduce the auxiliary information at the survey estimation stage

also (Chambers and Dunstan, 1986). In recent years, several estirnators of distri-

bution function of a finite population have been proposed tvith the use oE m~eliary

population information (Rao, 1994).

1.2 Objective

For singbstage sarnpling scheme, the most notable distribut ion fimction estima-

tors are the model assisted estimator (Rao, 1994) and the pseudo-empirical mazuimiun

Likelihood estimator (Chen and Sitter, 1996). The former uses the complete infor-

mation of auxiliary variable but with the açsumption that the relationship between

Y and follow a linear model. Eventhough, the latter does not assume any rnodel,

the resulting estimator is optimal under a linear model. Also, the latter rnethod is

computational intensive. In this thesis, I will develop an estimator of the h i t e pop-

ulation distribution function and associate quantiles such that it uses the complete

information of and is exempt of model assumption. More precisely, this estimator

uses a nonpatametric approach.

For a twmstage sampling scheme, Roy& (1976) proposed the linear least-squares

prediction approach in the estimation of population total, with the cluster size as

supplementary information. In this thesis, The same approach will be used to get

an estimator of the finite population distribution function and associate quantiles.

Al l of the estirnators proposed in this thesis mdl be cornpared with conventional

est imators t hrough a Monto Car10 simulation st udy.

1.3 Thesis Overview

In chapter 2, some of the literature on the estimation of distribution function and

quantiles for a finite population under single-stage and tvvo-stage sampling scheme

are reviewed. In chapter 3, by exploring the advantages of different methods under

single-stage sampling plan, a robust distribution function estirnator is proposed. This

estimator can be used under single-stage sampling plan Rrit hout mode1 assurnption.

The proposed rnethod is examined through a limited simulation study. In chapter

4, t his t hesis presents some discussion of the estimation of distribution hnc t ion and

associate quantiles under two-stage sampling scheme. Also, the basic idea of esti-

rnating population total is extended to estimate population distribution funct ion. A

Monto Car10 sirnulation study is d s o presented to assess the efficiency of proposed

estimator.

Chapter 2

Review of Literature

Singlestage sarnpling is the most simple sarnpling scheme in sample surveys. Several

met hods are considered in the fiterature for estimating distribut ion function and cluan-

tiles, which can be grouped under three broad headings: (i) design-based approach;

(ii) model-dependent approach, or prediction approach; and (üi) model-assisted ap-

proach.

However, in practice, mult i-st age sampling is also kequently used, especidly in

surveys of hurnan populations. For =ample, Kish (1965) described a stratified three-

stage sample of dwellings, i.e., sampling counties at the first stage, sampling blocks

within selected counties at the second stage and sampling dwellings niithin chosen

blocks at the third stage. For this kind of siurveys, there is a wide variety of possible

designs in any stage, and there is a much wider variety of possible estimators of

interested parameters.

So, this chapter mainly reviews some of papers that are relevant to the estimation

of finite population distribut ion function and quant iles under single-st age sampling

design.

2.1 General Estimation Method of Distribution Function

Consider a single-stage sampling scheme deôned by {S, p ( . ) ) and a sample set

s E S with design weight di(s) for i E S. Also, we assume availability of ai~Gfiary

6

information, Say x, for ail of the units in the population. Let yj,, ..-, yj, be the

observed values for Y in set S . We consider belon7 dserent approaches to estirnate

the distribution Eunction (1.1) and crt" quantile (1.3).

2.1.1 Design-based Approach

Under the design-based approach, F ( t ) can be estimated by the ratio estirnator

F,(t), the digerence estirnator &(t) and the regession estimator F,, ( t ) (See Rao

and Liu (1992) and Rao (1994)).

In the case of a single a~xil iary X-variable, denote 9(xj) = Aa(t - R ~ ) Ç m j =

1 A ; where, R =

G = C i E , d i ( s ) g ( x i ) ,

sarnple s, we have

and

The above three estirnators have some good properties. For example? Pr(t) re-

duces to F ( t ) and variance becomes zero m-hen yj a xj for all j E U. This indicaies

that the ratio estirnator can gain large efEiciency when A(t - yi) and A(t - hi)

have a stronger bear relationship. However, in real world, the correlation between

7

A (t - yi) and A(t - kci) is generally weaker t han that between yi and xi, so here

est k a t or Fr (t) actually gains very lit tle efficiency over ~ ( t ) . As for est irnators Pd (t)

and F,,,(t), they suffer fiom the same drawback as the ratio estimator Fr(t) . Since

FT,(t) involves the cornputation of ccru(B, G) and v(G), it is cornputationally more

cumbersome than the ratio estimator F~ (t). Moreover, all of these three estimators

are not rnodel-unbiased under a linear model assurnption, although they are asymp-

totically design-unbiased (Rao, Kovar and Mantel, 1990 and Rao, 1994).

2.1.2 Prediction Approach

This approach, contrary to the desigx-based approach, assumes that the popula-

t ion Y values are random and the relationship between Y and the amuiliary variable

X follows a certain model. The most-common used superpopulation model is given

by7

where ,B is an unknown parameter, v(xj) is a strictly positive function of xj, and

E ~ ' S are independent and identically distributed (i.i.d) random variables Mth zero

means. Under this approach valid inference wil l be made with respect to the model

and irrespective of the sarnpling design p( s ) .

At hrst, note that FN ( t ) can be decomposed as fo110m-s,

where s is the same as before, S = { j; j = 1, 2, . . . , N and j $! s). T h e only unknom

in (2.5) is the second term. It is also assurned that the yj 's are independent and

the population mode1 holds for the sample, so there WU be no sample selection bias

(Krieger and Pfeffemann, 1992). Under t his assurnption, a general est imator (refer-

enced as CD estirnator) of distribution function, which was suggested by Chambers

and Dunstan (1986), is given by

where, b, = {Ci,-, y i ~ i / ~ 2 (xi)} { Cies X;/V~(X~)}-' y mhich is the best linear unbiased

estimator (BLUE) of 0, uni = {v(x~)}- ' (t - bnxi)- Rao and Liu (1992) noted t hat,

when v 2 ( x j ) = o2xj7 (2.6) reduces to

are approximately independent with mean zero and variance c2.

In general, the model-based estirnator of the distribution function d l not be the

same as those suggested by the conventional design-based approach as in (1.4). But

there is one special situation "when the sample is a stratified random sample and the

auxiliary information X is a qualitative variable, indicating the stratum in which a

population elernent occurs" under which F s ( t ) and FN(t) are consistent (Chambers

and Dunstan, 1986).

2.1.3 Model-assisted Approach

From earlier

assump tion hee,

discussion, nie see that although probability sampling approach is

the associated inferences have to refer to repeated sampling instead

9

of just the particular sample, s, that has been chosen- Prediction approach, on

the other hand, ignores the sampling design completely and the dehit ion of FG(t)

depends on (2.4) being the "correct" rnodel for the population. In particular, it

assumes that the heteroscedasticity hinction v (2) is correctly specified. However? in

practice, this is unlikely to be the case and in large sarnples, prediction inferences are

also very sensitive to rnodel misspecitications (Hansen, Madow and Tepping, 1983).

To overcome the shortcomings in the above two rnethods? the model-assisted ap-

proach is proposed as an approach providing valid inferences under an assumed model

and at the same t h e protecting against model misspecifications in the sense of pro-

viding valid design-based inferences irrespect ive of the population Y -values. That is7

we can only consider design-consistent estimators that are also model-unbiased (at

least asymptotically) under an assurned rnodel. So in the case of estimating a finite

population distribution hnction, -ive may assume that YI : Y2: --.: YN corne from the

superpopulation with model (2.4). Cases of this problern have b e n treated by Cham-

bers and Dunstan(l986), Kuk(1988), Godambe(l989), Rao, Kovar and Mante1(1990),

Chambers (1992), Rao and Liu(1992) and Rao(1994).

Let G denote the model cumulative density function (c.d.f.) of E j 7 then under the

superpopulat ion rnodel (2.4),

is a population estirnating function, each of those terms has expectation zero. Using

estimation hnction t heory, Godarnbe (1939) arrived at a model-assisted es timat or of

F ( t ) under the special case of di(s) = ql. Rao (1994) evtended this to a general

design weight di (s) , where the case v2 (x j ) = xju2 was considered.

Under the model (2.4), a predictor of A(t - yj) is given by

Shen a model assisted estimator of F ( t ) based on (2.9) can be gîven by the difference

est irnator:

which is model-unbiased (at least asyrnptotically) under the assumed model, but its

asymptotic design-bias is zero only for a subclass of sarnpling designs. However, this

subclass seerns to cover a wide variety of sarnpling design(Godambe, 1989).

For the special case of di(s) = TF', Rao, Kovar and Mantel (1990) also proposed

an alternative model-assisted estimator (it can be reference as RKM es timator) ,

alter noting that the ratio and the diffaence estimators, ~ , ( t ) and ~ ~ ( t ) , are not

model-unbiased for FN(t). This estimator is asymptotically model-unbiased and

design-unbiased under all designs. Rao and Liu (1992) evtended this estimator to

the general case of di (s) as follows:

where,

So7 pma(t) and &,(t) are difFerent only in the last term of the formula.

2.2 Robustness of Model-assisted Distribution Function

Estimator

From above section, we know that the rnodel-assisted estimator F,,(t) and Fd,(t)

have some good properties (Rao, Kovar and Mantel, 1990), but ~ h e t h e r or not they

are better than F(t) as in (1.4) depends on the degree of validity of the

model, this is reflected in the prediction of A (t - yj). So, some improvements

might arise with the use of a more general model 'and the local fitting method which

could be employed to increase the robustness of the estimator.

Chambers, Dorfman and Wehrly (1993) also explored a robust estimation ap-

proach via nonparametric regression method against model misspecification. For a

general model

where q(.) is some reasonably smooth function, $(.) is some knom hnction and e

is the random error with zero mean. This approach works by £kt getting a robust

(although may be inefficient) predictor of population total <P of q5 RTith nonparametric

regession, then using bias calibration method to get a more efficient predictor. That

is, one srnooths #(y) against x to obtain i j , (x) and then smooths @(y) -fjl(x) against

x to obtain l), (x). So, the b a l smoothing of Q(y) against x is defined as fji (x) + q2 (2). In the case of estimating the finite population distribut ion hnc t ion of Y, w-e

can take 4(Y) = N-'A(t - Y), then

Here, Fs(t) is obtained fkom (1.6) and q(x, t) = Pr{Y 5 t/x) n.hich is assumed

to be a smooth function of x for any t. Chambers, Dorfman and TVehrly (1993)

estimated it by

where the weights wf can be calculated using kernel

y 4 , (2.16)

smoothing method (Chambers,

Dorhan and Wehrly, 1993). Then the predictor of Fns(t) is:

And a robust calibrated estimator is given by

where uis are the nonparametric prediction m-eights defining ~ ~ ~ ( t ) , f i and ô- are

sample-based estimators of the conditional model evpectation E ( Y / X ) and standard

deviation (v.~R(Y/x))* under the working model

sample-based estimator of the distribut ion function

13

for the

G(*) of

population. G(-) is a

t h e st andardized error

-1

Y - E(Y'x) . ~f we put ui = Ti - 1

in (2.18) (where is the inclusion proba- ,mm% N - n bility), then the resulting estimator is basicdy the same as that sugested by Rao,

Kovar and Mantel (1990). They also suggested that, by choosing the w-eights ui to

reflect the actual distribution of the sample X values, the robustness of the predictor

can be improved, and this heavily depends on the choice of the bandwidth h,. Some

rules of choosing bandwidth are also given in that paper. However, this method may

loss some ac iency by obtaining the bias robustness.

2.3 More Distribution Function Estimators

In above section, we saw some distribution function estimators Nith the utility of

complete information of the auxi.liaq variable X. However, in some situations, we

may not have any am~iliary variables, but ~ 4 t h some auxiliary information about F or

its associate parameters available. This kind of situation has been discussed by Qin

and Lawless (1994) who used the profile empirical likelihood-based kernel rnethod to

estimate the finite population distribution function under the semiparameter mode1

assumption. That is, they estimated F as the maximum empirical likelihood estima-

tor (MELE) pmel, by mavimizing the profile empirical likelihood function L ( F )

(Zhang, 1997) which is defined as

And

- Zhang (1997) established the weak convergence of Fmde so that this approach not only

overcomes the disadvantage of the standard kernel method which can not exploit extra

information systernatically, but also makes the variance of estimator smaller to arrive

a t some degree of increasing of efficiency. In this thesis, this situation Nill not be

discussed furt her .

Another case is that we can not get the complete information about the a~~xil iary

vâriable x, but stiU with sorne s u r n r n q available information on X. U s u d y this

kind of information c m be represented as

where, T ( X ) = ( .r l (X) , ... , T , ( X ) ) ~ . Shen, how this information can be irsed to

irnprove the estimation'?

Chen and Qin (1993) proposed an empirical likelihood approach to be used in

simple random sampling mhen this kind of auxiliary information (2 -2 1) is available.

Their results suggested that this approach has desirable properties when population

distribution function and quantiles are estimated under this situation. But, the

formulation of their method did not extend to more complex survey designs. Chen

and Sitter (1996) generalized it to estimate the parameter which is some function of

the distribution function < = <(FN) under the situation mhen the sample s is dramn

using some sarnpling design {D , p(-) ) as follows.

For a finite population U, if the entire finite population is available, the empirical

likelihood, ori,hdly introduced by Owen (1988, 1990) for constructing confidence

regions in nonparametric settings, nrill be

wit h the corresponding log-likelihood hnct ion

where pj = P(Y = y,). Suppose that ps (O 5 pj 5 1) are unknonm and zjEu pj 2 1- J

Then, in the presence of auxiliary information (2.21), (2.23) could be mavimized

subject to

However, since we only have a sample set s of size n of the entire population available,

we can view (2.23) as a population total, then using the unif?ed theory of sampling

met ho dology to get a design unbiased estimator of Z(p), that is

where di (s)'s are the desiga meights which satisb ES(CiEs di(s) log p, ) = log(pi)

and Es stands for the expectation respecting to the sampling design. Chen and

S itt er t ermed (2.25) as pseudo-ernpirical likelihood. For auxiliary in format ion

of the form (2.21), the problem reduces to maximizing (2.25) subject to (2.24) with

N replaced by n. Using the Lagrange multiplier method, they gave the resulting

pseudo empirical mavimum likelihood estimator (PEMLE) of FN(t ) as

- di (s) 1 . for i E s, Pi = c.,~~(s) 1 + X*T(X~) ,

and X satisfies

2.4 General Estimation Method of Quantile

Compared t o the estimation of distribution function, more attention lias been

gïven to quantile estimation, for oreuample, McCarthy (l965), Loynes (l966), Sedransk

and Meyer (l978), Gross (1980), Sedransk and Smith (1983), Chambers and Dunstan

(1986), h o , Kovar and Mantel (1990), Olsson and Rootzen (1996). Many- papers

discuss the estimation of the median, for the survey variable of interest under simple

random sarnpling or stratified random samphg, e-g., Gross (1980), Sedransk and

Meyer (1978), Sedransk and Smith (1983), without makingexplicit use of the ailuiliary

variables in the construction of est imators.

To make use of the auxiliary information, Chambers and Dunstan (1986) consid-

ered briefly the estimation of a quantile by inverting a model-based estimator F of

the distribution function through formula (1.5). \men the population size IV and the

sample size n are large, the estimator suffers the possible bias problem resulting frorn

the mode1 misspecification in addition to intensive comput ation. This approach alço

assumes that the complete information about the a~xiliar-y variable is knom. But

in some applications, especially when the auxiliary information is e~tracted frorn the

statistical reports or other secondary sources: only a certain surnrn~ary measures or

a grouped frequency distribution of the a~uiliary variable X are available. In such

situations, Kuk and Mak (1989) proposed three estimators of median: ratio esti-

mator position estimator 6 f y p and stratification estimator &llrs under

the assumption that only the median hiAY of X is knom. They also showed that

the last two estimators are efficient when the relationship between Y and ,Y departs

from the linearity assumpt ion.

Rao, Kovar and Mante1 (1990) studied the ratio estimator and regession es-

timator for a general ath quantile, and showed that their estimators could lead to

considerable gains in efficiency over the custornary estimator @(a) mhen 5- is approx-

irnately proportional to A\ for j = 1, ..., N. Olsson and Rootzen (1996) proposed a

quantile estimator for a nonparametric components of variance situation and proved

its consistency and asppto t ic normality.

Chapter 3 Distribution Function and Quantile Est imat or for OneSt age Sampling: Non-parametric Approach

The estimators given in previous chapter assume the linear mode1 as described in

(2.4). In this chapter, a general model is considered to obtain estimators sirnilar to

the ones that are described in the previous chapter.

Formulation of the General

Function

Estimator of Dis tribution

Suppose that nothing is known about the relationship between the study variable

Y and the aadiary variable except some population information about X. That

is, only the foIloFving general model c m be assumed.

E(Y/X = x) = 7 (2) , Var (Y IX = x) = au2 (x) ,

where q(-) is unknom, v(.) is some positive function of x. In this case, nonparametric

method is fiequently used, and several methods have been proposed for estimating

0 (. ) , for euample, kernel, spline, and orthogonal series met hods. Fan (1992) developed

a design-adaptive nonparametric regession method, which was based on a weighted

local h e a r regession. This method works as follows.

Suppose that the second derivative of q(xo) exists. Using Taylor's =pansion in a

small neighborhood of a point gives

Shen estimating q(xo) is equivalent to estimating the intercept a of a local linear

regression problem. Now if t h e study variable Y and auxiliary variable .ri satisQ

(3.1), we can consider such a n-eighted local Linear regression problem: to h d a and

b by rninimizing

where K(-) is some kernel function and h, is the smooth bandwidth. The solution

of b wiU not be discussed in this thesis. Then, the weighted least square solution of

a is d e h e d to be the local linear regression smoother, i-e.,

with

The bandwidth h, can be shosen either subjectively by data analysts or objec-

tively by data. Fan (1992) showed that the local linear regression smoother have high

asymptotic efficiency (it can be 100% w-ïth a suitable choice of kernel and bnndm-ïdth)

among possible linear smoothers. It adapts to almost aU regression settuigs and does

not require any modifications even at the boundary. 1 explore the use of this method

to obtain an estimator for distribution function and quant ile.

The rnodel-assisted estimator Fm in (2.11) has some advantages, and its efficiency

depends largely on the accuracy of the predictor of 4 ( t - yj). So we c m improve the

distribution function estimator via improving the predictor of A(t - yj) . Frorn the

predictor expression (2.9), we see that it uses the assumption of linearity between Y

and X. But in most cases, we can only use the general mode1 (3.1) to describe the

relationship between Y and X, this gives a new predictor of A(t - yj) as

where

By using an efficient predictor $(xi) of I;. ( s e Fan, l992), g ( x j ) should be more

efficient than cj(xj) in (2.9). If we plug 9 ( x j ) into the formula (2. Il), we Riill get a

new estimator, denoted as ~,,(t), of the distribution Function as follom:

and expect that it has a better performance. In next section, we ndl present some

simulation results to compare the behavior of Fm,, Fm, and Fpe.

About the estimation of a* quantile of Y, we will estirnate it by inverting the

distribution function estirnator F( t ) to get q(a). Since F ( t ) is not necessarily non-

decreasing function of t (Olsson and Rootzén, 1996), to estimate the quantile q(a) ,

we have to mod* F(t) to F ( t ) such that F ( t ) is a nondecreasing function of t, i.e.,

we have

F ( t ) = sup{Fyy) : y 5 t} . (3.9)

It c m be showed that p ( t ) and p ( t ) have the same lirniting distribution Eunction and

the above modification would not affect the value of ath quantile q(a) (Olsson and

Rootzen, 1996).

In the simulation, we will compare some quantile estirnators which are obtained

korn the direct or indirect use of auxiliary information. They are denoted as qma(ct),

Qpm.(~), Ce(a), &(a) and &(a). The former three are obtained hom (1.5) when

~ ( t ) is replaced by &,(t), F,,(t) and &,(t) respectively; the later tn-O are the ratio

estirnator and the clifference estimator suggested in Rao, Kovar and Mantel (1990).

3.2 Simulation Study

In this section, a limited simulation is conducted to assess the estimation proce-

dures in earlier section.

Here, two sets of simulated population data are used. Suppose that there are N

distinct elements in each population under study. One population data set (Popula-

tion 1) is generated kom a linear model of the form yj = a + Pxj + ' U ( X ~ ) E ~ ~ Another

population data set (Population 2) cornes hem a nonpararnetric model of the form

yj = v ( x j ) + v ( x ~ ) E ~ for j = Il . . . , N. In both cases, xj's (xj > 0 ) and ,'s are 1

independent, x j - N ( p x , O$) , E j Y. N(0, 1) u ( x ~ ) = X; for j = l7 2, ... N.

22

No matter which set of population data are being uskg, the basic procedures are

the same- In each simulation, we use the population vdue X2 : ... XN of the

aiLui1iar-y variable X and the observed value of Y in a sample set s to estirnate the

distribution funct ion and associate quant iles of Y t hrough rnodel-=assis ted approach

under linear model assumpt ion (Pma(t), 4,, (a)) , or under general model assurnpt ion

(~,(t), Qma(a)), except when we calculate the pseudo empirical maximum likeli-

hood estimators, Fec(t) and @,,(a), only the population mean ZN and the observation

value of X in the sarnple are available as ai~eliary informat ion. We will t alce ArA,,

samples of size n Erom the simulated population. The procedures are given as foUows.

In each simulation, we wi11 use the simple random sarnpling without replacement

(SRSWOR) scheme to obtain a sample s of size n out of the population of N elements,

because this simple sampling scheme actually plays an important role in practice. So

we get the observation data of Y as {yi, i E s } , the corresponding auxiliary values

are {xi, i E s) and the design weights are di(s) = for each sample point i E S .

Distribution function - estimation

Method 1: Mode1 assisted estirnator Pm, proposed in Rao (1994), i.e., the esti-

rnator calculated hom (2.9) - (2.11). This estimator is based on the linear mode1 I

(2.4) assumption and v (xj) = x: with the utility of complete information of auxiliuy

variable x.

Step 1: Predict y when given x.

Under the above assumptions, ,8 is estimated by the ratio estimate b = Cies yi

C i E s xi ' which is a best linear unbiased estimator of ,O for any sample. Shen, when given xj,

yj is predicted as Qj = Pxj for each j E U .

Step 2: Calculation of the residual of yi for each i in s through formula (2.10).

Step 3: Predict A (t - y j ) for each element j in the population U through formula

(2.9).

Step 4: Get the value of the distribution function estirnator Fm, at t by using

(2.11)-

Method 2: Mode1 assisted estirnator F ~ , proposed here. This estirnator is based I

on the general model (3.1) and under the assurnption that u ( x ~ ) = x; for j E U with

the utility of complete information of auxiliary variable X.

Step 1: Predict y when given x.

Since the concrete form of the model is rinknonm, the aa~il iary information of X

and sample data of Y are used to fit an appropriate model. Here, the local linear

regession method (Fan, 1992) is used for this purpose. That is, yj is predicted

through (3.3) when given xj for each j E U.

Step 2: Cdculation of the residuals of yi for each i in s by using formula (3.7).

Step 3: Predict A (t - y j ) for each element j in the population U by using formula

(3.6).

In this process, an appropriate kernel function K(.) and an appropriate bandwidth

h,, are needed. Here, the uniform kernel function ( s e Table 3.3) is firstly used to

estimate the distribution function F,, and the performance of Fp, is compared

with other distribution function estimators. Later on, the effect of different kernel

hinctions on the estimator F, d also be checked. As for the choice of smooth

bandwidth &, the cross-validation technique is used. So we choose

where q-i(.) is the regession estimator (3.3) Nithout using the ith observation data

(xi, yi) of the sample S.

Step 4: Estimate the distribution function F( t ) through ((3.8).

1

Note that, in both of the above methods, it is assumed that v(xi) = x: <and the

complete information of X is available.

Met hod 3: Pseudo ernpirical mâ,uimurn likelihood estirnator F ~ , . This rnethod

assumes that only the population mean and the sampled data of are available as the

auxiliary information, but it does not need any mode1 assumption for the relationship

between Y and X.

S tep 1: Solve the equation (2.28). -

Let xN be the population mean of X', then T(xi) = xi - &. So c = 1 in ( Z X ) ,

and (2.28) becomes

where &(s) = $. To solve this nonlinear equation for A, the Newton-Raphson

rnethod (Press, Flannery, Teukokky and Vetterling, 1990) is used here. Pluggjng

one-term Taylor's expansion of (2.27) into the constraint conditions (2.24) gives an

initial guess of X as

Shen starting fkom this initial guess, an approximate solution of root X can be ob-

t ained quickly .

S tep 2: Using the formula (2.27) to calculate the optimal value Fi for i E S.

Step 3: Using the formula (2.26) to calculate the value of distribution function

estimator Fpe at given t.

Also note that, since SRSWOR sampling scheme is being used, F,, is actucally

coincide with Fm,[, here.

The ath quantile - estimation

When estirnating the quantiles of Y, the assumption is the same as that in the

preceding section. If the a d i a r y information of has already b e n used in the

estimation of the distribution function F ( t ) as above, the distribution function esti-

mator d l be directly inverted to estima te the corresponding quantiles of Y. That is,

the formula (3.9) and (1.5) will be used to estimate the quantile q(cr). The resulting

estimators are denoted as &,(a), @,,(a) and &,(a) corresponding to Fm,, FFa and

Fpe respect ively.

However, if the quantile need to be estimated directly fkom the sampled data of

Y with the complete a~xi1izu-y information in hand, then it can be estimated by the

ratio estimator 4, and the difference estimator & proposed by Rao, Kovar and Mantel

(1990) as:

mhere &, QY are the sample ath quantile of X and Y respectively for a given sample

s; q, is the population cuth quantile of X . In this simulation, all these five kinds of

quantile estimator of Y will be compared.

Here the relative biases (RB) and relative root mean square errors (RRMSE)

generated in this simulation experiment are used as rneasurement to compare the

performance of dieerent estimators B(t) , which may be eit her the dis tribut ion hnction

estimator F(t) or the quantile estimator @(a). But for the distribution fimction

estimator, its values only fall in the range between O and 1, so the mean square errors

(MSE) are also appropriate to measure the efficiency in this case and nrill be examined

too. For each sarnple s E S , let ar ( t ) denote the estimator 8( t ) obtained horn the rth

simulation sample, and 0 ( t) denote the population value of the interested parameter,

then

1v8imu ( B r ( t ) - M S E ( ~ ) := C 7

r=l N i m u

and

Note: Here, we are only interested in the distribution function values at the

three quartiles (0.25, 0.50 and 0 . 7 ~ ~ ~ quantile), so RB and RRMSE are appropriate

rneasures for estimators of distribution function.

27

A singlestage sarnphg scheme is usually used in the sampling kom a s m d pop-

ulation, and in most cases, only a very small sample set are observed. So7 in this

simulation, a small simulated population of size N = 500 are generated and small

samples of size n = 30 are taken out of each population. However, the methods

discussed here also apply to population and samples of any other sizes. After 1: 000

t imes simulation runs, the simulation results become stable, so t his section only shows

the simulation results of N&, = 1,000.

1

The linear simulated population data corne fkom model = -2Xj + 5 ? c j , and 1

the nonhear simulated population data corne hom model Yj = 150 + 2.~e- -~ - ' + x,' f j :

i.i.d i.i.d where, E~ - N(0 , l ) and X j - N(10.2,2.0~) for j = 1 : -.., 500. The population

data of these two data sets are s h o m in Figure 3.1 and 3.2.

Firstly, uniform kernel hnction is used to estimate F ( t ) through ((3.8) and the

performance of different estimators are compared. Then the effects of different kernels

in the process of local regession on the estirnator (3.8) of the distribution function

are also e~amined.

Cornparison arnong different estirnators of distribution function

The results of the estimation of distribution function F( t ) at three different points,

t = 0.25, 0.50 and 0 . 7 5 ~ ~ quantile of Y, are shom in Table 3.1 - 3.2. IWhere Fp

stands for the population cumulative distribution function ; Fma, F', and FP, stand

for the estirnator obtained from method 1, method 2 and method 3, respectively. RB,

M S E , RRMSE stand for the relative biases, mean square errors and relative root of

mean square errors of the corresponding distribut ion funct ion est imat or respect ively.

Fiorne 3 . 3 ~ - 3 . 4 ~ give the plots for the population distribution function and dZFerent

distribution estimators.

The results indicate that among the three kinds of estimator, if the underlying

relationship between the study variable Y and the audiary variable X is Linear (see

Fim-e XI), for the estimation of the distribution hinction F ( t ) , in terms of biases.

as measured by RB, and in terms of efficiency, as rneasured by MSE and RRMSE.

Fm, has the srnallest biases and is the most efficient estimator. The pseudo empirical

rnâ.uimum likelihood estimator Fp, gives the worst performance among these three

estirnators except that in the median of the population.

In the case of having a nonlinear simulated population data (see Figure 3.2), in

terms of biases, as measured by RB, and in terms of efficiency, as rneasured by M S E

and RRMSE, Fpma and FP, have comparable performances, but both of them are

less biased and more efficient than Fm,.

This indicates that when it is k n o m that there exists a strong linear relationship

between Y and X in the underlying population, Ga can be used to estimate the

distribution function comfortably; but when no indication of strong linear relationship A -

appears, Fpma and F, may give a better estimation. In most cases, Fm, is more

stable since it makes use of the cornplete information of the au,diary variable, but.

FPe has the advantage that it can still be used when only the surnrnary information

of the a u x i l i q variable is available and it is still bias robust.

Figure 3.1: Scatterplot of Linear Simulated Population fiom Mode1

Figure 3.2: Scatterplot of Nonlinear Simdated Population from Mode1

Y = 150 + 2.5e-" + X ~ E

Table 3.1: Simulation Results wit h Linear Simulated Population Data:

Relative Biases (RB), Mean Square Errors (MSE) and Relative Root

Mean Square Errors (RRMSE) of Dif5erent Distribution Function Es-

tirnators (Population Size = 500, Sample Size = 30, Number of Sim-

ulation = 1000.)

Table 3.2: Simulation Results with Nonlinear Simulated Population Data:

Relative Biases (RB), Mean Square Errors (MSE) and Relative Root

Mean Square Errors (RRMSE) of Different Distribution Function Es-

tirnators (Population Size = 500, Sample Size = 30, Number of Sim-

ulation = 1000.)

Figure 3.3a: Plot of Population Distribution Function Fi and Mode1 As

sisted Estirnator Fm, wit h Linear Simulated Population

Figure 3.3b: Plot of Population Distribution Function F, and Proposed

Estirnator F,, wit h Linear Simulated Population

Figure 3.3~: Plot of Population Distribution Function F, and Pseudo

Ernpirical Likelihood Estirnator &;,, with Linear Sirnulated Population

Figure 3.4a: Plot of Population Distribution F'unction Fp and Mode1 A s

sisted Estimator F,, wit h Nonlinear Simulated Population

Figure 3.4b: Plot of Population Distribution Function Fp and

Estimator Fv, wit h Nonlinear Simulated Population

Proposed

Figure 3 . 4 ~ : Plot of Population Distribution Fhnction F' and Pseudo Em-

pirical Likelihood Estirnator Fpe with Nonlinear Simulated Population

The effect of kernel on the model-assisted estirnator

When the proposed model-assisted estirnator Fp, is used to estirnate the popu-

lation distribution function and associate quantiles, an appropriate kernel fimction is

needed. And the associate quantile esimator cj,, is just the inverse of the distribu-

tion function estimator F,,. So in this section, only effects of kernel hnctions on

the proposed distribution hnction estimator are being exarnined by comparing the

following seven common kernelç:

Table 3.3: Common Kernel Function

Name

Uniform

Triangle

Epanechnikov

Quartic

Triweight

Cosinus

Gaussian

The estimation results are given in Table 3.4 .- 3.5.

Table 3.4: Relative Biases (RB) and Relative Root Mean Square Errors

(RRMSE) of Fp, at 0.25,0 .50 and 0.75th quantile of Y under Different

Kernels with Linear Simulated Population

Kernel Type

Uniform

Triangle

Epanechnikov

Quart ic

Cosinus

Gaussian

Table 3.5: Relative Biases (RB) and Relative Root Mean Square Errors

(RRMSE) of Fm, at 0.25, 0.50 and 0 . 7 5 ~ quantile of Y under Different

Kernels wit h Nonlinear Simulated Population

Kernel Type

Uni forrn

Triangle

. .. .. .

Epanechnikov

Quart ic

Triweight

Cosinus

Gaussian

RRMSE (Fm&) at

t = cyth quantile)

The above results demonstrate that no rnatter which kind of mode1 the underlying

population data tmly comply to, in terms of biases and efficiency, as rneasured by

RB, MSE and RRMSE, the estimator 02 distribution function obtained fkom (3.8)

has a better performance when the Unifarm kernel function is used. Especially,

d e n the underlying relationship between Y and A' is nonlinear, the use of uni£orm

kernel function produces an estimator with the smallest bias and the highese efficiency

compared to ot her kernel funct ions.

However, when any one of the kernel function 1 - 7 is used in the process of

estimating distribution function, the estirnator Fm, always has better performance

than p,, when the underlying distribution hnction is nonlinear. So in the follonlng

the Uniform kernel function is still used t o obtain Qma.

Cornparison among different quant ile est imators

- - In this simulation, five kinds of ath quantile estirnator: Qma, Qpa7 Ge7 qT, qd are

cornpared. The first t hree stand for the inversion of Fm,, F ~ , and Ppe ; the latter two

are the ratio estirnator, the difference est iniator with the a~viliary information used,

respectively. The results are given in Table 3.6 3.7.

For the linear simulated population clata, in terms of biases and efficiency, as

rneasured by RB and RRiWSE, estirnator tma has the best performance arnong these

five estimators, but at the 0.75th quantile o f Y, the bias of ka is larger than that of

ratio estimator 4,. Q, has better performance than QFa evcept in the case of the

0.251~~ quantile. Generally, the first three estimators have smaller biases, as measured

by RB, and more efficiency, as rneasuredRR&f SE than the last two estirnators except

at the 0.75th quantile, & is better than QPe and Qma. So the &st three estimators

are appropriate to be used. In particular, @,, is encouraged to be used when strong

linear relat ionship cxist s in the underlying population.

For the nonlinear simulated population data, in terms of biases and efficiency, as

measured by RB and RRMSE, estimator qp, is less biased and more efficient than

Q,,, and even les biased than Q, in the median ( 0 . 5 ~ ~ quantile) of Y. However, 9,

has the highest efEciency at all of these three quantiles, and in the O Z t h and 0.75"h

quantiles, $, also has the smallest bias. In the 0.Sth and 0.7Sth quantiles, qr has a

higher efficiency thao &,. So in the case of nonlinear population as an interested

population, op, and Qm, are suggested to be used when compared to @,,.

Table 3.6: Simulation Results wit h Simulated Linear Population Data:

Relative Biases (RB) and Relative Root Mean Square Errors (RRMSE)

of DifEerent a* Quantile q ( a ) Estimators (Population Size = 500,

Sample Size = 30, Simulation Times = 1000.)

RRMSE (&) RRMSE(&,) RRM SE ( & ) RRM S E (4,) RRh.ISE(Qd)

Table 3.7: Simulation Results with S i d a t e d Nonlinear Population Data:

Relative Biases (RB) and Relative Root Mean Square Errors (RRILISE)

of DifKerent at" Quantile q(a) Estimators (Population Size = 500,

SampIe Size = 30, Simulation Times = 1000.)

RRhf S E (dm,) RRMSE(Qpa) RRI1fSE(&) RRM S E ( g r ) RRMSE ( Q d )

0.058523 0.050050 0.004770 O. 060386 0.051528

Chapter 4 Distribut ion Function and Quant ile Estimation in Twestage Sampling

In this chapter, we consider the estimation of distribution fimction and a'" quantile

of a finite population under two-stage sarnpling scheme. In pract ice, under the most

simple SRSWOR/SRSWOR sampling plan, there are already several conventional

est iniators of distribution function developed under design-based inference met hods.

However, the estimation under a superpopulation mode1 is considered in this chap

ter. Section 4.1 gives a brief introduction to the conventional estimators. In section

4.2, dong the same lines of chapter 3 and extending the results for estimating pop-

ulation total t o population distribution function, a distribution function estimator

is proposed. Some of alternative estimators are &en in section 4.3. Finaily, in

section 4.4, the proposed estimator is compared with conventional est imators under

SRSWOR/SRSWOR samphg plan t hrough simulation st udy.

4.1 Two-stage Distribution Function Estimator in

Conventional Theory

Consider a finite population which contains K elements and are cu?-anged in N

clusters, the i t h cluster is h o - to contain Mi elernents, so that CL, .VI. = K.

Then the population can be represented as U = { X j 7 i = 1, . . . , N; j = 1 : . . .: Mi}.

First a sample s of size n units is taken out of N clusters at the hrst stage and

47

then a sample si of size mi for i E s is chosen at the second stage. According to

the conventional sampling theory, there are several types of estimator for twestage

cluster sampling. For the most cornmon and simple design: simple random sampling

without replacement (SRSWOR) in both stages, the usual estirnators are (see Royal,

1976):

( i ) the expansion estimator

( ii ) the 'iinbiased" estimator

( iii ) the "ratietype" estimator

and

where,

Remark: The estimator Fu is unbiased only if && happens to equal &$ (Royal,

1976). Where as, in general, p. and Fr are biased estimator.

In the next section, estimation of distribution hnction under a superpopulat ion

rnodel be consdered.

4.2 Proposed Method of Estimation: Distribution Function

in Twu-stage Sampling

Assume the population U is generated from a tw-est age superpopulat ion rnodel:

By denoting E,(-) for expectation operators under the above rnodel, it is assiuned

that ( s e Scott and Smith, 1969)

are identically distributed random variables nrith h i t e population dis-

tribution funct ion Fy (t) , which is defined as

In current case, the population has nothing to do with the time of sampling. It is

easy to see that it satisfies the assumption A in Olsson and Rootzènys paper (1996),

( i ) The finite population distribution hinction Fy(t) of Y is (piecewise) contin-

uous.

( ii ) The N clusters are independent, so the vectors (xi, .-., K M = ) , i = 1: ...: N ,

are independent.

( iii ) For each i, the correlation coefficient between A(t - Yi j ) and A(t - xjr)

depends only on t, so we can denote it as ph@).

NON suppose that it is needed to estirnate the distribution firnction FI-(t) of Y

and the associate cuth quantile. A two-stage sampling scheme is conducted to get the

observation data. That is, at the first stage, n out of N clusters are chosen t hrough

some sarnpling scheme to get a primary sample s; a t the second stage, mi out of Adi

elements of each selected cluster i E s are chosen under designed sampling scherne to

N get a subsample si. Since K = xi=, Mi, then the s~ame decomposition method as

that in chapter 2 can be used to decompose the finite population distribution hnction

FY ( t) as follows:

Notice that only the fkst terrn TI in (4.11) is completely knom after the sarnpling,

so the other two terms, TI[ and TI[[, have to be predicted fkom the observation data

in order to obtain the estirnator of the h i t e population distribution hnction Fy ( t ) .

VVhen all the population cluster sizes, Mi (i = 1, .. . , N), are knom, estimating FI. ( t )

is equivalent to estirnating Ta(t) - the population total of A (t - Y ) , which is defined

as

Hence the prediction method (Royall, 1976) can be used to get an unbiased estimator

of TA. Firstly, observe that

E[A(t - Y,) - FY(t)] = 0,

and

where,

which is the common correlation coefficient between A(t - xj) and A(t - xjr) within

cluster i, and

where

Hence, the resulting estimator for Fu ( t ) , mhich is termed as predictive estimator.

is given by:

where,

with the weight given by

Along the lines of Royal1 (1976), it can be s h o w that the above estirnator F' ( t ) is

the best unbiased estimator for the finite population parameter Fu(t) mhen and

pA are k n m , and the mean square enor of ps ( t ) is

In practice, values of 0% and p, are usually unknonn. Hom-ever, the est imat e for 0%

and Pa cm be used to obtain an approximately optimal estimator of the interested

A simple estimator of p, ( t ) , given in Ohson and Rootien(l996), is as follows.

Denote .m = #{i : mi # 1 and i E s ) and set

2 1 a&)=- m C C j,si {a(t - 'mi Y,) - F ( t ) }2

i E s and mi+l

1 C C15j+jl<m {a(' - Y,) - '(')){A(' - Y,/) - F ( t ) ) mi(mf - 1)

i E s and mi#l

4.3 Alternative Estimators

This section gives some discussion on alternative est imators of the finite distribu-

t ion hnct ion in two-stage sarnpling. The corresponding quant ile estimators can be

obtained kom formula (1 -5 ) wit h different distnbut ion hinction estirnator plugged in.

At hrst, for the easy use of the predictive estirnator (4.19), the simple estimator

E( t ) of the distribution huiction is used in the process of estimating o i ( t ) and p A ( t ) .

I n practice, for simplicity, E(t) can be replaced by Fl(t) ,

when it is reasonable to ignore the dependence wïthin each cluster. When al1 the

sample size mis are equal, they will result in the same estimator (Royall, 19%).

Sorne additional estirnators can be obtained as special cases of the predictive

this case the optimal estimator (4.19) becomes

This is actuaIly the conventional eupmsion estirnator Fe.

optimal estimator can be represented as

This also suggests t hat the conventional estimators Fe as in (4.1) and Fr as in (4.3)

be repIaced in the third part of the predictive estimator, that gives

and

Another form with FP used in the third part of the predictive estimator can be given

as:

And, actually, kg, and F~~ are just the predictive estimator representstion of Fr and

The diffaence between (4.27) and (4.30) only exists in the second term which

predicts CiEs xj,, a(t - Kj)-

In the next section, results from a Monto Carlo simulation ndl be presented to

compare estimators Fe as in (4.1), Fr as in (4.3) and Fu as in (4.2) with F, as in

(4.19) and Fg, as in (4.28) under SRSWOR/SRSWOR sunpling scherne.

4.4 Simulation Study

In this section, the simulation study For sorne special cases of two-stage sarnpling

process is presented.

Since two-stage sampling scheme is usually used in moderate or large populations.

Here, three large simulated populations are generated using two-stage sampling pro-

cedure. In each population, it is assumed that there are 1V = 200 clusters and the

elements have a common correlation coefficient, p(xj7 Kjl) ni t hin each cluster, for

j # j' and i = 1, ..., N. The cluster sizes in each population considered in this study

are:

80, for i = 1, -.., 50;

= 100, [or i = 51, ...? 100;

120, for i = 101, ..., 200.

Finally we generate the data using the followhg procedure.

S tep 1: Generate cluster rneans pl, p2, . . . , pN of each cluster such that pi - N ( p 7 O:)

independently, where p = 100.0, and

0.81, Population 1,

0.91, Population 2,

50.91, Population 3.

S tep 2: Within each cluster i, sii , . . .: Einc are generated such that s, - N(0, O:),

for j = 1: ..., fili, where

100.7, Population 1,

a, = 0.9, Population 2,

0.9: Population 3.

Step 3: Generate the population data of size K = 21: 000 {Y,: i = 1' ...: iV;

K j = pi +se, for i = l7 --., N; j = 1, ---, ALi.

Then, note that in each population, the correlation coefficient within cluster i is:

~y = p ( K j , Kj>) = , for 1 5 j # jf 5 Mi and i = 1: ..., N. fl; + oz

In this simulation study, we consider three kinds of population in which every

cluster h a a correlation coefficient satiskng: i) p, = O; ii) O < p~ < 1; and iii)

pA rr 1 by assigning py satisfying i), ii) and iii) too. The three simulated popula-

tions generated in section 4.4.1 have the correlation coeficients: py = 6.46968e-5;

PY = 0.505525 and py = 0.999688 respectively. But in the process of estimating dis-

tribution function, and p, are aççumed to be unknown. They can be eçtimated

through (4.22) and (4.23). The estimates Sa and 8, are plugged in the predictive

estimator to obtain corresponding distribution h c t ion est imate p' ( t ) . We consider

two cases: case ( i ) ni th moderate sample size within cluster:

15, for i = 1: ..., 5;

% = { 2 for i = 67 . . , 15 ; for i E s,

25, for i = 16? ..., 20;

so that k = 400; and case ( ii ) +th small sample size nithin cluster:

IOl for z = l7 ...l 5;

mi = 12: for i = 6, ..., 15; for i f sl

15, fo r i=16 , ..., 20;

so that k = 245.

Ln both cases small sample is selected using twestage sampling plan by first

selecting 20 clusters hom 200 clusters using SRSWOR. Wit hin i t h selected cluster,

mi units; as d e h e d in (4.36) for case 1 and in (4.37) for case 2, are selected using

SRSWOR.

After 10,000 simulation runs under above mentioned three populations, simulation

results become stable, then the relative biases (RB), relative root mean square errors

(RRMSE) and relative efficiency (RE) +th respect to Fu (or qu) are computed for

the proposed estimators using N,,, = 10,000. RB and RRMSE are computed

using (3.14) and (3.16) respectively; while R E is computed as follows.

Let 8, denote the unbiased estimator of 8, 8 denote any estimator of 8, then we

RE(&) = relative efficiency of 8 with respect to 8, := RRMSE (8,)

(4.38) RRA.ISE(~) -

The above rnentioned measures are computed based on 10:000 runs and results

for distribution functions are tabulated in Table 4.la, 4.2a7 4.3a for population 1, 2

and 3, respectively; for quantiles in Table 4. lb, 4.2b and 4.3b7 respectively. Turning

to case 2, only estimation of distribution function has been considered and results are

tabulated in Table 4-4 - 4.6 for population 1 to 3 respectively. The minimum value

of RB and RRMSE, and maximum value of RE will appear in bold font.

The simulation results in case 1 indicate that. although the estimates ô: and

PA rather than the population values for them are used in the estimation of distri-

bution Eunction, the predictive estirnator Pg still has better performance t han the

conventional ones, especidy (when p, < ph),

i) when p, is close to zero, hence pA is also close to zero, Fg attains its optimal

value at Go which is also the conventional evpansion estimator Fe, and the alternative

predictive estimator pgg, also performs bet ter than the convent ional est imator Fu, Fr.

ii) when O < p, < 1, Fg is still better than FE, Fu and Fr in terms of biases, as

measured by RB; efficiency, as rneasured by RRM SE; and relative efficiency Fvith

respect to Fu, as rneasured by RB.

iii) when pA is close to the unity, the results show t hat Fs does at tain the optimum

For the estimation of quantiles, basically, c& has almost all of the good character-

istics of Fg, evcept when pA is close to the unity, for the estimation of lem (upper)

tail quantiles, the bias, as rneasured by RB, of Qg is bigger than that of qe.

The results of distribution fùnction estimation in case 2 (Table 4.4 - 4.6) show

that F, still has a better performance than the conventional estimators of distribu-

tion hnction under sampling scheme SRSWOR/SRSWOR when O < py < 1, but

the relative efficiency (RE) decreases. So, based on the simulation results, the pre-

dictive estimators Pg, qg are suggested to be used under two stage sampling plan

SRSWOR/SRSWOR, especially when a moderate sCmple data set is available.

Table 4.la: Estimation of Distribution Fundion under Twestage Sam-

pling Scheme (k = 400, py = 6.46968ed5): Relative Biases (RB), Rel-

ative Root Mean Square Errors (RRMSE) and Relative Efficiency

(RE) with respect to Fu

Table 4.lb: Estimation of Quantile under Two-stage Sampling Scheme

(k = 400, p, = 6.46968e-5): Relative Biases (RB), Relative Root Mean

Square Errors (RRMSE) and Relative Efficiency (RE) with respect

R Rk1 S E (qe) RRMS E (qu) RRMSE(q,) RRMSE($) RRMSE(&,) RRMSE(G1)

Table 4.2a: Est irnation of Distribution Funct ion under Two-stage Sam-

pling Scheme (k = 400, py = 0.505525): Relative Biases (RB), Relative

Root Mean Square Errors (RRMSE) and Relative Efficiency (RE)

with respect to Fu

Table 4.2b: Estimation of Quantile under Two-stage Sampling Scheme

(k = 400, py = 0.505525 ): Relative Biases (RB), Relative Root Mean


R RfV1 SE (q , ) RRMSE (qJ RR&ISE(Q~) RRMSE(ljg) R RM SE (Gg,) RRMSE (QgI)

Table 4.3a: Estimation of Distribution Function under Two-stage Sam-

pling Scheme (k = 400, p, = 0.999688): Relative Biases (RB), Relative


with respect to Fu

Table 4.3b: Estimation of Quantile under Two-stage Sampling Scheme

(k = 400, p, = 0.999688): Relative Biases (RB), Relative Root Mean


RRMSE (a) R R M S E (&) RRMSE (qr ) RRMSE (4,) RRMSE ((î,,) RRMSE (bI)

Table 4.4: Estimation of Distribution Fundion under Two-st age Sarnpling

Scheme (k = 245, p, = 6.46968e-5): Relative Biases (RB), Relative


with respect to pu

Table 4.5: Estimation of Distribution Function under Twestage Sampling

Scheme (k = 245, p , = 0.505525): Relative Biases (RB), Relative Root

Mean Square Errors (RRMSE) and Relative Efficiences

respect to Fu

(RE) with

Table 4.6: Estimation of Distribution Function under Two-stage Sampluig

Scheme (k = 245, p , = 0.999688): Relative Biases (RB), Relative Root

Mean Square Errors (RRMSE)

respect to Fu

and Relative Efficiency (RE) with

Chambers, R. L. and Dunstan, R. (1986). Estimating distribution functions from

survey data. Biometrika. 73,497 - 604-

Chambers, R. L., D o r h m , A. H., and Wehrly, T. E. (1993). Bias robust estima-

tion in h i t e popdations using nonparametrie calibration. Journal of the American

Statistical Association. 88, 268 - 277.

Chen, J. and Qin, J. (1993). Empirical likelihood estimation for finite populations and

the effective usage of au,.a'liaz-y in au,~liaryinformation. Biometrika. 80, 107 - 116.

Chen, J. and Sitter, R. R. (1996). A pseudo empi~cal likelihood approach to the

effective use of a ~ , ~ l i q information in complex surveys. Simon Fraser University,

Burnaby.

Chen, S. X. (1997). Empirical Likefiood- based kernel density estimation. Aus tralian

Journal of Statistics. 39, 47 - 56.

Cheng, K .F., Lin, P. E. (1981). Nonparametric estimation of a regression function:

limiting distribution. Australian Journal of Statistics. 23, 186 - 195.

Cochran, W. G. (1977). SampLing teclmïques (Third Edition). New York: John

Wiley & Sons.

Dielrnan, D., Lowry, C., and Pfaffenberger, R. (1994). A comparison of qtritntile

estimators. Communications in S tatistics. Part B-Sirnulat ion and Computat ion.

23,355 - 371.

Fan, J. Q. (1992). Design-adaptive nonparametric regression. Journal of the Ameri-

c m Statistical Association. 87, 998 - 1004.

Fransisco, C. A. and Fuller, W. A. (1991). Quantile estimation rvith a complex survey

design. The Annals of Statistics. 19,454 - 469.

Godambe, V. P. (1955). A uniûed theory of sarnpling fiom finite populations. Journal

of the Royal Statistical Society. B17,269 - 278.

Godambe, V. P. (1989). Estimation of cumulative distribution o f a survey population.

Technical Report STAT-89-17, University of Waterloo.

Gross, S. T. (1980). &ledian estimation in sample surveys. Proc Survey Res. Meth.

Sect., Am. Statist. Assoc. 181-184.

Hansen, M. H., Madow, W. G. and Tepping, B. J. (1983). An evaluation of model-

dependent and probability-smpfing inferences in sample surveys. Journal of the

Arnerican S tat ist icd Association. 78,776- 793.

Kish, L. (1965). Survey sampling. Wiley, New York.

Krieger, A. M. and Pfeffermann, P. (1992). &laximum Likelihood estimation fi-om

complex surveys. Unpublished Technical Report, Depart ment of S tatistics, Gniver-

sity of Pennsylvania.

Kuk, A. Y. C. (1988). Estimation o f distribution funetions and medians under sam-

p h g 1~4th unequd probabilities. Biometrika. 75 , 97 - 103.

Kuk; A. Y. C. and Mak, T. K. (1989). Median estimation in the presence o f auviliary

information. Journal of the Royal Statisticd Society. B51,261 - 269.

Loynes, R. M. (1966). Some aspects of the estimation of quantiles. Journal of the

Royal S tatistical Society. B28, 497-51 2.

McCarthy, P. G. (1965). Stratified sampfingand distribution-fiee conhdence intervals

for a medim. Journal of the American Statistical Association. 60, '7'72 - 783.

Olsson, J. and Rootzén, H. (1996). Quantile estimation &om repeated measurements.

Journal of the American Statistical Association. 91, 1560 - 1565.

Owen, A. B. (1988). Empirical LikeLihood ratio confidence intervals for a single func-

tiond. Biometrika. 75,237-249.

Owen, A. B. (1990). Empiricd likelihood confidence regions. The Annals of S tatis-

tics. l8,9O-lZO.

Press, W. H., Flannery, B. P., Teukolsky Çaul A. and Vetterling W. T. (1990). NLZ-

merical Recipes, The art of Scientific Comp uting (Fortran Version). Cambridge Uni-

versity Press,

Qin J. and Lawless J. (1994). Empincd likeLihood and general estimating equations.

The Annals of S tatistics. 22:300-325.

Royall, R. M. (1976). The Iinear least-squares prediction approach to trvo stage

sstrnphg. Journal of the American Statistical Association. 71, 657 - 670.

Rao, J. N. K. (l975). Unbiased variance estimation for multistage designs. Sankhya.

C3 7, 133-1 39.

Rao, J. N. K., Kovar, J. G. and Mantel, H. J. (1990). On estimating distribution

functions and quantiles fiom survey data using awüiliary idonnation. Biometrika,

77, 365 - 375.

Rao, J. N. K. and Liu, J. (1992). On estimating distribution functions fi-oln sanzple

survey data usingsupplementary information at the estimation stage. In Saleh, A. K.

M. E. (ed.) Nonparametric statistics and related topics. Elsevier Science Publishers;

Amsterdam. pp 399 - 407.

Rao, J. N. K. (1994). Estimating totals and distribution functions using auialiary

information at the estimation stage. Journal of Official Statistics. 10: 153 - 165.

Scott, A. and Smith, S. M. F. (1969). Estimation in multi-stage surveys. Journal of

the Arnerican Statistical Association. 64? 830 - 840.

Sedransk, J. and Meyer, J. (1978). Confidence intervals for the quantiles of a 6-

nite population: simple random stratified random sampling. Journal of the Royal

Statistical Society. B40;239 - 252.

Sedransk, J. and Smith, P. (1983). Lower bounds for confidence coefficients for con-

fidence intervals for f i t e population quantiles. Comrn. Statist. Theory Meth.

12,1329 - 1344.

Thompson, M. E. (1997). Theory of sample surveys. S t Edmundsbury Press, Bury

St Edmunds, Suffolk.

Zhang, B. (1997). Estirnating a distribution function in the presenxe of auxifiary

information- Metrika- 46,245 - 251.

and quantile in the presence of auxiliary information

Documents