finite population estimation under generalized linear model

Finite Population Estimation Under Generalized Linear Model Assistance

Luz Marina Rondón

Universidad Nacional de Colombia

Luis Hernando Vanegas


Cristiano Ferraz

Universidad Federal de Pernambuco

Reporte Interno de Investigación No. 17

Departamento de Estadística

Facultad de Ciencias


Bogotá, COLOMBIA

Finite population estimation under

generalized linear model assistance

Luz Marina Rondon a,∗ Luis Hernando Vanegas a

Cristiano Ferraz b

aUniversidad Nacional de Colombia, Facultad de Ciencias, Departamento de

Estadıstica - Carrera 30 No. 45-03, Bogota - Colombia

bDepartamento de Estatıstica, CCEN-UFPE - Cidade Universitaria - Recife, PE -

Brazil 50740-540

Abstract

Finite population estimation is the overall goal of sample surveys. When informa-tion regarding auxiliary variables are available, one may take advantage of generalregression estimators (GREG) to improve sample estimate precision. GREG estima-tors may be derived when the relationship between interest and auxiliary variablesis represented by a normal linear model. However, in some scenarios, such as esti-mating class frequencies or counting process totals, Bernoulli or Poisson responsesmodels are more suitable for describing the relationship between interest and aux-iliary variables than normal linear model ones. This paper focuses on the generalcase for which the relationship between interest variables and the available auxil-iary ones may be suitably described by a generalized linear model. The variable ofinterest’s finite population distribution is viewed in such scenario as if generated byan exponential family distribution, which includes Bernoulli, Poisson, Gamma andinverse Gaussian distributions. The resulting estimator is a generalized linear modelregression estimator (GEREG). Its general form and basic statistical properties arepresented and studied analytically and, empirically, through of Monte Carlo ex-periments. Three applications are presented in which the GEREG estimator showsbetter performance than GREG.

Key words: Auxiliary information, generalized linear models, model-assistedestimation, pseudomaximum likelihood.

1 Introduction

Finite population estimation is sample surveys’ general goal. Whenever inter-est relies on estimating means, totals and percentages, parameters are esti-mated using the Horvitz-Thompson estimator.

∗ Corresponding author.Email address: [email protected] (Luz Marina Rondon).

Reporte interno de investigacion, Universidad Nacional de Colombia.4 March 2011

When information regarding auxiliary variables is available, however, one maytake advantage of general regression estimators (GREG) to improve sampleestimate precision. Such GREG estimators may be derived in a model assistedestimation approach (Sarndal, Swensson and Wretman (1992)) when the re-lationship between interest and auxiliary variables is represented by a normallinear model. The list of authors working on the subject is broad. Isaki andFuller (1982), Wright (1983), Fuller (2002) and Sarndal, Swensson and Wret-man (1992). Lehtonen and Veijanen (1998) may perhaps be the first to arguethat in some instances (such as when estimating class frequencies) modelsfor multinomial responses are more suitable for describing the relationship be-tween interest and auxiliary variables than normal linear model ones. LGREGestimator properties have been presented and studied, this being a multinomiallogistic model assisted estimator. Breidt and Opsomer (2000) focused on anonparametric regression modeling assisted approach, proposing and studyingthe properties of local polynomial regression estimators. Estevao and Sarndal(2004) studied several applications of domain estimation using calibration.Duchesne (2003) has illustrated the efficiency of inference based on estima-tor assisted by logistic regression regarding normal regression by using MonteCarlo experiments. Lehtonen, Sarndal and Veijanen (2003) compared the per-formance of different estimators that using auxiliary information in the estima-tion stage, including those assisted by the logistic regression model. Lehtonen,Sarndal and Veijanen (2005) studied the importance of model specificationfor estimating the total of a polytomous interest variable for a number oflarge or small domains. Li (2008) introduced the Box-Cox technique intothe generalized regression estimator, which can be especially suitable to dealwith highly skewed continuous and positive interest variables, where normalresponse models may not be appropriate. This paper focuses on the generalcase for which the relationship between interest variables and the availableauxiliary ones may be suitably described by a generalized linear model. Finitepopulation distribution of the interest variable is viewed in this scenario asif generated by an exponential family distribution, which includes Bernoulli,Poisson, Gamma and inverse Gaussian distributions; the resulting estimatoris a generalized linear model regression estimator (GEREG). Its general formand basic statistical properties are presented and studied. U = {1, 2, . . . , N}was thus chosen as being a finite population. A sample S, size n, was obtainedfrom U using p(·) sampling design. πk = P (k ∈ S) was denoted as beingthe first-order inclusion probability for the k-th population element. Likewise,πkl = P (k, l ∈ S) represented the second-order inclusion probability for theelements k and l. yk was denoted as being the value of the interest variablewhich could be discrete or continuous, while xk = (xk1, . . . , xkJ)T was theauxiliary information vector for the k-th element. The paper is organized asfollows. In Section 2 we introduced generalized linear model in finite popula-tion. In Section 3 we presented the GEREG estimator and its main propertieswere studied. Besides, in that section was presented a simulation study thatillustrates the performance of the GEREG estimator. Section 4 we presented

2

three applications of the proposed estimator.

2 Generalized Linear Model in Finite Population

The literature concerning generalized linear models is vast; for instance, Mc-Cullagh and Nelder (1989), Dobson (2001) and Fox (2008). Y1, . . . , Yn areconsidered the values of an interest variable Y measured in n elements in aGLM setup. Y ’s are supposed to be independent random variables with prob-ability distribution from the exponential family. Let Yk be random variable Yfor element k. Its density function may expressed as

f(y; θk, φk) = exp{φk[yθk − b(θk)] + c(y, φk)}, (1)

where c(·) is a known function, θk is the canonical parameter, E(Yk) = µk =b′(θk), Var(Yk) = φ−1

k V (µk), with Vk = V (µk) = ∂µk/∂θk the variance functionand φ−1

k > 0 the dispersion parameter. The generalized linear models aredefined by (1) and by the following systematic component

g(µk) = ηk =J∑

j=1

βjxkj = xT

kβ, (2)

where xk = (xk1, . . . , xkJ)T is a vector of J explanatory variables for element k,β = (β1, . . . , βJ)T a vector having unknown parameters, and g(·) a monotonedifferentiable function, called the link function. When g(·) is defined in sucha way that θk = ηk for every k, then g(·) is called the canonic link function.It has been assumed in this work that if k and l such that φk 6= φl, thenφk ∝ δk, k = 1, . . . , n, with δk being a known quantity for every k. Particularcases of such setup include normal, Bernoulli, Poisson, Gamma and the inversenormal response models. Table 1 shows b(θ), θ, φ and V (µ) values for mainexponential familiy distributions.

Table 1Principal distributions belonging to exponential family.

Distribution b(θ) θ φ V (µ)

Normal θ2/2 µ 1/σ2 1

Poisson eθ log µ 1 µ

Bernoulli log(1 + eθ) log{µ/(1 − µ)} 1 µ(1 − µ)

Gamma − log(−θ) −1/µ 1/(CV )2 µ2

N. Inversa −√−2θ −1/2µ2 φ µ3

The maximum-likelihood method may be used to estimate GLM parameters.The estimation method can use sampling weights in the finite populationcontext. For instance, µU

k and µSk are estimators of µk; however, the first one

3

uses information about U , while the second one just uses information aboutS. Similarly, there are differences between β, βU ,βS and β

π

S, since β is thesuper-population parameter vector while βU is the β estimator based on U .There are two options for estimating β with information from sample S: thefirst, from which βS is obtained by assigning equal weighting to individuals;

the second one, from which the pseudomaximum likelihood estimator βπ

S isobtained, takes into account sample design assigning weights for individualsaccording to first-order inclusion probabilities. (see Table 2).

Table 2Estimation of β and µk

With information about U With information about S

β βU = arg maxβ

LU(β)Weighted Unweighted

βπS = arg max

βLπ

S(β) βS = arg maxβ

LS(β)

µk µUk =g−1(xT

k βU ) µSk =g−1(xT

k βπS) µS

k =g−1(xT

k βS)

The loglikelihood function for sample S(S ⊂ U) considering sample weighting,often called π-weighted loglikelihood or pseudo-likelihood, can be written inthe following way

LπS(β) =

∑

k∈S

φk

πk[ykθ(β;xk) − b(θ(β;xk))], (3)

then, βπ

S = arg maxβ

LπS(β) and µS

k = g−1(xT

kβπ

S) are the pseudomaximum

likelihood estimators (see, for instance, Skinner and Holt (1989); Lehtonen andPahkinen (2004)) of β and µk, respectively. If πk = π for k ∈ U , then βS and

βπ

S are equivalents. The estimator βπ

S can also be found by solving equation

UπS(β

π

S) = 0. The function UπS(β) = ∂Lπ

S(β)/∂β = (UπS1

(β), . . . , UπSJ

(β))T isthe score function for β (see, for example, Cox and Hinkley (1974)), where

UπSj

(β) =∑

k∈S

φk

πk

(yk − µk)δθk

δηk

xkj , j = 1, . . . , J. (4)

The pseudomaximum likelihood estimator is equivalent to the weighted leastsquares estimator for normal response models having canonical link (identity),which can be expressed as

βπ

S = (XT

SWSXS)−1XT

SWSYS, (5)

where XS = (x1, . . . ,xn)T, YS = (y1, . . . , yn)T and WS = diag{φ1/π1, . . . , φn/πn}.

The loglikelihood function for the population U can be written in the following

4

way

LU (β) =∑

k∈U

φk[ykθ(β;xk) − b(θ(β;xk))],

then, βU = arg maxβ

LU(β) and µUk = g−1(xT

kβU) are the maximum likelihood

estimators of β and µk, respectively. The score function for β in the populationU is given by UU(β) = ∂LU (β)/∂β = (UU1(β), . . . , UUJ

(β))T, where

UUj(β) =

∑

k∈U

φk (yk − µk)δθk

δηkxkj, j = 1, . . . , J. (6)

3 Generalized linear model regression estimator (GEREG)

Considering estimating a total population ty =∑

k∈Uyk. The GEREG for ty may

be written as

tG

=∑

k∈U

µSk +

∑

k∈S

(yk − µSk )

πk

, (7)

where ξ, the model explaining the relationship between the response and theauxiliary variables, may be expressed as

Eξ(yk) = µk;

Varξ(yk) = φ−1k V (µk)

g(µk) =J∑

j=1βjxkj = xT

kβ

(8)

where β is a vector having unknown parameters, g(·) is the link function,xk = (xk1, . . . , xkJ)T is the auxiliary information vector for the k-th popula-tion element and Eξ(·) and Varξ(·) are the expectation and variance in model

ξ, respectively. In expression (7) µSk is given by g−1(xT

kβπ

S), with βπ

S, the β es-timator based in S, takes sampling weighting into account. We assumed that,if k and l so that φk 6= φl, then φk ∝ δk, with δk being a known quantity forevery k ∈ U . If ξ considers homogeneity in the dispersion parameter (φk = φ),then we assumed that φk=1 for every k ∈ U . The t

Gestimator requires knowl-

edge about auxiliary information vector xk for every k ∈ U , however, if thelink function in the ξ model is identity, t

Gonly requires knowledge about

xk for every k ∈ S and total population vector tx

= (tx1, . . . , txJ)T, wheretxj =

∑

k∈Uxkj , j = 1, . . . , J .

5

3.1 Alternative expression for GEREG

The second term in (7) is an adjustment. Theorem 1 provides conditions inwhich the term adjustment can be eliminated, thus simplifying the expressionfor the GEREG estimator. This theorem also allows writing ty using the fittedvalues of the model based on U .Theorem 1. If the GEREG estimator t

Gis based on a ξ model where it has

been verified thatJ∑

j=1ajhj = 1 for hj = (h1j , . . . , hNj)

T with hkj = φk∂θk

∂ηkxkj

and a = (a1, . . . , aj)T a constant vector, then

(1)∑

k∈U(yk − µU

k ) = 0, i.e, the total of y can be written as ty =∑

k∈UµU

k ,

(2)∑

k∈S(yk − µS

k )/πk = 0, i.e, the estimator tG

can be written as tG

=∑

k∈UµS

k .

According to theorem 1, the smaller the difference between βπ

S and βU thesmaller also the difference between t

Gand ty. In the terms of finite population

consistency, the estimator βπ

S is consistent for βU , therefore, tG

is consistentfor ty. Moreover, if πk = π for k ∈ U , from theorem 1 we have

∑

k∈Syk =

∑

k∈SµS

k .

Corollary 1. If GEREG estimator tG

is based on an ξ model where it hasbeen verified that there is

(1) Homogeneity in the dispersion parameter, i.e, φk = φ for k ∈ U ;(2) A systematic component having an intercept, i.e, there is βj in β so that

∂ηk/∂βj = C for k ∈ U , with C a constant;(3) A systematic component having a canonical link, i.e, θk = ηk for k ∈ U ;

then, the results (1) and (2) of theorem 1 are satisfied.

The corollary 1 provides, from the model formulation point of view, clearerconditions than the 1 theorem so that the term adjustment term in theGEREG estimator can be eliminated.

Corollary 2. If the GEREG estimator tG

is based on an ξ model with normalresponse, canonical link (identity) and a = (a1, . . . , aj)

T is a constant vectorso that, for k ∈ U it has been verified that

φ−1k =

J∑

j=1

ajxkj ,

then, the results (1) and (2) of 1 theorem are satisfied.

The result in the corollary 2 is analogous to that found in Sarndal, Swenssonand Wretman (1992). Then, the 1 theorem is an extension of the result foundin Sarndal, Swensson and Wretman (1992) for generalized linear models.

6

3.2 Expectation and variance

The following theorem provides approximate expressions for the expectationand variance of the GEREG estimator, taking into account the sampling de-sign and the regression model formulated.

Theorem 2. If tG

is given by∑

k∈UµS

k and ξ considers the canonical link we

have

(1) E(tG) ≈ ty;

(2) Var(tG) ≈ ΓT

UΣUΓU ;

(3) Var(tG) = ΓT

SΣSΓS,

where ΓU = (XT

UW∗

UXU)−1 XT

UW∗

Uφ∗, W∗

U = diag{φ1VU1 , . . . , φN V U

N }, ΓS =(XT

UW∗

SXU)−1 XT

UW∗

Sφ∗, W∗

S = diag{φ1VS1 , . . . , φN V S

N }, ΣU = {σUij} and

ΣS = {σSij}, with σU

ij =∑

k∈U

∑

l∈U∆kl

xkiφkEk

πk

xljφlEl

πl, σS

ij =∑

k∈S

∑

l∈S

∆kl

πkl

xkiφkek

πk

xljφlel

πl,

φ∗=(φ−11 , . . . , φ−1

N )T, V Uk =V (µU

k ), V Sk =V (µS

k ), Ek =yk − µUk and ek =yk − µS

k .

Corollary 3. Under the conditions of the Corollary 1, we have that

(1) E(tG) ≈ ty;

(2) Var(tG) ≈ ∑

k∈U

∑

l∈U∆kl

Ek

πk

El

πl;

(3) Var(tG) =

∑

k∈S

∑

l∈S

∆kl

πkl

ek

πk

el

πl

Corollary 4. If the GEREG estimator tG

is based on an ξ model with normalresponse, canonical link (identity) and a = (a1, . . . , aj)

T is a constant vectorso that, for k ∈ U it has been verified that

φ−1k =

J∑

j=1

ajxkj ,

then, the results (1) and (2) of 3 corollary are satisfied.

Corollarys 3 and 4 shows that when the ξ model considers canonical link,intercept and homogeneity in the dispersion parameter, the expressions forVar(t

G) and Var(t

G) become simplified, i.e, stated that the estimator t

Gis

approximately unbiased for ty and, an approximation of the variance of tG

can be calculate applying the formula for the variance of Horvitz-Thompsonestimator to residuals Ek.

7

3.3 The models role

The role of model ξ in the estimating ty can be observed by assessing theimpact on Ek of small perturbations in yk. Pregibon (1981) proposed anamount measuring the influence of yk on µU

k in generalized linear models,this being an extension of the measurement ∂µU

k /∂yk applied to normal linearmodels. Then, when ξ considers canonical link and homogeneity in disper-sion parameter, then quantity γk = 1 − hkk can be used to assessing theinfluence of yk on Ek, where hkk is the k-th element from main diagonal ofH = V

1/2U XU(XT

UVUXU)−1XT

UV1/2U , with VU = diag{V U

1 , . . . , V UN }. Small γk

values indicate that an change in yk leads to µUk similarly changing, suggesting

that the k-th population element has a large weighting on estimating βU

andon Ek magnitude. Likewise, large γk values indicate that an change in yk doesnot similarly lead to an change in µU

k , suggesting that the k-th population ele-

ment has small weighting in estimating βUand on Ek magnitude. For example,

for a model having an intercept, one explanatory variable and homogeneity inthe dispersion parameter, we have the following

i) Canonical link

γk = 1 − V Uk

∑

l∈UV U

l

− (xk − xU)2V Uk

∑

l∈U(xl − xU )2V U

l

ii) Identity link

γk = 1 − (1/V Uk )

∑

l∈U(1/V U

k )− (xk − xU)2(1/V U

k )∑

l∈U(xl − xU)2(1/V U

l ),

where xU is the population mean of explanatory variable x. In both casesconsidered above, γk ∈ (0, 1) and, for a fixed value of xk, the γk value dependson the variance function, in direct proportion when the link is canonical and,inversely proportional when link is identity. Thus, the ξ model determinesthe structure of weighting or importance for elements in U , which affects thevalues of Ek on Var(t

G).

3.4 Relative efficiency

The relative efficiency of tG

compared to the Horvitz-Thompson estimator us-ing p(·) sampling design can be evaluated with ∆p(·) = Var(t

G)p(·)/Var(t

HT)p(·)

measurement. For example, for sampling designs such as simple Random Sam-pling (SRS) and Bernoulli sampling (BER) the ∆p(·) measurement does notdepend on sample size. The following result led us to observe the impact ofthe model’s goodness of fit on t

Grelative efficiency.

8

Theorem 3. Under the conditions of the corollary 1, if the tG

is used under

i) Simple Random Sampling (SRS) we have

∆SRS =Var(t

G)SRS

Var(tHT

)SRS

≈ 1 − R2U

ii) Bernoulli Sampling (BER) we have

∆BER =Var(t

G)BER

Var(tHT

)BER

≈ 1 − R2U

1 +CV −2

y

1−1/N

where RU is Pearson’s linear correlation coefficient calculated between yk’sand µU

k ’s and CVy is the variation coefficient of y in the population U .

This result shows that when the model’s goodness of fit increases then thet

Gefficiency compared to the Horvitz-Thompson estimator also increases.

It is also possible to observe that, using comparable sample sizes, we haveVar(t

G)BER/Var(t

G)SRS ≈ 1 − 1/N . Besides, 3 theorem means that it is pos-

sible to show that under SRS sampling, the Horvitz-Thompson estimator hasefficiency equal to t

Gwhen using an f ∗ sampling fraction, given by

f ∗ ≈ f

f + (1 − f)(1 − R2U )

,

where f = n/N is the sampling fraction used by tG. For instance, when the

GEREG estimator uses a 0.2 sampling fraction under SRS sampling, the ξmodel satisfies corollary 1 conditions and has a goodness of fit described byR2

U = 0.95, the Horvitz-Thompson estimator requires a sampling fractionaround four times greater than the GEREG estimator to obtaining the sameefficiency.

3.5 Particular cases

This section considers estimators assisted by models satisfying the conditionsof theorem 1 for different response distributions.

3.5.1 Normal response with constant variance

The estimator (9) is obtained through the least squares method in the sta-tistical literature, having no distributional assumptions. The same estimatorwas obtained in this paper using a regression model having normal response

9

and canonical link. According to corollary 1, this estimator is given by

tG

=∑

k∈U

µSk =

∑

k∈U

xT

kβπ

S = NβπS1 +

J∑

j=2

βπSjtxj

, (9)

with βπ

S = (βπS1, . . . , β

πSJ)T. The model ξ that assists (9) can be written as

Eξ(Yk) = µk;

Varξ(Yk) = σ2

µk = β1 +J∑

j=2βjxkj

From (5) βπ

S can be written as βπ

S = (XT

SWSXS)−1XT

SWSYS with XS =(x1, . . . ,xn)T, xk = (1, xk2, . . . , xkJ)T and WS = diag{1/π1, . . . ,1/πn}. Ac-cording to corollary 3 under SRS sampling, variance (9) approximation andits corresponding estimator can be written as

Var(tG)SRS ≈ N(1 − R2

U)1 − f

fS2

yUand Var(t

G)SRS = N(1 − R2

S)1 − f

fS2

yS

where S2yU

= 1N−1

∑

U(yk − yU)2, S2yS

= 1n−1

∑

S(yk − yS)2 and RS is the Pear-

son’s linear correlation coefficient calculated between yk’s and µSk ’s. Similarly,

from corollary 3 with BER sampling, variance (9) approximation and its cor-responding estimator can be written as

Var(tG)BER ≈ N(1 − R2

U)

(1 − 1/N)−1

1 − π

πS2

yUand Var(t

G)BER =

n

Nπ

N(1 − R2S)

(1 − 1/n)−1

1 − π

πS2

yS,

where Nπ is expected sample size.

3.5.2 Normal response with variance heterogeneity

The following estimator was assisted by a model having normal response,variance heterogeneity and which satisfies the conditions of corollary 2

tG

=∑

k∈U

µSk =

∑

k∈U

βπSxk = βπ

Stx =tx

txty, (10)

with tx and ty being the Horvitz-Thompson estimators for tx and ty, respec-tively. The estimator (10) is called Ratio estimator (see for instance, Sarndal,Swensson and Wretman (1992)). The ξ model that assisted (10) can be writ-ten as

Eξ(Yk) = µk;

Varξ(Yk) = σ2xk, xk > 0

µk = βxk

10

From (5) βπS can be written as βπ

S = (XT

SWSXS)−1XT

SWSYS with WS =diag{1/x1π1, . . . ,1/xnπn}. According to theorem 2 under SRS sampling, ap-proximation of variance (10) and its corresponding estimator are given by

Var(tG)SRS = N

1 − f

f

[

S2yU

+SxU

SyUy2

U

x2U

(

SxU

SyU

− 2RU

)]

and

Var(tG)SRS = N

1 − f

f

[

S2ys

+Sxs

Sysy2

s

x2s

(

Sxs

Sys

− 2RS

)]

Similarly, from theorem 2 under BER sampling, (10) variance aproximationand its corresponding estimator can be written as

Var(tG)BER =

N

(1 − 1/N)−1

1 − π

π

[

S2yU

+SxU

SyUy2

U

x2U

(

SxU

SyU

− 2RU

)]

and

Var(tG)BER =

n

Nπ

N

(1 − 1/n)−1

1 − π

π

[

S2ys

+Sxs

Sysy2

s

x2s

(

Sxs

Sys

− 2RS

)]

One can easily show that, using SRS and BER sampling designs, tG

is moreefficient that the Horvitz-Thompson estimator when 2RU > SxU

/SyU.

3.5.3 Bernoulli response

Suppose yk, the response for element k in a particular population is a dichoto-mous variable, assuming value one if the element presents a characteristic ofinterest, and zero, otherwise. The ty parameter of interest can be written asNP , where N is the population size and P is the proportion of populationelements having the characteristic of interest. The estimator for NP , assistedby a Bernoulli regression model ξ satisfying the conditions of corollary 1 canbe written as

tG

=∑

k∈U

µSk =

∑

k∈U

1 + exp

−βπS1 −

J∑

j=2

βπSjxkj

−1

(11)

where ξ, the model assisting in the estimation, can be described as

Eξ(Yk) = µk = Pξ[yk = 1|xk]

Varξ(Yk) = µk(1 − µk)

log[

µk

1−µk

]

= β1 +J∑

j=2βjxkj

As discussed in section 3.3, for a given xk, the closest µk is 0.5, the greaterthe weighting or importance of element k in the estimation. According to

11

corollary 3 under SRS sampling, the approximated variance of (11) and itscorresponding estimator can be expressed as


U)

1 − 1/N

1 − f

fP (1 − P )

and

Var(tG)SRS=

N(1 − R2S)

1 − 1/n

1 − f

fP (1 − P ),

where P = 1n

∑

S yk is the sample proportion of elements having the charac-teristic of interest. On the other hand, in BER sampling, the approximatedvariance of (11) and its corresponding estimator can be expressed as


U)1 − π

πP (1 − P )

and

Var(tG)BER=

n

NπN(1 − R2

S)1 − π

πP (1 − P )

3.5.4 Poisson response

When the variable of interest represent a counting process, estimating popu-lation totals can be assisted by a Poisson regression model. Such an estimator,which satisfies the conditions of corollary 1, can be expressed by

tG

=∑

k∈U

µSk =

∑

k∈U

exp

βπS1 +

J∑

j=2

βπSjxkj

= κπ1

∑

k∈U

J∏

j=2

κπxkj

j , (12)

with κπj = exp(βπ

Sj). Where ξ, the model assisting the estimation, is given by

Eξ(Yk) = µk

Varξ(Yk) = µk

log(µk) = β1 +J∑

j=2βjxkj

As discussed in section 3.3, for a given xk, the higher µk, the higher therelevance (weighting) of element k. According to corollary 3 in SRS sampling,the approximated variance of (12) and its corresponding estimator can beexpressed as


U)1 − f

fS2

yUand Var(t

G)SRS = N(1 − R2

S)1 − f

fS2

yS

12

On the other hand, in BER sampling, the approximated variance of (12) andits corresponding estimator can be expressed as


U)

(1 − 1/N)−1

1 − π

πS2

yUand Var(t

G)BER =

n

Nπ

N(1 − R2S)

(1 − 1/n)−1

1 − π

πS2

yS

3.5.5 Gamma response

When the variable of interest is skewed continuous and positive (i.e. income)the estimation can be assisted by the Gamma model. The estimator of tyassisted by a Gamma response regression model and satisfying the conditionsof corollary 1 can be expressed in the following way

tG

=∑

k∈U

µSk =

∑

k∈U

βπS1 +

J∑

j=2

βπSjxkj

−1

, (13)

where ξ, the model that assisting the estimation, can be written as

Eξ(Yk) = µk

Varξ(Yk) = µ2k

µk−1 = β1 +

J∑

j=2βjxkj

As discussed in section 3.3, for a given xk, the higher µk, the higher therelevance (weighting) of element k. According to corollary 3 in SRS sampling,the approximated variance of (13) and its corresponding estimator is given by


U)1 − f

fS2

yUand Var(t

G)SRS = N(1 − R2

S)1 − f

fS2

yS

On the other hand, in BER sampling, the approximated variance of (13) andits corresponding estimator can be expressed as


U)

(1 − 1/N)−1

1 − π

πS2

yUand Var(t

G)BER =

n

Nπ

N(1 − R2S)

(1 − 1/n)−1

1 − π

πS2

yS

3.5.6 Stratified sampling

i) Estimating population totalsFor a stratified SRS design in which the model assisting estimation in eachstratum satisfies the conditions of corollary 1, the estimator for ty is

tG

=H∑

h=1

tGh

=H∑

h=1

∑

k∈Uh

µSh

k ,

13

where tGh

is the estimator of the total of y in the stratum h. According tocorollary 3, the variance estimator expression is given by

Var(tG)SRS =

H∑

h=1

Var(tGh

)SRS ≈H∑

h=1

Nh(1 − R2Uh

)1 − fh

fhS2

yUh,

with Nh being the size of the stratum h, fh the sampling fraction in thestratum h, S2

yUhthe variance of y in the stratum h and RUh

is Pearson’s

linear correlation coefficient calculated between yk’s and µUk ’s in the stratum

h. The estimator of the variance expression is given by:

Var(tG)SRS =

H∑

h=1

Var(tGh

)SRS =H∑

h=1

Nh(1 − R2Sh

)1 − fh

fhS2

ySh,

where S2ySh

is the variance of y the sample of stratum h and RShis Pearson’s

linear correlation coefficient calculated between yk’s and µSk ’s in the sample

from stratum h. With BER sampling, the total population estimator ex-pression is the same as for the SRS case. The variance estimator expressioncan be written as

Var(tG)BER =

H∑

h=1

Var(tGh

)BER ≈H∑

h=1

(Nh − 1)(1 − R2Uh

)1 − πh

πh

S2yUh

,

with Nhπh being expected sample size in stratum h. The variance estimatorexpression is given by

Var(tG)BER =

H∑

h=1

Var(tGh

)BER =H∑

h=1

(nh − 1)

πh(1 − R2

Sh)1 − πh

πhS2

ySh,

where nh is the sample size in stratum h.

ii) Estimating population meansFor a stratified SRS design in which the model assisting estimation in eachstratum satisfies the conditions of corollary 1, the population mean estima-tor is given by

mG

=H∑

h=1

ahmGh=

H∑

h=1

ah

∑

k∈Uh

N−1

h µSh

k ,

with mGh

being the estimator of mean in stratum h and ah = Nh/(∑

Nh)is the proportion of individuals from population in stratum h. According tocorollary 3, the variance estimator expression is given by

Var(mG)SRS =

H∑

h=1

a2hVar(m

Gh)SRS ≈

H∑

h=1

a2h

1 − R2Uh

Nh

1 − fh

fh

S2yUh

14

The variance estimator expression is given by

Var(mG)SRS =

H∑

h=1

a2hVar(m

Gh)SRS =

H∑

h=1

a2h

1 − R2Sh

Nh

1 − fh

fh

S2ySh

With BER sampling, the mean population estimator expression is the sameas for the SRS case. The variance estimator expression is given by

Var(mG)BER =

H∑

h=1

a2hVar(m

Gh)BER ≈

H∑

h=1

a2h

Nh − 1

N2h

(1 − R2Uh

)1 − πh

πhS2

yUh

The variance estimator expression is given by

Var(mG)BER =

H∑

h=1

a2hVar(m

Gh)BER =

H∑

h=1

a2h

(nh − 1)

N2hπh

(1 − R2Sh

)1 − πh

πh

S2ySh

3.6 Simulation study

A simulation study was carried out for comparing the behavior of the estima-tors described above in different scenarios.

3.6.1 Bernoulli response

Twelve populations consisting of N = 1000 individuals each were generatedwhere one auxiliary variable, x, was considered, generated by following chi-square distribution, having 1 degree of freedom. The value of µU

k for the ele-

ment k from each population was calculated as µUk =

[

1 + exp(−βU1 − βU2xk)]−1

,and interest variable, yk, was randomly generated following Bernoulli distri-bution with µU

k as parameter. In each population, the βU = (βU1, βU2)T values

were chosen so that they ensured values for R2U = 0.5, 0.65 and 0.75 and

P = 0.15, 0.35, 0.65 and 0.85. 10000 independent samples were obtained foreach population having n = 40 and n = 50 individuals, according to SRSscheme. These samples were used for calculating estimates of P = N

−1ty and

estimator’s variance as well as 95% confidence intervals (based on normality)for P using two estimators assisted by regression models having normal andBernoulli responses, denoted as P

Nand P

B, respectively. These GEREG esti-

mators satisfied the conditions of corollary 1. Performance measurements werecalculated for each estimator as: i) relative bias of P ; ii) relative efficiency ofP regarding Horvitz-Thompson estimator; iii) relative bias of Var(P ); and iv)95% confidence interval coverage rate for P . Table 3 presents the results. Itcan be observed that the relative bias of P

Bwas always less than that for the

PN; however, relative bias was very small for both estimators. Regarding rela-

tive efficiency, the values in the Table were always lower than 100%, indicating

15

that the PN

and PB

estimators were more efficient than the Horvitz-Thompsonone. However, greater efficiency was always provided for the estimator assistedby the model having Bernoulli response. The relative bias of Var(P ) was rel-atively large, mainly for the estimator assisted by the model having Bernoulliresponse. The 95% confidence interval coverage rates for P for both estimatorswere always lower than nominal values; nevertheless, taking into account thatthe sample size was small, coverage rate values were good. Besides, it can beobserved that P

Bperformed better than P

N, mainly when the value of the

population proportion P was large.

Table 3Performance of P

Nand P

Bestimating the population proportion P

R2U

n P

Measures(%)

PN

PB

∣

∣

∣

PP−1

∣

∣

∣

V (P)

V (PHT

)

∣

∣

∣

V (P)

V (P )−1

∣

∣

∣CR

∣

∣

∣

PP−1

∣

∣

∣

V (P)

V (PHT

)

∣

∣

∣

V (P)

V (P )−1

∣

∣

∣CR

0.50

40

0.15 1.51 65.18 8.39 91.81 0.86 54.11 13.35 89.06

0.35 0.09 58.29 5.99 93.55 0.12 52.32 6.35 92.29

0.65 0.74 65.93 11.23 92.52 0.13 53.75 12.76 91.18

0.85 0.30 76.18 3.51 92.01 0.30 54.60 11.57 90.36

50

0.15 1.03 65.57 8.12 92.16 0.77 54.29 12.85 90.19

0.35 0.10 60.04 7.95 93.30 0.08 53.75 7.55 92.58

0.65 0.72 63.14 6.58 93.62 0.08 51.97 8.98 92.28

0.85 0.30 74.83 1.52 92.52 0.19 52.81 9.43 90.60

0.65

40

0.15 1.83 59.21 10.73 91.49 0.84 41.79 20.19 85.12

0.35 0.13 48.17 8.20 93.27 0.03 37.31 11.81 90.53

0.65 0.96 58.66 10.85 92.49 0.12 37.76 12.19 90.19

0.85 0.35 72.45 3.65 93.21 0.30 37.88 6.72 90.97

50

0.15 1.22 57.08 6.61 92.64 0.76 38.88 14.32 89.01

0.35 0.25 48.37 8.05 93.36 0.11 37.23 10.87 91.13

0.65 0.75 57.93 9.03 93.13 0.01 37.69 11.68 91.14

0.85 0.24 69.75 0.39 93.46 0.20 36.78 6.91 90.68

0.75

40

0.15 2.70 57.40 12.75 90.43 0.04 32.76 23.97 80.00

0.35 0.45 42.59 5.66 93.52 0.23 27.58 11.01 90.11

0.65 0.99 55.48 10.63 92.79 0.07 32.62 12.27 90.25

0.85 0.38 72.24 4.16 92.63 0.29 32.86 8.02 90.87

50

0.15 1.42 54.30 6.21 92.82 0.42 29.38 14.39 86.96

0.35 0.22 43.57 8.09 93.34 0.02 28.00 14.17 89.79

0.65 0.80 54.38 8.28 93.10 0.05 31.68 9.94 91.38

0.85 0.22 68.89 0.78 93.34 0.17 30.34 5.44 90.83

3.6.2 Poisson response

Nine populations consisting of N = 1000 individuals each were generatedwhere one auxiliary variable, x, was considered, generated by following chi-square distribution with 1 degree of freedom. The µU

k value for element kfrom each population was calculated as µU

k = exp(βU1 + βU2xk), and interestvariable, yk, was randomly generated following Poisson distribution with µU

k

as parameter. The βU values in each population were chosen so that theyensured that values of R2

U = 0.5, 0.65 and 0.75 and m = N−1

ty = 1.5, 3.0and 5.0. 10000 independent samples were obtained, having population sizes ofn = 40 and n = 50 elements, according to SRS scheme. These samples were

16

used for calculating estimates of m and estimator’s variance as well as 95%confidence intervals (based on normality) for m using two estimators assistedby regression models having normal and Poisson responses, denoted as m

N

and mP, respectively. These GEREG estimators satisfied the conditions of

corollary 1. The same measurements of performance as in the Bernoulli casewere calculated for each estimator. Table 4 presents the results. It can beobserved that the relative bias of m

Pwas always less than that for the m

Ncase.

However, relative bias was very small for both estimators. Regarding relativeefficiency, the values in the Table were always lower than 100%, indicating thatthe m

Nand m

Pestimators were more efficient than the Horvitz-Thompson

one. However, greater efficiency always pertained to the estimator assisted bythe model having Poisson response. The relative bias of Var(m) was relativelylarge, mainly for the estimator assisted by the model having Poisson response.For both estimators, the 95% confidence interval coverage rates for m werealways lower than the nominal values; nevertheless, taking into account thatthe sample size was small, coverage rate values were good. Besides, it can beobserved that m

Pperformed better than m

N, mainly when population mean

m value was small.

Table 4Performance of m

Nand m

Pestimating the population mean m

R2U

n m

Measures(%)

mN

mP

∣

∣

mm

−1∣

∣

V (m)V (m

HT)

∣

∣

∣

V (m)V (m )

−1

∣

∣

∣CR

∣

∣

mm

−1∣

∣

V (m)V (m

HT)

∣

∣

∣

V (m)V (m )

−1

∣

∣

∣CR

0.50

40

1.5 1.21 60.94 8.37 91.98 0.62 55.58 16.65 91.25

3.0 0.78 56.91 8.75 92.34 0.38 55.01 14.71 91.47

5.0 0.44 54.13 8.01 92.83 0.19 52.06 11.13 92.39

50

1.5 0.98 60.80 6.07 92.34 0.42 55.12 14.48 91.75

3.0 0.54 56.36 5.96 93.04 0.30 53.51 10.63 92.50

5.0 0.30 55.11 8.49 92.91 0.18 52.66 11.10 92.58

0.65

40

1.5 2.00 58.41 14.77 89.18 1.35 49.11 16.42 88.08

3.0 1.43 51.69 10.44 91.05 0.37 37.24 15.45 91.27

5.0 0.80 45.99 11.18 91.37 0.28 37.23 14.95 90.99

50

1.5 1.65 56.64 9.55 90.50 1.01 43.68 17.17 89.33

3.0 1.08 51.02 7.72 91.61 0.21 35.85 10.55 92.26

5.0 0.76 45.68 9.27 92.26 0.12 36.97 12.76 92.26

0.75

40

1.5 3.22 58.22 14.16 84.41 0.65 34.80 14.59 85.49

3.0 1.85 49.67 14.06 88.63 0.59 30.20 18.12 88.74

5.0 1.17 43.07 12.33 90.36 0.01 26.79 17.01 91.51

50

1.5 2.52 58.47 10.11 86.11 0.49 32.03 17.59 87.22

3.0 1.41 48.60 10.57 89.92 0.50 28.96 13.10 89.98

5.0 1.12 43.02 10.63 91.02 0.16 26.23 13.83 92.08

3.6.3 Gamma response

Nine populations consisting of N = 1000 individuals each were generated,where one auxiliary variable, x was considered, being generated following chi-square distribution with 1 degree of freedom. The value of µU

k for e element

k for each population was calculated as µUk =

(

βU1 + βU2xk

)−1

and interest

17

variable, yk, was randomly generated following Gamma distribution with µUk

and φ as parameters. In each population, the values of βU and φ were cho-sen so that they ensured the values of R2

U = 0.5, 0.65 and 0.75 and m = 5,10 and 15. 10000 independent samples were obtained from each population,n = 40 and n = 50 element sizes, according to SRS scheme. These sampleswere used for calculating estimates of m and estimator’s variance as well as95% confidence intervals (based on normality) for m using two estimators as-sisted by regression models having normal and Gamma responses, denoted asm

Nand m

G, respectively. These GEREG estimators satisfied the conditions

of corollary 1. The same performance measurements as in the Bernoulli andPoisson cases were calculated for each estimator. Table 5 presents the results.It can be observed that the relative bias of m

Gwas always less than for m

N.

However, relative bias was very small for both estimators. Regarding relativeefficiency, the values in the Table were always lower than 100%, indicating thatthe m

Nand m

Gestimators were more efficient than the Horvitz-Thompson

one. However, the greater efficiency was always given by the estimator assistedby the model having gamma response. The relative bias of Var(m) was rela-tively small. For both estimators, 95% confidence interval coverage rates form were always lower than nominal values; nevertheless, taking into accountthat sample size was small, coverage rate values were good. Besides, it can beobserved that m

Gperformed better than m

N, mainly when populational mean

m value was small.

Table 5Performance of m

Nand m

Gestimating the population mean m

R2U

n m

Measures(%)

mN

mG

∣

∣

mm

−1∣

∣

V (m)V (m

HT)

∣

∣

∣

V (m)V (m )

−1

∣

∣

∣CR

∣

∣

mm

−1∣

∣

V (m)V (m

HT)

∣

∣

∣

V (m)V (m )

−1

∣

∣

∣CR

0.50

40

5 1.29 69.75 8.62 92.02 0.19 53.58 4.82 93.22

10 0.55 66.39 11.10 92.04 0.03 55.73 5.66 93.36

15 0.54 61.45 8.81 92.20 0.09 52.28 4.23 93.30

50

5 0.94 67.97 7.62 92.24 0.18 52.76 4.52 93.24

10 0.51 63.59 6.45 93.04 0.04 53.27 0.84 94.34

15 0.35 58.96 5.37 93.36 0.06 51.15 3.56 93.84

0.65

40

5 1.76 72.40 14.26 90.66 0.12 41.65 4.96 93.08

10 1.15 58.86 12.61 91.28 0.02 37.61 2.93 93.96

15 0.80 53.42 10.63 92.06 0.09 37.60 7.10 93.32

50

5 1.38 71.60 12.58 91.90 0.11 41.59 4.67 93.18

10 0.87 59.87 12.23 92.08 0.03 37.45 0.81 94.10

15 0.54 51.95 10.02 92.36 0.08 37.58 1.75 93.78

0.75

40

5 2.19 76.46 12.65 90.40 0.24 34.61 8.66 92.18

10 1.47 63.31 18.95 90.42 0.10 32.57 6.74 92.66

15 1.23 58.05 19.59 89.54 0.12 29.98 4.12 93.30

50

5 1.62 76.49 11.76 91.26 0.16 36.12 6.36 92.22

10 1.08 61.48 15.09 90.76 0.10 32.84 6.39 92.98

15 0.84 55.31 13.96 91.06 0.06 29.49 4.08 93.64

18

4 Applications

4.1 Academic Performance Index (API)

The academic performance index (API) has been used annually as perfor-mance measurement for every school in California. The API is based based ona standardized test applied to students. Data from 2000 was analyzed. Thedata set contains 6194 observations and 37 variables and was obtained fromthe Survey package for the statistical software R (www.r-project.org). The goalwas to estimate the API average for the population of schools using a stratifiedsample design. The strata were formed by the levels of Stype variable (elemen-tary (E), middle (M) and high school (H)). SRS and BER sampling schemeperformance was investigated in each strata. The m

Gestimator discussed in

section 3.5.6 was applied; the estimation in each stratum was assisted by mod-els having normal and Gamma responses as described in (9) and (13). Thepercentage of parents not graduating from high school in each school not.hsgwas the auxiliary variable. Strata sizes were given by N1 = 4421, N2 = 1018and N3 = 755. Strata sample sizes were 100, 50 and 50, respectively for thefirst, second and third stratum. Such numbers corresponded to expected sam-ple sizes when using a BER sampling scheme.

Table 6 shows mG

relative efficiency (∆p(·)) regarding the Horvitz-Thompsonestimator for the two regression models and the two sample schemes. It canbe seen that the Gamma assisted m

Gestimator was more efficient than the

normal assisted one in every strata. Normal assisted estimator variance wasapproximately 1.11 times greater than the variance for the Gamma modelassisted estimator for both sampling schemes. Figure 1 shows the relationshipbetween interest and the auxiliary variables. The R2

U values for the normalmodel’s goodness of fit were 0.51, 0.54 and 0.61 for E, M and H, respectively.The R2

U values for Gamma model were 0.55, 0.60 and 0.66 for E, M and H.

Table 6Relative efficiency (percentage) of m

Gregarding to Horvitz-Thompson estimator

for the API example

Sampling Model School typeTotal

scheme assisting Elemental Middle High school

SRSNormal 49.11 46.48 38.76 48.95

Gamma 44.65 40.32 33.92 43.93

BERNormal 1.70 1.49 1.00 1.76

Gamma 1.64 1.41 0.95 1.59

19

Fig. 1. Scatter plots between interest and auxiliary variables for the API example

0 20 40 60 80 100

400

500

600

700

800

900

0 20 40 60 80 100

400

500

600

700

800

900

0 20 40 60 80 100

400

500

600

700

800

900

Elementary

Percent parents not high−school graduates

AP

I 200

0NormalGamma

0 20 40 60 80 100

400

500

600

700

800

900

0 20 40 60 80 100

400

500

600

700

800

900

0 20 40 60 80 100

400

500

600

700

800

900

Middle


AP

I 200

0

NormalGamma

0 20 40 60 80

400

500

600

700

800

900

0 20 40 60 80

400

500

600

700

800

900

0 20 40 60 80

400

500

600

700

800

900

High School


AP

I 200

0

NormalGamma

0 20 40 60 80 100

400

500

600

700

800

900

Total


AP

I 200

0

4.2 Quality of life in Colombia

The unsatisfied basic necessities indicator (UBN) was calculated for the 1118Colombian municipalities, based on the household and population census car-ried out by the National Administrative Department of Statistics - DANE(www.dane.gov.co). The data set was related to information from 2005. TheUBN is an indicator of whether the population’s basic needs are been pro-vided for, concerning i) inadequacy housing, ii) overpopulated households,iii) households having inadequated services, iv) high economic dependencehouseholds and v) families were children do not attend school. The goal wasto estimate the proportion of municipalities (excluding the capital Bogota)presenting a higher than 25% UBN. A stratified sample design was used withstrata based on a Distance variable defined as the distance, in kilometers,between each municipality and the closest city having more than 1 millioninhabitants. Three strata were considered: one having a distance of less that150 km (<150 km), one having 150 to 200 Km distance (150-200 km) andone having a distance of over 200 km (>200 km). SRS and BER scheme per-formance was investigated in each strata. The m

Gestimator discussed in the

20

section 3.5.6 was applied, the estimation in each strata was assisted by modelshaving normal and Bernoulli responses as described in (9) and (11). The vari-able Density, the average number of people living in an area of 1 km2 of thecity was the auxiliary variable. Stratum sizes were N1 = 308, N2 = 395 andN3 = 414, respectively. Stratum sample sizes were 46, 59 and 62, respectively,for the first, second and third stratum. Such sizes were expected values whenusing BER sampling.

Table 7 shows the relative efficiency (∆p(·)) of the mG

estimator comparedto the Horvitz-Thompson estimator, for both regression models and sampleschemes. It can be seen that the Bernoulli assisted m

Gestimator was more

efficient than the normal assisted one in every strata. The variance of the nor-mal assisted estimator was approximately 1.18 times greater than the varianceof the Bernoulli model assisted estimator for both sampling schemes. Figure 2shows the relationship between interest and auxiliary variables. The R2

U val-ues for the normal model’s goodness of fit were 0.06, 0.27 and 0.13 for <150,150-200 and >200, respectively. The R2

U values for the Bernoulli model were0.24, 0.40 and 0.20 for <150, 150-200 and >200, respectively.


Gregarding to Horvitz-Thompson estimator for

the Quality of life in Colombia application when the interest variable is dichotomous

Sampling Model Distance (Kms)Total

scheme assisting < 150 150 - 200 > 200

SRSNormal 93.09 72.76 87.43 85.61

Bernoulli 76.60 59.82 80.61 72.41

BERNormal 21.68 6.99 6.60 14.93

Bernoulli 21.25 6.86 6.43 12.57

Additionally, we wanted to estimate the average number of UBN componentshigher than 10%. In this case, possible interest variable values were 0, 1, 2, 3, 4and 5, thereby motivating an estimator assisted by a Poisson regression model.Table 8 shows the relative efficiency (∆p(·)) of the m

Gestimator compared

to the Horvitz-Thompson estimator for both regression models and sampleschemes. It can be seen that the Poisson assisted m

Gestimator was more ef-

ficient than the normal assisted one in all strata. Normal assisted estimatorvariance was about 1.17 times greater than variance for the Poisson modelassisted estimator for both sampling schemes. Figure 3 shows the relation-ship between interest and auxiliary variables. The R2

U values for the normalmodel’s goodness of fit were 0.06, 0.16 and 0.10 for <150, 150-200 and >200,respectively. The R2

U values for the Poisson model were 0.24, 0.27 and 0.21 for<150, 150-200 and >200, respectively.

21

Fig. 2. Scatter plots between interest and auxiliary variables for the Quality of life

in Colombia application when the interest variable is dichotomous

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Distance less than 150 kms

Population density (persons/km2)

UB

N m

ore

than

25%

NormalBernoulli

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Distance between 150 and 200 kms


UB

N m

ore

than

25%

NormalBernoulli

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Distance more than 200 kms


UB

N m

ore

than

25%

NormalBernoulli

0 200 400 600 800 1000

0.0

0.2

0.4

0.6

0.8

1.0

Total


UB

N m

ore

than

25%


Gregarding to Horvitz-Thompson estimator for

the Quality of life in Colombia application when the interest variable is a countingprocess

Sampling Model Distance (Kms)Total

scheme assisting < 150 150 - 200 > 200

SRSNormal 93.88 83.84 90.37 89.14

Poisson 76.42 73.23 79.09 76.23

BERNormal 23.22 13.01 11.16 17.68

Poisson 22.21 12.66 10.77 14.91

Appendix

A Theorem 1

Proof. .

22

Fig. 3. Scatter plots between interest and auxiliary variables for the Quality of life

in Colombia example when the interest variable is a counting process

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

Distance less than 150 kms


Com

pone

nts

of U

BN

mor

e th

an 1

0%

NormalPoisson

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

Distance between 150 and 200 kms


Com

pone

nts

of U

BN

mor

e th

an 1

0%

NormalPoisson

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

0 200 400 600 800 1000

01

23

45

Distance more than 200 kms


Com

pone

nts

of U

BN

mor

e th

an 1

0%

NormalPoisson

0 200 400 600 800 1000

01

23

45

Total


Com

pone

nts

of U

BN

mor

e th

an 1

0%

i) βU estimate satisfies UU(βU) = 0, i.e., U

Uj(βU) = hT

j E = 0, j = 1, . . . , J ,

where hj = (hkj, . . . , hNj)T and E = (E1, . . . , EN)T, with hkj = φk

∂θk

∂ηkxkj .

If exist a = (a1, . . . , aJ)T such thatJ∑

j=1ajhj = 1N , then,

∑

k∈U(yk − µU

k ) =

1T

NE =

(

J∑

j=1ajhj

)

T

E =J∑

j=1ajh

T

j E. From UUj

(βU) = 0 above, j =

1, . . . , J , we haveJ∑

j=1ajh

T

j E = 0, therefore, ty =∑

k∈UµU

k .

ii) βπ

S estimate satisfies UπS(β

π

S) = 0, i.e., UπSj

(βπ

S) = hT

Sje∗ = 0, j = 1, . . . , J ,

where hSjand e∗ are n-dimension column vectors with elements hkj and

ek/πk, respectively, with k ∈ S. If exist a such thatJ∑

j=1ajhj = 1N ,

then,J∑

j=1ajhSj

= 1n. Therefore,∑

k∈S

(yk−µSk )

πk= 1T

ne∗ =

(

J∑

j=1ajhSj

)

T

e∗ =

J∑

j=1ajh

T

Sje∗. From Uπ

Sj(β

π

S) = 0 above, j = 1, . . . , J , we haveJ∑

j=1ajh

T

Sje∗ =

23

0, then, tG

=∑

k∈UµS

k .

B Corollary 1

Proof. Homogeneity in the dispersion parameter, systematic component hav-ing intercept and canonical link led to hkj = φxkj, where hk1 = φ. Then,

with a1 = φ−1 and aj = 0 for j 6= 1, we haveJ∑

j=1ajhj = 1n. Therefore, using

theorem 1 we could proved that ty =∑

k∈UµU

k and tG

=∑

k∈UµS

k .

C Corollary 2

Proof. Normal response, systematic component having canonical link and dis-

persion parameter expressed as φ−1k =

J∑

j=1ajxkj led to hkj =

xkj

J∑

r=1

arxkr

. We could

verified thatJ∑

j=1ajhj = 1N . Therefore, using theorem 1 we could proved that

ty =∑

k∈UµU

k and tG

=∑

k∈UµS

k .

Lemma 1. If the tG is based on an ξ model where it has been verified thatthere is homogeneity in the dispersion parameter, systematic component having

intercept and canonical link, we have µT(YU−µ) ≈ 0 with µ =(

µU1 , . . . , µU

N

)

T

Proof. Expanding µUk by Taylor series of first-order around of x◦

k = (1, x◦

k2, . . . , x◦

kJ)T

we have the following

µUk = µ(xk, βU) ≈ µ◦

k +J∑

j=2

V (µ◦

k)βUj(xkj − x◦

kj),

where µ◦

k = µ(x◦

k, βU). For example, with x◦

kj = N−1txj= xUj

, j = 2, . . . , J ,we have that x◦

k = x◦ and therefore µ◦

k = µ◦, then

µUk ≈ µ◦ +

J∑

j=2

V (µ◦)βUj(xkj − xUj

)

=J∑

j=1

mjxkj,

24

where mj = V (µ◦)βUjfor j 6= 1 and m1 = µ◦ −

J∑

j=2V (µ◦)βUj

xUj . In matrix

form µ can be written asµ ≈ X

Um,

with m = (m1, . . . , mJ)T. βU estimate satisfies UU(βU) = 0. As tG is basedon an ξ model where it has been verified that there is homogeneity in thedispersion parameter, systematic component having intercept and canonicallink, UU(β) can be expressed as UU(β) = XT

U(YU − µ), therefore, we have

XT

U(YU − µ) = 0, then

µT(YU − µ) ≈ (XUm)T (YU − µ)

= mTXT

U(YU − µ)

= mT0

= 0

D Theorem 2

Proof. According Pregibon (1981) in the convergence of the iterative process

to solving UπS(β

π

S) = 0, βπ

S estimate can be expressed as

βπ

S = (XT

SV∗

SXS)−1XT

SV∗

SYT

S

≈ (XT

SV∗

UXS)−1XT

SV∗

UYT

U , (D.1)

where V∗

S = diag{V S1 φ1/π1, . . . , V

Sn φn/πn}, V∗

U = diag{V U1 φ1/π1, . . . , V

Un φn/πn},

Y∗

S = (yS1 , . . . , yS

n)T, Y∗

U = (yU1 , . . . , yU

n )T with ySk = {ηk + V −1

k (yk − µk)}β=ˆβ

π

S

and yUk = {ηk + V −1

k (yk − µk)}β=ˆβ

U

. From result 5.10.1 of Sarndal, Swensson

and Wretman (1992) we have that βπ

S (as in the expression D.1) is an approx-

imately unbiased estimator of βU (i.e E(βπ

S − βU) ≈ 0) with the approximatevariance-covariance matrix given by

(XT

UW∗

UXU)−1ΣU (XT

UW∗

UXU)−1

Expanding the GEREG estimator tG by Taylor series of first-order around ofβU we have the following

tG ≈∑

k∈U

µUk +

J∑

j=1

∑

k∈U

V Uk xkj(β

πSj

− βUj)

≈ ty + 1T

NVUXU(βπ

S − βU).

Then, we can write

25

i) E(tG) ≈ ty + 1T

NVUXUE(βπ

S − βU) = tyii) Var(tG) ≈ 1T

NVUXUVar(βπ

S − βU)XT

UVU1N = ΓT

UΣUΓU ,and the Var(tG) estimator can be obtained replacing the unknown quan-tities by its estimators in Var(tG), i.e, V U

k by V Sk and σU

ij by σSij , well

iii) Var(tG) = ΓT

SΣSΓS.

E Corollary 3

Proof. As tG is based in an ξ model where it has been verified that thereis homogeneity in the dispersion parameter and systematic component withintercept and canonical link we have

ΓU = (XT

UV∗

UXU)−1XT

UV∗

U1N = (1, 0, . . . , 0)T

(J×1),

then, Var(tG) is the (1, 1) element of ΓU , i.e, Var(tG) ≈ σU11 and Var(tG) ≈ σS

11,where xk1 = φk = 1 for k ∈ U .

F Theorem 3

Proof. Using the Pearson’s linear correlation coeficient definition we can write

1 − R2U = 1 −

[

∑

k∈Uykµ

Uk − Ny

U¯µU

]2

(

∑

k∈Uy2

k − Ny2U

)(

∑

k∈U(µU

k )2 − N ¯µ2U

) ,

Where ¯µU = N−1 ∑

k∈UµU

k . As the conditions of corollary 1 are satisfied we have∑

k∈U(yk − µU

k ) = 0, then, yU = ¯µU and we can write the following

1 − R2U = 1 −

[

∑

k∈Uykµ

Uk − Ny2

U

]2

(

∑

k∈Uy2

k − Ny2U

)(

∑

k∈U(µU

k )2 − Ny2U

)

26

From lemma 1 we have µT(YU − µ) ≈ 0, i.e,∑

k∈Uykµ

Uk ≈ ∑

k∈U(µU

k )2, therefore

1 − R2U ≈ 1 −

[

∑

k∈UµU2

k − Ny2U

]2

(

∑

k∈Uy2

k − Ny2U

)(

∑

k∈U(µU

k )2 − Ny2U

)

= 1 −

∑

k∈U(µU

k )2 − Ny2U

∑

k∈Uy2

k − Ny2U

=

∑

k∈Uy2

k − 2∑

k∈Uykµ

Uk +

∑

k∈UµU

k

∑

k∈Uy2

k − Ny2U

=

∑

k∈U

(

yk − µUk

)2

∑

k∈U(yk − yU)2

= ∆AAS.

References

Breidt, F.J. and Opsomer, J.D. (2000). Local polynomial regression estimatorsin survey sampling Annals of Statistics, 28, 1026-1053.

Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. London: Chapmanand Hall.

Dobson, A.J. (2001). An Introduction to Generalized Linear Models, 2nd. Edi-tion. London: Chapman and Hall.

Duchesne, P. (2003). Estimation of a proportion with survey data. Journal ofStatistics Education. 11, 1-24.

Estevao, V.M. e Sarndal, C.E. (2004). Borrowing strength is not the besttechique within a wide class of design-consistent domain estimators. Journalof Official Statistics. 20, 645-669.

Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models,2nd. Edition. London: Sage.

Fuller, W.A. (2002). Regression estimation for survey samples. Survey Method-ology 28, 5-23.

Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression su-perpopulation model. Journal of the American Statistical Association. 77,89-96.

Li, Yan (2008). Generalized regression estimators of a finite population totalusing the Box-Cox technique .Survey Methodology. 34, 79-89.

Lehtonen, R. and Veijanen, A. (1998). Logistic generalized regression estima-tors.Survey Methodology. 24, 51-55.

27

Lehtonen, R., Sarndal, C.E. and Veijanen, A. (2003). The effect of model choicein estimation for domains, including small domains. Survey Methodology. 29,33-44.

Lehtonen, R. and Pahkinen E.J. (2004). Practical Methods for Design andAnalysis of Complex Surveys, 2nd Edition. Chichester: Wiley.

Lehtonen, R., Sarndal, C.E. e Veijanen, A. (2005). Does the model matter?Comparing model-assisted and model-dependent estimators of class frequen-cies for domains. Statistics in Transition. 7, 649-673.

McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd. Edi-tion. London: Chapman and Hall.

Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9,705-724.

Sarndal, C.E., Swensson, B. and Wretman, J. (1992). Model Assisted SurveySampling. New York: Springer Verlag.

Skinner, C.J. and Holt, D. (1989).Analysis of complex surveys. New York:John Wiley and Sons.

Wright, R.L. (1983). Finite population sampling with multivariate auxiliaryinformation. Journal of the American Statistical Association. 78, 879-884.

28

finite population estimation under generalized linear model

Documents