finite population estimation under generalized linear model
TRANSCRIPT
![Page 1: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/1.jpg)
Finite Population Estimation Under Generalized Linear Model Assistance
Luz Marina Rondón
Universidad Nacional de Colombia
Luis Hernando Vanegas
Universidad Nacional de Colombia
Cristiano Ferraz
Universidad Federal de Pernambuco
Reporte Interno de Investigación No. 17
Departamento de Estadística
Facultad de Ciencias
Universidad Nacional de Colombia
Bogotá, COLOMBIA
![Page 2: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/2.jpg)
Finite population estimation under
generalized linear model assistance
Luz Marina Rondon a,∗ Luis Hernando Vanegas a
Cristiano Ferraz b
aUniversidad Nacional de Colombia, Facultad de Ciencias, Departamento de
Estadıstica - Carrera 30 No. 45-03, Bogota - Colombia
bDepartamento de Estatıstica, CCEN-UFPE - Cidade Universitaria - Recife, PE -
Brazil 50740-540
Abstract
Finite population estimation is the overall goal of sample surveys. When informa-tion regarding auxiliary variables are available, one may take advantage of generalregression estimators (GREG) to improve sample estimate precision. GREG estima-tors may be derived when the relationship between interest and auxiliary variablesis represented by a normal linear model. However, in some scenarios, such as esti-mating class frequencies or counting process totals, Bernoulli or Poisson responsesmodels are more suitable for describing the relationship between interest and aux-iliary variables than normal linear model ones. This paper focuses on the generalcase for which the relationship between interest variables and the available auxil-iary ones may be suitably described by a generalized linear model. The variable ofinterest’s finite population distribution is viewed in such scenario as if generated byan exponential family distribution, which includes Bernoulli, Poisson, Gamma andinverse Gaussian distributions. The resulting estimator is a generalized linear modelregression estimator (GEREG). Its general form and basic statistical properties arepresented and studied analytically and, empirically, through of Monte Carlo ex-periments. Three applications are presented in which the GEREG estimator showsbetter performance than GREG.
Key words: Auxiliary information, generalized linear models, model-assistedestimation, pseudomaximum likelihood.
1 Introduction
Finite population estimation is sample surveys’ general goal. Whenever inter-est relies on estimating means, totals and percentages, parameters are esti-mated using the Horvitz-Thompson estimator.
∗ Corresponding author.Email address: [email protected] (Luz Marina Rondon).
Reporte interno de investigacion, Universidad Nacional de Colombia.4 March 2011
![Page 3: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/3.jpg)
When information regarding auxiliary variables is available, however, one maytake advantage of general regression estimators (GREG) to improve sampleestimate precision. Such GREG estimators may be derived in a model assistedestimation approach (Sarndal, Swensson and Wretman (1992)) when the re-lationship between interest and auxiliary variables is represented by a normallinear model. The list of authors working on the subject is broad. Isaki andFuller (1982), Wright (1983), Fuller (2002) and Sarndal, Swensson and Wret-man (1992). Lehtonen and Veijanen (1998) may perhaps be the first to arguethat in some instances (such as when estimating class frequencies) modelsfor multinomial responses are more suitable for describing the relationship be-tween interest and auxiliary variables than normal linear model ones. LGREGestimator properties have been presented and studied, this being a multinomiallogistic model assisted estimator. Breidt and Opsomer (2000) focused on anonparametric regression modeling assisted approach, proposing and studyingthe properties of local polynomial regression estimators. Estevao and Sarndal(2004) studied several applications of domain estimation using calibration.Duchesne (2003) has illustrated the efficiency of inference based on estima-tor assisted by logistic regression regarding normal regression by using MonteCarlo experiments. Lehtonen, Sarndal and Veijanen (2003) compared the per-formance of different estimators that using auxiliary information in the estima-tion stage, including those assisted by the logistic regression model. Lehtonen,Sarndal and Veijanen (2005) studied the importance of model specificationfor estimating the total of a polytomous interest variable for a number oflarge or small domains. Li (2008) introduced the Box-Cox technique intothe generalized regression estimator, which can be especially suitable to dealwith highly skewed continuous and positive interest variables, where normalresponse models may not be appropriate. This paper focuses on the generalcase for which the relationship between interest variables and the availableauxiliary ones may be suitably described by a generalized linear model. Finitepopulation distribution of the interest variable is viewed in this scenario asif generated by an exponential family distribution, which includes Bernoulli,Poisson, Gamma and inverse Gaussian distributions; the resulting estimatoris a generalized linear model regression estimator (GEREG). Its general formand basic statistical properties are presented and studied. U = {1, 2, . . . , N}was thus chosen as being a finite population. A sample S, size n, was obtainedfrom U using p(·) sampling design. πk = P (k ∈ S) was denoted as beingthe first-order inclusion probability for the k-th population element. Likewise,πkl = P (k, l ∈ S) represented the second-order inclusion probability for theelements k and l. yk was denoted as being the value of the interest variablewhich could be discrete or continuous, while xk = (xk1, . . . , xkJ)T was theauxiliary information vector for the k-th element. The paper is organized asfollows. In Section 2 we introduced generalized linear model in finite popula-tion. In Section 3 we presented the GEREG estimator and its main propertieswere studied. Besides, in that section was presented a simulation study thatillustrates the performance of the GEREG estimator. Section 4 we presented
2
![Page 4: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/4.jpg)
three applications of the proposed estimator.
2 Generalized Linear Model in Finite Population
The literature concerning generalized linear models is vast; for instance, Mc-Cullagh and Nelder (1989), Dobson (2001) and Fox (2008). Y1, . . . , Yn areconsidered the values of an interest variable Y measured in n elements in aGLM setup. Y ’s are supposed to be independent random variables with prob-ability distribution from the exponential family. Let Yk be random variable Yfor element k. Its density function may expressed as
f(y; θk, φk) = exp{φk[yθk − b(θk)] + c(y, φk)}, (1)
where c(·) is a known function, θk is the canonical parameter, E(Yk) = µk =b′(θk), Var(Yk) = φ−1
k V (µk), with Vk = V (µk) = ∂µk/∂θk the variance functionand φ−1
k > 0 the dispersion parameter. The generalized linear models aredefined by (1) and by the following systematic component
g(µk) = ηk =J∑
j=1
βjxkj = xT
kβ, (2)
where xk = (xk1, . . . , xkJ)T is a vector of J explanatory variables for element k,β = (β1, . . . , βJ)T a vector having unknown parameters, and g(·) a monotonedifferentiable function, called the link function. When g(·) is defined in sucha way that θk = ηk for every k, then g(·) is called the canonic link function.It has been assumed in this work that if k and l such that φk 6= φl, thenφk ∝ δk, k = 1, . . . , n, with δk being a known quantity for every k. Particularcases of such setup include normal, Bernoulli, Poisson, Gamma and the inversenormal response models. Table 1 shows b(θ), θ, φ and V (µ) values for mainexponential familiy distributions.
Table 1Principal distributions belonging to exponential family.
Distribution b(θ) θ φ V (µ)
Normal θ2/2 µ 1/σ2 1
Poisson eθ log µ 1 µ
Bernoulli log(1 + eθ) log{µ/(1 − µ)} 1 µ(1 − µ)
Gamma − log(−θ) −1/µ 1/(CV )2 µ2
N. Inversa −√−2θ −1/2µ2 φ µ3
The maximum-likelihood method may be used to estimate GLM parameters.The estimation method can use sampling weights in the finite populationcontext. For instance, µU
k and µSk are estimators of µk; however, the first one
3
![Page 5: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/5.jpg)
uses information about U , while the second one just uses information aboutS. Similarly, there are differences between β, βU ,βS and β
π
S, since β is thesuper-population parameter vector while βU is the β estimator based on U .There are two options for estimating β with information from sample S: thefirst, from which βS is obtained by assigning equal weighting to individuals;
the second one, from which the pseudomaximum likelihood estimator βπ
S isobtained, takes into account sample design assigning weights for individualsaccording to first-order inclusion probabilities. (see Table 2).
Table 2Estimation of β and µk
With information about U With information about S
β βU = arg maxβ
LU(β)Weighted Unweighted
βπS = arg max
βLπ
S(β) βS = arg maxβ
LS(β)
µk µUk =g−1(xT
k βU ) µSk =g−1(xT
k βπS) µS
k =g−1(xT
k βS)
The loglikelihood function for sample S(S ⊂ U) considering sample weighting,often called π-weighted loglikelihood or pseudo-likelihood, can be written inthe following way
LπS(β) =
∑
k∈S
φk
πk[ykθ(β;xk) − b(θ(β;xk))], (3)
then, βπ
S = arg maxβ
LπS(β) and µS
k = g−1(xT
kβπ
S) are the pseudomaximum
likelihood estimators (see, for instance, Skinner and Holt (1989); Lehtonen andPahkinen (2004)) of β and µk, respectively. If πk = π for k ∈ U , then βS and
βπ
S are equivalents. The estimator βπ
S can also be found by solving equation
UπS(β
π
S) = 0. The function UπS(β) = ∂Lπ
S(β)/∂β = (UπS1
(β), . . . , UπSJ
(β))T isthe score function for β (see, for example, Cox and Hinkley (1974)), where
UπSj
(β) =∑
k∈S
φk
πk
(yk − µk)δθk
δηk
xkj , j = 1, . . . , J. (4)
The pseudomaximum likelihood estimator is equivalent to the weighted leastsquares estimator for normal response models having canonical link (identity),which can be expressed as
βπ
S = (XT
SWSXS)−1XT
SWSYS, (5)
where XS = (x1, . . . ,xn)T, YS = (y1, . . . , yn)T and WS = diag{φ1/π1, . . . , φn/πn}.
The loglikelihood function for the population U can be written in the following
4
![Page 6: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/6.jpg)
way
LU (β) =∑
k∈U
φk[ykθ(β;xk) − b(θ(β;xk))],
then, βU = arg maxβ
LU(β) and µUk = g−1(xT
kβU) are the maximum likelihood
estimators of β and µk, respectively. The score function for β in the populationU is given by UU(β) = ∂LU (β)/∂β = (UU1(β), . . . , UUJ
(β))T, where
UUj(β) =
∑
k∈U
φk (yk − µk)δθk
δηkxkj, j = 1, . . . , J. (6)
3 Generalized linear model regression estimator (GEREG)
Considering estimating a total population ty =∑
k∈Uyk. The GEREG for ty may
be written as
tG
=∑
k∈U
µSk +
∑
k∈S
(yk − µSk )
πk
, (7)
where ξ, the model explaining the relationship between the response and theauxiliary variables, may be expressed as
Eξ(yk) = µk;
Varξ(yk) = φ−1k V (µk)
g(µk) =J∑
j=1βjxkj = xT
kβ
(8)
where β is a vector having unknown parameters, g(·) is the link function,xk = (xk1, . . . , xkJ)T is the auxiliary information vector for the k-th popula-tion element and Eξ(·) and Varξ(·) are the expectation and variance in model
ξ, respectively. In expression (7) µSk is given by g−1(xT
kβπ
S), with βπ
S, the β es-timator based in S, takes sampling weighting into account. We assumed that,if k and l so that φk 6= φl, then φk ∝ δk, with δk being a known quantity forevery k ∈ U . If ξ considers homogeneity in the dispersion parameter (φk = φ),then we assumed that φk=1 for every k ∈ U . The t
Gestimator requires knowl-
edge about auxiliary information vector xk for every k ∈ U , however, if thelink function in the ξ model is identity, t
Gonly requires knowledge about
xk for every k ∈ S and total population vector tx
= (tx1, . . . , txJ)T, wheretxj =
∑
k∈Uxkj , j = 1, . . . , J .
5
![Page 7: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/7.jpg)
3.1 Alternative expression for GEREG
The second term in (7) is an adjustment. Theorem 1 provides conditions inwhich the term adjustment can be eliminated, thus simplifying the expressionfor the GEREG estimator. This theorem also allows writing ty using the fittedvalues of the model based on U .Theorem 1. If the GEREG estimator t
Gis based on a ξ model where it has
been verified thatJ∑
j=1ajhj = 1 for hj = (h1j , . . . , hNj)
T with hkj = φk∂θk
∂ηkxkj
and a = (a1, . . . , aj)T a constant vector, then
(1)∑
k∈U(yk − µU
k ) = 0, i.e, the total of y can be written as ty =∑
k∈UµU
k ,
(2)∑
k∈S(yk − µS
k )/πk = 0, i.e, the estimator tG
can be written as tG
=∑
k∈UµS
k .
According to theorem 1, the smaller the difference between βπ
S and βU thesmaller also the difference between t
Gand ty. In the terms of finite population
consistency, the estimator βπ
S is consistent for βU , therefore, tG
is consistentfor ty. Moreover, if πk = π for k ∈ U , from theorem 1 we have
∑
k∈Syk =
∑
k∈SµS
k .
Corollary 1. If GEREG estimator tG
is based on an ξ model where it hasbeen verified that there is
(1) Homogeneity in the dispersion parameter, i.e, φk = φ for k ∈ U ;(2) A systematic component having an intercept, i.e, there is βj in β so that
∂ηk/∂βj = C for k ∈ U , with C a constant;(3) A systematic component having a canonical link, i.e, θk = ηk for k ∈ U ;
then, the results (1) and (2) of theorem 1 are satisfied.
The corollary 1 provides, from the model formulation point of view, clearerconditions than the 1 theorem so that the term adjustment term in theGEREG estimator can be eliminated.
Corollary 2. If the GEREG estimator tG
is based on an ξ model with normalresponse, canonical link (identity) and a = (a1, . . . , aj)
T is a constant vectorso that, for k ∈ U it has been verified that
φ−1k =
J∑
j=1
ajxkj ,
then, the results (1) and (2) of 1 theorem are satisfied.
The result in the corollary 2 is analogous to that found in Sarndal, Swenssonand Wretman (1992). Then, the 1 theorem is an extension of the result foundin Sarndal, Swensson and Wretman (1992) for generalized linear models.
6
![Page 8: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/8.jpg)
3.2 Expectation and variance
The following theorem provides approximate expressions for the expectationand variance of the GEREG estimator, taking into account the sampling de-sign and the regression model formulated.
Theorem 2. If tG
is given by∑
k∈UµS
k and ξ considers the canonical link we
have
(1) E(tG) ≈ ty;
(2) Var(tG) ≈ ΓT
UΣUΓU ;
(3) Var(tG) = ΓT
SΣSΓS,
where ΓU = (XT
UW∗
UXU)−1 XT
UW∗
Uφ∗, W∗
U = diag{φ1VU1 , . . . , φN V U
N }, ΓS =(XT
UW∗
SXU)−1 XT
UW∗
Sφ∗, W∗
S = diag{φ1VS1 , . . . , φN V S
N }, ΣU = {σUij} and
ΣS = {σSij}, with σU
ij =∑
k∈U
∑
l∈U∆kl
xkiφkEk
πk
xljφlEl
πl, σS
ij =∑
k∈S
∑
l∈S
∆kl
πkl
xkiφkek
πk
xljφlel
πl,
φ∗=(φ−11 , . . . , φ−1
N )T, V Uk =V (µU
k ), V Sk =V (µS
k ), Ek =yk − µUk and ek =yk − µS
k .
Corollary 3. Under the conditions of the Corollary 1, we have that
(1) E(tG) ≈ ty;
(2) Var(tG) ≈ ∑
k∈U
∑
l∈U∆kl
Ek
πk
El
πl;
(3) Var(tG) =
∑
k∈S
∑
l∈S
∆kl
πkl
ek
πk
el
πl
Corollary 4. If the GEREG estimator tG
is based on an ξ model with normalresponse, canonical link (identity) and a = (a1, . . . , aj)
T is a constant vectorso that, for k ∈ U it has been verified that
φ−1k =
J∑
j=1
ajxkj ,
then, the results (1) and (2) of 3 corollary are satisfied.
Corollarys 3 and 4 shows that when the ξ model considers canonical link,intercept and homogeneity in the dispersion parameter, the expressions forVar(t
G) and Var(t
G) become simplified, i.e, stated that the estimator t
Gis
approximately unbiased for ty and, an approximation of the variance of tG
can be calculate applying the formula for the variance of Horvitz-Thompsonestimator to residuals Ek.
7
![Page 9: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/9.jpg)
3.3 The models role
The role of model ξ in the estimating ty can be observed by assessing theimpact on Ek of small perturbations in yk. Pregibon (1981) proposed anamount measuring the influence of yk on µU
k in generalized linear models,this being an extension of the measurement ∂µU
k /∂yk applied to normal linearmodels. Then, when ξ considers canonical link and homogeneity in disper-sion parameter, then quantity γk = 1 − hkk can be used to assessing theinfluence of yk on Ek, where hkk is the k-th element from main diagonal ofH = V
1/2U XU(XT
UVUXU)−1XT
UV1/2U , with VU = diag{V U
1 , . . . , V UN }. Small γk
values indicate that an change in yk leads to µUk similarly changing, suggesting
that the k-th population element has a large weighting on estimating βU
andon Ek magnitude. Likewise, large γk values indicate that an change in yk doesnot similarly lead to an change in µU
k , suggesting that the k-th population ele-
ment has small weighting in estimating βUand on Ek magnitude. For example,
for a model having an intercept, one explanatory variable and homogeneity inthe dispersion parameter, we have the following
i) Canonical link
γk = 1 − V Uk
∑
l∈UV U
l
− (xk − xU)2V Uk
∑
l∈U(xl − xU )2V U
l
ii) Identity link
γk = 1 − (1/V Uk )
∑
l∈U(1/V U
k )− (xk − xU)2(1/V U
k )∑
l∈U(xl − xU)2(1/V U
l ),
where xU is the population mean of explanatory variable x. In both casesconsidered above, γk ∈ (0, 1) and, for a fixed value of xk, the γk value dependson the variance function, in direct proportion when the link is canonical and,inversely proportional when link is identity. Thus, the ξ model determinesthe structure of weighting or importance for elements in U , which affects thevalues of Ek on Var(t
G).
3.4 Relative efficiency
The relative efficiency of tG
compared to the Horvitz-Thompson estimator us-ing p(·) sampling design can be evaluated with ∆p(·) = Var(t
G)p(·)/Var(t
HT)p(·)
measurement. For example, for sampling designs such as simple Random Sam-pling (SRS) and Bernoulli sampling (BER) the ∆p(·) measurement does notdepend on sample size. The following result led us to observe the impact ofthe model’s goodness of fit on t
Grelative efficiency.
8
![Page 10: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/10.jpg)
Theorem 3. Under the conditions of the corollary 1, if the tG
is used under
i) Simple Random Sampling (SRS) we have
∆SRS =Var(t
G)SRS
Var(tHT
)SRS
≈ 1 − R2U
ii) Bernoulli Sampling (BER) we have
∆BER =Var(t
G)BER
Var(tHT
)BER
≈ 1 − R2U
1 +CV −2
y
1−1/N
where RU is Pearson’s linear correlation coefficient calculated between yk’sand µU
k ’s and CVy is the variation coefficient of y in the population U .
This result shows that when the model’s goodness of fit increases then thet
Gefficiency compared to the Horvitz-Thompson estimator also increases.
It is also possible to observe that, using comparable sample sizes, we haveVar(t
G)BER/Var(t
G)SRS ≈ 1 − 1/N . Besides, 3 theorem means that it is pos-
sible to show that under SRS sampling, the Horvitz-Thompson estimator hasefficiency equal to t
Gwhen using an f ∗ sampling fraction, given by
f ∗ ≈ f
f + (1 − f)(1 − R2U )
,
where f = n/N is the sampling fraction used by tG. For instance, when the
GEREG estimator uses a 0.2 sampling fraction under SRS sampling, the ξmodel satisfies corollary 1 conditions and has a goodness of fit described byR2
U = 0.95, the Horvitz-Thompson estimator requires a sampling fractionaround four times greater than the GEREG estimator to obtaining the sameefficiency.
3.5 Particular cases
This section considers estimators assisted by models satisfying the conditionsof theorem 1 for different response distributions.
3.5.1 Normal response with constant variance
The estimator (9) is obtained through the least squares method in the sta-tistical literature, having no distributional assumptions. The same estimatorwas obtained in this paper using a regression model having normal response
9
![Page 11: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/11.jpg)
and canonical link. According to corollary 1, this estimator is given by
tG
=∑
k∈U
µSk =
∑
k∈U
xT
kβπ
S = NβπS1 +
J∑
j=2
βπSjtxj
, (9)
with βπ
S = (βπS1, . . . , β
πSJ)T. The model ξ that assists (9) can be written as
Eξ(Yk) = µk;
Varξ(Yk) = σ2
µk = β1 +J∑
j=2βjxkj
From (5) βπ
S can be written as βπ
S = (XT
SWSXS)−1XT
SWSYS with XS =(x1, . . . ,xn)T, xk = (1, xk2, . . . , xkJ)T and WS = diag{1/π1, . . . ,1/πn}. Ac-cording to corollary 3 under SRS sampling, variance (9) approximation andits corresponding estimator can be written as
Var(tG)SRS ≈ N(1 − R2
U)1 − f
fS2
yUand Var(t
G)SRS = N(1 − R2
S)1 − f
fS2
yS
where S2yU
= 1N−1
∑
U(yk − yU)2, S2yS
= 1n−1
∑
S(yk − yS)2 and RS is the Pear-
son’s linear correlation coefficient calculated between yk’s and µSk ’s. Similarly,
from corollary 3 with BER sampling, variance (9) approximation and its cor-responding estimator can be written as
Var(tG)BER ≈ N(1 − R2
U)
(1 − 1/N)−1
1 − π
πS2
yUand Var(t
G)BER =
n
Nπ
N(1 − R2S)
(1 − 1/n)−1
1 − π
πS2
yS,
where Nπ is expected sample size.
3.5.2 Normal response with variance heterogeneity
The following estimator was assisted by a model having normal response,variance heterogeneity and which satisfies the conditions of corollary 2
tG
=∑
k∈U
µSk =
∑
k∈U
βπSxk = βπ
Stx =tx
txty, (10)
with tx and ty being the Horvitz-Thompson estimators for tx and ty, respec-tively. The estimator (10) is called Ratio estimator (see for instance, Sarndal,Swensson and Wretman (1992)). The ξ model that assisted (10) can be writ-ten as
Eξ(Yk) = µk;
Varξ(Yk) = σ2xk, xk > 0
µk = βxk
10
![Page 12: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/12.jpg)
From (5) βπS can be written as βπ
S = (XT
SWSXS)−1XT
SWSYS with WS =diag{1/x1π1, . . . ,1/xnπn}. According to theorem 2 under SRS sampling, ap-proximation of variance (10) and its corresponding estimator are given by
Var(tG)SRS = N
1 − f
f
[
S2yU
+SxU
SyUy2
U
x2U
(
SxU
SyU
− 2RU
)]
and
Var(tG)SRS = N
1 − f
f
[
S2ys
+Sxs
Sysy2
s
x2s
(
Sxs
Sys
− 2RS
)]
Similarly, from theorem 2 under BER sampling, (10) variance aproximationand its corresponding estimator can be written as
Var(tG)BER =
N
(1 − 1/N)−1
1 − π
π
[
S2yU
+SxU
SyUy2
U
x2U
(
SxU
SyU
− 2RU
)]
and
Var(tG)BER =
n
Nπ
N
(1 − 1/n)−1
1 − π
π
[
S2ys
+Sxs
Sysy2
s
x2s
(
Sxs
Sys
− 2RS
)]
One can easily show that, using SRS and BER sampling designs, tG
is moreefficient that the Horvitz-Thompson estimator when 2RU > SxU
/SyU.
3.5.3 Bernoulli response
Suppose yk, the response for element k in a particular population is a dichoto-mous variable, assuming value one if the element presents a characteristic ofinterest, and zero, otherwise. The ty parameter of interest can be written asNP , where N is the population size and P is the proportion of populationelements having the characteristic of interest. The estimator for NP , assistedby a Bernoulli regression model ξ satisfying the conditions of corollary 1 canbe written as
tG
=∑
k∈U
µSk =
∑
k∈U
1 + exp
−βπS1 −
J∑
j=2
βπSjxkj
−1
(11)
where ξ, the model assisting in the estimation, can be described as
Eξ(Yk) = µk = Pξ[yk = 1|xk]
Varξ(Yk) = µk(1 − µk)
log[
µk
1−µk
]
= β1 +J∑
j=2βjxkj
As discussed in section 3.3, for a given xk, the closest µk is 0.5, the greaterthe weighting or importance of element k in the estimation. According to
11
![Page 13: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/13.jpg)
corollary 3 under SRS sampling, the approximated variance of (11) and itscorresponding estimator can be expressed as
Var(tG)SRS ≈ N(1 − R2
U)
1 − 1/N
1 − f
fP (1 − P )
and
Var(tG)SRS=
N(1 − R2S)
1 − 1/n
1 − f
fP (1 − P ),
where P = 1n
∑
S yk is the sample proportion of elements having the charac-teristic of interest. On the other hand, in BER sampling, the approximatedvariance of (11) and its corresponding estimator can be expressed as
Var(tG)BER ≈ N(1 − R2
U)1 − π
πP (1 − P )
and
Var(tG)BER=
n
NπN(1 − R2
S)1 − π
πP (1 − P )
3.5.4 Poisson response
When the variable of interest represent a counting process, estimating popu-lation totals can be assisted by a Poisson regression model. Such an estimator,which satisfies the conditions of corollary 1, can be expressed by
tG
=∑
k∈U
µSk =
∑
k∈U
exp
βπS1 +
J∑
j=2
βπSjxkj
= κπ1
∑
k∈U
J∏
j=2
κπxkj
j , (12)
with κπj = exp(βπ
Sj). Where ξ, the model assisting the estimation, is given by
Eξ(Yk) = µk
Varξ(Yk) = µk
log(µk) = β1 +J∑
j=2βjxkj
As discussed in section 3.3, for a given xk, the higher µk, the higher therelevance (weighting) of element k. According to corollary 3 in SRS sampling,the approximated variance of (12) and its corresponding estimator can beexpressed as
Var(tG)SRS ≈ N(1 − R2
U)1 − f
fS2
yUand Var(t
G)SRS = N(1 − R2
S)1 − f
fS2
yS
12
![Page 14: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/14.jpg)
On the other hand, in BER sampling, the approximated variance of (12) andits corresponding estimator can be expressed as
Var(tG)BER ≈ N(1 − R2
U)
(1 − 1/N)−1
1 − π
πS2
yUand Var(t
G)BER =
n
Nπ
N(1 − R2S)
(1 − 1/n)−1
1 − π
πS2
yS
3.5.5 Gamma response
When the variable of interest is skewed continuous and positive (i.e. income)the estimation can be assisted by the Gamma model. The estimator of tyassisted by a Gamma response regression model and satisfying the conditionsof corollary 1 can be expressed in the following way
tG
=∑
k∈U
µSk =
∑
k∈U
βπS1 +
J∑
j=2
βπSjxkj
−1
, (13)
where ξ, the model that assisting the estimation, can be written as
Eξ(Yk) = µk
Varξ(Yk) = µ2k
µk−1 = β1 +
J∑
j=2βjxkj
As discussed in section 3.3, for a given xk, the higher µk, the higher therelevance (weighting) of element k. According to corollary 3 in SRS sampling,the approximated variance of (13) and its corresponding estimator is given by
Var(tG)SRS ≈ N(1 − R2
U)1 − f
fS2
yUand Var(t
G)SRS = N(1 − R2
S)1 − f
fS2
yS
On the other hand, in BER sampling, the approximated variance of (13) andits corresponding estimator can be expressed as
Var(tG)BER ≈ N(1 − R2
U)
(1 − 1/N)−1
1 − π
πS2
yUand Var(t
G)BER =
n
Nπ
N(1 − R2S)
(1 − 1/n)−1
1 − π
πS2
yS
3.5.6 Stratified sampling
i) Estimating population totalsFor a stratified SRS design in which the model assisting estimation in eachstratum satisfies the conditions of corollary 1, the estimator for ty is
tG
=H∑
h=1
tGh
=H∑
h=1
∑
k∈Uh
µSh
k ,
13
![Page 15: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/15.jpg)
where tGh
is the estimator of the total of y in the stratum h. According tocorollary 3, the variance estimator expression is given by
Var(tG)SRS =
H∑
h=1
Var(tGh
)SRS ≈H∑
h=1
Nh(1 − R2Uh
)1 − fh
fhS2
yUh,
with Nh being the size of the stratum h, fh the sampling fraction in thestratum h, S2
yUhthe variance of y in the stratum h and RUh
is Pearson’s
linear correlation coefficient calculated between yk’s and µUk ’s in the stratum
h. The estimator of the variance expression is given by:
Var(tG)SRS =
H∑
h=1
Var(tGh
)SRS =H∑
h=1
Nh(1 − R2Sh
)1 − fh
fhS2
ySh,
where S2ySh
is the variance of y the sample of stratum h and RShis Pearson’s
linear correlation coefficient calculated between yk’s and µSk ’s in the sample
from stratum h. With BER sampling, the total population estimator ex-pression is the same as for the SRS case. The variance estimator expressioncan be written as
Var(tG)BER =
H∑
h=1
Var(tGh
)BER ≈H∑
h=1
(Nh − 1)(1 − R2Uh
)1 − πh
πh
S2yUh
,
with Nhπh being expected sample size in stratum h. The variance estimatorexpression is given by
Var(tG)BER =
H∑
h=1
Var(tGh
)BER =H∑
h=1
(nh − 1)
πh(1 − R2
Sh)1 − πh
πhS2
ySh,
where nh is the sample size in stratum h.
ii) Estimating population meansFor a stratified SRS design in which the model assisting estimation in eachstratum satisfies the conditions of corollary 1, the population mean estima-tor is given by
mG
=H∑
h=1
ahmGh=
H∑
h=1
ah
∑
k∈Uh
N−1
h µSh
k ,
with mGh
being the estimator of mean in stratum h and ah = Nh/(∑
Nh)is the proportion of individuals from population in stratum h. According tocorollary 3, the variance estimator expression is given by
Var(mG)SRS =
H∑
h=1
a2hVar(m
Gh)SRS ≈
H∑
h=1
a2h
1 − R2Uh
Nh
1 − fh
fh
S2yUh
14
![Page 16: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/16.jpg)
The variance estimator expression is given by
Var(mG)SRS =
H∑
h=1
a2hVar(m
Gh)SRS =
H∑
h=1
a2h
1 − R2Sh
Nh
1 − fh
fh
S2ySh
With BER sampling, the mean population estimator expression is the sameas for the SRS case. The variance estimator expression is given by
Var(mG)BER =
H∑
h=1
a2hVar(m
Gh)BER ≈
H∑
h=1
a2h
Nh − 1
N2h
(1 − R2Uh
)1 − πh
πhS2
yUh
The variance estimator expression is given by
Var(mG)BER =
H∑
h=1
a2hVar(m
Gh)BER =
H∑
h=1
a2h
(nh − 1)
N2hπh
(1 − R2Sh
)1 − πh
πh
S2ySh
3.6 Simulation study
A simulation study was carried out for comparing the behavior of the estima-tors described above in different scenarios.
3.6.1 Bernoulli response
Twelve populations consisting of N = 1000 individuals each were generatedwhere one auxiliary variable, x, was considered, generated by following chi-square distribution, having 1 degree of freedom. The value of µU
k for the ele-
ment k from each population was calculated as µUk =
[
1 + exp(−βU1 − βU2xk)]−1
,and interest variable, yk, was randomly generated following Bernoulli distri-bution with µU
k as parameter. In each population, the βU = (βU1, βU2)T values
were chosen so that they ensured values for R2U = 0.5, 0.65 and 0.75 and
P = 0.15, 0.35, 0.65 and 0.85. 10000 independent samples were obtained foreach population having n = 40 and n = 50 individuals, according to SRSscheme. These samples were used for calculating estimates of P = N
−1ty and
estimator’s variance as well as 95% confidence intervals (based on normality)for P using two estimators assisted by regression models having normal andBernoulli responses, denoted as P
Nand P
B, respectively. These GEREG esti-
mators satisfied the conditions of corollary 1. Performance measurements werecalculated for each estimator as: i) relative bias of P ; ii) relative efficiency ofP regarding Horvitz-Thompson estimator; iii) relative bias of Var(P ); and iv)95% confidence interval coverage rate for P . Table 3 presents the results. Itcan be observed that the relative bias of P
Bwas always less than that for the
PN; however, relative bias was very small for both estimators. Regarding rela-
tive efficiency, the values in the Table were always lower than 100%, indicating
15
![Page 17: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/17.jpg)
that the PN
and PB
estimators were more efficient than the Horvitz-Thompsonone. However, greater efficiency was always provided for the estimator assistedby the model having Bernoulli response. The relative bias of Var(P ) was rel-atively large, mainly for the estimator assisted by the model having Bernoulliresponse. The 95% confidence interval coverage rates for P for both estimatorswere always lower than nominal values; nevertheless, taking into account thatthe sample size was small, coverage rate values were good. Besides, it can beobserved that P
Bperformed better than P
N, mainly when the value of the
population proportion P was large.
Table 3Performance of P
Nand P
Bestimating the population proportion P
R2U
n P
Measures(%)
PN
PB
∣
∣
∣
PP−1
∣
∣
∣
V (P)
V (PHT
)
∣
∣
∣
V (P)
V (P )−1
∣
∣
∣CR
∣
∣
∣
PP−1
∣
∣
∣
V (P)
V (PHT
)
∣
∣
∣
V (P)
V (P )−1
∣
∣
∣CR
0.50
40
0.15 1.51 65.18 8.39 91.81 0.86 54.11 13.35 89.06
0.35 0.09 58.29 5.99 93.55 0.12 52.32 6.35 92.29
0.65 0.74 65.93 11.23 92.52 0.13 53.75 12.76 91.18
0.85 0.30 76.18 3.51 92.01 0.30 54.60 11.57 90.36
50
0.15 1.03 65.57 8.12 92.16 0.77 54.29 12.85 90.19
0.35 0.10 60.04 7.95 93.30 0.08 53.75 7.55 92.58
0.65 0.72 63.14 6.58 93.62 0.08 51.97 8.98 92.28
0.85 0.30 74.83 1.52 92.52 0.19 52.81 9.43 90.60
0.65
40
0.15 1.83 59.21 10.73 91.49 0.84 41.79 20.19 85.12
0.35 0.13 48.17 8.20 93.27 0.03 37.31 11.81 90.53
0.65 0.96 58.66 10.85 92.49 0.12 37.76 12.19 90.19
0.85 0.35 72.45 3.65 93.21 0.30 37.88 6.72 90.97
50
0.15 1.22 57.08 6.61 92.64 0.76 38.88 14.32 89.01
0.35 0.25 48.37 8.05 93.36 0.11 37.23 10.87 91.13
0.65 0.75 57.93 9.03 93.13 0.01 37.69 11.68 91.14
0.85 0.24 69.75 0.39 93.46 0.20 36.78 6.91 90.68
0.75
40
0.15 2.70 57.40 12.75 90.43 0.04 32.76 23.97 80.00
0.35 0.45 42.59 5.66 93.52 0.23 27.58 11.01 90.11
0.65 0.99 55.48 10.63 92.79 0.07 32.62 12.27 90.25
0.85 0.38 72.24 4.16 92.63 0.29 32.86 8.02 90.87
50
0.15 1.42 54.30 6.21 92.82 0.42 29.38 14.39 86.96
0.35 0.22 43.57 8.09 93.34 0.02 28.00 14.17 89.79
0.65 0.80 54.38 8.28 93.10 0.05 31.68 9.94 91.38
0.85 0.22 68.89 0.78 93.34 0.17 30.34 5.44 90.83
3.6.2 Poisson response
Nine populations consisting of N = 1000 individuals each were generatedwhere one auxiliary variable, x, was considered, generated by following chi-square distribution with 1 degree of freedom. The µU
k value for element kfrom each population was calculated as µU
k = exp(βU1 + βU2xk), and interestvariable, yk, was randomly generated following Poisson distribution with µU
k
as parameter. The βU values in each population were chosen so that theyensured that values of R2
U = 0.5, 0.65 and 0.75 and m = N−1
ty = 1.5, 3.0and 5.0. 10000 independent samples were obtained, having population sizes ofn = 40 and n = 50 elements, according to SRS scheme. These samples were
16
![Page 18: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/18.jpg)
used for calculating estimates of m and estimator’s variance as well as 95%confidence intervals (based on normality) for m using two estimators assistedby regression models having normal and Poisson responses, denoted as m
N
and mP, respectively. These GEREG estimators satisfied the conditions of
corollary 1. The same measurements of performance as in the Bernoulli casewere calculated for each estimator. Table 4 presents the results. It can beobserved that the relative bias of m
Pwas always less than that for the m
Ncase.
However, relative bias was very small for both estimators. Regarding relativeefficiency, the values in the Table were always lower than 100%, indicating thatthe m
Nand m
Pestimators were more efficient than the Horvitz-Thompson
one. However, greater efficiency always pertained to the estimator assisted bythe model having Poisson response. The relative bias of Var(m) was relativelylarge, mainly for the estimator assisted by the model having Poisson response.For both estimators, the 95% confidence interval coverage rates for m werealways lower than the nominal values; nevertheless, taking into account thatthe sample size was small, coverage rate values were good. Besides, it can beobserved that m
Pperformed better than m
N, mainly when population mean
m value was small.
Table 4Performance of m
Nand m
Pestimating the population mean m
R2U
n m
Measures(%)
mN
mP
∣
∣
mm
−1∣
∣
V (m)V (m
HT)
∣
∣
∣
V (m)V (m )
−1
∣
∣
∣CR
∣
∣
mm
−1∣
∣
V (m)V (m
HT)
∣
∣
∣
V (m)V (m )
−1
∣
∣
∣CR
0.50
40
1.5 1.21 60.94 8.37 91.98 0.62 55.58 16.65 91.25
3.0 0.78 56.91 8.75 92.34 0.38 55.01 14.71 91.47
5.0 0.44 54.13 8.01 92.83 0.19 52.06 11.13 92.39
50
1.5 0.98 60.80 6.07 92.34 0.42 55.12 14.48 91.75
3.0 0.54 56.36 5.96 93.04 0.30 53.51 10.63 92.50
5.0 0.30 55.11 8.49 92.91 0.18 52.66 11.10 92.58
0.65
40
1.5 2.00 58.41 14.77 89.18 1.35 49.11 16.42 88.08
3.0 1.43 51.69 10.44 91.05 0.37 37.24 15.45 91.27
5.0 0.80 45.99 11.18 91.37 0.28 37.23 14.95 90.99
50
1.5 1.65 56.64 9.55 90.50 1.01 43.68 17.17 89.33
3.0 1.08 51.02 7.72 91.61 0.21 35.85 10.55 92.26
5.0 0.76 45.68 9.27 92.26 0.12 36.97 12.76 92.26
0.75
40
1.5 3.22 58.22 14.16 84.41 0.65 34.80 14.59 85.49
3.0 1.85 49.67 14.06 88.63 0.59 30.20 18.12 88.74
5.0 1.17 43.07 12.33 90.36 0.01 26.79 17.01 91.51
50
1.5 2.52 58.47 10.11 86.11 0.49 32.03 17.59 87.22
3.0 1.41 48.60 10.57 89.92 0.50 28.96 13.10 89.98
5.0 1.12 43.02 10.63 91.02 0.16 26.23 13.83 92.08
3.6.3 Gamma response
Nine populations consisting of N = 1000 individuals each were generated,where one auxiliary variable, x was considered, being generated following chi-square distribution with 1 degree of freedom. The value of µU
k for e element
k for each population was calculated as µUk =
(
βU1 + βU2xk
)−1
and interest
17
![Page 19: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/19.jpg)
variable, yk, was randomly generated following Gamma distribution with µUk
and φ as parameters. In each population, the values of βU and φ were cho-sen so that they ensured the values of R2
U = 0.5, 0.65 and 0.75 and m = 5,10 and 15. 10000 independent samples were obtained from each population,n = 40 and n = 50 element sizes, according to SRS scheme. These sampleswere used for calculating estimates of m and estimator’s variance as well as95% confidence intervals (based on normality) for m using two estimators as-sisted by regression models having normal and Gamma responses, denoted asm
Nand m
G, respectively. These GEREG estimators satisfied the conditions
of corollary 1. The same performance measurements as in the Bernoulli andPoisson cases were calculated for each estimator. Table 5 presents the results.It can be observed that the relative bias of m
Gwas always less than for m
N.
However, relative bias was very small for both estimators. Regarding relativeefficiency, the values in the Table were always lower than 100%, indicating thatthe m
Nand m
Gestimators were more efficient than the Horvitz-Thompson
one. However, the greater efficiency was always given by the estimator assistedby the model having gamma response. The relative bias of Var(m) was rela-tively small. For both estimators, 95% confidence interval coverage rates form were always lower than nominal values; nevertheless, taking into accountthat sample size was small, coverage rate values were good. Besides, it can beobserved that m
Gperformed better than m
N, mainly when populational mean
m value was small.
Table 5Performance of m
Nand m
Gestimating the population mean m
R2U
n m
Measures(%)
mN
mG
∣
∣
mm
−1∣
∣
V (m)V (m
HT)
∣
∣
∣
V (m)V (m )
−1
∣
∣
∣CR
∣
∣
mm
−1∣
∣
V (m)V (m
HT)
∣
∣
∣
V (m)V (m )
−1
∣
∣
∣CR
0.50
40
5 1.29 69.75 8.62 92.02 0.19 53.58 4.82 93.22
10 0.55 66.39 11.10 92.04 0.03 55.73 5.66 93.36
15 0.54 61.45 8.81 92.20 0.09 52.28 4.23 93.30
50
5 0.94 67.97 7.62 92.24 0.18 52.76 4.52 93.24
10 0.51 63.59 6.45 93.04 0.04 53.27 0.84 94.34
15 0.35 58.96 5.37 93.36 0.06 51.15 3.56 93.84
0.65
40
5 1.76 72.40 14.26 90.66 0.12 41.65 4.96 93.08
10 1.15 58.86 12.61 91.28 0.02 37.61 2.93 93.96
15 0.80 53.42 10.63 92.06 0.09 37.60 7.10 93.32
50
5 1.38 71.60 12.58 91.90 0.11 41.59 4.67 93.18
10 0.87 59.87 12.23 92.08 0.03 37.45 0.81 94.10
15 0.54 51.95 10.02 92.36 0.08 37.58 1.75 93.78
0.75
40
5 2.19 76.46 12.65 90.40 0.24 34.61 8.66 92.18
10 1.47 63.31 18.95 90.42 0.10 32.57 6.74 92.66
15 1.23 58.05 19.59 89.54 0.12 29.98 4.12 93.30
50
5 1.62 76.49 11.76 91.26 0.16 36.12 6.36 92.22
10 1.08 61.48 15.09 90.76 0.10 32.84 6.39 92.98
15 0.84 55.31 13.96 91.06 0.06 29.49 4.08 93.64
18
![Page 20: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/20.jpg)
4 Applications
4.1 Academic Performance Index (API)
The academic performance index (API) has been used annually as perfor-mance measurement for every school in California. The API is based based ona standardized test applied to students. Data from 2000 was analyzed. Thedata set contains 6194 observations and 37 variables and was obtained fromthe Survey package for the statistical software R (www.r-project.org). The goalwas to estimate the API average for the population of schools using a stratifiedsample design. The strata were formed by the levels of Stype variable (elemen-tary (E), middle (M) and high school (H)). SRS and BER sampling schemeperformance was investigated in each strata. The m
Gestimator discussed in
section 3.5.6 was applied; the estimation in each stratum was assisted by mod-els having normal and Gamma responses as described in (9) and (13). Thepercentage of parents not graduating from high school in each school not.hsgwas the auxiliary variable. Strata sizes were given by N1 = 4421, N2 = 1018and N3 = 755. Strata sample sizes were 100, 50 and 50, respectively for thefirst, second and third stratum. Such numbers corresponded to expected sam-ple sizes when using a BER sampling scheme.
Table 6 shows mG
relative efficiency (∆p(·)) regarding the Horvitz-Thompsonestimator for the two regression models and the two sample schemes. It canbe seen that the Gamma assisted m
Gestimator was more efficient than the
normal assisted one in every strata. Normal assisted estimator variance wasapproximately 1.11 times greater than the variance for the Gamma modelassisted estimator for both sampling schemes. Figure 1 shows the relationshipbetween interest and the auxiliary variables. The R2
U values for the normalmodel’s goodness of fit were 0.51, 0.54 and 0.61 for E, M and H, respectively.The R2
U values for Gamma model were 0.55, 0.60 and 0.66 for E, M and H.
Table 6Relative efficiency (percentage) of m
Gregarding to Horvitz-Thompson estimator
for the API example
Sampling Model School typeTotal
scheme assisting Elemental Middle High school
SRSNormal 49.11 46.48 38.76 48.95
Gamma 44.65 40.32 33.92 43.93
BERNormal 1.70 1.49 1.00 1.76
Gamma 1.64 1.41 0.95 1.59
19
![Page 21: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/21.jpg)
Fig. 1. Scatter plots between interest and auxiliary variables for the API example
0 20 40 60 80 100
400
500
600
700
800
900
0 20 40 60 80 100
400
500
600
700
800
900
0 20 40 60 80 100
400
500
600
700
800
900
Elementary
Percent parents not high−school graduates
AP
I 200
0NormalGamma
0 20 40 60 80 100
400
500
600
700
800
900
0 20 40 60 80 100
400
500
600
700
800
900
0 20 40 60 80 100
400
500
600
700
800
900
Middle
Percent parents not high−school graduates
AP
I 200
0
NormalGamma
0 20 40 60 80
400
500
600
700
800
900
0 20 40 60 80
400
500
600
700
800
900
0 20 40 60 80
400
500
600
700
800
900
High School
Percent parents not high−school graduates
AP
I 200
0
NormalGamma
0 20 40 60 80 100
400
500
600
700
800
900
Total
Percent parents not high−school graduates
AP
I 200
0
4.2 Quality of life in Colombia
The unsatisfied basic necessities indicator (UBN) was calculated for the 1118Colombian municipalities, based on the household and population census car-ried out by the National Administrative Department of Statistics - DANE(www.dane.gov.co). The data set was related to information from 2005. TheUBN is an indicator of whether the population’s basic needs are been pro-vided for, concerning i) inadequacy housing, ii) overpopulated households,iii) households having inadequated services, iv) high economic dependencehouseholds and v) families were children do not attend school. The goal wasto estimate the proportion of municipalities (excluding the capital Bogota)presenting a higher than 25% UBN. A stratified sample design was used withstrata based on a Distance variable defined as the distance, in kilometers,between each municipality and the closest city having more than 1 millioninhabitants. Three strata were considered: one having a distance of less that150 km (<150 km), one having 150 to 200 Km distance (150-200 km) andone having a distance of over 200 km (>200 km). SRS and BER scheme per-formance was investigated in each strata. The m
Gestimator discussed in the
20
![Page 22: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/22.jpg)
section 3.5.6 was applied, the estimation in each strata was assisted by modelshaving normal and Bernoulli responses as described in (9) and (11). The vari-able Density, the average number of people living in an area of 1 km2 of thecity was the auxiliary variable. Stratum sizes were N1 = 308, N2 = 395 andN3 = 414, respectively. Stratum sample sizes were 46, 59 and 62, respectively,for the first, second and third stratum. Such sizes were expected values whenusing BER sampling.
Table 7 shows the relative efficiency (∆p(·)) of the mG
estimator comparedto the Horvitz-Thompson estimator, for both regression models and sampleschemes. It can be seen that the Bernoulli assisted m
Gestimator was more
efficient than the normal assisted one in every strata. The variance of the nor-mal assisted estimator was approximately 1.18 times greater than the varianceof the Bernoulli model assisted estimator for both sampling schemes. Figure 2shows the relationship between interest and auxiliary variables. The R2
U val-ues for the normal model’s goodness of fit were 0.06, 0.27 and 0.13 for <150,150-200 and >200, respectively. The R2
U values for the Bernoulli model were0.24, 0.40 and 0.20 for <150, 150-200 and >200, respectively.
Table 7Relative efficiency (percentage) of m
Gregarding to Horvitz-Thompson estimator for
the Quality of life in Colombia application when the interest variable is dichotomous
Sampling Model Distance (Kms)Total
scheme assisting < 150 150 - 200 > 200
SRSNormal 93.09 72.76 87.43 85.61
Bernoulli 76.60 59.82 80.61 72.41
BERNormal 21.68 6.99 6.60 14.93
Bernoulli 21.25 6.86 6.43 12.57
Additionally, we wanted to estimate the average number of UBN componentshigher than 10%. In this case, possible interest variable values were 0, 1, 2, 3, 4and 5, thereby motivating an estimator assisted by a Poisson regression model.Table 8 shows the relative efficiency (∆p(·)) of the m
Gestimator compared
to the Horvitz-Thompson estimator for both regression models and sampleschemes. It can be seen that the Poisson assisted m
Gestimator was more ef-
ficient than the normal assisted one in all strata. Normal assisted estimatorvariance was about 1.17 times greater than variance for the Poisson modelassisted estimator for both sampling schemes. Figure 3 shows the relation-ship between interest and auxiliary variables. The R2
U values for the normalmodel’s goodness of fit were 0.06, 0.16 and 0.10 for <150, 150-200 and >200,respectively. The R2
U values for the Poisson model were 0.24, 0.27 and 0.21 for<150, 150-200 and >200, respectively.
21
![Page 23: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/23.jpg)
Fig. 2. Scatter plots between interest and auxiliary variables for the Quality of life
in Colombia application when the interest variable is dichotomous
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Distance less than 150 kms
Population density (persons/km2)
UB
N m
ore
than
25%
NormalBernoulli
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Distance between 150 and 200 kms
Population density (persons/km2)
UB
N m
ore
than
25%
NormalBernoulli
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Distance more than 200 kms
Population density (persons/km2)
UB
N m
ore
than
25%
NormalBernoulli
0 200 400 600 800 1000
0.0
0.2
0.4
0.6
0.8
1.0
Total
Population density (persons/km2)
UB
N m
ore
than
25%
Table 8Relative efficiency (percentage) of m
Gregarding to Horvitz-Thompson estimator for
the Quality of life in Colombia application when the interest variable is a countingprocess
Sampling Model Distance (Kms)Total
scheme assisting < 150 150 - 200 > 200
SRSNormal 93.88 83.84 90.37 89.14
Poisson 76.42 73.23 79.09 76.23
BERNormal 23.22 13.01 11.16 17.68
Poisson 22.21 12.66 10.77 14.91
Appendix
A Theorem 1
Proof. .
22
![Page 24: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/24.jpg)
Fig. 3. Scatter plots between interest and auxiliary variables for the Quality of life
in Colombia example when the interest variable is a counting process
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
Distance less than 150 kms
Population density (persons/km2)
Com
pone
nts
of U
BN
mor
e th
an 1
0%
NormalPoisson
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
Distance between 150 and 200 kms
Population density (persons/km2)
Com
pone
nts
of U
BN
mor
e th
an 1
0%
NormalPoisson
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
0 200 400 600 800 1000
01
23
45
Distance more than 200 kms
Population density (persons/km2)
Com
pone
nts
of U
BN
mor
e th
an 1
0%
NormalPoisson
0 200 400 600 800 1000
01
23
45
Total
Population density (persons/km2)
Com
pone
nts
of U
BN
mor
e th
an 1
0%
i) βU estimate satisfies UU(βU) = 0, i.e., U
Uj(βU) = hT
j E = 0, j = 1, . . . , J ,
where hj = (hkj, . . . , hNj)T and E = (E1, . . . , EN)T, with hkj = φk
∂θk
∂ηkxkj .
If exist a = (a1, . . . , aJ)T such thatJ∑
j=1ajhj = 1N , then,
∑
k∈U(yk − µU
k ) =
1T
NE =
(
J∑
j=1ajhj
)
T
E =J∑
j=1ajh
T
j E. From UUj
(βU) = 0 above, j =
1, . . . , J , we haveJ∑
j=1ajh
T
j E = 0, therefore, ty =∑
k∈UµU
k .
ii) βπ
S estimate satisfies UπS(β
π
S) = 0, i.e., UπSj
(βπ
S) = hT
Sje∗ = 0, j = 1, . . . , J ,
where hSjand e∗ are n-dimension column vectors with elements hkj and
ek/πk, respectively, with k ∈ S. If exist a such thatJ∑
j=1ajhj = 1N ,
then,J∑
j=1ajhSj
= 1n. Therefore,∑
k∈S
(yk−µSk )
πk= 1T
ne∗ =
(
J∑
j=1ajhSj
)
T
e∗ =
J∑
j=1ajh
T
Sje∗. From Uπ
Sj(β
π
S) = 0 above, j = 1, . . . , J , we haveJ∑
j=1ajh
T
Sje∗ =
23
![Page 25: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/25.jpg)
0, then, tG
=∑
k∈UµS
k .
B Corollary 1
Proof. Homogeneity in the dispersion parameter, systematic component hav-ing intercept and canonical link led to hkj = φxkj, where hk1 = φ. Then,
with a1 = φ−1 and aj = 0 for j 6= 1, we haveJ∑
j=1ajhj = 1n. Therefore, using
theorem 1 we could proved that ty =∑
k∈UµU
k and tG
=∑
k∈UµS
k .
C Corollary 2
Proof. Normal response, systematic component having canonical link and dis-
persion parameter expressed as φ−1k =
J∑
j=1ajxkj led to hkj =
xkj
J∑
r=1
arxkr
. We could
verified thatJ∑
j=1ajhj = 1N . Therefore, using theorem 1 we could proved that
ty =∑
k∈UµU
k and tG
=∑
k∈UµS
k .
Lemma 1. If the tG is based on an ξ model where it has been verified thatthere is homogeneity in the dispersion parameter, systematic component having
intercept and canonical link, we have µT(YU−µ) ≈ 0 with µ =(
µU1 , . . . , µU
N
)
T
Proof. Expanding µUk by Taylor series of first-order around of x◦
k = (1, x◦
k2, . . . , x◦
kJ)T
we have the following
µUk = µ(xk, βU) ≈ µ◦
k +J∑
j=2
V (µ◦
k)βUj(xkj − x◦
kj),
where µ◦
k = µ(x◦
k, βU). For example, with x◦
kj = N−1txj= xUj
, j = 2, . . . , J ,we have that x◦
k = x◦ and therefore µ◦
k = µ◦, then
µUk ≈ µ◦ +
J∑
j=2
V (µ◦)βUj(xkj − xUj
)
=J∑
j=1
mjxkj,
24
![Page 26: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/26.jpg)
where mj = V (µ◦)βUjfor j 6= 1 and m1 = µ◦ −
J∑
j=2V (µ◦)βUj
xUj . In matrix
form µ can be written asµ ≈ X
Um,
with m = (m1, . . . , mJ)T. βU estimate satisfies UU(βU) = 0. As tG is basedon an ξ model where it has been verified that there is homogeneity in thedispersion parameter, systematic component having intercept and canonicallink, UU(β) can be expressed as UU(β) = XT
U(YU − µ), therefore, we have
XT
U(YU − µ) = 0, then
µT(YU − µ) ≈ (XUm)T (YU − µ)
= mTXT
U(YU − µ)
= mT0
= 0
D Theorem 2
Proof. According Pregibon (1981) in the convergence of the iterative process
to solving UπS(β
π
S) = 0, βπ
S estimate can be expressed as
βπ
S = (XT
SV∗
SXS)−1XT
SV∗
SYT
S
≈ (XT
SV∗
UXS)−1XT
SV∗
UYT
U , (D.1)
where V∗
S = diag{V S1 φ1/π1, . . . , V
Sn φn/πn}, V∗
U = diag{V U1 φ1/π1, . . . , V
Un φn/πn},
Y∗
S = (yS1 , . . . , yS
n)T, Y∗
U = (yU1 , . . . , yU
n )T with ySk = {ηk + V −1
k (yk − µk)}β=ˆβ
π
S
and yUk = {ηk + V −1
k (yk − µk)}β=ˆβ
U
. From result 5.10.1 of Sarndal, Swensson
and Wretman (1992) we have that βπ
S (as in the expression D.1) is an approx-
imately unbiased estimator of βU (i.e E(βπ
S − βU) ≈ 0) with the approximatevariance-covariance matrix given by
(XT
UW∗
UXU)−1ΣU (XT
UW∗
UXU)−1
Expanding the GEREG estimator tG by Taylor series of first-order around ofβU we have the following
tG ≈∑
k∈U
µUk +
J∑
j=1
∑
k∈U
V Uk xkj(β
πSj
− βUj)
≈ ty + 1T
NVUXU(βπ
S − βU).
Then, we can write
25
![Page 27: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/27.jpg)
i) E(tG) ≈ ty + 1T
NVUXUE(βπ
S − βU) = tyii) Var(tG) ≈ 1T
NVUXUVar(βπ
S − βU)XT
UVU1N = ΓT
UΣUΓU ,and the Var(tG) estimator can be obtained replacing the unknown quan-tities by its estimators in Var(tG), i.e, V U
k by V Sk and σU
ij by σSij , well
iii) Var(tG) = ΓT
SΣSΓS.
E Corollary 3
Proof. As tG is based in an ξ model where it has been verified that thereis homogeneity in the dispersion parameter and systematic component withintercept and canonical link we have
ΓU = (XT
UV∗
UXU)−1XT
UV∗
U1N = (1, 0, . . . , 0)T
(J×1),
then, Var(tG) is the (1, 1) element of ΓU , i.e, Var(tG) ≈ σU11 and Var(tG) ≈ σS
11,where xk1 = φk = 1 for k ∈ U .
F Theorem 3
Proof. Using the Pearson’s linear correlation coeficient definition we can write
1 − R2U = 1 −
[
∑
k∈Uykµ
Uk − Ny
U¯µU
]2
(
∑
k∈Uy2
k − Ny2U
)(
∑
k∈U(µU
k )2 − N ¯µ2U
) ,
Where ¯µU = N−1 ∑
k∈UµU
k . As the conditions of corollary 1 are satisfied we have∑
k∈U(yk − µU
k ) = 0, then, yU = ¯µU and we can write the following
1 − R2U = 1 −
[
∑
k∈Uykµ
Uk − Ny2
U
]2
(
∑
k∈Uy2
k − Ny2U
)(
∑
k∈U(µU
k )2 − Ny2U
)
26
![Page 28: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/28.jpg)
From lemma 1 we have µT(YU − µ) ≈ 0, i.e,∑
k∈Uykµ
Uk ≈ ∑
k∈U(µU
k )2, therefore
1 − R2U ≈ 1 −
[
∑
k∈UµU2
k − Ny2U
]2
(
∑
k∈Uy2
k − Ny2U
)(
∑
k∈U(µU
k )2 − Ny2U
)
= 1 −
∑
k∈U(µU
k )2 − Ny2U
∑
k∈Uy2
k − Ny2U
=
∑
k∈Uy2
k − 2∑
k∈Uykµ
Uk +
∑
k∈UµU
k
∑
k∈Uy2
k − Ny2U
=
∑
k∈U
(
yk − µUk
)2
∑
k∈U(yk − yU)2
= ∆AAS.
References
Breidt, F.J. and Opsomer, J.D. (2000). Local polynomial regression estimatorsin survey sampling Annals of Statistics, 28, 1026-1053.
Cox, D.R. and Hinkley, D.V. (1974). Theoretical Statistics. London: Chapmanand Hall.
Dobson, A.J. (2001). An Introduction to Generalized Linear Models, 2nd. Edi-tion. London: Chapman and Hall.
Duchesne, P. (2003). Estimation of a proportion with survey data. Journal ofStatistics Education. 11, 1-24.
Estevao, V.M. e Sarndal, C.E. (2004). Borrowing strength is not the besttechique within a wide class of design-consistent domain estimators. Journalof Official Statistics. 20, 645-669.
Fox, J. (2008). Applied Regression Analysis and Generalized Linear Models,2nd. Edition. London: Sage.
Fuller, W.A. (2002). Regression estimation for survey samples. Survey Method-ology 28, 5-23.
Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression su-perpopulation model. Journal of the American Statistical Association. 77,89-96.
Li, Yan (2008). Generalized regression estimators of a finite population totalusing the Box-Cox technique .Survey Methodology. 34, 79-89.
Lehtonen, R. and Veijanen, A. (1998). Logistic generalized regression estima-tors.Survey Methodology. 24, 51-55.
27
![Page 29: Finite Population Estimation Under Generalized Linear Model](https://reader033.vdocuments.us/reader033/viewer/2022052419/58a2d5d71a28abf8698b6bed/html5/thumbnails/29.jpg)
Lehtonen, R., Sarndal, C.E. and Veijanen, A. (2003). The effect of model choicein estimation for domains, including small domains. Survey Methodology. 29,33-44.
Lehtonen, R. and Pahkinen E.J. (2004). Practical Methods for Design andAnalysis of Complex Surveys, 2nd Edition. Chichester: Wiley.
Lehtonen, R., Sarndal, C.E. e Veijanen, A. (2005). Does the model matter?Comparing model-assisted and model-dependent estimators of class frequen-cies for domains. Statistics in Transition. 7, 649-673.
McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models, 2nd. Edi-tion. London: Chapman and Hall.
Pregibon, D. (1981). Logistic regression diagnostics. Annals of Statistics, 9,705-724.
Sarndal, C.E., Swensson, B. and Wretman, J. (1992). Model Assisted SurveySampling. New York: Springer Verlag.
Skinner, C.J. and Holt, D. (1989).Analysis of complex surveys. New York:John Wiley and Sons.
Wright, R.L. (1983). Finite population sampling with multivariate auxiliaryinformation. Journal of the American Statistical Association. 78, 879-884.
28