overview of the state-of-the-art of survey...

Overview of the state-of-the-art of surveysampling

Imbi TraatInstitute of Mathematical Statistics

University of Tartu

August 23-27, 2009, Kyiv

4 lectures

1

The four lectures

• Population, domains, parameters, sample, sampling design,sampling procedure;

• Unbiased estimation and variance estimation, the effect ofthe second order inclusion probabilities;

• More on estimation;

• Calibration; for increasing precision, for compensating non-response, for achieving consistency between estimates;

2

• Sample survey − a sample based study of the finite popu-lation,

• the aim − reliable, timely (regular), cost-efficient estimatesfor the population and for its domains,

• today, a worldwide survey industry,

• government, academic, private, mass-media, etc. sectors,

• central statistical offices in many countries

• have to provide information by law,

• proper specialized education at universities needed.

3

Some steps in History

• 1891 Census in Norway used partial investigation of the pop-ulation,

• Kiaer (1897) describes the representative method (SI-sampling,purposive sampling),

• Neyman (1934) gives a new sense to "representative" (con-fidence intervals),

• stratified with different sampling fractions is not representa-tive in old sense but can be more efficient,

• study of different sampling methods exploded to develop.

4

Some steps in History

• 1991 Estonia became independent,

• 1993 First sampling course at Tartu University,

• 1995 Household Budget Survey, Labour Force Survey, En-terprize survey,

• many difficulties.

• very important has been the Baltic-Nordic Network in SurveySampling Theory and Methodology.

5

Population and other notions

Population (finite, size N , elements/units identified):

U = {1,2, . . . , N}.

Frame a list of population units with some characteristics (nowa-days in computer), e.g. Register.

Units (people, schools, enterprizes, farms, ...):

k, k ∈ U.

Sample (a set, part of U):

s, samle size n

.6

Variable, its value on unit k:

yk − study variable (income, education, labour force status, ...)

xk − auxiliary variable (age, sex, address, ...).

Taken from Register or measured in a survey.

Population parameters:

Y =∑U

yk − total

Y =1

N

∑U

yk − mean

R =Y

Z− ratio of two totals, Z =

∑U

zk.

Special interpretation for binary yk, i.e. yk ∈ {0,1}.

7

Domain − part of the population:

Usually formed by a categorical variable, or by cross-classifyingmany categorical variables (county, sex-age class)

Ud, size Nd, d = 1,2, . . . , D.

Small domains − small area estimation

Domain parameters

Ud =∑Ud

yk,

Yd =1

Nd

∑Ud

yk,

Rd =Yd

Zd, Zd =

∑Ud

zk.

8

Probability sample − s random, there is a probability p(s).

Sampling design − probability distribution p(s) on all possiblesamples s

Task. 1. Let N = 4, i.e. U = {1,2,3,4} and let n = 2. Fill in!

s {1,2} {1,3}p(s) 0.3

Probabilities p(s) are defined by sampling procedure, they areoften unknown.

Sampling procedure − activities to be made for drawing a sam-ple

For a given sampling design p(s) there are many sampling pro-cedures.

10

Design p(s) defines random selection of a population unit k:

Ik =

{1, unit k is selected,0, not selected. sampling indicator, random variable ,

πk = P (Ik = 1), inclusion probability,

πkl = P (Ik = 1, Il = 1), second order inclusion probability.

πk needed for (nearly) unbiased estimation, πkl needed in vari-ance formulae.

Task 2. Find π1, π2, π3, π4 for the design of Task 1. We use theformula

πk =∑

s, k∈s

p(s).

The πk are known or approximately known in practice, the πkl

are often unknown.

11

Some classical sampling procedures

• systematic sampling;

• probability-proportional-to-size-sampling;

• cluster sampling;

• multi-stage sampling;

• multi-phase sampling;

• Stratified sampling, different designs in strata.

Sampling design can be WOR and WR; fixed and variable samplesize; equal and unequal probability design.

12

Simple random sampling without replacement − SI

DEF: All samples of size n are equally probable: p(s) = 1(N

n ).

SI-design is,

• independent of all variables (distributions in sample ≈ distri-butions in population);

• aspires to proportional allocation of the sample to all groupsof the population;

• is theoretically best studied design;

• is used as a component of the complex designs in practice;

• its properties are used as approximations to other designs.

13

Some sampling procedures for drawing a SI-sample

• enumerate all samples and draw randomly one (like ball fromthe urn);

• draw-wise selection of units (if selected, then deleted in U);

• list-wise, a Bernoulli experiment is performed for each unitin U with probability

Pr(Ik+1 = 1) =n−

∑kj=1 ij

N − k;

• rejective way, Bernoulli sampling with probabilities πk = n/N

is performed in U . Sample is rejected if sample size is not n:

• order sampling (is good for sharing response burden );

14

Characteristics of the SI-design

πk =Cn−1

N−1

CnN

=n

N, ∀k

πkl =n

N

n− 1

N − 1, ∀k, l.

If some other design has πk and πkl same as SI, it still need notto be a SI-design.

15

An alternative approach to samples and sampling design

Example. Let N = 4, n = 2. Let us have the frame of U in acomputer file. We create a column, where sampled elements aredenoted by 1.

U 1st sample 2nd sample1 1 12 1 03 0 14 0 0

Sample is a vector i = (i1, i2, . . . , iN); sampling design is proba-bilities on these vectors

Sampling design is a multivariate discrete distribution of the sam-pling vector I,

I ∼ p(i) = Pr(I = i).

16

Advantages

• some sampling designs have explicit probability functions p(i);

• tools from distribution theory become applicable;

• drawing a sample is simulation from distribution;

• unified consideration for WOR and WR designs.

17

Probability functions of some WOR sampling designs

Conditional Poisson p(i) = C ·∏k

pikk (1− pk)

1−ik, |i| = n.

Sampford p(i) = C ·∏k

pikk (1− pk)

1−ik ×∑

(1− pk)ik, |i| = n.

Pareto p(i) =∏k

pikk (1− pk)

1−ik ×∑

ckik, |i| = n,

where ck =∫ ∞

0xn−1∏ 1 + τj

1 + τjx·

1

1 + τkxdx, with τj = pj/(1−pj).

0 < pk < 1,∑

pk = n − parameters.

πk = pk for Sampford, otherwise πk ≈ pk (a lot of research!)

If pk ≡ n/N then SI in each case.18

Research problems

• relation of these designs to each other,

• good approximations to πkl,

• simulation with advanced methods (MCMC) for getting asample rapidly.

The probability functions of some WR designs.

Multinomial design Mult(n; p1, p2, . . . , pN),

p(i) = n!∏k

pikk

ik!, |i| = n,

∑pk = 1, ik ∈ {0,1, . . . , n}.

If pk ≡ 1/N then SIR design.19

Estimation Let θ be an estimator of θ.

Different approaches describing randomness of θ

• a design-based approach − θ = θ(s), thus θ is a discrete r.v.having values with probabilities p(s), all of its characteristicsdefined by p(s);

• model-based approach yk = f(xk) + εk, k ∈ s,assumptions on r.v. εk define properties of estimators θ;

• model-assisted approach, the relations

yk = f(xk) + εk, k ∈ s, with assumptions onεk,

are used for construction θ, its properties are defined by p(s).

Model-based approach is useful in small area estimation;Statistical agencies like to use design-based and model-assistedapproaches.

20

General unbiased estimator for the total Y =∑

U yk (design-based approach):

Horvitz-Thompson (HT) estimator

Y =∑s

yk

πk.

Alternatively,

Y =∑s

akyk,

ak = 1/πk, design weight.

Interpretation! Weight column. Weight modification. Self-weighting design.

Note, if yk ≡ 1, then N =∑

U yk − population size, and N =∑

s ak

− its unbiased estimator.

21

Example: An average household size H = MN .

N number of households, M number of people, mk size of hh k.

A sample s of n hhs drawn through people from the PopulationRegister. Big hhs over-represented:

πk = nmkM inclusion probability of hh k (approximate);

Sample mean H =∑

s mk/n, biased;

HT-estimator H = 1N

∑s

mknmk/M

= MN exact;

Usually N unknown, use N =∑

s 1/πk = Mn

∑s

1mk

Now H = MN

= MN

= n∑s 1/mk

approximately unbiased.

22

HT-estimator is unbiased under any design.

Since E(Ik) = Pr(Ik = 1) = πk, then

E(Y ) = E(∑s

yk

πk) = E(

∑U

Ikyk

πk) =

∑U

E(Ik)yk

πk=∑U

yk = Y.

Meaning of unbiasedness and variability in the design-based sense.

Task. 3. Let N = 4 and y = (9,6,4,1). Find Y and from eachsample Y under the design of Task 1. Do we see variability? Isthe estimator unbiased?

23

Variance of the HT-estimator:

V (θ) = E(θ − E(θ))2.

It measures from-sample-to-sample variability around E(θ).

For HT-estimator, denoting yk = ykπk

, we have

V (Y ) =∑k∈U

∑l∈U

(πkl − πkπl)ykyl.

An unbiased estimator for the variance is,

V (Y ) =∑k∈s

∑l∈s

(1−πkπl

πkl)ykyl.

Sen-Yates-Grundy variance estimator for fixed size designs,

V (Y ) = −1

2

∑k∈s

∑l∈s

(1−πkπl

πkl)(yk − yl)

2.

Other precision measures:√V (Y ) , standard error,√V (Y )

Y, relative error. How big can be tolerated?.

24

Unbiased estimation for domains

The same formulae can be applied with a simple trick.

Define a new variable

ydk = yk, if k ∈ Ud, otherwise ydk = 0.

Now domain total is a population total of the new variable,

Yd =∑Ud

yk =∑U

ydk.

We know how to estimate the population total∑

U ydk and whatis its variance.

25

Estimation under SI

The HT-estimator, its variance and variance estimator:

Y =∑s

yk

πk=∑s

N

nyk = Ny,

V (Y ) = N2(1− f)S2/n,

V (Y ) = N2(1− f)s2/n,

where

y =1

n

∑s

yk sample mean,

f =n

Nsampling fraction,

S2 =1

N − 1

∑U

(yk − Y )2 population variance of the variable,

s2 =1

n− 1

∑s(yk − y)2 sample variance of the variable.

Unbiased estimator of the population mean Y = Y/N is

y, where V (y) = (1−f)s2/n, note difference from classical result!

26

Task. 5.

Let us have y-variable like in Task 3. Find an estimate to Y fromeach sample assuming a SI-design with sample size 2. Comparevariability of this estimator and the estimator of Task 3.

Which design is more efficient in estimating Y ?

27

A recent alternative form for variance of HT-estimator

Note that SI variance can be written as

VSI(Y ) = (1−n− 1

N − 1)VSIR(Y ),

where SIR means simple random sampling with replacement.

SIR corresponds to iid sampling,

SI is more efficient than SIR,

how much more, depends here on n,

Knottnerus (2003) has derived a similar expression for generalWOR and WR designs

28

The Knottnerus variance of HT-estimatorLet a fixed size n WOR design have inclusion probabilities πk andπkl

Let the WR design be multinomial Mult(n; p1, p2, . . . , pN) suchthat

sampling expectations of units under these designs are equal,

πk = npk.

Then,

VWOR(Y ) = (1 + (n− 1)ρ)VMult(Y ),

where

VMult(Y ) =∑U

πk(yk − Y/n)2.

1 + (n− 1)ρ is generalized finite population correction term,

29

shows how many times the WOR design is more efficient thanthe WR design.

The effect of the 2nd order inclusion probabilities for variancecomes through ρ.

ρ sampling autocorrelation

The sampling autocorrelation

ρ =

∑∑i6=j πij(yi − Y/n)(yj − Y/n)

(n− 1)∑

πi(yi − Y/n)2.

• why autocorrelation?

• note, depends on both the πkl and study variable,

• is known for some simple designs, −1/(N − 1) for SI,

• the limits −(n− 1)−1 ≤ ρ ≤ 1,

• difficult to estimate, otherwise new good variance estimatorscould be received.

30

Gabler’s condition

Look at

D =VWOR(Y )

VMult(Y )= 1 + (n− 1)ρ.

D ≤ 1 if ρ ≤ 0, the latter depends on y-variable.

When is D ≤ 1 uniformly (i.e.∀y)?

Gabler(1984)

If∑

i∈U minjπijπj

≥ n− 1, then D ≤ 1 uniformly.

How much less than 1? Is there a sharper bound possible?

31

A sharper bound for D, our study

We assume π1 ≤ π2 ≤ . . . πN . Consider a matrix B = (bij),

bij =

1, if i = j,πijπj

, otherwise.

Let λ2 be the 2nd largest eigenvalue of B. It holds

D ≤ λ2.

Under Gabler’s condition λ2 ≤ 1.

Can λ2 be found exactly?

We consider designs for which

bij ↑ with j, i 6= j.

32

These are the Conditional Poisson, the Hajek’s approximativeand some other designs.

For such designs

1−2

π1 + π2π12 ≤ λ2 ≤ 1−

π12

π2.

If π1 = π2 then

λ2 = 1−π12

π1.

For SI-sampling,

D = λ2 = 1−n− 1

N − 1.

Stratified sampling − most often used in practice.

Population U is divided into groups (strata) with the help ofstratification variables. Sampling is performed separately in eachstratum.

For example, in the Enterprize Survey strata are formed by ac-tivity and number of employees.

Why stratificaion?

• administrative reasons:coordination, sharing responsibility, different situation requiresdifferent sampling methods;

• coverage of important domains by sampled elements;

• cost reduction;33

• variance reduction;

• assuming non-response;

Estimation under stratified sampling.

Let Uh, h = 1,2, . . . H be strata in U .

Let the total in Uh be: Yh =∑

Uhyk.

Population total is then Y =∑H

h=1 Yh.

The Yh can be estimated by HT-estimator:

Yh =∑sh

yk

πk.

Population total is estimated by Y =∑H

h=1 Yh.

Respective variance estimator is V (Y ) =∑H

h=1 V (Yh).

34

Forming strata for variance reduction.

We see from V (Y ) =∑H

h=1 V (Yh) that variance is smaller ifvariances in strata are smaller.

Therefore it is necessary to form strata as homogeneous as pos-sible.

We have a problem if there are many study variables? Which arethe most important?

35

Stratified sampling with SI in strata

Used very often.

Let stratum Uh have a size Nh and sample size nh.

Then under SI in stratum

πk =nh

NH, in stratum Uh

Yh =Nh

nh

∑sh

yk = Nhyh

Y =H∑

h=1

Yh.

Design weights ak = Nh/nh are constant inside stratum, maydiffer between strata.

Task 6. Let us have a stratified SI sampling. Put down varianceestimator for Y . Explain the notation.

36

Allocation of sample in strata.

Let the total sample size n be fixed. How to determine nh?

Let the cost function be

C = c0 +H∑

h=1

nhch,

where c0 - general expenses, ch - expenses for getting data fromunits in Uh.

Cost can be reduced if not to sample units from Uh with largech. But which effect does this have to variance?

There is a compromise. The optimal set of nh is such thatminimizes a product C ·D(Y ).

37

Optimal allocation of sample

is achieved under stratified SI with

nh ∝Nh · Syh√

ch,

where proportionality constant is determined from the conditionn =

∑Hh=1 nh, and

Syh =

√√√√ 1

N − 1

∑Uh

(yk − y)2.

We see: more sample from larger strata, from strata with largery-variability, less sample from high-cost strata.

For constant cost in strata (telephone, post, internet) we get

nh ∝ Nh · Syh, where proportionality constant isn∑H

h=1 Nh · Syh.

This is Neyman allocation (often used in enterprize surveys).38

Task 7. In the table we have a population of hhs. There are 3strata, a known auxiliary variable is household size. Find in eachstratum mean and standard deviation of hh size. Let the totalsample size be 8. Perform a Neyman allocation of the samplein strata, using hh size variable. NB! Minimal sample size instratum should be 2.

Stratum HH HH size1 1 4

2 33 4

2 1 42 63 44 75 8

3 1 22 33 24 25 26 3

39

Problems with allocation

• Optimal is optimal for 1 variable. There are many variablesin a survey.

• Optimal is good for such variables whose st.-deviation be-haves like Syh. For the rest of variables it may increase thevariance.

• Syh is not known before survey, but still nh has to be deter-mined.

• if the variables are very different (in Syh), it is better to makea proportional allocation

nh = nNh

N.

40

Proportional allocation is optimal if Syh is constant in eachstrata. It has been shown:

Vopt(Y ) ≤ Vprop(Y ) ≤ VSI(Y ).

Calibration

A method for constructing estimators that use auxiliary informa-tion in certain way.

Theoretical elaborations ca. 20 years ago (Deville and Särndal1992 ).

Now the method is used in Statistical agencies worldwide.

The method allows to calculate new weights that can be appliedto any study variable (uniform weighting system).

Any domain total is estimated by using the same weight systembut summing over sample in that domain.

41

Calibration is used

• for increasing precision of estimators,

• for compensating non-response,

• for achieving consistency between estimates from differentsurveys.

• consistency in the same survey, if different estimators areused, and e.g. the additivity is violated.

42

Let the totals of auxiliary variables be known

X =∑U

xk, (a vector)

.

Calibrated weights wk are such for which∑s

wkxk = X. (1)

It is also required that wk are close to design weights ak = 1/πk.

Two basic methods:

• a distance minimization method ;

• an instrument vector method.

43

GREG-estimator (Särndal, Swennson, Wretman 1992) follows asa special case from both approaches.

The philosophy of calibration and GREG-estimation is describedin Särndal (2007).

44

Calibration of weights by instrument vector method

The weights are searched in the form

wk = ak(1 + λ′zk), (2)

where the instrument vector zk has the same dimension as xk.

λ is found from the constraints formula:

X′ =∑s

wkx′k =

∑s

ak(1 + λ′zk)x′k.

After simple manipulation,

λ′ = (X− X)′(∑s

akzkx′k)−1,

where X is a vector of HT-estimates for the total X.

Weights (2) satisfy constraints (1). They are used in estimatingstudy variable totals:

YCAL =∑s

wkyk.

45

Task 8. Let the population of hhs be like in Task 7. Let thesampling design in the population be SI with size n = 8. Takean arbitrary sample of 8 hhs. Find calibrated weights for theunits of your sample. We look a simple case with 1-dimensionalauxiliary variable and instrument variable. Let the auxiliary xk behh size. Fix an instrument variable (e.g. zk = 1/xk) and find theweights. Comment!

46

Domain estimates are calculated with the same weights butsumming over the domain sample:

YdCAL =∑sd

wkyk.

The instrument vector has to be fixed. Usually

zk = qkxk,

where numbers qk > 0. This choice gives GREG-estimator.

With choices of zk one gets many special cases known earlier.

47

Post-stratification estimator

Post-strata are such subgroups in U , which are not used forsampling. They are used in the estimation stage. Sample isdivided into post-strata and the estimator is formed as describedbelow.

Let zk = qkxk = δk, where

δk = (δ1k, δ2k, . . . , δDk)

is an indicator-vector of the post-stratum. Now we have in cali-bration estimator

X =∑U

δk = (N1, N2, . . . , ND), sizes of post-strata;

X =∑s

akδk = (N1, N2, . . . , ND), estimates of post-strata sizes;

wk = akNd

Nd, if k ∈ sd, calibrated weight. Prove it!

48

Calibrated estimator for post-stratum total,

YdCAL = NdYd

Nd,

and for population total (post-stratified estimator),

YCAL =D∑

d=1

NdYd

Nd, (3)

where Yd =∑

sdakyk and Nd =

∑sd

ak.

The estimator (3) is a simplest possibility to compensate non-response:

• post-strata should be formed so that respondents and non-respondents are similar in them (Nd should be known);

• Yd and Nd are calculated using respondents in post-stratumd; Yd/Nd estimates the mean in post-stratum;

• mean of respondents is transferred to all units.

49

Task 9. Which form takes the formula (3), if one has SI sam-pling in the entire population with sample size n.

50

Calibration for consistency between surveys – AC calibra-tion

We have 2 surveys the PRS and the RFS.

yk is common variable in both surveys. Its total Y0 is estimatedin the RFS.

We want that the domain totals of y-variable in PRS are consis-tent with Y0.

For units k in the PRS sample s we observe the vectors xk, yk.We know the (p + m)-dimensional vectors(

xkyk

), k ∈ s,

(XY0

),

(XY

),

where X =∑

U xk and Y0 are known, and X =∑

s akxk andY =

∑s akyk are HT estimators in the PRS.

51

AC calibration

Let z∗k be an instrument vector with matching dimension. ThenAC calibrated weights are given by

w∗k = ak(1 + λ∗′z∗k), where

λ∗′ =

(X− XY0 − Y

)′M−1, M =

∑s

akz∗k

(xkyk

)′,

The w∗k satisfy the A constraint

∑s w∗

kxk = X and the C con-straint, ∑

sw∗

kyk = Y0. (1)

The w∗k can be used for estimating all totals of interest in PRS.

In particular, the vector of common variable domain totals, Yd,is estimated as

Y∗dCAL =

∑sd

w∗kyk, d ∈ D.

We have additive consistency with the RFS estimator Y0, since(1) implies

∑d∈D Y∗

dCAL = Y0.

The weights w∗k can be routinely computed, like the ordinary

calibration weights.

However, we want to know how the A calibration and C calibra-tion are related to each other.

Derivation Let z∗k′ = qk(x

′k,y′k) and zk = qkxk. Then for weights

w∗k = ak(1 + λ∗′z∗k), where

λ∗′ =

(X− XY0 − Y

)′M−1, M =

∑s

akz∗k

(xkyk

)′,

we need to invert a matrix

M =

(Txx Txy

T′xy Tyy

),

Txx =∑s

akqkxkx′k : p× p

Txy =∑s

akqkxky′k : p×m

Tyy =∑s

akqkyky′k : m×m.

A block matrix, formula readily available.

Simplification to a meaningful form is a problem.

52

Success, when using residuals from regressing yk on xk:

ek = yk − B′xk, where B = T−1xx Txy.

Resulting AC calibrated weights,

w∗k = wk + akqke

′kQ

−1(Y0 − YCAl),

where wk are the A calibrated weights, and

Q =∑s

akqkeke′k : m×m.

The dimensionality of matrix inversion is reduced.

The Q is positive definite and therefore invertible.

The weights w∗k can be applied to any study variable in any

domain. If applied to common variables, the consistency withY0 is achieved.

53

Thank you!

54

overview of the state-of-the-art of survey...

Documents