ratio estimation with stratified samples consider the agriculture stratified sample. in addition to...
TRANSCRIPT
1
Ratio estimation with stratified samples
• Consider the agriculture stratified sample. In addition to the data of 1992, we also have data of 1987. Suppose that the population data of 1987 are available. How can we combine the two techniques?
2
Method 1: combined ratio estimator
• Step 1: combine strata to estimate tx and ty
• Step 2: use ratio estimation
H
h Si iSi ihH
h xhyhstrxstry
strxstrystrxstrystrx
stryyrc
strx
stryxyrc
hhxyvoC
n
NttvoCttvoC
ttvoCBtraVBtraVt
ttraV
t
tBtBt
1 2
2
1,,
,,,2
,
2
,
,
,
,
),(ˆ)ˆ,ˆ(ˆ)ˆ,ˆ(ˆ where
)]ˆ,ˆ(ˆˆ2)ˆ(ˆˆ)ˆ(ˆ[ˆ
ˆ)ˆ(ˆ
ˆ
ˆˆ where,ˆˆ
H
h xhstrx
H
h yhstry tttt1,1,ˆˆ,ˆˆ
3
Method 2: separate ratio estimators
• Step 1: use ratio estimation in each stratum
• Step 2: combine strata
H
hxh
yhxh
H
h yhryrs t
tttt
11 ˆ
ˆˆˆ
H
h yhryrs traVtraV1
)ˆ(ˆ)ˆ(ˆ
4
Method 1 vs Method 2
• If the ratios vary from stratum to stratum, use method 2
• If sample sizes are small, use method 1
• Poststratificatio is a special case of method 2
5
Cluster Sampling
6
A new sampling method
• Motivating example• Want to study the average amount water used
by per person• How would you design a survey?
7
A new sampling method
• Consider the two strategies– Sample person by person– Sample household by household
• Which one do you prefer and why?
8
A new sampling method
• In the water usage example, I would sample households, in other words, I would use household as the sampling unit.
• I do this for convenience. I am interested in average monthly usage per person, but I sample household
9
A new sampling method
• The example of water usage is an example of cluster sampling– Households are the primary sampling units (PSUs)
or clusters– Persons are the secondary sampling units (SSUs).
They are the elements in the population
10
Definition of Cluster Sampling
• Take an SRS on clusters• Individual elements of the population are
allowed in the sample only if they belong to a cluster (primary sampling unit) that is included in the sample
• The sampling unit (psu) is not the same as the observation unit (ssu), and the two sizes of experimental units must be considered when calculating standard errors from cluster samples
11
Stratified sampling vs Cluster sampling
• The two sampling methods look similar– A cluster is also a grouping of elements of the
population• But the sampling schemes are different– Stratified: SRS from each stratum– Cluster: SRS of the clusters. For each selected
cluster, we select all its elements– See the following two slides
12
Stratified sampling
13
Cluster sampling
14
Stratified sampling vs Cluster sampling
• Stratified sampling– Variance of the estimate of depends on the
variability of values within strata– For greater precision, individual elements within
each stratum should be similar values, but stratum means should differ from each other as much as possible
– Stratified sampling usually improves the precision of SRS
Uy
15
Stratified sampling vs Cluster sampling
• Cluster sampling– The cluster is the sampling unit– The more clusters we sample, the smaller the
variance– The variance of the estimate of depends
primarily on the variability between cluster means– For greater precision, individual elements with each
cluster should be heterogeneous and cluster means should be similar to one another
– Cluster sampling usually ??? the precision of SRS
Uy
16
Why does cluster sampling tend to reduce precision?
• Elements of the same cluster tend to be more similar than elements selected at random from the whole population. E.g, – Elements of the same household tend to have similar political views– Fish in the same lake tend to have similar concentrations of mercury– Residents of the same nursing home tend to have similar opinions of
the quality of care
• The similarities arise because of some underlying factors that may or may not be measurable– Residents of the same nursing home may have similar opinions
because the care is poor– The concentration of mercury in the fish will reflect the concentration
of mercury in the lake
17
Why does cluster sampling tend to reduce precision?
• Because of the similarities of elements within clusters, we do not obtain as much information
• By sampling everyone in the cluster, we partially repeat the same information instead of obtaining new information
• As a result, cluster sampling leads to less precision for estimates of population quantities
18
Motivation of using cluster sampling
• A sampling frame list of observation units may be difficult, expensive, or unavailable– Cannot list all honeybees in a region
• The population may be widely distributed geographically or may occur in nature clusters– Nursing home residents cluster in nursing homes
• Cluster sampling leads to convenience and reduced cost
• Cluster sampling may result in more information per dollar spent
19
Versions of cluster sampling: one-stage vs two-stage cluster sampling
• We will consider one-stage and two-stage sampling– One-stage sampling: every element within a
sampled cluster is included in the sample– Two-stage sampling: we subsample only some of
the elements of selected clusters
20
One-stage cluster sampling(1)
(2) (3)
21
Two-stage cluster sampling(1)
(2) (3)
22
Notation for cluster sampling
23
Notation for cluster sampling
24
Notation for cluster sampling
25
Notation for cluster sampling
26
One-stage cluster sampling(1)
(2) (3)
27
One-stage cluster sampling
• Every element within a cluster (PSU) is included in the sample
• Either “all” or “none” of the elements that compose a cluster (PSU) are in the sample
• iiiiiUiii ttsSyymM ˆ,,, 22
28
Clusters of equal sizes
• – Most naturally occurring clusters do not fit into
this framework– Can occur in agricultural and industrial sampling– Estimating population means or totals is simple• We treat the cluster means or totals as the
observations and simply ignore the individual elements• We have an SRS of n observations , where ti
is the total for all the elements in PSU i. },{ Siti
29
Clusters of equal sizes
30
Clusters of equal sizes
Nothing is new here
31
Clusters of equal sizes: an example
32
Clusters of equal sizes: an example
33
Clusters of equal sizes: sampling weights
34
Theory of Cluster sampling with equal sizes
35
Theory of Cluster sampling with equal sizes
• In one-stage cluster sampling, the variability of the unbiased estimator of t depends entirely on the between-cluster part of the variability
• For cluster sampling
36
Theory of Cluster sampling with equal sizes
• When MSB/MSW is large– MSB is relatively large: elements in different clusters
vary more than elements in the same cluster– cluster sampling is less precise than SRS
• If MSB>S^2, cluster sampling is less precise
37
38
Measurements of correlation
• ICC (or ρ): Intraclass (or intracluster) Correaltion Coefficient– Describes how similar elements in the same
cluster are– Provides a measure of homogeneity within the
clusters• Definition:• It can be shown that
39
Measurements of correlation
If SSB=0, then
40
One-stage cluster sampling with equal sizes vs SRS
If N is large
1+(M-1)ICC SSU’s, taken in a one-stage cluster sample, giveThe same amount of information as one SSU from an SRSe.g, ICC=1/2, M=5, then 1+(M-1)ICC=3 → 300 SSUs in the cluster sample = 100 SSUs in an SRS
• If ICC<0, cluster sampling is more efficient than SRS • ICC is rarely negative in naturally occurring clusters
41
The GPA example
The population ANOVA table (estimated)
42
The GPA exampleThe population ANOVA table (estimated)
• The sample mean square total should not be used to estimate when n is small
• The data were collected as a cluster sample. They do not reflect enough of the cluster-to-cluster variability.
• Multiply the unbiased estimates of MSB and MSW by the df from the population ANOVA table to estimate the population sums of squares
43
The GPA example
The population ANOVA table (estimated)
44
The GPA example
45
Clusters of unequal sizes
• The adjusted R2 measures the relative amount of variability in the population explained by the cluster means, adjusted fro the number of degrees of freedom
• If the clusters are homogeneous, then the cluster means are highly variable relative to the variation within cluster, and R2 will be high.
46
An example
47
An example
48
The GPA example
49
The GPA exampleThe population ANOVA table (estimated)
50
Clusters of unequal sizes
• In social surveys, clusters are usually of equal sizes
• In a one-stage sample, we will introduce two methods to estimate the population total/mean– Unbiased estimation– Ratio estimation
51
Unbiased estimation for cluster sampling with unequal sizes
52
Unbiased estimation for cluster sampling with unequal sizes
• Nothing is different from cluster sampling with equal sizes
• The problem is that the between cluster variance is large when the sizes of clusters are quite different from each other, as we expect large total from clusters of large sizes
• Therefore, we consider another estimator
53
Ratio estimation for cluster sampling with unequal sizes
54
Ratio estimation for cluster sampling with unequal sizes
where
55
Ratio estimation for cluster sampling with unequal sizes
Note, it is not difficult to find that
• The variance of the ratio estimator depends on the variability of the means per element in the clusters
• It can be much smaller than that of the unbiased estimator• The ratio estimator requires the total number of elements in the
population, K.• The unbiased estimator does not require K.
56
Two-stage cluster sampling
• In one-stage cluster sampling, we – Examine all the SSU’s within the selected PSU’s– Obtain redundant information because SSU’s in a
PSU tend to be similar– Expensive
• An alternative: taking a subsample within each selected PSU – two stage cluster sampling
57
Two-stage cluster sampling with equal probability
58
Two-stage cluster sampling with equal probability
• Compared with the one-stage cluster sampling, the two-stage uses one extra stage.
• The extra stage complicates the notation and estimators, as one needs to consider variability arising from both stages of data collection
• The points estimates are similar to those in one-stage, but variances are much more complicated
59
Two-stage cluster sampling with equal probability: an unbiased estimator
• Since we do not observe every SSU in the sampled PSU’s, we need to estimate the totals for the sampled PSU’s
• An unbiased estimator of the population total is
60
Two-stage cluster sampling with equal probability: an unbiased estimator
• The estimator is unbiased
t
tN
n
n
NtEZE
n
N
tZn
NEt
n
NEtE
N
i i
N
i ii
N
i iiSiiunb
11
1
]ˆ[][
]ˆ[]ˆ[]ˆ[
61
Two-stage cluster sampling with equal probability: an unbiased estimator
• Because are random variables, the variance of has two components– The variability between PSU’s– The variability within PSU’s
unbt̂
sti 'ˆ
Recall thatVar[Y]=Var[E[Y|X]] + E[Var[Y|X]]
Here ),...,(,ˆ1 Nunb ZZXtY
62
Two-stage cluster sampling with equal probability: an unbiased estimator
i
iN
ii
ii
t
N
i iit
N
i ii
N
i ii
i
N
i iii
N
i ii
N
i iiSiiunb
m
S
M
mM
n
N
n
S
N
nN
tVarZEn
N
n
S
N
nN
tVarZEn
N
n
tZVarN
ZtZVarEZtZEVarn
N
tZVarn
Nt
n
NVartVar
2
1
22
2
1
222
1
2
12
11
2
1
2
11
]ˆ[][1
]]ˆ[[][
]]|ˆ[[]]|ˆ[[
]ˆ[]ˆ[]ˆ[
63
Two-stage cluster sampling with equal probability
i
iN
ii
ii
tunb m
S
M
mM
n
N
n
S
N
nNtVar
2
1
22
2 11]ˆ[
64
Two-stage cluster sampling with equal probability: an unbiased estimator
It can be shown that an unbiased estimator of the variance is
For the population mean
65
Two-stage cluster sampling with equal probability: a ratio estimator
As in one-stage cluster sampling with unequal sizes, the between-PSU variance can be very large since it is affected both by variations in the cluster sizes and by variation in y.
66
where
67
The egg volume example
• A study (Arnold 1991) on egg volume of American coot eggs in Minnesota. We looked at volumes of a subsample of eggs in clutches (nests of eggs) with at least two eggs.
• For each sampled clutch, two eggs were measured
68
The egg volume example
69
The egg volume example
70
The egg volume example
71
The egg volume example
N is unknown but presumably to be large.
72
Using weights in cluster samples
• For estimating overall means and totals in cluster samples, most survey statisticians use sampling weights.
• Weights can be used to find a point estimate of almost any quantity of interest
• For cluster sampling:
73
Using weights in cluster samples
74
SRS : one-stage cluster: two-stage cluster
• For simplicity, we only consider
• One estimator from each of the three sampling methods
mmmMMM NN ...,... 11
cluster stage-one fromestimator theˆ
cluster stage- twofromestimator unbiased theˆ
SRS fromˆ
1t
t
t
unb
SRS
75
SRS : one-stage cluster: two-stage cluster
Assume (nm) SSUs are sampled
N
i it
iN
it
unb
Siiunb
SMMm
mM
n
N
n
S
N
nN
m
S
M
mM
n
N
n
S
N
nNtVar
tn
Nt
1
222
2
2
1
22
2
1
11]ˆ[
ˆˆ
76
SRS : one-stage cluster: two-stage cluster
• Recall that
• Therefore,
MSWMNSSWSMN
i i )1()1(1
2
MSWNMMm
mM
n
NMSB
n
M
N
nN
SMMm
mM
n
N
n
S
N
nNtVar
N
i it
unb
22
1
222
2
1
1]ˆ[
77
SRS : one-stage cluster: two-stage cluster
• We have defined ICC (ρ)
])1(1[)1(
1)]1)(1([
)1(
1
)1(
)1)(1)(1()1(
1
1)1)(1()1(
1
)1()1(
1
)1(1
11
)1(
)1(
11
11
22
2222
2
2
22
MSMN
NMMMS
MN
NM
MN
SNMMSNMM
N
SNMNM
MNSNM
N
MSWMNSNM
N
SSWSSTOMSB
SNM
NMMSW
S
MSW
NM
NM
SNM
MSWMN
M
M
SSTO
SSW
M
M
78
SRS : one-stage cluster: two-stage cluster
)1(1)1(1)1(
)1()1(
)1(])1(1[11
)1(
)1(1
])1(1[)1(
11
1]ˆ[
222
2
2
2
2222
22
mSnm
MNmS
nm
NMNM
M
mM
M
mM
M
Mm
M
mS
nm
NMNM
M
mMM
M
m
N
n
N
NS
nm
NMNM
SNM
NMNM
Mm
mM
n
NMS
MN
NM
n
M
N
nN
MSWNMMm
mM
n
NMSB
n
M
N
nNtVar unb
79
SRS : one-stage cluster: two-stage cluster
• If we use nm SSU’s in a one-stage cluster sampling, #PSU’s=n’=nm/M
])1(1[
])1(1[)1(
11
1
1/
/1]ˆ[
222
22
2
22
22
1
MSnm
MN
MSMN
NM
nm
MM
NM
nmN
MSBnm
MM
NM
nmN
nm
MS
NM
nmN
Mnm
S
N
MnmNtVar tt
80
SRS : one-stage cluster: two-stage cluster
• If we use nm SSU’s in an SRS
2222
2 1)(]ˆ[ Snm
MN
nm
S
NM
nmNMtVar SRS
81
SRS : one-stage cluster: two-stage cluster
2222
2 1)(]ˆ[ Snm
MN
nm
S
NM
nmNMtVar SRS
)1(1]ˆ[ 222
mSnm
MNtVar unb
])1(1[]ˆ[ 222
1 MSnm
MNtVar
]ˆ[]ˆ[]ˆ[,0when 1tVartVartVar unbSRS
82
Design a cluster survey
• It is worth spending a great deal of effort on designing the survey for an expensive and large-scale survey
• It can take several years to design and pre-test• For designing a cluster sample– What overall precision is needed?– What size should the PSU’s be?– How many SSU’s should be sampled in each sampled
PSU?– How many PSU’s should be sampled?
83
Choosing the PSU size
• In many situations, the PSU size exists naturally. E.g, a clutch of eggs, a household
• In some situations, one needs to choose PSU sizes. E.g., area of a region, 1km2, 2km2,…
• Many ways to “try out” different PSU sizes• Pilot study, perform an experiment• The goal is get the most information for the
least cost and inconvenience
84
Two-stage cluster design with equal cluster size and equal variance
85
Two-stage cluster design with equal cluster size and equal variance
11
)1(
)1(
)(
1)(
whenreached is minimum The
constant1)(
21
constant]1)(
[1
111
11]ˆ[)(
1]ˆ[
22
1
2
11
2
12
12
2121
2
a
unbunb
RNMc
NMc
MSWMSBc
MSWMcm
mMSWcm
M
MSWMSBc
mMSWmc
M
MSWMSBc
C
mMSWcm
M
MSWMSBc
C
NM
MSB
Cm
mccMSW
C
mcc
M
MSWMSBNM
MSB
mnMSW
nM
MSWMSB
nM
MSW
nm
MSW
NM
MSB
nM
MSB
nm
MSW
M
m
nM
MSB
N
ntVar
NMyVar
86
Two-stage cluster design with equal cluster size and equal variance
• Graphing variance of varying m and n gives more information
• It is useful to examine– What if the costs or the cost function are slightly
different?– What if changes slightly?
11
)1(
)1(2
2
1
aRNMc
NMcm
2aR
87
The GPA example
88
The GPA example
89
Summary of two-stage cluster
• Cluster sampling is widely used in large surveys
• Variances from cluster samples are usually greater than SRSs with the same SSUs
• Less expensive – the per-dollar information from cluster sampling might be greater than that of SRS
90
Summary of two-stage cluster