nonparametric, model-assisted estimation for a two-stage sampling design mark delorey, f. jay...

1
Nonparametric, Model-Assisted Estimation for a Two- Stage Sampling Design Mark Delorey, F. Jay Breidt, Colorado State University I I N i i N i i N N U U 1 1 and Abstract In aquatic resources, a two-stage sampling design can be employed to make the best use of what are often limited time and financial resources. Even with the ability to focus such resources, it is often the case that the sample sizes are not sufficiently large to make model-free inferences. The presence of auxiliary information for the regions of interest suggests employing a model in our inferences. Breidt, Claeskens, and Opsomer (2003) propose incorporating this auxiliary information through a class of model-assisted estimators based on penalized spline regression in single stage sampling. Zheng and Little (2003) also use penalized spline regression in a model- based approach for finite population estimation in a two-stage sample. In a survey context, weights computed from a set of auxiliary information are often applied to many study variables. With this approach, model-assisted estimators should fare better than model-based estimators. We compare the two through a series of simulations. This research is funded by U.S.EPA – Science To Achieve Results (STAR) Program Cooperative Agreements # CR – 829095 and # CR – 829096 Funding/Disclaimer The work reported here was developed under the STAR Research Assistance Agreement CR-829095 and CR-829096 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This poster has not been formally reviewed by EPA. The views expressed here are solely those of the presenter and the STARMAP, the Program he represents. EPA does not endorse any products or commercial services mentioned in this poster. Case A: Cluster Level Auxiliaries (Our focus) The auxiliary information is available for all clusters in the population Leads to regression modeling of quantities associated with the clusters, such as cluster totals Cluster quantities can be computed for all clusters Population quantities can be computed from cluster estimates Example: Lake represents a cluster; auxiliary information is elevation Case B: Complete Element Level Auxiliaries The auxiliary information is available for all elements in the population Leads to regression modeling of quantities associated with the elements Cluster and population quantities can then be computed from element estimates and observations Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation Case C: Limited Element Level Auxiliaries The auxiliary information is available for all elements in selected clusters only Leads to regression modeling of quantities associated with the elements Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial Case D: Limited Cluster Level Auxiliaries The auxiliary information is available for all clusters in the first-stage sample Not a very interesting case Design-based estimator can be used for population quantities In some cases, good estimators for population quantities are not available Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited Generating Responses 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400) PSU = m( I ) + , where m() is one of the eight functions below and ~ N(0, 2 I) We use first order inclusion probabilities proportional to size (pps) Auxiliary data is often proportional to size of cluster Response of interest y ij = i + ij . where y ij is the jth element in the ith cluster and ij ~iid N(0, 2 ) Two-Stage Sampling The population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U 1 ,…, U i ,…, . So, where N i is the number of elements or secondary sampling units (SSUs) in U i . First stage: A sample of clusters, s I , is selected based on a design, p I () with inclusion probabilities Ii and Iij . Ii and Iij are the first and second order inclusion probabilities, respectively Second stage: For every i s I , a sample s i is drawn from U i based on the design p i ( | s I ) Typically require second stage design to be invariant and independent of the first stage Two-Stage Sampling with Aquatic Resources Time and expense constraints may make two-stage sampling more efficient Auxiliary information may be available on different scales The Estimators (for population totals) Horvitz-Thompson (HT) where Model-assisted where is the PSU total predicted by the model Model-based where is the ith cluster mean predicted by the model I I s Ii yip yi U yip y t t t t ˆ ˆ ˆ ˆ linear quadratic bump jump exponential growth cycle 1 cycle 4 i s i k k yi y t | ˆ yip t ˆ I I U i i s i i i y N y n t ˆ ˆ ˆ i s j i i i y n y ˆ and 1 Comments on Simulation Results 500 samples from each of the populations were drawn H-T = Horvitz-Thompson estimator M-A: lin = Model-assisted estimator using a linear model M-B: pmmra = Model-based estimator using a penalized spline and including a random effect for PSU M-A: pmm = Model-assisted estimator using a penalized spline with no random effect for PSU Point represents MSE Estimator :MSE Model-assisted estimator with radom effect for PSU Vertical black bars represent approximate 95% confidence intervals Model-assisted estimator with random effect for PSU is as efficient or more efficient than model-based estimator; we do not appear to lose efficiency (with respect to MSE) by using model-assisted non-parametric methods Notes on the Models and Model Parameters 3 different models used – Linear – Penalized spline with random effect for PSU – Penalized spline with no random effect for PSU In a survey context, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this: – Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach – Variance component for PSU effect is computed for the linear model and resulting covariance matrix and corresponding survey weights are applied to samples from other data sets – In this kind of survey context, model- assisted estimators have good efficiency properties and should be superior to model- based estimators which rely on correct specification of variance components I N U MSE ratio 0 1 2 3 4 5 6 linear H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0 2 4 6 8 quadratic H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0.0 1.0 2.0 3.0 bump H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0 2 4 6 8 jump H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0.0 1.0 2.0 3.0 exponential H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0 2 4 6 8 10 growth H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0 2 4 6 cycle 1 H-T M-A: lin M-B: pmm M-A: pmm MSE ratio 0.0 0.5 1.0 1.5 2.0 2.5 cycle 4 H-T M-A: lin M-B: pmm M-A: pmm I I N i N i i i N N U U 1 1 and I s Ii yi s k k y t y t ˆ ˆ

Upload: shavonne-pearson

Post on 18-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design Mark Delorey, F. Jay Breidt, Colorado State University Abstract In aquatic resources,

Nonparametric, Model-Assisted Estimation for a Two-Stage Sampling Design

Mark Delorey, F. Jay Breidt, Colorado State University

II N

ii

N

ii NNUU

11

and

Abstract

In aquatic resources, a two-stage sampling design can be employed to make the best use of what are often limited time and financial resources. Even with the ability to focus such resources, it is often the case that the sample sizes are not sufficiently large to make model-free inferences. The presence of auxiliary information for the regions of interest suggests employing a model in our inferences. Breidt, Claeskens, and Opsomer (2003) propose incorporating this auxiliary information through a class of model-assisted estimators based on penalized spline regression in single stage sampling. Zheng and Little (2003) also use penalized spline regression in a model-based approach for finite population estimation in a two-stage sample. In a survey context, weights computed from a set of auxiliary information are often applied to many study variables. With this approach, model-assisted estimators should fare better than model-based estimators. We compare the two through a series of simulations.

This research is funded byU.S.EPA – Science To AchieveResults (STAR) ProgramCooperative

Agreements # CR – 829095 and # CR – 829096

Funding/Disclaimer

The work reported here was developed under the STAR Research Assistance Agreement CR-829095 and CR-829096 awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This poster has not been formally reviewed by EPA.  The views expressed here are solely those of the presenter and the STARMAP, the Program he represents. EPA does not endorse any products or commercial services mentioned in this poster.

Case A: Cluster Level Auxiliaries (Our focus)

• The auxiliary information is available for all clusters in the population

• Leads to regression modeling of quantities associated with the clusters, such as cluster totals

• Cluster quantities can be computed for all clusters

• Population quantities can be computed from cluster estimates

• Example: Lake represents a cluster; auxiliary information is elevation

Case B: Complete Element Level Auxiliaries

• The auxiliary information is available for all elements in the population

• Leads to regression modeling of quantities associated with the elements

• Cluster and population quantities can then be computed from element estimates and observations

• Example: EMAP hexagon is cluster; lake is element; auxiliary information is elevation

Case C: Limited Element Level Auxiliaries

• The auxiliary information is available for all elements in selected clusters only

• Leads to regression modeling of quantities associated with the elements

• Regression estimators can be used for cluster-level quantities only for the clusters selected in the first-stage sample

• Example: Aerial photography of selected sites (clusters); for each point (element) in site, we have percent forested, urban, industrial

Case D: Limited Cluster Level Auxiliaries

• The auxiliary information is available for all clusters in the first-stage sample

• Not a very interesting case

• Design-based estimator can be used for population quantities

• In some cases, good estimators for population quantities are not available

• Example: Cluster is lake; auxiliary information is measure of size which is not available until site is visited

Generating Responses

• 500 PSUs; the number of SSUs per cluster ~ Uniform(50, 400)

PSU = m(I) + , where m() is one of the eight functions below and ~ N(0, 2I)– We use first order inclusion probabilities proportional to size (pps)

– Auxiliary data is often proportional to size of cluster

• Response of interest yij = i + ij. where yij is the jth element in the ith cluster and ij ~iid N(0, 2)

Two-Stage Sampling

• The population of elements U = {1,…, k,…, N} is partitioned into clusters or primary sampling units (PSUs), U1,…, Ui,…, . So,

where Ni is the number of elements or secondary sampling units (SSUs) in Ui.

• First stage: A sample of clusters, sI, is selected based on a design, pI() with inclusion probabilities Ii and Iij.

– Ii and Iij are the first and second order inclusion probabilities, respectively

• Second stage: For every i sI, a sample si is drawn from Ui based on the design pi( | sI)

• Typically require second stage design to be invariant and independent of the first stage

Two-Stage Sampling with Aquatic Resources

• Time and expense constraints may make two-stage sampling more efficient

• Auxiliary information may be available on different scales

The Estimators (for population totals)

• Horvitz-Thompson (HT)

where

• Model-assisted

where is the PSU total predicted by the model

• Model-based

where is the ith cluster mean

predicted by the model

II s

Ii

yipyi

U yipy

tttt

ˆˆˆˆ

linear quadratic

bump jump

exponential growth

cycle 1 cycle 4

is

ik

kyi

yt

|

ˆ

yipt̂

II U iis iiiy Nynt ˆˆˆ

is ji

iiy

ny ̂ and

1

Comments on Simulation Results

• 500 samples from each of the populations were drawn

• H-T = Horvitz-Thompson estimatorM-A: lin = Model-assisted estimator using a linear modelM-B: pmmra = Model-based estimator using a penalized spline and including a random effect for PSUM-A: pmm = Model-assisted estimator using a penalized spline with no random effect for PSU

• Point represents MSEEstimator:MSEModel-assisted estimator with radom effect for PSU

• Vertical black bars represent approximate 95% confidence intervals

• Model-assisted estimator with random effect for PSU is as efficient or more efficient than model-based estimator; we do not appear to lose efficiency (with respect to MSE) by using model-assisted non-parametric methods

Notes on the Models and Model Parameters

• 3 different models used

– Linear

– Penalized spline with random effect for PSU

– Penalized spline with no random effect for PSU

• In a survey context, such as those found in environmental monitoring, it is often desirable to obtain a single set of survey weights that can be used to predict any study variable. To accommodate this:

– Smoothing parameter for spline is selected by fixing the degrees of freedom for the smooth rather than using a data driven approach

– Variance component for PSU effect is computed for the linear model and resulting covariance matrix and corresponding survey weights are applied to samples from other data sets

– In this kind of survey context, model-assisted estimators have good efficiency properties and should be superior to model-based estimators which rely on correct specification of variance components

INU

MS

E ra

tio

01

23

45

6

linear

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

02

46

8

quadratic

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

0.0

1.0

2.0

3.0

bump

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

02

46

8

jump

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

0.0

1.0

2.0

3.0

exponential

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

02

46

810

growth

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

02

46

cycle 1

H-T M-A: lin M-B: pmm M-A: pmm

MS

E ra

tio

0.0

0.5

1.0

1.5

2.0

2.5

cycle 4

H-T M-A: lin M-B: pmm M-A: pmm

I IN

i

N

iii NNUU

1 1

and

IsIi

yi

sk

ky

tyt

ˆˆ