joint unece-eurostat worksession on confidentiality, 2011, tarragona sampling as a way to reduce...

22
Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim

Upload: jayson-dwayne-parrish

Post on 19-Jan-2018

226 views

Category:

Documents


0 download

DESCRIPTION

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Multiple … Multiple countries Multiple countries MS1 MS2 SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases SURVEY 1 TABLES 1 PUF 1 MFR 1 OTHER 1 SURVEY 2 TABLES 2 PUF 2 MFR 2 OTHER 2 SURVEY X TABLES X PUF X MFR X OTHER X Multiple releases MS27 Multiple countries Multiple surveys

TRANSCRIPT

Page 1: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Sampling as a way to reduce risk and create a Public Use

File maintaining weighted totals

Maria Cristina Casciano, Laura Corallo, Daniela Ichim

Page 2: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Outline• Multiple releases: MFR and PUF

• Subsampling– allocation: reduce the risk of disclosure– selection: pre-defined quality standards

• Results– Career of Doctorate Holders Survey

• Further work

Page 3: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple …Multiple countries

Multiple countries

Multiple countries

MS1

MS2

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

SURVEY1 TABLES1 PUF1 MFR1 OTHER1

SURVEY2 TABLES2 PUF2 MFR2 OTHER2

SURVEYX TABLESX PUFX MFRX OTHERX

Multiple releases

Multiple releases

Multiple releases

MS27

Multiple countriesM

ultip

le

surv

eys

Page 4: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Comparability• ESSnet on SDC harmonisation and

common tools– WP1: test the comparability concept– Istat, Destatis, Statistics Austria– multiple countries

• 1 Assessment of effects of different practices on predefined statistics• 2 Definition of a threshold to define when action is needed• 3 setting a process for choosing acceptable practices

HOW

Page 5: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple releasesSURVEY1 TABLES1 PUF1 MFR1 OTHER1

• A particular harmonisation dimension

• Hierarchical structure– Utility– Risk of disclosure

Page 6: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Multiple releaseshierarchical structure

MFR

+

-

More restrictive license

PUF

+

-Less aggregated information

Less restrictive license More aggregated information

UNIQUE PRODUCTION PROCESS!

Page 7: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-MFR• MFR

– definition of a disclosure scenario– risk assessment R1– risk limitation w.r.t.

•adopted disclosure scenario•some data utility requirements

• PUF– harmonized with the MFR (e.g. weighted totals)– reduced the risk of disclosure– random sample– internal consistency of records– some (other) data utility requirements (CV and weighted totals – precision and accuracy)

Page 8: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Data description

Year t-5 Year t-3 Year t

Doctorate Holders CDH 2009 Survey

Estimates by PhD scientific area, by gender and by region

labour market entry

usefulness of the PhD

for obtaining a job

type of contract

type of work earnings

job satisfaction

Focus on the characterisation of the occupational status of the PhD holders:

Page 9: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

72%resp

28%No resp

12964 respondents

18500PhD Holders

(Census)

Citizenship(2 categories)

PhD Scientific Area(14 categories) Gender Region

weights obtained by

constraining on known marginal distributions:

Adjustment for non-responses via calibration

Data description

Page 10: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-subsamplingSimple random samplingUtility: Weighted totals may always be preserved by calibrationRisk: how many units at risk are sampled?Example (MFR-CDH): 12964 units, 24.7% of units at risk

Page 11: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Subsamplingallocation

domains

utility

disclosure

sample size

stratification

dissemination

totals

scenario

calibration

key variables

quality

users

auxiliary

Page 12: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

PUF-subsampling: proposal

1. Optimal allocation of units to be sampled in each domain according to Bethel’s approach

(Risk minimization)

2. Selection of a fixed size balanced sample (CUBE method)

(Data utility maximization)

Page 13: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

*djpdjp CVCV

●Cost function to minimize:

● Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain jd equal or lower than prefixed thresholds:

djH

1hhh0

' nCCC

1. Bethel’s approach (1989)

nh and Ch related to the risk to be reduced

Optimal allocation: nh*

Page 14: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

2. Balanced sampling

A sampling design s is said to be balanced on the auxiliary variables if and only if the balancing equations given by:

are satisfied, where X is the vector of known population totals, is the H.-T. estimator

XXπ ˆ

'.....1 pj xxxx

πX̂

exact estimates for pre-defined variables

Page 15: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Balanced sampling: the CUBE methodGeometrically each vertex of the hypercube is a sample:

The balancing equations define a sub-space of RN named K.The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K

(111)

(000) (100)

(101)

(010)

(011)

(110)

N,s 10

Cube method (Deville & Tillé,2004):1. Flight phase: it’s a random walk starting from the

vector and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists.

2. Landing phase: At the end of the flight phase, if a sample is not exactly determined in C∩K, a sample is selected as close as possible to the constraints space K.

K

Page 16: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Implementation

1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables

Allocation variables: Occup, JobS, Contract, Work, IncomeDomain variables: Gender, Region, Scientific Area, Year of

Completion

2. six possible settings, corresponding to different choices of the parameters:a. Risk R1 used as the minimization cost of the algorithmb. Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk

Page 17: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

C.S

Risk.cost

Risk.strat

Cens.no.risk

# Strata

#Cens.strata

#Cens.units

Size Bethel

Size Prop.

Size Equal

Max.Bethel-

Prop

Max.Bethel-

Equal

1 N Y N 925 153 252 4933 5391 5550 459 618

2 N Y Y 925 214 704 5105 5547 5550 443 446

3 Y Y N 925 204 558 5239 5719 5550 480 311

4 Y Y Y 925 235 814 5330 5781 5550 451 220

5 Y N N 925 240 687 5555 5953 6475 399 921

6 Y N Y 925 269 983 5649 6094 6475 446 827

7 N Y N 925 306 1614 8725 9256 9250 530 524

8 N Y Y 925 352 1919 8827 9324 9250 498 424

9 Y Y N 925 416 3229 8955 9424 9250 468 294

10 Y Y Y 925 451 3398 9045 9511 9250 466 205

11 Y N N 925 426 3243 9151 9601 9250 451 100

12 Y N Y 925 457 3399 9222 9669 9250 446 84

13 N Y N 56 0 0 4745 4773 4760 138 132

14 N Y Y 56 28 9761 10320 10346 10360 166 630

15 Y Y N 56 21 5844 8812 8841 8848 189 389

16 Y Y Y 56 28 9761 10323 10349 10360 166 630

17 Y N N 28 0 0 4760 4774 4788 176 88

18 Y N Y 28 0 0 4759 4774 4788 176 88

Allocations (CV* = 5%)

Page 18: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

0

2000

4000

6000

8000

10000

12000

0 0.05 0.1 0.15 0.2 0.25

CV

Bet

hel s

ampl

e si

ze123456789101112131415161718

Allocations

Page 19: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Balanced sampleSelection of samples of fixed size from the CDH survey: Utility constraints on:• the population size N • the optimal sample size n • the marginal frequency distributions by

Gender, Year of Doctorate Completion and Scientific Area

18 equations

CUBE algorithm:I. Input Vector is the optimal one determined by

BethelII. Flight phase ends with no exact solutionIII. Landing phase starts: selection of a sample which

ensures a low difference to the balance, according to the distance between to

Page 20: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Median of absolute relative errors

Results

Page 21: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Results

C.S

Risk.cost

Risk.strat

Cens.no.risk

Risk

Occup

JobS

Contract

Work

Income

1 N Y N 1366 0.88 0.97 0.97 0.99 0.992 N Y Y 1333 0.92 0.99 0.94 0.97 0.993 Y Y N 1335 0.92 0.98 0.95 0.99 0.994 Y Y Y 1354 0.87 0.99 0.95 0.97 0.995 Y N N 1490 0.86 0.98 0.97 0.98 0.986 Y N Y 1525 0.91 0.98 0.95 0.97 0.997 N Y N 2194 0.83 0.91 0.99 0.97 1.008 N Y Y 2177 0.56 0.81 0.99 0.94 0.999 Y Y N 2149 0.78 0.91 0.99 0.91 1.00

10 Y Y Y 2163 0.64 0.88 0.97 0.95 0.9911 Y N N 2232 0.63 0.87 0.99 0.86 1.0012 Y N Y 2233 0.55 0.78 0.96 0.94 0.9913 N Y N 1272 0.96 0.99 0.92 0.96 0.9814 N Y Y 559 0.52 0.79 0.41 0.83 0.9815 Y Y N 564 0.77 0.94 0.93 0.97 0.9916 Y Y Y 562 0.56* 0.84 0.59 0.88 0.9917 Y N N 1270 0.95 0.99 0.98 0.99 0.9918 Y N Y 1247 0.91 0.99 0.98 0.99 0.98

Page 22: Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona Sampling as a way to reduce risk and create a Public Use File maintaining weighted

Joint UNECE-Eurostat worksession on confidentiality, 2011, Tarragona

Further work 1. the relationship between coefficients of

variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design;

2. the introduction of an utility-priority

approach into the way to deal with the balancing equations;

3. the usage of other data utility constraints to be investigated.