1 1 confidentiality protection of large frequency data cubes unece workshop on statistical...

16
1 1 Confidentiality Confidentiality protection of large protection of large frequency data cubes frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana Badina Statistics Norway

Upload: jonathan-holmes

Post on 18-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

1

1

Confidentiality protection of Confidentiality protection of large frequency data cubeslarge frequency data cubes

UNECE Workshop on Statistical Confidentiality

Ottawa 28-30 October 2013

Johan Heldal and Svetlana Badina

Statistics Norway

Page 2: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

2

Eurostat Census Hypercubes

• 60 Census 2011 frequency count hypercubes that all 32 EU+EEA countries must submit in 2014.

• Four to nine variables (breakdowns) in each cube.

• Each country is responsible for its own disclosure control method according to national legislation.

• Norway is the only country that wishes to use small count (1 and 2) rounding as the preferred disclosure control method.

• This presentation will show how.

• Hypercube 06 will be used for illustration.

Page 3: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

3

The problem

AB

1 … b … L

1

:

a 0 0 7 0 0

:

K

A og B combinations of variables. Value a for A implies value b for B.Idea: Want to create uncertainty about the surrounding zeroes.

Page 4: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

4

Idea

• We want to create uncertainties about whether zeroes are real zeroes.

• Creating more zeroes from small counts (1 and 2) by rounding to 0 or 3 (unbiasedly)

• The rounding must be carried out to minimize perturbation on given aggregate counts.

• Counts of 1 and 2 are not necessarily considered problematic by themselves but will be removed by rounding.

Page 5: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

5

Hypercube 06

Spanning variables and groups in hypercube 06

Variable Explanation No. of groups

GEO.L Region of residence according to NUTS2 7 regions

SEX Sex 2

FST.H Family status. High detail 6

LMS. Marital status 4

CAS.L Activity status. Low detail 3

POB.M Country of birth. Medium detail 9

COC.M Citizenship. Medium detail 9

AGE.M Age. Medium detail (5-year groups) 21

The hypercube spans 1 714 608 cells.

53 550 cells are populated.

Page 6: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

6

Principal Marginal Distributions

Either the entire HC or 6 PMDs must be submitted from HC 06.

Principle Marginal Distributions of hypercube 06

Breakdowns

6. GEO.L SEX FST.H LMS CAS.L POB.M COC.M AGE.M

6.1 GEO.L SEX FST.H LMS AGE.M

6.2 GEO.L SEX FST.H LMS CAS.L POB.M

6.3 GEO.L SEX FST.H LMS CAS.L COC.M

6.4 GEO.L SEX FST.H CAS.L AGE.M

6.5 GEO.L SEX FST.H POB.L AGE.M

6.6 GEO.L SEX FST.H COC.L AGE.M

There are 5-6 variables in each PMD. 3 variables are common for all six PMDs.

Page 7: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

7

Reduce the hypercubeSTEP 1: Identifying small counts

a. Reduce hypercube A by selecting a subset B consisting of • All interior cells in A with counts 1 or 2 or• all interior cells in A contributing to 1 or 2 in the PMDs of A.

b. Calculate C = A – B

STEP 2: Rounding. • nB = total value of B

• Round [nB/3] interior counts in B to 3, the rest to 0. B*.

• IF the solution B* is good enough, STOP. ELSE, continue search for a better B*.

STEP 3: Calculate A* = C + B*, the rounded cube.

Page 8: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

8

Simple properties

• A* - A = B* - B = C

• A* is additive

• |nA – nA* | = |nA – 3[nA/3]| ≤ 1

• All Primary Marginal Distributions will be consistently rounded.

Page 9: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

9

The Norwegian HC 06

Number of small cells in full HC and with PMDs only

Full hypercube PMDs only

Principal small counts m1 + m2 25 823 3 048

Internal small counts n1 + n2 25 823 2 941

No. of internal 1s n1 18 728 2 683

No. of internal 2s n2 7 095 258

Pop.in small count cells nB = n1 + 2n2 32 918 3 199

Prop in small count cells 100nB/N 0.66 0.064

No. of cells to round to 3 [nB /3] 10 973 1 066

Page 10: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

10

Rounding method used

1.Let nB = total count of B, e.g. nB = 3 199

2.From the non-zero cells in B, select (WOR) [nB/3] (=1066) cells to be rounded to 3.• Probabilities: P(2 3) = 2·P(1 3)• Selection may be stratified.

3.Calculate distance m = maxcM|bc* – bc |

across a control set M of marginal cells of B.

4.The solution with the smallest value m is selected.

Page 11: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

11

Test experiment

•Control set M : All one- and two-way marginal counts generated from the eight variables spanning HC 06. (1985 cells.)

•10 000 runs are done. – For full HC 06 and for the PMDs only– With stratified and unstratified sampling.

Page 12: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

12

Improvements in maximum deviation m by iterations in random search

Full hypercube PMDs only

Stratification Stratification

Iter NoneGEO.L

SEX

GEO.LSEX

AGE.MFST.H

NoneGEO.L

SEX

GEO.LSEX

AGE.MFST.H

m m m m m m

1 263 149 198 68 62 57

10 186 145 154 62 56 52

100 153 133 123 50 50 40

1000 140 133 100 45 45 37

10000 133 121 100 43 41 35

Page 13: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

13

Percent deviations for largest absolute deviations, m, in the best solutions.

m True cell valuePercent

deviation

Full HC

133 208 553 0.064

121 166 784 0.073

100 64 481 0.155

PMDs only

43 275 346 0.016

-41 7 464 -0.549

-35 85 383 -0.041

Page 14: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

14

Discussion

• The method is not yet fully approved for the Census HCs.

• Is the method sufficient to prevent any kind of disclosure?

• The reduction of the problem (A B) absolutely required to make the method work.

• Advantage: – Can produce consistent results with acceptable (?) aggregate

deviations for a number of linked cubes of some size.

• Problems:– With random search the result is subject to chance. – Diminishing return from increasing the number of iterations.– We need to find better and more stable search engines.– Generalization to rounding bases of more than 3 will increase the

deviations in aggregates.

Page 15: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

15

Further work

• Try better sampling procedures (Balanced sampling?)

• Try Mixed Integer Linear Programming software.

• Extend the experiment to round more hypercubes jointly.

• An idea: Merge the reduced rounded cells back into microdata:

– A method for perturbing some variables in relation to others.– How many variables must be perturbed this way to make all

hypercubes safe?– Creates a micro data set that produces the rounded tables directly.

Page 16: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana

16

Thank you very much for

your attention