1 1 confidentiality protection of large frequency data cubes unece workshop on statistical...
TRANSCRIPT
![Page 1: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/1.jpg)
1
1
Confidentiality protection of Confidentiality protection of large frequency data cubeslarge frequency data cubes
UNECE Workshop on Statistical Confidentiality
Ottawa 28-30 October 2013
Johan Heldal and Svetlana Badina
Statistics Norway
![Page 2: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/2.jpg)
2
Eurostat Census Hypercubes
• 60 Census 2011 frequency count hypercubes that all 32 EU+EEA countries must submit in 2014.
• Four to nine variables (breakdowns) in each cube.
• Each country is responsible for its own disclosure control method according to national legislation.
• Norway is the only country that wishes to use small count (1 and 2) rounding as the preferred disclosure control method.
• This presentation will show how.
• Hypercube 06 will be used for illustration.
![Page 3: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/3.jpg)
3
The problem
AB
1 … b … L
1
:
a 0 0 7 0 0
:
K
A og B combinations of variables. Value a for A implies value b for B.Idea: Want to create uncertainty about the surrounding zeroes.
![Page 4: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/4.jpg)
4
Idea
• We want to create uncertainties about whether zeroes are real zeroes.
• Creating more zeroes from small counts (1 and 2) by rounding to 0 or 3 (unbiasedly)
• The rounding must be carried out to minimize perturbation on given aggregate counts.
• Counts of 1 and 2 are not necessarily considered problematic by themselves but will be removed by rounding.
![Page 5: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/5.jpg)
5
Hypercube 06
Spanning variables and groups in hypercube 06
Variable Explanation No. of groups
GEO.L Region of residence according to NUTS2 7 regions
SEX Sex 2
FST.H Family status. High detail 6
LMS. Marital status 4
CAS.L Activity status. Low detail 3
POB.M Country of birth. Medium detail 9
COC.M Citizenship. Medium detail 9
AGE.M Age. Medium detail (5-year groups) 21
The hypercube spans 1 714 608 cells.
53 550 cells are populated.
![Page 6: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/6.jpg)
6
Principal Marginal Distributions
Either the entire HC or 6 PMDs must be submitted from HC 06.
Principle Marginal Distributions of hypercube 06
Breakdowns
6. GEO.L SEX FST.H LMS CAS.L POB.M COC.M AGE.M
6.1 GEO.L SEX FST.H LMS AGE.M
6.2 GEO.L SEX FST.H LMS CAS.L POB.M
6.3 GEO.L SEX FST.H LMS CAS.L COC.M
6.4 GEO.L SEX FST.H CAS.L AGE.M
6.5 GEO.L SEX FST.H POB.L AGE.M
6.6 GEO.L SEX FST.H COC.L AGE.M
There are 5-6 variables in each PMD. 3 variables are common for all six PMDs.
![Page 7: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/7.jpg)
7
Reduce the hypercubeSTEP 1: Identifying small counts
a. Reduce hypercube A by selecting a subset B consisting of • All interior cells in A with counts 1 or 2 or• all interior cells in A contributing to 1 or 2 in the PMDs of A.
b. Calculate C = A – B
STEP 2: Rounding. • nB = total value of B
• Round [nB/3] interior counts in B to 3, the rest to 0. B*.
• IF the solution B* is good enough, STOP. ELSE, continue search for a better B*.
STEP 3: Calculate A* = C + B*, the rounded cube.
![Page 8: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/8.jpg)
8
Simple properties
• A* - A = B* - B = C
• A* is additive
• |nA – nA* | = |nA – 3[nA/3]| ≤ 1
• All Primary Marginal Distributions will be consistently rounded.
![Page 9: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/9.jpg)
9
The Norwegian HC 06
Number of small cells in full HC and with PMDs only
Full hypercube PMDs only
Principal small counts m1 + m2 25 823 3 048
Internal small counts n1 + n2 25 823 2 941
No. of internal 1s n1 18 728 2 683
No. of internal 2s n2 7 095 258
Pop.in small count cells nB = n1 + 2n2 32 918 3 199
Prop in small count cells 100nB/N 0.66 0.064
No. of cells to round to 3 [nB /3] 10 973 1 066
![Page 10: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/10.jpg)
10
Rounding method used
1.Let nB = total count of B, e.g. nB = 3 199
2.From the non-zero cells in B, select (WOR) [nB/3] (=1066) cells to be rounded to 3.• Probabilities: P(2 3) = 2·P(1 3)• Selection may be stratified.
3.Calculate distance m = maxcM|bc* – bc |
across a control set M of marginal cells of B.
4.The solution with the smallest value m is selected.
![Page 11: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/11.jpg)
11
Test experiment
•Control set M : All one- and two-way marginal counts generated from the eight variables spanning HC 06. (1985 cells.)
•10 000 runs are done. – For full HC 06 and for the PMDs only– With stratified and unstratified sampling.
![Page 12: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/12.jpg)
12
Improvements in maximum deviation m by iterations in random search
Full hypercube PMDs only
Stratification Stratification
Iter NoneGEO.L
SEX
GEO.LSEX
AGE.MFST.H
NoneGEO.L
SEX
GEO.LSEX
AGE.MFST.H
m m m m m m
1 263 149 198 68 62 57
10 186 145 154 62 56 52
100 153 133 123 50 50 40
1000 140 133 100 45 45 37
10000 133 121 100 43 41 35
![Page 13: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/13.jpg)
13
Percent deviations for largest absolute deviations, m, in the best solutions.
m True cell valuePercent
deviation
Full HC
133 208 553 0.064
121 166 784 0.073
100 64 481 0.155
PMDs only
43 275 346 0.016
-41 7 464 -0.549
-35 85 383 -0.041
![Page 14: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/14.jpg)
14
Discussion
• The method is not yet fully approved for the Census HCs.
• Is the method sufficient to prevent any kind of disclosure?
• The reduction of the problem (A B) absolutely required to make the method work.
• Advantage: – Can produce consistent results with acceptable (?) aggregate
deviations for a number of linked cubes of some size.
• Problems:– With random search the result is subject to chance. – Diminishing return from increasing the number of iterations.– We need to find better and more stable search engines.– Generalization to rounding bases of more than 3 will increase the
deviations in aggregates.
![Page 15: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/15.jpg)
15
Further work
• Try better sampling procedures (Balanced sampling?)
• Try Mixed Integer Linear Programming software.
• Extend the experiment to round more hypercubes jointly.
• An idea: Merge the reduced rounded cells back into microdata:
– A method for perturbing some variables in relation to others.– How many variables must be perturbed this way to make all
hypercubes safe?– Creates a micro data set that produces the rounded tables directly.
![Page 16: 1 1 Confidentiality protection of large frequency data cubes UNECE Workshop on Statistical Confidentiality Ottawa 28-30 October 2013 Johan Heldal and Svetlana](https://reader036.vdocuments.us/reader036/viewer/2022082613/5697bfd01a28abf838caabb0/html5/thumbnails/16.jpg)
16
Thank you very much for
your attention