![Page 1: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/1.jpg)
1
Assessing the Impact of SDC Methods on Census Frequency Tables
Natalie Shlomo
Southampton Statistical Sciences Research Institute
University of Southampton
![Page 2: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/2.jpg)
2
Topics:
• Introduction• Disclosure risk • SDC methods for protecting Census frequency tables• Disclosure risk and data utility measures• Description of table• Risk-Utility analysis • Summary of Analysis • Discussion and future work
![Page 3: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/3.jpg)
3
• Disclosure risk in Census tables:
• Need to protect many tables from one dataset containing population counts which can be linked and differenced
• Need to consider output strategies for standard tables and web based table generating applications
• Need to interact with users and develop SDC framework with a focus on both disclosure risk and data utility
Introduction
Identification Individual Attribute Disclosure
![Page 4: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/4.jpg)
4
Disclosure Risk
For Census tables: • 1’s and 2’s in cells are disclosive since these
cells lead to identification,
• 0’s may be disclosive if there are only a few non-zero cells in a row or column (attribute disclosure)
Consideration of disclosure risk:• Threshold rules (minimum average cell size, ratio of small cells to
zeros, etc.)
• Proportion of high-risk cells (1 or 2)
• Entropy (minimum of 0 if distribution has one non-zero cell and all others zero, maximum of (log K) if all cells are equal).
![Page 5: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/5.jpg)
5
SDC Methods for Protecting Frequency Tables
1. Pre-tabular methods (special case of PRAM)
Random Record Swapping
Targeted Record Swapping
In a Census context, geographical variables typically swapped to avoid edit failures and minimize biasImplementation:
Randomly select p% of the households
Draw a household matching on set of key variables (i.e. household size and broad sex-age distribution) and swap all geographical variables Can target records for swapping that are in high-risk cells of size 1 or 2
![Page 6: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/6.jpg)
6
SDC Methods for Protecting Frequency Tables2. Rounding Unbiased random rounding
Entries are rounded up or down to a multiple of the rounding base depending on pre-defined probabilities and a stochastic draw
Example: For unbiased random rounding to base 3: 1 0 w.p of 2/3 1 3 w.p 1/3 2 0 w.p of 1/3 2 3 w.p 2/3
Expectation of rounding is 0 Margins and internal cells
rounded separately Small cell rounding: internal cells aggregated to obtain margins
Confidence Interval for Totals
-100-80-60-40-20
020406080
100
0 100 200 300 400 500 600 700 800 900 1000
Number of Perturbed Cells
Inte
rval
of
Err
or
![Page 7: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/7.jpg)
7
SDC Methods for Protecting Frequency Tables
2. Rounding (cont.)
Semi-controlled unbiased random rounding
Control the selection strategy for entries to round, i.e. use a “without replacement” strategy
Implementation:
- Calculate the expected number of entries to round up
- Draw an srswor sample from among the entries and round up, the rest round down.
Can be carried out per row/column to ensure consistent totals on one dimension (key statistics)
Eliminates extra variance as a result of the rounding
![Page 8: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/8.jpg)
8
SDC Methods for Protecting Frequency Tables
2. Rounding (cont.)
Controlled rounding
Feature in Tau-Argus (Salazar-González, Bycroft and Staggemeier, 2005)
- Uses linear programming techniques to round entries up or down, results similar to deterministic rounding
- All rounded entries add up to rounded margins
- Method not unbiased and entries can jump a base
![Page 9: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/9.jpg)
9
SDC Methods for Protecting Frequency Tables3. Cell Suppression
Hypercube method (Giessing, 2004)
Feature in Tau-Argus and suited for large tables
Uses heuristic based on suppressing corners of a hypercube formed by the primary suppressed cell with optimality conditions
Imputing suppressed cells for utility evaluation:
Replace suppressed cell by the average information loss in each row/column.
Example: Two suppressed cells in a row and known margin is 500. The total of non-suppressed cells is 400. Each cell is replaced with a value of 50
![Page 10: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/10.jpg)
10
Disclosure Risk MeasuresNeed to determine output strategies and SDC together
• Hard-copy tables, non-flexible categories and geographies: can
control SDC methods to suit the tables• Web-based tables and flexible categories and geographies:
need to add noise or round for every query
Disclosure risk measures:• Proportion of high-risk cells (C1 and C2) not protected
• Percent true zeros out of total zeros
21
21
)(
1CC
CCii
n
imputedorperturbednotRI
DR
pertorig
orig
CC
CDR
00
02
![Page 11: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/11.jpg)
11
• Distance metric - distortion to distributions (Gomatam and
Karr, 2003):
Internal cells:
Let be a table for row k, the number of rows, and the cell frequency for cell c,
Margins:
Let M be the margin, the number of categories, the number of persons in the category:
rn
k kc
korig
kpert
rorigpert cDcD
nDDHD
1
2))()((2
11),(
kD ( )kD crn
Utility Measures
MN
l
lorig
lpertorigpert NNNNHDM
1
2
2
1),(
Mn lNthl
![Page 12: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/12.jpg)
12
Utility Measures
• Impact on Tests for Independence:
Cramer’s V measure of association: where is the Pearson chi-square statistic
Same utility measure for entropy and the Pearson chi- square statistics
Impact on log linear analysis for multi-dimensional tables, i.e. deviance
)1(),1min(
2
CRnCV
2
( ) ( )
( , ) 100( )
pert origpert orig
orig
CV D CV DRCV D D
CV D
![Page 13: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/13.jpg)
13
Utility Measures
• “Between” Variance:
Let be a target proportion for a cell c in row k,
and let be the overall
proportion across all rows of the table
The “between” variance is defined as:
and the utility measure is:
( )korigP c
( )( )
( )
korigk
orig korig
c k
D cP c
D c
r
r
n
k kc
korig
n
k
korig
orig
cD
cDcP
1
1
)(
)()(
rn
korig
korig
rorig cPcP
ncPBV
1
2))()((1
1))((
))((
)))(())(((100))(),((
cPBV
cPBVcPBVcPcPBVR
orig
origpertorigpert
![Page 14: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/14.jpg)
14
Utility Measures
• Variance of Cell Counts:
The variance of the cell count for row k:
)(1
)(1
rn
k
korig
rorig DV
nDV
where is the number of columns
The average variance across all rows:
kn
The utility measure is:
)(
))()((100),(
orig
origpertpertorig DV
DVDVDDRDV
2))((1
1)( k
origkc
korig
k
korig DcD
nDV
![Page 15: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/15.jpg)
15
Description of Table
• 2001 UK Census Table:
Rows: Output Areas (1,487)
Columns: Economic Activity (9) * Sex (2)* Long- Term Illness (2)
Table includes 317,064 persons between 16-74 in 53,532 internal cells
Average cell size: 5.92 although table is skewed
Number of zeros: 17,915 (33.5%)
Number of small cells: 14,726 (27.5%)
![Page 16: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/16.jpg)
16
Percent Unperturbed Small Cells
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Original 10%Random
10%Target
20%Random
20%Target
Percent True Zeros
00.1
0.20.3
0.40.50.6
0.70.8
0.91
![Page 17: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/17.jpg)
17
Hellinger's Distance Margins OAs
0123456789
10
Hellinger's Distance Internal Cells
0123456789
10
![Page 18: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/18.jpg)
18
Difference in Cramer's V (Original=0.121)
-10
-5
0
5
10
15
20
25
30
![Page 19: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/19.jpg)
19
Difference in Variance of Cell Counts (Original=188.3)
-3
-2
-1
0
1
2
3
Difference in Between Variance (Original=0.00023)
-15-10-505
10152025303540
![Page 20: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/20.jpg)
20
Risk-Utility Map
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
00.511.522.533.5
HD
Pro
p.
tru
e ze
ros
RR 3
T 20
R 20
R 10
T 10
RR 5
Sup
![Page 21: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/21.jpg)
21
Summary of Analysis
• Rounding eliminates small cells but need to protect against disclosure by differencing and linking when random rounding
• Rounding adds more ambiguity into the zero counts
• Random rounding to base 5 has greatest impact on distortions to distribution
• Semi-controlled rounding has almost no effect on distortions to internal cells but has less distortion on marginal cells
• Full controlled rounding has less distortion to internal cells since it is similar to deterministic rounding
• Cell suppression with simple imputation method has highest utility (no perturbation on large cells) but difficult to implement in a Census
![Page 22: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/22.jpg)
22
Summary of Analysis
• High percent of true small cells in record swapping and less ambiguity of zero cells
• Record swapping has less distortion to internal cells than rounding which increases with higher swapping rates
• Targeted swapping has more distortion on internal cells than random swapping but has less impact on marginal cells
• Column margins of the table have no distortion because of controls in swapping
• Combining record swapping with rounding results in more distortion but provides added protection
![Page 23: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/23.jpg)
23
Summary of Analysis• Record swapping across geographies attenuates: - loss of association (moving towards independence)
- counts “flattening” out - proportions moving to the overall proportion
• Attenuation increases with higher swapping rates Targeted record swapping has less attenuation than random swapping
• Rounding introduces more zeros: - levels of association are higher - cell counts “sharper” Effects less severe for controlled rounding
• Combing record swapping and rounding cancel out opposing effects depending on the direction and magnitude of each procedure separately
![Page 24: 1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton](https://reader035.vdocuments.us/reader035/viewer/2022062422/56649eda5503460f94bea320/html5/thumbnails/24.jpg)
24
Discussion• Choice of SDC method depends on tolerable risk thresholds and
demands for “fit for purpose” data
• Modifying and combining SDC methods (non-perturbative and perturbative methods) can produce higher utility, i.e. ABS developed microdata keys for consistency in rounding
• Dissemination of quality measures and guidance for carrying out statistical analysis on protected tables
• Future output strategies based on flexible table generating software. More need for research into disclosure risk by differencing and linking (collaboration with CS community)
• Safe setting, remote access and license agreements for highly disclosive Census outputs (sample microdata and origin-destination tables)