The Application for
Statistical Processing at
SURS
Andreja Smukavec, SURS
Rudi Seljak, SURS
UNECE Statistical Data Confidentiality Work Session
Helsinki, 5 – 7 October 2015
Old system
• Stove-pipe oriented production
– Ad-hoc solutions were developed for a
particular survey
• Survey methodologists‘ strive for
improvement was crucial
– “Our data are not confidential“
• Process metadata were not organized
– Difficulties when a survey methodologist
resigns
Renovation
• An internal project started in 2012
– IT, General Methodology and subject-matter
specialists
– Build a global solution appropriate for most of
the surveys
– Solution which covers most of the parts of
statistical production:
• Data validation
• Data editing and imputation
• Aggregation and standard error estimation
• Statistical disclosure control for tabular data
• Tabulation
Renewed system
• Generalised metadata driven application
– Database of process metadata
• MS Access -> ORACLE
• For each survey instance
– General SAS code
– GUI for process metadata
– Different microdata environments allowed,
just some basic rules for the structure of
microdata databases
• Ad hoc SAS program for preparation of
microdata
Schematic presentation of the
renewed system
Different microdata databases
General SAS
Ad -
Database of process
metadata
Metadata repository
Different kind of
output
… program program
Application for management
Data on tables and variables
Ad-hoc
Tabular data protection
1. Calculation of primary sensitivity for
seven types of statistics: number, total,
share, ratio, average…
– Threshold, p%-rule, (n,k)-dominance rule
– „Holding rule“ + sampling weights
– Zeroes unsafe
2. Secondary suppression applied in case
of sensitive statistics (number and total)
– SAS-Tool (Excel file with metadata, Tau
Argus, SAS macros)
Tabular data protection
• Results for each survey instance saved in
the database with statistics (ORACLE)
– Statuses for lower precision
– Confidentiality flags for the type of primary
and secondary suppression
• 3 types of tabulation (codelists)
– Excel format (the most user-friendly)
– plain text format (.tab,.hrc) for Tau-Argus
– plain text format (.csv) for PX-Edit (SURS’s
publication tool)
Tabulation & Tabular Data Protection
program
General SAS program
…
Database of process metadata
Caculation of statistics
Tabulation
Different microdata databases
Ad - hoc program
Tabular
protection
Output tables
General SAS program
Database with
statistics
Database of process metadata
Parameters for SDC in MetaSOP
Tabulation in MetaSOP
Processing in MetaSOP
Example of 3-dimensional
table After aggregation
CC_SI / Dim_2
Dim_3
TOT F O TOT TOT 1209943548 1.09E+09 1.23E+08
1 37700934.42 35625442 2075493 11 47110694.48 46417660 693034.1 2 733763444.2 6.62E+08 71456295 21 517712620.1 4.8E+08 37489998 22 161044502.5 1.1E+08 50837088 23 37903335.85 37783060 120275.8 24 343495995.1 2.86E+08 57438583
11 TOT 59283130.99 56199883 3083248 1 64428657.15 62453677 1974980 11 21989840.69 21609892 379948.2 2 69502173.33 67377101 2125073 21 13959568.67 13959569 - 22 338148.7639 338148.8 z 23 7911125.122 7911125 - 24 27886089.54 26016025 1870064
12 TOT 215349659.2 2.04E+08 11792968 1 5993635.356 5993635 - 11 2035728.954 2035729 - 2 55635358.28 54430511 1204847 21 146242216.3 1.43E+08 2783876 22 4164502.417 3872003 292499.2 23 38774447.75 34931862 3842585 24 42332750.72 37447112 4885639
21 TOT 176972728 1.76E+08 1323998 1 2248602.352 2248602 z 11 166013.5624 166013.6 z 2 372993785.9 3.69E+08 4134769 21 418831917.8 4.08E+08 10337323 22 29411096.08 29411096 z 23 56581.5975 56581.6 z 24 88244091.34 86483431 1760660
After use of SAS-Tool
CC_SI / Dim_2
Dim_3
TOT F O TOT TOT 1209943548 1.09E+09 1.23E+08
1 37700934.42 35625442 2075493 11 47110694.48 46417660 693034.1 2 733763444.2 6.62E+08 71456295 21 517712620.1 4.8E+08 37489998 22 161044502.5 1.1E+08 50837088 23 37903335.85 37783060 120275.8 24 343495995.1 2.86E+08 57438583
11 TOT 59283130.99 56199883 3083248 1 64428657.15 z z 11 21989840.69 z z 2 69502173.33 z z 21 13959568.67 13959569 -
22 338148.763 z z 23 7911125.122 7911125 - 24 27886089.54 z z
12 TOT 215349659.2 2.04E+08 11792968 1 5993635.356 5993635 - 11 2035728.954 2035729 - 2 55635358.28 54430511 1204847 21 146242216.3 1.43E+08 2783876 22 4164502.417 z z 23 38774447.75 z z
24 42332750.72 z z 21 TOT 176972728 1.76E+08 1323998
1 z z z 11 z z z 2 z z z 21 418831917.8 4.08E+08 10337323 22 29411096.08 z z
23 z z z 24 88244091.34 z z
New organization
• Old system:
– Every survey had its own programmer and its
own general methodologist
• Renewed system:
– General methodologist and IT expert
(„support team“) help the subject-matter
specialist to
• insert and edit the process metadata (except for
SDC) into the application
• run particular parts of the statistical process
Advantages
• The subject-matter personnel‘s skills
improve (higher quality of data)
• The process metadata can be changed
easily and the procedure can be repeated
in short time (flexibility)
• The rules for data processing are gathered
in one place (transparency)
Drawbacks
• High risk of syntax errors in the process of
the insertion of metadata expressions
• Subject-matter personnel has to learn
some new skills (SAS expressions)
• An error during the execution can cause
problem if the support team is busy or not
available
Challenges for the future
• Introduce the application successfully into
the production
– Adjusting to changes by the subject-matter
specialists
– Building a qualified support team
• Adding new functionalities
– Indices
– Secondary suppression for other types of
statistics
– GUI instead of the Excel file for the SAS - Tool
Thank you for attention.