anatomy: simple and effective privacy preservation xiaokui xiao, yufei tao chinese university of...
TRANSCRIPT
Anatomy:Simple and Effective Privacy Preservation
Xiaokui Xiao, Yufei Tao
Chinese University of Hong Kong
Privacy preserving data publishing
Microdata
• Purposes:– Allow researchers to effectively study the correlation b
etween various attributes – Protect the privacy of every patient
Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu
Linda 65 F 25000 gastritisAlice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
A naïve solution
• It does not work. See next.
publish
Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu
Linda 65 F 25000 gastritisAlice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis
Inference attack
• An adversary knows that Bob – has been hospitalized
before– is 23 years old– lives in an area with zi
pcode 11000
Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis
Published table
Quasi-identifier (QI) attributes
Generalization
A generalized tableAge Sex Zipcode Disease
[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis
Name Age Sex ZipcodeBob 23 M 11000
• Transform each QI value into a less specific form
How much generalization do we need?
l-diversity
• A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m / l times in the QI-group.
• A table is l-diverse, iff all of its QI-groups are l-diverse.
• The above table is 2-diverse.
2 QI-groups
Quasi-identifier (QI) attributes Sensitive attribute
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis
What l-diversity guarantees
• From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis
Name Age Sex ZipcodeBob 23 M 11000
A 2-diverse generalized table
A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.
ICDE 2006
Defect of generalization• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis
• Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions
Defect of generalization (cont.)
• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05
• Estimated answer for query A: 2 * p = 0.1
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] pneumonia
20
10k
7060504030
60k
50k
40k
30k
20k
AgeZ
ipco
de
Q
R1
Defect of generalization (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• Estimated answer from the generalized table: 0.1
Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu
Linda 65 F 25000 gastritisAlice 65 F 25000 flu
Mandy 70 F 30000 bronchitis
• The exact answer should be: 1
Research Works on Generalization
1. V. S. Iyengar. Transforming data to satisfy privacy constraints. KDD 2002.2. K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up Generalization: A Data Mini
ng Solution to Privacy Protection. ICDM 2004.3. R. J. Bayardo Jr. and R. Agrawal. Data Privacy through Optimal k-Anonymizati
on. ICDE 2005.4. B. C. M. Fung, K. Wang and P. S. Yu. Top-Down Specialization for Information
and Privacy Preservation. ICDE 2005.5. K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient Full-Domai
n K-Anonymity. SIGMOD 2005.6. K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian Multidimensional K-
Anonymity. ICDE 2006.7. D. Kifer and J. Gehrke. Injecting utility into anonymized datasets.
SIGMOD 2006.8. X. Xiao and Y. Tao. Personalized privacy preservation. SIGMOD 2006.9. K. Wang and B. C. M. Fung. Anonymization for Sequential Releases.
KDD 2006.10. K. LeFevre, D. DeWitt and R. Ramakrishnan. Workload-Aware Anonymization.
KDD 2006.11. J. Xu, Wei Wang, J. Pei, etc. Utility-Based Anonymization Using Local Recodin
gs. KDD 2006.12. …
Contributions
1. We propose an alternative technique for generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy.
2. We develop an algorithm for computing anatomized tables that
• runs in linear I/Os• (nearly) minimizes information loss
Outline
• Basic Idea of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results
Basic Idea of Anatomy
• For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST)
Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2
Quasi-identifier Table (QIT)
Sensitive Table (ST)
Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis
microdata
Basic Idea of Anatomy (cont.)
1. Select a partition of the tuplesAge Sex Zipcode Disease
23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia
61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis
QI group 1
QI group 2
a 2-diverse partition
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
Disease
pneumoniadyspepsiadyspepsia
pneumonia
flugastritis
flubronchitis
Age Sex Zipcode
23 M 1100027 M 1300035 M 5900059 M 12000
61 F 5400065 F 2500065 F 2500070 F 30000
group 1
group 2
quasi-identifier table (QIT) sensitive table (ST)
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
Group-ID Disease
1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia
2 flu2 gastritis2 flu2 bronchitis
Age Sex Zipcode Group-ID
23 M 11000 127 M 13000 135 M 59000 159 M 12000 1
61 F 54000 265 F 25000 265 F 25000 270 F 30000 2
quasi-identifier table (QIT) sensitive table (ST)
Basic Idea of Anatomy (cont.)
2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition
Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2
quasi-identifier table (QIT)
sensitive table (ST)
Privacy Preservation
• From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l
Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)
sensitive table (ST)
Name Age Sex ZipcodeBob 23 M 11000
Accuracy of Data Analysis• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)
sensitive table (ST)
Accuracy of Data Analysis (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata
WHERE Disease = ‘pneumonia’ AND Age in [0, 30]
AND Zipcode in [10001, 20000]
• 2 patients have contracted pneumonia
• 2 out of 4 patients satisfies the query condition on Age and Zipcode
• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 1
20
10k
7060504030
60k
50k
40k
30k
20k
x (Age)
y (Z
ipco
de)
t1
Q
t2
t3
t4
t1t2t3t4
Outline
• Rationale of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results
Preserving Correlation
• Let us first examine the correlation between Age and Disease in our running example
• Each tuple in the microdata can be mapped to a point in the (Age, Disease) domain
• The above tuple can be mapped to (23, pneumonia).
Age Sex Zipcode Disease23 M 11000 pneumonia.... … … …
t1
Preserving Correlation (cont.)
• We model this tuple using a probability density function (pdf):
20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
0
Preserving Correlation (cont.)
• In the generalized table, the tuple becomes:
• Its corresponding pdf becomes:
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia… … … …
20 60504030Age
0.2
10.80.60.4
0
dysp
epsia
pneu
monia
Diseas
e
Preserving Correlation (cont.)
• In the anatomized tables, the tuple becomes:
• Its corresponding pdf becomes:
Age Sex Zipcode Group-ID23 M 11000 1… … … …
Group-ID Disease Count1 dyspepsia 21 pneumonia 2… … …
20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
0
Preserving Correlation (cont.)
20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
020 60504030
Age
0.2
10.80.60.4
0
dysp
epsia
pneu
monia
Diseas
e
20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
0
Outline
• Rationale of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results
Quality Metric
20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
0 20 60504030Age
dysp
epsia
pneu
monia
Diseas
e0.2
10.80.60.4
0
• For each approximated pdf , we measure its error from the original pdf by their “L2 distance”:
• We aim at obtaining anatomized tables that minimize the following re-construction error (RCE):
the original pdf the approximated pdf
Anatomize
• An algorithm for computing anatomized tables that
– runs in I/O cost linear to the cardinality n of the microdata table
– minimizes the RCE when n is a multiple of l, otherwise achieves an RCE that is higher than the lower-bound by a factor of at most 1 + 1/n
Outline
• Rationale of Anatomy
• Preserving Correlation
• Algorithm for Anatomy
• Experimental Results
Experimental Settings
• Goal: to compare the accuracy of data analysis on the generalized / anatomized tables.
• Real dataset with 9 attributes:– Age, Gender, Education, Marital-status, Race, Work-class,
Country,– Occupation, Salary-class
• OCC-d, SAL-d, (d = 3, 4, 5, 6, 7)– OCC-3:
– SAL-4:
• Cardinality: 100k, 200k, 300k, 400k, 500k
Age Gender Education Occupation
Age Gender Education Marital-status Salary-class
Experimental Settings (cont.)
• competitor: multi-dimensional generalization• l = 10
• avg. relative error for 10000 aggregate queries:|act – est| / act
•
• qd = 1, 2, …, d
• • s = 1%, …, 5%, …, 10%
Accuracy of Data Analysis (cont.)
C.C. Aggarwal. On k-anonymity and the curse of dimensionality. VLDB 2005
Accuracy of Data Analysis (cont.)
Accuracy of Data Analysis (cont.)
Computation Overhead
Summary
• Anatomy outperforms generalization by allowing much more accurate data analysis on the published data.
• Anatomized tables (with nearly optimal quality guarantee) can be computed in I/O cost linear to the database cardinality.
Thank you!
Datasets and implementation are available for download at
http://www.cse.cuhk.edu.hk/~taoyf
Anatomy vs. Generalization Revisit
• Sometimes the adversary is not sure whether an individual appears in the microdata or not
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis
A 2-diverse generalized tableName Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …
A Voter Registration List
Anatomy vs. Generalization Revisit
• From the adversary’s perspective:– Bob has 4 / 6 probability to be in the microdata– If Bob indeed appears the microdata, there is 2 / 4 probability that h
e has contracted pneumonia– So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumoni
a
Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia
… … … …
A 2-diverse generalized table
Name Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …
A Voter Registration List
Anatomy vs. Generalization Revisit
• The adversary knows that– Bob must appear the microdata– There is 1/2 probability that Bob
has contracted pneumonia
Group-ID Disease Count1 dyspepsia 21 pneumonia 2… … …
Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 1… … … …
2-diverse QIT
2-diverse ST
Name Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …
Anatomy vs. Generalization Revisit
• For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does.
• But is not always the case, since:– the external database may not contain any irrelevant individuals– the adversary may know that some individuals indeed appear in
the microdataName Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …