privacy in data systems rakesh agrawal ibm almaden research center joint work with srikant, kiernan...
TRANSCRIPT
Privacy in Data SystemsPrivacy in Data Systems
Rakesh AgrawalIBM Almaden Research Center
joint work with Srikant, Kiernan & Xu
ThemeTheme
ƒ Increasing need for information systems thatƒ protect the privacy and ownership of
informationƒ do not impede the flow of information
ƒ Resolving the apparent contradiction in the above statement is a major research challenge and opportunity
DriversDrivers Policies and Legislations
– U.S. and international regulations– Legal proceedings against businesses
Consumer Concerns– Consumer privacy apprehensions continue to plague the
Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001
– Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative
– The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890
Privacy Is Headline NewsPrivacy Is Headline News
“Privacy #1 issue in the 21Century” -Wall Street Journal, January 24, 2000
“Anyone today who thinks the privacy issue has peaked is greatly mistaken…we are in the early stages of a sweeping change in attitudes that will put once-routine business practices under the microscope.”
Forrester Research, March 5, 2001
OutlineOutline Privacy Preserving Data Mining
– How to have you cake and mine it too!– Assuming no trusted third party.
Hippocratic Databases– What I may see or hear in the course of treatment … I will
keep to myself. - Hippocratic Oath, 8 (circa 400 BC)
Other related topics– Privacy Cognizant Information Integration– Database support for P3P– Watermarking & fingerprinting
Data Mining and PrivacyData Mining and Privacy
The primary task in data mining:– development of models about aggregated
data.Can we still develop accurate models,
while protecting the privacy of individual records?
ApproachesApproaches
RandomizationCryptographicStatistical disclosure control
Recommendation Service
Alice
Bob
3580,000J.S. Bachpaintingnasa
3580,000J.S. Bachpaintingnasa
4560,000B. Spearsbaseballcnn
4560,000B. Spearsbaseballcnn 42
85,000B. Marleycampingmicrosoft
4285,000B. Marleycampingmicrosoft
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Recommendation Service
Alice
Bob
3580,000J.S. Bachpaintingnasa
3580,000J.S. Bachpaintingnasa
4560,000B. Spearsbaseballcnn
4560,000B. Spearsbaseballcnn 42
85,000B. Marleycampingmicrosoft
4285,000B. Marleycampingmicrosoft
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Data Mining Model
Mining Algorithm
Recommendation Service
Alice
Bob
5065,000Metallicapaintingnasa
5065,000Metallicapaintingnasa
3890,000B. Spearssoccerfox
3890,000B. Spearssoccerfox 32
55,000B. Marleycampinglinuxware
3255,000B. Marleycampinglinuxware
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Recommendation Service
Alice
Bob
5065,000Metallicapaintingnasa
5065,000Metallicapaintingnasa
3890,000B. Spearssoccerfox
3890,000B. Spearssoccerfox 32
55,000B. Marleycampinglinuxware
3255,000B. Marleycampinglinuxware
4560,000B. Spearsbaseballcnn
3595,000J.S. Bachpaintingnasa
Chris4285,000B. Marley,camping,microsoft
Randomization Randomization OverviewOverview
Data Mining Model
Mining Algorithm
Recovery
Inducing Classifiers over Privacy Inducing Classifiers over Privacy Preserved Numeric DataPreserved Numeric Data
30 | 25K | … 50 | 40K | …
Randomizer
65 | 50K | …
Randomizer
35 | 60K | …
ReconstructAge Distribution
ReconstructSalary Distribution
Decision TreeAlgorithm
Model
30 becomes
65 (30+35)
Alice’s age
Alice’s salary
John’s age
Reconstruction ProblemReconstruction Problem
Original values x1, x2, ..., xn
– from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn
– from probability distribution Y Given
– x1+y1, x2+y2, ..., xn+yn
– the probability distribution of Y Estimate the probability distribution of X.
Reconstruction AlgorithmReconstruction Algorithm
fX0 := Uniform distribution
j := 0 repeat
fXj+1(a) := Bayes’ Rule
j := j+1 until (stopping criterion met)
(R. Agrawal & R. Srikant, SIGMOD 2000)
Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.
n
ij
XiiY
jXiiY
afayxf
afayxf
n 1 )())((
)())((1
Estimate of a single pointEstimate of a single point
Use Bayes' rule for density functions
10 90Age
V
Original distribution for Age
Probabilistic estimate of original value of V
Estimate of a single pointEstimate of a single point
Original Distribution for Age
Probabilistic estimate of original value of V
10 90Age
V
Use Bayes' rule for density functions
Estimate of the distributionEstimate of the distribution
Combine estimates of where a point came from for all the points:– yields estimate of original distribution.
10 90Age
Works WellWorks Well
20
60
Age
0
200
400
600
800
1000
1200
Nu
mb
er
of
Peop
le
Original
Randomized
Reconstructed
Decision Tree ExampleDecision Tree Example
Age Salary Repeat Visitor?
23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat
Age < 25
Salary < 50K
Repeat
Repeat
Single
Yes
Yes
No
No
AlgorithmsAlgorithms
Global– Reconstruct for each attribute once at the beginning
By Class– For each attribute, first split by class, then reconstruct
separately for each class.
Local– Reconstruct at each node
See SIGMOD 2000 paper for details.
Experimental MethodologyExperimental Methodology
Compare accuracy against– Original: unperturbed data without randomization.– Randomized: perturbed data but without making any
corrections for randomization.
Test data not randomized. Synthetic benchmark from [AGI+92]. Training set of 100,000 records, split equally
between the two classes.
Decision Tree ExperimentsDecision Tree Experiments
Fn 1 Fn 2 Fn 3 Fn 4 Fn 550
60
70
80
90
100
Acc
urac
y
Original
Randomized
Reconstructed
100% Randomization Level
Accuracy vs. RandomizationAccuracy vs. Randomization
10 20 40 60 80 100 150 200
Randomization Level
40
50
60
70
80
90
100
Acc
ura
cy
Original
Randomized
Reconstructed
Fn 3
Discovering Associations Over Discovering Associations Over Privacy Preserved Categorical DataPrivacy Preserved Categorical Data
A transaction t is a set of items Support s for an itemset A is the number of
transactions in which A appears Itemset A is frequent if s sminTask: Find all frequent itemsets, while
preserving the privacy of individual transaction.
A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association RulesOver Privacy Preserving Data. KDD-2002.
S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002
Uniform RandomizationUniform Randomization
Given a transaction,– Keep item with, say 20% probability– Replace with a new random item with
80% probability.
Privacy BreachPrivacy BreachIf one has access to the result (frequent
itemsets) as well as the randomized transactions, one may make inferences about the original transactions.
Itemset A causes a privacy breach of level if, for some item z A
tAtz |Pr
See also: A. Evfimievski, J. Gehrke , R. Srikant. PODS 2003.
Example: Example: {{xx,, y y,, z z}}
1% have
{x, y, z}
5% have{x, y}, {x, z},or {y, z} only
10 M transactions of size 10 with 10 K items:
94%have one or zeroitems of {x, y, z}
0.008%800 ts.97.8%
0.00016%16 trans.
1.9%
less than 0.00002%2 transactions
0.3%
• 0.22 • 8/10,000• 0.23at most
• 0.2 • (9/10,000)2
Privacy Breach: Given {x, y, z} in a randomized transaction,we have 98% certainty of {x, y, z} being in the original one
Original
Randomized
SolutionSolution
Insert many false items into each transaction Hide true itemsets among false ones
“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”
“He grows a forest to hide it in.”
G.K. Chesterton
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
a, b, c, u, v, w, x, y, zt =
t’ =
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
a, b, c, u, v, w, x, y, zt =
t’ =j = 4
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
a, b, c, u, v, w, x, y, zt =
b, v, x, zt’ =j = 4
Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:
– Choose a number j between 0 and Km (cutoff);
– Include j items of t into t’;
– Each other item is included into t’ with probability pm .
The choice of Km and pm is based on the desired level of privacy.
a, b, c, u, v, w, x, y, zt =
b, v, x, zt’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, …j = 4
Partial SupportsPartial SupportsTo recover original support of an itemset, we need randomized
supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is
– Here sk is the same as the support of A.
– Randomized partial supports are denoted by
lAtTtT
s
ssss
l
k
#|#1
,,...,, 10 where
.s
The Unbiased EstimatorsThe Unbiased Estimators
Given randomized partial supports, we can estimate original partial supports:
Covariance matrix for this estimator:
To estimate it, substitute sl with (sest)l .– Special case: estimators for support and its variance
1, PQsQs whereest
ljlijiliji
Tk
ll
PPPlD
QlDQsT
s
,,,,
0
][
][1
Cov
where
,est
Can we still find frequent itemsets?Can we still find frequent itemsets?
Itemset Size
True Itemsets
True Positives
False Drops
False Positives
1 266 254 12 31
2 217 195 22 45
3 48 43 5 26
Itemset Size
True Itemsets
True Positives
False Drops
False Positives
1 65 65 0 0
2 228 212 16 28
3 22 18 4 5
Soccer:
smin = 0.2%
Mailorder:
smin = 0.2%
Breach level = 50%.
Cryptographic ApproachCryptographic Approach
+ Accuracy– Performance? Security
– Semi-honest (or passive) adversary: Correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages.
Cryptographic PrimitivesCryptographic Primitives
Oblivious transfer [CACM85, MIT81]– Sender’s input: (x0, x1), Receiver’s input {0,1}– Receiver learns x , sender learns nothing– Sufficient for secure computation [STOC 88]
Oblivious polynomial evaluation [STOC99]– Sender’s input: polynomial Q over F– Receiver’s input: z F– Receiver obtains Q(z), sender learns nothing
Yao’s two-party protocol [FOCS 84]– Party 1 with input x, Party 2 with input y– Compute f(x,y) without revealing x,y– Any polynomial-time function can be expressed as a combinatorial
circuit of polynomial size [JACM72]
Private Distributed ID3Private Distributed ID3
Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information.
Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.
Basic IdeaBasic Idea
Find attribute with highest information gain privately
We can then split on this attribute and recurse.
Information GainInformation Gain
Let– T = set of records (dataset),– T(ci) = set of records in class i,– T(ci,aj) = set of records in class i with value(A) = aj.
– Entropy(T) =
– Gain(T,A) = Entropy(T) -
Need to compute– j i |T(aj, ci)| log |T(aj, ci)|– j |T(aj)| log |T(aj)|.
||
|)(|log
||
|)(|
T
cT
T
cT i
i
i
))aEntropy(T(||
|)(|j j
j
T
aT
Selecting the Split AttributeSelecting the Split Attribute
Given v1 known to party 1 and v2 known to party 2, compute (v1 + v2) log (v1 + v2) and output random shares.– Party 1 gets Answer - – Party 2 gets , where is a random number
Given random shares for each attribute, use Yao's protocol to compute information gain.
Purdue ToolkitPurdue Toolkit
Partitioned databasesSecure Building Blocks for computing
– Sum– Set Union– Size of Set Intersection– Scalar Product
C. Clifton et al. Tools for Privacy Preserving Data Mining. C. Clifton et al. Tools for Privacy Preserving Data Mining. SIGKDD Explorations 2003.SIGKDD Explorations 2003.
SumSum
5R = 7
39
12
15
24
Sum = 24 – R = 17
Semi-honest Assumption
AlgorithmsAlgorithms
Association rules– horizontally partitioned data– vertically partitioned data
EM Clustering
Work in Statistical DatabasesWork in Statistical Databases
Provide statistical information without compromising sensitive information about individuals (AW89, Sho82)
Techniques– Query Restriction– Data Perturbation
Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89]
TechniquesTechniques
Query Restriction– Restrict the size of query result (e.g. FEL72, DDS79)– Control overlap among successive queries (e.g. DJL79)– Suppress small data cells (e.g. CO82)
Output Perturbation– Sample result of query (e.g. Den80)– Add noise to query result (e.g. Bec80)
Data Perturbation– Replace db with sample (e.g. LST83, LCL85, Rei84)– Swap values between records (e.g. Den82)– Add noise to values (e.g. TYW84, War65)
SummarySummary
Promising technical direction & resultsMuch more needs to be done, e.g.
– Trade off between the amount of privacy breach and performance
– Examination of other approaches (e.g. randomization based on swapping)
Hippocratic DatabasesHippocratic Databases
Hippocratic Oath, 8 (circa 400 BC)– What I may see or hear in the course of
treatment … I will keep to myself.
What if the database systems were to embrace the Hippocratic Oath?
R. Agrawal, J. Kiernan, R. Srikant, Y. Xu
Hippocratic Databases. VLDB 2002..
The Ten PrinciplesThe Ten Principles
Driven by current privacy legislation.– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),
Australia (2000), Japan (2003)
Principles:– Collection Group: Purpose Specification, Consent,
Limited Collection– Use Group: Limited Use, Limited Disclosure,
Limited Retention, Accuracy– Security & Openness Group: Safety, Openness,
Compliance
Collection GroupCollection Group
1. Purpose Specification– For personal information stored in the database, the
purposes for which the information has been collected shall be associated with that information.
2. Consent– The purposes associated with personal information shall
have consent of the donor (person whose information is being stored).
3. Limited Collection– The information collected shall be limited to the minimum
necessary for accomplishing the specified purposes.
Use GroupUse Group
4. Limited Use– The database shall run only those queries that
are consistent with the purposes for which the information has been collected.
5. Limited Disclosure– Personal information shall not be
communicated outside the database for purposes other than those for which there is consent from the donor of the information.
Use Group (2)Use Group (2)
6. Limited Retention– Personal information shall be retained only as
long as necessary for the fulfillment of the purposes for which it has been collected.
7. Accuracy– Personal information stored in the database
shall be accurate and up-to-date.
Security & Openness GroupSecurity & Openness Group
8. Safety– Personal information shall be protected by security
safeguards against theft and other misappropriations.
9. Openness– A donor shall be able to access all information about
the donor stored in the database.
10. Compliance– A donor shall be able to verify compliance with the
above principles. Similarly, the database shall be able to address a challenge concerning compliance.
ArchitectureArchitecturePrivacyPolicy
DataCollection
Queries
PrivacyMetadataCreator
Store
PrivacyConstraintValidator
DataAccuracyAnalyzer
AuditInfo
AuditInfo
AuditTrail
QueryIntrusionDetector
AttributeAccessControl
PrivacyMetadata
Other
DataRetentionManager
RecordAccessControl
EncryptionSupport
DataCollectionAnalyzer
New ChallengesNew Challenges
General– Language – Efficiency
Use– Limited Collection– Limited Disclosure– Limited Retention
Security and Openness– Safety– Openness– Compliance
LanguageLanguage Need a declarative language for specifying privacy
policies & user preferences. P3P is very limited
– Developed primarily for web shopping– No enforcement
Some features:– User negotiation models
Least invasive site Coalitional game [KPR2001])
– Balance expressibility and usability
EfficiencyEfficiency
How do we minimize the cost of privacy checking?– Need cell-level access control
How do we incorporate purpose into database design and query optimization?
How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?– Difference in granularity and scale from work in secure
databases
Limited CollectionLimited Collection
How do we identify attributes that are collected but not used?– Assets are only needed for mortgage when salary is
below some threshold. What’s the needed granularity for numeric
attributes?– Queries only ask “Salary > threshold” for rent
application. How do we generate minimal query for a given
purpose?
Limited DisclosureLimited Disclosure
Need dynamically determined set of recipients?
Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.
Digital signatures.
Limited RetentionLimited Retention
Completely forgetting some information is non-trivial.
How do we delete a record from the logs and checkpoints, without affecting recovery?
How do we continue to support historical analysis and statistical queries without incurring privacy breaches?
SafetySafety
Encryption do avoid inadvertent disclosure of data– How do we index encrypted data?– How do we run queries against encrypted data?– [SWP00], [HILM02]
OpennessOpenness
A donor shall be able to access all information about the donor stored in the database.
How does the database check Alice is really Alice and not somebody else?– Princeton admissions office broke into Yale’s admissions
using applicant’s social security number and birth date. How does Alice find out what databases have
information about her?– Information discovery literature– Symmetrically private information retrieval [GIKM98].
ComplianceComplianceUniversal Logging
– Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?
Tracking Privacy Breaches– Insert “fingerprint” records with emails, telephone
numbers, and credit card numbers.– Some data may be more valuable for spammers or credit
card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?
SummarySummary
Gold mine of challenging research problems (besides being useful)!
Related TopicsRelated Topics
Privacy Cognizant Information IntegrationDatabase support for P3PWatermarking & fingerprinting
Decision-Making Across Private Data RepositoriesDecision-Making Across Private Data Repositories
Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on
need-to-know basis. Example: Among those who took
a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn
anything beyond counts.
• Algorithms for computing joins and join counts while revealing minimal additional information.
Minimal Necessary Sharing
R S R must not
know that S has b & y
S must not know that R has a & x
u
v
RSa
u
v
x
b
u
v
y
R
S
Count (R S) R & S do not learn
anything except that the result is 2.
Database Support for P3PDatabase Support for P3PP3P: New W3C standard to encode company privacy policies and user privacy preferences in XML.
• Can programmatically match preferences against policies.• Solves the problem that current policies are written by lawyers, for
lawyers.• Current implementations do the matching in the client (browser).
Proposal: Server-centric preference matching using relational databases.Advantages:
• Server-side matching necessary for thin clients, e.g. mobile devices.• Sets up necessary infrastructure for policy enforcement.• Provides companies with extra information for refining privacy
policies.Prototype enables DB2 with P3P support.
• Stores P3P policies in relational tables.• Reuses database query technology for policy-preference matching.• Algorithm for converting APPEL preferences into SQL queries.
APPEL to SQL Converter
Shredder
Policy Storing
Policy-Preference Matching
SQL query
Query results
Matching result
P3P Privacy Policy
APPEL Privacy
Preference
Policy Metadata
Database
Watermarking Relational DatabasesWatermarking Relational Databases Goal: Deter data theft and assert ownership of
pirated copies.– Examples: Life Sciences, Electronic Parts.
Watermark – Intentionally introduced pattern in the data.
– Very unlikely to occur by chance.– Hard to find => hard to destroy (robust against
malicious attacks). Existing watermarking techniques developed for
multimedia: images, sound, text, … are not applicable to database tables.
– Rows in a table are unordered.– Rows can be inserted, updated, deleted.– Attributes can be added, dropped.
New algorithm for watermarking database tables.
– Watermark can be detected using only a subset of the rows and attributes of a table.
– Robust against updates,incrementally updatable.
Watermark
Insertion
Watermark
Detection
DatabaseSuspiciousDatabase
3. Pseudo randomly select a subset of the rows for marking
Function of secret key and attribute values
3. Identify marked rows/attributes, compare marks with expected mark values
Requires neither original unmarked data nor the watermark
1. Choose secret key
2. Specify table/attributes to be marked
1. Specify secret key
2. Specify table/attributes which should contain marks
4. Confirm presence or absence of the watermark
Closing ThoughtsClosing Thoughts
The right to privacy: the most cherished of human freedoms. -- Warren & Brandeis, 1890
Code is law … it is all a matter of code: the software and hardware that now rule. -- L. Lessig
We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.
What do we want to do as database researchers?
ReferencesReferences R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases.
ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for
P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database
Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the
Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on
Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very
Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over
Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.
R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.