privacy in data systems rakesh agrawal ibm almaden research center joint work with srikant, kiernan...

Privacy in Data SystemsPrivacy in Data Systems

Rakesh AgrawalIBM Almaden Research Center

joint work with Srikant, Kiernan & Xu

ThemeTheme

ƒ Increasing need for information systems thatƒ protect the privacy and ownership of

informationƒ do not impede the flow of information

ƒ Resolving the apparent contradiction in the above statement is a major research challenge and opportunity

DriversDrivers Policies and Legislations

– U.S. and international regulations– Legal proceedings against businesses

Consumer Concerns– Consumer privacy apprehensions continue to plague the

Web … these fears will hold back roughly $15 billion in e-Commerce revenue.” Forrester Research, 2001

– Most consumers are “privacy pragmatists.” Westin Surveys Moral Imperative

– The right to privacy: the most cherished of human freedom -- Warren & Brandeis, 1890

Privacy Is Headline NewsPrivacy Is Headline News

“Privacy #1 issue in the 21Century” -Wall Street Journal, January 24, 2000

“Anyone today who thinks the privacy issue has peaked is greatly mistaken…we are in the early stages of a sweeping change in attitudes that will put once-routine business practices under the microscope.”

Forrester Research, March 5, 2001

OutlineOutline Privacy Preserving Data Mining

– How to have you cake and mine it too!– Assuming no trusted third party.

Hippocratic Databases– What I may see or hear in the course of treatment … I will

keep to myself. - Hippocratic Oath, 8 (circa 400 BC)

Other related topics– Privacy Cognizant Information Integration– Database support for P3P– Watermarking & fingerprinting

Data Mining and PrivacyData Mining and Privacy

The primary task in data mining:– development of models about aggregated

data.Can we still develop accurate models,

while protecting the privacy of individual records?

ApproachesApproaches

RandomizationCryptographicStatistical disclosure control

Recommendation Service

Alice

Bob

3580,000J.S. Bachpaintingnasa


4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft




Chris4285,000B. Marley,camping,microsoft

Randomization Randomization OverviewOverview


Alice

Bob




4560,000B. Spearsbaseballcnn 42







Data Mining Model

Mining Algorithm


Alice

Bob

5065,000Metallicapaintingnasa


3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware







Alice

Bob



3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32







Data Mining Model

Mining Algorithm

Recovery

Inducing Classifiers over Privacy Inducing Classifiers over Privacy Preserved Numeric DataPreserved Numeric Data

30 | 25K | … 50 | 40K | …

Randomizer

65 | 50K | …

Randomizer

35 | 60K | …

ReconstructAge Distribution

ReconstructSalary Distribution

Decision TreeAlgorithm

Model

30 becomes

65 (30+35)

Alice’s age

Alice’s salary

John’s age

Reconstruction ProblemReconstruction Problem

Original values x1, x2, ..., xn

– from probability distribution X (unknown) To hide these values, we use y1, y2, ..., yn

– from probability distribution Y Given

– x1+y1, x2+y2, ..., xn+yn

– the probability distribution of Y Estimate the probability distribution of X.

Reconstruction AlgorithmReconstruction Algorithm

fX0 := Uniform distribution

j := 0 repeat

fXj+1(a) := Bayes’ Rule

j := j+1 until (stopping criterion met)

(R. Agrawal & R. Srikant, SIGMOD 2000)

Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.

n

ij

XiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

Estimate of a single pointEstimate of a single point

Use Bayes' rule for density functions

10 90Age

V

Original distribution for Age

Probabilistic estimate of original value of V

Estimate of a single pointEstimate of a single point

Original Distribution for Age

Probabilistic estimate of original value of V

10 90Age

V

Use Bayes' rule for density functions

Estimate of the distributionEstimate of the distribution

Combine estimates of where a point came from for all the points:– yields estimate of original distribution.

10 90Age

Works WellWorks Well

20

60

Age

0

200

400

600

800

1000

1200

Nu

mb

er

of

Peop

le

Original

Randomized

Reconstructed

Decision Tree ExampleDecision Tree Example

Age Salary Repeat Visitor?

23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat

Age < 25

Salary < 50K

Repeat

Repeat

Single

Yes

Yes

No

No

AlgorithmsAlgorithms

Global– Reconstruct for each attribute once at the beginning

By Class– For each attribute, first split by class, then reconstruct

separately for each class.

Local– Reconstruct at each node

See SIGMOD 2000 paper for details.

Experimental MethodologyExperimental Methodology

Compare accuracy against– Original: unperturbed data without randomization.– Randomized: perturbed data but without making any

corrections for randomization.

Test data not randomized. Synthetic benchmark from [AGI+92]. Training set of 100,000 records, split equally

between the two classes.

Decision Tree ExperimentsDecision Tree Experiments

Fn 1 Fn 2 Fn 3 Fn 4 Fn 550

60

70

80

90

100

Acc

urac

y

Original

Randomized

Reconstructed

100% Randomization Level

Accuracy vs. RandomizationAccuracy vs. Randomization

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

ura

cy

Original

Randomized

Reconstructed

Fn 3

Discovering Associations Over Discovering Associations Over Privacy Preserved Categorical DataPrivacy Preserved Categorical Data

A transaction t is a set of items Support s for an itemset A is the number of

transactions in which A appears Itemset A is frequent if s sminTask: Find all frequent itemsets, while

preserving the privacy of individual transaction.

A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association RulesOver Privacy Preserving Data. KDD-2002.

S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002S. Rizvi, J. Haritsa. Privacy-Preserving Association Rule Mining. VLDB 2002

Uniform RandomizationUniform Randomization

Given a transaction,– Keep item with, say 20% probability– Replace with a new random item with

80% probability.

Privacy BreachPrivacy BreachIf one has access to the result (frequent

itemsets) as well as the randomized transactions, one may make inferences about the original transactions.

Itemset A causes a privacy breach of level if, for some item z A

tAtz |Pr

See also: A. Evfimievski, J. Gehrke , R. Srikant. PODS 2003.

Example: Example: {{xx,, y y,, z z}}

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

0.008%800 ts.97.8%

0.00016%16 trans.

1.9%

less than 0.00002%2 transactions

0.3%

• 0.22 • 8/10,000• 0.23at most

• 0.2 • (9/10,000)2

Privacy Breach: Given {x, y, z} in a randomized transaction,we have 98% certainty of {x, y, z} being in the original one

Original

Randomized

SolutionSolution

Insert many false items into each transaction Hide true itemsets among false ones

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”

“He grows a forest to hide it in.”

G.K. Chesterton

Cut and Paste RandomizationCut and Paste Randomization Given transaction t of size m, construct t’:

a, b, c, u, v, w, x, y, zt =

t’ =


– Choose a number j between 0 and Km (cutoff);

a, b, c, u, v, w, x, y, zt =

t’ =j = 4



– Include j items of t into t’;

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ =j = 4



– Include j items of t into t’;

– Each other item is included into t’ with probability pm .

The choice of Km and pm is based on the desired level of privacy.

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, …j = 4

Partial SupportsPartial SupportsTo recover original support of an itemset, we need randomized

supports of its subsets. Given an itemset A of size k and transaction size m, A vector of partial supports of A is

– Here sk is the same as the support of A.

– Randomized partial supports are denoted by

lAtTtT

s

ssss

l

k

#|#1

,,...,, 10 where

.s

The Unbiased EstimatorsThe Unbiased Estimators

Given randomized partial supports, we can estimate original partial supports:

Covariance matrix for this estimator:

To estimate it, substitute sl with (sest)l .– Special case: estimators for support and its variance

1, PQsQs whereest

ljlijiliji

Tk

ll

PPPlD

QlDQsT

s

,,,,

0

][

][1

Cov

where

,est

Can we still find frequent itemsets?Can we still find frequent itemsets?

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 266 254 12 31

2 217 195 22 45

3 48 43 5 26

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 65 65 0 0

2 228 212 16 28

3 22 18 4 5

Soccer:

smin = 0.2%

Mailorder:

smin = 0.2%

Breach level = 50%.

Cryptographic ApproachCryptographic Approach

+ Accuracy– Performance? Security

– Semi-honest (or passive) adversary: Correctly follows the protocol specification, yet attempts to learn additional information by analyzing the messages.

Cryptographic PrimitivesCryptographic Primitives

Oblivious transfer [CACM85, MIT81]– Sender’s input: (x0, x1), Receiver’s input {0,1}– Receiver learns x , sender learns nothing– Sufficient for secure computation [STOC 88]

Oblivious polynomial evaluation [STOC99]– Sender’s input: polynomial Q over F– Receiver’s input: z F– Receiver obtains Q(z), sender learns nothing

Yao’s two-party protocol [FOCS 84]– Party 1 with input x, Party 2 with input y– Compute f(x,y) without revealing x,y– Any polynomial-time function can be expressed as a combinatorial

circuit of polynomial size [JACM72]

Private Distributed ID3Private Distributed ID3

Problem: Two parties owning confidential databases wish to build a decision-tree classifier on the union of their databases, without revealing any unnecessary information.

Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.Y. Lindell, B. Pinkas. Privacy Preserving Data Mining. Crypto 2000.

Basic IdeaBasic Idea

Find attribute with highest information gain privately

We can then split on this attribute and recurse.

Information GainInformation Gain

Let– T = set of records (dataset),– T(ci) = set of records in class i,– T(ci,aj) = set of records in class i with value(A) = aj.

– Entropy(T) =

– Gain(T,A) = Entropy(T) -

Need to compute– j i |T(aj, ci)| log |T(aj, ci)|– j |T(aj)| log |T(aj)|.

||

|)(|log

||

|)(|

T

cT

T

cT i

i

i

))aEntropy(T(||

|)(|j j

j

T

aT

Selecting the Split AttributeSelecting the Split Attribute

Given v1 known to party 1 and v2 known to party 2, compute (v1 + v2) log (v1 + v2) and output random shares.– Party 1 gets Answer - – Party 2 gets , where is a random number

Given random shares for each attribute, use Yao's protocol to compute information gain.

Purdue ToolkitPurdue Toolkit

Partitioned databasesSecure Building Blocks for computing

– Sum– Set Union– Size of Set Intersection– Scalar Product

C. Clifton et al. Tools for Privacy Preserving Data Mining. C. Clifton et al. Tools for Privacy Preserving Data Mining. SIGKDD Explorations 2003.SIGKDD Explorations 2003.

SumSum

5R = 7

39

12

15

24

Sum = 24 – R = 17

Semi-honest Assumption

AlgorithmsAlgorithms

Association rules– horizontally partitioned data– vertically partitioned data

EM Clustering

Work in Statistical DatabasesWork in Statistical Databases

Provide statistical information without compromising sensitive information about individuals (AW89, Sho82)

Techniques– Query Restriction– Data Perturbation

Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89]

TechniquesTechniques

Query Restriction– Restrict the size of query result (e.g. FEL72, DDS79)– Control overlap among successive queries (e.g. DJL79)– Suppress small data cells (e.g. CO82)

Output Perturbation– Sample result of query (e.g. Den80)– Add noise to query result (e.g. Bec80)

Data Perturbation– Replace db with sample (e.g. LST83, LCL85, Rei84)– Swap values between records (e.g. Den82)– Add noise to values (e.g. TYW84, War65)

SummarySummary

Promising technical direction & resultsMuch more needs to be done, e.g.

– Trade off between the amount of privacy breach and performance

– Examination of other approaches (e.g. randomization based on swapping)

Hippocratic DatabasesHippocratic Databases

Hippocratic Oath, 8 (circa 400 BC)– What I may see or hear in the course of

treatment … I will keep to myself.

What if the database systems were to embrace the Hippocratic Oath?

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu

Hippocratic Databases. VLDB 2002..

The Ten PrinciplesThe Ten Principles

Driven by current privacy legislation.– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),

Australia (2000), Japan (2003)

Principles:– Collection Group: Purpose Specification, Consent,

Limited Collection– Use Group: Limited Use, Limited Disclosure,

Limited Retention, Accuracy– Security & Openness Group: Safety, Openness,

Compliance

Collection GroupCollection Group

1. Purpose Specification– For personal information stored in the database, the

purposes for which the information has been collected shall be associated with that information.

2. Consent– The purposes associated with personal information shall

have consent of the donor (person whose information is being stored).

3. Limited Collection– The information collected shall be limited to the minimum

necessary for accomplishing the specified purposes.

Use GroupUse Group

4. Limited Use– The database shall run only those queries that

are consistent with the purposes for which the information has been collected.

5. Limited Disclosure– Personal information shall not be

communicated outside the database for purposes other than those for which there is consent from the donor of the information.

Use Group (2)Use Group (2)

6. Limited Retention– Personal information shall be retained only as

long as necessary for the fulfillment of the purposes for which it has been collected.

7. Accuracy– Personal information stored in the database

shall be accurate and up-to-date.

Security & Openness GroupSecurity & Openness Group

8. Safety– Personal information shall be protected by security

safeguards against theft and other misappropriations.

9. Openness– A donor shall be able to access all information about

the donor stored in the database.

10. Compliance– A donor shall be able to verify compliance with the

above principles. Similarly, the database shall be able to address a challenge concerning compliance.

ArchitectureArchitecturePrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

New ChallengesNew Challenges

General– Language – Efficiency

Use– Limited Collection– Limited Disclosure– Limited Retention

Security and Openness– Safety– Openness– Compliance

LanguageLanguage Need a declarative language for specifying privacy

policies & user preferences. P3P is very limited

– Developed primarily for web shopping– No enforcement

Some features:– User negotiation models

Least invasive site Coalitional game [KPR2001])

– Balance expressibility and usability

EfficiencyEfficiency

How do we minimize the cost of privacy checking?– Need cell-level access control

How do we incorporate purpose into database design and query optimization?

How does the secure databases work on decomposition of multilevel relations into single-level relations [JS91] apply here?– Difference in granularity and scale from work in secure

databases

Limited CollectionLimited Collection

How do we identify attributes that are collected but not used?– Assets are only needed for mortgage when salary is

below some threshold. What’s the needed granularity for numeric

attributes?– Queries only ask “Salary > threshold” for rent

application. How do we generate minimal query for a given

purpose?

Limited DisclosureLimited Disclosure

Need dynamically determined set of recipients?

Example: Alice wants to add EasyCredit to set of recipients in EquiRate’s database.

Digital signatures.

Limited RetentionLimited Retention

Completely forgetting some information is non-trivial.

How do we delete a record from the logs and checkpoints, without affecting recovery?

How do we continue to support historical analysis and statistical queries without incurring privacy breaches?

SafetySafety

Encryption do avoid inadvertent disclosure of data– How do we index encrypted data?– How do we run queries against encrypted data?– [SWP00], [HILM02]

OpennessOpenness

A donor shall be able to access all information about the donor stored in the database.

How does the database check Alice is really Alice and not somebody else?– Princeton admissions office broke into Yale’s admissions

using applicant’s social security number and birth date. How does Alice find out what databases have

information about her?– Information discovery literature– Symmetrically private information retrieval [GIKM98].

ComplianceComplianceUniversal Logging

– Can we provide each user whose data is accessed with a log of that access, along with the query reading the data?

Tracking Privacy Breaches– Insert “fingerprint” records with emails, telephone

numbers, and credit card numbers.– Some data may be more valuable for spammers or credit

card theft. How do we identify categories to do stratified fingerprinting rather than randomly inserting records?

SummarySummary

Gold mine of challenging research problems (besides being useful)!

Related TopicsRelated Topics

Privacy Cognizant Information IntegrationDatabase support for P3PWatermarking & fingerprinting

Decision-Making Across Private Data RepositoriesDecision-Making Across Private Data Repositories

Separate databases due to statutory, competitive, or security reasons. Selective, minimal sharing on

need-to-know basis. Example: Among those who took

a particular drug, how many had adverse reaction and their DNA contains a specific sequence? Researchers must not learn

anything beyond counts.

• Algorithms for computing joins and join counts while revealing minimal additional information.

Minimal Necessary Sharing

R S R must not

know that S has b & y

S must not know that R has a & x

u

v

RSa

u

v

x

b

u

v

y

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Database Support for P3PDatabase Support for P3PP3P: New W3C standard to encode company privacy policies and user privacy preferences in XML.

• Can programmatically match preferences against policies.• Solves the problem that current policies are written by lawyers, for

lawyers.• Current implementations do the matching in the client (browser).

Proposal: Server-centric preference matching using relational databases.Advantages:

• Server-side matching necessary for thin clients, e.g. mobile devices.• Sets up necessary infrastructure for policy enforcement.• Provides companies with extra information for refining privacy

policies.Prototype enables DB2 with P3P support.

• Stores P3P policies in relational tables.• Reuses database query technology for policy-preference matching.• Algorithm for converting APPEL preferences into SQL queries.

APPEL to SQL Converter

Shredder

Policy Storing

Policy-Preference Matching

SQL query

Query results

Matching result

P3P Privacy Policy

APPEL Privacy

Preference

Policy Metadata

Database

Watermarking Relational DatabasesWatermarking Relational Databases Goal: Deter data theft and assert ownership of

pirated copies.– Examples: Life Sciences, Electronic Parts.

Watermark – Intentionally introduced pattern in the data.

– Very unlikely to occur by chance.– Hard to find => hard to destroy (robust against

malicious attacks). Existing watermarking techniques developed for

multimedia: images, sound, text, … are not applicable to database tables.

– Rows in a table are unordered.– Rows can be inserted, updated, deleted.– Attributes can be added, dropped.

New algorithm for watermarking database tables.

– Watermark can be detected using only a subset of the rows and attributes of a table.

– Robust against updates,incrementally updatable.

Watermark

Insertion

Watermark

Detection

DatabaseSuspiciousDatabase

3. Pseudo randomly select a subset of the rows for marking

Function of secret key and attribute values

3. Identify marked rows/attributes, compare marks with expected mark values

Requires neither original unmarked data nor the watermark

1. Choose secret key

2. Specify table/attributes to be marked

1. Specify secret key

2. Specify table/attributes which should contain marks

4. Confirm presence or absence of the watermark

Closing ThoughtsClosing Thoughts

The right to privacy: the most cherished of human freedoms. -- Warren & Brandeis, 1890

Code is law … it is all a matter of code: the software and hardware that now rule. -- L. Lessig

We can architect computing systems to protect values we believe are fundamental, or we can architect them to allow those values to disappear.

What do we want to do as database researchers?

ReferencesReferences R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private Databases.

ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for

P3P. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database

Technology. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P. W3C Workshop on the

Future of P3P, Dulles, Virginia, Nov. 2002. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic Databases. 28th Int'l Conf. on

Very Large Databases (VLDB), Hong Kong, August 2002. R. Agrawal, J. Kiernan. Watermarking Relational Databases. 28th Int'l Conf. on Very

Large Databases (VLDB), Hong Kong, August 2002. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over

Privacy Preserving Data. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002.

R. Agrawal, R. Srikant. Privacy Preserving Data Mining. ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.

privacy in data systems rakesh agrawal ibm almaden research center joint work with srikant, kiernan...

Documents

privacy issue

privacy pragmatists

probability distribution

privacy of individual

uniform distribution

forrester research

aggregated data

data mininghow