privacy, security, and compliance in data systems intelligent information systems research ibm...

51
Privacy, Security, and Privacy, Security, and Compliance in Data Systems Compliance in Data Systems Intelligent Information Systems Intelligent Information Systems Research Research IBM Almaden Research Center IBM Almaden Research Center San Jose, CA 95120, USA San Jose, CA 95120, USA Contact: Rakesh Agrawal, [email protected]

Upload: daniel-wells

Post on 05-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Privacy, Security, and Compliance Privacy, Security, and Compliance in Data Systemsin Data Systems

Intelligent Information Systems ResearchIntelligent Information Systems Research

IBM Almaden Research CenterIBM Almaden Research Center

San Jose, CA 95120, USASan Jose, CA 95120, USA

Contact: Rakesh Agrawal, [email protected]

Page 2: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

ThesisThesis

ƒ Through technological innovations, it is possible to Through technological innovations, it is possible to build information systems thatbuild information systems thatƒ protect the privacy and ownership of informationprotect the privacy and ownership of informationƒ do not impede the flow of informationdo not impede the flow of information

Page 3: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

OutlineOutline

Privacy Preserving Data MiningPrivacy Preserving Data Mining Privacy Aware Data ManagementPrivacy Aware Data Management Information Sharing Across Private DatabasesInformation Sharing Across Private Databases ConclusionsConclusions

Page 4: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Data Mining and PrivacyData Mining and Privacy

The primary task in data mining:The primary task in data mining:– development of models about aggregated data.development of models about aggregated data.

Can we develop accurate models, while Can we develop accurate models, while protecting the privacy of individual records? protecting the privacy of individual records?

Page 5: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

SettingSetting

Application scenario: A central server interested in Application scenario: A central server interested in building a data mining model using data obtained building a data mining model using data obtained from a large number of clients, while preserving from a large number of clients, while preserving their privacytheir privacy– Web-commerce, e.g. recommendation serviceWeb-commerce, e.g. recommendation service

Desiderata:Desiderata:– Must not slow-down the speed of client interactionMust not slow-down the speed of client interaction– Must scale to very large number of clientsMust scale to very large number of clients

During the application phase During the application phase – Ship model to the clientsShip model to the clients– Use oblivious computationsUse oblivious computations

Page 6: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

World TodayWorld Today

Page 7: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Mining Algorithm

Data Mining Model

World TodayWorld Today

Page 8: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox

3255,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

35 becomes

50 (35+15)

Per-record randomizationwithout considering other records

Randomization parameterscommon across users

Randomization techniques differ for numeric and

categorical dataEach attribute randomized

independently

New Order: New Order: Randomization toRandomization toProtect PrivacyProtect Privacy

Page 9: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

New Order: New Order: Randomization toRandomization toProtect PrivacyProtect Privacy

True valuesNever Leave

the User!

Page 10: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Data Mining Model

Mining Algorithm

Recovery

Recovery of distributions, not individual records

New Order: New Order: RandomizationRandomizationProtects PrivacyProtects Privacy

Page 11: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Reconstruction Problem Reconstruction Problem (Numeric Data)(Numeric Data)

Original values xOriginal values x11, x, x22, ..., x, ..., xnn

– from probability distribution X (unknown)from probability distribution X (unknown) To hide these values, we use yTo hide these values, we use y11, y, y22, ..., y, ..., ynn

– from probability distribution Yfrom probability distribution Y GivenGiven

– xx11+y+y11, x, x22+y+y22, ..., x, ..., xnn+y+ynn

– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X.

Page 12: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Reconstruction AlgorithmReconstruction Algorithm

ffXX00 := Uniform distribution := Uniform distribution

j := 0 j := 0 repeatrepeat

ffXXj+1j+1(a) := Bayes’ Rule(a) := Bayes’ Rule

j := j+1j := j+1 until (stopping criterion met)until (stopping criterion met)

(R. Agrawal & R. Srikant, SIGMOD 2000)(R. Agrawal & R. Srikant, SIGMOD 2000)

Converges to maximum likelihood estimate.Converges to maximum likelihood estimate.– D. Agrawal & C.C. Aggarwal, PODS 2001.D. Agrawal & C.C. Aggarwal, PODS 2001.

n

ijXiiY

jXiiY

afayxf

afayxf

n 1 )())((

)())((1

Page 13: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Works WellWorks Well

20

60

Age

0

200

400

600

800

1000

1200

Nu

mb

er

of

Peop

le

Original

Randomized

Reconstructed

Page 14: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Decision Tree ExampleDecision Tree Example

Age Salary Repeat Visitor?

23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat

Age < 25

Salary < 50K

Repeat

Repeat

Single

Yes

Yes

No

No

Page 15: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

AlgorithmsAlgorithms

GlobalGlobal– Reconstruct for each attribute once at the beginningReconstruct for each attribute once at the beginning

By ClassBy Class– For each attribute, first split by class, then reconstruct For each attribute, first split by class, then reconstruct

separately for each class.separately for each class.

LocalLocal– Reconstruct at each nodeReconstruct at each node

See SIGMOD 2000 paper for details.See SIGMOD 2000 paper for details.

Page 16: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Experimental MethodologyExperimental Methodology

Compare accuracy againstCompare accuracy against– OriginalOriginal: unperturbed data without randomization.: unperturbed data without randomization.– RandomizedRandomized: perturbed data but without making any : perturbed data but without making any

corrections for randomization.corrections for randomization.

Test data not randomized.Test data not randomized. Synthetic benchmark from [AGI+92].Synthetic benchmark from [AGI+92]. Training set of 100,000 records, split equally Training set of 100,000 records, split equally

between the two classes.between the two classes.

Page 17: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Accuracy vs. RandomizationAccuracy vs. Randomization

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

ura

cy

Original

Randomized

Reconstructed

Fn 3

Page 18: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

More on RandomizationMore on Randomization

Privacy-Preserving Association Rule Mining Over Privacy-Preserving Association Rule Mining Over Categorical DataCategorical Data– Rizvi & Haritsa [VLDB 02]Rizvi & Haritsa [VLDB 02]– Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02]

Privacy Breach Control: Probabilistic limits on what one can infer with access to the randomized data as well as mining results– Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02]– Evfimievski, Gehrke & Srikant [PODS-03]

Page 19: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Related Work:Related Work:Private Distributed ID3Private Distributed ID3

How to build a decision-tree classifier on the union of two How to build a decision-tree classifier on the union of two private databases (Lindell & Pinkas [Crypto 2000])private databases (Lindell & Pinkas [Crypto 2000])

Basic Idea:Basic Idea: Find attribute with highest information gain privatelyFind attribute with highest information gain privately Independently split on this attribute and recurseIndependently split on this attribute and recurse

Selecting the Split AttributeSelecting the Split Attribute Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2)

log (v1 + v2) and output random shares of the answerlog (v1 + v2) and output random shares of the answer Given random shares, use Yao's protocol Given random shares, use Yao's protocol [FOCS 84][FOCS 84] to compute to compute

information gain.information gain. Trade-offTrade-off

+ AccuracyAccuracy– Performance & scalingPerformance & scaling

Page 20: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Related Work: Purdue ToolkitRelated Work: Purdue Toolkit

Partitioned databases (horizontally + vertically)Partitioned databases (horizontally + vertically) Secure Building BlocksSecure Building Blocks Algorithms (using building blocks):Algorithms (using building blocks):

– Association rulesAssociation rules– EM ClusteringEM Clustering

C. Clifton et al. Tools for Privacy Preserving Data C. Clifton et al. Tools for Privacy Preserving Data Mining. SIGKDD Explorations 2003.Mining. SIGKDD Explorations 2003.

Page 21: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Related Work:Related Work:Statistical DatabasesStatistical Databases

Provide statistical information without compromising Provide statistical information without compromising sensitive information about individuals (AW89, sensitive information about individuals (AW89, Sho82)Sho82)

TechniquesTechniques– Query RestrictionQuery Restriction– Data PerturbationData Perturbation

Negative Results: cannot give high quality statistics Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of and simultaneously prevent partial disclosure of individual information [AW89]individual information [AW89]

Page 22: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

SummarySummary

Promising technical direction & resultsPromising technical direction & results Much more needs to be done, e.g.Much more needs to be done, e.g.

– Trade off between the amount of privacy breach and Trade off between the amount of privacy breach and performanceperformance

– Examination of other approaches (e.g. randomization Examination of other approaches (e.g. randomization based on swapping)based on swapping)

Page 23: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

OutlineOutline

Privacy Preserving Data MiningPrivacy Preserving Data Mining Privacy Aware Data ManagementPrivacy Aware Data Management Information Sharing Across Private DatabasesInformation Sharing Across Private Databases ConclusionsConclusions

Page 24: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Hippocratic DatabasesHippocratic Databases

Hippocratic Oath, 8 (circa 400 BC)Hippocratic Oath, 8 (circa 400 BC)– What I may see or hear in the course of treatment … I will What I may see or hear in the course of treatment … I will

keep to myself.keep to myself.

What if the database systems were to embrace the What if the database systems were to embrace the Hippocratic Oath?Hippocratic Oath?

Architecture derived from privacy legislations.Architecture derived from privacy legislations.– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),

Australia (2000), Japan (2003)Australia (2000), Japan (2003)

Agrawal, Kiernan, Srikant & Xu: VLDB 2002..

Page 25: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Architectural PrinciplesArchitectural Principles

Purpose SpecificationPurpose SpecificationAssociate with data the Associate with data the purposes for collectionpurposes for collection

ConsentConsentObtain donor’s consent on the Obtain donor’s consent on the purposespurposes

Limited CollectionLimited CollectionCollect minimum necessary Collect minimum necessary datadata

Limited UseLimited UseRun only queries that are Run only queries that are consistent with the purposesconsistent with the purposes

Limited DisclosureLimited DisclosureDo not release data without Do not release data without donor’s consentdonor’s consent

Limited RetentionLimited RetentionDo not retain data beyond Do not retain data beyond necessarynecessary

AccuracyAccuracyKeep data accurate and up-to-Keep data accurate and up-to-datedate

SafetySafetyProtect against theft and other Protect against theft and other misappropriationsmisappropriations

OpennessOpennessAllow donor access to data Allow donor access to data about the donorabout the donor

ComplianceComplianceVerifiable compliance with the Verifiable compliance with the above principlesabove principles

Page 26: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

ArchitectureArchitecture

PrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

Page 27: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Current StatusCurrent Status

PrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

Page 28: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Related Work:Related Work:Statistical & Secure DatabasesStatistical & Secure Databases

Statistical DatabasesStatistical Databases– Provide statistical information (sum, count, etc.) without Provide statistical information (sum, count, etc.) without

compromising sensitive information about individuals, [AW89]compromising sensitive information about individuals, [AW89] Multilevel Secure DatabasesMultilevel Secure Databases

– Multilevel relations, e.g., records tagged “secret”, “confidential”, Multilevel relations, e.g., records tagged “secret”, “confidential”, or “unclassified”, e.g. [JS91]or “unclassified”, e.g. [JS91]

Need to protect privacy in transactional databases that Need to protect privacy in transactional databases that support daily operations.support daily operations.– Cannot restrict queries to statistical queries.Cannot restrict queries to statistical queries.– Cannot tag all the records “top secret”.Cannot tag all the records “top secret”.

Page 29: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Some Interesting ProblemsSome Interesting Problems

Privacy enforcement requires cell-level decisions (which may Privacy enforcement requires cell-level decisions (which may be different for different queries)be different for different queries)– How to minimize the cost of privacy checking?How to minimize the cost of privacy checking?

Encryption to avoid data theftEncryption to avoid data theft– How to index encrypted data for range queries?How to index encrypted data for range queries?

Intrusive queries from authorized usersIntrusive queries from authorized users– Query intrusion detection?Query intrusion detection?

Identifying unnecessary data collectionIdentifying unnecessary data collection– Assets info needed only if salary is below a thresholdAssets info needed only if salary is below a threshold– Queries only ask “Salary > threshold” for rent applicationQueries only ask “Salary > threshold” for rent application

Forgetting data after the purpose is fulfilledForgetting data after the purpose is fulfilled– Databases designed not to lose dataDatabases designed not to lose data– Interaction with compliance Interaction with compliance

Solutions must scale to database-size problems!

Page 30: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

OutlineOutline

Privacy Preserving Data MiningPrivacy Preserving Data Mining Privacy Aware Data ManagementPrivacy Aware Data Management Information Sharing Across Private DatabasesInformation Sharing Across Private Databases ConclusionsConclusions

Page 31: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.

Today’s Information Sharing Today’s Information Sharing SystemsSystems

Mediator

Q R

Federated

Q R

Centralized

Page 32: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Sovereign Information SharingSovereign Information Sharing

Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed (without information than necessary is revealed (without using a trusted third party).using a trusted third party).

Need is driven by several trends:Need is driven by several trends:– End-to-end integration of information systems End-to-end integration of information systems

across companies.across companies.– Simultaneously compete and cooperate.Simultaneously compete and cooperate.– Security: need-to-know information sharingSecurity: need-to-know information sharing

Agrawal, Evfimievski & Srikant: SIGMOD 2003.Agrawal, Evfimievski & Srikant: SIGMOD 2003.

Page 33: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Selective Document SharingSelective Document Sharing

R is shopping for R is shopping for technology.technology.

S has intellectual S has intellectual property it may want to property it may want to license.license.

First find the specific First find the specific technologies where there technologies where there is a match, and then is a match, and then reveal further information reveal further information about those.about those.

R

ShoppingList

S

TechnologyList

Example 2: Govt. agencies sharing information on a

need-to-know basis.

Page 34: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Medical Research Medical Research

Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.

Researchers should not Researchers should not learn anything beyond 4 learn anything beyond 4 counts:counts:

MayoClinic

DNA Sequences

DrugReactions

Adverse ReactionAdverse Reaction No Adv. ReactionNo Adv. Reaction

Sequence PresentSequence Present ?? ??

Sequence AbsentSequence Absent ?? ??

Page 35: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

CaveatsCaveats

Schema Discovery & HeterogeneitySchema Discovery & Heterogeneity Multiple QueriesMultiple Queries

Page 36: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

R S R must not

know that S has b & y

S must not know that R has a & x

uu

vv

RSaa

uu

vv

xx

bb

uu

vv

yy

R

S

Count (R S) R & S do not learn

anything except that the result is 2.

Minimal Necessary SharingMinimal Necessary Sharing

Page 37: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Problem Statement:Problem Statement:Minimal SharingMinimal Sharing

Given:Given:– Two parties (honest-but-curious): R (receiver) and S Two parties (honest-but-curious): R (receiver) and S

(sender)(sender)– Query Q spanning the tables R and SQuery Q spanning the tables R and S– Additional (pre-specified) categories of information IAdditional (pre-specified) categories of information I

Compute the answer to Q and return it to R without revealing Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in Iinformation contained in I– For intersection, intersection size & equijoin, For intersection, intersection size & equijoin,

I = { |R| , |S| }I = { |R| , |S| }

– For equijoin size, I also includes the distribution of duplicates & For equijoin size, I also includes the distribution of duplicates & some subset of information in R some subset of information in R S S

Page 38: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

A Possible ApproachA Possible Approach

Secure Multi-Party ComputationSecure Multi-Party Computation– Given two parties with inputs x and y, compute f(x,y) such Given two parties with inputs x and y, compute f(x,y) such

that the parties learn only f(x,y) and nothing else.that the parties learn only f(x,y) and nothing else.– Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and

simulating that circuit [Yao86].simulating that circuit [Yao86].

Prohibitive cost for database-size problems.Prohibitive cost for database-size problems.– Intersection of two relations of a million records each Intersection of two relations of a million records each

would require 144 days (Yao’s protocol)would require 144 days (Yao’s protocol)

Page 39: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Intersection Protocol: IntuitionIntersection Protocol: Intuition

Want to encrypt the value in R and S and compare Want to encrypt the value in R and S and compare the encrypted values.the encrypted values.

However, want an encryption function such that it However, want an encryption function such that it can only be jointly computed by R and S, not can only be jointly computed by R and S, not separately.separately.

Page 40: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Commutative EncryptionCommutative Encryption

Commutative encryption F is a computable function Commutative encryption F is a computable function f : Key F X Dom F -> Dom F, satisfying: f : Key F X Dom F -> Dom F, satisfying:

– For all e, e’ For all e, e’ Key F, f Key F, fe e oo ffe’ e’ = f= fe’ e’ oo ffee

(The result of encryption with two different keys is the same, (The result of encryption with two different keys is the same, irrespective of the order of encryption)irrespective of the order of encryption)

– Each Each ffe e is a bijection.is a bijection.

(Two different values will have different encrypted values)(Two different values will have different encrypted values)

– The distribution of <x, fThe distribution of <x, fee(x), y, f(x), y, fee(y)> is indistinguishable from the (y)> is indistinguishable from the

distribution of <x, fdistribution of <x, fee(x), y, z>; x, y, z (x), y, z>; x, y, z rr Dom F and e Dom F and e rr Key F. Key F.

(Given a value x and its encryption f(Given a value x and its encryption fee(x), for a new value y, we (x), for a new value y, we

cannot distinguish between fcannot distinguish between fee(y) and a random value z. Thus we (y) and a random value z. Thus we

cannot encrypt y nor decrypt fcannot encrypt y nor decrypt fee(y).)(y).)

Page 41: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Example Commutative Example Commutative EncryptionEncryption

ffee(x) = x(x) = xee mod p mod p

wherewhere– p: safe prime number, i.e., both p and q=(p-1)/2 p: safe prime number, i.e., both p and q=(p-1)/2

are primesare primes– encryption key e encryption key e 1, 2, …, q-1 1, 2, …, q-1– Dom F: all quadratic residues modulo pDom F: all quadratic residues modulo p

Commutativity: powers commuteCommutativity: powers commute(x(xdd mod p) mod p)ee mod p = x mod p = xdede mod p = (x mod p = (xee mod p) mod p)dd mod p mod p

Indistinguishability follows from Decisional Diffie-Indistinguishability follows from Decisional Diffie-Hellman Hypothesis (DDH)Hellman Hypothesis (DDH)

Page 42: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Intersection ProtocolIntersection Protocol

RS

R S

Secret key

r s

fs(S )We apply fs on h(S), where h is a hash function, not

directly on S.

Shorthand for { fs(x) | x S }

Page 43: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

R

Intersection ProtocolIntersection Protocol

S

R S

fs(S)fs(S )

fr(fs(S ))

r s

fs(fr(S ))

Commutative property

Page 44: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

R

Intersection ProtocolIntersection Protocol

S

R

S

fr(R )

fr(R )

fs(fr(S ))

<y, fs(y)> for y fr(R)

r s

<x, fs(fr(x))> for x R

<y, fs(y)> for y fr(R)

Since R knows<x, y=fr(x)>

Page 45: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Intersection Size ProtocolIntersection Size Protocol

R S

R S

fr(R ) fs(S )

fs(S ) fr(R )

fr(fs(S ))

r s

fs(fr(R ))

fr(fs(R))

R cannot map z fr(fs(R))

back to x R.

Not <y, fs(y)> for y fr(R)

Page 46: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Equijoin Protocol: IntuitionEquijoin Protocol: Intuition

R needs some extra information ext(v) for values v R needs some extra information ext(v) for values v R R S. S.– ext(v): information about the other attributes in ext(v): information about the other attributes in

S for those records where S.A = v S for those records where S.A = v S has second secret key s’S has second secret key s’ For each value v For each value v S, S,

– S generates an encryption key S generates an encryption key = f = fs’s’(v), and(v), and

– encrypts ext(v) using encryption function K with key encrypts ext(v) using encryption function K with key .. R learns fR learns fs’s’(v) only for v (v) only for v R. R.

– ff-1-1r r (f(fs’ s’ (f(frr(v))) = f(v))) = f-1-1

r r (f(fr r (f(fs’s’(v))) = f(v))) = fs’s’(v)(v)

Page 47: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Cost AnalysisCost Analysis

Cost is dominated by exponentiationsCost is dominated by exponentiations Let CLet Cee = cost of x = cost of xee mod p mod p

– x, e, p are all 1024-bit integersx, e, p are all 1024-bit integers– Roughly 0.02 seconds on a Pentium 3 (in 2001) [NP01], or Roughly 0.02 seconds on a Pentium 3 (in 2001) [NP01], or

2 x 102 x 1055 operations per hour operations per hour

Intersection: 2 (|VIntersection: 2 (|VRR| + |V| + |VSS|) C|) Cee

Join: (2 |VJoin: (2 |VRR| + 5 |V| + 5 |VSS|) C|) Cee Cost of intersection size and join size similar to Cost of intersection size and join size similar to

intersectionintersection Protocols are trivially parallelizableProtocols are trivially parallelizable

Page 48: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Related WorkRelated Work

[Naor & Pinkas 99]: Two protocols for list [Naor & Pinkas 99]: Two protocols for list intersection problemintersection problem– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.– Oblivious evaluation of nOblivious evaluation of n22 linear polynomials. linear polynomials.

[Huberman et al 99]: find people with common [Huberman et al 99]: find people with common preferences, without revealing the preferences.preferences, without revealing the preferences.– Intersection protocols are similar Intersection protocols are similar

[Clifton et al, 2003]: Secure set union and set [Clifton et al, 2003]: Secure set union and set intersectionintersection– Similar protocolsSimilar protocols

Page 49: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Summary and ChallengesSummary and Challenges

New applications require us to go beyond traditional New applications require us to go beyond traditional centralized and federated information integration: sovereign centralized and federated information integration: sovereign information integrationinformation integration

Need models of minimal disclosure and corresponding Need models of minimal disclosure and corresponding protocols forprotocols for– other database operationsother database operations

– combination of operationscombination of operations Need faster protocolsNeed faster protocols Need further study of tradeoff between efficiency andNeed further study of tradeoff between efficiency and

– additional information disclosedadditional information disclosed

– ApproximationApproximation Need to handle maliciousnessNeed to handle maliciousness

Page 50: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

Closing ThoughtsClosing Thoughts

Solutions to complex problems such as privacy Solutions to complex problems such as privacy require a mix of legislations, societal norms, require a mix of legislations, societal norms, market forces & technologymarket forces & technology

By advancing technology, we can change the mix By advancing technology, we can change the mix and improve the overall quality of the solution and improve the overall quality of the solution

Gold mine of challenging research problems Gold mine of challenging research problems (besides being useful)!(besides being useful)!

Page 51: Privacy, Security, and Compliance in Data Systems Intelligent Information Systems Research IBM Almaden Research Center San Jose, CA 95120, USA Contact:

ReferencesReferenceshttp://www.http://www.almadenalmaden..ibmibm.com/software/quest/.com/software/quest/

M. Bawa, R. Bayardo, R. Agrawal. M. Bawa, R. Bayardo, R. Agrawal. Privacy-preserving indexing of Documents on the Privacy-preserving indexing of Documents on the NetworkNetwork. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003.. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003.

R. Agrawal, A. Evfimievski, R. Srikant. R. Agrawal, A. Evfimievski, R. Srikant. Information Sharing Across Private DatabasesInformation Sharing Across Private Databases. . ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.

A. Evfimievski, J. Gehrke, R. Srikant. A. Evfimievski, J. Gehrke, R. Srikant. Liming Privacy Breaches inLiming Privacy Breaches in Privacy Preserving Privacy Preserving Data MiningData Mining. PODS, San Diego, California, June 2003.. PODS, San Diego, California, June 2003.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. An Xpath Based Preference Language for An Xpath Based Preference Language for P3PP3P.. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Implementing P3P Using Database Implementing P3P Using Database TechnologyTechnology.. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Server Centric P3P.Server Centric P3P. W3C Workshop on the W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002.Future of P3P, Dulles, Virginia, Nov. 2002.

R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. R. Agrawal, J. Kiernan, R. Srikant, Y. Xu. Hippocratic DatabasesHippocratic Databases.. 28th Int'l Conf. on 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.Very Large Databases (VLDB), Hong Kong, August 2002.

R. Agrawal, J. Kiernan. R. Agrawal, J. Kiernan. Watermarking Relational DatabasesWatermarking Relational Databases.. 28th Int'l Conf. on Very 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal Large Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal 2003.2003.

A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Mining Association Rules Over Mining Association Rules Over Privacy Preserving DataPrivacy Preserving Data.. 8th Int'l Conf. on Knowledge Discovery in Databases and Data 8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining (KDD), Edmonton, Canada, July 2002Mining (KDD), Edmonton, Canada, July 2002..

R. Agrawal, R. Srikant. R. Agrawal, R. Srikant. Privacy Preserving Data MiningPrivacy Preserving Data Mining. ACM Int’l Conf. On . ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.Management of Data (SIGMOD), Dallas, Texas, May 2000.