privacy-preserving data quality assessment for high-fidelity data sharing julien freudiger, shantanu...

29
Privacy-Preserving Data Quality sessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

Upload: rose-mason

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

Privacy-Preserving Data Quality Assessment for

High-Fidelity Data SharingJulien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun

PARC

Page 2: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

2

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

$

Page 3: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

3

What about data quality?

Alice does not know data quality prior to acquisition

Dirty data costs US businesses ~$600 billion annually[1]

Data cleaning accounts for up to 80% of development time

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

[1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, 2002

80

20Data Cleaning

Data Exploration

Page 4: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

4

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Privacy concerns for Bob

Page 5: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

5

All of them

How many rows are

complete?

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Trust and privacy concerns for Alice

Page 6: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

6

ProblemPrivacy-Preserving Data Quality AssessmentPrivacy-Preserving Data Quality Assessment

Page 7: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

7

Data Quality MetricsIntegrity constraints on attributes

=, >, [ ], age > 0

Dependency constraints across 2+ attributes if, while, forif state == CA, then ZIP in [94000, 96199]

Many data quality metrics[1,2] CompletenessValidityUniquenessConsistency Timeliness

[1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002

[2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171, 1996

Page 8: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

8

Data Quality Metrics

CompletenessPercentage of elements that are properly populated

Check for values such as NULL, “”,…

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 9: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

9

Data Quality Metrics

ValidityPercentage of elements whose attributes possess

meaningful values

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 10: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

10

Data Quality Metrics

ConsistencyDegree to which the data attributes satisfy a

dependency constraints

First Name

Last Name

Age State ZIP

John Steinbeck 32 CA 94043

Jimi Hendrix 27 WA 01000

Isaac Asimov -15 NY NULL

Page 11: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

11

Desired Privacy Properties

Query PrivacyBob should not learn the data quality constraint

parameters and the resulting values

Data PrivacyAlice should not learn anything from Bob’s data besides

quality metric

Page 12: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

12

Application:High-Fidelity Cyber Threat Mitigation

[1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005

[2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008

[3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006

IP Port Time

UID APT

IP Port Time

UID APT

IP Port Time

UID APT

IP Port Time

UID APT

Page 13: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

13

SolutionsRely on existing cryptographic primitives

Develop custom solution

Page 14: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

14

Private Set Intersection

Set intersection or cardinality of set intersection

[1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004

[2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012

Page 15: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

15

Private Set Intersection Completeness

{NULL}

1, NULL2, NULL…n, NULL

1, d1

2, d2

n, dn

{d1, …, dn}

PSI-CA approach is inefficient

Page 16: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

16

Encrypted-domain Computation

E(d1), E(d2)

E(d1) * E(d2)

d1 + d2

[1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, 1999

Page 17: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

17

Select & Aggregate Setup

Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v.Query Privacy: Bob should not find the selector vector.Data Privacy: Alice should not discover any information other than the selected aggregate.

SecureSelect & Aggregat

eProtocol

Page 18: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

18

Select & Aggregate Protocol

1. Alice sends element-wise encryptions of u to Bob.2. Bob computes the dot product of u and v using

additive homomorphic property, and sends it to Alice.

3. Alice decrypts the dot product.

SecureSelect & Aggregat

eProtocol

Page 19: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

19

Select & Aggregate Complexity

Cannot afford O(#tuples) complexity for large databases.

# Encryptions K 0

# Decryptions 1 0

# Multiplications

0 K

# Exponentiations

0 K

# Transmissions K 1

Page 20: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

20

Key Idea1. Find a suitable low-dimensional representation.

2. Use Select & Aggregate to evaluate quality metric.

Page 21: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

21

Completeness Evaluation Setup

Example: Alice wants to find the number of NULL values in Bob’s data.Query Privacy: Bob does not discover that Alice is searching for the number of NULLs.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap.

0...

H(NULL): 1...0

HashMap Counting HashMap

H(b1): 23...

H(NULL): 5...

H(bt): 2

Page 22: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

22

Completeness Evaluation Protocol

Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem.The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap.By construction, protocol reveals number of NULLs to Alice.

0...

H(NULL): 1...0

HashMap Counting HashMap

5

H(b1): 23...

H(NULL): 5...

H(bt): 2

SecureSelect & Aggregat

eProtocol

Page 23: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

23

Validity Evaluation Setup

01467201

Histogram of attribute

00011100

Binary vector

Example: Alice wants to know how many of Bob’s entries are in the range [C,E].Query Privacy: Bob does not discover the range of Alice’s searches.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram.

AB

CD

E

G

F

Z

Page 24: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

24

Validity Evaluation Protocol

As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram.By construction, protocol reveals number of “valid” values to Alice.Protocol works for arbitrary range queries, uniqueness, timeliness.

00011100

01467201

Binary vector Histogram of attribute

15

SecureSelect & Aggregat

eProtocol

AB

CD

E

G

F

Z

Page 25: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

25

Consistency Evaluation Setup

Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode.Query Privacy: Bob doesn’t discover which dependencies Alice is checking.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations.

10111001

Observeddependencies

11011011

Expecteddependencies

Page 26: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

26

Alice and Bob agree upon an ordering of attribute values.They also agree on a vectorization (flattening) pattern.Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules.

CA MA MN

94304

1 0 0 0

55414

0 0 1 0

02139

0 1 0 0

94305

1 0 0 0

CA MA MN

94304

0 0 1 0

55414

0 0 1 0

02139

0 1 0 0

94305

1 0 0 0

…Desired Dependencies Observed Dependencies

Page 27: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

27

Consistency Evaluation Protocol

11011011

10111001

Expecteddependencies

Observeddependencies

4

SecureSelect & Aggregat

eProtocol

Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector.Protocol reveals number of “valid” dependencies to Alice.Works for dependencies among arbitrary attribute combinations.

Page 28: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

28

Computational Complexity

D R L G

# uniques = # bins = 4

# tuples = 2,306,559

AZ

20

12 v

ote

sMetrics Proposed Protocols Using PSI-CA

Completeness O(# uniques) O(# tuples)

Validity

Timeliness

Uniqueness

O(# histogram bins)

O(# tuples)

Consistency O((# histogram bins)m)

O((# tuples)m)

Page 29: Privacy-Preserving Data Quality Assessment for High-Fidelity Data Sharing Julien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun PARC

29

Conclusions & Discussion• An important subclass of privacy-preserving data mining.

Precursor to collaboration among untrusting entities.

• Existing protocols, e.g., PSI-CA have high computational overhead.

• Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions.

• Future work:– DQ for non-numeric attributes. – Efficient protocols for testing sparse dependencies.– Extremely difficult: Private evaluation of reliability of

data.

{jfreudig,srane}@parc.com