privacy-preserving data quality assessment for high-fidelity data sharing julien freudiger, shantanu...
TRANSCRIPT
Privacy-Preserving Data Quality Assessment for
High-Fidelity Data SharingJulien Freudiger, Shantanu Rane, Alex Brito, and Ersin Uzun
PARC
2
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
$
3
What about data quality?
Alice does not know data quality prior to acquisition
Dirty data costs US businesses ~$600 billion annually[1]
Data cleaning accounts for up to 80% of development time
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
[1] W. Eckerson. Data quality and the bottom line. TDWI Report, The Data Warehouse Institute, 2002
80
20Data Cleaning
Data Exploration
4
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
Privacy concerns for Bob
5
All of them
How many rows are
complete?
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
Trust and privacy concerns for Alice
6
ProblemPrivacy-Preserving Data Quality AssessmentPrivacy-Preserving Data Quality Assessment
7
Data Quality MetricsIntegrity constraints on attributes
=, >, [ ], age > 0
Dependency constraints across 2+ attributes if, while, forif state == CA, then ZIP in [94000, 96199]
Many data quality metrics[1,2] CompletenessValidityUniquenessConsistency Timeliness
[1] Y. Lee, D. Strong, B. Kahn, and R. Wang. AIMQ: a methodology for information quality assessment. Information & management, 40(2), 2002
[2] P. Cykana, A. Paul, and M. Stern. DoD Guidelines on Data Quality Management. In IQ, pages 154–171, 1996
8
Data Quality Metrics
CompletenessPercentage of elements that are properly populated
Check for values such as NULL, “”,…
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
9
Data Quality Metrics
ValidityPercentage of elements whose attributes possess
meaningful values
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
10
Data Quality Metrics
ConsistencyDegree to which the data attributes satisfy a
dependency constraints
First Name
Last Name
Age State ZIP
John Steinbeck 32 CA 94043
Jimi Hendrix 27 WA 01000
Isaac Asimov -15 NY NULL
11
Desired Privacy Properties
Query PrivacyBob should not learn the data quality constraint
parameters and the resulting values
Data PrivacyAlice should not learn anything from Bob’s data besides
quality metric
12
Application:High-Fidelity Cyber Threat Mitigation
[1] S. Katti, B. Krishnamurthy, and D. Katabi. Collaborating against common enemies. In IMC, 2005
[2] J. Zhang, P. A. Porras, and J. Ullrich. Highly predictive blacklisting. In USENIX Security, 2008
[3] P. Porras and V. Shmatikov. Large-scale collection and sanitization of network security data: risks and challenges. In NSPW, 2006
IP Port Time
UID APT
IP Port Time
UID APT
IP Port Time
UID APT
IP Port Time
UID APT
13
SolutionsRely on existing cryptographic primitives
Develop custom solution
14
Private Set Intersection
Set intersection or cardinality of set intersection
[1] M. Freedman,K. Nissim, and B. Pinkas. Efficient private matching and set intersection. In EUROCRYPT, 2004
[2] E. De Cristofaro, P. Gasti, and G. Tsudik. Fast and Private Computation of Cardinality of Set Intersection and Union. In CANS, 2012
15
Private Set Intersection Completeness
{NULL}
1, NULL2, NULL…n, NULL
1, d1
2, d2
…
n, dn
{d1, …, dn}
PSI-CA approach is inefficient
16
Encrypted-domain Computation
E(d1), E(d2)
E(d1) * E(d2)
d1 + d2
[1] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In EUROCRYPT, 1999
17
Select & Aggregate Setup
Goal: Alice has a binary selector u, Bob has data vector v. Alice should discover the sum of selected elements from v.Query Privacy: Bob should not find the selector vector.Data Privacy: Alice should not discover any information other than the selected aggregate.
SecureSelect & Aggregat
eProtocol
18
Select & Aggregate Protocol
1. Alice sends element-wise encryptions of u to Bob.2. Bob computes the dot product of u and v using
additive homomorphic property, and sends it to Alice.
3. Alice decrypts the dot product.
SecureSelect & Aggregat
eProtocol
19
Select & Aggregate Complexity
Cannot afford O(#tuples) complexity for large databases.
# Encryptions K 0
# Decryptions 1 0
# Multiplications
0 K
# Exponentiations
0 K
# Transmissions K 1
20
Key Idea1. Find a suitable low-dimensional representation.
2. Use Select & Aggregate to evaluate quality metric.
21
Completeness Evaluation Setup
Example: Alice wants to find the number of NULL values in Bob’s data.Query Privacy: Bob does not discover that Alice is searching for the number of NULLs.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Alice generates a Hashmap, Bob generates a Counting Hashmap.
0...
H(NULL): 1...0
HashMap Counting HashMap
H(b1): 23...
H(NULL): 5...
H(bt): 2
22
Completeness Evaluation Protocol
Alice generates public encryption key and private decryption key for additively homomorphic cryptosystem.The parties evaluate Select & Aggregate on Alice’s Hashmap and Bob’s Counting Hashmap.By construction, protocol reveals number of NULLs to Alice.
0...
H(NULL): 1...0
HashMap Counting HashMap
5
H(b1): 23...
H(NULL): 5...
H(bt): 2
SecureSelect & Aggregat
eProtocol
23
Validity Evaluation Setup
01467201
Histogram of attribute
00011100
Binary vector
Example: Alice wants to know how many of Bob’s entries are in the range [C,E].Query Privacy: Bob does not discover the range of Alice’s searches.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a histogram vector, Alice generates a binary selector vector on the support of the histogram.
AB
CD
E
G
F
Z
24
Validity Evaluation Protocol
As before, Alice and Bob run the Select & Aggregate protocol on Alice’s selector vector and Bob’s histogram.By construction, protocol reveals number of “valid” values to Alice.Protocol works for arbitrary range queries, uniqueness, timeliness.
00011100
01467201
Binary vector Histogram of attribute
15
SecureSelect & Aggregat
eProtocol
AB
CD
E
G
F
Z
25
Consistency Evaluation Setup
Example: Alice wants to know how many of Bob’s entries follow correct dependencies among attributes, e.g., State – Zipcode.Query Privacy: Bob doesn’t discover which dependencies Alice is checking.Data Privacy: Alice discovers nothing else about Bob’s data.Trick: Bob generates a vector of observed associations, Alice generates a vector of desired associations.
10111001
Observeddependencies
11011011
Expecteddependencies
26
Alice and Bob agree upon an ordering of attribute values.They also agree on a vectorization (flattening) pattern.Need to securely compute how many of Bob’s dependencies are consistent with Alice’s rules.
CA MA MN
…
94304
1 0 0 0
55414
0 0 1 0
02139
0 1 0 0
94305
1 0 0 0
…
CA MA MN
…
94304
0 0 1 0
55414
0 0 1 0
02139
0 1 0 0
94305
1 0 0 0
…Desired Dependencies Observed Dependencies
27
Consistency Evaluation Protocol
11011011
10111001
Expecteddependencies
Observeddependencies
4
SecureSelect & Aggregat
eProtocol
Alice and Bob run the Select & Aggregate protocol on Alice’s desired rule vector and Bob’s observed rule vector.Protocol reveals number of “valid” dependencies to Alice.Works for dependencies among arbitrary attribute combinations.
28
Computational Complexity
D R L G
# uniques = # bins = 4
# tuples = 2,306,559
AZ
20
12 v
ote
sMetrics Proposed Protocols Using PSI-CA
Completeness O(# uniques) O(# tuples)
Validity
Timeliness
Uniqueness
O(# histogram bins)
O(# tuples)
Consistency O((# histogram bins)m)
O((# tuples)m)
29
Conclusions & Discussion• An important subclass of privacy-preserving data mining.
Precursor to collaboration among untrusting entities.
• Existing protocols, e.g., PSI-CA have high computational overhead.
• Can efficiently evaluate many DQ metrics via homomorphic operations on reduced-dimensionality descriptions.
• Future work:– DQ for non-numeric attributes. – Efficient protocols for testing sparse dependencies.– Extremely difficult: Private evaluation of reliability of
data.
{jfreudig,srane}@parc.com