cs573 data privacy and security midterm revielxiong/cs573_f16/share/slides/midterm_revie… ·...
TRANSCRIPT
CS573 Data Privacy and Security
Midterm Review
Li Xiong
Department of Mathematics and Computer Science
Emory University
Principles of Data Security – CIA Triad
• Confidentiality
– Prevent the disclosure of information to unauthorized users
• Integrity
– Prevent improper modification
• Availability
– Make data available to legitimate users
Privacy vs. Confidentiality
• Confidentiality
– Prevent disclosure of information to unauthorized users
• Privacy
– Prevent disclosure of personal information to unauthorized users
– Control of how personal information is collected and used
– Prevent identification of individuals
11/8/2016 3
Data Privacy and Security Measures
• Access control
– Restrict access to the (subset or view of) data to authorized users
• Cryptography
– Use encryption to encode information so it can be only read by authorized users (protected in transmit and storage)
• Inference control
– Restrict inference from accessible data to sensitive (non-accessible) data
• Inference control: Prevent inference from accessible information to individual information (not accessible)
• Technologies
– De-identification and Anonymization (input perturbation)
– Differential Privacy (output perturbation)
Inference Control
Original
Data
Sanitized
Records De-identification anonymization
Traditional De-identification and Anonymization
• Attribute suppression, encoding, perturbation, generalization
• Subject to re-identification and disclosure attacks
Original
Data
Statistics/
Models/
Synthetic
Records
Differentially Private Data Sharing
Statistical Data Sharing with Differential Privacy
• Macro data (as versus micro data)
• Output perturbation (as versus input perturbation)
• More rigorous guarantee
Cryptography
• Encoding data in a way that only authorized users can read it
11/9/2016 8
Original
Data
Encrypted
Data Encryption
9
Applications of Cryptography
• Secure data outsourcing
– Support computation and queries on encrypted data
11/9/2016 9
Encrypted
Data
Computation /Queries
10
Applications of Cryptography
• Multi-party secure computations (secure function evaluation) – Securely compute a function without revealing private
inputs
xn
x1
x3
x2
f(x1,x2,…, xn)
11
Applications of Cryptography
• Private information retrieval (access privacy) – Retrieve data without revealing query (access
pattern)
Course Topics
• Inference control – De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications • Histograms
• Data mining
• Local differential privacy
• Location privacy
• Cryptography
• Access control
• Applications 11/8/2016 12
k-Anonymity
Caucas 78712 Flu
Asian 78705 Shingle
s
Caucas 78754 Flu
Asian 78705 Acne
AfrAm 78705 Acne
Caucas 78705 Flu
Caucas 787XX Flu
Asian/AfrA
m 78705 Shingle
s
Caucas 787XX Flu
Asian/AfrA
m 78705 Acne
Asian/AfrA
m 78705 Acne
Caucas 787XX Flu
Quasi-identifiers (QID) = race, zipcode
Sensitive attribute = diagnosis
K-anonymity: the size of each QID group is at least k
slide 14
Problem of k-anonymity
… … …
Rusty
Shackleford Caucas 78705
… … …
Caucas 787XX Flu
Asian/AfrA
m 78705 Shingle
s
Caucas 787XX Flu
Asian/AfrA
m 78705 Acne
Asian/AfrA
m 78705 Acne
Caucas 787XX Flu
Problem: sensitive attributes are not “diverse” within each quasi-identifier group
l-Diversity
Caucas 787XX Flu
Caucas 787XX Shingle
s
Caucas 787XX Acne
Caucas 787XX Flu
Caucas 787XX Acne
Caucas 787XX Flu
Asian/AfrA
m 78XXX Flu
Asian/AfrA
m 78XXX Flu
Asian/AfrA
m 78XXX Acne
Asian/AfrA
m 78XXX Shingle
s
Asian/AfrA
m 78XXX Acne
Asian/AfrA
m 78XXX Flu
Entropy of sensitive attributes
within each quasi-identifier
group must be at least l
[Machanavajjhala et al. ICDE ‘06]
… HIV-
… HIV-
… HIV-
… HIV-
… HIV-
… HIV+
… HIV-
… HIV-
… HIV-
… HIV-
… HIV-
… HIV-
Original dataset
Q1 HIV-
Q1 HIV-
Q1 HIV-
Q1 HIV+
Q1 HIV-
Q1 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 Flu
Anonymization B
Q1 HIV+
Q1 HIV-
Q1 HIV+
Q1 HIV-
Q1 HIV+
Q1 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Q2 HIV-
Anonymization A
99% have HIV-
50% HIV- quasi-identifier group is “diverse”
This leaks a ton of information
99% HIV- quasi-identifier group is not “diverse”
…yet anonymized database does not leak anything
Problem with l-diversity
Caucas 787XX Flu
Caucas 787XX Shingle
s
Caucas 787XX Acne
Caucas 787XX Flu
Caucas 787XX Acne
Caucas 787XX Flu
Asian/AfrA
m 78XXX Flu
Asian/AfrA
m 78XXX Flu
Asian/AfrA
m 78XXX Acne
Asian/AfrA
m 78XXX Shingle
s
Asian/AfrA
m 78XXX Acne
Asian/AfrA
m 78XXX Flu
[Li et al. ICDE ‘07]
Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database
t-Closeness
slide 17
Problems with Syntactic Privacy notions
• Syntactic
– Focuses on data transformation, not on what can be learned from the anonymized dataset
• “Quasi-identifier” fallacy
– Assumes a priori that attacker will not know certain information about his target
– Attacker may know the records in the database or external information
slide 18
Course Topics
• Inference control
– De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications
• Histograms
• Data mining
• Location privacy
• Cryptography
• Access control
• Applications 11/8/2016 19
Differential Privacy • Statistical outcome is indistinguishable regardless whether a
particular user (record) is included in the data
Differential Privacy • Statistical outcome is indistinguishable regardless whether a
particular user (record) is included in the data
Original records Original histogram Perturbed histogram
with differential privacy
Statistical Data Release: differential privacy
Differential Privacy
A privacy mechanism A gives ε-differential privacy if for all neighbouring databases D, D’, and for any possible output S ∈ Range(A),
Pr[A(D) = S] ≤ exp(ε) × Pr[A(D’) = S]
D D’
• D and D’ are neighboring databases if they differ in one record
Add Laplace noise to the true output f(D)
Δf = max D,D’ |f(D) - f(D’)|
Global Sensitivity
Laplace Mechanism
Example: Laplace Mechanism
• For a single counting query Q over a dataset D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.
11/8/2016 26
Exponential Mechanism
For a database D, output space R and a
utility score function u : D×R → R, the
algorithm A
Pr[A(D) = r] ∝ exp (ε × u(D, r)/ 2Δu)
satisfies ε-differential privacy, where Δu is
the sensitivity of the utility score function
Δu = max r & D,D’ |u(D, r) - u(D’, r)|
Example: Exponential Mechanism
• Scoring/utility function w: Inputs x Outputs R
• D: nationalities of a set of people
• f(D) : most frequent nationality in D
• u (D, O) = #(D, O) the number of people with nationality O
Tutorial: Differential Privacy in the Wild 29 Module 2
Composition theorems
Sequential composition ∑iεi –differential privacy
Parallel composition max(εi)–differential
privacy
Differential Privacy
• Differential privacy ensure an attacker can’t infer the presence or absence of a single record in the input based on any output.
• Building blocks
– Laplace, exponential mechanism
• Composition rules help build complex algorithms using building blocks
Course Topics
• Inference control
– De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications
• Histograms
• Data mining
• Location privacy
• Cryptography
• Access control
• Applications 11/8/2016 32
Baseline: Laplace Mechanism
• For the counting query Q on each histogram bin, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.
11/8/2016 33
Name Age Income HIV+
Frank 42 30K Y
Bob 31 60K Y
Mary 28 20K Y
… … … …
Original Records
DP V-optimal Histogram
Multi-dimensional
partitioning
DPCube [SecureDM 2010, ICDE 2012 demo]
DP unit Histogram
DP Interface
ε/2-DP
ε/2-DP
• Compute unit histogram counts with differential privacy
• Use DP unit histogram for partitioning
• Compute V-optimal histogram counts with differential privacy
Private Spatial decompositions [CPSSY 12]
quadtree kd-tree
Need to ensure both partitioning boundary and the counts of each partition are differentially private
35
Non-parametric methods (only work well for low-dimensional data)
Parametric methods (joint distribution difficult to model)
Histogram methods vs parametric methods
Fit the data to a distribution, make inferences about parameters
e.g. PrivacyOnTheMap
Original data
Histogram
Synthetic data
Perturbation
Learn empirical distribution through histograms
e.g. PSD , Privelet, FP, P-HP
DPCopula
A semi-parametric method
Non-parametric estimation
for each dimension
Age Hours /week
Income
42 64 30K
31 82 60K
28 40 20K
43 36 80K
… … …
Original data set
Hours/week Age Income
DP Marginal Histograms
Dependence structure
Age Hours /week
Income
42 64 30K
31 82 60K
28 40 20K
43 36 80K
… … …
DP synthetic data set
Parametric estimation for
dependence
• Marginal distributions + Bayesian network
PrivBayes: Bayesian Network
age workclass
education title
income
Pr 𝑎𝑔𝑒 Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒
Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘
Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘
• STEP 1: Choose a suitable Bayesian network 𝒩 – must in a differentially private way – Add edges with highest mutual information -
Exponential mechanism
• STEP 2: Compute conditional distributions implied by 𝒩 – straightforward to do under differential privacy – inject noise – Laplace mechanism
• STEP 3: Generate synthetic data by sampling from 𝒩
– post-processing: no privacy issues
Outline of the Algorithm
Metrics: Random range-count queries with random query predicates covering all attributes
Relative error:
Absolute error:
Evaluation for DP Histograms
Course Topics
• Inference control
– De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications
• Histograms
• Data mining
• Location privacy
• Cryptography
• Access control
• Applications 11/8/2016 41
Frequent sequence mining (FSM) ID
100
200
300
400
500
Record
a→c→d
b→c→d
a→b→c→e→d
d→b
a→d→c→d
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
F1: freq 1-seqs
Sequence
{a→a}
{a→b}
{a→c}
{a→d}
Sup.
0
1
3
3
{b→a}
{b→b}
{b→c}
{b→d}
0
2
2
1
{c→a}
{c→b}
{c→c}
{c→d}
0
0
0
4
{d→a}
{d→b}
{d→c}
{d→d}
0
1
1
0
C2: cand 2-seqs
Sequence
{a→c}
{a→d}
{c→d}
Sup.
3
3
4
F3: freq 2-seqs
Scan D
Scan D
Scan D
Sequence
{a→a}
{a→b}
{a→c}
{a→d}
{b→a}
{b→b}
{b→c}
{b→d}
{c→a}
{c→b}
{c→c}
{c→d}
{d→a}
{d→b}
{d→c}
{d→d}
C2: cand 2-seqs
Sequence
{a→b→c}
C3: cand 3-seqs
Sequence
{a→b→c}
Sup.
3
F3: freq 3-seqs
ID
100
200
300
400
500
Record
a→c→d
b→c→d
a→b→c→e→d
d→b
a→d→c→d
Database D
Sequence
{a}
{b}
{c}
{d}
Sup.
3
3
4
4
{e} 1
C1: cand 1-seqs
noise
0.2
-0.4
0.4
-0.5
0.8
Sequence
{a→a}
{a→c}
{a→d}
{c→a}
{c→c}
{c→d}
{d→a}
{d→c}
{d→d}
C2: cand 2-seqs
Sequence
{a→a}
{a→c}
{a→d}
Sup.
0
3
3
{c→a}
{c→c}
{c→d}
0
0
4
{d→a}
{d→c}
{d→d}
0
1
0
C2: cand 2-seqs
noise
0.2
0.3
0.2
-0.5
0.8
0.2
0.3
2.1
-0.5
Scan D
Scan D
Sequence
{a→c→d}
C3: cand 3-seqs
{a→d→c}
noise
0
0.3
Sequence
{a→c→d}
Sup.
3
{a→d→c} 1
C3: cand 3-seqs
Scan D
Sequence
{a}
{c}
{d}
Noisy Sup.
3.2
4.4
3.5
F1: freq 1-seqs
Sequence
{a→c}
{a→d}
{c→d}
Noisy Sup.
3.3
3.2
4.2
F2: freq 2-seqs
{d→c} 3.1
Sequence
{a→c→d}
Noisy Sup.
3
F3: freq 3-seqs
Lap(|C2| / ε2)
Lap(|C1| / ε1)
Lap(|C3| / ε3)
Baseline: Laplace Mechanism
PFS2 Algorithm
• Sensitivity impacted by two factors:
– Candidate size
– Sequence length
• Basic ideas: reduce sensitivity
• Use kth sample database for pruning candidate k-
sequences
– reduce candidate size
• Shrink sequences by transformation while maintaining
the frequent patterns – Reduce sequence length
Original Database
mth
sample database2nd
sample database1st sample database
……
Partition
DP Frequent Sequence Mining Evaluation
• Metrics
• F-score:
• Relative Error:
2precision recall
F scoreprecision recall
'{(| |) / }X x x xRE median sup sup sup
Course Topics
• Inference control – De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications • Histograms
• Data mining
• Local differential privacy
• Location privacy
• Cryptography
• Access control
• Applications 11/9/2016 46
Local Differential Privacy
Finance.com
Fashion.co
m
WeirdStuff.com
. . .
• No trusted server
• Each user applies
local perturbation
before submitting the
value to server
• Server only
aggregates the values
• Google Chrome
deployment
Randomized Response
Disease
(Y/N)
Y
Y
N
Y
N
N
With probability p,
Report true value
With probability 1-p,
Report flipped value
Disease
(Y/N)
Y
N
N
N
Y
N
D O
[W 65]
Differential Privacy Analysis
• Consider 2 databases D, D’ (of size M) that differ in the jth value
• D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j
• Consider some output O
Course Topics
• Inference control – De-identification and anonymization
– Differential privacy foundations
– Differential privacy applications • Histograms
• Data mining
• Local differential privacy
• Location privacy
• Cryptography
• Access control
• Applications 11/9/2016 50
Individual Location Sharing: Existing Solutions
and Challenges
• Private information retrieval
• Computationally expensive
• Spatial cloaking
• Syntactic privacy notion
• Temporal correlations due to road constraints and moving patterns
• Event-level differential privacy
• protect an event (the exact location of a single user at a given time)
• Challenges:
• Large input domain (locations on the map) makes output useless
• Temporal correlations
Event-level differential privacy
Definition (Differential Privacy)
At any timestamp t, a randomized mechanism A satisfies
-differential privacy if, for any output zt and any two locations x1
and x2, the following holds:
Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any input locations
• Challenges:
• Distance does not capture location semantics
• Temporal correlations
Geo-indistinguishability [CCS 13]
Definition (Geo-indistinguishability)
At any timestamp t, a randomized mechanism A satisfies
-differential privacy if, for any output zt and any two locations x1
and x2 within a circle of radius r, the following holds:
Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two input locations that are close to each other
Differential privacy on δ-location set under temporal
correlations [CCS 15]
Definition (Differential Privacy on δ-location set)
At any timestamp t, a randomized mechanism A satisfies
-differential privacy on δ-location set if, for any output zt and any
two locations x1 and x2 in δ-location set, the following holds:
Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two locations in δ-
location set, the possible set of locations a user might appear
Course Topics
• Inference control/Differential Privacy
• Cryptography
– Foundations
– Applications
• Secure outsourcing
• Secure multiparty computations
• Private information retrieval
• Access control
• Applications
11/8/2016 55
56
E D m
plaintext
k
encryption key
k’
decryption key
Ek(m)
ciphertext Dk’ (Ek(m)) = m
attacker
Operational model of encryption
• Kerckhoff’s assumption: – attacker knows E and D
– attacker doesn’t know the (decryption) key
• attacker’s goal: – to systematically recover plaintext from ciphertext
– to deduce the (decryption) key
• attack models: – ciphertext-only
– known-plaintext
– (adaptive) chosen-plaintext
– (adaptive) chosen-ciphertext
slide 57
Cryptography Primitives
• Symmetric encryption
• Pubic encryption
• Encryption schemes with different properties
– Hommomorphic encryption
– Probabilistic encryption vs Deterministic encryption
– Order preserving encryption
– Commutative encryption
Symmetric Key Cryptography
symmetric key crypto: Bob and Alice share know same (symmetric) key: KA-B
Examples: AES
plaintext ciphertext
K A-B
encryption algorithm
decryption algorithm
K A-B
plaintext message, m
c=KA-B (m) K (m) A-B
m = K ( ) A-B
Course Topics
• Inference control/Differential Privacy
• Cryptography
– Foundations
– Applications
• Secure multiparty computations
• Secure outsourcing
• Access control
• Applications
11/9/2016 60
61
Multi-party secure computations (secure function evaluation)
– Securely compute a function without revealing private inputs
xn
x1
x3
x2
f(x1,x2,…, xn)
slide 62
Security Model
• A protocol is secure if it emulates an ideal setting where the parties hand their inputs to a “trusted party,” who locally computes the desired outputs and hands them back to the parties [Goldreich-Micali-Wigderson 1987]
A B x1
f2(x1,x2) f1(x1,x2)
x2
slide 63
Properties of the Definition
• Correctness
– All honest participants should receive the correct result of evaluating function f
• Privacy
– All corrupt participants should learn no more from the protocol than what they would learn in ideal model
– Which means only the private input (obviously) and the result of f
slide 64
Adversary Models
• Semi-honest (aka passive; honest-but-curious)
– Follows protocol, but tries to learn more from received messages than he would learn in the ideal model
• Malicious
– Deviates from the protocol in arbitrary ways, lies about his inputs, may quit at any point
Security proof tools
– Real/ideal model: the real model can be simulated in the ideal model • Key idea – Show that whatever can be computed by a party
participating in the protocol can be computed based on its input and output only
• polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)}
• Composition theorem – if a protocol is secure in the hybrid model where the
protocol uses a trusted party that computes the (sub) functionalities, and we replace the calls to the trusted party by calls to secure protocols, then the resulting protocol is secure
– Prove that component protocols are secure, then prove that the combined protocol is secure
slide 67
Oblivious Transfer (OT)
• Fundamental SMC primitive
• 1-out-of-2 Oblivious Transfer (OT)
S R m0, m1
m
= 0 or 1
S inputs two bits, R inputs the index of one of S’s bits R learns his chosen bit, S learns nothing
– S does not learn which bit R has chosen; R does not learn the value of the bit that he did not choose
[Rabin 1981]
Secret Sharing Scheme
• Splitting
– Encode the secret as an integer S.
– Give to each player i (except one) a random integer ri.
– Give to the last player the number 𝑆 − 𝑟𝑖𝑛−1𝑖=1
(t, n) threshold scheme
• Shamir’s scheme 1979
– It takes t points to define a polynomial of degree t-1
– Create a t-1 degree polynomial with secret as the first coefficient and the remaining coefficients picked at random. Find n points on the curve and give one to each of the players. At least t points are required to fit the polynomial.
General protocols
• Passively-secure computation for semi-honest model
– Yao’s garbled circuit for two-party (OT and symmetric encryption)
– GWM protocol for multiple parties (random shares and OT)
• From passively-secure protocols to actively-secure protocols for malicious model
– Use zero-knowledge proofs to force parties to behave in a way consistent with the passively-secure protocol
Specialized protocols
• Using secret sharing, special encryption schemes, or randomized responses – May reveal some information
– Tradeoff of security and efficiency
• Examples – Secure sum by random shares
– Secure union (using commutative encryption)
• Build complex protocols from primitive protocols
Course Topics
• Inference control/Differential Privacy
• Cryptography
– Foundations
– Applications
• Secure multiparty computations
• Secure outsourcing
• Access control
• Applications
11/9/2016 72
73
Secure data outsourcing
• Support computation and queries on encrypted data
11/9/2016 73
Encrypted
Data
Computation /Queries
Secure data outsourcing
• Crypto Primitives
– Homomorphic encryption
• General protocol based on fully homomorphic encryption: computationally prohibitive
• Specialized protocols based on partially homomorphic encryption
– Property preserving encryption
• Deterministic encryption vs probabilistic encryption
• Order preserving encryption
Homomorphic encryption schemes
• Partially homomorphic encryption: – homomorphic scheme where only one type of operation is
possible (* or +) • Multiplicative homomorphic – e.g. RSA • Additive homomorphic, e.g. Paillier
• Somewhat homomorphic encryption: • homomorphic scheme that can perform a limited number
of additions and multiplications
• Fully homomorphic encryption (FHE) (Gentry, 2010)
– Can perform an infinite number of additions and multiplications
76
Using Partially Hommomorphic Encryption with two servers
• Two server setting – C1 holds encrypted data E(a), E(b)
– C2 holds decryption key sk
• Security goal: – C1 and C2 do not obtain anything about the data and result
• Basic idea – Utilize additive homomorphic property
– Use random shares to ensure C2 only has access to decrypted data with random shares
• Primitive protocols: – secure multiplication, secure comparison, …
• Build complex protocols from primitive protocols: eg secure kNN queries …
Secure data outsourcing
• Crypto Primitives
– Homomorphic encryption
• General protocol based on fully homomorphic encryption: computationally prohibitive
• Specialized protocols based on partially homomorphic encryption
– Property preserving encryption
• Deterministic encryption vs probabilistic encryption
• Order preserving encryption