cs573 data privacy and security midterm revielxiong/cs573_f16/share/slides/midterm_revie… ·...

81
CS573 Data Privacy and Security Midterm Review Li Xiong Department of Mathematics and Computer Science Emory University

Upload: doque

Post on 04-Jun-2018

230 views

Category:

Documents


0 download

TRANSCRIPT

CS573 Data Privacy and Security

Midterm Review

Li Xiong

Department of Mathematics and Computer Science

Emory University

Principles of Data Security – CIA Triad

• Confidentiality

– Prevent the disclosure of information to unauthorized users

• Integrity

– Prevent improper modification

• Availability

– Make data available to legitimate users

Privacy vs. Confidentiality

• Confidentiality

– Prevent disclosure of information to unauthorized users

• Privacy

– Prevent disclosure of personal information to unauthorized users

– Control of how personal information is collected and used

– Prevent identification of individuals

11/8/2016 3

Data Privacy and Security Measures

• Access control

– Restrict access to the (subset or view of) data to authorized users

• Cryptography

– Use encryption to encode information so it can be only read by authorized users (protected in transmit and storage)

• Inference control

– Restrict inference from accessible data to sensitive (non-accessible) data

• Inference control: Prevent inference from accessible information to individual information (not accessible)

• Technologies

– De-identification and Anonymization (input perturbation)

– Differential Privacy (output perturbation)

Inference Control

Original

Data

Sanitized

Records De-identification anonymization

Traditional De-identification and Anonymization

• Attribute suppression, encoding, perturbation, generalization

• Subject to re-identification and disclosure attacks

Original

Data

Statistics/

Models/

Synthetic

Records

Differentially Private Data Sharing

Statistical Data Sharing with Differential Privacy

• Macro data (as versus micro data)

• Output perturbation (as versus input perturbation)

• More rigorous guarantee

Cryptography

• Encoding data in a way that only authorized users can read it

11/9/2016 8

Original

Data

Encrypted

Data Encryption

9

Applications of Cryptography

• Secure data outsourcing

– Support computation and queries on encrypted data

11/9/2016 9

Encrypted

Data

Computation /Queries

10

Applications of Cryptography

• Multi-party secure computations (secure function evaluation) – Securely compute a function without revealing private

inputs

xn

x1

x3

x2

f(x1,x2,…, xn)

11

Applications of Cryptography

• Private information retrieval (access privacy) – Retrieve data without revealing query (access

pattern)

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 12

k-Anonymity

Caucas 78712 Flu

Asian 78705 Shingle

s

Caucas 78754 Flu

Asian 78705 Acne

AfrAm 78705 Acne

Caucas 78705 Flu

Caucas 787XX Flu

Asian/AfrA

m 78705 Shingle

s

Caucas 787XX Flu

Asian/AfrA

m 78705 Acne

Asian/AfrA

m 78705 Acne

Caucas 787XX Flu

Quasi-identifiers (QID) = race, zipcode

Sensitive attribute = diagnosis

K-anonymity: the size of each QID group is at least k

slide 14

Problem of k-anonymity

… … …

Rusty

Shackleford Caucas 78705

… … …

Caucas 787XX Flu

Asian/AfrA

m 78705 Shingle

s

Caucas 787XX Flu

Asian/AfrA

m 78705 Acne

Asian/AfrA

m 78705 Acne

Caucas 787XX Flu

Problem: sensitive attributes are not “diverse” within each quasi-identifier group

l-Diversity

Caucas 787XX Flu

Caucas 787XX Shingle

s

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Shingle

s

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Flu

Entropy of sensitive attributes

within each quasi-identifier

group must be at least l

[Machanavajjhala et al. ICDE ‘06]

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

… HIV+

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

… HIV-

Original dataset

Q1 HIV-

Q1 HIV-

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q1 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 Flu

Anonymization B

Q1 HIV+

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q1 HIV+

Q1 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Q2 HIV-

Anonymization A

99% have HIV-

50% HIV- quasi-identifier group is “diverse”

This leaks a ton of information

99% HIV- quasi-identifier group is not “diverse”

…yet anonymized database does not leak anything

Problem with l-diversity

Caucas 787XX Flu

Caucas 787XX Shingle

s

Caucas 787XX Acne

Caucas 787XX Flu

Caucas 787XX Acne

Caucas 787XX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Flu

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Shingle

s

Asian/AfrA

m 78XXX Acne

Asian/AfrA

m 78XXX Flu

[Li et al. ICDE ‘07]

Distribution of sensitive attributes within each quasi-identifier group should be “close” to their distribution in the entire original database

t-Closeness

slide 17

Problems with Syntactic Privacy notions

• Syntactic

– Focuses on data transformation, not on what can be learned from the anonymized dataset

• “Quasi-identifier” fallacy

– Assumes a priori that attacker will not know certain information about his target

– Attacker may know the records in the database or external information

slide 18

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 19

Differential Privacy • Statistical outcome is indistinguishable regardless whether a

particular user (record) is included in the data

Differential Privacy • Statistical outcome is indistinguishable regardless whether a

particular user (record) is included in the data

Original records Original histogram

Statistical Data Release: disclosure risk

Original records Original histogram Perturbed histogram

with differential privacy

Statistical Data Release: differential privacy

Differential Privacy

A privacy mechanism A gives ε-differential privacy if for all neighbouring databases D, D’, and for any possible output S ∈ Range(A),

Pr[A(D) = S] ≤ exp(ε) × Pr[A(D’) = S]

D D’

• D and D’ are neighboring databases if they differ in one record

Add Laplace noise to the true output f(D)

Δf = max D,D’ |f(D) - f(D’)|

Global Sensitivity

Laplace Mechanism

Example: Laplace Mechanism

• For a single counting query Q over a dataset D, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

11/8/2016 26

Exponential Mechanism

Inputs Outputs

Sample output r with a utility score function u(D,r)

Exponential Mechanism

For a database D, output space R and a

utility score function u : D×R → R, the

algorithm A

Pr[A(D) = r] ∝ exp (ε × u(D, r)/ 2Δu)

satisfies ε-differential privacy, where Δu is

the sensitivity of the utility score function

Δu = max r & D,D’ |u(D, r) - u(D’, r)|

Example: Exponential Mechanism

• Scoring/utility function w: Inputs x Outputs R

• D: nationalities of a set of people

• f(D) : most frequent nationality in D

• u (D, O) = #(D, O) the number of people with nationality O

Tutorial: Differential Privacy in the Wild 29 Module 2

Composition theorems

Sequential composition ∑iεi –differential privacy

Parallel composition max(εi)–differential

privacy

Differential Privacy

• Differential privacy ensure an attacker can’t infer the presence or absence of a single record in the input based on any output.

• Building blocks

– Laplace, exponential mechanism

• Composition rules help build complex algorithms using building blocks

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 32

Baseline: Laplace Mechanism

• For the counting query Q on each histogram bin, returning Q(D)+Laplace(1/ε) gives ε-differential privacy.

11/8/2016 33

Name Age Income HIV+

Frank 42 30K Y

Bob 31 60K Y

Mary 28 20K Y

… … … …

Original Records

DP V-optimal Histogram

Multi-dimensional

partitioning

DPCube [SecureDM 2010, ICDE 2012 demo]

DP unit Histogram

DP Interface

ε/2-DP

ε/2-DP

• Compute unit histogram counts with differential privacy

• Use DP unit histogram for partitioning

• Compute V-optimal histogram counts with differential privacy

Private Spatial decompositions [CPSSY 12]

quadtree kd-tree

Need to ensure both partitioning boundary and the counts of each partition are differentially private

35

Non-parametric methods (only work well for low-dimensional data)

Parametric methods (joint distribution difficult to model)

Histogram methods vs parametric methods

Fit the data to a distribution, make inferences about parameters

e.g. PrivacyOnTheMap

Original data

Histogram

Synthetic data

Perturbation

Learn empirical distribution through histograms

e.g. PSD , Privelet, FP, P-HP

DPCopula

A semi-parametric method

Non-parametric estimation

for each dimension

Age Hours /week

Income

42 64 30K

31 82 60K

28 40 20K

43 36 80K

… … …

Original data set

Hours/week Age Income

DP Marginal Histograms

Dependence structure

Age Hours /week

Income

42 64 30K

31 82 60K

28 40 20K

43 36 80K

… … …

DP synthetic data set

Parametric estimation for

dependence

• Marginal distributions + Bayesian network

PrivBayes: Bayesian Network

age workclass

education title

income

Pr 𝑎𝑔𝑒 Pr 𝑤𝑜𝑟𝑘 | 𝑎𝑔𝑒

Pr 𝑒𝑑𝑢 | 𝑎𝑔𝑒 Pr 𝑡𝑖𝑡𝑙𝑒 | 𝑤𝑜𝑟𝑘

Pr 𝑖𝑛𝑐𝑜𝑚𝑒 | 𝑤𝑜𝑟𝑘

• STEP 1: Choose a suitable Bayesian network 𝒩 – must in a differentially private way – Add edges with highest mutual information -

Exponential mechanism

• STEP 2: Compute conditional distributions implied by 𝒩 – straightforward to do under differential privacy – inject noise – Laplace mechanism

• STEP 3: Generate synthetic data by sampling from 𝒩

– post-processing: no privacy issues

Outline of the Algorithm

Metrics: Random range-count queries with random query predicates covering all attributes

Relative error:

Absolute error:

Evaluation for DP Histograms

Course Topics

• Inference control

– De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications

• Histograms

• Data mining

• Location privacy

• Cryptography

• Access control

• Applications 11/8/2016 41

Frequent sequence mining (FSM) ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

F1: freq 1-seqs

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

Sup.

0

1

3

3

{b→a}

{b→b}

{b→c}

{b→d}

0

2

2

1

{c→a}

{c→b}

{c→c}

{c→d}

0

0

0

4

{d→a}

{d→b}

{d→c}

{d→d}

0

1

1

0

C2: cand 2-seqs

Sequence

{a→c}

{a→d}

{c→d}

Sup.

3

3

4

F3: freq 2-seqs

Scan D

Scan D

Scan D

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

{b→a}

{b→b}

{b→c}

{b→d}

{c→a}

{c→b}

{c→c}

{c→d}

{d→a}

{d→b}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→b→c}

C3: cand 3-seqs

Sequence

{a→b→c}

Sup.

3

F3: freq 3-seqs

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

noise

0.2

-0.4

0.4

-0.5

0.8

Sequence

{a→a}

{a→c}

{a→d}

{c→a}

{c→c}

{c→d}

{d→a}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→a}

{a→c}

{a→d}

Sup.

0

3

3

{c→a}

{c→c}

{c→d}

0

0

4

{d→a}

{d→c}

{d→d}

0

1

0

C2: cand 2-seqs

noise

0.2

0.3

0.2

-0.5

0.8

0.2

0.3

2.1

-0.5

Scan D

Scan D

Sequence

{a→c→d}

C3: cand 3-seqs

{a→d→c}

noise

0

0.3

Sequence

{a→c→d}

Sup.

3

{a→d→c} 1

C3: cand 3-seqs

Scan D

Sequence

{a}

{c}

{d}

Noisy Sup.

3.2

4.4

3.5

F1: freq 1-seqs

Sequence

{a→c}

{a→d}

{c→d}

Noisy Sup.

3.3

3.2

4.2

F2: freq 2-seqs

{d→c} 3.1

Sequence

{a→c→d}

Noisy Sup.

3

F3: freq 3-seqs

Lap(|C2| / ε2)

Lap(|C1| / ε1)

Lap(|C3| / ε3)

Baseline: Laplace Mechanism

PFS2 Algorithm

• Sensitivity impacted by two factors:

– Candidate size

– Sequence length

• Basic ideas: reduce sensitivity

• Use kth sample database for pruning candidate k-

sequences

– reduce candidate size

• Shrink sequences by transformation while maintaining

the frequent patterns – Reduce sequence length

Original Database

mth

sample database2nd

sample database1st sample database

……

Partition

DP Frequent Sequence Mining Evaluation

• Metrics

• F-score:

• Relative Error:

2precision recall

F scoreprecision recall

'{(| |) / }X x x xRE median sup sup sup

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/9/2016 46

Local Differential Privacy

Finance.com

Fashion.co

m

WeirdStuff.com

. . .

• No trusted server

• Each user applies

local perturbation

before submitting the

value to server

• Server only

aggregates the values

• Google Chrome

deployment

Randomized Response

Disease

(Y/N)

Y

Y

N

Y

N

N

With probability p,

Report true value

With probability 1-p,

Report flipped value

Disease

(Y/N)

Y

N

N

N

Y

N

D O

[W 65]

Differential Privacy Analysis

• Consider 2 databases D, D’ (of size M) that differ in the jth value

• D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j

• Consider some output O

Course Topics

• Inference control – De-identification and anonymization

– Differential privacy foundations

– Differential privacy applications • Histograms

• Data mining

• Local differential privacy

• Location privacy

• Cryptography

• Access control

• Applications 11/9/2016 50

Individual Location Sharing: Existing Solutions

and Challenges

• Private information retrieval

• Computationally expensive

• Spatial cloaking

• Syntactic privacy notion

• Temporal correlations due to road constraints and moving patterns

• Event-level differential privacy

• protect an event (the exact location of a single user at a given time)

• Challenges:

• Large input domain (locations on the map) makes output useless

• Temporal correlations

Event-level differential privacy

Definition (Differential Privacy)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy if, for any output zt and any two locations x1

and x2, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any input locations

• Challenges:

• Distance does not capture location semantics

• Temporal correlations

Geo-indistinguishability [CCS 13]

Definition (Geo-indistinguishability)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy if, for any output zt and any two locations x1

and x2 within a circle of radius r, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two input locations that are close to each other

Differential privacy on δ-location set under temporal

correlations [CCS 15]

Definition (Differential Privacy on δ-location set)

At any timestamp t, a randomized mechanism A satisfies

-differential privacy on δ-location set if, for any output zt and any

two locations x1 and x2 in δ-location set, the following holds:

Intuition the released location zt (observed by the adversary) will not help an adversary to differentiate any two locations in δ-

location set, the possible set of locations a user might appear

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure outsourcing

• Secure multiparty computations

• Private information retrieval

• Access control

• Applications

11/8/2016 55

56

E D m

plaintext

k

encryption key

k’

decryption key

Ek(m)

ciphertext Dk’ (Ek(m)) = m

attacker

Operational model of encryption

• Kerckhoff’s assumption: – attacker knows E and D

– attacker doesn’t know the (decryption) key

• attacker’s goal: – to systematically recover plaintext from ciphertext

– to deduce the (decryption) key

• attack models: – ciphertext-only

– known-plaintext

– (adaptive) chosen-plaintext

– (adaptive) chosen-ciphertext

slide 57

Cryptography Primitives

• Symmetric encryption

• Pubic encryption

• Encryption schemes with different properties

– Hommomorphic encryption

– Probabilistic encryption vs Deterministic encryption

– Order preserving encryption

– Commutative encryption

Symmetric Key Cryptography

symmetric key crypto: Bob and Alice share know same (symmetric) key: KA-B

Examples: AES

plaintext ciphertext

K A-B

encryption algorithm

decryption algorithm

K A-B

plaintext message, m

c=KA-B (m) K (m) A-B

m = K ( ) A-B

Public-Key Cryptography

Public key for encryption and secret key for decryption Examples: RSA

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure multiparty computations

• Secure outsourcing

• Access control

• Applications

11/9/2016 60

61

Multi-party secure computations (secure function evaluation)

– Securely compute a function without revealing private inputs

xn

x1

x3

x2

f(x1,x2,…, xn)

slide 62

Security Model

• A protocol is secure if it emulates an ideal setting where the parties hand their inputs to a “trusted party,” who locally computes the desired outputs and hands them back to the parties [Goldreich-Micali-Wigderson 1987]

A B x1

f2(x1,x2) f1(x1,x2)

x2

slide 63

Properties of the Definition

• Correctness

– All honest participants should receive the correct result of evaluating function f

• Privacy

– All corrupt participants should learn no more from the protocol than what they would learn in ideal model

– Which means only the private input (obviously) and the result of f

slide 64

Adversary Models

• Semi-honest (aka passive; honest-but-curious)

– Follows protocol, but tries to learn more from received messages than he would learn in the ideal model

• Malicious

– Deviates from the protocol in arbitrary ways, lies about his inputs, may quit at any point

Security proof tools

– Real/ideal model: the real model can be simulated in the ideal model • Key idea – Show that whatever can be computed by a party

participating in the protocol can be computed based on its input and output only

• polynomial time S such that {S(x,f(x,y))} ≡ {View(x,y)}

• Composition theorem – if a protocol is secure in the hybrid model where the

protocol uses a trusted party that computes the (sub) functionalities, and we replace the calls to the trusted party by calls to secure protocols, then the resulting protocol is secure

– Prove that component protocols are secure, then prove that the combined protocol is secure

General protocols

• Primitives

– Oblivious transfer (OT)

– Random shares

slide 67

Oblivious Transfer (OT)

• Fundamental SMC primitive

• 1-out-of-2 Oblivious Transfer (OT)

S R m0, m1

m

= 0 or 1

S inputs two bits, R inputs the index of one of S’s bits R learns his chosen bit, S learns nothing

– S does not learn which bit R has chosen; R does not learn the value of the bit that he did not choose

[Rabin 1981]

Secret Sharing Scheme

• Splitting

– Encode the secret as an integer S.

– Give to each player i (except one) a random integer ri.

– Give to the last player the number 𝑆 − 𝑟𝑖𝑛−1𝑖=1

(t, n) threshold scheme

• Shamir’s scheme 1979

– It takes t points to define a polynomial of degree t-1

– Create a t-1 degree polynomial with secret as the first coefficient and the remaining coefficients picked at random. Find n points on the curve and give one to each of the players. At least t points are required to fit the polynomial.

General protocols

• Passively-secure computation for semi-honest model

– Yao’s garbled circuit for two-party (OT and symmetric encryption)

– GWM protocol for multiple parties (random shares and OT)

• From passively-secure protocols to actively-secure protocols for malicious model

– Use zero-knowledge proofs to force parties to behave in a way consistent with the passively-secure protocol

Specialized protocols

• Using secret sharing, special encryption schemes, or randomized responses – May reveal some information

– Tradeoff of security and efficiency

• Examples – Secure sum by random shares

– Secure union (using commutative encryption)

• Build complex protocols from primitive protocols

Course Topics

• Inference control/Differential Privacy

• Cryptography

– Foundations

– Applications

• Secure multiparty computations

• Secure outsourcing

• Access control

• Applications

11/9/2016 72

73

Secure data outsourcing

• Support computation and queries on encrypted data

11/9/2016 73

Encrypted

Data

Computation /Queries

Secure data outsourcing

• Crypto Primitives

– Homomorphic encryption

• General protocol based on fully homomorphic encryption: computationally prohibitive

• Specialized protocols based on partially homomorphic encryption

– Property preserving encryption

• Deterministic encryption vs probabilistic encryption

• Order preserving encryption

Homomorphic Encryption

Homomorphic Encryption

Enc[f(x)]

Enc[x]

f

Eval

Homomorphic encryption schemes

• Partially homomorphic encryption: – homomorphic scheme where only one type of operation is

possible (* or +) • Multiplicative homomorphic – e.g. RSA • Additive homomorphic, e.g. Paillier

• Somewhat homomorphic encryption: • homomorphic scheme that can perform a limited number

of additions and multiplications

• Fully homomorphic encryption (FHE) (Gentry, 2010)

– Can perform an infinite number of additions and multiplications

76

11/9/2016 77

Using Partially Hommomorphic Encryption with two servers

• Two server setting – C1 holds encrypted data E(a), E(b)

– C2 holds decryption key sk

• Security goal: – C1 and C2 do not obtain anything about the data and result

• Basic idea – Utilize additive homomorphic property

– Use random shares to ensure C2 only has access to decrypted data with random shares

• Primitive protocols: – secure multiplication, secure comparison, …

• Build complex protocols from primitive protocols: eg secure kNN queries …

11/9/2016 79

Secure data outsourcing

• Crypto Primitives

– Homomorphic encryption

• General protocol based on fully homomorphic encryption: computationally prohibitive

• Specialized protocols based on partially homomorphic encryption

– Property preserving encryption

• Deterministic encryption vs probabilistic encryption

• Order preserving encryption

11/9/2016 81

CryptDB

• Use layers of encryption

• Decrypt as needed