privacy preserving data mining - emory universitylxiong/cs378/share/slides/09_ppdm.pdf ·...

58
Privacy Preserving Data Mining Li Xiong Department of Mathematics and Computer Science Department of Biomedical Informatics Emory University CS378 Introduction to Data Mining

Upload: others

Post on 05-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy Preserving Data Mining

Li Xiong

Department of Mathematics and Computer Science

Department of Biomedical Informatics

Emory University

CS378 Introduction to Data Mining

Page 2: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Netflix Sequel

• 2006, Netflix announced the challenge

• 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

• July 2009, $1M grand prize awarded

• August 2009, Netflix announced the second challenge

• December 2009, four Netflix users filed a class action

lawsuit against Netflix

• March 2010, Netflix canceled the second challenge

Page 3: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

3

Page 4: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Netflix Sequel

• 2006, Netflix announced the challenge

• 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

• July 2009, $1M grand prize awarded

• August 2009, Netflix announced the second challenge

• December 2009, four Netflix users filed a class action

lawsuit against Netflix

• March 2010, Netflix canceled the second challenge

Page 5: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Netflix Sequel

• 2006, Netflix announced the challenge

• 2007, researchers from University of Texas identified

individuals by matching Netflix datasets with IMDB

• July 2009, $1M grand prize awarded

• August 2009, Netflix announced the second challenge

• December 2009, four Netflix users filed a class action

lawsuit against Netflix

• March 2010, Netflix canceled the second competition

Page 6: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Facebook-Cambridge Analytica

• April 2010, Facebook launches Open Graph

• 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

• 2016, Trump’s campaign invest heavily in Facebook ads

• March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for

Cambridge Analytica and used for Trump’s campaign

• April 11, 2018, Zuckerberg testified before Congress

Page 7: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Facebook-Cambridge Analytica

• April 2010, Facebook launches Open Graph

• 2013, 300,000 users took the psychographic personality

test app ”thisisyourdigitallife”

• 2016, Trump’s campaign invest heavily in Facebook ads

• March 2018, reports revealed that 50 million (later revised

to 87 million) Facebook profiles were harvested for

Cambridge Analytica and used for Trump’s campaign

• April 11, 2018, Zuckerberg testified before Congress

Page 8: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

• How many people know we are here?

(a) no one

(b) 1-10 i.e. family and friends

(c) 10-100 i.e. colleagues and more (social network)

friends

Page 9: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000
Page 10: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Who Knows What About Me? A Survey of Behind the Scenes Personal Data Sharing to Third Parties by Mobile Apps,

2015-10-30 https://techscience.org/a/2015103001/

• 73% / 33% of Android

apps shared personal

info (i.e. email) / GPS

coordinates with third

parties

• 45% / 47% of iOS

apps shared email /

GPS coordinates with

third parties

Location data sharing by iOS apps (left) to domains (right)

Page 11: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

The EHR Data Map

Page 12: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Shopping records

Page 13: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Big Data Goes Personal

• Movie ratings

• Social network/media data

• Mobile GPS data

• Electronic medical records

• Shopping history

• Online browsing history

Page 14: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000
Page 15: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Data Mining

Page 16: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Data Mining … the dark side

Page 17: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Private

DataSanitized

Data/

Models

Privacy Preserving

Data Mining

Privacy Preserving Data Mining

• Privacy goal: personal data is not revealed and cannot be

inferred

• Utility goal: data/models as close to the private data as

possible

Page 18: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy preserving data mining

• Differential privacy

• Definition

• Building blocks (primitive mechanisms)

• Composition rules

• Data mining algorithms with differential privacy

• k-means clustering w/ differential privacy

• Frequent pattern mining w/ differential privacy

Page 19: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differential Privacy

Page 20: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Original

Data

Sanitized

ViewDe-identification

anonymization

Traditional De-identification and Anonymization

• Attribute suppression, perturbation, generalization

• Inference possible with external data

Page 21: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Massachusetts GIC Incident (1990s)

• Massachusetts Group Insurance Commission (GIC) Encounter

data (“de-identified”) – mid 1990s

• External information: voter roll from city of Cambridge

• Governor’s health records identified

• 87% Americans can be uniquely identified using: Zip, birthdate,

and sex (2000)

Name SSN Birth

date

Zip Diagnosis

Alice 123456789 44 48202 AIDS

Bob 323232323 44 48202 AIDS

Charley 232345656 44 48201 Asthma

Dave 333333333 55 48310 Asthma

Eva 666666666 55 48310 Diabetes

Page 22: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

AOL Query Log Release (2006)

• User 4417749

• “numb fingers”,

• “60 single men”

• “dog that urinates on everything”

• “landscapers in Lilburn, Ga”

• Several people names with last name Arnold

• “homes sold in shadow lake subdivision

gwinnett county georgia”

AnonID Query QueryTime ItemRank ClickURL

217 lottery 2006-03-01 11:58:51 1 http://www.calottery.com

217 lottery 2006-03-27 14:10:38 1 http://www.calottery.com

1268 gall stones 2006-05-11 02:12:51

1268 gallstones 2006-05-11 02:13:02 1 http://www.niddk.nih.gov

1268 ozark horse blankets 2006-03-01 17:39:28 8 http://www.blanketsnmore.com

20 million Web search queries by AOL

Page 23: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

The Genome Hacker (2013)

Page 24: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differential Privacy

• Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data

Page 25: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differential Privacy

• Statistical outcome (view) is indistinguishable regardless

whether a particular user is included in the data

Page 26: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Private

Data D Models

/Data

Privacy preserving

data mining/sharing

mechanism

Differential Privacy

• View is indistinguishable regardless of the input

Private

Data D’

Page 27: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000
Page 28: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Original records Original histogramPerturbed histogram

with differential privacy

Differential privacy: an example

Page 29: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Laplace Mechanism

0

0.2

0.4

0.6

-10 -8 -6 -4 -2 0 2 4 6 8 10

Laplace Distribution –Lap(S/ε)

Private

Data

Query q

True

answer

q(D)q(D) + η

η

Page 30: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Laplace Distribution

• PDF:

• Denoted as Lap(b) when u=0

• Mean u

• Variance 2b2

Page 31: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

How much noise for privacy?

Sensitivity: Consider a query q: I R. S(q) is the smallest number s.t.

for any neighboring tables D, D’,

| q(D) – q(D’) | ≤ S(q)

Theorem: If sensitivity of the query is S, then the algorithm

A(D) = q(D) + Lap(S(q)/ε) guarantees ε-differential privacy

[Dwork et al., TCC 2006]

Page 32: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Example: COUNT query

• Number of people having HIV+

• Sensitivity = ?

Page 33: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Example: COUNT query

• Number of people having HIV+

• Sensitivity = 1

• ε-differentially private count: 3 + η, where η is drawn from Lap(1/ε)

• Mean = 0

• Variance = 2/ε2

Page 34: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Example: Sum (Average) query

• Sum of Age (suppose Age is in [a,b])

• Sensitivity = ?

Page 35: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Example: Sum (Average) query

• Sum of Age (suppose Age is in [a,b])

• Sensitivity = b

Page 36: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Composition theorems

Sequential composition∑iεi –differential privacy

Parallel compositionmax(εi)–differential privacy

Page 37: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Sequential Composition

• If M1, M2, ..., Mk are algorithms that access a

private database D such that each Mi satisfies εi -

differential privacy,

then the combination of their outputs satisfies

ε-differential privacy with ε=ε1+...+εk

Page 38: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Parallel Composition

• If M1, M2, ..., Mk are algorithms that access disjoint

databases D1, D2, …, Dk such that each Mi satisfies εi -

differential privacy,

then the combination of their outputs satisfies

ε-differential privacy with ε= max{ε1,...,εk}

Page 39: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Postprocessing

• If M1 is an ε differentially private algorithm that accesses a

private database D,

then outputting M2(M1(D)) also satisfies ε-differential

privacy.

Module 2Tutorial: Differential

Privacy in the Wild

42

Page 40: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Original records Original histogramPerturbed histogram

with differential privacy

Differential privacy: an example

Page 41: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy preserving data mining

• Differential privacy

• Definition

• Building blocks (primitive mechanisms)

• Composition rules

• Data mining algorithms with differential privacy

• k-means clustering w/ differential privacy

• Frequent itemsets mining w/ differential privacy

Page 42: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy Preserving Data Mining as Constrained

Optimization

• Two goals

• Privacy

• Error (utility)

• Given a task and privacy budget ε, how to design a set of

queries (functions) and allocate the budget such that the

error is minimized?

Page 43: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Data mining algorithms with differential privacy

• General algorithmic framework

• Decompose a data mining algorithm into a set of

functions

• Allocate privacy budget to each function

• Implement each function with εi differential privacy

• Compute noisy output using Laplace mechanism

based on sensitivity of the function and εi

• Compose them using composition theorem

• Optimization techniques

• Decomposition design

• Budget allocation

• Sensitivity reduction for each function

Page 44: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Review: K-means Clustering

Page 45: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

K-means Problem

• Partition a set of points x1, x2, …, xn into k clusters S1, S2,

…, Sk such that the SSE is minimized:

Mean of the cluster Si

Page 46: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

K-means Algorithm

• Initialize a set of k centers

• Repeat until convergence

1. Assign each point to its nearest center

2. Update the set of centers

• Output final set of k centers and the points in each cluster

Page 47: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differentially Private K-means

• Initialize a set of k centers

• Repeat iterations until convergence

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the size of each cluster

3. Compute the sum (centroid) of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]

Page 48: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differentially Private K-means

• Initialize a set of k centers

• Suppose we fix the number of iterations to T

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the noisy size of each cluster

3. Compute the noisy sum (centroid)

of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]

Page 49: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differentially Private K-means

• Initialize a set of k centers

• Suppose we fix the number of iterations to T

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the noisy size of each cluster

3. Compute the noisy sum (centroid)

of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε

Page 50: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differentially Private K-means

• Initialize a set of k centers

• Suppose we fix the number of iterations to T

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the noisy size of each cluster

3. Compute the noisy sum (centroid)

of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε

S = 1

S = Dom

Page 51: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Differentially Private K-means

• Initialize a set of k centers

• Suppose we fix the number of iterations to T

• In each iteration (given a set of centers):

1. Assign the points to the closest center

2. Compute the noisy size of each cluster

3. Compute the noisy sum (centroid)

of points in each cluster

• Output the final centroid and size of each cluster

[BDMN 05]

Each iteration uses

ε/T privacy,

total privacy is ε

Laplace(2T/ε)

Laplace(2T *dom/ε)

Page 52: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Results (T = 10 iterations, random initialization)

Original K-means algorithm Laplace K-means algorithm

• Laplace k-means can distinguish clusters that are far apart

• Laplace k-means can’t distinguish small clusters that are close by.

Page 53: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Privacy preserving data mining

• Differential privacy

• Definition

• Building blocks (primitive mechanisms)

• Composition rules

• Data mining algorithms with differential privacy

• k-means clustering w/ differential privacy

• Frequent itemsets/sequence mining w/

differential privacy

Page 54: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Frequent Sequence Mining (FSM)

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

F1: freq 1-seqs

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

Sup.

0

1

3

3

{b→a}

{b→b}

{b→c}

{b→d}

0

2

2

1

{c→a}

{c→b}

{c→c}

{c→d}

0

0

0

4

{d→a}

{d→b}

{d→c}

{d→d}

0

1

1

0

C2: cand 2-seqs

Sequence

{a→c}

{a→d}

{c→d}

Sup.

3

3

4

F3: freq 2-seqs

Scan D

Scan D

Scan D

Sequence

{a→a}

{a→b}

{a→c}

{a→d}

{b→a}

{b→b}

{b→c}

{b→d}

{c→a}

{c→b}

{c→c}

{c→d}

{d→a}

{d→b}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→b→c}

C3: cand 3-seqs

Sequence

{a→b→c}

Sup.

3

F3: freq 3-seqs

Page 55: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Baseline Differentially Private FSM

ID

100

200

300

400

500

Record

a→c→d

b→c→d

a→b→c→e→d

d→b

a→d→c→d

Database D

Sequence

{a}

{b}

{c}

{d}

Sup.

3

3

4

4

{e} 1

C1: cand 1-seqs

noise

0.2

-0.4

0.4

-0.5

0.8

Sequence

{a→a}

{a→c}

{a→d}

{c→a}

{c→c}

{c→d}

{d→a}

{d→c}

{d→d}

C2: cand 2-seqs

Sequence

{a→a}

{a→c}

{a→d}

Sup.

0

3

3

{c→a}

{c→c}

{c→d}

0

0

4

{d→a}

{d→c}

{d→d}

0

1

0

C2: cand 2-seqs

noise

0.2

0.3

0.2

-0.5

0.8

0.2

0.3

2.1

-0.5

Scan D

Scan D

Sequence

{a→c→d}

C3: cand 3-seqs

{a→d→c}

noise

0

0.3

Sequence

{a→c→d}

Sup.

3

{a→d→c} 1

C3: cand 3-seqs

Scan D

Sequence

{a}

{c}

{d}

Noisy Sup.

3.2

4.4

3.5

F1: freq 1-seqs

Sequence

{a→c}

{a→d}

{c→d}

Noisy Sup.

3.3

3.2

4.2

F2: freq 2-seqs

{d→c} 3.1

Sequence

{a→c→d}

Noisy Sup.

3

F3: freq 3-seqs

Lap(|C2| / ε2)

Lap(|C1| / ε1)

Lap(|C3| / ε3)

S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based

Candidate Pruning. ICDE 2015

Page 56: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Frequent pattern (subgraph) mining

• Represent each record as a graph

• Modeling the co-occurrence between diagnosis, procedures, medications

• Frequent subgraph mining with differential privacy

v1

v2 v3

v4 v1

v2 v3

v4

v1

v2 v3

v4

Threshold = 3v1 v4 …

Input Graphs Frequent Subgraphs

support = 3

S. Xu, S. Su, L. Xiong, X. Cheng, K. Xiao, Differentially Private Frequent

Subgraph Mining. ICDE 2016

Page 57: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Acknowledgements

• Research support

• Center for Comprehensive Informatics

• Woodrow Wilson Foundation

• Cisco research award

• Students

• James Gardner

• Yonghui Xiao

• Collaborators

• Andrew Post, CCI

• Fusheng Wang, CCI

• Tyrone Grandison, IBM

• Chun Yuan, Tsinghua

Page 58: Privacy Preserving Data Mining - Emory Universitylxiong/cs378/share/slides/09_ppdm.pdf · Facebook-Cambridge Analytica • April 2010, Facebook launches Open Graph • 2013, 300,000

Emory Assured Information

Management and Sharing (AIMS) Lab

• Collect, use, analyze, share data

without compromising privacy