towards achieving anonymity

76
An Zhu An Zhu Towards Towards Achieving Achieving Anonymity Anonymity

Upload: morton

Post on 15-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Towards Achieving Anonymity. An Zhu. Introduction. Collect and analyze personal data Infer trends and patterns Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?. Example: Medical Records. De-identified Records. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Towards Achieving Anonymity

An ZhuAn Zhu

Towards Achieving Towards Achieving AnonymityAnonymity

Page 2: Towards Achieving Anonymity

IntroductionIntroduction

Collect and analyze personal data Infer trends and patterns

Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?

Page 3: Towards Achieving Anonymity

Example: Medical RecordsExample: Medical Records

Identifiers Sensitive Info

SSN Name Age Race Zipcode Disease

614 Sara 31 Cauc 94305 Flu

615 Joan 34 Cauc 94307 Cold

629 Kelly 27 Cauc 94301 Diabetes

710 Mike 41 Afr-A 94305 Flu

840 Carl 41 Afr-A 94059 Arthritis

780 Joe 65 Hisp 94042 Heart problem

616 Rob 46 Hisp 94042 Arthritis

Page 4: Towards Achieving Anonymity

De-identified RecordsDe-identified Records

Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Page 5: Towards Achieving Anonymity

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Page 6: Towards Achieving Anonymity

Not Sufficient! Not Sufficient! [Sweeney 00’][Sweeney 00’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 ArthritisPublic Database

UniqueIdentifiers!

Page 7: Towards Achieving Anonymity

Anonymize the Quasi-Identifiers!Anonymize the Quasi-Identifiers!

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

*** *** *** Flu

*** *** *** Cold

*** *** *** Diabetes

*** *** *** Flu

*** *** *** Arthritis

*** *** *** Heart problem

*** *** *** ArthritisPublic Database

UniqueIdentifiers!

Page 8: Towards Achieving Anonymity

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Page 9: Towards Achieving Anonymity

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Page 10: Towards Achieving Anonymity

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Page 11: Towards Achieving Anonymity

Each rowis identicalto at least k-1otherrows

kk-anonymized Table-anonymized Table [Samarati 01’][Samarati 01’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

* Cauc * Flu

* Cauc * Cold

* Cauc * Diabetes

41 Afr-A * Flu

41 Afr-A * Arthritis

* Hisp 94042 Heart problem

* Hisp 94042 Arthritis

Page 12: Towards Achieving Anonymity

Definition: Definition: kk-anonymity-anonymity

Input: a table consists of n row, each with m attributes (quasi-identifiers)

Output: suppress some entries such that each row is identical to at least k-1 other rows

Objective: minimize the number of suppressed entries

Page 13: Towards Achieving Anonymity

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Page 14: Towards Achieving Anonymity

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Page 15: Towards Achieving Anonymity

Graph RepresentationGraph Representation

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

4

2

4

6

3

A B

F

E D

C

3

W(e)=Hamming distance between the two rows

Page 16: Towards Achieving Anonymity

2

Edge Selection IEdge Selection I

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

2

2

3

A B

F

E D

C

Each node selects thelightest weight edge

0

k=3

Page 17: Towards Achieving Anonymity

3

Edge Selection IIEdge Selection II

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

2

3

A B

F

E D

C

For components with <kvertices, add more edges

0

k=3

2

Page 18: Towards Achieving Anonymity

LemmaLemma

Total weight of edges selected is no more than OPT In the optimal solution, each vertex pays at

least the weight of the (k-1)st lightest weight edge

Forest: at most one edge per vertex By construction, the edge weight is no more

than the (k-1)st lightest weight edge per vertex

Page 19: Towards Achieving Anonymity

GroupingGrouping

Ideally, each connected component forms a group

Anonymize vertices within a group

Total cost of a group: (total edge weights)

(number of nodes) (2+2+3+3)6

3 2

2

3

A B

F

E D

C

0

Small groups: O(k)

Page 20: Towards Achieving Anonymity

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k

Aim: all sub-trees <k

kk k

<k<k<k<k

k

Page 21: Towards Achieving Anonymity

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k

Rotate the tree if necessary

kk

k

Page 22: Towards Achieving Anonymity

Dividing a Component Dividing a Component

Root tree arbitrarily Divide if Sub-trees & rest k T. condition: max(2k-1, 3k-5)

<k<k

<k

<k<k

Page 23: Towards Achieving Anonymity

An ExampleAn Example

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

3 2

2

3

A B

F

E D

C

0

Page 24: Towards Achieving Anonymity

0

3

An ExampleAn Example

C

FE

D

B

A

2 2 3

0 0 1 0 0 0

1 0 0 1 0 1

0 1 0 1 0 1

0 0 1 0 0 0

1 1 0 1 1 1

0 1 1 0 1 1

A:

B:

C:

D:

E:

F:

Page 25: Towards Achieving Anonymity

0

3

An ExampleAn Example

C

FE

D

B

A

2 2

Estimatedcost:43+33

0 * 1 0 * *

* * 0 1 * 1

* * 0 1 * 1

0 * 1 0 * *

* * 0 1 * 1

0 * 1 0 * *

A:

B:

C:

D:

E:

F:

Optimal cost:33+33

Page 26: Towards Achieving Anonymity

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Page 27: Towards Achieving Anonymity

1.51.5-approximation-approximation

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

1 1 0 1 1 1

A:

B:

C:

D:

E:

F:

1

6

5

6

6

A B

F

E D

C

0

W(e)=Hamming distance between the two rows

Page 28: Towards Achieving Anonymity

MinimumMinimum {1,2} {1,2}-matching-matching

0 0 1 0 0 0

0 0 0 0 0 0

1 1 1 1 1 1

0 0 1 0 0 0

1 1 0 1 1 1

1 1 0 1 1 1

A B

F

D

Each vertex is matched to1 or 2 other vertices

0

0

1

E

C

1A:

B:

C:

D:

E:

F:

Page 29: Towards Achieving Anonymity

PropertiesProperties

Each component has 3 nodes

Not OptimalNot possible(degree 2)

>3

Page 30: Towards Achieving Anonymity

Cost 2OPT

For binary alphabet: 1.5OPT

QualitiesQualities

a p q

r p,qOPT pays: 2aWe pay: 2a OPT pays: p+q+r

We pay: 3(p+q) 2(p+q+r)

Page 31: Towards Achieving Anonymity

Past Work and New ResultsPast Work and New Results

[MW 04’] NP-hardness for a large size alphabet O(k logk)-approximation

[AFKMPTZ 05’] NP-hardness even for ternary alphabet O(k)-approximation 1.5-approximation for 2-anonymity 2-approximation for 3-anonymity

Page 32: Towards Achieving Anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

Page 33: Towards Achieving Anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

Page 34: Towards Achieving Anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

11111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111000000000000000000000000000000000000000011111111

k = 5, d = 16, c = k d / 2

Page 35: Towards Achieving Anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

Page 36: Towards Achieving Anonymity

Open ProblemsOpen Problems

Can we improve O(k)? (k) for graph representation

1010101010101010101010101010101011001100110011001100110011001100111100001111000011110000111100001111111100000000111111110000000011111111111111110000000000000000

k = 5, d = 16, c = 2 d

Page 37: Towards Achieving Anonymity

Q:Q: How to share such data? How to share such data?

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Page 38: Towards Achieving Anonymity

Clustering ApproachClustering Approach [[AFKKPTZ 06AFKKPTZ 06’]’]

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Page 39: Towards Achieving Anonymity

Transfers into a Metric…Transfers into a Metric…

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Page 40: Towards Achieving Anonymity

Clusters and CentersClusters and Centers

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

34 Cauc 94307 Cold

27 Cauc 94301 Diabetes

41 Afr-A 94305 Flu

41 Afr-A 94059 Arthritis

65 Hisp 94042 Heart problem

46 Hisp 94042 Arthritis

Page 41: Towards Achieving Anonymity

Clusters and CentersClusters and Centers

Quasi-Identifiers Sensitive Info

Age Race Zipcode Disease

31 Cauc 94305 Flu

Cold

Diabetes

Flu

41 Afr-A 94059 Arthritis

Heart problem

46 Hisp 94042 Arthritis

Page 42: Towards Achieving Anonymity

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Cost:

Gather-k: 10

Cellular-k: 624

Page 43: Towards Achieving Anonymity

MeasureMeasure

How good are the clusters “Tight” clusters are better

Minimize max radius: Gather-k Minimize max distortion error: Cellular-k

radius num_nodes

Handle outliers Constant approximations!

Page 44: Towards Achieving Anonymity

ComparisonComparison

k = 5 5-anonymity

Suppress all entries More distortion

Clustering Can pick R5 as the center Less distortion Distortion is directly related

with pair-wise distances

R1 0 1 1 1

R2 1 0 1 1

R3 1 1 0 1

R4 1 1 1 0

R5 1 1 1 1

Page 45: Towards Achieving Anonymity

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

Page 46: Towards Achieving Anonymity

ResultsResults [[AFKKPTAFKKPTZ 06Z 06’]’]

Gather-k Tight 2-approximation Extension to outlier: 4-approximation

Cellular-k Primal-dual const. approximation Extensions as well

Page 47: Towards Achieving Anonymity

22-approximation-approximation

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R.

A

R

2R

Page 48: Towards Achieving Anonymity

22-approximation-approximation

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Page 49: Towards Achieving Anonymity

Example: Example: kk = 5 = 5

Page 50: Towards Achieving Anonymity

Optimal SolutionOptimal Solution

1 2

R

Page 51: Towards Achieving Anonymity

Center SelectionCenter Selection

Page 52: Towards Achieving Anonymity

Center SelectionCenter Selection

1

Page 53: Towards Achieving Anonymity

Center SelectionCenter Selection

1

2R

Page 54: Towards Achieving Anonymity

Center SelectionCenter Selection

2R

1

Page 55: Towards Achieving Anonymity

Center SelectionCenter Selection

2

1

2R

Page 56: Towards Achieving Anonymity

Center SelectionCenter Selection

2

1

2R

Page 57: Towards Achieving Anonymity

ReassignmentReassignment

2

1

Page 58: Towards Achieving Anonymity

Degree Constrained MatchingDegree Constrained Matching

1

≥ k-1

≥ k-1

=1

=1

=1=1

=1

=1

=1

=1 =1

2

Page 59: Towards Achieving Anonymity

Actual ClusteringActual Clustering

1

2

Page 60: Towards Achieving Anonymity

Optimal ClusteringOptimal Clustering

1 2

Page 61: Towards Achieving Anonymity

Our guaranteesOur guarantees

Return clusters of radius no more than 2R

If R is guessed correctly, then reassignment is possible Each cluster has at least k nodes

Do a binary search on the value of R suffices

Page 62: Towards Achieving Anonymity

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Pick an arbitrary node as a center and

remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Page 63: Towards Achieving Anonymity

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers.

Page 64: Towards Achieving Anonymity

Binary Search on Binary Search on RR

Assume an optimal value R Make sure each node has at least k – 1

neighbors within distance 2R. Not necessary, but is useful for quick pruning

Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone.

Make sure we can reassign nodes to the selected centers. If successful, R could be smaller Otherwise, R should be larger

Page 65: Towards Achieving Anonymity

ResultsResults [[AFKKPTZ 06AFKKPTZ 06’]’]

Gather-k Tight 2-approximation Extension to outliner: 4-approximation

Cellular-k Primal-dual const. approximation Extensions

Page 66: Towards Achieving Anonymity

Ignore Cluster Size ConstraintIgnore Cluster Size Constraint

Similar to Facility Location radius num_nodes vs. invidual_distance_to_center

Caveat Assigning one distant node to an existing

cluster will increase cost proportional to number of nodes in that cluster

Each cluster is a (center, radius) pair

Page 67: Towards Achieving Anonymity

Intermediate Step IIntermediate Step I

Primal-dual constant approximation for radius num_nodes No cluster size constaint Arbitrary cluster setup cost

We want radius num_nodes Cluster size constraint No cluster setup cost

Page 68: Towards Achieving Anonymity

Enforce Cluster SizeEnforce Cluster Size

Introduce extra cluster setup cost Setup cost pays for k nodes to join a

particular cluster, i.e., csetup = k r This at most doubles the actual cost of

any size constrained cluster solution Each cluster’s total cost is at least k r

Page 69: Towards Achieving Anonymity

Intermediate Step IIIntermediate Step II

Shared solution! For each cluster with less than k nodes,

additional nodes can join the cluster At no additional cost, paid for by the cluster

setup cost Now nodes could be shared among multiple

clusters Key: convert a “shared” solution to a

disjoint solution

Page 70: Towards Achieving Anonymity

AttachedAttached

Attached

SeparationSeparation

Starting from small radius clusters

“Open” as long as there are enough nodes

The left over points in clusters “attach” to the intersecting smaller radius (open) clusters

Open

Page 71: Towards Achieving Anonymity

Regroup (Regroup (kk = 5 = 5))

Open cluster has ≥k nodes

Attached cluster has <k nodes

Group clusters to create bigger ones

Choose the “fat” cluster’s center as the new center

3 2 4

6

Page 72: Towards Achieving Anonymity

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Page 73: Towards Achieving Anonymity

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Routing cost is only a constant blowup w.r.t. the fat radius

Page 74: Towards Achieving Anonymity

What About Cluster Cost?What About Cluster Cost?

These clustering intersects with the open cluster

Routing cost is only a constant blowup w.r.t. the fat radius

Need to make sure the merged cluster is of reasonable size

Page 75: Towards Achieving Anonymity

RecapRecap

Anonymize the quasi-identifiers Suppress information

Privacy guarantee: anonymity Quality: the amount of suppressed information

Clustering Privacy guarantee: cluster size Quality: various clustering measures

Page 76: Towards Achieving Anonymity

Thanks!Thanks!