privacy and anonymity in text

Privacy and Anonymity in Text

Chris Clifton

12 November, 2009

Plausibly Deniable Search

This is joint work with Mummoorthy Murugesan

2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009

http://www.siam.org/proceedings/datamining/2009/dm09_070_murugesanm.pdf

The AOL Awakening

• In Aug 2006, AOL released its customers web searches for research studies

• 20 Million unique queries of 650K unique users• <user-id> was replaced with a <random-number>• NY Times reporter successfully found the identity

of an individual from the queries– Queries included “60 single men” “landscapers in

Lilburn, Ga”– Many more queries contained enough information to

uniquely identify the person

AOL fired its CTO over this issue; Two researchers were forced out

Privacy in Web Search

• Server-Controlled Privacy– Deletion of queries after a few months – Anonymization of querylogs before backup

• Some of these methods have been shown to be inadequate

• Private Information Retrieval– affects the advertising business model– not practical with the current solutions

Lessons Learned

• Content of user queries reveals a lot– Ego surfing: searching for own name, ssn, credit card

• Identifiable

– Location, type of work, age, medical condition• Sensitive

– Car they own, restaurants in a zip code• Query transformation alone is not enough

– Submitting Q’ instead of Q to retrieve the same set of documents

– User intent still revealed

User-Controlled Privacy

1. Hide identifying metadata– Private Web Search (PWS) – Firefox plugin

(Yale Univ.)• Removes metadata• Hides user IP Address (via TOR)

Private Web SearchFelipe Saint-Jean, Johnson, Boneh, Feigenbaum

• Tor: Hides IP addresses– Routes request,

response through multiple servers

– Each knows only preceding server

• HTTP filter normalizes search queries– Browser, OS, etc.

• HTML filter removes active components

User-Controlled Privacy

1. Hide identifying metadata– Private Web Search (PWS) – Firefox plugin (Yale

Univ.)• Removes metadata• Hides user IP Address (via TOR)

2. Protect against disclosure through query terms– TrackMeNot – Firefox plugin (NYU)

• Periodically issues randomized queries from a list of “seeds”

• Uses search results for 'logical' future query terms

Actual User Query (user intent) is revealed

Timing attacks, load on server

Query semantics attacks – `logical’ generated terms

Plausibly Deniable Search

q

PDS

{q1,...,qk}{q1,...,qk}

{R(q1),...,R(qk)}

Filter R(qi) usingthe original q

Search Engine12

3 4

Plausibly Deniable Search:Key Concepts

• Browser submits more than one query {q1,…,qk}

• Deniability– Reversible: any of the k queries would have produced the

same set– The additional “cover queries” are of diverse topics

• Plausibility– All queries are equally plausible– Implausible queries would weaken the deniability argument

{“java compiler” , “newton apple”}

Vs

{“java compiler” , “motorola table”}

Plausibly Deniable Search: Theory

• Assume the following:– User Queries follow a distribution Pu

– Cover queries are generated through a distribution Pc

• Given a set of two queries S={q1,q2}, there are two possible events– E1: q1 is user query & q2 is cover query

– E2: q2 is cover query & q1 is user query

)()()()(

)()()()(

)()|(

1221

21

21

11

qPqPqPqP

qPqPEPEP

EPSEP

cucu

cu

)()()()(

)()()()(

)()|(

1221

12

21

22

qPqPqPqP

qPqPEPEP

EPSEP

cucu

cu

Plausibly Deniable Search: Theory

• To achieve deniability for either of these queries, we require the following condition:

• Two of many possible solutions

1. queries have equal probability of being user queries, and equal probability of being cover queries

2. queries have the same probability of being user query or cover query

)()(and)()( 2121 qPqPqPqP ccuu

)()()()( 1221 qPqPqPqP cucu

)()(and)()( 2211 qPqPqPqP cucu

Creating Plausibly Deniable Cover Queries

1. Create Canonical Queries– Standard queries

2. Creating PD-Querysets– Plausibly deniable querysets

with k queries

3. Issuing query– Find and issue the PD-Queryset for the

given user query

Done in advance (Server / Third Party)

Step One:Creating Canonical Queries

Seed Queries

Use LSI to combine Semantically Similar Seed Queries

Seed Documents

Canonical Queries

FP Mining

• Semantically similar surrogate queries for user queries

• Supports the “deniability” argument since all queries could be generated by the system.

Step Two:Creating PD-Querysets

• Dissimilarity between two queries is based on 3 measures:– Euclidean distance: Semantically similar queries are closer in the

semantic space– Magnitude: queries that are equally stronger in their respective

topics have similar magnitude– Neighborhood count: equally plausible queries have similar

number of log (already issued) queries in their neighborhood

PD-Querysets

Agglomerative ClusteringCanonical

Queries

Step 3: Issuing Query

• User query is mapped to semantic space– Vec(q)=qTU’S’-1

• Find canonical queries that have the maximum cosine similarity with q in the semantic space

• The PD-Queryset of the selected canonical query is issued

i

i

qq

qq

How Good is PDS?

• Deniability:– Canonical query provides one level of anonymity– There exist many seed queries that map to a

single canonical query– The reversible property provides deniability

• Plausibility:– Base on the number of similar topics queries

issued by users– Measure as perception of human subjects;

difficult to quantify How good are the canonical queries? Do they

fetch what the users want?

Results from Experiments

• Document Collection– DMOZ categorized web documents – 314K documents and 1.28M unique terms– Three topics: Computers, Science, Sports

• Number of Documents in Each Category– Computers 115k– Science 100k– Sports 99k

• After performing SVD on the term-doc matrix, only 30 columns are kept in U

Canonical Queries

• 2.6 Million seed queries generated with ∆=500• Produces 932K canonical queries

• Average canonical query length 3.7

Total Canonical Queries 931,863

2 terms 1019

3 terms 324,034

4 terms 524,256

5 terms 67,664

6 terms 14,890

Query Canonical Cover

{synthesis technology} {synthesis molecular technology}

{baker priority report}

Retrieval Performance

• 5k queries from the allthweb.com searches• 3.4k unique queries containing at least 75% terms from

our collection• Six of top 20 in 69% of queries (500)

Topic Diversity

• DMOZ categories are used in comparing the topics of queries

• 85% of PD-Querysets have queries with >50% topic diversity

What is Next?

• PDS can be used along with other approaches such as PWS, TOR, etc.

• Canonical Queries– Efficient ways of creating canonical queries– Improving retrieval performance

• Sequential Queries– How to handle the sequentially edited queries by

an user on the same topic?– Can an attacker figure out the user queries over

period of time?

Query Sequences

• Users issue a sequence of queries on a topic– Cover queries should be plausibly deniable sequences

• Consider two sequences, S1={a1,b1} S2={a2,b2}, where <a1,a2> are issued together (first), <b1,b2> are issued second

• There are two possible events:– E1: S1 is user sequence, S2 is cover sequence

– E2: S1 is cover sequence, S2 is user sequence

Query Sequences

• To deniability is achieved when we satisfy the following constraint:

• Given deniability for the first queries a1,b1, we get:

)()()()(

)()()()(

)()|(

1221

21

21

11

SPSPSPSP

SPSPEPEP

EPSEP

cucu

cu

)()()()(

)()()()(

)()|(

1221

12

21

22

SPSPSPSP

SPSPEPEP

EPSEP

cucu

cu

)()()()( 1221 SPSPSPSP cucu

)|()|()|()|( 11222211 abPabPabPabP cucu

Two (of many) Possible Solutions

1. b1 and b2 have same conditional probability of being user-generated– Also same conditional probability of being

method-generated

2. a2 has equal conditional probability of being user generated or method generated; b2 has the same property.

)|()|(and)|()|( 12121212 bbPaaPbbPaaP ccuu

)|()|(and)|()|( 12121212 bbPbbPaaPaaP cucu This is applicable to the m+1th query given a sequence of m queries

Generating “user-like” Sequences

• Idea: Inter-query time determines difference between queries– Learn distribution of changes to queries at

time– Given time, generate query from previous

cover query and appropriate distribution

• P(qk | qk-1) same as a real user!

Distribution of what changes?

• Features “defining” query are those useful in linking queries in sequence– If sequence can be discovered, must be simulated

• Features from I know what you did last summer (Jones et. al)– Term re-use, topic similarity used to link queries in a

sequence• Learned distribution from large query log for ranges of

inter-query times– Topic relation– Topic repetition– Number of term changes

Term Changes Topic Changes

Feature Distributions with respect to Inter-Query Time

“Bin number” is exponential grouping on time

0 2 4 6 8 10 12 14 16 18 20 22 240

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Bin Number

Pro

ba

bili

ty

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

BN=5 BN=10 BN=15BN=20

Number of Term Changes

Pro

ba

bili

ty

Effectiveness: Topic Change Distribution on DMOZ

How well does it really work?

ecuador world trade center deaths

banana guide

gmail center ice hockey wabi sabi

the promenade temecula pittsburgh convention center

wa

sheriff sales detroit homes

100 center street wabi sabi

summit county sheriff sales

bolton center pa. donald judd

harris county sheriff thornton town center wabi sabi

Try again…

angelina jolie my space parker animal shelter

angelina ballerina panic at the disco brenden

parker animal shelter colorado

residence inn madison wi red carpet keim action group

tagawa

residence inn hillsboro claudio ciardi paradise valley arizona

rodeway inn maingate www yahoo.com map of texas

community collage of beaver county

www g map of tennessee

town of bennington nh www hp com santa ana california

elmo love on aol.com santa ana winds date

www.redeem your rewards.com

love aol.com when do the santa ana winds come

brookfield wi sunday brunch

webmail aol.com when santa ana winds

Figure it out yet?

letter signs m signs chinese jump rope the hall sisters

letter m logo chinese silk the hall sisters gospel group

alphabet letter m ohio medical board gospel group the hall sisters

letter m logo anderson ohio rules georgia gospel group the hall sisters

texas a&m at commerce net georgia gospel church sings

i’m makin some noise http southern georgia gospel church sings

m.i.t. land grab in new orleans

mini clip.com http

Disclosure-free Discovery of Related Documents

Chris Clifton

Mummoorthy Murugesan Wei Jiang

Luo Si Jaideep Vaidya

18 September, 2009

Proceedings of the 24th International Conference on Data Engineering (ICDE 2008)

, Cancun, Mexico, April 7-12, 2008

http://www.icde2008.org/

http://www.icde2008.org/

Problem:Identifying Common Interests

We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …



There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …




Alice

1 Bin Laden 8

2 Stinger 3

3 Afghanistan 1

4 Kabul 0

… … …

Bob

1 Bin Laden 7

2 Stinger 5

3 Afghanistan 0

4 Kabul 2

… … …

Solution Overview



...**** 44332211 babababa

Secure Product:Random Matrix

Vaidya and Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data, KDD02

http://doi.acm.org/10.1145/775047.775142

Secure Product:Homomorphic Encryption

Goethals, Laur, Lipmaa, and Mielikainen, On secure scalar product computation for privacy-preserving data mining, ICISC 2004

Alice

1 Bin Laden 8

2 Stinger 3

3 Afghanistan 1

4 Kabul 0

… … …

Bob

1 Bin Laden 7

2 Stinger 5

3 Afghanistan 0

4 Kabul 2

… … …

Is Performance an Issue?



...**** 44332211 babababa





We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …







...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa

Running Time(journal articles)

0 200 400 600 800 1000 1200 14000

1000

2000

3000

4000

5000

6000

SSDD_H SSDD_M

Number of Documents

Ru

nn

ing

Tim

e (

min

ute

s)

Faster: Local Clustering

• Locally cluster similar documents– Secure protocol identifies similar clusters

• Document comparison only within identified clusters











Savings / Loss from Clustering

Effectiveness:40% Document Overlap

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

sim=0.5

sim=0.6

sim=0.7

CS2 CS4 CS6 Cosine

Recall

Pre

cisi

on

1

)),((max),,(i

jij

BASimilarityBASimilarity

t-Plausibility: Semantic Preserving Text Sanitization

Wei Jiang Mummoorthy Murugesan

Chris Clifton, andLuo Si

2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09)

, Vancouver, Canada, August 29-31, 2009

http://cse.stfx.ca/~passat09/

http://cse.stfx.ca/~passat09/

Motivations

• De-identification plays an important role in privacy (legislation)– Documents that do not contain personally identifiable information

can be shared, e.g., pathology reports• De-identification tools remove “obvious” identifying

information– Name, address, dates, …

• Unfortunately, non-obvious information can identify– Pain vs. phantom pain

• Alternative: suppress sensitive information– Uses marijuana for pain Uses --- for ---

• Our approach: information generalization– phantom pain pain– tuberculosis infectious disease

Related Work

• Data anonymization– k-Anonymity: sanitizing structured info, e.g.,

datasets with at least k records in relational format

– Transforming a text into a dataset of k records is not well studied

• Text sanitization– Most work focuses on identifying sensitive

attributes– Then removing identified sensitive information

Basic Idea: Generalization

Original: A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer

Sanitized: A ---------- resident purchased --------- for the ----------- caused by ------------

Generalized: A state capital resident purchased drug for the pain caused by carcinoma

Denver, IndianapolisPhoenix, Sacramento

State_capitol (4)

Capitol (32)

Seat (50)

Liver_cancerLung_cancer …

Carcinoma (2)

Cancer (5)

Malignant_tumor (7)

Lumbar_painMigraine …

Pain (2)

Symptom (10)

Evidence (20)

MorphineMarijuana …

Controlled_substance (2)

Drug (6)

Agent (10)

t-Plausible Anonymization:t-PAT

• Given a document d and an ontology o, anonymized document d’ is t-plausible if at least t base texts can be generalized to d’

• Let D(d’,d,o) give the number of possible base texts that can be generalized to d’

• t-PAT: Find the generalization d’ that ist-plausible and D(d’,d,o) is minimal

Uniform t-Plausibility

• t-PAT is a start; but too raw for being useful in protecting privacy

• Consider our example:– Original text

A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer

– Sanitized text (t-PAT with t = 32)A capital resident purchased marijuana for the lumbar pain caused by liver cancer

• Generalizing a single word may satisfy t-PAT

Uniform t-PAT

• Uniform t-PAT generalizes each word in an unbiased manner

• We use entropy function H(w) to quantify generalization of each word

– P(…) gives the probability of base term given a generalized word

),(log),()(1

ijio

k

ji

jioi wwPwwPwH

i

Cost function forUniform t-PAT

• We define the following cost function C(d’,t) and attempt to minimize

– α is the parameter to control global optimality and uniform generalization

Global generalization on d based on uncertainty t

Uniform uncertainty introduced for each word

m

ii m

twH

mtdH

m 1

22

2

log)(

1)log)((

Uniform t-Plausibility

• Let us consider previous example – t-PAT with t = 32

• A capital resident purchased marijuana for the lumbar pain caused by liver cancer

• Cost = 2.33

– Uniform t-PAT with t = 32, α =.50• A state_capital resident purchased drug for the pain caused

by carcinoma• Cost = .09

• We want to find a generalization d’ of d so that– cost C(d’,t) is minimized, and– H(d’) ≥ log t

Experiments

• 50 leaf node words, randomly selected from Wordnet; height of each hypernym tree is 8±1

• Running time for Pruning based Search fort-PAT: (varying t)

Exhaustive Search is very inefficient;For |d|=10, it took 56 seconds

10 20 30 40 500

2

4

6

8

10

12

14

16

1024 2048 4096 8192 16384

Number of Words

Tim

e (

se

co

nd

s)

Uniform t-PAT

• Time taken for Pruning based and LUBS is similar to t-PAT EP

• Cost is lower for LUBS and UEP; LUBS is same as that of UEP

*α is fixed at 0.5

10 20 30 40 500

2

4

6

8

10

12

1024 2048 4096 8192

Number of Words

Tim

e (

se

co

nd

s)

10 20 30 40 500

0.5

1

1.5

2

EP LUBS

Number of Words

Co

st

Uniform t-PAT (2)

• Variance in entropy of generalized words is better for LUBS and UEP

• As we increase t, similarity between d and d’ decreases; better than “blacking out” words

10 20 30 40 500

1

2

3

4

5

EP LUBS UEP

Number of Words

Va

ria

nc

e o

f E

ntr

op

y

1024 2048 4096 8192 163840.850000000000001

0.860000000000001

0.870000000000001

0.880000000000001

0.890000000000001

0.900000000000001

0.910000000000001

t value

Sim

ilari

ty S

core

*α is fixed at 0.5

From Text to Document

• A large document can be broken into small pieces of texts– Sentence, paragraph or section as a unit– Sensitive words can be identified using existing

work, and sanitize using various t values• What if a large document is treated as a single

text?– It is difficult to choose t – To achieve uniform plausibility with the same t for

all sensitive words may not be desirable

Summary & Future Research

• The proposed work has both privacy- and semantic- preserving qualities– Assuming that marijuana and phantom limb pain are

sensitive – “uses marijuana for phantom limb pain” sanitizes to

“uses soft drug for pain”• What needs to be investigated further?

– Systematically generate the best value for t – Evaluate the privacy and semantic preserving property

via human subjects study– Consider correlations among sensitive words

privacy and anonymity in text

Documents

unique queries

plausible queries

search of privacy

person aol

search engine

customers web searches

outthe aol awakeningin

unique users