privacy and anonymity in text
DESCRIPTION
Privacy and Anonymity in Text. Chris Clifton 12 November, 2009. Plausibly Deniable Search. This is joint work with Mummoorthy Murugesan. 2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009. The AOL Awakening. - PowerPoint PPT PresentationTRANSCRIPT
Privacy and Anonymity in Text
Chris Clifton
12 November, 2009
Plausibly Deniable Search
This is joint work with Mummoorthy Murugesan
2009 SIAM International Conference on Data Mining (SDM09), Sparks, Nevada, April 30-May 2, 2009
The AOL Awakening
• In Aug 2006, AOL released its customers web searches for research studies
• 20 Million unique queries of 650K unique users• <user-id> was replaced with a <random-number>• NY Times reporter successfully found the identity
of an individual from the queries– Queries included “60 single men” “landscapers in
Lilburn, Ga”– Many more queries contained enough information to
uniquely identify the person
AOL fired its CTO over this issue; Two researchers were forced out
Privacy in Web Search
• Server-Controlled Privacy– Deletion of queries after a few months – Anonymization of querylogs before backup
• Some of these methods have been shown to be inadequate
• Private Information Retrieval– affects the advertising business model– not practical with the current solutions
Lessons Learned
• Content of user queries reveals a lot– Ego surfing: searching for own name, ssn, credit card
• Identifiable
– Location, type of work, age, medical condition• Sensitive
– Car they own, restaurants in a zip code• Query transformation alone is not enough
– Submitting Q’ instead of Q to retrieve the same set of documents
– User intent still revealed
User-Controlled Privacy
1. Hide identifying metadata– Private Web Search (PWS) – Firefox plugin
(Yale Univ.)• Removes metadata• Hides user IP Address (via TOR)
Private Web SearchFelipe Saint-Jean, Johnson, Boneh, Feigenbaum
• Tor: Hides IP addresses– Routes request,
response through multiple servers
– Each knows only preceding server
• HTTP filter normalizes search queries– Browser, OS, etc.
• HTML filter removes active components
User-Controlled Privacy
1. Hide identifying metadata– Private Web Search (PWS) – Firefox plugin (Yale
Univ.)• Removes metadata• Hides user IP Address (via TOR)
2. Protect against disclosure through query terms– TrackMeNot – Firefox plugin (NYU)
• Periodically issues randomized queries from a list of “seeds”
• Uses search results for 'logical' future query terms
Actual User Query (user intent) is revealed
Timing attacks, load on server
Query semantics attacks – `logical’ generated terms
Plausibly Deniable Search
q
PDS
{q1,...,qk}{q1,...,qk}
{R(q1),...,R(qk)}
Filter R(qi) usingthe original q
Search Engine12
3 4
Plausibly Deniable Search:Key Concepts
• Browser submits more than one query {q1,…,qk}
• Deniability– Reversible: any of the k queries would have produced the
same set– The additional “cover queries” are of diverse topics
• Plausibility– All queries are equally plausible– Implausible queries would weaken the deniability argument
{“java compiler” , “newton apple”}
Vs
{“java compiler” , “motorola table”}
Plausibly Deniable Search: Theory
• Assume the following:– User Queries follow a distribution Pu
– Cover queries are generated through a distribution Pc
• Given a set of two queries S={q1,q2}, there are two possible events– E1: q1 is user query & q2 is cover query
– E2: q2 is cover query & q1 is user query
)()()()(
)()()()(
)()|(
1221
21
21
11
qPqPqPqP
qPqPEPEP
EPSEP
cucu
cu
)()()()(
)()()()(
)()|(
1221
12
21
22
qPqPqPqP
qPqPEPEP
EPSEP
cucu
cu
Plausibly Deniable Search: Theory
• To achieve deniability for either of these queries, we require the following condition:
• Two of many possible solutions
1. queries have equal probability of being user queries, and equal probability of being cover queries
2. queries have the same probability of being user query or cover query
)()(and)()( 2121 qPqPqPqP ccuu
)()()()( 1221 qPqPqPqP cucu
)()(and)()( 2211 qPqPqPqP cucu
Creating Plausibly Deniable Cover Queries
1. Create Canonical Queries– Standard queries
2. Creating PD-Querysets– Plausibly deniable querysets
with k queries
3. Issuing query– Find and issue the PD-Queryset for the
given user query
Done in advance (Server / Third Party)
Step One:Creating Canonical Queries
Seed Queries
Use LSI to combine Semantically Similar Seed Queries
Seed Documents
Canonical Queries
FP Mining
• Semantically similar surrogate queries for user queries
• Supports the “deniability” argument since all queries could be generated by the system.
Step Two:Creating PD-Querysets
• Dissimilarity between two queries is based on 3 measures:– Euclidean distance: Semantically similar queries are closer in the
semantic space– Magnitude: queries that are equally stronger in their respective
topics have similar magnitude– Neighborhood count: equally plausible queries have similar
number of log (already issued) queries in their neighborhood
PD-Querysets
Agglomerative ClusteringCanonical
Queries
Step 3: Issuing Query
• User query is mapped to semantic space– Vec(q)=qTU’S’-1
• Find canonical queries that have the maximum cosine similarity with q in the semantic space
• The PD-Queryset of the selected canonical query is issued
i
i
How Good is PDS?
• Deniability:– Canonical query provides one level of anonymity– There exist many seed queries that map to a
single canonical query– The reversible property provides deniability
• Plausibility:– Base on the number of similar topics queries
issued by users– Measure as perception of human subjects;
difficult to quantify How good are the canonical queries? Do they
fetch what the users want?
Results from Experiments
• Document Collection– DMOZ categorized web documents – 314K documents and 1.28M unique terms– Three topics: Computers, Science, Sports
• Number of Documents in Each Category– Computers 115k– Science 100k– Sports 99k
• After performing SVD on the term-doc matrix, only 30 columns are kept in U
Canonical Queries
• 2.6 Million seed queries generated with ∆=500• Produces 932K canonical queries
• Average canonical query length 3.7
Total Canonical Queries 931,863
2 terms 1019
3 terms 324,034
4 terms 524,256
5 terms 67,664
6 terms 14,890
Query Canonical Cover
{synthesis technology} {synthesis molecular technology}
{baker priority report}
Retrieval Performance
• 5k queries from the allthweb.com searches• 3.4k unique queries containing at least 75% terms from
our collection• Six of top 20 in 69% of queries (500)
Topic Diversity
• DMOZ categories are used in comparing the topics of queries
• 85% of PD-Querysets have queries with >50% topic diversity
What is Next?
• PDS can be used along with other approaches such as PWS, TOR, etc.
• Canonical Queries– Efficient ways of creating canonical queries– Improving retrieval performance
• Sequential Queries– How to handle the sequentially edited queries by
an user on the same topic?– Can an attacker figure out the user queries over
period of time?
Query Sequences
• Users issue a sequence of queries on a topic– Cover queries should be plausibly deniable sequences
• Consider two sequences, S1={a1,b1} S2={a2,b2}, where <a1,a2> are issued together (first), <b1,b2> are issued second
• There are two possible events:– E1: S1 is user sequence, S2 is cover sequence
– E2: S1 is cover sequence, S2 is user sequence
Query Sequences
• To deniability is achieved when we satisfy the following constraint:
• Given deniability for the first queries a1,b1, we get:
)()()()(
)()()()(
)()|(
1221
21
21
11
SPSPSPSP
SPSPEPEP
EPSEP
cucu
cu
)()()()(
)()()()(
)()|(
1221
12
21
22
SPSPSPSP
SPSPEPEP
EPSEP
cucu
cu
)()()()( 1221 SPSPSPSP cucu
)|()|()|()|( 11222211 abPabPabPabP cucu
Two (of many) Possible Solutions
1. b1 and b2 have same conditional probability of being user-generated– Also same conditional probability of being
method-generated
2. a2 has equal conditional probability of being user generated or method generated; b2 has the same property.
)|()|(and)|()|( 12121212 bbPaaPbbPaaP ccuu
)|()|(and)|()|( 12121212 bbPbbPaaPaaP cucu This is applicable to the m+1th query given a sequence of m queries
Generating “user-like” Sequences
• Idea: Inter-query time determines difference between queries– Learn distribution of changes to queries at
time– Given time, generate query from previous
cover query and appropriate distribution
• P(qk | qk-1) same as a real user!
Distribution of what changes?
• Features “defining” query are those useful in linking queries in sequence– If sequence can be discovered, must be simulated
• Features from I know what you did last summer (Jones et. al)– Term re-use, topic similarity used to link queries in a
sequence• Learned distribution from large query log for ranges of
inter-query times– Topic relation– Topic repetition– Number of term changes
Term Changes Topic Changes
Feature Distributions with respect to Inter-Query Time
“Bin number” is exponential grouping on time
0 2 4 6 8 10 12 14 16 18 20 22 240
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Bin Number
Pro
ba
bili
ty
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
BN=5 BN=10 BN=15BN=20
Number of Term Changes
Pro
ba
bili
ty
Effectiveness: Topic Change Distribution on DMOZ
How well does it really work?
ecuador world trade center deaths
banana guide
gmail center ice hockey wabi sabi
the promenade temecula pittsburgh convention center
wa
sheriff sales detroit homes
100 center street wabi sabi
summit county sheriff sales
bolton center pa. donald judd
harris county sheriff thornton town center wabi sabi
Try again…
angelina jolie my space parker animal shelter
angelina ballerina panic at the disco brenden
parker animal shelter colorado
residence inn madison wi red carpet keim action group
tagawa
residence inn hillsboro claudio ciardi paradise valley arizona
rodeway inn maingate www yahoo.com map of texas
community collage of beaver county
www g map of tennessee
town of bennington nh www hp com santa ana california
elmo love on aol.com santa ana winds date
www.redeem your rewards.com
love aol.com when do the santa ana winds come
brookfield wi sunday brunch
webmail aol.com when santa ana winds
Figure it out yet?
letter signs m signs chinese jump rope the hall sisters
letter m logo chinese silk the hall sisters gospel group
alphabet letter m ohio medical board gospel group the hall sisters
letter m logo anderson ohio rules georgia gospel group the hall sisters
texas a&m at commerce net georgia gospel church sings
i’m makin some noise http southern georgia gospel church sings
m.i.t. land grab in new orleans
mini clip.com http
Disclosure-free Discovery of Related Documents
Chris Clifton
Mummoorthy Murugesan Wei Jiang
Luo Si Jaideep Vaidya
18 September, 2009
Proceedings of the 24th International Conference on Data Engineering (ICDE 2008)
, Cancun, Mexico, April 7-12, 2008
Problem:Identifying Common Interests
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
Alice
1 Bin Laden 8
2 Stinger 3
3 Afghanistan 1
4 Kabul 0
… … …
Bob
1 Bin Laden 7
2 Stinger 5
3 Afghanistan 0
4 Kabul 2
… … …
Solution Overview
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
...**** 44332211 babababa
Secure Product:Random Matrix
Vaidya and Clifton, Privacy Preserving Association Rule Mining in Vertically Partitioned Data, KDD02
Secure Product:Homomorphic Encryption
Goethals, Laur, Lipmaa, and Mielikainen, On secure scalar product computation for privacy-preserving data mining, ICISC 2004
Alice
1 Bin Laden 8
2 Stinger 3
3 Afghanistan 1
4 Kabul 0
… … …
Bob
1 Bin Laden 7
2 Stinger 5
3 Afghanistan 0
4 Kabul 2
… … …
Is Performance an Issue?
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
...**** 44332211 babababa
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa ...**** 44332211 babababa
Running Time(journal articles)
0 200 400 600 800 1000 1200 14000
1000
2000
3000
4000
5000
6000
SSDD_H SSDD_M
Number of Documents
Ru
nn
ing
Tim
e (
min
ute
s)
Faster: Local Clustering
• Locally cluster similar documents– Secure protocol identifies similar clusters
• Document comparison only within identified clusters
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
There have been reports from Kabul of financial transfers from Bin Laden, purportedly or the purchase of Stinger missiles …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
We have evidence that Osama Bin Laden has financed the purchase of Stinger Missiles in Afghanistan. …
Savings / Loss from Clustering
Effectiveness:40% Document Overlap
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
sim=0.5
sim=0.6
sim=0.7
CS2 CS4 CS6 Cosine
Recall
Pre
cisi
on
1
)),((max),,(i
jij
BASimilarityBASimilarity
t-Plausibility: Semantic Preserving Text Sanitization
Wei Jiang Mummoorthy Murugesan
Chris Clifton, andLuo Si
2009 IEEE International Conference on Privacy, Security, Risk and Trust (PASSAT-09)
, Vancouver, Canada, August 29-31, 2009
Motivations
• De-identification plays an important role in privacy (legislation)– Documents that do not contain personally identifiable information
can be shared, e.g., pathology reports• De-identification tools remove “obvious” identifying
information– Name, address, dates, …
• Unfortunately, non-obvious information can identify– Pain vs. phantom pain
• Alternative: suppress sensitive information– Uses marijuana for pain Uses --- for ---
• Our approach: information generalization– phantom pain pain– tuberculosis infectious disease
Related Work
• Data anonymization– k-Anonymity: sanitizing structured info, e.g.,
datasets with at least k records in relational format
– Transforming a text into a dataset of k records is not well studied
• Text sanitization– Most work focuses on identifying sensitive
attributes– Then removing identified sensitive information
Basic Idea: Generalization
Original: A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer
Sanitized: A ---------- resident purchased --------- for the ----------- caused by ------------
Generalized: A state capital resident purchased drug for the pain caused by carcinoma
Denver, IndianapolisPhoenix, Sacramento
State_capitol (4)
Capitol (32)
Seat (50)
Liver_cancerLung_cancer …
Carcinoma (2)
Cancer (5)
Malignant_tumor (7)
Lumbar_painMigraine …
Pain (2)
Symptom (10)
Evidence (20)
MorphineMarijuana …
Controlled_substance (2)
Drug (6)
Agent (10)
t-Plausible Anonymization:t-PAT
• Given a document d and an ontology o, anonymized document d’ is t-plausible if at least t base texts can be generalized to d’
• Let D(d’,d,o) give the number of possible base texts that can be generalized to d’
• t-PAT: Find the generalization d’ that ist-plausible and D(d’,d,o) is minimal
Uniform t-Plausibility
• t-PAT is a start; but too raw for being useful in protecting privacy
• Consider our example:– Original text
A Sacramento resident purchased marijuana for the lumbar pain caused by liver cancer
– Sanitized text (t-PAT with t = 32)A capital resident purchased marijuana for the lumbar pain caused by liver cancer
• Generalizing a single word may satisfy t-PAT
Uniform t-PAT
• Uniform t-PAT generalizes each word in an unbiased manner
• We use entropy function H(w) to quantify generalization of each word
– P(…) gives the probability of base term given a generalized word
),(log),()(1
ijio
k
ji
jioi wwPwwPwH
i
Cost function forUniform t-PAT
• We define the following cost function C(d’,t) and attempt to minimize
– α is the parameter to control global optimality and uniform generalization
Global generalization on d based on uncertainty t
Uniform uncertainty introduced for each word
m
ii m
twH
mtdH
m 1
22
2
log)(
1)log)((
Uniform t-Plausibility
• Let us consider previous example – t-PAT with t = 32
• A capital resident purchased marijuana for the lumbar pain caused by liver cancer
• Cost = 2.33
– Uniform t-PAT with t = 32, α =.50• A state_capital resident purchased drug for the pain caused
by carcinoma• Cost = .09
• We want to find a generalization d’ of d so that– cost C(d’,t) is minimized, and– H(d’) ≥ log t
Experiments
• 50 leaf node words, randomly selected from Wordnet; height of each hypernym tree is 8±1
• Running time for Pruning based Search fort-PAT: (varying t)
Exhaustive Search is very inefficient;For |d|=10, it took 56 seconds
10 20 30 40 500
2
4
6
8
10
12
14
16
1024 2048 4096 8192 16384
Number of Words
Tim
e (
se
co
nd
s)
Uniform t-PAT
• Time taken for Pruning based and LUBS is similar to t-PAT EP
• Cost is lower for LUBS and UEP; LUBS is same as that of UEP
*α is fixed at 0.5
10 20 30 40 500
2
4
6
8
10
12
1024 2048 4096 8192
Number of Words
Tim
e (
se
co
nd
s)
10 20 30 40 500
0.5
1
1.5
2
EP LUBS
Number of Words
Co
st
Uniform t-PAT (2)
• Variance in entropy of generalized words is better for LUBS and UEP
• As we increase t, similarity between d and d’ decreases; better than “blacking out” words
10 20 30 40 500
1
2
3
4
5
EP LUBS UEP
Number of Words
Va
ria
nc
e o
f E
ntr
op
y
1024 2048 4096 8192 163840.850000000000001
0.860000000000001
0.870000000000001
0.880000000000001
0.890000000000001
0.900000000000001
0.910000000000001
t value
Sim
ilari
ty S
core
*α is fixed at 0.5
From Text to Document
• A large document can be broken into small pieces of texts– Sentence, paragraph or section as a unit– Sensitive words can be identified using existing
work, and sanitize using various t values• What if a large document is treated as a single
text?– It is difficult to choose t – To achieve uniform plausibility with the same t for
all sensitive words may not be desirable
Summary & Future Research
• The proposed work has both privacy- and semantic- preserving qualities– Assuming that marijuana and phantom limb pain are
sensitive – “uses marijuana for phantom limb pain” sanitizes to
“uses soft drug for pain”• What needs to be investigated further?
– Systematically generate the best value for t – Evaluate the privacy and semantic preserving property
via human subjects study– Consider correlations among sensitive words