preventing information leaks in email
DESCRIPTION
Preventing Information Leaks in Email. Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU. Outline. Motivation Idea and method Leak Criteria, text-based baselines Crossvalidation, network features Results Finding Real Leaks in the Enron Data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/1.jpg)
Preventing Information Leaks in Email
Vitor
Text Learning Group MeetingJan 18, 2007 – SCS/CMU
![Page 2: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/2.jpg)
Outline
1. Motivation2. Idea and method
Leak Criteria, text-based baselines Crossvalidation, network features
3. Results4. Finding Real Leaks in the Enron Data5. Predicting Real leaks in the Enron Data
Smoothing the leak criteria
6. Related Work7. Conclusions
![Page 3: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/3.jpg)
Information Leaks
What’s being leaked? Credit card, New products information Social Security Numbers Software pre-release versions Business Strategy, Health records, etc.
Multi-million dollar industry (ILDP) Anonymity and Privacy of data Information Leakage Detection and Prevention (from
Wikipedia)
![Page 4: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/4.jpg)
Information Leak using Email Hard to estimate, but according to PortAuthority
Technologies
How data is being leaked
![Page 5: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/5.jpg)
Email Leaks make good headlines. Just google it…
California Power-Buying Data Disclosed in Misdirected E-Mail
Leaked email exposes MS charity as PR exercise
Bush Glad FEMA Took Blame for Katrina, According to Leaked Email
![Page 6: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/6.jpg)
More Email leak in the headlines
Dell leaked email shows channel plans -Direct threat haunts dealers-A leaked email reveals Dell wants to get closer to UK resellers.
Business group say Liberals handled leaked email badly.
Is Leaked eMail a SCO-Microsoft Connection?
“Leaked email may be behind Morgan Stanley's Asia economist's sudden resignation”
![Page 7: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/7.jpg)
Detecting Email Leaks
Idea Goal: to detect emails
accidentally sent to the wrong person
Generate artificial leaks: Email leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc.
Method: LOOK FOR OUTLIERS.
Email Leak: email accidentally sent to wrong person
Email Leak
![Page 8: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/8.jpg)
Avoiding Expensive Email Errors
Method Create simulated/artificial email recipients Build model for (msg.recipients): train classifier on
real data to detect synthetically created outliers (added to the true recipient list). Features: textual(subject, body), network features
(frequencies, co-occurrences, etc). Rank potential outliers - Detect outlier and warn user
based on confidence.
Rec_6
Rec_2
…
Rec_K
Rec_5
Most likely outlier
Least likely outlier
P(rec_t)
P(rec_t) =Probability recipient t is an outlier given “message text and other recipients in the message”.
![Page 9: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/9.jpg)
Leak Criteria: how to generate (artificial) outliers
Several options: Frequent typos, same/similar last names,
identical/similar first names, aggressive auto-completion of addresses, etc.
In this paper, we adopted the 3g-address criteria: On each trial, one of the msg recipients is randomly
chosen and an outlier is generated according to:
Else: Randomly select an address book entry
1
2
3
Marina.wang @enron.com
![Page 10: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/10.jpg)
![Page 11: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/11.jpg)
Dataset: Enron Email Collection
Why? Large, thousands of messages Natural email, not email lists Real work environment Free No privacy concerns More than 100 users (with sent+received msgs)
![Page 12: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/12.jpg)
Enron Data Preprocessing 1
Setup a realistic temporal setup For each user, 10% (most recent) sent messages
will be used as test
All users had their Address Books extracted List of all recipients in the sent messages.
![Page 13: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/13.jpg)
Enron Data Preprocessing 2
ISI version of Enron Remove repeated messages and inconsistencies
Disambiguate Main Enron addresses List provided by Corrada-Emmanuel from UMass
Bag-of-words Messages were represented as the union of BOW
of body and BOW of subject Some stop words removed Self-addressed messages were
removed
![Page 14: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/14.jpg)
Experiments: using Textual Features only
Three Baseline Methods Random
Rank recipient addresses randomly Cosine or TfIdf Centroid
Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid.
Knn-30 Given a test msg, get 30 most similar msgs in
training set. Rank according to “sum of similarities” of a given user on the 30-msg set.
![Page 15: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/15.jpg)
Experiments: using Textual Features only
Email Leak Prediction Results: Prec@1 in 10 trials.
On each trial, a different set of outliers is generated
![Page 16: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/16.jpg)
Network Features
How frequent a recipient was addressed
How these recipients co-occurred in the training set
![Page 17: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/17.jpg)
Using Network Features
1. Frequency features Number of received messages (from this user) Number of sent messages (to this user) Number of sent+received messages
2. Co-Occurrence Features Number of times a user co-occurred with all other
recipients. Co-occurr means “two recipients were addressed in the same message in the training set”
3. Max3g features For each recipient R, find Rm (=address with max score
from 3g-address list of R), then use score(R)-score(Rm) as feature. Scores come from the CV10 procedure. Leak-recipient scores are likely to be smaller than their 3g-address highest score.
![Page 18: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/18.jpg)
To combine textual features with network features: Crossvalidation
Training Use Knn-30 on 10-Fold crossvalidation setting to get
“textual score” of each user for all training messages Turn each train example into |R| binary examples, where
|R| is the number of recipients of the message. |R|-1 positive (the real recipients) 1 negative (leak-recipient)
Augment “textual score” with network features Quantize features Train a classifier VP5- Classification-based ranking
scheme (VP5=Voted Perceptron with 5 passes over training set)
![Page 19: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/19.jpg)
Results: Textual+Network Features
![Page 20: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/20.jpg)
Finding Real Leaks in Enron
How can we find it? Grep for “mistake”, “sorry” or “accident”. We were
looking for sentences like “Sorry. Sent this to you by mistake. Please disregard.”, “I accidentally send you this reminder”, etc.
How many can we find? Dozens of cases. Unfortunately, most of these cases were originated by non-
Enron email addresses or by an Enron email address that is not one of the 151 Enron users whose messages were collected
Our method requires a collection of sent (+received) messages from a user. Only 150 Enron users .
![Page 21: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/21.jpg)
Finding Real Leaks in Enron
Found 2 good cases:1. Message germanyc/sent/930, message
has 20 recipients, leak is alex.perkins@2. kitchen-l/sent items/497, it has 44
recipients, leak is rita.wynne@
Prepared training data accordingly (90/10 split) and no simulated leak added
![Page 22: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/22.jpg)
Results: Finding Real Leaks in Enron
Very Disappointing!!
Reason: alex.perkins@ and rita.wynne@ were never observed in the training set!
[Prec@1, Average Rank], 100 trials
![Page 23: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/23.jpg)
“Smoothing” the leak generation
Else: Randomly select an address book entry
1
2
3
Marina.wang @enron.comGenerate a random email address NOT in Address Book
•Sampling from random unseen recipients with probability
![Page 24: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/24.jpg)
Some Results:
•Kitchen-l has 4 unseen addresses out of the 44 recipients,
•Germany-c has only one.
![Page 25: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/25.jpg)
Mixture parameter :
Germany Leak Case
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 0.1 0.2 0.3 0.4 0.5
a
AvgRank
Prec@1
![Page 26: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/26.jpg)
Mixture parameter :
Kitchen-l Leak Case
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0 0.05 0.1 0.15 0.2
a
Prec@1
Kitchen-l Leak Case
0.0
2.0
4.0
6.0
8.0
10.0
12.0
0 0.05 0.1 0.15 0.2a
AvgRank
![Page 27: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/27.jpg)
Back to the simulated leaks:
![Page 28: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/28.jpg)
What’s next
Modeling Better, more elegant model
Email Server side application Predict based on all users on mail
server In companies, use info from all email
users Privacy issues
Integration with cc-prediction
![Page 29: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/29.jpg)
Related Work
Email Privacy Enforcement System Boufaden et al. (CEAS-2005) - used information
extraction techniques and domain knowledge to detect privacy breaches via email in a university environment. Breaches: student names, student grades and student IDs.
CC Prediction Pal & McCallum (CEAS-06) Counterpart problem:
prediction of most likely intended recipients of email msg. One single user, limited evaluation, not public data
Expert finding in Email Dom et al.(SIGMOD-03), Campbell et al(CIKM-03) Balog & de Rijke (www-06), Balog et al (SIGIR-06) Soboroff, Craswell, de Vries (TREC-Enterprise 2005-06-
07…) Expert finding task on the W3C corpus
![Page 30: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/30.jpg)
Thanks!
Questions? Comments?
Ideas?
![Page 31: Preventing Information Leaks in Email](https://reader036.vdocuments.us/reader036/viewer/2022062518/5681408d550346895dac1954/html5/thumbnails/31.jpg)