email classification results for folder classification on enron dataset

15
Email Classification Results for Folder Classification on Enron Dataset

Upload: simon-mclaughlin

Post on 19-Jan-2018

219 views

Category:

Documents


0 download

DESCRIPTION

Immediate Goals  To establish an credible test corpus  To create baseline results for classification  To analyze possible future techniques

TRANSCRIPT

Page 1: Email Classification Results for Folder Classification on Enron Dataset

Email Classification

Results forFolder Classificationon Enron Dataset

Page 2: Email Classification Results for Folder Classification on Enron Dataset

Overall Goals

To help users managelarge volumes of email.

…by helping them to sorttheir email into folders.

Page 3: Email Classification Results for Folder Classification on Enron Dataset

Immediate Goals

To establish an credible test corpus

To create baseline results for email classification

To analyze possible future techniques

Page 4: Email Classification Results for Folder Classification on Enron Dataset

The “Enron” Corpus

Previous email classification experiments have used “toy” collections.

Enron emails are collected from actual business users.

Made public through legal proceedings.

Page 5: Email Classification Results for Folder Classification on Enron Dataset

The Enron Corpus

158 users 200,399 emails Average of 757 emails per user

Messages per User

1

10

100

1000

10000

100000

0 20 40 60 80 100 120 140Users

# o

f Mes

sage

s

Page 6: Email Classification Results for Folder Classification on Enron Dataset

Enron Data Analysis

Most users do use folders to classify their email. Some users with many emails still have few folders. Users with more emails tend to have more email in

each folder.

Correlation of Folders and Messages

020406080

100120140160180200

1 10 100 1000 10000 100000# of messages

# o

f fol

ders

Page 7: Email Classification Results for Folder Classification on Enron Dataset

Representation From To, CC Subject Body Date/Time? Thread? Attachments? etc…?

Page 8: Email Classification Results for Folder Classification on Enron Dataset

Approaches Using a bag-of-words

email data “bag of words” SVM classificationdecision

Page 9: Email Classification Results for Folder Classification on Enron Dataset

Approaches Using separate SVMs for each section

email data

SVMs

classificationdecision

LLSF

Page 10: Email Classification Results for Folder Classification on Enron Dataset

Approach Data was split in half, chronologically.

A “flat” approach was used. (not hierarchical)

An SVM was trained for each folder for each user for each field.

The SVM for each folder was trained using all of the emails for that user.

Combination weights were found with a regression for each folder.

Thresholding was performed for optimal F1 score, using the “scut” method.

Page 11: Email Classification Results for Folder Classification on Enron Dataset

“Enron” Results Analysis

Obviously some data fields are more useful than others. Unsurprisingly, the “To, CC” data is the least useful. Body is the most useful field, followed closely by sender. Using all fields works better than using any particular field alone. Linearly combining fields works better than bag-of-words approach. Because it’s SVM, the linear weights are not directly interpretable.

F1 Scores for Enron Dataset

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

Avg. Micro Avg. F1 Avg. Macro Avg. F1

FromSubjectBodyTo, CCAllLinear Comb.

Page 12: Email Classification Results for Folder Classification on Enron Dataset

Enron Results Analysis

F1 classification score is unrelated to the number of emails a user has.

Message Count vs. Micro Average F1

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100 1000 10000 100000messages

Message Count vs. Macro Average F1

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100 1000 10000 100000messages

Page 13: Email Classification Results for Folder Classification on Enron Dataset

Enron Results Analysis

F1 score is somewhat correlated with the number of folders a user has.

Emails are much harder to classify for users with many folders.

Folder Count vs. Macro Average F1

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100 1000folders

Folder Count vs. Micro Average F1

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100 1000folders

Page 14: Email Classification Results for Folder Classification on Enron Dataset

Enron Thread Analysis

200,399 messages 101,786 threads 30,091 non-trivial threads 61.63% messages are in non-trivial threads Average of 4.1 messages/thread Median of 2 messages/threadthread size: 2 3 4 5 6 7 8 9 10 (10-20] (20-30] (30-40] (40-50] 51+# of threads 16736 4782 3049 1282 879 903 378 214 178 1260 209 79 54 88

Page 15: Email Classification Results for Folder Classification on Enron Dataset

Enron Thread Analysis Largest threads are most potentially useful.

But, the largest threads are the least common.

Threads are also redundant with other kinds of evidence.Since threads are detected by subject and sender, much of the thread information is redundant. Also, emails in the same thread tend to have similar bodies.

Largest thread in the Enron corpus is 1124 copies of the same message…all in the “Deleted Items” folder for a particular user!

thread size: 2 3 4 5 6 7 8 9 10 (10-20] (20-30] (30-40] (40-50] 51+# of threads 16736 4782 3049 1282 879 903 378 214 178 1260 209 79 54 88