how many folders do you really need ? classifying email into a handful of categories
TRANSCRIPT
![Page 1: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/1.jpg)
+
How Many Folders Do You Really Need? �Classifying Email into a Handful of Categories
2014/1/23 (Fri.)�Chang Wei-Yuan @ MakeLab Group Meeting
Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek �Yahoo Labs CIKM‘14
![Page 2: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/2.jpg)
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
2
![Page 3: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/3.jpg)
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
3
![Page 4: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/4.jpg)
+ Introduction
n Traditional email classification is still a mostly manual task. �
4
![Page 5: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/5.jpg)
+ Introduction
n Recently automatic classification has started to appear in some Web mail clients, e.g. Inbox.
5
![Page 6: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/6.jpg)
+ Introduction
n The current email traffic is dominated by non-spam machine-generated email. �n Social network �n Commerce sites �n Official institutions
6
![Page 7: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/7.jpg)
+ Introduction
n Goal �n automatically distinguishing between personal
and machine-generated email �n classifying messages into latent categories,
without requiring users to have defined any folder
7
![Page 8: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/8.jpg)
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
8
![Page 9: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/9.jpg)
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 10: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/10.jpg)
+Discovering Latent Categories
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 11: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/11.jpg)
+Discovering Latent Categories
n All messages have the potential to be classified. �n by retrieving the most popular folder from
users �
n This paper applied LDA to these "document folders " for finding latent categories. �n latent topics would map into "latent
categories" �
11
![Page 12: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/12.jpg)
+ 12
msg msg msg
msg
msg
msg
msg msg
msg msg msg
msg
msg msg
![Page 13: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/13.jpg)
+ 13
msg msg msg
msg
msg
msg
msg msg
msg msg msg
msg
msg msg
LDA
![Page 14: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/14.jpg)
+Discovering Latent Categories
n Our objective was to train a value of K �n each individual and overall set of topics
achieve significant coverage �
n We further examined for K = 6 �n good balance between total and individual
coverage �
14
![Page 15: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/15.jpg)
+Discovering Latent Categories 15
msg
travel %, social % …
travel
![Page 16: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/16.jpg)
+Modeling Data
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 17: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/17.jpg)
+Modeling Data
n Original method: Each individual message as a single data point �n various features extracted from the message
header and body�
17
![Page 18: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/18.jpg)
+Modeling Data
n Extracting Features �n content features �
n the message subject and body�n address features�
n sender email address, including the subdomain �n behavioral features �
n sender's and recipient's actions over a given message
18
subject� body� action� time� sender� address� domain� msg
![Page 19: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/19.jpg)
+Modeling Data
n Extended method: Aggregating messages at higher levels�n address/mail domain level �
n This paper consider three levels of aggregation.
19
subject� body� action� time� address� sender� domain� msg
Aggregating : sender level
Aggregating : domain level
![Page 20: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/20.jpg)
+Modeling Data
n Aggregation Levels �
20
msg: shopping msg: traveling
![Page 21: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/21.jpg)
+Training Data
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 22: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/22.jpg)
+Training Data
n labeling techniques �n label used as 6 latent categories �n we will create a two-stage classifier by msg-
level and sender-level �
22
subject� action� …� sender� domain� category � msg
sender� domain� category� sender
![Page 23: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/23.jpg)
+Training Data
n labeling techniques �n label used as 6 latent categories �n we will create a two-stage classifier by msg-
level and sender-level �
23
subject� action� …� sender� domain� category � msg
sender� domain� category� sender known by LDA
unknown
![Page 24: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/24.jpg)
+ 24
sender
human
travel
social
career
![Page 25: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/25.jpg)
+ 25
sender
human
travel
social
career
heuristic-based • Domain : gmail.com, yahoo.com • Sender: <first name>.<last name>
![Page 26: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/26.jpg)
+ 26
sender
human
travel
social
career
automatic voting
sender msg
msg
msg
folder1
folder2
folder3
travel 96%,
travel 88%,
shopping 70%, travel 20 %
![Page 27: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/27.jpg)
+ 27
sender
human
travel
social
career
automatic voting
sender msg
msg
msg
folder1
folder2
folder3
travel
travel
shopping
![Page 28: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/28.jpg)
+Classification Mechanism
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 29: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/29.jpg)
+Classification Mechanism
n Offline creation of classified senders table and message-level classier�n We use the training set to train a logistic
regression model. �n For each category we train a separate model in a
one-vs-all manner. �n The classification process is run performed
periodically to account for new senders.
![Page 30: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/30.jpg)
+Classification Mechanism
35 % sender training data
classifier
classifier
senders table
65 % sender testing data
msg training data
![Page 31: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/31.jpg)
+Classification Mechanism
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 32: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/32.jpg)
+Classification Mechanism
n Online Light-weight classification �
n The initial classification �n hard coded rules designed to quickly classify �
n This process described requires very few resources and covers 32% of the email traffic.
![Page 33: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/33.jpg)
+Classification Mechanism
n Online Sender-based classification �
n The second phase in our cascade classification �n looking for the sender with known categories �n using senders table �
n The amount of traffic that is not covered by this phase is roughly 8%. �
![Page 34: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/34.jpg)
+Classification Mechanism
n Online Heavy-weight classification �
n As only 8% of the traffic end up in this last phase �
n We can afford slightly heavier computations to classifier. �n use all relevant feature, pertaining to the
message body, subject line and sender name
![Page 35: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/35.jpg)
+One-vs-all 35
social
human
career
shopping
travel
finance
Yes, confidence
No
msg
![Page 36: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/36.jpg)
+Semi-supervise 36
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 37: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/37.jpg)
+Semi-supervise 37
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 38: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/38.jpg)
+Semi-supervise 38
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 39: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/39.jpg)
+Semi-supervise 39
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 40: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/40.jpg)
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
40
![Page 41: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/41.jpg)
+Experiment
n This paper estimated the actual volume of machine-generated messages on a very large Yahoo mail dataset. �
n This dataset built for the purpose of this work �n 6 months of email traffic �n more than 500 billion messages.
41
![Page 42: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/42.jpg)
+Experiment
n 5 sender based classifiers for machine latent categories �n Shopping, Financial, Travel, Career and
Social �
n 1 sender-based machine for human classifier.
![Page 43: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/43.jpg)
+
![Page 44: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/44.jpg)
+ 44
![Page 45: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/45.jpg)
+
![Page 46: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/46.jpg)
+Outline
n Introduction �
n Method �n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
46
![Page 47: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/47.jpg)
+Conclusion
n We presented here a Web-scale categorization approach. �n offline learning �n online classification �
n Discovered latent categories. �
n Discriminated human and machine-generated email. �
n Building a scalable online system can be applied in Web mail.
![Page 48: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/48.jpg)
+Future Work
n Discussing how categories should be exposed to users.
![Page 49: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/49.jpg)
+Outline
n Introduction �n Method �
n Discovering Latent Categories n Modeling Data�n Training Data�n Classification Mechanism�
n Experiment �
n Conclusion �
n Thought
49
![Page 50: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/50.jpg)
+Thought
n Extended multiclass classification with multi-label.
50
![Page 51: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/51.jpg)
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
![Page 52: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/52.jpg)
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
k ?
![Page 53: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/53.jpg)
+Overview
Latent categories
Extracting Features
Aggregation Level
LDA
Training data Classifier
Mail raw data
Mail testing
data raw data
threshold ?
![Page 54: How many folders do you really need ? Classifying email into a handful of categories](https://reader030.vdocuments.us/reader030/viewer/2022032617/55a9beac1a28abcf238b4739/html5/thumbnails/54.jpg)
+Thanks for listening. 2014 / 01 / 23 (Tue.) @ MakeLab Group Meeting �[email protected]�