problem statement
DESCRIPTION
CS 277 DataMining Project Presentation Instructor : Prof. Dave Newman Team : Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine. Problem Statement. - PowerPoint PPT PresentationTRANSCRIPT
CS 277 DataMiningProject Presentation
Instructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, Kusum Kumar
Donald Bren School of Information and Computer ScienceUniversity of California, Irvine
Problem Statement
Review Categories
Food Service Ambience Discount
Worthiness
They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)
1 1 1
• Classify a given yelp review text into one or more relevant categories
DatasetReviews s• Reviews from Food and Restaurant category• # Useful votes > 1• Total 10,000 reviews
Classification categories• Identified categories using sample set of 400
random reviews• Refined categories using 200 more reviews• Final categories: 5• Food, Ambience, Service, Deals/Discounts
Worthiness
Data Annotation
• 10,000 reviews divided into 5 bins (w/ repetition)
• 6 researchers manually annotated reviews
• 225 man-hours of work!• Discrepancy in 981 ambiguous reviews --
removed from analysis• Total 9,019 reviews: split into 80% train
and 20% test
5
Features – unigrams/bigrams/trigrams• Total 703 textual features
• 375 unigrams, 208 bigrams, 120 trigrams
Frequency
Unigrams/bigrams/trigrams
6
Features – User ratings• 3 nominal features – Good,
Moderate, BadReview stars Feature
or Bad
Moderate
or Good
ApproachReview Categories
Food Service Ambience Discount
Worthiness
They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)
1 1 1
• Reviews can be classified into more than one categories
• Not a binary classification problem. It is a multi-label classification!
Binary classifiers for each category
Reviews Categories
Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}
Review 4 {Service, Ambience, Deals}
• Learns one binary classifier for each category• Output is the union of predictions of all binary
classifiers
Reviews
Food
Review 1 1
Review 2 0
Review 3 1
Review 4 0
Reviews
Servic
eReview
1 0
Review 2 0
Review 3 0
Review 4 1
Reviews
Ambience
Review 1 0
Review 2 1
Review 3 0
Review 4 1
Reviews
Deals
Review 1 1
Review 2 1
Review 3 0
Review 4 1
Original dataset
Transformed datasets
Classifier for each subset of categories
• Categories = {Food, Service, Ambience, Deals}• We consider each different “subset of categories” as a
single category and learn a multi-class classifier
Original datasetReviews Categories
Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}
Review 4 {Service, Ambience, Deals}
Food Service Ambience Deals
1 0 0 1Food Service Ambience Deals
0 0 1 1
Transformed datasetReviews CategoriesReview 1 “1001”Review 2 “0011”Review 3 “1000”Review 4 “0111”
Ensemble of subset classifiers• Train a classifier for predicting only each
subset of categoriesReviews Categories
Food Service
Ambience Deals
Review 1 0 1 1 0
Review 2 0 1 0 1
Review 3 0 0 0 1
Review 4 0 1 1 1
Review 5 1 0 1 0
Review 6 1 1 1 0
Review 7 0 1 1 1
Classifier 1 for (Food, Service)
Classifier 2 for (Food, Ambience)
Classifier 3 for (Food, Deals)
Classifier 4 for (Service, Ambience)
Classifier 5 for (Service, Deals
Classifier 6 for (Ambience, Deals)Total 6 classifiers for subset of size of 2 categories –
4C2
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 1Prediction from (Food, Service) classifier
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 11 1
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifier
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 11 11 0
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifier
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 11 11 0
0 1
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifier
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 11 11 0
0 10 1
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifier
Ensemble of classifiers: Prediction• Ask each classifier to vote!
Food Service
Ambience Deals
0 11 11 0
0 10 1
0 1
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier
Ensemble of classifiers: Prediction• Final prediction: Majority vote (>= 2
classifiers) Food Servic
eAmbienc
e Deals
0 11 11 0
0 10 1
0 11 0 1 1
Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier
Majority vote
Evaluation measuresNotations:• Let (x,Y) be a multi-label example, Y L• Let h be a multi-label classifier• Let Z = h(x) be the set of labels predicted
by h for (x, Y) Precision:
Recall:
Precision & Recall (Train)
0
20
40
60
80
100
120
140
160
68 75 66 72
67 53 6671
Training performance3-fold cross validation
Precision & Recall (Test)
Train Test Test (baseline)0
20
40
60
80
100
120
140
160
72 72 69
71 70 68
Train vs. Test performance of Ensemble of (smaller) subsets using decision trees
Precision Recall
Observation1: Ensemble gave the best results
Observation 2: Data Skew
55%
21%
9%
8%7%
Sales
FoodServiceAmbienceDealsWorthiness
35%
21%16%
15%
13%
Sales
FoodServiceAmbienceDealsWorthiness
Normalized skew in training data by adding selective data
Precision & Recall (w & w/o category normalization)
with data skew without data skew0
20
40
60
80
100
120
140
160
180
72 84
7178
Cross-validation performance of En-semble of (smaller) subsets using de-
cision trees
Precision Recall
Precision & Recall (w & w/o category normalization)
with data skew without data skew0
20
40
60
80
100
120
140
160
180
72 84
7178
Cross-validation performance of En-semble of (smaller) subsets using de-
cision trees
Precision Recall
Thanks!
Check out our yelp submissionhttp://www.ics.uci.edu/~vpsaini/
Feedback welcome!