problem statement

25
CS 277 DataMining Project Presentation Instructor: Prof. Dave Newman Team: Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine

Upload: kiri

Post on 25-Feb-2016

21 views

Category:

Documents


0 download

DESCRIPTION

CS 277 DataMining Project Presentation Instructor : Prof. Dave Newman Team : Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine. Problem Statement. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Problem Statement

CS 277 DataMiningProject Presentation

Instructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, Kusum Kumar

Donald Bren School of Information and Computer ScienceUniversity of California, Irvine

Page 2: Problem Statement

Problem Statement

Review Categories

Food Service Ambience Discount

Worthiness

They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)

1 1 1

• Classify a given yelp review text into one or more relevant categories

Page 3: Problem Statement

DatasetReviews s• Reviews from Food and Restaurant category• # Useful votes > 1• Total 10,000 reviews

Classification categories• Identified categories using sample set of 400

random reviews• Refined categories using 200 more reviews• Final categories: 5• Food, Ambience, Service, Deals/Discounts

Worthiness

Page 4: Problem Statement

Data Annotation

• 10,000 reviews divided into 5 bins (w/ repetition)

• 6 researchers manually annotated reviews

• 225 man-hours of work!• Discrepancy in 981 ambiguous reviews --

removed from analysis• Total 9,019 reviews: split into 80% train

and 20% test

Page 5: Problem Statement

5

Features – unigrams/bigrams/trigrams• Total 703 textual features

• 375 unigrams, 208 bigrams, 120 trigrams

Frequency

Unigrams/bigrams/trigrams

Page 6: Problem Statement

6

Features – User ratings• 3 nominal features – Good,

Moderate, BadReview stars Feature

or Bad

Moderate

or Good

Page 7: Problem Statement

ApproachReview Categories

Food Service Ambience Discount

Worthiness

They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)

1 1 1

• Reviews can be classified into more than one categories

• Not a binary classification problem. It is a multi-label classification!

Page 8: Problem Statement

Binary classifiers for each category

Reviews Categories

Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}

Review 4 {Service, Ambience, Deals}

• Learns one binary classifier for each category• Output is the union of predictions of all binary

classifiers

Reviews

Food

Review 1 1

Review 2 0

Review 3 1

Review 4 0

Reviews

Servic

eReview

1 0

Review 2 0

Review 3 0

Review 4 1

Reviews

Ambience

Review 1 0

Review 2 1

Review 3 0

Review 4 1

Reviews

Deals

Review 1 1

Review 2 1

Review 3 0

Review 4 1

Original dataset

Transformed datasets

Page 9: Problem Statement

Classifier for each subset of categories

• Categories = {Food, Service, Ambience, Deals}• We consider each different “subset of categories” as a

single category and learn a multi-class classifier

Original datasetReviews Categories

Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}

Review 4 {Service, Ambience, Deals}

Food Service Ambience Deals

1 0 0 1Food Service Ambience Deals

0 0 1 1

Transformed datasetReviews CategoriesReview 1 “1001”Review 2 “0011”Review 3 “1000”Review 4 “0111”

Page 10: Problem Statement

Ensemble of subset classifiers• Train a classifier for predicting only each

subset of categoriesReviews Categories

Food Service

Ambience Deals

Review 1 0 1 1 0

Review 2 0 1 0 1

Review 3 0 0 0 1

Review 4 0 1 1 1

Review 5 1 0 1 0

Review 6 1 1 1 0

Review 7 0 1 1 1

Classifier 1 for (Food, Service)

Classifier 2 for (Food, Ambience)

Classifier 3 for (Food, Deals)

Classifier 4 for (Service, Ambience)

Classifier 5 for (Service, Deals

Classifier 6 for (Ambience, Deals)Total 6 classifiers for subset of size of 2 categories –

4C2

Page 11: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 1Prediction from (Food, Service) classifier

Page 12: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifier

Page 13: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifier

Page 14: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifier

Page 15: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 10 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifier

Page 16: Problem Statement

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 10 1

0 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier

Page 17: Problem Statement

Ensemble of classifiers: Prediction• Final prediction: Majority vote (>= 2

classifiers) Food Servic

eAmbienc

e Deals

0 11 11 0

0 10 1

0 11 0 1 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier

Majority vote

Page 18: Problem Statement

Evaluation measuresNotations:• Let (x,Y) be a multi-label example, Y L• Let h be a multi-label classifier• Let Z = h(x) be the set of labels predicted

by h for (x, Y) Precision:

Recall:

Page 19: Problem Statement

Precision & Recall (Train)

0

20

40

60

80

100

120

140

160

68 75 66 72

67 53 6671

Training performance3-fold cross validation

Page 20: Problem Statement

Precision & Recall (Test)

Train Test Test (baseline)0

20

40

60

80

100

120

140

160

72 72 69

71 70 68

Train vs. Test performance of Ensemble of (smaller) subsets using decision trees

Precision Recall

Page 21: Problem Statement

Observation1: Ensemble gave the best results

Page 22: Problem Statement

Observation 2: Data Skew

55%

21%

9%

8%7%

Sales

FoodServiceAmbienceDealsWorthiness

35%

21%16%

15%

13%

Sales

FoodServiceAmbienceDealsWorthiness

Normalized skew in training data by adding selective data

Page 23: Problem Statement

Precision & Recall (w & w/o category normalization)

with data skew without data skew0

20

40

60

80

100

120

140

160

180

72 84

7178

Cross-validation performance of En-semble of (smaller) subsets using de-

cision trees

Precision Recall

Page 24: Problem Statement

Precision & Recall (w & w/o category normalization)

with data skew without data skew0

20

40

60

80

100

120

140

160

180

72 84

7178

Cross-validation performance of En-semble of (smaller) subsets using de-

cision trees

Precision Recall

Page 25: Problem Statement

Thanks!

Check out our yelp submissionhttp://www.ics.uci.edu/~vpsaini/

Feedback welcome!