problem statement

Post on 25-Feb-2016

21 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

CS 277 DataMining Project Presentation Instructor : Prof. Dave Newman Team : Hitesh Sajnani, Vaibhav Saini, Kusum Kumar Donald Bren School of Information and Computer Science University of California, Irvine. Problem Statement. - PowerPoint PPT Presentation

TRANSCRIPT

CS 277 DataMiningProject Presentation

Instructor: Prof. Dave NewmanTeam: Hitesh Sajnani, Vaibhav Saini, Kusum Kumar

Donald Bren School of Information and Computer ScienceUniversity of California, Irvine

Problem Statement

Review Categories

Food Service Ambience Discount

Worthiness

They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)

1 1 1

• Classify a given yelp review text into one or more relevant categories

DatasetReviews s• Reviews from Food and Restaurant category• # Useful votes > 1• Total 10,000 reviews

Classification categories• Identified categories using sample set of 400

random reviews• Refined categories using 200 more reviews• Final categories: 5• Food, Ambience, Service, Deals/Discounts

Worthiness

Data Annotation

• 10,000 reviews divided into 5 bins (w/ repetition)

• 6 researchers manually annotated reviews

• 225 man-hours of work!• Discrepancy in 981 ambiguous reviews --

removed from analysis• Total 9,019 reviews: split into 80% train

and 20% test

5

Features – unigrams/bigrams/trigrams• Total 703 textual features

• 375 unigrams, 208 bigrams, 120 trigrams

Frequency

Unigrams/bigrams/trigrams

6

Features – User ratings• 3 nominal features – Good,

Moderate, BadReview stars Feature

or Bad

Moderate

or Good

ApproachReview Categories

Food Service Ambience Discount

Worthiness

They have the best happy hours around, the food is good and their service is even better. When its winter, we become regulars. :)

1 1 1

• Reviews can be classified into more than one categories

• Not a binary classification problem. It is a multi-label classification!

Binary classifiers for each category

Reviews Categories

Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}

Review 4 {Service, Ambience, Deals}

• Learns one binary classifier for each category• Output is the union of predictions of all binary

classifiers

Reviews

Food

Review 1 1

Review 2 0

Review 3 1

Review 4 0

Reviews

Servic

eReview

1 0

Review 2 0

Review 3 0

Review 4 1

Reviews

Ambience

Review 1 0

Review 2 1

Review 3 0

Review 4 1

Reviews

Deals

Review 1 1

Review 2 1

Review 3 0

Review 4 1

Original dataset

Transformed datasets

Classifier for each subset of categories

• Categories = {Food, Service, Ambience, Deals}• We consider each different “subset of categories” as a

single category and learn a multi-class classifier

Original datasetReviews Categories

Review 1 {Food, Deals}Review 2 {Ambience, Deals}Review 3 {Food}

Review 4 {Service, Ambience, Deals}

Food Service Ambience Deals

1 0 0 1Food Service Ambience Deals

0 0 1 1

Transformed datasetReviews CategoriesReview 1 “1001”Review 2 “0011”Review 3 “1000”Review 4 “0111”

Ensemble of subset classifiers• Train a classifier for predicting only each

subset of categoriesReviews Categories

Food Service

Ambience Deals

Review 1 0 1 1 0

Review 2 0 1 0 1

Review 3 0 0 0 1

Review 4 0 1 1 1

Review 5 1 0 1 0

Review 6 1 1 1 0

Review 7 0 1 1 1

Classifier 1 for (Food, Service)

Classifier 2 for (Food, Ambience)

Classifier 3 for (Food, Deals)

Classifier 4 for (Service, Ambience)

Classifier 5 for (Service, Deals

Classifier 6 for (Ambience, Deals)Total 6 classifiers for subset of size of 2 categories –

4C2

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 1Prediction from (Food, Service) classifier

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifier

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifier

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifier

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 10 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifier

Ensemble of classifiers: Prediction• Ask each classifier to vote!

Food Service

Ambience Deals

0 11 11 0

0 10 1

0 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier

Ensemble of classifiers: Prediction• Final prediction: Majority vote (>= 2

classifiers) Food Servic

eAmbienc

e Deals

0 11 11 0

0 10 1

0 11 0 1 1

Prediction from (Food, Service) classifier Prediction from (Food, Ambience) classifierPrediction from (Food ,Deals) classifierPrediction from (Service, Ambience) classifierPrediction from (Service ,Deals) classifierPrediction from (Ambience ,Deals) classifier

Majority vote

Evaluation measuresNotations:• Let (x,Y) be a multi-label example, Y L• Let h be a multi-label classifier• Let Z = h(x) be the set of labels predicted

by h for (x, Y) Precision:

Recall:

Precision & Recall (Train)

0

20

40

60

80

100

120

140

160

68 75 66 72

67 53 6671

Training performance3-fold cross validation

Precision & Recall (Test)

Train Test Test (baseline)0

20

40

60

80

100

120

140

160

72 72 69

71 70 68

Train vs. Test performance of Ensemble of (smaller) subsets using decision trees

Precision Recall

Observation1: Ensemble gave the best results

Observation 2: Data Skew

55%

21%

9%

8%7%

Sales

FoodServiceAmbienceDealsWorthiness

35%

21%16%

15%

13%

Sales

FoodServiceAmbienceDealsWorthiness

Normalized skew in training data by adding selective data

Precision & Recall (w & w/o category normalization)

with data skew without data skew0

20

40

60

80

100

120

140

160

180

72 84

7178

Cross-validation performance of En-semble of (smaller) subsets using de-

cision trees

Precision Recall

Precision & Recall (w & w/o category normalization)

with data skew without data skew0

20

40

60

80

100

120

140

160

180

72 84

7178

Cross-validation performance of En-semble of (smaller) subsets using de-

cision trees

Precision Recall

Thanks!

Check out our yelp submissionhttp://www.ics.uci.edu/~vpsaini/

Feedback welcome!

top related