empowering businesses using yelp reviews mining
TRANSCRIPT
Empowering Businesses using YelpReviews Mining
Vipul MunotPralhad SapreNishant Salvi
Neelam TikoneRutuja Kulkarni
Fall 2016
Advisor Prof. Xiaozhong Liu
Z534 ILS Search
Yelp Dataset Mining
Agenda
• Task 1 -Predicting categories of a business (multi-class and multi-label)
• Task 2 - Predict pros and cons of a business
(topic modelling)
2
Yelp Dataset Mining
Technologies
• MongoDB• Python• Gensim• NLTK• Scikit-learn• TextBlob• R
3
Yelp Dataset Mining
Exploratory Data Analysis
• Total number of Reviews: 2685066• Total businesses : 85901
4
Ratings for Businesses
5
Top CategoriesRestaurants : 26729 Shopping : 12444Food : 10143Beauty : 7490Health & Medical : 6106Home Services : 5866Nightlife : 5507
6
Data Preprocessing
• Merging the reviews and businesses using Business id’s.
• Merge all the reviews into a Passage for that Business id.
• Remove stop words from reviews.• Use TF-IDF to create the word vector.• Class labels : All categories for that business id.
7
Data
Multi-Class:Hotel Chocolate - [Coffee &, Tea, Food, Cafes, Chocolatiers & Shops, Specialty Food, Event Planning & Services, Hotels Travel, Hotels, Restaurants]
Multi-Label:Prediction of at least 45% of distinct labels which “Hotel Chocolate” have.
8
Task 1
Prediction of Business categories using reviews
• Naive Bayes• Logistic Regression• Random Forest
We built the Naive Bayes classifier ground up. For the rest we used scikit-learn
9
Challenges Faced
• Multi-class classification• Adapting existing classifiers (one vs all) • Preprocessing the data (engineering problem)• Defining own Accuracy function based on
nature of Problem (Partial accuracy - 45% in our case)
• Labels assigned are not mutually exclusive• There is an inherent class hierarchy - could be
learned by association rule mining
10
Adapting classifiers to multi-class, multi-label problems• Make probabilistic prediction• Take top 7 categories• Accurate += 1
if (prediction ∩ truth) > len(truth) * 0.45• This is the idea of partial match
e.g
11
"predicted_labels" : [ "Automotive", "Oil Change Stations", "Auto Repair", "Tires", "Shopping", "Auto Parts & Supplies", "Gas & Service Stations" ]
"labels" : [ "Automotive", "Auto Parts & Supplies" ]
Evaluation Metrics• Hamming Loss:
Fraction of the wrong labels to the total number of labels
• Hamming Score : Number of correct labels divided by the union of predicted andtrue labels
12
Evaluation metrics (contd.)• Precision :
The fraction of retrieved instances that are relevant.
• Recall :The fraction of relevant instances that are retrieved.
13
Naive BayesAvg precision - 0.33Avg recall - 0.80Avg hamming score - 0.31
Performance of classifiers (Partial Label Match - 45%)
14
Businesses Classifier (One vs All) Accuracy (75%-25% split)
0 - 80000 (full set) Naive Bayes 90.94
0 - 20000 (537K reviews) Random Forest 76.18
20000 - 40000 (897K reviews) Random Forest 75.97
20000 - 40000 (897K reviews) Logistic Regression 90.64
40000 - 60000 (574K reviews) Random Forest 68.62
40000 - 60000 (574K reviews) Logistic Regression 89.61
Yelp Dataset Mining
Task 2: Objectives
• The prediction goal was to figure out the words, phrases, ratings, and patterns that predict pros and cons of the business.
• Also we extract the good and bad features for every
restaurant which can help in providing suggestions to yelp users.
15
Yelp Dataset Mining
Task 2: Tools and Techniques• Gensim: used for applying the LDA algorithm • TextBlob: used for assigning POS tags• NLTK: used for removing stop words, extracting nouns
and creating bag of words• LDA (Latent Dirichlet algorithm): used for grouping
similar terms from negative and positive reviews together and associating a name to that grouping.
16
17
Task 2: Build Model
18
Task 2: Utilize Model
Task 2: Analysis
Top 10 Good Topics 1. Customer Service2. Food3. Bar & Liquor4. Overall Quality5. Mexican Food6. Breakfast7. Ambiance & Hospitality8. Expensive9. Location10. Entertainment
19
Task 2: Analysis
Top 10 Bad Topics 1. Staff and Service2. Coffee and Cake3. Ambiance and Hospitality4. Bad Service5. Pet Friendliness6. Delivery Services7. Entertainment8. Parking and Utilities9. Food10. Mexican Food
20
Yelp Dataset Mining
Task 2: Results• Business Id - 1vQLTKwmcmZXtNzfKEvMmA• Good points-
Food, Mexican Food, Overall Quality Bad Points -
Delivery Services, Staff and Service
21
Yelp Dataset Mining
Future Scope
• Association rules to define hierarchy of labels
• Device formula to convert good and bad topics into rating
• Human feedback for task 2 to evaluate.
22
Yelp Dataset Mining
Questions?
23
Thank You