yelp dataset challenge 2015
TRANSCRIPT
SEARCH Final Project (ILS-Z534)
Yelp Data ChallengeUNDER THE SUPERVISION OF PROFESSOR XIAOZHONG LIU
PRESENTED BY,MILIND GOKHALE
NAMRATA JAGASIADEEPAK BHARANIKANA
SAMEEDHA BAIRAGISIDDHARTH JAYASANKAR
TASK – 1
PREDICTING CATEGORIES FOR EACH BUSINESS USING INFORMATION RETRIEVAL
APPROACH
INPUT : - BUSINESS ID OUTPUT :- LIST OF CATEGORIES FOR EACH BUSINESS ID
Test SetTraining Set
Dataset Division
1.6M reviews and 500K tips for 61K businesses Data divided into training set and test set.
66% Training Set: ~ 38K businesses Used for category feature extraction.
33% Test Set: ~ 20K businesses Used for prediction Evaluation
Toolset
JSON Handling
Indexing +
Search
String Utilities
Database
POS Tagging
Java (Eclipse)
Task 1 – ALGORITHM
Start
Indexing[Business ID, Reviews,
Tips]Create Category
Feature MapPerform Business search on
categoriesRank categories
found
Comparison with Ground Truth
Evaluate precision and recallEnd
Task 1 – Method
Index Creation using Lucene
Business ID
Category Reviews and Tips Text
10001 Restaurant , Indian, Spicy
The chicken curry is great.Loved the food.…………
10002 Restaurant, American, Donut
The donuts are deliciousThe ambiance is good…………
Category Search QueryIndian Curry ,mutter, spicy…..Italian Pizza, Pasta, Alfredo…..
Category Feature Extraction from training set
Features are words with highest TFIDF score among all the words in reviews and tips text for the category
Task 1 – Method
Category Scores for BusinessesBusiness ID Result
10001 1 Indian - 0.7342 Restaurant – 0.6783 Asian – 0.567..783 Mexican – 0.0
10002 1 Donut -0.672 Cheese – 0.563 Restaurant –
0.43..783 Bar – 0.0
Business ID Predicted categories
10001 Indian, Restaurant, Asian, Authentic, traditional
10002 Donut , Cheese Restaurant , American , Icecream
Predicted Results
Task 1 – Evaluation
Comparison of Ground Truth Value (provided by Yelp) with calculated predictions.
3 Categories
5 Categories
10 Categories
20 Categories
0 10 20 30 40 50 60 70 80
46.383648726415
74.1969906132075
Variation Across Number of Prediction Results
Recall Precision
Task 1 – Evaluation
Precision38
39
40
41
42
43
44
45
46
47 46.383648726415
Precision Across Algorithms
VSM BM25 LMD LMJ
Recall46
47
48
49
50
51
52
53
54
55 54.0318960613207
Recall Across Algorithms
VSM BM25 LMD LMJ
Task 1 – Evaluation
Recall
Precision
0 10 20 30 40 50 60
40.0182504646226
35.4952838254715
53.2457325
44.0251581132075
Impact of POS Tagging
With POS Tagging Without POS Tagging
Task 2PREDICT MOST DISCUSSED ATTRIBUTES
IN EACH CITYINPUT : CITY NAMEOUTPUT : LIST OF ATTRIBUTES THAT ARE MOST TALKED ABOUT IN THE CITY
Task 2 - Algorithm
Start
Split the data into Test and Train and Index the reviews and Tips
for each City separately
Using word net Create a Attribute Map for each Attribute with Attribute
Name as key and search text (related words) as
values
For the given input city , perform a search for each Attribute and
retrieve scores and rank for each Attribute using BM25 ranking
function. Perform this step on both test and train data
Assign top 10 ranked Attributes to each
City for both test and train data
Compare the test results with the train
results.
Calculate Precision and Recall for this
modelEnd
Task 2 - Method Splitting and Indexing of data (City-wise)
Business File Review File Tip File
MongDB Collections
Final Collection
Used to Index {BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]}
Reviews & TipsReview 101 Review 102.Tip 101
Reviews & TipsReview 1Review 2.Tip 1
Reviews & TipsReview 1Review 2.Tip 1
Reviews & TipsReview 1Review 2.Tip 1
Las Vegas
Reviews & TipsReview 101 Review 102.Tip 101
Reviews & TipsReview 101 Review 102.Tip 101
Las Vegas
Pheonix Pheonix
Tempe Tempe
TRAIN INDEXES TEST INDEXES
Task 2 - Method
We used word net to create a Attribute map.
For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test and train data
Attribute
Good for KidsMusicLiquorSmoking
Attribute MapGood for Kids Healthy, colorful,
son, daughter,…….
Music Jazz, Rock, Pop, melody…..
Liquor Alcohol , sprits ,vodka, Rum….
Smoking Cigar , Cigarette, lighter, …..
WORD NET
Top 10 Attributes
Liquor - 4.35 Good for Kids – 3.5Music – 2.1Smoking – 1.6
Top 10 Attributes
Liquor - 5.35 Music – 4.0Good for Kids – 2.5Smoking – 0.5
Train Data(60%)
Test Data(40%)
Results from Train Data
Results from Test Data
IR Model
Task 2 - Evaluation
We compared the predicted results of test data with the predicted results of the train data (considered as ground truth)and calculated the precision and recall
Charlotte
PhoenixLas Vegas
Challenges
Task 1 Data cleaning and pre-processing time. Even after stop word removal, many unwanted features with high
TFIDF scores. Java heap space out of memory exception while feature extraction
from categories. Task 2
Data cleaning and pre-processing. Manual removal of some features from WordNet for improving
output. Evaluation metric
Questions ?
Thank You…!!