yelp dataset challenge 2015

SEARCH Final Project (ILS-Z534)

Yelp Data ChallengeUNDER THE SUPERVISION OF PROFESSOR XIAOZHONG LIU

PRESENTED BY,MILIND GOKHALE

NAMRATA JAGASIADEEPAK BHARANIKANA

SAMEEDHA BAIRAGISIDDHARTH JAYASANKAR

TASK – 1

PREDICTING CATEGORIES FOR EACH BUSINESS USING INFORMATION RETRIEVAL

APPROACH

INPUT : - BUSINESS ID OUTPUT :- LIST OF CATEGORIES FOR EACH BUSINESS ID

Test SetTraining Set

Dataset Division

1.6M reviews and 500K tips for 61K businesses Data divided into training set and test set.

66% Training Set: ~ 38K businesses Used for category feature extraction.

33% Test Set: ~ 20K businesses Used for prediction Evaluation

Toolset

JSON Handling

Indexing +

Search

String Utilities

Database

POS Tagging

Java (Eclipse)

Task 1 – ALGORITHM

Start

Indexing[Business ID, Reviews,

Tips]Create Category

Feature MapPerform Business search on

categoriesRank categories

found

Comparison with Ground Truth

Evaluate precision and recallEnd

Task 1 – Method

Index Creation using Lucene

Business ID

Category Reviews and Tips Text

10001 Restaurant , Indian, Spicy

The chicken curry is great.Loved the food.…………

10002 Restaurant, American, Donut

The donuts are deliciousThe ambiance is good…………

Category Search QueryIndian Curry ,mutter, spicy…..Italian Pizza, Pasta, Alfredo…..

Category Feature Extraction from training set

Features are words with highest TFIDF score among all the words in reviews and tips text for the category

Task 1 – Method

Category Scores for BusinessesBusiness ID Result

10001 1 Indian - 0.7342 Restaurant – 0.6783 Asian – 0.567..783 Mexican – 0.0

10002 1 Donut -0.672 Cheese – 0.563 Restaurant –

0.43..783 Bar – 0.0

Business ID Predicted categories

10001 Indian, Restaurant, Asian, Authentic, traditional

10002 Donut , Cheese Restaurant , American , Icecream

Predicted Results

Task 1 – Evaluation

Comparison of Ground Truth Value (provided by Yelp) with calculated predictions.

3 Categories

5 Categories

10 Categories

20 Categories

0 10 20 30 40 50 60 70 80

46.383648726415

74.1969906132075

Variation Across Number of Prediction Results

Recall Precision


Precision38

39

40

41

42

43

44

45

46

47 46.383648726415

Precision Across Algorithms

VSM BM25 LMD LMJ

Recall46

47

48

49

50

51

52

53

54

55 54.0318960613207

Recall Across Algorithms

VSM BM25 LMD LMJ


Recall

Precision

0 10 20 30 40 50 60

40.0182504646226

35.4952838254715

53.2457325

44.0251581132075

Impact of POS Tagging

With POS Tagging Without POS Tagging

Task 2PREDICT MOST DISCUSSED ATTRIBUTES

IN EACH CITYINPUT : CITY NAMEOUTPUT : LIST OF ATTRIBUTES THAT ARE MOST TALKED ABOUT IN THE CITY

Task 2 - Algorithm

Start

Split the data into Test and Train and Index the reviews and Tips

for each City separately

Using word net Create a Attribute Map for each Attribute with Attribute

Name as key and search text (related words) as

values

For the given input city , perform a search for each Attribute and

retrieve scores and rank for each Attribute using BM25 ranking

function. Perform this step on both test and train data

Assign top 10 ranked Attributes to each

City for both test and train data

Compare the test results with the train

results.

Calculate Precision and Recall for this

modelEnd

Task 2 - Method Splitting and Indexing of data (City-wise)

Business File Review File Tip File

MongDB Collections

Final Collection

Used to Index {BusinessID : “1001” , City : “Las Vegas”, Rev&Tips :[“Rev1”,”Rev2”,…..,”Tip1”,”Tip2”,…]}

Reviews & TipsReview 101 Review 102.Tip 101

Reviews & TipsReview 1Review 2.Tip 1



Las Vegas



Las Vegas

Pheonix Pheonix

Tempe Tempe

TRAIN INDEXES TEST INDEXES

Task 2 - Method

We used word net to create a Attribute map.

For the given city we ran a search for each Attribute on both test and train data and we retrieved the top 10 Attributes for both test and train data

Attribute

Good for KidsMusicLiquorSmoking

Attribute MapGood for Kids Healthy, colorful,

son, daughter,…….

Music Jazz, Rock, Pop, melody…..

Liquor Alcohol , sprits ,vodka, Rum….

Smoking Cigar , Cigarette, lighter, …..

WORD NET

Top 10 Attributes

Liquor - 4.35 Good for Kids – 3.5Music – 2.1Smoking – 1.6

Top 10 Attributes

Liquor - 5.35 Music – 4.0Good for Kids – 2.5Smoking – 0.5

Train Data(60%)

Test Data(40%)

Results from Train Data

Results from Test Data

IR Model

Task 2 - Evaluation

We compared the predicted results of test data with the predicted results of the train data (considered as ground truth)and calculated the precision and recall

Charlotte

PhoenixLas Vegas

Challenges

Task 1 Data cleaning and pre-processing time. Even after stop word removal, many unwanted features with high

TFIDF scores. Java heap space out of memory exception while feature extraction

from categories. Task 2

Data cleaning and pre-processing. Manual removal of some features from WordNet for improving

output. Evaluation metric

Questions ?

Thank You…!!

yelp dataset challenge 2015

Social Media