yelp dataset challenge campus arc ii, 16 april 2015 mehdy davary, computer science department (iiun)
TRANSCRIPT
ABOUT THE CHALLENGE DATASET• 1.2M reviews
• 400K tips by 250K users for 42K businesses
• 400K business attributes, e.g., hours, parking availability, ambience
• Social network of 250K users for a total of 1.9M social edges.
• Aggregated check-ins over time for each of the 42K businesses
CITIES• U.K.: Edinburgh
• U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison
PLATFORM
• The Hortonworks Sandbox is a single node implementation of the Hortonworks Data Platform (HDP). It is a personal, portable Hadoop environment.
• H2O on Hortonworks Data Platform is a fully Open Source Predictive Analytics Platform.
• Neo4j is a Graph Database which stores data in a Graph, with Nodes. Neo4j uses Cypher queries to work with graph data.
The Hortonworks
Sandbox
H2O
Sentiment Analysis
Neoj4
THE HORTONWORKS SANDBOXBy now we have managed all YELP five JSON data files in Hadoop as tables which are sortable and searchable. Mainly we use HCatalog, Pig, Python and Hive to load and process data.
H2OH2O is a statistical analysis engine that uses Hadoop Distributed File System (HDFS) as its storage platform and provides a user-friendly interface for easy querying.
NEOJ4The real power of Neo4j is in connected data. To associate any two nodes, we add a Relationship which describes how the records are related.
TO ANALYZE HORTONWORKS SANDBOX DATA WITH EXCEL 2013
• Hortonworks ODBC driver (64-bit) installed and configured.
• Microsoft Excel 2013 Professional Plus 64-bit.
• Use the Microsoft Query feature to access Hortonworks sandbox data.
• Use the Excel Power View feature to analyze the data.
ABOUT REVIEWS ON “RESTAURANTS”
5 IMPORTANT DIMENSIONS
• Food
• Service
• Ambience
• Deals/Discounts
• Quality-Price Ratio
RAW DATA
• yelp_academic_dataset_review.json
• yelp_academic_dataset_business.json
A review can be associated with multiple dimensions (categories) at the same time.
DATA PREPARATION FOR DATA MINING
• All reviews
• Total reviews on “Restaurants”
• Reduced numbers of reviews on “Restaurants” by using (review.useful > 3 AND review.cool > 2 AND review.stars > 3 AND business.review_count > 5) as filtering factors
All businessesAll
restaurants
Restaurants r.useful > 3
r.cool >2r.stars > 3
b.review_count > 5
Review 1’127’525 706’290 22’584
Business 42’153
User 252’898
Tip 403’210
Checkin 31’617
review ---------------------------------funny: int useful: int cool: int
user_id: string review_id: string stars: int text: stringdate: stringtype: stringbusiness_id: string
business---------------------------------attributes: stringbusiness_id: stringfull_address: string open: boolean hours: stringcategories: string city: string review_count: int name: stringneighborhoods: stringlongitude: float state: stringstars: float latitude: floattype: string
user---------------------------------yelping_since: string votes: {funny: 1, useful: 5, cool: 0}, stringname: string review_count: int user_id: stringfriends: stringfans: int average_stars: float type: string compliments: string elite: string
JAVA IMPLEMENTATION OF THE NLTK IN HADOOPTHE STANFORD NLP GROUP
Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser.
Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars.
Unfortunately, frustration Dr. Goldberg's patient repeat experience I've doctors NYC -- good doctor, terrible staff. It staff simply answers phone. It takes 2 hours repeated calling answer. Who time deal it? run problem doctors it. You office workers, patients medical needs, answering phone? It's incomprehensible work aggravation. It's regret feel give Dr. Goldberg 2 stars.
((Unfortunately,RB),(frustration,NN),(being,VB),(Goldberg,NNP),(patient,NN),(repeat,NN),(experience,NN),('ve,VB),(had,VB),(so,RB),(many,JJ),(other,JJ),(doctors,NN),(NYC,NNP),(good,JJ),(doctor,NN),(terrible,JJ),(staff,NN),(seems,VB),(staff,NN),(simply,RB),(never,RB),(answers,VB),(phone,NN),(usually,RB),(takes,VB),(hours,NN),(repeated,VB),(calling,VB),(get,VB),(answer,NN),(time,NN),(wants,VB),(deal,VB),(have,VB),(run,VB),(problem,NN),(many,JJ),(other,JJ),(doctors,NN),(just,RB),(do,VB),(n't,RB),(get,VB),(have,VB),(office,NN),(workers,NN),(have,VB),(patients,NN),(medical,JJ),(needs,NN),(n't,RB),(anyone,NN),(answering,VB),(phone,NN),('s,VB),(incomprehensible,NN),(not,RB),(work,VB),(aggravation,NN),('s,VB),(regret,NN),(feel,VB),(have,VB),(give,VB),(Goldberg,NNP),(stars,NN))
{(Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.),(It seems that his staff simply never answers the phone.),(It usually takes 2 hours of repeated calling to get an answer.),(Who has time for that or wants to deal with it?),(I have run into this problem with many other doctors and I just don't get it.),(You have office workers, you have patients with medical needs, why isn't anyone answering the phone?),(It's incomprehensible and not work the aggravation.),(It's with regret that I feel that I have to give Dr. Goldberg 2 stars.)}
SENTIWORDNET
• Retrieving the Parts of speech(verbs, nouns, adjectives etc) from the sentence using the Stanford NLP parser.
• Using the SentiWordNet to find the Positive and Negative values related to each Part of Speech.
• Summing up the Positive and Negative values obtained to calculate a Net Positive and Net Negative value related to a sentence.
A lexical resource for opinion mining