spark tutorial for text analysis - cleveland state...

8
Spark Tutorial for Text Analysis Sunnie Ching CIS612 Big Data and Parallel Data Processing Aspect Based Opinion Mining of User-Product Reviews: The following data set was used for this experiment: from the website: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html Downloaded the Additional Customer review datasets used by Ding, Liu and Yu, WSDM-2008. This contained a dataset of Ipod Reviews roughly 1000 reviews. The file was annotated with a bunch of brackets and hashtag signs for NLP processing. These were manually removed

Upload: others

Post on 12-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

Spark Tutorial for Text Analysis

Sunnie Ching

CIS612 Big Data and Parallel Data Processing

Aspect Based Opinion Mining of User-Product Reviews:

The following data set was used for this experiment: from the website:

https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

Downloaded the Additional Customer review datasets used by Ding, Liu and Yu, WSDM-2008.

This contained a dataset of Ipod Reviews roughly 1000 reviews. The file was annotated with a bunch of

brackets and hashtag signs for NLP processing. These were manually removed

Page 2: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

using a semi-supervised approach, by feeding some feature set of import product features into the

model, before applying the LDA clustering technique to get actual term weights, a script was written in

python: scrap.py to scrape: the Classic IPod review site at Amazon.com:

Page 3: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

This takes the URL:

https://www.amazon.com/Apple-MC297LL-Generation-Discontinued-

Manufacturer/dp/B001F7AHOG/ref=sr_1_1?s=mp3&ie=UTF8&qid=1492750330&sr=1-1

and creates three variables to get different product features located in different areas of the

amazon.com Ipod review site web page, scraping them by their XPATH locations.

Finally a csv file is created in the directory of the python script which contains the rows of scraped

product features. :

Page 4: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

We manually extracted individual 7 features, based on words that matched in the IpodReviews.txt file

downloaded from Professor Bing Liu’s website.

Keeping with the semi-supervised method, the data set of reviews were initially filtered using the

feature words obtained by this scraping method, below is an example of the methodology in Apache

spark. The RDD reviews, is filtered by comparing each review to each product aspect feature in the list, if

a tweet contains this aspect term, then it is kept for clustering, if not, the tweet is removed from the

data set.

The reviews are once again preprocessed to remove non-alphanumeric characters preparing them for

sentiment analysis on important term features.

Page 5: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching
Page 6: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

Output: k=1 Top Features: battery, case, sound, features

Page 7: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching

k=2 top features battery and sound

Page 8: Spark Tutorial for Text Analysis - Cleveland State Universitycis.csuohio.edu/~sschung/cis612/SparkTutorial_IIforText... · 2019-04-22 · Spark Tutorial for Text Analysis Sunnie Ching