web blog miner licence thesis

I

BLOG MINER

WEB BLOG MINING FOR CLASSIFICATION

OF MOVIE REVIEWS

THESIS

Submitted By:

Onur ENEZ 120045072

Kadir ARDIÇ 120042663

Advisor:

Yrd. Doç. Dr. Arzu Baloğlu

MARMARA UNIVERSITY

FACULTY OF ENGINEERING

II

Abstract

Blogs are the latest and most popular way to express the ideas, interests and emotions

for the world. With the increasing use of internet sources for all needs of life and peoples

choices on how to send their life around computers new organizations on the web such as

online networks, forums and blogs are the new meeting point for the people. Blogs are also

such an important source of information but it is hard as well reach that information by

automatically. The difficulty comes out by the personalized design and the size of the

blogosphere; every blog has a different structure which prevents us to find the information

or related data with several tracking from one to another. Approach of the project is to

create an analysis framework uses web mining principles. Aim of the project is to form an

opinion mining application to grab people’s opinions and emotions about recent movies

from contents of weblogs.

III

Abstract ............................................................................................................................ II

1-Introduction ................................................................................................................... 1

2-Literature Review .......................................................................................................... 2

3-Approach ....................................................................................................................... 5

3.1 Overview ............................................................................................................................ 5

3.2 Problem Definition and Goals .............................................................................................. 5 3.2.1 Problem Definition ............................................................................................................................... 5 3.2.2 Goals ..................................................................................................................................................... 5

3.3 Solution .............................................................................................................................. 5

4-Project Development ..................................................................................................... 7

4.1 Planning Phase ................................................................................................................... 7 4.1.1 Project Identification ............................................................................................................................ 7 4.1.2. Feasibility Analysis ............................................................................................................................... 8

4.2 Analysis Phase .................................................................................................................... 8 4.2.1 Requirements Analysis ......................................................................................................................... 8 4.2.2 Modeling process and data ................................................................................................................ 10

4.3 Designing Phase-System Architecture ................................................................................ 11 4.3.1 Blog Crawler........................................................................................................................................ 11 4.3.2 Sentiment Analyzer ............................................................................................................................ 12 4.3.3 Web User Interface ............................................................................................................................. 16

4.4 Implementation Phase ...................................................................................................... 20

5. Experiments and Results ............................................................................................. 20

5.1 Data ................................................................................................................................. 20

5.2 Experimental Results ........................................................................................................ 20

5.3 Discussion ........................................................................................................................ 23

5.4 Difficulties Encountered .................................................................................................... 23

6. Conclusion .......................................................................................................................... 24

7.References ................................................................................................................... 24

IV

Figure 1 State of the Blogosphere ........................................................................................................................... 1 Figure 2 Blog Miner Overall Process Model ............................................................................................................ 6 Figure 3 Blog Crawler Data Flow ........................................................................................................................... 10 Figure 4 Sentiment Analyzer Data Flow ................................................................................................................ 11 Figure 5 User Interface Data Flow ......................................................................................................................... 11 Figure 6 Crawler Architecture ............................................................................................................................... 12 Figure 7 Sentiment Analyzer Process Model ......................................................................................................... 14 Figure 8 Blog Miner ER Diagram ........................................................................................................................... 14 Figure 9 Blog Miner Class Diagram ....................................................................................................................... 15 Figure 10 Words Matching Class Diagram ............................................................................................................ 16 Figure 11 Main Page ............................................................................................................................................. 17 Figure 12 Graphs Page .......................................................................................................................................... 18 Figure 13 User Interface Process Model ................................................................................................................ 19

V

Table 1 SentiWord Data Table .............................................................................................................................. 13 Table 2 Sample Graphs .......................................................................................................................................... 20 Table 3 Word Tags ................................................................................................................................................ 22 Table 4 Experiment Results ................................................................................................................................... 23

“Every idea is valuable”. This was the motivation for developing a sentiment

engine. World’s biggest library internet is getting feed by every user around the world.

People all donate their personal signatures, ideas, moments, knowledge and so on by

internet. We live in the century of technology every simple step of life h

different virtual communication lines.

Sociologists have used many different ways to recognize people nature their interests,

community aims, preferences and we are quite sure the most realistic way to do

generalization is to look for share

to use human idea to define the aim of web communities which is a person also. Grab their

ideas over web specifically from their sharing. Most efficient way for that are people’s own

diaries or books as named web blogs. The most popular way of sharing your world with your

sentences or your quotations and also least studies made on it to use the valuable

information contained in them.

With increasing usage of the internet, blogging and blog pages are

pages are the most popular way to express opinions and emotions. According to the blog

search engine of Technorati [1], by the end of 2008, there were 133 million blogs on the

global Internet, which are indexed by Technorati. Figure

blogosphere at 2008.

Figure shows how rapidly blog number is increasing and will

opinions and emotions about almost every topic to the blogs. Mining opinions from reviews

1

1-Introduction

“Every idea is valuable”. This was the motivation for developing a sentiment



internet. We live in the century of technology every simple step of life h

different virtual communication lines.



generalization is to look for shared common points. The system has designed in that manner



as named web blogs. The most popular way of sharing your world with your


information contained in them.

With increasing usage of the internet, blogging and blog pages are grown rapidly and blog



ich are indexed by Technorati. Figure 1: Shows the state of the

Figure 1 State of the Blogosphere

Figure shows how rapidly blog number is increasing and will. People are writing their


“Every idea is valuable”. This was the motivation for developing a sentiment analysis



internet. We live in the century of technology every simple step of life has moved over



d common points. The system has designed in that manner



as named web blogs. The most popular way of sharing your world with your


grown rapidly and blog



Shows the state of the

. People are writing their


2

on web pages, however, is a complex process, which requires more than just text mining

techniques. The complexity is related to a couple of issues. First, review data has to been

crawled from websites, in which web spiders or search engines can play an important role.

Moreover, it is necessary to separate the data of reviews from non-reviews. The sentiment

classification process can then be conducted [11].

This thesis proposes a system that extracts movie reviews from blogs and classifies these

reviews into two groups: positive and negative with defined different categories or overall

also it has been designed in the idea to be extended for other alternative topics for future.

Every component needed for an effective sentiment web mining introduced in details and

with reasons. Then application summarizes the result to the user with an effective visual

way.

2-Literature Review

When it has been first started to search on web blog mining there was not a clear idea of

what it was really concerning. It was new search topic and developments have done were

limited done mostly by other academic students. It is been chosen some of these papers to

point the direction on which the focus on research and development has to be. Below there

are short descriptions of methodology and techniques used by previous researchers.

In paper [4] has been built a sentiment classification application which uses phrase

patterns to classify opinions. In their method, they construct some phrase patterns and

calculate their sentiment orientation by unsupervised learning algorithm. At the document

classification phase, they are adding special tags to some words in the text, and then

matching the tags within a sentence with some phrase patterns to get the sentiment

orientation of the sentence. At last, they are adding up the sentiment orientation of each

sentence. They are classifying the text according to this summation. This method achieves an

accuracy rate of 86% when used to evaluate sports reviews from some websites.

In paper [5] has been built a reputation management application on the WebFountain

(WebFountain is a platform for very large-scale text analytics applications that allows

uniform access to a wide variety of sources.) platform that enables various analyses for

corporate and product reputation, and tracking of market trends. A key component of their

reputation management system is the sentiment miner that extracts sentiment (or opinions)

people express about a subject, such as a company, brand, or product name. They designed

the sentiment miner with the following challenge in mind: Not only is the overall opinion

about a topic, but also sentiment about individual aspects of the topic essential information

of interest. Because document level sentiment classification fails to detect sentiment about

individual aspects of the topic. The sentiment miner analyzes grammatical sentence

structures and phrases based on natural language processing (NLP) techniques. It detects,

3

for each occurrence of a known topic spot, the sentiment specifically about the topic. With

these characteristics their NLP based sentiment mining system achieved high quality results

(∼90% of accuracy) on various datasets including online review articles and the general web

pages and news articles. Their feature extraction algorithm successfully identified topic

related feature terms from online review articles, enabling sentiment analysis at finer

granularity.

In paper [6] has been built an application on sentiment classification with review

extraction. Their whole process can be illustrated logically in three phases:

1) Extract the review expressions on specific subjects and attach sentiment tag and

weight to each expression;

2) Calculate the sentiment indicator of each tag by accumulating the weights of all the

expression with the corresponding tag;

3) Given the indicators on different tags, use a classifier to predict the sentiment label of

the text.

It has been used some on-line documents to test the performance of their application. The

experimental documents cover two domains: politics and religion. The experiments within

those domains achieve accuracy between %85 to %95.

In paper [7] has been applied the method of opinion mining to help e-learning systems to

know the users’ opinions on the course-wares, the teachers, the charge or something else of

the e-learning system and to help the developers improve the services. They developed an

opinion mining system for e-learning reviews, the goal of this system is to extract and

summarize the opinions and reviews, and determine whether these reviews and opinions

are positive or negative and how strong they are. They divided the whole task into 4

subtasks;

1) Expression identification

2) Opinion determination

3) Content-value pair identification

4) Sentiment analysis.

And the achieved precision of these subtasks are respectively; %94, %84.2, %80.9 and %92.6.

In paper [8] has been developed the unified collocation framework for opinion mining.

They propose o novel unified collocation-driven opinion mining method. And they compared

this method with the attribute-driven method, sentiment-driven method and general

collocation-driven method, the unified collocation-driven method exhibits reasonable

generalization ability. As showed by the experimental results, 0.245 on average improves

recall in opinion extraction without obvious loss on opinion extraction precision and

sentiment analysis accuracy. The unified collocation-driven method incorporates attribute-

sentiment collocations as well as their syntactical features to achieve reasonable

generalization ability.

4

In paper [9] has been built a sentiment mining and retrieval system called: AMAZING.

They introduce a ranking mechanism, which is different from general web search engine

since it utilizes the quality of each review rather than the link structures for generating

review authorities. In this system most important part is they incorporate temporal

dimension information into the ranking mechanism, and make use of temporal opinion

quality and relevance to rank review sentences. They monitor customer reviews’ changing

trends with time, and visualize the changing trends of positive and negative opinion

respectively. And they generate visual comparison between positive and negative evaluation

of a particular feature which potential customers are interested in. They conducted

experiments in the sentiment mining and retrieval system using the customer reviews of

four kinds of electronic products including 20 digital cameras, 20 cell phones, 20 laptops and

20 MP3 players. And they achieved a precision of %85 approximately.

In paper [12] a multi-knowledge based approach is proposed, which integrates WordNet

[13], statistical analysis and movie knowledge. WordNet® is a large lexical database of

English, developed under the direction of George A. Miller. Nouns, verbs, adjectives and

adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct

concept. They decompose the problem of review mining and summarization into the

following subtasks:

1) Identifying feature words and opinion words in a sentence;

2) Determining the class of feature word and the polarity of opinion word;

3) For each feature word, first identifying the relevant opinion word(s), and then

obtaining some valid feature-opinion pairs;

4) Producing a summary using the discovered information.

WordNet, movie casts and labeled training data were used to generate a keyword list for

finding features and opinions. Then grammatical rules between feature words and opinion

words were applied to identify the valid feature-opinion pairs. Finally, they re-organized the

sentences according to the extracted feature-opinion pairs to generate the summary. The

objective of their work is to automatically generate a feature class-based summary for

arbitrary online movie reviews. Experimental results show that their method working with

an average precision of %65 approximately. In addition, with their approach, it is easy to

generate a summary with movie-related people names as the sub-headlines, which probably

interests many movie fans.

The work done on this project is most similar with the work in [12]. One and important

difference of work done is the aim to calculate sentiment orientation of the movie reviews

from the blogs. All of the works researched are working on a constant dataset but the

projects review dataset will be crawled from the blogs and then will be worked on this

dataset to calculate movie scores. Discussion of the method with details is in next section.

5

3-Approach

3.1 Overview

In this section it is shortly defined the techniques, goals of project and what is aimed to

succeed as result and methods applied during the project development. The project is

separated to three phases. The first phase crawling phase which data gathered from web

blogs or portals; second phase is to parse, analyze and process that data to information; the

last phase is interfacing or visualizing our analyze results that will be presented, comparisons

will be made with existing results and accuracy of work will be tested. More details of the

technical and architectural work will be explained in the system architecture part.

3.2 Problem Definition and Goals

3.2.1 Problem Definition

Web blogs and portals are full with un-indexed and unprocessed text that is containing

so much useful analysis source. This is direct interaction to a person’s ideas. There is a

need to take and process that data and let people to use it in their decision making

processes. For sure many people take action by the words of common interest of a fact.

Like to buy a camera that most claimed it is the best between the options. We focused in

the same manner to create a blog mining system that will took movie comments from

blogs or portals and define to user what most thinks about the movie with its related sub

units from director to screen writer.

3.2.2 Goals

� Gather the right data from the right sources to process.

� Process the data into information using well-defined word libraries and well-

defined procedures that will analyze it and turn it to meaningful results.

� Present the results in a clear way that user can use for:

� Consuming time,

� Learn about community agreements on a topic.

� Produce a model for follower researches that will work on that topic to have a

base example.

3.3 Solution

The problem to address is to parse data existing in texts on blogs or on web portals that

people talk on their ideas, making comments or criticize. Included in this problem to alter is

to take that data by an automated system as you do with a human eye. Define stop points

and specialize on factors to define the lines to not get out of subject. Second problem to

alter will be after getting that data to analyze it. Again it will be ne

specific algorithms that will take of the subjective pieces to use in pointing or sketching

information. Last part of work is to make a presentation environment for the end user that

can note the work and see project’s accuracy th

graphical interfaces. Figure 2 shows the overall process model of our application.

Figure 2 Blog Miner Overall Process Model

Let us introduce you the basic principles and working mechanism of our system;

Crawling the blogs for movie reviews:

crawling the blogs and collecting dat

a Web spider, Web robot) is a program or automated script that browses the World Wide

Web in a methodical, automated manner. Other less frequently used names for Web

crawlers are ants, automatic indexers

This process is called Web crawling

use spidering as a means of providing up

create a copy of all the visited pages for later processing by a search engine that will index

the downloaded pages to provide fast se

maintenance tasks on a Web site, such as checking links or validating HTML code. Also,

crawlers can be used to gather specific types of information from Web pages, such as

harvesting e-mail addresses (usuall

6


alter will be after getting that data to analyze it. Again it will be necessary to define subject



can note the work and see project’s accuracy that we will handle by building a web site and


Figure 2 Blog Miner Overall Process Model


Crawling the blogs for movie reviews: OpenWebSpider and Arachnode

crawling the blogs and collecting data for sentiment analysis. A Web crawler

is a program or automated script that browses the World Wide


automatic indexers, bots, and worms.

Web crawling or spidering. Many sites, in particular search engines,

use spidering as a means of providing up-to-date data. Web crawlers are mainly used to


the downloaded pages to provide fast searches. Crawlers can also be used for automating



mail addresses (usually for spam) or gathering text content like we do.


cessary to define subject



at we will handle by building a web site and



have been used for

Web crawler (also known as

is a program or automated script that browses the World Wide


tes, in particular search engines,

date data. Web crawlers are mainly used to


arches. Crawlers can also be used for automating



y for spam) or gathering text content like we do.

7

A Web crawler is one type of robot, or software agent. In general, it starts with a list of

URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks

in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the

frontier are recursively visited according to a set of policies.

Sentiment analysis of blogs: Sentiment analysis has three main tasks; Determining

subjectivity, determining sentiment orientation and determining the strength of the

sentiment orientation. Sentiment analysis can be done in two different ways:

� With using unsupervised approach.

� With using supervised machine learning approach.

This application uses the unsupervised approach. OPEN-NLP is used to find the types of

words. There is a keyword database which contains the specific words about movie domain.

Keyword are searched in the text for analyze, if is found a keyword then calculation of the

score is done by the identification if it is has adjectives or adverbs. Below it is mentioned

about this algorithm as keyword algorithm. Also another algorithm that looks the all words

in related sentences and calculates the general score for a movie has defined. It is

mentioned as all words algorithm.

Generating visual results: It has been used Zed Graph for visualization of the findings. Zed

Graph is a class library, Windows Forms User Control, and ASP web-accessible control for

creating 2D line, bar, and pie graphs of arbitrary datasets. Zed Graph is maintained as an

open-source development project. The results presented on the project web site over a

shared database.

4-Project Development

4.1 Planning Phase

In planning phase, it is established a high level view of the intended project and

determined its goals. The Water Fall methodology selected for developing this project and

the project is divided into four phases according to this methodology. The project started

with planning phase, after planning phase the analysis phase starts, after analysis phase

design phase and implementation phase has been done respectively.

4.1.1 Project Identification

User Need

User need is to get an analyzed information that is product of a sentimental analysis

made by system to use for comparison or help on decision making.

Business Requirements

8

With new interests of today business carried on web platforms and every person is also a

virtual costumer or just attendee and their ideas and sharing on web needs to be used to

help to firms or business owners to recognize their costumers better. They can change

their direction of production and portfolio knowing what people are looking for.

Business Value

Produce a model for follower researches that will work on that topic to have a base

example.

4.1.2. Feasibility Analysis

Technical Feasibility

Web content or data mining are new terms and they are not such common as other

subjects like e commerce sites that you can find many material and examples but need

and popularity increases. So the development is done against limited help about

documentation and researches done before by people working in same topic.

Development environment choice is to use C#.Net and ASP.net and work done on

Visual Studio 2008. For database management SQL-2005 is chosen. Ajax is used as web

controls also in our web interface. Graphs library used is named Zed Graphs. It is an open

source library which is also graphs used by Wikipedia for most of their charts. They work

both in form applications and web applications.

OPEN-NLP is used to help on natural language processing when the parsing is done

over the texts. SentiWord.Net database for words are used as word database. Porter

Stemmer is used to find the stem of a word and NetSpell [18] library for correcting

misspelled words. In rest it has defined project’s own classes and algorithms to operate

on text.

Economic Feasibility

Economic feasibility analysis is not a must for the project that the developers and

investment is not necessary for the project development tools.

4.2 Analysis Phase

4.2.1 Requirements Analysis

Functional Requirements

Functionality of the system basically is to process human opinion using web mining

techniques so most of the functionality is done in the code at the background user will

have the action only to browse already processed data. By the admin side of the system

9

can be criticized on which functions will be necessary to accomplish the mission of the

project.

FR0: User can see all analyzed movies one by one with all topics included

FR1: User can choose movies to compare to each other in a specific topic or overall

FR2: Users shall be able to feedback using feedback screens to request an analysis about

their choices.

FR3: Developers can use our sentiment algorithms as packages to rule for text processing

on their own analysis.

FR4: User can test our system with results taken from imdb to see accuracy of the system.

FR5: We can crawl any site in the dept as we wish and specify our topics or web site limits

as we wish using crawler interfaces.

FR6: We shall modify the content of sentiment analysis to make analysis on different

topics.

FR7: Developers can use the existing system as a template and modify the code basics in

their interest of analysis.

Non-functional Requirements

NF0: It is important that graphical information is clear and easy to understand by the web

site users. NF1: The response times of the user searches must be short. NF2: The accuracy of the returned results must be high.

NF3: Comparison options must be logic.

NF4: Hardware of the system that will host the crawler should be high performance

because of fast transactions and data storing of crawler.

NF5: Worker threads should be used on crawling options to have performance and multi

process on web crawling. Otherwise crawling big and multiple sites to analyze are harder.

NF6: The platform that application will be set up has to have .NET 3.5 framework and MS-

SQL Server 2005 has to be set up on that machine.

NF7: Users will need to have environments to browse ASP.NET pages. Any basic web

browser already defines this ability.

NF8: Implementation environment should be set up for developers. All list below is

necessary during implementation phase.

� Visual Studio 2008

� MS-SQL 2005

� Windows XP

� Computer

� Internet Connection

NF9: System should have a high maintainability capacity because system should be

customizable and easy to change for another topic.

4.2.2 Modeling process and data

In this phase, data flows between components of the system were determined. Then they

were modeled with Data Flow Diagrams. Figure 3 shows the data flow of the blog crawler.

Figure 4 shows the data flow of the sentiment analyzer and figure 5 shows the data flow in

the web user interface.

10

4.2.2 Modeling process and data


modeled with Data Flow Diagrams. Figure 3 shows the data flow of the blog crawler.


Figure 3 Blog Crawler Data Flow


modeled with Data Flow Diagrams. Figure 3 shows the data flow of the blog crawler.


Figure 4 Sentiment An

Figure 5 User Interface Data Flow

4.3 Designing Phase-System Architecture

4.3.1 Blog Crawler

One of the most important parts of the application is the Blog Crawler. The crawler has a

really heavy work, because is needed to analyze as

accuracy results. If analyze have not been done with enough data, results will show opinions

of only restricted group of people but it is a goal to calculate general opinions about a movie.

So it has to be crawled as many b

hardware restrictions in this matter. The blogosphere contains very huge data but the

storage capacity is limited also the crawler needs very fast computer with high memory to

crawl all of the blogosphere, so it is crawled only some part of the blogosphere. İt is a

hypothesis that when the hardware specifications will improved and crawled part of the

blogosphere increased, the application will create better results.

Arachnode.Net is used for

crawler for downloading, indexing and storing Internet content including e

files, hyperlinks, images, and Web pages. Arachnode.net is written in

2005. Arahnode.net uses the Lucene.Net library fo

is selected because it is very customizable and well written; also it is written with C# and this

makes the customization and integration easier. Customization is done on Arachnode.Net

for crawling blogs and crawler

11

Figure 4 Sentiment Analyzer Data Flow

Figure 5 User Interface Data Flow

System Architecture


really heavy work, because is needed to analyze as many as data that can reach good



So it has to be crawled as many blogs as it can be to reach good results but there are some



osphere, so it is crawled only some part of the blogosphere. İt is a


blogosphere increased, the application will create better results.

Arachnode.Net is used for crawling the blogs. Arachnode.net is an open source

for downloading, indexing and storing Internet content including e

files, hyperlinks, images, and Web pages. Arachnode.net is written in C#

Arahnode.net uses the Lucene.Net library for indexing and searching. Arachnode.Net



for crawling blogs and crawler has started with seeds like www.blogpulse.com


many as data that can reach good



logs as it can be to reach good results but there are some



osphere, so it is crawled only some part of the blogosphere. İt is a


rachnode.net is an open source Web

for downloading, indexing and storing Internet content including e-mail addresses,

C# using SQL Server

r indexing and searching. Arachnode.Net



www.blogpulse.com and

www.technorati.com, because these web sites contains a lot of links to blogs and this

improves the crawling performance. Figur

crawler.

4.3.2 Sentiment Analyzer

Sentiment analyzer is the main structure of the application. In this part it is being

calculated scores for a movie from the comments about tha

first of all blogs selected that contains comments about a specific movie and after text of the

web page is taken to parse the text into sentences for sentence level calculation. In first

algorithm for every sentence that a

about movie domain, if is found a keyword in the sentence then is being looking for the

modifying adjectives of the keyword. SentiWordNet [14] is being used for sentiment score of

the words. SentiWordNet is a lexical resource in which each WordNet [13] synset

associated to three numerical scores

positive, and negative the terms contained in the synsets are.

and their scores according to the SentiWordNet. After modifying adjectives are found that

12

, because these web sites contains a lot of links to blogs and this

improves the crawling performance. Figure 3 shows the main working process of the

Figure 6 Crawler Architecture


calculated scores for a movie from the comments about that movie. For calculating scores,



algorithm for every sentence that are being looked for the keywords which was created



s a lexical resource in which each WordNet [13] synset

associated to three numerical scores Obj(s), Pos(s) and Neg(s), describing how objective,

positive, and negative the terms contained in the synsets are. Table 1 shows some adjectives

res according to the SentiWordNet. After modifying adjectives are found that

, because these web sites contains a lot of links to blogs and this

e 3 shows the main working process of the


t movie. For calculating scores,



re being looked for the keywords which was created



s a lexical resource in which each WordNet [13] synset s is

describing how objective,

Table 1 shows some adjectives

res according to the SentiWordNet. After modifying adjectives are found that

are being looked for modifying adverbs for these adjectives. These adverbs are separated

into two categories; degree adverbs and reversing adverbs. If there is a degree adverb lik

“less” or “more” founded for the adjective then multiply the adjective’s score with the

degree adverb’s score and use the result as keyword’s score. If reversing adverb is found like

“not” for that adjective, simply reversing the score of that adjective and using the sco

keyword’s score. All keywords’ scores are calculated for every related blog page and then

calculated the average of these scores. The keywords created with different categories like

“Screen Play”, “Director” and

After all keywords’ score calculation completed, it is calculated calculated the scores of these

9 different categories according to the keywords’ categories. In second algorithm, for every

related sentence that is being looked for

are calculated the average of these word scores and that gives the general movie score.

The peoples’ opinions and comments in their blogs may contain spelling errors and thes

errors will decrease accuracy of the application. To alter this problem NetSpell [18] is used as

a spell checker library in the score calculator method. Only stem of the words are stored in

the SentiWord table and to find out the sentiment score of a wor

stem. And to alter this problem it is used, the Porter Stemmer [16] to get the stem of a word.

Also there is a string similarity project called Words Matching created for keywords that still

can’t found after spelling control an

strings and returns a value between 0 and 1; assumed that if similarity score is greater than

13


into two categories; degree adverbs and reversing adverbs. If there is a degree adverb lik

founded for the adjective then multiply the adjective’s score with the


“not” for that adjective, simply reversing the score of that adjective and using the sco



“Producer”, there are 9 categories like that for movie domain.



related sentence that is being looked for every word’s scores from SentiWord database and


Table 1 SentiWord Data Table

The peoples’ opinions and comments in their blogs may contain spelling errors and thes



the SentiWord table and to find out the sentiment score of a word it must searched with its



can’t found after spelling control and stemming. This project calculates the similarity of two



into two categories; degree adverbs and reversing adverbs. If there is a degree adverb like

founded for the adjective then multiply the adjective’s score with the


“not” for that adjective, simply reversing the score of that adjective and using the score as



e that for movie domain.



every word’s scores from SentiWord database and


The peoples’ opinions and comments in their blogs may contain spelling errors and these



d it must searched with its



d stemming. This project calculates the similarity of two


0.8 strings are equal. These text and word modifications will improve the application’s

accuracy. Figure 7 shows the process model of the sentiment analyzer.

Figure 7 Sentiment Analyzer Process Model

14


7 shows the process model of the sentiment analyzer.

Figure 7 Sentiment Analyzer Process Model

Figure 8 Blog Miner ER Diagram


Figure 8 shows the ER Diagram of the application but this diagram does not includes the

Arachnode.Net database which th

added the database diagram of the Arachnode.Net because the size of the diagram is very

large but you can find the database diagrams of Arachnode.Net at

results of the each movie investigated are stored. In People table the related people

information about movies for example actor, actress names, director etc are stored. this

information is stored for improving the accuracy of the score calculator method for catching

all comments about a movie. In SentiWord table the sentiment dictionary is stored which

was obtained from SentiWordNet [14]. Movie Elements table contains the 9 categories for

movie domain and Element Alias table contains the keywords about these categories.

Figure 9 Blog Miner Class Diagram

Figure 9 shows the class diagram of the main project. In the project most of the work is

done by MovieScoreCalculator class, this class uses the

classes and Words Matching project for improving efficiency. Score Calculator class is a test

class which calculates the scores of 10 movie from imdb comments. Also the crawler calls

the MovieScoreCalculator class when a page

class calculates the score and updates or creates the movie score.

15


Arachnode.Net database which the crawler uses and stores the blog pages. It could not be


large but you can find the database diagrams of Arachnode.Net at [15]. In Movies table score



r improving the accuracy of the score calculator method for catching



n and Element Alias table contains the keywords about these categories.

Figure 9 Blog Miner Class Diagram


done by MovieScoreCalculator class, this class uses the Porter Stemmer, Spell Checker



the MovieScoreCalculator class when a page is related to a movie and MovieScoreCalculator

class calculates the score and updates or creates the movie score.


e crawler uses and stores the blog pages. It could not be


. In Movies table score



r improving the accuracy of the score calculator method for catching



n and Element Alias table contains the keywords about these categories.


Porter Stemmer, Spell Checker



is related to a movie and MovieScoreCalculator

16

Figure 10 Words Matching Class Diagram

4.3.3 Web User Interface

Web blog mining process that has been worked on this thesis mostly lies behind the visual

interface and results and work is mostly lies between processes in databases and in

functions. After a long process of gathering data storing it , cropping it to evaluate more

logical data from raw data and processing it with defined parsing and sentiment analysis

functions results comes out for our work as just simple numbers. That is more actually

pointing of movies in a few data table. This work is planned to present to the end user in the

most simplest and useful way as graphical charts that they can select what they want to

screen on a simple graphs.

A project web site has been developed for both to present the project evaluation and to

give information about what has been gained all this process long. And most importantly to

publish the web blog mining sentiment analysis results with basic mechanism.

Web site has five main pages. Three of them present project and reference materials.

One is a comment page and the last, most important one is the graphs page that

of the project presented.

Web Site Pages

Start Page

The main page of the web site is as in figure

from the menu up. On the right side there are some referential pages ad quick launch

options. Users will be able to reach all documentation from paper and materials pages. Most

important page of the interface is the graphs page that is explained in detail below.

Graphs Page

17


One is a comment page and the last, most important one is the graphs page that

Figure 11 Main Page

The main page of the web site is as in figure 11. User can browse between main pages


be able to reach all documentation from paper and materials pages. Most



One is a comment page and the last, most important one is the graphs page that the results

can browse between main pages


be able to reach all documentation from paper and materials pages. Most


Graphs page is formed by two sections. First section is the selection part.

been three selection options first one is to select movie name analyzed and then clicking to

show all button. This will create a bar graph that has 9 different analysis

The second selection option is to choose a category from combo section. Here

specify a selection only. The third part formed as grid lists the movies

analyzed. Here users can select the movies

them all. And with the specified selection up graph will be sketched.

On the right side near graph

have a comparison base to see our accuracy of the system. When

score option the score comes out in overall and imdb point taken as overall rating will be

there to compare. Think that

so if a person made a comment but not specified a vote thi

result.

Second section of graphs page is zed graphs that are dynamic chartings

created each time users specify a selection.

graphs and how it is used in this work

Zed Graphs

18

Figure 12 Graphs Page

Graphs page is formed by two sections. First section is the selection part.

three selection options first one is to select movie name analyzed and then clicking to

show all button. This will create a bar graph that has 9 different analysis result sketched.

The second selection option is to choose a category from combo section. Here

The third part formed as grid lists the movies

can select the movies they want to sketch in graph

them all. And with the specified selection up graph will be sketched.

On the right side near graph users will be able to see imdb.com point of the movie to

have a comparison base to see our accuracy of the system. When a user


imdb takes their point on voting and we analyze the comments

so if a person made a comment but not specified a vote this may lead to deviation of real

Second section of graphs page is zed graphs that are dynamic chartings

specify a selection. Next section contains a short summary of zed

it is used in this work.

Graphs page is formed by two sections. First section is the selection part. There have

three selection options first one is to select movie name analyzed and then clicking to

result sketched.

The second selection option is to choose a category from combo section. Here users

The third part formed as grid lists the movies that have been

graph or they can choose

point of the movie to

a user chooses to show


imdb takes their point on voting and we analyze the comments

s may lead to deviation of real

Second section of graphs page is zed graphs that are dynamic chartings that will be

short summary of zed

Zed Graph [17] is a set of classes, written in C#, for creating 2D line and bar graphs of

arbitrary datasets. The classes provide a high degree of flexibility

the graph can be user-modified. At the same time, usage of t

providing default values for all of the graph attributes. The classes include code for choosing

appropriate scale ranges and step sizes based on the range of data values being plotted.

Zed Graphs has two different libraries th

applications and web pages. And

second option “image render mode”

load it to a folder as a temp image and load fo

which fastens graph loading time. Chart creation progress showed below in the figure

out graphs page.

Figure 1

Process diagrams show how to handle data feed to

to take in care is for the graph type

make sketch logical. There are plenty of graphs types

Some sample graphs can be mad

Sample Bar Graph

19


arbitrary datasets. The classes provide a high degree of flexibility -- almost every aspect of

modified. At the same time, usage of the classes is kept simple by



Zed Graphs has two different libraries that can be used both for windows form

applications and web pages. And there is two different modes can be

second option “image render mode” has been chosen that allows user

load it to a folder as a temp image and load form there if user re-clicks the same graphs

which fastens graph loading time. Chart creation progress showed below in the figure

Figure 13 User Interface Process Model

Process diagrams show how to handle data feed to the graphs data set. In this point what

to take in care is for the graph type has been chosen, the data should be send

e are plenty of graphs types can be created with very simple codes.

be made very easily are in table 2.

Pie Charts Line & Symbol Charts


almost every aspect of

he classes is kept simple by



both for windows form

used in both. The

to create a graphs

clicks the same graphs

which fastens graph loading time. Chart creation progress showed below in the figure 13 for

ta set. In this point what

should be send that would

with very simple codes.

Line & Symbol Charts

20

Table 2 Sample Graphs

4.4 Implementation Phase

At design phase the structure of the project is well defined and every step of

implementation are determined. Development language of project is C# and development

environment is Microsoft Visual Studio 2008. For database development Microsoft SQL 2005

is selected. All the classes and projects are implemented as defined in design phase. A

sentiment analyzer project created as a WPF project and integrated with Blog Crawler. An

ASP.NET web site project created for publishing the results of project. For improving

implementation performance Visual Studio’s dataset and table adapter structures are used

for database interactions. Object oriented design rules and structures are used while

implementing these projects. For creating diagrams Smart Draw 2009 and Visual Paradigm

for UML 7.0 Enterprise Edition is used.

5. Experiments and Results 5.1 Data

The user reviews of a few movies from IMDB have been used as the data set. These

movies are selected from recent movies. The selected movies should be familiarized by most

movie fans because this work aims to analyze as many as possible user comments and the

well-known movies have enough comments for this purpose. According to the above

criterions, 10 movies from the IMDB have been selected. The selected movies are The Fast

and Furious, Monsters vs. Aliens, State of Play, Knowing, The Dark Knight, Wall-E, Slumdog

millionaire, No Country for old men, There Will be Blood and The Curious Case of Benjamin

Button. For each movie, approximately 10 review pages are crawled by the Blog Crawler.

This makes approximately 1000 reviews in total. These reviews are used for experiments to

calculate accuracy of application.

5.2 Experimental Results

The experiment will be presented in the flow of blog miner processes the raw data and

calculates its results. A sample review has been chosen from imdb the blog miner will work

on it. The keyword algorithm has been used for this example review. The second algorithm

also uses a similar way but looks every word’s score not just looking only keywords’ scores.

Sample Review:

“I thought it wouldn't be as good as it was, because thousands of people and reviews said it

would suck! It was great, but what it missed was that it needed to be at-least an hour longer,

because it missed a-little bit, but it still rocked! I loved it! I thought it was funny, and as did

the person next to me, when John says: "I'll be back!””.

21

Step 1: Split the text into sentences

In this step the text will be splitted into sentences to make the sentiment analysis at

sentence level. Text below is the condition of the sample review after step 1.

~1~ I thought it wouldn’t be as good as it was, because thousands of people and reviews said

it would suck! ~1~

~2~ It was great, but what it missed was that it needed to be at-least an hour longer,

because it missed a-little bit, but it still rocked! ~2~

~3~ I loved it! ~3~

~4~ I thought it was funny, and as did the person next to me, when John says: "I'll be

back!””. ~4~

Step 2: Tag the words in each sentence by their type

In step 2, appropriate tags will be added to the words for understanding the meanings of the

words more accurately. Table 3 shows the tags has been used and the meanings of these

tags. And the below text is the sample review after step 2.

I/PRP thought/VBD it/PRP would/MD not/RB be/VB as/RB good/JJ as/IN it/PRP was/VBD

,/, because/IN thousands/NNS of/IN people/NNS and/CC reviews/NNS said/VBD it/PRP

would/MD suck/VB !/.

It/PRP was/VBD great/JJ ,/, but/CC what/WP it/PRP missed/VBD was/VBD that/IN it/PRP

needed/VBD to/TO be/VB at-least/JJ an/DT hour/NN longer/RB ,/, because/IN it/PRP

missed/VBD a-little/JJ bit/NN ,/, but/CC it/PRP still/RB rocked/VBD !/.

I/PRP loved/VBD it/PRP !/.

I/PRP thought/VBD it/PRP was/VBD funny/JJ, /, and/CC as/RB did/VBD the/DT person/NN

next/JJ to/TO me/PRP, /, when/WRB John/NNP says/VBZ :/: "/`` I/PRP will/MD be/VB

back/RB !/. ”/NN. /.

Step 3: Point the text using full text algorithm or key point based algorithm.

“I/PRP thought/VBD it/PRP

it/PRP was/VBD ,/, because

said/VBD it/PRP would/MD suck

(sentence score = -0.844)

It/PRP was/VBD great/JJ<0.344

it/PRP needed/VBD<-0.140625

because/IN it/PRP missed/VBD

rocked/VBD !/.

(sentence score = 0.0104)

I/PRP loved/VBD<0.375> it/PRP !/.

(sentence score = 0.375)

22

Table 3 Word Tags

Point the text using full text algorithm or key point based algorithm.

/PRP would/MD not/RB<-1> be/VB as/RB good

because/IN thousands/NNS of/IN people/NNS and

suck/VB !/.

0.344> ,/, but/CC what/WP it/PRP missed/VBD

0.140625> to/TO be/VB at-least/JJ an/DT hour/NN

/VBD a-little/JJ bit/NN ,/, but/CC it/PRP

/PRP !/.

Point the text using full text algorithm or key point based algorithm.

good/JJ<0.844> as/IN

and/CC reviews/NNS

/VBD was/VBD that/IN

/NN longer/RB ,/,

/PRP still/RB<-0.171>

I/PRP thought/VBD it/PRP was

person/NN next/JJ to/TO me/PRP ,/,

be/VB back/RB !/. ””/NN ./.

(sentence score = -0.515)

The application has been tested

two different techniques has been calculated

Table 4. Keyword algorithm gives more average scores for each movie’s general

words algorithm gives closer results to imdb scores. In producer and screen writer columns

there are some rows have score of 5.25, these scores are default values because there are

no keywords found for these movies.

5.3 Discussion

The results of the experiment have been compared

imdb page of movie there is only movie’s general score because of that there can be two

comparisons made; first comparison is imdb score with

second one is imdb score with all words algorithm’s score. The scores of other categories

cannot be compared because there is no available data at imdb.com to compare.

5.4 Difficulties Encountered

The first difficulty to alter was to learn what web mining is because

us and for many others in computer science

researches on internet to learn

mining idea especially on thinking how to reach to that data on the web. But there are many

useful helping open source software

source software created another difficulty which is to integrate those projects to our work.

This difficulty encountered with reading the documentation of these software and analyzing

their codes.

23

was/VBD funny/JJ<-0.515> ,/, and/CC as/RB

/PRP ,/, when/WRB John/NNP says/VBZ :/: "/``

has been tested with data mentioned above and scores of movies with

has been calculated. The results of experiments can be shown in

Keyword algorithm gives more average scores for each movie’s general

ser results to imdb scores. In producer and screen writer columns


no keywords found for these movies.

Table 4 Experiment Results

the experiment have been compared with each movie’s imdb score.


comparisons made; first comparison is imdb score with keyword algorithm’s score and the

imdb score with all words algorithm’s score. The scores of other categories


Difficulties Encountered

The first difficulty to alter was to learn what web mining is because it was a new term for

computer science also. So that a lot of time

learn what is done and what is the methodology behind the web


useful helping open source software has been explored will make the work eas

reated another difficulty which is to integrate those projects to our work.


/RB did/VBD the/DT

/VBZ :/: "/`` I/PRP will/MD

with data mentioned above and scores of movies with

eriments can be shown in

Keyword algorithm gives more average scores for each movie’s general score; all

ser results to imdb scores. In producer and screen writer columns


with each movie’s imdb score. In


keyword algorithm’s score and the

imdb score with all words algorithm’s score. The scores of other categories


it was a new term for

has been spent on

what is done and what is the methodology behind the web


work easier. The open

reated another difficulty which is to integrate those projects to our work.


24

6. Conclusion

As a conclusion, opinion mining in web 2.0 is very important and this area is developing

day by day. Because with web 2.0 user created content of web increased enormously and

collecting meaningful information from this data became an important task. In this work, an

opinion mining application is created for calculating movie scores from blog posts.

Experiment results shows this task is not an easy one. Some of the results are close to the

real scores but some results are far from expectations. With this work we learned the

unsupervised approach for sentiment analysis not giving enough accuracy. We have

searched and investigated many works about this subject and we believe that using

supervised approach might create more accurate results for sentiment analysis.

7.References

[1] Technorati, Inc. http://technorati.com; Available at 20.05.2009

[2] A Content based Algorithm for Blog Ranking. Jie Shen, Yan Zhu, Hui Zhang, Chen

Chen, Rongshuang Sun, Fayan Xu Yangzhou University, Jiangsu Province, China, p. 1, 2008

International Conference on Internet Computing in Science and Engineering.

[3] Blog Mining through Opinionated Words. Giuseppe Attardi. Dipartimento di

Informatica Università di Pisa [email protected] Maria Simi Dipartimento di Informatica

Università di Pisa [email protected] p.1 2006

[4] Sentiment Classification Using Phrase Patterns Zhongchao Fei, Jian Liu, and Gengfeng

Wu. Proceedings of the Fourth International Conference on Computer and Information

Technology (CIT’04).

[5] Sentiment Mining in WebFountain. Jeonghee Yi, and Wayne Niblack. Proceedings of

the 21st International Conference on Data Engineering (ICDE 2005).

[6] Super Parsing: Sentiment Classification with Review Extraction. Jian Liu, JianXin Yao,

and GengFeng Wu. Proceedings of the Fifth International Conference on Computer and

Information Technology (CIT’05).

[7] Opinion Mining in e-Learning System. Dan Song, Hongfei Lin, and Zhihao Yang.

International Conference on Network and Parallel Computing (IFIP 2007).

25

[8] The Unified collocation Framework for Opinion Mining. Yun-Qing Xia, Rui-Feng Xu,

Kam-Fai Wong, and Fang Zheng. Proceedings of the Sixth International Conference on

Machine Learning and Cybernetics, Hong Kong, 19-22 August 2007.

[9] AMAZING: A sentiment mining and retrieval system. Qingliang Miao, Qiudan Li, and

Ruwei Dai. Expert Systems with Applications (2008) doi:10.1016/j.eswa.2008.09.035.

[10] Opinion Mining. Bing Liu. Department of Computer Science University of Illinois at

Chicago 851 S. Morgan Street Chicago, IL 60607-0753.

[11] Sentiment classification of online reviews to travel destinations by supervised

machine learning approaches. Qiang Ye, Ziqiong Zhang, and Rob Law. Expert Systems

with Applications (2008) doi:10.1016/j.eswa.2008.07.035.

[12] Movie review mining and summarization. Li Zhuang, Feng Jing, Xiao-yan Zhu.

[13] WordNet. http://wordnet.princeton.edu; Available at 20.05.2009

[14] SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining. Andrea

Esuli and Fabrizio Sebastiani

[15] Arachnode.Net Database Diagrams: Available at 20.05.2009;

http://arachnode.net/media/g/database_diagrams/default.aspx;

[16] Porter Stemmer: http://tartarus.org/~martin/PorterStemmer/; Available at

20.05.2009

[17]Zed Graphs: http://zedgraph.org/wiki/index.php?title=Main_Page; Available at

20.05.2009

[18]NetSpell: http://sourceforge.net/projects/netspell/; Available at 20.05.2009

web blog miner licence thesis

Documents