ranking blogs based on topic...

Master’s Thesis

Ranking Blogs based on

Topic Consistency

by

Philipp Berger

Potsdam, September 2012

Supervisor

Prof. Dr. Christoph Meinel

Internet-Technologies and Systems Group

Disclaimer

I certify that the material contained in this master’s thesis is my own work

and does not contain significant portions of unreferenced or unacknowledged

material. I also warrant that the above statement applies to the implementation

of the project and all associated documentation.

Hiermit versichere ich, dass diese Arbeit selbständig verfasst wurde und dass

keine anderen Quellen und Hilfsmittel als die angegebenen benutzt wurden.

Diese Aussage trifft auch für alle Implementierungen und Dokumentationen

im Rahmen dieses Projektes zu.

Potsdam, September 27, 2012

(Philipp Berger)

iii

Kurzfassung

Gängige Blog Rankings, wie PageRank, Technorati Authority,

und BI-Impact, bevorzugen Blogs, die sich mit einer Vielzahl von

Themen auseinander setzen, da diese ein größeres Publikum und

damit mehr Besucher, Links, und Kommentare anziehen. Ein

Beispiel dafür ist der Blog spreeblick.com, der sich mit Themen rund

um Politik, Gesellschaft und IT beschäftigt.

Andererseits, erreichen Nischenblogs, welche sich auf ein Thema

konzentrieren, nur wenig Einfluss. Nischenblogs sind Blogs wie

telemedicus.info, dieser veröffentlicht nur Artikel über Datenschutz

und Urheberrecht. Dadurch erhalten diese nur eine niedrige

Bewertung von heutigen Blog-Suchmaschinen.

Diese Arbeit erörtert, dass die Konsistenz von Blogs, d.h. wie

konzentriert ein Autor ein Thema behandelt, ein Zeichen für Ex-

pertenwissen ist. Solche Blogs zu finden ist besonders wichtig für

andere Experten, um diese Blogs zu identifizieren, damit sie diesen

folgen und in einen aktiven Diskurs treten können.

Um das Auffinden dieser Blogs zu erleichtern, d.h. sie von

der Masse der vielseitig interessierten Blogs zu trennen, wird eine

Metrik für Blogs vorgestellt, welche auf der thematischen Konsistenz

basiert. Das Konsistenz-Ranking basiert auf der (1) Intra-Post, der

(2) Inter-Post, der (3) Intra-Blog, und der (4) Inter-Blog Konsistenz.

Die vorgestellte Metrik wird auf einem Datensatz von 12.000

gesammelten Blogs ausgewertet und somit die Plausibilität dieses

Ansatzes demonstriert.

iv

Abstract

Current ranking algorithms, such as PageRank, Technorati author-

ity, and BI-Impact, favor blogs that report on a diversity of topics

since those attract a large audience and thus more visitors, links, and

comments. One example is the spreeblick.com blog, which offers arti-

cles on politics, society, and IT.

On the other side, niche blogs with a very specific topic only at-

tract a small audience and thus have only a small reach. Niche blogs

are blogs like telemedicus.info, which only publishes articles on pri-

vacy and copyright. This results in a low ranking from today’s blog

retrieval systems.

This thesis argues that the consistency of a blog, i.e. how focused

an author reports on a single topic, is a sign for expert knowledge. To

find these blogs is particular important for other domain experts to

identify blogs that they would like to follow and stay in active con-

tact. To ease the retrieval of expert blogs, i.e. to separate them from

the mass of blogs that report on random topics, a metric for blogs

based on topic consistency is introduced. The consistency ranking

is based on four different aspects: (1) intra-post, (2) inter-post, (3)

intra-blog, and (4) inter-blog consistency.

By evaluating the metric with a test data set of 12,000 crawled

blogs, the plausibility of this approach is demonstrated.

v

Contents

Contents

1 Introduction 1

2 Background 5

2.1 Weblogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 BlogIntelligence Framework . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Apache Nutch - the Crawling Framework . . . . . . . . . . . . . . 12

2.4 SAP HANA - the Persistence Layer . . . . . . . . . . . . . . . . . . 13

2.5 Clustering and Apache Mahout . . . . . . . . . . . . . . . . . . . . 15

3 Related Work 17

3.1 General Rankings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Blog-Specific Rankings . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Consistency-Related Rankings . . . . . . . . . . . . . . . . . . . . 20

4 Definition of the Topic Consistency Metric 23

4.1 Consistency between Posts (Inter-Post) . . . . . . . . . . . . . . . . 23

4.2 Internal Consistency of Posts (Intra-Post) . . . . . . . . . . . . . . 26

4.3 Consistency between Posts and Classification (Intra-Blog) . . . . . 27

4.4 Consistency of Linking and Linked Blogs (Inter-Blog) . . . . . . . 28

4.5 Combined Topic Consistency Rank . . . . . . . . . . . . . . . . . . 30

5 Implementation of Topic Detection 33

5.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6 Implementation of the Topic-Consistency Rank 39

6.1 Intra-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2 Inter-Post Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.3 Intra-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.4 Inter-Blog Consistency . . . . . . . . . . . . . . . . . . . . . . . . . 41

vi

Contents

6.5 BI-Impact Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Evaluation 45

7.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

7.3 Results of the Topic Consistency Sub Ranks . . . . . . . . . . . . . 47

7.4 Comparison of BI-Impact and Combined Topic Consistency Rank 51

8 Recommendations for Future Research 55

8.1 Enhanced Topic Detection . . . . . . . . . . . . . . . . . . . . . . . 55

8.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

8.3 Full integration with SAP HANA . . . . . . . . . . . . . . . . . . . 59

9 Conclusion 61

List of Abbreviations 65

List of Figures 67

List of Tables 69

Bibliography 71

vii

Contents

“Blogging is ... to writing what extreme sports are to athletics;

more free-form, more accident-prone, less formal, more alive. It is in

many ways, writing out loud.”

-from Andrew Sullivan, The Atlantic, Why I Blog

viii

1 Introduction

Weblogs, called blogs, are one of the most popular “social media tools” of the

World Wide Web (WWW) [1]. They are specialized, but easy-to-use, content

management systems. Blogs focus on frequently updated content, social inter-

actions, and interoperability with other Web authoring systems.

Blogs are part of the rise of social media, i.e. the move of the internet to more

user participation and freedom of speech [2]. This is caused by their various

application areas: beginning with personal diaries and holiday photo collec-

tions, reaching to knowledge management, educational, scientific research and

corporate platforms, and finally to forums for traditional journalists and the up-

coming concept of Citizen Journalists who leaped into fame during the Arab

spring [3, 4, 5].

The actual power of blogs evolves through their common superstructure, i.e. a

blog integrates itself into a huge think tank of millions of interconnected we-

blogs, called blogosphere that creates an enormous and ever-changing archive

of open source intelligence [6].

Through the various application areas and the immense amount of blogs, the

diversity of discussed topics continuously increases. As shown in Fig. 1, the

diversity reaches from travel and news, to politics and gaming.

Blog readers are not able to access all the information of the blogosphere because

they are overwhelmed by the enormous number of blogs and the blogs’ diver-

sity. To handle this information overload, the research and application area of

blog retrieval evolved [8]. Equally to traditional information retrieval (IR) and

data mining approaches, the target is to ease the understanding of the causal

relations in the blogosphere and the retrieval of the most relevant blogs to the

user’s information need [9].

Facing this unique challenge, the BlogIntelligence (BI) [10] project got initiated

with the objective to map, and ultimately reveal, content-oriented, network-

related structures of the blogosphere by using an intelligent crawler and tailor-

made analyses for the blogosphere [10]. Beside normal search engine function-

1

1 Introduction

6,7%

9,8%

12,1%

13,5%

13,6%

16,7%

18,2%

20,8%

22,5%

24,8%

26,4%

26,6%

29,1%

33,1%

38,9%

63,5%

0% 10% 20% 30% 40% 50% 60% 70%

Other

Celebrities

Business news

Business

Science

Sport

Gaming

Technology

Travel

Computers

Film/TV

Opinions on products and brands

News

Music

Family of friend blogs

Personal blogs

Topics written about

Figure 1: Topics blogged about in 2008 [7].

alities, BlogIntelligence has to consider the specific characteristics of the blogo-

sphere, social interaction from other social networks, and leverage content min-

ing [11].

This thesis originates from the BlogIntelligence project and presents a ranking

approach based on the topical consistency of blogs. This ranking aims to ease

the retrieval of expert blogs that are particular important for users to identify

blogs to follow and interact with.

The ranking of documents is a common technique in IR [12]. It aims to assess

the relevance of documents for a specific user’s information need. Current IR

systems mainly calculate the ranking or authority of a blog based on its position

in the web graph or social graph [13, 14]. Advanced ranking approaches also

consider the up-to-dateness of the content and level of readers’ engagement [11].

In contrast to current approaches, the goal of this thesis is to establish the topic

consistency as primary factor for the ranking.

2

Topical consistency is defined as the degree to which a blog author focuses on

a specific set of topics [15]. If blog authors cover several topics, like in ran-

dom interest blogs or diaries, they have a low topic consistency and thus cannot

create topical thrust. In contrast, a blog has the highest topic consistency if it

continuously concentrates on one topic. It is argued that such a blog develops

a sufficiently high expertise in this topic [16]. Thus, the content of this blog au-

thor is expected to be more relevant to an information need than the content of a

topically versatile, and influential author. Analogous to frequently cited experts

in the real world, it is expected that blog readers are more likely to trust and

interact with a blog author with a high topic consistency.

To implement the ranking, it is integrated into the BlogIntelligence framework.

BlogIntelligence essentially consists of three components. The data extractioncomponent is the basis of the BI framework that harvests the web, analyzes

each web page, extracts blog-specific information, and stores the harvested

data into the storage layer. The analysis component provides prototypical

implementations for the detection of trending topics and the ranking of blogs.

The third component is the visualization that communicates the analysis results

with the user.

To implement the topic consistency rank, it is necessary to integrate a topic de-

tection mechanism into the analysis layer and to calculate the actual ranking

based on the detected topics and the crawled data. Further, this thesis introduces

an extension for the visualization that communicates the topic consistency of a

blog with the user.

In order to evaluate the plausibility of a topic consistency ranking, it gets for-

mally defined and prototypically implemented in the course of this thesis. Fur-

ther, it is tested whether a correlation between the topical consistency of a blog

and its influence is observable. This evaluation can make recourse to the BlogIn-telligence data set that currently consists of 12,000 blogs with over 600,000 posts.

The remainder of this thesis is structured as follows. Section 2 outlines the foun-

dations of this thesis. It introduces the reader to the concept of weblogs, to the

layers of BlogIntelligence, and to the technique of data clustering. In Section 3,

related research concerning ranking approaches and topical content analyses is

3

1 Introduction

described. Section 4 presents the formal definition of the topic consistency rank

and its sub ranks. Section 5 outlines the implementation of the underlying topic

detection mechanism. Section 6 describes the implementation of the topic con-

sistency rank and its integration into the BlogIntelligence framework. Section 7

discusses the results and the plausibility of the topic consistency rank. Future

work is introduced in Section 8. Finally, Section 9 presents the conclusion of this

thesis.

4

2 Background

This Section presents the basic concept of weblogs. The analysis of weblogs

is the main goal of the BlogIntelligence framework, which is the foundation of

this work. Therefore, the layers of BlogIntelligence are introduced, too. Further,

this Section gives an overview to the technologies that are utilized for the topic

consistency rank calculation: Apache Nutch, Apache Mahout, and SAP HANA.

2.1 Weblogs

As discussed in Sec. 1, blogs are specialized content management systems (CMS)

that enable the authors to share content and open discussions.

Blog platforms, like Blogger1, WordPress2, and TypePad3, provide a unified struc-

ture for the published content. This structure reflects the requirements of a fre-

quently updated and socially active medium.

Weblog is a made-up word composed of the terms web and log [17]. The entries

of this log are called posts.

Posts are usually displayed in reverse chronological order with the most recent

entry first. These posts can contain texts, images, and videos to express the

author’s opinion. They are the counterpart to traditional newspaper articles.

Each post can be referenced via a URI (Uniform Resource Identifier) in the World

Wide Web (WWW). A special kind of URI is the permalink. A permalink is the

durable address of a blog post which is guaranteed to be reachable and unique

during the life-time of a blog.

In addition, a blog author can categorize his posts based on two classification

mechanisms: categories and tags.

Categories offer a hierarchical structure for classifying a blog’s contents equally

to traditional libraries. They are frequently used to emphasize distinct discus-

1http://www.blogger.com/2http://wordpress.com/3http://www.typepad.com/

5

http://www.blogger.com/

http://wordpress.com/

http://www.typepad.com/

2 Background

sion streams within a blog.

In contrast, tags are unordered keywords attached to a post and do not offer a

hierarchy. They summarize the content of a post. Readers use tags to navigate a

blog and to find posts related to a very specialized concept [18]. One prominent

application is the tag cloud (see Fig. 2, generated by Wordle4).

Figure 2: A generated tag cloud.

A tag cloud visualizes all tags (keywords) of all posts of a blog and gives an

impression of the most discussed topics. It becomes popular method to support

navigation and retrieval of posts [19].

The social component of a blog is the reader’s ability to write comments [20].

Comments enable blog readers to open an active discussion attached to a post

and communicate their opinions or to offer help. This enables the users of a blog

to communicate in an highly interactive way [21]. Nevertheless, blog comments

are manually moderated by the blog author because the author is responsible

for the content published on his blog. The blog author also wants to control the

discussion and to avoid inappropriate comments.

Further, blogs have special technical features that simplify harvesting and ana-

lyzing their posts and comments.

The most prominent technical feature of blogs is the publishing format feed [22].

Feeds present the content of a blog in standardized, XML-based formats (namely

4http://www.wordle.net/

6

http://www.wordle.net/

RSS and ATOM). A feed is an integrated part of the blog system and is always

up-to-date with the blogs content. These feeds ease the machine readability of

blogs. They contain all relevant information like the publishing date, the author,

categories, tags, the title, and a short description of a post.

Thus, a new kind of application develops, named aggregators. An aggregatorrequests an user-selected set of feeds and displays the content to the user in a

unified, enriched, and compact view. This way, users do no longer request a

blog directly. Instead, they are provided the content of their favorite blogs and

do not have to actively retrieve it from the WWW.

Concerning the social interaction of blogs, important technical features are

blogrolls and linkbacks [10].

A blogroll is noticeable placed on a blog’s starting page and contains links to

other blogs. These blogs are considered as followed or friend blogs. Thus,

blogrolls form close communities based on mutual linking.

Linkbacks are methods a blog author can use to get notified when other authors

link to his posts. This enables authors of different blogs to bidirectionally link

their discussions.

There are three kinds of linkbacks: refback, trackback, and pingback. Refback is not

part of the blogging system. Instead it is part of the HTTP protocol and of to-

day’s browsers. A refback occurs when a blog reader follows a link and the

receiving blog recognizes the HTTP referrer value of the reader’s browser. In

contrast, trackback and pingback are automatized mechanisms of blog systems

based on HTTP-POST and XML-RPC. In the moment a blog author A references

another blog author B, the blog system sends a notification to the server of B.

The server stores this message which contains all relevant meta information like

referencing post URI and post title. Thus, B can display this back reference under

his post to lead the blog reader to further discussions.

7

2 Background

2.2 BlogIntelligence Framework

To exploit the unique features of blogs, the BlogIntelligence project was initi-

ated [10]. This project shows in which perspectives the entirety of weblogs can

be analyzed and visualized in order to extract valuable aggregated information.

The visualizations and insights are composed into a web portal5.

To generate the data for this web portal, BI provides a framework consisting of

three layers: extraction, analysis, and visualization. An illustration of the complete

architecture is shown in Fig. 3.

2.2.1 Extraction

The extraction layer consists of a web harvesting application called crawler.

Web crawlers are computer programs that browse the web in an automatic, me-

thodical manner [23]. They are mainly used to store a copy of each visited page.

Search engines and other services analyze and index these pages to provide fast

search interfaces. The crawler starts with a fixed set of URIs. After visiting and

copying the first pages, the crawler extracts all hyperlinks and continues by vis-

iting the linked pages.

The BI crawler is a tailor-made adaption of Apache Nutch for the special re-

quirements of the blogosphere (see Sec. 2.3). Similar to common crawlers, the BIcrawler traverses the link graph of the web to harvest web pages.

Two parts of the crawler are adapted to the special needs of harvesting blogs.

This first part is the URI selection. It is responsible for selecting the next set of

URIs to crawl from the queued URIs in the joblist (see Fig. 3). It distinguishes

between special types of links present in blogs. These types reflect the position

of a link in a blog. Thereby, the crawler prefers links from blogrolls, posts, com-

ments, and links explicitly marked as feeds.

The second adapted part is the post-processor. This part is responsible for extract-

ing meta data from the downloaded page and attach it to the persistent data ob-

5http://www.blog-intelligence.com/

8

http://www.blog-intelligence.com/

EXTR

AC

TIO

NA

NA

LYSI

SVI

SUA

LIZA

TIO

N

COMMUNITIESTREND

RANKINGWHAT‘S

INFORMATION SPREADING

PERSONALIZEDSEARCH

C

ON

TEN

T

NETW

ORK

DATA ANLAYZERS

WEBINTERFACE

...

BLOGROLLSTWITTER ACC....

...

LINKS

CATEGORYTIMESTAMPCONTENTAUTHORTITLE

...

BLOGOSPHERE

NEWS-PORTALS

TRENDSWHATS’S UP?COMMUNITIES RANKING INFORMATION SPREADING

WWW RSS-FEED

S BLOG

PARSING

CRAWLER#2 CRAWLER#3CRAWLER#1

Figure 3: The BlogIntelligence architecture [10].

9

2 Background

ject. The default extraction includes language detection, text content extraction,

meta tag extraction and link extraction. In addition, the BI crawler creates post

and comment objects. Content, description, author, publishing date, language,

tags, and categories of a post are extracted.

The post-processor recognizes the specific blog system, like Blogger6, based on

hints in the HTML structure. This way, platform-specific informations like track-

backs are extracted, as well.

Further, it analyzes the position of links in the content structure of a blog. After

the completion of the post-processor, the crawler stores the enriched web page

into the persistence layer.

2.2.2 Analysis

The second layer of the framework, the analysis layer, performs while the crawler

continuously collects new information. The analysis consists of multiple loosely

coupled modules. Each module performs a specific algorithm that delivers data

necessary for the third layer of the framework, the visualization layer. The cur-

rent BI prototype includes a ranking, a clustering, and a dimension reduction

algorithm.

The ranking algorithm is described by Bross et al. [11]. The authors define a

complex metric called BI-Impact score. The BI-Impact score combines multiple

quality metrics of blogs to one score (see Sec. 3.2).

The current implementation of the prototype runs as an over-night batch job

to calculate a new ranking. It only considers the specific link types from blog,

blogroll, post, or comment. The analysis elements (see Fig. 3), like trend detec-

tion, ranking, and community recognition, are prototypically implemented. The

details of the ranking are discussed in Sec. 3.2.

The integration of the topic consistency rank calculation in this analysis layer is

described in Sec. 6.

6http://www.blogger.com

10

http://www.blogger.com

2.2.3 Visualization

The visualization layer is based on the results of the data analyses. It allows users

to browse the preprocessed information of the data analyzers in an unlimited,

personalized and intuitive way. The visualization layer consists of three visual-

izations that give different insights at different abstraction levels (see Fig. 3).

The first visualization is directly integrated into the web portal. It basically

shows the frequently discussed topics (What’s up), and trending terms (Trends)

of the blogosphere.

Figure 4: Screenshot of the BlogConnect visualization [24].

The second visualization is an interactive visualization tool to powerfully ex-

plore and browse through the network of blogs, called BlogConnect. Essentially,

it displays all blogs as bubbles on a 2D canvas (see Fig. 4) [24]. The position of

a blog reflects its topical identity. The size of a blog indicates its position in the

ranking. Thus, users can orientate themselves in the network and find the most

relevant blog for a topic area, called community.

The third visualization, called PostConnect [25], serves as a visualization for blog

archives. As shown in Fig. 5, it arranges all posts of a blog in a circle. By activat-

11

2 Background

Figure 5: Screenshot of the PostConnect visualization [25].

ing a post, each topically linked post of a blog archive gets highlighted. Hereby,

a post is topically related if it uses the same categories or tags as the activated

post. PostConnect helps users to explore the topical nature of a blog and identify

highly related subsets of posts.

2.3 Apache Nutch - the Crawling Framework

As described by Berger et al. [26], the underlying framework of the BlogIntelli-gence extraction layer is the open-source web search engine Apache Nutch7 [27].

Apache Nutch provides a transparent alternative to private global scale search

services. It comes with an easily extensible and scalable crawler component.

Following the MapReduce paradigm [28], Apache Nutch defines four different

phases for crawling: generator, fetcher, parser and updater that are executed itera-

tively [29]. The generator job selects the next URIs to fetch. The fetcher job asyn-

chronously downloads the selected pages. Afterwards, the parser job extracts

7http://nutch.apache.org/

12

http://nutch.apache.org/

metadata, links and the actual text content. In addition, the framework offers

an extension point to insert new parsing algorithms. This functionality is used

by the topic detection implementation to integrate the term extraction module

(see Sec. 5). Finally, the updater job inserts new links and calculates scores for the

parsed web pages. These scores are used to select the next URIs to crawl. Each

job works on a large amount of pages in parallel.

Apache Nutch is a MapReduce application dedicated for scale-out scenarios,

i.e. runs on a large number of small machines. For example, researchers at

Google [28] report to use a massive cluster of small machines to crawl the web.

Nevertheless, even on scale-up scenarios, i.e. execution on one big machine,

MapReduce applications perform as scale-out-in-a-box more effective than pure

scale-up approaches [30]. This enables us to run the crawler on a large cluster of

small machines as well as on a large shared-memory server.

In this context, the Hasso-Plattner Institute offers a testing platform, the Fu-

ture SOC Lab8, which provides researchers access to the latest multi/many-core

hardware. Thus, the crawler implementation currently runs on a scale-up sce-

nario.

2.4 SAP HANA - the Persistence Layer

The persistence layer of the BlogIntelligence framework has an high impact on the

performance of the extraction and analysis layer [26]. In addition, the overall

target of BI is to provide real-time analytics for the whole blogosphere. There-

fore, three different database technologies compete: a row-oriented, disc-based

database, a distributed file system and a column-oriented, in-memory database.

The evaluation considers a traditional row-oriented, disc-based databases,

namely PostgreSQL9. This database makes the data discoverable and easy to

query by offering a SQL query API that ease the implementation of the analysis

layer. However, the query performance of PostgreSQL massively decreases with

growing data amounts during the extraction phase [26].

8http://www.hpi.uni-potsdam.de/forschung/future_soc_lab.html9http://www.postgresql.org/

13

http://www.hpi.uni-potsdam.de/forschung/future_soc_lab.html

http://www.postgresql.org/

2 Background

An alternative is the distributed file system HDFS10. HDFS is the original per-

sistence API of Apache Nutch. It is able to handle and process huge amounts of

data using commodity hardware [31]. It does not provide a query API like SQL.

Further, HDFS is not able to take full advantage of today’s high-end hardware

with massive amount of memory, because it requests only minimal hardware

resources [32].

Since costs for main memory are decreasing and access to data in the main mem-

ory is extremely fast, it makes sense to store all data mainly in the main memory.

Thus, an in-memory database, namely SAP HANA11, is tested. Although SAP

HANA is targeting enterprise applications, the majority of analysis algorithms

also apply to social media. Because of the effective usage of main memory, the

versatile analysis capabilities, and the SQL API, the extraction component got

adapted to store all collected data in SAP HANA [33].

To integrate the extraction component with the in-memory database, the per-

sistence layer of Apache Nutch is replaced. The Apache Gora12 framework al-

ready offers an object relational mapper(ORM) for traditional SQL databases like

PostgresSQL. This ORM is adapted to also support the SAP HANA database,

because HANA uses a special SQL dialect. Hence, the complete extraction com-

ponent is currently integrated with SAP HANA.

Caused by the tight coupling of the persistence layer and the analysis layer, this

change implies the adaption of the whole analysis layer. Thereby, most of the al-

gorithms have to be modified regarding the direct integration into SAP HANA.

HANA offers various kinds of programming interfaces to run analytics direct

in-memory without transferring the data to the application layer.

Besides saving transfer time, the main advantage of the database is the

dictionary-encoded column-oriented in-memory computing that outruns

file-based database solutions [33]. The dictionary encoding saves space and

access time for highly redundant tables like the link or dictionary tables used

for analysis (see Sec. 6). Further, the column-orientation performs best on tables

10http://hadoop.apache.org/11http://www.sap.com/HANA/12http://gora.apache.org/

14

http://hadoop.apache.org/

http://www.sap.com/HANA/

http://gora.apache.org/

with a large number of columns, but only few columns questioned. This applies

to the main table of BlogIntelligence, called web page table, which essentially

stores all informations of a web page like content, date, author, and many more

into one table.

However, HANA is still under development and the transfer of analysis algo-

rithms is out of scope for this work. As a consequence, the clustering needed by

the topic consistency rank is outsourced as described in the following Section.

2.5 Clustering and Apache Mahout

One major foundation of the algorithms in the analysis layer is clustering. This

also applies for the topic detection mechanism needed for the topic consistency

rank (see Sec. 5).

Clustering is an unsupervised classification technique of data items into groups,

called clusters. These clusters contain data items that are similar in meaning.

Beside density-based clusterings, frequently used clusterings in data mining are

distance-based [34, 35].

Essentially, a distance-based clustering works like follows. Each data item has a

number of numerical features. The feature vector of each data item is the combi-

nation of these features. One can think about the feature vector as a position in

an n-dimensional space. The clustering defines a distance metric for the feature

vectors. The task of the clustering is to group feature vectors together that have

a low distance between each other. Thus, all data items are grouped together

that have similar numerical features.

The current clustering of BlogIntelligence groups blogs together based on the

word occurrences in the blogs. Blogs are in one cluster if they contain similar

words with a similar frequency and thus are regarded as topical similar. These

clusters are visualized in the BlogConnect visualization (see Fig. 4).

The prototypical analysis layer of the BI framework consists of Java implemen-

tations for the clustering and other analysis techniques. The logical next step is

to integrate it with a well-established framework for information analysis.

15

2 Background

Such an established framework is Apache Mahout13 [36]. Similar to Apache

Nutch, it is based on a MapReduce framework.

It provides various algorithms for clustering, classification, and collaborative

filtering. In order to provide maximal distribution during the execution, these

algorithms are customized for the MapReduce framework.

Mahout is primarily built for batch analyses that are able to handle big data.

This data has to be present on the distributed file system of Hadoop. Hence, the

complete data has to be loaded from the persistence layer.

However, the long-term target is to integrate all needed clustering and classifi-

cation algorithms directly into the persistence layer to avoid high transfer costs.

Although, first clustering algorithms for HANA are under development, the in-

tegration is out of scope of this thesis.

13http://mahout.apache.org/

16

http://mahout.apache.org/

3 Related Work

The related work can be divided into three categories of ranking approaches.

The first category consists of general rankings that assess web pages and other

documents. The second category includes blog-specific rankings that are spe-

cialized on blogs and other social media channels. The last category comprises

consistency-related rankings that incorporate the topic consistency of a document

or blog into the ranking.

3.1 General Rankings

PageRank is one of the most frequently used algorithms, e.g. by Google [37],

for ranking traditional web pages based on the web link graph. It has been

introduced by Page et al. [13] and is based on the random surfer model. A web

page’s PageRank is defined as the probability of a random surfer visiting this

web page. The random surfer traverses the web by choosing repeatedly between

two options: clicking on a random link on the current page or randomly jumping

to another web page.

The second option is necessary to make sure the random surfer also visits pages

that have no incoming links and to make sure that it is possible to escape from

pages that have no outgoing links. The calculation of the PageRank algorithm is

shown in the following equation.

PR(pi) =1− d

N+ d ∑

pj∈M(pi)

PR(pj)

L(pj)

The probability of clicking on a random link is determined by the damping fac-

tor d. pj ∈ M(pi) if pj has a link to pi. L(pj) gives the number of outgoing links

for pj and PR(pj) is the previous PageRank of pj. The PageRank algorithm is

iterative and converges after a certain number of iterations depending on the

implementation used.

A very similar algorithm to PageRank is TrustRank [38]. In contrast to PageRank,

TrustRank is initialized with a fixed set of trustworthy or untrustworthy web

17

3 Related Work

pages. The trust propagates through the web graph equally to the PageRank

algorithm.

Another approach is the Hyperlink-Induced Topic Search (HITS) algorithm by

Kleinberg [39]. It is based on the concept of hubs and authorities. In the tradi-

tional view of the web, hubs are link directories and archives that only refer to

information authorities, which actually offer valuable information. The HITS

algorithm operates on a subgraph of the web that is related to a specific input

query. Each page gets an authority score and a hub score. The authority score is

increased based on the hub score of linking web pages and vice versa.

These traditional ranking algorithms are all based on the web link graph. How-

ever, traditional web pages show a different linking behavior than blogs. Blogs

offer different types of links, e.g. trackbacks or blogroll links, with different se-

mantics (see Sec. 2.1). Furthermore, the blog link graph tends to be rather sparse

in comparison to the overall web [40].

3.2 Blog-Specific Rankings

To address the special characteristics of blogs, blog ranking engines and current

research introduce tailor-made ranking algorithms for the blogosphere [11].

The most popular platforms ranking the blogosphere are Technorati14 and

Spinn3r15 [11]. Other services like BlogPulse, PostRank, or BlogScoop went offline

during the last year and got integrated into commercial products. Thus, the free

services of Technorati and Spinn3r are described.

Technorati established the authority score as their unique ranking. It is calculated

based on a blog’s linking behavior, categorization and other associated data over

a small period of time [41]. Furthermore, Technorati calculates its authority score

also for topical segments of the blogosphere to identify topic-specific opinion

leaders.

Although Spinn3r is well known for its crawling service, it also provides a simple

14http://technorati.com/15http://spinn3r.com/

18

http://technorati.com/

http://spinn3r.com/

PageRank and a Social Media Rank. The Social Media Rank is an adaption of the

TrustRank algorithm. It incorporates social networks as incoming link providers

and uses a fixed number of initially trusted users to prevent spam.

Beside these platform specific rankings, current research also discusses blog-

specific ranking approaches.

A ranking score, called BlogRank, is introduced by Kritikopoulos et al. [42]. It is

a modified version of the PageRank algorithm. The BlogRank score is based on

the link graph and different similarity characteristics of weblogs. The authors

create an enriched graph of inter-connected weblogs with additional edges and

weights representing the specific features of blogs. Mainly, these features are

shared authorship and topics. For example, the authors create a pseudo link

between two posts that share the same topic that is identified by category anno-

tations.

Bross et al. [11] propose the BlogIntelligence-Impact-Score (BI-Impact) ranking, a

more complete approach to successfully rank blogs. Their definition is the basis

for the currently implemented scoring algorithm in the BlogIntelligence frame-

work.

Figure 6: Ranking variables of the BI-Impact score [11].

Similar to the above mentioned rankings, they give special weightings for spe-

cial link types of the blogosphere. In contrast to BlogRank, their algorithm does

not create new links between blogs. It rather weights the different interaction

19

3 Related Work

types of blog authors like links to comments, posts, and to the start page of a

blog. Like Spinn3r, they also consider links from outside the blogosphere such

as from Twitter16 and news portals.

All used ranking variables are shown in Fig. 6. They distinguish between a post

and a blog ranking.

The post ranking incorporates the different kinds of links between posts like

linkbacks, tweets, and normal links. Further, the content of a post gets rated.

In contrast to consistency-related rankings, the authors do not incorporate the

topics of a post. Instead, the authors focus on the detection of spam keywords

and trend keywords. Trend keywords are terms extracted by a hot topic analyzer,

which is also part of the BI framework.

The blog ranking combines the ranking of all posts with blog-specific character-

istics. Among others, these are the publishing frequency and the blogroll links

of a blog.

All these variables are combined into one score for a blog and propagate through

a PageRank-like algorithm to all linked blogs.

The work presented in this thesis introduces a new score that complements the

BI-Impact score to foster the retrieval of topically consistent blogs for hot top-

ics. Thereby, users of the BI framework are able to find niche blogs that discuss

trending and interesting topics.

3.3 Consistency-Related Rankings

Consistency-related rankings are blog rankings that incorporate the topical con-

sistency of a blog. This topical consistency adds to other factors to form one rank

for each blog.

A trend detection system, called Social Media Miner, is presented by Schirru

et al. [43]. This system extracts topics and the corresponding, most relevant

posts. The topics are detected using a clustering on word importance vectors

(see Sec. 2.5).

16http://twitter.com/

20

http://twitter.com/

Their approach is rather simple and does not directly reflect a consistency. They

cluster topics for a given period, find relevant terms (or labels), and visualize the

term mentions over time as a trend graph. Nevertheless, posts that consistently

handle a specific topic have a constant term frequency of topic terms. Thus,

topically consistent blogs get a good trend graph, at least for trending topics.

Sriphaew et al. [44] discuss how to find blogs that have great content and are

worth to be explored. They show how to identify these blogs, called cool blogs,

based on three assumptions: cool blogs tend to have definite topics, enough posts,

and a certain level of consistency among their posts. The level of consistency,

called topical consistency, tries to measure whether a blog author focuses on a

solid interest. Thus, it favors blogs with stable topics like reviews on mobile

devices. The authors measure the consistency based on the similarity of topic

probabilities of preceding posts.

Eleven indicators of credibility to improve the effectiveness of topical blog re-

trieval are introduced by Weerkamp et al. [15]. Beside some syntactic indicators,

they also present the timeliness of posts, and the consistency of blogs. The time-

liness of a post is defined as the temporal distance of a blog post to a news

portal post of the same topic. Their topical consistency represents the blog’s

topical fluctuation. The authors define the consistency as a tf*idf-like score over

all terms of a blog. Although this measure favors blogs that frequently use rare

terms, it does not reflect when a blog author changes the topic from one post to

another. In contrast to other related research, the authors do not use the natural

ordering of posts. Nevertheless, the authors show that their indicators improve

the topical blog retrieval significantly.

The detection of spam blogs (splogs) is a frequently discussed topic in ongoing

research [45, 46, 47]. However, Liuwei et al. [48] describe a spam blog filter-

ing technique that also incorporates the writing consistency of a blog author.

Similar to Weerkamp et al., the consistency on topic level is defined as the av-

erage topical similarity of posts. Each post gets compared with its preceding

post. The topical similarity is defined as the distance of the posts’ tf*idf word

vectors. Thereby, blogs with a extremely high topical consistency are expected

to be auto-generated. They integrate their topic consistency into a blog filtering

21

3 Related Work

system.

Another approach for ranking blogs is introduced by Jiyin He et al. [49]. They

define a coherence score to measure the topical consistency of a blog. The au-

thors define a consistent blog as a blog that contains lots of coherent posts. A

post is coherent to another post if both posts are in the same cluster of the whole

collection. The authors integrate the coherence score into a blog ranking for

boosting the topically relevant and topically consistent blogs.

Chen et al. [50] present a blog-specific filtering system that measures topic con-

centration and variation. They assess the quality of blogs via two main aspects:

content depth and breadth. In essence, the authors present a score that contains

five criteria. Each criterion is based on an external topic model derived from

Wikipedia17 articles. For example, the completeness of a blog is defined as the

ratio of words used in a blog in comparison to all words assigned to a topic. Fur-

ther, the authors define the topical consistency of a blog as the mean distance of

used topics in a post. A blog is consistent if it only handles closely related top-

ics. The ordering of posts, which can indicate a topic shift of the author, is not

considered.

In contrast to related work, the topic consistency rank presented in this the-

sis calculates the consistency of a blog based on multiple aspects. Thereby, it

measures the topical consistency at four different granularities and thus offers a

differentiated view on the blogs consistency.

Further, during the calculation of the score, topics are not considered as proba-

bility distribution over words. Instead, a topic is defined as a fix set of words

derived from a prior word clustering, which is also used by Sriphaew et al. [44].

17http://www.wikipedia.org/

22

http://www.wikipedia.org/

4 Definition of the Topic Consistency Metric

To evaluate the topical consistency of a blog author, four different facets of con-

sistency are defined.

First, the consistency between posts defines the inter-post consistency. It inves-

tigates whether the contents of the latest posts discuss closely related topics.

Next, the internal consistency of a post, called intra-post consistency, is a mea-

sure that considers to which extend all paragraphs of a post discuss a similar

topic. In difference to the inter-post consistency, the intra-blog consistency com-

pares the topic space created by each posts with the topic space created by tags

and categories of this post. Therefore, it is a measure for the quality of the blog’s

classification system. The inter-blog consistency measures whether a blog is part

of a domain expert communitiy. Hereby, the rank of a blog is increased if blogs

handling a similar topic link to it. In addition, a blog is boosted if it links to

topically related blogs.

Finally, all four facets get combined into the topic consistency rank.

4.1 Consistency between Posts (Inter-Post)

As a first step, the inter-post consistency is formally defined. The inter-post con-

sistency compares topical distance of succeeding posts. Each post is represented

as a topic vector. Each component of this topic vector gives the probability of a

post talking about one topic. The sum of all vector components is one as usual

for a probability distribution.

Fig. 7 shows the assignment of ten example posts to ten topics. Each column

symbolizes a topic vector of a post. The size of a bubble indicates the probability

of a post p to be in topic t.

The transient nature of the blogosphere motivates us to only consider the latest

posts that lay outside the outdated post area. There are two approaches to define

outdated posts: exclude all posts exceeding a specific time span, or including

only a specific number of latest posts. The latter solution punishes blogs that are

frequently publishing new content by shrinking the observed time window to

23


0

2

4

6

8

10

12

0 2 4 6 8 10

Top

ic ID

Post Number

Topic Probability

Outdated post area Topic Vector

Low distance

High distance

Time

Figure 7: Visualization of post-topic-probabilities.

a day’s work. The time span variant is beneficial for small blogs because only

a small part of the content is considered. However, the time span variant is

applied because it is assumed that it fits the user’s perception.

Sriphaew et al. [44] calculate the average difference o topic vectors of posts with

the blog’s topic centroid. This favors blogs with a central interest, but does not

consider the change of a blog’s topic over time. As shown in Fig. 7, blogs can

have low distances and high distances between posts. Thus, the average differ-

ence of topic vectors of two successive posts serves as indicator for topic consis-

tency.

In the following, the formal definition of the inter-post consistency is shown.

Before defining the metric, the sets and functions used for the calculation have

to be defined. The set Blog contains all blogs of the used data set. Post is a set

that contains all posts. The set Postb with b ∈ Blog contains all posts of blog b.

The function publishedDate(p) with p ∈ Post returns the publishing time and

date of a post. The function LatestPostsb,d with b ∈ Blog and d ∈ Date being a

24

point in time is a set defined in Eq. 1.

LatestPostsb,d = {p ∈ Postb | publishedDate(p) ≥ d} (1)

Term is the set of all terms. The set Topic contains all topics discussed in the

considered subset of the blogosphere. Similarly to Eguchi et al. [51], the set

TTtp ⊂ Term is defined as all terms of a topic tp ∈ Topic. All TTtp are pairwise

disjoint.

∀tp ∈ Topic ∀j ∈ Topic : tp 6= j⇒ TTtp ∩ TTj = ∅ (2)

PTp ⊂ Term is the set of all used terms of a post p ∈ Post. The function

Prob(p, tp) with p ∈ Post and tp ∈ Topic gives the probability of the post pbeing about the the topic tp.

Prob(p, tp) =∑t∈TTtp∩PTp

t f ∗ id f (t, p)

∑t∈PTpt f ∗ id f (t, p)

(3)

Salton et al. [52] give an overview to the components of the tf*idf-function and

its variances. Essentially, it is the product of a term frequency component t f and

a collection frequency component id f .

t f ∗ id f (t, p) = t f (t, p)× id f (t, Post) (4)

t f is the raw term frequency (number of times a terms occurs in a post). id f is

the inverse document frequency. Postt with t ∈ Term is the set of all posts in

which a term is contained.

id f (t, Post) = log|Post||Postt|

(5)

The funtion topicalDistance(pi, pj) with pi, pj ∈ Post is defined as the Euclidean

distance between the topic vectors of both posts (see Eq. 6). The Euclidean dis-

tance is a frequently used distance metric and has proven to apply best for text

vector comparison [44].

topicalDistance(pi, pj) =

√∑

tp∈Topics(Prob(pi, tp)− Prob(pj, tp))2 (6)

25


The function predecessor(p) ∈ Post returns the direct predecessor of p ∈ Post.Given these definitions the inter-post distance is formalized as shown in Eq. 7

with b ∈ Blog and d ∈ Date.

interPostDistance(b, d) =∑p∈LatestPostsb,d

topicalDistance(p, predecessor(p))

|LatestPostsb,d|(7)

interPostDistance(b, d) is the average topical distance of two succeeding posts

among the latest posts of a blog. It returns high values for very inconsistent

blogs and low values for very consistent blogs. To give consistent blog a high

inter-post consistency score, it is defined as the inverse interPostDistance(b, d),as shown in Eq. 8.

interPostConsistency(b, d) =1

interPostDistance(b, d)(8)

4.2 Internal Consistency of Posts (Intra-Post)

The intra-post consistency focuses on the inner consistency of one post. It is high

if a blog author focuses on one single topic and does not change the subject while

writing one single post. Thus, it favors self-contained and complete posts that

do not cover several topics. A consistent post should handle just a few topics,

but discuss them in more detail.

The intra-post consistency is very similar to the inter-post consistency except

that it operates on the sections of posts. Each post is subdivided into sections

by splitting the post’s content by each occurrence of more than one line break or

HTML separator.

Each section gets assigned one topic vector. The components of this topic vector

represent the probability to which a section is about a specific topic.

Two additional concepts need to be defined before formalizing the intra-

post consistency. Firstly, Section is the set of all sections in the data set and

Sectionp ⊂ Section is the set of all sections of one specific post p ∈ Post. Sec-

ondly, predecessor(s) with s ∈ Section is the function that returns the preceding

section of one section s.

26

Further, the function topicalDistance(si, sj) with si, sj ∈ Section is defined in the

same manner as Eq. 6.

intraPostDistance(p) =∑s∈Sectionp

topicalDistance(s, predecessor(s))∣∣Sectionp∣∣

(9)

The intra-post distance is also defined for a whole blog. It is the mean of all

distance values of the latest posts.

intraPostDistance(b, d) =∑p∈LatestPostsb,d

intraPostDistance(p)

|LatestPostsb,d|(10)

Thereby, the intraPostConsistency(b, d) is defined as the inverse intra-post dis-

tance to provide consistent blogs with a high score (see Eq. 11).

intraPostConsistency(b, d) =1

intraPostDistance(b, d)(11)

4.3 Consistency between Posts and Classification (Intra-Blog)

The intra-blog consistency serves as a measure for the quality of a blog’s clas-

sification. It evaluates to which extent the content of posts is consistent with

tags and categories that form the classification system of a blog. As discussed in

Sec. 2.1, tags and categories are very important for the orientation of a user and

the navigation through the blog. It is crucial that blog authors choose tags and

categories wisely and appropriate to their content. In addition, spam blogs tend

to overuse tags and categories to earn a higher rank in blog search engines for a

high number of keywords. These low quality blogs and spam blogs get a very

low intra-blog consistency score.

For a high consistency, tags and categories should span an equal topic distribu-

tion as the overall content of a blog.

The intra-blog consistency is the distance of the topic vector of each post and the

topic vector for the post’s classification system.

27


Before defining the intra-blog consistency it is needed to formally define addi-

tional concepts. Tag is the set of all tags and Category is the set of all categories

in the data set. Further, Tagp and Categoryp with p ∈ Post are the set of tags

and categories of one post. The Classi f icationp set is the defined the union of

categories and tags of one post p.

Classi f icationp = Tagp ∪ Categoryp (12)

Given the classification of each post, Classi f icationp, and the set of all posts in

a blog, Postb, the intra-blog distance is defined as the average topical distance

between each post and its classification (see Eq. 13).

intraBlogDistance(b) =∑p∈Postb

topicalDistance(Classi f icationp, p)|Postb|

(13)

Finally, the intraBlogConsistency(b) is defined as shown in Eq. 14.

intraBlogConsistency(b) =1

intraBlogDistance(b)(14)

A low value of intraBlogConsistency(b) indicates a mismatch between the clas-

sification and the actual content. Thus, the quality of the blog is questionable

and it is supposed to be of a lower rank.

4.4 Consistency of Linking and Linked Blogs (Inter-Blog)

Finally, the inter-blog consistency serves as a context-based consistency metric.

It measures the consistency between the blog’s content and the content of link-

ing and linked blogs. Thus, it measures whether a blog is part of an expert

community. An expert community is a set of blogs that focus on one topic and

discuss this topic interactively. For example, during the Arab spring one single

blog starts the discussion and other blogs build an active discussion around this

initial blog [5].

Among other motivations, the followers of blogs have two targets: First, they

like to spread the word of the referenced blog author to widen the reach of the

28

message. Second, referencing blog authors want to discuss the message and get

into an active discourse with the referenced blog author. Those discourses are

the essence of the blogosphere. Similar to Wikipedia, blog authors increase the

information quality by evaluating and iterating posts of each other.

As already discussed for the BI-Impact score, blogs have a set of special link

types, but only a few of them are actual interaction links and not only friendly

links or advertisements (see Sec. 2.1).

Blogroll links and links, which are not located in posts or comments, have no

evaluating or commenting nature. In contrast, if a blog author links from a post

directly to a post of another blog author, he indicates a reply or similar reac-

tion like a reference. Further, comment authors can also link to other posts, this

is formally regarded as a linkback. Linkbacks are also indicators for an active

discourse between two blogs. These links, linkbacks and links from posts, are

interaction links.

The inter-blog consistency defines the consistency of a blog and blogs that link

or are linked via an interaction link.

The post linking post relation (PLP) contains the tuple (pi, pj) with pi, pj ∈ Postif pi has an interaction link to pj. The set IPpi , incoming posts, with pi ∈ Post is

defined as follows:

IPpi = {pj | pj ∈ Post ∧ (pj, pi) ∈ PLP} (15)

In parallel, the set OPp, outgoing posts, p ∈ Post is defined.

OPpi = {j | pj ∈ Post ∧ (pi, pj) ∈ PLP} (16)

Incoming links cannot be controlled by the blog author. Hence,two constants

α, β introduce a weighting for incoming and outgoing posts.

The postContextDistance(p) with p ∈ Post as the weighted sum of the aver-

age distance to all incoming and the average distance to all outgoing posts (see

Eq. 17).

29


postContextDistance(p) = α ∗∑j∈IPp

topicalDistance(p, j)

|IPp|+

β ∗∑j∈OPp

topicalDistance(p, j)

|OPp|

(17)

A typical weighting is α = 0.6; β = 0.4 to slightly emphasize incoming links for

their unbiased nature.

The interBlogDistance(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 18. The

inter-blog distance calculation considers only the latest posts due to the transient

nature of the blogosphere.

interBlogDistance(b, d) =∑p∈LatestPostsb,d

postContextConsistency(p)

|LatestPostsb,d|(18)

Analogously to the other three aspects, the interBlogConsistency(b, d) is defined

as the inverse interBlogDistance(b, d) (see Eq. 19).

interBlogConsistency(b, d) =1

interBlogDistance(b, d)(19)

4.5 Combined Topic Consistency Rank

Finally, the topic consistency rank is defined as the combination of all four facets.

All facets are combined by calculating a weighted sum for each blog.

The topicConsistency(b, d) with b ∈ Blog and d ∈ Date is defined in Eq. 20. The

four constants, χ, δ, ε,and γ, give a weighting for each component of the topic

consistency rank.

topicConsistency(b, d) = χ ∗ interPostConsistency(b, d) +

δ ∗ intraPostConsistency(b, d) +

ε ∗ intraBlogConsistency(b) +

γ ∗ interBlogConsistency(b, d)

(20)

30

The weighting can be varied according to the characteristic of the analyzed data

set. Caused by the low usage of categories and tags in the BlogIntelligence data

set and the high usage of content summaries in posts’content, the weights used

in this thesis are: χ = 0.3; δ = 0.2; ε = 0.2; γ = 0.3.

The final topic consistency rank is calculated by normalizing the results of the

topicCosistency function over all considered blogs. Through this normalization

the values will be in the interval [0, 1], which is a common approach for rank

normalizations [53].

31


32

5 Implementation of Topic Detection

As mentioned in Sec. 4.1, all topic consistency metrics depend on topic term sets.

To find topics and assign terms to topic term sets, the topic detection procedure,

shown in Fig. 8, is implemented.

Download Post

Parse Content

Extract Terms

Calculate Tf*Idf

Build Word Vectors

Run k-Means Write

Word Clusters

1. 2. 3.

4. 5.

6. 7.

BlogIntelligence Crawler

SAP Hana Database

Apache Mahout Analyzer

Figure 8: Flow diagram of the topic detection.

5.1 Prerequisites

There are several steps necessary before running the actual clustering algorithm,

which creates the topic term sets. The preprocessing covers steps 1-5 of the topic

detection flow (see Fig. 8).

Step 1. First of all, the BI crawler harvests the blogosphere. It stores all data of

blogs into the SAP HANA database. The crawler traverses the blog link graph

and downloads every blog post. Immediately after downloading, the crawler

parses the downloaded HTML files (see Fig. 8).

33


Step 2. The parsing includes the removal of non-textual content like images

and videos. Further, it removes markups like HTML tags. After parsing a web

page, the crawler stores the pure text content as a character large object (CLOB)

into the database.

Step 3. The Nutch crawling cycle is extended by a new component that allows

a word extraction on the text of posts. During this extraction, the crawler first

segments the text into words. This is done by splitting on non-word characters.

Afterwards, the extraction component removes all stop words from the word

set. Stop words are the most common words of a language, such as the, is, at,and on. It uses the stop word lists from the Weka18 project. Weka is a collection

of machine learning algorithms for data mining tasks.

The word set is still redundant. It contains inflected or derived words. Thus,

a stemming of words is applied to reduce the words to their stem form. The

extraction component incorporates the stemmers of the Weka framework which

provides stemmer classes for various languages like German.

The preprocessing of the crawler assigns to each post the set of word stems. This

set of words is stored in a separate table into the database, called dictionary table.

The word extraction process is actually a common feature among text databases

like Apache Lucene19. Although SAP HANA already contains a word count

matrix, which is the dictionary table for the topic detection, this matrix is not

accessible via an application interface (API).

In contrast, the next two steps are directly performed in the database.

Step 4. An SQL procedure calculates the tf*idf values for each word. SQL

procedures have the advantage that they can directly access the data in memory

without transferring them for processing. The implementation follows Eq. 4.

18http://www.cs.waikato.ac.nz/ml/weka/19http://lucene.apache.org

34

http://www.cs.waikato.ac.nz/ml/weka/

http://lucene.apache.org

Step 5. Further, the database is used to create the word vectors for each post

and the post vectors for each word. The latter are used for the clustering of

words that finally produces the desired topics. The vectors are computed by an

SQL view that directly refers to the basic web page table and the result table of

the tf*idf calculation. An example result of the view is shown in Tab. 1.

post id word id tf*idf

p4 w5 t f id f4,5

p7 w8 t f id f7,8

p5 w5 t f id f5,5

p8 w8 t f id f8,8...

......

Table 1: Example tf*idf vectors resulting from the SQL view.

With step 5 the preprocessing is completed and all vectors can be loaded into

the HDFS file system of Mahout. This is implemented by a tailor-made class

for the BlogIntelligence analytics. It uses the adapted object relational mapper

(ORM), Apache Gora, to access the tf*idf vector view of HANA and transfer all

vectors to the HDFS file system. These vectors are the word vectors with posts

as dimensions.

Two example vectors are shown in Tab. 2. Mahout uses a sparse vector im-

plementation. Sparse vectors are specially designed for document-word vectors

that are only sparsely filled. Sparsely filled means that most of the vector compo-

nents are zero because words only appear in a small set of documents compared

to the overall collection.

5.2 Clustering

The two last steps are executed by the adapted Mahout framework (see Sec. 2.5).

Mahout offers various clustering algorithms like mean shift clustering, spectral

clustering, latent Dirichlet allocation, and k-means clustering [36].

35


w5 w8

p4 t f id f4,5 0

p5 t f id f5,5 0

p6 0 0

p7 0 t f id f7,8...

......

Table 2: Sparse word vectors from HDFS.

The current implementation of SAP HANA does not support a clustering appli-

cable for the high number of dimensions created by the word-blog-vectors.

The total post-word-matrix size for L20 is limited to the maximum integer. This

value is too small for the approximately 1,000,000 by 500,000 word-post matrix.

Thus, the L API is not applicable for the clustering task.

Another alternative is the R21 integration of HANA. R is a programming lan-

guage and software environment with special focus on statistical calculation.

Besides clustering algorithms, R supports various algorithms like time-series

analysis and statistical tests. The road block for R is also the massive amount

of data. The database needs to transfer all vectors to an external R component.

This process also fails due to the high transportation cost.

To sum up, until the integration of advanced text analysis algorithms in HANA

is completed, the external analysis framework Apache Mahout is used.

As discussed in Sec. 4.1, the topic consistency rank relies on a 1:n relation be-

tween words and topics. This approach simplifies the prototypical implementa-

tion, because it does not require a complex clustering technique based on prob-

ability distributions. Advanced, more complex clustering techniques are subject

to further research (see Sec. 8).

20http://wiki.tcl.tk/1706821http://www.r-project.org/

36

http://wiki.tcl.tk/17068

http://www.r-project.org/

Step 6. k-means is a well known algorithm for clustering objects that creates

pair-wise distinct clusters. All objects need to be represented as a numerical fea-

ture vector. In this case, these objects are the words that are grouped into topic

term sets. The components of the feature vector are the tf*idf values of these

words in each crawled post. The k in k-means identifies the user-defined num-

ber of clusters that is also input for the algorithm. The feature vector represents

a vector in an n-dimensional space with n being the number of posts.

The algorithm operates as illustrated in Fig. 9.

k-means randomly chooses k points in the n-dimensional space that serve as

initial centers of the clusters, or called centroids (see Fig. 9 A). In the next phase

each word is assigned to the closest centroid. The closest centroid is the centroid

with the minimal distance to the feature vector of the word (see Fig. 9 B). One

can apply various distance measures depending on the data set to be clustered.

As discussed in Sec. 4.1, the established Euclidean distance serves as distance

measure.

After assigning the words to centroids, each cluster gets a new centroid. These

centroids are calculated by averaging the feature vectors of all words assigned

to one cluster (see Fig. 9 C). This process of assigning words and computing new

centroids is repeated until the convergence of the algorithm. The convergence

can be reached if the centroid movement is below a predefined threshold.

A) B) C)

Figure 9: An example iteration of k-means (∆ - centroids; x - points).

A) Random centroids. B) Assign clusters. C) Compute new centroids.

37


Mahout’s version of k-means is implemented by the KMeansDriver class. Esteves

et al. [54] describe the performance of this implementation. They highlight that

the Mahout implementation scales with increasing data set size and increasing

number of computing nodes.

After each iteration, the KMeansDriver stores the new centroids into the HDFS.

After the completion of all iterations, Mahout runs an extra job that writes the

clustered points, i.e. the word to topic assignment, to the file system.

Step 7. This assignment is readable from the cluster writer module of Mahout.

An additional class, called HANAClusterWriter, is implemented. This class trans-

fers the clustered points to the HANA database. It is not a MapReduce job be-

cause it only sequentially transfers the data from the HDFS to the database.

word id cluster id

4 1

8 1

2 3...

...

Table 3: Resulting cluster table.

An example of the resulting table is shown in Tab. 3. The choice of the feature

vector is crucial for the meaning of the clustering results. By selecting the tf*idf

values in each post for each word, words are grouped together that frequently

appear in the same post. Thus, words with a similar meaning are assigned to

the same cluster [10].

These word groups are the topic term sets used for the calculation of the topical

distance. The granularity of the topics is dependent on the user-defined number

of clusters k. As proposed by Abe et al. [55], the aim is to find clusters with

around 100 words per cluster.

In the evaluation (see Sec. 7), different settigs for k and the number of iterations

are tested to achieve an average cluster size of 100.

38

6 Implementation of the Topic-Consistency Rank

This Section presents the details of the implementation of the topic-consistency

rank. The rank is completely integrated into the database and only relies on

basic SQL constructs.

The theoretical foundations for each of the underlying partial scores are already

discussed in Sec. 4. Each score implementation consists of a combination of SQL

views, permanent and temporary tables. The combined score for each blog is

the weighted sum of the single scores (see Sec. 4.5).

6.1 Intra-Post Consistency

To calculate the intra-post consistency, an additional tf*idf calculation view is

implemented based on paragraphs. Equal to the normal tf*idf view (see Sec. 5.1),

this view is also based on the dictionary tables. The dictionary tables are the

result of the word extraction phase of the topic detection. An example dictionary

table is shown in Tab. 4. For each word of a post a row is created that contains

the word, the post id, and the word number.

word post position

hello postid1 0

world postid1 1...

......

Table 4: The dictionary table maps words to the containing posts and positions.

To create a tf*idf value based on paragraphs, all words within a specific window

are regarded as paragraphs. The size of this window is set to 100 based on the

average length of a paragraph, which is 100-150 words [56].

The calculation is a direct implementation from the formal definition (see

Sec. 4.2). It creates a join between all succeeding sections. The result of this

join are the tf*idf values for each section and each occurring word. Afterwards,

this tf*idf values are joined with the cluster table. The score for each cluster is

39


calculated by summing up the tf*idf values per cluster.

Afterwards, the topical differences of the sections are calculated by joining the

sections of each post on the topic cluster. The topical distance of two section

is the square root of the sum of the differences for each cluster. The intra-post

distance on post level is the average of the section distances. Based on the post-

level distance, the blog-level distance is calculated by averaging the intra-post

distance values of each post. Finally, the intra-post score is computing by invert-

ing the intra-post distance.

To sum up, the intra-post score calculation is a combination of nine joins and

four aggregations in the database. The mapping from ids to words and URIs

and vice versa introduces the most complexity to this operation. Further, one

has to mention that the intra-post rank is the most detailed rank in respect to

size of the tf*idf view results.

6.2 Inter-Post Consistency

The inter-post consistency builds upon the tf*idf view based on posts, called

post-tf*idf, which is also used by the topic clustering (see Sec. 5.1). Posts are

objects in the database and thus do not require an additional segmentation.

To get succeeding posts, each post is joined with the post that has the minimal

next publishing date. After this join, the topic vector differences of each post

and its successor can be computed. By grouping for each post, the Euclidean

distances between all succeeding posts are calculated. Afterwards, the average

of all distances results in the inter-post distance and thus in the inter-post con-

sistency score of a blog.

This operation is pretty similar to the intra-post consistency except that it is

based on the latest posts. The selection of the latest posts is implemented as

a simple where-condition on the post publishing date.

40

6.3 Intra-Blog Consistency

The intra-blog consistency calculates the distance between the classification of

each post and its content. Therefore, it uses the post-tf*idf view to get the term

importance values for the content. Further, it uses a tf*idf view based on the

classification system, called class-tf*idf. This view returns the importance values

for each term used in tags or categories.

The intra-blog consistency on post-level is calculated by the topical distance of

the post’s classification and the post’s content vector. Finally, all topical dis-

tances are combined by performing an average operation for each blog.

To accelerate the calculation the tf*idf vectors become persistent as temporary

column tables. Thereby, a join between vectors can be performed as a column

search operation in the SAP HANA database, which is the fastest way of join-

ing [33].

Further, blogs do not get an intra-blog consistency if they are not using tags

or categories. These blogs are regarded as inconsistent with their non-existing

classification system. Thus, they are assigned the minimal score, i.e. zero.

6.4 Inter-Blog Consistency

The context-based consistency, called inter-blog consistency, of a blog is based

on its linking and linked blogs.

To calculate this score a join with the biggest table of the data set, the link ta-

ble (see Tab. 5), is necessary. This table consists of the linking and linked blog

URIs and the corresponding link type, which represents whether a blog links to

another blog via a post or a comment.

To calculate the topical distance between all outgoing and incoming links the

blog-topic-probability table is joined with the link table. This is the most costly

operation for the data set because the link table is rapidly growing and contains

currently around 160 million rows.

After the join computation, the post context distances can be calculated. By

41


linking post linked post link type

spreeblick.de?p=22 netzwertig.de?p=31 via post

carta.info?p=12 spreeblick.de?p=26 via comment

promicabana.de?p=76 gesichtet.net?p=3 via post...

......

Table 5: Example rows of the link table.

grouping for the blog, the inter-blog consistency score is computed as defined

in Eq. 19.

6.5 BI-Impact Score

As discussed in Sec. 3.2, BlogIntelligence implements a blog ranking metric called

BI-Impact score as a prove-of-concept prototype. In the course of evaluating the

topic consistency metrics against a blog-specific ranking, the BI-Impact score is

transfered to SAP HANA.

The score contains two components: the blog interaction and the post interac-

tion. These components are also calculated as SQL views. The calculation re-

quires numerous joins over the link table to calculate the partial rank for each

distinct link type.

The BI-Impact score is calculated by a recursive algorithm. It needs multiple

iterations until the rank converges. After each iteration, a temporary table stores

the ranks for each blog and serves as input for the next iteration.

The whole calculation spans a complex query tree. It contains about 52 join

operations. Although the majority of tables have a low number of rows, the

usage of the link table introduces an high complexity.

Listing 1 shows the simplified code for one of the basic views for the rank calcu-

lation. This view creates a score for each post based on the scores of all incoming

links of blogs. It differentiates between the various link locations or link types

of the incoming links. The final rank is calculated by the weighted sum of the

42

different link types [11].

Listing 1: SQL view creates post score per link type

CREATE VIEW postScoreByLinkType AS

SELECT post , l inktype , AVG( scoreOfIncomingBlogs ) AS score

FROM

postByIncomingPostAndLinkType AS inBlog

JOIN

normalizedBiImpactScore AS score

ON score . host = inBlog . host

GROUP BY post , l ink type ;

43


44

7 Evaluation

This Section discusses the results and the plausibility of the topic consistency

rank. Therefore, the evaluation shows the results of the partial ranks, the overall

rank, and compares it to the results of the BI-Impact score.

7.1 Experimental Setup

For the evaluation of this master’s thesis, we activated the BlogIntelligencecrawler for one month. The crawler uses an 8 core machine with 24 gigabyte

RAM running Ubuntu Linux. The harvested data is stored in a separate

database machine with 32 cores and 1 terabyte RAM running Suse Linux. This

machine also runs the SQL analytical queries.

The cluster setup for the topic detection consists of 12 machines with 2 cores and

4 gigabyte RAM each. These machines are grouped into one Hadoop cluster that

is configured to run 50 parallel tasks.

The key data indicators of the data set are shown in Tab. 6.

Indicator Value (approx.)

data set size 500 GB

crawled web pages 2.5 million

identified blogs 12,000

identified posts 600,000

average words per post 57.5

average number of categories per post 2.6

average number of tags per post 4.2

number of news portals 1,300

Table 6: State of the BlogIntelligence data set.

45

7 Evaluation

7.2 Clustering

The quality of the underlying clustering is crucial for the quality of the topic

consistency rank. Especially, the size of clusters determines whether blogs with

a versatile interest wrongly get a good consistency rank.

The k-means clustering of the Mahout implementation runs on the cluster setup.

The runtime depends on the number of iterations and the number of desired

clusters. It varies between 8 to 20 minutes per iteration. However, the topic

detection only has to be repeated if the number of words significantly changes.

After the term extraction procedure, the data set contains 450 000 words. The

resulting matrix for words and posts consists of 2.7 billion tf*idf values. Most of

the values are zero. Therefore, Mahout uses a sparse vector representation that

results in a matrix size of only 144 megabyte.

For the clustering, four different variants are evaluated. The indicators for the

quality of the clusterings are shown in Tab. 7.

Variant 1 Variant 2 Variant 3 Variant 4

Parameters:

k 100 10 000 10 000 20 000

iteration 10 10 40 40

Results:

maximum cluster size 448 546 419 453 187 093 21 234

minimum cluster size 1 1 1 1

number of filtered clusters 52 5 398 4 419 18 546

minimum filtered cluster size 2 2 2 2

maximum filtered cluster size 37 83 52 383

average filtered cluster size 8.73 4.55 3.86 10.1

Table 7: Quality of the tested clustering configurations.

The number of filtered clusters is always below the actual calculated number of

clusters of k-means, called k. This is caused by the filtering of too small and too

large clusters. The filtering is conservative. It removes clusters with a size of

46

one. This avoids expensive and too specific word distance calculations. Further,

clusters with more than 1,000 words are ignored, because the word diversity of

this cluster harms the validity of the topic consistency rank.

Variant 1 creates 100 clusters with a maximum cluster size of 448 546 words.

These words cannot be considered, because the cluster size is larger than 1,000.

Thereby, only 1 500 words are grouped into meaningful clusters. With an aver-

age cluster size of 8.73, there are enough words per cluster to describe a topic.

Variant 1 creates too few clusters. Therefore, the cluster number is increased in

variant 2 to 10 000. Although it creates more than 5 000 filtered clusters, the av-

erage cluster size halves and the number of unused words in the biggest cluster

only negligible decreases. Hence, variant 3 increases the number of iterations to

get a better word distribution among the clusters.

Unexpectedly, the number of filtered clusters decreases for variant 3. The size

of the maximum cluster decreases and the average size of filtered clusters also

decreases. Consequently, variant 4 creates more clusters with a size over 1,000

than variant 2.

To further increase the number and average size of filtered clusters, variant 4

increases the number of created clusters. Variant 4 gives the best results in the

evaluation. It contains over 18,000 filtered clusters and the maximum cluster size

decreases to about 20,000. In addition, variant 4 has on average 10 words per

cluster, which is a far more promising distribution than all three other variants.

As a consequence of the clustering evaluation, the topic consistency rank calcu-

lation uses the filtered clusters of variant 4.

7.3 Results of the Topic Consistency Sub Ranks

The ten best blogs for each of the topic consistency sub ranks are calculated. The

BI crawler is focused to crawl the German blogosphere. Therefore, the majority

of all blogs is German and the top consistency blogs are German, too. For each

of the sub ranks, two highly ranked representatives are introduced in detail.

The top ten blogs for the two post-related sub ranks are shown in Tab. 8.

47

7 Evaluation

Rank Intra-Post Inter-Post

1 promicabana.de blog.de.playstation.com2 dsds2011.info upload-magazin.de

3 blog.beetlebum.de blog.studivz.net

4 schockwellenreiter. der-postillon.com

5 hornoxe.com allfacebook.de6 netbooknews.de achgut.com

7 iphoneblog.de gutjahr.biz

8 carta.info elmastudio.de

9 blog.studivz.net netzwertig.com

10 seo.at lawblog.de

Table 8: The top ten ranked blogs for intra-post and inter-post consistency.

One example for an high intra-post consistency is the dsds2011.info blog. The

intra-post consistency gives the average internal consistency of posts in a blog.

dsds2011.info is a follower blog of a German TV show that has the aim to cast a

new superstar. This blog is a fan blog. Therefore, each post mostly focuses on

one person, e.g. the current candidate. Further, some posts discuss the perfor-

mance of each candidate of a show. This causes that each paragraph of such a

post focuses on another person, but also uses the same attributes to describe the

performance.

Another blog with an high intra-post consistency is the iphoneblog.de. Obviously,

the topics of each post are all related news about Apple’s iPhone. Each post of

this blog contains on average five paragraphs, is carefully investigated, and con-

centrates on one feature, game, or accessory of the iPhone. These special inter-

ests are fully investigated in a post over several paragraphs. As a consequence,

the internal consistency of the posts is high.

A representative for an high inter-post consistency is the blog.de.playstation.comblog. This blog has an high topical consistency between the latest published

posts. The main focus of this blog is on PlayStation games. Hereby, it fre-

quently publishes posts about the latest games, which are discussed regarding

48

their game play, graphics, and story line. Each post presents a game in a similar

structure and phrasing. Thus, the topical distance between these posts is very

low and the topical consistency is very high.

Another highly ranked blog regarding the consistency between posts is allface-book.de. It publishes posts about new features of the social network, discussion

about privacy, and the latest news about Facebook. Although this blogs han-

dles these three topics, it usually publishes multiple posts per topic in a row.

This decreases the distance between succeeding posts and boost its inter-post

consistency.

Rank Intra-Blog Inter-Blog

1 readers-edition.de innenaussen.com2 iphoneblog.de shopblog author.de

3 eisy.eu nachdenkseiten.de

4 karrierebibel.de helmschrott.de

5 meinungs-blog.de blog.studivz.net

6 dsds2011.info fanartisch.de

7 macerkopf.de achgut.com

8 kwerfeldein.de internet-law.de

9 events.ccc.de scienceblogs.de10 mobiflip.de events.ccc.de

Table 9: The top ten ranked blogs for intra-blog and inter-blog consistency.

The top ten blogs for the two blog-related sub ranks are shown in Tab. 9.

One example of an high intra-blog consistency rank is also the iphoneblog.deblog. This blog uses the post classification in an appropriate way. As men-

tioned above, the posts of this blog are carefully edited. By investigating the

content of the blog, it is observable that each post contains beside the common

categories also at least six content-specific tags. This shows that a blog gains a

high consistency ranking for the intra-post and intra-blog consistency by care-

fully authoring its posts.

Another example is the macerkopf.de blog. In contrast to iphoneblog.de, the posts of

49

7 Evaluation

this blog handle a higher variety of topics and comment more critical. For exam-

ple, they frequently compare the iPhone against other mobile phones. Hereby,

a post covers at least two topics. Nevertheless, categories and tags address each

topic of the post, which results in a high quality of the classification and in a

high intra-blog consistency rank.

The inter-blog consistency measures the consistency of a blog with a linking

and linked blogs. The best ranked blog for the inter-blog consistency is the in-nenaussen.com blog. This blog writes reviews about diverse beauty products.

The blog link graph indicates that this blog is mainly linking other product re-

views e.g. for referencing another opinion on the product. Further, it is observ-

able that it is also linked by product review blogs on beauty products like the

lipglossladys.com blog.

The scienceblogs.de blog has also an high inter-blog consistency rank. This is

caused by its link directory nature. It mainly collects and summarizes posts

from other science-related blogs and provides an entry point into a science com-

munity. This blog mainly references the original content. Thereby, its summaries

are very consistent with the linked content.

In addition, by comparing all four sub ranks of Tab. 8 and Tab. 9, the

blog.studivz.net shows high consistency ranks for each subrank except the

intra-blog consistency. This blog writes about topics around a German social

network called studiVZ. It is a typical corporate blog that describes news and

new features of a company and the company’s products. Hereby, the blog

has highly consistent posts that discuss a topic over multiple paragraphs. It

constantly posts about activities of the company and is linked by blogs, which

spread the news of the company. Nevertheless, each post of this blog is not

tagged and is only categorized as allgemein (German for miscellaneous), which

is a common standard configuration for blog systems.

By investigating the top ten rank blogs for each subrank, two examples for each

subrank are analyzed and the evaluation shows that the sub ranks create plau-

sible results.

50

7.4 Comparison of BI-Impact and Combined Topic Consistency Rank

The weighted combination of all sub ranks is the combined topic consistency

rank. It identifies the topical consistent blogs in the data set. Thereby, it creates

a ranking of experts depending on the consistency of their writing. In contrast,

the BI-Impact aims to identify the most influential blog authors with the highest

reach and famousness.

During the evaluation, both ranks are compared against each other to find pos-

sible correlations.

BlogCombined topic

BI-Impactconsistency rank

helmschrott.de 1 85

gedankendeponie.net 2 94

yuccatree.de 3 104

upload-magazin.de 4 96

nachdenkseiten.de 5 117

events.ccc.de 6 54

telemedicus.info 7 118

bei-abriss-aufstand.de 8 90

stereopoly.de 9 87

annalist.noblogs.org 10 88

Table 10: Top ten ranked blogs for the combined topic consistency rank with

their BI-Impact rank

First, the top ten blogs concerning the combined topic consistency rank are in-

vestigated. As shown in Tab. 10, each top ten blog is listed with its ranking

position regarding both rankings.

The two sample blogs, yuccatree.de and telemedicus.info, have high combined

topic consistency ranks. yuccatree.de has a low inter-post consistency value

caused by the diversity of discussed topics. However, it has a high combined

consistency score because all remaining three consistency sub ranks are very

51

7 Evaluation

high. In contrast, the telemedicus.info blog focuses only on privacy and patent

right discussions. Thus, it has a very high inter-post consistency that results in

combination with the proper usage of tags in a high combined topic consistency

rank.

In contrast, both have a very low BI-Impact score. Thus, both are not identified

as highly influential blogs, because there position in the blog link graph has not

enough influence. This can be seen for all other blogs of the top ten, as well.

BlogCombined topic

BI-Impactconsistency rank

fuenf-filmfreunde.de 54 1

sistrix.de 97 2

elektrischer-reporter.de 142 3

t3n.de 49 4

scienceblogs.de 75 5

fontblog.de 37 6

de.engadget.com 52 7

achgut.com 34 8

schockwellenreiter.de 77 9

saschalobo.com 35 10

Table 11: Top ten ranked blogs for the BI-Impact rank with their combined topic

consistency rank

Secondly, the top ten blogs regarding the BI-Impact rank are investigated. As

shown in Tab. 11, the blogs are ordered by the BI-Impact rank and listed with

their combined topic consistency rank.

By investigating three sample blogs, namely t3n.de, de.engadget.com, and

saschalobo.com, it is observed that the most influential blogs deal with a high

number of topics. These blogs summarize current events in technology or give

their opinions to diverse political discussions.

Although these blogs contain high quality content, the number of discussed

52

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

100

No

rm

alized

sco

re

Rank position

Consistency BI-Impact

Figure 10: BI-Impact and topic consistency rank for top 100 blogs ordered by

topic consistency rank.

topics is very high. Further, the inter-blog consistency decreases through the

number of different view points and the wide range of linking blog authors.

The intra-post consistency also decreases by the usage of summary posts which

summarize the news of a day.

The exemplarily analysis of the top ten implies an inverse relation between the

topic consistency of a blog and its reach. Thus, the expectation is to find a corre-

lation between the BI-Impact rank and the topic consistency rank.

To evaluate this, an analysis of the top 100 ranked blogs is done. The behavior

of both ranks is shown in Fig. 10 and Fig. 11.

In Fig. 10, the blogs are ordered by their ranking position in topic consistency

ranking. The best blog gets the rank position one. The topic consistency rank

is monotonously decreasing with the ranking position. Contradictory to the

expectation, no correlation is observable between both ranks.

However, an accumulation of higher BI-Impact scores can be identified in the

area of low consistency ranks. It looks like blogs, which handle a higher diver-

sity of topics, gain more influence in the blogosphere. In contrast, the BI-Impact

53

7 Evaluation

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

1

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

73

76

79

82

85

88

91

94

97

100

No

rm

alized

sco

re

Rank position

Consistency BI-Impact

Figure 11: BI-Impact score and topic consistency rank for top 100 blogs ordered

by BI-Impact rank.

score of the most topical consistent blogs is low. Consequently, these blogs have

low impact and a low reach. The assumption is that they form closed expert

communities which are less integrated into the blogosphere.

The same is observable by looking at the behavior of the topic consistency rank

if the blogs are ordered by their BI-Impact score. There is an accumulation of

high topic consistency ranks at the long-tail of the BI-Impact score. In addition,

a small accumulation of medium topic consistency ranks at rank position 3-16 is

observable. However, a correlation between both scores cannot be observed.

54

8 Recommendations for Future Research

The focus of this thesis is to motivate and define a topic consistency rank for

blogs. The formal definition and implementation specially focus on a resource

efficient and fast calculation. Therefore, complex algorithms and dependencies

to external resources are avoided. Nevertheless, this should be focus for future

research.

8.1 Enhanced Topic Detection

The central part of our topic consistency rank is the topic detection. As already

discussed, k-means clustering detects the topics in the introduced implementa-

tion. Nevertheless, the central shortcoming of this approach is that it is highly

dependent on the underlying collection. Thus, the rank depends on the crawl

coverage of BlogIntelligence. There are several approaches that can circumvent

this problem.

Wikipedia. Although the content creation in the blogosphere is highly interac-

tive, it does not aim to provide reliable knowledge. In contrast, Wikipedia offers

a great information source of reviewed content. Wikipedia is fully available for

download. The whole set of articles is available online and covers each imagin-

able topic. Thus, a word clustering based on this data has to be tested whether

it can provide more reliable clusters.

Thesauri. Another solution is the usage of thesauri. A thesaurus is a

dictionary-like database that additionally contains acronyms, synonyms, and

hypernyms. Currently, the most important words are identified by calculating

the tf*idf score for each word. By using thesauri, the collection of common hy-

pernyms for the most important words of a post is possible. These hypernyms

can serve as new clusters with all their subordinated words.

Thesauri are human-made collections and several times iterated by linguistic

researchers. Thereby, the clustering will have an high quality and an intuitive

55


grouping. One frequently referenced thesaurus is WordNet [57]. WordNet allows

the complete download of its database. This enables the analysis to load the

complete knowledge in-memory and perform a fast matching of words and hy-

pernyms. Although this process is expected to perform slower than the k-means

clustering, the results can be more promising.

Ontologies. A promising solution is the usage of ontologies.

"An ontology is an explicit, formal specification of a shared con-

ceptualization. The term is borrowed from philosophy, where an

Ontology is a systematic account of existence. For AI systems, what

"exists" is everything that can be represented." [58]

An ontology holds numerous relations between concepts. Among others, an

ontology defines classes of resources and super classes of classes.

To use ontologies, the post’s content has to be assigned to the concepts present

in the ontologies. This is a hard problem and frequently discussed in ongoing

research [59, 60, 61]. Hereby, the probability of a word or word group repre-

senting a specific concept is needed. The probability is influenced by the direct

context of the word and by the overall collection.

Although this results in a hard calculation problem, the data is semantically

enriched. These semantics can be used to easily derive clusters with different

granularities. Further, it enables us to make the results machine readable and to

offer more semantic filtering to users.

Sentiments. Beside the quality of blog posts, incorporating the opinion of blog

authors in the ranking is a future challenge. For example, the user may want

to identify a blog author that constantly writes positively or negatively about

a topic like Apple. Thereby, BlogIntelligence should provide special insights to

identify fans and haters of products or persons.

Therefore, sentiment analysis should be applied to the posts’ content. Sentiment

analysis determines the attitude of a writer [62]. The attitude is the emotional

state of the authors.

56

Probability distributions. As discussed in Sec. 5.2, a k-means clustering assigns

words to topics. Although this gives promising results, another approach is to

view topics as probability distributions over words. Thus, each word is assigned

to a topic with a specific probability. This probability distribution creates over-

lapping topic clusters that represent the reality in more detail than a distinct

assignment of word to topics. Hereby, the word ray (light ray) get assigned to

physics, but also to fishing (ray-bones at the fin of a fish) with a smaller proba-

bility.

Multilingual clustering. The word clustering in this thesis is limited to a Ger-

man data set. Thereby, the problem of multilingual clustering is circumvented.

Due to the future extension of BlogIntelligence to the whole blogosphere, the clus-

tering also has to detect topics over language boundaries. This problem is dis-

cussed by Chen et al. [63], who propose to first cluster each language and after-

wards merge the resulting topic clusters. Future work has to integrate this or

a similar approach into the topic detection to solve the multilingual clustering

problem.

8.2 Visualization

The key component of the BlogIntelligence framework is the visualization. It en-

ables users to understand and use the results of the BI analyses.

The topic consistency rank presented in this paper is a complex calculation. It

results in a numerical value for each blog. By displaying this number, the user is

not able to relate it to other blogs or to interpolate its meaning. Therefore, future

work will address the creation of an appropriate visualization.

This visualization helps the user to explore and categorize blogs based on their

visual perception. As discussed in Sec. 2.2.3, the BlogConnect visualization of

BlogIntelligence already shows an exploratory overview to the blogosphere. To

integrate the topic consistency rank into this view, another visual dimension gets

introduced. This dimension has to symbolize the consistency of a blog. The user

has to be able to perceive the order of blogs regarding their consistency. Thus,

57


PoliticsSociety

Health

Tech Movies

Search

MinimumBI-Impact Score

MinimumTopic Consistency

TopicGranularity

War

Figure 12: BlogConnect 2.0 with topic consistency represented as color value.

the color value of blog bubbles serves as the indicator for their topic consistency.

The value is hereby the direct mapping from the normalized rank multiplied

with a constant parameter.

The prototypically BlogConnect 2.0 visualization is shown in Fig. 12. As shown,

the user still controls the set of blogs via a search term at the lower right cor-

ner of the visualization. Blogs are only shown if they are related to the search

term. Essentially, there are three extensions to the current BlogConnect visual-

ization. First, blog bubbles are now ordered around their assigned topics. The

topic names have to be calculated via a cluster labeling algorithm, which is also

subject to future research. Further, the arrangement around the topics is based

on a gravitation simulation where the force is determined via the distance of a

blog to the clusters centroid.

As mentioned above, the color value of the blog bubble represents the degree

of topic consistency. As shown in Fig. 12, blogs with a high consistency shine

threw the cloud of dark inconsistency blogs. Hereby, the small light point also

helps the user to compare less consistent blogs.

58

PoliticsSociety

Health

Tech Movies

Search

MinimumBI-Impact Score

MinimumTopic Consistency

TopicGranularity

War

Figure 13: BlogConnect with a high minimal topic consistency threshold.

Third is the introduction of an interactive toolbar with three controls. The first

control regulates the topic granularity of the visualization. One can see five top-

ics. By raising the granularity, the BlogIntelligence framework calculates a higher

number of clusters. This enables the user to explore the blogs in more detail. In

addition, the user is able to configure the minimum BI-Impact score. All blogs

with a lower score get excluded from the view leaving the most important blog

for the user. Similarly, the minimum topic consistency can be controlled by the

user. Thus, the user can exclude inconsistent blogs from his overview. As shown

in Fig. 13, the higher the topic consistency threshold the less blogs are shown.

One can see, that even big blogs disappear caused by their versatile interest.

8.3 Full integration with SAP HANA

The full integration into SAP HANA is one of the main goals for the future of

BlogIntelligence. Hereby, the focus lies on transferring the text analysis founda-

tions into the core of HANA and creating an API for future text analysis algo-

rithms.

59


As discussed in Sec. 5, the tf*idf calculation runs inside SAP HANA. Although

the SQL procedures run totally on the database,they use a externally extracted

dictionary table instead of the database owned word index.

Transportation costs can be decreased by implementing the k-means clustering

directly into the database. Although the k-means algorithms already runs in a

distributed environment, an full in-memory computation can achieve an addi-

tional performance boost. However, this expectation has to be tested by inte-

grating it into SAP HANA.

Furthermore, the actual consistency ranking calculation can be adapted to incre-

mentally update the rank of each blog on the insertion of new posts. Due to the

integration of the text analysis algorithms into SAP HANA, the overall aim of

BlogIntelligence, which is to provide real-time analytics of the blogosphere, can

be approached.

60

9 Conclusion

This master’s thesis proposed a metric for topical consistency of a blog with the

goal to identify domain experts in the blogosphere.

It is discussed that current blog ranking approaches focus on finding the most

influential blogs that attract a large audience and thus more visitors, links, and

comments. Further, it is argued that niche blogs with a very specific topic can

only attract a limited audience and thus have only a small reach. For a blog to

develop expert knowledge, it should show recurring interest in its topics and

therefore concentrate on a small set of topics. To identify those experts blogs is

particular important for domain experts to find blogs which they can observe

and interact with.

To ease the retrieval of these blogs, four different aspects of topic consistency

were defined: (1) intra-post, (2) inter-post, (3) intra-blog, and (4) inter-blog con-

sistency. These aspects define the consistency of a blog on different granularities:

from the internal consistency of a post’s paragraphs to the global consistency be-

tween a blog and its linking and linked blogs. The four aspects are combined

into a joint rank, called topic consistency rank.

The implementation of the topic consistency rank was introduced. Further, this

thesis showed how the topic consistency rank is integrated into the blog ana-

lytics framework, BlogIntelligence. The foundation of the topic consistency rank

is based on the topic detection, which implements the automatic assignment of

words into groups of highly related words. These groups are defined as topics.

Using this topic detection, the implementation of the four aspects and the fi-

nal rank were described with focus on the specifics of the persistence layer SAP

HANA.

The plausibility of the topic consistency rank was evaluated based on a real

world data set. This data set consisted of 12,000 crawler blogs that were col-

lected by the BlogIntelligence crawler. The top ten results of each aspect were

analyzed and two representatives were discussed in detail.

In addition, the correlation between the topic consistency of a blogs and it influ-

61

9 Conclusion

ence was evaluated. This was done by implementing the BI-Impact score that is

a measure for the reach and the impact of a blog and incorporates blog-specific

characteristics.

The analysis of the top ten blogs appeared to imply an inverse relation between

the topic consistency of a blog and its reach i.e. the more consistent a blog is,

the less influence it can gain in the blogosphere. In contrast, by analyzing the

distribution of ranks among the top hundred, it could not be observed that there

is a correlation between the influence and the consistency of blogs. Thus, both

metrics are considered to be independent.

As a consequence, the topic consistency rank is established as an additional indi-

cator, beside the influence of a blog, to ease the blog retrieval for domain experts.

Future work includes the enhancement of the topic detection to provide more

specific and accurate topics that allows words to be part of multiple topics. The

influence of this enhancement on the results of the topic consistency rank should

be analyzed. In addition, the proposed visualization, BlogConnect 2.0, should

be integrated into the BlogIntelligence web portal to offer the results of the topic

consistency rank to the user.

62

..

63

9 Conclusion

64

List of Abbreviations

API Application Programming Interface

ATOM Atom Syndication Format

BI BlogIntelligence

BI-Impact BlogIntelligence-Impact-Score

Blog Weblog

HDFS Hadoop Distributed File System

HITS Hyperlink-Induced Topic Search

HTTP Hypertext Transfer Protocol

IR Information Retrieval

RAM Random-Access Memory

RPC Remote Procedure Call

RSS Rich Site Summary

Splog Spam Blog

SQL Structured Query Language

tf*idf Term Frequency-Inverse Document Frequency

URI Uniform Resource Identifier

WWW World Wide Web

XML Extensible Markup Language

65

9 Conclusion

66

List of Figures

List of Figures

1 Overview of blog topics. . . . . . . . . . . . . . . . . . . . . . . . . 2

2 An example tag cloud. . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 BlogIntelligence architecture overview. . . . . . . . . . . . . . . . . 9

4 BlogConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5 PostConnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Ranking variables of the BI-Impact. . . . . . . . . . . . . . . . . . . 19

7 Visualization of post-topic-probabilities. . . . . . . . . . . . . . . . 24

8 Topic detection flow diagram. . . . . . . . . . . . . . . . . . . . . . 33

9 An example iteration of k-means. . . . . . . . . . . . . . . . . . . . 37

10 BI-Impact and topic consistency ordered by topic consistency rank. 53

11 BI-Impact and topic consistency ordered by BI-Impact. . . . . . . 54

12 BlogConnect 2.0, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

13 BlogConnect 2.0 with a minimal topic consistency. . . . . . . . . . 59

67

List of Figures

68

List of Tables

List of Tables

1 Example tf*idf vector table. . . . . . . . . . . . . . . . . . . . . . . 35

2 Sparse word vector representation. . . . . . . . . . . . . . . . . . . 36

3 Resulting cluster table. . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Example of the dictionary table. . . . . . . . . . . . . . . . . . . . . 39

5 Example of the link table. . . . . . . . . . . . . . . . . . . . . . . . 42

6 State of the BlogIntelligence data set. . . . . . . . . . . . . . . . . . . 45

7 Clustering quality results. . . . . . . . . . . . . . . . . . . . . . . . 46

8 Top 10 blogs for intra-post and inter-post consistency. . . . . . . . 48

9 Top 10 blogs for intra-blog and inter-blog consistency. . . . . . . . 49

10 Top 10 blogs for combined topic consistency rank with BI-Impact. 51

11 Top 10 blogs for BI-Impact with combined topic consistency rank. 52

69

List of Tables

70

References

References

[1] T. Cook and L. Hopkins: Social media or, “how i learned to stop worrying andlove communication”, September 2007.

http://trevorcook.typepad.com/weblog/files/

CookHopkins-SocialMediaWhitePaper-2007.pdf.

[2] R. Ramakrishnan and A. Tomkins: Toward a peopleweb.

Computer, 40(8):63–72, 2007.

[3] H. Kircher: Web 2.0-plattform für innovation.

IT-Information Technology, 49(1):63–65, 2007.

[4] N.J. Thurman: Forums for citizen journalists? adoption of user generated contentinitiatives by online news media.

New Media & Society, 10(1):139–157, 2008.

[5] S.D. Reese, L. Rutigliano, K. Hyun, and J. Jeong: Mapping the blogosphereprofessional and citizen-based media in the global news arena.

Journalism, 8(3):235–261, 2007.

[6] J. Schmidt: Weblogs: eine kommunikationssoziologische studie.

2006.

[7] Tom Smith: Power to the People: Social Media Tracker Wave 3. Technical report2008, 2008.

http://www.slideshare.net/Tomuniversal/

wave-3-social-media-tracker-presentation.

[8] J. Arguello, J. Elsas, J. Callan, and J. Carbonell: Document representation andquery expansion models for blog recommendation.

In Proc. of the 2nd Intl. Conf. on Weblogs and Social Media (ICWSM), 2008.

[9] I.H. Witten, E. Frank, and M.A. Hall: Data Mining: Practical Machine Learn-ing Tools and Techniques: Practical Machine Learning Tools and Techniques.

Morgan Kaufmann, 2011.

[10] J. Bross: Understanding and Leveraging the Social Physics of the Blogosphere.

PhD thesis, Hasso-Plattner-Institute, 2011.

[11] J. Bross, K. Richly, M. Kohnen, and C. Meinel: Identifying the top-dogs of theblogosphere.

Social Network Analysis and Mining, pages 1–15, 2011.

[12] D.L. Lee, H. Chuang, and K. Seamons: Document ranking and the vector-space

71

http://trevorcook.typepad.com/weblog/files/CookHopkins-SocialMediaWhitePaper-2007.pdf

http://trevorcook.typepad.com/weblog/files/CookHopkins-SocialMediaWhitePaper-2007.pdf

http://www.slideshare.net/Tomuniversal/wave-3-social-media-tracker-presentation

http://www.slideshare.net/Tomuniversal/wave-3-social-media-tracker-presentation

References

model.Software, IEEE, 14(2):67–75, 1997.

[13] L. Page, S. Brin, R. Motwani, and T. Winograd: The pagerank citation ranking:Bringing order to the web.

1999.

[14] M. Clements, A.P. de Vries, and M.J.T. Reinders: Optimizing single termqueries using a personalized markov random walk over the social graph.

In Workshop on Exploiting Semantic Annotations in Information Retrieval(ESAIR), 2008.

[15] W. Weerkamp and M. De Rijke: Credibility improves topical blog post retrieval.Association for Computational Linguistics (ACL), 2008.

[16] K. Balog, L. Azzopardi, and M. De Rijke: Formal models for expert finding inenterprise corpora.

In Proceedings of the 29th annual international ACM SIGIR conference on Re-search and development in information retrieval, pages 43–50. ACM, 2006.

[17] R. Blood: Weblogs: a history and perspective.

Rebecca’s Pocket, 7(9), 2000.

[18] C. Körner, R. Kern, H.P. Grahsl, and M. Strohmaier: Of categorizers and de-scribers: An evaluation of quantitative measures for tagging motivation.

In Proceedings of the 21st ACM conference on Hypertext and hypermedia, pages

157–166. ACM, 2010.

[19] O. Kaser and D. Lemire: Tag-cloud drawing: Algorithms for cloud visualization.

arXiv preprint cs/0703109, 2007.

[20] C. Marlow: Audience, structure and authority in the weblog community.

In International Communication Association Conference, volume 27, 2004.

[21] M. Gumbrecht: Blogs as “protected space”.

In WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysisand Dynamics, volume 2004, 2004.

[22] S. Thies: Content-Interaktionsbeziehungen im Internet: Ausgestaltung und Er-folg.

Springer DE, 2004.

[23] M. Kobayashi and K. Takeda: Information retrieval on the web.

ACM Computing Surveys (CSUR), 32(2):144–173, 2000.

[24] J. Broß, P. Schilf, M. Jenders, and C. Meinel: Visualizing the blogosphere with

72

References

blogconnect.In Privacy, security, risk and trust (passat), 2011 ieee third international con-

ference on and 2011 ieee third international conference on social computing(socialcom), pages 651–656. IEEE, 2011.

[25] J. Bross, P. Schilf, and C. Meinel: Visualizing blog archives to explore content-and context-related interdependencies.

In Conf. Web Intelligence and Intelligent Agent Technology, 2010.

[26] P. Berger, P. Hennig, J. Bross, and C. Meinel: Mapping the blogosphere–towardsa universal and scalable blog-crawler.

In 2011 IEEE Third International Conference on Social Computing (SocialCom),pages 672–677. IEEE, 2011.

[27] M. Cafarella and D. Cutting: Building nutch: Open source search: A case studyin writing an open source search engine.

ACM Queue, 2(2), 2004.

[28] J. Dean and S. Ghemawat: Mapreduce: Simplified data processing on large clus-ters.

Communications of the ACM, 51(1):107–113, 2008.

[29] R. Khare, D. Cutting, K. Sitaker, and A. Rifkin: Nutch: A flexible and scalableopen-source web search engine.

Oregon State University, 2004.

[30] M. Michael, J.E. Moreira, D. Shiloach, and R.W. Wisniewski: Scale-up x scale-out: A case study using nutch/lucene.

In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE In-ternational, pages 1–8. IEEE, 2007.

[31] D. Borthakur: The hadoop distributed file system: Architecture and design.

Hadoop Project Website, 11:21, 2007.

[32] J. Shafer, S. Rixner, and A.L. Cox: The hadoop distributed filesystem: Balancingportability and performance.

In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE Interna-tional Symposium on, pages 122–133. IEEE, 2010.

[33] H. Plattner and A. Zeier: In-memory data management: an inflection point forenterprise applications.

Springer, 2011.

[34] A.K. Jain, M.N. Murty, and P.J. Flynn: Data clustering: a review.

73

References

ACM computing surveys (CSUR), 31(3):264–323, 1999.

[35] J. Han and M. Kamber: Data mining: concepts and techniques.

Morgan Kaufmann, 2006.

[36] S. Owen, R. Anil, T. Dunning, and E. Friedman: Mahout in action.

Online, pages 1–90, 2011.

[37] A.N. Langville, C.D. Meyer, and P. FernÁndez: Google’s pagerank and beyond:The science of search engine rankings.

The Mathematical Intelligencer, 30(1):68–69, 2008.

[38] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen: Combating web spam withtrustrank.

In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 576–587. VLDB Endowment, 2004.

[39] Jon Kleinberg: Bursty and hierarchical structure in streams.

In Proceedings of the eighth ACM SIGKDD international conference on Knowl-edge discovery and data mining - KDD ’02, page 91, New York, New York,

USA, July 2002. ACM Press, ISBN 158113567X.

[40] K. Fujimura, H. Toda, T. Inoue, N. Hiroshima, R. Kataoka, and M. Sugizaki:

Blogranger - a multi-faceted blog search engine.

In Proceedings of the WWW 2006 3nd annual workshop on the weblogging ecosys-tem: Aggregation, analysis and dynamics, 2006.

[41] Technorati: What is technorati authority?, September 2012.

http://technorati.com/what-is-technorati-authority.

[42] A. Kritikopoulos, M. Sideri, and I. Varlamis: Blogrank: ranking weblogs basedon connectivity and similarity features.

In Proceedings of the 2nd international workshop on Advanced architectures andalgorithms for internet delivery and applications, page 8. ACM, 2006.

[43] R. Schirru, D. Obradovic, S. Baumann, and P. Wortmann: Domain-specificidentification of topics and trends in the blogosphere.

Advances in Data Mining. Applications and Theoretical Aspects, pages

490–504, 2010.

[44] K. Sriphaew, H. Takamura, and M. Okumura: Cool blog identification usingtopic-based models.

In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08.IEEE/WIC/ACM International Conference on, volume 1, pages 402–406.

74

http://technorati.com/what-is-technorati-authority

References

IEEE, 2008.

[45] L. Zhu, A. Sun, and B. Choi: Online spam-blog detection through blog search.

In Proceedings of the 17th ACM conference on Information and knowledge man-agement, pages 1347–1348. ACM, 2008.

[46] T. Katayama, T. Utsuro, Y. Sato, T. Yoshinaka, Y. Kawada, and T. Fukuhara:

An empirical study on selective sampling in active learning for splog detection.

In 5th International Workshop on Adversarial Information Retrieval on the Web,

pages 29–36. ACM, 2009.

[47] P. Kolari, A. Java, and T. Finin: Characterizing the splogosphere.

In Proceedings of the 3rd Annual Workshop on Weblogging Ecosystem: Aggrega-tion, Analysis and Dynamics, 15th World Wid Web Conference. University

of Maryland, Baltimore County, 2006.

[48] W. Liu, S. Tan, H. Xu, and L. Wang: Splog filtering based on writing consistency.

In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT’08.IEEE/WIC/ACM International Conference on, volume 1, pages 227–233.

IEEE, 2008.

[49] J. He, W. Weerkamp, M. Larson, and M. de Rijke: An effective coherence mea-sure to determine topical consistency in user-generated content.

International journal on document analysis and recognition, 12(3):185–203,

2009.

[50] M. Chen and T. Ohta: Using blog content depth and breadth to access and classifyblogs.

International Journal of Business and Information, 5(1):26–45, 2010.

[51] K. Eguchi, K. Kuriyama, and N. Kando: Sensitivity of ir systems evaluation totopic difficulty.

In Proceedings of the 3rd International Conference on Language Resources andEvaluation (LREC 2002), volume 2, pages 585–589. Citeseer, 2002.

[52] G. Salton and C. Buckley: Term-weighting approaches in automatic text re-trieval.

Information processing & management, 24(5):513–523, 1988.

[53] M. Fernández, D. Vallet, and P. Castells: Probabilistic score normalization forrank aggregation.

Advances in Information Retrieval, pages 553–556, 2006.

[54] R.M. Esteves, R. Pais, and C. Rong: K-means clustering in the cloud–a mahout

75

References

test.In Advanced Information Networking and Applications (WAINA), 2011 IEEE

Workshops of International Conference on, pages 514–519. IEEE, 2011.

[55] Hidenao Abe and Shusaku Tsumoto: Evaluating a temporal pattern detectionmethod for finding research keys in bibliographical data.

pages 1–17, January 2011.

[56] J.C. Tressler, M.H. Larock, and C.E. Lewis: Mastering Effective English.

The Copp Clark., 1980.

[57] C. Fellbaum: Wordnet.Theory and Applications of Ontology: Computer Applications, pages 231–

243, 2010.

[58] T.R. Gruber et al.: A translation approach to portable ontology specifications.

Knowledge acquisition, 5(2):199–220, 1993.

[59] A. Hotho, A. Maedche, and S. Staab: Ontology-based text document clustering.

KI, 16(4):48–54, 2002.

[60] L. Jing, L. Zhou, M.K. Ng, and J.Z. Huang: Ontology-based distance measurefor text clustering.

In Proc. of SIAM SDM workshop on text mining, 2006.

[61] Y. Ding and X. Fu: A text document clustering method based on ontology.

Advances in Neural Networks–ISNN 2011, pages 199–206, 2011.

[62] B. Pang and L. Lee: Opinion mining and sentiment analysis.

Now Pub, 2008.

[63] H.H. Chen and C.J. Lin: A multilingual news summarizer.

In Proceedings of the 18th conference on Computational linguistics-Volume 1,

pages 159–165. Association for Computational Linguistics, 2000.

76

References

..

77

References

..

78

References

..

79

References

..

80

References

..

81

References

..

82

ranking blogs based on topic...

Documents