detecting blogs independently from the language and content msm09

Post on 17-Dec-2014

345 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentación del trabajo de investigación "Detecting Blogs Independently from the Language and Content" presentado en el Workshop Mining Social Media MSM2009 de la CAEPIA.La aproximación al problema se efectúa mediante un modelo de aprendizaje automático a partir de las características visuales que hacen de un blog un tipo de página reconocible a simple vista por un observador humano.

TRANSCRIPT

Detecting Blogs Independently From The Language And Content

Francisco Manuel Rangel Pardo

PhD. Anselmo Peñas Padilla

Powered by Corex Soluciones Informáticas 2009

Introduction

What is a Social Media? sharing, discussion, collaboration -> Web 2.0,

What is a Blog? Opinions, experiences, information Freely comunication

What means “Detecting Blogs Independently From The Content And Language”?

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Powered by Corex Soluciones Informáticas 2009

Many people generating contents, many people consuming contents

Huge quantities of users and data Free, global and spontaneous information,

experiences and opinions Blog as a source of knowledge but raw data First of all, we have to identify them

Problem definition

Powered by Corex Soluciones Informáticas 2009

Get a self-contained representation for Web pages that can be used in an inductive learning process to obtain good results identifying Blogs independently from the content, style, author and language

Research Objective

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Powered by Corex Soluciones Informáticas 2009

How are Blogs? Heterogeneus in content, themes and

styles Many different languages Technology vs. Web2.0 Concept

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Powered by Corex Soluciones Informáticas 2009

Machine learning / inductive learning 14 features from content and structure Frecuency of ocurrence of some entities Ratios between frecuency of ocurrence of

some entities

Research approachFeatures of the representation

Powered by Corex Soluciones Informáticas 2009

“blog” in Url “blog” in document “post” in document “rss” or “atom” Comments vs. Dates Comments in link

vs. Dates Comments in link

vs. Comments

Comments vs. Headlines Comments in link vs headlines Dates vs. Headlines ¿Blogroll? Links same domain vs. Links

different domain Links different domain vs.

Total links Links blogroll vs. Links page

Research approachFeatures of the representation

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Powered by Corex Soluciones Informáticas 2009

DMOZ ODP 4 different languages Blog / No-Blog No-Blog: Many

different categories (arts, business, computers…)

Experimental ResultsTest Collection

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsEvaluation framework

4 different classifiers Naïve Bayes BayesNet Support Vector Machines Decision Trees

Training based on accuracy

Cross validation Statistical-F

T-Student H0: all representations

have the same performance Interval for real error

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsBaseline and other representations

4 different representations + majority baseline BoW: Bag of Words Google Blog Search NITLE Project CRX: Our representation

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

We reject the H0 -> CRX improves significantly the classification

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsDiscussion

BoW methods High dimensionality: multilingualism & themes Decrease their performance

Google Blog Search Do not distinguish between Blog or pages with subscription:

newsgroups, wikis or forums NITLE

Logic rules vs. inductive learning Personal Web page created with wordpress Blog created programmatically and hosted in an own domain Based on the current technology

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and researching goals Experimental results Conclussions and Future Work. Applications.

Powered by Corex Soluciones Informáticas 2009

Conclusions

We have created a test collection: 4 different languages 2 different classes: Blog / No-Blog

We have experimented: 4 different representations 4 different methods of inductive learning

We have obtain better results F values of 0.920 Interval error lower than 2%

Powered by Corex Soluciones Informáticas 2009

Conclusions

Our representation The concept of Blog vs. underlying technology Prioritizes Blogs with reviews and comments

Conclusion Identification independently from content, style,

author and language

Powered by Corex Soluciones Informáticas 2009

Future work

Deal with new languages Strengthen entity extraction Include temporal analysis Include extra rules

Powered by Corex Soluciones Informáticas 2009

Applications

Searching for information and opinions about: Products and services Customers and providers Competitors

Powered by Corex Soluciones Informáticas 2009

Contact

Thank you

You can contact us:

francisco.rangel@corex.es

Resources:

http://www.wikrplusd.com

top related