detecting blogs independently from the language and content msm09

30
Detecting Blogs Independently From The Language And Content Francisco Manuel Rangel Pardo PhD. Anselmo Peñas Padilla

Upload: francisco-manuel-rangel-pardo

Post on 17-Dec-2014

345 views

Category:

Technology


2 download

DESCRIPTION

Presentación del trabajo de investigación "Detecting Blogs Independently from the Language and Content" presentado en el Workshop Mining Social Media MSM2009 de la CAEPIA.La aproximación al problema se efectúa mediante un modelo de aprendizaje automático a partir de las características visuales que hacen de un blog un tipo de página reconocible a simple vista por un observador humano.

TRANSCRIPT

Page 1: Detecting Blogs Independently from the Language and Content MSM09

Detecting Blogs Independently From The Language And Content

Francisco Manuel Rangel Pardo

PhD. Anselmo Peñas Padilla

Page 2: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Introduction

What is a Social Media? sharing, discussion, collaboration -> Web 2.0,

What is a Blog? Opinions, experiences, information Freely comunication

What means “Detecting Blogs Independently From The Content And Language”?

Page 3: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Page 4: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Many people generating contents, many people consuming contents

Huge quantities of users and data Free, global and spontaneous information,

experiences and opinions Blog as a source of knowledge but raw data First of all, we have to identify them

Problem definition

Page 5: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Get a self-contained representation for Web pages that can be used in an inductive learning process to obtain good results identifying Blogs independently from the content, style, author and language

Research Objective

Page 6: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Page 7: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

How are Blogs? Heterogeneus in content, themes and

styles Many different languages Technology vs. Web2.0 Concept

Research approachVisual Characteristics of the Blogs

Page 8: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Page 9: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Page 10: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Page 11: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Page 12: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Research approachVisual Characteristics of the Blogs

Page 13: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Machine learning / inductive learning 14 features from content and structure Frecuency of ocurrence of some entities Ratios between frecuency of ocurrence of

some entities

Research approachFeatures of the representation

Page 14: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

“blog” in Url “blog” in document “post” in document “rss” or “atom” Comments vs. Dates Comments in link

vs. Dates Comments in link

vs. Comments

Comments vs. Headlines Comments in link vs headlines Dates vs. Headlines ¿Blogroll? Links same domain vs. Links

different domain Links different domain vs.

Total links Links blogroll vs. Links page

Research approachFeatures of the representation

Page 15: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.

Page 16: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

DMOZ ODP 4 different languages Blog / No-Blog No-Blog: Many

different categories (arts, business, computers…)

Experimental ResultsTest Collection

Page 17: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsEvaluation framework

4 different classifiers Naïve Bayes BayesNet Support Vector Machines Decision Trees

Training based on accuracy

Cross validation Statistical-F

T-Student H0: all representations

have the same performance Interval for real error

Page 18: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsBaseline and other representations

4 different representations + majority baseline BoW: Bag of Words Google Blog Search NITLE Project CRX: Our representation

Page 19: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Page 20: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Page 21: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Page 22: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

We reject the H0 -> CRX improves significantly the classification

Page 23: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsResults

Page 24: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Experimental ResultsDiscussion

BoW methods High dimensionality: multilingualism & themes Decrease their performance

Google Blog Search Do not distinguish between Blog or pages with subscription:

newsgroups, wikis or forums NITLE

Logic rules vs. inductive learning Personal Web page created with wordpress Blog created programmatically and hosted in an own domain Based on the current technology

Page 25: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Structure

Problem definition and researching goals Experimental results Conclussions and Future Work. Applications.

Page 26: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Conclusions

We have created a test collection: 4 different languages 2 different classes: Blog / No-Blog

We have experimented: 4 different representations 4 different methods of inductive learning

We have obtain better results F values of 0.920 Interval error lower than 2%

Page 27: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Conclusions

Our representation The concept of Blog vs. underlying technology Prioritizes Blogs with reviews and comments

Conclusion Identification independently from content, style,

author and language

Page 28: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Future work

Deal with new languages Strengthen entity extraction Include temporal analysis Include extra rules

Page 29: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Applications

Searching for information and opinions about: Products and services Customers and providers Competitors

Page 30: Detecting Blogs Independently from the Language and Content MSM09

Powered by Corex Soluciones Informáticas 2009

Contact

Thank you

You can contact us:

[email protected]

Resources:

http://www.wikrplusd.com