detecting blogs independently from the language and content msm09
DESCRIPTION
Presentación del trabajo de investigación "Detecting Blogs Independently from the Language and Content" presentado en el Workshop Mining Social Media MSM2009 de la CAEPIA.La aproximación al problema se efectúa mediante un modelo de aprendizaje automático a partir de las características visuales que hacen de un blog un tipo de página reconocible a simple vista por un observador humano.TRANSCRIPT
Detecting Blogs Independently From The Language And Content
Francisco Manuel Rangel Pardo
PhD. Anselmo Peñas Padilla
Powered by Corex Soluciones Informáticas 2009
Introduction
What is a Social Media? sharing, discussion, collaboration -> Web 2.0,
What is a Blog? Opinions, experiences, information Freely comunication
What means “Detecting Blogs Independently From The Content And Language”?
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
Powered by Corex Soluciones Informáticas 2009
Many people generating contents, many people consuming contents
Huge quantities of users and data Free, global and spontaneous information,
experiences and opinions Blog as a source of knowledge but raw data First of all, we have to identify them
Problem definition
Powered by Corex Soluciones Informáticas 2009
Get a self-contained representation for Web pages that can be used in an inductive learning process to obtain good results identifying Blogs independently from the content, style, author and language
Research Objective
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
Powered by Corex Soluciones Informáticas 2009
How are Blogs? Heterogeneus in content, themes and
styles Many different languages Technology vs. Web2.0 Concept
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Research approachVisual Characteristics of the Blogs
Powered by Corex Soluciones Informáticas 2009
Machine learning / inductive learning 14 features from content and structure Frecuency of ocurrence of some entities Ratios between frecuency of ocurrence of
some entities
Research approachFeatures of the representation
Powered by Corex Soluciones Informáticas 2009
“blog” in Url “blog” in document “post” in document “rss” or “atom” Comments vs. Dates Comments in link
vs. Dates Comments in link
vs. Comments
Comments vs. Headlines Comments in link vs headlines Dates vs. Headlines ¿Blogroll? Links same domain vs. Links
different domain Links different domain vs.
Total links Links blogroll vs. Links page
Research approachFeatures of the representation
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and research objectives Research approach Experimental results Conclussions and Future Work. Applications.
Powered by Corex Soluciones Informáticas 2009
DMOZ ODP 4 different languages Blog / No-Blog No-Blog: Many
different categories (arts, business, computers…)
Experimental ResultsTest Collection
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsEvaluation framework
4 different classifiers Naïve Bayes BayesNet Support Vector Machines Decision Trees
Training based on accuracy
Cross validation Statistical-F
T-Student H0: all representations
have the same performance Interval for real error
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsBaseline and other representations
4 different representations + majority baseline BoW: Bag of Words Google Blog Search NITLE Project CRX: Our representation
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
We reject the H0 -> CRX improves significantly the classification
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsResults
Powered by Corex Soluciones Informáticas 2009
Experimental ResultsDiscussion
BoW methods High dimensionality: multilingualism & themes Decrease their performance
Google Blog Search Do not distinguish between Blog or pages with subscription:
newsgroups, wikis or forums NITLE
Logic rules vs. inductive learning Personal Web page created with wordpress Blog created programmatically and hosted in an own domain Based on the current technology
Powered by Corex Soluciones Informáticas 2009
Structure
Problem definition and researching goals Experimental results Conclussions and Future Work. Applications.
Powered by Corex Soluciones Informáticas 2009
Conclusions
We have created a test collection: 4 different languages 2 different classes: Blog / No-Blog
We have experimented: 4 different representations 4 different methods of inductive learning
We have obtain better results F values of 0.920 Interval error lower than 2%
Powered by Corex Soluciones Informáticas 2009
Conclusions
Our representation The concept of Blog vs. underlying technology Prioritizes Blogs with reviews and comments
Conclusion Identification independently from content, style,
author and language
Powered by Corex Soluciones Informáticas 2009
Future work
Deal with new languages Strengthen entity extraction Include temporal analysis Include extra rules
Powered by Corex Soluciones Informáticas 2009
Applications
Searching for information and opinions about: Products and services Customers and providers Competitors
Powered by Corex Soluciones Informáticas 2009
Contact
Thank you
You can contact us:
Resources:
http://www.wikrplusd.com