smart rss aggregator a text classification problem alban scholer & markus kirsten 2005

Smart RSS Aggregator

A text classification problem

Alban Scholer & Markus Kirsten2005

Introduction

● Smart RSS aggregator

● Predicts how interesting a user finds an unread article

● Presents news articles depending on the prediction

Issues

● Extremely high dimensional data

● Lots of unlabeled data

● Few training examples

● Only clickthrough information

● Multiuser environment

Support Vector Machine

● Support Vector Machine

● Max-margin for generalization

● Linear but easily extended to non-linear classification

Max-margin separator

● The problem of finding the optimal w can be reduced to the following QP

Transductive SVM (TSVM)

● Semi-supervised learning VS supervised learning.

● TSVM is well suited for problem where:– There are few labeled data available – There are lots of unlabeled data.

● Information lying in the unlabeled data is captured and modifies the decision surface.

TSVM VS SVM

TSVM optimization problem

● New optimized variable set : yi*

● New set of slack variables● New user-specified variable : C*

● Very difficult optimization problem:– Intractable when the number of unlabeled

data is greater than 10– Approximative solution proposed by

Johachims.

Text Classification

● Joachims T. Transductive “Inference for Text Classification using SVM”

● Characteristics of the Text Classification problem

● Why are SVM and TSVM well suited for this kind of problem?

● Feature selection for text classification using SVM

Characteristics of the Text Classification problem

● High dimensional input space– One dimension for each word in the vocabulary

(10 000 words)

● Sparse input vector– In one text, a tiny proportion of the full

vocabulary is used

Why (T)SVM?

● SVM has been shown to perform well in these conditions and can outperform other classifiers.

● Transductive SVM, exploiting information in test data, can outperform SVM when few training samples but lots of test data are available.

Feature selection for Text Classification using SVM

● Feature selection is the main problem in many machine learning applications.

● A poor feature selection leads to poor accuracy.

Feature selection (cont)

● For the text classification problem:

– The number of dimensions of the document vector is the number of words in the vocabulary. (Huge number of dimensions!)

– Each component of the document vector is the

count of the number of word in the document.

● Refinement of the feature selection:

– Johachims add to this document vector the Inverse Document Frequency of each relevant word in the document.

– The IDF can be computed using the Document Frequency DF(w)

● IDF(w) = log(n/DF(w)) ● Where n is the total number of document

● Other refinements :– Stopword elimination– Word stemmer

● Ex : “the text classification task is characterized by a special set of characteristics. The text classification problem....”

● Transformation of the above text into a feature vector

Feature selection (example)

classification 2

task 1

charact 2

● The document vector isvery sparse

● The words characteristicsand characterized have thesame stemmer charact

Smart stuff

● Wordnet

● Combinations of words

● Putting users into clusters

● Using additional features (links, dates, author, source etc.)

● Active learning

Conclusion

● TSVM is well suited for text classification problems

● Feature selection is crucial

● To boost accuracy to a reasonable level, we have to combine techniques.

References

● Simon Haykin, Neural Networks, Second Edition, Pearson Education, chapter 6 1999

● Joachims Thorsten, Transductive Inference for Text Classification using SVM, Proceedings of ICML-99, 16th International Conference on Machine Learning, 1999

References (cont)

● Tom M. Mitchell, Machine Learning, chapter 6 Mc Graw-Hill international editions, 1997

● K. Nigam, A. K. Mccallum, S. ThMachine Learningrun, T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Kluwer Academic Publishers, Boston, 1999

smart rss aggregator a text classification problem alban scholer & markus kirsten 2005

text classification

text classification

text classificationjoachims

feature selection contfor

svmfeature selection

poor feature selection

number of unlabeled

main problem

Documents

pluie alban andy

vehicle aggregator

kaye, scholer, fierman, hays handler, llp · kaye, scholer,...

sharepoint list aggregator admin guide -...

mule custom aggregator

alban denic_design record

recent accomplishments - alban institute

an aggregator deffiniton picture an aggregator is a person...

yralup: social media aggregator

open classroom with aggregator

complaint - zahn v. kaye scholer

alban berg art

uniform collateral data portal reference series for the...

the photodentro aggregator federated system...

informatica transformations - create aggregator...

collisionless shocks manfred scholer

using aggregator 10 - dx...

quaternionic electrodynamics - saint alban

arnold & porter kaye scholer llp

saint alban the martyr