active learning for text classification

ACTIVE LEARNING FOR ACTIVE LEARNING FOR TEXT CLASSIFICATIONTEXT CLASSIFICATIONAnkit Bhutani Y9094

AUTOMATIC TEXT CLASSIFICATION

A FEW HOURS ONLY

MANUAL TEXT CLASSIFICATIONTAKES YEARS

ORGANIZING LARGE ORGANIZING LARGE VOLUMES OF TEXTVOLUMES OF TEXTMassive volume of online text

available.Organisation into categories to

enable efficient search.Find use in a lot of applications like

Data Mining, Automatic Query Answer, Learning User Interest, Making Suggestions, etc.

Learning Approaches : unsupervised, supervised and semi-supervised.

Terms UsedTerms UsedMultinomial Naïve Bayes :

◦Documents in bag of words format◦Independence assumptions

Terms UsedTerms UsedSemi-Supervised Learning :

◦Makes use of Labeled as well as Unlabeled Data to learn the parameters of the model.

Expectation Maximization :◦Class of Iterative Algorithms for

Maximum Likelihood Estimation in problems with incomplete data

Parameters of the model

Document labels

Provide Soft Labels to Documents based on estimated model parameters

Re-estimate the model parameters based on the

soft labels

Terms usedTerms usedActive Learning :

◦Form of supervised machine learning◦Learning Algorithm is able to

interactively query the user◦Query has associated cost.◦Algorithm requests label for document

such that gain in information about model parameters is maximized

But how to choose which DOCUMENT to request for

Label???

Terms UsedTerms UsedQuery by Committee :

◦Divide the training set into 4 – 5 sets.◦Each set as member gives

probability estimates.◦Maximum disagreement measured

by maximum average KL divergence between all pairs

Terms UsedTerms UsedSemi-Supervised Frequency

Estimate (SFE) :◦Slight variation in basic EM :

Different parameters re-estimation formula.

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningNigam et al, 1998-99 :

◦MNB + EM◦100 Labeled + 2500 Unlabeled

documents◦80 – 85 % accuracy

Nigam & McCullum, 2000 : ◦MNB + EM + Active Learning◦Total 1000 Documents◦Label requests : 50, Accuracy :

NOTICABLE WORK: NOTICABLE WORK: Semi-Supervised LearningSemi-Supervised LearningLYRL, 2004 :

◦Compared various Semi-supervised Learning Techniques

◦Introduced Reuters Corpus as a new benchmark

Su Shirabad and Matwin, 2011 : ◦MNB + SFE

My workMy workMNB + SFE + Active Learning

◦Data-set: Reuters Corpus from LYRL 2004: contains around 8 lakh documents

◦Experiments on 10,000 documents starting with : 50 Labeled Documents + 100 requests 100 Labeled Documents + 50 requests

Results so farResults so far

active learning for text classification

semisupervised learninglyrl

learning user

learning approaches

maximum likelihood estimation

noticable work

maximum disagreement

maximum average

mnb em100

Documents

text classification and images

lda-based topic modelling in text sentiment classification...

chinese short-text classification based on topic model...

text classification/categorization

text classification, active/interactive learning

text classification – naïve bayes

text classification applications

text classification using text kernels

textmining4 text classification

search vs text classification

text classification - school of informatics · text...

text classification and naïve bayes the task of text...

text classification combining clustering and hierar chical...

text classification and naïve bayes - ecology...

an empirical study of active learning for text...

effective multi-label active learning for text...

text mining classification

text classification methods

text classification & summarization -...

text mining text classification text clusteringtext mining...