information retrieval lecture 4 introduction to information retrieval (manning et al. 2007) chapter...

Information Retrieval

Lecture 4Introduction to Information Retrieval (Manning et al. 2007)

Chapter 13

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Is this spam?

Text Classification/Categorization Given:

A document, dD. A set of classes C = {c1, c2,…, cn}.

Determine: The class of d: c(d)C, where c(d) is a

classification function (“classifier”).

Classification Methods (1)

Manual Classification For example,

Yahoo! Directory, DMOZ, Medline, etc. Very accurate when job is done by experts. Difficult to scale up.


Hand-Coded Rules For example,

CIA, Reuters, SpamAssassin, etc. Accuracy is often quite high, if the rules have

been carefully refined over time by experts. Expensive to build/maintain the rules.


Machine Learning (ML) For example

Automatic Email Classification: PopFile http://popfile.sourceforge.net/

Automatic Webpage Classification: MindSet http://mindset.research.yahoo.com/

There is no free lunch: hand-classified training data are required.

But the training data can be built up (and refined) easily by amateurs.

Text Classification via ML

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

TrainingData:

TestData:

Classes:

Text Classification via ML - Example

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

(AI) (Programming) (HCI)

... ...

Evaluating Classification

Classification Accuracy The proportion of correct predictions

Precision, Recall F1 (for each class) macro-averaging: computes performance

measure for each class, and then computes a simple average over classes.

micro-averaging: pools per-document predictions across classes, and then computes performance measure on the pooled contingency table.

Sample Learning Curve

Yahoo Science Data

Bayesian Methods for Classification Before seeing the content of document d

Classify d to the class with maximum prior probability Pr[c].

For each class cjC, Pr[cj] could be estimated from the training data:

j j

jj N

Nc ]Pr[

Nj: the number of documents in the class cj

Bayesian Methods for Classification After seeing the content of document d

Classify d to the class with maximum a posterio probability Pr[c|d].

For each class cjC, Pr[cj|d] could be computed by the Bayes’ Theorem.

Bayes’ Theorem

]Pr[

]Pr[]|Pr[]|Pr[

d

ccddc

prior probability]Pr[c

class-conditional probability]|Pr[ cd

a posterior probability]|Pr[ dc

a constant]Pr[d

Naïve Bayes: Classification

]|Pr[argmax)( dcdc jCc j

]Pr[

]Pr[]|Pr[argmax

d

ccd jj

Cc j

]Pr[]|Pr[argmax jjCc

ccdj

as Pr[d] is a constant

How can we compute Pr[d|cj] ?

Naive Bayes Assumptions

To facilitate the computation of Pr[d|cj], two simplifying assumptions are made. Conditional Independence Assumption

Given the doc’s topic, word in one position tells us nothing about words in other positions.

Positional Independence Assumption Each doc as a bag-of-words: the occurrence of word

does not depend on position.

Then Pr[d|cj] is given by the class-specific unigram language model Essentially a multinomial distribution.

Unigram Language Model

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

…

Model for cj

dw

jij

i

cwcd ]|Pr[]|Pr[

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

01.02.002.001.02.0]|Pr[ jcd000000080.

Naïve Bayes: Learning

Given the training data for each class cjC

estimate Pr[cj] (as before)

for each term wi in the vocabulary V

estimate Pr[wi|cj]

i ji

jiji T

Tcw

1

1]|Pr[

Tji: the number of occurrences of term i in documents of class cj

Smoothing

Why not just use MLE?

i ji

jiji T

Tcw ]|Pr[

If a term w (in a test doc d) did not occur in the training data, Pr[w|cj] would be 0, and then Pr[d|cj] would be 0 no matter how strongly other terms in d are associated with class cj.

Add-One (Laplace) Smoothing

i ji

jiji T

Tcw

1

1]|Pr[

Naïve Bayes is Not So Naïve

Fairly Effective The Bayes optimal classifier if the independence

assumptions do hold. Often performs well even if the independence

assumptions are badly violated. Usually yields highly accurate classification

(though the estimated probabilities are not so accurate).

The 1st & 2nd place in KDD-CUP 97 competition, among 16 (then) state-of-the-art algorithms.

A good dependable baseline for text classification (though not the best).

Naïve Bayes is Not So Naïve

Very Efficient Linear time complexity for learning/classification. Low storage requirements.

Take Home Messages

Text Classification via Machine Learning Bayes’ Theorem Naïve Bayes

]Pr[]|Pr[argmax)( jdw

jiCc

ccwdcij

i ji

jiji T

Tcw

1

1]|Pr[

j j

jj N

Nc ]Pr[Learning

Classification

]Pr[

]Pr[]|Pr[]|Pr[

d

ccddc

information retrieval lecture 4 introduction to information retrieval (manning et al. 2007) chapter...

Documents

class of d

classification methods

class cjbayesian methods

class cjcestimate prcj

mltext classification

laplace smoothingnave

naive bayes assumptionsto

training datafor