information retrieval lecture 4 introduction to information retrieval (manning et al. 2007) chapter...

21
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang Birkbeck, University of London

Upload: lydia-watts

Post on 17-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Information Retrieval

Lecture 4Introduction to Information Retrieval (Manning et al. 2007)

Chapter 13

For the MSc Computer Science Programme

Dell ZhangBirkbeck, University of London

Page 2: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Is this spam?

Page 3: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Text Classification/Categorization Given:

A document, dD. A set of classes C = {c1, c2,…, cn}.

Determine: The class of d: c(d)C, where c(d) is a

classification function (“classifier”).

Page 4: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Classification Methods (1)

Manual Classification For example,

Yahoo! Directory, DMOZ, Medline, etc. Very accurate when job is done by experts. Difficult to scale up.

Page 5: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Classification Methods (2)

Hand-Coded Rules For example,

CIA, Reuters, SpamAssassin, etc. Accuracy is often quite high, if the rules have

been carefully refined over time by experts. Expensive to build/maintain the rules.

Page 6: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Classification Methods (3)

Machine Learning (ML) For example

Automatic Email Classification: PopFile http://popfile.sourceforge.net/

Automatic Webpage Classification: MindSet http://mindset.research.yahoo.com/

There is no free lunch: hand-classified training data are required.

But the training data can be built up (and refined) easily by amateurs.

Page 7: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Text Classification via ML

L Classifier U

Learning Predicting

TrainingDocuments

TestDocuments

Page 8: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

TrainingData:

TestData:

Classes:

Text Classification via ML - Example

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

(AI) (Programming) (HCI)

... ...

Page 9: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Evaluating Classification

Classification Accuracy The proportion of correct predictions

Precision, Recall F1 (for each class) macro-averaging: computes performance

measure for each class, and then computes a simple average over classes.

micro-averaging: pools per-document predictions across classes, and then computes performance measure on the pooled contingency table.

Page 10: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Sample Learning Curve

Yahoo Science Data

Page 11: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Bayesian Methods for Classification Before seeing the content of document d

Classify d to the class with maximum prior probability Pr[c].

For each class cjC, Pr[cj] could be estimated from the training data:

j j

jj N

Nc ]Pr[

Nj: the number of documents in the class cj

Page 12: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Bayesian Methods for Classification After seeing the content of document d

Classify d to the class with maximum a posterio probability Pr[c|d].

For each class cjC, Pr[cj|d] could be computed by the Bayes’ Theorem.

Page 13: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Bayes’ Theorem

]Pr[

]Pr[]|Pr[]|Pr[

d

ccddc

prior probability]Pr[c

class-conditional probability]|Pr[ cd

a posterior probability]|Pr[ dc

a constant]Pr[d

Page 14: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Naïve Bayes: Classification

]|Pr[argmax)( dcdc jCc j

]Pr[

]Pr[]|Pr[argmax

d

ccd jj

Cc j

]Pr[]|Pr[argmax jjCc

ccdj

as Pr[d] is a constant

How can we compute Pr[d|cj] ?

Page 15: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Naive Bayes Assumptions

To facilitate the computation of Pr[d|cj], two simplifying assumptions are made. Conditional Independence Assumption

Given the doc’s topic, word in one position tells us nothing about words in other positions.

Positional Independence Assumption Each doc as a bag-of-words: the occurrence of word

does not depend on position.

Then Pr[d|cj] is given by the class-specific unigram language model Essentially a multinomial distribution.

Page 16: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Unigram Language Model

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

Model for cj

dw

jij

i

cwcd ]|Pr[]|Pr[

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

01.02.002.001.02.0]|Pr[ jcd000000080.

Page 17: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Naïve Bayes: Learning

Given the training data for each class cjC

estimate Pr[cj] (as before)

for each term wi in the vocabulary V

estimate Pr[wi|cj]

i ji

jiji T

Tcw

1

1]|Pr[

Tji: the number of occurrences of term i in documents of class cj

Page 18: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Smoothing

Why not just use MLE?

i ji

jiji T

Tcw ]|Pr[

If a term w (in a test doc d) did not occur in the training data, Pr[w|cj] would be 0, and then Pr[d|cj] would be 0 no matter how strongly other terms in d are associated with class cj.

Add-One (Laplace) Smoothing

i ji

jiji T

Tcw

1

1]|Pr[

Page 19: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Naïve Bayes is Not So Naïve

Fairly Effective The Bayes optimal classifier if the independence

assumptions do hold. Often performs well even if the independence

assumptions are badly violated. Usually yields highly accurate classification

(though the estimated probabilities are not so accurate).

The 1st & 2nd place in KDD-CUP 97 competition, among 16 (then) state-of-the-art algorithms.

A good dependable baseline for text classification (though not the best).

Page 20: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Naïve Bayes is Not So Naïve

Very Efficient Linear time complexity for learning/classification. Low storage requirements.

Page 21: Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang

Take Home Messages

Text Classification via Machine Learning Bayes’ Theorem Naïve Bayes

]Pr[]|Pr[argmax)( jdw

jiCc

ccwdcij

i ji

jiji T

Tcw

1

1]|Pr[

j j

jj N

Nc ]Pr[Learning

Classification

]Pr[

]Pr[]|Pr[]|Pr[

d

ccddc