distributional clustering of words for text classification

37
Distributional Clustering of Words for Text Classification Presentation by: Thomas Walsh (Rutgers University) L.Douglas Baker (Carnegie Mellon University) Andrew Kachites McCallum (Justsystem Pittsburgh Research Center)

Upload: jael-knapp

Post on 04-Jan-2016

27 views

Category:

Documents


1 download

DESCRIPTION

Distributional Clustering of Words for Text Classification. Andrew Kachites McCallum (Justsystem Pittsburgh Research Center). L.Douglas Baker (Carnegie Mellon University). Presentation by: Thomas Walsh (Rutgers University). Clustering. Define what it means for words to be “similar”. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distributional Clustering of Words for Text Classification

Distributional Clustering of Words for Text Classification

Presentation by:

Thomas Walsh

(Rutgers University)

L.Douglas Baker (Carnegie Mellon

University)

Andrew Kachites McCallum (Justsystem Pittsburgh

Research Center)

Page 2: Distributional Clustering of Words for Text Classification

Clustering

• Define what it means for words to be “similar”.

• “Collapse” the word space by grouping similar words in “clusters”.

• Key Idea for Distributional Clustering:– Class probabilities given the words in a labeled

document collection P(C|w) provide rules for correlating words to classifications.

Page 3: Distributional Clustering of Words for Text Classification

Voting

• Can be understood by a voting model:

• Each word in a document casts a weighted vote for classification.

• Words that normally vote similarly can be clustered together and vote with the average of their weighted votes without negatively impacting performance.

Page 4: Distributional Clustering of Words for Text Classification

Benefits of Word Clustering

• Useful Semantic word clustering– Automatically generates a “Thesaurus”

• Higher classification accuracy– Sort of, we’ll discuss in the results section

• Smaller classification models– size reductions as dramatic as 50000 50

Page 5: Distributional Clustering of Words for Text Classification

Benefits of Smaller Models

• Easier to compute – with the constantly increasing amount of available text, reducing the memory space is clutch.

• Memory constrained devices like PDA’s could now use text classification algorithms to organize documents.

• More complex algorithms can be unleashed that would be infeasible in 50000 dimensions.

Page 6: Distributional Clustering of Words for Text Classification

The Framework

• Start with Training Data with:– Set of Classes C = {c1, c2… cm}

– Set of Documents D ={d1… dn}

– Each Document has a class label

Page 7: Distributional Clustering of Words for Text Classification

Mixture Models

• f(xi|) = pkh(xi|k)

• Sum of pk’s is 1

• h is a distriution function for x (such as a Gausian) with k as the parameter (, ) in the Gausian case.

• Thus = (p1…pk, 1… k)

Page 8: Distributional Clustering of Words for Text Classification

What is in this case?

• Assumption: one-to-one correspondence between the mixture model components and the classes.

• The class priors are contained in the vector 0

• Instances of each class / number of documents

Page 9: Distributional Clustering of Words for Text Classification

What is in this case?

• The rest of the entries in correspond to disjoint sets. The jth entry contains the probability of each word wt in the vocabulary V given the class cj.

• N(wt, di) is the number of times a word appears in document di.

• P(cj|di) = {0, 1}

Page 10: Distributional Clustering of Words for Text Classification

Prob. of a given Document in the Model

• The mixture model can be used to produce documents with probability:

• Just the sum of the probability of generating this document in the model over each class.

Page 11: Distributional Clustering of Words for Text Classification

Documents as Collections of Words

• Treat each document as an ordered collection of word events.

• Dik = work in document di at place k.

• Each word is dependent on preceding words

Page 12: Distributional Clustering of Words for Text Classification

Apply Naïve Bayes Assumption

• Assume each word is independent of both content and position

• Where dik = wt

• Update Formulas 2 and 1:– (2) P(di | cj ; ) = P(wt|cj ; )

– (1) P(di| ) = P(cj|) P(wt|cj; )

Page 13: Distributional Clustering of Words for Text Classification

Incorporate Expanded Formulae for

• We can calculate the model parameter from the training data.

• Now we wish to calculate P(cj|di; ), the probability of document di belonging to class cj.

Page 14: Distributional Clustering of Words for Text Classification

Final Equation

Class prior * (2)Product of all the probabilities of each word in the document assuming we are in class cj

-------------------------------------------------------------(1/2/3) Sum of all class priors * product of all word

probabilities assuming we are in class cr

Maximize and that value of cj is the class for the document

Page 15: Distributional Clustering of Words for Text Classification

Shortcomings of the Framework

• In real world data (documents) there isn’t actually an underlying mixture model and the independence assumption doesn’t actually hold.

• But empirical evidence and some theoretical writing (Domingos and Pazzani 1997) indicates the damage from this is negligible.

Page 16: Distributional Clustering of Words for Text Classification

What about clustering?

• So assuming the Framework holds… how does clustering fit into all this?

Page 17: Distributional Clustering of Words for Text Classification

How Does Clustering affect probabilities?

• Fraction of cluster from wt + fraction of cluster from ws

Page 18: Distributional Clustering of Words for Text Classification

Vs. other forms of learning

• Measures similarity based on the property it is trying to estimate (the classes)– Makes the supervision in the training data

really important.

• Clustering is based on the similarity of the class variable distributions

• Key Idea: Clustering preserves the “shape” of the class distributions.

Page 19: Distributional Clustering of Words for Text Classification
Page 20: Distributional Clustering of Words for Text Classification

Kullock-Liebler Divergence

• Measures the similarity between class distributions

• D( P(C | wt) || P(C | ws)) =

• If P(cj | wt) = P(cj | ws) then log(1) = 0

Page 21: Distributional Clustering of Words for Text Classification

Problems with K-L Divergence

• Not symmetric

• Denominator can be 0 if ws does not appear in any documents of class cj.

Page 22: Distributional Clustering of Words for Text Classification

K-L Divergence from the Mean

• Ratio of each words occurrence in the cluster * K-L divergence of that word within the cluster

• New and improved: uses a weighted average instead of just the mean

• Justification: fits clustering because independent distributions now form combined statistics.

Page 23: Distributional Clustering of Words for Text Classification

Minimizing Error in Naïve Bayes Scores

• Assuming uniform class priors allows us to drop P(cj | ) and the whole denominator from (6)

• Then performing a little algebra gets us the cross entropy:

• So error can be measured in the difference in cross-entropy caused by clustering. Minimizing this equation results in equation (9), so clustering in this method minimizes error.

Page 24: Distributional Clustering of Words for Text Classification

The Clustering Algorithm

• Comparing similarity of all possible word clusters would be O(V2)

• Instead, a number M is set as the total number of desired clusters– More supervision

• M clusters initialized with the M words with the highest mutual information to the class variable

• Properties: Greedy, scales efficiently

Page 25: Distributional Clustering of Words for Text Classification

Algorithm

P(C | wt)

Page 26: Distributional Clustering of Words for Text Classification

Related Work• Chi Merge / Chi 2

– Use D. Clustering to discretize numbers

• Class-based clustering– Uses amount that mutual information is reduced to

determine when to cluster– Not effective in text classification

• Feature Selection by Mutual Information– cannot capture dependencies between words

• Markov-blanket-based Feature Selection– Also attempts to Preserve P(C | wt) shapes

• Latent Semantic Indexing– Unsupervised, using PCA

Page 27: Distributional Clustering of Words for Text Classification

The Experiment : Competitors to Distributional

Clustering• Clustering with LSI

• Information Gain Based Feature Selection

• Mutual-Information Feature Selection

• Feature Selection involves cutting out redundant instances

• Clustering combines these redundancies

Page 28: Distributional Clustering of Words for Text Classification

The Experiment: Testbeds

• 20 Newsgroups– 20,000 articles from 20 usenet groups (apx 62000

words)

• ModApte “Reuters-21578”– 9603 training docs, 3299 testing docs, 135 topics

(apx. 16000 words)

• Yahoo! Science (July 1997)– 6294 pages in 41 classes (apx. 44000 words)– Very noisy data

Page 29: Distributional Clustering of Words for Text Classification

20 Newsgroups Results

• Averaged over 5-20 trials• Computational constraints forced Markov blanket to a

smaller data set (second graph)• LSI uses only 1/3 training ratio

Page 30: Distributional Clustering of Words for Text Classification

20 Newsgroups Analysis• Distributional Clustering achieves 82.1% accuracy at

50 features, almost as good as having the full vocabulary.

• More accurate then all non-clustering approaches• LSI did not add any improvement to clustering

(claim: because it is unsupervised)• On the smaller data set, D.C. achieves 80% accuracy

far quicker then the others, in some cases doubling their performance for small numbers of features.

• Claim: Clustering outperforms Feature selection because it conserves information rather than discarding it.

Page 31: Distributional Clustering of Words for Text Classification

Speed in 20-Newsgroups Test

• Distributional Clustering: 7.5 minutes

• LSI: 23 minutes

• Makov Blanket: 10 hours

• Mutual information feature selection (???): 30 seconds

Page 32: Distributional Clustering of Words for Text Classification

Reuters-21578 Results

• D.C. outperforms others for small numbers of features

• Information-Gain based feature selection does better for larger feature sets.

• In this data set, documents can have multiple labels.

Page 33: Distributional Clustering of Words for Text Classification

Yahoo! Results

• Feature selection performs almost as well or better in these cases

• Claim: The data is so noisy that it is actually beneficial to “lose data” via feature selection.

Page 34: Distributional Clustering of Words for Text Classification

Performance Summary

• Only slight loss in accuraccy despite despite the reduction in feature space

• Preserves “redundent” information better than feature selection.

• The improvement is not as drastic with noisy data.

Page 35: Distributional Clustering of Words for Text Classification

Improvements on Earlier D.C. Work

• Does not show much improvement on sparse data because the performance measure is related to the data distribution– D.C. preserves class distributions, even if these

are poor estimates to begin with.

• Thus this whole method relies on accurate values for P(C | wi)

Page 36: Distributional Clustering of Words for Text Classification

Future Work

• Improve D.C.’s handling of sparse data (ensure good estimates of P(C | wi)

• Find ways to combine feature selection and D.C. to utilize the strengths of both (perhaps increase performance on noisy data sets?)

Page 37: Distributional Clustering of Words for Text Classification

Some Thoughts

• Extremely supervised• Needs to be retrained when new documents

come in• In a paper with a lot of topics, does Naïve

Bayes (word independent of context) make sense?

• Didn’t work well in noisy data• How can we ensure proper theta values?