on feature distributional clustering for text categorization
DESCRIPTION
On feature distributional clustering for text categorization. Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001. Plan of talk. A representation of a new text categorization technique based on: Distributional Clustering Support Vector Machine (SVM) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/1.jpg)
On feature distributional clustering for text categorization
Bekkerman, El-Yaniv, Tishby and Winter
The Technion. June, 27, 2001
![Page 2: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/2.jpg)
Plan of talk
A representation of a new text categorization technique based on: Distributional Clustering Support Vector Machine (SVM)
Comparative evaluation of the new technique wrt previous work (Dumais et. al.) that used Mutial Information (MI) feature
selection
![Page 3: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/3.jpg)
Main resultsThe evaluation is performed on two benchmark corpora: Reuters 20 Newsgroups (20NG)
The result is that the new technique works better than the known one on 20NG.But it isn’t better on Reuters.Possible reasons for such a behavior will be discussed.
![Page 4: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/4.jpg)
Text categorization
A fundamental problem of splitting a large text corpus into a number of semantic categories (predefined). We are dealing with its supervised
version. The problem has many real-world
applications. Search engines. Helpdesks.
![Page 5: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/5.jpg)
Text representationA standard approach: Bag-Of-Words. A document as a list of words it contains.
Much more sophisticated method: distributional clusters. A word is represented as a distribution
over the categories. The words are then clustered to k
clusters. Details will go later on.
![Page 6: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/6.jpg)
Support Vector Machines
A modern inductive classification method.Proposed by Vapnik.Usually shows its advantage over other learning schemes such as K Nearest Neighbors Naïve Bayes
![Page 7: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/7.jpg)
Corpora
A corpus is a large collection of documents.We’ve checked our algorithms on two well-known corpora: Reuters (ModApte split): 7063
articles in the training set, 2742 articles in the test set. 118 categories.
20 Newsgroups: 20000 articles. 20 categories.
![Page 8: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/8.jpg)
Multi-labeling vs. uni-labeling
Multi-labeled corpus: many articles belong to a number of categories. Example: Reuters (15.5% are multi-
labeled documents)
Uni-labeled corpus: each article belongs to only one category. It has been thought so about 20
newsgroups. But in fact it contains 4.5% multi-labeled documents.
![Page 9: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/9.jpg)
Related results
Dumais et al. (1998): SVM with simple feature selection on Reuters. Best known result: 92.0% of breakeven
over 10 largest categories.
Baker and McCallum (1998): Distributional clustering + Naïve Bayes on 20NG. 85.7% of accuracy (uni-labeled scheme).
![Page 10: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/10.jpg)
Related results (contd.)
Joachims (1996): Rocchio algorithm on Naïve Bayes. Best known result on 20NG (uni-
labeled approach): 90.3% of accuracy.
Slonim and Tishby (2000): Information Bottleneck method. Used in our work.
![Page 11: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/11.jpg)
Related results (contd.)
Zhang and Oles (2001): comparative study of linear classification techniques wrt. text categorization over different corpora. SVM is always better.
![Page 12: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/12.jpg)
The case of our study
corpus
MI feature selection
Distributional Clustering
Support Vector Machine
result
<>
![Page 13: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/13.jpg)
Feature selection via Mutual Information
On training set, choose N words which contribute maximum for separating the categories.The contribution is in terms of Mutual Information:
For each word w and each category c.
}1,0{ }1,0{ )()(
),(log),(),(
w ce e cw
cwcw epep
eepeepcwI
![Page 14: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/14.jpg)
Feature selection via MI (contd.)
For each category we build a list of N most contributing words.For example (on 20 Newsgroups): sci.electronics: circuit, voltage, amp,
ground, copy, battery, electronics, cooling, circuits, …
rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …
![Page 15: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/15.jpg)
Distributional ClusteringWas proposed by Pereira, Tishby and Lee (1993).Its generalization is called Information Bottleneck (Tishby, Pereira, Bialek 1999).In our case, each word (in the training set) is represented as a distribution over categories it appears in.Each word w is then clustered into a pseudo-word .w
~
![Page 16: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/16.jpg)
Distributional Clustering (contd.)
The idea is to maximize the Mutual Information wrt. the partition
under a constraint on .The solution is in the following equation:
Where Z is the normalization factor, β is an annealing parameter.
),~( cwI)|~( wwp ),~( wwI
c wcp
wcpwcp
wZ
wpwwp
)~|(
)|(ln)|(exp
),(
)~()|~(
![Page 17: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/17.jpg)
Deterministic Annealing
A powerful clustering method, proposed by Rose et. al. (1998).The approach is “top-down”: Start with one cluster with low β
(“high temperature”). Split it while lowering the
“temperature” until reaching a stable stage.
![Page 18: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/18.jpg)
Deterministic Annealing (contd.)
![Page 19: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/19.jpg)
Vector space in our experiment
In MI feature selection technique: documents are projected onto N most
contributing words.
In Information Bottleneck technique: Firstly words are grouped into clusters, And then documents are projected onto
the pseudo-words.
So, documents are vectors whose elements are numbers of occurrences of “best” words (1) or pseudo-words (2).
![Page 20: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/20.jpg)
Support Vector Machines
A modern classification technique.The classification is based on the border examples only:
We used linear SVM (the SVMlight packet by Joachims).
Support Vectors
![Page 21: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/21.jpg)
Multi-labeled setting1. MI feature selection (or distributional
clustering) on the training and test sets.
2. For each category we train a binary classifier on the training set.
3. On each document in the test set we run all the classifiers.
4. The document is related to all the categories whose classifiers accepted it.
![Page 22: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/22.jpg)
Uni-labeled setting
1. The same as in multi-labeled one.2. `` `` ``3. `` `` ``4. The document is related to the
(one) category whose classifier accepted it with maximal score.
![Page 23: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/23.jpg)
Evaluating the results
Multi-labeled: each document’s labels should be identical to the classification results. Precision/Recall scheme.
Uni-labeled: the classification result should be included in the set of document’s labels. Accuracy measure (number of hits).
![Page 24: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/24.jpg)
The setup of our experiment
To reproduce the results achieved by Dumais et. al., we choose k = 300 (number of “best” words and number of clusters).Since we wanted to compare 20NG and Reuters (ModApte split: ¾ is training set and ¼ is test set) we used 4-fold cross-validation on 20NG.
![Page 25: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/25.jpg)
Parameter tuningWe have 2 major parameters: Number of clusters or “best” words (k). SVM parameters (C and J in SVMlight).
For each experiment, k is fixed.To perform a “fair” experiment, we tune C and J on the training set, splitting it to train-train and train-validation sets.Then we run the experiment with the best parameters fixed at the previous stage.
![Page 26: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/26.jpg)
Unfair parameter tuningSuppose we want to compare results of two experiments A and B.And we see that the result of A is better than the one of B.So, we run B with unfair parameter tuning Parameters are tuned right on the test set.
This will assure us that it’s impossible to achieve the result of A with the setting of B.
![Page 27: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/27.jpg)
Our result on 20 Newsgroups
Multi-labeled setting (break-even point): Clustering: 88.6±0.3% (k = 300) MI feature selection: 77.7±0.5% (k = 300) `` `` : 86.3±0.4% (k = 15000)
Uni-labeled setting (accuracy measure): Clustering: 91.0±0.3% (k=300) MI feature selection: 85.1±0.5% (k = 300) `` `` : 91.2±0.4% (k = 15000)
Parameter tuning of the MI-based experiments is unfair.
![Page 28: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/28.jpg)
Our result on ReutersIt makes no sense to speak about uni-labeled setting on Reuters. Because it’s a multi-labeled corpus.
Multi-labeled setting (break-even point): Clustering: 91.6% (k = 300) – unfair MI feature selection: 92.0% (k = 300)
The results are achieved on 10 largest categories of Reuters.
![Page 29: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/29.jpg)
Discussion of the resultsSo, we see that our technique (clustering) works better than MI on 20NG and almost the same (a little worse) on Reuters.What can be the explanation?Reuters is manually labeled while 20NG is “naturally” labeled.Hypothesis: Reuters was labeled only according to a few keywords that appeared in the documents.
![Page 30: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/30.jpg)
Confirmation of our suggestion
We tried to decrease the number of features selected by MI technique, on both Reuters and 20NG.We saw that On 20NG the results decreased sharply, On Reuters the results remained the
same.
So, just a few words are enough to categorize documents of Reuters, while in 20NG we need much more words.
![Page 31: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/31.jpg)
Dependence of break-even on number of features
![Page 32: On feature distributional clustering for text categorization](https://reader035.vdocuments.us/reader035/viewer/2022062517/56813870550346895da0214a/html5/thumbnails/32.jpg)
ConclusionThere’re corpora for which simple methods work well.Such as Reuters: selection of just a few features solves the problem of text categorization.For other corpora (such as 20NG) a sophisticated method of distributional clustering helps a lot.Future work: to evaluate our technique on other corpora.