soc presentation title 2004 a new term weighting method for text categorization lan man school of...
TRANSCRIPT
![Page 1: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/1.jpg)
SoC Presentation Title 2004
A New Term Weighting Method for Text Categorization
LAN Man
School of ComputingNational University of Singapore
16 Apr, 2007
![Page 2: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/2.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 3: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/3.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 4: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/4.jpg)
SoC Presentation Title 2004
Introduction Background
• Explosive increase of textual information• Organizing and accessing these information
in flexible ways• Text Categorization (TC) is the task of
classifying natural language documents into
a predefined set of semantic categories
![Page 5: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/5.jpg)
SoC Presentation Title 2004
• Categorize web pages by topic (the directories like Yahoo!);• Customize online newspapers to different labels according to a particular user’s reading preferences;• Filter spam emails and forward incoming emails to the target expert by content;• Word sense disambiguation is also taken as a text categorization task once we view word occurrence contexts as documents and word sense as categories.
Introduction Applications of TC
![Page 6: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/6.jpg)
SoC Presentation Title 2004
• TC is a discipline at the crossroads of machine learning (ML) and information retrieval (IR)
• Two key issues in TC:– Text Representation– Classifier Construction
• This thesis focuses on the first issue.
Introduction Two sub-issues of TC
![Page 7: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/7.jpg)
SoC Presentation Title 2004
• Approaches to build a classifier:– No more than 20 algorithms– Borrowed from Information Retrieval: Rocchio– Machine Learning: SVM, kNN, decision Tree,
Naïve Bayesian, Neural Network, Linear
Regression, Decision Rule, Boosting, etc.• SVM performs better.
Introduction Construction of Text Classifier
![Page 8: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/8.jpg)
SoC Presentation Title 2004
Little room to improve the performance from the algorithm aspect:
– Excellent algorithms are few. – The rationale is inherent to each algorithm and
the method is usually fixed for one given algorithm– Tuning the parameter has limited improvement
Introduction Construction of Text Classifier
![Page 9: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/9.jpg)
SoC Presentation Title 2004
• Various text format, such as DOC, PDF, PostScript, HTML, etc.• Can Computer read them like us? • Convert them into a compact format in order to be recognized and categorized for classifiers or a classifier-building algorithm in a computer.• This indexing procedure is also called text representation.
Introduction Text Representation
![Page 10: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/10.jpg)
SoC Presentation Title 2004
Texts are vectors in the term space.Assumption: Documents that are “close together” in space are similar in meaning.
Introduction Vector Space Model
![Page 11: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/11.jpg)
SoC Presentation Title 2004
Two main issues in Text Representation:
1. What should a term be – Term Type
2. How to weight a term – Term Weighting
Introduction Text Representation
![Page 12: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/12.jpg)
SoC Presentation Title 2004
• Sub-word level – syllables• Word level – single token• Multi-word level – phrases, sentences, etc• Syntactic and semantic – sense (meaning)
Introduction 1. What Should a Term be?
![Page 13: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/13.jpg)
SoC Presentation Title 2004
1. Word senses (meanings) [Kehagias 2001] same word assumes different meanings in a different contexts
2. Term clustering [Lewis 1992]group words with high degree of pairwise semantic relatedness
3. Semantic and syntactic representation [Scott & matwin 1999]
Relationship between words, i.e. phrases, synonyms and hypernyms
Introduction 1. What Should a Term be?
![Page 14: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/14.jpg)
SoC Presentation Title 2004
4. Latent Semantic Indexing [Deerwester 1990]
A feature reconstruction technique
5. Combination Approach [Peng 2003]
Combine two types of indexing terms, i.e. words
and 3-grams
6. Theme Topic Mixture Model – [Keller 2004]
Graphical Model
7. Using keywords from summarization [Li 2003]
Introduction 1. What Should a Term be?
![Page 15: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/15.jpg)
SoC Presentation Title 2004
• In general, high level representation did not show good performance in most cases
• Word level is better (bag-of-words)
Introduction 1. What Should a Term be?
![Page 16: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/16.jpg)
SoC Presentation Title 2004
• [Salton 1988] elaborated three considerations:–1. term occurrences closely represent the content of document–2. other factors with the discriminating power pick up the relevant documents from other irrelevant documents–3. consider the effect of length of documents
Introduction 2. How to Weight a Term?
![Page 17: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/17.jpg)
SoC Presentation Title 2004
• Simplest method – binary • Most popular method – tf.idf• Combination with information-theory metrics
or statistics method – tf.chi2, tf.ig, tf.gr• Combination with the linear classifier
Introduction 2. How to Weight a Term?
![Page 18: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/18.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 19: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/19.jpg)
SoC Presentation Title 2004
• SVM is better.• Leopold(2002) stated that text representation
dominates the performance of text categorization
rather than the kernel functions of SVMs.
Which is the best method for SVM-based text classifier among these widely-used ones?
Motivation (1)
![Page 20: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/20.jpg)
SoC Presentation Title 2004
• Text categorization is a form of supervised learning • The prior information on the membership of
training documents in predefined categories is
useful in– feature selection– supervised learning for classifier
Motivation (2)
![Page 21: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/21.jpg)
SoC Presentation Title 2004
• The supervised term weighting methods adopt
this known information and consider the
document distribution.• They are naturally expected to be superior to
the unsupervised (traditional) term weighting
methods.
Motivation (2)
![Page 22: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/22.jpg)
SoC Presentation Title 2004
Q1. How can we propose a new effective term weighting method by using the important prior information given by the training data set?
Q2. Which is the best term weighting method for SVM-based text classifier?
MotivationThree questions to be addressed
![Page 23: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/23.jpg)
SoC Presentation Title 2004
Q3. Are supervised term weighting methods able to lead to better performance than unsupervised ones for TC?
What kinds of relationship can we find between term weighting methods and the widely-used learning algorithms, i.e. kNN and SVM, given different benchmark data collections?
MotivationThree questions to be addressed
![Page 24: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/24.jpg)
SoC Presentation Title 2004
First, we will analyze term’s discriminating power for TC and propose a new term weighting method using the prior information of the training data set to improve the performance of TC.
MotivationSub-tasks of This Thesis
![Page 25: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/25.jpg)
SoC Presentation Title 2004
Second, we will explore term weighting methods for SVM-based text categorization and investigate the best method for SVM-based text classifier.
MotivationSub-tasks of This Thesis
![Page 26: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/26.jpg)
SoC Presentation Title 2004
Third, we will extend research work on more general benchmark data sets and learning algorithms to examine the superiority of supervised term weighting methods and investigate the relationship between term weighting methods and different learning algorithms. Moreover, this study will be extended to a new application domain, i.e. biomedical literature classification.
MotivationSub-tasks of This Thesis
![Page 27: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/27.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 28: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/28.jpg)
SoC Presentation Title 2004
• Machine Learning Algorithms– SVM, kNN
• Benchmark Data Corpora– Reuters News Corpus– 20Newsgroups Corpus– Ohsumed Corpus– 18Journal Corpus
Methodology of Research
![Page 29: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/29.jpg)
SoC Presentation Title 2004
• Evaluation Measures– F1– Breakeven point
• Significance Tests– McNemar’s significance Test
Methodology of Research
![Page 30: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/30.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 31: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/31.jpg)
SoC Presentation Title 2004
Three considerations:1. Term occurrence: binary, tf, ITF, log(tf)
2. Term’s discriminative power: idfNote: chi^2, ig (information gain), gr
(gain ratio), mi (mutual information), or (Odds Ratio), etc.
3. Document length: cosine normalization, linear normalization
Analysis and Proposal of a New Term Weighting Method
![Page 32: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/32.jpg)
SoC Presentation Title 2004
Analysis and Proposal of a New Term Weighting MethodAnalysis of Term’s Discriminating Power
![Page 33: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/33.jpg)
SoC Presentation Title 2004
• Assume they have same tf value. t1, t2, t3 share the same idf1 value; t4, t5, t6 share the same idf2 value.
• Clearly, the six terms contribute differently to the semantic of documents.
Analysis and Proposal of a New Term Weighting MethodAnalysis of Term’s Discriminating Power
![Page 34: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/34.jpg)
SoC Presentation Title 2004
• Case 1. t1 contributes more than t2 and t3; t4 contributes more than t5 and t6.
• Case 2. t4 contributes more than t1 although idf(t4) < idf(t1).
Analysis and Proposal of a New Term Weighting MethodTerm’s Discriminating Power -- idf
![Page 35: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/35.jpg)
SoC Presentation Title 2004
•Case 3. for t1, t2, t3,
in idf value, t1= t2= t3
in chi^2 value, t1=t3 >t2
in or value, t1 > t2 > t3
• Case 4. for t1 and t4,
in idf value, t1 > t4
in chi^2 and or value, t1 < t4
Analysis and Proposal of a New Term Weighting MethodTerm’s Discriminating Power -- idf, chi^2, or
![Page 36: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/36.jpg)
SoC Presentation Title 2004
• Intuitive Consideration
the more concentrated a high frequency term is in the positive category than in the negative category, the more contributions it makes in selecting the positive samples from among the negative samples.
Analysis and ProposalA New Term Weighting Method -- rf
![Page 37: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/37.jpg)
SoC Presentation Title 2004
• rf – relevance frequency
• The rf is only related to the ratio of b and c, not involve d
• The base of log is 2.
• in case of c=0, c=1
• The final weight of term is: tf*rf
)),1max(_
2log(c
brf +=
Analysis and ProposalA New Term Weighting Method -- rf
![Page 38: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/38.jpg)
SoC Presentation Title 2004
Analysis and ProposalEmpirical observation
feature
Category: 00_acq
idf rf chi2 or ig gr
Acquir 3.553 4.368 850.66 30.668 0.125 0.161
Stake 4.201 2.975 303.94 24.427 0.074 0.096
Payout 4.999 1 10.87 0.014 0.011 0.014
dividend 3.567 1.033 46.63 0.142 0.017 0.022
Comparison of idf, rf, chi2, or, ig and gr value of four features in category 00_acq of Reuters Corpus
![Page 39: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/39.jpg)
SoC Presentation Title 2004
Analysis and ProposalEmpirical observation
feature
Category: 03_earn
idf rf chi2 or ig gr
Acquir 3.553 1.074 81.50 0.139 0.031 0.032
Stake 4.201 1.082 31.26 0.164 0.018 0.018
Payout 4.999 7.820 44.68 364.327 0.041 0.042
dividend 3.567 4.408 295.46 35.841 0.092 0.095
Comparison of idf, rf, chi2, or, ig and gr value of four features in category 03_earn of Reuters Corpus
![Page 40: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/40.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 41: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/41.jpg)
SoC Presentation Title 2004
Purposes of Experiment Set 1:
1. Explore the best term weighting method for SVM-based text categorization – (Q2)
2. Compare tf.rf with various traditional term weighting methods on SVM – (Q1)
Experiment Set 1Exploring the best term weighting method for SVM-based text categorization
![Page 42: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/42.jpg)
SoC Presentation Title 2004
Methods Denotation Description
Traditional
Weighting
Methods
Term frequency
binary 0: absence, 1: presence
tf Term frequency alone
logtf log(1+tf)
ITF 1-1/(1+tf)
tf.idf
And
variants
idf idf alone
tf.idf Classical tf * idf
logtf.idf log(1+tf) * idf
tf. idf_prob tf * Probabilistic idf
Supervised
Weighting
Methods
tf.chi^2 tf * chi square
Our new tf.rf tf.rf Our new method
Experiment Set 1Experimental Methodology: 10 Methods
![Page 43: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/43.jpg)
SoC Presentation Title 2004
Experiment Set 1Results on Reuters Corpus
![Page 44: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/44.jpg)
SoC Presentation Title 2004
Experiment Set 1Results on subset of 20Newsgroups Corpus
![Page 45: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/45.jpg)
SoC Presentation Title 2004
• tf.rf performs better consistently
• No significant difference among the three term frequency variants, i.e. tf, logtf and ITF
• No significant difference between tf.idf, logtf.idf and tf.idf-prob
Experiment Set 1Conclusions (1)
![Page 46: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/46.jpg)
SoC Presentation Title 2004
• idf and chi^2 factor, even considering the
distribution of documents, do not improve or
even decrease the performance
• binary and tf.chi^2 significantly underperform
the other methods
Experiment Set 1Conclusions (2)
![Page 47: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/47.jpg)
SoC Presentation Title 2004
Purposes of Experiment Set 2:
1. Investigate supervised term weighting methods and their relationship with learning algorithms – (Q3)
2. Compare the effectiveness of tf.rf under more general experimental circumstances – (Q1)
Experiment Set 2Investigating supervised term weighting method and their relationship with learning algorithms
![Page 48: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/48.jpg)
SoC Presentation Title 2004
• Supervised term weighting methods– Use the prior information on the membership
of training documents in predefined categories• Unsupervised term weighting methods
– Does not use. – binary, tf, log(1+tf), ITF– Most popular is tf.idf and its variants: logtf.idf,
tf.idf-prob
Experiment Set 2Review
![Page 49: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/49.jpg)
SoC Presentation Title 2004
1. Combined with information-theory functions or statistic metrics
– such as chi2, information gain, gain ratio, Odds ratio, etc.– Used in the feature selection step– Select the most relevant and discriminating features for the classification task, that is, the terms with higher feature selection scores
The results are inconsistent and/or incomplete.
Experiment Set 2Review: Supervised Term Weighting Methods
![Page 50: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/50.jpg)
SoC Presentation Title 2004
2. Interaction with supervised text classifier–Linear SVM, Perceptron, kNN–Text classifier selects the positive test documents from negative test documents by assigning different scores to the test samples, these scores are believed to be effective in assigning more appropriate weights to terms
Experiment Set 2Review: Supervised Term Weighting Methods
![Page 51: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/51.jpg)
SoC Presentation Title 2004
3. Based on Statistical Confidence Intervals -- ConfWeight
– Linear SVM– Compared with tf.idf and tf.gr (gain ratio)– The results failed to show that supervised
methods are generally higher than
unsupervised methods.
Experiment Set 2Review: Supervised Term Weighting Methods
![Page 52: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/52.jpg)
SoC Presentation Title 2004
Hypothesis:
Since supervised schemes consider the document distribution, they should perform better than unsupervised ones
Motivation:
1. Are supervised term weighting method is
superior to the unsupervised (traditional) methods?
2. What kinds of relationship between term
weighting methods and the learning algorithms,
i.e. kNN and SVM, given different data collections?
Experiment Set 2
![Page 53: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/53.jpg)
SoC Presentation Title 2004
Unsupervised
Term
Weighting
binary 0 or 1
tf Term frequency
tf.idf Classic scheme
Supervised
Term
Weighting
tf.rf Own scheme
tf.chi^2 Chi^2
tf.ig Information gain
tf.or Odds ratio
Experiment Set 2Methodology
![Page 54: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/54.jpg)
SoC Presentation Title 2004
Experiment Set 2Results on Reuters Corpus using SVM
![Page 55: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/55.jpg)
SoC Presentation Title 2004
Experiment Set 2Results on Reuters Corpus using kNN
![Page 56: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/56.jpg)
SoC Presentation Title 2004
Experiment Set 2Results on 20Newsgroups Corpus using SVM
![Page 57: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/57.jpg)
SoC Presentation Title 2004
Experiment Set 2Results on 20Newsgroups Corpus using kNN
![Page 58: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/58.jpg)
SoC Presentation Title 2004
Algorithm Corpus #_fea Significance Test Results
SVM R 15937 (tf.rf, tf, rf) > tf.idf >
(tf.ig, tf.chi2, binary) >> tf.orSVM 20 13456 (rf, tf.rf, tf.idf) > tf >> binary >> tf.or
>> (tf.ig, tf.chi2)kNN R 405 (binary, tf.rf) > tf >>
(tf.idf, rf, tf.ig) > tf.chi2 >> tf.orkNN 20 494 (tf.rf, binary, tf.idf,tf) >> rf >> (tf.or,
tf.ig, tf.chi2)
Experiment Set 2McNemar’s Significance Tests
![Page 59: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/59.jpg)
SoC Presentation Title 2004
Experiment Set 2
Effects of feature set size on algorithms
• For SVM, almost all methods achieved the
best performance when in putting the full
vocabulary (13000-16000 features)• For kNN, the best performance achieved at a
smaller feature set size (400-500 features)
• Possible reason: different noise resistance
![Page 60: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/60.jpg)
SoC Presentation Title 2004
Q1: Are supervised term weighting methods superior to the unsupervised term weighting methods?
-- No always.
Experiment Set 2 Conclusions (1)
![Page 61: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/61.jpg)
SoC Presentation Title 2004
Experiment Set 2 Conclusions (1)
• Specifically, three supervised methods based on information theory, i.e. tf.chi2, tf.ig and tf.or, perform rather poorly in all experiments.
• On the other hand, newly proposed supervised method, tf.rf achieved the best performance consistently and outperforms other methods substantially and significantly.
![Page 62: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/62.jpg)
SoC Presentation Title 2004
Q2: What kinds of relationship between term weighting methods and learning algorithms, given different benchmark data collections?
A: The performance of the term weighting methods, especially, the three unsupervised methods, has close relationships with the learning algorithms and data corpora.
Experiment Set 2 Conclusions (2)
![Page 63: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/63.jpg)
SoC Presentation Title 2004
Summary of supervised methods:– tf.rf performs consistently better in all
experiments;– tf.or, tf.chi^2 and tf.ig perform consistently
the worst in all experiments;– rf alone, shows a comparable performance
to tf.rf except for on Reuters using kNN
Experiment Set 2 Conclusions (4)
![Page 64: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/64.jpg)
SoC Presentation Title 2004
Summary of unsupervised methods:– tf.idf performs comparable to tf.rf on the uniform
category corpus either using SVM or kNN;– binary performs comparable to tf.rf on both
corpora using kNN while rather bad using SVM;– although tf does not perform as well as tf.rf, it
performs consistently well in all experiments
Experiment Set 2 Conclusions (5)
![Page 65: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/65.jpg)
SoC Presentation Title 2004
Purpose:
Application on biomedical data collections
Motivation:
Explosive growth of biomedical research
Experiment Set 3Applications on Biomedical Domain
![Page 66: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/66.jpg)
SoC Presentation Title 2004
• The Ohsumed Corpus
Used by Joachims [1998]
• 18 Journals Corpus– Biochemistry and Molecular Biology– Top impact factor– 7903 documents from two-year span (2004 - 2005)
Experiment Set 3Data Corpora
![Page 67: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/67.jpg)
SoC Presentation Title 2004
Four term weighting methods
bianry, tf, tf.idf, tf.rf
Evaluation Measures– Micro-averaged breakeven point– F1
Experiment Set 3Methodology
![Page 68: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/68.jpg)
SoC Presentation Title 2004
Experiment Set 3 Results on the Ohsumed Corpus
![Page 69: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/69.jpg)
SoC Presentation Title 2004
scheme Micro-R Micro-P Micro-F1 Macro-F1
binary 0.6091 0.6097 0.6094 0.5757
tf 0.6578 0.6566 0.6572 0.6335
tf.idf 0.6567 0.6588 0.6578 0.6407
tf.rf 0.6810 0.6800 0.6805 0.6604
Experiment Set 3 The best performance of SVM with four term weighting schemes on the Ohsumed Corpus
![Page 70: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/70.jpg)
SoC Presentation Title 2004
• Our linear SVM is more accurate– 68.05% (linear) vs 60.7%(linear) and 66.1%(rbf)
• The performances of tf.idf are identical– 65.78% vs 66%
• Our proposed tf.rf performs significantly better than
other methods.
Experiment Set 3Comparison between our study and Joachims’ study
![Page 71: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/71.jpg)
SoC Presentation Title 2004
Experiment Set 3Results on 18 Journals Corpus
![Page 72: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/72.jpg)
SoC Presentation Title 2004
Experiment Set 3Results on 3 subsets of 18 Journals Corpus
![Page 73: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/73.jpg)
SoC Presentation Title 2004
• The comparison between our results and Joachims’ results shows that tf.rf can improve on the classification performance than tf.idf.
• tf.rf outperforms the other term weighting methods on the both data corpora.
• tf.rf can improve the classification performance of biomedical text classification.
Experiment Set 3Conclusions
![Page 74: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/74.jpg)
SoC Presentation Title 2004
Outline
• Introduction• Motivation• Methodology of Research• Analysis and Proposal of a New Term
Weighting Method• Experimental Research Work• Contributions and Future Work
![Page 75: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/75.jpg)
SoC Presentation Title 2004
Contributions of Thesis
1. To propose an effective supervised term
weighting method tf.rf to improve the performance of text categorization.
![Page 76: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/76.jpg)
SoC Presentation Title 2004
Contributions of Thesis
2. To make an extensive comparative study
of different term weighting methods under
controlled conditions.
![Page 77: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/77.jpg)
SoC Presentation Title 2004
Contributions of Thesis
3. To give a deep analysis of the terms’
discriminating power for text categorization
from the qualitative and quantitative aspects.
![Page 78: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/78.jpg)
SoC Presentation Title 2004
Contributions of Thesis
4. To investigate the relationships between
term weighting methods and learning
algorithms given different corpora.
![Page 79: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/79.jpg)
SoC Presentation Title 2004
Future Work (1)
• Extending term weighting methods
on feature types other than word
![Page 80: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/80.jpg)
SoC Presentation Title 2004
Future Work (2)
2. Applying term weighting methods to
other text-related applications
![Page 81: SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007](https://reader035.vdocuments.us/reader035/viewer/2022070415/5697c02f1a28abf838cda6d2/html5/thumbnails/81.jpg)
SoC Presentation Title 2004
Thanks for your time and suggestion.