comments from pre-submission presentation
Post on 13-Jan-2016
21 Views
Preview:
DESCRIPTION
TRANSCRIPT
SoC Presentation Title 2004
Comments from Pre-submission Presentation
Q: Check why kNN is so lower than SVM on Reuters and 20 Newsgroups corpus? -10%.
A: Refer to the following four references: [Joachims 98] [Debole 03 STM] [Dumais 98 Inductive] [Yang 99 Reexamination]
SoC Presentation Title 2004
[Joachims98][Debole03][Dumais98]Results on the Reuters Corpus
Bayes Rocchio C4.5 kNN SVM
(linear)
SVM
(Poly)
SVM
(rbf)
Micro-BEP(%)
69.84 79.14 77.78 82.5 84.2 86 86
kNN SVM
(linear)
Micro-F1
85.4 92.0
NBayes DT SVM
(linear)
Micro-
BEP
81.5 88.4 92.0
SoC Presentation Title 2004
[Yang 99 Re-examination]Significance Test
Micro-level analysis (s-test)
SVM > kNN >> {LLSF, NNet} >> NB
Macro-level analysis
{SVM, kNN, LLSF} >> {NB, NNet}
Error-rate based comparison
{SVM, kNN} > LLSF > NNet >> NB
SoC Presentation Title 2004
Comments from Pre-submission Presentation
2. Explain why BEP & F1 in Chap 7
-Add reference
SoC Presentation Title 2004
Breakeven point (1)
BEP, first proposed by Lewis[1992]. Later, he himself pointed out that BEP is not a good effectiveness measure, because
1. there may be no parameter setting that yields the breakeven; in this case the final BEP value, obtained by interpolation, is artificial;2. to have P=R is not necessarily desirable, and it is not clear that a system that achieves high BEP can be tuned to score high on other effectiveness measure.
SoC Presentation Title 2004
Breakeven point (2)
Yang[1999Re-examinatio] also noted that when for no value of the parameters P and R are close enough, interpolated breakeven may not be a reliable indicator of effectiveness.
SoC Presentation Title 2004
Comments from Pre-submission Presentation
3. Add more qualitative analysis would be better
SoC Presentation Title 2004
Analysis and Proposal: Empirical observation
feature
Category: 00_acq Category: 03_earn
idf rf chi2 idf rf chi2
Acquir 3.553 4.368 850.66 3.553 1.074 81.50
Stake 4.201 2.975 303.94 4.201 1.082 31.26
Payout 4.999 1 10.87 4.999 7.820 44.68
dividend 3.567 1.033 46.63 3.567 4.408 295.46
Comparison of idf, rf and chi2 value of four features in two categories of Reuters Corpus
SoC Presentation Title 2004
Comments from Pre-submission Presentation
4. Chap 7 remove Joachims Results using quotation is fine
SoC Presentation Title 2004
Comments from Pre-submission Presentation
5. Tone down “best” claims
to our knowledge (experience, understanding)
Pay attention this usage when doing presentation
SoC Presentation Title 2004
Introduction:Other Text Representation
• Word senses (meanings) [Kehagias 2001]
same word assumes different meanings in a different contexts
• Term clustering [Lewis 1992]
group words with high degree of pairwise semantic relatedness
• Semantic and syntactic representation [Scott & matwin 1999]
Relationship between words, i.e. phrases, synonyms and hypernyms
SoC Presentation Title 2004
Introduction:Other Text Representation
• Latent Semantic Indexing [Deerwester 1990]A feature reconstruction technique
• Combination Approach [Peng 2003]combine two types of indexing terms, i.e. words and 3-grams
In general, high level representation did not show good performance in most cases
SoC Presentation Title 2004
Literature Review:Knowledge-based Representation
• Theme Topic Mixture Model – Graphical Model [Keller 2004]• Using keywords from summarization [Li 2003]
SoC Presentation Title 2004
Literature Review: 2. How to weight a term (feature)
[Salton 1988] elaborated three considerations:
1. term occurrences closely represent the content of document
2. other factors with the discriminating power pick up the relevant documents from other irrelevant documents
3. consider the effect of length of documents
SoC Presentation Title 2004
Literature Review: 2. How to weight a term (feature)
1. Term Frequency Factor
Binary representation (1 for present and 0 for absent)
Term frequency (tf): number of times a term occurs in a document
Log(tf): log operation to scale the effect of unfavorably high term frequency
Inverse term frequency (ITF)
SoC Presentation Title 2004
Literature Review: 2. How to weight a term (feature)
2. Collection Frequency Factor
idf: the most-commonly used factor
Probabilistic idf: aka. term relevance weight
Feature selection metrics: chi^2, information gain, gain ratio, odds ratio, etc.
SoC Presentation Title 2004
Literature Review: 2. How to weight a term (feature)
3. Normalization Factor
Combine the above two factors by using multiplication operation
In order to eliminate the length effect, we use the cosine normalization to limit the term weighting range within (0,1)
top related