hierarchical classification of real life documents ke wang, senqiang zhou simon fraser university yu...
TRANSCRIPT
Hierarchical Classification of Real Life Documents
Ke Wang, Senqiang ZhouSimon Fraser University
Yu HeNational University of Singapore
Hierarchical & Multi-classed Documents
• Topics are organized into a hierarchy of increasing specificity
• A document is classified into all relevant classes.
• For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo
New Issues
• Misclassification is non-symmetric– Travel Outdoor Vs. Travel Software
• Documents are multi-classed– Traditional way: only one class attached
• Class space is sparse– 2 - 1 subsets of classes for k classes– Exploring the similarities between classes k
A New Classification Model
• The model of documents:– {t1,t2,….,tn|C1,…,Ck}, where t1,t2,….,tn are
keywords and C1,…,Ck are classes from a given class hierarchy
– { C1,…,Ck } is called a classset (CS)
• Construct a classifier– consisting of rules of the form {ti1,…, tip} {Ci1,
…, Cip}, that assigns a “good” classset to a given new document
Class Similarity
• Two classsets are similar if they “cover” similar documents.
• Anc(CS): the set of classes in a classset CS plus all ancestor classes.
• CS1 is more general than CS2 if Anc(CS1) Anc(CS2)– {Dance} is more general than {Fast-Dance,Music}
because Anc({Dance}) Anc({Fast-dance,Music})
Class Similarity (Cont.)
• A document d is covered by a classset CS if CS is more general than the classset of d
• Cover(CS) denotes the set of documents covered by CS
• Cover(CS1) Cover(CS2)=Cover(CS1 CS2)
Class Similarity (Cont.)
• The dissimilarity of CS1 and CS2 is defined as the normalized difference of their coverage E(CS1,CS2):
(|Cover(CS2)-Cover(CS1)| + |Cover(CS1)-Cover(CS2)|)/|Cover(CS1) Cover(CS2)|
• The similarity is defined as 1 - E(CS1,CS2)
The Confidence
• Match(TCS ): the set of documents that contain all the terms in T.
• The confidence of TCS is defined as:
Match(TCS ) - d E(CSd,CS)Confg(TCS ) = ------------------------------------
Match(TCS )
What’s behind the Confg ?
• Intuitively, Confg(TCS ) measures the average similarity between CS and the classsets of the documents that match TCS .
• If E(CSd,CS) is binary, i.e., 1 or 0, Confg(TCS ) degenerates to the standard confidence.
Construction of Classifier
• Step 1: Find association rules– Generate all association rules of the form
TCS that satisfy some user-specified minimum support and confidence.
Construction of Classifier(Cont.)
• Step 2: rank the rules– A document is classified by the matching
rule that has highest confidence.
– This selection is called most confidence first (MCF)
Construction of Classifier (Cont.)
• Step 3: remove rules of low accuracy– Let D be the set of training documents
classified by rule TCS, the accuracy of TCS is defined as
||
),(||)(
D
CSCSEDCSTAccu Dd
d
Construction of Classifier (Cont.)
– Confg(T CS) is defined with respect to all the document s that match the rule, whereas Accu(TCS ) is defined w.r.t the documents classified by the rule.
– Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.
Construction of Classifier (Cont.)
• Step 4: cut off the ranked list– If we cut off the list of rules r1,…,rm after the first
i rules, r1,…,ri,– Cutoff error = PrefixError(ri)+DefualtError(ri)– PrefixError(ri) is the sum of the rule error
Error(rj) for all rules rj, 1 j I– DefualtError(ri) is the error caused by assigning
the default classset to all the documents not classified by any rule rj
Experiments
Experimental Results
• The result on IBM data set– The error: Coverage beats the others. – The size: Confidence gets smaller. – The time: Coverage takes longer.
Classification Error
Size & Execution Time