hierarchical classification of real life documents ke wang, senqiang zhou simon fraser university yu...

21
Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Upload: eleanor-gibble

Post on 31-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Hierarchical Classification of Real Life Documents

Ke Wang, Senqiang ZhouSimon Fraser University

Yu HeNational University of Singapore

Page 2: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Hierarchical & Multi-classed Documents

• Topics are organized into a hierarchy of increasing specificity

• A document is classified into all relevant classes.

• For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo

Page 3: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

New Issues

• Misclassification is non-symmetric– Travel Outdoor Vs. Travel Software

• Documents are multi-classed– Traditional way: only one class attached

• Class space is sparse– 2 - 1 subsets of classes for k classes– Exploring the similarities between classes k

Page 4: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

A New Classification Model

• The model of documents:– {t1,t2,….,tn|C1,…,Ck}, where t1,t2,….,tn are

keywords and C1,…,Ck are classes from a given class hierarchy

– { C1,…,Ck } is called a classset (CS)

• Construct a classifier– consisting of rules of the form {ti1,…, tip} {Ci1,

…, Cip}, that assigns a “good” classset to a given new document

Page 5: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Class Similarity

• Two classsets are similar if they “cover” similar documents.

• Anc(CS): the set of classes in a classset CS plus all ancestor classes.

• CS1 is more general than CS2 if Anc(CS1) Anc(CS2)– {Dance} is more general than {Fast-Dance,Music}

because Anc({Dance}) Anc({Fast-dance,Music})

Page 6: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Class Similarity (Cont.)

• A document d is covered by a classset CS if CS is more general than the classset of d

• Cover(CS) denotes the set of documents covered by CS

• Cover(CS1) Cover(CS2)=Cover(CS1 CS2)

Page 7: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Class Similarity (Cont.)

• The dissimilarity of CS1 and CS2 is defined as the normalized difference of their coverage E(CS1,CS2):

(|Cover(CS2)-Cover(CS1)| + |Cover(CS1)-Cover(CS2)|)/|Cover(CS1) Cover(CS2)|

• The similarity is defined as 1 - E(CS1,CS2)

Page 8: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

The Confidence

• Match(TCS ): the set of documents that contain all the terms in T.

• The confidence of TCS is defined as:

Match(TCS ) - d E(CSd,CS)Confg(TCS ) = ------------------------------------

Match(TCS )

Page 9: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

What’s behind the Confg ?

• Intuitively, Confg(TCS ) measures the average similarity between CS and the classsets of the documents that match TCS .

• If E(CSd,CS) is binary, i.e., 1 or 0, Confg(TCS ) degenerates to the standard confidence.

Page 10: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore
Page 11: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore
Page 12: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Construction of Classifier

• Step 1: Find association rules– Generate all association rules of the form

TCS that satisfy some user-specified minimum support and confidence.

Page 13: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Construction of Classifier(Cont.)

• Step 2: rank the rules– A document is classified by the matching

rule that has highest confidence.

– This selection is called most confidence first (MCF)

Page 14: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Construction of Classifier (Cont.)

• Step 3: remove rules of low accuracy– Let D be the set of training documents

classified by rule TCS, the accuracy of TCS is defined as

||

),(||)(

D

CSCSEDCSTAccu Dd

d

Page 15: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Construction of Classifier (Cont.)

– Confg(T CS) is defined with respect to all the document s that match the rule, whereas Accu(TCS ) is defined w.r.t the documents classified by the rule.

– Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.

Page 16: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Construction of Classifier (Cont.)

• Step 4: cut off the ranked list– If we cut off the list of rules r1,…,rm after the first

i rules, r1,…,ri,– Cutoff error = PrefixError(ri)+DefualtError(ri)– PrefixError(ri) is the sum of the rule error

Error(rj) for all rules rj, 1 j I– DefualtError(ri) is the error caused by assigning

the default classset to all the documents not classified by any rule rj

Page 17: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Experiments

Page 18: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Experimental Results

• The result on IBM data set– The error: Coverage beats the others. – The size: Confidence gets smaller. – The time: Coverage takes longer.

Page 19: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore
Page 20: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Classification Error

Page 21: Hierarchical Classification of Real Life Documents Ke Wang, Senqiang Zhou Simon Fraser University Yu He National University of Singapore

Size & Execution Time