hierarchical classification of real life documents ke wang, senqiang zhou simon fraser university yu...

Hierarchical Classification of Real Life Documents

Ke Wang, Senqiang ZhouSimon Fraser University

Yu HeNational University of Singapore

Hierarchical & Multi-classed Documents

• Topics are organized into a hierarchy of increasing specificity

• A document is classified into all relevant classes.

• For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo

New Issues

• Misclassification is non-symmetric– Travel Outdoor Vs. Travel Software

• Documents are multi-classed– Traditional way: only one class attached

• Class space is sparse– 2 - 1 subsets of classes for k classes– Exploring the similarities between classes k

A New Classification Model

• The model of documents:– {t1,t2,….,tn|C1,…,Ck}, where t1,t2,….,tn are

keywords and C1,…,Ck are classes from a given class hierarchy

– { C1,…,Ck } is called a classset (CS)

• Construct a classifier– consisting of rules of the form {ti1,…, tip} {Ci1,

…, Cip}, that assigns a “good” classset to a given new document

Class Similarity

• Two classsets are similar if they “cover” similar documents.

• Anc(CS): the set of classes in a classset CS plus all ancestor classes.

• CS1 is more general than CS2 if Anc(CS1) Anc(CS2)– {Dance} is more general than {Fast-Dance,Music}

because Anc({Dance}) Anc({Fast-dance,Music})

Class Similarity (Cont.)

• A document d is covered by a classset CS if CS is more general than the classset of d

• Cover(CS) denotes the set of documents covered by CS

• Cover(CS1) Cover(CS2)=Cover(CS1 CS2)

The Confidence

• Match(TCS ): the set of documents that contain all the terms in T.

• The confidence of TCS is defined as:

Match(TCS ) - d E(CSd,CS)Confg(TCS ) = ------------------------------------

Match(TCS )

What’s behind the Confg ?

• Intuitively, Confg(TCS ) measures the average similarity between CS and the classsets of the documents that match TCS .

• If E(CSd,CS) is binary, i.e., 1 or 0, Confg(TCS ) degenerates to the standard confidence.

Construction of Classifier

• Step 1: Find association rules– Generate all association rules of the form

TCS that satisfy some user-specified minimum support and confidence.

Construction of Classifier(Cont.)

• Step 2: rank the rules– A document is classified by the matching

rule that has highest confidence.

– This selection is called most confidence first (MCF)

Construction of Classifier (Cont.)

• Step 3: remove rules of low accuracy– Let D be the set of training documents

classified by rule TCS, the accuracy of TCS is defined as

||

),(||)(

D

CSCSEDCSTAccu Dd

d


– Confg(T CS) is defined with respect to all the document s that match the rule, whereas Accu(TCS ) is defined w.r.t the documents classified by the rule.

– Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.


• Step 4: cut off the ranked list– If we cut off the list of rules r1,…,rm after the first

i rules, r1,…,ri,– Cutoff error = PrefixError(ri)+DefualtError(ri)– PrefixError(ri) is the sum of the rule error

Error(rj) for all rules rj, 1 j I– DefualtError(ri) is the error caused by assigning

the default classset to all the documents not classified by any rule rj

Experiments

Experimental Results

• The result on IBM data set– The error: Coverage beats the others. – The size: Confidence gets smaller. – The time: Coverage takes longer.

Classification Error

Size & Execution Time

hierarchical classification of real life documents ke wang, senqiang zhou simon fraser university yu...

Documents

cs covercs

confidence of t cs

rule t cs

matcht cs slide

cs conf g t cs

classset cs

accuracy of t cs

form t cs