hierarchical classification of real life documents ke wang, senqiang zhou simon fraser university yu...

Post on 31-Mar-2015

215 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hierarchical Classification of Real Life Documents

Ke Wang, Senqiang ZhouSimon Fraser University

Yu HeNational University of Singapore

Hierarchical & Multi-classed Documents

• Topics are organized into a hierarchy of increasing specificity

• A document is classified into all relevant classes.

• For example, a document on Dance could be reached from both Arts:Performing_Arts and Recreation topics in Yahoo

New Issues

• Misclassification is non-symmetric– Travel Outdoor Vs. Travel Software

• Documents are multi-classed– Traditional way: only one class attached

• Class space is sparse– 2 - 1 subsets of classes for k classes– Exploring the similarities between classes k

A New Classification Model

• The model of documents:– {t1,t2,….,tn|C1,…,Ck}, where t1,t2,….,tn are

keywords and C1,…,Ck are classes from a given class hierarchy

– { C1,…,Ck } is called a classset (CS)

• Construct a classifier– consisting of rules of the form {ti1,…, tip} {Ci1,

…, Cip}, that assigns a “good” classset to a given new document

Class Similarity

• Two classsets are similar if they “cover” similar documents.

• Anc(CS): the set of classes in a classset CS plus all ancestor classes.

• CS1 is more general than CS2 if Anc(CS1) Anc(CS2)– {Dance} is more general than {Fast-Dance,Music}

because Anc({Dance}) Anc({Fast-dance,Music})

Class Similarity (Cont.)

• A document d is covered by a classset CS if CS is more general than the classset of d

• Cover(CS) denotes the set of documents covered by CS

• Cover(CS1) Cover(CS2)=Cover(CS1 CS2)

Class Similarity (Cont.)

• The dissimilarity of CS1 and CS2 is defined as the normalized difference of their coverage E(CS1,CS2):

(|Cover(CS2)-Cover(CS1)| + |Cover(CS1)-Cover(CS2)|)/|Cover(CS1) Cover(CS2)|

• The similarity is defined as 1 - E(CS1,CS2)

The Confidence

• Match(TCS ): the set of documents that contain all the terms in T.

• The confidence of TCS is defined as:

Match(TCS ) - d E(CSd,CS)Confg(TCS ) = ------------------------------------

Match(TCS )

What’s behind the Confg ?

• Intuitively, Confg(TCS ) measures the average similarity between CS and the classsets of the documents that match TCS .

• If E(CSd,CS) is binary, i.e., 1 or 0, Confg(TCS ) degenerates to the standard confidence.

Construction of Classifier

• Step 1: Find association rules– Generate all association rules of the form

TCS that satisfy some user-specified minimum support and confidence.

Construction of Classifier(Cont.)

• Step 2: rank the rules– A document is classified by the matching

rule that has highest confidence.

– This selection is called most confidence first (MCF)

Construction of Classifier (Cont.)

• Step 3: remove rules of low accuracy– Let D be the set of training documents

classified by rule TCS, the accuracy of TCS is defined as

||

),(||)(

D

CSCSEDCSTAccu Dd

d

Construction of Classifier (Cont.)

– Confg(T CS) is defined with respect to all the document s that match the rule, whereas Accu(TCS ) is defined w.r.t the documents classified by the rule.

– Remove the rules with accuracy below a certain threshold because they contribute negatively to overall accuracy.

Construction of Classifier (Cont.)

• Step 4: cut off the ranked list– If we cut off the list of rules r1,…,rm after the first

i rules, r1,…,ri,– Cutoff error = PrefixError(ri)+DefualtError(ri)– PrefixError(ri) is the sum of the rule error

Error(rj) for all rules rj, 1 j I– DefualtError(ri) is the error caused by assigning

the default classset to all the documents not classified by any rule rj

Experiments

Experimental Results

• The result on IBM data set– The error: Coverage beats the others. – The size: Confidence gets smaller. – The time: Coverage takes longer.

Classification Error

Size & Execution Time

top related