1 building hierarchical classifiers using class proximity ke wang senqiang zhou shiang chen liew...
Post on 21-Dec-2015
228 views
TRANSCRIPT
1
Building Hierarchical Classifiers Using Class Proximity
Ke Wang
Senqiang Zhou
Shiang Chen Liew
National University of Singapore
VLDB99, Sept 6-10, Edinburgh
2
Hierarchical classification• Given
– a class hierarchy– a collection of pre-classified documents – a document is a set of terms
• Build– a classifier that assigns a relevant class to a new
document
• Key– extract features of classes
VLDB99, Sept 6-10, Edinburgh
3
Yahoo classes
Yahoo
recreation science
automotivesports
skatingcycling
VLDB99, Sept 6-10, Edinburgh
4
ACM classes
Hardware
General Memory_structure
General Design_style
Cache_memories
Level 1
Level 2
Level 3
Level 4
VLDB99, Sept 6-10, Edinburgh
5
Existing local approaches
• build one classifier at each split of the
class hierarchy
• determine features locally at each node
• classify a document by going through a
path of classifiers starting from the root
VLDB99, Sept 6-10, Edinburgh
6
Diminishing of high level structure
• rely on classification at high levels
• but high level structures usually weak, i.e., divergence of topics
• e.g., “car” is a feature at Recreation: Automotive, but not at Recreation
VLDB99, Sept 6-10, Edinburgh
7
Bias of misclassification
• sibling classes Vs. nephew classes
• misclassification at high levels Vs. at low levels
• specialisation Vs. generalisation
VLDB99, Sept 6-10, Edinburgh
8
Features should be
• determined wrt the target class
• determined at all concept levels
• correlated
The solution: generalised association rules (SA95, HF95)
{sql, IO} DB
{language, performance} CS
VLDB99, Sept 6-10, Edinburgh
9
Our approach
• class proximity
• global classifier
• term hierarchy
• use the “best” generalised association rule
T C to determine the class
VLDB99, Sept 6-10, Edinburgh
10
Rank association rules
• Biased confidence
• Biased J-measure
VLDB99, Sept 6-10, Edinburgh
11
An example
author story
writer editor fiction poem
Music Literature
A_Music A_Literature
Arts
Term hierarchy
Class hierarchy
... ...
VLDB99, Sept 6-10, Edinburgh
12
Term hierarchy(T)=YesClass proximity(B)=Yes
• R0: author,storyLiterature (ConfB=1,Clist=d6,d7)
• R1: authorLiterature (ConfB=1)
• R2: storyLiterature (ConfB=0.67, Wlist=d5(1))
• R4: hallMusic (ConfB=0.4, Clist=d1,d2, Wlist=d3(1))
• R3: StatesA_Literature (ConfB=0.33, Clist=d4,d5)
VLDB99, Sept 6-10, Edinburgh
13
VLDB99, Sept 6-10, Edinburgh
14
Experiment I
• http://www.acm.org/dl/toc.html/
• 26,515 papers, 78 classes, 14,754 terms
• class hierarchy=Level-1 and level-2 categories
• term hierarchy=Level-3 and level-4 categories
• document=Title and level-4 categories
VLDB99, Sept 6-10, Edinburgh
15
Best rules found by (B,T)• CSO:
– vector,stream,processor,parallelProcessor_Architectures– multiple_instruction_streamProcessor_Architectures– data_flow,architecturProcessor_Architectures– internet, architecturComputer_Communication_Networks – mode,atmComputer_Communication_Networks – network,circuit_switching Computer_Communication_Networks – tecniqu, model, attributPerformance_of_Systems
• Software:– program,function, applicationProgramming_Techniques– object_oriented_programmingProgramming_Techniques– reusable_softwareSoftware_Engineering– software,methodologieSoftware_Engineering– organization, distributed_systemOperating_Systems
VLDB99, Sept 6-10, Edinburgh
16
() --- | (T) --- (B) --- (B,T) --- (CDAR97,T) --- (CDAR97) ---
VLDB99, Sept 6-10, Edinburgh
17
VLDB99, Sept 6-10, Edinburgh
18
Experiment II
• http://dir.yahoo.com/recreation/sports
• 7,550 documents
• 367 classes, 7 levels
• 10,747 terms
• 90% of the terms occur in no more than 10 documents and many documents contain only such terms
VLDB99, Sept 6-10, Edinburgh
19
Best rules found by (B,T)• Sports:Cycling:
– page,mountain Mountain_Biking– product,bikeMountain_Biking– mtb,mountain Mountain_Biking– held,bicyclRaces– classic,bicyclRaces– trip,tourTravelogues– trip,canada Travelogues– bicycl,alaskaTravelogues
• Sports:Auto_Racing:– team,result,driverFormula_one– model,featurTracks_and_Speedways– ovalTracks_and_Speedways– racewayTracks_and_Speedways
VLDB99, Sept 6-10, Edinburgh
20
VLDB99, Sept 6-10, Edinburgh
21