decision trees for hierarchical multilabel classification : a case study in functional genomics
DESCRIPTION
Decision Trees for Hierarchical Multilabel Classification : A Case Study in Functional Genomics. Hendrik Blockeel 1 , Leander Schietgat 1 , Jan Struyf 1,2 , Saso Dzeroski 3 , Amanda Clare 4 1 Katholieke Universiteit Leuven 2 University of Wisconsin, Madison - PowerPoint PPT PresentationTRANSCRIPT
Decision Trees for Decision Trees for Hierarchical Multilabel Classification :Hierarchical Multilabel Classification : A Case Study in Functional GenomicsA Case Study in Functional Genomics
Hendrik BlockeelHendrik Blockeel11, Leander Schietgat, Leander Schietgat11, , Jan StruyfJan Struyf1,21,2, Saso Dzeroski, Saso Dzeroski33, Amanda Clare, Amanda Clare44
1 1 Katholieke Universiteit LeuvenKatholieke Universiteit Leuven2 2 University of Wisconsin, MadisonUniversity of Wisconsin, Madison
3 3 Jozef Stefan Institute, LjubljanaJozef Stefan Institute, Ljubljana4 4 University of Wales, AberystwythUniversity of Wales, Aberystwyth
OverviewOverview The task: Hierarchical multilabel classification The task: Hierarchical multilabel classification
(HMC)(HMC) Applied to functional genomicsApplied to functional genomics
Decision trees for HMCDecision trees for HMC Multiple prediction with decision treesMultiple prediction with decision trees HMC decision treesHMC decision trees
ExperimentsExperiments How does HMC tree learning compare to learning How does HMC tree learning compare to learning
multiple standard trees?multiple standard trees? ConclusionsConclusions
Classification settingsClassification settings NormallyNormally, in classification, we assign one class , in classification, we assign one class
label label ccii from a set from a set CC = { = {cc11, …, , …, cckk} to each } to each exampleexample
In In multilabel classificationmultilabel classification, we have to assign a , we have to assign a subset subset SS CC to each example to each example i.e., one example can belong to multiple classesi.e., one example can belong to multiple classes Some applications:Some applications:
Text classification: assign subjects (newsgroups) to texts Text classification: assign subjects (newsgroups) to texts Functional genomics: assign functions to genesFunctional genomics: assign functions to genes
In In hierarchical multilabel classificationhierarchical multilabel classification (HMC), (HMC), the classes the classes CC form a hierarchy form a hierarchy CC,, Partial order Partial order expresses “is a superclass of” expresses “is a superclass of”
Hierarchical multilabel classificationHierarchical multilabel classification Hierarchy constraintHierarchy constraint: :
ccii ccjj coverage( coverage(ccjj) ) coverage( coverage(ccii)) Elements of a class must be elements of its superclassesElements of a class must be elements of its superclasses
Should hold for given data as well as predictionsShould hold for given data as well as predictions Straightforward way to learn a HMC model:Straightforward way to learn a HMC model:
Learn Learn kk binary classifiers binary classifiers, one for each class, one for each class Disadvantages:Disadvantages:
1. difficult to guarantee hierarchy constraint1. difficult to guarantee hierarchy constraint 2. skewed class distributions (few pos, many neg)2. skewed class distributions (few pos, many neg) 3. relatively slow3. relatively slow 4. no simple interpretable model4. no simple interpretable model
Alternative: learn Alternative: learn one classifierone classifier that predicts a vector of that predicts a vector of classesclasses Quite natural for, e.g., neural networksQuite natural for, e.g., neural networks We will do this with (interpretable) decision treesWe will do this with (interpretable) decision trees
Goal of this workGoal of this work There has been work on extending decision tree learning There has been work on extending decision tree learning
to the HMC caseto the HMC case Multiple prediction trees: Blockeel et al., ICML 1998; Clare and Multiple prediction trees: Blockeel et al., ICML 1998; Clare and
King, ECML 2001; …King, ECML 2001; … HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005
HMC trees were evaluated in functional genomics, with HMC trees were evaluated in functional genomics, with good results ( good results ( proof of concept) proof of concept)
But: no comparison with learning multiple single But: no comparison with learning multiple single classification trees has been madeclassification trees has been made Size of trees, predictive accuracy, runtimes…Size of trees, predictive accuracy, runtimes… Previous work focused on the knowledge discovery aspectPrevious work focused on the knowledge discovery aspect
We compare both approaches for functional genomicsWe compare both approaches for functional genomics
Functional genomicsFunctional genomics Task:Task: GivenGiven a data set with descriptions of genes and a data set with descriptions of genes and
the functions they have, the functions they have, learnlearn a model that can predict a model that can predict for a new gene what functions it performsfor a new gene what functions it performs A gene can have multiple functions (out of 250 possible A gene can have multiple functions (out of 250 possible
functions, in our case)functions, in our case) Could be done with decision trees, with all the Could be done with decision trees, with all the
advantages that brings (fast, interpretable)… advantages that brings (fast, interpretable)… But:But: Decision trees predict only one class, not a set of classesDecision trees predict only one class, not a set of classes Should we learn a separate tree for each function?Should we learn a separate tree for each function?
250 functions = 250 trees: not so fast and interpretable anymore!250 functions = 250 trees: not so fast and interpretable anymore!
Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250G1 … … … … x x x xG2 … … … … x x x G3 … … … … x x x… … … …. … … … … … … … … …
description functions
…
1 2 250
Multiple prediction treesMultiple prediction trees A multiple prediction tree (MPT) makes multiple A multiple prediction tree (MPT) makes multiple
predictions at oncepredictions at once Basic idea: Basic idea: (Blockeel, De Raedt, Ramon, 1998)(Blockeel, De Raedt, Ramon, 1998)
A decision tree learner prefers tests that yield much information A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART))or variance reduction (CART))
MPT learner prefers tests that MPT learner prefers tests that reduce variance for all target reduce variance for all target variablesvariables together together Variance = mean squared distance of vectors to mean vector, in k-D Variance = mean squared distance of vectors to mean vector, in k-D
spacespace
Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250G1 … … … … x x x xG2 … … … … x x x G3 … … … … x x x… … … …. … … … … … … … … …
description function
1 4,12,105,250
1,5,24,35 1401,5 2
The algorithmThe algorithm
Procedure MPTree(T) returns tree(t*,h*,P*) = (none, , )For each possible test t
P = partition induced by t on Th = TkP |Tk|/|T| Var(Tk)if (h<h*) and acceptable(t,P)
(t*,h*,P*) = (t,h,P)If t* <> none
for each TkP*treek = MPTree(Tk)
return node(t*, k{treek})Else
return leaf(v)
HMC tree learningHMC tree learning A special case of MPT learningA special case of MPT learning
Class vector contains all classes in hierarchyClass vector contains all classes in hierarchy Main characteristics:Main characteristics:
Errors higher up in the hierarchy are more importantErrors higher up in the hierarchy are more important Use weighted euclidean distance (higher weight for higher Use weighted euclidean distance (higher weight for higher
classes)classes) Need to ensure hierarchy constraintNeed to ensure hierarchy constraint
Normally, leaf predicts Normally, leaf predicts ccii iff proportion of iff proportion of ccii examples in leaf examples in leaf is above some threshold is above some threshold ttii (often 0.5) (often 0.5)
We will let We will let ttii vary (see further) vary (see further) To ensure compliance with hierarchy constraint:To ensure compliance with hierarchy constraint:
ccii ccjj ttii ttjj Automatically fulfilled if all Automatically fulfilled if all ttii equal equal
ExampleExample
c1 c2 c3
.
c4 c5 c6 c7
d2(x1, x2) = 0.25 + 0.25 = 0.5d2(x1, x3) = 1+1 = 2
x1 is more similar to x2 than to x3
DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets
Weight 1
Weight 0.5
x1: {c1, c3, c5} = [1,0,1,0,1,0,0]x2: {c1, c3, c7} = [1,0,1,0,0,0,1]x3: {c1, c2, c5} = [1,1,0,0,0,0,0]
c1 c2 c3
.
c4 c5 c6 c7
c1 c2 c3
.
c4 c5 c6 c7
c1 c2 c3
.
c4 c5 c6 c7
x1
x2
x3
Evaluating HMC treesEvaluating HMC trees Original work by Clare et al.:Original work by Clare et al.:
Derive rules with high “accuracy” and “coverage” from the treeDerive rules with high “accuracy” and “coverage” from the tree Quality of individual rules was assessedQuality of individual rules was assessed No simple overall criterion to assess quality of treeNo simple overall criterion to assess quality of tree
In this work: using precision-recall curvesIn this work: using precision-recall curves Precision = P(pos| predicted pos)Precision = P(pos| predicted pos) Recall = P(predicted pos | pos)Recall = P(predicted pos | pos) The P,R of a tree depends on the tresholds The P,R of a tree depends on the tresholds ttii used used By changing the threshold By changing the threshold ttii from 1 to 0, a precision-recall curve from 1 to 0, a precision-recall curve
emergesemerges For 250 classes:For 250 classes:
Precision = P(X | predicted X)Precision = P(X | predicted X) [with X any of the 250 classes] [with X any of the 250 classes] Recall = P(predicted X | X)Recall = P(predicted X | X) This gives a PR curve that is a kind of “average” of the individual PR This gives a PR curve that is a kind of “average” of the individual PR
curves for each classcurves for each class
The Clus systemThe Clus system Created by Jan StruyfCreated by Jan Struyf Propositional DT learner, implemented in JavaPropositional DT learner, implemented in Java Implements ideas from Implements ideas from
C4.5 (Quinlan, ’93)C4.5 (Quinlan, ’93) CART (Breiman et al., ’84)CART (Breiman et al., ’84) predictive clustering trees (Blockeel et al., ’98)predictive clustering trees (Blockeel et al., ’98)
includes multiple prediction trees and hierarchical multilabel includes multiple prediction trees and hierarchical multilabel classification treesclassification trees
Reads data in ARFF format (Weka)Reads data in ARFF format (Weka) We used two versions for our experiments:We used two versions for our experiments:
Clus-HMC: HMC version as explainedClus-HMC: HMC version as explained Clus-SC: single classification version, +/- CARTClus-SC: single classification version, +/- CART
The datasetsThe datasets 12 datasets from functional genomics12 datasets from functional genomics
Each with a different description of the genesEach with a different description of the genes Sequence statistics (1)Sequence statistics (1) Phenotype (2)Phenotype (2) Predicted secondary structure (3)Predicted secondary structure (3) Homology (4)Homology (4) Micro-array data (5-12)Micro-array data (5-12)
Each with the same class hierarchyEach with the same class hierarchy 250 classes distributed over 4 levels250 classes distributed over 4 levels
Number of examples: 1592 to 3932Number of examples: 1592 to 3932 Number of attributes: 52 to 47034Number of attributes: 52 to 47034
Our expectations…Our expectations… How does HMC tree learning compare to the How does HMC tree learning compare to the
“straightforward” approach of learning 250 trees?“straightforward” approach of learning 250 trees? We expect:We expect:
Faster learning:Faster learning: Learning 1 HMCT is slower than learning 1 SPT Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s(single prediction tree), but faster than learning 250 SPT’s
Much faster prediction:Much faster prediction: Using 1 HMCT for prediction is as fast as Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’susing 250 SPT’s
Larger trees:Larger trees: HMCT is larger than average tree for 1 class, but HMCT is larger than average tree for 1 class, but smaller than set of 250 treessmaller than set of 250 trees
Less accurate:Less accurate: HMCT is less accurate than set of 250 SPT’s (but HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate)hopefully not much less accurate)
So So how muchhow much faster / simpler / less accurate are our faster / simpler / less accurate are our HMC trees?HMC trees?
The resultsThe results The HMCT is on average The HMCT is on average less complexless complex than one single than one single
SPTSPT HMCT has HMCT has 24 nodes24 nodes, SPT’s on average , SPT’s on average 33 nodes33 nodes … … but you’d need 250 of the latter to do the same jobbut you’d need 250 of the latter to do the same job
The HMCT is on average The HMCT is on average slightly more accurateslightly more accurate than a than a single SPTsingle SPT Measured using “average precision-recall curves” (see graphs)Measured using “average precision-recall curves” (see graphs) Surprising, as each SPT is tuned for one specific prediction taskSurprising, as each SPT is tuned for one specific prediction task
Expectations w.r.t. efficiency are confirmedExpectations w.r.t. efficiency are confirmed Learning: min. speedup factor = 4.5x, max 65x, average 37xLearning: min. speedup factor = 4.5x, max 65x, average 37x Prediction: >250 times faster (since tree is not larger)Prediction: >250 times faster (since tree is not larger) Faster to learnFaster to learn, , much faster to applymuch faster to apply
Precision recall curvesPrecision recall curvesPrecision: proportionof predictions thatis correctP(X | predicted X)
Recall: proportionof class membershipscorrectly identifiedP(predicted X | X)
An example ruleAn example rule
IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28THEN 40, 40/3, 5, 5/1
High interpretability: IF-THEN rules High interpretability: IF-THEN rules extracted from the HMCT are quite simpleextracted from the HMCT are quite simple
For class 40/3: Recall = 0.15; precision = 0.97.(rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)
The effect of merging…The effect of merging…
Optimized for c1 Optimized for c2 Optimized for c250
. . .
Optimized for c1, c2, …, c250
- Smaller than average individual tree- More accurate than average individual tree
Any explanation for these results?Any explanation for these results?
Seems too good to be true… how is it possible?Seems too good to be true… how is it possible? Answer: the classes are not independentAnswer: the classes are not independent
Different trees for different classes actually Different trees for different classes actually share share structurestructure Explains some complexity reduction achieved by the HMC Explains some complexity reduction achieved by the HMC
tree, but not all !tree, but not all ! One class carries information on other classesOne class carries information on other classes
This increases the signal-to-noise ratioThis increases the signal-to-noise ratio Provides better guidance when learning the tree (explaining Provides better guidance when learning the tree (explaining
good accuracy)good accuracy) Avoids overfittingAvoids overfitting (explaining further reduction of tree size) (explaining further reduction of tree size)
This was confirmed empiricallyThis was confirmed empirically
OverfittingOverfitting
To check our “overfitting” hypothesis:To check our “overfitting” hypothesis: Compared area under PR curve on training Compared area under PR curve on training
set (Aset (Atrtr) and test set (A) and test set (Atete)) For SPC: AFor SPC: Atrtr – A – Atete = = 0.2190.219 For HMCT: AFor HMCT: Atrtr – A – Atete = = 0.0240.024 (to verify, we tried Weka’s M5’ too: 0.387)(to verify, we tried Weka’s M5’ too: 0.387)
So HMCT clearly overfits much lessSo HMCT clearly overfits much less
ConclusionsConclusions Surprising discovery: a single tree can be found that Surprising discovery: a single tree can be found that
predicts 250 different functions with, on average, predicts 250 different functions with, on average, equal or better equal or better accuracyaccuracy than special-purpose trees for each function than special-purpose trees for each function
is is not more complexnot more complex than a single special-purpose tree (hence, than a single special-purpose tree (hence, 250 times simpler than the whole set)250 times simpler than the whole set)
is is (much) more efficient(much) more efficient to learn and to apply to learn and to apply The reason for this is to be found in the dependencies The reason for this is to be found in the dependencies
between the gene functionsbetween the gene functions Provide better guidance when learning the treeProvide better guidance when learning the tree Help to avoid overfittingHelp to avoid overfitting
Multiple prediction / HMC trees have a lot of potential Multiple prediction / HMC trees have a lot of potential and should be used more often !and should be used more often !
Ongoing workOngoing work
More extensive experimentationMore extensive experimentation Predicting classes in a Predicting classes in a latticelattice instead of a instead of a
tree-shaped hierarchytree-shaped hierarchy