decision trees for hierarchical multilabel classification : a case study in functional genomics

Decision Trees for Decision Trees for Hierarchical Multilabel Classification :Hierarchical Multilabel Classification : A Case Study in Functional GenomicsA Case Study in Functional Genomics

Hendrik BlockeelHendrik Blockeel11, Leander Schietgat, Leander Schietgat11, , Jan StruyfJan Struyf1,21,2, Saso Dzeroski, Saso Dzeroski33, Amanda Clare, Amanda Clare44

1 1 Katholieke Universiteit LeuvenKatholieke Universiteit Leuven2 2 University of Wisconsin, MadisonUniversity of Wisconsin, Madison

3 3 Jozef Stefan Institute, LjubljanaJozef Stefan Institute, Ljubljana4 4 University of Wales, AberystwythUniversity of Wales, Aberystwyth

OverviewOverview The task: Hierarchical multilabel classification The task: Hierarchical multilabel classification

(HMC)(HMC) Applied to functional genomicsApplied to functional genomics

Decision trees for HMCDecision trees for HMC Multiple prediction with decision treesMultiple prediction with decision trees HMC decision treesHMC decision trees

ExperimentsExperiments How does HMC tree learning compare to learning How does HMC tree learning compare to learning

multiple standard trees?multiple standard trees? ConclusionsConclusions

Classification settingsClassification settings NormallyNormally, in classification, we assign one class , in classification, we assign one class

label label ccii from a set from a set CC = { = {cc11, …, , …, cckk} to each } to each exampleexample

In In multilabel classificationmultilabel classification, we have to assign a , we have to assign a subset subset SS CC to each example to each example i.e., one example can belong to multiple classesi.e., one example can belong to multiple classes Some applications:Some applications:

Text classification: assign subjects (newsgroups) to texts Text classification: assign subjects (newsgroups) to texts Functional genomics: assign functions to genesFunctional genomics: assign functions to genes

In In hierarchical multilabel classificationhierarchical multilabel classification (HMC), (HMC), the classes the classes CC form a hierarchy form a hierarchy CC,, Partial order Partial order expresses “is a superclass of” expresses “is a superclass of”

Hierarchical multilabel classificationHierarchical multilabel classification Hierarchy constraintHierarchy constraint: :

ccii ccjj coverage( coverage(ccjj) ) coverage( coverage(ccii)) Elements of a class must be elements of its superclassesElements of a class must be elements of its superclasses

Should hold for given data as well as predictionsShould hold for given data as well as predictions Straightforward way to learn a HMC model:Straightforward way to learn a HMC model:

Learn Learn kk binary classifiers binary classifiers, one for each class, one for each class Disadvantages:Disadvantages:

1. difficult to guarantee hierarchy constraint1. difficult to guarantee hierarchy constraint 2. skewed class distributions (few pos, many neg)2. skewed class distributions (few pos, many neg) 3. relatively slow3. relatively slow 4. no simple interpretable model4. no simple interpretable model

Alternative: learn Alternative: learn one classifierone classifier that predicts a vector of that predicts a vector of classesclasses Quite natural for, e.g., neural networksQuite natural for, e.g., neural networks We will do this with (interpretable) decision treesWe will do this with (interpretable) decision trees

Goal of this workGoal of this work There has been work on extending decision tree learning There has been work on extending decision tree learning

to the HMC caseto the HMC case Multiple prediction trees: Blockeel et al., ICML 1998; Clare and Multiple prediction trees: Blockeel et al., ICML 1998; Clare and

King, ECML 2001; …King, ECML 2001; … HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005HMC trees: Blockeel et al., 2002; Clare, 2003; Struyf et al., 2005

HMC trees were evaluated in functional genomics, with HMC trees were evaluated in functional genomics, with good results ( good results ( proof of concept) proof of concept)

But: no comparison with learning multiple single But: no comparison with learning multiple single classification trees has been madeclassification trees has been made Size of trees, predictive accuracy, runtimes…Size of trees, predictive accuracy, runtimes… Previous work focused on the knowledge discovery aspectPrevious work focused on the knowledge discovery aspect

We compare both approaches for functional genomicsWe compare both approaches for functional genomics

Functional genomicsFunctional genomics Task:Task: GivenGiven a data set with descriptions of genes and a data set with descriptions of genes and

the functions they have, the functions they have, learnlearn a model that can predict a model that can predict for a new gene what functions it performsfor a new gene what functions it performs A gene can have multiple functions (out of 250 possible A gene can have multiple functions (out of 250 possible

functions, in our case)functions, in our case) Could be done with decision trees, with all the Could be done with decision trees, with all the

advantages that brings (fast, interpretable)… advantages that brings (fast, interpretable)… But:But: Decision trees predict only one class, not a set of classesDecision trees predict only one class, not a set of classes Should we learn a separate tree for each function?Should we learn a separate tree for each function?

250 functions = 250 trees: not so fast and interpretable anymore!250 functions = 250 trees: not so fast and interpretable anymore!

Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250G1 … … … … x x x xG2 … … … … x x x G3 … … … … x x x… … … …. … … … … … … … … …

description functions

…

1 2 250

Multiple prediction treesMultiple prediction trees A multiple prediction tree (MPT) makes multiple A multiple prediction tree (MPT) makes multiple

predictions at oncepredictions at once Basic idea: Basic idea: (Blockeel, De Raedt, Ramon, 1998)(Blockeel, De Raedt, Ramon, 1998)

A decision tree learner prefers tests that yield much information A decision tree learner prefers tests that yield much information on the “class” attribute (measured using information gain (C4.5) on the “class” attribute (measured using information gain (C4.5) or variance reduction (CART))or variance reduction (CART))

MPT learner prefers tests that MPT learner prefers tests that reduce variance for all target reduce variance for all target variablesvariables together together Variance = mean squared distance of vectors to mean vector, in k-D Variance = mean squared distance of vectors to mean vector, in k-D

spacespace

Name A1 A2 A3 ….. An 1 2 3 4 5 … 249 250G1 … … … … x x x xG2 … … … … x x x G3 … … … … x x x… … … …. … … … … … … … … …

description function

1 4,12,105,250

1,5,24,35 1401,5 2

The algorithmThe algorithm

Procedure MPTree(T) returns tree(t*,h*,P*) = (none, , )For each possible test t

P = partition induced by t on Th = TkP |Tk|/|T| Var(Tk)if (h<h*) and acceptable(t,P)

(t*,h*,P*) = (t,h,P)If t* <> none

for each TkP*treek = MPTree(Tk)

return node(t*, k{treek})Else

return leaf(v)

HMC tree learningHMC tree learning A special case of MPT learningA special case of MPT learning

Class vector contains all classes in hierarchyClass vector contains all classes in hierarchy Main characteristics:Main characteristics:

Errors higher up in the hierarchy are more importantErrors higher up in the hierarchy are more important Use weighted euclidean distance (higher weight for higher Use weighted euclidean distance (higher weight for higher

classes)classes) Need to ensure hierarchy constraintNeed to ensure hierarchy constraint

Normally, leaf predicts Normally, leaf predicts ccii iff proportion of iff proportion of ccii examples in leaf examples in leaf is above some threshold is above some threshold ttii (often 0.5) (often 0.5)

We will let We will let ttii vary (see further) vary (see further) To ensure compliance with hierarchy constraint:To ensure compliance with hierarchy constraint:

ccii ccjj ttii ttjj Automatically fulfilled if all Automatically fulfilled if all ttii equal equal

ExampleExample

c1 c2 c3

.

c4 c5 c6 c7

d2(x1, x2) = 0.25 + 0.25 = 0.5d2(x1, x3) = 1+1 = 2

x1 is more similar to x2 than to x3

DT tries to create leaves with “similar” examples i.e., relatively pure w.r.t. class sets

Weight 1

Weight 0.5

x1: {c1, c3, c5} = [1,0,1,0,1,0,0]x2: {c1, c3, c7} = [1,0,1,0,0,0,1]x3: {c1, c2, c5} = [1,1,0,0,0,0,0]

c1 c2 c3

.

c4 c5 c6 c7

c1 c2 c3

.

c4 c5 c6 c7

c1 c2 c3

.

c4 c5 c6 c7

x1

x2

x3

Evaluating HMC treesEvaluating HMC trees Original work by Clare et al.:Original work by Clare et al.:

Derive rules with high “accuracy” and “coverage” from the treeDerive rules with high “accuracy” and “coverage” from the tree Quality of individual rules was assessedQuality of individual rules was assessed No simple overall criterion to assess quality of treeNo simple overall criterion to assess quality of tree

In this work: using precision-recall curvesIn this work: using precision-recall curves Precision = P(pos| predicted pos)Precision = P(pos| predicted pos) Recall = P(predicted pos | pos)Recall = P(predicted pos | pos) The P,R of a tree depends on the tresholds The P,R of a tree depends on the tresholds ttii used used By changing the threshold By changing the threshold ttii from 1 to 0, a precision-recall curve from 1 to 0, a precision-recall curve

emergesemerges For 250 classes:For 250 classes:

Precision = P(X | predicted X)Precision = P(X | predicted X) [with X any of the 250 classes] [with X any of the 250 classes] Recall = P(predicted X | X)Recall = P(predicted X | X) This gives a PR curve that is a kind of “average” of the individual PR This gives a PR curve that is a kind of “average” of the individual PR

curves for each classcurves for each class

The Clus systemThe Clus system Created by Jan StruyfCreated by Jan Struyf Propositional DT learner, implemented in JavaPropositional DT learner, implemented in Java Implements ideas from Implements ideas from

C4.5 (Quinlan, ’93)C4.5 (Quinlan, ’93) CART (Breiman et al., ’84)CART (Breiman et al., ’84) predictive clustering trees (Blockeel et al., ’98)predictive clustering trees (Blockeel et al., ’98)

includes multiple prediction trees and hierarchical multilabel includes multiple prediction trees and hierarchical multilabel classification treesclassification trees

Reads data in ARFF format (Weka)Reads data in ARFF format (Weka) We used two versions for our experiments:We used two versions for our experiments:

Clus-HMC: HMC version as explainedClus-HMC: HMC version as explained Clus-SC: single classification version, +/- CARTClus-SC: single classification version, +/- CART

The datasetsThe datasets 12 datasets from functional genomics12 datasets from functional genomics

Each with a different description of the genesEach with a different description of the genes Sequence statistics (1)Sequence statistics (1) Phenotype (2)Phenotype (2) Predicted secondary structure (3)Predicted secondary structure (3) Homology (4)Homology (4) Micro-array data (5-12)Micro-array data (5-12)

Each with the same class hierarchyEach with the same class hierarchy 250 classes distributed over 4 levels250 classes distributed over 4 levels

Number of examples: 1592 to 3932Number of examples: 1592 to 3932 Number of attributes: 52 to 47034Number of attributes: 52 to 47034

Our expectations…Our expectations… How does HMC tree learning compare to the How does HMC tree learning compare to the

“straightforward” approach of learning 250 trees?“straightforward” approach of learning 250 trees? We expect:We expect:

Faster learning:Faster learning: Learning 1 HMCT is slower than learning 1 SPT Learning 1 HMCT is slower than learning 1 SPT (single prediction tree), but faster than learning 250 SPT’s(single prediction tree), but faster than learning 250 SPT’s

Much faster prediction:Much faster prediction: Using 1 HMCT for prediction is as fast as Using 1 HMCT for prediction is as fast as using 1 SPT for prediction, and hence 250 times faster than using 1 SPT for prediction, and hence 250 times faster than using 250 SPT’susing 250 SPT’s

Larger trees:Larger trees: HMCT is larger than average tree for 1 class, but HMCT is larger than average tree for 1 class, but smaller than set of 250 treessmaller than set of 250 trees

Less accurate:Less accurate: HMCT is less accurate than set of 250 SPT’s (but HMCT is less accurate than set of 250 SPT’s (but hopefully not much less accurate)hopefully not much less accurate)

So So how muchhow much faster / simpler / less accurate are our faster / simpler / less accurate are our HMC trees?HMC trees?

The resultsThe results The HMCT is on average The HMCT is on average less complexless complex than one single than one single

SPTSPT HMCT has HMCT has 24 nodes24 nodes, SPT’s on average , SPT’s on average 33 nodes33 nodes … … but you’d need 250 of the latter to do the same jobbut you’d need 250 of the latter to do the same job

The HMCT is on average The HMCT is on average slightly more accurateslightly more accurate than a than a single SPTsingle SPT Measured using “average precision-recall curves” (see graphs)Measured using “average precision-recall curves” (see graphs) Surprising, as each SPT is tuned for one specific prediction taskSurprising, as each SPT is tuned for one specific prediction task

Expectations w.r.t. efficiency are confirmedExpectations w.r.t. efficiency are confirmed Learning: min. speedup factor = 4.5x, max 65x, average 37xLearning: min. speedup factor = 4.5x, max 65x, average 37x Prediction: >250 times faster (since tree is not larger)Prediction: >250 times faster (since tree is not larger) Faster to learnFaster to learn, , much faster to applymuch faster to apply

Precision recall curvesPrecision recall curvesPrecision: proportionof predictions thatis correctP(X | predicted X)

Recall: proportionof class membershipscorrectly identifiedP(predicted X | X)

An example ruleAn example rule

IF Nitrogen_Depletion_8_h <= -2.74 AND Nitrogen_Depletion_2_h > -1.94 AND 1point5_mM_diamide_5_min > -0.03 AND 1M_sorbitol___45_min_ > -0.36 AND 37C_to_25C_shock___60_min > 1.28THEN 40, 40/3, 5, 5/1

High interpretability: IF-THEN rules High interpretability: IF-THEN rules extracted from the HMCT are quite simpleextracted from the HMCT are quite simple

For class 40/3: Recall = 0.15; precision = 0.97.(rule covers 15% of all class 40/3 cases, and 97% of the cases fulfilling these conditions are indeed 40/3)

The effect of merging…The effect of merging…

Optimized for c1 Optimized for c2 Optimized for c250

. . .

Optimized for c1, c2, …, c250

- Smaller than average individual tree- More accurate than average individual tree

Any explanation for these results?Any explanation for these results?

Seems too good to be true… how is it possible?Seems too good to be true… how is it possible? Answer: the classes are not independentAnswer: the classes are not independent

Different trees for different classes actually Different trees for different classes actually share share structurestructure Explains some complexity reduction achieved by the HMC Explains some complexity reduction achieved by the HMC

tree, but not all !tree, but not all ! One class carries information on other classesOne class carries information on other classes

This increases the signal-to-noise ratioThis increases the signal-to-noise ratio Provides better guidance when learning the tree (explaining Provides better guidance when learning the tree (explaining

good accuracy)good accuracy) Avoids overfittingAvoids overfitting (explaining further reduction of tree size) (explaining further reduction of tree size)

This was confirmed empiricallyThis was confirmed empirically

OverfittingOverfitting

To check our “overfitting” hypothesis:To check our “overfitting” hypothesis: Compared area under PR curve on training Compared area under PR curve on training

set (Aset (Atrtr) and test set (A) and test set (Atete)) For SPC: AFor SPC: Atrtr – A – Atete = = 0.2190.219 For HMCT: AFor HMCT: Atrtr – A – Atete = = 0.0240.024 (to verify, we tried Weka’s M5’ too: 0.387)(to verify, we tried Weka’s M5’ too: 0.387)

So HMCT clearly overfits much lessSo HMCT clearly overfits much less

ConclusionsConclusions Surprising discovery: a single tree can be found that Surprising discovery: a single tree can be found that

predicts 250 different functions with, on average, predicts 250 different functions with, on average, equal or better equal or better accuracyaccuracy than special-purpose trees for each function than special-purpose trees for each function

is is not more complexnot more complex than a single special-purpose tree (hence, than a single special-purpose tree (hence, 250 times simpler than the whole set)250 times simpler than the whole set)

is is (much) more efficient(much) more efficient to learn and to apply to learn and to apply The reason for this is to be found in the dependencies The reason for this is to be found in the dependencies

between the gene functionsbetween the gene functions Provide better guidance when learning the treeProvide better guidance when learning the tree Help to avoid overfittingHelp to avoid overfitting

Multiple prediction / HMC trees have a lot of potential Multiple prediction / HMC trees have a lot of potential and should be used more often !and should be used more often !

Ongoing workOngoing work

More extensive experimentationMore extensive experimentation Predicting classes in a Predicting classes in a latticelattice instead of a instead of a

tree-shaped hierarchytree-shaped hierarchy

decision trees for hierarchical multilabel classification : a case study in functional genomics

Documents