decision trees for hierarchical multilabel classification a case study in functional genomics

Decision trees for hierarchical multilabel classificationA case study in functional genomics

Work by

Hendrik Blockeel Leander Schietgat Jan Struyf

Katholieke Universiteit Leuven (Belgium)

Amanda ClareUniversity of Aberystwyth (Wales)

Sašo DžeroskiJozef Stefan Institute Ljubljana (Slovenia)

Overview

Hierarchical Multilabel Classification task description

Predictive Clustering Trees for HMC the algorithm: Clus-HMC

Evaluation on yeast datasets

Hierarchical multilabel classification (HMC) Classification

predict class for unseen instances based on (classified) training examples

HMC instance can belong to multiple classes classes are organised in a hierarchy

Example toy hierarchy

Advantages efficiency skewed class distributions hierarchical relationships

1 (1)

3 (5)

2 (2)

2/1 (3) 2/2 (4)

Predictive clustering trees

~ decision trees [Blockeel et al. 1998] each node (including leaves) is a cluster tests in nodes are descriptions of clusters

Heuristic minimize intra-cluster variance maximise inter-cluster variance

Can be extended to perform HMC distance measure d (quantifies similarity) prediction function p (maps a cluster in a

leaf onto prediction)

Instantiating d

Class labels are represented in a vector

vi = [1,1,0,1,0] (1) (2) (3) (4) (5)

Distance between vectors is defined as the component-wise Euclidean distance: d(x1,x2) = √∑k wk • (v1,k – v2,k)2

1 (1)

3 (5)

2 (2)

2/1 (3) 2/2 (4)

(wk = wdepth(ck))

Example

Si = {1,2,2/2}, Sj = {2}

dEucl([1,1,0,1,0],[0,1,0,0,0]) = sqrt(w + w²)

Instantiating p Each leaf contains multiple classes

(organised in a hierarchy)

Which classes to predict? binary classification: predict positive if the

instance ends up in a leaf with at least 50% positives

multilabel classification: skewed class distributions

Threshold an instance ending up in some leaf is predicted to

belong to class ci if vi ti, with vi the proportion of instances in the leaf belonging to ci, and ti some threshold

by varying threshold, we obtain different points on the precision-recall curve

Clus-HMC algorithm

Pseudo code

stoppingcriterion

Experiments in yeast functional genomics Saccharomyces cerevisiae or

baker’s/brewer’s yeast

MIPS FunCat hierarchy function of yeast genes

12 data sets [Clare 2003] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data

cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all)

1 METABOLISM

1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms

…

2 ENERGY

2/1 glycolysis and gluconeogenesis

…

Experimental evaluation

Objectives Comparison with C4.5H [Clare 2003] Evaluation of the improvement

obtainable with HMC trees over single classification trees

Evaluation with precision-recall curves precision recall advantages

= TP / Yes = TP / (TP+FP)= TP / + = TP / (TP+FN)

Comparison with C4.5H

C4.5H = hierarchical multilabel extension of C4.5 [Clare 2003]

Designed by Amanda Clare Heuristic: information gain

adaptation of entropy (sum of all classes) Prediction: most frequent set of classes +

significance test

Clus-HMC method Tuning: different F-tests on validation

data, choose F-test with highest AUPRC

Comparison between Clus-HMC and C4.5H

Average case

Comparison between Clus-HMC and C4.5H Specific classes

25 wins (II), 6 losses (IV)

I II

IIIIV

Comparing rules

e.g. predictions for class 40/3 in “gasch1” data set C4.5H: two rules

Clus-HMC(most precise rule)

IF 29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___15_minutes <= 0.03 AND

constant_0point32_mM_H202_20_min_redo <= 0.72 AND

1point5_mM_diamide_60_min <= -0.17 AND

steady_state_1M_sorbitol > -0.37 AND

DBYmsn2_4__37degree_heat___20_min <= -0.67

THEN 40/3

IF Heat_Shock_10_minutes_hs_1 <= 1.82 AND

Heat_Shock_030inutes__hs_2 <= -0.48 AND

29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___5_minutes > -0.1

THEN 40/3

IF Nitrogen_Depletion_8_h <= -2.74 AND

Nitrogen_Depletion_2_h > -1.94 AND

1point5_mM_diamide_5_min > -0.03 AND

1M_sorbitol___45_min_ > -0.36 AND

37C_to_25C_shock___60_min > 1.28

THEN 40/3

Precision: 0.52

Recall: 0.26 Precision: 0.56

Recall: 0.18

Precision: 0.97

Recall: 0.15

HMC vs. single classification Method Average case

HMC vs. single classification

Specific classes numbers are AUPRC(Clus-HMC) – AUPRC(Clus-SC)

HMC performs better!

Conclusions

Use of precision-recall curves to optimize the learned models and to evaluate the results

Improvement over C4.5H

HMC compared to SC Comparable predictive performance Faster Easier to interpret

References

Hendrik Blockeel, Luc De Raedt, Jan Ramon, Top-down induction of clustering trees (1998)

Amanda Clare, Machine learning and data mining for yeast functional genomics, Doctoral dissertation (2003)

Jan Struyf, Sašo Džeroski, Hendrik Blockeel, Amanda Clare, Hierarchical multi-classification with predictive clustering trees in functional genomics (2005)

Questions?

decision trees for hierarchical multilabel classification a case study in functional genomics

Documents

hmc trees

prediction slide

gluconeogenesis slide

criterion slide

curve slide

tp tp fn slide

clushmc evaluation

sqrtw w slide