decision trees for hierarchical multilabel classification a case study in functional genomics
TRANSCRIPT
Decision trees for hierarchical multilabel classificationA case study in functional genomics
Work by
Hendrik Blockeel Leander Schietgat Jan Struyf
Katholieke Universiteit Leuven (Belgium)
Amanda ClareUniversity of Aberystwyth (Wales)
Sašo DžeroskiJozef Stefan Institute Ljubljana (Slovenia)
Overview
Hierarchical Multilabel Classification task description
Predictive Clustering Trees for HMC the algorithm: Clus-HMC
Evaluation on yeast datasets
Hierarchical multilabel classification (HMC) Classification
predict class for unseen instances based on (classified) training examples
HMC instance can belong to multiple classes classes are organised in a hierarchy
Example toy hierarchy
Advantages efficiency skewed class distributions hierarchical relationships
1 (1)
3 (5)
2 (2)
2/1 (3) 2/2 (4)
Predictive clustering trees
~ decision trees [Blockeel et al. 1998] each node (including leaves) is a cluster tests in nodes are descriptions of clusters
Heuristic minimize intra-cluster variance maximise inter-cluster variance
Can be extended to perform HMC distance measure d (quantifies similarity) prediction function p (maps a cluster in a
leaf onto prediction)
Instantiating d
Class labels are represented in a vector
vi = [1,1,0,1,0] (1) (2) (3) (4) (5)
Distance between vectors is defined as the component-wise Euclidean distance: d(x1,x2) = √∑k wk • (v1,k – v2,k)2
1 (1)
3 (5)
2 (2)
2/1 (3) 2/2 (4)
(wk = wdepth(ck))
Example
Si = {1,2,2/2}, Sj = {2}
dEucl([1,1,0,1,0],[0,1,0,0,0]) = sqrt(w + w²)
Instantiating p Each leaf contains multiple classes
(organised in a hierarchy)
Which classes to predict? binary classification: predict positive if the
instance ends up in a leaf with at least 50% positives
multilabel classification: skewed class distributions
Threshold an instance ending up in some leaf is predicted to
belong to class ci if vi ti, with vi the proportion of instances in the leaf belonging to ci, and ti some threshold
by varying threshold, we obtain different points on the precision-recall curve
Clus-HMC algorithm
Pseudo code
stoppingcriterion
Experiments in yeast functional genomics Saccharomyces cerevisiae or
baker’s/brewer’s yeast
MIPS FunCat hierarchy function of yeast genes
12 data sets [Clare 2003] Sequence structure (seq) Phenotype growth (pheno) Secondary structure (struc) Homology search (hom) Microarray data
cellcycle, church, derisi, eisen, gasch1, gasch2, spo, expr (all)
1 METABOLISM
1/1 amino acid metabolism1/2 nitrogen and sulfur metabolisms
…
2 ENERGY
2/1 glycolysis and gluconeogenesis
…
Experimental evaluation
Objectives Comparison with C4.5H [Clare 2003] Evaluation of the improvement
obtainable with HMC trees over single classification trees
Evaluation with precision-recall curves precision recall advantages
= TP / Yes = TP / (TP+FP)= TP / + = TP / (TP+FN)
Comparison with C4.5H
C4.5H = hierarchical multilabel extension of C4.5 [Clare 2003]
Designed by Amanda Clare Heuristic: information gain
adaptation of entropy (sum of all classes) Prediction: most frequent set of classes +
significance test
Clus-HMC method Tuning: different F-tests on validation
data, choose F-test with highest AUPRC
Comparison between Clus-HMC and C4.5H
Average case
Comparison between Clus-HMC and C4.5H Specific classes
25 wins (II), 6 losses (IV)
I II
IIIIV
Comparing rules
e.g. predictions for class 40/3 in “gasch1” data set C4.5H: two rules
Clus-HMC(most precise rule)
IF 29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___15_minutes <= 0.03 AND
constant_0point32_mM_H202_20_min_redo <= 0.72 AND
1point5_mM_diamide_60_min <= -0.17 AND
steady_state_1M_sorbitol > -0.37 AND
DBYmsn2_4__37degree_heat___20_min <= -0.67
THEN 40/3
IF Heat_Shock_10_minutes_hs_1 <= 1.82 AND
Heat_Shock_030inutes__hs_2 <= -0.48 AND
29C_Plus1M_sorbitol_to_33C_Plus_1M_sorbitol___5_minutes > -0.1
THEN 40/3
IF Nitrogen_Depletion_8_h <= -2.74 AND
Nitrogen_Depletion_2_h > -1.94 AND
1point5_mM_diamide_5_min > -0.03 AND
1M_sorbitol___45_min_ > -0.36 AND
37C_to_25C_shock___60_min > 1.28
THEN 40/3
Precision: 0.52
Recall: 0.26 Precision: 0.56
Recall: 0.18
Precision: 0.97
Recall: 0.15
HMC vs. single classification Method Average case
HMC vs. single classification
Specific classes numbers are AUPRC(Clus-HMC) – AUPRC(Clus-SC)
HMC performs better!
Conclusions
Use of precision-recall curves to optimize the learned models and to evaluate the results
Improvement over C4.5H
HMC compared to SC Comparable predictive performance Faster Easier to interpret
References
Hendrik Blockeel, Luc De Raedt, Jan Ramon, Top-down induction of clustering trees (1998)
Amanda Clare, Machine learning and data mining for yeast functional genomics, Doctoral dissertation (2003)
Jan Struyf, Sašo Džeroski, Hendrik Blockeel, Amanda Clare, Hierarchical multi-classification with predictive clustering trees in functional genomics (2005)
Questions?