machine learning methods for gene/protein function...

28
G.Valentini, DI - Univ. Milano 1 Giorgio Valentini [email protected] Machine learning methods Machine learning methods for gene/protein function prediction for gene/protein function prediction DI - Dipartimento di Informatica Università degli Studi di Milano

Upload: others

Post on 09-Jun-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 1

Giorgio [email protected]

Machine learning methods Machine learning methods for gene/protein function predictionfor gene/protein function prediction

DI - Dipartimento di Informatica

Università degli Studi di Milano

Page 2: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 2

Outline

• Gene Function Prediction (GFP)

• The Gene Ontology and the FunCat

• Characteristics of the GFP problem

• Computational approaches to GFP

• Machine learning methods for GFP

Page 3: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 3

Gene function prediction

Data about genes

Predictor Gene functions

Gene function prediction can be formalized as a supervised machine learning problem

Page 4: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 4

Motivation

• Novel high-throughput biotechnologies accumulated a wealth of data about genes and gene products

• Manual annotation of gene function is time consuming and expensive and becomes infeasible for growing amount of data.

• For most species the functions of several genes are unknown or only partially known: “in silico” methodsrepresent a fundamental tool for gene function prediction at genome-wide and ontology-wide level (Friedberg, 2006).

• Computational analysis provide predictions that can be considered hypotheses to drive the biological validation of gene function (Pena-Castillo et al. 2008).

Page 5: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 5

Computational prediction supports biological gene function prediction

Biological genome-wide gene function prediction

through direct experimental assays is costly and time-

consuming

Computational prediction methods

Computational prediction methods assist the biologist to:

• Suggest a restricted set of candidate functions that can be experimentally verified

• Directly generate new hypotheses

• Guide the exploration of promising hypotheses

Page 6: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 6

Characteristics of the Gene Function Prediction (GFP) problem

• Large number of functional classes: hundreds (FunCat) or thousands (Gene Ontology (GO)) : large multi-class classification

• Multiple annotations for each gene: multilabel classification• Different level of evidence for functional annotations: labels at different

level of reliability• Hierarchical relationships between functional classes (tree forest for FunCat,

direct acyclic graph for GO): hierarchical relationships between classes (structured output)

• Class frequencies are unbalanced, with positive examples usually largely lower than negatives: unbalanced classification

• The notion of “negative example” is not univocally determined: different strategies to choose negative examples

• Multiple sources of data available: each type captures specific functional characteristics of genes/gene products: multi-source classification

• Data are usually complex (e.g. high-dimensional) and noisy: classification with complex and noisy data

Page 7: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 7

Taxonomies of gene function

1. Gene Ontology (GO)

http://www.geneontology.org/

Fine grained: classes structured according to a directed

acyclic graph

2. Functional Catalogue (FunCat)

http://www.helmholtz-muenchen.de/en/mips/projects/funcat/

Coarse grained: classes structured according to a tree

Page 8: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 8

The Gene Ontology

The Gene Ontology (GO) project began as a collaboration between three model organism databases, FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD), in 1998. Now it includes several of the world's major repositories for plant, animal and microbial genomes.

The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species-independent manner

Page 9: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 9

The Gene Ontology (GO) is actually three Ontologies

2) Biological ProcessGO term: tricarboxylic acid cycleSynonym: Krebs cycleSynonym: citric acid cycleGO id: GO:0006099

3) Cellular ComponentGO term: mitochondrionGO id: GO:0005739Definition: A semiautonomous, self replicating organelle that occurs in varying numbers, shapes, and sizes in the cytoplasm of virtually all eukaryotic cells. It is notably the site of tissue respiration.

1) Molecular FunctionGO term: Malate dehydrogenase activityGO id: GO:0030060(S)-malate + NAD(+) = oxaloacetate + NADH.

H

O

H

O

O

H

O

H

O

H

H

O

O

H

O

H

O

H

H

O

NAD+NADH + H+

(Slide downloaded from www.geneontology.org)

Page 10: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 10

Relationships between GO terms

are structured according to a

DAG

Page 11: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 11

GO D GO DAG of the BP ontology (S. cerevisiae)

1074 GO classes (nodes) connected by 1804 edges

Graph realized through HCGene (Valentini, Cesa-Bianchi, Bioinformatics 24(5), 2008)

Page 12: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 12

The Functional Catalogue (FunCat)http://www.helmholtz-muenchen.de/en/mips/projects/funcat

Page 13: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 13

The Functional Catalogue (FunCat)http://www.helmholtz-muenchen.de/en/mips/projects/funcat

• The Functional Catalogue is an annotation scheme for the functional description of proteins of prokaryotic and eukaryotic origin

• Hierarchical tree like structure.• Up to six levels of increasing specificity. FunCat version 2.1

includes 1362 functional categories. • FunCat descriptive, but compact: classifies protein functions not

down to the most specific level. • Comparable to parts of the ‘Molecular Function’ and ‘Biological

Process’ terms of the GO system. • More compact and stable than GO, focuses on the functional

process not describing the molecular function on the atomic level

Page 14: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 14

Computational approaches to GFP

A very schematic taxonomy of computational GFP

methods:

• Inference and annotation transfer through

sequence similarity (BLAST)

• Network-based methods

• Kernel methods for structured output spaces

• Hierarchical ensemble methods

Page 15: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 15

Biological networks

S. Cerevisiae4389 proteins14319 interactions

Page 16: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 16

A network-based approach

From: Sharan et al. Mol. Sys. Biol. 2007

Page 17: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 17

Network based methods: predicting a specific functional term

Page 18: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 18

Several available methods:• Guilt by association (Marcotte et al. 1999, Oliver et al. 2000)• Label propagation (Zhu and Ghahramani, 2003, Zhou et al. 2004)• Markov random walks (Szummer and Jaakkola, 2002, Azran et al 2007)• Markov random fields (Deng et al. 2004)• Graph regularization techniques (Belkin et al. 2004, Dellaleu et al 2005)• Gaussian random fields (Tsuda et al. 2005, Mostafavi et al. 2010)• Hopfield networks (Karaoz et al. 2004, Bertoni et al. 2011)

These different approaches minimize a similar quadratic criterion to improve:a) Consistency of the initial labelingb) Topological consistency of the data

They exploit different types of relational data: physical and genetic interactions, similarities between protein domains or motifs, structural and sequence homologies, correlations between expression profiels, …

- need for network integration algorithms

Network-based methods

Page 19: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 19

Kernel methodsKernel methods are largely applied to classification problems:

f ( x )=wT ϕ( x )

1. Obtaining a non-linear classifier, through a non-linear mapping into the feature space, using an algorithm designed for linear discrimination :

w=∑i

α iϕ( x i)⇒ f ( x )=∑i

αiϕ( x i )Tϕ( x )

3. The discriminant function can be expressed through a suitable kernel function:

f ( x )=∑i

αi K ( x i , x )

2. Whenever w can be expressed as a weighted sum over the images of the input examples:

Page 20: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 20

Kernel metods for binary classification problems

Non linear

kernel mapping

Original input space Transformed feature spaceϕ

Page 21: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 21

Kernel methods for structured output spaces

A binary classier can predict whether a protein performs a certain function:

f : X →Y i Y i= {0,1 }

How to predict the full hierarchical annotation ?y= { y1 , y2 , .. . , yk }

1≤ i≤k

The main idea: using a kernel for structured output, that is a function:

f : X ×Y →ℜThis classification rule chooses the label y that is most compatible with an input x.

Whereas in two-class classification problems the kernel depends only on the input (proteins), in the structured-output setting it is a joint function of inputs and outputs (set of the labels)

Page 22: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 22

Structured output kernel methods for gene function prediction

• Sokolov and Ben-Hur (2010): a structured Perceptron,

and a variant of the structured support vector machine

(Tsochantaridis et al. 2005), applied to the the prediction

of GO terms in mouse and other model organisms

• Astikainen et al. (2008) and Rousu et al. (2006): Structured

output maximum-margin algorithms applied to the tree-

structured prediction of enzyme functions

Page 23: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 23

Hierarchical ensemble methodsThey are in general characterized by a two-step strategy:1. Flat learning of the protein function on a per-term basis (a

set of independent classification problems)2. Combination of the predictions by exploiting the

relationships between terms that govern the hierarchy of the functional classes.

The term ensemble raises from the fact that a set of learning machines in someway combine their output.

In principle any supervised learning algorithm can be used for step 1.

Step 2 requires a proper combination of the predictions made at step 1.

Page 24: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 24

Hierarchical ensemble methods

• Bayesian network-based ensembles (Barutcuoglu et al.

2006, Guan et al. 2008)

• Hierarchical renconciliation methods (Obozinski et al. 2008)

• Hierarchical Bayesian cost-sensitive ensembles (Cesa-Bianchi and Valentini, 2010)

• True Path Rule Ensembles (Valentini, 2011)

• Hierarchical decision trees (Vens et al. 2008, Schietgat et al 2010)

Page 25: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 25

References (1)Astikainen, K., Holm, L., Pitkanen, E., Szedmak, S., and Rousu, J. (2008). Towards structured output prediction of

enzyme function. BMC Proceedings, 2(Suppl 4:S2).

Barutcuoglu, Z., Schapire, R., and Troyanskaya, O. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7), 830–836

Belkin, M, Matveeva, I, Niyogi, P. (2004) Regularization and semi-supervised learning on large graphs. In COLT

Bengio, Y., Delalleau, O., and Le Roux, N. (2006). Label Propagation and Quadratic Criterion. In O. Chapelle, B. Scholkopf, and A. Zien, editors, Semi-Supervised Learning, pages 193–216. MIT Press.

Bertoni, A., Frasca, M., Valentini G. (2011) COSNet: a Cost Sensitive Neural Network for Semi-supervised Learning in Graphs., European Conference on Machine Learning 2011, Athens, Lecture Notes in Computer Science, Springer

Cesa-Bianchi, N. and Valentini, G. (2010). Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. Journal of Machine Learning Research, W&C Proceedings, Machine Learning in Systems Biology, 8, 14–29.

Cesa-Bianchi, N., Re, M., and Valentini, G. (2010). Functional inference in FunCat through the combination of hierarchical ensembles with data fusion methods. In ICML-MLD 2nd International Workshop on learning from Multi-Label Data, pages 13–20, Haifa, Israel.

Delalleau, O., Bengio, Y, Le oux, N (2005) Efficient non-parametric function induction in semi-supervised learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005.

Deng, M., Chen, T., and Sun, F. (2004). An integrated probabilistic model for functional prediction of proteins. J. Comput. Biol., 11, 463–475.

Friedberg, I. (2006). Automated protein function prediction-the genomic challenge. Brief. Bioinformatics, 7, 225–242

Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., and Troyanskaya, O. (2008). Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology, 9(S2).

Karaoz, U. et al. (2004). Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl Acad. Sci. USA, 101, 2888–2893.

Page 26: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 26

References (1)Astikainen, K., Holm, L., Pitkanen, E., Szedmak, S., and Rousu, J. (2008). Towards structured output prediction of

enzyme function. BMC Proceedings, 2(Suppl 4:S2).

Barutcuoglu, Z., Schapire, R., and Troyanskaya, O. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7), 830–836

Belkin, M, Matveeva, I, Niyogi, P. (2004) Regularization and semi-supervised learning on large graphs. In COLT

Bengio, Y., Delalleau, O., and Le Roux, N. (2006). Label Propagation and Quadratic Criterion. In O. Chapelle, B. Scholkopf, and A. Zien, editors, Semi-Supervised Learning, pages 193–216. MIT Press.

Bertoni, A., Frasca, M., Valentini G. (2011) COSNet: a Cost Sensitive Neural Network for Semi-supervised Learning in Graphs., European Conference on Machine Learning 2011, Athens, Lecture Notes in Computer Science, Springer

Cesa-Bianchi, N. and Valentini, G. (2010). Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. Journal of Machine Learning Research, W&C Proceedings, Machine Learning in Systems Biology, 8, 14–29.

Cesa-Bianchi, N., Re, M., and Valentini, G. (2010). Functional inference in FunCat through the combination of hierarchical ensembles with data fusion methods. In ICML-MLD 2nd International Workshop on learning from Multi-Label Data, pages 13–20, Haifa, Israel.

Cesa-Bianchi, N., Re, M., and Valentini, G. (2012) Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference, Machine Learning, vol.88(1), pp. 209-241, 2012

Delalleau, O., Bengio, Y, Le oux, N (2005) Efficient non-parametric function induction in semi-supervised learning. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, 2005.

Deng, M., Chen, T., and Sun, F. (2004). An integrated probabilistic model for functional prediction of proteins. J. Comput. Biol., 11, 463–475.

Friedberg, I. (2006). Automated protein function prediction-the genomic challenge. Brief. Bioinformatics, 7, 225–242

Guan, Y., Myers, C., Hess, D., Barutcuoglu, Z., Caudy, A., and Troyanskaya, O. (2008). Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biology, 9(S2).

Karaoz, U. et al. (2004). Whole-genome annotation by using evidence integration in functional-linkage networks. Proc. Natl Acad. Sci. USA, 101, 2888–2893.

Page 27: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 27

References (2)Marcotte, E., Pellegrini, M., Thompson, M., Yeates, T., and Eisenberg, D. (1999). A combined algorithm for

genome-wide prediction of protein function. Nature, 402, 83–86.

Mostafavi, S. and Morris, Q. (2010). Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics, 26(14), 1759–1765.

Mostafavi, S., Ray, D.,Warde-Farley, D., Grouios, C., and Morris, Q. (2008). GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biology, 9(S4).

Obozinski, G., Lanckriet, G., Grant, C., M., J., and Noble, W. (2008). Consistent probabilistic output for protein function prediction. Genome Biology, 9(S6).

Oliver, S. (2000). Guilt-by-association goes global. Nature, 403, 601–603.

Pavlidis, P.,Weston, J., Cai, J., and Noble,W. (2002). Learning gene functional classification from multiple data. J. Comput. Biol., 9, 401–411.

Pena-Ca stillo, L., et al. (2008): A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biology 9 S1

Re, M. and Valentini, G. (2010). Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction. Journal of Machine Learning Research, W&C Proceedings, Machine Learning in Systems Biology, 8, 98–111.

Rousu, J., Saunders, C., Szedmak, S., and Shawe-Taylor, J. (2006). Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research, 7, 1601–1626.

Schietgat, L., Vens, C., Struyf, J., Blockeel, H., and Dzeroski, S. (2010). Predicting gene function using hierarchical multilabel decision tree ensembles. BMC Bioinformatics, 11(2).

Sharan, R. Ulitsky, I.Shamir, R. (2007) Network-based prediction of protein function, Molecular Systems Biology 3:88

Sokolov, A. and Ben-Hur, A. (2010). Hierarchical classification of Gene Ontology terms using the GOstruct method. Journal of Bioinformatics and Computational Biology, 8(2), 357–376.

Page 28: Machine learning methods for gene/protein function predictionhomes.di.unimi.it/valentini/MB201213/slide/MLMethodsGFP.pdf · Learning in Graphs., European Conference on Machine Learning

G.Valentini, DI - Univ. Milano 28

References (3)

Szummer, M Jaakkola, T. (2001) Partially labeled classication with markov random walks. In NIPS, volume 14.

Tsochantaridis, I., Joachims, T., Hoffman, T., and Altun, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484.

Tsuda, K., Shin, H., and Scholkopf, B. (2005). Fast protein classification with multiple networks. Bioinformatics, 21(Suppl 2), ii59–ii65.

Valentini, G. and Cesa-Bianchi, N. (2008). Hcgene: a software tool to support the hierarchical classification of genes. Bioinformatics, 24(5), 729–731.

Valentini, G. (2011), True Path Rule hierarchical ensembles for genome-wide gene function prediction, IEEE ACM Transactions on Computational Biology and Bioinformatics, 8(3), 832-847.

Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A. (2003). Global protein function prediction from protein-protein interaction networks. Nature Biotechnology, 21, 697–700.

Vens, C., Struyf, J., Schietgat, L., Dzeroski, S., and Blockeel, H. (2008). Decision trees for hierarchical multi-label classification. Machine Learning, 73(2), 185–214.

Zhou, D., et al. (2004) Learning with local and global consistency. In NIPS, volume 16

Zhu, X. Ghahramani, Z. , Laerty J. (2003). Semi-supervised learning using Gaussian fields and harmonic functions. In ICML.