some contributions of machine learning to...

73
Some contributions of machine learning to bioinformatics Jean-Philippe Vert [email protected] Mines ParisTech / Institut Curie / Inserm Université de Provence, Marseille, France, March 10, 2009. Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 1 / 57

Upload: others

Post on 08-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Some contributions of machine learning tobioinformatics

Jean-Philippe [email protected]

Mines ParisTech / Institut Curie / Inserm

Université de Provence, Marseille, France, March 10, 2009.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 1 / 57

Page 2: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Where I come from

A joint lab about “Cancer computational genomics, bioinformatics,biostatistics and epidemiology”Located in th Institut Curie, a major hospital and cancer researchinstitute in Europe

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 2 / 57

Page 3: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

”Statistical machine learning for cancer informatics”team

Main topicsTowards better diagnosis, prognosis, and personalized medicine

Supervised classification of genomic, transcriptomic, proteomicdata; heterogeneous data integration

Towards new drug targetsSystems biology, reconstruction of gene networks, pathwayenrichment analysis, multidimensional phenotyping of cellpopulations.

Towards new drugsLigand-based virtual screening, in silico chemogenomics.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 3 / 57

Page 4: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Towards personalized medicine:Diagnosis/prognosis from genome/transcriptome

From Golub et al., Science, 1999.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 4 / 57

Page 5: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Towards new drug targets:Inference of biological networks

From Mordelet and Vert, Bioinformatics, 2008.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 5 / 57

Page 6: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Towards new drugs:Ligand-Based Virtual Screening and QSAR

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 6 / 57

Page 7: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Page 8: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Page 9: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Page 10: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Page 11: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Pattern recognition, aka supervised classification

ChallengesHigh dimensionFew samplesStructured dataPrior knowledgeFast and scalableimplementations

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 8 / 57

Page 12: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Page 13: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Page 14: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Page 15: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Page 16: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 10 / 57

Page 17: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Motivation

GoalDesign a classifier toautomatically assign aclass to future samplesfrom their expressionprofileInterpret biologically thedifferences between theclasses

DifficultyLarge dimensionFew samples

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 11 / 57

Page 18: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Linear classifiers

The modelEach sample is represented by a vector x = (x1, . . . , xp)

Goal: estimate a linear function:

fβ(x) =

p∑i=1

βixi + β0 .

Interpretability: the weight βi quantifies the influence of feature i(but...)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 12 / 57

Page 19: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Linear classifiers

Training the model

fβ(x) =

p∑i=1

βixi + β0 .

Minimize an empirical risk on the training samples:

minβ∈Rp+1

Remp(β) =1n

n∑i=1

l(fβ(xi), yi) ,

... subject to some constraint on β, e.g.:

Ω(β) ≤ C .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 13 / 57

Page 20: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example : Norm Constraints

The approachA common method in statistics to learn with few samples in highdimension is to constrain the Euclidean norm of β

Ωridge(β) = ‖β ‖22 =

p∑i=1

β2i ,

(ridge regression, support vector machines...)

ProsGood performance inclassification

ConsLimited interpretation(small weights)No prior biologicalknowledge

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 14 / 57

Page 21: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example : Feature Selection

The approachConstrain most weights to be 0, i.e., select a few genes (< 100) whoseexpression are sufficient for classification.

Greedy feature selection (T-tests, ...)Contrain the norm of β: LASSO penalty (‖β ‖1 =

∑pi=1 |βi |),

elastic net penalty (‖β ‖1 + ‖β ‖2), ... )

ProsGood performance inclassificationBiomarker selectionInterpretability

ConsThe gene selectionprocess is usually notrobustNo use of prior biologicalknowledge

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 15 / 57

Page 22: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Why LASSO leads to sparse solutions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 16 / 57

Page 23: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Incorporating prior knowledge

The ideaIf we have a specific prior knowledge about the “correct” weights,it can be included in Ω in the contraint:

Minimize Remp(β) subject to Ω(β) ≤ C .

If we design a convex function Ω, then the algorithm boils down toa convex optimization problem (usually easy to solve).Similar to priors in Bayesian statistics

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 17 / 57

Page 24: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: CGH array classification

MotivationComparative genomic hybridization (CGH) data measure the DNAcopy number along the genomeVery useful, in particular in cancer researchCan we classify CGH arrays for diagnosis or prognosis purpose?

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 18 / 57

Page 25: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: CGH array classification

Prior knowledgeLet x be a CGH profileWe focus on linear classifiers, i.e., the sign of :

f (x) = x>β .

We expect β to besparse : only a few positions should be discriminativepiecewise constant : within a region, all probes should contributeequally

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 19 / 57

Page 26: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: CGH array classification

A solution (Rapaport et al., 2008)

Ωfusedlasso(β) =∑

i

|βi |+∑i∼j

|βi − βj | .

Good performance on diagnosis for bladder cancer, and prognosisfor melanoma.More interpretable classifiers

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 20 / 57

Page 27: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

The problemClassification of gene expression: too many genesA gene network is given (PPI, metabolic, regulatory, signaling,co-expression...)We expect that “clusters of genes” (modules) in the networkcontribute similarly to the classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 21 / 57

Page 28: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 22 / 57

Page 29: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 22 / 57

Page 30: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworksRapaport et al

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Fig. 4. Global connection map of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM

(left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have been removed) (right). Spectral filtering

divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen modes being determined by

microarray data. Positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute

values of the coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of

the network are annotated including big highly connected clusters corresponding to protein kinases and DNA and RNA polymerase sub-units.

5 DISCUSSION

Our algorithm groups predictor variables according to highly

connected "modules" of the global gene network. We assume

that the genes within a tightly connected network module

are likely to contribute similarly to the prediction function

because of the interactions between the genes. This motivates

the filtering of gene expression profile to remove the noisy

high-frequencymodes of the network.

Such grouping of variables is a very useful feature of the

resulting classification function because the function beco-

mes meaningful for interpreting and suggesting biological

factors that cause the class separation. This allows classifi-

cations based on functions, pathways and network modules

rather than on individual genes. This can lead to a more robust

behaviour of the classifier in independent tests and to equal if

not better classification results. Our results on the dataset we

analysed shows only a slight improvement, although this may

be due to its limited size. Thereforewe are currently extending

our work to larger data sets.

An important remark to bear in mind when analyzing pictu-

res such as fig.4 and 5 is that the colors represent the weights

of the classifier, and not gene expression levels. There is

of course a relationship between the classifier weights and

the typical expression levels of genes in irradiated and non-

irradiated samples: irradiated samples tend to have expression

profiles positively correlated with the classifier, while non-

irradiated samples tend to be negatively correlated. Roughly

speaking, the classifier tries to find a smooth function that

has this property. If more samples were available, better

non-smooth classifier might be learned by the algorithm, but

constraining the smoothness of the classifier is away to reduce

the complexity of the learning problem when a limited num-

ber of samples are available. This means in particular that the

pictures provide virtually no information regarding the over-

8

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 23 / 57

Page 31: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 24 / 57

Page 32: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 24 / 57

Page 33: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: finding discriminant modules in genenetworks

Groups (1, 2) and (2, 3). Left: Ωintersection(β). Right: Ωunion(β). Verticalaxis is β2.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 25 / 57

Page 34: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 26 / 57

Page 35: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Biological networks

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 27 / 57

Page 36: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Our goal

DataGene expression,Gene sequence,Protein localization, ...

GraphProtein-protein interactions,Metabolic pathways,Signaling pathways, ...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 28 / 57

Page 37: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

More precisely

“De novo” inferenceGiven data about individual genes and proteinsInfer the edges between genes and proteins

“Supervised” inferenceGiven data about individual genes and proteinsand given some known interactionsinfer unknown interactions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 29 / 57

Page 38: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

More precisely

“De novo” inferenceGiven data about individual genes and proteinsInfer the edges between genes and proteins

“Supervised” inferenceGiven data about individual genes and proteinsand given some known interactionsinfer unknown interactions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 29 / 57

Page 39: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Page 40: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

(1,3)

(1,4)(3,4)

2,4)(2,3)

(1,2)

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Page 41: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

(1,3)

(1,4)(3,4)

2,4)(2,3)

(1,2)

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Page 42: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Page 43: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Page 44: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Page 45: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Page 46: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Page 47: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Page 48: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Supervised inference with local models

The idea (Bleakley et al., 2007)Motivation: define specific models for each target node todiscriminate between its neighbors and the othersTreat each node independently from the other. Then combinepredictions for ranking candidate edges.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 33 / 57

Page 49: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Supervised inference with local models

The idea (Bleakley et al., 2007)Motivation: define specific models for each target node todiscriminate between its neighbors and the othersTreat each node independently from the other. Then combinepredictions for ranking candidate edges.

+1

−1

?

?

?

+1

−1

−1

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 33 / 57

Page 50: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Results: protein-protein interaction (yeast)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

DirectkMLkCCAemlocalPkernel

(from Bleakley et al., 2007)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 34 / 57

Page 51: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Results: metabolic gene network (yeast)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

DirectkMLkCCAemlocalPkernel

(from Bleakley et al., 2007)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 35 / 57

Page 52: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Results: regulatory network (E. coli)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

CLRSIRENESIRENE−Bias

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

CLRSIRENESIRENE−Bias

Method Recall at 60% Recall at 80%SIRENE 44.5% 17.6%CLR 7.5% 5.5%Relevance networks 4.7% 3.3%ARACNe 1% 0%Bayesian network 1% 0%

SIRENE = Supervised Inference of REgulatory NEtworks (Mordelet and V., 2008)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 36 / 57

Page 53: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Results: predicted regulatory network (E. coli)

Prediction at 60% precision, restricted to transcription factors (from Mordelet and V., 2008).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 37 / 57

Page 54: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 38 / 57

Page 55: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Virtual screening

ObjectiveBuild models to predict biochemical properties of small molecules fromtheir structures.

Structures

C15H14ClN3O3

Propertiesbinding to a therapeutic target,pharmacokinetics (ADME),toxicity...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 39 / 57

Page 56: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Ligand-Based Virtual Screening and QSAR

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 40 / 57

Page 57: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Formalization

The problemGiven a set of training instances (x1, y1), . . . , (xn, yn), where xi ’sare graphs and yi ’s are continuous or discrete variables of interest,Estimate a function

y = f (x)

where x is any graph to be labeled.This is a classical regression or pattern recognition problem overthe set of graphs.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 41 / 57

Page 58: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Classical approaches

Two steps1 Map each molecule to a vector of fixed dimension using molecular

descriptorsGlobal properties of the molecules (mass, logP...)2D and 3D descriptors (substructures, fragments, ....)

2 Apply an algorithm for regression or pattern recognition.PLS, ANN, ...

Example: 2D structural keys

O

N

O

O

OO

N N N

O O

O

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 42 / 57

Page 59: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Which descriptors?

O

N

O

O

OO

N N N

O O

O

DifficultiesMany descriptors are needed to characterize various features (inparticular for 2D and 3D descriptors)But too many descriptors are harmful for memory storage,computation speed, statistical estimation

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 43 / 57

Page 60: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Kernels

DefinitionLet Φ(x) = (Φ1(x), . . . ,Φp(x)) be a vector representation of themolecule xThe kernel between two molecules is defined by:

K (x , x ′) = Φ(x)>Φ(x ′) =

p∑i=1

Φi(x)Φi(x ′) .

φX H

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 44 / 57

Page 61: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

The kernel trick

φX H

K (x , x ′) = Φ(x)>Φ(x ′)

The trickMany linear algorithms for regression or pattern recognition canbe expressed only in terms of inner products between vectorsComputing the kernel is often more efficient than computing Φ(x),especially in high or infinite dimensions!

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 45 / 57

Page 62: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: 2D fragment kernel

. . . . . .C NCCON

C C NO C

CO C

CC CCN C

NO CC C C

CC CC C C

CN CC C C

N

O

O

O N

O

C. . . . . . . . .

φd(x) is the vector of counts of all fragments of length d :

φ1(x) = ( #(C),#(O),#(N), ...)>

φ2(x) = ( #(C-C),#(C=O),#(C-N), ...)> etc...

The 2D fragment kernel is defined, for λ < 1, by

Kfragment(x , x ′) =∞∑

d=1

r(λ)φd(x)>φd(x ′) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 46 / 57

Page 63: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: 2D fragment kernel

. . . . . .C NCCON

C C NO C

CO C

CC CCN C

NO CC C C

CC CC C C

CN CC C C

N

O

O

O N

O

C. . . . . . . . .

In practiceKfragment can be computed efficiently (geometric kernel, randomwalk kernel...) although the feature space has infinite dimension.Increasing the specificity of atom labels improves performanceSelecting only “non-tottering” fragments can be done efficientlyand improves performance.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 47 / 57

Page 64: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: 2D subtree kernel

.

.

.

.

.

.

.

.

.N

N

C

CO

C

.

.

. C

O

C

N

C

N O

C

N CN C C

N

N

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 48 / 57

Page 65: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

2D Subtree vs fragment kernels (Mahé and V, 2007)

7072

7476

7880

AU

CWalksSubtrees

CC

RF

−C

EM

HL−

60(T

B)

K−

562

MO

LT−

4R

PM

I−82

26 SR

A54

9/A

TC

CE

KV

XH

OP

−62

HO

P−

92N

CI−

H22

6N

CI−

H23

NC

I−H

322M

NC

I−H

460

NC

I−H

522

CO

LO_2

05H

CC

−29

98H

CT

−11

6H

CT

−15

HT

29K

M12

SW

−62

0S

F−

268

SF

−29

5S

F−

539

SN

B−

19S

NB

−75

U25

1LO

X_I

MV

IM

ALM

E−

3M M14

SK

−M

EL−

2S

K−

ME

L−28

SK

−M

EL−

5U

AC

C−

257

UA

CC

−62

IGR

−O

V1

OV

CA

R−

3O

VC

AR

−4

OV

CA

R−

5O

VC

AR

−8

SK

−O

V−

378

6−0

A49

8A

CH

NC

AK

I−1

RX

F_3

93S

N12

CT

K−

10U

O−

31P

C−

3D

U−

145

MC

F7

NC

I/AD

R−

RE

SM

DA

−M

B−

231/

AT

CC

HS

_578

TM

DA

−M

B−

435

MD

A−

NB

T−

549

T−

47D

Screening of inhibitors for 60 cancer cell lines (from Mahé and V.,2008)Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 49 / 57

Page 66: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: 3D pharmacophore kernel (Mahé et al.,2005)

O

O

2

d1

d3

d

O

O

2

d1

d3

d

K (x , y) =∑

px∈P(x)

∑py∈P(y)

exp (−γd (px , py )) .

Results (accuracy)Kernel BZR COX DHFR ER2D (Tanimoto) 71.2 63.0 76.9 77.13D fingerprint 75.4 67.0 76.9 78.63D not discretized 76.4 69.8 81.9 79.8

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 50 / 57

Page 67: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Chemogenomics

The problemSimilar targets bind similar ligandsInstead of focusing on each target individually, can we screen thebiological space (target families) vs the chemical space (ligands)?Mathematically, learn f (target , ligand) ∈ bind , notbind

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 51 / 57

Page 68: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Chemogenomics with SVM

Tensor product SVMTake the kernel:

K((t , l), (t ′, l ′)

)= Kt(t , t ′)Kl(l , l ′) .

Equivalently, represent a pair (t , l) by the vector φt(t)⊗ φl(l)Allows to use any kernel for proteins Kt with any kernel for smallmolecules Kl

When Kt is the Dirac kernel, we recover the classical paradigm:each target is treated independently from the others.Otherwise, information is shared across targets. The more similarthe targets, the more they share information.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 52 / 57

Page 69: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Example: MHC-I epitope prediction across differentalleles

The approach (Jacob and V., 2007)take a kernel to compare different MHC-I alleles (e.g., based onthe amino-acids in the paptide recognition pocket)take a kernel to compare different epitopes (9-mer peptides)Combine them to learn the f (allele, epitope) functionState-of-the-art performanceAvailable at http://cbio.ensmp.fr/kiss

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 53 / 57

Page 70: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Generalization: collaborative filtering with attributes

General problem: learn f (x , y) with a kernel Kx for x and a kernelKy for y .SVM with a tensor product kernel Kx ⊗ Ky is a particular case ofsomething more general: estimating an operator with a spectralregularization.Other spectral regularization are possible (e.g., trace norm) andlead to efficient algorithmsMore details in Abernethy et al. (2008).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 54 / 57

Page 71: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 55 / 57

Page 72: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

Conclusion

Modern machine learning methods for regression / classificationlend themselves well to the integration of prior knowledge in thepenalization / regularization function.Inference of biological networks can be formulated in theframework of pattern recognition.Kernel methods (eg SVM) allow to manipulate complex objects(eg molecules, biological sequences) as soon as kernels can bedefined and computed.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 56 / 57

Page 73: Some contributions of machine learning to bioinformaticsmembers.cbio.mines-paristech.fr/~jvert/talks/090310... · 2014-10-10 · Some contributions of machine learning to bioinformatics

People I need to thank

Including prior knowledge in penalizationFranck Rapaport, Emmanuel Barillot, Andrei Zynoviev, ChristianLajaunie, Yves Vandenbrouck, Nicolas Foveau...

Virtual screening, kernels etc..Pierre Mahé, Laurent Jacob, Liva Ralaivola, Véronique Stoven, BriceHoffman, Martial Hue, Francis Bach, Jacob Abernethy, TheosEvgeniou...

Network inferenceKevin Bleakley, Fantine Mordelet, Yoshihiro Yamanihi, Gérard Biau,Minoru Kanehisa, William Noble, Jian Qiu...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 57 / 57