some contributions of machine learning to...

Post on 08-Jul-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Some contributions of machine learning tobioinformatics

Jean-Philippe VertJean-Philippe.Vert@ensmp.fr

Mines ParisTech / Institut Curie / Inserm

Université de Provence, Marseille, France, March 10, 2009.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 1 / 57

Where I come from

A joint lab about “Cancer computational genomics, bioinformatics,biostatistics and epidemiology”Located in th Institut Curie, a major hospital and cancer researchinstitute in Europe

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 2 / 57

”Statistical machine learning for cancer informatics”team

Main topicsTowards better diagnosis, prognosis, and personalized medicine

Supervised classification of genomic, transcriptomic, proteomicdata; heterogeneous data integration

Towards new drug targetsSystems biology, reconstruction of gene networks, pathwayenrichment analysis, multidimensional phenotyping of cellpopulations.

Towards new drugsLigand-based virtual screening, in silico chemogenomics.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 3 / 57

Towards personalized medicine:Diagnosis/prognosis from genome/transcriptome

From Golub et al., Science, 1999.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 4 / 57

Towards new drug targets:Inference of biological networks

From Mordelet and Vert, Bioinformatics, 2008.Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 5 / 57

Towards new drugs:Ligand-Based Virtual Screening and QSAR

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 6 / 57

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Pattern recognition, aka supervised classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 7 / 57

Pattern recognition, aka supervised classification

ChallengesHigh dimensionFew samplesStructured dataPrior knowledgeFast and scalableimplementations

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 8 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 9 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 10 / 57

Motivation

GoalDesign a classifier toautomatically assign aclass to future samplesfrom their expressionprofileInterpret biologically thedifferences between theclasses

DifficultyLarge dimensionFew samples

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 11 / 57

Linear classifiers

The modelEach sample is represented by a vector x = (x1, . . . , xp)

Goal: estimate a linear function:

fβ(x) =

p∑i=1

βixi + β0 .

Interpretability: the weight βi quantifies the influence of feature i(but...)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 12 / 57

Linear classifiers

Training the model

fβ(x) =

p∑i=1

βixi + β0 .

Minimize an empirical risk on the training samples:

minβ∈Rp+1

Remp(β) =1n

n∑i=1

l(fβ(xi), yi) ,

... subject to some constraint on β, e.g.:

Ω(β) ≤ C .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 13 / 57

Example : Norm Constraints

The approachA common method in statistics to learn with few samples in highdimension is to constrain the Euclidean norm of β

Ωridge(β) = ‖β ‖22 =

p∑i=1

β2i ,

(ridge regression, support vector machines...)

ProsGood performance inclassification

ConsLimited interpretation(small weights)No prior biologicalknowledge

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 14 / 57

Example : Feature Selection

The approachConstrain most weights to be 0, i.e., select a few genes (< 100) whoseexpression are sufficient for classification.

Greedy feature selection (T-tests, ...)Contrain the norm of β: LASSO penalty (‖β ‖1 =

∑pi=1 |βi |),

elastic net penalty (‖β ‖1 + ‖β ‖2), ... )

ProsGood performance inclassificationBiomarker selectionInterpretability

ConsThe gene selectionprocess is usually notrobustNo use of prior biologicalknowledge

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 15 / 57

Why LASSO leads to sparse solutions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 16 / 57

Incorporating prior knowledge

The ideaIf we have a specific prior knowledge about the “correct” weights,it can be included in Ω in the contraint:

Minimize Remp(β) subject to Ω(β) ≤ C .

If we design a convex function Ω, then the algorithm boils down toa convex optimization problem (usually easy to solve).Similar to priors in Bayesian statistics

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 17 / 57

Example: CGH array classification

MotivationComparative genomic hybridization (CGH) data measure the DNAcopy number along the genomeVery useful, in particular in cancer researchCan we classify CGH arrays for diagnosis or prognosis purpose?

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 18 / 57

Example: CGH array classification

Prior knowledgeLet x be a CGH profileWe focus on linear classifiers, i.e., the sign of :

f (x) = x>β .

We expect β to besparse : only a few positions should be discriminativepiecewise constant : within a region, all probes should contributeequally

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 19 / 57

Example: CGH array classification

A solution (Rapaport et al., 2008)

Ωfusedlasso(β) =∑

i

|βi |+∑i∼j

|βi − βj | .

Good performance on diagnosis for bladder cancer, and prognosisfor melanoma.More interpretable classifiers

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

0 500 1000 1500 2000 2500 3000 3500 4000−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

BAC

Wei

ght

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 20 / 57

Example: finding discriminant modules in genenetworks

The problemClassification of gene expression: too many genesA gene network is given (PPI, metabolic, regulatory, signaling,co-expression...)We expect that “clusters of genes” (modules) in the networkcontribute similarly to the classification

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 21 / 57

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 22 / 57

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have similar weigths.

Two solutions (Rapaport et al., 2007, 2008)

Ωspectral(β) =∑i∼j

(βi − βj)2 ,

Ωgraphfusion(β) =∑i∼j

|βi − βj |+∑

i

|βi | .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 22 / 57

Example: finding discriminant modules in genenetworksRapaport et al

N

-

Glycan biosynthesis

Protein kinases

DNA and RNA polymerase subunits

Glycolysis / Gluconeogenesis

Sulfurmetabolism

Porphyrinand chlorophyll metabolism

Riboflavin metabolism

Folatebiosynthesis

Biosynthesis of steroids, ergosterol metabolism

Lysinebiosynthesis

Phenylalanine, tyrosine andtryptophan biosynthesis Purine

metabolism

Oxidative phosphorylation, TCA cycle

Nitrogen,asparaginemetabolism

Fig. 4. Global connection map of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM

(left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have been removed) (right). Spectral filtering

divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen modes being determined by

microarray data. Positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute

values of the coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of

the network are annotated including big highly connected clusters corresponding to protein kinases and DNA and RNA polymerase sub-units.

5 DISCUSSION

Our algorithm groups predictor variables according to highly

connected "modules" of the global gene network. We assume

that the genes within a tightly connected network module

are likely to contribute similarly to the prediction function

because of the interactions between the genes. This motivates

the filtering of gene expression profile to remove the noisy

high-frequencymodes of the network.

Such grouping of variables is a very useful feature of the

resulting classification function because the function beco-

mes meaningful for interpreting and suggesting biological

factors that cause the class separation. This allows classifi-

cations based on functions, pathways and network modules

rather than on individual genes. This can lead to a more robust

behaviour of the classifier in independent tests and to equal if

not better classification results. Our results on the dataset we

analysed shows only a slight improvement, although this may

be due to its limited size. Thereforewe are currently extending

our work to larger data sets.

An important remark to bear in mind when analyzing pictu-

res such as fig.4 and 5 is that the colors represent the weights

of the classifier, and not gene expression levels. There is

of course a relationship between the classifier weights and

the typical expression levels of genes in irradiated and non-

irradiated samples: irradiated samples tend to have expression

profiles positively correlated with the classifier, while non-

irradiated samples tend to be negatively correlated. Roughly

speaking, the classifier tries to find a smooth function that

has this property. If more samples were available, better

non-smooth classifier might be learned by the algorithm, but

constraining the smoothness of the classifier is away to reduce

the complexity of the learning problem when a limited num-

ber of samples are available. This means in particular that the

pictures provide virtually no information regarding the over-

8

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 23 / 57

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 24 / 57

Example: finding discriminant modules in genenetworks

Prior hypothesisGenes near each other on the graph should have non-zero weigths(i.e., the support of β should be made of a few connectedcomponents).

Two solutions?

Ωintersection(β) =∑i∼j

√β2

i + β2j ,

Ωunion(β) = supα∈Rp:∀i∼j,‖α2

i +α2j ‖≤1

α>β .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 24 / 57

Example: finding discriminant modules in genenetworks

Groups (1, 2) and (2, 3). Left: Ωintersection(β). Right: Ωunion(β). Verticalaxis is β2.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 25 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 26 / 57

Biological networks

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 27 / 57

Our goal

DataGene expression,Gene sequence,Protein localization, ...

GraphProtein-protein interactions,Metabolic pathways,Signaling pathways, ...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 28 / 57

More precisely

“De novo” inferenceGiven data about individual genes and proteinsInfer the edges between genes and proteins

“Supervised” inferenceGiven data about individual genes and proteinsand given some known interactionsinfer unknown interactions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 29 / 57

More precisely

“De novo” inferenceGiven data about individual genes and proteinsInfer the edges between genes and proteins

“Supervised” inferenceGiven data about individual genes and proteinsand given some known interactionsinfer unknown interactions

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 29 / 57

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

(1,3)

(1,4)(3,4)

2,4)(2,3)

(1,2)

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Supervised inference by pattern recognition

Formulation and basic issueA pair can be connected (1) or not connected (-1)From the known subgraph we can extract examples of connectedand non-connected pairsHowever the genomic data characterize individual proteins; weneed to work with pairs of proteins instead!

1

2

4

3

1

2

3

4

(1,3)

(1,4)(3,4)

2,4)(2,3)

(1,2)

Known graph Genomic data

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 30 / 57

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Tensor product SVM (Ben-Hur and Noble, 2006)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A is similar to C and B is similar to D, or...A is similar to D and B is similar to C

Formally, define a similarity between pairs from a similaritybetween individuals by

KTPPK ((a, b), (c, d)) = K (a, c)K (b, d) + K (a, d)K (b, c) .

If K is a positive definite kernel for individuals then KTPPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizedtensor product:

(a, b) → (a⊗ b)⊕ (b ⊗ a) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 31 / 57

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Metric learning pairwise SVM (V. et al, 2007)

Intuition: a pair (A, B) is similar to a pair (C, D) if:A− B is similar to C − D, or...A− B is similar to D − C.

Formally, define a similarity between pairs from a similaritybetween individuals by

KMLPK ((a, b), (c, d)) = (K (a, c) + K (b, d)− K (a, c)− K (b, d))2 .

If K is a positive definite kernel for individuals then KMLPK is a p.d.kernel for pairs which can be used by SVMThis amounts to representing a pair (a, b) by the symmetrizeddifference:

(a, b) → (a− b)⊗2 .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 32 / 57

Supervised inference with local models

The idea (Bleakley et al., 2007)Motivation: define specific models for each target node todiscriminate between its neighbors and the othersTreat each node independently from the other. Then combinepredictions for ranking candidate edges.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 33 / 57

Supervised inference with local models

The idea (Bleakley et al., 2007)Motivation: define specific models for each target node todiscriminate between its neighbors and the othersTreat each node independently from the other. Then combinepredictions for ranking candidate edges.

+1

−1

?

?

?

+1

−1

−1

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 33 / 57

Results: protein-protein interaction (yeast)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

DirectkMLkCCAemlocalPkernel

(from Bleakley et al., 2007)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 34 / 57

Results: metabolic gene network (yeast)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

DirectkMLkCCAemlocalPkernel

(from Bleakley et al., 2007)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 35 / 57

Results: regulatory network (E. coli)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Ratio of false positives

Rat

io o

f tru

e po

sitiv

es

CLRSIRENESIRENE−Bias

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

CLRSIRENESIRENE−Bias

Method Recall at 60% Recall at 80%SIRENE 44.5% 17.6%CLR 7.5% 5.5%Relevance networks 4.7% 3.3%ARACNe 1% 0%Bayesian network 1% 0%

SIRENE = Supervised Inference of REgulatory NEtworks (Mordelet and V., 2008)

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 36 / 57

Results: predicted regulatory network (E. coli)

Prediction at 60% precision, restricted to transcription factors (from Mordelet and V., 2008).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 37 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 38 / 57

Virtual screening

ObjectiveBuild models to predict biochemical properties of small molecules fromtheir structures.

Structures

C15H14ClN3O3

Propertiesbinding to a therapeutic target,pharmacokinetics (ADME),toxicity...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 39 / 57

Ligand-Based Virtual Screening and QSAR

inactive

active

active

active

inactive

inactive

NCI AIDS screen results (from http://cactus.nci.nih.gov).Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 40 / 57

Formalization

The problemGiven a set of training instances (x1, y1), . . . , (xn, yn), where xi ’sare graphs and yi ’s are continuous or discrete variables of interest,Estimate a function

y = f (x)

where x is any graph to be labeled.This is a classical regression or pattern recognition problem overthe set of graphs.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 41 / 57

Classical approaches

Two steps1 Map each molecule to a vector of fixed dimension using molecular

descriptorsGlobal properties of the molecules (mass, logP...)2D and 3D descriptors (substructures, fragments, ....)

2 Apply an algorithm for regression or pattern recognition.PLS, ANN, ...

Example: 2D structural keys

O

N

O

O

OO

N N N

O O

O

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 42 / 57

Which descriptors?

O

N

O

O

OO

N N N

O O

O

DifficultiesMany descriptors are needed to characterize various features (inparticular for 2D and 3D descriptors)But too many descriptors are harmful for memory storage,computation speed, statistical estimation

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 43 / 57

Kernels

DefinitionLet Φ(x) = (Φ1(x), . . . ,Φp(x)) be a vector representation of themolecule xThe kernel between two molecules is defined by:

K (x , x ′) = Φ(x)>Φ(x ′) =

p∑i=1

Φi(x)Φi(x ′) .

φX H

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 44 / 57

The kernel trick

φX H

K (x , x ′) = Φ(x)>Φ(x ′)

The trickMany linear algorithms for regression or pattern recognition canbe expressed only in terms of inner products between vectorsComputing the kernel is often more efficient than computing Φ(x),especially in high or infinite dimensions!

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 45 / 57

Example: 2D fragment kernel

. . . . . .C NCCON

C C NO C

CO C

CC CCN C

NO CC C C

CC CC C C

CN CC C C

N

O

O

O N

O

C. . . . . . . . .

φd(x) is the vector of counts of all fragments of length d :

φ1(x) = ( #(C),#(O),#(N), ...)>

φ2(x) = ( #(C-C),#(C=O),#(C-N), ...)> etc...

The 2D fragment kernel is defined, for λ < 1, by

Kfragment(x , x ′) =∞∑

d=1

r(λ)φd(x)>φd(x ′) .

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 46 / 57

Example: 2D fragment kernel

. . . . . .C NCCON

C C NO C

CO C

CC CCN C

NO CC C C

CC CC C C

CN CC C C

N

O

O

O N

O

C. . . . . . . . .

In practiceKfragment can be computed efficiently (geometric kernel, randomwalk kernel...) although the feature space has infinite dimension.Increasing the specificity of atom labels improves performanceSelecting only “non-tottering” fragments can be done efficientlyand improves performance.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 47 / 57

Example: 2D subtree kernel

.

.

.

.

.

.

.

.

.N

N

C

CO

C

.

.

. C

O

C

N

C

N O

C

N CN C C

N

N

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 48 / 57

2D Subtree vs fragment kernels (Mahé and V, 2007)

7072

7476

7880

AU

CWalksSubtrees

CC

RF

−C

EM

HL−

60(T

B)

K−

562

MO

LT−

4R

PM

I−82

26 SR

A54

9/A

TC

CE

KV

XH

OP

−62

HO

P−

92N

CI−

H22

6N

CI−

H23

NC

I−H

322M

NC

I−H

460

NC

I−H

522

CO

LO_2

05H

CC

−29

98H

CT

−11

6H

CT

−15

HT

29K

M12

SW

−62

0S

F−

268

SF

−29

5S

F−

539

SN

B−

19S

NB

−75

U25

1LO

X_I

MV

IM

ALM

E−

3M M14

SK

−M

EL−

2S

K−

ME

L−28

SK

−M

EL−

5U

AC

C−

257

UA

CC

−62

IGR

−O

V1

OV

CA

R−

3O

VC

AR

−4

OV

CA

R−

5O

VC

AR

−8

SK

−O

V−

378

6−0

A49

8A

CH

NC

AK

I−1

RX

F_3

93S

N12

CT

K−

10U

O−

31P

C−

3D

U−

145

MC

F7

NC

I/AD

R−

RE

SM

DA

−M

B−

231/

AT

CC

HS

_578

TM

DA

−M

B−

435

MD

A−

NB

T−

549

T−

47D

Screening of inhibitors for 60 cancer cell lines (from Mahé and V.,2008)Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 49 / 57

Example: 3D pharmacophore kernel (Mahé et al.,2005)

O

O

2

d1

d3

d

O

O

2

d1

d3

d

K (x , y) =∑

px∈P(x)

∑py∈P(y)

exp (−γd (px , py )) .

Results (accuracy)Kernel BZR COX DHFR ER2D (Tanimoto) 71.2 63.0 76.9 77.13D fingerprint 75.4 67.0 76.9 78.63D not discretized 76.4 69.8 81.9 79.8

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 50 / 57

Chemogenomics

The problemSimilar targets bind similar ligandsInstead of focusing on each target individually, can we screen thebiological space (target families) vs the chemical space (ligands)?Mathematically, learn f (target , ligand) ∈ bind , notbind

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 51 / 57

Chemogenomics with SVM

Tensor product SVMTake the kernel:

K((t , l), (t ′, l ′)

)= Kt(t , t ′)Kl(l , l ′) .

Equivalently, represent a pair (t , l) by the vector φt(t)⊗ φl(l)Allows to use any kernel for proteins Kt with any kernel for smallmolecules Kl

When Kt is the Dirac kernel, we recover the classical paradigm:each target is treated independently from the others.Otherwise, information is shared across targets. The more similarthe targets, the more they share information.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 52 / 57

Example: MHC-I epitope prediction across differentalleles

The approach (Jacob and V., 2007)take a kernel to compare different MHC-I alleles (e.g., based onthe amino-acids in the paptide recognition pocket)take a kernel to compare different epitopes (9-mer peptides)Combine them to learn the f (allele, epitope) functionState-of-the-art performanceAvailable at http://cbio.ensmp.fr/kiss

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 53 / 57

Generalization: collaborative filtering with attributes

General problem: learn f (x , y) with a kernel Kx for x and a kernelKy for y .SVM with a tensor product kernel Kx ⊗ Ky is a particular case ofsomething more general: estimating an operator with a spectralregularization.Other spectral regularization are possible (e.g., trace norm) andlead to efficient algorithmsMore details in Abernethy et al. (2008).

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 54 / 57

Outline

1 Supervised classification of genomic data

2 Inference of biological networks

3 Virtual screening and chemogenomics

4 Conclusion

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 55 / 57

Conclusion

Modern machine learning methods for regression / classificationlend themselves well to the integration of prior knowledge in thepenalization / regularization function.Inference of biological networks can be formulated in theframework of pattern recognition.Kernel methods (eg SVM) allow to manipulate complex objects(eg molecules, biological sequences) as soon as kernels can bedefined and computed.

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 56 / 57

People I need to thank

Including prior knowledge in penalizationFranck Rapaport, Emmanuel Barillot, Andrei Zynoviev, ChristianLajaunie, Yves Vandenbrouck, Nicolas Foveau...

Virtual screening, kernels etc..Pierre Mahé, Laurent Jacob, Liva Ralaivola, Véronique Stoven, BriceHoffman, Martial Hue, Francis Bach, Jacob Abernethy, TheosEvgeniou...

Network inferenceKevin Bleakley, Fantine Mordelet, Yoshihiro Yamanihi, Gérard Biau,Minoru Kanehisa, William Noble, Jian Qiu...

Jean-Philippe Vert (ParisTech-Curie) Machine learning in bioinformatics 57 / 57

top related