data mining in bioinformatics day 8: graph mining for

38
Data Mining in Bioinformatics Day 8: Graph Mining for Chemoinformatics and Drug Discovery Chloé-Agathe Azencott Machine Learning & Computational Biology Research Group MPIs Tübingen C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1

Upload: others

Post on 11-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Data Mining in BioinformaticsDay 8: Graph Mining for Chemoinformatics

and Drug Discovery

Chloé-Agathe Azencott

Machine Learning & Computational Biology Research GroupMPIs Tübingen

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1

Drug Discovery

Modern Therapeutic Research

From serendipity to rationalized drug design

Ancient Greeks treatinfections with mould

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Biapenem in PBP-1A

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 2

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 3

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 4

Drug Discovery

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 5

Drug Discovery

Chemoinformatics

How can computer science help?

→ Chemoinformatics!“...the mixing of information resources to transform data into information, and information into knowledge, forthe intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.”– F. K. Brown

“... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 6

Drug Discovery

Chemoinformatics

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 7

Drug Discovery

The Chemical Space

É 1060 possible small moleculesÉ 1022 stars in the observable

universe

(Slide courtesy of Matthew A. Kayala)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 8

Structure-Based Approaches

Drug Discovery Process

QSAR: Qualitative Structure-Activity Relationship (classification)QSPR: Quantititive Structure-Property Relationship (regression)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 9

Structure-Based Approaches

Representing Chemicals in silico

É Expert knowledge molecular descriptors→ hard, potentially incomplete

É How to get a “complete” enough representation?

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Similar Property Principle: molecules having similar structuresshould exhibit similar activities.

É → Structure-based representations

É Compare molecules by comparing substructures

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 10

Structure-Based Approaches

Molecular Graph

Undirected labeled graph

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 11

Structure-Based Approaches

Feature Vectors based on Patterns

É Define feature vectors that record the presence/absence (ornumber of occurrences) of particular patterns in a givenmolecular graph

ϕ(A) = (ϕs(A))s substructure

where

ϕs(A) =

1 if s occurs in A0 otherwise

É Extension of traditional chemical fingerprints

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 12

Structure-Based Approaches

Feature Vectors based on Patterns

Classical machine learning and data mining techniques can beapplied to these vectorial feature representations.

É Any distance / kernel can be used

É Dot product → “How many substructures do two compoundsshare?”

É Classification

É Feature selection

É Clustering

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 13

Structure-Based Approaches

Tanimoto and MinMax

É Tanimoto (binary setting): k(A,B) = A∩BA∪B

É MinMax (counts setting):∑N

i=1 min(Ai,Bi)∑N

i=1 max(Ai,Bi)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 14

Structure-Based Approaches

Tanimoto and MinMax

Both Tanimoto and Minmax are kernels.É Proof for Tanimoto: J.C. Gower A general coefficient of similarity and

some of its properties. Biometrics 1971.

É Proof for MinMax:

MinMax(x,y) =⟨ϕ(x), ϕ(y)⟩

⟨ϕ(x), ϕ(x)⟩+ ⟨ϕ(y), ϕ(y)⟩ − ⟨ϕ(x), ϕ(y)⟩

with ϕ(x) of length: # patterns × max count

ϕ(x)i = 1 iff. the pattern indexed by bi/qc appears more thani mod q times in x

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 15

Structure-Based Approaches

Fingerprint Compression

É Systematic enumeration → long, sparse vectorse.g. 50,000 random compounds from ChemDB→ 300,000 paths of length up to 8→ 300 non-zeros on average

É “Naive” CompressionÉ List the positions of the 1sÉ 219 = 524,288É average encoding: 300× 19 = 5,700 bits

É ModuloCompression(lossy)

É Elias-Gamma Monotone Encoding(lossless) [Baldi 2007]

É index j → blog(j)c 0 bits + binaryencoding of j

É ji < ji+1: blog(ji+1)c → blog(ji)− log(ji+1)cÉ average compressed size = 1,800 bits

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 16

Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity InducingSubstructures

[Helma et al., 2004]

É P = positive (mutagenic) compoundsN = negative compounds

É features: fragments (= patterns) f such thatboth freq(f ,P) ≥ t and freq(f ,N ) ≤ t

É Limited to frequent linear patterns

É ML algorithm: SVM with linear or quadratic kernel

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 17

Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity InducingSubstructures[Helma et al., 2004]

É CPDB – Carcinogenic Potency DataBaseÉ 684 compounds classified in 341 mutagens and 343

non-mutagens according to Ames test on Salmonella

1% 3% 5% 10%Frequency threshold

50

60

70

80

90

100

Cross-validated sensitivity

Mutagenicity prediction [Hema04]

Linear kernelQuadratic kernel

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 18

Frequent Patterns gBoost

gBoost[Saigo et al., 2008]

É Train data: {(Gn,yn)}n=1...l

É Stump: h(x : t,w) = w(2xt − 1) xt = 1 if x ⊆ G and 0 otherwiseÉ Decision function:

f (x) =∑

t∈T ,w∈{−1,+1}

αtw h(x : t,w)

∑lt,w

αtw = 1 and αtw ≥ 0 ∀t,w

É LPBoosting: minα,ξ,ρ

−ρ+Dl∑

n=1

ξn

s.t. ξn ≥ 0 ∀n,∑l

t,wαtw = 1 and αtw ≥ 0 ∀t,w

É Equivalent to solving minλ,γ

γ

s.t∑l

n=1 λnynh(xn : t,w) ≤ γ ∀t,w∑l

n=1 λn = 1 and 0 ≤ λn ≤ D ∀n

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 19

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

É Solve by “column generation”

É start with H =∅ and λn = 1/n ∀nÉ Iteratively:

É find (t∗,w∗) that maximizesg(t,w) =

∑ln=1 λnynh(xn : t,w)

É add (t∗,w∗) to HÉ update λn, γ

É Stop when 6 ∃(t∗,w∗) such that g(t∗,w∗) > γ+ ε

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 20

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

É Finding (t∗,w∗): DFS code tree

É Pruning condition (g∗ optimal gain so far): if

max

2∑

n:yn=1,t⊆Gn

λn −l∑

n=1

ynλn,2∑

n:yn=−1,t⊆Gn

λn +l∑

n=1

ynλn

< g∗

then ∀t′ : t ⊆ t′,∀w′,g∗ > g(t′,w′)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 21

Frequent Patterns gBoost

gBoost

[Saigo et al., 2008]

Application to CPDB:

É Accuracy similar toHelma et al. (79%)

É Most discriminativepatterns

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 22

All Patterns 2D Kernels

Paths-Based Fingerprints

É Labeled sub-paths (walks)

Figure: Some sub-paths of length 3

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 23

All Patterns 2D Kernels

Circular Fingerprints

É Labeled sub-trees - Extended-Connectivity (or Circular)features

Figure: Example of a circular substructure of depth 2

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 24

All Patterns 2D Kernels

Two-Dimensional Kernels[Azencott et al., 2007]

É Systematically extract paths / circular fingerprints,for various maximal depths

É SVM with Tanimoto / Minmax

É DataÉ Mutagenicity (Mutag): 188 compoundsÉ Benzodiazepine receptor affinity (BZR): 181+ 125

compoundsÉ Cyclooxygenase-2 ihibitors (COX2): 178+ 125

compoundsÉ Estrogen receptor affinity: 166+ 180 compounds

Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%

COX2 70.1% 73.6%ER 82.1% 79.8%

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 25

All Patterns 3D Kernels

Introducing Spatial Information

É 3D Histograms [Azencott et al, 2007]

É Groups of k atoms

É Associated size:É Pairwise distances (k = 2)É diameter of the smallest sphere that

contains all k atoms

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 26

All Patterns 3D Kernels

Introducing Spatial Information

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 27

All Patterns 3D Kernels

Hist3D Performance

Data 2D kernel Hist3D kernelMutag 90.4% 88.8%

BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 28

All Patterns 3D Kernels

Introducing Spatial InformationÉ 3D Decomposition Kernels Ceroni et al, 2007

É Try to match patterns in 3D space as well

É e−γ(li−l′i )2

li (resp l′i): length of edge i in molecule x (resp. x′)

É edges are lexicographically orderedÉ kernel between two patterns:

Ks(σ, σ′) =r∏

i=1

δ(ei,e′i)e−γ(li−l′i )

2

É kernel between two molecules:

K(x,x′) =∑

σ∈x

σ∈x′Ks(σ, σ′)

É can be reduced to a class of patterns (e.g. circularsubstructures)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 29

All Patterns 3D Kernels

3DDK Performance

Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%

BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%

É How relevant is 3D information?É How good is 3D information?

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 30

Virtual HTS

Drug Discovery Process

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 31

Virtual HTS

High-Throughput Screening

É Assay a large libraryof potential drugsagainst their target

É Very costly

É → docking

É → virtualhigh-throughputscreening (vHTS)

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 32

Virtual HTS Measuring Performance

Imbalanced Data

É Typically, most compounds are inactive ⇒ many more negativethan positive examples

É E.g. DHFR data setÉ 99,995 chemicals screened for activity against

dihydrofolate reductaseÉ < 0.2% active compounds

É Accuracy is not appropriate:predicting all compounds negative ⇒ accuracy = 99.8%

É sensitivity = # True Positives# Positives

É specificity =# True Negatives

# NegativesÉ For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 33

Virtual HTS Measuring Performance

Receiver-Operator Characteristic Curves

É For all possible values of θ, report sensitivity and 1− specificityÉ AUROC (Area under the ROC Curve) is a numerical measure of

peformanceÉ AUROC(random) = 0.5 and AUROC(optimal) = 1

0 1/6 1/3 1/2 2/3 5/6 1

01

/42

/43

/41

False Positive Rate

Tru

e P

ositiv

e R

ate

x

x x

x

x x x x

x x x

Inf

0.95 0.94

0.9

0.81 0.73 0.52 0.2

0.17 0.12 0.09

random

perfect

real

label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 34

Virtual HTS Measuring Performance

Inhibition of DHFR: ROC Curves

method AUCIRV 0.71SVM 0.59kNN 0.59

MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

RANDOM

IRV

SVM

MAXSIM

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 35

Virtual HTS Measuring Performance

Precision-Recall Curves

É Precision =# True Positives

# Predicted PositivesÉ Recall = sensitivity

0 1/4 2/4 3/4 1

01

/52

/53

/54

/51

Recall

Pre

cis

ion

x

x

x

x

x

x

x

xxx

0.95

0.94

0.9

0.81

0.73

0.52

0.2

0.170.120.09

perfect

real

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 36

Other Graph Mining Applications

Other Applications

É Database indexing and searchÉ Prediction of 3D structures of small compounds and

proteinsÉ Reaction Prediction

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 37

Other Graph Mining Applications

ReferencesC.-A. Azencott, A. Ksikes, S. J. Swamidass, J. H. Chen, L. Ralaivola, and P. Baldi.One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, andBiological Properties. J. Chem. Inf. Model, 2007http://www.igb.uci.edu/~pfbaldi/publications/journals/2006/ci600397p.pdf

P. Baldi, R. Benz, J. S. Swamidass, and D. S. Hirschberg.Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage andRetrieval J. Chem. Inf. Model, 2007 http://www.ics.uci.edu/~dan/pubs/ci700200n.pdf

A. Ceroni, F. Costa, and P. Frasconi.Classification of Small Molecules by Two- and Three-Dimensional Decomposition Kernels. Bioinformatics,2007 http://bioinformatics.oxfordjournals.org/content/23/16/2038

T. Fawcett.ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech Report, 2004http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9777

C. Helma, T. Cramer, S. Kramer, and L. De Raedt.Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructuresand Structure Activity Relationships of Noncongeneric Compounds. J. Chem. Inf. Comput. Sci., 2004http://cbio.ensmp.fr/~jvert/svn/bibli/local/Helma2004Data.pdf

H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tusda.gBoost: a mathematical programming approach to graph classification and regression Mach. Learn., 2009http://www.nowozin.net/sebastian/papers/saigo2008gboost.pdf

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 38