data mining in bioinformatics day 8: graph mining for

Data Mining in BioinformaticsDay 8: Graph Mining for Chemoinformatics

and Drug Discovery

Chloé-Agathe Azencott

Machine Learning & Computational Biology Research GroupMPIs Tübingen

C.-A. Azencott Graph Mining for Chemoinformatics February 16, 2012 1

Drug Discovery

Modern Therapeutic Research

From serendipity to rationalized drug design

Ancient Greeks treatinfections with mould

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Biapenem in PBP-1A


Drug Discovery

Drug Discovery Process


Drug Discovery



Drug Discovery

Chemoinformatics

How can computer science help?

→ Chemoinformatics!“...the mixing of information resources to transform data into information, and information into knowledge, forthe intended purpose of making better decisions faster in the arena of drug lead identification and optimisation.”– F. K. Brown

“... the application of informatics methods to solve chemical problems.” – J. Gasteiger and T. Engel


Drug Discovery

Chemoinformatics


Drug Discovery

The Chemical Space

É 1060 possible small moleculesÉ 1022 stars in the observable

universe

(Slide courtesy of Matthew A. Kayala)


Structure-Based Approaches


QSAR: Qualitative Structure-Activity Relationship (classification)QSPR: Quantititive Structure-Property Relationship (regression)



Representing Chemicals in silico

É Expert knowledge molecular descriptors→ hard, potentially incomplete

É How to get a “complete” enough representation?

CH 3

N

S

O

NH

O

HO

NH 2

O

HO

CH 3

Similar Property Principle: molecules having similar structuresshould exhibit similar activities.

É → Structure-based representations

É Compare molecules by comparing substructures



Molecular Graph

Undirected labeled graph



Feature Vectors based on Patterns

É Define feature vectors that record the presence/absence (ornumber of occurrences) of particular patterns in a givenmolecular graph

ϕ(A) = (ϕs(A))s substructure

where

ϕs(A) =

�

1 if s occurs in A0 otherwise

É Extension of traditional chemical fingerprints



Feature Vectors based on Patterns

Classical machine learning and data mining techniques can beapplied to these vectorial feature representations.

É Any distance / kernel can be used

É Dot product → “How many substructures do two compoundsshare?”

É Classification

É Feature selection

É Clustering



Tanimoto and MinMax

É Tanimoto (binary setting): k(A,B) = A∩BA∪B

É MinMax (counts setting):∑N

i=1 min(Ai,Bi)∑N

i=1 max(Ai,Bi)



Tanimoto and MinMax

Both Tanimoto and Minmax are kernels.É Proof for Tanimoto: J.C. Gower A general coefficient of similarity and

some of its properties. Biometrics 1971.

É Proof for MinMax:

MinMax(x,y) =⟨ϕ(x), ϕ(y)⟩

⟨ϕ(x), ϕ(x)⟩+ ⟨ϕ(y), ϕ(y)⟩ − ⟨ϕ(x), ϕ(y)⟩

with ϕ(x) of length: # patterns × max count

ϕ(x)i = 1 iff. the pattern indexed by bi/qc appears more thani mod q times in x



Fingerprint Compression

É Systematic enumeration → long, sparse vectorse.g. 50,000 random compounds from ChemDB→ 300,000 paths of length up to 8→ 300 non-zeros on average

É “Naive” CompressionÉ List the positions of the 1sÉ 219 = 524,288É average encoding: 300× 19 = 5,700 bits

É ModuloCompression(lossy)

É Elias-Gamma Monotone Encoding(lossless) [Baldi 2007]

É index j → blog(j)c 0 bits + binaryencoding of j

É ji < ji+1: blog(ji+1)c → blog(ji)− log(ji+1)cÉ average compressed size = 1,800 bits


Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity InducingSubstructures

[Helma et al., 2004]

É P = positive (mutagenic) compoundsN = negative compounds

É features: fragments (= patterns) f such thatboth freq(f ,P) ≥ t and freq(f ,N ) ≤ t

É Limited to frequent linear patterns

É ML algorithm: SVM with linear or quadratic kernel


Frequent Patterns MOLFEA

Frequent Pattern Mining: Mutagenicity InducingSubstructures[Helma et al., 2004]

É CPDB – Carcinogenic Potency DataBaseÉ 684 compounds classified in 341 mutagens and 343

non-mutagens according to Ames test on Salmonella

1% 3% 5% 10%Frequency threshold

50

60

70

80

90

100

Cross-validated sensitivity

Mutagenicity prediction [Hema04]

Linear kernelQuadratic kernel


Frequent Patterns gBoost

gBoost[Saigo et al., 2008]

É Train data: {(Gn,yn)}n=1...l

É Stump: h(x : t,w) = w(2xt − 1) xt = 1 if x ⊆ G and 0 otherwiseÉ Decision function:

f (x) =∑

t∈T ,w∈{−1,+1}

αtw h(x : t,w)

∑lt,w

αtw = 1 and αtw ≥ 0 ∀t,w

É LPBoosting: minα,ξ,ρ

−ρ+Dl∑

n=1

ξn

s.t. ξn ≥ 0 ∀n,∑l

t,wαtw = 1 and αtw ≥ 0 ∀t,w

É Equivalent to solving minλ,γ

γ

s.t∑l

n=1 λnynh(xn : t,w) ≤ γ ∀t,w∑l

n=1 λn = 1 and 0 ≤ λn ≤ D ∀n



gBoost

[Saigo et al., 2008]

É Solve by “column generation”

É start with H =∅ and λn = 1/n ∀nÉ Iteratively:

É find (t∗,w∗) that maximizesg(t,w) =

∑ln=1 λnynh(xn : t,w)

É add (t∗,w∗) to HÉ update λn, γ

É Stop when 6 ∃(t∗,w∗) such that g(t∗,w∗) > γ+ ε



gBoost


É Finding (t∗,w∗): DFS code tree

É Pruning condition (g∗ optimal gain so far): if

max

2∑

n:yn=1,t⊆Gn

λn −l∑

n=1

ynλn,2∑

n:yn=−1,t⊆Gn

λn +l∑

n=1

ynλn

< g∗

then ∀t′ : t ⊆ t′,∀w′,g∗ > g(t′,w′)



gBoost


Application to CPDB:

É Accuracy similar toHelma et al. (79%)

É Most discriminativepatterns


All Patterns 2D Kernels

Paths-Based Fingerprints

É Labeled sub-paths (walks)

Figure: Some sub-paths of length 3



Circular Fingerprints

É Labeled sub-trees - Extended-Connectivity (or Circular)features

Figure: Example of a circular substructure of depth 2



Two-Dimensional Kernels[Azencott et al., 2007]

É Systematically extract paths / circular fingerprints,for various maximal depths

É SVM with Tanimoto / Minmax

É DataÉ Mutagenicity (Mutag): 188 compoundsÉ Benzodiazepine receptor affinity (BZR): 181+ 125

compoundsÉ Cyclooxygenase-2 ihibitors (COX2): 178+ 125

compoundsÉ Estrogen receptor affinity: 166+ 180 compounds

Data SVM Previous bestMutag 90.4% 85.2% (gBoost)BZR 79.8% 76.4%

COX2 70.1% 73.6%ER 82.1% 79.8%



Introducing Spatial Information

É 3D Histograms [Azencott et al, 2007]

É Groups of k atoms

É Associated size:É Pairwise distances (k = 2)É diameter of the smallest sphere that

contains all k atoms



Introducing Spatial Information

One histogram per class of k-tuple (e.g. C-C-C, C-C-O)



Hist3D Performance

Data 2D kernel Hist3D kernelMutag 90.4% 88.8%

BZR (loo) 82.0% 79.4%ER (loo) 87.0% 86.1%COX2 76.9% 78.6%



Introducing Spatial InformationÉ 3D Decomposition Kernels Ceroni et al, 2007

É Try to match patterns in 3D space as well

É e−γ(li−l′i )2

li (resp l′i): length of edge i in molecule x (resp. x′)

É edges are lexicographically orderedÉ kernel between two patterns:

Ks(σ, σ′) =r∏

i=1

δ(ei,e′i)e−γ(li−l′i )

2

É kernel between two molecules:

K(x,x′) =∑

σ∈x

∑

σ∈x′Ks(σ, σ′)

É can be reduced to a class of patterns (e.g. circularsubstructures)



3DDK Performance

Data 2D kernel Hist3D kernel 3DDK Circ3DDKMutag 90.4% 88.8% 86.7% 83.5%

BZR (loo) 82.0% 79.4% 78.4% 81.4%ER (loo) 87.0% 86.1% 82.3% 82.1%COX2 76.9% 78.6% 75.6% 75.2%

É How relevant is 3D information?É How good is 3D information?


Virtual HTS



Virtual HTS

High-Throughput Screening

É Assay a large libraryof potential drugsagainst their target

É Very costly

É → docking

É → virtualhigh-throughputscreening (vHTS)


Virtual HTS Measuring Performance

Imbalanced Data

É Typically, most compounds are inactive ⇒ many more negativethan positive examples

É E.g. DHFR data setÉ 99,995 chemicals screened for activity against

dihydrofolate reductaseÉ < 0.2% active compounds

É Accuracy is not appropriate:predicting all compounds negative ⇒ accuracy = 99.8%

É sensitivity = # True Positives# Positives

É specificity =# True Negatives

# NegativesÉ For many methods, the output is continuous⇒ accuracy, sensitivity and specificity depend on a threshold θ



Receiver-Operator Characteristic Curves

É For all possible values of θ, report sensitivity and 1− specificityÉ AUROC (Area under the ROC Curve) is a numerical measure of

peformanceÉ AUROC(random) = 0.5 and AUROC(optimal) = 1

0 1/6 1/3 1/2 2/3 5/6 1

01

/42

/43

/41

False Positive Rate

Tru

e P

ositiv

e R

ate

x

x x

x

x x x x

x x x

Inf

0.95 0.94

0.9

0.81 0.73 0.52 0.2

0.17 0.12 0.09

random

perfect

real

label prediction+ 0.95- 0.94+ 0.90+ 0.81- 0.73- 0.52- 0.20+ 0.17- 0.12- 0.09



Inhibition of DHFR: ROC Curves

method AUCIRV 0.71SVM 0.59kNN 0.59

MAX-SIM 0.54

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

RANDOM

IRV

SVM

MAXSIM



Precision-Recall Curves

É Precision =# True Positives

# Predicted PositivesÉ Recall = sensitivity

0 1/4 2/4 3/4 1

01

/52

/53

/54

/51

Recall

Pre

cis

ion

x

x

x

x

x

x

x

xxx

0.95

0.94

0.9

0.81

0.73

0.52

0.2

0.170.120.09

perfect

real


Other Graph Mining Applications

Other Applications

É Database indexing and searchÉ Prediction of 3D structures of small compounds and

proteinsÉ Reaction Prediction


Other Graph Mining Applications

ReferencesC.-A. Azencott, A. Ksikes, S. J. Swamidass, J. H. Chen, L. Ralaivola, and P. Baldi.One- to Four-Dimensional Kernels for Virtual Screening and the Prediction of Physical, Chemical, andBiological Properties. J. Chem. Inf. Model, 2007http://www.igb.uci.edu/~pfbaldi/publications/journals/2006/ci600397p.pdf

P. Baldi, R. Benz, J. S. Swamidass, and D. S. Hirschberg.Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage andRetrieval J. Chem. Inf. Model, 2007 http://www.ics.uci.edu/~dan/pubs/ci700200n.pdf

A. Ceroni, F. Costa, and P. Frasconi.Classification of Small Molecules by Two- and Three-Dimensional Decomposition Kernels. Bioinformatics,2007 http://bioinformatics.oxfordjournals.org/content/23/16/2038

T. Fawcett.ROC Graphs: Notes and Practical Considerations for Researchers HP Labs Tech Report, 2004http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9777

C. Helma, T. Cramer, S. Kramer, and L. De Raedt.Data Mining and Machine Learning Techniques for the Identification of Mutagenicity Inducing Substructuresand Structure Activity Relationships of Noncongeneric Compounds. J. Chem. Inf. Comput. Sci., 2004http://cbio.ensmp.fr/~jvert/svn/bibli/local/Helma2004Data.pdf

H. Saigo, S. Nowozin, T. Kadowaki, T. Kudo, and K. Tusda.gBoost: a mathematical programming approach to graph classification and regression Mach. Learn., 2009http://www.nowozin.net/sebastian/papers/saigo2008gboost.pdf


http://www.igb.uci.edu/~pfbaldi/publications/journals/2006/ci600397p.pdf

http://www.ics.uci.edu/~dan/pubs/ci700200n.pdf

http://bioinformatics.oxfordjournals.org/content/23/16/2038

http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9777

http://cbio.ensmp.fr/~jvert/svn/bibli/local/Helma2004Data.pdf

http://www.nowozin.net/sebastian/papers/saigo2008gboost.pdf

data mining in bioinformatics day 8: graph mining for

Documents