chemical spaces: modeling, exploration & understanding

40
Overview Modeling & Algorithms Tool Development Chemical Space: Modeling Exploration & Understanding Rajarshi Guha School of Informatics Indiana University 16 th August, 2006

Upload: rguha

Post on 07-May-2015

2.145 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Chemical Space: ModelingExploration & Understanding

Rajarshi Guha

School of InformaticsIndiana University

16th August, 2006

Page 2: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Outline

1 Overview

2 Modeling & AlgorithmsAspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

3 Tool DevelopmentCDKR

Page 3: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Outline

1 Overview

2 Modeling & AlgorithmsAspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

3 Tool DevelopmentCDKR

Page 4: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Goals of Cheminformatics Research at IU

Extend the state of the art in cheminformatics

Extract chemistry, not just R2, q2 et al.

Provide intuitive and efficient ways to handle chemicalinformation

Supply expertise to bench chemists and non-computationalchemists

Page 5: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Statistical Modeling of Chemical Information

QSAR model development

Lazy regressionEnsemble descriptor selectionWavelet-based spectraldescriptors

Interpretation of QSAR models

Measuring model applicability

Page 6: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Cheminformatics Algorithms

R-NN curves

Outlier detectionCluster cardinality

LSH based applications

Ensemble descriptorselection

Interpretation techniques

Page 7: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Tools and Pipelines for Cheminformatics

Contributions to the CDK

Packages for R

rcdkspefingerprintspectral clusteringrpubchem (to come)

Automated QSAR pipeline(collaboration with P&G)

Development of workflows

Page 8: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Outline

1 Overview

2 Modeling & AlgorithmsAspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

3 Tool DevelopmentCDKR

Page 9: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

QSAR Model Interpretation

Why interpret?

Predictive ability is usefulfor screening

Interpretation provides extravalue

Interpretability might bemore useful than predictiveability

Interpretability depends onmodeling technique anddescriptors involved

Page 10: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

QSAR Model Interpretation

The trade-offs in interpretation

Interpretability is usually atrade-off with accuracy

OLS models are easilyinterpretable, not alwaysaccurate

CNN models are usuallyblack boxes, more accurate

Some methods (RF) lie inbetween

Page 11: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Detailed Interpretation Methods

OLS Models

Build a good model

Run it through PLS

The X-loadings indicatewhich descriptors areimportant

magnitude lets us rankthemsign lets us indicate thenature of their effects

Allows us to identify effectsof descriptors on amolecule-wise basis

CNN Models

Analogous to PLS basedinterpretations

Linearizes the CNN

Information is lost

Resultant interpretationsmatch the correspondinginterpretations for an OLSmodel quite well

Guha, R. et al., J. Chem. Inf. Model., 2005, 45, 321–333

Guha, R. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1440–1449

Page 12: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Extensions of Interpretation Methods

The CNN interpretation uses significant approximations anddoes not make full use of biases

Develop interpretation techniques for other methods, RF inparticular

Ensemble descriptor selection to choose a subset that issimultaneously good for multiple model types

Use these methods on real datasets

Page 13: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

QSAR Model Applicability

Model Validation

Goal is to test the reliability of the model

Ensures that the model is not due to chance factors

Based on dataset used to develop the model

Model Applicability

Goal is to test the applicability of the model to newcompounds

Tells us: The model will predict the activity well (or not)

Similar to confidence measures

Page 14: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

What is Model Applicability

Question?

How will a model perform when facedwith molecules that it has not beentrained on or validated with?

Aspects

Similarity to the TSET?

Can we consider a globalchemistry space?

Structural or statistical similarity?

Quantitative or qualitative?

Page 15: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

How to Assess Model Applicability

Define Model Performance

Performance is measured by prediction residuals. The modelperforms well on a new molecule if it predicts its activity with lowresidual error.

Correlate ’X’ With Performance

’X’ could be similarity between a query molecule and theoriginal training set

’X’ could be derived from a cluster membership approach

Alternatively, predict performance itself

Guha, R.; Jurs, P.C; J. Chem. Inf. Model., 2005, 45, 65–73

Page 16: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Nearest Neighbor Methods

Traditional kNN methods aresimple, fast, intuitive

Applications in

regression & classificationdiversity analysis

Can be misleading if the nearestneighbor is far away

R-NN methods may be moresuitable

●●

Page 17: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves for Diversity Analysis

The question . . .

Does the variation of nearest neighbor countwith radius allow us to characterize thelocation of a query point in a dataset?

The answer . . . R-NN curves

Dmax ← max pairwise distancefor molecule in dataset do

R ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius RIncrement R

end whileend forPlot NN count vs. R

●●●●●

●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

R

Num

ber

of N

eigh

bors

Guha, R., et al., J. Chem. Inf. Model., 2006, 46, 1713–1772

Page 18: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves for Diversity Analysis

The question . . .

Does the variation of nearest neighbor countwith radius allow us to characterize thelocation of a query point in a dataset?

The answer . . . R-NN curves

Dmax ← max pairwise distancefor molecule in dataset do

R ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius RIncrement R

end whileend forPlot NN count vs. R

●●●●●

●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

R

Num

ber

of N

eigh

bors

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

RN

umbe

r of

Nei

ghbo

rs

Guha, R., et al., J. Chem. Inf. Model., 2006, 46, 1713–1772

Page 19: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Characterizing R-NN Curves

Page 20: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves and Outliers

0 1000 2000 3000 4000

2040

6080

Serial Number

R max

S

1861

3904

A single plot identifies the locationcharacteristics of all the molecules

NH

HN OO

NH

NH

O

HO

O

HN

NH

OHO

H2N

N•

O H2N

O

HN

S

HO

S

O

HN

HN

O

HN

O

HN

O

NH

O

HN

OOH

HN

O NH

O

NH

O

OH

HN

O

HO

NH

O

HN

O

NH

O

NH

NH

O

HN

O

NH

H2N

O

O

NH2

•N

HN

HN

NH

O

HN

O

NH

S

H2N

N•

O

NH2

•N

NH2

O

NHO

HO

NH

OH

O

O

NH

H2N

•N

3904

ONH

N

NHO

O

NH

O

O

N

HO

HO

O

HO

OH

N

NH2

OH

HN

NHO

O

HO

OH

NH2

O

H2N

O

H2N

HNO

OH

H2NO

N

S

N

S HN

O

N

N

N

O

O

O

O

1861

Kazius, J.; McGuire, R.; Bursi, R.; J. Med. Chem. 2005, 48, 312–320

Page 21: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves and Clusters

−5 0 5 10 15 20 25

05

1015

84

88

137

251

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

Smoothed R-NN Curves

R-NN curves are indicative of the number of clusters

Page 22: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves and Clusters

Counting the steps

Essentially a curve matchingproblem

All points will not beindicative of the number ofclusters

Not applicable for concentricclusters

Approaches

Hausdorff / Frechet distance- requires canonical curves

RMSE from distance matrix

Slope analysis

Page 23: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves and Their Slopes

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 1000

510

15

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed first derivative of the R-NN Curves

Identifying peaks identifies the number of clusters

Automated picking can identify spurious peaks

Page 24: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

R-NN Curves and Their Slopes

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 1000

510

15

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed first derivative of the R-NN Curves

Identifying peaks identifies the number of clusters

Automated picking can identify spurious peaks

Page 25: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Slope Analysis of R-NN Curves

Procedure

for i in molecules doEvaluate R-NN curveF ← smoothed R-NN curveEvaluate F ′′

Smooth F ′′

Nroot,i ← no. of roots of F ′′

end forNcluster = [max(Nroot) + 1]/2

Possible improvements

Sample from the collection of R-NN curves

Improve handling of concentric clusters

0 20 40 60 80 100

050

100

150

200

250

300

RNN Curve

0 20 40 60 80 100

02

46

8 D'

0 20 40 60 80 100

−0.

50.

00.

5

D''

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●

●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●

Page 26: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Preliminary Results

Simulated 2D data

Predicted k, followed by kmeans clustering using k

Investigated similar values of k

k ASW

4 0.613 0.745 0.70

k ASW

3 0.442 0.654 0.47

k ASW

4 0.643 0.485 0.56

ASW - average silhouette width, higher is better; k - number of clusters

Page 27: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Ontologies

What is an ontology?

Defines a controlled vocabulary (keywords)

Defines a set of relationships between them

Allows for

meaning to be added to data and algorithmsautomated inference of relationships

An example is the Gene Ontology

Page 28: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

Ontologies in Chemistry

What’s available?

No really comprehensive ontology available

Some work is on to add chemistry semantics tobiology-related ontologies

What can we do?

Start small - focus on one area of chemistry, descriptors

The CDK currently provides an ontology for implementeddescriptors

Vocabulary includes

authorliterature referenceclass (molecular, atomic, bond)type (geometric, electronic, . . . )

Page 29: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

Aspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

How Can We Use Ontologies?

Allow interoperability of software

Different program use different naming schemes

Programs may be open- or close-source

Annotation via dictionaries allows us to make conclusionsregarding descriptors, suggest similar descriptors etc.

Utilize expertise from chemists

Some descriptors are useful for certain properties

Chemists who use models can tag descriptors as useful for aproperty

Over time the tagging can be indicative of the utility ofcertain descriptors

Page 30: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Outline

1 Overview

2 Modeling & AlgorithmsAspects of QSAR ModelingExploring Chemical SpaceAdding Meaning to Chemical Information

3 Tool DevelopmentCDKR

Page 31: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

The Chemistry Development Kit

What does it do?

Reads multiple file formats

Calculates molecular descriptors (in progress)

Topologicals (Chi, Kappa, . . . )Geometric (Gravitational indices, MI, . . . )Hybrid (CPSA)Global (BCUT, WHIM, RDF, . . . )

Provides access to statistical engines (R & Weka)

Fingerprint calculation, 2D structure diagrams

Kabsch alignment

3D coordinates and force fields (in progress)

Steinbeck, C. et al., Curr. Pharm. Des., 2006, 12, 2111–2120

Steinbeck, C. et al., J. Chem. Inf. Sci., 2003, 43, 493–500

Page 32: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Contributions

Main areas of contributions

Descriptor framework

QSAR modeling framework (interface to R)

Web service functionality

Other areas include

Numerical surface areasRigid alignmentDescriptor ontologyBuild systemQA, debugging, support

http://almost.cubic.uni-koeln.de/cdk/cdk top

Page 33: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Cheminformatics and R

What is R?

An open-source statistical environment for

model developmentalgorithm prototyping

Open source version of Splus

Splus code runs (mostly) unchanged on R

Provides a wide array of mathematical and statisticalfunctionality

Linear models (OLS, robust regression, GLM, PLS)Neural networks, random forests, SOM, SVMClustering methods (kmeans, agnes, pam, . . . )Optimization routinesDatabase interfaces

When working with chemical data it would be nice to haveaccess to cheminformatics functionality inside R

Page 34: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Cheminformatics and R

What is R?

An open-source statistical environment for

model developmentalgorithm prototyping

Open source version of Splus

Splus code runs (mostly) unchanged on R

Provides a wide array of mathematical and statisticalfunctionality

Linear models (OLS, robust regression, GLM, PLS)Neural networks, random forests, SOM, SVMClustering methods (kmeans, agnes, pam, . . . )Optimization routinesDatabase interfaces

When working with chemical data it would be nice to haveaccess to cheminformatics functionality inside R

Page 35: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Some Cheminformatics Related Packages

Connecting R and the CDK

R can access Java code via the SJava package

This allows us to use CDK functionality within R

The rcdk package provides user friendly wrappers to CDKclasses and methods

The result is that we can stay inside R and handle moleculesdirectly

Fingerprints

The fingerprint package handles binary fingerprint datafrom CDK and MOE

Calculates similarity matrices using the Tanimoto metric

Converts binary fingeprints to Euclidean vectors

Guha, R., CDK News, 2005, 2, 2–6

Page 36: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

R Miscellanea

R as a Web Service

A number of packages are available to access R via the web

An effort is also underway to provide an explicit SOAPinterface

Very simple to access a remote R process via RServe

Currently under investigation as our statistical backend forworkflows

Page 37: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Summary

Broadly focused on 2 areas . . .

Modeling

Predictive model developmentModel interpreation and applicability

Algorithm development

Exploring nearest neighbor methodsDescriptor selection and interpretationDictionaries and ontologies

Underlying motiviation is the extraction of chemistry from thenumbers and making it available

Collaborations . . .

More brains are useful

Always on the lookout for data

Page 38: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Summary

Broadly focused on 2 areas . . .

Modeling

Predictive model developmentModel interpreation and applicability

Algorithm development

Exploring nearest neighbor methodsDescriptor selection and interpretationDictionaries and ontologies

Underlying motiviation is the extraction of chemistry from thenumbers and making it available

Collaborations . . .

More brains are useful

Always on the lookout for data

Page 39: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR

Summary

Broadly focused on 2 areas . . .

Modeling

Predictive model developmentModel interpreation and applicability

Algorithm development

Exploring nearest neighbor methodsDescriptor selection and interpretationDictionaries and ontologies

Underlying motiviation is the extraction of chemistry from thenumbers and making it available

Collaborations . . .

More brains are useful

Always on the lookout for data

Page 40: Chemical Spaces: Modeling, Exploration & Understanding

OverviewModeling & Algorithms

Tool Development

CDKR