r & cdk: a sturdy platform in the oceans of chemical data}

R & CDK: A Sturdy Platform in the Oceans ofChemical Data

Rajarshi Guha

NIH Center for Translational Therapeutics

5th May, 2011EBI, Hinxton

Background

I Cheminformatics methods since 2003I QSAR, diversity analysis, virtual

screening, fragments, polypharmacology,networks

I More recentlyI RNAi screening, high content screening

I Extensive use of machine learningI All tied together with software

developmentI User-facing GUI toolsI Low level programmatic librariesI Core developer for the CDK

I Believer and practitioner of Open Source

Why Cheminformatics in R?

I In contrast to bioinformatics (cf. Bioconductor), not a wholelot of cheminformatics support for R

I For cheminformatics and chemistry relevant packages includeI rcdk, rpubchem, fingerprintI bio3d, ChemmineR

I A lot of cheminformatics employs various forms of statisticsand machine learning - R is exactly the environment for that

I We just need to add some chemistry capabilities to it

See here for a much more detailed tutorial on R &cheminformatics presented at the EBI in 2010

http://mccammon.ucsd.edu/~bgrant/bio3d/

http://bioweb.ucr.edu/ChemMineV2/chemminer/

http://www.slideshare.net/rguha/cheminformatics-in-r

Motivations

I Much of cheminformatics is datamodeling and mining

I But the numeric data is derived fromchemical structure

I Thus we want to work withI molecules & and their partsI files containing moleculesI databases of molecules

What is R?

I R is an environment for modelingI Contains many prepackaged statistical and mathematical

functionsI No need to implement anything

I R is a matrix programming language that is good forstatistical computing

I Full fledged, interpreted languageI Well integrated with statistical functionalityI More details later

What is R?

I It is possible to use R just for modelingI Avoids programming, preferably use a GUI

I Load data → build model → plot data

I But you can also get much more creativeI Scripts to process multiple data filesI Ensemble modeling using different types of modelsI Implement brand new algorithms

I R is good for prototyping algorithmsI Interpreted, so immediate resultsI Good support for vectorization

I Faster than explicit loopsI Analogous to map in Python and Lisp

I Most times, interpreted R is fine, but you can easily integrateC code

What is R?

I R integrates with other languagesI C code can be linked to R and C can also call R functionsI Java code can be called from R and vice versa. See various

packages at rosuda.orgI Python can be used in R and vice versa using Rpy

I R has excellent support for publication quality graphics

I See R Graph Gallery for an idea of the graphing capabilities

I But graphing in R does have a learning curveI A variety of graphs can be generated

I 2D plots - scatter, bar, pie, box, violin, parallel coordinateI 3D plots - OpenGL support is available

http://rosuda.org/software/

http://rpy.sourceforge.net/

http://addictedtor.free.fr/graphiques/

Parallel R

I R itself is not multi-threadedI Well suited for embarassingly parallel problems

I Even then, a number of “large data” problems are nottractable

I Recent developments on integrating R and Hadoop address thisI See the RHIPE package

I snow which allows distribution of processing on the samemachine (multiple CPU’s) or multiple machines

I But see snowfall for a nice set of wrappers around snow

I Also see multicore for a package that focuses on parallelprocessing on multicore CPU’s

http://www.stat.purdue.edu/~sguha/rhipe/

http://cran.r-project.org/web/packages/snow/index.html

http://cran.r-project.org/web/packages/snowfall/index.html

http://cran.r-project.org/web/packages/multicore/index.html

R and databases

I Bindings to a variety of databases are availableI Mainly RDBMS’s but some NoSQL databases are being

interfaced

I The R DBI spec lets you write code that is portable overdatabases

I Note that loading multiple database packages can lead toproblems

I This can happen even when you don’t explicitly load adatabase package

I Some Bioconductor packages load the RSQLite package as adependency, which can interfere with, say, ROracle

http://cran.r-project.org/web/packages/DBI/index.html

http://cran.r-project.org/web/packages/RSQLite/index.html

http://cran.r-project.org/web/packages/ROracle/index.html

Why use a database?

I Dont have to load bulk CSV or .Rda files each time we startwork

I Can index data in RDBMS’s so queries can be very fast

I Good way to exchange data between applications (as opposedto .Rda files which are only useful between R users)

Using the CDK in R

I Based on the rJava package

I Two R packages to install (not counting the dependencies)

I Provides access to a variety of CDK classes and methods

I Idiomatic R

R Programming Environment

rJava

CDK Jmol

rcdk

XML

rpubchem

fingerprint

Acessibility & usability

I Plain R is not necessarily the most “usable” platform

I So rcdk doesn’t really satisfy usability for complete R newbies

I But, if you know R, installation is trivial

> install.packages(’rcdk’, dependencies=TRUE)

I R specifies a documentation format

I Most packages have quite good documentation, rcdk is noexception

I A tutorial is also available from within R, in addition to thefunction docs

> vignette(’rcdk’) # read tutorial

> ls(’package:rcdk’) # list functions

> ?load.molecules # get help on a function

Reading in data

I The CDK supports a variety of file formats

I rcdk loads all recognized formats, automatically

I Data can be local or remote

mols <- load.molecules( c("data/io/set1.sdf",

"data/io/set2.smi",

"http://rguha.net/rcdk/remote.sdf"))

I Gives you a list of Java references representingIAtomContainer objects

I For large SDF’s use an iterating reader

I Can’t do much with these objects, except via rcdk functions

Writing molecules

I Currently only SDF is supported as an output file format

I By default a multi-molecule SDF will be written

I Properties are not written out as SD tags by default

smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")

mols <- sapply(smis, parse.smiles)

## all molecules in a single file

write.molecules(mols, filename="mols.sdf")

## ensure molecule data is written out

write.molecules(mols, filename="mols.sdf", write.props=TRUE)

## molecules in individual files

write.molecules(mols, filename="mols.sdf", together=FALSE)

Working with molecules

I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates

I Since rcdk doesn’t cover the entire CDK API, you might needto drop down to the rJava level and make calls to the Javacode by hand

Working with atoms

I Simple elemental analysis

I Identifying flat molecules

mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")

atoms <- get.atoms(mol)

## elemental analysis

syms <- unlist(lapply(atoms, get.symbol))

round( table(syms)/sum(table(syms)) * 100, 2)

## is the molecule flat?

coords <- do.call("rbind", lapply(atoms, get.point3d))

any(apply(coords, 2, function(x) length(unique(x)) == 1))

SMARTS matching

I rcdk supports substructure searches with SMARTS orSMILES

I May not be practical for large collections of molecules due tomemory

mols <- sapply(c("CC(C)(C)C",

"c1ccc(Cl)cc1C(=O)O",

"CCC(N)(N)CC"), parse.smiles)

query <- "[#6D2]"

hits <- matches(query, mols)

> print(hits)

CC(C)(C)C c1ccc(Cl)cc1C(=O)O CCC(N)(N)CC

FALSE TRUE TRUE

Visualization

I rcdk supports visualization of 2D structure images in twoways

I First, you can bring up a Swing window

I Second, you can obtain the depiction as a raster image

I Doesn’t work on OS X

mols <- load.molecules("data/dhfr_3d.sd")

## view a single molecule in a Swing window

view.molecule.2d(mols[[1]])

## view a table of molecules

view.molecule.2d(mols[1:10])

Visualization

Visualization

I The Swing window is a little heavy weight

I It’d be handy to be able to annotate plots with structures

I Or even just make a panel of images that could be saved to aPNG file

I We can make use of rasterImage and rcdk

I As with the Swing window, this won’t work on OS X

m <- parse.smiles("c1ccccc1C(=O)NC")

img <- view.image.2d(m, 200,200)

## start a plot

plot(1:10, 1:10, pch=19)

## overlay the structure

rasterImage(img, 1,8, 3,10)

http://stat.ethz.ch/R-manual/R-patched/library/graphics/html/raster.html

Molecular descriptors

I Numerical representations of chemical structure featuresI Can be based on

I connectivityI 3D coordnatesI electronic propertiesI combination of the above

I Many descriptors are described and implemented in variousforms

I The CDK implements 50 descriptor classes, resulting in ≈ 300individual descriptor values for a given molecule

Descriptor caveats

I Not all descriptors are optimized for speed

I Some of the topological descriptors employ graphisomorphism which makes them slow on large molecules

I In general, to ensure that we end up with a rectangulardescriptor matrix we do not catch exceptions

I Instead descriptor calculation failures return NA

CDK Descriptor Classes

I The CDK provides 3 packages for descriptor calculationsI org.openscience.cdk.qsar.descriptors.molecularI org.openscience.cdk.qsar.descriptors.atomicI org.openscience.cdk.qsar.descriptors.bond

I rcdk only supports molecular descriptorsI Each descriptor is also described by an ontology

I For rcdk this is used to classify descriptors into groups

Descriptor calculations

I Can evaluate a single descriptor or all available descriptors

I If a descriptor cannot be calculated, NA is returned (so noexceptions thrown)

dnames <- get.desc.names(’topological’)

descs <- eval.desc(mols, dnames)

I Just evaluate topological descriptors

I descs will be a data.frame which can then be used in any ofthe modeling functions

I Column names are the descriptor names provided by the CDK

The QSAR workflow

The QSAR workflow

I Before model development you’ll need to clean the molecules,evaluate descriptors, generate subsets

I With the numeric data in hand, we can proceed to modelingI Before building predictive models, we’d probably explore the

datasetI Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets

I Then we can go wild and build all the models that R supports

Accessing fingerprints

I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem

I Access them via get.fingerprint(...)

I Works on one molecule at a time, use lapply to process a listof molecules

I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint data

from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculationsI Lots of similarity and dissimilarity metrics available

http://stat.ethz.ch/R-manual/R-patched/library/base/html/lapply.html

Accessing fingerprints

mols <- load.molecules("data/dhfr_3d.sd")

## get a single fingerprint

fp <- get.fingerprint(mols[[1]], type="maccs")

## process a list of molecules

fplist <- lapply(mols, get.fingerprint, type="maccs")

## or read from file

fplist <- fp.read("data/fp/fp.data", size=1052,

lf=bci.lf, header=TRUE)

I Easy to support new line-oriented fingerprint formats byproviding your own line parsing function (e.g., bci.lf)

I See the fingerprint package man pages for more details

http://rss.acs.unt.edu/Rdoc/library/fingerprint/html/linefunc.html

Similarity metrics

I The fingerprint package implements 28 similarity anddissimilarity metrics

I All accessed via the distance function

I Implemented in C, but still, large similarity matrix calculationsare not a good idea!

## similarity between 2 individual fingerprints

distance(fplist[[1]], fplist[[2]], method="tanimoto")

distance(fplist[[1]], fplist[[2]], method="mt")

## similarity matrix - compare similarity distributions

m1 <- fp.sim.matrix(fplist, "tanimoto")

m2 <- fp.sim.matrix(fplist, "carbo")

par(mfrow=c(1,2))

hist(m1, xlim=c(0,1))

hist(m2, xlim=c(0,1))

Comparing datasets with fingerprints

I We can compare datasets based on a fingerprints

I Rather than perform pairwise comparisons, we evaluate thenormalized occurence of each bit, across the dataset

I Gives us a n-D vector - the “bit spectrum”

bitspec <- bit.spectrum(fplist)

plot(bitspec, type="l")

0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384

Bit spectrum

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Bit Position

Fre

quen

cy

I Only makes sense with structural key type fingerprints

I Longer fingerprints give better resolution

I Comparing bit spectra, via any distance metric, allows us tocompare datasets in O(n) time, rather than O(n2) for apairwise approach

0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384

GSK & ONS Ugi datasets

I 117K member virtual library of Ugi products was the basis ofan ONS project looking for anti-malarials (Jean-ClaudeBradley, Drexel University)

I GSK recently published their anti-malarial screening dataset(13K compounds)

I How do the two data sets compare?

http://usefulchem.wikispaces.com/UClib007

http://www.nature.com/nature/journal/v465/n7296/full/nature09107.html

GSK & ONS Ugi datasets

I A little easier to identify differences if we take the “differencespectrum”

Comparing fingerprint performance

I Various studies comparing virtual screening methods

I Generally, metric of success is how many actives are retrievedin the top n% of the database

I Can be measured using ROC, enrichment factor, etc.I Exercise - evaluate performance of CDK fingerprints using

enrichment factorsI Load active and decoy moleculesI Evaluate fingerprintsI For each active, evaluate similarity to all other molecules

(active and inactive)I For each active, determine enrichment at a given percentage of

the database screened

For a given query molecule, order dataset by decreasing similarity,look at the top 10% and determine fraction of actives in that top

10%


I A good dataset to test this out is the Maximum UnbiasedValidation datasets by Rohr & Baumann

I Derived from 17 PubChem bioassay datasets and designed toavoid analog bias and artifical enrichment

I As a result, 2D fingerprints generally show poor performanceon these datasets (by design)

I See here for a comparison of various fingerprints using two ofthese datasets

0Rohrer, S.G et al, J. Chem. Inf. Model, 2009, 49, 169–1840Good, A. & Oprea, T., J. Chem. Inf. Model, 2008, 22, 169–1780Verdonk, M.L. et al, J. Chem. Inf. Model, 2004, 44, 793–806

http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html

http://www.pharmchem.tu-bs.de/lehre/baumann/MUV.html

http://dx.doi.org/10.1021/ci8002649

http://dx.doi.org/10.1007/s10822-007-9167-2

http://dx.doi.org/10.1021/ci034289q

http://rguha.wordpress.com/2008/10/11/do-the-cdk-fingerprints-work/


0.0

0.2

0.4

0.6

0.8

1.0

Query Molecule

Sim

ilarit

y

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Fragment based analysis

I Fragment based analysis can be a useful alternative toclustering, especially for large datasets

I Useful for identifying interesting seriesI Many fragmentation schemes are available

I ExhaustiveI Rings and ring assembliesI Murcko

I The CDK supports fragmentation (still needs work) intoMurcko frameworks and ring systems

Getting fragments

I Access to exhaustive and Murcko fragmentation schemes

I Exhaustive fragmentation can take a long time in some cases

I Both have several parameters allow us to filter fragments

mol <- parse.smiles(

"c1cc(c(cc1c2c(nc(nc2CC)N)N)[N+](=O)[O-])NCc3ccc(cc3)C(=O)N4CCCCC4")

mfrags <- get.murcko.fragments(mol)

xfrags <- get.exhaustive.fragments(mol)

Doing stuff with fragments

I Look at frequency of occurence of fragments

I Pseudo-cluster a dataset based on fragments

I Compound selection based on fragment membership

I Develop predictive models on fragment members, looking forlocal SAR

Fragments & kinase activities

I Consider the Abbot kinasedataset (Metz et al)

I ≈ 1500 structures, 172targets

I Slice and dice activitiesbased on Murcko frameworkmembership

frags <- lapply(mols,

get.murcko.fragments)

fworks <-lapply(frags,

function(x) x[[1]]$frameworks)

frag.freq <- data.frame(

table(unlist(fworks))

)

Fragment ID

Fre

quen

cy

0

50

100

150

0 500 1000 1500

http://www.nature.com/nchembio/journal/v7/n4/abs/nchembio.530.html


I Explore activity data on a fragment-wise basis

I Compare activity distributions by targets

## build a look up table (frag SMILES -> molecule ID)

ftable <- do.call(’rbind’, mapply(function(x,y) {

if (length(y) == 0) y <- NA

data.frame(mid=x, frag=y)

}, names(fworks), fworks, SIMPLIFY=FALSE))

rownames(ftable) <- NULL

ftable <- subset(ftable, !is.na(frag))

ftable$frag <- as.character(ftable$frag)

## identify molecules containing a fragment

query <- ’c1nccc(n1)c3c[nH]c2ncccc23’

values <- subset(ftable, frag == query)

depvs <- subset(abbot, PUBCHEM_SID %in% values$mid)[, 15:186]

Matched molecular pairs

I Inspired by Gregs’ 1-line SQL query

I But performed over 172 kinase targets

I Slower, especially the similarity matrix calculation

## load molecules and get similarity matrix

mols <- load.molecules(’abbot.smi’)

fps <- fp.sim.matrix(lapply(mols, get.fingerprint, ’extended’))

## identify similar pairs

idxs <- which(fpsims > 0.95, arr.ind=TRUE)

idxs <- idxs[ idxs[,1] > idxs[,2], ] # ignore diagonal elements

## evaluate activity differences

mps <- t(apply(idxs, 1, function(x) {

apply(depvs, 2, function(z) {

d <- abs(z[x[1]] - z[x[2]])

ifelse(d >= 1, d, NA)

})

}))


pIC50 = 5.3 pIC50 = 8.6

pIC50 = 5.4 pIC50 = 8.5

Future developments

I One current drawback of the package is that mostcheminformatics operations cannot be parallelized

I Many objects are Java refs so can’t be sharedI Many CDK methods are not threadsafe

I Data table and depictions

I Streamline I/O and molecule configuration

I Add more atom and bond level operationsI Convert from jobjRef to S4 objects and vice versa

I Would allow for serialization of CDK data classesI Is it worth the effort?

Acknowledgements

I rcdkI Miguel RojasI Ranke Johannes

I CDKI Egon WillighagenI Christoph SteinbeckI . . .

r & cdk: a sturdy platform in the oceans of chemical data}

Documents

parallel r r

interpreted r

r users

r functionsjava code

true r species

complete r newbies

rjava packagetwo r packages

acessibility usability