chemical data mining: open source & reproducible

R & CDK

1/18

Chemical Data MiningOpen Source & Reproducible

Rajarshi Guha

NIH Center for Advancing Translational Science

August 21, 2012Philadelphia PA

R & CDK

2/18

Background

I Been using it since 2003, developed a number of Rpackages, mostly public

I Make extensive use of R at NCGC for small molecule &RNAi screening and high content analysis

I In paralllel, need to manipulate and process chemicalstructure data

I How R is enhanced by other Open Source software

I How R enables and supports reproducible science

R & CDK

3/18

What is R?

I R is an environment for modelingI Contains many prepackaged statistical and mathematical

functionsI No need to implement anything (if you don’t want to)

I R is a matrix programming language that is good forstatistical computing

I Full fledged, interpreted languageI Well integrated with statistical functionalityI Easy to integrate with C, C++, FortranI Good for prototyping

R & CDK

4/18

Why cheminformatics in R?

I Much of cheminformatics is datamodeling and mining

I But the numeric data is derivedfrom chemical structure

I Thus we want to work withI molecules & and their partsI files containing moleculesI databases of molecules

R & CDK

5/18

Why cheminformatics in R?

I In contrast to bioinformatics (cf. Bioconductor), not awhole lot of cheminformatics support for R

I For cheminformatics and chemistry, relevant packagesinclude

I rcdk, rpubchem, chemblr,fingerprintI bio3d, ChemmineR, caret

I A lot of cheminformatics employs various forms ofstatistics and machine learning - R is exactly theenvironment for that

I We just need to add some chemistry capabilities to it

http://cran.r-project.org/web/packages/rcdk/index.html

http://cran.r-project.org/web/packages/fingerprint/index.html

http://mccammon.ucsd.edu/~bgrant/bio3d/

http://bioweb.ucr.edu/ChemMineV2/chemminer/

http://cran.r-project.org/web/packages/caret/index.html

R & CDK

6/18

What does the CDK provide?

I Fundamental chemical objectsI atomsI bondsI molecules

I More complex objects are also availableI SequencesI ReactionsI Collections of molecules

I Input/Output for a wide variety of molecular file formats

I Fingerprints and fragment generation

I Rigid alignments, pharmacophore searching

I Substructure searching, SMARTS support

I Molecular descriptors

R & CDK

7/18

Using the CDK in R

I Based on the rJava package

I Two R packages to install (not counting thedependencies)

I Provides access to a variety of CDK classes and methods

I Idiomatic R

R Programming Environment

rJava

CDK Jmol

rcdk

XML

rpubchem

fingerprint

R & CDK

8/18

Reading in data

I The CDK supports a variety of file formats

I rcdk loads all recognized formats, automatically

I Data can be local or remote

mols <- load.molecules( c("data/io/set1.sdf",

"data/io/set2.smi",

"http://rguha.net/rcdk/remote.sdf"))

I For large SDF’s use an iterating reader

I Can’t do much with these objects, except via rcdk

functions

R & CDK

9/18

Working with molecules

I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates

I Since rcdk doesn’t cover the entire CDK API, you mightneed to drop down to the rJava level and make calls tothe Java code by hand

R & CDK

10/18

Accessing fingerprints

I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem

I Access them via get.fingerprint(...)

I Works on one molecule at a time, use lapply to process alist of molecules

I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint

data from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculations

http://stat.ethz.ch/R-manual/R-patched/library/base/html/lapply.html

R & CDK

11/18

Working with fingerprints

I The fingerprint package implements 28 similarity anddissimilarity metrics

I Easy to run enrichment studies

I We can compare datasets in O(n) time, using the “bitspectrum”

0 50 100 150

0.0

0.2

0.4

0.6

0.8

1.0

Bit Position

Fre

quen

cy

0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384

R & CDK

12/18

Visualization

I rcdk supports visualization of 2D structure images intwo ways

I First, you can bring up a Swing window

I Second, you can obtain the depiction as a raster image

mols <- load.molecules("data/dhfr_3d.sd")

## view a single molecule in a Swing window

view.molecule.2d(mols[[1]])

## view a table of molecules

view.molecule.2d(mols[1:10])

R & CDK

13/18

The QSAR workflow

R & CDK

14/18

The QSAR workflow

I Before model development you’ll need to clean themolecules, evaluate descriptors, generate subsets

I With the numeric data in hand, we can proceed tomodeling

I Before building predictive models, we’d probably explorethe dataset

I Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets

I Go wild and build all the models that R supports

R & CDK

15/18

Interacting with chemical databases

I A variety of databases containing structures, physicalproperties, biological activities

I Direct access within R lets us streamline our workflowI Enabled by public APIs

I Pubchem PUG and RESTI ChEMBL REST API (chemblr)

R & CDK

16/18

Reproducible chemical data mining

I The many toolkits and versionsmake reproducibility tough

I DB and HTTP access ensuresthat an analysis can be alwaysup to date if required

I If the analysis is not based on afixed snapshot of data,reproducibility cannot beguaranteed

I Might actually make all thosepublished QSAR modelsreusable!

.Rda

Sweave / Knitr

.R

Reproducible Bundle

R & CDK

17/18

Acknowledgements

I rcdkI Steffen NeumannI Miguel RojasI Ranke Johannes

I CDKI Egon WillighagenI Christoph SteinbeckI . . .

R & CDK

18/18

http://sourceforge.net/projects/cdk/

http://github.com/rajarshi/cdk

@rguha

http://sourceforge.net/projects/cdk/

http://github.com/rajarshi/cdkr

https://twitter.com/rguha

chemical data mining: open source & reproducible

Documents