chemical data mining: open source & reproducible
TRANSCRIPT
R & CDK
1/18
Chemical Data MiningOpen Source & Reproducible
Rajarshi Guha
NIH Center for Advancing Translational Science
August 21, 2012Philadelphia PA
R & CDK
2/18
Background
I Been using it since 2003, developed a number of Rpackages, mostly public
I Make extensive use of R at NCGC for small molecule &RNAi screening and high content analysis
I In paralllel, need to manipulate and process chemicalstructure data
I How R is enhanced by other Open Source software
I How R enables and supports reproducible science
R & CDK
3/18
What is R?
I R is an environment for modelingI Contains many prepackaged statistical and mathematical
functionsI No need to implement anything (if you don’t want to)
I R is a matrix programming language that is good forstatistical computing
I Full fledged, interpreted languageI Well integrated with statistical functionalityI Easy to integrate with C, C++, FortranI Good for prototyping
R & CDK
4/18
Why cheminformatics in R?
I Much of cheminformatics is datamodeling and mining
I But the numeric data is derivedfrom chemical structure
I Thus we want to work withI molecules & and their partsI files containing moleculesI databases of molecules
R & CDK
5/18
Why cheminformatics in R?
I In contrast to bioinformatics (cf. Bioconductor), not awhole lot of cheminformatics support for R
I For cheminformatics and chemistry, relevant packagesinclude
I rcdk, rpubchem, chemblr,fingerprintI bio3d, ChemmineR, caret
I A lot of cheminformatics employs various forms ofstatistics and machine learning - R is exactly theenvironment for that
I We just need to add some chemistry capabilities to it
R & CDK
6/18
What does the CDK provide?
I Fundamental chemical objectsI atomsI bondsI molecules
I More complex objects are also availableI SequencesI ReactionsI Collections of molecules
I Input/Output for a wide variety of molecular file formats
I Fingerprints and fragment generation
I Rigid alignments, pharmacophore searching
I Substructure searching, SMARTS support
I Molecular descriptors
R & CDK
7/18
Using the CDK in R
I Based on the rJava package
I Two R packages to install (not counting thedependencies)
I Provides access to a variety of CDK classes and methods
I Idiomatic R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
R & CDK
8/18
Reading in data
I The CDK supports a variety of file formats
I rcdk loads all recognized formats, automatically
I Data can be local or remote
mols <- load.molecules( c("data/io/set1.sdf",
"data/io/set2.smi",
"http://rguha.net/rcdk/remote.sdf"))
I For large SDF’s use an iterating reader
I Can’t do much with these objects, except via rcdk
functions
R & CDK
9/18
Working with molecules
I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates
I Since rcdk doesn’t cover the entire CDK API, you mightneed to drop down to the rJava level and make calls tothe Java code by hand
R & CDK
10/18
Accessing fingerprints
I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem
I Access them via get.fingerprint(...)
I Works on one molecule at a time, use lapply to process alist of molecules
I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint
data from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculations
R & CDK
11/18
Working with fingerprints
I The fingerprint package implements 28 similarity anddissimilarity metrics
I Easy to run enrichment studies
I We can compare datasets in O(n) time, using the “bitspectrum”
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
Bit Position
Fre
quen
cy
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
R & CDK
12/18
Visualization
I rcdk supports visualization of 2D structure images intwo ways
I First, you can bring up a Swing window
I Second, you can obtain the depiction as a raster image
mols <- load.molecules("data/dhfr_3d.sd")
## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])
## view a table of molecules
view.molecule.2d(mols[1:10])
R & CDK
13/18
The QSAR workflow
R & CDK
14/18
The QSAR workflow
I Before model development you’ll need to clean themolecules, evaluate descriptors, generate subsets
I With the numeric data in hand, we can proceed tomodeling
I Before building predictive models, we’d probably explorethe dataset
I Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets
I Go wild and build all the models that R supports
R & CDK
15/18
Interacting with chemical databases
I A variety of databases containing structures, physicalproperties, biological activities
I Direct access within R lets us streamline our workflowI Enabled by public APIs
I Pubchem PUG and RESTI ChEMBL REST API (chemblr)
R & CDK
16/18
Reproducible chemical data mining
I The many toolkits and versionsmake reproducibility tough
I DB and HTTP access ensuresthat an analysis can be alwaysup to date if required
I If the analysis is not based on afixed snapshot of data,reproducibility cannot beguaranteed
I Might actually make all thosepublished QSAR modelsreusable!
.Rda
Sweave / Knitr
.R
Reproducible Bundle
R & CDK
17/18
Acknowledgements
I rcdkI Steffen NeumannI Miguel RojasI Ranke Johannes
I CDKI Egon WillighagenI Christoph SteinbeckI . . .
R & CDK
18/18
http://sourceforge.net/projects/cdk/
http://github.com/rajarshi/cdk
@rguha