Download - Cheminformatics in R
![Page 1: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/1.jpg)
Cheminformaticsin R
1/189
Cheminformatics in R
or How to Start R and Never Have to Exit
Rajarshi Guha
NIH Chemical Genomics Center
17th May, 2010EBI, Hinxton
![Page 2: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/2.jpg)
Cheminformaticsin R
2/189
Background
![Page 3: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/3.jpg)
Cheminformaticsin R
3/189
Background
I Involved in various aspects ofcheminformatics since 2002
I QSAR modeling, virtual screening,polypharmacology, networks
I Algorithm developementI Cheminformatics software
development
I Core developer for the CDK
![Page 4: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/4.jpg)
Cheminformaticsin R
4/189
Background
I Pretty much live inside of R
I Been using it since 2003, developed a number of Rpackages, mostly public
I Make extensive use of R at NCGC for small molecule &RNAi screening
![Page 5: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/5.jpg)
Cheminformaticsin R
5/189
Acknowledgements
I rcdkI Miguel RojasI Ranke Johannes
I CDKI Egon WillighagenI Christoph SteinbeckI . . .
![Page 6: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/6.jpg)
Cheminformaticsin R
6/189
Resources
I Download binariesI If you’re working on Unix it’s a good idea to compile
from source
I There’s lots of documentation of varying qualityI The R manual is a good document to start withI R Data Import and Export is extremely handy regarding
data loadingI A variety of short introductory tutorials as well as
documents on specific topics in statistical modeling canbe found here
I More general help can be obtained fromI The R-help mailing list. Has a lot of renoowned experts
on the list and if you can learn quite a bit of statistics inaddition to getting R help just be reading the archives ofthe list. Note: they do expect that you have done yourhomework! See the posting guide
I A useful reference for Python/Matlab users learning R
![Page 7: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/7.jpg)
Cheminformaticsin R
7/189
R Books
I A number of books are available ranging fromintroductions to R along with examples up to advancedstatistical modeling using R.
I Data Analysis and Graphics Using R: An Example-basedApproach by John Maindonald, John Braun
I Introductory Statistics with R by Peter Dalgaard
I Using R for Introductory Statistics by John Verzani
I If you end up programming a lot in R, you’ll want tohave Programming with Data: A Guide to the SLanguage by John M. Chambers
![Page 8: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/8.jpg)
Cheminformaticsin R
8/189
IDE’s and editors
I Emacs + ESS - works on Linux, Windows, OS X.Invaluable if you are an Emacs user
I TINN-R is a small and useful editor for WindowsI You can also do R in Eclipse
I StatETI Rsubmit
I A number of different GUI’s are available for, which maybe useful if you’re an infrequent user
I RcommanderI JGRI RattleI PMGI SciViews-RI Red-R
![Page 9: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/9.jpg)
Cheminformaticsin R
9/189
What is R?
I R is an environment for modelingI Contains many prepackaged statistical and mathematical
functionsI No need to implement anything
I R is a matrix programming language that is good forstatistical computing
I Full fledged, interpreted languageI Well integrated with statistical functionalityI More details later
![Page 10: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/10.jpg)
Cheminformaticsin R
10/189
What is R?
I It is possible to use R just for modelingI Avoids programming, preferably use a GUI
I Load data → build model → plot data
I But you can also get much more creativeI Scripts to process multiple data filesI Ensemble modeling using different types of modelsI Implement brand new algorithms
I R is good for prototyping algorithmsI Interpreted, so immediate resultsI Good support for vectorization
I Faster than explicit loopsI Analogous to map in Python and Lisp
I Most times, interpreted R is fine, but you can easilyintegrate C code
![Page 11: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/11.jpg)
Cheminformaticsin R
11/189
What is R?
I R integrates with other languagesI C code can be linked to R and C can also call R functionsI Java code can be called from R and vice versa. See
various packages at rosuda.orgI Python can be used in R and vice versa using Rpy
I R has excellent support for publication quality graphics
I See R Graph Gallery for an idea of the graphingcapabilities
I But graphing in R does have a learning curveI A variety of graphs can be generated
I 2D plots - scatter, bar, pie, box, violin, parallel coordinateI 3D plots - OpenGL support is available
![Page 12: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/12.jpg)
Cheminformaticsin R
12/189
Why Cheminformatics in R?
I In contrast to bioinformatics (cf. Bioconductor), not awhole lot of cheminformatics support for R
I For cheminformatics and chemistry relevant packagesinclude
I rcdk, rpubchem, fingerprintI bio3d, ChemmineR
I A lot of cheminformatics employs various forms ofstatistics and machine learning - R is exactly theenvironment for that
I We just need to add some chemistry capabilities to it
![Page 13: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/13.jpg)
Cheminformaticsin R
13/189
Outline
I An overview of the R language
I An overview of the CDK library
I Exploring the rcdk package
I Exploring the rcdklibs package
I Extending the code
![Page 14: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/14.jpg)
Cheminformaticsin R
14/189
Accessing packages
I From within R (you’ll need 2.11.0 or better)
install.packages(c("rcdk","rpubchem"), dependencies = TRUE)
I Sources for rcdk and rcdklibs can be obtained fromthe Github repository
I This is required if you want to modify Java codeassociated with rcdk
I Installable development packages (OS X, Linux,Windows) available from http://rguha.net/rcdk
I These will make their way to CRAN
![Page 15: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/15.jpg)
Cheminformaticsin R
15/189
Workshop prerequisites
I Everything should be installed
I Running the code below should not give any errors
library(rcdk)
library(rpubchem)
source(helper.R)
I Most of the code snippets in these slides are contained incode.R
I You should have a data/ directory containing variousdatasets used in this workshop
I You should have a exercises/ directory containing thesource code solutions to the exercises
![Page 16: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/16.jpg)
Cheminformaticsin R
16/189
The language
Parallel R
Database access
Part I
Overview of R
![Page 17: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/17.jpg)
Cheminformaticsin R
17/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
![Page 18: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/18.jpg)
Cheminformaticsin R
18/189
The language
Parallel R
Database access
Working in R
I To get help for a function name do: ?functionname
I If you don’t know the name of the function, use:help.search("blah")
I R Site Search and Rseek are very helpful online resources
I To exit from the prompt do: quit()
I Two ways to run R codeI Type or paste it at the promptI Execute code from a source file: source("mycode.R")
I Saving your workI save.image(file="mywork.Rda") saves the whole
workspace to mywork.Rda, which is a binary fileI When you restart R, do: load("mywork.Rda") will restore
your workspaceI You can save individual (or multiple) objects using save
![Page 19: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/19.jpg)
Cheminformaticsin R
19/189
The language
Parallel R
Database access
Basic types in R
Primitive types
I numeric - indicates an integer or floating point number
I character - an alphanumeric symbol (string)
I logical - boolean value. Possible values are TRUE andFALSE
Complex types
I vector - a 1D collection of objects of the same type
I list - a 1D container that can contain arbitrary objects.Each element can have a name associated with it
I matrix - a 2D data structure containing objects of thesame type
I data.frame - a generalization of the matrix type. Eachcolumn can be of a different type
![Page 20: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/20.jpg)
Cheminformaticsin R
20/189
The language
Parallel R
Database access
Primitive types
> x <- 1.2
> x <- 2
> y <- "abcdegf"
> y
[1] "abcdegf"
> z <- c(1,2,3,4,5)
> z
[1] 1 2 3 4 5
> z <- 1:5
> z
[1] 1 2 3 4 5
![Page 21: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/21.jpg)
Cheminformaticsin R
21/189
The language
Parallel R
Database access
Matrices
I Make a matrix from a vector
I The matrix is constructed column-wise
> z <- c(1,2,3,4,5,6,7,8,9,10)
> m <- matrix(z, nrow=2, ncol=5)
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
![Page 22: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/22.jpg)
Cheminformaticsin R
22/189
The language
Parallel R
Database access
data.frame
I We want to store compound names and number of atomsand whether they are toxic or not
I Need to store: characters, numbers and boolean values
x <- c("aspirin", "potassium cyanide",
"penicillin", "sodium hydroxide")
y <- c(21, 3, 41,3)
z <- c(FALSE, TRUE, FALSE, TRUE)
d <- data.frame(x,y,z)
> d
x y z
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
![Page 23: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/23.jpg)
Cheminformaticsin R
23/189
The language
Parallel R
Database access
data.frame
I But do the column means? 6 months later, we probablywon’t know (easily)
I So we should add names
> names(d) <- c("Name", "NumAtom", "Toxic")
> d
Name NumAtom Toxic
1 aspirin 21 FALSE
2 potassium cyanide 3 TRUE
3 penicillin 41 FALSE
4 sodium hydroxide 3 TRUE
I When adding column names, don’t use the underscorecharacter
I You can access a column of a data.frame by it’s name:d$Toxic
![Page 24: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/24.jpg)
Cheminformaticsin R
24/189
The language
Parallel R
Database access
Lists
I A 1D collection of arbitrary objects (ArrayList in Java)
I We can access the lists by indexing: mylist[1] ormylist[[1]] for the first element or mylist[c(1,2,3)] forthe first 3 elements
x <- 1.0
y <- "hello there"
s <- c(1,2,3)
mylist <- list(x,y,s)
> mylist
[[1]]
[1] 1
[[2]]
[1] "hello there"
[[3]]
[1] 1 2 3
![Page 25: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/25.jpg)
Cheminformaticsin R
25/189
The language
Parallel R
Database access
Lists
I It’s useful to name individual elements of a list
x <- "oxygen"
z <- 8
m <- 32
mylist <- list(name="oxygen",
atomicNumber=z,
molWeight=m)
> mylist
$name
[1] "oxygen"
$atomicNumber
[1] 8
$molWeight
[1] 32
![Page 26: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/26.jpg)
Cheminformaticsin R
26/189
The language
Parallel R
Database access
Lists
I We can then get the molecular weight by writing
> mylist$molWeight
[1] 32
![Page 27: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/27.jpg)
Cheminformaticsin R
27/189
The language
Parallel R
Database access
Identifying types
I It’s useful initially, to identify the type of the object
I Usually just printing it out explains what it is
I You can be more concise by doing
> x <- 1
> class(x)
[1] "numeric"
> x <- "hello world"
> class(x)
[1] "character"
> x <- matrix(c(1,2,3,4), nrow=2)
> class(x)
[1] "matrix"
![Page 28: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/28.jpg)
Cheminformaticsin R
28/189
The language
Parallel R
Database access
Indexing
I Indexing fundamental to using R and is applicable tovectors, matrices, data.frame’s and lists
I all indices start from 1 (not 0!)
I For a 1D vector, we get the i ’th element by x[i]
I For a 2D matrix of data.frame we get the i , j element byx[i,j]
I But we can also index using vectorsI Called an index vectorI Allows us to select or exclude multiple elements
simultaneously
![Page 29: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/29.jpg)
Cheminformaticsin R
29/189
The language
Parallel R
Database access
Indexing
I Say we have a vector, x and a matrix y
> x
[1] 1 2 3 4 5 6 7 8 9 10
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
I Some ways to get elements from x
I 5th element → x[5]
I The 3rd, 4th and 5th elements → x[ c(3,4,5) ]
I Everything but the 3rd, 4th and 5th elements →x[ -c(3,4,5) ]
![Page 30: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/30.jpg)
Cheminformaticsin R
30/189
The language
Parallel R
Database access
Indexing
I Say we have a matrix y
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1 4 7 10 13
[2,] 2 5 8 11 14
[3,] 3 6 9 12 15
I Some ways to get elements from y
I Element in the first row, second column → y[1,2]
I All elements in the 1st column → y[,1]
I All elements in the 3rd row → y[3,]
I Elements in the first 2 columns → y[ , c(1,2) ]
I Elements not in the first 2 columns → y[ , -c(1,2) ]
![Page 31: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/31.jpg)
Cheminformaticsin R
31/189
The language
Parallel R
Database access
Indexing
I You can also use a boolean vector to index another vector
I In this case, the index vector must of the same length asyour target vector
I If the i ’th element in the index vector is TRUE then thei ’th element of the target is selected
> x <- c(1,2,3,4)
> idx <- c(TRUE, FALSE, FALSE, TRUE)
> x[ idx ]
[1] 1 4
I It’s a pain to have to create a whole index vector by hand
![Page 32: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/32.jpg)
Cheminformaticsin R
32/189
The language
Parallel R
Database access
Indexing
I A more intuitive way to do this, is to make use of R’svector operations
I x == 2 will compare each element of x with the value 2and return TRUE or FALSE
I The result is a boolean vector
> x <- c(1,2,3,4)
> x == 2
[1] FALSE TRUE FALSE FALSE
I So we can select the elements of x that are equal to 2 bydoing
> x <- c(1,2,3,4)
> x[ x == 2 ]
[1] 2
![Page 33: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/33.jpg)
Cheminformaticsin R
33/189
The language
Parallel R
Database access
Indexing
I Or select elements of x that are greater than 2
> x <- c(1,2,3,4)
> x[ x > 2 ]
[1] 3 4
I Or select elements of x that are 2 or 4
> x <- c(1,2,3,4)
> x[ x == 2 | x == 4 ]
[1] 2 4
I The single bar (|) is intentional
I Logical operations using vectors should use |, & ratherthan the usual ||, &&
![Page 34: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/34.jpg)
Cheminformaticsin R
34/189
The language
Parallel R
Database access
List indexing
I With list objects we can index using [ or [[I [ keeps element names, [[ drops names
I Can’t use index vectors or boolean vectors with [[
> y <- list(a=’b’, b=123, x=c(1,2,3))
> y[3]
$x
[1] 1 2 3
> y[[3]]
[1] 1 2 3
> y[[1:2]]
Error in y[[1:2]] : subscript out of bounds
![Page 35: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/35.jpg)
Cheminformaticsin R
35/189
The language
Parallel R
Database access
Subsetting
I Get a subset of the rows of a matrix or data.frame basedon values in a certain column
I We can generate a boolean index vector
I For data.frame’s, we can use a more intuitive approach
d <- data.frame(a=1:10, b=runif(10), c=rnorm(10))
d[d$a <= 6,]
subset(d, a <= 6)
subset(d, a >= 7 & b < 0.4)
![Page 36: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/36.jpg)
Cheminformaticsin R
36/189
The language
Parallel R
Database access
Merging data.frames
I Lets you join data.frames based on a common column
I Pretty much identical to an SQL join (and supportsnatural and outer joins)
I Very handy when merging data from multiple sources
d1 <- data.frame(mol=I(c("mol1", "mol2", "mol3")),
klass=c("active", "active", "inactive"))
d2 <- data.frame(molecule=I(c("mol1", "mol2", "mol3")),
TPSA=c(1.2,3.4,5.6), Kier1=c(5.7, 8.9, 12))
> merge(d1, d2, by.x="mol", by.y="molecule")
mol klass TPSA Kier1
1 mol1 active 1.2 5.7
2 mol2 active 3.4 8.9
3 mol3 inactive 5.6 12.0
![Page 37: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/37.jpg)
Cheminformaticsin R
37/189
The language
Parallel R
Database access
Functions in R
I Functions are much like in any other languageI R employs copy-on-write semantics
I Send in a copy of an input variable if it will be modifiedin the function
I Otherwise send in a reference
I Typing is dynamic
my.add <- function(x,y) {
return(x+y)
}
I You can then call the function as my.add(1,2)
I In general, check for the type of object using class()
I Functions can be anonymous
![Page 38: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/38.jpg)
Cheminformaticsin R
38/189
The language
Parallel R
Database access
Loop constructs
I R does not have C-style looping
for (i = 0; i < 10; i++) {
print(i)
}
I Rather, it works like Python’s for-in idiom
for (i in c(0,2,3,4,5,6,7,8,9)) {
print(i)
}
![Page 39: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/39.jpg)
Cheminformaticsin R
39/189
The language
Parallel R
Database access
Loop constructs
I In general, to loop n times, you create a vector with nelements, say by writing 1 : n
I But like Python loops, you can just loop over elements ofa vector or list and use them
> objects <- list(x=1, y="hello world", z=c(1,2,3))
> for (i in objects) print(i)
[1] 1
[1] "hello world"
[1] 1 2 3
I Good R style discourages explicit loops
![Page 40: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/40.jpg)
Cheminformaticsin R
40/189
The language
Parallel R
Database access
Apply and friends
I Explicit loops (for (...)) are not idiomatic and can beslow
I apply, lapply, sapply resemble functional programmingstyles (but see Reduce, Filter and Map for alternatives)
I Good idea to master these forms
x <- c(1,2,3,4,5)
sapply(x, sqrt)
sapply(x, function(z) z+3)
sapply(x, function(z) rep(z,3), simplify=FALSE)
sapply(x, function(z) rep(z,3), simplify=TRUE)
I Handy for looping over vector elements
![Page 41: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/41.jpg)
Cheminformaticsin R
41/189
The language
Parallel R
Database access
Looping over lists
I For list objects, use lapply
I The return value is a list - if it can be converted to asimple vector use unlist to do so
x <- list(x=1, b=2, c=3)
lapply(x, sqrt)
unlist(lapply(x, sqrt))
lapply(x, function(z) {
return(z^2)
})
I The function argument of lapply should expect an objectas the first argument
I Columns of a data.frame are lists, so lapply can beused to loop over columns
![Page 42: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/42.jpg)
Cheminformaticsin R
42/189
The language
Parallel R
Database access
Looping over multiple objects
I What if you want to loop over two (or more) lists (orvectors) simulataneously
I Like Pythons zip would allow
I Use mapply and provide a function that takes (at least) asmany arguments as input vectors/lists
x <- 1:3
y <- list(a="Hello", b=12.34, c=c(1,2,3))
mapply(function(a,b) {
print(a)
print(b)
print("---")
}, x, y)
![Page 43: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/43.jpg)
Cheminformaticsin R
43/189
The language
Parallel R
Database access
Looping over matrices
I Use apply to operate on entire rows or columns of amatrix or data.frame
I Be careful when working on rows of a data.frame -behavior can be unexpected if all columns are not of thesame type
m <- matrix(1:12, nrow=4)
apply(m, 1, sum) ## sum the rows
apply(m, 2, sum) ## sum the columns
I The function argument of apply should expect a vector asthe first argument
![Page 44: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/44.jpg)
Cheminformaticsin R
44/189
The language
Parallel R
Database access
Looping caveats
I The main thing to remember when using apply, lapply
etc is nature of inputs to the user supplied function
I apply will send a vector (corresponding to a row orcolumn)
I lapply and sapply will send a single element of thelist/vector
I mapply will send a single element of each of the inputlists/vectors
I The function argument of these methods can takeanonymous functions or pre-defined functions
![Page 45: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/45.jpg)
Cheminformaticsin R
45/189
The language
Parallel R
Database access
Loading data
I Writing vectors and matrices by hand is good for 15minutes!
I Easier to read data from files
I R supports a variety of text and binary formatsI Certain packages can be loaded to handle special formats
I Read from SPSS, Stata, SASI Read microarray data, GIS data, chemical structure data
I Excel files can also be read (easiest on Windows)
I Data can be accessed from SQL databases (Postgres,MySQL, Oracle)
![Page 46: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/46.jpg)
Cheminformaticsin R
46/189
The language
Parallel R
Database access
Reading CSV data
I Consider the extract of a CSV file below
ID,Class,Desc1,Desc2,Desc3,Desc4,Desc5
w150,nontoxic,0.37,0.44,0.86,0.44,0.77
c49,toxic,0.37,0.58,0.05,0.85,0.57
s97,toxic,0.66,0.63,0.93,0.69,0.08
I Key features includeI A header line, naming the columnsI An ID column and a Class column, both characters
I Reading this data file is as simple as
> dat <- read.csv("data1.csv", header=TRUE)
I If you print dat it will scroll down your screen
![Page 47: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/47.jpg)
Cheminformaticsin R
47/189
The language
Parallel R
Database access
Reading CSV data
I To get a quick summary of the data.frame use the str()
function
> str(dat)
’data.frame’: 100 obs. of 7 variables:
$ ID : Factor w/ 100 levels "a115","a180",..: 83 7 66 85 64 82 33 10 92 95 ...
$ Class: Factor w/ 2 levels "nontoxic","toxic": 1 2 2 1 2 1 1 1 1 2 ...
$ Desc1: num 0.37 0.37 0.66 0.18 0.88 0.37 0.72 0.72 0.2 0.68 ...
$ Desc2: num 0.44 0.58 0.63 0.09 0.1 0.14 0.03 0.15 0.65 0.66 ...
$ Desc3: num 0.86 0.05 0.93 0.57 0.85 0.67 0.99 0.68 0.77 0.78 ...
$ Desc4: num 0.44 0.85 0.69 0.85 0.58 0.17 0.95 0.19 0.69 0.61 ...
$ Desc5: num 0.77 0.57 0.08 0.17 0.74 0.56 0.22 0.19 0.91 0.13 ...
![Page 48: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/48.jpg)
Cheminformaticsin R
48/189
The language
Parallel R
Database access
What is a factor?
I The factor type is used to represent categorical variablesI Toxic, non-toxicI Active, inactiveI Yes, no, maybe
I They correspond to enum’s in C/Java
I A factor variable looks like an ordinary vector ofcharacters
I Internally they are represented by integersI A factor variable has a property called the levels
I Indicates what values are valid for the factor
> levels(dat$Class)
[1] "nontoxic" "toxic"
![Page 49: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/49.jpg)
Cheminformaticsin R
49/189
The language
Parallel R
Database access
What is a factor?
I If you try to assign an invalid value to a factor you get awarning
> dat$Class[1] <- "Yes"
Warning message:
In ‘[<-.factor‘(‘*tmp*‘, 1, value = "Yes") :
invalid factor level, NAs generated
> dat$Class[1:10]
[1] <NA> toxic toxic nontoxic toxic nontoxic nontoxic nontoxic
[9] nontoxic toxic
Levels: nontoxic toxic
![Page 50: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/50.jpg)
Cheminformaticsin R
50/189
The language
Parallel R
Database access
Operating on groups
I Factors are very useful when you’d like to operate onsubsets of the data, as defined by factor levels
I Assay data might label molecules as active, inactive,inconclusive
I by is similar to lapply but operates on data.frame’s in afactor-wise way
I Takes a function argument which will receive a subset ofthe input data.frame, corresponding to a single level ofthe factor
![Page 51: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/51.jpg)
Cheminformaticsin R
51/189
The language
Parallel R
Database access
assay <- read.csv("data/aid2170.csv", header=TRUE)
str(assay)
levels(assay$PUBCHEM.ACTIVITY.OUTCOME)
## how many actives and inactives are there?
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
nrow(x)
})
## summary stats of the inhibiion values, by group
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
summary(x$Inhibition)
})
## or distributions of the inhibition values, by group
par(mfrow=c(1,2))
by(assay, assay$PUBCHEM.ACTIVITY.OUTCOME, function(x) {
hist(x$Inhibition)
})
![Page 52: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/52.jpg)
Cheminformaticsin R
52/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
![Page 53: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/53.jpg)
Cheminformaticsin R
53/189
The language
Parallel R
Database access
Parallel R
I R itself is not multi-threadedI Well suited for embarassingly parallel problems
I Even then, a number of “large data” problems are nottractable
I Recent developments on integrating R and Hadoopaddress this
I See the RHIPE package
I We’ll look at snow which allows distribution ofprocessing on the same machine (multiple CPU’s) ormultiple machines
I But see snowfall for a nice set of wrappers around snow
I Also see multicore for a package that focuses onparallel processing on multicore CPU’s
![Page 54: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/54.jpg)
Cheminformaticsin R
54/189
The language
Parallel R
Database access
Trivial parallelization with snow
I snow lets you use multiple machines or multiple cores ona single machine
I Well suited for data-parallel problems
I If using multiple machines, data will be sent across thenetwork, so large chunks of data can slow things down
I General strategy isI Initialize clusterI Send variables required for calculation to the nodesI Call a parallel function
![Page 55: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/55.jpg)
Cheminformaticsin R
55/189
The language
Parallel R
Database access
Use case - exhaustive feature selection
I Given a set of N descriptors, find a n-descriptor subsetthat leads to a good predictive model
I Will obviously explode - 50C5 = 2, 118, 760 descriptorcombinations
I Usually we would employ a genetic algorithm or simulatedannealing or even some form of stepwise selection
I For pedagogical purposesI how can we do this exhaustively?I how can we speed it up?
![Page 56: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/56.jpg)
Cheminformaticsin R
56/189
The language
Parallel R
Database access
The data and a model
I Lets consider some melting point data
I Precalculated and reduced descriptor set
I 184 molecules in training set, 34 descriptors
I Goal - find a good 4 descriptor OLS model
0Bergstrom, C.A.S. et al, J. Chem. Inf. Model., 2003, 43, 1177–1185
![Page 57: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/57.jpg)
Cheminformaticsin R
57/189
The language
Parallel R
Database access
A serial solution
I Evaluate all 4-member combinations of the numbers{1, 2, . . . , 34}
I These are the indices of the descriptor columns
I Loop over the rows, use the values as column indices ofthe descriptor matrix, build a model
## First some data prep
library(gtools)
load("data/bergstrom/bergstrom.Rda")
depv <- train.desc[,1]
desc <- train.desc[,-1]
idx <- 1:nrow(desc)
combos <- combinations(ncol(desc), 4)
![Page 58: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/58.jpg)
Cheminformaticsin R
58/189
The language
Parallel R
Database access
A serial solution
rmse <- function(x,y) sqrt(sum((x-y)^2)/length(x))
system.time(
rmse.serial <- apply(combos, 1, function(idx, mydata, y) {
dat <- data.frame(y=y, x=mydata[,idx])
m <- lm(y~., dat)
return(rmse(m$fitted,y))
}, mydata=desc, y=depv)
)
I Takes about 200 seconds
![Page 59: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/59.jpg)
Cheminformaticsin R
59/189
The language
Parallel R
Database access
The parallel solution
I Code is pretty much identical
I First need to initialize the “cluster” - in this case we’llrun multiple processes on the same machine
library(snow)
clus <- makeCluster(rep("localhost",2), type="SOCK")
system.time(
rmse.par <- parApply(clus, combos, 1, function(idx, mydata, y) {
dat <- data.frame(y=y, x=mydata[,idx])
m <- lm(y~., dat)
return(rmse(m$fitted,y))
}, mydata=desc, y=depv)
)
I Takes about 120 seconds, but easily scales up withmultiple cores
![Page 60: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/60.jpg)
Cheminformaticsin R
60/189
The language
Parallel R
Database access
Outline
The language
Parallel R
Database access
![Page 61: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/61.jpg)
Cheminformaticsin R
61/189
The language
Parallel R
Database access
R and databases
I Bindings to a variety of databases are availableI Mainly RDBMS’s but some NoSQL databases are being
interfaced
I The R DBI spec lets you write code that is portable overdatabases
I Note that loading multiple database packages can lead toproblems
I This can happen even when you don’t explicitly load adatabase package
I Some Bioconductor packages load the RSQLite packageas a dependency, which can interfere with, say, ROracle
![Page 62: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/62.jpg)
Cheminformaticsin R
62/189
The language
Parallel R
Database access
A quick look at RSQLite
I Not suitable for very large data or multiple connections
I Definitely look at sqldf, that makes it very easy to useSQL on R data.frames
I Very brief overview of what we can do with RSQLite forlocal storage
I Should transer easily to other databases
![Page 63: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/63.jpg)
Cheminformaticsin R
63/189
The language
Parallel R
Database access
Basic usage - connecting
I We’ll use a database I prepared and placed atdata/db/assay.db
I Lets first get some information about the tables and fielddefinitions
library(RSQLite)
## specific to RSQLite
sqlite <- dbDriver("SQLite")
con <- dbConnect(sqlite, dbname = "data/db/assay.db")
> dbListTables(con)
[1] "assay"
> dbListFields(con, "assay")
[1] "row_names" "aid" "PUBCHEM_SID" "PUBCHEM_CID"
"PUBCHEM_ACTIVITY_OUTCOME" "PUBCHEM_ACTIVITY_SCORE"
"PUBCHEM_ASSAYDATA_COMMENT"
![Page 64: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/64.jpg)
Cheminformaticsin R
64/189
The language
Parallel R
Database access
Basic usage - querying
I The simplest way to get data is to slurp in the wholetable as a data.frame
I Not a good idea if the table is very large
assay <- dbReadTable(con, "assay")
> str(assay)
’data.frame’: 188264 obs. of 6 variables:
$ aid : num 396 396 396 429 429 429 429 429 429 429 ...
$ PUBCHEM_SID : int 10318951 10318962 10319004 842121 842122 842123 842124 842125 842126 842127 ...
$ PUBCHEM_CID : int 6419748 5280343 969516 6603008 6602571 6602616 644371 6603132 2850911 6603374 ...
$ PUBCHEM_ACTIVITY_OUTCOME : chr "active" "active" "active" "inactive" ...
$ PUBCHEM_ACTIVITY_SCORE : int 100 100 100 0 1 1 1 -3 0 0 ...
$ PUBCHEM_ASSAYDATA_COMMENT: int 20061024 20060421 20061024 20061127 20061127 20061127 20061127 20061127 20060622 20061127 ...
![Page 65: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/65.jpg)
Cheminformaticsin R
65/189
The language
Parallel R
Database access
Basic usage - querying
I Better to get subsets of the table via SQL queries
I Two ways to do it - get back all rows from the query orretrieve in chunks
I The latter is better when you might get a lot of rows
## query and get results in one go
dbGetQuery(con, "select distinct aid from assay")
## query and then get results in chunks
res <- dbSendQuery(con, "select * from assay where aid = 429")
fetch(res, n=10) ## get first 10 results
fetch(res, n=10) ## get next 10
fetch(res, n=-1) ## get remaining results
## no more rows to fetch
fetch(res, n=-1)
dbClearResult(res) ## a vital step!
![Page 66: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/66.jpg)
Cheminformaticsin R
66/189
The language
Parallel R
Database access
Saving data in a database
I Can directly write a data.frame to a database file
I RSQLite will automatically generate a CREATE TABLEstatement
library(rpubchem)
## get some assay data from PubChem
assay <- sapply(c(396, 429, 438, 445), function(x) {
data.frame(aid=x, get.assay(x))[,1:6]
}, simplify=FALSE)
## join all assays into a single data.frame
assay <- do.call("rbind", assay)
## Make a new table
sqlite <- dbDriver("SQLite")
con <- dbConnect(sqlite, dbname = "assay.db")
dbWriteTable(con, "assay", assay)
dbDisconnect(con)
![Page 67: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/67.jpg)
Cheminformaticsin R
67/189
The language
Parallel R
Database access
So why use a database?
I Dont have to load bulk CSV or .Rda files each time westart work
I Can index data in RDBMS’s so queries can be very fast
I Good way to exchange data between applications (asopposed to .Rda files which are only useful between Rusers)
I All the code above, except for the dbConnect statementwill be identical for PostgreSQL, MySQL etc.
![Page 68: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/68.jpg)
Cheminformaticsin R
68/189
Introduction
I/O
Fingerprints
SubstructuresPart II
The Chemistry Development Kit
![Page 69: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/69.jpg)
Cheminformaticsin R
69/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
![Page 70: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/70.jpg)
Cheminformaticsin R
70/189
Introduction
I/O
Fingerprints
Substructures
Resources
I The CDK wiki
I The nightly build page
I Code snippets
I cdk-user mailing list
I Keyword list and feature listI If you want to develop with the CDK, use the
comprehensive jar file that contains all dependenciesI from the nightly build page for the cutting edgeI or from Sourceforge for the last stable release
![Page 71: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/71.jpg)
Cheminformaticsin R
71/189
Introduction
I/O
Fingerprints
Substructures
Some CDK issues
I Learning the CDK is non-trivial
I No real tutorial but see here for a start
I Best place is to use the keyword list to find the relevantclass and then look at the unit tests for a class
![Page 72: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/72.jpg)
Cheminformaticsin R
72/189
Introduction
I/O
Fingerprints
Substructures
What does the CDK provide?
I Fundamental chemical objectsI atomsI bondsI molecules
I More complex objects are also availableI SequencesI ReactionsI Collections of molecules
I Input/Output for a wide variety of molecular file formats
I Fingerprints and fragment generation
I Rigid alignments, pharmacophore searching
I Substructure searching, SMARTS support
I Molecular descriptors
![Page 73: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/73.jpg)
Cheminformaticsin R
73/189
Introduction
I/O
Fingerprints
Substructures
Some design principles
I Abstraction is the key principleI Applies to chemical objects such as atoms, bondsI Also applied to input/output
I Rather than directly creating an atom object we use afactory method
IAtom atom = DefaultChemObjectBuilder.getInstance().
newAtom();
and not
IAtom atom = new Atom();
![Page 74: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/74.jpg)
Cheminformaticsin R
74/189
Introduction
I/O
Fingerprints
Substructures
Some design principles
I By using DefaultChemObjectBuilder we don’t worryhow an atom or bond is created
I By using this type of abstraction we can load any fileformat without having to specify it
I Another aspect is the use of interfaces rather thanconcrete types
I A molecule is a collection of atoms and bondsI A graph view is not imposed directlyI Molecular features such as aromaticity are properties of
the atoms and bonds
![Page 75: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/75.jpg)
Cheminformaticsin R
75/189
Introduction
I/O
Fingerprints
Substructures
Molecules
I A molecule is fundamentally represented as anIAtomContainer object
I For many purposes this is fineI In some cases specialization is required so we have
I CrystalI RingI Fragment
I Some methods require a specific subclass
I Many methods simply require an IAtomContainer, so youcan use any of the above subclasses
![Page 76: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/76.jpg)
Cheminformaticsin R
76/189
Introduction
I/O
Fingerprints
Substructures
Molecules
I Similar procedure to getting an atom or bond object
IAtomContainer container = DefaultChemObjectBuilder.
getInstance().
newAtomContainer();
I Then we populate it
container.addAtom( atom1 );
container.addAtom( atom2 );
container.addBond( atom1, atom2, 1.5);
![Page 77: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/77.jpg)
Cheminformaticsin R
77/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
![Page 78: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/78.jpg)
Cheminformaticsin R
78/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I SMILES are a common format
I SMILES parsing, you will have to handle theInvalidSmilesException
SmilesParser sp = new SmilesParser(
DefaultChemObjectBuilder.
getInstance());
IMolecule molecule = sp.parseSmiles("CCCC(CC)CC=CC=CC");
I Given a molecule object, we can create a SMILES
SmilesGenerator sg = new SmilesGenerator();
String smiles = sg.createSMILES(molecule);
System.out.println("SMILES = "+smiles);
![Page 79: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/79.jpg)
Cheminformaticsin R
79/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I Reading files from disk is a common task
I A little more complex due to abstraction but is quitegeneral
File file = new File("mols.sdf");
FileReader fileReader = new FileReader(file);
MDLV2000Reader reader = new MDLV2000Reader(fileReader);
IChemFile chemFile = (IChemFile) reader.
read(new ChemFile());
List containers = ChemFileManipulator.
getAllAtomContainers(chemFile);
![Page 80: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/80.jpg)
Cheminformaticsin R
80/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I This is nice since it allows you to get all the molecules inthe file at one go
I Not a good idea for very large SD files
I In such cases, use IteratingMDLReader
I Allows you to iterate over arbitrarily large SD files
I There is also an IteratingSMILESReader for big SMILESfiles
![Page 81: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/81.jpg)
Cheminformaticsin R
81/189
Introduction
I/O
Fingerprints
Substructures
Input/Output
I Writing files is also supported for a wide variety offormats
I For example, to write to SD format
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
MDLWriter mw = new MDLWriter(w1);
mw.write(molecule);
mw.close();
} catch (Exception e) {
System.out.println(e.toString());
}
![Page 82: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/82.jpg)
Cheminformaticsin R
82/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I SD tags are a common way to associate propertyinformation with a molecular structure
I See PubChem SD files to get an idea of what we can putin them
CDK 1/28/07,15:24
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
M END
> <myProperty>
1.2
> <anotherProperty>
Hello world
![Page 83: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/83.jpg)
Cheminformaticsin R
83/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I When a molecule is read in from an SD file we can getthe value of a given tag by doing
Object value = molecule.getProperty("myProperty");
I When writing a molecule we can supply a set of tags in aHashMap
HashMap tags = new HashMap();
tags.put("myProperty", new Double(1.2));
![Page 84: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/84.jpg)
Cheminformaticsin R
84/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I To ensure tags are written to disk we would do
FileWriter w1 = new FileWriter(new File("molecule.sdf"));
try {
MDLWriter mw = new MDLWriter(w1);
mw.setSdFields( tags );
mw.write(molecule);
mw.close();
} catch (Exception e) {
System.out.println(e.toString());
}
![Page 85: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/85.jpg)
Cheminformaticsin R
85/189
Introduction
I/O
Fingerprints
Substructures
SD tags
I A molecule might have several properties associated withit.
I We can avoid creating an extra HashMap, when writingthem all to disk
MDLWriter mw = new MDLWriter(w1);
mw.setSdFields( molecule.getProperties() );
mw.write(molecule);
mw.close();
![Page 86: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/86.jpg)
Cheminformaticsin R
86/189
Introduction
I/O
Fingerprints
Substructures
Generating canonical SMILES
I The SmilesGenerator class will create canonical SMILES
String smiles = "C(C)(C)CC=CC(C(CC(C))CC)CC";
SmilesParser sp = new SmilesParser();
IMolecule mol = sp.parseSmiles(smiles);
SmilesGenerator sg = new SmilesGenerator();
String canSmi = sg.createSMILES(mol);
![Page 87: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/87.jpg)
Cheminformaticsin R
87/189
Introduction
I/O
Fingerprints
Substructures
Fingerprints
1 0 1 1 0 0 0 1 0
I Used in many areas - database screening, diversityanalysis, modeling
![Page 88: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/88.jpg)
Cheminformaticsin R
88/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
![Page 89: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/89.jpg)
Cheminformaticsin R
89/189
Introduction
I/O
Fingerprints
Substructures
Fingerprints
I The simplest fingerprint looks for specific substructuresand sets a specific bit
I MACCSFingerprinterI SubstructureFingerprinter
I Hashed fingerprints don’t maintain a correspondencebetween bit position and substructural feature
![Page 90: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/90.jpg)
Cheminformaticsin R
90/189
Introduction
I/O
Fingerprints
Substructures
Getting fingerprints
I The CDK can generate binary fingerprints
I This gives a default fingerprint of 1024 bits - but you canspecify smaller or larger fingerprints
I The CDK also provides variants of the default fingerprintI MACCSFingerprintI ExtendedFingerPrinter - includes bits related to ringsI GraphOnlyFingerprinter - simplified version of
fingerprinter that ignores bond orderI SubstructureFingerprinter - generates fingerprints based
on a specified set of substructures
IFingerprinter fprinter = new Fingerprinter()
BitSet fingerprint = fprinter.getFingerprint(molecule);
![Page 91: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/91.jpg)
Cheminformaticsin R
91/189
Introduction
I/O
Fingerprints
Substructures
Evaluating similarity
I Given two molecules we are interested in determininghow similar they are
I A common approach is to evaluate their fingerprints
I Calculate a similarity metric (e.g., Tanimoto coefficient)
BitSet fp1 = Fingerprinter.getFingerprint(molecule1);
BitSet fp2 = Fingerprinter.getFingerprint(molecule2);
float tc = Tanimoto.calculate(fp1, fp2);
![Page 92: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/92.jpg)
Cheminformaticsin R
92/189
Introduction
I/O
Fingerprints
Substructures
Outline
Introduction
I/O
Fingerprints
Substructures
![Page 93: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/93.jpg)
Cheminformaticsin R
93/189
Introduction
I/O
Fingerprints
Substructures
Substructure searching
I Identify the presence/absence of a substructure in amolecule
IAtomContainer mol = ...;
SMARTSQueryTool sqt = new SMARTSQueryTool("C");
sqt.setSmarts("C(=O)C");
boolean matched = sqt.matches(mol);
I In many cases we’re interested in all the possible matches
List<List<Integer>> maps;
maps = sqt.getUniqueMatchingAtoms();
I maps.size() gives you the number of matches
I Each element of maps is a sequence of atom ID’s
![Page 94: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/94.jpg)
Cheminformaticsin R
94/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I We’ll look at constructing a pharmacophore query byhand
I Generally, you’d read it from a file
I Given a query we can search for that in a target molecule
I The target molecule should have 3D coordinates
I Generally you’ll examine multiple conformers for a givenmolecule
![Page 95: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/95.jpg)
Cheminformaticsin R
95/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Construct pharmacophore groups
PharmacophoreQueryAtom o =
new PharmacophoreQueryAtom("D", "[!H0;#7,#8,#9]");
PharmacophoreQueryAtom n1 =
new PharmacophoreQueryAtom("A", "c1ccccc1");
PharmacophoreQueryAtom n2 =
new PharmacophoreQueryAtom("A", "c1ccccc1");
![Page 96: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/96.jpg)
Cheminformaticsin R
96/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Construct pharmacophore constraints
PharmacophoreQueryBond b1 =
new PharmacophoreQueryBond(o, n1, 4.0, 4.5);
PharmacophoreQueryBond b2 =
new PharmacophoreQueryBond(o, n2, 4.0, 5.0);
PharmacophoreQueryBond b3 =
new PharmacophoreQueryBond(n1, n2, 5.4, 5.8);
![Page 97: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/97.jpg)
Cheminformaticsin R
97/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Given pharmacophore groups and constraints, weconstruct a pharmacophore query
I Design to have the same semantics as a normal molecule
QueryAtomContainer query =
new QueryAtomContainer();
query.addAtom(o);
query.addAtom(n1);
query.addAtom(n2);
query.addBond(b1);
query.addBond(b2);
query.addBond(b3);
![Page 98: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/98.jpg)
Cheminformaticsin R
98/189
Introduction
I/O
Fingerprints
Substructures
Pharmacophores
I Given the query we can now perform a pharmacophoresearch
IAtomContainer aMolecule;
// load in the molecule
PharmacophoreMatcher matcher =
new PharmacophoreMatcher(query);
boolean status = matcher.matches(aMolecule);
if (status) {
// get the matching pharmacophore groups
List<List<PharmacophoreAtom>> pmatches =
matcher.getUniqueMatchingPharmacophoreAtoms();
}
![Page 99: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/99.jpg)
Cheminformaticsin R
99/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Part III
Cheminformatics & R - rcdk
![Page 100: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/100.jpg)
Cheminformaticsin R
100/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Using the CDK in R
I Based on the rJava package
I Two R packages to install (not counting thedependencies)
I Provides access to a variety of CDK classes and methods
I Idiomatic R
R Programming Environment
rJava
CDK Jmol
rcdk
XML
rpubchem
fingerprint
![Page 101: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/101.jpg)
Cheminformaticsin R
101/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 102: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/102.jpg)
Cheminformaticsin R
102/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading in data
I The CDK supports a variety of file formats
I rcdk loads all recognized formats, automatically
I Data can be local or remote
mols <- load.molecules( c("data/io/set1.sdf",
"data/io/set2.smi",
"http://rguha.net/rcdk/remote.sdf"))
I Gives you a list of Java references representingIAtomContainer objects
I Can’t do much with these objects, except via rcdkfunctions
![Page 103: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/103.jpg)
Cheminformaticsin R
103/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading structures from text files
I SMILES strings can be contained in arbitrary text files(such as CSV)
I If you read such a file with read.table or read.csv, makesure to set comment.char=""
I Also a good idea to specify the colClasses argument orelse as.is=TRUE- otherwise SMILES will be read in asfactors
![Page 104: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/104.jpg)
Cheminformaticsin R
104/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading SMILES strings
I Also useful to load molecules directly from a SMILESstring
I parse.smiles parses a single molecule
mol <- parse.smiles("c1ccccc1C(=O)N")
## loop over multiple SMILES
smiles <- c("CCC", "c1ccccc1",
"C(C)(C=O)C(CCNC)C1CC1C(=O)")
mols <- sapply(smiles, parse.smiles)
I But the second call is inefficient, especially for largeSMILES collections
![Page 105: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/105.jpg)
Cheminformaticsin R
105/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Efficient SMILES parsing
I By default parse.smiles instantiates a SMILES parserobject (SmilesParser)
I So when looping over a vector of SMILES strings we’recreating a parser object each time
I In such a case, specify a parser object explicitly - 2Xspeedup
smilesParser <- get.smiles.parser()
## now specify this parser
smiles <- c("CCC","c1ccccc1","C(C)(C=O)C(CCNC)C1CC1C(=O)")
mols <- sapply(smiles, parse.smiles, parser = smilesParser)
![Page 106: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/106.jpg)
Cheminformaticsin R
106/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Efficient SMILES parsing
> parser <- get.smiles.parser()
> smiles <- rep(c("CCC",
"c1ccccc1",
"C(C)(C=O)C(CCNC)C1CC1C(=O)"),
500)
> system.time(junk <- sapply(smiles, parse.smiles))
user system elapsed
4.088 0.106 3.946
> system.time(junk <- sapply(smiles, parse.smiles, parser=parser))
user system elapsed
1.792 0.032 1.705
![Page 107: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/107.jpg)
Cheminformaticsin R
107/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
You get what is provided . . . and no more
I CDK philosophy is not to do extra work
I For file reading this implies that it only parses what isspecified in the file and will not perform operations thatare not explicitly indicated
I As a result some features (isotopic masses) of moleculesand atoms are not automatically configured
I load.molecule will do these extra steps
I parse.smiles will not
![Page 108: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/108.jpg)
Cheminformaticsin R
108/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Configuring molecules
## no explicit configuration needed
mols <- load.molecules("data/io/bp.smi")
get.exact.mass(mols[[1]])
## explicit configuration required
mol <- parse.smiles("c1ccccc1")
get.exact.mass(mol)
I The last call will give a NullPointerException sincethe isotopic masses have not been configured
![Page 109: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/109.jpg)
Cheminformaticsin R
109/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Configuring molecules
I When parsing SMILES, do the configuration by hand
smi <- "OC(=O)C1=C(C=CC=C1)OC(=O)C"
mol <- parse.smiles(smi)
## do the configuration
do.typing(mol) ## not required in recent versions
do.aromaticity(mol) ## not required in recent versions
do.isotopes(mol)
## now, this works
get.exact.mass(mol)
![Page 110: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/110.jpg)
Cheminformaticsin R
110/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Hydrogens - implicit & explicit
I SMILES indicates that when hydrogens are not specified,they are to be considered implicit
I But MDL MOL files have no such specification, so if noH’s specified, the molecule will not have explicit orimplicit hydrogens
mol <- load.molecules("data/noh.mol")[[1]]
unlist(lapply(get.atoms(mol), get.symbol))
I In such cases, convert.implicit.to.explicit willadd implicit H’s (to satisfy valencies) and then convertthem to explicit
convert.implicit.to.explicit(mol)
unlist(lapply(get.atoms(mol), get.symbol))
![Page 111: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/111.jpg)
Cheminformaticsin R
111/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
A note on function calls
I R uses copy-on-write semantics for function callsI If a function modifies its input argument, R will send a
copy of the input variable to the functionI Otherwise send a referenceI “Safety” of call-by-value, efficiency of call-by-reference
I In contrast, convert.implicit.to.explicit modifiesthe input argument
I No return value
I In general rJava allows you to work with Java objectsusing call-by-reference semantics
![Page 112: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/112.jpg)
Cheminformaticsin R
112/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Writing molecules
I Currently only SDF is supported as an output file format
I By default a multi-molecule SDF will be written
I Properties are not written out as SD tags by default
smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")
mols <- sapply(smis, parse.smiles)
## all molecules in a single file
write.molecules(mols, filename="mols.sdf")
## ensure molecule data is written out
write.molecules(mols, filename="mols.sdf", write.props=TRUE)
## molecules in individual files
write.molecules(mols, filename="mols.sdf", together=FALSE)
![Page 113: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/113.jpg)
Cheminformaticsin R
113/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Generating SMILES
I Finally, we can also create SMILES strings frommolecules via get.smiles
I Works on a single molecule at a time
I Canonical SMILES are generated by default
smis <- c("c1ccccc1", "CC(C=O)NCC", "CCCC")
mols <- sapply(smis, parse.smiles)
lapply(mols, get.smiles)
![Page 114: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/114.jpg)
Cheminformaticsin R
114/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 115: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/115.jpg)
Cheminformaticsin R
115/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing attached data
I Some file formats allow you to attach data to a structure
I SDF is a good example, using named tags and associatedvalues
I For example, a DrugBank SDF has fields such as
> <DRUGBANK_ID>
DB00135
> <DRUGBANK_GENERIC_NAME>
L-Tyrosine
> <DRUGBANK_MOLECULAR_FORMULA>
C9H11NO3
> <DRUGBANK_MOLECULAR_WEIGHT>
181.1885
![Page 116: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/116.jpg)
Cheminformaticsin R
116/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing data fields
I The CDK reads SDF tags and their values into theIAtomContainer object
I We can access them via the get.property andget.properties
> mols <- load.molecules("data/io/set1.sdf")
> get.properties(mols[[1]])
$DRUGBANK_ID
[1] "DB00135"
$DRUGBANK_IUPAC_NAME
[1] "(2S)-2-amino-3-(4-hydroxyphenyl)propanoic acid"
$‘cdk:Title‘
[1] "135"
...
![Page 117: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/117.jpg)
Cheminformaticsin R
117/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing data fields
I If you know the name of the field, just pull the valuedirectly
get.property(mols[[1]], "DRUGBANK_GENERIC_NAME")
I Note that values for all tags are character, so you needto cast them to the appropriate types manually
![Page 118: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/118.jpg)
Cheminformaticsin R
118/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Extra tags
I In some cases, you’ll see tag names that are not in theSDF file
I The CDK uses molecule property fields to attacharbitrary data.
I For example, the title field from a SDF entry is storedusing the cdk:Title name
I In general, all CDK-derived properties have namesprefixed by “cdk:”
## identify CDK-derived properties
props <- get.properties(mols[[1]])
props[ grep("^cdk:", names(props)) ]
![Page 119: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/119.jpg)
Cheminformaticsin R
119/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Setting property values
I You can add your own property values
I The type of the value is preserved, as long as you arewithin R
I The type of the name (or key) must be character
mol <- parse.smiles("c1ccccc1Cc2ccccc2")
set.property(mol, "MyProperty", 23.14)
set.property(mol, "toxic", "yes")
get.properties(mol)
![Page 120: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/120.jpg)
Cheminformaticsin R
120/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting molecular data
I You could extract the properties and write them to a textfile
mols <- load.molecules("data/io/set1.sdf")
props <- lapply(mols, get.properties)
## convert list to table and write it out
props <- do.call("rbind", props)
write.table(props, "drugdata.txt", row.names=FALSE)
![Page 121: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/121.jpg)
Cheminformaticsin R
121/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting molecular data
I Molecular data can also be persisted by writing to SDFformat
mol <- parse.smiles("c1ccccc1")
set.property(mol, "foo", 1)
set.property(mol, "bar", "baz")
write.molecules(mol, "mol.sdf", write.props=TRUE)
![Page 122: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/122.jpg)
Cheminformaticsin R
122/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with molecules
I Currently you can access atoms, bonds, get certain atomproperties, 2D/3D coordinates
I Since rcdk doesn’t cover the entire CDK API, you mightneed to drop down to the rJava level and make calls tothe Java code by hand
I I’m happy to code specific tasks in idiomatic R
![Page 123: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/123.jpg)
Cheminformaticsin R
123/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with molecules
I A useful task is to wash or standardize molecules
I CDK doesn’t provide a method to do this directly
I But a common operation is to check for disconnectedmolecules and keep the largest fragment
> m <- parse.smiles("c1ccccc1")
> is.connected(m)
[1] TRUE
>
> m <- parse.smiles("C1CC(N=C1)C(=O)[O-].[Na+]")
> is.connected(m)
[1] FALSE
>
> largest <- get.largest.component(m)
> length(get.atoms(largest))
[1] 8
![Page 124: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/124.jpg)
Cheminformaticsin R
124/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting atoms & bonds
I Simple to get atoms and bonds
mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")
atoms <- get.atoms(mol)
bonds <- get.bonds(mol)
I These are lists of jobjRef objects
I rcdk currently supports getting the symbol, coordinates,atomic number, implicit H-count
I Can also check whether an atom is aromatic etc.
I ?get.symbol
![Page 125: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/125.jpg)
Cheminformaticsin R
125/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Working with atoms
I Simple elemental analysis
I Identifying flat molecules
mol <- parse.smiles("c1ccccc1C(Cl)(Br)c1ccccc1")
atoms <- get.atoms(mol)
## elemental analysis
syms <- unlist(lapply(atoms, get.symbol))
round( table(syms)/sum(table(syms)) * 100, 2)
## is the molecule flat?
coords <- do.call("rbind", lapply(atoms, get.point3d))
any(apply(coords, 2, function(x) length(unique(x)) == 1))
![Page 126: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/126.jpg)
Cheminformaticsin R
126/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
SMARTS matching
I rcdk supports substructure searches with SMARTS orSMILES
I May not be practical for large collections of moleculesdue to memory
mols <- sapply(c("CC(C)(C)C",
"c1ccc(Cl)cc1C(=O)O",
"CCC(N)(N)CC"), parse.smiles)
query <- "[#6D2]"
hits <- matches(query, mols)
> print(hits)
CC(C)(C)C c1ccc(Cl)cc1C(=O)O CCC(N)(N)CC
FALSE TRUE TRUE
![Page 127: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/127.jpg)
Cheminformaticsin R
127/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Persisting objects
I We already saw how we can persist molecule properties
I It’d be useful to also persist CDK objects (i.e.,jobjRef’s)
m <- parse.smiles("c1ccccc1")
obj <- .jserialize(m)
newObj <- .junserialize(obj)
I Serialized objects are just raw vectors, so can be savedvia save
I But also take a look at .jcache
![Page 128: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/128.jpg)
Cheminformaticsin R
128/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 129: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/129.jpg)
Cheminformaticsin R
129/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I rcdk supports visualization of 2D structure images intwo ways
I First, you can bring up a Swing window
I Second, you can obtain the depiction as a raster image
I Doesn’t work on OS X
mols <- load.molecules("data/dhfr_3d.sd")
## view a single molecule in a Swing window
view.molecule.2d(mols[[1]])
## view a table of molecules
view.molecule.2d(mols[1:10])
![Page 130: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/130.jpg)
Cheminformaticsin R
130/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
![Page 131: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/131.jpg)
Cheminformaticsin R
131/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I The Swing window is a little heavy weight
I It’d be handy to be able to annotate plots with structures
I Or even just make a panel of images that could be savedto a PNG file
I We can make use of rasterImage and rcdk
I As with the Swing window, this won’t work on OS X
![Page 132: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/132.jpg)
Cheminformaticsin R
132/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
m <- parse.smiles("c1ccccc1C(=O)NC")
img <- view.image.2d(m, 200,200)
## start a plot
plot(1:10, 1:10, pch=19)
## overlay the structure
rasterImage(img, 1,8, 3,10)
![Page 133: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/133.jpg)
Cheminformaticsin R
133/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Visualization
I By playing around with plotting parameters we can fill upthe plot region with an image
I The values being plotted define the coordinates used tooverlay the raster image
mols <- load.molecules("data/dhfr_3d.sd")
## A table of 16 structures
par(mfrow=c(4,4), mar=c(0,0,0,0))
for (i in 1:16) {
img <- view.image.2d(mols[[i]], 200, 200)
plot(1:10, xaxt="n", yaxt="n", xaxs="i", yaxs="i", col="white")
rasterImage(img, 1,1, 10,10)
}
![Page 134: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/134.jpg)
Cheminformaticsin R
134/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 135: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/135.jpg)
Cheminformaticsin R
135/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Molecular descriptors
I Numerical representations of chemical structure featuresI Can be based on
I connectivityI 3D coordnatesI electronic propertiesI combination of the above
I Many descriptors are described and implemented invarious forms
I The CDK implements 45 descriptor classes, resulting in≈ 300 individual descriptor values for a given molecule
![Page 136: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/136.jpg)
Cheminformaticsin R
136/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor caveats
I Not all descriptors are optimized for speed
I Some of the topological descriptors employ graphisomorphism which makes them slow on large molecules
I In general, to ensure that we end up with a rectangulardescriptor matrix we do not catch exceptions
I Instead descriptor calculation failures return NA
![Page 137: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/137.jpg)
Cheminformaticsin R
137/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
CDK Descriptor Classes
I The CDK provides 3 packages for descriptor calculationsI
org.openscience.cdk.qsar.descriptors.molecularI org.openscience.cdk.qsar.descriptors.atomicI org.openscience.cdk.qsar.descriptors.bond
I rcdk only supports molecular descriptorsI Each descriptor is also described by an ontology
I For rcdk this is used to classify descriptors into groups
![Page 138: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/138.jpg)
Cheminformaticsin R
138/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor calculations
I Can evaluate a single descriptor or all availabledescriptors
I If a descriptor cannot be calculated, NA is returned (so noexceptions thrown)
I First need to get available descriptor class names
dnames <- get.desc.names()
I Gives you a vector of all descriptor class names
![Page 139: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/139.jpg)
Cheminformaticsin R
139/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Descriptor calculations
I You can also specify descriptor categories, to get a subset
dnames <- get.desc.names("topological")
## what categories are available?
> get.desc.categories()
[1] "electronic" "protein" "topological" "geometrical" "constitutional" "hybrid"
I Depending on the input structures, some descriptors maynot make sense
I If you evaluate 3D descriptors from SMILES input, they’dall be set to NA
![Page 140: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/140.jpg)
Cheminformaticsin R
140/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Evaluating the descriptors
I With a set of descriptor class names in hand, you canevaluate them
descs <- eval.desc(mols, dnames)
I descs will be a data.frame which can then be used in anyof the modeling functions
I Column names are the descriptor names provided by theCDK
I Good practise to put molecule titles in a column or in therow names
![Page 141: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/141.jpg)
Cheminformaticsin R
141/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
The QSAR workflow
![Page 142: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/142.jpg)
Cheminformaticsin R
142/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
The QSAR workflow
I Before model development you’ll need to clean themolecules, evaluate descriptors, generate subsets
I With the numeric data in hand, we can proceed tomodeling
I Before building predictive models, we’d probably explorethe dataset
I Normality of the dependent variableI Correlations between descriptors and dependent variableI Similarity of subsets
I Then we can go wild and build all the models that Rsupports
![Page 143: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/143.jpg)
Cheminformaticsin R
143/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 144: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/144.jpg)
Cheminformaticsin R
144/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing fingerprints
I CDK provides several fingerprintsI Path-based, MACCS, E-State, PubChem
I Access them via get.fingerprint(...)
I Works on one molecule at a time, use lapply to process alist of molecules
I This method works with the fingerprint packageI Separate package to represent and manipulate fingerprint
data from various sources (CDK, BCI, MOE)I Uses C to perform similarity calculationsI Lots of similarity and dissimilarity metrics available
![Page 145: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/145.jpg)
Cheminformaticsin R
145/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Accessing fingerprints
mols <- load.molecules("data/dhfr_3d.sd")
## get a single fingerprint
fp <- get.fingerprint(mols[[1]], type="maccs")
## process a list of molecules
fplist <- lapply(mols, get.fingerprint, type="maccs")
I Each fingerprint is an S4 object
I See the fingerprint package man pages for moredetails
![Page 146: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/146.jpg)
Cheminformaticsin R
146/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Reading fingerprints
I We can also read in fingerprint data generated by othertools - BCI, MOE, CDK
fplist <- fp.read("data/fp/fp.data", size=1052,
lf=bci.lf, header=TRUE)
I Easy to support new line-oriented fingerprint formats byproviding your own line parsing function
I See bci.lf for an example
![Page 147: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/147.jpg)
Cheminformaticsin R
147/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Similarity metrics
I The fingerprint package implements 28 similarity anddissimilarity metrics
I All accessed via the distance function (so you call thiseven if you want similarity)
I Implemented in C, but still, large similarity matrixcalculations are not a good idea!
## similarity between 2 individual fingerprints
distance(fplist[[1]], fplist[[2]], method="tanimoto")
distance(fplist[[1]], fplist[[2]], method="mt")
## similarity matrix - compare similarity distributions
m1 <- fp.sim.matrix(fplist, "tanimoto")
m2 <- fp.sim.matrix(fplist, "carbo")
par(mfrow=c(1,2))
hist(m1, xlim=c(0,1))
hist(m2, xlim=c(0,1))
![Page 148: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/148.jpg)
Cheminformaticsin R
148/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing datasets with fingerprints
I We can compare datasets based on a fingerprints
I Rather than perform pairwise comparisons, we evaluatethe normalized occurence of each bit, across the dataset
I Gives us a n-D vector - the “bit spectrum”
bitspec <- bit.spectrum(fplist)
plot(bitspec, type="l")
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
![Page 149: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/149.jpg)
Cheminformaticsin R
149/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Bit spectrum
0 50 100 150
0.0
0.2
0.4
0.6
0.8
1.0
Bit Position
Fre
quen
cy
I Only makes sense with structural key type fingerprints
I Longer fingerprints give better resolution
I Comparing bit spectra, via any distance metric, allows usto compare datasets in O(n) time, rather than O(n2) fora pairwise approach
0Guha, R., J. Comp. Aid. Molec. Des., 2008, 22, 367–384
![Page 150: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/150.jpg)
Cheminformaticsin R
150/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing datasets
Bit Position
Nor
mal
ized
Fre
quen
cy
0.0
0.2
0.4
0.6
0.8
1.0
0 50 100 150
Entire Dataset
0.0
0.2
0.4
0.6
0.8
1.0
Subset 1
0.0
0.2
0.4
0.6
0.8
1.0
Subset 2
![Page 151: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/151.jpg)
Cheminformaticsin R
151/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fingerprints & clustering
I Clustering large collections is not practical
I Any way to avoid a pairwise similarity matrix is desirable
## Get fingerprints
mols <- load.molecules("data/dhfr_3d.sd")
fplist <- lapply(mols, get.fingerprint, type="maccs")
## Gives us a similarity matrix
fpdist <- fp.sim.matrix(fplist)
## Convert to a distance matrix
fpdist <- as.dist(1-fpdist)
## Perform the clustering
clus <- hclust(fpdist, method="ward")
![Page 152: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/152.jpg)
Cheminformaticsin R
152/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fingerprints & clustering
05
1015
2025
3035
Hierarchical Clustering of the Sutherland DHFR Dataset
![Page 153: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/153.jpg)
Cheminformaticsin R
153/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
I Various studies comparing virtual screening methods
I Generally, metric of success is how many actives areretrieved in the top n% of the database
I Can be measured using ROC, enrichment factor, etc.I Exercise - evaluate performance of CDK fingerprints
using enrichment factorsI Load active and decoy moleculesI Evaluate fingerprintsI For each active, evaluate similarity to all other molecules
(active and inactive)I For each active, determine enrichment at a given
percentage of the database screened
For a given query molecule, order dataset by decreasingsimilarity, look at the top 10% and determine fraction of
actives in that top 10%
![Page 154: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/154.jpg)
Cheminformaticsin R
154/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
ActivesInactives
Query Targets
Similarity
![Page 155: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/155.jpg)
Cheminformaticsin R
155/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance
I A good dataset to test this out is the MaximumUnbiased Validation datasets by Rohr & Baumann
I Derived from 17 PubChem bioassay datasets anddesigned to avoid analog bias and artifical enrichment
I As a result, 2D fingerprints generally show poorperformance on these datasets (by design)
I See here for a comparison of various fingerprints usingtwo of these datasets
0Rohrer, S.G et al, J. Chem. Inf. Model, 2009, 49, 169–1840Good, A. & Oprea, T., J. Chem. Inf. Model, 2008, 22, 169–1780Verdonk, M.L. et al, J. Chem. Inf. Model, 2004, 44, 793–806
![Page 156: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/156.jpg)
Cheminformaticsin R
156/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Comparing fingerprint performance0.
00.
20.
40.
60.
81.
0
Query Molecule
Sim
ilarit
y
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
![Page 157: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/157.jpg)
Cheminformaticsin R
157/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Outline
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
![Page 158: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/158.jpg)
Cheminformaticsin R
158/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Fragment based analysis
I Fragment based analysis can be a useful alternative toclustering, especially for large datasets
I Useful for identifying interesting seriesI Many fragmentation schemes are available
I ExhaustiveI Rings and ring assembliesI Murcko
I The CDK supports fragmentation (still needs work) intoMurcko frameworks and ring systems
I We can also use the NCGC fragmenter (seedata/fragmenter.jar) to exhaustively generatefragments
![Page 159: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/159.jpg)
Cheminformaticsin R
159/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting fragments
I rcdk currently wraps the GenerateFragments class
I Only exposes the Murcko fragmentation method
I Lets look at directly working a CDK class and itsmethods
I Once we have the fragmenter we can perform a Murckofragmentation and get back the fragments as SMILES
![Page 160: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/160.jpg)
Cheminformaticsin R
160/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Getting fragments
mol <- parse.smiles(
"c1cc(c(cc1c2c(nc(nc2CC)N)N)[N+](=O)[O-])NCc3ccc(cc3)C(=O)N4CCCCC4")
## get the fragmenter
fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")
## do fragmentation
.jcall(fragmenter, "V", "generateMurckoFragments",
.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),
TRUE, TRUE, as.integer(6))
## get fragments as SMILES
frags <- .jcall(fragmenter,
"[S",
"getMurckoFrameworksAsSmileArray")
![Page 161: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/161.jpg)
Cheminformaticsin R
161/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Some caveats
I The fragment SMILES come out as non-aromatic
I There is supposed to be a way to get the appropriateSMILES out, currently looking into this
I Implies that if we want to look at fragment properties,we’re stuck
I However, we can still use these to look at series and/orlocal trends etc.
![Page 162: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/162.jpg)
Cheminformaticsin R
162/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Lets wrap the fragmentation procedure into a function
get.frags <- function(mol, fragmenter) {
.jcall(fragmenter, "V", "generateMurckoFragments",
.jcast(mol, "org/openscience/cdk/interfaces/IMolecule"),
TRUE, TRUE, as.integer(6))
return(.jcall(fragmenter, "[S", "getMurckoFrameworksAsSmileArray"))
}
![Page 163: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/163.jpg)
Cheminformaticsin R
163/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Now, we can get fragments for a set of molecules
mols <- load.molecules("data/dfhr_3d.sd")
fragmenter <- .jnew("org/openscience/cdk/tools/GenerateFragments")
frags <- lapply(mols, function(x, f) {
get.frags(x, f)
}, f=fragmenter)
![Page 164: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/164.jpg)
Cheminformaticsin R
164/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I This gives us a list of fragment SMILES
> frags[1:3]
[[1]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
[[2]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
[[3]]
[1] "C1=CC=C(C=C1)C=2C=NC=NC=2"
I What we want is a data.frame that associates fragments,fragment ID’s and the SMILES from which they arederived
![Page 165: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/165.jpg)
Cheminformaticsin R
165/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
frags <- lapply(mols, function(x, f) {
tmp <- get.frags(x, f)
tmp <- data.frame(frag=I(tmp),
mol=I(get.property(x, "cdk:Title")))
return(tmp)
}, f=fragmenter)
frags <- do.call("rbind", frags)
I Now we have something useful to work with
> head(frags)
frag mol
1 C1=CC=C(C=C1)C=2C=NC=NC=2 1-pyrimethamine
2 C1=CC=C(C=C1)C=2C=NC=NC=2 1-3062
3 C1=CC=C(C=C1)C=2C=NC=NC=2 1-7364
4 C1=CC=C(C=C1)C=2C=NC=NC=2 1-115194
5 C1=CC=C(C=C1)OCC=2C=CN=CN=2 1-118203
6 C1=CC=C(C=C1)C=2C=NC=NC=2 1-118203
![Page 166: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/166.jpg)
Cheminformaticsin R
166/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Aggregating fragments
I Finally we generate some arbitrary fragment ID’s
ufrags <- data.frame(frag=I(unique(frags$frag)),
fid=I(sprintf("F%03d")),
1:length(unique(frags$frag)))))
frags <- merge(frags, ufrags, by="frag")
I . . . giving us
> tail(frags)
frag mol fid
963 O=C2NC=NC=3NC=C(CNC1=CC=CC=C1)C2=3 43-13 F239
964 O=C2NC4=NC=NC=C4(C=3CN(CC1=CC=CC=C1)CC2=3) 49-4 F265
965 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-8 F150
966 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-9 F150
967 O=S(=O)(C1=CC=CC=C1)C=2C=CC3=NC=NC=C3(N=2) 22-7 F150
968 O=S(=O)(NCCOC1=CC=CC=C1)C2=CC=CC=C2 1-122059 F068
![Page 167: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/167.jpg)
Cheminformaticsin R
167/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Doing stuff with fragments
I Look at frequency of occurence of fragments
I Develop predictive models on fragment members, lookingfor local SAR
I Pseudo-cluster a dataset based on fragments
I Compound selection based on fragment membership
frag.counts <- by(frags, frags$frag, nrow)
barplot(frag.counts)
![Page 168: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/168.jpg)
Cheminformaticsin R
168/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Characterizing fragment SAR
I Consider some of the larger series and see if there is anSAR within each of them
I StrategyI Identify fragments with 20 to 30 membersI Merge the fragment data with descriptor data for the
associated moleculesI For each fragment series, build an OLS model using
exhaustive feature selection
I I’ve precalculated a set of descriptors for this purpose (din data/frags.Rda) or you can generate fragmentsyourself
![Page 169: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/169.jpg)
Cheminformaticsin R
169/189
I/O
Molecular data
Visualization
Descriptors
Fingerprints
Fragments
Characterizing fragment SAR
Observed Activity
Pre
dict
ed A
ctiv
ity
−0.5
0.0
0.5
1.0
1.5
2.0
−1 0 1 2
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
F018
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
F019
−1 0 1 2
● ●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
F052
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
F128
−1 0 1 2
−0.5
0.0
0.5
1.0
1.5
2.0
●
●
●
●●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
F252
Predicted versus observed activity for 5 fragment series, basedon OLS models and exhaustive feature selection
![Page 170: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/170.jpg)
Cheminformaticsin R
170/189
Part IV
Cheminformatics & R - rpubchem
![Page 171: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/171.jpg)
Cheminformaticsin R
171/189
Public chemical and biological data
![Page 172: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/172.jpg)
Cheminformaticsin R
172/189
Working with PubChem data
Bioassay data (numeric)
Assay metadata(text)
Chemical structures
Process
Extract
Process
StandardizeCleanDescriptors
ProcessExtractMerge
Model
![Page 173: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/173.jpg)
Cheminformaticsin R
173/189
Working with PubChem data in R
I Access PubChem datasets using idiomatic R
I Minimize extra work (merge data, add metadata etc)
I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve
I Get the assay dataI Get structure data associated with the assayI Proceed with modeling
![Page 174: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/174.jpg)
Cheminformaticsin R
173/189
Working with PubChem data in R
I Access PubChem datasets using idiomatic R
I Minimize extra work (merge data, add metadata etc)
I End up with a data.frame, ready for modelingI The usual sequence of tasks would involve
I Get the assay dataI Get structure data associated with the assayI Proceed with modeling
![Page 175: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/175.jpg)
Cheminformaticsin R
174/189
Getting assay data
I The workhorse method is get.assay which takes theAID as argument
assay <- get.assay(2018, quiet=FALSE)
I Returns a data.frame derived from the assay data CSVfile and the assay description XML file form PubChem
I The description, comments and field descriptions andtypes are available as attributes of the data.frame
I Note - no structures are included at this point
![Page 176: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/176.jpg)
Cheminformaticsin R
175/189
Other ways to get assay data
I If you’re looking for a group of assays, say related toHDAC inhibitors, perform a keyword search
## gives a vector of integer AIDs
aids <- find.assay.id("HDAC", quiet=FALSE)
## what are the assays?
sapply(aids[1:10], get.assay.desc)
![Page 177: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/177.jpg)
Cheminformaticsin R
176/189
Other ways to get assay data
I Find assays that a compound is active in
cid <- 68368
aids <- get.aid.by.cid(cid, type="active")
I Gives a vector AIDs that the compound is active in
I If you specify type="raw", a summary is returned
![Page 178: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/178.jpg)
Cheminformaticsin R
177/189
Getting structure informationI Get structure data by CID or SID
cids <- get.cid(assay$PUBCHEM.CID[1:10])
I Gives you a data.frame containing name, structure andsome properties
> str(structs)
’data.frame’: 10 obs. of 11 variables:
$ CID : chr "644436" "645300" "645983" "648874" ...
$ IUPACName : chr "2-(2-methylphenyl)-1,3-benzoxazol-5-amine"
$ CanonicalSmile : chr "CC1=CC=CC=C1C2=NC3=C(O2)C=CC(=C3)N"
$ MolecularFormula : chr "C14H12N2O" "C18H15N3O2" "C24H34N6O3"
$ MolecularWeight : num 224 305 455 407 290 ...
$ TotalFormalCharge : int 0 0 0 0 0 0 0 0 0 0
$ XLogP : num 3.1 3.2 3 4.5 2.1 2.9 1.8 2.5 0.7 1.7
$ HydrogenBondDonorCount : int 1 0 1 1 2 2 1 1 2 2
$ HydrogenBondAcceptorCount: int 3 4 7 5 4 3 7 8 5 4
$ HeavyAtomCount : int 17 23 33 29 21 26 33 31 20 21
$ TPSA : num 52 57 94.4 68 84.5 70.7 94.4 125 104 69.6
![Page 179: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/179.jpg)
Cheminformaticsin R
178/189
Ready to model
I Having gotten the data.frame of structures we canparse the SMILES into CDK objects and evaluatedescriptors
I However, get.cid (and get.sid) will not work if youtry and retrieve too many at one go
I The current design makes an Entrez URL, which canbecome too large and so is rejected
I Will be switching to the use of PUG
I In the meantime, retreive structure data in chunks
![Page 180: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/180.jpg)
Cheminformaticsin R
179/189
Getting CID data in chunks
## will likely fail
structs <- get.cid(assay$PUBCHEM.CID)
## better way, 300 CIDs at a time
library(itertools)
chunks <- as.list(ichunk(assay$PUBCHEM.CID, 300))
structs <- lapply(chunks, get.cid)
structs <- do.call("rbind", structs)
![Page 181: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/181.jpg)
Cheminformaticsin R
180/189
Part V
Extending rcdk
![Page 182: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/182.jpg)
Cheminformaticsin R
181/189
Why extend?
I rcdk doesn’t cover the whole CDK API
I I add functions as and when I need them (or if someoneasks)
I If you want to access CDK classes and methods, you canuse rJava (such as .jnew and .jcall)
I This can get a bit cumbersomeI Use the soure code for inspiration
I For more complex stuff, you can write Java code thatgets installed along with rcdk
I Ideally this would get contributed back and incorporatedinto the rcdk.jar file that is bundled with the rcdkpackage
![Page 183: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/183.jpg)
Cheminformaticsin R
182/189
Structure of the rcdk package
I If you add new functionality, good idea to add unit tests(using RUnit) under rcdk/inst/unitTests/
![Page 184: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/184.jpg)
Cheminformaticsin R
183/189
Structure of the cdkr project
I Within rcdkjar/, run ant jar to build the jar file andcopy into the rcdk package directory
I Generally, no need to touch rcdklibs/
![Page 185: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/185.jpg)
Cheminformaticsin R
184/189
Writing Java in R
I Don’t like it, but sometimes it’s easier than firing up theJava IDE. Need to be familiar with the CDK API
I Three main functionsI .jnew
I .jcall
I .jcast
![Page 186: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/186.jpg)
Cheminformaticsin R
185/189
Writing Java in R
I First, lets look at a better way to get the atom count ofa molecule
I Right now, we get a molecule and get the list of atomobjects and return the length of the list
I But IAtomContainer has a method getAtomCount()
m <- parse.smiles("CCC")
.jcall(mol, "I", "getAtomCount")
I For static methods, the first cargument can be the fullyqualified class name
I You do need to specify the return type in JNI notation
![Page 187: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/187.jpg)
Cheminformaticsin R
186/189
Writing Java in R
JNI Signature Java Type
I intZ booleanB byteC charLfully-qualified-class; fully-qualified-class[type type[]
I For the L signature, remember the semi-colon at the end
![Page 188: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/188.jpg)
Cheminformaticsin R
187/189
Writing Java in R
I Another example is getting the largest connectedcomponent
I We make use of the ConnectivityChecker class andthe partitionIntoMolecules method
mol <- parse.smiles("CCC(=O)C[O-].[Na+]")
molSet <- .jcall("org.openscience.cdk.graph.ConnectivityChecker",
"Lorg/openscience/cdk/interfaces/IMoleculeSet;",
"partitionIntoMolecules", mol)
ncomp <- .jcall(molSet, "I", "getMoleculeCount")
![Page 189: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/189.jpg)
Cheminformaticsin R
188/189
Writing Java in Java
I When writing Java in R gets too tedious, switch to Javain Java
I You could write classes to do the work and then packagethem into a jar
I Copy the jar into the inst/cont of the rcdk directory
I Edit .First.lib in R/rcdk.R to add this jar to theclasspath
I Alternatively edit the Java sources for the rcdk packageI Need to access the Github repository
![Page 190: Cheminformatics in R](https://reader036.vdocuments.us/reader036/viewer/2022081721/554e9a4eb4c90526358b5359/html5/thumbnails/190.jpg)
Cheminformaticsin R
189/189
Future developments
I Update rpubchem to use PUG throughout
I Data table and depictions
I Streamline I/O and molecule configuration
I Add more atom and bond level operationsI Convert from jobjRef to S4 objects and vice versa
I Would allow for serialization of CDK data classesI Is it worth the effort?
I Your suggestions