robots, small molecules & r
TRANSCRIPT
Robots, Small Molecules & R Ingredients for Exploring and Predic<ng
Biological Effects
Rajarshi Guha September 13, 2014
hEp://blog.rguha.net/
Target Iden<fica<on Lead Discovery
Lead Op<miza<on
Clinical Development
• Sensi<vity • Scaling
Assay Op<miza<on
• Fluorescence • High Content
Primary Screening • Select subset
to follow up • Diversity
Cherry Picking
• Counter screen
• Explore SAR
Confirma<on
HTS
Hun<ng for Leads
High Throughput Screening
• Test thousands to hundreds of thousands of compounds in one or more assays
• Employs a robo<c plaXorm • Rapidly iden<fy novel modulators of biological systems – Infec<ous agents – Cellular basis of diseases
Robots for Screening
Robots for Screening
HTS Workflow
• Rapidly screen large compound collec<ons
• Efficiently iden<fy real ac<ves – Test them in slower, accurate, expensive screens
• Use the data to learn what types of compounds tend to be ac<ve
• Use the model to suggest more compounds to screen
300K
1000
300
Nu
mb
er o
f M
ole
cu
les
Cherry
Picks
HTS
Data Science Problems
• Predic<ve models for highlight imbalanced datasets
• Global versus local models? • Feature selec<on – data driven? Domain driven? • Clustering & enrichment • Similarity – defini<on, computa<on, performance • Integra<on – chemical structures, numerical data, text (papers, patents), images
The Roles of R
Also see ChemPhys CRAN Task View
Data AccessROracleRMyQSL
RPostgreSQLrpubchemchemblr
Chemistry
rcdkChemmineRfingerprint
HTS QC
displayHTSspdep
Imaging
EBImagerflowcyt
riparaster
Visualizationgrid
ggplotShinyggvisigraph
Data Analysisdrc
igraphrandomForest
svm...
HTS Data Types – Single Point
0
25
50
75
100
9.50 9.75 10.00 10.25 10.50Concentration
Response
HTS Data Types – Dose Response
30
60
90
120
0.01 1.00log10 Concentration
Response
y = S0 +Sinf − S0
1+10(logAC50−x )H
HTS Data Types – Mul<ple Readouts
(and have this at mul<ple doses!)
HTS Data Types -‐ Combina<ons
+
Independent Variable(s)
Activity = f ( )
Features, Features, Features
• How do we “quan<fy” a chemical structure?
Features, Features, Features
Charges Dipole moments Surface proper<es Topological invariants
1 0 1 1 0 0 0 1 0
Working with Molecules in R
• A number of OSS libraries are available
• ChemmineR and rcdk are the main packages that allow you to manipulate molecules in R
• Uses rJava to interface with JOELib and CDK respec<vely
rcdk
• Idioma<c R interface to the CDK library – I/O support for chemical file formats – Manipula<on of atoms, bonds, molecules – Generate molecular descriptors, fingerprints
library(rcdk) mol <- parse.smiles(‘CCCC’)[[1]] mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)
rcdk
• rcdk works with references to Java objects – Can’t save them in a workspace (trivially)
> mol [1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, #O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC:2)))}" >
Calcula<ng Molecular Features
• Evaluate a matrix of numerical features
• End up with a rectangular data.frame
mols <- load.molecules("mipe100.smi") dnames <- get.desc.names('topological') descs <- eval.desc(mols, dnames)
> str(descs) 'data.frame': 99 obs. of 195 variables: $ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... $ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... $ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... $ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...
Calcula<ng Fingerprints
• Binary string representa<on of molecular structure – Objec<vely defined, fast to calculate – Good for searching, clustering, predic<on
• The fingerprint package is used to represent them as S4 objects
library(fingerprint) fps <- lapply(mols, get.fingerprint)
Calcula<ng Fingerprints
• Methods to compute similari<es, generate summaries & manipulate fingerprints
> fps[[1]] Fingerprint object name = length = 1024 folded = FALSE source = CDK bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 936 954 985 988 1005 1008 1016 >
Use Case -‐ SAR
• Cluster molecules by structure and examine whether clusters are enriched in ac<vity
library(chemblr); library(rcdk) d <- get.activity(chembl.id='CHEMBL857155', type='assay') cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, type='chemblid') cmpds <- do.call(rbind, lapply(cmpds, function(x) data.frame(x$chemblId, x$smiles, stringsAsFactors=FALSE))) mols <- parse.smiles(cmpds$x.smiles) fps <- lapply(mols, get.fingerprint) sm <- fp.sim.matrix(fps) rownames(sm) <- cmpds$x.chemblId dm <- as.dist(1-sm) clus <- hclust(dm)
Use Case -‐ SAR CHEMBL331502
CHEMBL328164
CHEMBL52551
CHEMBL331120
CHEMBL120497
CHEMBL331759
CHEMBL120547
CHEMBL324064
CHEMBL318208
CHEMBL328627
CHEMBL99803
CHEMBL317562
CHEMBL332678
CHEMBL100312
CHEMBL119963
CHEMBL334031
CHEMBL323657
CHEMBL118406
CHEMBL118162
CHEMBL120137
CHEMBL331722
CHEMBL120078
CHEMBL121953
CHEMBL331783
CHEMBL333066
CHEMBL116832
CHEMBL316512
CHEMBL318471
CHEMBL98153
CHEMBL95827
CHEMBL119932
CHEMBL99037
CHEMBL120355
CHEMBL430574
CHEMBL120941
CHEMBL299756
CHEMBL317964
CHEMBL98501
CHEMBL317150
CHEMBL120030
CHEMBL99779
CHEMBL98554
CHEMBL318911
CHEMBL97844
CHEMBL316485
CHEMBL296586
CHEMBL100309
CHEMBL98360
CHEMBL316940
CHEMBL120664
CHEMBL419054
CHEMBL119989
CHEMBL121958
CHEMBL121957
CHEMBL329505
CHEMBL121543
CHEMBL121492
CHEMBL333894
CHEMBL333006
CHEMBL50894
CHEMBL116545
CHEMBL331190
CHEMBL325403
CHEMBL99423
CHEMBL330398
CHEMBL95477
CHEMBL545053
CHEMBL329063
CHEMBL331000
CHEMBL319373
CHEMBL431634
CHEMBL325654
CHEMBL332359
CHEMBL334084
CHEMBL328194
0.00
0.25
0.50
0.75
1.00
0 250 500 750Bit Position
Nor
mal
ized
Fre
quen
cy
Use Case -‐ Bit Spectrum
• Vector summary of the fingerprints for a dataset • Defined as the frac<on of <mes a bit posi<on is set to 1, for each bit posi<on
0 0 1
0 1 0
1 1 1
1 0 1
0.5 0.5 0.75
...
...
...
...
...
~ 10K molecules
-1.0
-0.5
0.0
0.5
1.0
0 50 100 150Bit Position
Δ N
orm
aliz
ed F
requ
ency
Use Case -‐ Bit Spectrum
• Comparison of two datasets is now O(n) • Simply take the difference of the two bit spectra
e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))
PREDICTIVE MODELS -‐ CAVEATS
Building Models is the Easy Part
• Given a descriptor data.frame or fingerprint list we’re ready to build models – caret, caretEnsemble
• Ques<on is whether the model(s) can generalize
• Applicability is a key considera<on when predic<ng bioac<vity – Has economic & safety ramifica<ons in regulatory enviroments
Domain Applicability
• How dissimilar to the training set do you have to be before the predic<on is meaningless? – Distance to training set? Inside/outside convex hull – Comparison of bit spectra
Training Set Test Set
Global vs Local Models
• Bioassay data is not really big data • Can big data be too big? • AID 1996 – 57K measurements of aqueous solubility
• Do we build one model? • Or mul<ple local models?
PCA of 166 Binary Features
RESPONSE SURFACES
Screening Drug Combina<ons
• Increased efficacy • Delay resistance • AEenuate toxicity
• Inform signaling pathway connec<vity
• Iden<fy synthe<c lethality • Polypharmacology
Transla'onal Interest Basic Interest
How to Test Combina<ons
• Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm
C5,D5 C5
C4,D4 C4
C3,D3 C3
C2,D2 C2
C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1
D5 D4 D3 D2 D1 0
How to Test Combina<ons
• Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm
Vargatef DCC-2036 PD-166285 GDC-0941
PI-103 GDC-0980 Bardoxolone methyl AT-7519AT7519
SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024
ISOX Belinostat PF-477736 AZD-7762
• Vargatef exhibited anomalous matrix response compared to other VEGFR inhibitors
Why Similarity?
Vargatef
Linifanib Axitinib Sorafenib Vatalanib
Motesanib Tivozanib Brivanib Telatinib
Cabozantinib Cediranib BMS-794833 Lenvatinib
OSI-632 Foretinib Regorafenib
When are Combina<ons Similar?
• Differences and their aggregates such as RMSD can lead to degeneracy
• Instead we’re interested in the shape of the surface
• How to characterize shape? – Parametrized fits – Distribu<on of responses
0.000
0.005
0.010
0 25 50 75 100
0.00
0.02
0.04
0.06
0 25 50 75 100
0.00
0.05
0.10
0.15
0 50 100
D, p value
0.0
2.5
5.0
7.5
10.0
0.00 0.25 0.50 0.75D
density
Similarity via the Syrjala Test
• Syrjala test used to compare popula<on distribu<ons over a spa<al grid – Invariant to grid orienta<on – Provides an empirical p-‐value
• Less degenerate than just considering 1D distribu<ons
Syrjala, S.E., “A Sta<s<cal Test for a Difference between the Spa<al Distribu<ons of Two Popula<ons”, Ecology, 1996, 77(1), 75-‐80
Clustering Response Surfaces 0.0
0.2
0.4
0.6
0.8
C1 (24)
C2(47)
C3(35)
C4(24)
Working in “Combina<on Space”
• Each cell line is represented as a vector of response matrices
• “Distance” between two cell lines is a func<on of the distance between component response matrices
• F can be min, max, mean, …
L1 L2
= d1
= d2
= d3
= d4
= d5
D L1,L2( ) = F({d1,d2,…,dn})
,
,
, , ,
Many Choices to Make 0
12
34
KMS-34
INA-6
L363
OPM-1
XG-2
FR4
AMO-1
XG-6
MOLP-8
ANBL-6
KMS-20
XG-7
OCI-MY1
XG-1
8226
EJM
U266
KMS-11LB
SKMM-1
MM-MM1
sum
0.0
0.1
0.2
0.3
0.4
0.5
0.6
L363
OPM-1
XG-2
KMS-20
XG-1
XG-7
ANBL-6
OCI-MY1
U266
XG-6
INA-6
MOLP-8
AMO-1
KMS-34
KMS-11LB
SKMM-1
MM-MM1
EJM FR4
8226
max
0.00
0.05
0.10
0.15
0.20
0.25
INA-6
MM-MM1
8226
XG-1
U266
ANBL-6
SKMM-1
EJM
OPM-1
XG-2
OCI-MY1
KMS-20
L363
KMS-11LB
AMO-1
XG-6
FR4
KMS-34
MOLP-8
XG-7
min
0.0
0.2
0.4
0.6
0.8
1.0
1.2
L363
OPM-1
XG-2
KMS-34
INA-6
KMS-11LB
SKMM-1
EJM
U266
MM-MM1
FR4
AMO-1
XG-6
8226
MOLP-8
ANBL-6
OCI-MY1
XG-1
KMS-20
XG-7
euc
NETWORKS
Networks & Integra<on
• Network models of molecules, and targets are common – Allows for the incorpora<on of lots of associated informa<on
– Diseases, pathways, OTE’s, • When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-‐drug-‐reac<on triples
Yildirim, M.A. et al
Networks & integra<on • SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula<ons of the metric
• Current studies have focused on small datasets (< 1000 molecules)
• Hadoop + Giraph could let us apply this to HTS-‐scale datasets
hEp://sali.rguha.net/ Peltason, L et al
Networks & integra<on
• When we apply a network view we can consider many interes<ng applica<ons & make use of cloud scale infrastructure – Network based similarity – Community detec<on (aka clustering) – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic<ve models (for interac<ons, effects, …)
Bauer-‐Mehren et al
Combina<ons as Networks Combina<on screens lend themselves naturally to network representa<ons
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
● ●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
∆ Bliss+
−4.3
−3.8
−3.3
−2.9
−2.4
−1.9
−1.4
−1.0
−0.5
0.0
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
∆ Bliss+
−3.4−3.1
−2.7
−2.3
−1.9
−1.5−1.2
−0.8
−0.4
0.0
immune system process
apoptotic process
transcription from RNApolymerase II promoter
protein phosphorylation
cell communication
immune response
Combina<ons as Networks
• Things get more interes<ng when we have n m screens
• Can be simplified using a variety of methods – Neighborhoods – Minimum Spanning Tree
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
×
Comparing Neighborhoods Combina<ons that have DBSumNeg < 1st quar<le value for that strain
3D7 DD2 HB3
Iden<fying the Most Synergis<c Pairs
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Summary
• The HTS workflow presents mul<ple data science problems involving (unique) data types
• R can play a role at several stages, but model building is straighXorward
• Representa<on is key and guides the types and nature of analyses