robots, small molecules & r

Robots, Small Molecules & R Ingredients for Exploring and Predic<ng

Biological Effects

Rajarshi Guha September 13, 2014

hEp://blog.rguha.net/

Target Iden<fica<on Lead Discovery

Lead Op<miza<on

Clinical Development

• Sensi<vity • Scaling

Assay Op<miza<on

• Fluorescence • High Content

Primary Screening • Select subset

to follow up • Diversity

Cherry Picking

• Counter screen

• Explore SAR

Confirma<on

HTS

Hun<ng for Leads

High Throughput Screening

•  Test thousands to hundreds of thousands of compounds in one or more assays

•  Employs a robo<c plaXorm •  Rapidly iden<fy novel modulators of biological systems –  Infec<ous agents – Cellular basis of diseases

Robots for Screening

HTS Workflow

•  Rapidly screen large compound collec<ons

•  Efficiently iden<fy real ac<ves – Test them in slower, accurate, expensive screens

•  Use the data to learn what types of compounds tend to be ac<ve

•  Use the model to suggest more compounds to screen

300K

1000

300

Nu

mb

er o

f M

ole

cu

les

Cherry

Picks

HTS

Data Science Problems

•  Predic<ve models for highlight imbalanced datasets

•  Global versus local models? •  Feature selec<on – data driven? Domain driven? •  Clustering & enrichment •  Similarity – defini<on, computa<on, performance •  Integra<on – chemical structures, numerical data, text (papers, patents), images

The Roles of R

Also see ChemPhys CRAN Task View

Data AccessROracleRMyQSL

RPostgreSQLrpubchemchemblr

Chemistry

rcdkChemmineRfingerprint

HTS QC

displayHTSspdep

Imaging

EBImagerflowcyt

riparaster

Visualizationgrid

ggplotShinyggvisigraph

Data Analysisdrc

igraphrandomForest

svm...

HTS Data Types – Single Point

0

25

50

75

100

9.50 9.75 10.00 10.25 10.50Concentration

Response

HTS Data Types – Dose Response

30

60

90

120

0.01 1.00log10 Concentration

Response

y = S0 +Sinf − S0

1+10(logAC50−x )H

HTS Data Types – Mul<ple Readouts

(and have this at mul<ple doses!)

HTS Data Types -‐ Combina<ons

+

Independent Variable(s)

Activity = f ( )

Features, Features, Features

•  How do we “quan<fy” a chemical structure?

Features, Features, Features

Charges Dipole moments Surface proper<es Topological invariants

1 0 1 1 0 0 0 1 0

Working with Molecules in R

•  A number of OSS libraries are available

•  ChemmineR and rcdk are the main packages that allow you to manipulate molecules in R

•  Uses rJava to interface with JOELib and CDK respec<vely

rcdk

•  Idioma<c R interface to the CDK library –  I/O support for chemical file formats – Manipula<on of atoms, bonds, molecules – Generate molecular descriptors, fingerprints

library(rcdk) mol <- parse.smiles(‘CCCC’)[[1]] mols <- load.molecules(‘http://www.rguha.net/mipe100.smi’)

rcdk

•  rcdk works with references to Java objects – Can’t save them in a workspace (trivially)

> mol [1] "Java-Object{AtomContainer(2040919865, #A:4, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), #B:3, Bond(549041464, #O:SINGLE, #S:NONE, #A:2, Atom(2131361171, S:C, H:3, AtomType(2131361171, FC:0, Isotope(2131361171, Element(2131361171, S:C, AN:6)))), Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), ElectronContainer(549041464EC:2)), Bond(2654289, #O:SINGLE, #S:NONE, #A:2, Atom(1759969037, S:C, H:2, AtomType(1759969037, FC:0, Isotope(1759969037, Element(1759969037, S:C, AN:6)))), Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), ElectronContainer(2654289EC:2)), Bond(1660962283, #O:SINGLE, #S:NONE, #A:2, Atom(359851081, S:C, H:2, AtomType(359851081, FC:0, Isotope(359851081, Element(359851081, S:C, AN:6)))), Atom(703168415, S:C, H:3, AtomType(703168415, FC:0, Isotope(703168415, Element(703168415, S:C, AN:6)))), ElectronContainer(1660962283EC:2)))}" >

Calcula<ng Molecular Features

•  Evaluate a matrix of numerical features

•  End up with a rectangular data.frame

mols <- load.molecules("mipe100.smi") dnames <- get.desc.names('topological') descs <- eval.desc(mols, dnames)

> str(descs) 'data.frame': 99 obs. of 195 variables: $ nRings7 : num 1 0 1 0 0 0 0 0 0 0 ... $ nRings8 : num 0 0 0 0 0 0 0 0 0 0 ... $ nRings9 : num 0 0 0 0 0 0 0 0 0 0 ... $ tpsaEfficiency : num 0.1856 0.2035 0.0118 0.0602 ...

Calcula<ng Fingerprints

•  Binary string representa<on of molecular structure – Objec<vely defined, fast to calculate – Good for searching, clustering, predic<on

•  The fingerprint package is used to represent them as S4 objects

library(fingerprint) fps <- lapply(mols, get.fingerprint)

Calcula<ng Fingerprints

•  Methods to compute similari<es, generate summaries & manipulate fingerprints

> fps[[1]] Fingerprint object name = length = 1024 folded = FALSE source = CDK bits on = 15 18 45 73 77 78 79 85 87 96 107 109 129 139 149 159 162 166 172 179 194 209 214 223 225 227 239 254 266 272 301 312 327 335 350 354 359 392 393 395 397 415 435 455 486 491 492 499 534 535 541 543 544 545 546 559 575 600 605 618 621 622 626 635 638 644 645 647 690 723 728 742 743 753 754 800 819 831 832 889 893 913 922 930 936 954 985 988 1005 1008 1016 >

Use Case -‐ SAR

•  Cluster molecules by structure and examine whether clusters are enriched in ac<vity

library(chemblr); library(rcdk) d <- get.activity(chembl.id='CHEMBL857155', type='assay') cmpds <- lapply(d$ingredient_cmpd_chemblid, get.compound, type='chemblid') cmpds <- do.call(rbind, lapply(cmpds, function(x) data.frame(x$chemblId, x$smiles, stringsAsFactors=FALSE))) mols <- parse.smiles(cmpds$x.smiles) fps <- lapply(mols, get.fingerprint) sm <- fp.sim.matrix(fps) rownames(sm) <- cmpds$x.chemblId dm <- as.dist(1-sm) clus <- hclust(dm)

Use Case -‐ SAR CHEMBL331502

CHEMBL328164

CHEMBL52551

CHEMBL331120

CHEMBL120497

CHEMBL331759

CHEMBL120547

CHEMBL324064

CHEMBL318208

CHEMBL328627

CHEMBL99803

CHEMBL317562

CHEMBL332678

CHEMBL100312

CHEMBL119963

CHEMBL334031

CHEMBL323657

CHEMBL118406

CHEMBL118162

CHEMBL120137

CHEMBL331722

CHEMBL120078

CHEMBL121953

CHEMBL331783

CHEMBL333066

CHEMBL116832

CHEMBL316512

CHEMBL318471

CHEMBL98153

CHEMBL95827

CHEMBL119932

CHEMBL99037

CHEMBL120355

CHEMBL430574

CHEMBL120941

CHEMBL299756

CHEMBL317964

CHEMBL98501

CHEMBL317150

CHEMBL120030

CHEMBL99779

CHEMBL98554

CHEMBL318911

CHEMBL97844

CHEMBL316485

CHEMBL296586

CHEMBL100309

CHEMBL98360

CHEMBL316940

CHEMBL120664

CHEMBL419054

CHEMBL119989

CHEMBL121958

CHEMBL121957

CHEMBL329505

CHEMBL121543

CHEMBL121492

CHEMBL333894

CHEMBL333006

CHEMBL50894

CHEMBL116545

CHEMBL331190

CHEMBL325403

CHEMBL99423

CHEMBL330398

CHEMBL95477

CHEMBL545053

CHEMBL329063

CHEMBL331000

CHEMBL319373

CHEMBL431634

CHEMBL325654

CHEMBL332359

CHEMBL334084

CHEMBL328194

0.00

0.25

0.50

0.75

1.00

0 250 500 750Bit Position

Nor

mal

ized

Fre

quen

cy

Use Case -‐ Bit Spectrum

•  Vector summary of the fingerprints for a dataset •  Defined as the frac<on of <mes a bit posi<on is set to 1, for each bit posi<on

0 0 1

0 1 0

1 1 1

1 0 1

0.5 0.5 0.75

...

...

...

...

...

~ 10K molecules

-1.0

-0.5

0.0

0.5

1.0

0 50 100 150Bit Position

Δ N

orm

aliz

ed F

requ

ency

Use Case -‐ Bit Spectrum

•  Comparison of two datasets is now O(n) •  Simply take the difference of the two bit spectra

e.g.: Compare ~ 800 solubles with > 30k insolubles ## make two subsets and generate bit spectra sol.idx <- which(sol$label == 'high') insol.idx <- which(sol$label != 'high') sol.bs <- bit.spectrum(fps[sol.idx]) insol.bs <- bit.spectrum(fps[insol.idx]) ## display a difference plot bsdiff <- sol.bs - insol.bs d <- data.frame(x=1:length(sol.bs), y=bsdiff) ggplot(d, aes(x=x,y=y))+geom_line()+ xlab('Bit Position')+ ylab('Normalized Frequency')+ ylim(c(-1,1))

PREDICTIVE MODELS -‐ CAVEATS

Building Models is the Easy Part

•  Given a descriptor data.frame or fingerprint list we’re ready to build models – caret, caretEnsemble

•  Ques<on is whether the model(s) can generalize

•  Applicability is a key considera<on when predic<ng bioac<vity – Has economic & safety ramifica<ons in regulatory enviroments

Domain Applicability

•  How dissimilar to the training set do you have to be before the predic<on is meaningless? – Distance to training set? Inside/outside convex hull – Comparison of bit spectra

Training Set Test Set

Global vs Local Models

•  Bioassay data is not really big data •  Can big data be too big? •  AID 1996 – 57K measurements of aqueous solubility

•  Do we build one model? •  Or mul<ple local models?

PCA of 166 Binary Features

RESPONSE SURFACES

Screening Drug Combina<ons

•  Increased efficacy •  Delay resistance •  AEenuate toxicity

•  Inform signaling pathway connec<vity

•  Iden<fy synthe<c lethality •  Polypharmacology

Transla'onal Interest Basic Interest

How to Test Combina<ons

•  Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm

C5,D5 C5

C4,D4 C4

C3,D3 C3

C2,D2 C2

C1,D5 C1,D4 C1,D3 C1,D2 C1,D1 C1

D5 D4 D3 D2 D1 0

How to Test Combina<ons

•  Many procedures described in the literature – Fixed dose ra<o (aka ray) – Ray contour – Checkerboard – Gene<c algorithm

Vargatef DCC-2036 PD-166285 GDC-0941

PI-103 GDC-0980 Bardoxolone methyl AT-7519AT7519

SNS-032 NCGC00188382-01 Lestaurtinib CNF-2024

ISOX Belinostat PF-477736 AZD-7762

•  Vargatef exhibited anomalous matrix response compared to other VEGFR inhibitors

Why Similarity?

Vargatef

Linifanib Axitinib Sorafenib Vatalanib

Motesanib Tivozanib Brivanib Telatinib

Cabozantinib Cediranib BMS-794833 Lenvatinib

OSI-632 Foretinib Regorafenib

When are Combina<ons Similar?

•  Differences and their aggregates such as RMSD can lead to degeneracy

•  Instead we’re interested in the shape of the surface

•  How to characterize shape? – Parametrized fits – Distribu<on of responses

0.000

0.005

0.010

0 25 50 75 100

0.00

0.02

0.04

0.06

0 25 50 75 100

0.00

0.05

0.10

0.15

0 50 100

D, p value

0.0

2.5

5.0

7.5

10.0

0.00 0.25 0.50 0.75D

density

Similarity via the Syrjala Test

•  Syrjala test used to compare popula<on distribu<ons over a spa<al grid –  Invariant to grid orienta<on – Provides an empirical p-‐value

•  Less degenerate than just considering 1D distribu<ons

Syrjala, S.E., “A Sta<s<cal Test for a Difference between the Spa<al Distribu<ons of Two Popula<ons”, Ecology, 1996, 77(1), 75-‐80

Clustering Response Surfaces 0.0

0.2

0.4

0.6

0.8

C1 (24)

C2(47)

C3(35)

C4(24)

Working in “Combina<on Space”

•  Each cell line is represented as a vector of response matrices

•  “Distance” between two cell lines is a func<on of the distance between component response matrices

•  F can be min, max, mean, …

L1 L2

= d1

= d2

= d3

= d4

= d5

D L1,L2( ) = F({d1,d2,…,dn})

,

,

, , ,

Many Choices to Make 0

12

34

KMS-34

INA-6

L363

OPM-1

XG-2

FR4

AMO-1

XG-6

MOLP-8

ANBL-6

KMS-20

XG-7

OCI-MY1

XG-1

8226

EJM

U266

KMS-11LB

SKMM-1

MM-MM1

sum

0.0

0.1

0.2

0.3

0.4

0.5

0.6

L363

OPM-1

XG-2

KMS-20

XG-1

XG-7

ANBL-6

OCI-MY1

U266

XG-6

INA-6

MOLP-8

AMO-1

KMS-34

KMS-11LB

SKMM-1

MM-MM1

EJM FR4

8226

max

0.00

0.05

0.10

0.15

0.20

0.25

INA-6

MM-MM1

8226

XG-1

U266

ANBL-6

SKMM-1

EJM

OPM-1

XG-2

OCI-MY1

KMS-20

L363

KMS-11LB

AMO-1

XG-6

FR4

KMS-34

MOLP-8

XG-7

min

0.0

0.2

0.4

0.6

0.8

1.0

1.2

L363

OPM-1

XG-2

KMS-34

INA-6

KMS-11LB

SKMM-1

EJM

U266

MM-MM1

FR4

AMO-1

XG-6

8226

MOLP-8

ANBL-6

OCI-MY1

XG-1

KMS-20

XG-7

euc

NETWORKS

Networks & Integra<on

•  Network models of molecules, and targets are common – Allows for the incorpora<on of lots of associated informa<on

– Diseases, pathways, OTE’s, •  When linked with clinical data & outcomes, we can generate massive networks – Adverse events (FDA AERS) – Analysis by Cloudera considered > 10E6 drug-‐drug-‐reac<on triples

Yildirim, M.A. et al

Networks & integra<on •  SAR data can be viewed in a network form – SALI, SARI based networks – Usually requires pairwise calcula<ons of the metric

•  Current studies have focused on small datasets (< 1000 molecules)

•  Hadoop + Giraph could let us apply this to HTS-‐scale datasets

hEp://sali.rguha.net/ Peltason, L et al

Networks & integra<on

•  When we apply a network view we can consider many interes<ng applica<ons & make use of cloud scale infrastructure – Network based similarity – Community detec<on (aka clustering) – PageRank style ranking (of targets, compounds, …) – Generate network metrics, which can be used as input to predic<ve models (for interac<ons, effects, …)

Bauer-‐Mehren et al

Combina<ons as Networks Combina<on screens lend themselves naturally to network representa<ons

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●● ●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

● ●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

∆ Bliss+

−4.3

−3.8

−3.3

−2.9

−2.4

−1.9

−1.4

−1.0

−0.5

0.0

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

∆ Bliss+

−3.4−3.1

−2.7

−2.3

−1.9

−1.5−1.2

−0.8

−0.4

0.0

immune system process

apoptotic process

transcription from RNApolymerase II promoter

protein phosphorylation

cell communication

immune response

Combina<ons as Networks

•  Things get more interes<ng when we have n m screens

•  Can be simplified using a variety of methods – Neighborhoods – Minimum Spanning Tree

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

×

Comparing Neighborhoods Combina<ons that have DBSumNeg < 1st quar<le value for that strain

3D7 DD2 HB3

Iden<fying the Most Synergis<c Pairs

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

Summary

•  The HTS workflow presents mul<ple data science problems involving (unique) data types

•  R can play a role at several stages, but model building is straighXorward

•  Representa<on is key and guides the types and nature of analyses

robots, small molecules & r

Technology