exploring virtual compound space with bayesian statistics

34
1 WP van Hoorn, Feb 2007 Exploring virtual compound space with Bayesian statistics Willem P van Hoorn Chemistry Pfizer Global Research and Development Sandwich UK [email protected] Pipeline Pilot UGM, San Diego, Mar 2007

Upload: willem-van-hoorn

Post on 13-Dec-2014

431 views

Category:

Technology


1 download

DESCRIPTION

Search 10^12 virtual compounds in minutes with a Bayesian model made from 10^6 combinatorial chemistry derived compounds. This work has been published: http://pubs.acs.org/doi/abs/10.1021/ci900072g

TRANSCRIPT

Page 1: Exploring virtual compound space with Bayesian statistics

1 WP van Hoorn, Feb 2007

Exploring virtual compound space with Bayesian statistics

Willem P van Hoorn

Chemistry

Pfizer Global Research and Development

Sandwich

UK

[email protected]

Pipeline Pilot UGM, San Diego, Mar 2007

Page 2: Exploring virtual compound space with Bayesian statistics

2 WP van Hoorn, Feb 2007

Overview

• An embarrassment of the riches

• Methodology

• Validation

• Interpreting the results

• How about the singleton file?

• Implementation

• Conclusions

Page 3: Exploring virtual compound space with Bayesian statistics

3 WP van Hoorn, Feb 2007

An embarrassment of the riches

• I have a compound. I want to make many analogues by combinatorial chemistry.

• At Pfizer, libraries are made internally and by multiple external suppliers.

• Chemist X used to know the entire protocol collection but even he has lost track.

• The Pfizer virtual library is estimated at 1012 compounds*

• This number is huge. Suppose 1000 CPU’s times 1000 compounds a second: run time

of >11 days. Or a maximum of 31 searches a year.

• Can’t this be done quicker?

As an aside: this is a trillion in USA and modern British, and a billion in traditional British

http://en.wikipedia.org/wiki/Names_of_large_numbers

Page 4: Exploring virtual compound space with Bayesian statistics

4 WP van Hoorn, Feb 2007

Methods

Page 5: Exploring virtual compound space with Bayesian statistics

5 WP van Hoorn, Feb 2007

Bayesian Learning, single category

Data set (assay data)

Fingerprint bits ~ substructures

“Good”Actives

“Bad”Inactives

Bayesian ModelRev Thomas Bayes

ca 1702 - 1761

• Fingerprints are calculated for each molecule

• Check how often fingerprint bit is observed

• And how often in “Good” compound

• Assign weighting factor taking into account

both activity ratio and sampling size

• For instance: “Good”/Total ratio of 90/100 is

statistically more relevant than 9/10

• Model distinguishes “Good” from “Bad”

• Prediction: likelihood molecule is “Good”

• Standard component from Pipeline Pilot

Page 6: Exploring virtual compound space with Bayesian statistics

6 WP van Hoorn, Feb 2007

Bayesian Learning, multiple categories

Pfizer library file

Fingerprint bits ~ substructures

Library 1

Bayesian Models

• Data set contains multiple categories

• For each category, one model is built

• Model X describes what separates

category X from all other categories

• Available in Pipeline Pilot since version 4.5.2

• Dataset: Pfizer library file

• Category: library name

• Prediction made: Library names ranked by

likelihood compound originates from that library

Library N…

Page 7: Exploring virtual compound space with Bayesian statistics

7 WP van Hoorn, Feb 2007

Building the multi-category Bayesian models

Pfizer compound database

Pfizer library file: all compounds

made in-house and externally by

combinatorial chemistry:

O(6) compounds, O(3) libraries

50% 50%

12.5K

Pfizer singleton diversity subset

Page 8: Exploring virtual compound space with Bayesian statistics

8 WP van Hoorn, Feb 2007

A singleton library?

• 12.5K diverse singleton set added as separate “library”

• Set consists of diverse subset out of larger set of “clean” singleton compounds

• Clean: R4.5 compliant, no structural alerts, reactive groups, etc

• Bayesian model describes what these diverse compounds have in common vs.

all library compounds.

• Difference is not presence of nitro, aldehyde or other undesirable group

• Difference must be in fragments (chemistry) not obtainable by library chemistry

• High score for this model may indicate that one is outside library chemistry space

Page 9: Exploring virtual compound space with Bayesian statistics

9 WP van Hoorn, Feb 2007

Multi-category Bayesian predictions

A probe (UK-92480, Sildenafil)

By default top 16 libraries is calculated:

SingletonLibrary1Library2...Library15

104.5784.1043.97...12.63

Page 10: Exploring virtual compound space with Bayesian statistics

10 WP van Hoorn, Feb 2007

Bayesian predictions are exemplified by Nearest Neighbour search

Exemplify libraries by

identifying nearest

neighbours from library file,

default top 6.

Final output: 16 x 6 = 96

compounds (one-plate

screenable hypothesis)

SingletonLibrary1Library2...Library15

104.5784.1043.97...12.63

16 96

R1

R2

849914-95-0 139755-82-1 298214-47-8 no CAS 155879-54-2 223430-18-0UK-A UK-B UK-C UK-D UK-E

1. Singleton (in file: 12500)

Page 11: Exploring virtual compound space with Bayesian statistics

11 WP van Hoorn, Feb 2007

What is searched?

x x

x

x

• Virtual library : 25 x 31• Real library compounds (yellow) : 19 x 25• Bayesian model fully covers 19 x 25 matrix despite being trained on only yellow products• All fingerprints in virtual red product are contained in crossed yellow products• Virtual products outside real library in area 1 share 1 monomer with real library: still (partially) within scope of Bayesian model• Only products 2 don’t share monomer, probably less well predicted• Easier to be out of scope in virtual library without template or common core (amide formation)

1

1 2

• Model based of O(6) compounds covers most of Pfizer virtual library O(12)• This coverage is at library ID level, not compound level

Page 12: Exploring virtual compound space with Bayesian statistics

12 WP van Hoorn, Feb 2007

A note on coverage of chemical space

• Electron density around hydrogen atom is described as probability distribution• Electron density is never zero even at large distance • Chemists tend to apply cut-off (atomic radius)

• Similarly, coverage of chemical space by Bayesian models is probalistic• Coverage is therefore hard to define, having an amide in common is probably not coverage as a chemist would see it, but what is?• Time will tell whether a Bayesian score cut-off for chemical space coverage can be established

Page 13: Exploring virtual compound space with Bayesian statistics

13 WP van Hoorn, Feb 2007

Validation

Page 14: Exploring virtual compound space with Bayesian statistics

14 WP van Hoorn, Feb 2007

Random test set

Found in top 1 Found in top 5 Not in top 5

9452 9068, 96% 9411, 99.6% 41, 0.4%

Exclude

singleton

libraryTest set

O(6)

Random x%: 9452Top 5 predicted library ID

compared with real library ID

V1

Page 15: Exploring virtual compound space with Bayesian statistics

15 WP van Hoorn, Feb 2007

41 compounds with correct library not in top 5

compound_ID correct library_id ranked_as Bayesian scorePF-A Internal Library 1 349 -29.0651987519029Another PF number Internal Library 2 243 -13.3644982689003Another PF number Internal Library 3 69 -0.400118961865439Another PF number Internal Library 4 63 0.614583090788282Another PF number Internal Library 1 53 -0.13987606970494Another PF number Internal Library 5 50 -7.32271642948761Another PF number Internal Library 1 35 3.41709994966454Another PF number Internal Library 3 22 9.57829190295786Another PF number Internal Library 1 22 10.0504136444794Another PF number External Library 1 20 22.8956131731457Another PF number External Library 2 19 18.8320528385981Another PF number Internal Library 3 15 12.1074842056827Another PF number Internal Library 1 14 54.6179465790837Another PF number Internal Library 3 13 16.6244027916311Another PF number Internal Library 1 12 6.74173586963795Another PF number Internal Library 1 12 17.0964105412622Another PF number Internal Library 1 11 58.7994305333701Another PF number External Library 3 10 58.2193181435516Another PF number Internal Library 1 10 12.5102031415206Another PF number Internal Library 6 9 19.4093857882624Another PF number Internal Library 1 8 20.6383651456158Another PF number External Library 3 8 73.0633503114444Another PF number External Library 4 8 18.5429747446516Another PF number Internal Library 1 8 36.9730841061725Another PF number Internal Library 1 7 34.8859762378176Another PF number Internal Library 3 7 17.3617539873978Another PF number Internal Library 1 7 30.8582847036755Another PF number Internal Library 1 7 41.848859585633Another PF number External Library 5 7 25.8587564812026Another PF number External Library 6 6 33.5395919145182Another PF number External Library 7 6 39.9074521984672Another PF number External Library 8 6 32.3248563198852Another PF number External Library 9 6 23.9421542596281Another PF number External Library 10 6 95.1965176091739Another PF number Internal Library 1 6 53.8715604224809Another PF number Internal Library 1 6 28.2709230508615Another PF number Internal Library 1 6 48.1827060771728Another PF number Internal Library 1 6 17.7689907755174Another PF number Internal Library 7 6 57.5694207876578Another PF number Internal Library 7 6 53.9529913359943Another PF number Internal Library 7 6 56.184768481901

Page 16: Exploring virtual compound space with Bayesian statistics

16 WP van Hoorn, Feb 2007

PF-A

Amide formation

Monomer 2Monomer 1

+No registration error

Worst mispredicted: PF-A

General remark: in-house libraries have broad scope, therefore harder to predict

Internal library 129,800 compounds registered, monomers known for 28,670

120 of these contain Monomer 1, but only 1 compound contains Monomer 2: PF-A is atypical product

Page 17: Exploring virtual compound space with Bayesian statistics

17 WP van Hoorn, Feb 2007

So what was found?

Bayesian predictions:

1. External library 11: Amide formation

2. External library 12: Amide formation…..

• Similar products to probe• Same chemistry as Internal library 1• Not really a failure after all

V1

Similar to monomer 2

Similar to monomer 1

Page 18: Exploring virtual compound space with Bayesian statistics

18 WP van Hoorn, Feb 2007

Six Bayesian categorisation models are available

ECFP_6• Based on atom type (C, N, O, etc)• Chemical descriptor• Default choice for similarity searching• Therefore default for Bayesian search

FCFP_6• Based on atom function (donor, acc, etc)• Functional descriptor• Default choice for activity modelling• Might give chemotype jump

Fingerprint

How to compensate for different sizes of libraries in training set?

Not• Fastest• Works well• Therefore default

#Enrichment• Favours smaller libraries• Slower to calculate

#EstPGood• Favours larger libraries• Slower to calculate

Page 19: Exploring virtual compound space with Bayesian statistics

19 WP van Hoorn, Feb 2007

Recall of known library id as function of model

Model Found in top 1 Found in top 5 Not in top 5

ECFP 9068, 96% 9411, 99.6% 41, 0.4%

ECFP_Enrichment 5692, 60% 9247, 97.8% 205, 2.2%

ECFP_EstPGood 8372, 89% 9439, 99.9% 13, 0.1%

FCFP 8920, 94% 9367, 99.1% 85, 0.9%

FCFP_Enrichment 6093, 64% 9344, 98.9% 108, 1.1%

FCFP_EstPGood 8547, 90% 9441, 99.9% 11, 0.1%

Exclude

singleton

libraryTest set

O(6)

Random x%: 9452Top 5 predicted library ID

compared with real library ID

Page 20: Exploring virtual compound space with Bayesian statistics

20 WP van Hoorn, Feb 2007

Comparison of six Bayesian models

• Evaluation of absolute models at least twice as slow

• “#Enrichment” models perform significantly worse

• “#EstPGood” models perform better when looking at top 5 (but worse in top 1)

• ECFP / FCFP fingerprint results very similar

• Recall rate used as metric, but search method intended as idea generator

• Hard to predict what chemist considers best idea, advice is to try more than one

V1

Page 21: Exploring virtual compound space with Bayesian statistics

21 WP van Hoorn, Feb 2007

Interpreting the results

Page 22: Exploring virtual compound space with Bayesian statistics

22 WP van Hoorn, Feb 2007

Opening the Bayesian black box

• Bayesian score ~ likelihood of compound originating from library• But how is this score derived?• Scitegic fingerprint bits ~ substructures• Bayesian model consists of fingerprint bits + weighting factors• Score is the sum of the weights of the substructures present in molecule• High scoring parts of molecule can be highlighted by these substructures

Probe

Fingerprint bits

• Filter by weight• Color probe by FP

A library

Page 23: Exploring virtual compound space with Bayesian statistics

23 WP van Hoorn, Feb 2007

Probe is highlighted by what each library recognises

1. Singleton 2. In-house 1 3. In-house 2 4. External 1 5. External 2 6. External 3

7. External 4 8. External 5 9. External 6 10. External 7 11. External 8 12. External 9

13. External 10 14. External 11 15. External 12 16. External 13

In-house 2 yields compounds similar to left hand site of probe

In-house 1 yields compounds similar to right hand site of probe

Page 24: Exploring virtual compound space with Bayesian statistics

24 WP van Hoorn, Feb 2007

Highlighted probes compared to actual compounds retrieved

2. In-house 1 3. In-house 2 4. External 1

Page 25: Exploring virtual compound space with Bayesian statistics

25 WP van Hoorn, Feb 2007

How about the singleton file?

Page 26: Exploring virtual compound space with Bayesian statistics

26 WP van Hoorn, Feb 2007

How about the Pfizer singleton file?

• 150 Mw 750; AlogP 7.5• Pass reactive group filters• O(6) compounds, liquid sample• O(5) compounds, solid sample

• Calculate top 1 predicted library• Equals cluster by predicted library• Map singletons on combinatorial chemistry space

Pfizer compound database

All singletons: O(6) compounds

Page 27: Exploring virtual compound space with Bayesian statistics

27 WP van Hoorn, Feb 2007

O(4)

None mapped (size of library):• Library X1 (1)• Library X2 (11)• Library X3 (1)• Library X4 (2)

Singl

eton

O(6) liquid singleton compounds mapped to O(3) libraries

As expected, “Singleton” library dominates

Generally: Good spread

Page 28: Exploring virtual compound space with Bayesian statistics

28 WP van Hoorn, Feb 2007

Pfizer solids and vendor compounds have been mapped to libraries

7 unmapped libraries6 unmapped libraries

O(5) solid samples for which no liquid sample is available

O(4) O(5)

O(6) structures from ChemNavigator not in Pfizer files

Singl

eton

Singl

eton

Page 29: Exploring virtual compound space with Bayesian statistics

29 WP van Hoorn, Feb 2007

Mapped singleton/vendor compounds can be searched by similarity

• Extend SAR• Spark idea for novel monomers, templates

4 x 96

16

Library compounds

Singleton compounds, liquid

Singleton compounds, solid

Singleton compounds, vendor

147676-92-4

Page 30: Exploring virtual compound space with Bayesian statistics

30 WP van Hoorn, Feb 2007

Implementation

Page 31: Exploring virtual compound space with Bayesian statistics

31 WP van Hoorn, Feb 2007

Bayesian search implemented as web service

2. Options, defaults are OK:• # of ranked libraries (16)• # of nearest neighbors (6)• Bayesian model• Unified output?

Last model update, overview of coverage, etc

~5-10 min

User

1. Query, one or more of:• (file of) Pfizer ID• smile (file)• mol/sd file• sketch

pdf report:

Ranked libraries + NN examples

1. Singleton 2. In-house 1 3. In-house 2 4. External 1 5. External 2 6. External 3

7. External 4 8. External 5 9. External 6 10. External 7 11. External 8 12. External 9

13. External 10 14. External 11 15. External 12 16. External 13

Singleton

R1

R2

849914-95-0 139755-82-1 298214-47-8 no CAS 155879-54-2 223430-18-0UK-A UK-B UK-C UK-D UK-E

1. Singleton (in file: 12500)

Page 32: Exploring virtual compound space with Bayesian statistics

32 WP van Hoorn, Feb 2007

A happy user

Page 33: Exploring virtual compound space with Bayesian statistics

33 WP van Hoorn, Feb 2007

Advantages• Fast and very accurate in retrieving known library IDs• Number of output libraries/compounds is tuneable• Based on registered compounds:

– precedented chemistry – exemplified compounds ready for screening: instant hypothesis testing

• Does not know chemistry, no need to encode chemical reactions• Coverage of singleton / vendor chemical space• Neat pdf output • Proven ability to jump chemotypes

Disadvantages• Novel libraries or monomers only “detected” once in registered product and models

have been regenerated• Example products are real, more similar virtual products are probably missed

Conclusions

Page 34: Exploring virtual compound space with Bayesian statistics

34 WP van Hoorn, Feb 2007

Thanks for contributing ideas, challenges and/or willingness to test:

• Andy Bell

• Bruce Lefker

• Dafydd Owen

• Dan Kung

• Graham Smith

• Jens Loesel

• Kevin Dack

• Dave Rogers (Scitegic)

Acknowledgements