hit rate 30%

1
Hit rate 30% Fragmentation of molecules SVILP generate s QSAR rules Molecula r database Scree n Novel verified hits Figure 1. Graphic showing the current method of finding matches for ILP derived rules, as used by the INDDEx software, and this project’s new method. Scree n Modified molecules Split molecule with reverse reaction Modify using all viable reactions Novel verified hits on synthesisable molecules Hit rate ?% Database of virtual reactions Predicted pKa Logic-based drug discovery Inductive Logic Programming (ILP) is a machine learning technique, which learns human interpretable qualitative rules from chemical knowledge of active drugs (see Figure 2), which relate structure to activity, and can guide the next steps drug design chemistry. Using Partial Least-Squares or Support Vector Machines, the rules can then be weighted Weighted rules used as a quantitative model of Quantitative Structure-Activity Relationship (QSAR) to predict drug activity. • This approach combines the human comprehensible rules of ILP with the predictive accuracy of advanced regression techniques. This Support Vector Inductive Logic Programming (SVILP) method is being patented. • INDDEx (Investigational Novel Drug Discovery by Example) is a proprietary virtual screening program, incorporating SVILP, and developed by Equinox Pharma Ltd., an Imperial College spin-off company. • The QSAR model is then used to screen a database of all ristopher Reynolds and Mike Sternberg The Silicon Chemist: Using logic-based machine learning and virtual chemistry to design new drugs automatically. Introduction New drug leads are always needed, and virtual screening is now often used to simulate the bioactivity of compounds, in order to search as much of chemical space as possible in a reasonable amount of time. This project takes an existing logic based machine-learning approach to identifying bioactive compounds, and incorporates chemical synthesis rules, to design novel, easily synthesisable, and effective pharmaceutical drugs. Figure 5. From left to right: wireframe visualisations of the two reactant molecules entered in SMILES format are processed by the SMIRKS esterification reaction, and the resultant product molecule is returned. Virtual chemistry To extend the machine learning method currently used in INDDEx, and move the process from hit to lead discovery, the logic-based rules produced by SVILP will be used to modify promising hits. This new process is shown in Figure 1. This project will concentrate on scanning through the dataset of purchasable molecules that partially fulfil the rules, and then altering the molecules to try and fit the remaining rules using a active(A):- positive(A, B), Nsp2(A, C), distance(A, B, C, 2.49, 0.5). Molecule is active if there is positive charge centre and an sp2 nitrogen atom 2.49±0.5Å. apart active(A):- phenyl(A, B), phenyl(A, C), distance(A, B, C, 4.1, 0.5). Molecule is active if there are two phenyl rings 4.1±0.5Å apart. Figure 2. Example ILP QSAR rules. Results Figure 3. A scatter plot showing observed vs. predicted pKa in a cross validation of a PubChem bioassay. This shows that there are no false positives when using an activity cut-off of pKa 7. database of virtual chemical synthesis reactions to combine these molecules with a database of fragment-like molecules. Through this method, the program will generate focussed libraries of synthetic derivatives around the promising hit molecules, and so explore a far greater section of easily synthetically accessible chemical space. to hold true up to the top 5% most active compounds provided to INDDEx as examples. True negatives False negatives False positives True positives Department of Computing, Imperial College, and Department of Life Sciences, Imperial College Figure 4. Receiver Operating Characteristic (ROC) curve, showing fraction of true positives against true negatives retrieved as the discrimination threshold is varied. ROC values for 1%, 5%, 10% and all actives are 0.912, 0.830, 0.814, and 0.892 respectively. This illustrates the sensitivity of active compound detection Top 1% (6 actives) Top 5% (25 actives) Top 10% (47 actives Active/Inactive (232 actives) False positive rate True positive rate Observed pKa Observe d activit y

Upload: chelsa

Post on 23-Feb-2016

27 views

Category:

Documents


0 download

DESCRIPTION

The Silicon Chemist: Using logic-based machine learning and virtual chemistry to design new drugs automatically. Department of Computing, Imperial College, and Department of Life Sciences, Imperial College. Christopher Reynolds and Mike Sternberg. Results. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Hit rate 30%

Hit rate30%

Fragmentation of molecules

SVILP generates

QSAR rules

Molecular database

Screen

Novel verified hits

Figure 1. Graphic showing the current method of finding matches for ILP derived rules, as used by the INDDEx software, and this project’s new method.

Screen

Modified molecules

Split molecule with reverse

reaction

Modify using all viable reactions

Novel verified hits on synthesisable

molecules

Hit rate?%

Database of virtual reactions

Pre

dict

ed p

Ka

Logic-based drug discovery• Inductive Logic Programming (ILP) is a machine learning technique, which learns human interpretable qualitative rules from chemical knowledge of active drugs (see Figure 2), which relate structure to activity, and can guide the next steps drug design chemistry.

• Using Partial Least-Squares or Support Vector Machines, the rules can then be weighted

• Weighted rules used as a quantitative model of Quantitative Structure-Activity Relationship (QSAR) to predict drug activity.

• This approach combines the human comprehensible rules of ILP with the predictive accuracy of advanced regression techniques.

• This Support Vector Inductive Logic Programming (SVILP) method is being patented.

• INDDEx (Investigational Novel Drug Discovery by Example) is a proprietary virtual screening program, incorporating SVILP, and developed by Equinox Pharma Ltd., an Imperial College spin-off company.

• The QSAR model is then used to screen a database of all purchasable molecules to identify drug leads.

• In a blind test, INDDEx had a hit rate of 30%, predicting around 30 active molecules, each capable of being the start of a new drug series, and each sufficiently novel that it could be patented.

Christopher Reynolds and Mike Sternberg

The Silicon Chemist: Using logic-based machine learning and virtual chemistry to design new drugs automatically.

Introduction

New drug leads are always needed, and virtual screening is now often used to simulate the bioactivity of compounds, in order to search as much of chemical space as possible in a reasonable amount of time. This project takes an existing logic based machine-learning approach to identifying bioactive compounds, and incorporates chemical synthesis rules, to design novel, easily synthesisable, and effective pharmaceutical drugs.

Figure 5. From left to right: wireframe visualisations of the two reactant molecules entered in SMILES format are processed by the SMIRKS esterification reaction, and the resultant product molecule is returned.

Virtual chemistryTo extend the machine learning method currently used in

INDDEx, and move the process from hit to lead discovery, the logic-based rules produced by SVILP will be used to modify promising hits. This new process is shown in Figure 1. This project will concentrate on scanning through the dataset of purchasable molecules that partially fulfil the rules, and then altering the molecules to try and fit the remaining rules using a

active(A):- positive(A, B), Nsp2(A, C), distance(A, B, C, 2.49, 0.5).

Molecule is active if there is positive charge centre and an sp2 nitrogen atom 2.49±0.5Å. apart

active(A):- phenyl(A, B), phenyl(A, C), distance(A, B, C, 4.1, 0.5).

Molecule is active if there are two phenyl rings 4.1±0.5Å apart.

Figure 2. Example ILP QSAR rules.

ResultsFigure 3. A scatter plot showing observed vs. predicted pKa in a cross validation of a PubChem bioassay. This shows that there are no false positives when using an activity cut-off of pKa 7.

database of virtual chemical synthesis reactions to combine these molecules with a database of fragment-like molecules. Through this method, the program will generate focussed libraries of synthetic derivatives around the promising hit molecules, and so explore a far greater section of easily synthetically accessible chemical space.

to hold true up to the top 5% most active compounds provided to INDDEx as examples.

True negatives False negatives

False positives True positives

Department of Computing, Imperial College, and Department of Life Sciences, Imperial College

Figure 4. Receiver Operating Characteristic (ROC) curve, showing fraction of true positives against true negatives retrieved as the discrimination threshold is varied. ROC values for 1%, 5%, 10% and all actives are 0.912, 0.830, 0.814, and 0.892 respectively. This illustrates the sensitivity of active compound detection

Top 1% (6 actives)

Top 5% (25 actives)

Top 10% (47 actives)

Active/Inactive (232 actives)

False positive rate

True

pos

itive

rate

Observed pKa

Observed activity