molecular data and analysis · the similarity concept is widely used in medicinal chemistry : e.g....

60
Useful Information The web address for these lectures is http://www-jmg.ch.cam.ac.uk/cil/partii/ (on front of handout) Assessment is by two online exercises (Glen and Goodman) at this address. Each will be marked out of ten. Your (paper) answers should be submitted to Mykola. Glen exercises due: Feb 10 th 2018 Lectures and handout available on Moodle

Upload: others

Post on 10-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Useful Information

• The web address for these lectures is

http://www-jmg.ch.cam.ac.uk/cil/partii/ (on

front of handout)

• Assessment is by two online exercises

(Glen and Goodman) at this address. Each

will be marked out of ten. Your (paper)

answers should be submitted to Mykola.

• Glen exercises due: Feb 10th 2018

• Lectures and handout available on Moodle

Page 2: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

2 Finding molecules

Page 3: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

In 1924 Dr. Markush was awarded a patent on pyrazolone dyes (USP No. 1,506,316) in

which he claimed generic chemical structures in addition to those actually synthesized.

Structures of this type were permitted after a ruling in 1925 by the US Patent Office and

became known as “Markush structures”. The “Markush Doctrine” of patent law greatly

increases flexibility in the preparation of claims for the definition of an invention.

Expanding our representation of chemical

Structures – Markush structures.

Page 4: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

We can expand our search by introducing less exact labelling of attachments to the

core structure. Markush structures are essentially structures involving R-groups,

where a part of the molecule is defined by a series of alternatives – a more complex

example

Additionally, to introduce a more generic approach to structure matching, we might

define e.g. hydrogen-bond donors as:

R = OH,NH,SH,PH for example – care is of course needed e.g. a

COOH may be ionised and have no H !

This approach is extensively used in the patent literature to cover claims

of chemical structures with many variations.

Markush or Generic Structures. J. Chem. Inf. Comp. Sci. 1991, 31 (1)

A comparison of different approaches to Markush structure handling

J. Chem. Inf. Comp. Sci. 31(1), 1991, 64-68

Page 5: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

An example of a Patent claim using Markush structures – how

many does this cover ?

Markush structure searching over the years, Edlyn S. Simmons World Patent Information, Volume

25, Issue 3, September 2003, Pages 195-202

Searching Markush compound structures is still an unsolved problem (so-called

‘nasties’), and has great implications for patents. MarPat is a Markush searchable

database of patents. http://www.cas.org/expertise/cascontent/marpat.html .

Page 6: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

2. Finding molecules using Molecular Similarity

• You may perform a structural search of a database, and find no molecules. You still want to use a molecule like your query in some way, so, how do you find one that is ‘similar’ ?

• We may have e.g. a molecule that shows anti-cancer effects, but is toxic

• We could then look for other molecules that could have a similar anti-cancer effect, but a lower toxic effect

• ‘Similarity’ though, has a context and the right molecular description is needed for each specific case.

Bender A., Glen RC., Org. Biomol. Chem., 2004, 2, 3204 – 3218.

Molecular similarity: a key technique in molecular informatics.

Page 7: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

The similarity concept is widely used in medicinal chemistry :

e.g. using the concept of Bio-Isosteres – the fundamental

concept in discovering new drugs

This idea (a bio-isostere) suggests that a chemical group can be

mimicked by a replacement group that, in many documented cases,

has appeared similar in its response to biological receptors (usually

proteins).

e.g. :

Bioorganic & Medicinal Chemistry Letters

Volume 17, Issue 14, 15 July 2007, Pages

4040-4043

Changing substituents

while maintaining affinity

in an anti-bacterial.

to

Page 8: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

An example at Influenza neuraminidase – a critical enzyme the

virus uses for infection - inhibitors Oseltamivir and Zanamivir

use a bio-isosteric replacement of the natural substrate

<=Similar to=>

Neuraminidase cleaves

the glycosidic linkages

of neuraminic acids

Page 9: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Therefore in a search,

these additional ‘R’

groups can be included as

Markush structures

More examples used as bio-isosteres (pairs)

Sarah R. Langdon,Peter Ertl,and Nathan Brown. Bioisosteric

Replacement and Scaffold Hopping in Lead Generation and

Optimization . Mol. Inf. 2010, 29, 366 – 385

Robert P. Sheridan. The Most Common Chemical Replacements in

Drug-Like Compounds. J. Chem. Inf. Comput. Sci. 2002, 42, 103-108

Similar to…

Similar to…

Page 10: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

But supposing you can’t easily ‘think up a similar substructure’?

There are various methods that have been devised to compute

similarity. These are generally:

•Based on the structure

•In one (strings), two (graphs) or three dimensions (coordinates)

•Based on molecular properties

•Experimental (e.g. size and shape) and computed properties (e.g.

Dipole Moment)

Lets look at how a similarity calculation can be defined using some

of these methods.

Page 11: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Similarity. The Maximal Common Subgraph (the biggest common

fragment)

•Important search to determine which part of a structure is constant –

e.g. identifying reaction components – in this one, the atoms and bonds

which are constant comprise the MCS.

•Is a complex case of identifying a fragment, as we don’t know the size of the

MCS beforehand, so involves ‘backtracking’ to compute – and is therefore

time consuming. For example, this is also one of the problems of converting

a list of compounds to a Markush structure.

MCS algorithms can be applied to problems other than atom-atom mapping in

reactions -

•structural similarity between molecules - size of MCS (relative to size of

molecules) can be used as a measure of similarity of molecules) e.g. search

for molecules containing at least 80% of query substructure.

Page 12: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Similarity by Molecular Fingerprints

• Fingerprints are a common approach to describing molecular similarity

• Fingerprints can be considered as a ‘bar code’ for the molecule

• Used because

– uses only the molecular graph

– does not require structural conformation or alignment

– fast searching method

• It is very fast to annotate a database of millions of molecules with fingerprints

• Often you are using fingerprints in searching databases, and don’t realise it !

Page 13: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Molecular Fingerprints

• Hash codes (already mentioned for searching)

• The simplest fingerprint registers the presence or

absence of fragments in a molecule. e.g.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesn’t contain F

.......X

Page 14: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Molecular Fingerprints

• We could use this fingerprint for example,

to find only molecules containing

Phosphorous that have an amine in their

structure

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0

Contains PContains NH2

Doesn’t contain F

.......X

NH

N

N

O

NH2N

O

OH

HH

HHO

PO

O-

HO

O-

Page 15: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

• Fingerprints can be generated algorithmically, we don’t need to manually specify all the fragments

• Fingerprint method most often used is based on the CRC algorithm (cyclic redundancy check) –you could look this up on the web.

• Advantages/disadvantages

– easy to calculate

– very fast

– not specific to one area of chemistry

– difficult to understand

Automated fingerprint generation

Page 16: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Fingerprint Generation – Hashing

CRC (Cyclic Redundancy Check)

CH3CH2CH2CH2OH H-C-C-C

C-C-C-O

C-C-O-H

| etc.

I, where 0 < I > 109MOD( I / 151 )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

12

e.g. 150-bit fingerprint for 4-atom fragments (we generally use 4-7 atom)

1 .......151

Linear Fingerprint

Page 17: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Level

0 1 2 3

C.ar C.ar

C.ar

O.3

C.2

C.2

C.ar

C.ar

C.3

C.ar

O.2

O.co2

O.co2

Fingerprint Generation – circular fingerprints

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

1

.......number of

‘atom types’ times

the number of

levelsLayer zero

30 atom types

Layer 1

30 atom types

Layer 2

Thirty atom types

These are very ‘sparse’ but work well – I’ll show some examples later

J. Chem. Inf. Model. 2007, 47(2), 583-590

Can also use a

variant of the

Morgan

algorithm

Page 18: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Comparing the fingerprints of molecules

- Tanimoto or Jaccard similarity

where A, B, A&B, are the number of bits set in fingerprint A, B, and A-AND-

B.

In a hypothetical example, A, B,and A&B are 24, 21, and 19, respectively,

resulting in a Tanimoto coefficient of 0.73 (1.00 is perfect similarity)

BABA

BAT

)(

another way to put it, TC = BC / (B1 + B2 - BC)

Values above 0.85 are usually significant. This method is

commonly used to search for Pharmaceuticaly active

molecules, reagents, reactions...

Page 19: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Tanimoto similarity example

- similarity to σ-chloro-ρ-aminobenzoic acid

σ-chloro-ρ-aminobenzoic acid

Structure Tanimoto

coefficient

Benzoic acid 0.52

m-chlorobenzoic acid 0.64

o-chlorobenzoic acid 0.80

o-chloro-p-aminobenzoic acid 1.0

p-aminobenzoic acid 0.70

p-chlorobenzoic acid 0.66

Page 20: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Similarity search in SciFinder Scholar1 - query structure

2 - similarity search

3 – pick > 85% similarity

4 - six structures retrieved (from xxx Million). This

probably uses linear fingerprints

Page 21: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

‘Tanimoto’ similarity indices are one of a class of

methods for bit-string comparisons.

Some comparison indices additional to the Tanimoto coefficient

(Nab/(Na+Nb-Nab) ) are:

Hamming coefficient =

Cosine coefficient = Nab/Sqrt (Na x Nb)

n

i

baXORH1

)),((

A good introduction is in :

http://www.orgchm.bas.bg/~vmonev/SimSearch.pdf

J. Chem. Inf. Comput. Sci., 37, 18-22 (1977)

J. Chem. Inf. Comput. Sci., 43, 819-828 (2003)

J. Chem. Inf. Comput. Sci. 38, 983-996 (1998)

J. Chem. Inf. Model. Publication Date (Web): October 19,

2012, DOI: 10.1021/ci300261r. Just accepted.

Page 22: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Not just bits – properties of moleculesThere are of course, an enormous number of “molecular properties”

That can be used to compare molecules – some of the more common ones are

listed below:

1. Quantum mechanical descriptors based on the wavefunction (Carbo index) Quantitative Structure-Activity Relationships, Volume 16, Issue 1 (p 25-32)

2. Topological indices (Weiner, Kier and Hall)H. Wiener, "Structural determination of paraffin boiling points", J. Am. Chem. Soc., 1947, 69(1), 17-20.

L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986

3. Compute molecular properties: volume, surface area, logP, pKa, .........vast

number – then cluster molecules according to a similarity measure.

Molecules............Index...........graph of similarity of pairs

Beck et al. Chemical

Physics 356 (2009) 121–

130

Page 23: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

http://www-metaprint2d.ch.cam.ac.uk/metaprint2d/

Metabolic Site/Product predictor (MetaPrint2D)

Page 24: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Metabolic Site/Product predictor (MetaPrint2D)

2Query compound

For each query atom, find

all similar environments in

database

Calculate reaction

occurrence ratios

Total number of similar reaction centres

Total number similar atoms in rest of database

Calculate relative ratios for each atom in

query compound, and display predictions

Using a naive Bayes probabilistic model

Symyx Metabolite

database (~80000

transformations)Substrate + Products

Calculate environment for

each substrate atom

Identify reaction centres

1

Calculate environment for

each atom

3How often is environment

found at a reaction centre?

4

5

Database Version 2005.1 2006.1 2007.1 2008.1

Transformations 72599 78009 82671 87446Single step 58757 62147 65732 69402Product not reported 811 831 834 882Newly added 5410 4662 4775

Page 25: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Interestingly, the

molecule dosed (which

has excellent

bioavailability) is a

partial agonist, while the

main metabolite is a full

agonist. So, as the drug

concentration lowers in

blood, the remaining

compound becomes

more potent – probably a

longer lasting effect

Page 26: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Paracetamol toxicity

(Tylenol)

Overdose results in

species NAPQI and

liver damage

Metaprint2D results

glutathione

Page 27: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

3 Finding molecules using three dimensional data

•‘Real’ molecules exist in a 3-dimensional world

•Their properties depend on their shape and the spacial

disposition of functional groups.

•Simple example: dipole moment

2.5 Debye 0.5 Debye

Page 28: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

An example of the

exquisite matching of

a substrate to a protein

binding site – here 3D

shape and the

complimentary non-

bonded interactions

are extremely

important

Cheminformatics Tools

for drug design

Page 29: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Example of a site which has

various drug design tools

Page 30: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

• Three dimensions in drug discovery

• A ‘pharmacophore’ is a 3-D representation of the required features

for binding to a biological receptor

5.2

4.2-4.7

6.7

4.8

5.1-7.1

Distances in Ǻngstroms.

Here is the pharmacophore

model used to design the migraine drug

‘Zomig’ deduced from comparison of

molecules that interact with the receptor

binding site

Page 31: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Similarity Searching based on

pharmacophores - What do we need ?• A database of 3-dimensional structures (Zinc

database is 72 million)

– Atom Coordinates

– Atom types

– Ring, fragment, property, H-bonding etc. definitions

– An excellent example is the Cambridge Structural Database of X-ray structures (next door)

• A definition of the query

– Fragments of molecules and their properties

– Constraints

• Distances between functional groups

• Angles between these

– The concept of Dummy atoms is useful

– e.g. ring centres, H-bonding points, planes

Page 32: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Example search (“Virtual Screening”) of our current

4.5 Million 3D database

5.2

4.2-4.7

6.7

4.8

5.1-7.1

A protonated amine (NH3+), a ring centre (defined by 6 atoms)

hydrogen-bond acceptor, a hydrogen bond donor-acceptor

-brings up the point that ‘properties’ can be specified at atom points

--Markush atoms

Page 33: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Hydrophobic

center

Positive NH Bond

Donor/Acceptor

H Bond

Acceptor

Page 34: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

When x-ray structures are available – molecules can

be ‘docked’ into the binding site – pharmacophores

can be generated and used for searching as before

• A docking program will take a

randomised ligand conformation from a

ligand/protein x-ray structure and place

the molecule back in the correct

position.

• Many thousands of molecules can be

‘docked’ / hour.

• Molecules can be selected based on their

‘fit’ to the protein, and subsequently

tested for binding affinity

Page 35: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

docked Gleevec with Gleevec X-ray 1T46 (x-ray structure) overlaid with

the predicted position of Gleevec – almost perfect – which implies we

could use the same docking approach to search for new molecules that

work in the same way

Docking example using Gold: the

anti-cancer drug Gleevec – a

specific cancer target inhibitor of

Bcr-Abl tyrosine kinase, the

constitutive abnormal kinase in

chronic myeloid leukemia.

Page 36: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Docking example using Gold:

Gleevec – specific cancer target

inhibitor of Bcr-Abl tyrosine

kinase, the constitutive abnormal

kinase in chronic myeloid

leukemia. Red lines define a

pharmacophore

The pharmacophore can be extracted and used to search for additional

Molecules from our database, these are then tested by ‘docking’ and

If they fit, can be tested for anti-cancer properties in this case.

*GOLD. Jones G, Willett P, Glen R C, Molecular Recognition of Receptor Sites using a Genetic Algorithm with a

Description of Desolvation, J.Mol. Biol.245, 43-53 (1995).

Jones G, Willett P, Glen R C, Leach A R, Taylor R. Development and Validation of a Genetic Algorithm for Flexible

Docking. J. Mol. Biol. 267, 727-748 (1997).

Page 37: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

“Virtual screening” using similarity – an important way to find starting points for

designing new drugs

Suppose we have no information on a biological target. Also, like

many pharmaceutical companies, we have 1 Million real molecules

in our compound store. But, due to cost, we can only afford to

screen 10,000. How can we pick the best representative set to

screen?

There are essentially two ways to do this – similarity and diversity.

Pro

per

ty A

Property B

A

B

Selection based on

similarity to A and B

Pro

per

ty A

Property B

A diverse set

Page 38: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Virtual screening using similarity

On the bottom left, we have used two molecules displaying

biological activity (A and B) to find those most similar in the

database for testing, to maximise our chances of finding new hits.

On the bottom right, we have no molecules to use, so we select the

best diverse set, maximising our chances of a hit whilst only testing

a representative subset of the compounds library.

Pro

per

ty A

Property B

A

B

Selection based on

similarity to A and B

Pro

per

ty A

Property B

A diverse set

Page 39: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

An example of a reaction in a modified Smiles, called Smirks.

‘Acetic acid and (.) ethanol > in the presence of HCl and Ethanol >

make ethylacetate

Chemical Reactions can also be represented in the computer

http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html

Page 40: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Virtual screening using a virtual library

The molecules we screen in the computer don’t have to be

physically available. We can generate vast libraries of

molecules we could synthesise, and search these. Promising

molecules could be synthesised. A billion examples is not

unreasonable. An example of potential HIV Protease inhibitors

There are so many, we have to be very selective and computer-aided

design can help

Page 41: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Example reaction of two acids with two alcohols to make four products

(the acids have ‘R’ groups). New characters and atom mapping is used.

[*:1][C:2](=[O:3])[O:4][H].[*:2][C:5][O:6][H]>>[*:1][C:2](=[O:3])[O:6][C:5][*:2].[H][O:4][H]

Page 42: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

You’ve found some interesting molecules –

but how can we predict their properties

quantitatively ?

• Particularly in drug discovery (but also in

materials science for example) methods

have been developed to relate the structure

and properties of molecules to their function

• These are called Quantitative Structure

Property (or Activity) Relationships -

QSPR, QSAR

• The handout contains details of some

approaches that may be of interest.

Page 43: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

• Quantitative Structure Property Relationships (QSPR)

• Quantitative Structure Activity Relationships (QSAR)

• We calculate descriptors to combine with statistical and machine-learning methods to create models to predict properties.

Combining molecular structure with calculation of

properties to deduce predictive models is usually

termed :

Page 44: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Picture the data – often the best approach

J. Med. Chem., 44 (5), 2001, pp681 -693,

‘Exclusion zone’ – compounds here

Are not bio-available

hyd

rop

ho

bic

ity

size

Here is a real example, two descriptors CMR (the size of the molecule) and logD

(the distribution coefficient between octanol and water) are calculated, plotted and

annotated with their ability to be absorbed in the intestine. The white areas are

molecules that are absorbed, the shaded molecules are not - so as drugs shaded

molecules would not be orally absorbed therefore useless in pills.

This approach to bioavailability has had a fundamental effect on new drug discovery,

see Lipinski’s Rule of Five in the notes.

Page 45: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Simplest approaches

• 1. Read across. If molecule A has a measured property, and molecule B has not had it measured, if molecules A and B are very similar, perhaps they have similar properties and we can predict the property of B. This is a common approach in predicting e.g. the toxicity of molecules.

• We use information for one chemical, called a “source chemical”, to make a prediction of the same property or toxicological endpoint for another chemical, called a “target chemical”, termed “read across”.

Page 46: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Example of using chemical and biological similarity in read-across prediction of toxicity

Low et al. Chem. Res. Toxicol., 2013, 26 (8), pp 1199–1208

Page 47: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Building models (QSAR/QSPR)

Molecular database

Calculate/measuremolecular properties

Analysis

Prediction

This is the most commonapproach for molecularanalysis and prediction

Page 48: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Supervised methods

Supervised methods. The most common method is linear regression. Simple linear

regression fits a straight line through the set of n points in such a way that makes

the sum of the squared residuals of the model (that is, vertical distances between

the points of the data set and the fitted line) is as small as possible. The equation

we obtain can be used to predict a new property based on the descriptors calculated

or measured for the new molecule.

Q is the function we want to obtain and minimiseAlpha is the correction factor to move all the points so the line goes through the originBeta is the coefficient to multiply our descriptor (x) by.Epsilon is a residual (which we wish to minimise)The method is explained in more detail the notes.

Page 49: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Machine Learning:

Predicting TLC (Thin Layer Chromatography)

Start point

Solvent front

Compound

moved

to here

(Rf=y/x)

X

Y

•Compounds move up the plate

depending on the solvent, their

properties etc.

•We can predict the Rf’s

(retention times) using details

of the molecules and the

solvent.

•Separate mixtures, identify

compounds etc.

Silica on glass

Page 50: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

15 2-OH

16 3-OH,6-OH

17 2-OH,6-OH

18 2-OH,3-OH

19 2-OH,5-OH

20 2-OH,4-OH

21 3-OH,4-OH

22 2-COOH

8 4-F,3-CF

9 4-F,2-CF

10 4-CH

11 2-CH

12 3-CH

13 4-NH

14 H

1 4-F

2 3-F

3 2-F

4 CF

5 3-CF

6 4-CF

7 2-F,4-CF

COOH1

23

4

5 6

3

3

3

3

3

3

3

3

3

2

• 22 substituted benzoic acids

Data

Page 51: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

• 2 solvent systems

• 6 - mixtures 1 Acetonitrile - Water 30 : 70

2 Acetonitrile - Water 40 - 60

3 Acetonitrile - Water 50 - 50

4 MeOH - Water 40 - 60

5 MeOH - Water 50 - 50

6 MeOH - Water 60 - 40

• 22 compounds x 6 mixtures = 132 experiments

Data

Page 52: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Measurements

Page 53: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

No. compound number

Cpd name of compound

Solvent water and acetonitrile/methanol

Rf retention time

Rm (log (1-Rf)/Rf))

S_Area surface area of molecule in A2

clogp calculated partition coefficient octanol/water

volume molecular volume in A3

MPolar polarizability of the molecule cm-25

dipole dipole moment of the molecule (Debye)

dipsol dipole moment of the solvent (%solv1+%sol2)*100 Debye

PolSol polarizability of of the solvent (%pol1+%pol2)/100 Debye

Ovality: how removed from sperical

water dipole is also given, 2.75Debye

3/2

4

3

3

4/ VSOvality

• Molecular properties were calculated for each of the molecules

and tabulated in a spreadsheet (tlcdata.xls) e.g.

Page 54: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

LogK‘ = -0.401QON + 0.396CLOGP + 0.109DIP -0.056DIPMOM -3.162ESDL1

+ 0.231CMR + 0.110POLSOL - 5.326

r = 0.954 F7,110 = 155.59

Variance Explained = 91.0 %

Multiple linear regression – using the ‘best’ 7 parameters

•Test set

oTraining set

measured

Page 55: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Unsupervised methods (typically classification models)

In the previous examples, data was fitted to a

model, usually predicting a numeric value of the

desired property. However, it is also possible to

cluster the data, and hence make predictions

about a particular class a new molecule will fall

into e.g. is it toxic or non-toxic. This is “guilt by

association”.

The most common approach to do this is cluster

analysis, which includes a diverse set of

approaches.

Hierarchical clustering and k-means clustering are

common approaches. Clustering involves finding

the distance between all points of the data (e.g.

the Tanimoto distance) usually using the Euclidean

distance or the Manhattan distance. The clusters

are then determined by either a bottom-up

approach (agglomerative) or by a Divisive

approach (top-down).

High

similarity

cuttoff

Low

similarity

cuttoff

Page 56: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

Plot of 2 PC’s of a dataset made up of many molecules and many calculated properties, It is possible to get a view of how diverse

molecules are within the property space, and also, for new molecules, where they are located.

Includes: physical properties (such as charge, van der Waals volume, and molecular refractivity)

subdivided surface areas (atomic contributions to logP and molecular refractivity)

counts of elemental atom types and of bond types

Kier/ Hall connectivity and kappa shape indices

topological indices (Wiener index and Balaban index)

pharmacophore feature counts (number of acidic and basic groups and hydrogen bond donors and acceptors)

partial charge descriptors, surface area, volume, and shape descriptors (among them water accessible surface area, mass density,

and principal moments of inertia).

So this is basically describing a series of molecules in many ways, then compressing the plot into two dimensions. Good for

selecting a screening set of compounds for testing.

J. Chem. Inf. Model., 2005, 45 (3), pp 581–590

The concept of ‘Chemical space’ – non-hierarchical clustering similar molecules

Page 57: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

•Simulates the way that neurons are interconnected

•‘learns’ by adjusting the connection weights between nodes taking an input set

of parameters and attempting to fit the output measurements

•New data can then be entered and using the ‘learned’ model -> predict

This network has a

2:4:4:1 topology

Like neurons, the connections

are made when a threshold value

is attained.

Use ‘back propagation of errors’ to

adjust the connections

http://en.wikipedia.org/wiki/Backpropagation

http://en.wikipedia.org/wiki/Artificial_neural_network

A machine learning method – a Neural network

Page 58: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

measure

dpredicted

TLC Neural Network and plot of measured vs Predicted results

Page 59: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

3D tumour

specimen

Discriminatio

n tumour /

non-tumour

Deep

Learning

Neural

Networks

Subtype

identification

Chemical

components and

related metabolic

pathways

Molecular picture

of tumour

interactions

DESI-MSI

100 300 700 1000m/z

The tumour microenvironment

is 3-dimensional.

More chances to capture the

biological interactions.

Dimensionalit

y reduction

on

ly tu

mo

ur s

pe

ctra

There is a lot of recent progress around ‘Deep

Learning” – AI applied to cheminformatics problems.

Essentially significant developments in Neural

Networks.

Application of Deep Learning to 3D DESI mass

spectrometry imaging in cancer

Inglese, Paolo, et al. "Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer." Chemical Science (2017).

Machine

Learning

Page 60: Molecular data and analysis · The similarity concept is widely used in medicinal chemistry : e.g. using the concept of Bio-Isosteres –the fundamental concept in discovering new

The first 3D mass spectral imaging of a tumour

Paolo Inglese, James S. McKenzie, Anna Mroz, James Kinross, Kirill Veselkov, Elaine Holmes, Zoltan Takats, Jeremy K. Nicholson and Robert C. Glen. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem. Sci., 2017, 8, 3500