molecular data and analysis · the similarity concept is widely used in medicinal chemistry : e.g....
TRANSCRIPT
Useful Information
• The web address for these lectures is
http://www-jmg.ch.cam.ac.uk/cil/partii/ (on
front of handout)
• Assessment is by two online exercises
(Glen and Goodman) at this address. Each
will be marked out of ten. Your (paper)
answers should be submitted to Mykola.
• Glen exercises due: Feb 10th 2018
• Lectures and handout available on Moodle
2 Finding molecules
In 1924 Dr. Markush was awarded a patent on pyrazolone dyes (USP No. 1,506,316) in
which he claimed generic chemical structures in addition to those actually synthesized.
Structures of this type were permitted after a ruling in 1925 by the US Patent Office and
became known as “Markush structures”. The “Markush Doctrine” of patent law greatly
increases flexibility in the preparation of claims for the definition of an invention.
Expanding our representation of chemical
Structures – Markush structures.
We can expand our search by introducing less exact labelling of attachments to the
core structure. Markush structures are essentially structures involving R-groups,
where a part of the molecule is defined by a series of alternatives – a more complex
example
Additionally, to introduce a more generic approach to structure matching, we might
define e.g. hydrogen-bond donors as:
R = OH,NH,SH,PH for example – care is of course needed e.g. a
COOH may be ionised and have no H !
This approach is extensively used in the patent literature to cover claims
of chemical structures with many variations.
Markush or Generic Structures. J. Chem. Inf. Comp. Sci. 1991, 31 (1)
A comparison of different approaches to Markush structure handling
J. Chem. Inf. Comp. Sci. 31(1), 1991, 64-68
An example of a Patent claim using Markush structures – how
many does this cover ?
Markush structure searching over the years, Edlyn S. Simmons World Patent Information, Volume
25, Issue 3, September 2003, Pages 195-202
Searching Markush compound structures is still an unsolved problem (so-called
‘nasties’), and has great implications for patents. MarPat is a Markush searchable
database of patents. http://www.cas.org/expertise/cascontent/marpat.html .
2. Finding molecules using Molecular Similarity
• You may perform a structural search of a database, and find no molecules. You still want to use a molecule like your query in some way, so, how do you find one that is ‘similar’ ?
• We may have e.g. a molecule that shows anti-cancer effects, but is toxic
• We could then look for other molecules that could have a similar anti-cancer effect, but a lower toxic effect
• ‘Similarity’ though, has a context and the right molecular description is needed for each specific case.
Bender A., Glen RC., Org. Biomol. Chem., 2004, 2, 3204 – 3218.
Molecular similarity: a key technique in molecular informatics.
The similarity concept is widely used in medicinal chemistry :
e.g. using the concept of Bio-Isosteres – the fundamental
concept in discovering new drugs
This idea (a bio-isostere) suggests that a chemical group can be
mimicked by a replacement group that, in many documented cases,
has appeared similar in its response to biological receptors (usually
proteins).
e.g. :
Bioorganic & Medicinal Chemistry Letters
Volume 17, Issue 14, 15 July 2007, Pages
4040-4043
Changing substituents
while maintaining affinity
in an anti-bacterial.
to
An example at Influenza neuraminidase – a critical enzyme the
virus uses for infection - inhibitors Oseltamivir and Zanamivir
use a bio-isosteric replacement of the natural substrate
<=Similar to=>
Neuraminidase cleaves
the glycosidic linkages
of neuraminic acids
Therefore in a search,
these additional ‘R’
groups can be included as
Markush structures
More examples used as bio-isosteres (pairs)
Sarah R. Langdon,Peter Ertl,and Nathan Brown. Bioisosteric
Replacement and Scaffold Hopping in Lead Generation and
Optimization . Mol. Inf. 2010, 29, 366 – 385
Robert P. Sheridan. The Most Common Chemical Replacements in
Drug-Like Compounds. J. Chem. Inf. Comput. Sci. 2002, 42, 103-108
Similar to…
Similar to…
But supposing you can’t easily ‘think up a similar substructure’?
There are various methods that have been devised to compute
similarity. These are generally:
•Based on the structure
•In one (strings), two (graphs) or three dimensions (coordinates)
•Based on molecular properties
•Experimental (e.g. size and shape) and computed properties (e.g.
Dipole Moment)
Lets look at how a similarity calculation can be defined using some
of these methods.
Similarity. The Maximal Common Subgraph (the biggest common
fragment)
•Important search to determine which part of a structure is constant –
e.g. identifying reaction components – in this one, the atoms and bonds
which are constant comprise the MCS.
•Is a complex case of identifying a fragment, as we don’t know the size of the
MCS beforehand, so involves ‘backtracking’ to compute – and is therefore
time consuming. For example, this is also one of the problems of converting
a list of compounds to a Markush structure.
MCS algorithms can be applied to problems other than atom-atom mapping in
reactions -
•structural similarity between molecules - size of MCS (relative to size of
molecules) can be used as a measure of similarity of molecules) e.g. search
for molecules containing at least 80% of query substructure.
Similarity by Molecular Fingerprints
• Fingerprints are a common approach to describing molecular similarity
• Fingerprints can be considered as a ‘bar code’ for the molecule
• Used because
– uses only the molecular graph
– does not require structural conformation or alignment
– fast searching method
• It is very fast to annotate a database of millions of molecules with fingerprints
• Often you are using fingerprints in searching databases, and don’t realise it !
Molecular Fingerprints
• Hash codes (already mentioned for searching)
• The simplest fingerprint registers the presence or
absence of fragments in a molecule. e.g.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesn’t contain F
.......X
Molecular Fingerprints
• We could use this fingerprint for example,
to find only molecules containing
Phosphorous that have an amine in their
structure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0
Contains PContains NH2
Doesn’t contain F
.......X
NH
N
N
O
NH2N
O
OH
HH
HHO
PO
O-
HO
O-
• Fingerprints can be generated algorithmically, we don’t need to manually specify all the fragments
• Fingerprint method most often used is based on the CRC algorithm (cyclic redundancy check) –you could look this up on the web.
• Advantages/disadvantages
– easy to calculate
– very fast
– not specific to one area of chemistry
– difficult to understand
Automated fingerprint generation
Fingerprint Generation – Hashing
CRC (Cyclic Redundancy Check)
CH3CH2CH2CH2OH H-C-C-C
C-C-C-O
C-C-O-H
| etc.
I, where 0 < I > 109MOD( I / 151 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
12
e.g. 150-bit fingerprint for 4-atom fragments (we generally use 4-7 atom)
1 .......151
Linear Fingerprint
Level
0 1 2 3
C.ar C.ar
C.ar
O.3
C.2
C.2
C.ar
C.ar
C.3
C.ar
O.2
O.co2
O.co2
Fingerprint Generation – circular fingerprints
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1
.......number of
‘atom types’ times
the number of
levelsLayer zero
30 atom types
Layer 1
30 atom types
Layer 2
Thirty atom types
These are very ‘sparse’ but work well – I’ll show some examples later
J. Chem. Inf. Model. 2007, 47(2), 583-590
Can also use a
variant of the
Morgan
algorithm
Comparing the fingerprints of molecules
- Tanimoto or Jaccard similarity
where A, B, A&B, are the number of bits set in fingerprint A, B, and A-AND-
B.
In a hypothetical example, A, B,and A&B are 24, 21, and 19, respectively,
resulting in a Tanimoto coefficient of 0.73 (1.00 is perfect similarity)
BABA
BAT
)(
another way to put it, TC = BC / (B1 + B2 - BC)
Values above 0.85 are usually significant. This method is
commonly used to search for Pharmaceuticaly active
molecules, reagents, reactions...
Tanimoto similarity example
- similarity to σ-chloro-ρ-aminobenzoic acid
σ-chloro-ρ-aminobenzoic acid
Structure Tanimoto
coefficient
Benzoic acid 0.52
m-chlorobenzoic acid 0.64
o-chlorobenzoic acid 0.80
o-chloro-p-aminobenzoic acid 1.0
p-aminobenzoic acid 0.70
p-chlorobenzoic acid 0.66
Similarity search in SciFinder Scholar1 - query structure
2 - similarity search
3 – pick > 85% similarity
4 - six structures retrieved (from xxx Million). This
probably uses linear fingerprints
‘Tanimoto’ similarity indices are one of a class of
methods for bit-string comparisons.
Some comparison indices additional to the Tanimoto coefficient
(Nab/(Na+Nb-Nab) ) are:
Hamming coefficient =
Cosine coefficient = Nab/Sqrt (Na x Nb)
n
i
baXORH1
)),((
A good introduction is in :
http://www.orgchm.bas.bg/~vmonev/SimSearch.pdf
J. Chem. Inf. Comput. Sci., 37, 18-22 (1977)
J. Chem. Inf. Comput. Sci., 43, 819-828 (2003)
J. Chem. Inf. Comput. Sci. 38, 983-996 (1998)
J. Chem. Inf. Model. Publication Date (Web): October 19,
2012, DOI: 10.1021/ci300261r. Just accepted.
Not just bits – properties of moleculesThere are of course, an enormous number of “molecular properties”
That can be used to compare molecules – some of the more common ones are
listed below:
1. Quantum mechanical descriptors based on the wavefunction (Carbo index) Quantitative Structure-Activity Relationships, Volume 16, Issue 1 (p 25-32)
2. Topological indices (Weiner, Kier and Hall)H. Wiener, "Structural determination of paraffin boiling points", J. Am. Chem. Soc., 1947, 69(1), 17-20.
L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986
3. Compute molecular properties: volume, surface area, logP, pKa, .........vast
number – then cluster molecules according to a similarity measure.
Molecules............Index...........graph of similarity of pairs
Beck et al. Chemical
Physics 356 (2009) 121–
130
http://www-metaprint2d.ch.cam.ac.uk/metaprint2d/
Metabolic Site/Product predictor (MetaPrint2D)
Metabolic Site/Product predictor (MetaPrint2D)
2Query compound
For each query atom, find
all similar environments in
database
Calculate reaction
occurrence ratios
Total number of similar reaction centres
Total number similar atoms in rest of database
Calculate relative ratios for each atom in
query compound, and display predictions
Using a naive Bayes probabilistic model
Symyx Metabolite
database (~80000
transformations)Substrate + Products
Calculate environment for
each substrate atom
Identify reaction centres
1
Calculate environment for
each atom
3How often is environment
found at a reaction centre?
4
5
Database Version 2005.1 2006.1 2007.1 2008.1
Transformations 72599 78009 82671 87446Single step 58757 62147 65732 69402Product not reported 811 831 834 882Newly added 5410 4662 4775
Interestingly, the
molecule dosed (which
has excellent
bioavailability) is a
partial agonist, while the
main metabolite is a full
agonist. So, as the drug
concentration lowers in
blood, the remaining
compound becomes
more potent – probably a
longer lasting effect
Paracetamol toxicity
(Tylenol)
Overdose results in
species NAPQI and
liver damage
Metaprint2D results
glutathione
3 Finding molecules using three dimensional data
•‘Real’ molecules exist in a 3-dimensional world
•Their properties depend on their shape and the spacial
disposition of functional groups.
•Simple example: dipole moment
2.5 Debye 0.5 Debye
An example of the
exquisite matching of
a substrate to a protein
binding site – here 3D
shape and the
complimentary non-
bonded interactions
are extremely
important
Cheminformatics Tools
for drug design
Example of a site which has
various drug design tools
• Three dimensions in drug discovery
• A ‘pharmacophore’ is a 3-D representation of the required features
for binding to a biological receptor
5.2
4.2-4.7
6.7
4.8
5.1-7.1
Distances in Ǻngstroms.
Here is the pharmacophore
model used to design the migraine drug
‘Zomig’ deduced from comparison of
molecules that interact with the receptor
binding site
Similarity Searching based on
pharmacophores - What do we need ?• A database of 3-dimensional structures (Zinc
database is 72 million)
– Atom Coordinates
– Atom types
– Ring, fragment, property, H-bonding etc. definitions
– An excellent example is the Cambridge Structural Database of X-ray structures (next door)
• A definition of the query
– Fragments of molecules and their properties
– Constraints
• Distances between functional groups
• Angles between these
– The concept of Dummy atoms is useful
– e.g. ring centres, H-bonding points, planes
Example search (“Virtual Screening”) of our current
4.5 Million 3D database
5.2
4.2-4.7
6.7
4.8
5.1-7.1
A protonated amine (NH3+), a ring centre (defined by 6 atoms)
hydrogen-bond acceptor, a hydrogen bond donor-acceptor
-brings up the point that ‘properties’ can be specified at atom points
--Markush atoms
Hydrophobic
center
Positive NH Bond
Donor/Acceptor
H Bond
Acceptor
When x-ray structures are available – molecules can
be ‘docked’ into the binding site – pharmacophores
can be generated and used for searching as before
• A docking program will take a
randomised ligand conformation from a
ligand/protein x-ray structure and place
the molecule back in the correct
position.
• Many thousands of molecules can be
‘docked’ / hour.
• Molecules can be selected based on their
‘fit’ to the protein, and subsequently
tested for binding affinity
docked Gleevec with Gleevec X-ray 1T46 (x-ray structure) overlaid with
the predicted position of Gleevec – almost perfect – which implies we
could use the same docking approach to search for new molecules that
work in the same way
Docking example using Gold: the
anti-cancer drug Gleevec – a
specific cancer target inhibitor of
Bcr-Abl tyrosine kinase, the
constitutive abnormal kinase in
chronic myeloid leukemia.
Docking example using Gold:
Gleevec – specific cancer target
inhibitor of Bcr-Abl tyrosine
kinase, the constitutive abnormal
kinase in chronic myeloid
leukemia. Red lines define a
pharmacophore
The pharmacophore can be extracted and used to search for additional
Molecules from our database, these are then tested by ‘docking’ and
If they fit, can be tested for anti-cancer properties in this case.
*GOLD. Jones G, Willett P, Glen R C, Molecular Recognition of Receptor Sites using a Genetic Algorithm with a
Description of Desolvation, J.Mol. Biol.245, 43-53 (1995).
Jones G, Willett P, Glen R C, Leach A R, Taylor R. Development and Validation of a Genetic Algorithm for Flexible
Docking. J. Mol. Biol. 267, 727-748 (1997).
“Virtual screening” using similarity – an important way to find starting points for
designing new drugs
Suppose we have no information on a biological target. Also, like
many pharmaceutical companies, we have 1 Million real molecules
in our compound store. But, due to cost, we can only afford to
screen 10,000. How can we pick the best representative set to
screen?
There are essentially two ways to do this – similarity and diversity.
Pro
per
ty A
Property B
A
B
Selection based on
similarity to A and B
Pro
per
ty A
Property B
A diverse set
Virtual screening using similarity
On the bottom left, we have used two molecules displaying
biological activity (A and B) to find those most similar in the
database for testing, to maximise our chances of finding new hits.
On the bottom right, we have no molecules to use, so we select the
best diverse set, maximising our chances of a hit whilst only testing
a representative subset of the compounds library.
Pro
per
ty A
Property B
A
B
Selection based on
similarity to A and B
Pro
per
ty A
Property B
A diverse set
An example of a reaction in a modified Smiles, called Smirks.
‘Acetic acid and (.) ethanol > in the presence of HCl and Ethanol >
make ethylacetate
Chemical Reactions can also be represented in the computer
http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html
Virtual screening using a virtual library
The molecules we screen in the computer don’t have to be
physically available. We can generate vast libraries of
molecules we could synthesise, and search these. Promising
molecules could be synthesised. A billion examples is not
unreasonable. An example of potential HIV Protease inhibitors
There are so many, we have to be very selective and computer-aided
design can help
Example reaction of two acids with two alcohols to make four products
(the acids have ‘R’ groups). New characters and atom mapping is used.
[*:1][C:2](=[O:3])[O:4][H].[*:2][C:5][O:6][H]>>[*:1][C:2](=[O:3])[O:6][C:5][*:2].[H][O:4][H]
You’ve found some interesting molecules –
but how can we predict their properties
quantitatively ?
• Particularly in drug discovery (but also in
materials science for example) methods
have been developed to relate the structure
and properties of molecules to their function
• These are called Quantitative Structure
Property (or Activity) Relationships -
QSPR, QSAR
• The handout contains details of some
approaches that may be of interest.
• Quantitative Structure Property Relationships (QSPR)
• Quantitative Structure Activity Relationships (QSAR)
• We calculate descriptors to combine with statistical and machine-learning methods to create models to predict properties.
Combining molecular structure with calculation of
properties to deduce predictive models is usually
termed :
Picture the data – often the best approach
J. Med. Chem., 44 (5), 2001, pp681 -693,
‘Exclusion zone’ – compounds here
Are not bio-available
hyd
rop
ho
bic
ity
size
Here is a real example, two descriptors CMR (the size of the molecule) and logD
(the distribution coefficient between octanol and water) are calculated, plotted and
annotated with their ability to be absorbed in the intestine. The white areas are
molecules that are absorbed, the shaded molecules are not - so as drugs shaded
molecules would not be orally absorbed therefore useless in pills.
This approach to bioavailability has had a fundamental effect on new drug discovery,
see Lipinski’s Rule of Five in the notes.
Simplest approaches
• 1. Read across. If molecule A has a measured property, and molecule B has not had it measured, if molecules A and B are very similar, perhaps they have similar properties and we can predict the property of B. This is a common approach in predicting e.g. the toxicity of molecules.
• We use information for one chemical, called a “source chemical”, to make a prediction of the same property or toxicological endpoint for another chemical, called a “target chemical”, termed “read across”.
Example of using chemical and biological similarity in read-across prediction of toxicity
Low et al. Chem. Res. Toxicol., 2013, 26 (8), pp 1199–1208
Building models (QSAR/QSPR)
Molecular database
Calculate/measuremolecular properties
Analysis
Prediction
This is the most commonapproach for molecularanalysis and prediction
Supervised methods
Supervised methods. The most common method is linear regression. Simple linear
regression fits a straight line through the set of n points in such a way that makes
the sum of the squared residuals of the model (that is, vertical distances between
the points of the data set and the fitted line) is as small as possible. The equation
we obtain can be used to predict a new property based on the descriptors calculated
or measured for the new molecule.
Q is the function we want to obtain and minimiseAlpha is the correction factor to move all the points so the line goes through the originBeta is the coefficient to multiply our descriptor (x) by.Epsilon is a residual (which we wish to minimise)The method is explained in more detail the notes.
Machine Learning:
Predicting TLC (Thin Layer Chromatography)
Start point
Solvent front
Compound
moved
to here
(Rf=y/x)
X
Y
•Compounds move up the plate
depending on the solvent, their
properties etc.
•We can predict the Rf’s
(retention times) using details
of the molecules and the
solvent.
•Separate mixtures, identify
compounds etc.
Silica on glass
15 2-OH
16 3-OH,6-OH
17 2-OH,6-OH
18 2-OH,3-OH
19 2-OH,5-OH
20 2-OH,4-OH
21 3-OH,4-OH
22 2-COOH
8 4-F,3-CF
9 4-F,2-CF
10 4-CH
11 2-CH
12 3-CH
13 4-NH
14 H
1 4-F
2 3-F
3 2-F
4 CF
5 3-CF
6 4-CF
7 2-F,4-CF
COOH1
23
4
5 6
3
3
3
3
3
3
3
3
3
2
• 22 substituted benzoic acids
Data
• 2 solvent systems
• 6 - mixtures 1 Acetonitrile - Water 30 : 70
2 Acetonitrile - Water 40 - 60
3 Acetonitrile - Water 50 - 50
4 MeOH - Water 40 - 60
5 MeOH - Water 50 - 50
6 MeOH - Water 60 - 40
• 22 compounds x 6 mixtures = 132 experiments
Data
Measurements
No. compound number
Cpd name of compound
Solvent water and acetonitrile/methanol
Rf retention time
Rm (log (1-Rf)/Rf))
S_Area surface area of molecule in A2
clogp calculated partition coefficient octanol/water
volume molecular volume in A3
MPolar polarizability of the molecule cm-25
dipole dipole moment of the molecule (Debye)
dipsol dipole moment of the solvent (%solv1+%sol2)*100 Debye
PolSol polarizability of of the solvent (%pol1+%pol2)/100 Debye
Ovality: how removed from sperical
water dipole is also given, 2.75Debye
3/2
4
3
3
4/ VSOvality
• Molecular properties were calculated for each of the molecules
and tabulated in a spreadsheet (tlcdata.xls) e.g.
LogK‘ = -0.401QON + 0.396CLOGP + 0.109DIP -0.056DIPMOM -3.162ESDL1
+ 0.231CMR + 0.110POLSOL - 5.326
r = 0.954 F7,110 = 155.59
Variance Explained = 91.0 %
Multiple linear regression – using the ‘best’ 7 parameters
•Test set
oTraining set
measured
Unsupervised methods (typically classification models)
In the previous examples, data was fitted to a
model, usually predicting a numeric value of the
desired property. However, it is also possible to
cluster the data, and hence make predictions
about a particular class a new molecule will fall
into e.g. is it toxic or non-toxic. This is “guilt by
association”.
The most common approach to do this is cluster
analysis, which includes a diverse set of
approaches.
Hierarchical clustering and k-means clustering are
common approaches. Clustering involves finding
the distance between all points of the data (e.g.
the Tanimoto distance) usually using the Euclidean
distance or the Manhattan distance. The clusters
are then determined by either a bottom-up
approach (agglomerative) or by a Divisive
approach (top-down).
High
similarity
cuttoff
Low
similarity
cuttoff
Plot of 2 PC’s of a dataset made up of many molecules and many calculated properties, It is possible to get a view of how diverse
molecules are within the property space, and also, for new molecules, where they are located.
Includes: physical properties (such as charge, van der Waals volume, and molecular refractivity)
subdivided surface areas (atomic contributions to logP and molecular refractivity)
counts of elemental atom types and of bond types
Kier/ Hall connectivity and kappa shape indices
topological indices (Wiener index and Balaban index)
pharmacophore feature counts (number of acidic and basic groups and hydrogen bond donors and acceptors)
partial charge descriptors, surface area, volume, and shape descriptors (among them water accessible surface area, mass density,
and principal moments of inertia).
So this is basically describing a series of molecules in many ways, then compressing the plot into two dimensions. Good for
selecting a screening set of compounds for testing.
J. Chem. Inf. Model., 2005, 45 (3), pp 581–590
The concept of ‘Chemical space’ – non-hierarchical clustering similar molecules
•Simulates the way that neurons are interconnected
•‘learns’ by adjusting the connection weights between nodes taking an input set
of parameters and attempting to fit the output measurements
•New data can then be entered and using the ‘learned’ model -> predict
This network has a
2:4:4:1 topology
Like neurons, the connections
are made when a threshold value
is attained.
Use ‘back propagation of errors’ to
adjust the connections
http://en.wikipedia.org/wiki/Backpropagation
http://en.wikipedia.org/wiki/Artificial_neural_network
A machine learning method – a Neural network
measure
dpredicted
TLC Neural Network and plot of measured vs Predicted results
3D tumour
specimen
Discriminatio
n tumour /
non-tumour
Deep
Learning
Neural
Networks
Subtype
identification
Chemical
components and
related metabolic
pathways
Molecular picture
of tumour
interactions
DESI-MSI
100 300 700 1000m/z
The tumour microenvironment
is 3-dimensional.
More chances to capture the
biological interactions.
Dimensionalit
y reduction
on
ly tu
mo
ur s
pe
ctra
There is a lot of recent progress around ‘Deep
Learning” – AI applied to cheminformatics problems.
Essentially significant developments in Neural
Networks.
Application of Deep Learning to 3D DESI mass
spectrometry imaging in cancer
Inglese, Paolo, et al. "Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer." Chemical Science (2017).
Machine
Learning
The first 3D mass spectral imaging of a tumour
Paolo Inglese, James S. McKenzie, Anna Mroz, James Kinross, Kirill Veselkov, Elaine Holmes, Zoltan Takats, Jeremy K. Nicholson and Robert C. Glen. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chem. Sci., 2017, 8, 3500