the shape of things to come? linkedin: fractal dimensionality and … · 2020. 1. 23. · •...

1
Abstract: Fractal dimension 1,2 can be used to capture shape information about small and macro-molecules. We utilise this method to generate shape descriptions of 14,556 crystal structures obtained from the sc-PDB 3 database. We trained 11 sequence translation models to generate ligand fingerprints from protein representations, generating ligand “answers” to protein “questions”. For three-quarters of the test set, reconstructed fingerprints were similar enough to that found in the crystallographic data to enable virtual screening, based on target analysis alone. The shape of things to come? Fractal dimensionality and its applications in deep-learning-driven ligand-receptor interaction prediction. Ryan Byrne - Dr. H. Chen, AstraZeneca Mölndal, Prof. Dr. G. Schneider, ETH Zürich References 1. Mandelbrot, B. (1967). Science, 156, 636-638. 2. Grassberger, P., & Procaccia, I. (1983). Physica D: Nonlinear Phenomena, 9, 189208. 3. Kellenberger, E., Muller, P., et al. (2006). Journal of Chemical Information and Modeling, 46(2), 717727 4. Vaswani, A., Shazeer, N., et al. (2017). Advances in Neural Information Processing Systems, 5998. Data preparation: We extracted and cleaned 14,556 crystal structures from the sc-PDB crystallographic database. We then generated FD fingerprints for each ligand (in its bound conformation), and a protein fingerprint for all residues within 3.5Å. We retained 10% of these as a test set. A low (0.33±0.09) average pairwise Tanimoto similarity was observed between protein pocket fingerprints. Follow us @Aegis_ITN “This project has received funding from the European Union’s Framework Programme for Research and Innovation Horizon 2020 (2014-2020) under the Marie Skłodowska-Curie Grant Agreement No. 675555, Accelerated Early staGe drug dIScovery (AEGIS).Retrospective and prospective studies Back- translation ? Model training: We considered 11 sequence-to-sequence architectures, and 76 hyper-parameter combinations. Models were trained to attempt reconstruction of the ligand shape fingerprint, based on the associated protein fingerprint Assessment was via perplexity (a measure of the uncertainty of the models about each decision) and accuracy. Outcomes The novel transformer 4 architecture was the best performing, by some measure. Our final architecture is a four layer transformer, with a dense-layer width of 512, and eight attention heads, with the Adam optimiser. Our model could regenerate adequate or excellent reconstructions in three- quarters of the examples tested, with the latter category representing a third of the total. Fractal dimension: A user's guide Fractal dimension (FD) is a measure of the roughness and complexity of a surface. More specifically, it describes how the properties of a surface vary with the scale at which they are measured. These non-integer measures of dimensionality correspond to the complexity and contortion of a surface in predictable ways, and allow us to rapidly rank molecules based on their shape. We can also use it to describe target pockets. We adopt this formalism to perform fast, shape- based virtual screening, and to analyse target pockets. FD ≈ 2 2.4 ≤ FD ≤ 2.6 FD ≈ 3 Poor recovery VS Useful VS Ready Random performance Figure 3: Ligand fingerprints reconstructed (predicted) from protein translation compared to those extracted (calculated) from the scPDB. Approximately three- quarters are ‘useful’ or better, a characterisation based on performance in large- scale retro- and prospective studies. A third are ‘VS ready’, based on the same analysis. Figure 2: Maximum accuracy and perplexity achieved on the validation set for each trained model. Best model per architecture family (LSTM, GRU, CNN, Transformer) highlighted. Validation Perplexity Validation Accuracy Figure 1: Illustration of the defined pocket region for a protein-ligand complex (ligand extracted for clarity). Pocket and ligand shapes are captured in the fingerprints with corresponding colour key. Remaining fingerprint is that generated by the deep-learning model, and compared against the experimental version. PDB-ID:5N2F. Combining our developed shape descriptors with deep-learning resulted in a model which could create useful fingerprints for virtual screening in three- quarters of cases, based on analysis of target structure alone. Twitter: @Ryan_Byrne_ LinkedIn:

Upload: others

Post on 23-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Abstract:

Fractal dimension1,2 can be used to capture shape

information about small and macro-molecules. We

utilise this method to generate shape descriptions of

14,556 crystal structures obtained from the sc-PDB3

database. We trained 11 sequence translation

models to generate ligand fingerprints from protein

representations, generating ligand “answers” to

protein “questions”. For three-quarters of the test

set, reconstructed fingerprints were similar enough

to that found in the crystallographic data to enable

virtual screening, based on target analysis alone.

The shape of things to come?

Fractal dimensionality and its applications

in deep-learning-driven ligand-receptor

interaction prediction.

Ryan Byrne - Dr. H. Chen, AstraZeneca Mölndal, Prof. Dr. G. Schneider, ETH Zürich

References

1. Mandelbrot, B. (1967). Science, 156, 636-638.

2. Grassberger, P., & Procaccia, I. (1983). Physica D: Nonlinear Phenomena, 9, 189–208.

3. Kellenberger, E., Muller, P., et al. (2006). Journal of Chemical Information and Modeling, 46(2), 717–727

4. Vaswani, A., Shazeer, N., et al. (2017). Advances in Neural Information Processing Systems, 5998.

Data preparation:

• We extracted and cleaned 14,556 crystal

structures from the sc-PDB

crystallographic database.

• We then generated FD fingerprints for

each ligand (in its bound conformation),

and a protein fingerprint for all residues

within 3.5Å.

• We retained 10% of these as a test set.

• A low (0.33±0.09) average pairwise

Tanimoto similarity was observed

between protein pocket fingerprints.

Follow us @Aegis_ITN

“This project has received funding from the European Union’s Framework

Programme for Research and Innovation Horizon 2020 (2014-2020) under

the Marie Skłodowska-Curie Grant Agreement No. 675555, Accelerated

Early staGe drug dIScovery (AEGIS).”

Retrospective and

prospective

studies

Back-

translation?

Model training:

• We considered 11 sequence-to-sequence

architectures, and 76 hyper-parameter

combinations.

• Models were trained to attempt reconstruction

of the ligand shape fingerprint, based on the

associated protein fingerprint

• Assessment was via perplexity (a measure of

the uncertainty of the models about each

decision) and accuracy.

Outcomes

• The novel transformer4 architecture was

the best performing, by some measure.

• Our final architecture is a four layer

transformer, with a dense-layer width of

512, and eight attention heads, with the

Adam optimiser.

• Our model could regenerate adequate

or excellent reconstructions in three-

quarters of the examples tested, with

the latter category representing a third

of the total.

Fractal dimension: A user's guide

Fractal dimension (FD) is a measure of the

roughness and complexity of a surface. More

specifically, it describes how the properties of a

surface vary with the scale at which they are

measured.

These non-integer measures of dimensionality

correspond to the complexity and contortion of a

surface in predictable ways, and allow us to rapidly

rank molecules based on their shape. We can also

use it to describe target pockets.

We adopt this formalism to perform fast, shape-

based virtual screening, and to analyse target

pockets.

FD ≈ 2 2.4 ≤ FD ≤ 2.6 FD ≈ 3

Poor recovery VS Useful VS Ready

Random

performance

Figure 3: Ligand fingerprints reconstructed (predicted) from protein translation

compared to those extracted (calculated) from the scPDB. Approximately three-

quarters are ‘useful’ or better, a characterisation based on performance in large-

scale retro- and prospective studies. A third are ‘VS ready’, based on the same

analysis.

Figure 2: Maximum accuracy and perplexity achieved on the validation

set for each trained model. Best model per architecture family (LSTM,

GRU, CNN, Transformer) highlighted.

Validation Perplexity

Val

idat

ion

A

ccu

racy

Figure 1: Illustration of the defined pocket region for a protein-ligand complex

(ligand extracted for clarity). Pocket and ligand shapes are captured in the

fingerprints with corresponding colour key. Remaining fingerprint is that

generated by the deep-learning model, and compared against the experimental

version. PDB-ID:5N2F.

Combining our developed shape descriptors with deep-learning resulted in a

model which could create useful fingerprints for virtual screening in three-

quarters of cases, based on analysis of target structure alone.

Twitter:

@Ryan_Byrne_

LinkedIn: