Bridging
cheminformatics and bioinformatics
using
protein structures
Bridging
cheminformatics and bioinformatics
using
protein structures
Edith Chan
Inpharmatica
London10 April 2001
2
Bioinformatics Cheminformatics
SELECTING THE BEST TARGETSSELECTING THE BEST TARGETS
Disease-association doesn’t make a protein a target - requires validation as point of intervention in pathway
Having good biological rationale doesn’t make a protein tractable to chemistry (drugable)
Genomics, HTS and Combichem have increased numerical throughput many hundred fold - overload of poorly integrated data, shortfall in productivity
Target Validation Process
Disease TargetTarget
Selection
Drug Discovery Process
ClinicLeads
Inpharmatica’s protein structure focus - uniquely placed to assess both parameters
High Validity and Drugability Requires a Unifying High Validity and Drugability Requires a Unifying Informatics FrameworkInformatics Framework
High Validity and Drugability Requires a Unifying High Validity and Drugability Requires a Unifying Informatics FrameworkInformatics Framework
3
BIOPENDIUM AND CHEMATICABIOPENDIUM AND CHEMATICA
Genome Data Target Structure Lead Hypotheses
O
O
HO
O
O
N
F
O
OO
O
O
NN
O
OO
O
Biopendium Chematicactgacaagtatgaaaacaacaagctgattg tccgcagagggcagtctttctatgtgcaga ttgacctcagtcgtc
protein target validation drug discoveryand selection
4
%
SEQ
UEN
CE
ID
AdvancedApproaches
AHHLDRPGHNMCEAGFWQPILLTest Sequence
100%
30%
0
Standard Approaches
STRUCTURE-BASED METHODS FIND MANY HOMOLOGUES (AND PUTATIVE TARGETS) NOT DETECTABLE FROM SEQUENCE SIMILARITY
STRUCTURE-BASED METHODS FIND MANY HOMOLOGUES (AND PUTATIVE TARGETS) NOT DETECTABLE FROM SEQUENCE SIMILARITY
Biochemical function and drugability defined by 3D structure, not sequence - structure is better conserved
Inpharmatica
5
BIOPENDIUMBIOPENDIUM
Inputs - all public (or proprietary) protein data
Proprietary methods
Genome-ThreaderGenome-Threader
QBI--Blast
Reverse Search MaximisationReverse Search Maximisation
Massive computation
1 million cpu hour set of calculations employing the most advanced algorithms (1100 processor farm)
Applied to 600,000 sequences, 14,000 structures + bound ligands
Yields 670m precalculated protein relationships
Query results in 15 minutes vs. two weeks with traditional bioinformatics in an Oracle database Protein Information
Structures Sequences Bound ligands Families Functions
6
Link complementary datain the 7 resources
Precalculated data for
600,000 protein sequences.
(scores and alignments for each hit)
Pairwise
sequence
searches
Profile
based
searches
Threading
based
approaches
InpharmaticaWorkbench
Ligplot ligand interaction
editor
Inpharmaticaenhanced RasMol
3D viewer
Interactive sequence alignment
editor
RelationalDatabase
Taxonomy
Processed PDBto XMAS data
Mask sequences
THE INPHARMATICA BIOPENDIUMTHE INPHARMATICA BIOPENDIUM
Genbank PDBPrositePrints EnzymeSwissprot
Ligplot
Proprietary seq.ORF prediction
Proprietarystructures
8
CHEMATICA Drugable site
identified
DRUGABLE TARGET DISCOVERYDRUGABLE TARGET DISCOVERY
Finding a novel brain metalloproteaseFinding a novel brain metalloprotease
BIOPENDIUM Novel brain
protein identified
9
CHEMATICA IS….CHEMATICA IS….
SiteMapping
SiteIdentification
FragmentMapping
Pharmacophore Generation
Database of putative/known binding sites site mapping and pharmacophore generation
similarity searching/clustering of siteslarge scale virtual screening resource
Gene FamilyData Views
Chemical annotation of
PDB ‘real’ ligand structures
N
O
N
O
C
O
O
N
N
O
O
O
Ligand 2-D structures
Gene family structures
consensus family analysis
10
a. Sphere is placed between the VDW surfaces of each atom pair.
b. Any neighbouring atoms penetrating sphere cause its size to be reduced.
c. Repeat for all possible atom pairs.
d. Generate surface around surviving sphere to define site region.
SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions.
Laskowski R A (1995), J. Mol. Graph., 13, 323-330.
Site identification - How sites in a protein structure are delineated?
11
Volume
Hydrophobic content
Polar content
surface accessibility
……
In total - 20 parameters calculated.
Physical Parameters of the clefts
8 largest sites are stored together with their physical parameters
12
Prediction of binding/active sitesPrediction of binding/active sites
Rule driven:
use of Neural Netsa on a training set of
100 ligand/protein PDBs
Validation:
success rate = 90% on a extended set of 500 PDBs
a backpropagation net -7-5-1 network
13
•3-D distributions of 20 different atom types about the 20 amino acids are calculated.
•No assumption of energy terms.
How XSITE potential is derived?
X-SITE: use of empirically derived atomic packing preferences to identify favourable interaction regions in the binding sites of proteins.
Laskowski R A, Thornton J M, Humblet C & Singh J (1996), Journal of Molecular Biology, 259, 175-201.
14
Data set Used
(1) 521 non-homologus protein chains* from PDB that satisfy
no two sequence identity is > 20%resolution <1.8ÅR factor < 0.2
AND
(2) 376 protein-ligand PDB structures for studying additional atom types other than those from peptides and proteins, such as Cl, F.
Note: The PDB has about 14K entries!
*cullpdb_pc20_res1.8_R0.2_d001130_chains521 (R. Dunbrack, Jr.)
U. Hobohm, M. Scharf, R. Schneider, "Selection of representative protein data sets." Protein Science, 1, 409-417 (1993).
15
Application of XSITE distributions to side-chains making up the calculated protein binding site
Projecting XSITE distributions onto the predicted binding site
16
How Pharmacophore is generated?
a. Compare the XSITE predictions generated for the different probe atoms at a 3D grid of densities encompassing the region of the binding site.
b. The higher the value at a given grid-point the higher the likelihood of finding that type of atom at that location.
c. For each probe atom, it derives a “best” map.
d. The net result is a new set of 3D grid maps, one per probe atom, holding only those regions where that atom scored higher than the others.
17
What is fragments mapping?
a. In-built database of more than 100 small molecule fragments - most common functional groups and represent the common building blocks that satisfy drug-like elements used in chemistry.
b. Privileged structures from companies.
O
O
O
O
N
ON
H
H
H
O
O
O S O
N
O
S N N
NN
N
N
N
S
O
N
N S
N N
S
OO
N SHS
OO
N OH
Cl FF
F
FCl
Cl
Cl
P O
O
O
N+O
O
t-butyl ethyl tBoc
phenyl naphthayl di-phenyl bi-phenyl
carbonyl carboxyl acetic acid acetamide methylamine
furan thiophene oxazole thiazole pryrole imidazole triazine
cyclohexyl thiazolidine piperazine thiadiazole
sulfonyl sulfnamide cyano mercapto methol
18
How is fragments mapping done?
• Each atom in a fragment is assigned one of the 20 atom type.
• Each fragment is placed at every grid-point within the binding site and subjected to 300 rotations.
• At each rotation a score is calculated using the appropriate X-SITE predictions for the atom types that the fragment contains.
C.ar
C.ar
C.ar
19
CHEMATICACHEMATICA
Curated, high-quality annotation and presentation of important ‘drugable’ gene families
NHRs, kinases, caspases, GPCRs,….
Contains ligand structure information
Contains crystal environment classification
Automatic alerts for newly released structures
Multiple structure comparison options
Gene Family Data Views
20
Consensus Family Analysis
MMP-1 MMP-8 MMP-13 MMP-3
Size and topology of binding sites for MMP-1 & MMP-8 are similar, but detailed interactions differ
Spheres signify negative charge requirement in different areas of the binding pockets
provides potential for specificity
CHEMATICACHEMATICA
21
Taken two sets of data from literature
1) GOLD (Jones, Willett, Glen, Leach and Taylor)
Genetic Optimization for Ligand Docking
(71% success rate in ligand binding mode in 100 pdbs)
our method - 70%
2) SUPERSTAR (Verdonk, Cole and Taylor)
Empirical method for interactions in proteins
(67% success rate for original 4 probes ~67% in 122 pdbs)
our method - 84%
Validation Study
1. Jones et al. J. Mol. Biol. (1997) 267, 727-748
2. Verdonk et al. J. Mol. Biol. (1999) 289, 1093-1108
22
AcknowledgementsAcknowledgements
Inpharmatica
Alex Michie
John Overington
Simon Skidmore
UCL
Roman Laskowski
Adrian Shepherd
Janet Thornton