csar and binding moad: two different databases, two different aims, one common goal provide the best...

CSAR and Binding MOAD: Two different databases,two different aims, one common goal provide the best protein-

ligand data

James B. Dunbar Jr. and Heather A. Carlson

5th Meeting on U.S. Government Chemical Databases and Open ChemistryAugust 25th and 26th 2011

Binding MOAD

•Who are we :• Heather Carlson – Principle Investigator• Mark L. Benson• Richard Smith• Nickolay Khanazov• Leigi Hu• Michael Lerner• John Beaver• Brandon Dimcheff• Jason Nerothin• Jayson Falkner• Peter Dresslar• James Dunbar Jr.

Binding MOAD

Binding MOAD

HTML

GATE NLP program

BUDA

Tagged HTML + annotated XML => Scores + text

highlights

Web App used to aid in manual biodata extraction

and curation

• For 2010 update: ~1200 manuscripts to review manually for data

• ~2800 new PDB structures

Binding MOAD

BUDA

Binding MOAD

Time consuming steps:• Obtaining the html from the journals – format changes

• Random – different every year• Hand curation of data within BUDA

• Correct data for compound• Correct sequence for the crystal

BUDA – essential for bookkeeping of the curation process• Allows for multiple people to work on the curation• Keeps track of changes and comments on a per user basis• Stores all in MySQL – records all work done over the years• Orders manuscripts by likelihood of data to top

CSAR Specific Aims

SA1. Build the largest, high-quality, freely accessible database of protein-ligand complexes with experimentally determined binding affinities from literature.

SA2. Generate new experimental data: We propose experimentally determining the dissociation constants (Kds) for selected protein-ligand complexes using two complementary techniques: isothermal calorimetry (ITC) and surface plasmon resonance (SPR). Consistency between the two approaches would provide confidence in the data. Furthermore, important physicochemical properties for the ligands will be determined (logP/logD, pKa, and solubility), and additional crystal structures will be solved.

(Note: actually using ITC, Octet Red, and ThermoFluor – Wuxi Apptec is measuring the properties)

SA3. Curate data from the community

SA4. Community outreach

CSAR Community Structure-Activity Resource

CSARdock.org

• Who are we :• Principal Investigators

• Heather Carlson• Jason Gestwicki• Jeanne Stuckey• Shaomeng Wang

• Researchers• William Clay Brown• Krishnapriya Chinnaswamy• James Delproposto• James Dunbar Jr.• Emilio Esposito• You-Na Kang• Ginger Kubish • Richard Smith• Kelly Damm-Ganamet

•Who are we (cont.):• Consultants

• Philip Andrews• Charles Brooks III• Hollis Showalter• Janet Smith

• Web Programming• Shelly Yang

• System Administration• Allen Bailey

• Advisory Board• Michael Gilson• Philip Hajduk• Paul Labute• Deborah Loughney• Anthony Nicholls• Tudor Oprea• Catherine Peishoff• Peter Preusch• Alexander Tropsha• Janna Wehrle

CSAR – the people

CSAR – dataset example

CSAR – compound properties

CSAR – crystallography

Abbott• 4 datasets in progress

Genentech • Signed CDA• 5 datasets in progress

GSK• Signed CDA

Roche• 1 dataset deposited

BMS and Pfizer• CDA in for legal review

In-house• CDK2 (done), CDK2/cyclinA (final stages), Lpxc (ongoing), urokinase

and Hsp90 (initial stages)

Industrial and In-house efforts

Dataset Selection Process (1)

Analyze target – does is it have crystal structures and have defined series (expect ~ 3 or so series) with appropriate biological data?

Analyze crystal data – does each have sufficient data present to refine the density? (.cv, .mtz, scale.log, …) – if so collect into a directory

Obtain biological data on all compounds tested in the relevant assay and any applicable counter screens (Ki, Ka, IC50, - no %inhibition)

Export from corporate database the:Structure (smiles)Company identifierBiological data for screens – including those in crystal structures

Split data into three types: actives, inactives, crystal structures.

For crystal structures obtain any PDB ids if available.

For the ActivesMove into MOE with pK(x) or pIC50 values:Wash and calculate physical properties:

Hydrogen bond acceptors (Acc)Hydrogen bond donors (Don)Total number for combined Acc and DonHeavy atom countRotatable bond countSlogPTPSAWeight

Tag each entry as to series

Select using MOE diverse selection ~40 for each series based on pK(x) or pIC50 and Acc and Don

Check spread in other characteristics to be sure they are not skewed and by eye verify a spread in available chemical functionality.


For the Actives (identify previous release of compound data)

Extract the ligands for the target from BindingDB/ChEMBL and load into MOE then export smiles.

Export the selected set from MOE (all fields) into text with structure as smiles.

In Pipeline Pilot –using canonicalized smiles – check to see if any selected is in BindingDB. If yes – select suitable replacement – if not then selection stands.


Find the Inactives

Many should be extremely similar to crystal structure

Using Pipeline Pilot search the inactives with the smiles from the crystal structure to find those very similar to known crystal structure.

MDLpublic keys with 0.85 to 0.99 as range.Select ~10

If ~10 are found then check BindingDB (Pipeline Pilot) for any that are in literature. If yes – select suitable replacement – else selection stands

If only 1 or 2 (or none) then continue in MOE


For the Inactives not extremely similar to crystal structureMove into MOE :Wash and calculate physical properties:

Hydrogen bond acceptors (Acc)Hydrogen bond donors (Don)Total number for combined Acc and DonHeavy atom countRotatable bond countSlogPTPSAWeight

Tag each entry as to series

Select using MOE diverse selection ~10 for each series based on Acc and Don

Check spread in other characteristics to be sure they are not skewed and by eye verify a spread in available chemical functionality.

Check BindingDB (Pipeline Pilot) for any that are in literature.


• Biological data• Attention to the details –

• LpxC – just enough Zn to be active (catalytic site), but not enough to cause inhibition from secondary inhibitory site for Zn

• Need to be aware of inherent error limits• Solubility can be a big issue

• Particularly how it is handled• i.e. filtered solids from ligand before injecting into ITC

• Protocols – did they use exactly what was published• Store output from assays in PDF – spectra, etc.

• Allow end users to see and judge what they want to include for themselves

• Crystallography – check the quality, provide density• Many different metrics – for us RSCC (real space correlation

coefficient) for ligand is very important – but we use several• Setting up lots of proteins for docking and scoring can be a bear• Getting approval of legal departments – very time consuming

• Initial confidentiality agreement• Approval of individual compounds for release

Datasets – lessons learned

Thank you and any comments or questions

csar and binding moad: two different databases, two different aims, one common goal provide the best...

Documents

budacorrect data

likelihood of data

curate data

best proteinligand data

yearhand curation of

new experimental data

binding moadwho

binding affinities