acs towards a gold standard database
TRANSCRIPT
Towards a Gold Standard: Improving The Quality of Public Domain Chemistry
Databases
Antony J. Williams1, Sean Ekins 2
1Royal Society of Chemistry, Wake Forest, NC 27587 2Collaborations in Chemistry, Fuquay Varina, NC 27526.
The future: crowdsourced drug discovery
Williams et al., Drug Discovery World, Winter 2009
Safety data
Toxicity data
Blogs and Wikis
Property databases
Experimental results
Scientific publications
Compound aggregators
Open Notebook Science
Metabolic pathway databases
Encyclopedic articles (Wikipedia)
Chemistry structures are proliferating on the web
Users take them at face value
They SHOULD NOT!!!
Immense quantities of scientific information are contained in the
thousands of databases
Progress can however be inhibited by errors in these databases,
downstream effects when the data is reused.
http://bit.ly/zWGaps
What is the Structure of Vitamin K1?
What Mechanisms Do we Have to Alert the Community ?
Email database owner and hope for a response
Blog it
Tony has been blogging about database quality for years and nobody
was listening – other than the people at PubChem
For some databases, when he blogged they listened and would edit!
Tweet it
Dec 2010 - We felt something had to be said definitively about structure
quality
Publish it – wrote to Science, Nature and then PLoS Computational Biology
http://bit.ly/qtJF2f
Perhaps the phone?
April 27 2011- Then came the : The NPC Browser
Science Translational Medicine 2011
But wait, hold on – did anyone peer review the database??
Database released and within days ..
A quick analysis of structure quality revealed..
100’s of errors found in structures
Williams and Ekins,
DDT, 16: 747-750 (2011)
NPC Browser http://tripod.nih.gov/npc/
Neomycin in NPC Browser http://tripod.nih.gov/npc/
Neomycin In ChemSpider
How many contribute to clean-up?
Less than a dozen contributors to data
The majority are project members
The crowd is small…
This is the same for all cheminformatics crowd-
based efforts
What Mechanisms Do we Have to Alert the Community – Publishing is too slow
Williams and Ekins,
DDT, 16: 747-750 (2011)
Tony Blogged April 28th 1 day after
release http://bit.ly/jn8wLC
I Blogged April 29th http://bit.ly/lXHInG
suggesting the need for a gold standard
database
After more extensive analysis we sent a
manuscript to Science Translational
Medicine - Rejected
Drug Discovery Today..accepted…8
Months after we pointed out the issue
even before NPC Browser release..
Responses from Community and NCGC
Comments on initial blog
NCGC added a disclaimer which I blogged about May 23rd
http://bit.ly/m4Tx2b
Sept 8th 2011
Email from Tudor Oprea
(cc’ed to 60 others)
He has also been pointing
out database errors for
years..
Followed by one from
Chris Austin offering to
meet us
Several individuals thanked us for the alert
Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving
the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
More Extensive Analysis and solutions
More analysis of NPC browser errors
“analysis of the NPC browser ‘HTS amenable compounds’ subset of
data for 7600 compounds identified fundamental errors in
stereochemistry, valency issues and charge imbalances in a few
minutes work using a rudimentary software tool”
Analysis of other chemistry databases and errors
Other types of databases and errors
Offered solutions
Substructure # of
Hits
# of
Correct
Hits
No
stereochemistry
Incomplete
Stereochemistry
Complete but
incorrect
stereochemistry
Gonane 34 5 8 21 0
Gon-4-ene 55 12 3 33 7
Gon-1,4-diene 60 17 10 23 10
Towards a Gold Standard: Regarding Quality in Public Domain Chemistry Databases and Approaches to Improving
the Situation Antony J. Williams, Sean Ekins and Valery Tkachenko, Drug Discovery Today, In Press 2012
Data Errors in the NPC Browser: Analysis of Steroids
Why this matters to us and
YOU the CROWD ?
What You Might Not Know About Chemistry Databases On The Internet
Data-sharing between open databases is cyclic
This can proliferate errors in the “Linked Data”
Public Domain Databases
Our databases are a mess…
Non-curated databases are proliferating errors
We source and deposit data between databases
Original sources of errors hard to determine
Curation is time-consuming and challenging
Molecule Data Quality Impacts
in silico drug discovery
vast ligand and protein–protein interaction databases
develop computational models
global mapping of pharmacological space
drug-target networks of approved drugs
prediction of off-target effects
Different types of databases and errors
Bayer paper on target validation 2/3 of papers did not live up to claims
MDL Drug Data Report (MDDR), errors
Errors in clinical research databases vary from 2.3% to 26.9%
Multicenter analysis by MS-based proteomics identified generic problems in
databases when characterizing proteins -search engines could not distinguish
different identifiers many algorithms calculated molecular weight incorrectly
One database had between 2.1% and 13.6% of annotated Pfam hits unjustified
ligand–protein X-ray structure - these can also have errors with far reaching
consequences
Solutions
Structure Validation and Standardization
Curation
Annotation
Structure filters
Incorrect valency, atom labels, aromatic bonds, stereochemistry, salts,
duplication
Structure standardization guidelines
Provided by the FDA (Substance Registration System UniqueIngredient
Identifier (UNII):
http://www.fda.gov/ForIndustry/DataStandards/SubstanceRegistrationSyste
m-UniqueIngredientIdentifierUNII/default.htm)
Need a record of molecule provenance
Can we track databases and quality - - www.scidbs.com
RSC Introduces “Validation Service”
Scidbs.com Default Body
Scidbs.com
Default Body
DB logo
Type of DB
Contact
Owner
Website
License
Curation etc
Data should be: Free from structure errors
Free from data errors
Free from experimental errors
Are we asking too much? Is it even possible??
When we raise our hands we are ignored
Our scientific community needs to wake up
Yet when we alert others:
Today NPC browser has fewer errors..so do ALL databases!
More people aware of molecule quality online. Trust is
earned not just granted!
The future database user is more informed
Peer reviewers test the databases that are in manuscripts
NIH checks databases before release!
COLLABORATION between government DBs. PLEASE!!!
We need minimal compound database standards
(MCDS)
Tomorrow
Acknowledgement
We thank the paper reviewers
and blog commenters
for their constructive comments
Chris Lipinski
This work was unfunded
(but was the right thing to do!)
www.scidbs.com