chemspider – an online database and registration system linking the web
DESCRIPTION
This presentation was given at the EBI Meeting in Cambridge on October 11th 2011 regarding Chemical Registration and Standardization.TRANSCRIPT
ChemSpider – An Online Database and Registration System Linking the Web
Antony Williams and Valery TkachenkoEBI Chemical Registry Systems Workshop, October 2011
www.chemspider.com
ChemSpider…
>26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol,
JSpecView, Balloon, OpenBabel, MediaWiki Slices of data are Open but the entire data
collection is not Open Crowdsourced depositions and curations
Uses InChIs for navigating and linking the web
Vancomycin
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
ChemSpider…
>26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol,
JSpecView, Balloon, OpenBabel, MediaWiki Slides of data are Open but the entire data
collection is not Open Crowdsourced depositions and curations
Uses InChIs for navigating and linking the web Uses Names for navigating and linking the web
I want to know about “Vincristine”
If all algorithms work then everything on the page is correct by default except the name-structure relationship!
Vincristine: Identifiers and Properties
Vincristine: Vendors and SourcesLinked by Structure
Vincristine: PatentsLinked by Name
Vincristine: ArticlesLinked by Name
ORIGINAL ChemSpider
“Create a system for linking and navigating databases on the web”
Use the power of InChI, and the proliferation of InChIs in databases, to make connectionsDeveloped on .NET and SQL Server for speed of implementation and existing skill setsSeeded with PubChem database of 10.5M chemicals and expanded using other sources to 20M
How do we build it?
We deal in Molfiles or SDF files – with coordinates
Deposit anything that has an InChI – we support what InChI can handle, good and bad
Standardization based on “InChI standardization”
InChIs aggregate (certain) tautomers
We link out to external sites using their IDs
InChIs – both on ChemSpider
Downsides of InChI
InChI was a moving target (multi versions) but overall worked as planned.
Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”
InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…
Side Effects of InChI Usage
SMILES by comparison…
Side Effects of InChI Usage
Standardization IssuesDepiction based on molfile
Downsides of Overall Approach
Meshing data together based on InChIs worked for simple molecules
2D layout errors inherited or limited by algorithm
Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same
Yohimbine
Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine
ChemSpider as an Aggregator
ChemSpider has inherited many errors, and it continues but we are way more careful now with pre-filtering
Cannot deposit chemicals without an InChI Deprecated compounds remain deprecated
Curated name-structure relationships do NOT remove the related structure If Taxol is removed from 20 asserted “incorrect
structures” those compounds remain in the database
Chemistry Databases on the Internet
Some public databases are “trusted” as primary sources
Trust is granted without investigation or understanding of the content
What do we know about some of the online resources?
PHYSPROP Database
The freely downloadable database under the EPI Suite prediction software
Very Basic filters suggest data quality issues
The Stereochemistry challenge.12500 chemicals with “missed” stereo
Searches on ChemSpider
Most searches are text-based: people searching for information about known chemicals
Creating accurate name-structure dictionaries is critical
NIST Webbook
PubChem
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
NPC Browser http://tripod.nih.gov/npc/
Synonyms on PubChem
1,3-DICHLORO-PROPAN-2-ONE
(2R,3R)-Butanediol bis(methanesulfonate)
Ethyl-1-propenyl ether, mixture of cis and trans
PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted
1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo [9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane
Synonyms on PubChem
Data Proliferation
What is meant by a name?
Choose a Starting Point
“The First 10”
What is getting into Our Databases?
Large aggregators are inheriting junk data
Data HAS proliferated from ChemSpider through PubChem – in process of deprecating and redepositing
A lot of data is for chemicals that will never exist (probably)
Standardization of Patent Data???
Standardization of Patent Data???
WYSIWYG compounds
WYSIWYG compounds
Text Mining Chemical Name Errors
“DPA”
All aggegators suffer dilution!
Structures have timelines
Name-Structure Dictionaries…
Depiction for Humans
Human Depiction versus Algorithms
Human Depiction versus Algorithms
Identifier Dictionaries
Reciprocal curation processes…share curation with each other.
If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.
A series of “added” and “removed” synonyms against InChIKeys for matching.
Proof of Concept Data Curation Sharing
Structure Validation using feed
Look for approved synonyms
Compare feed InChIKey with database InChIKey
If different, flag for inspection
Open PHACTS : partnership between European Community and EFPIA
Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.
Adopting Modified FDA Rules
As already used by ChEMBL…
Nitro groups
Salt and Ionic Bonds
Ammonium salts
Parent and Child
Chemical entities reduced to primary component plus relationships salt forms solvates combinations
ChemSpider Standardization
Entire ChemSpider database will be standardized using modified FDA rule set
Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated
Standardization procedures automatically applied to all future depositions
Project Status
Standardization pipelining process initiated Rule implementation and checking – iterative
work with Open PHACTS pharma members Data model development to support parent-child
relationships
In dialog with the FDA about latest form of recommendations
Conclusions ChemSpider has an important role in quality data
Crowdsourced deposition, validation and curation works but low engagement to date
Standardization of our entire backfile is necessary
Designing the standardization processes with input from pharma and general chemists is necessary
Acknowledgments
The ChemSpider team
Our data providers, depositors, collaborators and curators
Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)
Thank you
Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams