chemspider – an online database and registration system linking the web

73
ChemSpider – An Online Database and Registration System Linking the Web Antony Williams and Valery Tkachenko EBI Chemical Registry Systems Workshop, October 2011

Upload: orcid-0000-0002-2668-4821

Post on 10-May-2015

1.489 views

Category:

Technology


3 download

DESCRIPTION

This presentation was given at the EBI Meeting in Cambridge on October 11th 2011 regarding Chemical Registration and Standardization.

TRANSCRIPT

Page 1: ChemSpider – An Online Database and  Registration System Linking the Web

ChemSpider – An Online Database and Registration System Linking the Web

Antony Williams and Valery TkachenkoEBI Chemical Registry Systems Workshop, October 2011

Page 2: ChemSpider – An Online Database and  Registration System Linking the Web

www.chemspider.com

Page 3: ChemSpider – An Online Database and  Registration System Linking the Web

ChemSpider…

>26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol,

JSpecView, Balloon, OpenBabel, MediaWiki Slices of data are Open but the entire data

collection is not Open Crowdsourced depositions and curations

Uses InChIs for navigating and linking the web

Page 4: ChemSpider – An Online Database and  Registration System Linking the Web

Vancomycin

Page 5: ChemSpider – An Online Database and  Registration System Linking the Web

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 6: ChemSpider – An Online Database and  Registration System Linking the Web

Full Skeleton Search: 104 Hits

Page 7: ChemSpider – An Online Database and  Registration System Linking the Web

Full Molecule Search: 4 Hits

Page 8: ChemSpider – An Online Database and  Registration System Linking the Web

ChemSpider…

>26 million unique molecules from >400 sources .NET, SQL Server and GGA Indigo toolkit Multiple Open Source components – Jmol,

JSpecView, Balloon, OpenBabel, MediaWiki Slides of data are Open but the entire data

collection is not Open Crowdsourced depositions and curations

Uses InChIs for navigating and linking the web Uses Names for navigating and linking the web

Page 9: ChemSpider – An Online Database and  Registration System Linking the Web

I want to know about “Vincristine”

If all algorithms work then everything on the page is correct by default except the name-structure relationship!

Page 10: ChemSpider – An Online Database and  Registration System Linking the Web

Vincristine: Identifiers and Properties

Page 11: ChemSpider – An Online Database and  Registration System Linking the Web

Vincristine: Vendors and SourcesLinked by Structure

Page 12: ChemSpider – An Online Database and  Registration System Linking the Web

Vincristine: PatentsLinked by Name

Page 13: ChemSpider – An Online Database and  Registration System Linking the Web

Vincristine: ArticlesLinked by Name

Page 14: ChemSpider – An Online Database and  Registration System Linking the Web

ORIGINAL ChemSpider

“Create a system for linking and navigating databases on the web”

Use the power of InChI, and the proliferation of InChIs in databases, to make connectionsDeveloped on .NET and SQL Server for speed of implementation and existing skill setsSeeded with PubChem database of 10.5M chemicals and expanded using other sources to 20M

Page 15: ChemSpider – An Online Database and  Registration System Linking the Web

How do we build it?

We deal in Molfiles or SDF files – with coordinates

Deposit anything that has an InChI – we support what InChI can handle, good and bad

Standardization based on “InChI standardization”

InChIs aggregate (certain) tautomers

We link out to external sites using their IDs

Page 16: ChemSpider – An Online Database and  Registration System Linking the Web

InChIs – both on ChemSpider

Page 17: ChemSpider – An Online Database and  Registration System Linking the Web

Downsides of InChI

InChI was a moving target (multi versions) but overall worked as planned.

Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”

InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…

Page 18: ChemSpider – An Online Database and  Registration System Linking the Web

Side Effects of InChI Usage

Page 19: ChemSpider – An Online Database and  Registration System Linking the Web

SMILES by comparison…

Page 20: ChemSpider – An Online Database and  Registration System Linking the Web

Side Effects of InChI Usage

Page 21: ChemSpider – An Online Database and  Registration System Linking the Web

Standardization IssuesDepiction based on molfile

Page 22: ChemSpider – An Online Database and  Registration System Linking the Web

Downsides of Overall Approach

Meshing data together based on InChIs worked for simple molecules

2D layout errors inherited or limited by algorithm

Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same

Page 23: ChemSpider – An Online Database and  Registration System Linking the Web

Yohimbine

Page 24: ChemSpider – An Online Database and  Registration System Linking the Web

Originally 15 compounds “called” Yohimbine54 Skeletons for Yohimbine

Page 25: ChemSpider – An Online Database and  Registration System Linking the Web

ChemSpider as an Aggregator

ChemSpider has inherited many errors, and it continues but we are way more careful now with pre-filtering

Cannot deposit chemicals without an InChI Deprecated compounds remain deprecated

Curated name-structure relationships do NOT remove the related structure If Taxol is removed from 20 asserted “incorrect

structures” those compounds remain in the database

Page 26: ChemSpider – An Online Database and  Registration System Linking the Web

Chemistry Databases on the Internet

Some public databases are “trusted” as primary sources

Trust is granted without investigation or understanding of the content

What do we know about some of the online resources?

Page 27: ChemSpider – An Online Database and  Registration System Linking the Web

PHYSPROP Database

The freely downloadable database under the EPI Suite prediction software

Very Basic filters suggest data quality issues

Page 28: ChemSpider – An Online Database and  Registration System Linking the Web

The Stereochemistry challenge.12500 chemicals with “missed” stereo

Page 29: ChemSpider – An Online Database and  Registration System Linking the Web

Searches on ChemSpider

Most searches are text-based: people searching for information about known chemicals

Creating accurate name-structure dictionaries is critical

Page 30: ChemSpider – An Online Database and  Registration System Linking the Web

NIST Webbook

Page 31: ChemSpider – An Online Database and  Registration System Linking the Web

PubChem

Page 32: ChemSpider – An Online Database and  Registration System Linking the Web

NPC Browser http://tripod.nih.gov/npc/

Page 33: ChemSpider – An Online Database and  Registration System Linking the Web

NPC Browser http://tripod.nih.gov/npc/

Page 34: ChemSpider – An Online Database and  Registration System Linking the Web
Page 35: ChemSpider – An Online Database and  Registration System Linking the Web

NPC Browser http://tripod.nih.gov/npc/

Page 36: ChemSpider – An Online Database and  Registration System Linking the Web

Synonyms on PubChem

1,3-DICHLORO-PROPAN-2-ONE

(2R,3R)-Butanediol bis(methanesulfonate)

Ethyl-1-propenyl ether, mixture of cis and trans

PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted

1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo [9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane

Page 37: ChemSpider – An Online Database and  Registration System Linking the Web

Synonyms on PubChem

Page 38: ChemSpider – An Online Database and  Registration System Linking the Web

Data Proliferation

Page 39: ChemSpider – An Online Database and  Registration System Linking the Web
Page 40: ChemSpider – An Online Database and  Registration System Linking the Web
Page 41: ChemSpider – An Online Database and  Registration System Linking the Web
Page 42: ChemSpider – An Online Database and  Registration System Linking the Web
Page 43: ChemSpider – An Online Database and  Registration System Linking the Web
Page 44: ChemSpider – An Online Database and  Registration System Linking the Web

What is meant by a name?

Page 45: ChemSpider – An Online Database and  Registration System Linking the Web

Choose a Starting Point

Page 46: ChemSpider – An Online Database and  Registration System Linking the Web

“The First 10”

Page 47: ChemSpider – An Online Database and  Registration System Linking the Web

What is getting into Our Databases?

Large aggregators are inheriting junk data

Data HAS proliferated from ChemSpider through PubChem – in process of deprecating and redepositing

A lot of data is for chemicals that will never exist (probably)

Page 48: ChemSpider – An Online Database and  Registration System Linking the Web

Standardization of Patent Data???

Page 49: ChemSpider – An Online Database and  Registration System Linking the Web

Standardization of Patent Data???

Page 50: ChemSpider – An Online Database and  Registration System Linking the Web

WYSIWYG compounds

Page 51: ChemSpider – An Online Database and  Registration System Linking the Web

WYSIWYG compounds

Page 52: ChemSpider – An Online Database and  Registration System Linking the Web

Text Mining Chemical Name Errors

Page 53: ChemSpider – An Online Database and  Registration System Linking the Web

“DPA”

Page 54: ChemSpider – An Online Database and  Registration System Linking the Web

All aggegators suffer dilution!

Page 55: ChemSpider – An Online Database and  Registration System Linking the Web

Structures have timelines

Page 56: ChemSpider – An Online Database and  Registration System Linking the Web

Name-Structure Dictionaries…

Page 57: ChemSpider – An Online Database and  Registration System Linking the Web

Depiction for Humans

Page 58: ChemSpider – An Online Database and  Registration System Linking the Web

Human Depiction versus Algorithms

Page 59: ChemSpider – An Online Database and  Registration System Linking the Web

Human Depiction versus Algorithms

Page 60: ChemSpider – An Online Database and  Registration System Linking the Web

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Page 61: ChemSpider – An Online Database and  Registration System Linking the Web

Proof of Concept Data Curation Sharing

Page 62: ChemSpider – An Online Database and  Registration System Linking the Web

Structure Validation using feed

Look for approved synonyms

Compare feed InChIKey with database InChIKey

If different, flag for inspection

Page 63: ChemSpider – An Online Database and  Registration System Linking the Web

Open PHACTS : partnership between European Community and EFPIA

Freely accessible for knowledge discovery and verification. Data on small molecules Pharmacological profiles Pharmacokinetics ADMET data Biological targets and pathways Proprietary and public data sources.

Page 64: ChemSpider – An Online Database and  Registration System Linking the Web

Adopting Modified FDA Rules

As already used by ChEMBL…

Page 65: ChemSpider – An Online Database and  Registration System Linking the Web

Nitro groups

Page 66: ChemSpider – An Online Database and  Registration System Linking the Web

Salt and Ionic Bonds

Page 67: ChemSpider – An Online Database and  Registration System Linking the Web

Ammonium salts

Page 68: ChemSpider – An Online Database and  Registration System Linking the Web

Parent and Child

Chemical entities reduced to primary component plus relationships salt forms solvates combinations

Page 69: ChemSpider – An Online Database and  Registration System Linking the Web

ChemSpider Standardization

Entire ChemSpider database will be standardized using modified FDA rule set

Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated

Standardization procedures automatically applied to all future depositions

Page 70: ChemSpider – An Online Database and  Registration System Linking the Web

Project Status

Standardization pipelining process initiated Rule implementation and checking – iterative

work with Open PHACTS pharma members Data model development to support parent-child

relationships

In dialog with the FDA about latest form of recommendations

Page 71: ChemSpider – An Online Database and  Registration System Linking the Web

Conclusions ChemSpider has an important role in quality data

Crowdsourced deposition, validation and curation works but low engagement to date

Standardization of our entire backfile is necessary

Designing the standardization processes with input from pharma and general chemists is necessary

Page 72: ChemSpider – An Online Database and  Registration System Linking the Web

Acknowledgments

The ChemSpider team

Our data providers, depositors, collaborators and curators

Software providers – OpenEye, ChemDoodle, ACD/Labs, GGA Software, Open Source (Jmol, JSpecView, OpenBabel)

Page 73: ChemSpider – An Online Database and  Registration System Linking the Web

Thank you

Email: [email protected] Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams