crowdsourcing, collaborations and text-mining in a world of open chemistry

Post on 22-Apr-2015

2.791 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

There is an increasing availability of free and open access resources for scientists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. However, freedom costs and in many cases the cost is quality. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. As an aggregator of chemistry related information from many sources, at present over 21.5 million unique chemical entities from over 150 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. This presentation will provide an overview of the issue of quality in many chemistry-related databases, approaches to cleaning up the data and how a curated platform can become the centralized hub for resourcing information about chemical entities. This includes experimental and predicted properties, analytical data, publications, suppliers and integrated databases. I will detail three efforts :1) the curation of chemistry on Wikipedia 2) an examination of structure integrity on the FDA Daily Med website, a web site of medication content and labeling as found in medication package inserts 3) recognizing chemical names in documents and providing a platform for structure-based searching of Open Access chemistry literature.

TRANSCRIPT

Crowdsourcing, Collaborations Crowdsourcing, Collaborations and Text-Mining in a World of and Text-Mining in a World of

Open Chemistry Open Chemistry

Antony WilliamsAntony WilliamsBio-IT World 2009Bio-IT World 2009

Building a Structure Centric Community for Chemists

Linked Data CloudLinked Data Cloud

Building a Structure Centric Community for Chemists

Chemistry on the InternetChemistry on the Internet

Much of the information online is Much of the information online is User Beware! User Beware!

The Quality of information is “diverse”The Quality of information is “diverse”

Technologies can “link and connect” information Technologies can “link and connect” information but validation and curation is key to providing but validation and curation is key to providing qualityquality

The LinkedData web is of less value when the The LinkedData web is of less value when the data linked are “wrong”data linked are “wrong”

Building a Structure Centric Community for Chemists

Quality Costs Quality Costs

Chemical Abstracts ServiceChemical Abstracts Service (CAS), a (CAS), a division of the ACS is “Gold Standard” in division of the ACS is “Gold Standard” in Chemistry related informationChemistry related information 101 years of content, $260 million revenue 101 years of content, $260 million revenue

(2006), >40 million substances and 60 million (2006), >40 million substances and 60 million sequencessequences

But online…But online…

Building a Structure Centric Community for Chemists

What is “wrong”?What is “wrong”?

Building a Structure Centric Community for Chemists

A platform for:A platform for: Data deposition, Data deposition, curation and annotationcuration and annotation Supporting Open Notebook Science effortsSupporting Open Notebook Science efforts Chemistry document mark-up with ChemMantisChemistry document mark-up with ChemMantis The Open Access ChemSpider Journal of The Open Access ChemSpider Journal of

ChemistryChemistry

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Search CholesterolSearch Cholesterol

Building a Structure Centric Community for Chemists

Complex Data and InformationComplex Data and Information

Building a Structure Centric Community for Chemists

Online DataOnline Data

Many websites host structure-based Many websites host structure-based informationinformation

Question quality!!!Question quality!!!

Building a Structure Centric Community for Chemists

Building a Structure Centric Community for Chemists

Wikipedia, C&E News, Wikipedia, C&E News, PubChemPubChem

C&E News C&E News (from ACS)(from ACS)

Building a Structure Centric Community for Chemists

Does one stereocenter matter?Does one stereocenter matter?

Building a Structure Centric Community for Chemists

VancomycinVancomycin

Who will Who will curate?curate?

PubChem is PubChem is not resourced not resourced to clean these to clean these errors errors

How would How would you clean such you clean such a large a large dataset?dataset?

Building a Structure Centric Community for Chemists

Vancomycin Vancomycin ChemSpider: 1 compound – 3 days ChemSpider: 1 compound – 3 days

Building a Structure Centric Community for Chemists

Question EverythingQuestion Everythingwww.dhmo.orgwww.dhmo.org

Building a Structure Centric Community for Chemists

DailyMedDailyMed

“ “DailyMed provides DailyMed provides high qualityhigh quality information about marketed drugs. information about marketed drugs.

This information includes FDA approved This information includes FDA approved labels (package inserts).”labels (package inserts).”

Building a Structure Centric Community for Chemists

The FDA’s DailyMedThe FDA’s DailyMed

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedPoor RepresentationsPoor Representations

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedLack of StereochemistyLack of Stereochemisty

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect StructuresScanning (?) IssuesScanning (?) Issues

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect Structures

Building a Structure Centric Community for Chemists

Does it Matter?Does it Matter?

Does it matter to the consumer that the Does it matter to the consumer that the structures are wrong? No…what matters structures are wrong? No…what matters is what is in the bottle is the right is what is in the bottle is the right medication!medication!

To make DailyMed structure searchable it To make DailyMed structure searchable it DOES matterDOES matter

To data mine DailyMed it mattersTo data mine DailyMed it matters To mark up DailyMed it mattersTo mark up DailyMed it matters

Building a Structure Centric Community for Chemists

CollaborativeCollaborative Knowledge Knowledge Management Management for Chemistsfor Chemists

Building a Structure Centric Community for Chemists

Wikipedia Links to DrugbankWikipedia Links to Drugbank

Building a Structure Centric Community for Chemists

Taxol on PubChemTaxol on PubChem

Building a Structure Centric Community for Chemists

Taxol on Daily MedTaxol on Daily Med

Building a Structure Centric Community for Chemists

The InChI IdentifierThe InChI Identifier

Building a Structure Centric Community for Chemists

Multiple LayersMultiple Layers

Source: Unofficial InChI FAQ pageSource: Unofficial InChI FAQ page

Building a Structure Centric Community for Chemists

InChIStrings Hash to InChIStrings Hash to InChIKeysInChIKeys

Building a Structure Centric Community for Chemists

InChIs for TaxolInChIs for Taxol

Building a Structure Centric Community for Chemists

Back to TaxolBack to Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDDDrugBank: RCINICONZNJXQF-CLDWUXIMDD

ChEBI: ChEBI: RCINICONZNJXQF-GXKQXQCDDN RCINICONZNJXQF-GXKQXQCDDN

Wikipedia: Wikipedia: RCINICONZNJXQF-MZXODVADBJ

Which one is correct???

Building a Structure Centric Community for Chemists

InChIKeys for TaxolInChIKeys for Taxol

DrugBank: RCINICONZNJXQF-DrugBank: RCINICONZNJXQF-CLDWUXIMDDCLDWUXIMDD

ChEBI: ChEBI: RCINICONZNJXQF-GXKQXQCDDN RCINICONZNJXQF-GXKQXQCDDN

Wikipedia: Wikipedia: RCINICONZNJXQF-MZXODVADBJ

ChEBI and Wikipedia are the SAME structure Drugbank is a DIFFERENT structure – ONE

stereocenter

Building a Structure Centric Community for Chemists

The InChI ResolverThe InChI Resolver

Building a Structure Centric Community for Chemists

Building a Structure Centric Community for Chemists

Coming Soon…Linked ArticlesComing Soon…Linked Articles

Building a Structure Centric Community for Chemists

How bad can it get???How bad can it get???And who is right????And who is right????

Building a Structure Centric Community for Chemists

ChemMantisChemMantis

ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystem – ystem – ChemMantisChemMantis

A platform for entity extraction for chemistry A platform for entity extraction for chemistry documents, markup and integration to online documents, markup and integration to online information sources – Wikipedia, ChemSpider, information sources – Wikipedia, ChemSpider, Entrez…Entrez…

Web-based submission, markup and publishing Web-based submission, markup and publishing platform now hosting the platform now hosting the ChemSpider Journal of ChemSpider Journal of ChemistryChemistry

Building a Structure Centric Community for Chemists

ChemMantis MarkupChemMantis Markup

Building a Structure Centric Community for Chemists

Enable Electronic Articles…Enable Electronic Articles…

Structures are the Structures are the language of language of chemistrychemistry

Show structures to Show structures to chemists and chemists and search/link from search/link from there…there…

Building a Structure Centric Community for Chemists

Species MarkupSpecies Markup

Building a Structure Centric Community for Chemists

Dictionaries are Easily Dictionaries are Easily EnhancedEnhanced

Copy-Paste into appropriate Entity Copy-Paste into appropriate Entity DictionaryDictionary

Impacts all future markupsImpacts all future markups

Expanding knowledgebases of informationExpanding knowledgebases of information

Linked out to rich sources of informationLinked out to rich sources of information

Building a Structure Centric Community for Chemists

Build Dictionaries Build Dictionaries Ontologies Next Ontologies Next

Building a Structure Centric Community for Chemists

Outlinks…Outlinks…

Building a Structure Centric Community for Chemists

Publishers and Document Publishers and Document Mark-UpMark-Up

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider Everywhere

Linked from WikipediaLinked from Wikipedia

Linked from Open Notebook Science sites using Linked from Open Notebook Science sites using EMBEDEMBED

Linked from Blogs using Structure/Spectra EMBEDLinked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages such as Integrated into structure drawing packages such as ACD/ChemSketch, Symyx Draw, Open Source appletsACD/ChemSketch, Symyx Draw, Open Source applets

Integrated to software offerings from Thermo, Integrated to software offerings from Thermo, Waters, Agilent, BrukerWaters, Agilent, Bruker

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereEmbed Functionality (like Embed Functionality (like

YouTube)YouTube)

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider Everywherewww.spectralgame.comwww.spectralgame.com

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereCrowdsourced Curation of SpectraCrowdsourced Curation of Spectra

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereRSC CompoundsRSC Compounds

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereNature ChemistryNature Chemistry

Nature ChemistryNature Chemistry articles articles are annotated to identify all are annotated to identify all of the chemical compounds of the chemical compounds mentioned throughout the mentioned throughout the text. text.

Those compounds are linked Those compounds are linked out to other information out to other information resources including resources including PubChem and PubChem and ChemSpiderChemSpider. .

Building a Structure Centric Community for Chemists

ChemSpider EverywhereChemSpider EverywhereChemMobiChemMobi

Building a Structure Centric Community for Chemists

Structure RSS Feeds with Structure RSS Feeds with InChIsInChIs

Building a Structure Centric Community for Chemists

Building a Structure Centric Community for Chemists

AcknowledgmentsAcknowledgments

Richard Kidd, Royal Society of ChemistryRichard Kidd, Royal Society of Chemistry Jason Wilde, Nature Publishing GroupJason Wilde, Nature Publishing Group Martin Walker and the Wikipedia Chemistry Martin Walker and the Wikipedia Chemistry

teamteam Microsoft – Rudy PotenzoneMicrosoft – Rudy Potenzone Symyx – Keith Taylor and James JackSymyx – Keith Taylor and James Jack SureChem – Nicko Goncharoff SureChem – Nicko Goncharoff Spectral game - Andrew Lang and Jean-Spectral game - Andrew Lang and Jean-

Claude BradleyClaude Bradley ““The InChI team and Advisory Group”The InChI team and Advisory Group”

Building a Structure Centric Community for Chemists

ConclusionsConclusions

www.chemspider.comwww.chemspider.com

www.chemspider.com/journalwww.chemspider.com/journal

InChIs and Internet ChemistryInChIs and Internet Chemistry

http://inchis.chemspider.comhttp://inchis.chemspider.com

top related