cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the...
DESCRIPTION
The original abstract is below. Ultimately this work was not funded by Microsoft and we did not deliver it on Sharepoint Server. Nevertheless, we DO depend heavily on Microsoft Technology to do what we do... .NET and SQL server specifically. DailyMed is a website hosted by the FDA providing access to information about marketed drugs. This information includes FDA approved labels (package inserts) and provides a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts. With an intention of enhancing the dataset by making it searchable by chemical structure/substructure we determined that the data contained numerous chemistry errors. We have therefore used a combination of text-mining, automated and manual curation to improve the quality of the data set. In so doing we have also made querying of the data more flexible. Specifically we have used the Microsoft Sharepoint technology to create a portal allowing both text-based and structure-based querying. We will report on the advantages such an approach delivers in terms of flexible interrogation of DailyMed.TRANSCRIPT
Cleaning up chemistry for Cleaning up chemistry for the pharma industrythe pharma industry
Delivering a flexible platform for Delivering a flexible platform for interrogating the interrogating the
FDA DailyMed websiteFDA DailyMed website
Antony WilliamsAntony Williams
Building a Structure Centric Community for Chemists
VisionVision
Use the DailyMed FDA Use the DailyMed FDA website data as a data website data as a data sourcesource
Use Microsoft Sharepoint Use Microsoft Sharepoint Server as a platform to Server as a platform to demonstrate integrated demonstrate integrated ChemSpider technology ChemSpider technology
Deliver some Deliver some “Chemistry” on the BioIT “Chemistry” on the BioIT Alliance websiteAlliance website
Get funding to support Get funding to support ChemSpiderChemSpider
Building a Structure Centric Community for Chemists
RealityReality
Building a Structure Centric Community for Chemists
Chemistry on the InternetChemistry on the Internet
The Internet The Internet cancan clearly benefit chemists clearly benefit chemists searching for informationsearching for information
Much of the information is based on Much of the information is based on assertions and assertions and User Beware!User Beware!
The Quality of information available is The Quality of information available is diverse and how does the user know what diverse and how does the user know what is and is not “correct”?is and is not “correct”?
Building a Structure Centric Community for Chemists
www.chemspider.comwww.chemspider.com
21.5 million structures, 150 data sources 21.5 million structures, 150 data sources and growingand growing
Flexible searchingFlexible searching Deposition of structures, spectra, Deposition of structures, spectra,
crowdsourced curationcrowdsourced curation and annotation and annotation
Building a Structure Centric Community for Chemists
Complex Data and InformationComplex Data and Information
Building a Structure Centric Community for Chemists
21.5 Million Structures, Varied 21.5 Million Structures, Varied SourcesSources
There are “bad structures” on the There are “bad structures” on the databasedatabase
There are bad structure-name pairsThere are bad structure-name pairs Users have associated “incorrect Users have associated “incorrect
information”information”
Building a Structure Centric Community for Chemists
Data CurationData Curation
Building a Structure Centric Community for Chemists
Caution! Question Everything!Caution! Question Everything!
Building a Structure Centric Community for Chemists
Question EverythingQuestion Everythingwww.dhmo.orgwww.dhmo.org
Building a Structure Centric Community for Chemists
VancomycinVancomycin
Who will Who will curate?curate?
PubChem is PubChem is not resourced not resourced to clean these to clean these errors errors
How would How would you clean such you clean such a large a large dataset?dataset?
Building a Structure Centric Community for Chemists
Vancomycin Vancomycin ChemSpider: 1 compound – 3 days ChemSpider: 1 compound – 3 days
Building a Structure Centric Community for Chemists
DailyMedDailyMed
“ “DailyMed provides DailyMed provides high qualityhigh quality information about marketed drugs. information about marketed drugs.
This information includes FDA approved This information includes FDA approved labels (package inserts).”labels (package inserts).”
Building a Structure Centric Community for Chemists
The FDA’s DailyMedThe FDA’s DailyMed
Building a Structure Centric Community for Chemists
The IntentionThe Intention
Make DailyMed structure searchable via Make DailyMed structure searchable via ChemSpiderChemSpider
In the process curate data on ChemSpider In the process curate data on ChemSpider and validate data on DailyMedand validate data on DailyMed
Improve the curation platform on Improve the curation platform on ChemSpiderChemSpider
Perform markup of DailyMed articles to Perform markup of DailyMed articles to enhance the reading experienceenhance the reading experience
Building a Structure Centric Community for Chemists
Structures on DailyMedStructures on DailyMedPoor RepresentationsPoor Representations
Building a Structure Centric Community for Chemists
Structures on DailyMedStructures on DailyMedLack of StereochemistyLack of Stereochemisty
Building a Structure Centric Community for Chemists
Incorrect StructuresIncorrect StructuresSimply WrongSimply Wrong
Building a Structure Centric Community for Chemists
Incorrect StructuresIncorrect StructuresScanning (?) IssuesScanning (?) Issues
Building a Structure Centric Community for Chemists
Incorrect StructuresIncorrect Structures“HOO-BOY!!!!!”“HOO-BOY!!!!!”
Building a Structure Centric Community for Chemists
Does it Matter?Does it Matter?
Does it matter to the consumer that the Does it matter to the consumer that the structures are wrong? No…what matters structures are wrong? No…what matters is what is in the bottle is the right is what is in the bottle is the right medication!medication!
To make DailyMed structure searchable it To make DailyMed structure searchable it DOES matterDOES matter
To data mine DailyMed it mattersTo data mine DailyMed it matters To mark up DailyMed it mattersTo mark up DailyMed it matters
Building a Structure Centric Community for Chemists
The ProcessThe Process
Import all XML files from DailyMedImport all XML files from DailyMed Use “Home built” entity extraction based Use “Home built” entity extraction based
on our dictionary of chemical nameson our dictionary of chemical names Articles online here:Articles online here:
http://www.chemspider.com/DailyMed.aspxhttp://www.chemspider.com/DailyMed.aspx Example Article: Example Article:
http://www.chemspider.com/DailyMedArticle.ahttp://www.chemspider.com/DailyMedArticle.aspx?id=2spx?id=2
Building a Structure Centric Community for Chemists
State of the DataState of the Data
Building a Structure Centric Community for Chemists
Tolinase: DailyMed on Tolinase: DailyMed on ChemSpiderChemSpider
Building a Structure Centric Community for Chemists
OTHER Mentioned ChemicalsOTHER Mentioned Chemicals
Building a Structure Centric Community for Chemists
One Name – Multiple One Name – Multiple StructuresStructures
NO Stereo Full Stereo Partial Stereo Partial Stereo
Building a Structure Centric Community for Chemists
Editing a RecordEditing a Record
Do NOT deprecate record…remove Do NOT deprecate record…remove association between name and chemical association between name and chemical structurestructure
Building a Structure Centric Community for Chemists
Building a Structure Centric Community for Chemists
Partial StereochemistryPartial Stereochemistry
Building a Structure Centric Community for Chemists
Loop of AssertionsLoop of Assertions
Reduce to ONE structure – with full explicit Reduce to ONE structure – with full explicit stereostereo
Building a Structure Centric Community for Chemists
How bad can it get???How bad can it get???And who is right????And who is right????
Building a Structure Centric Community for Chemists
Name-Structure PairsName-Structure Pairs
Cleaning up the associations of names and Cleaning up the associations of names and structures is torturous and time-structures is torturous and time-consumingconsuming
Decisions get made and can be challengedDecisions get made and can be challenged Names are not “removed” …they are still Names are not “removed” …they are still
on the databaseon the database
Such a curated “dictionary” is very Such a curated “dictionary” is very valuable valuable
Building a Structure Centric Community for Chemists
ChemMantisChemMantis
ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystem – ystem – ChemMantisChemMantis
A platform for entity extraction for chemistry A platform for entity extraction for chemistry documents, markup and integration to online documents, markup and integration to online information sources – Wikipedia, ChemSpider, information sources – Wikipedia, ChemSpider, Entrez…Entrez…
Web-based submission, markup and publishing Web-based submission, markup and publishing platform now hosting the platform now hosting the ChemSpider Journal of ChemSpider Journal of ChemistryChemistry
Building a Structure Centric Community for Chemists
Back to DailyMedBack to DailyMed
Building a Structure Centric Community for Chemists
Quality of Structures!!!Quality of Structures!!!
Building a Structure Centric Community for Chemists
ChemMantis MarkupChemMantis Markup
Building a Structure Centric Community for Chemists
Species MarkupSpecies Markup
Building a Structure Centric Community for Chemists
Dictionaries are Easily Dictionaries are Easily EnhancedEnhanced
Copy-Paste into appropriate Entity Copy-Paste into appropriate Entity DictionaryDictionary
Impacts all future markupsImpacts all future markups Expanding knowledgebases of informationExpanding knowledgebases of information Linked out to rich sources of informationLinked out to rich sources of information
Building a Structure Centric Community for Chemists
Building a Structure Centric Community for Chemists
Outlinks…Outlinks…
Building a Structure Centric Community for Chemists
Where To From Here?Where To From Here?
The platform is built…it’s all eyeballs for The platform is built…it’s all eyeballs for curation nowcuration now
As structure-identifier pairs are curated As structure-identifier pairs are curated DailyMed will improveDailyMed will improve
The project is now on hold – The project is now on hold – no no resources to continueresources to continue
Building a Structure Centric Community for Chemists
If We Had Our Way…If We Had Our Way…
Convert every DailyMed Label to a ChemMantis Convert every DailyMed Label to a ChemMantis marked up document marked up document
Use the XML segregation of the Tablet Labels to Use the XML segregation of the Tablet Labels to tag where chemicals are in the labeltag where chemicals are in the label
Allow data mining based on “where” in a label Allow data mining based on “where” in a label the chemicals are..drug-drug interactions etcthe chemicals are..drug-drug interactions etc
Markup and mine property data out of the labels Markup and mine property data out of the labels using new dictionaries related to properties such using new dictionaries related to properties such as IC50 and toxicityas IC50 and toxicity
Building a Structure Centric Community for Chemists
ConclusionsConclusions
The internet enables chemistry – and at a The internet enables chemistry – and at a reduced costreduced cost
Question Quality! All online information is Question Quality! All online information is suspectsuspect
Crowdsourcing for expansion, curation and Crowdsourcing for expansion, curation and integration can both improve the quality of integration can both improve the quality of existing information and add new contentexisting information and add new content
If the FDA doesn’t have responsibility for If the FDA doesn’t have responsibility for what is on Tablet Labels…who does? The what is on Tablet Labels…who does? The answer is simply an assertion!answer is simply an assertion!
Building a Structure Centric Community for Chemists
Interesting SitesInteresting Sites
ChemSpiderChemSpider http://www.chemspider.comhttp://www.chemspider.com
ChemSpider Journal of ChemistryChemSpider Journal of Chemistry http://www.chemmantis.comhttp://www.chemmantis.com
The InChI resolverThe InChI resolver http://inchis.chemspider.comhttp://inchis.chemspider.com (goes live at ACS (goes live at ACS
Spring)Spring) The ChemSpider blogThe ChemSpider blog
http://www.chemspider.com/bloghttp://www.chemspider.com/blog ContactContact
[email protected]@chemspider.com