cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the...

44
Cleaning up chemistry for Cleaning up chemistry for the pharma industry the pharma industry Delivering a flexible platform for Delivering a flexible platform for interrogating the interrogating the FDA DailyMed website FDA DailyMed website Antony Williams Antony Williams

Upload: antony-williams-chemconnector-orcid-0000-0002-2668-4821

Post on 17-Jun-2015

1.091 views

Category:

Technology


2 download

DESCRIPTION

The original abstract is below. Ultimately this work was not funded by Microsoft and we did not deliver it on Sharepoint Server. Nevertheless, we DO depend heavily on Microsoft Technology to do what we do... .NET and SQL server specifically. DailyMed is a website hosted by the FDA providing access to information about marketed drugs. This information includes FDA approved labels (package inserts) and provides a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts. With an intention of enhancing the dataset by making it searchable by chemical structure/substructure we determined that the data contained numerous chemistry errors. We have therefore used a combination of text-mining, automated and manual curation to improve the quality of the data set. In so doing we have also made querying of the data more flexible. Specifically we have used the Microsoft Sharepoint technology to create a portal allowing both text-based and structure-based querying. We will report on the advantages such an approach delivers in terms of flexible interrogation of DailyMed.

TRANSCRIPT

Page 1: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Cleaning up chemistry for Cleaning up chemistry for the pharma industrythe pharma industry

Delivering a flexible platform for Delivering a flexible platform for interrogating the interrogating the

FDA DailyMed websiteFDA DailyMed website

Antony WilliamsAntony Williams

Page 2: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

VisionVision

Use the DailyMed FDA Use the DailyMed FDA website data as a data website data as a data sourcesource

Use Microsoft Sharepoint Use Microsoft Sharepoint Server as a platform to Server as a platform to demonstrate integrated demonstrate integrated ChemSpider technology ChemSpider technology

Deliver some Deliver some “Chemistry” on the BioIT “Chemistry” on the BioIT Alliance websiteAlliance website

Get funding to support Get funding to support ChemSpiderChemSpider

Page 3: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

RealityReality

Page 4: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Chemistry on the InternetChemistry on the Internet

The Internet The Internet cancan clearly benefit chemists clearly benefit chemists searching for informationsearching for information

Much of the information is based on Much of the information is based on assertions and assertions and User Beware!User Beware!

The Quality of information available is The Quality of information available is diverse and how does the user know what diverse and how does the user know what is and is not “correct”?is and is not “correct”?

Page 5: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

www.chemspider.comwww.chemspider.com

21.5 million structures, 150 data sources 21.5 million structures, 150 data sources and growingand growing

Flexible searchingFlexible searching Deposition of structures, spectra, Deposition of structures, spectra,

crowdsourced curationcrowdsourced curation and annotation and annotation

Page 6: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Complex Data and InformationComplex Data and Information

Page 7: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

21.5 Million Structures, Varied 21.5 Million Structures, Varied SourcesSources

There are “bad structures” on the There are “bad structures” on the databasedatabase

There are bad structure-name pairsThere are bad structure-name pairs Users have associated “incorrect Users have associated “incorrect

information”information”

Page 8: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Data CurationData Curation

Page 9: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Caution! Question Everything!Caution! Question Everything!

Page 10: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Question EverythingQuestion Everythingwww.dhmo.orgwww.dhmo.org

Page 11: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

VancomycinVancomycin

Who will Who will curate?curate?

PubChem is PubChem is not resourced not resourced to clean these to clean these errors errors

How would How would you clean such you clean such a large a large dataset?dataset?

Page 12: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Vancomycin Vancomycin ChemSpider: 1 compound – 3 days ChemSpider: 1 compound – 3 days

Page 13: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

DailyMedDailyMed

“ “DailyMed provides DailyMed provides high qualityhigh quality information about marketed drugs. information about marketed drugs.

This information includes FDA approved This information includes FDA approved labels (package inserts).”labels (package inserts).”

Page 14: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

The FDA’s DailyMedThe FDA’s DailyMed

Page 15: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

The IntentionThe Intention

Make DailyMed structure searchable via Make DailyMed structure searchable via ChemSpiderChemSpider

In the process curate data on ChemSpider In the process curate data on ChemSpider and validate data on DailyMedand validate data on DailyMed

Improve the curation platform on Improve the curation platform on ChemSpiderChemSpider

Perform markup of DailyMed articles to Perform markup of DailyMed articles to enhance the reading experienceenhance the reading experience

Page 16: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedPoor RepresentationsPoor Representations

Page 17: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Structures on DailyMedStructures on DailyMedLack of StereochemistyLack of Stereochemisty

Page 18: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect StructuresSimply WrongSimply Wrong

Page 19: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect StructuresScanning (?) IssuesScanning (?) Issues

Page 20: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Incorrect StructuresIncorrect Structures“HOO-BOY!!!!!”“HOO-BOY!!!!!”

Page 21: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Does it Matter?Does it Matter?

Does it matter to the consumer that the Does it matter to the consumer that the structures are wrong? No…what matters structures are wrong? No…what matters is what is in the bottle is the right is what is in the bottle is the right medication!medication!

To make DailyMed structure searchable it To make DailyMed structure searchable it DOES matterDOES matter

To data mine DailyMed it mattersTo data mine DailyMed it matters To mark up DailyMed it mattersTo mark up DailyMed it matters

Page 22: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

The ProcessThe Process

Import all XML files from DailyMedImport all XML files from DailyMed Use “Home built” entity extraction based Use “Home built” entity extraction based

on our dictionary of chemical nameson our dictionary of chemical names Articles online here:Articles online here:

http://www.chemspider.com/DailyMed.aspxhttp://www.chemspider.com/DailyMed.aspx Example Article: Example Article:

http://www.chemspider.com/DailyMedArticle.ahttp://www.chemspider.com/DailyMedArticle.aspx?id=2spx?id=2

Page 23: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

State of the DataState of the Data

Page 24: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Tolinase: DailyMed on Tolinase: DailyMed on ChemSpiderChemSpider

Page 25: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

OTHER Mentioned ChemicalsOTHER Mentioned Chemicals

Page 26: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

One Name – Multiple One Name – Multiple StructuresStructures

NO Stereo Full Stereo Partial Stereo Partial Stereo

Page 27: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Editing a RecordEditing a Record

Do NOT deprecate record…remove Do NOT deprecate record…remove association between name and chemical association between name and chemical structurestructure

Page 28: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Page 29: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Partial StereochemistryPartial Stereochemistry

Page 30: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Loop of AssertionsLoop of Assertions

Reduce to ONE structure – with full explicit Reduce to ONE structure – with full explicit stereostereo

Page 31: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

How bad can it get???How bad can it get???And who is right????And who is right????

Page 32: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Name-Structure PairsName-Structure Pairs

Cleaning up the associations of names and Cleaning up the associations of names and structures is torturous and time-structures is torturous and time-consumingconsuming

Decisions get made and can be challengedDecisions get made and can be challenged Names are not “removed” …they are still Names are not “removed” …they are still

on the databaseon the database

Such a curated “dictionary” is very Such a curated “dictionary” is very valuable valuable

Page 33: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

ChemMantisChemMantis

ChemChemical ical MMarkup arkup AAnd nd NNomenclature omenclature TTransformation ransformation IIntegrated ntegrated SSystem – ystem – ChemMantisChemMantis

A platform for entity extraction for chemistry A platform for entity extraction for chemistry documents, markup and integration to online documents, markup and integration to online information sources – Wikipedia, ChemSpider, information sources – Wikipedia, ChemSpider, Entrez…Entrez…

Web-based submission, markup and publishing Web-based submission, markup and publishing platform now hosting the platform now hosting the ChemSpider Journal of ChemSpider Journal of ChemistryChemistry

Page 34: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Back to DailyMedBack to DailyMed

Page 35: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Quality of Structures!!!Quality of Structures!!!

Page 36: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

ChemMantis MarkupChemMantis Markup

Page 37: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Species MarkupSpecies Markup

Page 38: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Dictionaries are Easily Dictionaries are Easily EnhancedEnhanced

Copy-Paste into appropriate Entity Copy-Paste into appropriate Entity DictionaryDictionary

Impacts all future markupsImpacts all future markups Expanding knowledgebases of informationExpanding knowledgebases of information Linked out to rich sources of informationLinked out to rich sources of information

Page 39: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Page 40: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Outlinks…Outlinks…

Page 41: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Where To From Here?Where To From Here?

The platform is built…it’s all eyeballs for The platform is built…it’s all eyeballs for curation nowcuration now

As structure-identifier pairs are curated As structure-identifier pairs are curated DailyMed will improveDailyMed will improve

The project is now on hold – The project is now on hold – no no resources to continueresources to continue

Page 42: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

If We Had Our Way…If We Had Our Way…

Convert every DailyMed Label to a ChemMantis Convert every DailyMed Label to a ChemMantis marked up document marked up document

Use the XML segregation of the Tablet Labels to Use the XML segregation of the Tablet Labels to tag where chemicals are in the labeltag where chemicals are in the label

Allow data mining based on “where” in a label Allow data mining based on “where” in a label the chemicals are..drug-drug interactions etcthe chemicals are..drug-drug interactions etc

Markup and mine property data out of the labels Markup and mine property data out of the labels using new dictionaries related to properties such using new dictionaries related to properties such as IC50 and toxicityas IC50 and toxicity

Page 43: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

ConclusionsConclusions

The internet enables chemistry – and at a The internet enables chemistry – and at a reduced costreduced cost

Question Quality! All online information is Question Quality! All online information is suspectsuspect

Crowdsourcing for expansion, curation and Crowdsourcing for expansion, curation and integration can both improve the quality of integration can both improve the quality of existing information and add new contentexisting information and add new content

If the FDA doesn’t have responsibility for If the FDA doesn’t have responsibility for what is on Tablet Labels…who does? The what is on Tablet Labels…who does? The answer is simply an assertion!answer is simply an assertion!

Page 44: Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website

Building a Structure Centric Community for Chemists

Interesting SitesInteresting Sites

ChemSpiderChemSpider http://www.chemspider.comhttp://www.chemspider.com

ChemSpider Journal of ChemistryChemSpider Journal of Chemistry http://www.chemmantis.comhttp://www.chemmantis.com

The InChI resolverThe InChI resolver http://inchis.chemspider.comhttp://inchis.chemspider.com (goes live at ACS (goes live at ACS

Spring)Spring) The ChemSpider blogThe ChemSpider blog

http://www.chemspider.com/bloghttp://www.chemspider.com/blog ContactContact

[email protected]@chemspider.com