2012-10-08 practical semantics in the pharmaceutical industry - the open phacts project
DESCRIPTION
Keynote presentation given by Lee Harland at EKAW 2012 http://rd.springer.com/chapter/10.1007/978-3-642-33876-2_1TRANSCRIPT
http://openphacts.org [email protected]
@Open_PHACTS
Source: Nature Reviews Drug Discovery 11, 191-200 (March 2012) | doi:10.1038/nrd3681 Jack W. Scannell, Alex Blanckley, Helen Boldon & Brian Warrington
Source: Nature Reviews Drug Discovery 3, 711-716 (August 2004) | doi:10.1038/nrd1470 Ismail Kola & John Landis
harmful
harmful useless
http://www.medicalprogresstoday.com/spotlight/spotlight_indarchive.php?id=1039
Derek Lowe
http://www.ebi.ac.uk/Information/Brochures/pdf/EMBL-EBI%20Annual%20Report%202011.pdf
297,650
http://www.forbes.com/sites/matthewherper/2011/04/13/a-decade-in-drug-industry-layoffs/
¤ Built to primary use-case ¤ Tailored indexes ¤ Tailored GUIs ¤ Unique language &
metadata ¤ Poor interoperability/
integration
Literature HR Synthesis Portfolio SAR Docs Safety In vivo Etc
Information Tombs…
The Outside World
Precompetitive Informatics
Public Domain Drug Discovery Data: Pharma are accessing, processing, storing & re-processing
LiteraturePubChem
GenbankPatents Databases
Downloads
Data Integration Data Analysis Firewalled Databases
Repeat @ each
company x
Lowering industry firewalls: pre-competitive informatics in drug discovery Nature Reviews Drug Discovery (2009) 8, 701-708 doi:10.1038/nrd2944
• EC funded public-private partnership for pharmaceutical research
• Focus on key problems – Efficacy, Safety,
Education & Training, Knowledge Management
The Innovative Medicines Initiative
The Open PHACTS Project • Create a semantic integration hub (“Open
Pharmacological Space”)… • Runs 2011-2014 • Deliver services to support on-going drug
discovery programs in pharma and public domain
• Leading academics in semantics, pharmacology and informatics, driven by solid industry business requirements
• 23 academic partners, 8 pharmaceutical companies, 3 software SMEs
• Work split into clusters: • Technical Build • Scientific Drive • Community & Sustainability
`
Pathways
Pharmacological Activities
Biological Processes
Transcripts
Pathological Processes
Diseases
Genes
Proteins
Interactions
Clinical Drug Applications
Indications
Drugs
Compounds
Chemicals
Optimised To Business Questions
Number sum Nr of 1 Ques-on
15 12 9 All oxido,reductase inhibitors ac6ve <100nM in both human and mouse
18 14 8 Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor, KOL) for findings associated with a compound?
24 13 8 Given a target find me all ac-ves against that target. Find/predict polypharmacology of ac-ves. Determine ADMET profile of ac-ves.
32 13 8 For a given interac-on profile, give me compounds similar to it.
37 13 8 The current Factor Xa lead series is characterised by substructure X. Retrieve all bioac-vity data in serine protease assays for molecules that contain substructure X.
38 13 8 Retrieve all experimental and clinical data for a given list of compounds defined by their chemical structure (with op-ons to match stereochemistry or not).
41 13 8
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to modulate the target directly? What are the compounds that may modulate the target directly? i.e. return all cmpds ac-ve in assays where the resolu-on is at least at the level of the target family (i.e. PKC) both from structured assay databases and the literature.
44 13 8 Give me all ac-ve compounds on a given target with the relevant assay data 46 13 8 Give me the compound(s) which hit most specifically the mul-ple targets in a given pathway (disease) 59 14 8 Iden-fy all known protein-‐protein interac-on inhibitors
Goals
Platform GUI
Standards
Apps
API
A Precompetitive Knowledge Framework
Integration
Pharma Needs
Inputs
Sustainability Stability Security
Management / Governance
Data Mining Services/Algorithms
Mapping & Populating Architecture Interfaces
& Services
Content Structured & Unstructured
Vocabularies & Identifiers
(URIs)
Community KD Innovation
Data Cache (Virtuoso Triple Store)
Linked Data API (RDF/XML, TTL, JSON) Domain Specific Services
Open PHACTS Explorer 1st Gen Apps
Identity Resolution
Service (ConceptWiki)
Chemistry Normalisation & Q/C ChemSpider
Identifier Management
Service (BridgeDb+)
Partner Apps
Data Import
P12374 EC2.43.4
CS4532
“Adenosine receptor 2a”
Oct. 2012
Public Content Commercial
Public Ontologies
User Annotations
P12047 X31045!
GB:29384!
Issues
¤ Provenance
¤ Conflicting Authorities
¤ Management
¤ Transitivity
Whats “equal” anyway?
Gleevec® = Imatinib Mesylate
Imatinib Mesylate YLMAHDNUQAMNNX-UHFFFAOYSA-N
Search “Gleevec”
PubChem Drugbank ChemSpider
Imatinib
Mesylate
Consequences…..
Ignore Salts?
NCX-911 Viagra ®
The 18th International Conference on Knowledge Engineering and Knowledge Management is concerned with all aspects of eliciting, acquiring, modeling and managing knowledge, and its role in the construction of knowledge-intensive systems and serv ices for the semantic web, knowledge management, e-business, natural language processing, intelligent information integration, etc. The focus of the 18th edition of EKAW will be on " K n o w l e d g e E n g i n e e r i n g a n d K n o w l e d g e Management that matters".
Dynamic Equality
§ Tuneable (same data, different questions) § Domain specific § User driven § Traceable
Strict Relaxed
Analysing Browsing
LinkSet#1 { chemspider:gleevec hasParent imatinib ... drugbank:gleevec exactMatch imatinib ... }
linkSet1{ chemspider:aspirin exactMatch chembl:aspirin …. } linkSet2{ imantinib_mesylate hasParent imatinib …. } linkSet3{ (+)Staurosporine enantiomer (-)Staurosporine …. } linkSet4{ vanillaEssence hasPart Vanillin …. }
Profile P1 “Broad”
Profile P2 “Parents”
Profile P2 “Strict”
The Identifier Mapping Service
Identity Mapping Service
(BridgeDB)
Query Expander
Service
cw:979b545d-f9a9 cheminf:logd ?logd
cw:979b545d-f9a9
?iri cheminf:logd ?logd .FILTER (?iri = cw:979b545d-f9a9 || ?iri = cs:2157 || ?iri = chembl:1280 || ?iri = db:db00945 || …) … }
For each line of SPARQL:
[cs:2157, chembl:1280,db:db00945]
parse
recognise
expand
transform
Profiles
Mappings
Q, P1 context GRAPH <http://rdf.chemspider.com> {
Q’
Based on ve2 editor http://lab.linkeddata.deri.ie/ve2/
Shouldn’t an integration system be able to tell you exactly what its integrating?
## your dataset description :myDS rdf:type void:Dataset ; foaf:homepage <http://example.org/> ; dcterms:title "Example Dataset"^^xsd:string ; dcterms:description """A simple dataset in RDF."""^^xsd:string ; pav:license <http://creativecommons.org/licenses/by-sa/3.0/> ; void:uriSpace "http://example.org/"^^xsd:string ; pav:retrievedFrom <http://exampledownload.com> ; pav:retrievedOn "2012-09-19"^^xsd:date ; pav:retrievedBy <http://some_web_id> ; pav:version "15.5"^^xsd:string ;
Provenance Everywhere
<inDataset href=“http://rdf.chemspider.com/void.rdf#chemSpiderDataset” />
Nanopublications
!
Credit For Curation
Quality Assertions
ChemSpider Validation & Standardization Platform http://bit.ly/NZF5VB
QUDT (http://www.qudt.org/)
STANDARD_TYPE UNIT_COUNT ---------------- ------- AC50 7 Activity 421 EC50 39 IC50 46 ID50 42 Ki 23 Log IC50 4 Log Ki 7 Potency 11 log IC50 0
STANDARD_TYPE STANDARD_UNITS COUNT(*) ------------------ ------------------ -------- IC50 nM 829448 IC50 ug.mL-1 41000 IC50 38521 IC50 ug/ml 2038 IC50 ug ml-1 509 IC50 mg kg-1 295 IC50 molar ratio 178 IC50 ug 117 IC50 % 113 IC50 uM well-1 52 IC50 p.p.m. 51 IC50 ppm 36 IC50 uM-1 25 IC50 nM kg-1 25 IC50 milliequivalent 22 IC50 kJ m-2 20
~ 100 units
>5000 types
Licencing
Linked Closed Data
Kick-Starting Sustainability
Apps
API
• Chem-Bio Navigator • Target Dossier • Polypharmacology Browser • Utopia Documents • Disease Maps • … more
Conclusions
¤ Project designed for the new drug discovery environment
¤ Timing with RDF/SW is good ¤ Companies eager to see whether it can really make a
difference
¤ Challenge: Got to be better than state of the art (in 3 years!)
¤ Funding challenges are formidable
Acknowledgements ¤ Many members of the consortium who have contributed to data, use cases,
funding, support, documentation, management
¤ EBI: John Overington, Anna Gaulton, Mark Davies
¤ Lundbeck: Sune Askjær
¤ Maastricht: Chris Evelo, Andra Waagmeester, Egon Willighagen
¤ Manchester: ¤ Carole Goble, Alasdair Gray, Christian Brenninkmeijer ¤ Steve Pettifer, Ian Dunlop, Rishi Ramgolam, James Eales
¤ NBIC: Barend Mons, Kees Burger
¤ RSC: Antony Williams, Valery Tkachenko
¤ SIB: Christine Chichester
¤ VU: Frank van Harmelen, Paul Groth, Antonis Loizou
¤ OpenLink: Orri Erling, Yrjana Rankka, Hugh Williams
¤ Chem2Bio2RDF: David Wild, Bin Chen
backup
Find me the off-target activities of known cancer
drugs who's primary target is a cell cycle regulatory kinase
ChEMBL DrugBank Gene Ontology Wikipathways
Uniprot
ChemSpider
UMLS
ConceptWiki
ChEBI
Connected Using Semantic Technology
Are these Interleukin 1A?
http://bio2rdf.org/uniprot:P01583
http://identifiers.org/uniprot/P01583
Human Interleukin 1A Protein
Human Interleukin 1A Protein
Entrez Gene: 3552, Ensembl:ENSG00000115008
1ITA (3D) 2ILA (3D) 2KKI (3D) 2L5X (3D) IL1A PDB Structures
Uniprot:P01582 Mouse Interleukin 1A
Human Interleukin 1A Gene
1076_at, 210118_s_at, 208200_at, 208200_at Affymetrix probes hIL1A
….etc
“There is lots of data we all use every day, and it’s not part of the web. I can see my bank statements on the web, and my photographs, and I can see my appointments in a calendar. But can I see my photos in a calendar to see what I was doing when I took them? Can I see bank statement lines in a calendar?
No. Why not? Because we don’t have a web of data. Because data is controlled by applications and each application keeps it to itself.”
Sir Tim Berners-Lee
Are These Vanilla?
Multiple Namespaces
Uniprot database ID: P26838
http://identifiers.org/uniprot/P26838 http://bio2rdf.org/uniprot:P26838 http://uniprot.bio2rdf.org/uniprot:P26838 http://chem2bio2rdf.org/uniprot/resource/P26838 http://purl.uniprot.org/uniprot/P26838 ……
What’s this?
http://www.drugbank.ca/drugs/DB00203
/Viagra
Data sets