building a biomedical knowledge garden
TRANSCRIPT
Building a Biomedical Knowledge Garden
Benjamin GoodSu Laboratory, Group Meeting Dec. 2, 2016
Unstructured dataPubMedClinical TrialsEtc.
NLP toolsSemRepDeepDiveImplicitomeetc.
Knowledge GraphSemmedDBLiteromeetc.
Applications Semantic MEDLINEBioGraphetc.
MicrotasksMark2CureAMT
Structured dataGene Ontology etc.
http://tinyurl.com/jbmn8mz
The Knowledge Garden Idea.Circa Jan. 2015.
The devil is in the details…
Unstructured dataPubMedClinical TrialsEtc.
NLP toolsSemRepDeepDiveImplicitomeetc.
Knowledge GraphSemmedDBLiteromeetc.
Application Semantic MEDLINEBioGraphetc.
MicrotasksMark2CureAMT
Structured dataGene Ontology etc.
Reality November 2016
Knowledge GraphSemmedDB
Application knowledge.bio
MicrotasksMark2CureAMT
knowledge.bioExplore all biomedical knowledge as a graph with edges connected back to supporting references
v2.5 demo
knowledge.bio – Data challenges• V1 – V2.5 • All content from SemmedDB or Implicitome• custom schema to support these.
• V3 key requirement: ?
allow import of content from many other sources, Gene Ontology, DeepDive output, User-generated…
This part is important…Not nailing it down makes everything else harder
Knowledge Garden content managed as:csv filesjson documentsmysql databasesPostgress databasesneo4j databases
None of which had any coherent plan or structure
Requirements for a knowledge graph
• Syntax: • How to refer to nodes and edges• identifiers• schema (structure of graph)
• Semantics: • What things mean• How you decide on the ‘?’: • node1 ‘?’ node2• are they the same (to you?)• if not, what is the edge? Mind the Gap…
(one node in “Amino Acid” namespaceother in (“Biologically Active Substance” namespace)
Options at kb3 scale (millions of concepts and relations)
• The Unified Medical Language System (UMLS)• The Semantic Web• Wikidata ?
The UMLS (CUIs, Atoms, Types)
C0026106HP:0001256Mild mental retardation,Mild and nonprogressive mental retardation
SNOMEDCT_US:86765009Moron (mental age 8-12 years)
MEDCIN:35101Mild intellectual disabilities
OMIM:MTHU035844Intellectual disability, mild
Atoms
CUI
equivalent to
https://uts.nlm.nih.gov
C0233630
SNOMEDCT_US:32386009Logical Thinking
Mental or Behavioral Dysfunction
Disease or Syndrome
isa
isa
Types
Behavior
Activity
affects
isa
Event
isa
isa
affects ?
Types organized into a “Semantic Network”~ 133 types, 54 predicates13 high level ‘groups’
CUI
The UMLS in 2016• 3,200,922 CUIs• 211 source vocabularies (e.g. MeSH, SNOMED, RxNORM, etc.)• 12,287,973 total terms (”ATOMS”)
• Every edge in the system is a manual product of NLM• every Atom->CUI• every CUI->Type• every Type->Type
The Semantic Web• Concepts uniquely identified by
resolvable URIs• Meaning (e.g. equivalency)
encoded in OWL axioms• Concepts and mappings
created and maintained by anyone who can host them • No other structure• No governance
UMLS versus Semantic Web• UMLS• PROs: covers large portion of biomedical concept space, manually curated,
we are already using it by default, the semantic types are handy• CONs: does not exist on the semantic web - no stable URI to associate with a
CUI, license is obscure and apparently limiting, weak representation of molecular biology domain, no control over its extension (e.g. no Human Disease Ontology)
• Semantic Web• PROs: universal, open, infrastructure is the Web itself• CONs: need for organization, curation, mapping
Not thrilled with my options
https://commons.wikimedia.org/wiki/File:A_frustrated_and_depressed_man_holds_his_head_in_his_hand.jpg
Meanwhile...• human, mouse, rat, yeast,
macaque, 120+ microbes genes and proteins• Gene Ontology terms• Human Disease Ontology terms• 120,000+ chemicals• Cancer genome variants• Other people adding and using
data!!!
Maybe ?
Wikidata(QIDs, ids, Types)
Q183560HP:0001256Mild mental retardation,Mild and nonprogressive mental retardation
SNOMEDCT_US:86765009Moron (mental age 8-12 years)
MEDCIN:35101Mild intellectual disabilities
OMIM:MTHU035844Intellectual disability, mild
QID
external id
https://www.wikidata.org/wiki/Q412194
Q412194
PubChem: 2477buspirone
Specific Developmental Disorder
developmental disorder of mental health
subclass of
subclass of
treated by
Poly-Ontology
Drug
QID
Chemical
isa
mental disorder
disorder
subclass of
subclass of (DO)
ids
ACTIVE! Knowledge Flow for Wikidata
Unstructured dataThe Internet
NLP toolsStrepHit
Knowledge Graph Applications WikipediaWikigenomesWikidata.org
MicrotasksWikidata gameMixnMatch
Structured dataGene Ontology etc.
Wikidata is a Functioning and Flourishing Knowledge Garden
Wikidata• ~27,000,000 concepts identified by Qids like ‘Q183560’• ~1350 source vocabularies (e.g. MeSH, RxNORM, IMDB, ETC.)• (Based on properties tagged with type ‘ExternalId’)
• ? total terms integrated = labels + aliases (a lot)• Mappings to Qids product of the unwashed masses• Constantly updated
What concept scheme do we use ? •Wikidata• PROs: universal, open, infrastructure,
active community, largely curated content• CONs: limited biomedical content so far
?
Challenge: Relevant Scientific Applications
NLP toolsSemRepLiteromeImplicitomePubTatorDeepDiveSnorkelContentMineTEES….
Knowledge GraphApplications WikigenomesHetioNet
Knowledge.Bio…
Structured dataGene Expression etc,…
A. Advancing science is the goal and this is how we can help
B. We need experts to help refine and build the knowledge graph and apps are the bait
On the plane Oct. 11,2016…
“Screw it, lets go all in”
I got really excited..
https://www.flickr.com/photos/alexnormand/5992512756 https://www.flickr.com/photos/k6lcs/15374887957
knowledge.bio 3.0• All nodes to be concepts from wikidata• All predicates to be properties from wikidata• All edges to be linked to references that could be ‘stated in’ Wikidata• Edges (‘claims’) can come from any source• Now
• We have one consistent format for data import• We have a consistent pattern for gathering more data about a concept• We have access to 27 million concepts and growing (and we can add more)• We have the beginnings of new tool for expert-sourcing curation of Wikidata content• Our code is getting simpler and cleaner
KB3.0 – next step seeding content
• You are now basically up to date…• Rest of talk is about mapping content from SemmedDB to the new
structure • 3.0 release will allow users to add new nodes and edges• If you want data in there:
1. map it to Wikidata items and properties 2. make a tab-delimited file (Qid Pid Qid referenceUrl sentence)3. load it (or ask me to)
• Users needed!
How many concepts in the UMLS are now items in Wikidata?
?
27,000,000
3,000,000
Direct identifier mapping
Direct identifier mapping (15 shared ontologies)
CUI Qid
UMLS_vocab Concepts Wikidata_property Prop id UsageNCBI 1014837 NCBI Taxonomy ID P685 379589
MSH 359116 MeSH ID P486 5979
ICD10PCS 178278 ICD-10-PCS P1690 5
NCI 119620 NCI Thesaurus ID P1748 5562
ICD10CM 98899 ICD-10 P494 8826
OMIM 86181 OMIM ID P492 5835
FMA 82042 Foundational Model of Anatomy ID P1402 3378
GO 60412 Gene Ontology ID P686 43693
MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1
HGNC 39261 HGNC gene symbol P353 63691
HGNC Sometimes... HGNC-ID P354 39758
NDFRT 38206 NDF-RT ID P2115 1509
ICD9CM 20993 ICD-9-CM P1692 88
ICD10 11552 ICD-10 P494 8826
RXNORM 205998 RxNorm CUI P3345 5671
C0001629Adrenal Medulla
FMA: 15633 ?qid wdt:P1402 “15633” Q934888 Local MySQL query Build sparql query.wikidata.org
Strict identifier mapping
CUI Qid
UMLS_vocab Concepts Wikidata_property Prop id UsageNCBI 1014837 NCBI Taxonomy ID P685 379589MSH 359116 MeSH ID P486 5979ICD10PCS 178278 ICD-10-PCS P1690 5NCI 119620 NCI Thesaurus ID P1748 5562ICD10CM 98899 ICD-10 P494 8826OMIM 86181 OMIM ID P492 5835FMA 82042 Foundational Model of Anatomy ID P1402 3378GO 60412 Gene Ontology ID P686 43693MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1HGNC 39261 HGNC gene symbol P353 63691HGNC Sometimes... HGNC-ID P354 39758NDFRT 38206 NDF-RT ID P2115 1509ICD9CM 20993 ICD-9-CM P1692 88ICD10 11552 ICD-10 P494 8826->8292RXNORM 205998 RxNorm CUI P3345 0->5671
-> Thanks to Sebastian’s recent work..
How many concepts in the UMLS are now items in Wikidata? (according to identifiers)
463,059
27,000,000
3,000,000
15%
463,059
Wikidata items by UMLS source id
Coverage of shared identifiers by item
(cut off, NCBI taxonomy has > 1million)
UMLS cuis
Wikidata items
Good targets for wikidata bots
463,059 mapped concepts, by semantic group
Occupations
Genes & M
olecular S
equences
Disorders
Procedures
Activiti
es & Behavio
rs
Anatomy
Devices
Phenomena
Chemicals &
Drugs
Organizations
Objects
Physiology
Concepts
& Ideas
Living Beings
Geographic Areas
1
10
100
1000
10000
100000
1000000
N 1 to 1
NCBI Taxons
Gene Ontology
Genes
Diseases
Drugs
Where are the Gaps?
Occupations
Genes & M
olecular S
equences
Disord
ers
Proce
dures
Activiti
es & Beh
aviors
Anatomy
Devices
Phenomena
Chemica
ls & Dru
gs
Organiza
tions
Objects
Physiology
Concepts
& Idea
s
Living Bein
gs0
100000
200000
300000
400000
500000
600000
700000
800000
N no Map
600,000 missing drugs550,000 missing disorders
Where are(n’t) the Gaps?
0
0.1
0.2
0.3
0.4
0.5
0.6
percent_mapped
Label matching…
Adding label matching actually doesn’t help that much…• Checked only 460,080 (including all 288,552 from SemmedDB)• 21% (96,843) had an identifier match• 6.9% (31,645) had a match on the UMLS Prefered Label• 3.1% (14,319) matched one of the UMLS synonyms
• Removing anything that matched more than 1 Wikidata item we get 129,726 concepts. • Limiting to concepts used in SemmedDB we get 113,623 • (43% coverage with most matches coming from identifiers)
SemmedDB as Wikidata, version 1• 15,957,582 predications with 13 relation types• All Concepts Wikidata items • All relation types Wikidata properties• (Data available at http://tinyurl.com/cui2qid-1 )• Will be accessible in kb3.0 next week or the following
Next steps / project opportunities• More Wikidata bots!• Establish a more consistent typing strategy in Wikidata (e.g. make each
item an instance of some semantic group)• Finish the mapping of the UMLS predicates to Wikidata Properties
• Add missing properties (e.g. ‘Activates’, ‘Inhibits’) • Use existing subproperty prop. to build a prop. ontology inside wikidata
• Populate kb3.0 with knowledge pertinent to your disease area• Extend the user interface• Use the underlying neo4j database to extend HetioNet and related (or
add HetioNet to it.
Pick an edge or node and create or improve it
Unstructured dataPubMedClinical TrialsEtc.
NLP toolsSemRepDeepDiveImplicitomeetc.
Knowledge GraphSemmedDBLiteromeetc.
Applications Semantic MEDLINEBioGraphetc.
MicrotasksMark2CureAMT
Structured dataGene Ontology etc.
Thanks!• Richard Bruskiewich! and Star Informatics team for persevering…
(v1,v2.1...5, v3.0)• Gene Wiki team! Especially bot developers: Sebastian B, Andra W,
Tim P., Greg S. who planted the seeds that are making this possible.• Su laboratory!• I hope you can find something useful here and help grow the garden…• Especially you HetNetters!
https://www.flickr.com/photos/alexnormand/5992512756