20141112 courtot big_datasemwebontologies
DESCRIPTION
Guest lecture (MBB342) at Simon Fraser University on Big data, Semantic Web and ontologiesTRANSCRIPT
About me
2
Overview
3
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
Overview
4
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
5
Big data
Big data is data that is too large and complex to process for any convenHonal data tools.
6
7
2005
8
2013
What is a Ze^abyte?
1,000,000,000,000 gigabytes 1,000,000,000,000 terabytes 1,000,000,000,000 petabytes 1,000,000,000,000 exabytes 1,000,000,000,000 zeAabyte
9
How big is big?
• Facebook: 25 Terabytes of logged data per day, Google (2008): 20 Petabytes per day
• Over 90% of all the data in the world was created in the past 2 years [1]
• Today 3.2 ze^abytes. 2020: 40 zeAabytes.[2] • Good news: jobs! [3]
1. http://www-01.ibm.com/software/data/bigdata/ 2. http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/ 3. http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html
10
11 h^ps://hbr.org/2012/10/data-‐scienHst-‐the-‐sexiest-‐job-‐of-‐the-‐21st-‐century
12
Issues with research data (1): data availability
h^p://www.nature.com/news/scienHsts-‐losing-‐data-‐at-‐a-‐rapid-‐rate-‐1.14416
Issues with research data (2): data reproducibility
13 h^p://www.firstwordpharma.com/node/931605#axzz3IalL2lzU
Overview
14
• Big Data – Big Data is BIG – Issues in research
• Seman-c Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
A soluHon: the SemanHc Web
"The Seman*c Web is an ... extension of the current web in which ... informa*on is given well-‐defined meaning, ... be?er enabling computers and people to work in coopera*on.” The Seman)c Web Tim Berners-‐Lee, James Hendler and Ora Lassila ScienHfic American, May 2001
15 http://www.scientificamerican.com/article/the-semantic-web/
Adds to Web standards and prac*ces (currently only for documents and services) encouraging • Unambiguous names for things, classes, and
relaHonships • Well organized and documented in ontologies • With data expressed using uniform knowledge
representaHon languages (e.g. OWL) • To enable computaHonally assisted exploitaHon of
informaHon • That can be easily integrated from different sources
The SemanHc Web in a nutshell
16
Some SemanHc Web successes • In February 2011, the Watson system by IBM made
internaHonal headlines for beaHng the best humans in the quiz show Jeopardy!
• A significant number of very prominent websites are powered by Seman-c Web technologies, including the New York Times, Thomson Reuters, BBC, and Google's Freebase.
• The Speech Interpreta-on and Recogni-on Interface Siri launched by Apple in 2011 as an intelligent personal assistant for the new generaHon of IPhone smartphones heavily draws from work on ontologies, knowledge representaHon, and reasoning.
17 h^p://130.108.5.60/faculty/pascal/pub/crc-‐handbook-‐13.pdf
18
Uniform Resource IdenHfiers (URIs)
• Two different uses: – Unambiguous name for something – LocaHon of a document
• Examples: – h^p://example.org/wiki/Main_Page – sp://example.org/resource.txt – mailto:[email protected]
19
Resource DescripHon Framework (RDF)
• Resources (= nodes) • Identified by Unique Resource Identifier (URI)
• Properties (= edges) • Identified by Unique Resource Identifier (URI) • Binary relations between 2 resources
20 h^p://elmonline.ca/sw/sparql/social.^l
<h^p://www.linkedin.com/in/mcourtot> a foaf:Person ; foaf:name "Melanie Courtot" ; foaf:knows <h^p://elmonline.ca/luke> ; foaf:knows <h^p://www.linkedin.com/pub/mark-‐wilkinson/1/674/665> .
21
SPARQL
SELECT ?person WHERE { <h^p://www.linkedin.com/in/mcourtot> <h^p://xmlns.com/foaf/0.1/knows> ?person . } -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ | person | ========================================================== | h^p://www.linkedin.com/pub/mark-‐wilkinson/1/674/665 | | <h^p://elmonline.ca/luke> | -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
• An excellent tutorial by Luke McCarthy: h^p://elmonline.ca/sw/sparql/
22
A query language for RDF
The Web Ontology Language (OWL)
• Knowledge representaHon language • Based on DescripHon Logics: fragments of
First-‐Order logics with decidable and defined computaHonal properHes
• Sound, complete, terminaHng reasoners available
23
Overview
24
• Big Data – Big Data is BIG – Issues in research
• Seman-c Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
Linked open data cloud
25
Biological resources in LOD
26
Examples of issues in linking data incorrectly
• h^p://dbpedia.org/resource/Welsh OWL:sameAs <h^p://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh> <h^p://sw.cyc.com/2006/07/27/cyc/Welsh-‐TheWord> <h^p://sw.cyc.com/2006/07/27/cyc/WelshLanguage> <h^p://sw.cyc.com/2006/07/27/cyc/Welshing-‐Chea-ng>
27
Overview
28
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – Defini-on and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
Ontologies • RepresentaHon of important things in a specific domain
– Describes types of enHHes (e.g. cells) and relaHons between them (e.g. prokaryoHc cells and eukaryoHc cells are cells) and their instances (e.g. the specific cells in my sample)
• An acHve computaHonal arHfact – A mathemaHcal model based on a subset of first order logic – Tools can automaHcally process ontologies
• A communicaHon tool – Provides a dicHonary for collaborators, a shared understanding – Allows data sharing
29
Reasoning is criHcal • ProkaryoHc and EukaryoHc
cell are declared disjoints • Fungal cell is a EukaryoHc
cell • Spore is a Fungal cell and a
ProkaryoHc cell ⇒ InsaHsfiability ⇒ SoluHon: clarify spore
(sensu Mycetozoa) AND acHnomycete-‐type spore
h^p://www.plosone.org/arHcle/info:doi/10.1371/journal.pone.0022006 30
Logics
• Simple example based on h^p://arxiv.org/pdf/1201.4089v1.pdf
• Ontology file available from h^p://www.sfu.ca/~mcourtot/course/20141112BigDataSemWebOntologies/ontology.owl
• ManipulaHon done using Protégé: h^p://protege.stanford.edu
31
Family ontology
32
Logics of a grandfather
33
Reasoning
34
Inferred class hierarchy
35
Explana-ons
36
A wrong asser-on
37
Unsa-sfiability
38
Overview
39
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exis-ng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
OBO Foundry
A subset of biological and biomedical ontologies whose developers have agreed in advance to accept a common set of principles reflecHng best pracHce in ontology development designed to ensure
• Hght connecHon to the biomedical basic sciences • CompaHbility
• interoperability, common relaHons • formal robustness • support for logic-‐based reasoning
40
41 hAp://www.obofoundry.org
RELATION TO TIME
GRANULARITY
CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
ORGAN AND ORGANISM
Organism (NCBI
Taxonomy?)
Anatomical Entity (FMA, CARO)
Organ Function
(FMP, CPRO) Phenotypic Quality (PaTO)
Organism-‐‑Level Process (GO)
CELL AND CELLULAR
COMPONENT
Cell (CL)
Cellular Component (FMA, GO)
Cellular Function (GO)
Cellular Process (GO)
MOLECULE Molecule (ChEBI, SO, RnaO, PrO)
Molecular Function (GO)
Molecular Process (GO)
Slide credit: Barry Smith
42
Minimum InformaHon to Reuse an External Ontology Term
• OBO and SemaHc Web promote reuse of resources
• Biological resources (e.g., FMA for anatomy), taken together, are too big for current tool support.
• MIREOT used across the OBO library – OBI: 400 mireoted terms (140 GO, 55 ChEBI, 50 PATO) – PR (Protein Ontology): 23,000 mireoted terms
• h^p://ontofox.hegroup.org
43
Example of OBO ontologies
• OBI, Ontology for Biomedical invesHgaHons • VO, the vaccine ontology • AERO, the Adverse Event ReporHng Ontology
Ontology for Biomedical InvesHgaHons (OBI)
• OBI is a mulH-‐community project driven by the pracHcal needs of its members with the goal to build a high quality, interoperable reference ontology
• OBI high level classes are in place -‐ solidified over several years -‐ that cover all aspects of biomedical invesHgaHons
• OBI is expanded to enable member applicaHons and based on term requests
45
46
High level class hierarchy (parHal)
Slide credit: OBI Consor)um
Slide credit: Alan Ru=enberg 47
48 Slide credit: OBI Consor)um
49
RepresenHng vaccine data – the Vaccine Ontology (VO)
Picture credit: Yongqun He
Overview
50
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
RepresenHng pharmacovigilance data
• The Adverse Event ReporHng Ontology (AERO)
• Encodes exisHng clinical guidelines (Brighton CollaboraHon)
'found to exhibit' some 'generalized urticaria or generalized erythema finding''found to exhibit' some 'measured hypotension finding'
inferred to be of typeinferred to be of type
major dermatological criterion for anaphylaxis according to Brighton
major cardiovascular criterionfor anaphylaxis according to Brighton
Level 1 of certainty of anaphylaxis according to Brighton
has component has component
Patient examination
has specified outputhas participant
exam report of June 7has specified input
finding of rashPatient
rash
dermatologicalsystem
Medicallyrelevant entity
Anatomical system
Clinical Finding
about mre
Clinical Report
part oflocated in
Clinician
involves
has participant
is aboutfound to exhibit
51
Background and problem statement
• Surveillance of Adverse Events Following Immuniza-on is important – DetecHon of issues with vaccine – Importance of vaccine-‐risk communicaHon
• Analysis of AE reports is a subjec-ve, -me-‐ and money costly process – Manual review of the textual reports
52
Workflow • Hypothesis: Use the AERO I developed to annotate
and classify a dataset • VAERS dataset
– Vaccine Adverse Event ReporHng System – 6032 reports: ~5800 negaHve, ~230 posiHve – Post H1N1 immunizaHon 2009/2010 – Manually classified for anaphylaxis
• MedDRA (Medical DicHonary of Regulatory AcHviHes) is used to represent clinical findings
53
54
Automated Diagnosis workflow
ADVERSE EVENT REPORTING ONTOLOGY
(AERO)
OWL/RDFEXPORT
VAERS DATASET
MySQL
BRIGHTON ANNOTATIONS
ASCII files MySQL
~800 MedDRA terms mapped to 32 Brighton terms
REASONER
?
MANUALLY CURATEDDATASET
A
B
C
D
55
Results
ADVERSE EVENT REPORTING ONTOLOGY
(AERO)
OWL/RDFEXPORT
VAERS DATASET
MySQL
BRIGHTON ANNOTATIONS
ASCII files MySQL
~800 MedDRA terms mapped to 32 Brighton terms
REASONER
?
MANUALLY CURATEDDATASET
A
B
C
D
At best cut-‐off point: Sensi-vity 57% Specificity 97%
56
AE classificaHon can be improved through the use of ontologies
• Manual analysis: 3 months for 12 medical officers • Ontology-‐based analysis: once data collected (2 months), almost
instantaneous (2h on laptop) => Could allow for earlier detecHon of safety issues and be^er understanding of adverse events
November 2009 December 2009 January 2010
Time gain
Ability to detect signal
Time
6000reports
Manual analysisOntology-based
analysis
Legend
2h automated vs.
3 months manual
h^p://dx.doi.org/10.1371/journal.pone.0092632
Overview
57
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the Seman-c Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
IRI dereferencing
58
59
Ontobee: publishing biomedical resources on the SemanHc Web
HTML for humans …
… RDF for machines
Ontobee: publishing biomedical resources on the SemanHc Web
Overview
61
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaborm – Adding standards to IRIDA
• Take home message
The Integrated Rapid InfecHous Disease Analysis (IRIDA) project
• Goal: automate infecHous disease outbreak detecHon and invesHgaHon
• Issues: – Integrate WGS, clinical and lab info – Provide relevant tools and validate pipeline
• Methods: – Data standards for informaHon exchange – Analysis pipeline (Galaxy based) – User interface – AddiHonal tools:
• IslandViewer • GenGIS
62
63
Building the IRIDA data standards
• Interview with key personnel at BCCDC • Review of exisHng resources • IdenHfy “holes”, i.e., missing bits • Collect exisHng data • Liaise with implementaHon team • Generate cohesive resource • Validate
64
Relevant data standards • TypON, the typing ontology • OBI, the ontology for Biomedical InvesHgaHons • NGSOnto, Next GeneraHon Sequencing Ontology • NIAIS-‐GS-‐BRC core metadata • TRANS, Pathogen Transmission ontology • ExO, Exposure Ontology • EPO, Epidemiology Ontology • IDO, InfecHous Disease Ontology • Food: USDA, EFSA?
65
Relevant internaHonal efforts
• MIxS standard • Global Microbial IdenHfier • Global Alliance for Genomics and Health • NCBI BioSample • European NucleoHde Archive • …
66
Remaining challenges
• Trust, provenance – Ability to track origin of data to assess whether it is trustworthy
• Data sharing, reuse, policy – Social and legal issues in ge�ng access to data
• ConfidenHality – Privacy concerns when linking data
67
Overview
68
• Big Data – Big Data is BIG – Issues in research
• SemanHc Web – Standards: URIs, RDF, SPARQL, OWL – Linked data
• Ontologies – DefiniHon and reasoning – OBO Foundry – Example of exisHng ontologies – Pharmacovigilance – Publishing ontologies on the SemanHc Web
• IRIDA – The IRIDA plaXorm – Adding standards to IRIDA
• Take home message
Take home message
Big data is a big challenge, but we can deal with it if done properly: that will be your responsibility DO NOT build a black box DO annotate and describe your data DO make your data openly available
69
Acknowledgements
• Drs. Fiona Brinkman, Will Hsiao, Ryan Brinkman • The Brinkman^2 labs • Alan Ru^enberg, Barry Smith, Chris Mungall &
OBO • Colleagues at Public Health Agency Canada (Ms
Lafleche, Dr Law) • The IRIDA consorHum and the IRIDA ontology
working group (Emma Griffiths and Damion Dooley)
70