20141112 courtot big_datasemwebontologies

71
Big data, Seman-c Web and Ontologies Mélanie Courtot, PhD Nov 12 th 2014 [email protected] 1

Upload: melanie-courtot

Post on 02-Jul-2015

373 views

Category:

Health & Medicine


3 download

DESCRIPTION

Guest lecture (MBB342) at Simon Fraser University on Big data, Semantic Web and ontologies

TRANSCRIPT

Page 1: 20141112 courtot big_datasemwebontologies

Big  data,  Seman-c  Web  and  Ontologies  

Mélanie  Courtot,  PhD  Nov  12th  2014  

[email protected]  

1  

Page 2: 20141112 courtot big_datasemwebontologies

About  me  

2  

Page 3: 20141112 courtot big_datasemwebontologies

Overview  

3  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 4: 20141112 courtot big_datasemwebontologies

Overview  

4  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 5: 20141112 courtot big_datasemwebontologies

5  

Page 6: 20141112 courtot big_datasemwebontologies

Big  data  

Big  data  is  data  that  is  too  large  and  complex  to  process  for  any  convenHonal  data  tools.  

6  

Page 7: 20141112 courtot big_datasemwebontologies

7  

2005  

Page 8: 20141112 courtot big_datasemwebontologies

8  

2013  

Page 9: 20141112 courtot big_datasemwebontologies

What  is  a  Ze^abyte?  

1,000,000,000,000  gigabytes  1,000,000,000,000  terabytes  1,000,000,000,000  petabytes  1,000,000,000,000  exabytes  1,000,000,000,000  zeAabyte  

9  

Page 10: 20141112 courtot big_datasemwebontologies

How  big  is  big?  

•  Facebook:  25  Terabytes  of  logged  data  per  day,  Google  (2008):  20  Petabytes  per  day  

•  Over  90%  of  all  the  data  in  the  world  was  created  in  the  past  2  years  [1]  

•  Today  3.2  ze^abytes.  2020:  40  zeAabytes.[2]    •  Good  news:  jobs!  [3]  

1.  http://www-01.ibm.com/software/data/bigdata/ 2.  http://barnraisersllc.com/2012/12/38-big-facts-big-data-companies/ 3.  http://www.webopedia.com/quick_ref/important-big-data-facts-for-it-professionals.html

10  

Page 11: 20141112 courtot big_datasemwebontologies

11  h^ps://hbr.org/2012/10/data-­‐scienHst-­‐the-­‐sexiest-­‐job-­‐of-­‐the-­‐21st-­‐century  

Page 12: 20141112 courtot big_datasemwebontologies

12  

Issues  with  research  data  (1):  data  availability  

h^p://www.nature.com/news/scienHsts-­‐losing-­‐data-­‐at-­‐a-­‐rapid-­‐rate-­‐1.14416      

Page 13: 20141112 courtot big_datasemwebontologies

Issues  with  research  data  (2):  data  reproducibility  

13  h^p://www.firstwordpharma.com/node/931605#axzz3IalL2lzU    

Page 14: 20141112 courtot big_datasemwebontologies

Overview  

14  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  Seman-c  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 15: 20141112 courtot big_datasemwebontologies

A  soluHon:  the  SemanHc  Web  

"The  Seman*c  Web  is  an  ...  extension  of  the  current  web  in  which  ...  informa*on  is  given  well-­‐defined  meaning,  ...  be?er  enabling  computers  and  people  to  work  in  coopera*on.”    The  Seman)c  Web  Tim  Berners-­‐Lee,  James  Hendler  and  Ora  Lassila  ScienHfic  American,  May  2001  

15  http://www.scientificamerican.com/article/the-semantic-web/  

Page 16: 20141112 courtot big_datasemwebontologies

Adds  to  Web  standards  and  prac*ces  (currently  only  for  documents  and  services)  encouraging  •  Unambiguous  names  for  things,  classes,  and  

relaHonships  •  Well  organized  and  documented  in  ontologies  •  With  data  expressed  using  uniform  knowledge  

representaHon  languages  (e.g.  OWL)  •  To  enable  computaHonally  assisted  exploitaHon  of  

informaHon  •  That  can  be  easily  integrated  from  different  sources  

The  SemanHc  Web  in  a  nutshell  

16  

Page 17: 20141112 courtot big_datasemwebontologies

Some  SemanHc  Web  successes  •  In  February  2011,  the  Watson  system  by  IBM  made  

internaHonal  headlines  for  beaHng  the  best  humans  in  the  quiz  show  Jeopardy!    

•  A  significant  number  of  very  prominent  websites  are  powered  by  Seman-c  Web  technologies,  including  the  New  York  Times,    Thomson  Reuters,  BBC,  and  Google's  Freebase.  

•  The  Speech  Interpreta-on  and  Recogni-on  Interface  Siri  launched  by  Apple  in  2011  as  an  intelligent  personal  assistant  for  the  new  generaHon  of  IPhone  smartphones  heavily  draws  from  work  on  ontologies,  knowledge  representaHon,  and  reasoning.  

17  h^p://130.108.5.60/faculty/pascal/pub/crc-­‐handbook-­‐13.pdf    

Page 18: 20141112 courtot big_datasemwebontologies

18  

Page 19: 20141112 courtot big_datasemwebontologies

Uniform  Resource  IdenHfiers  (URIs)  

•  Two  different  uses:  – Unambiguous  name  for  something  – LocaHon  of  a  document  

•  Examples:  – h^p://example.org/wiki/Main_Page    – sp://example.org/resource.txt  – mailto:[email protected]  

19  

Page 20: 20141112 courtot big_datasemwebontologies

Resource  DescripHon  Framework  (RDF)  

• Resources (= nodes) •  Identified by Unique Resource Identifier (URI)

• Properties (= edges) •  Identified by Unique Resource Identifier (URI) •  Binary relations between 2 resources

20  h^p://elmonline.ca/sw/sparql/social.^l    

Page 21: 20141112 courtot big_datasemwebontologies

<h^p://www.linkedin.com/in/mcourtot>  a  foaf:Person  ;          foaf:name  "Melanie  Courtot"  ;          foaf:knows  <h^p://elmonline.ca/luke>  ;          foaf:knows  <h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665>  .

 21  

Page 22: 20141112 courtot big_datasemwebontologies

SPARQL  

SELECT  ?person  WHERE  {          <h^p://www.linkedin.com/in/mcourtot>  <h^p://xmlns.com/foaf/0.1/knows>  ?person  .  }    -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐  |  person                                                                                                                                                                                                                            |  ==========================================================  |  h^p://www.linkedin.com/pub/mark-­‐wilkinson/1/674/665                                    |  |  <h^p://elmonline.ca/luke>                                                                                                                                                    |  -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐    

•  An  excellent  tutorial  by  Luke  McCarthy:  h^p://elmonline.ca/sw/sparql/  

22  

A  query  language  for  RDF  

Page 23: 20141112 courtot big_datasemwebontologies

The  Web  Ontology  Language  (OWL)  

•  Knowledge  representaHon  language  •  Based  on  DescripHon  Logics:  fragments  of  

First-­‐Order  logics  with  decidable  and  defined  computaHonal  properHes  

•  Sound,  complete,  terminaHng  reasoners  available  

23  

Page 24: 20141112 courtot big_datasemwebontologies

Overview  

24  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  Seman-c  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 25: 20141112 courtot big_datasemwebontologies

Linked  open  data  cloud  

25  

Page 26: 20141112 courtot big_datasemwebontologies

Biological  resources  in  LOD  

26  

Page 27: 20141112 courtot big_datasemwebontologies

Examples  of  issues  in  linking  data  incorrectly  

•  h^p://dbpedia.org/resource/Welsh    OWL:sameAs  <h^p://sw.cyc.com/2006/07/27/cyc/EthnicGroupOfWelsh>  <h^p://sw.cyc.com/2006/07/27/cyc/Welsh-­‐TheWord>  <h^p://sw.cyc.com/2006/07/27/cyc/WelshLanguage>  <h^p://sw.cyc.com/2006/07/27/cyc/Welshing-­‐Chea-ng>  

27  

Page 28: 20141112 courtot big_datasemwebontologies

Overview  

28  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  Defini-on  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 29: 20141112 courtot big_datasemwebontologies

Ontologies  •  RepresentaHon  of  important  things  in  a  specific  domain  

–  Describes  types  of  enHHes  (e.g.  cells)  and  relaHons  between  them  (e.g.  prokaryoHc  cells  and  eukaryoHc  cells  are  cells)  and  their  instances  (e.g.  the  specific  cells  in  my  sample)  

•  An  acHve  computaHonal  arHfact  –  A  mathemaHcal  model  based  on  a  subset  of  first  order  logic  –  Tools  can  automaHcally  process  ontologies  

•  A  communicaHon  tool  –  Provides  a  dicHonary  for  collaborators,  a  shared  understanding  –  Allows  data  sharing  

29  

Page 30: 20141112 courtot big_datasemwebontologies

Reasoning  is  criHcal  •  ProkaryoHc  and  EukaryoHc  

cell  are  declared  disjoints    •  Fungal  cell  is  a  EukaryoHc  

cell  •  Spore  is  a  Fungal  cell  and  a  

ProkaryoHc  cell  ⇒  InsaHsfiability  ⇒  SoluHon:  clarify  spore  

(sensu  Mycetozoa)  AND  acHnomycete-­‐type  spore  

h^p://www.plosone.org/arHcle/info:doi/10.1371/journal.pone.0022006  30  

Page 31: 20141112 courtot big_datasemwebontologies

Logics  

•  Simple  example  based  on  h^p://arxiv.org/pdf/1201.4089v1.pdf  

•  Ontology  file  available  from  h^p://www.sfu.ca/~mcourtot/course/20141112BigDataSemWebOntologies/ontology.owl  

•  ManipulaHon  done  using  Protégé:  h^p://protege.stanford.edu  

    31  

Page 32: 20141112 courtot big_datasemwebontologies

Family    ontology  

32  

Page 33: 20141112 courtot big_datasemwebontologies

Logics  of  a  grandfather  

33  

Page 34: 20141112 courtot big_datasemwebontologies

Reasoning  

34  

Page 35: 20141112 courtot big_datasemwebontologies

Inferred  class  hierarchy  

35  

Page 36: 20141112 courtot big_datasemwebontologies

Explana-ons  

36  

Page 37: 20141112 courtot big_datasemwebontologies

A  wrong  asser-on  

37  

Page 38: 20141112 courtot big_datasemwebontologies

Unsa-sfiability  

38  

Page 39: 20141112 courtot big_datasemwebontologies

Overview  

39  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exis-ng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 40: 20141112 courtot big_datasemwebontologies

OBO  Foundry  

A  subset  of  biological  and  biomedical  ontologies  whose  developers  have  agreed  in  advance  to  accept  a  common  set  of  principles  reflecHng  best  pracHce  in  ontology  development  designed  to  ensure    

•  Hght  connecHon  to  the  biomedical  basic  sciences  •  CompaHbility  

•  interoperability,  common  relaHons  •  formal  robustness    •  support  for  logic-­‐based  reasoning  

       

40  

Page 41: 20141112 courtot big_datasemwebontologies

41  hAp://www.obofoundry.org    

Page 42: 20141112 courtot big_datasemwebontologies

                                       RELATION                                TO  TIME  

 

 GRANULARITY

CONTINUANT OCCURRENT

INDEPENDENT DEPENDENT

ORGAN  AND ORGANISM

Organism (NCBI

Taxonomy?)

Anatomical  Entity (FMA,  CARO)

Organ Function

(FMP,  CPRO) Phenotypic  Quality  (PaTO)

Organism-­‐‑Level  Process (GO)

CELL  AND  CELLULAR  

COMPONENT

Cell (CL)

Cellular  Component (FMA,  GO)

Cellular  Function (GO)

Cellular  Process (GO)

MOLECULE Molecule (ChEBI,  SO, RnaO,  PrO)

Molecular  Function (GO)

Molecular  Process (GO)

Slide  credit:  Barry  Smith    

42  

Page 43: 20141112 courtot big_datasemwebontologies

Minimum  InformaHon  to  Reuse  an  External  Ontology  Term  

•  OBO  and  SemaHc  Web  promote  reuse  of  resources  

•  Biological  resources  (e.g.,  FMA  for  anatomy),  taken  together,  are  too  big  for  current  tool  support.  

•  MIREOT  used  across  the  OBO  library  – OBI:  400  mireoted  terms  (140  GO,  55  ChEBI,  50  PATO)  –  PR  (Protein  Ontology):  23,000  mireoted  terms  

•  h^p://ontofox.hegroup.org    

43  

Page 44: 20141112 courtot big_datasemwebontologies

Example  of  OBO  ontologies  

•  OBI,  Ontology  for  Biomedical  invesHgaHons  •  VO,  the  vaccine  ontology  •  AERO,  the  Adverse  Event  ReporHng  Ontology  

Page 45: 20141112 courtot big_datasemwebontologies

Ontology  for  Biomedical  InvesHgaHons  (OBI)  

•  OBI  is  a  mulH-­‐community  project  driven  by  the  pracHcal  needs  of  its  members  with  the  goal  to  build  a  high  quality,  interoperable  reference  ontology  

•  OBI  high  level  classes  are  in  place  -­‐  solidified  over  several  years  -­‐  that  cover  all  aspects  of  biomedical  invesHgaHons  

•  OBI  is  expanded  to  enable  member  applicaHons  and  based  on  term  requests  

45  

Page 46: 20141112 courtot big_datasemwebontologies

46  

High  level  class  hierarchy  (parHal)  

Slide  credit:  OBI  Consor)um    

Page 47: 20141112 courtot big_datasemwebontologies

Slide  credit:  Alan  Ru=enberg  47  

Page 48: 20141112 courtot big_datasemwebontologies

48  Slide  credit:  OBI  Consor)um    

Page 49: 20141112 courtot big_datasemwebontologies

49  

RepresenHng  vaccine  data  –  the  Vaccine  Ontology  (VO)  

Picture  credit:  Yongqun  He  

Page 50: 20141112 courtot big_datasemwebontologies

Overview  

50  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 51: 20141112 courtot big_datasemwebontologies

RepresenHng  pharmacovigilance  data  

•  The  Adverse  Event  ReporHng  Ontology  (AERO)  

•  Encodes  exisHng  clinical  guidelines  (Brighton  CollaboraHon)  

'found to exhibit' some 'generalized urticaria or generalized erythema finding''found to exhibit' some 'measured hypotension finding'

inferred to be of typeinferred to be of type

major dermatological criterion for anaphylaxis according to Brighton

major cardiovascular criterionfor anaphylaxis according to Brighton

Level 1 of certainty of anaphylaxis according to Brighton

has component has component

Patient examination

has specified outputhas participant

exam report of June 7has specified input

finding of rashPatient

rash

dermatologicalsystem

Medicallyrelevant entity

Anatomical system

Clinical Finding

about mre

Clinical Report

part oflocated in

Clinician

involves

has participant

is aboutfound to exhibit

51  

Page 52: 20141112 courtot big_datasemwebontologies

Background  and  problem  statement  

•  Surveillance  of  Adverse  Events  Following  Immuniza-on  is  important  –  DetecHon  of  issues  with  vaccine    –  Importance  of  vaccine-­‐risk  communicaHon  

•  Analysis  of  AE  reports  is  a  subjec-ve,  -me-­‐  and  money  costly  process  – Manual  review  of  the  textual  reports  

52  

Page 53: 20141112 courtot big_datasemwebontologies

Workflow  •  Hypothesis:  Use  the  AERO  I  developed  to  annotate  

and  classify  a  dataset  •  VAERS  dataset  

– Vaccine  Adverse  Event  ReporHng  System  – 6032  reports:  ~5800  negaHve,  ~230  posiHve  – Post  H1N1  immunizaHon  2009/2010  – Manually  classified  for  anaphylaxis    

•  MedDRA  (Medical  DicHonary  of  Regulatory  AcHviHes)  is  used  to  represent  clinical  findings  

 53  

Page 54: 20141112 courtot big_datasemwebontologies

54  

Automated  Diagnosis  workflow  

ADVERSE EVENT REPORTING ONTOLOGY

(AERO)

OWL/RDFEXPORT

VAERS DATASET

MySQL

BRIGHTON ANNOTATIONS

ASCII files MySQL

~800 MedDRA terms mapped to 32 Brighton terms

REASONER

?

MANUALLY CURATEDDATASET

A

B

C

D

Page 55: 20141112 courtot big_datasemwebontologies

55  

Results  

ADVERSE EVENT REPORTING ONTOLOGY

(AERO)

OWL/RDFEXPORT

VAERS DATASET

MySQL

BRIGHTON ANNOTATIONS

ASCII files MySQL

~800 MedDRA terms mapped to 32 Brighton terms

REASONER

?

MANUALLY CURATEDDATASET

A

B

C

D

At  best  cut-­‐off  point:    Sensi-vity  57%  Specificity  97%  

Page 56: 20141112 courtot big_datasemwebontologies

56  

AE  classificaHon  can  be  improved  through  the  use  of  ontologies  

•  Manual  analysis:  3  months  for  12  medical  officers  •  Ontology-­‐based  analysis:  once  data  collected  (2  months),  almost  

instantaneous  (2h  on  laptop)    =>  Could  allow  for  earlier  detecHon  of  safety  issues  and  be^er  understanding  of  adverse  events  

November 2009 December 2009 January 2010

Time gain

Ability to detect signal

Time

6000reports

Manual analysisOntology-based

analysis

Legend

2h  automated  vs.  

3  months  manual  

h^p://dx.doi.org/10.1371/journal.pone.0092632    

Page 57: 20141112 courtot big_datasemwebontologies

Overview  

57  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  Seman-c  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 58: 20141112 courtot big_datasemwebontologies

IRI  dereferencing  

58  

Page 59: 20141112 courtot big_datasemwebontologies

59  

Ontobee:  publishing  biomedical  resources  on  the  SemanHc  Web  

HTML  for  humans  …  

Page 60: 20141112 courtot big_datasemwebontologies

…  RDF  for  machines  

Ontobee:  publishing  biomedical  resources  on  the  SemanHc  Web  

Page 61: 20141112 courtot big_datasemwebontologies

Overview  

61  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaborm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 62: 20141112 courtot big_datasemwebontologies

The  Integrated  Rapid  InfecHous  Disease  Analysis  (IRIDA)  project  

•  Goal:  automate  infecHous  disease  outbreak  detecHon  and  invesHgaHon  

•  Issues:    –  Integrate  WGS,  clinical  and  lab  info  –  Provide  relevant  tools  and  validate  pipeline  

•  Methods:  – Data  standards  for  informaHon  exchange  – Analysis  pipeline  (Galaxy  based)  – User  interface  – AddiHonal  tools:    

•  IslandViewer  •  GenGIS  

62  

Page 63: 20141112 courtot big_datasemwebontologies

63  

Page 64: 20141112 courtot big_datasemwebontologies

Building  the  IRIDA  data  standards  

•  Interview  with  key  personnel  at  BCCDC  •  Review  of  exisHng  resources  •  IdenHfy  “holes”,  i.e.,  missing  bits  •  Collect  exisHng  data  •  Liaise  with  implementaHon  team  •  Generate  cohesive  resource  •  Validate  

64  

Page 65: 20141112 courtot big_datasemwebontologies

Relevant  data  standards  •  TypON,  the  typing  ontology  •  OBI,  the  ontology  for  Biomedical  InvesHgaHons  •  NGSOnto,  Next  GeneraHon  Sequencing  Ontology  •  NIAIS-­‐GS-­‐BRC  core  metadata  •  TRANS,  Pathogen  Transmission  ontology  •  ExO,  Exposure  Ontology  •  EPO,  Epidemiology  Ontology  •  IDO,  InfecHous  Disease  Ontology  •  Food:  USDA,  EFSA?  

65  

Page 66: 20141112 courtot big_datasemwebontologies

Relevant  internaHonal  efforts  

•  MIxS  standard  •  Global  Microbial  IdenHfier  •  Global  Alliance  for  Genomics  and  Health  •  NCBI  BioSample  •  European  NucleoHde  Archive  •  …  

66  

Page 67: 20141112 courtot big_datasemwebontologies

Remaining  challenges  

•  Trust,  provenance  – Ability  to  track  origin  of  data  to  assess  whether  it  is  trustworthy  

•  Data  sharing,  reuse,  policy  – Social  and  legal  issues  in  ge�ng  access  to  data  

•  ConfidenHality  – Privacy  concerns  when  linking  data  

67  

Page 68: 20141112 courtot big_datasemwebontologies

Overview  

68  

•  Big  Data  –  Big  Data  is  BIG  –  Issues  in  research  

•  SemanHc  Web  –  Standards:  URIs,  RDF,  SPARQL,  OWL  –  Linked  data  

•  Ontologies  –  DefiniHon  and  reasoning  –  OBO  Foundry  –  Example  of  exisHng  ontologies  –  Pharmacovigilance  –  Publishing  ontologies  on  the  SemanHc  Web  

•  IRIDA  –  The  IRIDA  plaXorm  –  Adding  standards  to  IRIDA  

•  Take  home  message    

Page 69: 20141112 courtot big_datasemwebontologies

Take  home  message  

Big  data  is  a  big  challenge,  but  we  can  deal  with  it  if  done  properly:  that  will  be  your  responsibility      DO  NOT  build  a  black  box  DO  annotate  and  describe  your  data  DO  make  your  data  openly  available  

69  

Page 70: 20141112 courtot big_datasemwebontologies

Acknowledgements  

•  Drs.  Fiona  Brinkman,  Will  Hsiao,  Ryan  Brinkman  •  The  Brinkman^2  labs  •  Alan  Ru^enberg,  Barry  Smith,  Chris  Mungall  &  

OBO  •  Colleagues  at  Public  Health  Agency  Canada  (Ms  

Lafleche,  Dr  Law)  •  The  IRIDA  consorHum  and  the  IRIDA  ontology  

working  group  (Emma  Griffiths  and  Damion  Dooley)  

70  

Page 71: 20141112 courtot big_datasemwebontologies

71  

Mélanie  Courtot,  PhD  [email protected]  

@mcourtot  h^p://purl.org/net/mcourtot