bio2rdf: towards a mashup to build bioinformatics knowledge system

47
Towards A M ashup To Build Bioinformatics K nowledge System , - , François Belleau M arc Alexandre Nolin , , Nicole Tourigny Philippe Rigault Jean M orissette Département d'informatique et de génie logiciel Université Laval

Upload: francois-belleau

Post on 07-May-2015

2.689 views

Category:

Technology


6 download

DESCRIPTION

Bio2RDF presentation at WWW2007 HCLS Workshop. http://bio2rdf.org/www2007/

TRANSCRIPT

Page 1: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

TowardsA MashupTo Build BioinformaticsKnowledgeSystem

, - ,FrançoisBelleau Marc AlexandreNolin , , NicoleTourigny PhilippeRigault Jean Morissette

Département d'informatique et de génie logicielUniversité Laval

Page 2: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 2

Presentation Plan Knowledge integration vision 2 Bio RDF architecture RDFization of knowledge Normalization of URI Parkinson ExampleDemo Conclusion

Page 3: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 3

From the RDF inventor :"Wouldn't it be great if you were able to organize all this information based on your own terms, instead of based on the application you use to access the information ?” (1999)

Ramanathan V. Guha

From WikiPedia :Mashup (web application hybrid)

A mashup is a website or application that combines content from more than one source into an integrated experience.(2007)

Page 4: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 4

- ’ Sir Berners Leesvision of semantic web

Tim Berners- Lee

« The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. »Scientific Americain, 2001

http://www.w3.org/2006/Talks/0404-mit-tbl/

Page 5: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 5

2 2005Bio RDF startingvision at ISMB

T hanks to Chr istopher Baker, Eric Neumann, Kei Cheung and Johanne Luciaono for their ideas.

Too many knowledgesources available for life sciencescientists

( , , Too many formats text XML)HTML

New sourceeach day with specialized tool or web interface

Integration problem recognized by global community

Page 6: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 6

Theknowledge integration problem inbioinformatics

2005From Carol Gobleat ISWC (2004)From the BioPAX group

Page 7: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 7

Integration methods in bioinformatics

1) Davidson 1995“Transform data to the federated database on

demand”

2) Köhler 2003 “In different databases the same things can be

given different names”

3) Stein 2003“link integration, view integration and data

warehousing”

Page 8: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 8

Data warehouseapproaches

url

http://www.ncbi.nlm.nih.gov/Database/ http://www.genome.jp/dbget/dbget.links.html

Page 9: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 9

2 ’ Bio RDF sapproach :to knowledge integration

“Solve the problem of knowledge integration in biology by applying

a semantic web approach.”

Page 10: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 10

Other semantic webprojects

Page 11: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 11

2 ’ Bio RDF sdesign rules

2. ;Convert document to RDF format3. ( , Useof a triplestore technology sesame

, );virtuoso oracle4. ;NormalizeURIs5. Build a mashup asneeded to answer specific

( );question elmo6. .Query themashup with SeRQL or SPARQL

Page 12: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 12

2 ’ Bio RDF sarchitecture

#1

#2

#3

#4

#5

#6

Page 13: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 13

2 ’ Bio RDF sknowledge sources

Page 14: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 14

RDF conversion statistics

LSID example Size of data converted

go go:0000001kegg path:aae00010kegg cpd:c00001mgi mgi:96103ncbi omim:100050ncbi geneid:1obo obo's 59 namespacespdb pdb:100d

uniprot uniprot:A0A000uniprot enzyme:1.-.-.-uniprot pubmed:100133uniprot taxonomy:10uniprotuniref:UniRef100_A0A000… … … …

Data sourc

e

Number of RDF documents

22 961 507 963 32135 257 1 038 593 13714 292 8 902 205

438 724 210 458 89717 359 573 639 380

2 744 786 67 225 535 082279 720 216 007 267

34 421 16 309 651 9354 177 176 29 453 203 064

5 020 2 844 058191 664 364 728 083337 564 125 630 659

7 990 452 14 865 490 144

Page 15: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 15

’ OpenRDF ssoftwarehttp://www.openrdf.org/

Page 16: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 16

:15275RDF of geneid

•rdf:about•rdfs:label•dc:identifier, title, created•bio2rdf:lsid•bio2rdf:url•bio2rdf:synonym•bio2rdf:xRef

Page 17: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 17

RDFizer

efetch rdfizer

: To rdfize T o convert existing docum ent into RDF form at.

Page 18: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 18

How to rdfize

• ( : 00101)From HTML pages prosite ps• From XML documentsusingXSLT

( : 00010)path mmu• From XML documentsusingXPath and

( :15275)JSTL geneid• From direct SQL access

( : 00000025875 ensembl ensmusg )• (From RDF document uniprot:p26838 )• ( : 00001)From Text files cpd c

Page 19: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 19

1) : 00101 prosite ps from html usinga regex

Page 20: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 20

2) ’ : 00010 Keggspath mmu from XML usingXSL

Page 21: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 21

3) : 00000025875 ensembl ensmusg from SQL

Page 22: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 22

4) uniprot:p26838 from RDF using SeRQL

Page 23: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 23

, One reality many names● Different namespace identifier

pubmed:11992264 vs pmid:11992264● Uppercase and lowercase

uniprot:p26838 vs uniprot:P26838● Version number

genbank:ac008393 vs genbank:ac008393.7● Total id length

go:0032283 vs go:32283

Page 24: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 24

RDFizing docum ent is not enough we also need norm alized URIs.

:/ / 2 . / :http bio rdf org namespace id

:/ / 2 . /http bio rdf org pubmed:11992264:/ / 2 . /http bio rdf org uniprot:p26838

:/ / 2 . /http bio rdf org genbank:ac008393 :/ / 2 . /http bio rdf org go:0032283

Page 25: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 25

URI Normalization rules● Different namespace identifier

, We resolvenamespacesynonymy with a urlrewrite rule for . examplepubmed and pmid

● Uppercase and lowercase We writeevery URI in lowercase

● Version numberA owl:sameAs predicate is use to link the different versions of a document.

● Total id length .A fixed length isdetermine for id

Page 26: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 26

Url Rewrite Filterhttp://tuckey.org/urlrewrite/

< >rule< > /̂ :(.*?) < / > from search @pubmed from< > / /to rdfizer - 2 .ncbi entrez rdf jsp? = ; = 1< / >db pubmed&amp query $ to

< / >rule< >rule

< > /̂ :(.*)< / >from pubmed from< > / /to rdfizer - 2 .ncbi pubmed rdf jsp? = 1< / >id $ to

< / >rule< >rule

< > /̂ :(.*)< / >from pmid from< > / /to rdfizer - 2 .lsid sameas rdf jsp? = : 1 ; = : 1< / >from pmid $ &amp to pubmed $ to

< / >rule

< >rule< > /̂ (.*):(.*)< / >from from< = " ">to type redirect :/ / 2 .http bio rdf org/ 1: 2< / >$ $ to

< / >rule

Page 27: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 27

URL vsLSID:/ / 2 . / : 26838http bio rdf org uniprot p

:owl sameAs: : . : : 26838urn lsid uniprot orguniprot p

http:/ / bio2rdf .org/ un iprot:p26838

http:/ / bio2rdf .org/ urn:lsid:uniprot.org:uniprot:p2 6838

Page 28: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 28

Our method to answer question

T o answer a very specialized question, we build a specifi c knowledge base (the mashup stored in a RDF triplestore)

and then query it wi th SeRQL.

Page 29: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 29

Parkinson examples1. What is the semantic network of

OMIM records describing Parkinson’s disease?

2. Which MeSH terms are mostly cited in Parkinson’s disease publications?

3. What genes related to Parkinson’s disease are involved in pathways according to Kegg ?

Page 30: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 30

!Time for demo

Page 31: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 31

The bigeverythingabout parkinson:/ / :8080/ 2 / :http localhost bio rdf search parkinson@omim:/ / :8080/ 2 / :http localhost bio rdf search parkinson@geneid:/ / :8080/ 2 / :http localhost bio rdf search parkinson@uniprot:/ / :8080/ 2 / :http localhost bio rdf search parkinson@kegg:/ / :8080/ 2 / :http localhost bio rdf load pubmed:/ / :8080/ 2 / : - http localhost bio rdf sameashsa geneid:/ / :8080/ 2 / : http localhost bio rdf learn geneid:/ / :8080/ 2 / : http localhost bio rdf load cpd:/ / :8080/ 2 / : http localhost bio rdf load reactome:/ / :8080/ 2 / : - http localhost bio rdf load biopax xref:/ / :8080/ 2 / : http localhost bio rdf load chebi:/ / :8080/ 2 / :http localhost bio rdf load obo- xref:/ / :8080/ 2 / : -http localhost bio rdf sameaskeggcompound cpd

1.700 Ktriples97 Mbytes in turtle format

90 in minutes

Page 32: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 32

Third exempleSeRQL queryWhat genes related to Parkinson’s disease are involved in

pathways according to Kegg ?SELECT

GeneticDisorder-label, Gene-label, pathway-labelFROM{GeneticDisorder} rdf:type {<http://bio2rdf.org/omim#GeneticDisorder>},{GeneticDisorder} rdfs:label {GeneticDisorder-label},{GeneticDisorder} <http://www.w3.org/2002/07/owl#sameAs> {sameAs},{Gene} <http://bio2rdf.org/bio2rdf#xRef> {sameAs},{Gene} rdfs:label {Gene-label},{Gene2} <http://www.w3.org/2000/01/rdf-schema#seeAlso> {Gene},{xobject} <http://bio2rdf.org/kegg#xobject> {Gene2},{xentry1} <http://bio2rdf.org/kegg#xentry1> {xobject},{pathway} <http://bio2rdf.org/kegg#xrelation> {xentry1},{pathway} rdfs:label {pathway-label}

WHEREGeneticDisorder-label like "*PARKINSON*"

Page 33: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 33

Query result

Page 34: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 34

Conclusion

Page 35: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 35

2 BeforeBio RDF integration

Page 36: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 36

Our main results

● RDF is a framework that enables a very simple thing: scalability of the knowledge base complexity.

● The Bio2RDF project proposes to keep complexity in the bioinformatics knowledge space under control by applying this proven web semantic approach.

Page 37: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 37

2 Now with Bio RDF semantic integration

Page 38: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 38

2 ’ Bio RDF svision of knowledge map

Page 39: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 39

2 ’ Bio RDF smapof distributed bioinformaticsknowledge

http://bio2rdf.org/bio2rdf-2007-02.owl

Page 40: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 40

Map of semantic resource

Page 41: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 41

’ Montreal ssubway map

Page 42: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 42

2 ’ Bio RDF sactual knowledgemap

Page 43: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 43

+ + Public data open sourcesoftware rdf + + = technology rdfizer normalized URIs

2 ;Bio RDF knowledge integration - A bioinformatic integration ontology wont exist if

, 2 . it isnot adopted by thecommunity bio rdf owl is ;just a proposed startingpoint

46 millionsRDF documentsarenow availableat:/ / 2 . .http bio rdf org

Achievement

Page 44: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 44

2 Bio RDF project providesopen . source RDFizer to the community

, So much styleneed to be rdfized if , you are interested to contribute

! join us

Now letsbuild the bigknowledge …mapof bioinformatics

Page 45: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 45

Final words

, - Please tell Sir Tim Berners Lee that hewasright‘ ’ semantic web in bioinformatics isa k iller a p p

.to illustrateall thepotential of thesemantic web , And also tell Mark W ilkinson that semantic web

’ in bioinformaticswont be full of creep s if we …organize it likewedid

Page 46: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Thanks Jean Morissette

Nicole Tourigny PhilippeRigault

’ Bioinformatics labsteam at CHUL Research Center

Many open sourcecommunities( , ’ , , )OpenRDF Similesproject Tomcat JSTL and many more

3 - W C Bio RDF Group

GénomeQuébec Génome Canada

Page 47: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System

Banff, May 8, 2007 CHUL research center - Laval University 47

Visit http://bio2rdf.org

Download http://sourceforge.net/projects/bio2rdf/

Discover http://bio2rdf.org/bio2rdf-2007-02.owl

Contact us at [email protected]