bio2rdf poster for biocurator 2014 conference

1
Context What is known about PARP family proteins involved in Reactome pathways ? Interesting question. Our proposed methodology is to build semantic mashup to solve this problem by using two open source software : OpenLink Virtuoso triplestore and Talend Open Studio for data integration. Our goal is to help solve the data integration problem, a reality in bioinformatic. Taverna and Galaxy workflows have been very successful in addressing this problem. They still lack support for Semantic technologies like RDF and SPARQL. BioMart has also been successful by offering a global model to share and query data. Bio2RDF project has the same goal but instead it use Semantic Web technology strategy based on the distributed RDF graph of Linked Data and public SPARQL endpoints to address this problem. Methodology To implement our strategy, we added Semantic technology and Life Science linked data sources to Talend. We have created two collections of components. The first one, Talend4SW, integrates Virtuoso triplestore into Talend and offer simple utilities to transform RDF data. The second collection of component, Talend4Bio2RDF, is used to fetch RDF data from Life Science’s SPARQL endpoints. Connected together in a workflow, those components are used to query Bio2RDF release 2 endpoints, UniProt REST service and EBI’s SPARQL endpoints. They all consume the new Bio2RDF REST services available at http://bio2rdf.org. Using those components to build a proper Talend workflow, we populate a triplestore by fetching RDF data directly from the web. Each triple is then stored in a local Virtuoso triplestore which is queried using SPARQL to discover new URIs that will be dereferenced. At the end we have obtained the needed data to answer our initial query, and a final SPARQL query returns the answer. Results This well designed semantic workflow instantiate the database needed to answer the initial query in a few steps. Finally, PARP1_HUMAN is the only protein of the PARP family present in Reactome’s pathways. These new Talend components can be imported from Talend Exchange http://www.talendforge.org/exchange. This Talend workflow used to answer the PARP question can be downloaded from myExperiment http://www.myexperiment.org/workflows/4050.html Building mashup from Linked Data using Bio2RDF’s Talend components François Belleau, Vincent, Emonet, Arnaud Droit Centre de Biologie Computationnelle Centre de recherche du CHUQ The PI of this project is Dr Arnaud Droit, Directeur du Centre de Biologie Computationnelle du CRCHUQ à l’Université Laval. http://bio2rdf.org The tBio2RDFRequest component is used to fetch RDF graph from describe, links and search Bio2RDF REST services. Result is available in different format. The tNtriplesTemplate component is used to generate N-Triples from the incoming data flow using a text template. Here it is used to create the owl:sameAs triples needed to connect Bio2RDF resources to UniProt ones because of the different URI pattern they use. The tDerefrencableURI component is used to fetch a graph using its URI. Here it is used to dereference UniProt URI for proteins and keywords. The tEBIRequest component is used to send queries to EBI new SPARQL endpoints, here it fetches Reactome. This final complex query is used to answer the question by linking data together from HGNC, UniProt and Reactome database the Linked Data way. The execution process can be monitored by looking at the URI used to fetch RDF data from the web. The table shows the number of triples loaded in the previous run. Talend being a complete ETL solution, results can easily exported to Excel spreadsheet for analysis. Our team can help you add your own curated database to this RDF Linked Data project based on Open Source software. Now your project can join the Semantic Web of Life Sciences resources.

Upload: francois-belleau

Post on 08-Jul-2015

288 views

Category:

Software


0 download

DESCRIPTION

Building mashup from Linked Data using Bio2RDF’s Talend components François Belleau, Vincent, Emonet, Arnaud Droit Centre de Biologie Computationnelle Centre de recherche du CHUQ

TRANSCRIPT

Page 1: Bio2RDF poster for Biocurator 2014 conference

Context

What is known about PARP family proteins involved in Reactome

pathways ? Interesting question. Our proposed methodology is to

build semantic mashup to solve this problem by using two open

source software : OpenLink Virtuoso triplestore and Talend

Open Studio for data integration.

Our goal is to help solve the data integration problem, a reality in

bioinformatic. Taverna and Galaxy workflows have been very

successful in addressing this problem. They still lack support for

Semantic technologies like RDF and SPARQL. BioMart has also

been successful by offering a global model to share and query data.

Bio2RDF project has the same goal but instead it use Semantic

Web technology strategy based on the distributed RDF graph of

Linked Data and public SPARQL endpoints to address this problem.

Methodology

To implement our strategy, we added Semantic technology and Life

Science linked data sources to Talend. We have created two

collections of components. The first one, Talend4SW, integrates

Virtuoso triplestore into Talend and offer simple utilities to transform

RDF data. The second collection of component, Talend4Bio2RDF,

is used to fetch RDF data from Life Science’s SPARQL endpoints.

Connected together in a workflow, those components are used to

query Bio2RDF release 2 endpoints, UniProt REST service and

EBI’s SPARQL endpoints. They all consume the new Bio2RDF

REST services available at http://bio2rdf.org.

Using those components to build a proper Talend workflow, we

populate a triplestore by fetching RDF data directly from the web.

Each triple is then stored in a local Virtuoso triplestore which is

queried using SPARQL to discover new URIs that will be

dereferenced. At the end we have obtained the needed data to

answer our initial query, and a final SPARQL query returns the

answer.

Results

This well designed semantic workflow instantiate the database

needed to answer the initial query in a few steps. Finally,

PARP1_HUMAN is the only protein of the PARP family present in

Reactome’s pathways.

These new Talend components can be imported from Talend

Exchange http://www.talendforge.org/exchange. This Talend

workflow used to answer the PARP question can be downloaded

from myExperiment

http://www.myexperiment.org/workflows/4050.html

Building mashup from Linked Data

using Bio2RDF’s Talend components

François Belleau, Vincent, Emonet, Arnaud Droit

Centre de Biologie Computationnelle

Centre de recherche du CHUQ

The PI of this project is Dr Arnaud Droit, Directeur du Centre de

Biologie Computationnelle du CRCHUQ à l’Université Laval.

http://bio2rdf.org

The tBio2RDFRequest component is used to fetch RDF graph from describe, links and search Bio2RDF REST services. Result is available in different format.

The tNtriplesTemplate component is used to generate N-Triples from the incoming data flow using a text template. Here it is used to create the owl:sameAs triples needed to connect Bio2RDF resources to UniProt ones because of the different URI pattern they use.

The tDerefrencableURI component is used to fetch a graph using its URI. Here it is used to dereference UniProt URI for proteins and keywords.

The tEBIRequest component is used to send queries to EBI new SPARQL endpoints, here it fetches Reactome.

This final complex query is used to answer the question by linking data together from HGNC, UniProt and Reactome database the Linked Data way.

The execution process can be monitored by looking at the URI used to fetch RDF data from the web. The table shows the number of triples loaded in the previous run.

Talend being a complete ETL solution, results can easily exported to Excel spreadsheet for analysis.

Our team can help you add your own curated database to this RDF Linked Data project based on Open Source software. Now your project can join the Semantic Web of Life Sciences resources.