sc1 - hangout 2: the open phacts pilot

Post on 20-Jan-2017

384 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

BIG DATA EUROPE H2020 CSA (2015-17) SOCIETALCHALLENGE“HEALTH”

Integrating Big Data, Software & Communities for Addressing Europe’s Societal Challenges

06.07.2016

BigDataEurope

6-Jul-16

Today: •  Short overview of Big Data Europe Ronald Siebes •  What is Open PHACTS Stian Soiland-Reyes, Bryn Williams-Jones •  The Big Data Europe infrastructure Erika Pauwels, Aad Versteden •  Pilot 1: The Open PHACTS docker Stian Soiland-Reyes •  Q&A

Stian Soiland-Reyes BioExcel and

University of Manchester

Ronald Siebes VU Amsterdam

Erika Pauwels Tenforce

Aad Versteden Tenforce

Bryn Williams-Jones Open PHACTS Foundation

Big Data Europe

6-Jul-16

6-Jul-16 www.big-data-europe.eu

Partners :

6-Jul-16

Q&A

6-Jul-16 www.big-data-europe.eu

Open PHACTSArchitecture and

Docker install

Stian Soiland-Reyes, University of Manchesterhttp://orcid.org/0000-0001-9842-9718

@soilandreyes

This work is licensed under a .Creative Commons Attribution 4.0 International License

Big Data Europe Webinar, 2016-07-06

This work has been done as part of the BioExcel CoE ( ),a project funded by the EC H2020 program, contract number

www.bioexcel.euEINFRA-5-2015 675728

https://slides.com/soilandreyes/2016-07-06-openphacts

1

http://www.openphacts.org/

Bringing together pharmacological data resources

in an integrated, interoperable infrastructure

Data sources integrated and linked togetherso that you can easily see the relationships

between compounds, targets, pathways,diseases and tissues.

, , , ,, , , ,

, ,

ChEBI ChEMBL ChemSpider ConceptWikiDisGeNET DrugBank FAERS Gene Ontology

neXtProt SureChEMBL, UniProt WikiPathways

2 . 1

Data integration

https://www.openphacts.org/2/sci/data.html2 . 2

https://dev.openphacts.org/docs/2.1

Re-exposed aspublic API

2 . 3

{ "format": "linked-data-api", "version": "1.5", "result": { "_about": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=json" "definition": "https://beta.openphacts.org/api-config", "extendedMetadataVersion": "https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=json&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite" "linkPredicate": "http://www.w3.org/2004/02/skos/core#exactMatch", "activeLens": "Default", "primaryTopic": { "_about": "http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5", "inDataset": "http://www.conceptwiki.org", "exactMatch": [ { "_about": "http://bio2rdf.org/drugbank:DB00398", "description_en": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the treatment of advanced renal cell carcinoma (primary kidney cancer). It has also received \"Fast Track\" designation by the FDA for the treatment of advanced hepatocellular carcinoma (primary liver cancer), and has since performed well in Phase III trials.\nSorafenib is a small molecular inhibitor of Raf kinase, PDGF (platelet-derived growth factor), VEGF receptor 2 & 3 kinases and c Kit the receptor for Stem cell factor. A growing number of drugs target most of these pathways. The originality of Sorafenib lays in its simultaneous targeting of the Raf/Mek/Erk pathway." "description": "Sorafenib (rINN), marketed as Nexavar by Bayer, is a drug approved for the treatment of advanced renal cell carcinoma (primary kidney cancer). It has also received \"Fast Track\" designation by the FDA for the treatment of advanced hepatocellular carcinoma (primary liver cancer), and has since performed well in Phase III trials.\nSorafenib is a small molecular inhibitor of Raf kinase, PDGF (platelet-derived growth factor), VEGF receptor 2 & 3 kinases and c Kit the receptor for Stem cell factor. A growing number of drugs target most of these pathways. The originality of Sorafenib lays in its simultaneous targeting of the Raf/Mek/Erk pathway." "drugType_en": [ "investigational", "approved" ], "drugType": [ "investigational", "approved" ], "genericName_en": "Sorafenib", "genericName": "Sorafenib", "metabolism_en": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabolism, mediated by CYP3A4, as well as glucuronidation mediated by UGT1A9. Sorafenib accounts for approximately 70-85% of the circulating analytes in plasma at steady- state. Eight metabolites of sorafenib have been identified, of which five have been detected in plasma. The main circulating metabolite of sorafenib in plasma, the pyridine N-oxide, shows in vitro potency similar to that of sorafenib. This metabolite comprises approximately 9-16% of circulating analytes at steady-state." "metabolism": "Sorafenib is metabolized primarily in the liver, undergoing oxidative metabolism, mediated by CYP3A4, as well as glucuronidation mediated by UGT1A9. Sorafenib accounts for approximately 70-85% of the circulating analytes in plasma at steady- state. Eight metabolites of sorafenib have been identified, of which five have been detected in plasma. The main circulating metabolite of sorafenib in plasma, the pyridine N-oxide, shows in vitro potency similar to that of sorafenib. This metabolite comprises approximately 9-16% of circulating analytes at steady-state." "proteinBinding_en": "99.5% bound to plasma proteins.", "proteinBinding": "99.5% bound to plasma proteins.", "toxicity_en": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adverse reactions observed at this dose were primarily diarrhea and dermatologic events. No information is available on symptoms of acute overdose in animals because of the saturation of absorption in oral acute toxicity studies conducted in animals." "toxicity": "The highest dose of sorafenib studied clinically is 800 mg twice daily. The adverse reactions observed at this dose were primarily diarrhea and dermatologic events. No information is available on symptoms of acute overdose in animals because of the saturation of absorption in oral acute toxicity studies conducted in animals." "inDataset": "http://www.openphacts.org/bio2rdf/drugbank", 2 . 4

<?xml version="1.0" encoding="utf-8"?><result format="linked-data-api" version="1.5" href="https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=xml" <primaryTopic href="http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5"> <prefLabel xml:lang="en">Sorafenib</prefLabel> <exactMatch> <item href="http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336"> <type href="http://rdf.ebi.ac.uk/terms/chembl#SmallMolecule"/> <inDataset href="http://www.ebi.ac.uk/chembl"/> <mw_freebase datatype="double">464.82</mw_freebase> </item> <item href="http://ops.rsc.org/OPS379634"> <smiles>CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F</smiles> <rtb datatype="double">5.0</rtb> <ro5_violations datatype="double">1.0</ro5_violations> <psa datatype="double">92.35</psa> <molweight datatype="double">464.825</molweight> <molformula>C21H16ClF3N4O3</molformula> <logp datatype="double">5.158</logp> <inchikey>MLDQJTXFUGDVEO-UHFFFAOYSA-N</inchikey> <inchi>InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3-6-14)28-20(31)29-13-4-7-17(22)16(10-13)21(23,24)25/h2-11H,1H3,(H,26,30)(H2,28,29,31) <hbd datatype="double">3.0</hbd> <hba datatype="double">7.0</hba> <inDataset href="http://ops.rsc.org"/> </item> <item href="http://aers.data2semantics.org/resource/drug/NEXAVAR"> <prefLabel>NEXAVAR</prefLabel> <reportedAdverseEvent> <item href="http://aers.data2semantics.org/resource/diagnosis/HEAD_INJURY"> <prefLabel>HEAD INJURY</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> </item> <item href="http://aers.data2semantics.org/resource/diagnosis/SUPRAVENTRICULAR_TACHYCARDIA" <prefLabel>SUPRAVENTRICULAR TACHYCARDIA</prefLabel> <inDataset href="http://aers.data2semantics.org/"/> 2 . 5

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix skos: <http://www.w3.org/2004/02/skos/core#> .@prefix void: <http://rdfs.org/ns/void#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix ns0: <http://www.openphacts.org/api#> .@prefix ns1: <http://bio2rdf.org/> .@prefix ns2: <http://rdf.ebi.ac.uk/terms/chembl#> .@prefix chembl1336: <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336#> .@prefix linked-data: <http://purl.org/linked-data/api/vocab#> .@prefix msg0: <http://www.openphacts.org/api/> .

<http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> skos:exactMatch <http://aers.data2semantics.org/resource/drug/NEXAVAR> ; skos:exactMatch <http://aers.data2semantics.org/resource/drug/SORAFENIB> ; skos:exactMatch <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> ; skos:exactMatch <http://bio2rdf.org/drugbank:DB00398> ; skos:exactMatch <http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL1336> ; skos:exactMatch <http://ops.rsc.org/OPS379634> ; skos:prefLabel "Sorafenib"@en ; void:inDataset <http://www.conceptwiki.org> ; foaf:isPrimaryTopicOf <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl> .

<https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl> foaf:primaryTopic <http://www.conceptwiki.org/concept/38932552-111f-4a4e-a46a-4ed1d7bdf9d5> ; linked-data:definition <https://beta.openphacts.org/api-config> ; msg0:activeLens "Default" ; void:linkPredicate skos:exactMatch ; linked-data:extendedMetadataVersion <https://beta.openphacts.org/1.5/compound?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2F38932552-111f-4a4e-a46a-4ed1d7bdf9d5&app_id=161aeb7d&app_key=bbcba81896020f0b95e3dd35b55e3345&_format=ttl&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite> .

<http://ops.rsc.org/OPS379634> void:inDataset <http://ops.rsc.org> ; ns0:smiles "CNC(=O)C1=NC=CC(=C1)OC2=CC=C(C=C2)NC(=O)NC3=CC(=C(C=C3)Cl)C(F)(F)F" ; ns0:inchi "InChI=1S/C21H16ClF3N4O3/c1-26-19(30)18-11-15(8-9-27-18)32-14-5-2-12(3-6-14)28-20(31)29-13-4-7-17(22)16(10-13)21(23,24)25/h2-11H,1H3,(H,26,30)(H2,28,29,31)" ns0:inchikey "MLDQJTXFUGDVEO-UHFFFAOYSA-N" ; 2 . 6

explorer.openphacts.org

3

Architecture

4 . 1

API architecture

4 . 2

Chemical Structure Search

RDF/SPARQL(Virtuoso)

Identity Mapping Service

Identity Resolution Service(ConceptWiki)

Chembl, Uniprot, ...

Data loading

4 . 3

SC1 Health WebinarTechnical overview6 July 2016

Platform goals

◎ Low total cost of ownership

◎ Simple to get started with Big Data

◎ Cater for widely varying use cases

◎ Embrace emerging Big Data technologies

◎ Simple integration with custom components

Key actors

Big Data is

◎ Volumeo Quantity of data

◎ Velocityo Speed at which data is provided

◎ Varietyo Different formats/models in which data is provided

◎ Veracityo Accuracy/truthfulness of the data

Why did we need all this?

Platform architecture

Platform architecture

Platform architecture

Semantic Big Data

ongoing research!

◎ Semantic Data Lake

o from data swamp to data lake

o query contents in the data lake

◎ SANSA stack

o Big Data analytics on semantic graph

Support layer

◎ Swarm UIo Launch, install and manage pipelines

◎ Pipeline daemon & monitoro Determine order in which steps are executedo eg: Upload files before running computations

◎ Integrator UIo Present dashboards in a unified interface

Platform architecture

Key actors

Platform installation

Platform installation

◎ Manual installation guide

◎ Using Docker Machine

o On local machine (VirtualBox)

o In the cloud (AWS, DigitalOcean, Azure)

o Bare metal

Platform development

◎ High level pictureo docker-compose.yml describes pipeline topology

◎ Common componentso extend template image with your code

◎ New componentso build a Docker image for your componento this is your own little Virtual Machine for your component

◎ Sharingo publish topology as git repositoryo publish new components on docker hub

Platform development

Platform development

Deployment

Swarm UI

Swarm UI

Deployment

Swarm UI

Swarm UI

Integrator UI

Workflow UI

More monitoring

This topic is ongoing, many interesting options

◎ Visualise logs with Kibana?

◎ Combine logs for large overview?

◎ Monitor node load?

◎ Provide autoscheduling?

Concluding remarks

◎ Used in practice

◎ Easy to get started

◎ Improving as we speak

You can talk to us!

◎ Aad Verstedenaad.versteden@tenforce.com

◎ Erika Pauwelserika.pauwels@tenforce.com

Linux Container technology..light-weight "virtual" virtual machine A container is started from a image Images downloaded from Docker Hub Dockerfile: Layer-based recipe Philosophy: One service, one image → microservices Cloud's best friend: scalable, reproducible, customizable

https://www.docker.com/5 . 1

https://hub.docker.com/r/openphacts/5 . 2

ops-ims

ops-mysql

ops-virtuoso

ops-apiops-memcached

ops-virtuosodata

ops-mysqldata

ops-virtuosostaging

ops-mysqlstaging

https://data.openphacts.org/

ops-explorer:3001

:3002

:3004:3003

https://hub.docker.com/

ops-docker

https://github.com/openphacts/ops-docker/5 . 3

Docker Compose

https://www.docker.com/products/docker-compose

Which images to download Which data volumes to use Which network ports are exposed How are containers linked How to start/stop the containers

$ docker-compose up -d

5 . 4

docker-compose.yml

# Open PHACTS platform# Docker Compose configuration

explorer: image: openphacts/explorer2 ports: - "3001:3000" links: - api environment: - API_URL=http://localhost:3002 #restart: always

api: image: openphacts/ops-linkeddataapi ports: - "3002:80" links: - ims - memcached - virtuoso:sparql

# SPARQL servervirtuoso: build: virtuoso-ops ports: - "3003:8890" volumes_from: - virtuosodata

virtuosodata: image: busybox volumes: - /virtuoso 5 . 5

Data staging

6 . 1

Docker and data?Docker Hub maximum image size: 10 GBOpen PHACTS data (compressed): ~30 GB

Open PHACTS data (installed): ~200 GB

Solution: Added staging Docker containersDownload from

Verify consistencyImport into Virtuso and mySQL

https://data.openphacts.org/

6 . 2

https://data.openphacts.org/6 . 3

https://data.openphacts.org/

data.openphacts.orgRDF datasetsRDF linksetsVoID metadata/provenance mySQL-imported linksetsVirtuoso-imported datasets → Maven repositoryrelease data as software→Research Objectspropagate metadata

6 . 4

Try it!

7 . 1

https://github.com/openphacts/ops-docker

Hardware requirements:

150 GB of disk space (ideal: 250 GB)16 GB of RAM (ideal: 128 GB)4 CPU core (ideal: 8 cores)

Prerequisites:

Recent x64 Linux (Ubuntu 14.04 LTS, Centos 7)

Fast Internet connection

DockerDocker Compose

What do I need?

7 . 2

https://github.com/openphacts/ops-docker

Follow the GitHub tutorial exactly, customize later Install latest Docker and Docker Compose Just testing on Windows or OS X?.. modify Docker's Linux VM to have enough disk and memory Firewall? Different settings depend on your firewall details. Don't worry - Docker is containerized!..you won't break your machine

Don't jump ahead..

7 . 3

https://github.com/openphacts/ops-docker

Get the softwarecurl -L https://github.com/openphacts/ops-docker/archive/master.tar.gz | tar xzvcd ops-docker-mastersudo docker-compose pull

7 . 4

https://github.com/openphacts/ops-docker

Get the data$ sudo docker-compose up --no-recreate -d mysqlstaging virtuosostaging

$ sudo docker-compose logs mysqlstaging virtuosostaging

ops-mysqlstaging | mySQL staging finishedops-mysqlstaging exited with code 0

ops-virtuosostaging | 09:13:35 --> Backup file # 675 [0x3F02-0x74-0x8A]ops-virtuosostaging | 09:13:36 --> Backup file # 676 [0x3F02-0x74-0x8A]ops-virtuosostaging | 09:13:37 End of restoring from backup, 6751701 pagesops-virtuosostaging | 09:13:37 Server exitingops-virtuosostaging | Loading completedops-virtuosostaging exited with code 0

7 . 5

https://github.com/openphacts/ops-docker

Start the services$ sudo docker-compose up --no-recreate -d$ sudo docker-compose logs --tail=5

7 . 6

Using the services

8 . 1

http://localhost:3001/ Explorer

8 . 2

http://localhost:3002/ API

8 . 3

http://localhost:3003/ SPARQL

8 . 4

http://localhost:3004/QueryExpander Identity Mapping

8 . 5

What's next?

9 . 1

Custom data stagingDifferent Open PHACTS 2.1 licensing options:

Non-Commercial users: Everything

Commercial users: No DrugBank, partial SureChemblOpen PHACTS members: Full SureChembl

9 . 2

Microservices pr datasetMost queries have separate fragments per dataset

..which could be executed on separate microservicesBetter cloud scalability

Easier to test upgrades of individual datasets

But still need "API" layer to do Identity Mappingand selecting datasets to query

9 . 3

BioExcel Workflow blocksBioExcel approach: Spin up virtual machine when an

Open PHACTS workflow is started

Workflow bound dynamically to VM instance(s)Scalability (exclusive access)

Reproducibility (independent/fixed OPS install)Tool descriptions - exposed in bio.tools

9 . 4

CustomizationMake it easier to add third-party data:

datasets, linksets, queries, API calls

..so pharma industry can mix in their in-house data.. so academics can upgrade and expand datasets

More tooling,

more documentation,or more training?

9 . 5

Feedback

https://github.com/openphacts/ops-docker/issues

http://support.openphacts.org/

http://ask.bioexcel.eu/

https://data.openphacts.org/10

top related