building collaborative workflows for scientific data

68
Building collaborative workflows for scientific data bmpvieira.com/orcambridge14

Upload: bruno-vieira

Post on 30-Jul-2015

77 views

Category:

Science


4 download

TRANSCRIPT

Page 1: Building collaborative workflows for scientific data

Building

collaborative

workflows for

scientific databmpvieira.com/orcambridge14

Page 3: Building collaborative workflows for scientific data

Sequencing cost drops

Page 5: Building collaborative workflows for scientific data

Goodbye Excel/Windows

Page 6: Building collaborative workflows for scientific data

Hello command line

Page 8: Building collaborative workflows for scientific data

Programming

Page 9: Building collaborative workflows for scientific data

Programming

Page 15: Building collaborative workflows for scientific data

Reproducibility layers

CodeDataWorkflowEnvironment

Page 17: Building collaborative workflows for scientific data

The GitHub for Science...

is GitHub!

Page 19: Building collaborative workflows for scientific data

Reproducibility layers

CodeDataWorkflowEnvironment

Page 21: Building collaborative workflows for scientific data

Dat

open source tool for sharing andcollaborating on datastarted august '13, we are grant fundedand 100% open source

public on freenode

dat-data.com

#datgitter.im/datproject/discussionsDat Community Call #1

Page 22: Building collaborative workflows for scientific data
Page 23: Building collaborative workflows for scientific data

Dat - "git for data"

npm install -g datdat initcollect-data | dat importdat listen

Page 25: Building collaborative workflows for scientific data

Dat

dat clone dat pull --livedat blobs put mygenome data.fastadat cat | transformdat cat | docker run -i transform

http://eukaryota.dathub.org

Page 26: Building collaborative workflows for scientific data

Dat

Planneddat checkout revisiondat diffdat branchmulti master replicationsync to databasesregistry

Page 27: Building collaborative workflows for scientific data

Data stored locally in leveldb, but can useother backends such as

PostgresRedisetc

Files stored in blob-storess3local-fsbitorrentftpetc

Page 28: Building collaborative workflows for scientific data

Dat features

auto schema generationfree REST APIall APIs are streaming

Page 29: Building collaborative workflows for scientific data

Dat workshop

maxogden.github.io/get-dat

Page 30: Building collaborative workflows for scientific data

Dat quick deploy

github.com/bmpvieira/heroku-dat-template

Page 31: Building collaborative workflows for scientific data

Reproducibility layers

CodeDataWorkflowEnvironment

Page 32: Building collaborative workflows for scientific data

Workflow

Page 33: Building collaborative workflows for scientific data

Bionode

open source project for modular anduniversal bioinformaticsstarted january '14

bionode.io

Page 34: Building collaborative workflows for scientific data

Some problems I faced

during my research:

Difficulty getting relevant descriptions anddatasets from NCBI API using bio* libsFor web projects, needed to implementthe same functionality on browser andserverDifficulty writing scalable, reproducibleand complex bioinformatic pipelines

Page 35: Building collaborative workflows for scientific data

Bionode also collaborates with BioJS

Page 36: Building collaborative workflows for scientific data

Bionode

npm install -g bionodebionode ncbi download gff bacteriabionode ncbi download sra arthropoda |bionode sra fastq-dumpnpm install -g bionode-ncbibionode-ncbi search assembly formicidae |dat import --json

Page 37: Building collaborative workflows for scientific data

Bionode - list of modules

Name Type Status PeopleDataaccess

status production

Parser status production

Wrangling status production Dataaccess

status production

Parser status production

ncbi

fastaseq IMensembl

blast-parser

Page 38: Building collaborative workflows for scientific data

Bionode - list of modules

Name Type Status PeopleDocumentation status production

Documentation status production

Documentation status production

Documentation status production

templateJS pipelineGasketpipelineDat/Bionodeworkshop

Page 39: Building collaborative workflows for scientific data

Bionode - list of modules

Name Type Status PeopleWrappers status development Wrappers status development

Wrappers status development Parser status development

srabwasambbi

Page 40: Building collaborative workflows for scientific data

Bionode - list of modules

status request

Name Type PeopleData access Data access ParserParserWrappersWrappers Wrappers

ebisemanticvcfgffbowtiesge badryanblast

Page 41: Building collaborative workflows for scientific data

Bionode - list of modules

Name Type PeopleWrappersWrappersWrappersWrappersWrappersWrappers

vsearchkhmerrsemgmapstargo badryan

Page 42: Building collaborative workflows for scientific data

Bionode - Why wrappers?

Same interface between modules(Streams and NDJSON)Easy installation with NPMSemantic versioningAdd testsAbstract complexity / More user friendly

Page 43: Building collaborative workflows for scientific data

Bionode - Why Node.js?

Same code client/server side

Page 44: Building collaborative workflows for scientific data

Need to reimplement the same code onbrowser and server.Solution: JavaScript everywhere

-> -> ,

-> ->

Afra bionode-seqGeneValidator seq fastaSequenceServerBioJS collaborating for code reuseBiodalliance converting to bionode

Page 45: Building collaborative workflows for scientific data

Bionode - Why Node.js?

Page 46: Building collaborative workflows for scientific data

Reusable, small and tested

modules

Page 47: Building collaborative workflows for scientific data

Benefit from other JS

projects

Dat BioJS NoFlo

Page 48: Building collaborative workflows for scientific data
Page 49: Building collaborative workflows for scientific data
Page 50: Building collaborative workflows for scientific data
Page 51: Building collaborative workflows for scientific data

Difficulty getting relevant description anddatasets from NCBI API using bio* libsPython example: URL for the Achromyrmexassembly?

Solution:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG

import xml.etree.ElementTree as ETfrom Bio import EntrezEntrez.email = "[email protected]"esearch_handle = Entrez.esearch(db="assembly", term="Achromyrmex")esearch_record = Entrez.read(esearch_handle)for id in esearch_record['IdList']: esummary_handle = Entrez.esummary(db="assembly", id=id) esummary_record = Entrez.read(esummary_handle) documentSummarySet = esummary_record['DocumentSummarySet'] document = documentSummarySet['DocumentSummary'][0] metadata_XML = document['Meta'].encode('utf-8') metadata = ET.fromstring('' + metadata_XML + '') for entry in Metadata[1]: print entry.text

bionode-ncbi

Page 52: Building collaborative workflows for scientific data

Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?

JavaScript

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

var bio = require('bionode')bio.ncbi.urls('assembly', 'Acromyrmex', function(urls) { console.log(urls[0].genomic.fna)})

bio.ncbi.urls('assembly', 'Acromyrmex').on('data', printGenomeURL)function printGenomeURL(urls) { console.log(urls[0].genomic.fna)})

Page 53: Building collaborative workflows for scientific data

Difficulty getting relevant description anddatasets from NCBI API using bio* libsExample: URL for the Achromyrmexassembly?

JavaScript

BASH

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000204515.1_Aech_3.9/GCA_000204515.1_Aech_3.9_genomic.fna.gz

var ncbi = require('bionode-ncbi')var ndjson = require('ndjson')ncbi.urls('assembly', 'Acromyrmex').pipe(ndjson.stringify()).pipe(process.stdout)

bionode-ncbi urls assembly Acromyrmex |tool-stream extractProperty genomic.fna

Page 54: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherevar ncbi = require('bionode-ncbi')var tool = require('tool-stream')var through = require('through2')var fork1 = through.obj()var fork2 = through.obj()

Page 55: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js Streams everywherencbi.search('sra', 'Solenopsis invicta').pipe(fork1).pipe(dat.reads)

fork1.pipe(tool.extractProperty('expxml.Biosample.id')).pipe(ncbi.search('biosample')).pipe(dat.samples)

fork1.pipe(tool.extractProperty('uid')).pipe(ncbi.link('sra', 'pubmed')).pipe(ncbi.search('pubmed')).pipe(fork2).pipe(dat.papers)

Page 56: Building collaborative workflows for scientific data
Page 57: Building collaborative workflows for scientific data
Page 58: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.bionode-ncbi search genome Guillardia theta |tool-stream extractProperty assemblyid |bionode-ncbi download assembly |tool-stream collectMatch status completed |tool-stream extractProperty uid|bionode-ncbi link assembly bioproject |tool-stream extractProperty destUID |bionode-ncbi link bioproject sra |tool-stream extractProperty destUID |bionode-ncbi download sra |bionode-sra fastq-dump |tool-stream extractProperty destFile |bionode-bwa mem 503988/GCA_000315625.1_Guith1_genomic.fna.gz |tool-stream collectMatch status finished|tool-stream extractProperty sam|bionode-sam

Page 59: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.

bionode-example-dat-gasketget-dat workshopget-dat bionode gasket example

Page 60: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. { "import-data": [ "bionode-ncbi search genome eukaryota", "dat import --json --primary=uid" ], "search-ncbi": [ "dat cat", "grep Guillardia", "tool-stream extractProperty assemblyid", "bionode-ncbi download assembly -", "tool-stream collectMatch status completed", "tool-stream extractProperty uid", "bionode-ncbi link assembly bioproject -", "tool-stream extractProperty destUID", "bionode-ncbi link bioproject sra -", "tool-stream extractProperty destUID", "grep 35526", "bionode-ncbi download sra -", "tool-stream collectMatch status completed", "tee > metadata.json" ],

Page 61: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines. "index-and-align": [ "cat metadata.json", "bionode-sra fastq-dump -", "tool-stream extractProperty destFile", "bionode-bwa mem **/*fna.gz" ], "convert-to-bam": [ "bionode-sam 35526/SRR070675.sam" ] }

Page 62: Building collaborative workflows for scientific data

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.datscriptpipeline main run pipeline import

pipeline import run foobar | run dat import --json

bmpvieira exampleekg example

Page 63: Building collaborative workflows for scientific data

Reproducibility layers

CodeDataWorkflowEnvironment

Page 64: Building collaborative workflows for scientific data

Environment

Page 65: Building collaborative workflows for scientific data

Docker for reproduciblesciencedocker run bmpvieira/thesis

Page 66: Building collaborative workflows for scientific data

- Modular and universal bioinformatics

Pipeable UNIX command line tools and

JavaScript / Node.js APIs for bioinformatic

analysis workflows on the server and browser.

- Build data pipelinesProvides a streaming interface between every fileformat and data storage backend. "git for data"

Bionode.io

#bionode

gitter.im/bionode/bionode

Dat-data.com

#datgitter.im/datproject/discussions

Page 67: Building collaborative workflows for scientific data

Acknowledgements

« « « « « « ­

@yannick__@maxogden@mafintosh@erikgarrison@QM_SBCS@opendataBionode contributors

Page 68: Building collaborative workflows for scientific data

Thanks!

"Science should work as anOpen Source project"

dat-data.combionode.io