data parallelism in bioinformatics workflows using...

26
Data Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho 1 , Eduardo Ogasawara 1 , Daniel de Oliveira 1 , Vanessa Braganholo 1 , Alexandre A. B. Lima 1 , Alberto M. R. Dávila 2 and Marta Mattoso 1 1 Federal University of Rio de Janeiro, Brazil 2 FIOCRUZ, Brazil

Upload: others

Post on 07-Apr-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Data Parallelism in Bioinformatics Workflows Using Hydra

Fábio Coutinho1, Eduardo Ogasawara1, Daniel de Oliveira1 ,Vanessa Braganholo1 , Alexandre A. B. Lima1 ,

Alberto M. R. Dávila2 and Marta Mattoso1

1Federal University of Rio de Janeiro, Brazil2FIOCRUZ, Brazil

Page 2: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Agenda

• Scientific experiments

• Blast workflow

• Data parallelization

• Hydra middleware

• Case study

• Measurements

• Provenance support

• Conclusion

2

Page 3: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Scientific Experiment Scenario

4. ...which need

to be processed

by a cluster or a

supercomputer…

1. Data is generated

and collected

2. Data is first

analyzed by some

programs

3. Large Volume of

Data Produced ...

5. Final Results

are analyzed

3

Page 4: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Blast sequence analysis match

4. ...which need

to be processed

In a HPC

environment…

1. Genomics Data

is collected.

2. Fasta file is

generated.

5. Final Results

are analyzed

SequenceDatabase

A C B C D B

C A D B D

A C - - B C D B

- C A D B - D -

>ACBCD

Fasta file

3. Sequence alignment …Ex. Genbank,

DDBJ, EMBL, …

4

Page 5: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Current Solutions

• mpiBlast, G-Blast and CloudBlast represent solutions for executing Blast in parallel on cluster, grid and cloudenvironments, respectively.

• BlastReduce: Using map-reduce approach to obtain data parallelization

• These solutions are disconnected from the concept of scientific experiment and they are not concerned about capturing and analyzing provenance.

• Scientific Workflow Management Systems (SWfMS) is analternative to represent a scientific experiment (TavernaWorkflows, for example)

• This solution brings some difficulties in obtaining parallelization

5

Page 6: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Experiments modeled as scientific workflows

• SWfMS is an alternative to represent a scientific experiment

• Obtaining parallelization and provenance gathering in distributed environments brings some challenges

6

Page 7: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Data Parallelization Difficulties

• Data fragmentation

o Dependent on the data format

• Activity distribution in HPC from SWfMS

• Data aggregation

o Dependent on the data format

7

Page 8: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

FASTA Fragmentation

8

Page 9: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Types of FASTA data parallelization

9

Sequence Fragmentation Database Fragmentation

No data parallelization No No

Data parallelization

Yes No

No Yes*

Yes Yes*

Page 10: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

> P. 0200AGTAGTT…

> P. 0400AGTAGTT…

> P. 0300AGTAGTT…

> P. 0100AGTAGTT…

Data Parallelism

10

UniprotKB

DataFragmenter

I = {i1}Input Data

Acitivity node 1

Acitivity node k

Di1-f1 Di1-fk

FO1 FOK

Aggregator

O = {o1, … , on)Output Data

Acitivity node 2

Di1-f2

FO2

Acitivity node (k-1)

Di1-f(k-1)

FO(k-1)

Parameters Parameters

> Plasmodium 0100AGTAGTTATTTAGGGCTATATAAAAGTTTTGAAAA…

Page 11: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Hydra Middleware

• Middleware solution that bridges the SWfMS to HPCEnvironment supporting data parallelism

• Goal: reduce the complexity involved in designing and managing bioinformatics programs while collecting provenance data

SWfMS MTC Environment

Hydra Middleware

11

Page 12: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Hydra Architecture

12

Hydra MTC LayerClient Layer

Uploader

Dispatcher

Gatherer

Downloader

Hydra Setup

SWfMS

Configuration

MUX

Parameter Sweeper

Data Fragmenter

Template

Fragment

Dispatcher/ Monitor

Data Analyzer

Provenance

Scientist

SWfMS

ExternalSchedulers

File System

File SystemFile

System

File System

12

3.a 3.b

4

5

6 7

8

9

1011

12

Page 13: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Compression Techniques in Hydra

Uploader

I’

I

I’

I’

ScientistDistributed Execution

I

O

O’

Uploader

O’

O

O’

Client Layer H

ydra

MTC L

ayer

Pack

Pack

Unpack

Unpack

I : Input DataI’: Compressed Input DataO : Output DataO’: Compressed Output Data

13

Page 14: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Steps to use Hydra

• Adapt the workflow for distribution using Hydra

• Setup the data parallelization

o Configure the workspace template and invoked program

• Setup a data fragmentation cartridge in Hydra

o Develop a program that fragments a bioinformatics data file

• Setup a data aggregation cartridge in Hydra

o A program that aggregate or merge data produced by individual workspaces

14

Page 15: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Blast workflow in VisTrails using Hydra

15

Page 16: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Setup the data parallelization

Page 17: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Case study

• BLAST tool

o Identification of similar sequences between Plasmodium falciparum e o UniProtKB/SWISS-Prot database (june/2008)

o Data fragmentation of input query

17

Page 18: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Experiment

• Workflow developed in VisTrails

• Hydra was setup with:

o Torque cartridge for job submission

o Data fragmentation cartridge using FASTASplitter

o Workspace configuration for blast invocation

o Data aggregation cartridge using BlastMerger

18

Page 19: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Experiment results

• Execution mode in 64 node Altix ICE 8200 with Xeon 8 cores per node installed at Nacad High Performance Computing Center at Federal University of Rio de Janeiro

• Speed-up measured

19

Page 20: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Workflow execution time

20

Page 21: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Activity execution time distribution

21

Page 22: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Provenance Support

o Where are the merged results for some particular experiment?

o What was the alignment method that lead to the best result?

o What were the parameters that lead the best result?

22

Page 23: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Provenance summary

23

Page 24: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Conclusions

• Hydra can be a bridge the SWfMS and the HPC environment

o Data parallelism using workflows

o Evaluated in a real case bioinformatics experiment using BLAST with little overhead

o Supports distributed provenance gathering

o Good speed-up and execution time

24

Page 25: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Future Work

• Balance workload between nodes

• New cartridges for data fragmentation and aggregation

• Automatic setup for fragment size and used nodes

25

Page 26: Data Parallelism in Bioinformatics Workflows Using Hydraeogasawara/wp-content/uploads/2015/04/2010-06-ecmls.pdfData Parallelism in Bioinformatics Workflows Using Hydra Fábio Coutinho

Data Parallelism in Bioinformatics Workflows Using Hydra

Fábio Coutinho, Eduardo Ogasawara, Daniel de Oliveira,Vanessa Braganholo, Alexandre A. B. Lima,

Alberto M. R. Dávila and Marta Mattoso

Thank you!