bioslax cloud – distributing your jobs. copyright ⓒ 2010. national university of singapore. all...

24
BioSlax Cloud – Distributing your jobs

Upload: emerald-patterson

Post on 29-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

BioSlax Cloud – Distributing your jobs

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Stages of distributing jobs– Establishing secure communications– Splitting data– Distributing executables and data– Processing at the nodes– Collation of results

• Examine a simple example – fuzzy search

– Use agrep (a fuzzy search grep utility) and Bioperl

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

The problem:

“Find matches to the nr database that includes 1 to 4 mismatch in amino acids to any given input sequence”

For example, given the hypothetical protein record in a database:

>gi|284518918_M5|gb|ADB92594.1_M5FLDGIDKAQEEHEKYHSNWRAMVSDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKIILVAVHVASGYIEAEVIPAGTGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQEFGIPYNPQSQGVVESMHKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDLQTRELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNSDIKVVPHKKAKIIRD

and an input sequence of:

DIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS

the input sequence is found in the protein record (underlined) with 4 mismatches as follows:

DLQTRELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNSDIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Executable – agrep.pl– perl script using Bioperl

Database – sequence.fasta – small subset db with about 100 sequences

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Finds 5 matches

• Database is only 100 sequences – NR is > 10,000,000 sequences

• Linearly scaled, for the full NR database, it would take (10,000,000/100) x 0.55 seconds to complete or approximately 16 hours

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Use 4 BioSlax VMs on the Cloud• 1 Master node, 3 Slave nodes

remote_process_send– shell script that is executed by each slave node to do the processing and

then scp the results file back to the master node

01-split_sequence– perl script to split db into chunks of X number of sequences per chunk– chosen to split the 100 sequences by 40 sequences each => 3 chunks (or 3

files)

02-upload_parts– shell script using scp with publickey authentication to upload agrep.pl, one

chunk and the ‘remote_process_send’ script to each slave node

03-call_slave_to_execute– shell script using ssh with publickey authentication to execute agrep.pl against each

chunk on each of the slave nodes concurrently and have the slave scp the results file back to the master – done using the ‘remote_process_send’ shell script

.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

remote_process_send

#!/bin/sh

HOSTN=`hostname`

for i in seq_*.fasta

do

./agrep.pl $i >> $HOSTN.results

scp ./$HOSTN.results root@bioslax01:/mnt/hda1/downloads/. 1> /dev/null 2>/dev/null

done

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

01-split_sequence#!/usr/bin/perl

open (DBFILE, "$ARGV[0]");$fcount=1; $count=0;

while (<DBFILE>){ my($line) = $_; chomp($line);

if ( $line =~ />/ ) { $count += 1; } if ($count == $ARGV[1]) { $fcount += 1; $count = 0; } open (NEWFILE, ">>seq_$fcount.fasta"); print NEWFILE "$line\n"; close (NEWFILE);}

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

02-upload_parts

#!/bin/sh

count=1

for i in seq_*.fasta

do

count=`expr $count + 1`;

scp ./agrep.pl root@bioslax0${count}:. 1> /dev/null 2> /dev/null

scp ./remote_process_send root@bioslax0${count}:. 1> /dev/null 2> /dev/null

scp ./$i root@bioslax0${count}:. 1> /dev/null 2> /dev/null

done

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

03-call_slave_to_execute

#!/bin/sh

if [ -f results ]

then

rm results

fi

count=1

for i in seq_*.fasta

do

count=`expr $count + 1`;

ssh -l root bioslax0${count} "./remote_process_send" &

done

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Establish secure communications between the master and slave nodes using SSH Publickey Authentication.

Master node: bioslax01Slave nodes: bioslax02, bioslax03, bioslax04

1. Generate public and private keys on bioslax01• run ‘ssh-keygen –t rsa’• generates id_rsa and id_rsa.pub in /root/.ssh

2. Copy id_rsa.pub to each slave node as /root/.ssh/authorized_keys3. Repeat step 1 on all the slave nodes4. Copy contents of each of the id_rsa.pub files from bioslax02 to bioslax04 into the file /root/.ssh/authorised_keys of bioslax01

Should now be able to ssh and scp/sftp between bioslax01 and bioslax02, bioslax03, bioslax04 without keying in passwords

* This has already been setup between bioslax01 and bioslax02, 03 and 04.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Takes 0.06 seconds to split 100 sequences into chunks of 40 sequences

• Scaled linearly, for 10,000,000 sequences it will take (100,000,000/100) x 0.06 or approximately 1.5 hours.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• The agrep.pl executable is 950 bytes (0.00095MB)• The remote_process_send script is 175 bytes (0.000175MB)• Each 40 sequence chunk is 14,000 bytes (0.014MB)• 1GBit network => 125MB/sec transfer rate• Each executable and chunk will take

(0.014 + 0.00095 + 0.000175) / 125 = 0.000121 seconds• For 10,000,000 sequences split by 40 sequences there will be 250,000

chunks => approximately 0.000121 x 250,000 seconds to upload all the chunks or approximately 30 seconds.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Take 0.51 seconds to run agrep.pl against the 40 sequence chunk on each slave node => NOT SIGNIFICANTLY FASTER THAN PROCESSING ON A SINGLE NODE!

• Takes approximately 0.86 seconds on each slave to run agrep.pl against each chunk AND send the results file back to the master node (done by ‘remote_process_send’ script => LONGER THAN PROCESSING ON A SINGLE NODE!

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

All the nodes have almost similar timing for the processing and sending the results back to the master node.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Any speed up is dependant on size of the job

• distributed computing is not advantageous when applied to small jobs (eg: processing dbs of 100 sequences)

• distributed computing most advantageous when applied to large jobs (eg: processing dbs of 100,000 sequences or more)

• overheads for each node process contribute to time taken for processing

• any job that takes an hour or less to run on a single node doesn’t need distributed computing

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Common riddle :

“1 man digs 1 hole in 1 hour. How long will it take 10 men to dig 10 holes?”

Each man starts at (approximately) the same time, all variables remaining constant, all of them should finish at the same time => 10 men will take 1 hour to dig 10 holes.

Answer : 1 hour

Applied to the problem at hand – 1 slave node processes 1 chunk and submits results to the master node in 0.86 seconds => 3 nodes will process 3 chunks and submit results to the master node in 0.86 seconds.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Apply extrapolated timings from example to a db of 10,000,000 sequences

• 10,000,000 split by 40 sequences = 250,000 chunks

• Instantiate 250,000 VMs on the Cloud => process 10,000,000 in approximately 0.86 seconds (in theory) plus some overheads!

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

VMs 1,954 977 488 245 122 61 30

Time

(s)

128 256 512 1024 2048 4096 8192

VMs 250,000 125,000 62,500 31,250 15,625 7,813 3,907

Time

(s)

1 2 4 8 16 32 64

For ONLY the compute portion, with 30 nodes, 10,000,000 sequence db can be processed in 8192 seconds or approximately 2.3 hours.

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Total time taken with 30 nodes =

Time to split db into chunks (1.5 hours) +

Time to upload executable, db chunk and script (30 seconds) +

Time for nodes to process all chunks and send results back to master (2.3 hours)

= 1.5 hours + 30 seconds + 2.3 hours ≈ 4 hours

=> ≈ 4x speed up compared to running on a single node against a single 10,000,000 sequence db file

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

• Most cases (real world situation)– network speeds vary– scalability is not linear

• Need to consider overheads– pre-processing time (writing sub programs to split files, etc)– network delays– processing power of the individual VMs

• Despite overheads, for large processing jobs, significant speed up is very likely.

• Nothing more than cluster computing on the cloud BUT cloud offers ability to scale the number of machines in the cluster without hardware costs and without queues

Copyright 2010. National University of Singapore. All rights reserved.ⓒ

Distributing Jobs on the BioSlax Cloud

Scripts and sample database are contained in a single tgz file and can be downloaded from:

ftp://sf01.bic.nus.edu.sg/incoming/bioslax/euasiagrid2010/euasiagrid_distcomp.tgz

Note:– Bioperl must be installed (http://www.bioperl.org)

– Tre agrep mus be installed (http://laurikari.net/tre/)

– Bioperl and Tre agrep available as SLAX LZM packages • ftp://sf01.bic.nus.edu.sg/incoming/bioslax/tre.lzm

• ftp://sf01.bic.nus.edu.sg/incoming/bioslax/zz01b_perl-update.lzm