BioSlax Cloud Distributing your jobs

Download BioSlax Cloud  Distributing your jobs

Post on 02-Jan-2016

27 views

Category:

Documents

0 download

DESCRIPTION

BioSlax Cloud Distributing your jobs. Distributing Jobs on the BioSlax Cloud. Stages of distributing jobs Establishing secure communications Splitting data Distributing executables and data Processing at the nodes Collation of results Examine a simple example fuzzy search - PowerPoint PPT Presentation

TRANSCRIPT

  • BioSlax Cloud Distributing your jobs

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud Stages of distributing jobsEstablishing secure communicationsSplitting dataDistributing executables and dataProcessing at the nodesCollation of results

    Examine a simple example fuzzy searchUse agrep (a fuzzy search grep utility) and Bioperl

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudThe problem:

    Find matches to the nr database that includes 1 to 4 mismatch in amino acids to any given input sequence

    For example, given the hypothetical protein record in a database:

    >gi|284518918_M5|gb|ADB92594.1_M5FLDGIDKAQEEHEKYHSNWRAMVSDFNLPPVVAKEIVASCDKCQLKGEAMHGQVDCSPGIWQLDCTHLEGKIILVAVHVASGYIEAEVIPAGTGQETAYFLLKLAGRWPVKTIHTDNGSNFTSATVKAACWWAGIKQEFGIPYNPQSQGVVESMHKELKKIIGQVRDQAEHLKTAVQMAVFIHNFKRKGGIGGYSAGERIVDIIATDLQTRELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNSDIKVVPHKKAKIIRD

    and an input sequence of:

    DIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS the input sequence is found in the protein record (underlined) with 4 mismatches as follows:

    DLQTRELEKEITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNSDIQTKELQKQITKIQNFRVYYRDSRDPLWKGPAKLLWKGEGAVVIQDNS

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudExecutable agrep.plperl script using BioperlDatabase sequence.fasta small subset db with about 100 sequences

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudFinds 5 matches

    Database is only 100 sequences NR is > 10,000,000 sequences

    Linearly scaled, for the full NR database, it would take (10,000,000/100) x 0.55 seconds to complete or approximately 16 hours

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudUse 4 BioSlax VMs on the Cloud1 Master node, 3 Slave nodes

    remote_process_sendshell script that is executed by each slave node to do the processing and then scp the results file back to the master node

    01-split_sequenceperl script to split db into chunks of X number of sequences per chunkchosen to split the 100 sequences by 40 sequences each => 3 chunks (or 3 files)

    02-upload_partsshell script using scp with publickey authentication to upload agrep.pl, one chunk and the remote_process_send script to each slave node

    03-call_slave_to_executeshell script using ssh with publickey authentication to execute agrep.pl against each chunk on each of the slave nodes concurrently and have the slave scp the results file back to the master done using the remote_process_send shell script

    .

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud

    remote_process_send

    #!/bin/sh

    HOSTN=`hostname`

    for i in seq_*.fastado ./agrep.pl $i >> $HOSTN.results scp ./$HOSTN.results root@bioslax01:/mnt/hda1/downloads/. 1> /dev/null 2>/dev/nulldone

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud01-split_sequence

    #!/usr/bin/perl

    open (DBFILE, "$ARGV[0]");$fcount=1; $count=0;

    while (){ my($line) = $_; chomp($line);

    if ( $line =~ />/ ) { $count += 1; } if ($count == $ARGV[1]) { $fcount += 1; $count = 0; } open (NEWFILE, ">>seq_$fcount.fasta"); print NEWFILE "$line\n"; close (NEWFILE);}

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud02-upload_parts

    #!/bin/sh

    count=1

    for i in seq_*.fastado count=`expr $count + 1`; scp ./agrep.pl root@bioslax0${count}:. 1> /dev/null 2> /dev/null scp ./remote_process_send root@bioslax0${count}:. 1> /dev/null 2> /dev/null scp ./$i root@bioslax0${count}:. 1> /dev/null 2> /dev/nulldone

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud03-call_slave_to_execute

    #!/bin/sh

    if [ -f results ]then rm resultsfi

    count=1

    for i in seq_*.fastado count=`expr $count + 1`; ssh -l root bioslax0${count} "./remote_process_send" &done

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax Cloud

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudEstablish secure communications between the master and slave nodes using SSH Publickey Authentication.

    Master node: bioslax01Slave nodes: bioslax02, bioslax03, bioslax04

    Generate public and private keys on bioslax01run ssh-keygen t rsagenerates id_rsa and id_rsa.pub in /root/.ssh Copy id_rsa.pub to each slave node as /root/.ssh/authorized_keys Repeat step 1 on all the slave nodes Copy contents of each of the id_rsa.pub files from bioslax02 to bioslax04 into the file /root/.ssh/authorised_keys of bioslax01

    Should now be able to ssh and scp/sftp between bioslax01 and bioslax02, bioslax03, bioslax04 without keying in passwords

    * This has already been setup between bioslax01 and bioslax02, 03 and 04.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudTakes 0.06 seconds to split 100 sequences into chunks of 40 sequencesScaled linearly, for 10,000,000 sequences it will take (100,000,000/100) x 0.06 or approximately 1.5 hours.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudThe agrep.pl executable is 950 bytes (0.00095MB)The remote_process_send script is 175 bytes (0.000175MB)Each 40 sequence chunk is 14,000 bytes (0.014MB)1GBit network => 125MB/sec transfer rateEach executable and chunk will take (0.014 + 0.00095 + 0.000175) / 125 = 0.000121 secondsFor 10,000,000 sequences split by 40 sequences there will be 250,000 chunks => approximately 0.000121 x 250,000 seconds to upload all the chunks or approximately 30 seconds.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudTake 0.51 seconds to run agrep.pl against the 40 sequence chunk on each slave node => NOT SIGNIFICANTLY FASTER THAN PROCESSING ON A SINGLE NODE!

    Takes approximately 0.86 seconds on each slave to run agrep.pl against each chunk AND send the results file back to the master node (done by remote_process_send script => LONGER THAN PROCESSING ON A SINGLE NODE!

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudAll the nodes have almost similar timing for the processing and sending the results back to the master node.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudAny speed up is dependant on size of the job

    distributed computing is not advantageous when applied to small jobs (eg: processing dbs of 100 sequences)distributed computing most advantageous when applied to large jobs (eg: processing dbs of 100,000 sequences or more)overheads for each node process contribute to time taken for processingany job that takes an hour or less to run on a single node doesnt need distributed computing

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudCommon riddle :1 man digs 1 hole in 1 hour. How long will it take 10 men to dig 10 holes?

    Each man starts at (approximately) the same time, all variables remaining constant, all of them should finish at the same time => 10 men will take 1 hour to dig 10 holes.

    Answer : 1 hour

    Applied to the problem at hand 1 slave node processes 1 chunk and submits results to the master node in 0.86 seconds => 3 nodes will process 3 chunks and submit results to the master node in 0.86 seconds.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudApply extrapolated timings from example to a db of 10,000,000 sequences

    10,000,000 split by 40 sequences = 250,000 chunks

    Instantiate 250,000 VMs on the Cloud => process 10,000,000 in approximately 0.86 seconds (in theory) plus some overheads!

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudFor ONLY the compute portion, with 30 nodes, 10,000,000 sequence db can be processed in 8192 seconds or approximately 2.3 hours.

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudTotal time taken with 30 nodes = Time to split db into chunks (1.5 hours) + Time to upload executable, db chunk and script (30 seconds) +Time for nodes to process all chunks and send results back to master (2.3 hours)

    = 1.5 hours + 30 seconds + 2.3 hours 4 hours

    => 4x speed up compared to running on a single node against a single 10,000,000 sequence db file

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudMost cases (real world situation)network speeds varyscalability is not linear

    Need to consider overheadspre-processing time (writing sub programs to split files, etc)network delaysprocessing power of the individual VMs

    Despite overheads, for large processing jobs, significant speed up is very likely.

    Nothing more than cluster computing on the cloud BUT cloud offers ability to scale the number of machines in the cluster without hardware costs and without queues

    Copyright 2010. National University of Singapore. All rights reserved.

    Distributing Jobs on the BioSlax CloudScripts and sample database are contained in a single tgz file and can be downloaded from:

    ftp://sf01.bic.nus.edu.sg/incoming/bioslax/euasiagrid2010/euasiagrid_distcomp.tgz

    Note:Bioperl must be installed (http://www.bioperl.org)Tre agrep mus be installed (http://laurikari.net/tre/)Bioperl and Tre agrep available as SLAX LZM packages ftp://sf01.bic.nus.edu.sg/incoming/bioslax/tre.lzmftp://sf01.bic.nus.edu.sg/incoming/bioslax/zz01b_perl-update.lzm

    ******************