fault tolerant clustering (ieee services 2012)

Fault Tolerant Clustering in Scien2fic Workflows

Weiwei Chen, Ewa Deelman Informa2on Sciences Ins2tute

University of Southern California

1

Outline

•  Introduc2on •  Workflow and Failure Model •  Fault Tolerant Clustering •  Experiments •  Task Specific Failures •  Loca2on Specific Failures

2

Introduc2on •  Task based Scien2fic Workflows –  Task –  Job

•  Task Clustering – Merges mul2ple small tasks into a job –  Reduce scheduling and submit overhead

•  Fault Tolerance in Task Clustering –  Exis2ng techniques underes2mate or ignore the influences of failures

3

Task Clustering •  Task Clustering – Horizontal Clustering – Ver2cal Clustering – Arbitrary Clustering

Clustering Factor (k): number of tasks in a job 4

System Overview

5

Timeline

with clustering

without clustering

Improvement

scheduling and submit delay

Task Failures and Job Failures •  We only focus on Transient Failure and Job Retry •  We don’t differen2ate the causes of failures but we concern about the average failure rate.

•  Assump2on: a failure is a random event independent of workflow characteris2cs or execu2on environment

•  Two Categories o  Task Failure: a task fails, other

tasks in the same job may not fail §  E.g. Applica2on

o  Job Failure: a job fails, all of its tasks fail §  E.g. Scheduling System

6

Influence of Failures on Clustering ttotal Es2mated Overall Run2me

n Number of tasks to run

t Run2me of a single task

r Number of available resources

d Time delay between jobs

N Expected retry 2mes for a single task

k Number of tasks in a job

β Job failure rate

α Task failure rate Target Func2on: Min (ttotal)

given n tasks to run on r resources task failure rate (α) is measurable (Task Failure Model) or job failure rate (β) is measurable (Job Failure Model)

Assump2on: n >> r, but n/k >> r 7

Job Failure Model

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

8

ttotal Es2mated Overall Run2me







β Job failure rate

α Task failure rate

N job =1

(1−β)

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me for a single job

Avg retry 2me for a single job

Retry 2me for all jobs

Overall run2me

Job Failure Model

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−β)

, if nk≥ r

N(kt + d) = (kt + d)1−β

, if nk< r

#

$

%%

&

%%

k* = nr

ttotal* =(kt + d)1−β

It’s not necessary to adjust k. Just set it to be

9

n=1000, t=5 sec, d=5 sec, r=20

k* is independent of β

Task Failure Model

10

ttotal Es2mated Overall Run2me







β Job failure rate

α Task failure rate

N job =1

(1−α)k

Ntotal =

N jobnrk

if nk≥ r

N job, if nk< r

"

#

$$

%

$$

t job = kt + d

ttotal = t jobNtotal

Run2me for a single job

Avg retry 2me for a single job

Retry 2me for all jobs

Overall run2me

ttotal =

Nn(kt + d)rk

=n(kt + d)rk(1−α)k

, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

Task Failure Model

ttotal =

Nn(kt + d)rk


, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =

n(k*t + d)rk(1−α)k

*

k* is dependent of α

It’s necessary to adjust k according to α

11

Comparing TFM and JFM

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =


*

ttotal* =(kt + d)1−β

k* = nr

1. Linear increase vs exponen2al increase 2. Op2mal clustering factor

12

Fault Tolerant Clustering •  Job Failure Model: k=n/r •  Selec2ve Reclustering (SR) – select the failed tasks in a clustered job and cluster them into a new clustered job

–  It requires the iden2fica2on of failed tasks.

13

Fault Tolerant Clustering

•  Dynamic Clustering (DC) – adjust the clustering factor according to the task failure rates dynamically

ttotal,DC* =

n(k*t + d)rk*(1−α)k

*

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

14

Fault Tolerant Clustering

•  Dynamic Reclustering (DR) – A combina2on of SR and DC

15

Evalua2on

•  Run simula2ons based on the real traces that were run by the Pegasus group.

•  Each workflow was simulated 100 2mes so that the standard devia2on is less than 10%

•  Two workflows were used. •  20 worker nodes were used in each experiment.

16

Workflows Used •  Montage – An astronomy applica2on used to construct large image mosaics of the sky.

– Montage has complex data dependencies between tasks

– 10,422 tasks, 57GB data.

17 Image from hhp://montage.ipac.caltech.edu/

Workflows Used •  Periodogram –  Iden2fy periodic signals from light curves that arise from transi2ng planets.

– 216,600 tasks, 19GB input data. – Periodogram has only one level

18 Image from hhp://pegasus.isi.edu/presenta2ons/2011/sci709-‐voeckler-‐talk.ppt/

Simulator

•  Extension to CloudSim – Workflow Engine – Clustering Engine – Scheduler – Failure Generator – Failure Monitor

19

Performance •  NOOP: no op2miza2on, (k=n/r) •  DC (Dynamic Clustering) •  SR (Selec2ve Reclustering) •  DR ( Dynamic Reclustering) •  Overall Run2me in seconds

20

Performance

•  Periodogram

21

Performance

•  Montage

22

Task Specific Failure Detec2on (TSFD) •  Task Failures are related to the type of tasks •  Failure Monitor classifies failures based on the type •  Clustering Engine merges tasks based on different task

failure rates •  In this experiment of Montage, we set the task failure

rate of mProjectPP and mDiffFit to be 0.001 while mBackground ranges from 0.2 to 0.8.

α1 Optimization Methods

DR DR+TSFD DC DC+TSFD

0.2 10415 10412 13804 13820

0.4 11830 11839 22946 22923

0.6 14704 14688 60429 60414

0.8 23238 23229 436638 435297

1 The task failure rate of mBackground only!

23

Task Failure Model

ttotal =

Nn(kt + d)rk


, if nk≥ r

N(kt + d) = (kt + d)(1−α)k

, if nk< r

#

$

%%

&

%%

k* =−d + d 2 − 4d

ln(1−α)2t

, if n >> r

ttotal* =


*

ttotal is not sensi2ve to α

24 Simplifica2on of failures is acceptable

Loca2on Specific Failure Detec2on (LSFD) •  Task Failures are related to the loca2on of execu2on •  Failure Monitor classifies failures based on resource id

•  Scheduler orders resources based on their reliability. •  Two out of twenty nodes have a higher task failure rates (from 0.2 to 0.8) while others s2ll have a task failure rate of 0.001.

25

DC generates many small tasks if task failure rate is high

Conclusion

•  We present three basic methods to improve fault tolerance in task clustering

•  If the system supports iden2fica2on of failed tasks, dynamic reclustering performs best

•  Otherwise, use dynamic clustering •  Improvement is significant even for very basic method

26

Future Work

•  Ver2cal Clustering and Arbitrary Clustering •  Intelligent Scheduler •  More Workflow Examples •  Distribu2on of Failures

27

Ques2ons?

•  Thank you for coming! •  For further info, please visit: pegasus.isi.edu or email [email protected]

28

Refinements •  When n>>r does not hold in the end of execu2on

•  Default: •  Replica2ve: replicate jobs by •  Even:

kactual = k* njobs =

ntaskk

< r

kactual = k* njobs = r

rntask / k

kactual =ntaskr

njobs = r

29

Dynamic Performance

•  TFM and DC

30

fault tolerant clustering (ieee services 2012)

Technology

job failures

n job n

measurable task failure

n tasks

job failure model run2me

r resources task failure

single job t job

introduc2on task