fault tolerant clustering (ieee services 2012)
DESCRIPTION
fault tolerant clustering for workflowsTRANSCRIPT
Fault Tolerant Clustering in Scien2fic Workflows
Weiwei Chen, Ewa Deelman Informa2on Sciences Ins2tute
University of Southern California
1
Outline
• Introduc2on • Workflow and Failure Model • Fault Tolerant Clustering • Experiments • Task Specific Failures • Loca2on Specific Failures
2
Introduc2on • Task based Scien2fic Workflows – Task – Job
• Task Clustering – Merges mul2ple small tasks into a job – Reduce scheduling and submit overhead
• Fault Tolerance in Task Clustering – Exis2ng techniques underes2mate or ignore the influences of failures
3
Task Clustering • Task Clustering – Horizontal Clustering – Ver2cal Clustering – Arbitrary Clustering
Clustering Factor (k): number of tasks in a job 4
System Overview
5
Timeline
with clustering
without clustering
Improvement
scheduling and submit delay
Task Failures and Job Failures • We only focus on Transient Failure and Job Retry • We don’t differen2ate the causes of failures but we concern about the average failure rate.
• Assump2on: a failure is a random event independent of workflow characteris2cs or execu2on environment
• Two Categories o Task Failure: a task fails, other
tasks in the same job may not fail § E.g. Applica2on
o Job Failure: a job fails, all of its tasks fail § E.g. Scheduling System
6
Influence of Failures on Clustering ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate Target Func2on: Min (ttotal)
given n tasks to run on r resources task failure rate (α) is measurable (Task Failure Model) or job failure rate (β) is measurable (Job Failure Model)
Assump2on: n >> r, but n/k >> r 7
Job Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−β)
, if nk≥ r
N(kt + d) = (kt + d)1−β
, if nk< r
#
$
%%
&
%%
8
ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate
N job =1
(1−β)
Ntotal =
N jobnrk
if nk≥ r
N job, if nk< r
"
#
$$
%
$$
t job = kt + d
ttotal = t jobNtotal
Run2me for a single job
Avg retry 2me for a single job
Retry 2me for all jobs
Overall run2me
Job Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−β)
, if nk≥ r
N(kt + d) = (kt + d)1−β
, if nk< r
#
$
%%
&
%%
k* = nr
ttotal* =(kt + d)1−β
It’s not necessary to adjust k. Just set it to be
9
n=1000, t=5 sec, d=5 sec, r=20
k* is independent of β
Task Failure Model
10
ttotal Es2mated Overall Run2me
n Number of tasks to run
t Run2me of a single task
r Number of available resources
d Time delay between jobs
N Expected retry 2mes for a single task
k Number of tasks in a job
β Job failure rate
α Task failure rate
N job =1
(1−α)k
Ntotal =
N jobnrk
if nk≥ r
N job, if nk< r
"
#
$$
%
$$
t job = kt + d
ttotal = t jobNtotal
Run2me for a single job
Avg retry 2me for a single job
Retry 2me for all jobs
Overall run2me
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
Task Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
k* is dependent of α
It’s necessary to adjust k according to α
11
Comparing TFM and JFM
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
ttotal* =(kt + d)1−β
k* = nr
1. Linear increase vs exponen2al increase 2. Op2mal clustering factor
12
Fault Tolerant Clustering • Job Failure Model: k=n/r • Selec2ve Reclustering (SR) – select the failed tasks in a clustered job and cluster them into a new clustered job
– It requires the iden2fica2on of failed tasks.
13
Fault Tolerant Clustering
• Dynamic Clustering (DC) – adjust the clustering factor according to the task failure rates dynamically
ttotal,DC* =
n(k*t + d)rk*(1−α)k
*
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
14
Fault Tolerant Clustering
• Dynamic Reclustering (DR) – A combina2on of SR and DC
15
Evalua2on
• Run simula2ons based on the real traces that were run by the Pegasus group.
• Each workflow was simulated 100 2mes so that the standard devia2on is less than 10%
• Two workflows were used. • 20 worker nodes were used in each experiment.
16
Workflows Used • Montage – An astronomy applica2on used to construct large image mosaics of the sky.
– Montage has complex data dependencies between tasks
– 10,422 tasks, 57GB data.
17 Image from hhp://montage.ipac.caltech.edu/
Workflows Used • Periodogram – Iden2fy periodic signals from light curves that arise from transi2ng planets.
– 216,600 tasks, 19GB input data. – Periodogram has only one level
18 Image from hhp://pegasus.isi.edu/presenta2ons/2011/sci709-‐voeckler-‐talk.ppt/
Simulator
• Extension to CloudSim – Workflow Engine – Clustering Engine – Scheduler – Failure Generator – Failure Monitor
19
Performance • NOOP: no op2miza2on, (k=n/r) • DC (Dynamic Clustering) • SR (Selec2ve Reclustering) • DR ( Dynamic Reclustering) • Overall Run2me in seconds
20
Performance
• Periodogram
21
Performance
• Montage
22
Task Specific Failure Detec2on (TSFD) • Task Failures are related to the type of tasks • Failure Monitor classifies failures based on the type • Clustering Engine merges tasks based on different task
failure rates • In this experiment of Montage, we set the task failure
rate of mProjectPP and mDiffFit to be 0.001 while mBackground ranges from 0.2 to 0.8.
α1 Optimization Methods
DR DR+TSFD DC DC+TSFD
0.2 10415 10412 13804 13820
0.4 11830 11839 22946 22923
0.6 14704 14688 60429 60414
0.8 23238 23229 436638 435297
1 The task failure rate of mBackground only!
23
Task Failure Model
ttotal =
Nn(kt + d)rk
=n(kt + d)rk(1−α)k
, if nk≥ r
N(kt + d) = (kt + d)(1−α)k
, if nk< r
#
$
%%
&
%%
k* =−d + d 2 − 4d
ln(1−α)2t
, if n >> r
ttotal* =
n(k*t + d)rk(1−α)k
*
ttotal is not sensi2ve to α
24 Simplifica2on of failures is acceptable
Loca2on Specific Failure Detec2on (LSFD) • Task Failures are related to the loca2on of execu2on • Failure Monitor classifies failures based on resource id
• Scheduler orders resources based on their reliability. • Two out of twenty nodes have a higher task failure rates (from 0.2 to 0.8) while others s2ll have a task failure rate of 0.001.
25
DC generates many small tasks if task failure rate is high
Conclusion
• We present three basic methods to improve fault tolerance in task clustering
• If the system supports iden2fica2on of failed tasks, dynamic reclustering performs best
• Otherwise, use dynamic clustering • Improvement is significant even for very basic method
26
Future Work
• Ver2cal Clustering and Arbitrary Clustering • Intelligent Scheduler • More Workflow Examples • Distribu2on of Failures
27
Ques2ons?
• Thank you for coming! • For further info, please visit: pegasus.isi.edu or email [email protected]
28
Refinements • When n>>r does not hold in the end of execu2on
• Default: • Replica2ve: replicate jobs by • Even:
kactual = k* njobs =
ntaskk
< r
kactual = k* njobs = r
rntask / k
kactual =ntaskr
njobs = r
29
Dynamic Performance
• TFM and DC
30