workflow fairness control on online and non-clairvoyant distributed computing platforms

25
1 Rafael Ferreira da Silva – [email protected] Workflow Fairness Control on Online and Non-Clairvoyant Distributed Computing Platforms Rafael FERREIRA DA SILVA , Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS Villeurbanne, France Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon Lyon, France Euro-Par 2013 August 26-30, 2013

Upload: rafael-ferreira-da-silva

Post on 26-Jun-2015

116 views

Category:

Technology


0 download

DESCRIPTION

Presentation held at Euro-Par 2013 - Aachen - Germany Abstract. Fairly allocating distributed computing resources among workflow executions is critical to multi-user platforms. However, this problem remains mostly studied in clairvoyant and offline conditions, where task durations on resources are known, or the workload and available resources do not vary along time. We consider a non-clairvoyant, online fairness problem where the platform workload, task costs and resource characteristics are unknown and not stationary. We propose a fairness control loop which assigns task priorities based on the fraction of pending work in the workflows. Workflow characteristics and performance on the target resources are estimated progressively, as information becomes available during the execution. Our method is implemented and evaluated on 4 different applications executed in production conditions on the European Grid Infrastructure. Results show that our technique reduces slowdown variability by 3 to 7 compared to first-come-first-served. More information: www.rafaelsilva.com

TRANSCRIPT

Page 1: Workflow fairness control on online and non-clairvoyant distributed computing platforms

1 Rafael Ferreira da Silva – [email protected]

Workflow Fairness Control on Online and Non-Clairvoyant

Distributed Computing Platforms

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

Euro-Par 2013 August 26-30, 2013

Page 2: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Fairness among workflow executions   Self-healing of workflow executions on grids

  Fairness control process

  Experiments and results

  Conclusion

2 Rafael Ferreira da Silva – [email protected]

Page 3: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Fairness among workflow executions   Self-healing of workflow executions on grids

  Fairness control process

  Experiments and results

  Conclusion

3 Rafael Ferreira da Silva – [email protected]

Page 4: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Context   Virtual Imaging Platform (VIP)

  Medical imaging science-gateway

  Grid of ~180 sites (EGI – http://www.egi.eu)

  Significant usage   452 registered users from 50 countries

  Consumed 472 CPU years from August 2012 to July 2013 http://dirac.france-grilles.fr

4 Rafael Ferreira da Silva – [email protected]

VIP consumption since August 2012

Page 5: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Workflow Execution

Rafael Ferreira da Silva – [email protected]

2. User launches a simulation

3. MOTEUR generates invocations

4. GASW generates grid jobs

5. Jobs are submitted to DIRAC

6. Pilot jobs are submitted to EGI

1. Input data upload

7. Pilot jobs fetch grid jobs

8. Inputs download

10. Results upload

11. Download results

9. Execution

5

Page 6: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Under resource contention workflows are unequally slowed down by concurrent executions

Fairness among workflow executions

6 Rafael Ferreira da Silva – [email protected]

3 identical workflows submitted sequentially

(ti,j = 10s)

t2,2

t2,3

t3,1

t2,4

t2,1

t1,2

t1,1

t1,3

t1,4

t3,2

t3,3

t3,4

t1,5 t3,5 t2,5

time

R1

R2

R3

Res

ourc

es

t1,1 t1,4

t1,5 t1,2

t1,3 t2,1

t2,2

t2,3

t2,4

t2,5

t3,1

t3,2

t3,3

t3,4

t3,5

0 10 20 30 40

slowdown(s) =Mmulti

Mown

s1 =2020

=1.0

s2 =4020

= 2.0

s3 =5020

= 2.5

Identical workflow executions do not experience the same slowdown

Makespan with concurrent executions

Makespan without concurrent executions

Page 7: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Under resource contention workflows are unequally slowed down by concurrent executions

Fairness among workflow executions

7 Rafael Ferreira da Silva – [email protected]

Very short workflow (t = 2s)

t3,1

t3,2

t3,3

t3,4

t3,5

time

R1

R2

R3

Res

ourc

es

t1,1 t1,4

t1,5 t1,2

t1,3 t2,1

t2,2

t2,3

t2,4

t2,5

0 10 20 30 40

2 identical workflows submitted sequentially

(ti,j = 10s)

t1,2

t1,1

t1,3

t1,4

t1,5

t2,2

t2,3

t2,4

t2,1

t2,5

t3,1

t3,2

t3,3

t3,4

t3,5

slowdown(s) =Mmulti

Mown

s1 =2020

=1.0

s2 =4020

= 2.0

s3 =366

= 6.0

Very short workflow executions are extremely slowed down

Page 8: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Workflow Self-Healing

8 Rafael Ferreira da Silva – [email protected]

  Problem: costly manual operations   Rescheduling tasks, restarting services or replicating data files

  In this work: fairly allocating computing resources among workflow executions

  Objective: automated platform administration   Autonomous detection of unfairness among workflow executions

  Perform appropriate set of actions

  Assumptions: online and non-clairvoyant   Only partial information available

  Decisions must be fast

  Production conditions, no user activity and workloads prediction

Page 9: Workflow fairness control on online and non-clairvoyant distributed computing platforms

General MAPE-K loop

9 Rafael Ferreira da Silva – [email protected]

Incident 1 degree η = 0.8

Incident 2 degree η = 0.4

Incident 3 degree η = 0.1

level 1

level2

level3

Roulette wheel selection

Incident 1

Selected

Rule Confidence (ρ) ρxη

2 1 0.8 0.32

3 1 0.2 0.02

1 1 1.0 0.80

Association rules for incident 1

Incident 2

Selected

Roulette wheel selection based on association rules

Set of Actions

x2

level 1

level2

level3

level 1

level2

level3

=ηiη jj=1

n∑

event (job completion and failures)

or timeout

Monitoring Analysis

Execution Knowledge

Planning

Monitoring data

R. Ferreira da Silva, T. Glatard, F. Desprez, Self-healing of workflow activity incidents on distributed computing infrastructures, Future Generation Computer Systems (FGCS), in press, 2013.

Page 10: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Incident degrees are quantified in discrete incident levels

  Thresholds are determined from visual mode clustering or K-means

Incident Levels and Actions

10 Rafael Ferreira da Silva – [email protected]

No actions are triggered Triggers a set of actions

Thresholds cluster platform configurations into groups

Page 11: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Fairness among workflow executions   Self-healing of workflow executions on grids

  Fairness control process

  Experiments and results

  Conclusion

11 Rafael Ferreira da Silva – [email protected]

Page 12: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Unfairness degree

where:

Fairness control: degree

12 Rafael Ferreira da Silva – [email protected]

ηu =Wmax −Wmin

Wi =max j∈[1,ni ]

Qi, j

Qi, j + Ri, j ⋅ Pi, j⋅ Ti, j

⎧ ⎨ ⎩

⎫ ⎬ ⎭

i = activity, ni = active activities Qi,j = number of waiting tasks Ri,j = number of running tasks

Ti, j =t~i, j

maxv∈[1,m ],w∈[1,ni* ](t~v,w )

Relative observed duration

Pi, j = 2⋅ 1−maxu∈[1,k j ]tu

t~i, j+ tu

⎧ ⎨ ⎪

⎩ ⎪

⎫ ⎬ ⎪

⎭ ⎪

⎜ ⎜

⎟ ⎟

Performance

Median task phase durations

Max difference between the fractions of pending work

A low Pi,j indicates that resources allocated to the activity have bad

performance for the activity

Page 13: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Fairness control: task estimation   Estimation of task durations

  Job phases: setup inputs download execution outputs upload

  Assumption: bag of tasks (all jobs have equal durations)

  Median-based estimation:

13 Rafael Ferreira da Silva – [email protected]

Median duration of jobs phases

Real job duration

42s

300s

20s

?

42s

300s

400s*

15s

Estimated job duration

50s

250s

400s

15s

completed

current

*: max(400s, 20s) = 400s

t~

= 715s

ti, j = 757s

Page 14: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Levels: identified from the platform logs

  Actions   Task prioritization

  Task priority is an integer initialized to 1

  Increase priority of Δi,j tasks:

Fairness control: levels and actions

14 Rafael Ferreira da Silva – [email protected]

τuLevel 1 (no actions)

Level 2 (action: task prioritization)

Δ i, j =Qi, j −(τ u +Wmin )(Qi, j + Ri, jPi, j )

Ti, j

⎣ ⎢

⎦ ⎥

Page 15: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Workload for Case Studies   Based on the workload of VIP

  January 2011 to April 2012

  Case Studies on:   Pilot Jobs

  User accounting

  Task analysis

  Bag of tasks

  Workflows

112 users 2,941 workflow executions 680,988 tasks

338,989 completed

138,480 error

105,488 aborted

15,576 aborted replicas

48,293 stalled

34,162 queued 339,545 pilot jobs

15 Rafael Ferreira da Silva – [email protected]

R. Ferreira da Silva, T. Glatard, A science-gateway workload archive to study pilot jobs, user activity, bag of tasks, task sub-steps, and workflow executionss, CoreGRID/ERCIM Workshop on Grids, Clouds and P2P Computing (CGWS), Rhodes Island, Greece, 2012.

Page 16: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Fairness among workflow executions   Self-healing of workflow executions on grids

  Fairness control process

  Experiments and results

  Conclusion

16 Rafael Ferreira da Silva – [email protected]

Page 17: Workflow fairness control on online and non-clairvoyant distributed computing platforms

  Experiment 1   Tests whether unfairness among identical workflows is properly addressed

  Experiment 2   Tests whether the performance of very short workflow executions is

improved by the fairness mechanism

  Experiment 3   Tests whether unfairness among different workflows is detected and

properly handled

  Workflows characteristics

Experiment Conditions

17 Rafael Ferreira da Silva – [email protected]

The experiments are performed in the Virtual Imaging Platform

Page 18: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Experiments: metrics

18 Rafael Ferreira da Silva – [email protected]

  Unfairness   Is the area under the curve ηu during the execution:

  Slowdown

where:

s =Mmulti

Mown

µ = ηu(ti)⋅ (ti − ti−1)i=2

M

Mown =maxp∈Ω tuu∈p∑

This metric measures if the fairness process can indeed minimize its own criterion ηu

Page 19: Workflow fairness control on online and non-clairvoyant distributed computing platforms

19

Results: identical workflows

19 Rafael Ferreira da Silva – [email protected]

makespans and unfairness degree values are significantly reduced reduced σm up to a factor of 15, σs up to a factor of 7, and µ by about 2

Page 20: Workflow fairness control on online and non-clairvoyant distributed computing platforms

20

Results: very short workflows

20 Rafael Ferreira da Silva – [email protected]

makespans of very short workflow executions are significantly reduced reduced σs up to a factor of 5.9, and µ up to a factor 1.9

Page 21: Workflow fairness control on online and non-clairvoyant distributed computing platforms

21

Results: very short workflows (2)

21 Rafael Ferreira da Silva – [email protected]

Speeds up executions up to a factor of 2.9, reduces task average waiting time up to a factor of 4.4 and slowdown up to a factor of 5.9

Page 22: Workflow fairness control on online and non-clairvoyant distributed computing platforms

22

Results: different workflows

22 Rafael Ferreira da Silva – [email protected]

reduced σs up to a factor of 3.8, and µ up to a factor 1.9

Page 23: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Outline

  Context   The Virtual Imaging Platform   Problem definition

  Fairness among workflow executions   Self-healing of workflow executions on grids

  Fairness control process

  Experiments and results

  Conclusion

23 Rafael Ferreira da Silva – [email protected]

Page 24: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Concluding remarks

24 Rafael Ferreira da Silva – [email protected]

  Context   Autonomous handling of unfairness among workflow executions

  No strong assumptions on resource characteristics and workload

  Summary of the proposed method   Implements a generic MAPE-K loop

  Quantifies unfairness based on the fraction of pending work:   Ratio of queuing tasks, relative durations, and performance

  Controlling fairness among workflow executions   Properly detects and handles unfairness among workflow executions

  Significantly reduced the standard deviation of the slowdown and unfairness metric for:   Identical workflows

  Very short workflow execution

  Different workflows

Page 25: Workflow fairness control on online and non-clairvoyant distributed computing platforms

Rafael Ferreira da Silva – [email protected]

Thank you for your attention. Questions?

Rafael FERREIRA DA SILVA, Tristan GLATARD University of Lyon, CNRS, INSERM, CREATIS

Villeurbanne, France

Frédéric DESPREZ INRIA, University of Lyon, LIP, ENS Lyon

Lyon, France

Workflow Fairness Control on Online and Non-Clairvoyant Distributed Computing Platforms

Acknowledgments: VIP users and project members

French National Agency for Research (ANR-09-COSI-03, ANR-11-LABX-0063) EC FP7 Programme (312579 ER-flow)

European Grid Initiative (EGI) France-Grilles