online workflow management and performance analysis with stampede

35
Online Workflow Management and Performance Analysis with Stampede Dan Gunter 1 , Taghrid Samak 1 , Monte Goode 1 , Ewa Deelman 2 , Gaurang Mehta 2 , Fabio Silva 2 , Karan Vahi 2 Christopher Brooks 3 Priscilla Moraes 4 , Martin Swany 4 1 1 Lawrence Berkeley National Laboratory 2 University of Southern California, Information Sciences Institute 3 University of San Francisco 4 University of Delaware

Upload: dan-gunter

Post on 21-Jun-2015

870 views

Category:

Technology


2 download

DESCRIPTION

Predicting performance of large scientific workflows.

TRANSCRIPT

Page 1: Online Workflow Management and Performance Analysis with Stampede

Online Workflow Management and Performance Analysis with

Stampede Dan Gunter1, Taghrid Samak1, Monte Goode1,

Ewa Deelman2, Gaurang Mehta2, Fabio Silva2, Karan Vahi2

Christopher Brooks3

Priscilla Moraes4, Martin Swany4

1

1 Lawrence Berkeley National Laboratory 2 University of Southern California, Information Sciences Institute 3 University of San Francisco 4 University of Delaware

Page 2: Online Workflow Management and Performance Analysis with Stampede

Background

CNSM 2011, October 24-28, Paris, France 2

Page 3: Online Workflow Management and Performance Analysis with Stampede

Goal: Predict behavior of running scientific workflows �  Primarily failures

�  Is a given workflow going to “fail”?

�  Are specific resources causing problems?

�  Which application sub-components are failing?

�  Is the data staging a problem?

�  In large workflows, some failures, etc. are normal �  This work is about learning from known problems, which

patterns of failures, etc. are unusual and require adaptation

�  Do all of this as generally as possible: Can we provide a solution that can apply to all workflow engines?

3 CNSM 2011, October 24-28, Paris, France

Page 4: Online Workflow Management and Performance Analysis with Stampede

Approach

�  Model the monitoring data from running workflows

�  Collect all the data in real-time

�  Run analysis, also in real-time, on the collected data �  map low-level failures to application-level

characteristics

�  Feed back analysis to user, workflow engine

4 CNSM 2011, October 24-28, Paris, France

Page 5: Online Workflow Management and Performance Analysis with Stampede

Scientific Applications

5

Montage CyberShake Epigenome LIGO

CNSM 2011, October 24-28, Paris, France

Astronomy Bioinformatics Astrophysics Geophysics

Page 6: Online Workflow Management and Performance Analysis with Stampede

Domain: Large Scientific Workflows

6

SCEC-2009: Millions of tasks completed per day

Radius = 11 million

Page 7: Online Workflow Management and Performance Analysis with Stampede

Workflow structure

7 CNSM 2011, October 24-28, Paris, France

Page 8: Online Workflow Management and Performance Analysis with Stampede

Basic terms and concepts

8

Success

Fail

Execution

Workflow Resources

Workflow Management System

CNSM 2011, October 24-28, Paris, France

Page 9: Online Workflow Management and Performance Analysis with Stampede

Base technologies

�  Workflow management systems �  Pegasus

�  www.pegasus.isi.edu

�  Monitoring and data analysis �  NetLogger

�  www.netlogger.lbl.gov

9 CNSM 2011, October 24-28, Paris, France

+

Page 10: Online Workflow Management and Performance Analysis with Stampede

Data Model

CNSM 2011, October 24-28, Paris, France 10

Page 11: Online Workflow Management and Performance Analysis with Stampede

Data Model Goals

�  Be widely applicable: there are many workflow engines out there that could benefit.

�  Provide everything we need for Pegasus workflows

10/27/11 CNSM 2011, October 24-28, Paris, France 11

Page 12: Online Workflow Management and Performance Analysis with Stampede

Abstract and Executable Workflows

12 CNSM 2011, October 24-28, Paris, France

�  Workflows start as a resource-independent statement of computations, input and output data, and dependencies �  This is called the Abstract Workflow (AW)

�  For each workflow run, Pegasus-WMS plans the workflow, adding helper tasks and clustering small computations together �  This is called the Executable Workflow (EW)

�  Note: Most of the logs are from the EW but the user really only knows the AW.

Page 13: Online Workflow Management and Performance Analysis with Stampede

Additional Terminology �  Workflow: Container for an entire computation

�  Sub-workflow: Workflow that is contained in another workflow

�  Task: Representation of a computation in the AW

�  Job: Node in the EW �  May represent part of a task (e.g., a stage-in/out), one task,

or many tasks

�  Job instance: Job scheduled or running by underlying system �  Due to retries, there may be multiple job instances per job

�  Invocation: One or more executables for a job instance �  Invocations are the instantiation of tasks, whereas jobs are an

intermediate abstraction for use by the planning and scheduling sub-systems

13 CNSM 2011, October 24-28, Paris, France

Page 14: Online Workflow Management and Performance Analysis with Stampede

Denormalized Data Model �  Stream of timestamped “events”:

�  unique, hierarchical, name �  unique identifiers (workflow, job, etc.) �  values and metadata

�  Used NETCONF YANG data-modeling language, keyed on event name [RFCs: 6020 6021 (6087)] �  YANG schema (see bit.ly/nQfPd1) documents and validates

each log event

14 CNSM 2011, October 24-28, Paris, France

container stampede.xwf.start { description “Start of executable workflow”; uses base-event; leaf restart_count { type uint32; description "Number of times workflow was restarted (due to failures)”; }}

Snippet of schema

Page 15: Online Workflow Management and Performance Analysis with Stampede

Relational data model

15

AbstractWorkflow (AW)

ExecutableWorkflow (EW)AW and EW

task_edgeTask parent

and child

taskTask

jobJob

jobstateJob status

job_instanceJob Instance

workflowWorkflow

job_edgeJob parent and child

workflow_stateWorkflow status

invocationInvocation

CNSM 2011, October 24-28, Paris, France

Page 16: Online Workflow Management and Performance Analysis with Stampede

Infrastructure

10/27/11 CNSM 2011, October 24-28, Paris, France 16

Page 17: Online Workflow Management and Performance Analysis with Stampede

Infrastructure overview

17

Raw logs

Normalized logs

Query Subscribe

CNSM 2011, October 24-28, Paris, France

Page 18: Online Workflow Management and Performance Analysis with Stampede

Detailed data flow

CNSM 2011, October 24-28, Paris, France 18

Pegasus

NetLogger

Log collection and normalization

Real-time analysis

Relational archive

Failure detection

Page 19: Online Workflow Management and Performance Analysis with Stampede

Message bus usage

10/27/11 CNSM 2011, October 24-28, Paris, France 19

AMQP Exchange Queue Queue …

Analysis client Analysis client …

BP Log events Routing key = event name

Subscribe Data

Page 20: Online Workflow Management and Performance Analysis with Stampede

Analysis

10/27/11 CNSM 2011, October 24-28, Paris, France 20

Page 21: Online Workflow Management and Performance Analysis with Stampede

Experimental Dataset summary

Application   Workflows   Jobs   Tasks   Edges  

Cybershake   881   288,668   577,330   1,245,845  

Periodograms   45   80,158   1,894,921   80,113  

Epigenome   46   10,059   29,837   23,425  

Montage   76   56,018   613,107   287,146  

Broadband   66   44,182   104,275   141,922  

LIGO   26   2,116   2,141   6,203  

1,140   481,201   3,221,611   1,784,654  

21 CNSM 2011, October 24-28, Paris, France

Page 22: Online Workflow Management and Performance Analysis with Stampede

Workflow clustering

�  Features collected for each workflow run �  Successful jobs

�  Failed jobs �  Success duration

�  Fail duration

�  Offline clustering on historical data �  Algorithm: k-means

�  Online analysis classifies workflows according to nearest cluster

22

Page 23: Online Workflow Management and Performance Analysis with Stampede

“High Failure” Workflows (HFW)

�  The workflow engine keeps retrying workflows until they complete or time out

�  But in the experimental logs, workflows are never marked as “failed” �  Aside: this is fixed in the newest version

�  Therefore, we use a simple heuristic for identifying workflows as problematic: �  HFW means: > 50% of jobs failed

23 CNSM 2011, October 24-28, Paris, France

Page 24: Online Workflow Management and Performance Analysis with Stampede

HFW failure patterns

24

Y-axis shows the percent of total job failures for this workflow, so far

Legend shows, for each workflow, jobs failed/jobs total

X-axis is normalized workflow execution time

Montage application

Page 25: Online Workflow Management and Performance Analysis with Stampede

More HFW Failure Patterns

25

Epigenome Broadband

Montage CyberShake

Page 26: Online Workflow Management and Performance Analysis with Stampede

Offline clustering

−2 0 2 4

−10

12

34

5

Component 1

Com

pone

nt 2

●●●

1

12

23

34404142

4344

2345

6

7

89101113

14

1516

1718

19

2021

22242526

27

2829

30

313233 35

36

37

38

39

1

2

3

4

26 CNSM 2011, October 24-28, Paris, France

High-failure workflow cluster

Other 3 clusters

Projection onto first 2 principal components

Epigenome

Page 27: Online Workflow Management and Performance Analysis with Stampede

Online classification

0 20 40 60 80 100Lifetime %

Cla

ss1

23

421:512/90524:28/2925:28/2927:4/433:28/3041:64/89

27 CNSM 2011, October 24-28, Paris, France

Workflows W

orkf

low

cla

ssific

atio

n

High-failure workflow class

Doesn’t converge

Page 28: Online Workflow Management and Performance Analysis with Stampede

Anomaly detection

0 10 20 30

0.0

0.2

0.4

0.6

0.8

1.0

Failures

Cum

ulat

ive P

erce

nt

46:281/49648:62/6549:44/7350:36/6551:22/3752:38/5153:42/5754:32/48

28 CNSM 2011, October 24-28, Paris, France

X: total number of failures Y: proportion of time-windows experiencing that number of failures or less

0.9

15

Montage application

Anomalous! See Slide #24

Page 29: Online Workflow Management and Performance Analysis with Stampede

System performance

CNSM 2011, October 24-28, Paris, France

29 Query type

Med

ian

quer

iesminute,

log 10

scal

e 100

101

102

103

104

100

101

102

103

104

100

101

102

103

104

broadband

epigenome

montage

01 02 03 04 05 06 07 08 09 10 11

cybershake

ligo

periodograms

01 02 03 04 05 06 07 08 09 10 11

Query type01-JobsTot

02-JobsState

03-JobsType

04-JobsHost

05-TimeTot

06-TimeState

07-TimeType

08-TimeHost

09-JobDelay

10-WfSumm

11-HostSumm

Bars show the rate for each type of query Each panel is an application Dashed black lines are median arrival rate for the application.

Page 30: Online Workflow Management and Performance Analysis with Stampede

Summary �  Real-time failure prediction for scientific workflows

is a challenging but important task

�  Unsupervised learning can be used to model high-level workflow failures from historical data

�  High failure classes of workflows can be predicted in real-time with high accuracy

�  Future directions �  Analysis; root-cause investigation �  System; notifications and updates

�  Working with data from other workflow systems

CNSM 2011, October 24-28, Paris, France

30

Page 31: Online Workflow Management and Performance Analysis with Stampede

Thank you! For more information, visit the Stampede wiki at:

https://confluence.pegasus.isi.edu/display/stampede/

Page 32: Online Workflow Management and Performance Analysis with Stampede

Extra slides..

CNSM 2011, October 24-28, Paris, France 32

Page 33: Online Workflow Management and Performance Analysis with Stampede

Scalability

CNSM 2011, October 24-28, Paris, France

33

Page 34: Online Workflow Management and Performance Analysis with Stampede

Pegasus

�  Maps from abstract to concrete workflow �  Algorithmic and AI-based techniques

�  Automatically locates physical locations for both workflow components and data

�  Finds appropriate resources to execute

�  Reuses existing data products where applicable

�  Publishes newly derived data products �  Provides provenance information

34 CNSM 2011, October 24-28, Paris, France

Page 35: Online Workflow Management and Performance Analysis with Stampede

NetLogger

�  Logging Methodology �  Timestamped, named, messages at the start and end

of significant events, with additional identifiers and metadata in a std. line-oriented ASCII format (Best Practices or BP) �  APIs are provided, incl. in-memory log aggregation for

high frequency events; but message generation is often best done within an existing framework

�  Logging and Analysis Tools �  Parse many existing formats to BP �  Load BP into message bus, MySQL, MongoDB, etc. �  Generate profiles, graphs, and CSV from BP data

35 CNSM 2011, October 24-28, Paris, France