young suk moon chair: dr. hans-peter bischof reader: dr. gregor von laszewski observer: dr. minseok...

MS Thesis DefenseDynamic Fault Tolerant Grid Workflow

in the Water Threat Management Project

Young Suk Moon

Chair: Dr. Hans-Peter BischofReader: Dr. Gregor von LaszewskiObserver: Dr. Minseok Kwon

OutlineIntroduction to the Water Threat

Management Project

Motivation

Research Objectives

Fault-Tolerant Queue

Evaluation

Conclusion

Water Threat ManagementMotivation

Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water.

Methods

Detect contamination using the sensors located across the WDSs.

Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations.

Existing Water Threat Management System Architecture

Optimization Engine: Runs Evolutionary Algorithm (EA)

Simulation Engine: Runs EPANET

Water Threat Management System RequirementsRequirements

Time sensitiveMassive calculationDynamic adaptation to a Grid environmentFault tolerance

Our goalThe current system is not fault-tolerant -

develop a fault-tolerant framework in the dynamic environment.

MotivationResource (Site)

Outage5% down during

Queue Wait Time TeraGrid User & System News (http://news.teragrid.org/)

Research ObjectivesDevelop a fault-tolerant framework dealing

with resource outages

Strategy: generation distribution on multiple sites

Reduce queue wait time

Strategy: dynamic job dependency

Water Threat Management ApplicationSequential & parallel processing

Generation DistributionDivide generations into multiple parts as

multiple jobs. Distribute them on multiple sites.

Dynamic Job DependencyProblems of generation distribution on

multiple sitesAdditional queue wait times

Each job is dependent on another. Cannot submit a job before the prior job finishes.Solution: determine job dependency at run

time.Submit jobs at the same time.Any job start first computes the first set of

generations

Dynamic WTM Workflow ManagementExample scenario

Fault-tolerant QueueMost common fault-tolerant strategies in a Grid

ReplicationCheckpointing

Limitation of checkpointing with time-criticalityCheckpointing performance degradationCheckpointing may not be compatible on a

different site (heterogeneity)Cannot reschedule job on the same site in case of

site outageChoosing the replication strategy within the

fault-tolerant queue

Fault-tolerant Queue DesignComponents

Command Line Interface

Task Pool

Resource Pool

Scheduler

Resource Checker (intergration with the TeraGrid Information Services)

Fault Detection in Fault-tolerant QueueFault detection

Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure

TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage

information

Evaluation – WTM performanceWTM application performance (original)

Big Red

CPU per Node

Evaluation – Queue Wait TimeQueue wait time statistics

Abe Big Red

Avg. (min)

Var. 38513

sd. 196 73

Evaluation – Performance OverheadPerformance overhead

Integrating a fault-tolerant framework usually causes performance degradation

No performance loss in our framework

Different type of workflow run time comparisonOriginal deployment VS. fault-tolerant

deploymentDynamic job dependency VS. static job

dependencyTest each type of deployment in the real Grid

system including queue wait time

Workflow Dependency

Site Name # Jobs Gen. range

Original - Abe 1 1-20

Original - Big Red 1 1-20

Fault-tolerant

static Abe, Big Red

2 1-10 (Abe),11-20 (Big Red)

Fault-tolerant

dynamic Abe, Big Red

2 1-10,11-2018

Evaluation – Workflow Performance

Evaluation – Workflow PerformanceWorkflow comparison results Experiment 1 Experiment 2

Experiment 3

Simulation – Worst Case Run Time Comparison

A threat management system must deliver results in any circumstances.

Thus, a run time of the worst case is a critical factor in the Water Threat Management system.

Simulation – Worst Case Run Time ComparisonSimulation setup

The generations are equally distributed among the machines.

Use the 2009 TeraGrid outage data.Submit jobs every 5 minutes starting from

1/1/2009 12:00 am EST.

Abe Big Red

Queen Bee

Run Time per Gen. (min)

0.52 2.07 1.02

#CPUs 16 16 8

Simulation – Worst Case Run Time ComparisonSimulation

queue wait time setup (unit: minutes)

TeraGrid User & System News (http://news.teragrid.org/)

Simulation – Median Run Time, Worst Case (Max.) Run Time

ConclusionAchievement:

Worst case run time is significantly reduced.Limitation:

In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance

difference between difference machines.

Possible improvement:Migrate the generation process to a faster

machine whenever possible.

young suk moon chair: dr. hans-peter bischof reader: dr. gregor von laszewski observer: dr. minseok...

faulttolerant framework

faulttolerant queue12suppose

timeseach job

prior job

real time queue

run time

searching time

time resources

Documents

lecture grids and markup languages gregor von laszewski...

rochester institute of technology cyberaide javascript: a...

cyberaide javascript: a javascript commodity grid kit gregor...

design discussion rain: dynamically provisioning clouds...

data grids darshan r. kapadia gregor von laszewski

indiana university faculty geoffrey fox, david crandall,...

2010 8 6 (fri) minseok kang

community software development with the astrophysics...

efficient resource management for cloud computing...

virtual machine technology dr. gregor von laszewski dr....

web services darshan r. kapadia gregor von laszewski 1

futuregrid dynamic provisioning experiments including hadoop...

annual hawai‘i regional scholastic art...

raining compute environments on resources by application...

http:// cog kit overview gregor von laszewski keith jackson

https://portal.futuregrid.org tutorial presented at tg2011...

minseok kwon department of computer science rochester...

open grid computing environments tutorial marlon pierce,...

convergence informatics - harvard university€¦ ·...

1 java collections framework minseok kwon...