nomadic grid applications: the cactus worm g.lanfermann max planck institute for gravitational...

22
Nomadic Grid Applications: The Cactus WORM G.Lanfermann Max Planck Institute for Gravitational Physics Albert-Einstein-Institute, Golm Dave Angulo University of Chicago Chicago, Il.

Upload: william-gibson

Post on 02-Jan-2016

218 views

Category:

Documents


2 download

TRANSCRIPT

Nomadic Grid Applications: The Cactus WORM

G.Lanfermann

Max Planck Institute for Gravitational Physics

Albert-Einstein-Institute, Golm

Dave Angulo

University of Chicago

Chicago, Il.

Grid Application Development Software Project

Outline

The Worm - Migration on the Grid: – Motivation

– Design The Worm: Adaptive Migration

– Data Transfer using GridFTP

– Resource Selection using MDS-2 and ClassAds

– Contract Monitoring using MDS-2

– Intelligent Migration using Gram

Grid Application Development Software Project

This talk:http://people.cs.uchicago.edu/~dangulo/

grads/CactusGrADS-Aug7-GlobusRetreat.ppt

Other documents on GrADS in Cactus architecture:http://people.cs.uchicago.edu/~dangulo/

grads/arch/http://www.cactuscode.org

Paper available in back of 2001 Globus Retreat book

Grid Application Development Software Project

Resource Broker

Requested & Available Resource

Payload

MigrationPayload

Your Grid

Migration on the GridMigration on the Grid

Grid Application Development Software Project

Large Scale HPC Simulation: Daily Routine

The “daily” routine of doing large scale numerical simulations: Take an educated guess at memory requirements, number of

processors, disk space needed Start with the first parameter in a range of values to explore

the behavior of your simulation.

– Select a machine and submit to the queuing system. Wait.

– Archive and analyze the data; make changes to the parameter file, resubmit to the queuing system. Wait.

For the large production run, increase resolution of your experiment, take educated guess at memory,….

Select a big machine, submit to the queue. Wait 3-7 days. Archive data & checkpoint file, resubmit to the queue. Wait 3-7 days. Archive data & checkpoint file, resubmit. Wait 3-7 days. Archive data & checkpoint file, resubmit. Wait 3-7 days. ….

Grid Application Development Software Project

Automating the Routine Let the computer find out about the code’s resource

requirements. Automatically contact appropriate machines, stage

executables and submit to the queuing system. Let the computer monitor the quality of the requested

resources as the simulation progresses. Perform multiple simulations over a range of

parameters automatically and in parallel. Archive the data and give the user a uniform access.

There is plenty of room to automate the way simulations are carried out today.

Grid Application Development Software Project

Cactus + Grid

Cactus based Application ThornsThe Physics: Initial Data, Evolution, Analysis, etc

Grid Aware Application ThornsDrivers for Contract Management, Dynamic Resource Detection,

Simulation Relocation

Grid Enabled Communication Library

MPICH-G2 implementation of MPI, can run MPI programs across heterogeneous computing

resources

Standard MPI

SingleProc

Grid Application Development Software Project

The Grid Layer Concept

Application Thornsprovides: Initial Data,

Analysis,Evolution

Grid Thorns provide:Migration & Resource

Management

Grid Enabled Simulation

Grid Enabled Computational Framework

Cactus Computational Framework

Grid Application Development Software Project

Migrating Applications on the Grid

PayloadApplicationInformation

Server

AIS

Migration Unit

Resource Management

Resource SelectorClient

Worm Layer

Hibernation Storage

Off Site Data ServerResource Broker

Resource Broker

Grid Application Development Software Project

The WORM at HPDC10

Information Server

MigrationServer

Grid Application Development Software Project

Current ArchitectureUnder Development

Resource Selection Client ThornExternal Resource Selection Service

“Worm” Migration ModuleCactus Worm Server

ThornsCactus Application Unit

Cactus Flesh

Performance Degradation Detection

User Supplied Application Payload

External Processes

Migration Logic Manager

GridFTP Client Thorn

External GridFTP Server (Source)

External GridFTP Server (Destination)

Data transfer

Gram

Grid Application Development Software Project

Migration of Checkpoint Files

Uses alpha version of GridFTP Allows Third Party Transfer

– Without this, need to> do a GET to transfer files from source to Migrator

> do a PUT to transfer files from Migrator to destination

Uses GSI security– Allows grid-proxy with only a single sign-on

while retaining tight security Allows fast, efficient, reliable transfer

Grid Application Development Software Project

Resource Selector Architecture(ClassAds) Resource Selection Client Thorn

ClassAds library

Resource Selection Engine

Request in ClassAds format

Response in XML

GIIS

NWS

Resources

UTk Project

GRISs

GRISs

MDS-2

Grid Application Development Software Project

MDS-2 Future Plans

Resource selector goes to GRIS directly after resources discovered

To investigate: strategies for managing update traffic

Would like persistent queries to support notification of changes in resource status

Grid Application Development Software Project

Resource Selection:Example Input: ClassAds format

[

Type="request";

Owner="dangulo";

RequiredDomains={"cs.uiuc.edu", "ucsd.edu"};

requirements= "other.opSys==‘LINUX’ &

other.minMemSize> (100G/other.CPUCount) &&

Include(other.domains, RequiredDomains)

";

Rank= other.minCPUSpeed * other.CPUCount / (other.maxCPULoad+1);

]

Grid Application Development Software Project

Resource Selection:Example output

<virtualMachine> <result statusCode="200" statusMessage="OK"/> <machineList> <machine dns="amajor.cs.uiuc.edu" processor=" 1"> <machine dns="bmajor.cs.uiuc.edu" processor=" 1"> <machine dns="cmajor.cs.uiuc.edu" processor=" 1"> <machine dns="dmajor.cs.uiuc.edu" processor=" 1"> <machine dns="emajor.cs.uiuc.edu" processor=" 1"> <machine dns="fmajor.cs.uiuc.edu" processor=" 1"> <machine dns="hmajor.cs.uiuc.edu" processor=" 1"> </machineList></virtualMachine>

Grid Application Development Software Project

Performance ModelWorking on putting Performance Model into ClassAdsEvery processor is assigned to computer XYZ/N grid points.Requested Memory > 16(constant) + 512 * (10E-6)(constant) *

(XYZ / N) (MB)Time needed to perform an iteration= (computation time +

communication time) * slowdown800 Floating point operations every grid point per iteration.Computation time= 800(constant) * (XYZ / N)/ cpuspeed

cpuspeed is FLOPS

Communication time= 1/G * 2*( T1 + 2 * T2 * GXYR)T1 is the communication latency between two processors.

latency from NWS

T2 is the transmit time for a wordT2 = 1 / (available bandwidth)

available bandwidth from NWS

Slowdown=1 + cpuload

Grid Application Development Software Project

Contract Monitor

Driven by three user-controllable parameters– Time quantum for “time per iteration”– % degradation in time per iteration (relative to

prior average) before noting violation– Number of violations before migration

Potential causes of violation– Competing load on CPU– Computation requires more processing power:

e.g., mesh refinement, new subcomputation– Hardware problems

Grid Application Development Software Project

Contract Monitor Details The end user specifies several variables. These variables can be changed during runtime by

contacting the application with an HTTP interface. These variables include:

– time quantum– % degradation– number of violations before migration

The system will then calculate the average wall clock time per iteration for each time quantum.

If the average iteration in any time quantum has lower performance (by the percentage specified) than the average for all the other previous quanta, then a violation is noted.

Grid Application Development Software Project

Actions Taken on Contract Violation

Occurs when more than the specified number of violations have been noted

New set of resources requested from the ResourceSelector

Checkpoints the application Moves checkpoint data to the new

resources along with other data needed for restart

Restarts application on the new resources

Grid Application Development Software Project

Migration Manager

Allows RS selection to occur asynchronously

Make intelligent choice on whether migration will actually help– Will not migrate to seemingly lower quality

resources

Grid Application Development Software Project

Summary

The Worm gives easy adaptability to changing grid environments to researchers in physics and computational science

Data Transfer using GridFTP Resource Selection using MDS-2 and

ClassAds Contract Monitoring using MDS-2 Intelligent Migration using Gram