gram5 - a sustainable, scalable, reliable gram service stuart martin - uc/anl

25
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

Upload: alfred-stokes

Post on 16-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 - A sustainable, scalable, reliable GRAM service

Stuart Martin - UC/ANL

Page 2: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 2SC 2009

What is GRAM?

GRAM is a Globus Toolkit component For Grid job management

GRAM is a unifying remote interface to Resource Managers Yet preserves local site security/control

GRAM is for stateful job control Reliable create operation Asynchronous monitoring and control Remote credential management Remote file staging and file cleanup

Page 3: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 3SC 2009

Grid Job Management Goals

Provide a service to securely: Create an environment for a job Stage files to/from environment Cause execution of job process(es)

Via various local resource managers Monitor execution Signal important state changes to client

Page 4: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 4SC 2009

Traditional Interaction

4

Local Jobs

Resource A

Scheduler (e.g., PBS)

Compute Nodes

Satisfies many use cases TACC’s Ranger (62976 cores!) is the Costco of HTC ;-), one

stop shopping, why do we need more?

Page 5: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 5SC 2009 5

Local Jobs

Resource A

GRAM Service

Scheduler (e.g., PBS)

Compute Nodes

remoteGRAMJobs

GRAM API

Add remote execution capability Enable clients/devices to manage

jobs with logging into the cluster

GRAM Benefit

Page 6: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 6SC 2009

GRAM Benefit

6

GRAM Service

Scheduler (e.g., PBS)

Compute Nodes

GRAM Service

Scheduler (e.g., LSF)

Compute Nodes

Local Jobs Local Jobs

Resource A Resource B

GRAMJobs

GRAM API

Provides scheduler abstraction

Page 7: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 7SC 2009

GRAM Benefit

7

GRAM

Sched

Compute Nodes

GRAMjobs

Scalable job management Interoperablility

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM API

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

Page 8: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 8SC 2009

Users/Applications: Science Gateways, Portals, CLI scripts,

App Specific Web Service, etc.

Resource Managers: PBS, Condor, LSF, SGE,

Loadleveler, Fork

GRAM

Page 9: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 9SC 2009

Higher-level Clients and User Examples

Page 10: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 10SC 2009

Condor-G Architecture

GRAM

LSF

User Job

Startd

Personal Condor Remote Resource

Condor jobs

GlideIn jobs

Starter

ScheddCollector & Negotiator

Grid Manager

Shadow

Master

Page 11: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5SC 2009

GridWay Components

ExecutionManager

TransferManager

InformationManager

DispatchManager

RequestManager

Scheduler

Job Pool Host Pool

DRMAA library CLI

GridWay Core

File TransferServices

ExecutionServices

GridFTP RFTpre-WSGRAM

WSGRAM

InformationServices

MDS2MDS2GLUE

MDS4

Resource DiscoveryResource MonitoringResource DiscoveryResource Monitoring

Job PreparationJob TerminationJob Migration

Job PreparationJob TerminationJob Migration

Job SubmissionJob MonitoringJob ControlJob Migration

Job SubmissionJob MonitoringJob ControlJob Migration

Page 12: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 12SC 2009

GridWay / Condor-G Benefit

12

Scalable job management Throttling Metascheduling

GRAM API

GridWayjobs

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

GRAM

Sched

Compute Nodes

Page 13: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 13SC 2009

Architecture of Ninf-G

Client

GRAM /NAREGI /

Condor / SSH

Invoke Executable

Connect back

IDL file NumericalLibrary

IDL Compiler

Ninf-GExecutable

Generate Interface Request

Interface Reply

Server side

Client side

MDS4 /NAREGI IS

Interface InformationLDIF Fileretrieve

Globus-IO / ssh / TCP

InvokeServer

Page 14: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 14SC 2009

caBIG and Globus caGrid is built on top of Globus 4 WSRF Java Core and Security

Page 15: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 15SC 2009

caBIG - TeraGridIntegration

Leave caGrid service infrastructure as is with the exception of the analytical services.

globus

Page 16: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 16SC 2009

Hierarchical Clustering Results

Page 17: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 17SC 2009

UserJob(s)

GRAM2 ArchitectureDiagram

Job Manager

Client Gatekeeper

RM adaptersubmit

ResourceManager

UserJob(s)

Job Manager RM adapter

poll ResourceManager

Job Submission

Job Monitoring

Page 18: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 18SC 2009

UserJob(s)

GRAM2 Architecture

Job Manager

Client Gatekeeper

RM adaptersubmit

ResourceManager

UserJob(s)

Job Manager RM adapter

poll ResourceManager

Job Submission

Job Monitoring

Job Manager RM adapter

submit Job Manager RM adapter

submit Job Manager RM adapter

submit

Job Manager RM adapter

poll Job Manager RM adapter

poll Job Manager RM adapter

poll

Unlimited Unlimited

Unlimited Unlimited

Page 19: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 19SC 2009

UserJob(s)

GRAM5 Architecture

Job Manager

Client Gatekeeper

RM adaptersubmit

ResourceManager

UserJob(s)

Job Manager ResourceManager

Job Submission

Job Monitoring

RM adaptersubmit RM adapter

submit

Job Manager

Job Manager Job Manager

RM logSEG log

SEG

throttled(default 6)

1 process

1 process 1 process

Page 20: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 20SC 2009

Changes Made to Improve Scalability

Removed extra listening port per job for MPIg jobs Functionality can be re-implemented around GRAM

Removed active monitoring of stdout/err files for streaming during job execution Instead transfer stdout/err at the end of job execution

Page 21: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 21SC 2009

Improvements

New Job Manager Logging implementation Added job exit code support Added GRAM service version detection Added usage statistics support Added support for auditing of TG gateway user attribute

Updated admin, user, developer guides Many bugs fixed

Page 22: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 22SC 2009

Releases and Testing

3 Alpha releases and 1 Beta 2 deployments on TeraGrid

Significant scalability testing of Condor-G Jaime Frey Igor Sfiligoi Gaurang Mehta

Included in GT 5.0.0 RCs Internal functional and performance testing

http://cvs.globus.org/toolkit/docs/5.0/5.0.0/execution/gram5/qp/#id2557011

Page 23: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 23SC 2009

Page 24: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 24SC 2009

Next Improvement

Add support for Sun Grid Engine (SGE) adapter

Improve support for native packaging

Page 25: GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL

GRAM5 25SC 2009

Thanks to the GRAM developers!

Joe Bester - ANL Mike Link - ANL