the grid,globus tool kit, condor – g and what?. outline 1. the “grid problem” 2. the “grid...

The Grid,Globus Tool Kit, Condor – Gand What?

Outline

1. The “Grid Problem”

2. The “Grid Architecture”

3. The “Globus Toolkit”

4. The “Condor-G”

Culled From

http://www.globus.org

The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001

Globus: A Metacomputing Infrastructure Toolkit. I. Foster, C. Kesselman. Intl J. Supercomputer Applications, 11(2):115-128, 1997

Condor-G: A Computation Management Agent for Multi-Institutional Grids. J.Frey, T.Tannenbaum, I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001

http://charm.cs.uiuc.edu/ppl_research/faucets/

http://www.globus.org/

"The Grid" Coined in 1990's to denote a proposed distributed computing

architecture. "Flexible, secure, coordinated resource sharing among dynamic

collections of individuals, institutions and resources" -From "The Anatomy of

the Grid" Resource Sharing

Computers,Storage,Sensors, Networks, Scientific Instruments sharing is highly controlled -- Providers & Consumer define

What is shared Who is allowed to share Conditions for sharing

Coordinated problem solving Beyond client-server: distributed data analysis,

visualization,computation, collaboration

Similar to the Power Grid, Faucets (Water supply), Nationwide Phone System.

Virtual Organization

A set of individual/institutions defined by some set of sharing rules form a Virtual Organization (VO).

VO may contain

Application Service Providers (ASP)

Storage Service Providers (SSP)

Cycle Providers

They lack

Central Control

Central Location

Existing Trust Relationships

Why Grids? Civil Engineers collaborate to design, execute & analyze

shake table experiments

Climate Scientists visualize, annotate & analyze terabytes of simulation datasets

An Emergency response team couples real-time data, weather model, population data

An application service provider purchases cycles from compute cycle provider

A biomedical engineer exploits 10K computers to screen 100K compounds in a hour.

NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers with each other. (Argonne,NCSA,Michigan,UIUC,USC)

DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago

tomographic reconstruction

real-timecollection

wide-areadissemination

desktop & VR clients with shared controls

Advanced Photon Source

Online Access to Scientific Instruments

archival storage

Image courtesy Harvey Newman, Caltech

Data Grids forHigh Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)




Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Is it Really NEW?

“Grid Computing” has much in common with the existing industrial thrusts

Application & Storage service providers (ASPs , SSPs)

Internet & Peer-to-Peer computing

Enterprise Computing Systems

Business-to-Business exchanges

SSPs & ASPs allow organizations to outsource storage & computing requirements to other parties (typically via a VPN)

Enterprise distributed computing technologies (CORBA, Enterprise Java) enable resource sharing within a single organization

Business-to-Business & virtual enterprise technologies exchanges focus on information sharing via central servers.

Is it NEW?

Sharing is not adequately addressed by these technologies.

Complicated Requirements: “run program X at site Y subject to community policy P, using data at Z according to the policy Q”

High Performance requirements of most of the applications.

Users may not care where their program may run, as long as it satisfies their QoS requirements (Faucets)

“controlled, dynamic sharing within VOs”

Current doesn’t accommodate the range of resource types or doesn’t provide the flexibility and control on sharing relationships to establish VOs

But Why Now??? The internet & the increasing use of wireless devices provides the

universal connectivity.

Many current research projects need teamwork, collaboration.

Network Vs. Computer Performance

Computer speed doubles every 18 months

Network speed doubles every 9 months

Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

Major Grid Projects

Name URL & Sponsors

Focus

BlueGrid IBM Grid testbed linking IBM laboratories

DISCOM www.cs.sandia.gov/discomDOE Defense Programs

Create operational Grid providing access to resources at three U.S. DOE weapons laboratories

DOE Science Grid

sciencegrid.org

DOE Office of Science

Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities

Earth System Grid (ESG)

earthsystemgrid.orgDOE Office of Science

Delivery and analysis of large climate model datasets for the climate research community

European Union (EU) DataGrid

eu-datagrid.org

European Union

Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics

Major Grid Projects

Name URL/Sponsor Focus

EuroGrid, Grid Interoperability (GRIP)

eurogrid.org

European Union

Create tech for remote access to supercomp resources & simulation codes; in GRIP, integrate with Globus Toolkit™

Fusion Collaboratory fusiongrid.org

DOE Off. Science

Create a national computational collaboratory for fusion research

Globus Project globus.org

DARPA, DOE, NSF, NASA, Msoft

Research on Grid technologies; development and support of Globus Toolkit™; application and deployment

GridPP gridpp.ac.uk

U.K. eScience

Create & apply an operational grid within the U.K. for particle physics research

Information Power Grid

Ipg.nasa.gov Create & apply a production Grid for aero sciences and other NASA missions

Grid Research Integration Dev. & Support Center

grids-center.org

NSF

Integration, deployment, support of the NSF Middleware Infrastructure for research & education

www.teragrid.org

www.ivdgl.org

iVDGL:International Virtual Data Grid Laboratory

Tier0/1 facility

Tier2 facility

10 Gbps link

2.5 Gbps link

622 Mbps link

Other link

Tier3 facility

Grid ArchitectureGrid Architecture

Why Do We Need It? To structure the development of new technology

Common Vocabulary, Guidance, Perspective

To

Identify the fundamental system components

Specify the purpose and function of these components

Indicate how these components interact

Emphasizes

1. identification and definition of protocols and services

2. APIs and SDKs

A Protocol architecture – mechanism for VO users and resources to negotiate, manage sharing relationships

Facilitates

Extensibility

Interoperability

Why is interoperability a fundamental concern?

Standard protocols standard services to abstract away resource specific details.

To accelerate application development in complex and dynamic execution environments we need APIs,SDKs

The Technology + Services middleware

Analogy to Internet Architecture

High level description – places few constraints on design & implementation

Application

Fabric“Controlling things locally”: Access to, & control of, resources

Connectivity“Talking to things”: communication (Internet protocols) & security

Resource“Sharing single resources”: negotiating access, controlling use

Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services

InternetTransport

Application

Link

Inte

rnet P

roto

col

Arch

itectu

re

Layered Grid Architecture

Features

Open and Extensible

Built on Internet protocols & services

Communication, routing, name resolution, etc.

“Layering” is conceptual No constraints on who can call what

Protocols/Services/APIs/SDKs will (ideally) be largely self-contained

Things like communication, security are very fundamental

Advantageous for higher layer functions to use common lower-level functions

The Hourglass Model

Diverse global services

Coreservices

Local OS

A p p l i c a t i o n s • Resource and connectivity form the“neck in the hour glass”• Designed to be implemented on topof diverse range of resource types(fabric layer)• Can be used to construct wide rangeof global and application specificServices (collective layer)

What's the Status?

No “official” standards yet.

Globus Toolkit is the “unofficial” standard for several connectivity, resource, collective protocols

Global Grid Forum (GGF) has Grid Protocol Architecture group

In security some RFCs are available

In scheduling & resource management some working documents are available

Globus Toolkit

A software toolkit (GTK) to address key issues to pave the road

Offers a set of technologies (NCSA’s “Grid-in-a-box”)

Try to standardize the Grid protocols and APIs

Open Architecture & Open Source (Reference implementation)

Define Grid Protocols & APIs

Integrate and extend existing protocols

provides software tools that make it easier to build computational grids and grid-based applications

Learn from the experiences gained through deployment and applications

Key Components

Security

Communication

Information Infrastructure

Fault Detection

Resource Management

Portability

Data Management

Fabric Layer What do you expect? -- diverse resource that may be shared

Can be a logical entity such as a distributed file system, computer cluster – But the Grid architecture don’t care

Components implement the local, resource-specific operations that occur on specific resources (physical or logical)

Trade-off (Rich fabric functionality vs. Easy of deployment)

Richer functionality more sophisticated sharing operations (e.g., reservation)

Few demands simplified Grid infrastructure deployment.

Should implement enquiry mechanisms and resource management

Fabric in Globus Toolkit Is designed to use existing fabric components

If a vendor doesn’t provide the necessary behavior, GTK includes the missing piece

Resource management, is generally the domain of local resource managers.

GARA (General - purpose Architecture for Reservation and Allocation) can provide QoS for different types resources

Connectivity Communication

Enable exchange of data between fabric layer resources

Include transport, routing, naming (TCP,IP,DNS)

Authentication

Build on communication protocols

Uniform authentication, authorization and message protection in multi institutional scenario

Based on existing standards whenever possible

Various Requirements

Single sign on (access to multiple Grid resources without user intervention)

Delegation (user’s program is able to access resources on which user is authorized)

Integration with various local security solutions (identity mapping)

User-based trust relationships (must not require for various resources to cooperate in configuring security environment)

stake holder should have the final control the authorization decisions

Security in GTK Grid Security Infrastructure (GSI) is based on

Public key

X.509 certificates

SSL/TLS communication

GSS-API (Generic Authorization and Access)

Extensions are added for single sign-on and delegation

Stakeholder control is supported via GAA (Generic Authorization and Access).

GSI adheres to the IETF’s standard GSS-API

Site A(Kerberos)

Site B (Unix)

Site C(Kerberos)

Computer

User

Single sign-on via “grid-id”& generation of proxy cred.

Or: retrieval of proxy cred.from online repository

User ProxyProxy

credential

Computer

Storagesystem

Communication*

GSI-enabledFTP server

AuthorizeMap to local idAccess file

Remote fileaccess request*

GSI-enabledGRAM server

GSI-enabledGRAM server

Remote processcreation requests*

* With mutual authentication

Process

Kerberosticket

Restrictedproxy

Process

Restrictedproxy

Local id Local id

AuthorizeMap to local idCreate processGenerate credentials

Ditto

GSI Example Scenario“Create Processes at A and B

that Communicate & Access Files at C”

Resource Layer

Concerned entirely with individual resources

Addresses

Resource discovery

Reservation & Allocation

Monitoring & Control

Secure Negotiation

Two primary classes of protocols

1. Information protocols – to obtain information about the structure & state of a resource

2. Management Protocols – “policy application point”

“in the neck of hourglass” the number of such protocols should be limited to small and focused set.

Resource Layer Components in GTK GRIP (Grid Resource Information Protocol)

Based on LDAP

Defines standard resource information

GRRP (Grid Resource Registration Protocol)

To register resources with Grid Index Information Servers

GRAM (Grid Resource Access and Management)

HTTP based RPC

Used for allocation of computational resources

for monitoring and controlling of computation on these resources

GridFTP

allows grid applications to have secure, ubiquitous, high-performance access to data

uses the GSI for authentication

new extensions to the FTP protocol for

• parallel data transfer

• partial file transfer

• third-party (server-to-server) data transfer

Collective Layer Coordinate multiple resources

Protocols & services (APIs,SDKs) are not associated with any resource but rather are global in nature

Capture interactions across collections of resources

Examples:

Index servers aka Metadirectory services – custom views on dynamic resource collections

Resource Brokers (e.g., Condor-G Matchmaker, AppLes, Nimrod-G, DRM broker)

– Resource discovery & allocation

Replica catalogs & services

Co-reservation & Co-allocation services

Software discovery services – select best s/w implementation and platform based on the problem parameters (e.g., NetSolve, Ninf)

Community accounting and payment services

Collaboratory services (e.g., CAVERNsoft)

Information Infrastructure in GTK An infrastructure that provides coherent system information

spanning virtual organizations is necessary

MDS (Metacomputing Directory Service)

Uses LDAP (Lightweight Directory Access Protocol)

Provides uniform means of querying system information from a rich variety of system components

uniform namespace for resource information across a system that may involve many organizations.

GRIS (Grid Resource Information Service) a uniform means of querying resources for their current configuration, capabilities, and status

GIIS (Grid Index Information Service) a means of knitting together arbitrary GRIS services to provide a coherent system image.

GIISes provide a mechanism for identifying "interesting" resources

Resource Management Services in GTK Three main components

1. RSL (Resource Specification Language) to communicate resource requirements

2. GRAM (Grid Resource Allocation Management) standardized interface to all of the various local resource management tools (e.g., Condor,LSF,PBS)

3. DUROC (Dynamically-Updated Request Online Co-allocator) coordinates a single request that may span multiple GRAMs

Resource Broker handle the mapping of high level application requests into requests to individual resource managers

GRAM GRAM GRAM

LSF Condor PBS

Application

RSL

Simple ground RSL

Information Service

Localresourcemanagers

RSLspecialization

Broker

RSL

Co-allocator

Queries& Info

Resource Management Architecture

GRAM Components

Grid SecurityInfrastructure

Job Manager

GRAM client API calls to request resource allocation

and process creation.

MDS client API callsto locate resources

Query current statusof resource

Create

RSL Library

Parse

RequestAllocate &

create processes

Process

Process

Process

Monitor &control

Site boundary

Client MDS: Grid Index Info Server

Gatekeeper

MDS: Grid Resource Info Server

Local Resource Manager

MDS client API callsto get resource info

GRAM client API statechange callbacks

GRAM GRAM-1 HTTP-based RPC

GRAM-1.5 (Reliability improvements)

Once-and-only-once submission

Reliable termination detection

GRAM-2 towards integration with web-services (SOAP)

OGSA (Open Grid Services Architecture)

Gate Keeper

Single point of entry “secure inetd”

Job Manager

Layers on top of local resource management system (e.g., PBS,LSF, etc)

Handles remote interaction with the job

Data Management in GTK "Access to distributed data is typically as important as access to

distributed computational resources“ - Globus

Tools for managing data in Grid systems and applications

Also called “Data Grid”.

GridFTP

Data Replication

Two tools for managing data replicas: multiple copies of data stored in different systems to improve access across geographically-distributed Grids

Replica Catalog – based on LDAP directory

Replica Manager – combines the Replica Catalog with GridFTP to manage data replication

GASS (Global Access to Secondary Storage)

Allows applications to access data stored in any remote file system by specifying a URL

Can be in HTTP, FTP

Condor-GA Computation Management Agent for

Multi-Institutional Grid

What is Condor-G?

Condor enhanced with Globus Toolkit components to “harness multi-domain resources as if they all belong to one personal domain”

Example of applying the general purpose GTK components to solve a particular problem (i.e., high-throughput computing on the Grid)

Separation of Concern

1. Remote Resource Access

Secure remote resource discovery,allocation, management

Uses Globus Toolkit components

2. Computation Management

Via user computation management agent responsible for job submission, job management, error recovery

Taken from Condor system

3. Remote Execution

Via mobile sandboxing – to create a user tailored execution environment on a remote node

Taken from Condor system

Why Condor for Grid Jobs?? Adv. Of using condor-G to manage Grid jobs

Can Query a job’s status or cancel a job

Credential management

Get informed of job termination or problems via callbacks or asynchronous mechanisms such as email

Access to detailed logs with complete history of the jobs’ execution

Fault tolerance and exactly once semantics

Job Execution

Stages a job’s standard I/O and executable using GASS

Submits jobs to remote machines using revised GRAM(1.5)

Job manager checkpoint & restart

Two-phase commit during job submission

Monitors job status & recovers from remote failures using revised GRAM callbacks and status calls

Condor-G handles resubmission of failed jobs, communications with the user concerning unusual & erroneous conditions, recording of computation status to support restart

Execution Mechanism

Fault Tolerance

Tolerates four types of failure

Local Crash

1. Crash of the host on which GridManager is running (or crash of the GridManager alone)

Queue state stored on disk Reconnect to the JobManagers that were running at the time of crash

Remote Crash

2. Crash of the GlobusJobManager Start a new JobManager

3. Crash of the machine that hosts the remote resource (GateKeeper,JobManager)

Wait until connectivity returns Start a new JobManager

4. Network Failures

Credential Management

Authentication in is done with limited-lifetime X509 proxies

Credential may expire before the job finishes execution

Condor-G agent periodically analyzes the credentials for all users with currently queued jobs

Can put jobs on hold and e-mail user to refresh proxy

Can forward new credentials to execution sites

Using the MyProxy system, which lets a user store a long-lived proxy credential on a secure server.

Remote services acting on behalf of user can obtain short-lived proxies

Condor-G can use these to refresh the user credential automatically

Resource Discovery & Scheduling

Simple – user supplies list of GRAM servers

Resource broker in Condor-G agent – Condor Matchmaking

“flood” candidate resources – “Glidelns”

GlideIn Mechanism

Use same codor mechanism to start on a remote node not a user job, but a daemon

The deamon traps system calls made by user’s job and redirects back to the originating system

Periodically checkpoints the job and migrates job to another location if it is requested

The Condor-G GlideIn mechanism uses Grid protocols to dynamically create a personal condor pool out of Grid resources by “gliding-in” Condor daemons to remote resource

Allows to delay the binding of an application to a resource

Prevent queuing delays

Can guarantee optimal queuing times to users

Remote Execution via GlideIn

Grid Architecture in Practice

ComputeResource

SDK

API

AccessProtocol

CheckpointRepository

SDK

API

C-pointProtocol

High Throughput Computing System

Dynamic checkpoint, job management, failover, staging

Brokering, certificate authorities

Access to data, access to computers, access to network performance data

Communication, service discovery (DNS), authentication, authorization, delegation

Storage systems, schedulers

Collective(App)

App

Collective(Generic)

Resource

Connect

Fabric

Future & Conclusions Evolution

Past-Present: O(102) computers; Mb/s networks; local (centralized) control

Present: O(104-106) data systems, computers; Gb/s networks; restricted decentralized control

Future: O(106-109) data,sensors,computer,instruments; highly flexible policy,control

“A computer (includes software) is a dynamically, often collaboratively constructed collection of processors,data sources,networks,sensors,instruments”

“Open the faucet get the water – Connect to the Grid get the compute power”

We need a powerful computational economy model (Bidding systems – new optimization algorithms)

Summary

The Grid Problem: Resource sharing & coordinated problem solving in dynamic multi-institutional virtual organizations

Grid Architecture: Emphasizes protocol and service definition to enable interoperability and resource sharing

Globus Toolkit: a source of protocol and APIs, reference implementation

Condor-G: applies general purpose Globus Toolkit to solve high-throughput computing on the Grid

Thanks for Ur Patience

the grid,globus tool kit, condor – g and what?. outline 1. the “grid problem” 2. the “grid...

Documents