the grid,globus tool kit, condor – g and what?. outline 1. the “grid problem” 2. the “grid...
TRANSCRIPT
The Grid,Globus Tool Kit, Condor – Gand What?
Outline
1. The “Grid Problem”
2. The “Grid Architecture”
3. The “Globus Toolkit”
4. The “Condor-G”
Culled From
http://www.globus.org
The Anatomy of the Grid: Enabling Scalable Virtual Organizations. I. Foster, C. Kesselman, S. Tuecke. International J. Supercomputer Applications, 15(3), 2001
Globus: A Metacomputing Infrastructure Toolkit. I. Foster, C. Kesselman. Intl J. Supercomputer Applications, 11(2):115-128, 1997
Condor-G: A Computation Management Agent for Multi-Institutional Grids. J.Frey, T.Tannenbaum, I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001
http://charm.cs.uiuc.edu/ppl_research/faucets/
"The Grid" Coined in 1990's to denote a proposed distributed computing
architecture. "Flexible, secure, coordinated resource sharing among dynamic
collections of individuals, institutions and resources" -From "The Anatomy of
the Grid" Resource Sharing
Computers,Storage,Sensors, Networks, Scientific Instruments sharing is highly controlled -- Providers & Consumer define
What is shared Who is allowed to share Conditions for sharing
Coordinated problem solving Beyond client-server: distributed data analysis,
visualization,computation, collaboration
Similar to the Power Grid, Faucets (Water supply), Nationwide Phone System.
Virtual Organization
A set of individual/institutions defined by some set of sharing rules form a Virtual Organization (VO).
VO may contain
Application Service Providers (ASP)
Storage Service Providers (SSP)
Cycle Providers
They lack
Central Control
Central Location
Existing Trust Relationships
Why Grids? Civil Engineers collaborate to design, execute & analyze
shake table experiments
Climate Scientists visualize, annotate & analyze terabytes of simulation datasets
An Emergency response team couples real-time data, weather model, population data
An application service provider purchases cycles from compute cycle provider
A biomedical engineer exploits 10K computers to screen 100K compounds in a hour.
NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers with each other. (Argonne,NCSA,Michigan,UIUC,USC)
DOE X-ray grand challenge: ANL, USC/ISI, NIST, U.Chicago
tomographic reconstruction
real-timecollection
wide-areadissemination
desktop & VR clients with shared controls
Advanced Photon Source
Online Access to Scientific Instruments
archival storage
Image courtesy Harvey Newman, Caltech
Data Grids forHigh Energy Physics
Tier2 Centre ~1 TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance Regional Centre
Italy Regional Centre
Germany Regional Centre
InstituteInstituteInstituteInstitute ~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server
Physics data cache
~PBytes/sec
~622 Mbits/sec or Air Freight (deprecated)
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Tier2 Centre ~1 TIPS
Caltech ~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
Is it Really NEW?
“Grid Computing” has much in common with the existing industrial thrusts
Application & Storage service providers (ASPs , SSPs)
Internet & Peer-to-Peer computing
Enterprise Computing Systems
Business-to-Business exchanges
SSPs & ASPs allow organizations to outsource storage & computing requirements to other parties (typically via a VPN)
Enterprise distributed computing technologies (CORBA, Enterprise Java) enable resource sharing within a single organization
Business-to-Business & virtual enterprise technologies exchanges focus on information sharing via central servers.
Is it NEW?
Sharing is not adequately addressed by these technologies.
Complicated Requirements: “run program X at site Y subject to community policy P, using data at Z according to the policy Q”
High Performance requirements of most of the applications.
Users may not care where their program may run, as long as it satisfies their QoS requirements (Faucets)
“controlled, dynamic sharing within VOs”
Current doesn’t accommodate the range of resource types or doesn’t provide the flexibility and control on sharing relationships to establish VOs
But Why Now??? The internet & the increasing use of wireless devices provides the
universal connectivity.
Many current research projects need teamwork, collaboration.
Network Vs. Computer Performance
Computer speed doubles every 18 months
Network speed doubles every 9 months
Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.
Major Grid Projects
Name URL & Sponsors
Focus
BlueGrid IBM Grid testbed linking IBM laboratories
DISCOM www.cs.sandia.gov/discomDOE Defense Programs
Create operational Grid providing access to resources at three U.S. DOE weapons laboratories
DOE Science Grid
sciencegrid.org
DOE Office of Science
Create operational Grid providing access to resources & applications at U.S. DOE science laboratories & partner universities
Earth System Grid (ESG)
earthsystemgrid.orgDOE Office of Science
Delivery and analysis of large climate model datasets for the climate research community
European Union (EU) DataGrid
eu-datagrid.org
European Union
Create & apply an operational grid for applications in high energy physics, environmental science, bioinformatics
Major Grid Projects
Name URL/Sponsor Focus
EuroGrid, Grid Interoperability (GRIP)
eurogrid.org
European Union
Create tech for remote access to supercomp resources & simulation codes; in GRIP, integrate with Globus Toolkit™
Fusion Collaboratory fusiongrid.org
DOE Off. Science
Create a national computational collaboratory for fusion research
Globus Project globus.org
DARPA, DOE, NSF, NASA, Msoft
Research on Grid technologies; development and support of Globus Toolkit™; application and deployment
GridPP gridpp.ac.uk
U.K. eScience
Create & apply an operational grid within the U.K. for particle physics research
Information Power Grid
Ipg.nasa.gov Create & apply a production Grid for aero sciences and other NASA missions
Grid Research Integration Dev. & Support Center
grids-center.org
NSF
Integration, deployment, support of the NSF Middleware Infrastructure for research & education
www.teragrid.org
www.ivdgl.org
iVDGL:International Virtual Data Grid Laboratory
Tier0/1 facility
Tier2 facility
10 Gbps link
2.5 Gbps link
622 Mbps link
Other link
Tier3 facility
Grid ArchitectureGrid Architecture
Why Do We Need It? To structure the development of new technology
Common Vocabulary, Guidance, Perspective
To
Identify the fundamental system components
Specify the purpose and function of these components
Indicate how these components interact
Emphasizes
1. identification and definition of protocols and services
2. APIs and SDKs
A Protocol architecture – mechanism for VO users and resources to negotiate, manage sharing relationships
Facilitates
Extensibility
Interoperability
Why is interoperability a fundamental concern?
Standard protocols standard services to abstract away resource specific details.
To accelerate application development in complex and dynamic execution environments we need APIs,SDKs
The Technology + Services middleware
Analogy to Internet Architecture
High level description – places few constraints on design & implementation
Application
Fabric“Controlling things locally”: Access to, & control of, resources
Connectivity“Talking to things”: communication (Internet protocols) & security
Resource“Sharing single resources”: negotiating access, controlling use
Collective“Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services
InternetTransport
Application
Link
Inte
rnet P
roto
col
Arch
itectu
re
Layered Grid Architecture
Features
Open and Extensible
Built on Internet protocols & services
Communication, routing, name resolution, etc.
“Layering” is conceptual No constraints on who can call what
Protocols/Services/APIs/SDKs will (ideally) be largely self-contained
Things like communication, security are very fundamental
Advantageous for higher layer functions to use common lower-level functions
The Hourglass Model
Diverse global services
Coreservices
Local OS
A p p l i c a t i o n s • Resource and connectivity form the“neck in the hour glass”• Designed to be implemented on topof diverse range of resource types(fabric layer)• Can be used to construct wide rangeof global and application specificServices (collective layer)
What's the Status?
No “official” standards yet.
Globus Toolkit is the “unofficial” standard for several connectivity, resource, collective protocols
Global Grid Forum (GGF) has Grid Protocol Architecture group
In security some RFCs are available
In scheduling & resource management some working documents are available
Globus Toolkit
A software toolkit (GTK) to address key issues to pave the road
Offers a set of technologies (NCSA’s “Grid-in-a-box”)
Try to standardize the Grid protocols and APIs
Open Architecture & Open Source (Reference implementation)
Define Grid Protocols & APIs
Integrate and extend existing protocols
provides software tools that make it easier to build computational grids and grid-based applications
Learn from the experiences gained through deployment and applications
Key Components
Security
Communication
Information Infrastructure
Fault Detection
Resource Management
Portability
Data Management
Fabric Layer What do you expect? -- diverse resource that may be shared
Can be a logical entity such as a distributed file system, computer cluster – But the Grid architecture don’t care
Components implement the local, resource-specific operations that occur on specific resources (physical or logical)
Trade-off (Rich fabric functionality vs. Easy of deployment)
Richer functionality more sophisticated sharing operations (e.g., reservation)
Few demands simplified Grid infrastructure deployment.
Should implement enquiry mechanisms and resource management
Fabric in Globus Toolkit Is designed to use existing fabric components
If a vendor doesn’t provide the necessary behavior, GTK includes the missing piece
Resource management, is generally the domain of local resource managers.
GARA (General - purpose Architecture for Reservation and Allocation) can provide QoS for different types resources
Connectivity Communication
Enable exchange of data between fabric layer resources
Include transport, routing, naming (TCP,IP,DNS)
Authentication
Build on communication protocols
Uniform authentication, authorization and message protection in multi institutional scenario
Based on existing standards whenever possible
Various Requirements
Single sign on (access to multiple Grid resources without user intervention)
Delegation (user’s program is able to access resources on which user is authorized)
Integration with various local security solutions (identity mapping)
User-based trust relationships (must not require for various resources to cooperate in configuring security environment)
stake holder should have the final control the authorization decisions
Security in GTK Grid Security Infrastructure (GSI) is based on
Public key
X.509 certificates
SSL/TLS communication
GSS-API (Generic Authorization and Access)
Extensions are added for single sign-on and delegation
Stakeholder control is supported via GAA (Generic Authorization and Access).
GSI adheres to the IETF’s standard GSS-API
Site A(Kerberos)
Site B (Unix)
Site C(Kerberos)
Computer
User
Single sign-on via “grid-id”& generation of proxy cred.
Or: retrieval of proxy cred.from online repository
User ProxyProxy
credential
Computer
Storagesystem
Communication*
GSI-enabledFTP server
AuthorizeMap to local idAccess file
Remote fileaccess request*
GSI-enabledGRAM server
GSI-enabledGRAM server
Remote processcreation requests*
* With mutual authentication
Process
Kerberosticket
Restrictedproxy
Process
Restrictedproxy
Local id Local id
AuthorizeMap to local idCreate processGenerate credentials
Ditto
GSI Example Scenario“Create Processes at A and B
that Communicate & Access Files at C”
Resource Layer
Concerned entirely with individual resources
Addresses
Resource discovery
Reservation & Allocation
Monitoring & Control
Secure Negotiation
Two primary classes of protocols
1. Information protocols – to obtain information about the structure & state of a resource
2. Management Protocols – “policy application point”
“in the neck of hourglass” the number of such protocols should be limited to small and focused set.
Resource Layer Components in GTK GRIP (Grid Resource Information Protocol)
Based on LDAP
Defines standard resource information
GRRP (Grid Resource Registration Protocol)
To register resources with Grid Index Information Servers
GRAM (Grid Resource Access and Management)
HTTP based RPC
Used for allocation of computational resources
for monitoring and controlling of computation on these resources
GridFTP
allows grid applications to have secure, ubiquitous, high-performance access to data
uses the GSI for authentication
new extensions to the FTP protocol for
• parallel data transfer
• partial file transfer
• third-party (server-to-server) data transfer
Collective Layer Coordinate multiple resources
Protocols & services (APIs,SDKs) are not associated with any resource but rather are global in nature
Capture interactions across collections of resources
Examples:
Index servers aka Metadirectory services – custom views on dynamic resource collections
Resource Brokers (e.g., Condor-G Matchmaker, AppLes, Nimrod-G, DRM broker)
– Resource discovery & allocation
Replica catalogs & services
Co-reservation & Co-allocation services
Software discovery services – select best s/w implementation and platform based on the problem parameters (e.g., NetSolve, Ninf)
Community accounting and payment services
Collaboratory services (e.g., CAVERNsoft)
Information Infrastructure in GTK An infrastructure that provides coherent system information
spanning virtual organizations is necessary
MDS (Metacomputing Directory Service)
Uses LDAP (Lightweight Directory Access Protocol)
Provides uniform means of querying system information from a rich variety of system components
uniform namespace for resource information across a system that may involve many organizations.
GRIS (Grid Resource Information Service) a uniform means of querying resources for their current configuration, capabilities, and status
GIIS (Grid Index Information Service) a means of knitting together arbitrary GRIS services to provide a coherent system image.
GIISes provide a mechanism for identifying "interesting" resources
Resource Management Services in GTK Three main components
1. RSL (Resource Specification Language) to communicate resource requirements
2. GRAM (Grid Resource Allocation Management) standardized interface to all of the various local resource management tools (e.g., Condor,LSF,PBS)
3. DUROC (Dynamically-Updated Request Online Co-allocator) coordinates a single request that may span multiple GRAMs
Resource Broker handle the mapping of high level application requests into requests to individual resource managers
GRAM GRAM GRAM
LSF Condor PBS
Application
RSL
Simple ground RSL
Information Service
Localresourcemanagers
RSLspecialization
Broker
RSL
Co-allocator
Queries& Info
Resource Management Architecture
GRAM Components
Grid SecurityInfrastructure
Job Manager
GRAM client API calls to request resource allocation
and process creation.
MDS client API callsto locate resources
Query current statusof resource
Create
RSL Library
Parse
RequestAllocate &
create processes
Process
Process
Process
Monitor &control
Site boundary
Client MDS: Grid Index Info Server
Gatekeeper
MDS: Grid Resource Info Server
Local Resource Manager
MDS client API callsto get resource info
GRAM client API statechange callbacks
GRAM GRAM-1 HTTP-based RPC
GRAM-1.5 (Reliability improvements)
Once-and-only-once submission
Reliable termination detection
GRAM-2 towards integration with web-services (SOAP)
OGSA (Open Grid Services Architecture)
Gate Keeper
Single point of entry “secure inetd”
Job Manager
Layers on top of local resource management system (e.g., PBS,LSF, etc)
Handles remote interaction with the job
Data Management in GTK "Access to distributed data is typically as important as access to
distributed computational resources“ - Globus
Tools for managing data in Grid systems and applications
Also called “Data Grid”.
GridFTP
Data Replication
Two tools for managing data replicas: multiple copies of data stored in different systems to improve access across geographically-distributed Grids
Replica Catalog – based on LDAP directory
Replica Manager – combines the Replica Catalog with GridFTP to manage data replication
GASS (Global Access to Secondary Storage)
Allows applications to access data stored in any remote file system by specifying a URL
Can be in HTTP, FTP
Condor-GA Computation Management Agent for
Multi-Institutional Grid
What is Condor-G?
Condor enhanced with Globus Toolkit components to “harness multi-domain resources as if they all belong to one personal domain”
Example of applying the general purpose GTK components to solve a particular problem (i.e., high-throughput computing on the Grid)
Separation of Concern
1. Remote Resource Access
Secure remote resource discovery,allocation, management
Uses Globus Toolkit components
2. Computation Management
Via user computation management agent responsible for job submission, job management, error recovery
Taken from Condor system
3. Remote Execution
Via mobile sandboxing – to create a user tailored execution environment on a remote node
Taken from Condor system
Why Condor for Grid Jobs?? Adv. Of using condor-G to manage Grid jobs
Can Query a job’s status or cancel a job
Credential management
Get informed of job termination or problems via callbacks or asynchronous mechanisms such as email
Access to detailed logs with complete history of the jobs’ execution
Fault tolerance and exactly once semantics
Job Execution
Stages a job’s standard I/O and executable using GASS
Submits jobs to remote machines using revised GRAM(1.5)
Job manager checkpoint & restart
Two-phase commit during job submission
Monitors job status & recovers from remote failures using revised GRAM callbacks and status calls
Condor-G handles resubmission of failed jobs, communications with the user concerning unusual & erroneous conditions, recording of computation status to support restart
Execution Mechanism
Fault Tolerance
Tolerates four types of failure
Local Crash
1. Crash of the host on which GridManager is running (or crash of the GridManager alone)
Queue state stored on disk Reconnect to the JobManagers that were running at the time of crash
Remote Crash
2. Crash of the GlobusJobManager Start a new JobManager
3. Crash of the machine that hosts the remote resource (GateKeeper,JobManager)
Wait until connectivity returns Start a new JobManager
4. Network Failures
Credential Management
Authentication in is done with limited-lifetime X509 proxies
Credential may expire before the job finishes execution
Condor-G agent periodically analyzes the credentials for all users with currently queued jobs
Can put jobs on hold and e-mail user to refresh proxy
Can forward new credentials to execution sites
Using the MyProxy system, which lets a user store a long-lived proxy credential on a secure server.
Remote services acting on behalf of user can obtain short-lived proxies
Condor-G can use these to refresh the user credential automatically
Resource Discovery & Scheduling
Simple – user supplies list of GRAM servers
Resource broker in Condor-G agent – Condor Matchmaking
“flood” candidate resources – “Glidelns”
GlideIn Mechanism
Use same codor mechanism to start on a remote node not a user job, but a daemon
The deamon traps system calls made by user’s job and redirects back to the originating system
Periodically checkpoints the job and migrates job to another location if it is requested
The Condor-G GlideIn mechanism uses Grid protocols to dynamically create a personal condor pool out of Grid resources by “gliding-in” Condor daemons to remote resource
Allows to delay the binding of an application to a resource
Prevent queuing delays
Can guarantee optimal queuing times to users
Remote Execution via GlideIn
Grid Architecture in Practice
ComputeResource
SDK
API
AccessProtocol
CheckpointRepository
SDK
API
C-pointProtocol
High Throughput Computing System
Dynamic checkpoint, job management, failover, staging
Brokering, certificate authorities
Access to data, access to computers, access to network performance data
Communication, service discovery (DNS), authentication, authorization, delegation
Storage systems, schedulers
Collective(App)
App
Collective(Generic)
Resource
Connect
Fabric
Future & Conclusions Evolution
Past-Present: O(102) computers; Mb/s networks; local (centralized) control
Present: O(104-106) data systems, computers; Gb/s networks; restricted decentralized control
Future: O(106-109) data,sensors,computer,instruments; highly flexible policy,control
“A computer (includes software) is a dynamically, often collaboratively constructed collection of processors,data sources,networks,sensors,instruments”
“Open the faucet get the water – Connect to the Grid get the compute power”
We need a powerful computational economy model (Bidding systems – new optimization algorithms)
Summary
The Grid Problem: Resource sharing & coordinated problem solving in dynamic multi-institutional virtual organizations
Grid Architecture: Emphasizes protocol and service definition to enable interoperability and resource sharing
Globus Toolkit: a source of protocol and APIs, reference implementation
Condor-G: applies general purpose Globus Toolkit to solve high-throughput computing on the Grid
Thanks for Ur Patience