gridprimer

201
E-Science NorthWest Jon MacLaren Monday 18 th to Friday 22 nd October 2004 GridPrimer Training Course University of Manchester GridPrimer GridPrimer An Introduction to the world of Grid Computing

Upload: caesar

Post on 12-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

GridPrimer. An Introduction to the world of Grid Computing. Jon MacLaren Monday 18 th to Friday 22 nd October 2004 GridPrimer Training Course University of Manchester. Computationally intensive File access/transfer Bag of various heterogeneous protocols & toolkits - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: GridPrimer

E-S

cie

nce N

ort

hW

est

Jon MacLaren

Monday 18th to Friday 22nd October 2004GridPrimer Training CourseUniversity of Manchester

GridPrimerGridPrimer

An Introduction to the world of Grid Computing

Page 2: GridPrimer

e-Science NorthWest2

Computationally intensive File access/transfer Bag of various heterogeneous protocols & toolkits Monolithic design Recognised internet, ignored Web Academic teams

Generation Game

Incr

ease

d fu

nctio

nalit

y,st

anda

rdiz

atio

n

Time

Customsolutions

Open GridServices

ArchitectureWeb services

Globus ToolkitCondor, Unicore

Defacto standardsGridFTP, GSI

X.509,LDAP,

FTP, …

App-specificServices

Data and knowledge intensive Open services-based architecture

Builds on Web services GGF + OASIS+W3C

Multiple implementations Global Grid Forum

Industry participation(adapted from Ian Foster GGF7 Plenary)

Page 3: GridPrimer

E-S

cie

nce N

ort

hW

est Grid Security IntroductionGrid Security Introduction

What is security on the Grid about?What does it do?What are the problems?

Page 4: GridPrimer

e-Science NorthWest4

What is security about?

What do we need to ensure?– The privacy of messages being exchanged– The integrity of messages being exchanged

– We know someone is who they say they are– We know what the person is “entitled” to do– We can remembering what the person did

It’s about communicating and collaborating securely in an insecure environment!

Solutions?– Encryption– Signing

– Authentication– Authorization– Accounting

Page 5: GridPrimer

e-Science NorthWest5

Public-key cryptography “The problems of authentication and large network privacy protection

were addressed theoretically in 1976 by Whitfield Diffie and Martin Hellman when they published their concepts for a method of exchanging secret messages without exchanging secret keys. The idea came to fruition in 1977 with the invention of the RSA Public Key Cryptosystem by Ronald Rivest, Adi Shamir, and Len Adleman, then professors at the Massachusetts Institute of Technology.”

So the RSA Algorithm (and others, e.g. DSA) allow you to create two keys with the following properties:– data encrypted with the first key can only be decrypted with the second key

– data encrypted with the second key can only be decrypted with the first key

– Given one of the keys, it is extremely difficult to work out what the other key is.

Page 6: GridPrimer

e-Science NorthWest6

How difficult is extremely difficult?

Can still attack key pairs using brute force methods

On August 22, 1999,  a group of researchers completed the factorization of the 155 digit (512 bit) RSA Challenge Number.  The work was accomplished with the General Number Field Sieve.  The sieving software used two different sieve techniques: line sieving and lattice sieving.  The sieving was accomplished as follows:

Sieving: 35.7 CPU-years in total on...– 160 175-400 MHz SGI and Sun workstations– 8 250 MHz SGI Origin 2000 processors– 120 300-450 MHz Pentium II PCs– 4 500 MHz Digital/Compaq boxes

Page 7: GridPrimer

e-Science NorthWest7

Should we be worried? For RSA, 576-bit is the longest to be broken (Dec 2003) 1024 bit still far off (maybe?). If you can do it, they’ll give you

$100,000! See RSA Challenge Numbers:

– http://www.rsasecurity.com/rsalabs/node.asp?id=2093

Many Certificate Authorities (more later) recommend 2048-bit key-pairs, which can be easily generated

New algorithms are being developed based on the mathematics of elliptic curves (Elliptic Curve Cryptography).

Longest key-pair broken is a 109 bit key in 2003, using the “birthday attack”, with over 10,000 Pentium class PCs running continuously for over 540 days.

The minimum recommended key size for ECC is 163 bits, is currently estimated to require 108 times the computing resources as that required for the 109 bit problem.

Page 8: GridPrimer

e-Science NorthWest8

Watch those random numbers! To generate “good” key-pairs, you need good random numbers!

In 1995, Goldberg and Wagner from Berkley discovered that random numbers in netscape were being generated using only three pieces of information:– time of day

– process ID

– parent process ID

This resulted in some easy attacks, particularly on shared machines. See:– http://www.cs.berkeley.edu/~daw/netscape-randomness.html

Can use /dev/random, or PRND (Pseudo-Random Number Generator), Entropy Gathering Daemon, etc.

Page 9: GridPrimer

Public Key CryptographyEncryption (Scary Maths!)

● Public Key: [e and N] – where N=pq product of

two large primes

● (p-1)(q-1) is almost prime

– and e (almost prime too)

● Private Key: [d and N]– where

e×d =1mod p 1 q 1

● To encrypt/decrypt with Public Key:● To decrypt/encrypt with Private Key:

c= m emod N

m= c d mod N

Page 10: GridPrimer

● Goal of signing is to leave something readable by everyone, but allow the signature to be verified

● So, we don’t encrypt the message● Instead we create a Hash and encrypt that

instead.● Hash is a one way digest of the message by a

specific algorithm (e.g. SHA1 or MD5)● The encrypted hash is then sent with the

message● Verify the signature by making the hash and

comparing this with the signature as decrypted using the sender’s public key

Signing

Page 11: GridPrimer

● Typically based on X509 certificates– Supports different key-pair algorithms

● We rely on ourselves to get true public keys

● Chain of trust rules● A public key may be digitally signed by many people● some of whom you may trust.

● CA method (Certificate Authority)● CA has a “root certificate” and a document called CP/CPS

http://www.grid-support.ac.uk/ca/cps

● You choose to trust on the basis of CP/CPS.● CA signs your certificate (your public key).● Large scale CAs are difficult and costly (~£220 per cert)

PKI – Public Key Infrastructure

Policy and Practice

Page 12: GridPrimer

Getting Certificates

● Create a private and public key pair● Send public key to CA● Identify yourself to the CA (as specified in

CPS)● CA signs your public key.● CA sends you a digital certificate which

contains your public key and the CA's digital signature

● Can be done two ways:– in your browser Netscape/IE certificate request

– on the command line: e.g. grid-cert-request

Page 13: GridPrimer

The UK eScience Certificate Authority

● Read CPS● Get CA cert● Get CRL● Request a

certificate● CertDB● Export Certs

Gets you an x509 cert

Page 14: GridPrimer

x509 Certificates● Version● Serial Number● Issuer● Times of Validity● Subject● Public Key● Extensions

– Constraints

– Type and Use

– Thumbprint

– CRL

Certificate: Data: Version: 3 (0x2) Serial Number: 127 (0x7f) Signature Algorithm: md5WithRSAEncryption Issuer: C=UK, O=eScience, OU=Authority, CN=CA/[email protected] Validity Not Before: Oct 31 15:50:59 2002 GMT Not After : Oct 31 15:50:59 2003 GMT Subject: C=UK, O=eScience, OU=Manchester, L=MC, CN=michael jones Subject Public Key Info: Public Key Algorithm: rsaEncryption RSA Public Key: (1024 bit) Modulus (1024 bit): 00:c6:96:fd:7a:e0:fa:f1:e6:43:9d:c1:cb:72:38: e1:4e:44:86:da:a7:8a:ed:8a:fc:f3:64:d8:9e:bd: af:ce:7c:55:39:cd:61:74:a8:1d:6d:60:6e:65:91: dc:2c:c2:64:80:f6:f9:1a:3c:fe:d4:d2:1c:52:fa: c6:47:ea:a6:4e:92:b5:c9:1d:93:dd:48:61:54:40: b5:17:84:3f:5c:47:48:29:2b:83:82:c7:d6:ad:d3: 60:5d:6d:5d:f7:08:25:17:d2:14:e2:8e:af:37:3b: e4:3b:63:f7:31:24:b4:66:78:8e:06:93:c6:8d:b6: fe:50:79:3a:4a:f8:59:58:3d Exponent: 65537 (0x10001) X509v3 extensions: X509v3 Basic Constraints: CA:FALSE Netscape Cert Type: SSL Client, S/MIME X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment, Key Agreement Netscape Comment: UK e-Science User Certificate X509v3 Subject Key Identifier: BF:00:02:4B:3A:45:A6:B8:EB:66:E4:F2:EE:CA:60:9D:B8:D1:B2:0D X509v3 Authority Key Identifier: keyid:02:38:AB:11:A3:96:80:8B:0D:D3:15:2B:08:A5:8E:30:DA:B2:DA:A8 DirName:/C=UK/O=eScience/OU=Authority/CN=CA/[email protected] serial:00

X509v3 Issuer Alternative Name: email:[email protected] Netscape CA Revocation Url: http://ca.grid-support.ac.uk/cgi-bin/importCRL Netscape Revocation Url: http://ca.grid-support.ac.uk/cgi-bin/importCRL Netscape Renewal Url: http://ca.grid-support.ac.uk/cgi-bin/renewURL X509v3 CRL Distribution Points: URI:http://ca.grid-support.ac.uk/cgi-bin/importCRL

Signature Algorithm: md5WithRSAEncryption 3a:1f:81:a8:1a:83:ff:2c:0f:7b:b6:1e:2a:87:31:13:d9:ca: 9e:c1:9e:e4:42:b5:22:56:7b:01:98:11:13:29:a3:d8:d2:37: 80:58:ac:7f:44:f7:1e:ba:00:f4:8b:c8:34:00:ff:44:27:c2: 2a:54:8b:95:e9:a0:00:f8:3d:60:92:c4:99:2b:72:d4:b7:dd: 78:bd:c9:4a:01:d7:14:1d:3c:d9:6f:60:7b:23:90:8e:d6:3a: 2d:45:39:5e:bc:fd:6d:77:7b:1e:cf:43:8c:e4:05:4c:1b:91: e5:bb:da:3d:cd:9d:05:6b:be:21:b0:e8:43:b2:4b:4e:c4:4f: 6b:4e:23:9e:03:d2:03:86:1b:44:68:60:41:5d:64:ae:2d:52: e2:7d:9b:99:60:71:7f:4a:00:1e:5d:9d:14:59:4f:4b:d7:9a: ee:e0:01:3d:87:36:16:bf:24:b3:84:fd:62:d1:d6:21:ae:3b: f7:e1:e5:52:ec:ef:68:f4:73:4f:1b:62:a6:f4:47:0b:6c:1e: 28:23:6b:25:d3:a1:f7:37:f6:55:d6:82:7c:49:a9:1d:71:57: e6:bc:74:71:94:0d:df:fc:21:63:16:54:c9:0f:51:1c:7a:bf: 5c:ef:7d:28:23:73:64:84:eb:f2:b6:52:89:ca:48:78:31:e8: dd:b9:91:3f-----BEGIN CERTIFICATE-----MIIFBDCCA+ygAwIBAgIBfzANBgkqhkiG9w0BAQQFADBwMQswCQYDVQQGEwJVSzERMA8GA1UEChMIZVNjaWVuY2UxEjAQBgNVBAsTCUF1dGhvcml0eTELMAkGA1UEAxMCQ0ExLTArBgkqhkiG9w0BCQEWHmNhLW9wZXJhdG9yQGdyaWQtc3VwcG9ydC5hYy51azAeFw0wMjEwMzExNTUwNTlaFw0wMzEwMzExNTUwNTlaMFoxCzAJBgNVBAYTAlVLMREwDwYDVQQKEwhlU2NpZW5jZTETMBEGA1UECxMKTWFuY2hlc3RlcjELMAkGA1UEBxMCTUMxFjAUBgNVBAMTDW1pY2hhZWwgam9uZXMwgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMaW/Xrg+vHmQ53By3I44U5Ehtqniu2K/PNk2J69r858VTnNYXSoHW1gbmWR3CzCZID2+Ro8/tTSHFL6xkfqpk6Stckdk91IYVRAtReEP1xHSCkrg4LH1q3TYF1tXfcIJRfSFOKOrzc75Dtj9zEktGZ4jgaTxo22/lB5Okr4WVg9AgMBAAGjggJBMIICPTAJBgNVHRMEAjAAMBEGCWCGSAGG+EIBAQQEAwIFoDALBgNVHQ8EBAMCA+gwLAYJYIZIAYb4QgENBB8WHVVLIGUtU2NpZW5jZSBVc2VyIENlcnRpZmljYXRlMB0GA1UdDgQWBBS/AAJLOkWmuOtm5PLuymCduNGyDTCBmgYDVR0jBIGSMIGPgBQCOKsRo5aAiw3TFSsIpY4w2rLaqKF0pHIwcDELMAkGA1UEBhMCVUsxETAPBgNVBAoTCGVTY2llbmNlMRIwEAYDVQQLEwlBdXRob3JpdHkxCzAJBgNVBAMTAkNBMS0wKwYJKoZIhvcNAQkBFh5jYS1vcGVyYXRvckBncmlkLXN1cHBvcnQuYWMudWuCAQAwKQYDVR0SBCIwIIEeY2Etb3BlcmF0b3JAZ3JpZC1zdXBwb3J0LmFjLnVrMD0GCWCGSAGG+EIBBAQwFi5odHRwOi8vY2EuZ3JpZC1zdXBwb3J0LmFjLnVrL2NnaS1iaW4vaW1wb3J0Q1JMMD0GCWCGSAGG+EIBAwQwFi5odHRwOi8vY2EuZ3JpZC1zdXBwb3J0LmFjLnVrL2NnaS1iaW4vaW1wb3J0Q1JMMDwGCWCGSAGG+EIBBwQvFi1odHRwOi8vY2EuZ3JpZC1zdXBwb3J0LmFjLnVrL2NnaS1iaW4vcmVuZXdVUkwwPwYDVR0fBDgwNjA0oDKgMIYuaHR0cDovL2NhLmdyaWQtc3VwcG9ydC5hYy51ay9jZ2ktYmluL2ltcG9ydENSTDANBgkqhkiG9w0BAQQFAAOCAQEAOh+BqBqD/ywPe7YeKocxE9nKnsGe5EK1IlZ7AZgREymj2NI3gFisf0T3HroA9IvINAD/RCfCKlSLlemgAPg9YJLEmSty1LfdeL3JSgHXFB082W9geyOQjtY6LUU5Xrz9bXd7Hs9DjOQFTBuR5bvaPc2dBWu+IbDoQ7JLTsRPa04jngPSA4YbRGhgQV1kri1S4n2bmWBxf0oAHl2dFFlPS9ea7uABPYc2Fr8ks4T9YtHWIa479+HlUuzvaPRzTxtipvRHC2weKCNrJdOh9zf2VdaCfEmpHXFX5rx0cZQN3/whYxZUyQ9RHHq/XO99KCNzZITr8rZSicpIeDHo3bmRPw==-----END CERTIFICATE-----

Page 15: GridPrimer

e-Science NorthWest15

Authentication

Your identity is seen as being trusted if your public key has been signed by a Certificate Authority which you have decided to trust.

Result is that you are verified as being the entity identified in the Distinguished Name (DN) of the certificate, e.g.:– C=UK, O=eScience, OU=Manchester, L=MC, CN=jon maclaren

This DN will then be used in later Authorization stages.

Page 16: GridPrimer

e-Science NorthWest16

Authorization

Still lots of ad-hoc techniques being used Most common one is the “Grid mapfile”, which maps each

DN to a single user name Bad for many reasons:

– You have to add each new user in (no scalability)– There are no standard mechanisms for distributing a new version

of the file within a virtual organisation– Things tend to get out of step, so a user will get authorization

failures at particular sites/services

Page 17: GridPrimer

e-Science NorthWest17

Beyond the grid-mapfile

Best known alternative is the Community Authorization Scheme (CAS).– Authorization is deferred from a resource to an authorization server.

– Server contains, in a single place, information on who is allowed to do what, and where, within the VO.

The introduction of a third-party is common to most sophisticated authorization schemes.

Other schemes include:– Akenti

– PERMIS (David Chadwick from Salford)

– Shibboleth (targeting HE sector, to replace ATHENS?)

All about expression of policy

Page 18: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 18

CAS1. CAS request, with resource names and operations

Community Authorization(Prototype shown August 2001)

Does the collective policy authorize this

request for this user?

user/group membership

resource/collective membership

collective policy information

Resource

Is this request authorized for

the CAS?

Is this request authorized by

the capability? local policy

information

4. Resource reply

User 3. Resource request, authenticated with

capability

2. CAS reply, with and resource CA info

capability

Page 19: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 19

Community Authorization Service

CAS provides user community with information needed to authenticate resources– Sent with capability credential, used on

connection with resource

– Resource identity (DN), CA This allows new resources/users (and their

CAs) to be made available to a community through the CAS without action on the other user’s/resource’s part

Page 20: GridPrimer

e-Science NorthWest20

Authorization Scheme Taxonomy

Authorization Push Sequence Authorization Pull Sequence

AuthorizationAgentSequence

Page 21: GridPrimer

e-Science NorthWest21

For those who want toknow (lots) more

“Conceptual Grid Authorization Framework and Classification”

GGF Informational (Draft) Document By the GGF Working Group on Authorization Frameworks

and Mechanisms (AuthZ-WG)

Document draft: http://tinyurl.com/6nuh6

Public comment period over Should be appearing in its final form at:

– http://www.ggf.org/documents/final.htm

Page 22: GridPrimer

e-Science NorthWest22

Accounting?

Much, much worse – almost all ad-hoc. Most middleware, including GT2/GT3, only provide some logfiles.

There is a draft standard for recording resource usage, at least for computational jobs.

Developed by the GGF Usage Record Working Group (UR-WG) Current drafts at: http://www.psc.edu/~lfm/Grid/UR-WG/ PBS Scheduler can generate records in this format

Also, there is a group working on a service for holding this data. The GGF OGSA Resource Usage Service Working Group (RUS-WG) Was following OGSI technology, and is currently inactive May publish a “pure web services” specification in 2005.

Page 23: GridPrimer

e-Science NorthWest23

Anything else?

Most advanced work in this area is the requirements that have been gathered by the GGF Site Authentication Authorization and Accounting Research Group (SA3-RG or AAA-RG)

“Site Requirements for Grid Authentication, Authorization and Accounting”

GGF Informational (Final) Document– http://www.ggf.org/documents/GWD-I-E/GFD-I.032.txt

The main problem with accounting is that it’s seen as very, very dull...

Page 24: GridPrimer

E-S

cie

nce N

ort

hW

est Pre-XML GridsPre-XML Grids

What is “Grid” anyway?

Page 25: GridPrimer

e-Science NorthWest25

Computationally intensive File access/transfer Bag of various heterogeneous protocols & toolkits Monolithic design Recognised internet, ignored Web Academic teams

Generation Game

Incr

ease

d fu

nctio

nalit

y,st

anda

rdiz

atio

n

Time

Customsolutions

Open GridServices

ArchitectureWeb services

Globus ToolkitCondor, Unicore

Defacto standardsGridFTP, GSI

X.509,LDAP,

FTP, …

App-specificServices

Data and knowledge intensive Open services-based architecture

Builds on Web services GGF + OASIS+W3C

Multiple implementations Global Grid Forum

Industry participation(adapted from Ian Foster GGF7 Plenary)

Page 26: GridPrimer

e-Science NorthWest26

What do I have to choose from?

Globus Toolkit– version 2 is widely deployed; nearest thing to a de facto standard– horizontally integrated bag of tools– suits grid application developers better than end users

UNICORE– less widely deployed; few UK deployments– vertically integrated– suits end users better than application developers

Condor– high throughput computing– great for cycle harvesting

Web Services?– wait or roll your own using Web Services tools

Others– yes, there are others

Page 27: GridPrimer

e-Science NorthWest27

Globus Toolkit version 2:An Overview

"Single sign-on" through Grid Security Infrastructure (GSI)

Remote execution of jobs– GRAM, job-managers, Resource Specification

Language (RSL)

Grid-FTP– Efficient, reliable file transfer; third-party file transfers

MDS (Metacomputing Directory Service)– Resource discovery (GRIS and GIIS)

Co-allocation (DUROC)– Limited by support from scheduling infrastructure

Other GSI-enabled utilities– gsi-ssh, grid-cvs, etc.

Low-level APIs and command-line interfaces Commodity Grid Kits (CoG-kits), Java, Perl,

Python Widespread deployment, lots of projects

Diverse global services

Coreservices

Local OS

A p p l i c a t i o n s

Page 28: GridPrimer

28

The Globus Toolkit™:The Globus Toolkit™:Security ServicesSecurity Services

The Globus Project™The Globus Project™

Argonne National LaboratoryArgonne National LaboratoryUSC Information Sciences USC Information Sciences

InstituteInstitute

http://www.globus.orghttp://www.globus.org

Page 29: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 29

Public Key Based Authentication User sends certificate over the wire. Other end sends user a challenge string. User encodes the challenge string with private key

– Possession of private key means you can authenticate as subject in certificate

Public key is used to decode the challenge.– If you can decode it, you know the subject

Treat your private key carefully!!– Private key is stored only in well-guarded places,

and only in encrypted form

Page 30: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 30

X.509 Proxy Certificate

Defines how a short term, restricted credential can be created from a normal, long-term X.509 credential– A “proxy certificate” is a special type of

X.509 certificate that is signed by the normal end entity cert, or by another proxy

– Supports single sign-on & delegation through “impersonation”

– Currently an IETF draft

Page 31: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 31

User Proxies

Minimize exposure of user’s private key A temporary, X.509 proxy credential for use by

our computations– We call this a user proxy certificate

– Allows process to act on behalf of user

– User-signed user proxy cert stored in local file

– Created via “grid-proxy-init” command Proxy’s private key is not encrypted

– Rely on file system security, proxy certificate file must be readable only by the owner

Page 32: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 32

Delegation

Remote creation of a user proxy Results in a new private key and X.509

proxy certificate, signed by the original key Allows remote process to act on behalf of

the user Avoids sending passwords or private keys

across the network

Page 33: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 33

Globus Security APIs

Generic Security Service (GSS) API– IETF standard

– Provides functions for authentication, delegation, message protection

– Decoupled from any particular communication method

But GSS-API is somewhat complicated, so we also provide the easier-to-use globus_gss_assist API.

GSI-enabled SASL is also provided

Page 34: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 34

GSI Applications

Globus Toolkit™ uses GSI for authentication Many Grid tools, directly or indirectly, e.g.

– Condor-G, SRB, MPICH-G2, Cactus, GDMP, … Commercial and open source tools, e.g.

– ssh, ftp, cvs, OpenLDAP, OpenAFS

– SecureCRT (Win32 ssh client) And since we use standard X.509 certificates,

they can also be used for– Web access, LDAP server access, etc.

Page 35: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 35

Ongoing and Future GSI Work

Protection against compromised resources– Restricted delegation, smartcards

Standardization Scalability in numbers of users & resources

– Credential management

– Online credential repositories (“MyProxy”)

– Account management Authorization

– Policy languages

– Community authorization

Page 36: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 36

Restricted Proxies

Q: How to restrict rights of delegated proxy to a subset of those associated with the issuer?

A: Embed restriction policy in proxy cert– Policy is evaluated by resource upon proxy use

– Reduces rights available to the proxy to a subset of those held by the user

But how to avoid policy language wars?– Proxy cert just contains a container for a policy

specification, without defining the language> Container = OID + blob

– Can evolve policy languages over time

Page 37: GridPrimer

e-Science NorthWest37

Problems with GSI

There are problems with GSI, and uptake is not as widespread as the Globus Alliance would like you to believe:– Creates proxy certificate on remote machine– Proxy contains private key, unencrypted, stored in /tmp– Only protected with UNIX filesystem rules– These are time-limited. But people can set this limit to be very long.– So you have to trust the root user of any machine you use– Mostly, restricted proxies are created, but some applications, e.g. GSI-SSH

create full proxies– It is easy to design attacks whereby the full private key can be stolen

In part, these problems are fundamental to generic delegation schemes, which are powerful, but also dangerous

Proxy certificates mean that people from some application domains simply will not use Globus

Page 38: GridPrimer

38

The Globus Toolkit™:The Globus Toolkit™:Resource Management Resource Management

ServicesServices

The Globus Project™The Globus Project™

Argonne National LaboratoryArgonne National LaboratoryUSC Information Sciences USC Information Sciences

InstituteInstitute

http://www.globus.orghttp://www.globus.org

Page 39: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 39

The Challenge Enabling secure, controlled remote access to

heterogeneous computational resources and management of remote computation– Authentication and authorization

– Resource discovery & characterization

– Reservation and allocation

– Computation monitoring and control Addressed by new protocols & services

– GRAM protocol as a basic building block

– Resource brokering & co-allocation services

– GSI for security, MDS for discovery

Page 40: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 40

Resource Management

The Grid Resource Allocation Management (GRAM) protocol and client API allows programs to be started on remote resources, despite local heterogeneity

Resource Specification Language (RSL) is used to communicate requirements

A layered architecture allows application-specific resource brokers and co-allocators to be defined in terms of GRAM services– Integrated with Condor, PBS, MPICH-G2, …

Page 41: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 41

GRAM GRAM GRAM

LSF Condor NQE

Application

RSL

Simple ground RSL

Information Service

Localresourcemanagers

RSLspecialization

Broker

Ground RSL

Co-allocator

Queries& Info

Resource Management Architecture

Page 42: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 42

Resource Specification Language

Common notation for exchange of information between components– Syntax similar to MDS/LDAP filters

RSL provides two types of information:– Resource requirements: Machine type,

number of nodes, memory, etc.

– Job configuration: Directory, executable, args, environment

Globus Toolkit provides an API/SDK for manipulating RSL

Page 43: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 43

RSL Syntax

Elementary form: parenthesis clauses– (attribute op value [ value … ] )

Operators Supported:– <, <=, =, >=, > , !=

Some supported attributes:– executable, arguments, environment, stdin, stdout,

stderr, resourceManagerContact,resourceManagerName

Unknown attributes are passed through – May be handled by subsequent tools

Page 44: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 44

Constraints: “&”

For example:

& (count>=5) (count<=10)

(max_time=240) (memory>=64)

(executable=myprog) “Create 5-10 instances of myprog, each

on a machine with at least 64 MB memory that is available to me for 4 hours”

Page 45: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 45

Disjunction: “|”

For example:

& (executable=myprog)

( | (&(count=5)(memory>=64))

(&(count=10)(memory>=32))) Create 5 instances of myprog on a

machine that has at least 64MB of memory, or 10 instances on a machine with at least 32MB of memory

Page 46: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 46

GRAM Protocol GRAM-1: Simple HTTP-based RPC

– Job request> Returns a “job contact”: Opaque string that can be passed

between clients, for access to job

– Job cancel, status, signal

– Event notification (callbacks) for state changes> Pending, active, done, failed, suspended

GRAM-1.5 (U Wisconsin contribution)– Add reliability improvements

> Once-and-only-once submission

> Recoverable job manager service

> Reliable termination detection

GRAM-2: Moving to Web Services (SOAP)…

Page 47: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 47

Globus Toolkit Implementation

Gatekeeper– Single point of entry

– Authenticates user, maps to local security environment, runs service

– In essence, a “secure inetd” Job manager

– A gatekeeper service

– Layers on top of local resource management system (e.g., PBS, LSF, etc.)

– Handles remote interaction with the job

Page 48: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 48

GRAM Components

Grid SecurityInfrastructure

Job Manager

GRAM client API calls to request resource allocation

and process creation.

MDS client API callsto locate resources

Query current statusof resource

Create

RSL Library

Parse

RequestAllocate &

create processes

Process

Process

Process

Monitor &control

Site boundary

Client MDS: Grid Index Info Server

Gatekeeper

MDS: Grid Resource Info Server

Local Resource Manager

MDS client API callsto get resource info

GRAM client API statechange callbacks

Page 49: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 49

Co-allocation

Simultaneous allocation of a resource set– Handled via optimistic co-allocation based on

free nodes or queue prediction

– In the future, advance reservations will also be supported (already in prototype)

Globus APIs/SDKs support the co-allocation of specific multi-requests– Uses a Globus component called the

Dynamically Updated Request OnlineCo-allocator (DUROC)

Page 50: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 50

Multirequest: “+”

A multirequest allows us to specify multiple resource needs, for example

+ (& (count=5)(memory>=64)

(executable=p1))

(&(network=atm) (executable=p2))– Execute 5 instances of p1 on a machine with at least

64M of memory

– Execute p2 on a machine with an ATM connection Multirequests are central to co-allocation

Page 51: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 51

A Co-allocation Multirequest+( & (resourceManagerContact= “flash.isi.edu:754:/C=US/…/CN=flash.isi.edu-fork”) (count=1) (label="subjob A") (executable= my_app1) ) ( & (resourceManagerContact= “sp139.sdsc.edu:8711:/C=US/…/CN=sp097.sdsc.edu-lsf") (count=2) (label="subjob B") (executable=my_app2) )

Different executables

Differentcounts

Different resourcemanagers

Page 52: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 52

Job Submission Interfaces

Globus Toolkit includes several command line programs for job submission – globus-job-run: Interactive jobs

– globus-job-submit: Batch/offline jobs

– globusrun: Flexible scripting infrastructure Others are building better interfaces

– General purpose> Condor-G, PBS, GRD, Hotpage, etc

– Application specific> ECCE’, Cactus, Web portals

Page 53: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 53

globus-job-run

For running of interactive jobs Additional functionality beyond rsh

– Ex: Run 2 process job w/ executable stagingglobus-job-run -: host –np 2 –s myprog arg1 arg2

– Ex: Run 5 processes across 2 hostsglobus-job-run \

-: host1 –np 2 –s myprog.linux arg1 \

-: host2 –np 3 –s myprog.aix arg2

– For list of arguments run:

globus-job-run -help

Page 54: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 54

globus-job-submit

For running of batch/offline jobs– globus-job-submit Submit job

> Same interface as globus-job-run

> Returns immediately

– globus-job-status Check job status

– globus-job-cancel Cancel job

– globus-job-get-output Get job stdout/err

– globus-job-clean Cleanup after job

Page 55: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 55

globusrun

Flexible job submission for scripting– Uses an RSL string to specify job request

– Contains an embedded globus-gass-server> Defines GASS URL prefix in RSL substitution variable:

(stdout=$(GLOBUSRUN_GASS_URL)/stdout)

– Supports both interactive and offline jobs Complex to use

– Must write RSL by hand

– Must understand its esoteric features

– Generally you should use globus-job-* commands instead

Page 56: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 56

Resource Management APIs

The globus_gram_client API provides access to all of the core job submission and management capabilities, including callback capabilities for monitoring job status.

The globus_rsl API provides convenience functions for manipulating and constructing RSL strings.

The globus_gram_myjob allows multi-process jobs to self-organize and to communicate with each other.

The globus_duroc_control and globus_duroc_runtime APIs provide access to multirequest (co-allocation) capabilities.

Page 57: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 57

Advance Reservationand Other Generalizations

General-purpose Architecture for Reservation and Allocation (GARA)– 2nd generation resource management services

Broadens GRAM on two axes– Generalize to support various resource types

> CPU, storage, network, devices, etc.

– Advance reservation of resources, in addition to allocation

Currently a research prototype

Page 58: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 58

Gatekeeper

Scheduler RM

Gatekeeper

Diffserv RM

Gatekeeper

DSRT RM

Gatekeeper

GRIO RM

Co-Reservation Agent MDS Info Service

GARA: The Big Picture

Page 59: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 59

Resource Management Futures:GRAM-2 (planned for 2002)

Advance reservations– As prototyped in GARA in previous 2 years

Multiple resource types– Manage anything: storage, networks, etc., etc.

Recoverable requests, timeout, etc. Better lifetime management Policy evaluation points for restricted proxies Use of Web Services (WSDL, SOAP)

Karl Czajkowski, Steve Tuecke, others

Page 60: GridPrimer

60

The Globus Toolkit™:The Globus Toolkit™:Information ServicesInformation Services

The Globus Project™The Globus Project™

Argonne National LaboratoryArgonne National LaboratoryUSC Information Sciences USC Information Sciences

InstituteInstitute

http://www.globus.orghttp://www.globus.org

Page 61: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 61

Grid Information Services

System information is critical to operation of the grid and construction of applications– What resources are available?

> Resource discovery

– What is the “state” of the grid?> Resource selection

– How to optimize resource use> Application configuration and adaptation?

We need a general information infrastructure to answer these questions

Page 62: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 62

Examples of Useful Information

Characteristics of a compute resource– IP address, software available, system

administrator, networks connected to, OS version, load

Characteristics of a network– Bandwidth and latency, protocols, logical

topology Characteristics of the Globus infrastructure

– Hosts, resource managers

Page 63: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 63

Grid Information: Facts of Life

Information is always old– Time of flight, changing system state

– Need to provide quality metrics Distributed state hard to obtain

– Complexity of global snapshot Component will fail Scalability and overhead Many different usage scenarios

– Heterogeneous policy, different information organizations, etc.

Page 64: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 64

Grid Information Service

Provide access to static and dynamic information regarding system components

A basis for configuration and adaptation in heterogeneous, dynamic environments

Requirements and characteristics– Uniform, flexible access to information

– Scalable, efficient access to dynamic data

– Access to multiple information sources

– Decentralized maintenance

Page 65: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 65

The GIS Problem: Many Information Sources, Many Views

?RR

R

RR

?

R

R

RR

R?

R

R

R

RR

?

RR

VO A

VO B

VO C

Page 66: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 66

What is a Virtual Organization?

• Facilitates the workflow of a group of users across multiple domains who share (some of) their resources to solve particular classes of problems

• Collates and presents information about these resources in a uniform view

Page 67: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 67

Two Classes Of Information Servers

Resource Description Services– Supplies information about a specific

resource (e.g. Globus 1.1.3 GRIS). Aggregate Directory Services

– Supplies collection of information which was gathered from multiple GRIS servers (e.g. Globus 1.1.3 GIIS).

– Customized naming and indexing

Page 68: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 68

Information Protocols

Grid Resource Registration Protocol– Support information/resource discovery

– Designed to support machine/network failure

Grid Resource Inquiry Protocol– Query resource description server for

information

– Query aggregate server for information

– LDAP V3.0 in Globus 1.1.3

Page 69: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 69

GIS Architecture

A A

Customized Aggregate Directories

R RR R

Standard Resource Description Services

Registration

Protocol

Users

Enquiry

Protocol

Page 70: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 70

Metacomputing Directory Service

Use LDAP as Inquiry Access information in a distributed directory

– Directory represented by collection of LDAP servers

– Each server optimized for particular function Directory can be updated by:

– Information providers and tools

– Applications (i.e., users)

– Backend tools which generate info on demand Information dynamically available to tools and

applications

Page 71: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 71

Two Classes Of MDS Servers Grid Resource Information Service (GRIS)

– Supplies information about a specific resource

– Configurable to support multiple information providers

– LDAP as inquiry protocol Grid Index Information Service (GIIS)

– Supplies collection of information which was gathered from multiple GRIS servers

– Supports efficient queries against information which is spread across multiple GRIS server

– LDAP as inquiry protocol

Page 72: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 72

LDAP Details

Lightweight Directory Access Protocol– IETF Standard

– Stripped down version of X.500 DAP protocol

– Supports distributed storage/access (referrals)

– Supports authentication and access control Defines:

– Network protocol for accessing directory contents

– Information model defining form of information

– Namespace defining how information is referenced and organized

Page 73: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 73

MDS Components

LDAP 3.0 Protocol Engine– Based on OpenLDAP with custom backend

– Integrated caching Information providers

– Delivers resource information to backend APIs for accessing & updating MDS contents

– C, Java, PERL (LDAP API, JNDI) Various tools for manipulating MDS contents

– Command line tools, Shell scripts & GUIs

Page 74: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 74

Grid Resource Information Service Server which runs on each resource

– Given the resource DNS name, you can find the GRIS server (well known port = 2135)

Provides resource specific information– Much of this information may be dynamic

> Load, process information, storage information, etc.

> GRIS gathers this information on demand

“White pages” lookup of resource information– Ex: How much memory does machine have?

“Yellow pages” lookup of resource options– Ex: Which queues on machine allows large jobs?

Page 75: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 75

Grid Index Information Service GIIS describes a class of servers

– Gathers information from multiple GRIS servers– Each GIIS is optimized for particular queries

> Ex1: Which Alliance machines are >16 process SGIs?> Ex2: Which Alliance storage servers have >100Mbps bandwidth

to host X?

– Akin to web search engines Organization GIIS

– The Globus Toolkit ships with one GIIS– Caches GRIS info with long update frequency

> Useful for queries across an organization that rely on relatively static information (Ex1 above)

Can be merged into GRIS

Page 76: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 76

Finding a GRIS and Server Registration

A GRIS or GIIS server can be configured to (de-) register itself during startup/shutdown– Targets specified in configuration file

Softstate registration protocol– Good behavior in case of failure

Allows for federations of information servers– E.g. Argonne GRIS can register with both

Alliance and DOE GIIS servers

Page 77: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 77

Logical MDS Deployment

ISI

GRISes

GIIS

Grads Gusto

Page 78: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 78

MDS Commands

LDAP defines a set of standard commands

ldapsearch, etc. We also define MDS-specific commands

– grid-info-search, grid-info-host-search APIs are defined for C, Java, etc.

– C: OpenLDAP client API > ldap_search_s(), …

– Java: JNDI

Page 79: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 79

Information Services API

RFC 1823 defines an IETF draft standard client API for accessing LDAP databases– Connect to server

– Pose query which returns data structures contains sets of object classes and attributes

– Functions to walk these data structures Globus does not provide an LDAP API. We

recommend the use of OpenLDAP, an open source implementation of RFC 1823.

Page 80: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 80

Searching an LDAP Directory

grid-info-search [options] filter [attributes]

Default grid-info-search options-h mds.globus.org MDS server-p 389 MDS port-b “o=Grid” search start point-T 30 LDAP query timeout-s sub scope = subtree

alternatives: base : lookup this entry one : lookup immediate children

Page 81: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 81

Searching a GRIS Server

grid-info-host-search [options] filter [attributes]

Exactly like grid-info-search, except defaults:-h localhost GRIS server-p 2135 GRIS port

Example:

grid-info-host-search –h pitcairn “dn=*” dn

Page 82: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 82

Filtering

Filters allow selection of object based on relational operators (=, ~=,<=, >=)– grid-info-search “cputype=*”

Compound filters can be construct with Boolean operations: (&, |, !)– grid-info-search “(&(cputype=*)(cpuload1<=1.0))”

– grid-info-search “(&(hn~=sdsc.edu)(latency<=10))”

Hints:– white space is significant

– use -L for LDIF format

required

Page 83: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 83

Example: Filtering

% grid-info-host-search -L “(objectclass=GlobusSoftware)”

dn: sw=Globus, hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Gridobjectclass: GlobusSoftwarereleasedate: 2000/04/11 19:48:29releasemajor: 1releaseminor: 1releasepatch: 3releasebeta: 11lastupdate: Sun Apr 30 19:28:19 GMT 2000objectname: sw=Globus, hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Grid

Page 84: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 84

Example: Attribute Selection

% grid-info-host- search -L “(objectclass=*)” dn hn– Returns the distinguished name (dn) and hostname (hn) of

all objects

– Objects without hn fields are still listed

– DNs are always listed

dn: sw=Globus, hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Grid

dn: hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Gridhn: pitcairn.mcs.anl.gov

dn: service=jobmanager, hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Gridhn: pitcairn.mcs.anl.gov

dn: queue=default, service=jobmanager, hn=pitcairn.mcs.anl.gov, dc=mcs, dc=anl, dc=gov, o=Grid

Page 85: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 85

Example: Discovering CPU Load

Retrieve CPU load fields of compute resources% grid-info-search -L “(objectclass=GlobusComputeResource)” \

dn cpuload1 cpuload5 cpuload15 dn: hn=lemon.mcs.anl.gov, ou=MCS, o=Argonne National Laboratory, o=Globus, c=UScpuload1: 0.48cpuload5: 0.20cpuload15: 0.03 dn: hn=tuva.mcs.anl.gov, ou=MCS, o=Argonne National Laboratory, o=Globus, c=UScpuload1: 3.11cpuload5: 2.64cpuload15: 2.57

Page 86: GridPrimer

e-Science NorthWest86

Problems with MDS

MDS is notoriously unstable. A single GRIS falling over can cause the GIIS(s) above it in

the hierarchy to fall over!

There are problems with the default schema too. An alternative, the GLUE schema, was developed in

European DataTAG and US iVDGL projects This schema is better, but still only goes as far as

describing batch queues

Some projects, e.g. European DataGrid (EDG) simply replaced MDS with their own solution.

Page 87: GridPrimer

87

The Globus Toolkit™:The Globus Toolkit™:Data Management Data Management

ServicesServices

The Globus Project™The Globus Project™

Argonne National LaboratoryArgonne National LaboratoryUSC Information Sciences USC Information Sciences

InstituteInstitute

http://www.globus.orghttp://www.globus.org

Page 88: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 88

Data Grid Problem

“Enable a geographically distributed community [of thousands] to pool their resources in order to perform sophisticated, computationally intensive analyses on Petabytes of data”

Note that this problem:– Is common to many areas of science

– Overlaps strongly with other Grid problems

Page 89: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 89

Major Data Grid Projects

Earth System Grid (DOE Office of Science)– DG technologies, climate applications

European Data Grid (EU)– DG technologies & deployment in EU

GriPhyN (NSF ITR)– Investigation of “Virtual Data” concept

Particle Physics Data Grid (DOE Science)– DG applications for HENP experiments

Page 90: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 90

Data Grids forHigh Energy Physics

Tier2 Centre ~1 TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance Regional Centre

Italy Regional Centre

Germany Regional Centre

InstituteInstituteInstituteInstitute ~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server

Physics data cache

~PBytes/sec

~622 Mbits/sec or Air Freight (deprecated)

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Tier2 Centre ~1 TIPS

Caltech ~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech

Page 91: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 91

Data Intensive Issues Include …

Harness [potentially large numbers of] data, storage, network resources located in distinct administrative domains

Respect local and global policies governing what can be used for what

Schedule resources efficiently, again subject to local and global constraints

Achieve high performance, with respect to both speed and reliability

Catalog software and virtual data

Page 92: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 92

Data IntensiveComputing and Grids

The term “Data Grid” is often used– Unfortunate as it implies a distinct infrastructure,

which it isn’t; but easy to say Data-intensive computing shares numerous

requirements with collaboration, instrumentation, computation, …– Security, resource mgt, info services, etc.

Important to exploit commonalities as very unlikely that multiple infrastructures can be maintained

Fortunately this seems easy to do!

Page 93: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 93

Examples ofDesired Data Grid Functionality

High-speed, reliable access to remote data Automated discovery of “best” copy of data Manage replication to improve performance Co-schedule compute, storage, network “Transparency” wrt delivered performance Enforce access control on data Allow representation of “global” resource

allocation policies

Page 94: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 94

A Model Architecture for Data Grids

Metadata Catalog

Replica Catalog

Tape Library

Disk Cache

Attribute Specification

Logical Collection and Logical File Name

Disk Array Disk Cache

Application

Replica Selection

Multiple Locations

NWS

SelectedReplica

GridFTP Control ChannelPerformanceInformation &Predictions

Replica Location 1 Replica Location 2 Replica Location 3

MDS

GridFTPDataChannel

Page 95: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 95

Globus Toolkit ComponentsTwo major Data Grid components:

1. Data Transport and Access Common protocol

Secure, efficient, flexible, extensible data movement

Family of tools supporting this protocol

2. Replica Management Architecture Simple scheme for managing:

multiple copies of files collections of files

Page 96: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 96

Motivation for a Common Data Access Protocol

Existing distributed data storage systems– DPSS, HPSS: focus on high-performance access, utilize

parallel data transfer, striping

– DFS: focus on high-volume usage, dataset replication, local caching

– SRB: connects heterogeneous data collections, uniform client interface, metadata queries

Problems– Incompatible (and proprietary) protocols

> Each require custom client

> Partitions available data sets and storage devices

– Each protocol has subset of desired functionality

Page 97: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 97

A Common, Secure,Efficient Data Access Protocol

Common, extensible transfer protocol– Common protocol means all can interoperate

Decouple low-level data transfer mechanisms from the storage service

Advantages: – New, specialized storage systems are automatically

compatible with existing systems

– Existing systems have richer data transfer functionality Interface to many storage systems

– HPSS, DPSS, file systems

– Plan for SRB integration

Page 98: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 98

Access/Transport Protocol Requirements

Suite of communication libraries and related tools that support– GSI, Kerberos security

– Third-party transfers

– Parameter set/negotiate

– Partial file access

– Reliability/restart

– Large file support

– Data channel reuse All based on a standard, widely deployed protocol

– Integrated instrumentation

– Loggin/audit trail

– Parallel transfers

– Striping (cf DPSS)

– Policy-based access control

– Server-side computation

– Proxies (firewall, load bal)

Page 99: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 99

And The Protocol Is … GridFTP Why FTP?

– Ubiquity enables interoperation with many commodity tools

– Already supports many desired features, easily extended to support others

– Well understood and supported We use the term GridFTP to refer to

– Transfer protocol which meets requirements

– Family of tools which implement the protocol Note GridFTP > FTP Note that despite name, GridFTP is not restricted to file

transfer!

Page 100: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 100

GridFTP: Basic Approach

FTP protocol is defined by several IETF RFCs Start with most commonly used subset

– Standard FTP: get/put etc., 3rd-party transfer Implement standard but often unused features

– GSS binding, extended directory listing, simple restart Extend in various ways, while preserving

interoperability with existing servers– Striped/parallel data channels, partial file, automatic

& manual TCP buffer setting, progress monitoring, extended restart

Page 101: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 101

GridFTP Protocol Specifications

Existing standards– RFC 949: File Transfer Protocol

– RFC 2228: FTP Security Extensions

– RFC 2389: Feature Negotiation for the File Transfer Protocol

– Draft: FTP Extensions New drafts

– GridFTP: Protocol Extensions to FTP for the Grid

> Grid Forum Data Working Group

Page 102: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 103

Family of Tools:Patches to Existing Code

Patches to standard FTP clients and servers– gsi-ncftp: Widely used client

– gsi-wuftpd: Widely used server

– GSI modified HPSS pftpd

– GSI modified Unitree ftpd Provides high-quality, production ready, FTP

clients and servers Integration with common mass storage systems Some do not support the full GridFTP protocol

Page 103: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 104

Family of Tools:Custom Developed Libraries

Custom developed libraries– globus_ftp_control: Low level FTP driver

> Client & server protocol and connection management

– globus_ftp_client: Simple, reliable FTP client> Plugins for restart, logging, etc.

– globus_gass_copy: Simple URL-to-URL copy library, supporting (gsi-)ftp, http(s), file URLs

Implement full GridFTP protocol Various levels of libraries, allowing implementation

of custom clients and servers Tuned for high performance on WAN

Page 104: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 105

Family of Tools:Custom Developed Programs

Simple production client– globus-url-copy: Simple URL-to-URL copy

Experimental FTP servers– Striped FTP server (ala.DPSS): MPI-IO backend

– Multi-threaded FTP server with parallel channels

– Firewall FTP proxy: Securely and efficiently allow transfers through firewalls

– Load balancing FTP proxy: Large data centers Experimental FTP clients

– POSIX file interface

Page 105: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 106

globus_ftp_client Plug-ins

globus_ftp_client is simple API/SDK:– get, put, 3rd party transfer, cd, mkdir, etc.

– All data is to/from memory buffers> Optimized to avoid any data copies

– Plug-in interface> Interface to one or more plug-ins:

Callouts for all interesting protocol events Callins to restart a transfer

> Can support: Monitor performance Monitor for failure Automatic retry: Customized for various approaches

Page 106: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 107

GridFTP at SC’2000:Long-Running Dallas-Chicago Transfer

SciNet Power Failure Other demos starting up

(Congestion)

Parallelism Increases (Demos)

Backbone problems on the SC Floor

DNS Problems

Transition between files (not zero due to averaging)

Page 107: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 108

(Prototype)Striped GridFTP Server

Parallel File System (e.g. PVFS, PFS, etc.)

MPI-IO

Plug-in

Control

GridFTP Server Parallel BackendGridFTPservermaster

mpirun

GridFTPclient

Plug-in

Control

Plug-in

Control

Plug-in

Control…MPI (Comm_World)

MPI (Sub-Comm)

To Client or Another Striped GridFTP Server

Controlsocket

GridFTP Control Channel GridFTP Data Channels

Page 108: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 109

Striped GridFTP Plug-in Interface

Given a RETR or STOR request:– Control calls plug-in to determine which nodes

should participate in the request

– Control creates an MPI sub-comm for nodes

– Control calls plug-in to perform the transfer> Includes request info, communicator,

globus_ftp_control_handle_t

– Plug-in does I/O to backend> MPI-IO, PVFS, Unix I/O, Raw I/O, etc.

– Plug-in uses globus_ftp_control_data_*() functions to send/receive data on GridFTP data channels

Page 109: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 110

Striped GridFTP Performance At SC’00, used first prototype:

– Transfer between Dallas and LBNL

– 8 node Linux clusters on each end

– OC-48, 2.5Gb/s link (NTON)

– Peaks over 1.5Gb/s> Limited by disk bandwidth on end-points

– 5 second peaks over 1Gb/s

– Sustained 530Mb/s for 1 hr (238GB transfer)> Had not yet implemented large files or data channel reuse.

> 2GB file took <20 seconds. New data channel sockets connected for each transfer.

> Explains difference between sustained and peak.

Page 110: GridPrimer

e-Science NorthWest111

https – a competitor!

Andy McNab has experimented with using https for file transfer

Recent http standard has incorporated support for multiple streams.

Result:– https matches GridFTP on large files– https outperforms GridFTP on small files

This took a very short time to develop!

So, what is all the fuss about?

Page 111: GridPrimer

e-Science NorthWest112

UNICORE

Packaged Software with GUI Open source

– http://unicore.sourceforge.net/

Designed for firewalls Strict security model

– explicit delegation

Abstract Job Object (AJO)– built-in workflow management

Resource Broker– can submit to Globus grids

Has notion of software resource Few APIs

– extend through plug-ins

– starting to expose service interfaces

Serves the userhttp://www.unicore.org/

Page 112: GridPrimer

e-Science NorthWest113

What UNICORE is/does

Grid middleware (what?!)– Allows a user to access many different resources (e.g. supercomputers) in

the same way

– Provides single sign-on, using X509 certificates

– Hides details about batch systems (NQE, LSF, etc.)

– Provides seamlessness across a heterogeneous environment

A tool to make life simpler!

Page 113: GridPrimer

e-Science NorthWest114

Architecture

Usite Manchester

Vsite turing

Vsite fermat

Vsite green

Usite ICM

Vsite hydra

Vsite tajfun

Vsite tsunami

ClientJon MacLaren

ClientKaukab Jaffri

ClientPiotr Bala

Page 114: GridPrimer

e-Science NorthWest115

Architecture Terminology

A Client is a user who wants to run some jobs

A Vsite (Virtual Site) is a resource upon which a job can run, e.g. a supercomputer or a workstation cluster

A Usite (UNICORE Site) is an organisational grouping of Vsites, e.g. Manchester Computing

Page 115: GridPrimer

e-Science NorthWest116

Usite in more detail

Usite Manchester

Gateway

Vsite green

NJS UUDB

TSI

Vsite turing

NJS UUDB

TSI

Page 116: GridPrimer

e-Science NorthWest117

Usite Component Terminology

A Gateway is a lightweight Java program which controls all incoming connections to the Vsites within that Usite.

An NJS (Network Job Supervisor) is a Java program which prepares the User’s job for execution on the target resource. The NJS also identifies users using the Unicore User Database or UUDB.

The prepared commands are then executed on the target system by one of the TSI (Target System Interface) processes. The TSI is a lightweight daemon process, written in PERL.

Page 117: GridPrimer

e-Science NorthWest118

UNICORE Globus

1. Java and Perl

2. Packaged Software with GUI

3. Not completely open-source

4. Build to handle firewalls

5. Easy to customise

6. No delegation

7. Abstract Job Object (AJO) Framework

1. ANSI C

2. Toolkit approach

3. Open Source

4. Firewalls are a problem

5. Easy to build on

6. Supports Delegation

7. ???

Page 118: GridPrimer

e-Science NorthWest119

Advantages of UNICORE Architecture

The Gateway manages all incoming connections to the Usite. This makes it easy to configure UNICORE to work with at a firewalled site.

Gateway and NJS written in Java, so they are portable, and distributed as jars; no compilation is required.

TSI written in PERL – even more portable, being available on even Cray T3E and Fujitsu VPP.

Only the lightweight TSI daemon processes run on the target resource. This is especially important for expensive resources, such as supercomputers.

The UUDB is a software component that plugs into the NJS, not just a simple file, so configuration for complicated authorisation procedure, such as Community Authorisation, is less complicated.

Page 119: GridPrimer

e-Science NorthWest120

Seamlessness

UNICORE is targeted at a heterogeneous supercomputing environment

Want to be able to run UNICORE jobs on different machines

More precisely, any machine with the required resources We want to abstract from the typical view of a

supercomputer, and hide the differences, or seams, from the user

How do we do this???

Page 120: GridPrimer

e-Science NorthWest121

The Abstract Job Object or AJO

Abstract representation of a job as a Java Object Essentially this contains the Job Group hierarchy from the

previous section Check out the package org.unicore.ajo in the JavaDoc for

the AJO, downloadable from http://www.unicore.org/ Look at the classes AbstractJob and AbstractTask –

these are essentially the groups and tasks of the job Each ExecuteTask is run as a job in the batch queue

system by default

Page 121: GridPrimer

e-Science NorthWest122

What do UNICORE Jobs Look Like?

A UNICORE Job is made up of one or more tasks, e.g. Gaussian job, file transfer, etc., arranged in a hierarchical structure of groups

These groups are represented by AbstractJob objects.

All jobs have a top-level AbstractJob.

Page 122: GridPrimer

e-Science NorthWest123

Why is the hierarchy important?

Within any AbstractJob object, the ordering of execution of the members of that group, i.e. tasks and sub-AbstractJobs can be controlled.– In UNICORE 3.x this is done by

a set of dependencies which form a Directed Acyclic Graph or DAG

– In UNICORE 4.x, loops and conditional branches were added to this scheme

Each AbstractJob can be assigned a Vsite to run on. These Vsites can be at different Usites.

Usite ManchesterVsite green

Usite ICMVsite hydra

Usite FZJVsite ZAMpano

Page 123: GridPrimer

e-Science NorthWest124

Consign/Endorse Model An object is constructed containing the hierarchy. Each AbstractJob

object is signed using the user’s certificate, endorsing the group. The entire object is then signed once more, just prior to consignment,

and then sent to the NJS at the Vsite which the job’s top-level AbstractJob object is to run. This Vsite known as the primary Vsite.

After consigning a job, the user contacts the primary Vsite’s NJS in order to monitor, or control/delete, or fetch the output of the job.

Sub AbstractJob objects are sent from the primary Vsite’s NJS to the other NJSs. Again, just prior to consignment, the NJS signs the entire object being sent. (This process is recursive.)

These secondary NJSs can still verify the user’s (endorser’s) signature Intermediate NJSs can’t tamper with the job Only weak NJS to NJS trust relationships are required.

Page 124: GridPrimer

e-Science NorthWest125

Incarnation

An AbstractJob is incarnated at a Vsite Incarnation is the process by which the AbstractJob is

made real, is concretised. The NJS does this using the information in the IDB

(Incarnation DataBase)

When tasks are run at a Vsite, they are run under the user id specified by the UUDB. Such a user id is called an xlogin. We also refer to this as the incarnated user.

Each incarnated task has a corresponding outcome (see org.unicore.outcomes)

Page 125: GridPrimer

e-Science NorthWest126

Resources – org.unicore.resources

Resources in UNICORE are used in two key places:– In the AJO objects to describe the resources required by each Task

in the job; and– By each Vsite to describe the resources available at the Vsite.

There are resources for describing:– Hardware (incl. nodes/processors, memory, etc.)– Operating System (incl. version)– Software (incl. version), e.g. Gaussian98, bash, cc)– Batch Queues (incl. priority, limits)

Page 126: GridPrimer

e-Science NorthWest127

Resource Satisfaction Model Resources are divided into three top-level categories:

– CapacityResource: Disk space, memory– CapabilityResource: Gaussian98, bash– InformationResource: Info about Vsite

Jobs can only execute at a Vsite if the resource requirements can be satisfied, i.e.– All Capability resources are present– All Capacity resources are met or exceeded

If they can’t the Consigning of the job will fail (good clients would check in advance to avoid unnecessary failures)

Jobs can still fail with resource problems, as the published resources represent what may be available, but individual users may be treated differently due to local policy.

Page 127: GridPrimer

e-Science NorthWest128

Introducing the UNICORE Client

A GUI client application, written in Java Allows a user to construct, consign and monitor jobs simply Client is easily extensible to add new, simple interfaces for

particular applications (more on this later…)

Page 128: GridPrimer

e-Science NorthWest129

What does it look like?

Job Preparationshowing hierarchy

of tasks and groups

Job MonitoringShowing Usites and

Vsites Vsite, Usite and Dependenciesfor selected group

Page 129: GridPrimer

e-Science NorthWest130

Extending UNICORE

UNICORE can be extended via a system of plug-ins The plug-in has a GUI front end (JPanel) which takes

information from the user in a Domain-specific context The plug-in then constructs an AJO from the user’s

requirements, which the user submits Such plug-ins may only be compatible with sites with

specific applications, e.g. CPMD, Gaussian In UNICORE, application software is a resource:

– Users specify applications/versions as requirements– Sites advertise applications as resources they can supply

Then the resource satisfaction model is followed...

Page 130: GridPrimer

e-Science NorthWest131

Hiding Resources from the User

The UNICORE client hides some resources from the user, e.g. the Gaussian plugin hides the use of the Gaussian98 software resource

However, number of processors and time for running each ExecuteTask are exposed – but these could be hidden in the future by including expert technology (see demo in the afternoon!)

Page 131: GridPrimer

e-Science NorthWest132

Condor:High-throughput computing

Condor converts collections of workstations and clusters into a distributed high-throughput computing facility

Emphasis on policy management and reliability High-throughput scheduler Supports job checkpoint and migration

– single processor jobs only

Remote system calls

Condor-G lets Condor users add Globus-enabled resources to their private view of a Condor pool ("flock")

"glide-in"

http://www.cs.wisc.edu/condor/

Page 132: GridPrimer

e-Science NorthWest133

Basic motivation

Basic idea is that you steal cycles from workstations– Many workstations are idle for a large percentage of time, being in use for

at most 40 hours a week– Want to harness this power– Condor targetted at Linux and UNIX– Have been other attempts at this, e.g. Entropia (almost dead now)

Not aimed at capability jobs High-Throughput NOT High-Performance Computing This suits a lot of science, basically anything that needs a lot of cycles,

but where cycles on a workstation are just fine Particularly good for parameter-sweep ensemble jobs

– Parameters are varied over a space you want to explore, and the simulation is re-run for all possible permutations

Page 133: GridPrimer

134www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed Computing research performed by a team of ~40 faculty, full time staff and students who

face software/middleware engineering challenges, involved in national and international collaborations, interact with users in academia and industry, maintain and support a distributed production

environment (more than 2300 CPUs at UW), and educate and train students.

Funding (~ $4.5M annual budget) –

DoE, NASA, NIH, NSF, EU, INTEL, Micron, Microsoft and the UW Graduate School

Page 134: GridPrimer

135www.cs.wisc.edu/condor

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “

Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Page 135: GridPrimer

136www.cs.wisc.edu/condor

Claims for “benefits” provided by Distributed Processing Systems

High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function

“What is a Distributed Data Processing System?” ,

P.H. Enslow, Computer, January 1978

Page 136: GridPrimer

138www.cs.wisc.edu/condor

Matchmaker

The Layers of Condor

Submit(client)

Customer Agent (schedD)

Application

Application Agent

Owner Agent (startD)Execute(service)Remote Execution Agent

Local Resource Manager

Resource

Page 137: GridPrimer

139www.cs.wisc.edu/condor

What is ClassAd Matchmaking?

› Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners.

› Users (jobs) have constraints: “I need an Alpha with 256 MB RAM”

› Owners (machines) have constraints: “Only run jobs when I am away from my desk

and never run jobs owned by Bob.”

Page 138: GridPrimer

e-Science NorthWest140

Basic Job Submission

Command-line submission interface is just like a batch job system:– condor_submit – submit a new jobs to Condor– condor_q – examine the state of the queue– condor_rm – delete a job from the queue– condor_hold – prevents job from running (or kills it – will be

restarted later)– condor_release – allows job to run (or be restarted if killed)– condor_prio – manipulates job priorities

This functionality is always available. But if you are running your own code, you can do more.

Page 139: GridPrimer

141http://www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Condor Pool

› You can configure your Condor pool to “flockflock” to these pools

› You still need permission to access their resources

Page 140: GridPrimer

142http://www.cs.wisc.edu/condor

How Flocking Works› Add a line to your condor_config :

FLOCK_HOSTS = Pool-Foo, Pool-Bar

ScheddSchedd

CollectorCollector

NegotiatorNegotiator

Central Manager

(CONDOR_HOST)

CollectorCollector

NegotiatorNegotiator

Pool-Foo Central Manager

CollectorCollector

NegotiatorNegotiator

Pool-BarCentral Manager

SubmitMachine

Page 141: GridPrimer

143http://www.cs.wisc.edu/condor

Condor Flocking› Remote pools are contacted in the order

specified until jobs are satisfied

› The list of remote pools is a property of the Schedd, not the Central Manager So different users can Flock to different

pools And remote pools can allow specific users

› User-priority system is “flocking-aware” A pool’s local users can have priority over

remote users “flocking” in.

Page 142: GridPrimer

144http://www.cs.wisc.edu/condor

Policy Review› Users submitting jobs can specify

Requirements and Rank expressions› Administrators can specify Startd Policy

expressions individually for each machine (Start,Suspend,etc)

› Expressions can use any job or machine ClassAd attribute

› Custom attributes easily added› Bottom Line: Enforce almost any policy!

Page 143: GridPrimer

e-Science NorthWest145

Condor Universes

VANILLA Universe– Basic functionality only– Will run existing executables

STANDARD Universe– Jobs must be re-linked– Can checkpoint and restart jobs (on other resources)

SCHEDULING Universe– Plug-in meta-scheduler – DAGMAN

GLOBUS Universe– Condor-G (more later)

Page 144: GridPrimer

146http://www.cs.wisc.edu/condor

Relinking Your Job for submission to the Standard Universe

To do this, just place “condor_compile” in front of the command you normally use to link your job:condor_compile gcc -o myjob myjob.c

OR

condor_compile f77 -o myjob filea.f fileb.f

OR

condor_compile make –f MyMakefile

Page 145: GridPrimer

147http://www.cs.wisc.edu/condor

Process Checkpointing› Condor’s Process Checkpointing

mechanism saves all the state of a process into a checkpoint file Memory, CPU, I/O, etc.

› The process can then be restarted from right where it left off

› Typically no changes to your job’s source code needed – however, your job must be relinked with Condor’s Standard Universe support library

Page 146: GridPrimer

148http://www.cs.wisc.edu/condor

Limitations in the Standard Universe

› Condor’s checkpointing is not at the kernel level. Thus in the Standard Universe the job may not Fork() Use kernel threads Use some forms of IPC, such as pipes

and shared memory

› Many typical scientific jobs are OK

Page 147: GridPrimer

149http://www.cs.wisc.edu/condor

When will Condor checkpoint your job?

› Periodically, if desired For fault tolerance

› To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority) Preemptive-resume scheduling

› When you explicitly run condor_checkpoint, condor_vacate, condor_off or condor_restart command

Page 148: GridPrimer

150http://www.cs.wisc.edu/condor

Want other Scheduling possibilities?

Extend with the Scheduler Universe

› In addition to VANILLA, another job universe is the Scheduler Universe.

› Scheduler Universe jobs run on the submitting machine and serve as a meta-scheduler.

› DAGMan meta-scheduler included

Page 149: GridPrimer

151http://www.cs.wisc.edu/condor

DAGMan

› Directed Acyclic Graph Manager

› DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you.

› (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Page 150: GridPrimer

152http://www.cs.wisc.edu/condor

What is a DAG?

› A DAG is the data structure used by DAGMan to represent these dependencies.

› Each job is a “node” in the DAG.

› Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Job A

Job B Job C

Job D

Page 151: GridPrimer

153http://www.cs.wisc.edu/condor

Defining a DAG

› A DAG is defined by a .dag file, listing each of its nodes and their dependencies:# diamond.dagJob A a.subJob B b.subJob C c.subJob D d.subParent A Child B CParent B C Child D

› each node will run the Condor job specified by its accompanying Condor submit file

Job A

Job B Job C

Job D

Page 152: GridPrimer

154http://www.cs.wisc.edu/condor

Submitting a DAG

› To start your DAG, just run condor_submit_dag with your .dag file, and Condor will start a personal DAGMan daemon which to begin running your jobs:

% condor_submit_dag diamond.dag

› condor_submit_dag submits a Scheduler Universe Job with DAGMan as the executable.

› Thus the DAGMan daemon itself runs as a Condor job, so you don’t have to baby-sit it.

Page 153: GridPrimer

155http://www.cs.wisc.edu/condor

DAGMan

Running a DAG

› DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies.

CondorJobQueue

C

D

A

A

B.dagFile

Page 154: GridPrimer

156http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› DAGMan holds & submits jobs to the Condor queue at the appropriate times.

CondorJobQueue

C

D

B

C

B

A

Page 155: GridPrimer

157http://www.cs.wisc.edu/condor

DAGMan

Running a DAG (cont’d)

› In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG.

CondorJobQueue

X

D

A

BRescue

File

Page 156: GridPrimer

158http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG

› Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG.

CondorJobQueue

C

D

A

BRescue

File

C

Page 157: GridPrimer

159http://www.cs.wisc.edu/condor

DAGMan

Recovering a DAG (cont’d)

› Once that job completes, DAGMan will continue the DAG as if the failure never happened.

CondorJobQueue

C

D

A

B

D

Page 158: GridPrimer

160http://www.cs.wisc.edu/condor

DAGMan

Finishing a DAG

› Once the DAG is complete, the DAGMan job itself is finished, and exits.

CondorJobQueue

C

D

A

B

Page 159: GridPrimer

Ewa Deelman Information Sciences Institute

Pegasus Overview

Pegasus maps complex workflows onto the Grid

Uses Grid information services to find resources, data and executables

Reduces the workflow based on existing intermediate products

Used in many applications Part of GriPhyN’s Virtual Data Toolkit

Page 160: GridPrimer

Ewa Deelman Information Sciences Institute

Grid Applications

Increasing in the level of complexity Use of individual application components Reuse of individual intermediate data products (files) Description of Data Products using Metadata Attributes

Execution environment is complex and very dynamic Resources come and go Data is replicated Components can be found at various locations or staged in on

demand

Separation between the application description the actual execution description

Page 161: GridPrimer

Ewa Deelman Information Sciences Institute

FFT

FFT filea

/usr/local/bin/fft /home/file1

transfer filea from host1://home/filea

to host2://home/file1

ApplicationDomain

AbstractWorkflow

ConcreteWorkflow

ExecutionEnvironment

host1 host2

Data

Data

host2

App

licat

ion

Dev

elop

men

t and

Exe

cutio

n P

roce

ss

DataTransfer

Resource SelectionData Replica Selection

Transformation InstanceSelection

ApplicationComponentSelection

Retry

Pick different Resources

Specify aDifferentWorkflow

Failure RecoveryMethod

Abstract Workflow

Generation

ConcreteWorkflow

Generation

Page 162: GridPrimer

Ewa Deelman Information Sciences Institute

Why Automate Workflow Generation?

Usability: Limit User’s necessary Grid knowledge Monitoring and Directory Service Replica Location Service

Complexity: User needs to make choices

Alternative application components Alternative files Alternative locations

The user may reach a dead end Many different interdependencies may occur among

components Solution cost:

Evaluate the alternative solution costs Performance Reliability Resource Usage

Global cost: minimizing cost within a community or a virtual organization requires reasoning about individual user’s choices in light of

other user’s choices

Page 163: GridPrimer

Ewa Deelman Information Sciences Institute

GriPhyN’sExecutable Workflow Construction

Build an abstract workflow based on VDL descriptions (Chimera)

Build an executable workflow based on the abstract workflows (Pegasus)

Execute the workflow (Condor’s DAGMan)

AbstractWorfklow

Concrete Workflow

Jobs

Chimera Pegasus DAGMan

VDL

RLS TC MDS

Page 164: GridPrimer

Ewa Deelman Information Sciences Institute

VDL and Abstract Workflow

d1

d2

ba

cb

VDL descriptions

User request data file “c”

d1 d2

ba cAbstract Workflow

Page 165: GridPrimer

Ewa Deelman Information Sciences Institute

Condor’s DAGMan Developed at UW Madison (Livny) Executes a concrete workflow Makes sure the dependencies are followed Execute the jobs specified in the workflow

Execution Data movement Catalog updates

Provides a “rescue DAG” in case of failure

Page 166: GridPrimer

Ewa Deelman Information Sciences Institute

Pegasus:Planning for Execution in Grids

Maps from abstract to concrete workflow Algorithmic and AI-based techniques

Automatically locates physical locations for both components (transformations) and data

Finds appropriate resources to execute Reuses existing data products where applicable Publishes newly derived data products

Chimera virtual data catalog Provides provenance information

Page 167: GridPrimer

Ewa Deelman Information Sciences Institute

Information ComponentsUsed by Pegasus

Globus Monitoring and Discovery Service (MDS) Locates available resources Finds resource properties

Dynamic: load, queue length Static: location of gridftp server, RLS, etc

Globus Replica Location Service Locates data that may be replicated Registers new data products

Transformation Catalog Locates installed executables

Page 168: GridPrimer

Ewa Deelman Information Sciences Institute

Example Workflow Reduction

Original abstract workflow

If “b” already exists (as determined by query to the RLS), the workflow can be reduced

d1 d2ba c

d2b c

Page 169: GridPrimer

Ewa Deelman Information Sciences Institute

Mapping from abstract to concrete

Query RLS, MDS, and TC, schedule computation and data movement

Execute d2 at B

Move b from A

to B

Move c from B

to U

Register c in the

RLS

d2b c

Page 170: GridPrimer

Ewa Deelman Information Sciences Institute

Montage Montage (NASA and

NVO) Deliver science-grade

custom mosaics on demand

Produce mosaics from a wide range of data sources (possibly in different spectra)

User-specified parameters of projection, coordinates, size, rotation and spatial sampling.

Mosaic created by Pegasus based Montage from a run of the M101 galaxy images on the Teragrid.

Page 171: GridPrimer

Ewa Deelman Information Sciences Institute

Small Montage Workflow

~1200 nodes

Page 172: GridPrimer

Ewa Deelman Information Sciences Institute

Applications Using Chimera, Pegasus and DAGMan

GriPhyN applications: High-energy physics: Atlas, CMS (many) Astronomy: SDSS (Fermi Lab, ANL) Gravitational-wave physics: LIGO (Caltech, AEI)

Astronomy: Galaxy Morphology (NCSA, JHU, Fermi, many others,

NVO-funded) Biology

BLAST (ANL, PDQ-funded) Neuroscience

Tomography for Telescience(SDSC, NIH-funded)

Page 173: GridPrimer

Ewa Deelman Information Sciences Institute

Current System

Original Abstract Workflow

Current Pegasus

Pegasus(Abstract Workflow)

DAGMan(CW))

Co

ncre

te W

orfklo

w

Workflow Execution

Page 174: GridPrimer

Ewa Deelman Information Sciences Institute

time

Levels ofabstraction

Application-level

knowledge

Logicaltasks

Tasksbound toresources

and sent forexecution

User’sRequest

Relevantcomponents

Fullabstractworkflow

Partialexecution

Not yetexecuted

executed

Workflow refinement

Taskmatchmaker

Workflow repair

Policyinfo

Workflow Refinement and execution

Page 175: GridPrimer

Ewa Deelman Information Sciences Institute

Incremental Refinement Partition Abstract workflow into partial

workflows

PW A

PW B

PW C

A Particular PartitioningNew Abstract

Workflow

Page 176: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 178

What Is Condor-G?

Enhanced version of Condor that uses Globus Toolkit™ to manage Grid jobs

Two Parts– Globus Universe

– GlideIn

Excellent example of applying the general purpose Globus Toolkit to solve a particular problem (I.e. high-throughput computing) on the Grid

Page 177: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 179

Condor

High-throughput scheduler Non-dedicated resources Job checkpoint and migration Remote system calls

Page 178: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 180

Globus Toolkit

Grid infrastructure software Tools that simplify working across multiple

institutions:– Authentication (GSI)

– Scheduling (GRAM, DUROC)

– File transfer (GASS, GridFTP)

– Resource description (GRIS/GIIS)

Page 179: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 181

Why Use Condor-G

Condor– Designed to run jobs within a single administrative

domain Globus Toolkit

– Designed to run jobs across many administrative domains

Condor-G– Combine the strengths of both

Page 180: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 182

Globus Universe

Advantages of using Condor-G to manage your Grid jobs– Full-featured queuing service

– Credential Management

– Fault-tolerance

Page 181: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 183

Full-Featured Queue

Persistent queue Many queue-manipulation tools Set up job dependencies (DAGman) E-mail notification of events Log files

Page 182: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 184

Credential Management

Authentication in Globus Toolkit is done with limited-lifetime X509 proxies

Proxy may expire before jobs finish executing

Condor-G can put jobs on hold and e-mail user to refresh proxy

Condor-G can forward new proxy to execution sites

Page 183: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 185

Fault Tolerance

Local Crash– Queue state stored on disk

– Reconnect to execute machines Network Failure

– Wait until connectivity returns

– Reconnect to execute machines

Page 184: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 186

Fault Tolerance

Remote Crash – job still in queue– Job state stored on disk

– Start new jobmanager to monitor job Remote Crash – job lost

– Resubmit job

Page 185: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 187

Globus Universe

Disadvantages– No matchmaking or dynamic scheduling of

jobs

– No job checkpoint or migration

– No remote system calls

Page 186: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 188

Solution: GlideIn

Use the Globus Universe to run the Condor daemons on Grid resources

When the resources run these GlideIn jobs, they will join your personal Condor pool

Submit your jobs as Condor jobs and they will be matched and run on the Grid resources

Page 187: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 189

Who Uses Condor-G?

MetaNEOS (NUG-30) NCSA (GridGaussian) INFN (DataGrid) CMS

Page 188: GridPrimer

Intro to Grid Computing and Globus Toolkit™ 190

Mathematicians Solve NUG30 Looking for the solution to the

NUG30 quadratic assignment problem

An informal collaboration of mathematicians and computer scientists

Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites)

14,5,28,24,1,3,16,15,10,9,21,2,4,29,25,22,13,26,17,30,6,20,19,8,18,7,27,12,11,23

MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin

Page 189: GridPrimer

SUN HPC Consortium, Heidelberg 2004 191

GridLab: Grid Application Toolkit and

TestbedGabrielle Allen, Ed Seidel and GridLab Team

Center for Computation & Technology, LSU

Albert Einstein Institute, Germany

Page 190: GridPrimer

192

GridLab Project

EU Funded by 5th Framework (January 2002):PSNC, AEI, ZIB, MASARYK, VU, SZTAKI, ISUFI, Cardiff, NTUA, Chicago, ISI, Wisconsin, Sun, Compaq

12 Work Packages covering:Grid PortalsMobile UsersDifferent Grid ServicesApplications

Numerical Relativity (Cactus)Gravitational Waves (Triana)

Europe-wide Test Bed

Grid Application Toolkit (GAT)

Page 191: GridPrimer

193

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Grid Application Toolkit (GAT)

Developed through EU GridLab projectApplication oriented access to Grid capabilities through standard API

GATFile_Move(from, to, [details])GATResource_FindResource([details])GAT_LogicalFile(file, name, [details])

Independent of Grid infrastructure and available services.C, C++, Java, [Python], [Perl], [Fortran]API driver for GGF SAGA-RG

Page 192: GridPrimer

194

GAT Motivation

Grids and Grid middleware are everywhereGrid applications are lagging behind, big jump from prototypes and demonstrations to real production use of Grids.Problems:

Missing or immature grid servicesChanging environmentDifferent and evolving interfaces to the “grid”Interfaces are not simple for scientific application developers

Application developers accept Grid computing paradigm only slowly

Page 193: GridPrimer

195

Copy a File: GASS

int RemoteFile::GetFile (char const* source, if (source_url.scheme_type == GLOBUS_URL_SCHEME_GSIFTP || char const* target) { source_url.scheme_type == GLOBUS_URL_SCHEME_FTP ) { globus_url_t source_url; globus_ftp_client_operationattr_init (&source_ftp_attr); globus_io_handle_t dest_io_handle; globus_gass_copy_attr_set_ftp (&source_gass_copy_attr, globus_ftp_client_operationattr_t source_ftp_attr; &source_ftp_attr); globus_result_t result; } globus_gass_transfer_requestattr_t source_gass_attr; else { globus_gass_copy_attr_t source_gass_copy_attr; globus_gass_transfer_requestattr_init (&source_gass_attr, globus_gass_copy_handle_t gass_copy_handle; source_url.scheme); globus_gass_copy_handleattr_t gass_copy_handleattr; globus_gass_copy_attr_set_gass(&source_gass_copy_attr, globus_ftp_client_handleattr_t ftp_handleattr; &source_gass_attr); globus_io_attr_t io_attr; } int output_file = -1; output_file = globus_libc_open ((char*) target, if ( globus_url_parse (source_URL, &source_url) != GLOBUS_SUCCESS ) { O_WRONLY | O_TRUNC | O_CREAT, printf ("can not parse source_URL \"%s\"\n", source_URL); S_IRUSR | S_IWUSR | S_IRGRP | return (-1); S_IWGRP); } if ( output_file == -1 ) { printf ("could not open the file \"%s\"\n", target); if ( source_url.scheme_type != GLOBUS_URL_SCHEME_GSIFTP && return (-1); source_url.scheme_type != GLOBUS_URL_SCHEME_FTP && } source_url.scheme_type != GLOBUS_URL_SCHEME_HTTP && /* convert stdout to be a globus_io_handle */ source_url.scheme_type != GLOBUS_URL_SCHEME_HTTPS ) { if ( globus_io_file_posix_convert (output_file, 0, printf ("can not copy from %s - wrong prot\n", source_URL); &dest_io_handle) return (-1); != GLOBUS_SUCCESS) { } printf ("Error converting the file handle\n"); globus_gass_copy_handleattr_init (&gass_copy_handleattr); return (-1); globus_gass_copy_attr_init (&source_gass_copy_attr); } globus_ftp_client_handleattr_init (&ftp_handleattr); result = globus_gass_copy_register_url_to_handle ( globus_io_fileattr_init (&io_attr); &gass_copy_handle, (char*)source_URL, &source_gass_copy_attr, &dest_io_handle, globus_gass_copy_attr_set_io (&source_gass_copy_attr, &io_attr); my_callback, NULL); &io_attr); if ( result != GLOBUS_SUCCESS ) { globus_gass_copy_handleattr_set_ftp_attr printf ("error: %s\n", globus_object_printable_to_string (&gass_copy_handleattr, (globus_error_get (result))); &ftp_handleattr); return (-1); globus_gass_copy_handle_init (&gass_copy_handle, } &gass_copy_handleattr); globus_url_destroy (&source_url); return (0); }

Page 194: GridPrimer

196

Copy a File: CoG/RFT

package org.globus.ogsa.gui; TransferRequestType transferRequest = new TransferRequestType (); transferRequest.setTransferArray (transfers1); import java.io.BufferedReader; import java.io.File; int concurrency = Integer.valueOf import java.io.FileReader; ((String)requestData.elementAt(6)).intValue(); import java.net.URL; import java.util.Date; if (concurrency > transfers1.length) import java.util.Vector; { import javax.xml.rpc.Stub; System.out.println ("Concurrency should be less than the number" import org.apache.axis.message.MessageElement; "of transfers in the request"); import org.apache.axis.utils.XMLUtils; System.exit (0); import org.globus.* } import org.gridforum.ogsi.* transferRequest.setConcurrency (concurrency); import org.gridforum.ogsi.holders.TerminationTimeTypeHolder; import org.w3c.dom.Document; TransferRequestElement requestElement = new TransferRequestElement (); import org.w3c.dom.Element; requestElement.setTransferRequest (transferRequest); public class RFTClient { ExtensibilityType extension = new ExtensibilityType (); public static void copy (String source_url, String target_url) { extension = AnyHelper.getExtensibility (requestElement); try { File requestFile = new File (source_url); OGSIServiceGridLocator factoryService = new OGSIServiceGridLocator (); BufferedReader reader = null; Factory factory = factoryService.getFactoryPort (new URL (source_url)); try { GridServiceFactory gridFactory = new GridServiceFactory (factory); reader = new BufferedReader (new FileReader (requestFile)); } catch (java.io.FileNotFoundException fnfe) { } LocatorType locator = gridFactory.createService (extension); Vector requestData = new Vector (); System.out.println ("Created an instance of Multi-RFT"); requestData.add (target_url); TransferType[] transfers1 = new TransferType[transferCount]; MultiFileRFTDefinitionServiceGridLocator loc RFTOptionsType multirftOptions = new RFTOptionsType (); = new MultiFileRFTDefinitionServiceGridLocator(); RFTPortType rftPort = loc.getMultiFileRFTDefinitionPort (locator); multirftOptions.setBinary (Boolean.valueOf ( ((Stub)rftPort)._setProperty (Constants.AUTHORIZATION, (String)requestData.elementAt (0)).booleanValue ()); NoAuthorization.getInstance()); multirftOptions.setBlockSize (Integer.valueOf ( ((Stub)rftPort)._setProperty (GSIConstants.GSI_MODE, (String)requestData.elementAt (1)).intValue ()); GSIConstants.GSI_MODE_FULL_DELEG); multirftOptions.setTcpBufferSize (Integer.valueOf ( ((Stub)rftPort)._setProperty (Constants.GSI_SEC_CONV, (String)requestData.elementAt (2)).intValue ()); Constants.SIGNATURE); multirftOptions.setNotpt (Boolean.valueOf ( ((Stub)rftPort)._setProperty (Constants.GRIM_POLICY_HANDLER, (String)requestData.elementAt (3)).booleanValue ()); new IgnoreProxyPolicyHandler ()); multirftOptions.setParallelStreams (Integer.valueOf ( (String)requestData.elementAt (4)).intValue ()); int requestid = rftPort.start (); multirftOptions.setDcau(Boolean.valueOf( System.out.println ("Request id: " + requestid); (String)requestData.elementAt (5)).booleanValue ()); } int i = 7; catch (Exception e) for (int j = 0; j < transfers1.length; j++) { { System.err.println (MessageUtils.toString (e)); transfers1[j] = new TransferType (); } } transfers1[j].setTransferId (j); } transfers1[j].setSourceUrl ((String)requestData.elementAt (i++)); transfers1[j].setDestinationUrl ((String)requestData.elementAt (i++)); transfers1[j].setRftOptions (multirftOptions); }

Page 195: GridPrimer

197

Copy a File: GAT/C

#include <GAT.h>

GATResult RemoteFile_GetFile (GATContext context, char const* source_url, char const* target_url) { GATStatus status = 0; GATLocation source = GATLocation_Create (source_url); GATLocation target = GATLocation_Create (target_url); GATFile file = GATFile_Create (context, source, 0); if (source == 0 || target == 0 || file == 0) { return GAT_MEMORYFAILURE; } if ( GATFile_Copy (file, target, GATFileMode_Overwrite) != GAT_SUCCESS ) { GATContext_GetCurrentStatus (context, &status); return GATStatus_GetStatusCode (status); } GATFile_Destroy (&file); GATLocation_Destroy (&target); GATLocation_Destroy (&source); return GATStatus_GetStatusCode (status);}

Page 196: GridPrimer

198

Copy a File: GAT/C++

#include <GAT++.hpp>

GAT::Result RemoteFile::GetFile (GAT::Context context, std::string source_url, std::string target_url) { try { GAT::File file (context, source_url); file.Copy (target_url); } catch (GAT::Exception const &e) { std::cerr << "Some error: " << e.what() << std::endl; return e.Result(); } return GAT_SUCCESS;}

Page 197: GridPrimer

199

GAT Solution

GAT API layer between applications and the grid infrastructure:

Higher level than existing grid APIs, hide complexity, abstract grid functionality through application oriented APIsInsulate against

Rapid evolution of grid infrastructureState of Grid deployment

Choose between different grid infrastructuresApplication developers use and develop for the grid independent of the state of deployment of the grid infrastructureService developers can make their software available to many different applications.

Page 198: GridPrimer

200

Grid Application Toolkit

Standard API and Toolkit for developing portable Grid applications independently of the underlying Grid infrastructure and available services

Implements the GAT-APIUsed by applications (different languages)

GAT Adaptors Connect to capabilities/services

Implement well defined CPI (mirrors GAT-API)

Interchangeable adaptors can be loaded/switched at runtime

GAT EngineProvides the function bindings for the GAT-API

Page 199: GridPrimer

201

GAT API Scope

Files

Resources

Events

Information exchange

Utility classes (error handling, security, preferences)

Not more! Keep it simple!

Provide simple functionality which allow us to focus on applications and science.

Page 200: GridPrimer

202

Implementation

C version fully implemented

C++ version 80% complete

Java version started

Python, Perl, Fortran to follow

Focus: portability, lightness, flexibility, adaptivity

Page 201: GridPrimer

e-Science NorthWest203

So, now you know whatPre-XML Grids are about

Thanks to others for the slides:– Mike Jones– The Globus Alliance– The Condor team– Ewa Deelman, ISI

So, any questions?