Download - Grids and Condor Barcelona, 2006
Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Grids and Condor
Barcelona, 2006
2http://www.cs.wisc.edu/condor
AgendaExtended user’s tutorialAdvanced Uses of Condor
Java programsDAGManStorkMWGrid Computing
Case studies, and a discussion of your application‘s needs
3http://www.cs.wisc.edu/condor
Resources
There are many resources (machines) in the world, and many are or can be made available!
Groups of machines may be labeled as grids
Welcome to the power of the grid !
4http://www.cs.wisc.edu/condor
Condor and Grids
Condor has always been a tool to harness grid computing
Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: Flocking Glidein The grid universe
5http://www.cs.wisc.edu/condor
Flocking
• A way for jobs to run within a different, separate Condor pool
• Condor runs here, and Condor runs there
herethere
6http://www.cs.wisc.edu/condor
Connect Condor Poolswith Flocking
Flocking is a Condor-specific technology
Flocking is enabled with configuration Jobs flock from here to there when
they cannot be run here due to lack of available machines
7http://www.cs.wisc.edu/condor
Configuration
Configuration files contain lots of the administrative information used by Condor
Format is like that in submit description files:
AttributeName = Value
8http://www.cs.wisc.edu/condor
Configuration here For jobs to be able to flock from here to
there In the configuration file on the pool
where jobs flock from:FLOCK_TO = <central manager machine name>FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
9http://www.cs.wisc.edu/condor
Configuration there In the configuration file on the pool where
jobs flock to:FLOCK_FROM = <submit machine name>, . . . ,
<submit machine name>
To make security work:HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE),
$(FLOCK_FROM)
HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)
HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)
HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
10http://www.cs.wisc.edu/condor
Submit Description File
Enable file transfer:universe = vanillaexecutable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue
11http://www.cs.wisc.edu/condor
The Glidein Concept
Assume:We need more machines, and we
have permission to use a set of machines
Glidein temporarily adds a set of machines to the local pool
12http://www.cs.wisc.edu/condor
Glidein
In addition, Glidein solves the problem:“My job needs to run on that particular
resource, and my job needs Condor.” For example: a job that must run under
the standard universe
13http://www.cs.wisc.edu/condor
Glidein
Condor sends and runs its own executables on the resource
The needed resource appears to temporarily join the local Condor pool !
14http://www.cs.wisc.edu/condor
Glideinrun condor_glidein to add the remote
resource to the local pool
local pool remote
resource
the master and
startd daemons
become grid
universe jobs
using gt2
15http://www.cs.wisc.edu/condor
Making Glidein Work Change the configuration to give access
permission (HOSTALLOW_WRITE) to the remote resource
No changes to jobs’ submit description files! But, do enable file transfer in the submit
description file: universe = vanilla
executable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue
16http://www.cs.wisc.edu/condor
Force Job to Glidein Resource
In the submit description file: universe = standard
executable = ajob.exeinput = ajob.inputoutput = ajob.outputlog = ajob.logrequirements = \ ( machine == “example.mcs.anl.gov" ) \ && Arch != "" && OpSys != ""queue
17http://www.cs.wisc.edu/condor
The Grid Universe
Most useful when1. We want to send a job off to a far away
machine2. We want to hand a job to another batch
processing system on the local machine3. We want to send a job off to a far away
machine, in order to hand that job to another batch processing system on that machine
18http://www.cs.wisc.edu/condor
The Grid Universe All handled in the submit description file Supports several back end types:
Globus: GT2, GT3, GT4 NorduGrid UNICORE Condor PBS LSF
19http://www.cs.wisc.edu/condor
Condor-G
Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware gt 2: Globus Toolkit 1 or 2 or the
pre-web services GRAM gt 3: Globus Toolkit 3 gt 4: Globus Toolkit 4 or WS GRAM
20http://www.cs.wisc.edu/condor
Submit Description File
For gt2:universe = grid
input = job1.input
output = job1.result
log = job1.log
grid_resource = gt2 example.wisc.edu/jobmanager
queue
jobmanager
jobmanager-condor
jobmanager-pbs
jobmanager-lsf
jobmanager-sge
One of:
21http://www.cs.wisc.edu/condor
For gt3:universe = grid
input = job2.input
output = job2.result
log = job2.log
grid_resource = gt3 http://198.51.254.40:8080/osga/services/base /gram/XXXManagedJobFactoryService
queue
Submit Description File
Fork
Condor
PBS
LSF
SGE
XXX is one of:
IP address:Port number
22http://www.cs.wisc.edu/condor
For gt4:universe = gridinput = job3.inputoutput = job3.resultlog = job3.loggrid_resource = gt4 https://198.51.254.40:8080/wsrf/service/ManagedJobFactoryService XXX
queue
Submit Description File
Fork
Condor
PBS
LSF
SGE
XXX is one of:
IP address:Port number
OR
Host name:Port number
23http://www.cs.wisc.edu/condor
Nordugrid and the Submit Description
Fileuniverse = grid
input = job4.input
output = job4.result
log = job4.log
grid_resource = nordugrid ngexample.com
queue
24http://www.cs.wisc.edu/condor
Unicore and the Submit Description
Fileuniverse = grid
input = job5.input
output = job5.result
log = job5.log
grid_resource = unicore usite.example.com vsite
keystore_file = /frieda/certificates/keystore
keystore_alias = “frieda”
keystore_passphrase_file = /frieda/private/passphrase
queue
vsite is the name of the
Unicore virtual resource
25http://www.cs.wisc.edu/condor
PBS and the Submit Description
File Details of the PBS installation in$(GLITE_LOCATION)/etc/batch_gahp.config
universe = gridinput = job6.inputoutput = job6.resultlog = job6.loggrid_resource = pbsqueue
26http://www.cs.wisc.edu/condor
LSF and the Submit Description
File Details of the LSF installation in$(GLITE_LOCATION)/etc/batch_gahp.config
universe = gridinput = job7.inputoutput = job7.resultlog = job7.loggrid_resource = lsfqueue
27http://www.cs.wisc.edu/condor
Condor-C
Condor is running here,and Condor is running over there
For the case whereWe want to send a job off to a far away
machine, in order to hand that job to another batch processing system on that machine
28http://www.cs.wisc.edu/condor
Condor-C and the Submit Description
Fileuniverse = gridinput = job8.inputoutput = job8.resultlog = job8.loggrid_resource = condor [email protected] remotecentralmanager.example.com
+remote_jobuniverse = 5+remote_requirements = True+remote_ShouldTransferFiles = "YES"+remote_WhenToTransferOutput = "ON_EXIT"queue
schedd name
collector
machine name
vanilla universe
29http://www.cs.wisc.edu/condor
Credentials
Not just anybody can use any resource at any time. . .
Key concepts:Authentication
verification of an identity
Authorizationpermission to do something
30http://www.cs.wisc.edu/condor
Authentication
If Frieda says “I am Frieda.”,
how do we distinguish this from
if Frieda says “I am George
Bush.” ?
31http://www.cs.wisc.edu/condor
Authentication
Bush can do whatever he pleases If Frieda claims to be Bush, (and
this is accepted), then Frieda can do whatever she pleases
Authentication attempts to verify the identity of the entity that is communicating
32http://www.cs.wisc.edu/condor
Authorization
Who is allowed (permitted) to do what Frieda may run gt4 jobs on the Open
Science Grid machines Fred may write to files in /usr/bin the Unix user root may do anything!
Can be implemented with a list of those authorized
33http://www.cs.wisc.edu/condor
Condor and Authentication
Authentication within Condor comes in many forms. Here are three.
1. File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner.
2. GSI (Grid Security Infrastructure)3. Kerberos
34http://www.cs.wisc.edu/condor
Authentication Idea
• A centralized certificate authority (CA) does verification of an entity’s identity.
• When satisfied, the CA issues a signed certificate (also called a credential)
I am
Frieda
CA
35http://www.cs.wisc.edu/condor
Authentication• To authenticate,
the entity presents the certificate
• All is well, if we trust the CA and the remote machine
I am
Frieda
CA
36http://www.cs.wisc.edu/condor
GSI Authentication
GSI uses X.509 certificates Grid universe, submitting to back
end types using Globus middleware (gt2, gt3, gt4), as well as nordugrid, and unicore use X.509 certificates
Condor can also use GSI
37http://www.cs.wisc.edu/condor
Revocation, Trust, and Proxies
The CA may revoke a credential Frieda gives the signed credential to the remote
machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential.
A proxy is a credential that includes the password, but is only valid for a specific (short) time period.
MyProxy software enables GSI proxy management