gridshell + condor how to execute 1 million jobs on the teragrid jeffrey p. gardner edward walker...
TRANSCRIPT
GridShell + CondorHow to Execute 1 Million Jobs
on the Teragrid
Jeffrey P. GardnerJeffrey P. GardnerEdward Walker Edward Walker Miron LivneyMiron LivneyTodd Tannenbaum Todd Tannenbaum The Condor Development The Condor Development Team Team Ryan ScrantonRyan ScrantonAndrew ConnollyAndrew Connolly
Pittsburgh Supercomputing Pittsburgh Supercomputing Center Center
Texas Advanced Computing Texas Advanced Computing Center Center
University of Wisconson University of Wisconson
University of Pittsburgh University of Pittsburgh
Scientific MotivationExample: Astronomy
Astronomy is increasingly being done by using large surveys with 100s of millions of objects.
Analyzing large astronomical datasets frequently means performing the same analysis task on >100,000 objects.
Each object may take several hours of computing.
The amount of computing time required may vary, sometimes dramatically, from object to object.
Requirements
Schedulers at each Teragrid site Should be able to gracefully handle
~1000 single-processor jobs at a time Metascheduler
distributes jobs across the Teragrid Metascheduler must be able to handle
~100,000 jobs
Requirements
Schedulers at each Teragrid site Should be able to gracefully handle
~1000 single-processor jobs at a time
Solution: PBS/LSF?
In theory, existing Teragrid schedulers like PBS or LSF should provide the answer.
In practice, this does not work. Teragrid nodes are multiprocessor
Only 1 PBS job per node Teragrid machines frequently restrict the
number of jobs a single user may run Asking for many processors at once
communicates your actual resource requirements.
Solution: Clever shell scripts?
We could submit a single PBS job that uses many processors. Now we have a reasonable number of
PBS jobs. Scheduling priority would reflect our
actual resource usage. This still has problems.
Each job takes a different amount of time to run: we are using resources inefficiently.
Requirements
Schedulers at each Teragrid site Should be able to gracefully handle
~1000 single-processor jobs at a time Metascheduler
distributes jobs across the Teragrid Metascheduler must be able to handle
~100,000 jobs
Requirements
Metascheduler distributes jobs across the Teragrid Metascheduler must be able to handle
~100,000 jobs
Teragrid has no metaschedulerTeragrid has no metascheduler
Metacheduler Solution: Condor-G?
Condor-G will schedule an arbitrarily large number of jobs across multiple grid resources using Globus.
However, 1 serial Condor-G job = 1 PBS job, so we are left with the same PBS limitations as before: Teragrid nodes are multiprocessor
Only 1 PBS job per node Teragrid machines frequently restrict the
number of jobs a single user may run.
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS on each Teragrid node, then use a private scheduler to manage serial work units within the PBS job.
Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object
PBS Job
PE
PEPE
PEPE
PE
Serial Work Units PrivateScheduler
The Real Solution: Condor+GridShell
The real solution is to submit one large PBS on each Teragrid node, then use a private scheduler to manage serial work units within the PBS job.
Vocabulary: JOB: (n) a thing that is submitted via Globus or PBS WORK UNIT: (n) An independent unit of work (usually serial), such as the analysis of a single astronomical object
PBS Job
PE
PEPE
PEPE
PE
Serial Work Units PrivateScheduler
CondorCondor
GridShellGridShell
Condor Overview
Condor was first designed as a CPU cycle harvester for workstations sitting on people’s desks.
Condor is designed to schedule large numbers of jobs across a distributed, heterogeneous and dynamic set of computational resources.
Advantages of Condor Condor user experience is simple Condor is flexible
Resources can be any mix of architectures Resources do not need a common filesystem Resources do not need common user
accounting Condor is dynamic
Resources can disappear and reappear Condor is fault-tolerant
Jobs are automatically migrated to new resources if existing one become unavailable.
Central Manager
collector
Condor Daemon Layout (very simplified)
Submission Machine
schedd
Execution Machine
startdStartd sends system specifications (ClassAds) and system status to Central Manager
negotiator
(To simplify this example, the functions of the Negotiator are combined with the Collector)
Condor Daemon Layout (very simplified)
Central Manager
collector
Submission Machine
schedd
Execution Machine
startd
Schedd sends job info to Central Manager
User submits Condor job
Central Manager
collector
Condor Daemon Layout (very simplified)
Submission Machine
schedd
Execution Machine
startd
Central Manager uses information to match Schedd jobs to available Startds
Condor Daemon Layout (very simplified)
Submission Machine
schedd
Execution Machine
startd
Schedd sends job to Startd on assigned execution node
Central Manager
collector
“Personal” Condor on a Teragrid Platform
Condor daemons can be run as a normal user.
Condor “GlideIn”™ ability supports the ability to launch condor_startd’s on nodes within an LSF or PBS job.
“Personal” Condor on a Teragrid Platform(Condor runs with normal user permissions)
Central Manager
collector
Submission Machine
schedd
Execution PE
startd
Execution PE
startd
Execution PE
startd
Login Node
PBS Job on Compute Nodes- GlideIn
GridShell Overview Allows users to interact with distributed
grid computing resources from a simple shell-like interface.
extends TCSH version 6.12 to incorporates grid-enabled features: parallel inter-script message-passing and
synchronization output redirection to remote files parametric sweep
GridShell Examples
Redirecting the standard output of a command to a remote file location using GlobusFTP:
a.out > gsiftp://tg-login.ncsa.teragrid.org/data
Message passing between 2 parallel tasks:if ( $_GRID_TASKID == 0) then
echo "hello" > task_1 else
Set msg=`cat < task_0` endif
Executing 256 instances of a job:a.out on 256 procs
Merging GridShell with Condor
Use GridShell to launch Condor GlideIn jobs at multiple grid sites
All Condor GlideIn jobs report back to a central collector
This converts the entire Teragrid into your own personal Condor pool!
Merging GridShell with Condor
Login Node
Gridshell event monitor
User starts GridShell Session at PSCUser starts GridShell Session at PSC
PSC (Alpha)
NCSA (IA64)TACC (IA32)
GridShell process
Merging GridShell with Condor
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor
GridShell session starts event monitor on remote login nodes via GlobusGridShell session starts event monitor on remote login nodes via Globus
PSC (Alpha)
NCSA (IA64)TACC (IA32)
GridShell process
Condor process
Merging GridShell with Condor
Login Node
collectorschedd
Gridshell event monitor
Login Node
Gridshell event monitor
Login Node
Gridshell event monitor
Local event monitor starts condor daemons on login nodeLocal event monitor starts condor daemons on login node
PSC (Alpha)
NCSA (IA64)TACC (IA32)
GridShell process
Condor process
Login Node
collectorschedd
Gridshell event monitor
PBS-RMS Job
Login Node
Gridshell event monitor
PBS Job
Login Node
Gridshell event monitor
LSF Job
gtcsh-ex
TACC (IA32)PSC (Alpha)
NCSA (IA64)
All event monitors submit PBS/LSF jobs.These jobs start GridShell gtcsh-exec on all processors
All event monitors submit PBS/LSF jobs.These jobs start GridShell gtcsh-exec on all processors
gtcsh-ex gtcsh-ex
Master gtcsh-exec
GridShell process
Condor process
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
Login Node
collectorschedd
Gridshell event monitor
PBS-RMS Job
Login Node
Gridshell event monitor
PBS Job
Login Node
Gridshell event monitor
LSF Job
gtcsh-ex
TACC (IA32)PSC (Alpha)
NCSA (IA64)
gtcsh-exec on each processor starts a Condor startd.Heartbeat is maintained between all gtcsh-exec processes
gtcsh-exec on each processor starts a Condor startd.Heartbeat is maintained between all gtcsh-exec processes
gtcsh-ex gtcsh-ex
Master gtcsh-exec
GridShell process
Condor process
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
startdstartd startd
startdstartd startd
startdstartd startd
“Heartbeat”
Login Node
collectorschedd
Gridshell event monitor
PBS-RMS Job
Login Node
Gridshell event monitor
PBS Job
Login Node
Gridshell event monitor
LSF Job
gtcsh-ex
TACC (IA32)PSC (Alpha)
NCSA (IA64)
gtcsh-exec on each processor starts a Condor startdgtcsh-exec on each processor starts a Condor startd
gtcsh-ex gtcsh-ex
Master gtcsh-exec
GridShell process
Condor process
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
startdstartd startd
startdstartd startd
startdstartd startd
Login Node
collectorschedd
Gridshell event monitor
PBS-RMS Job
Login Node
Gridshell event monitor
PBS Job
Login Node
Gridshell event monitor
LSF Job
gtcsh-ex
TACC (IA32)PSC (Alpha)
NCSA (IA64)
Condor schedd distributes Condor jobs to compute nodesCondor schedd distributes Condor jobs to compute nodes
gtcsh-ex gtcsh-ex
Master gtcsh-exec
GridShell process
Condor process
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
gtcsh-ex gtcsh-ex gtcsh-ex
Master gtcsh-exec
startdstartd startd
startdstartd startd
startdstartd startd
Demo: GridShell on the Teragrid
We will launch and run withing a GridShell session on 3 Teragrid sites: PSC (Alpha) NCSA (IA64) TACC (IA32)
We will use Condor to schedule work units of scientific application: Application “synfast” calculates Monte Carlo
realizations of the Cosmic Microwave Background.
Submit 200 independent “synfast” work units, each calculates 1 Monte Carlo realization
Start GridShell Session
1. Write a simple GridShell configuration script:
# vo.conf:# A GridShell config scripttg-login.ncsa.teragrid.orgtg-login.tacc.teragrid.org
2. Start GridShell session. This submits PBS jobs at each site and starts local Condor daemons
% vo-login –n 1:16 –H ~/vo.conf –G –T –W 60Spawning on tg-login.ncsa.teragrid.org
Spawning on tg-login.tacc.teragrid.org
waiting for VO participants to callback...
###########Done.
-n 1:16 Start 1 PBS job per Teragrid machine, each with 16 processors-W 60 Each PBS job has a wallclock limit of 60 minutes-H vo.conf Configuration file
Start GridShell Session
3. Check status of PBS jobs on all sites
(grid)% agent_jobs
GATEWAY: iam763.psc.edu 332190.iam763 (PENDING)GATEWAY: lonestar.tacc.utexas.edu 353150 (PENDING)GATEWAY: tg-login1.ncsa.teragrid.org 240428.tg-master.ncsa.teragrid.org (PENDING)
(grid)% agent_jobs
GATEWAY: iam763.psc.edu 332190.iam763 (RUNNING)GATEWAY: lonestar.tacc.utexas.edu 353150 (RUNNING)GATEWAY: tg-login1.ncsa.teragrid.org 240428.tg-master.ncsa.teragrid.org (RUNNING)
Submit Condor Job
4. We can now interact with Condor daemons:
(grid)% condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
24731@compute LINUX INTEL Unclaimed Idle 0.030 2016 0+00:04:0424732@compute LINUX INTEL Unclaimed Idle 0.030 2016 0+00:04:04…6921@tg-login IA64 INTEL Unclaimed Idle 0.030 2016 0+00:04:046922@tg-login IA64 INTEL Unclaimed Idle 0.030 2016 0+00:04:04…882153@iam320 OSF1 ALPHA Unclaimed Idle 3.590 4096 0+00:00:03882390@iam320 OSF1 ALPHA Unclaimed Idle 3.310 4096 0+00:00:08… Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 48 0 0 48 0 0
Total 48 0 0 48 0 0
Submit Condor Job
5. Write a simple Condor job description file:
# SC2004demo.cmd: Condor Job DescriptionUniverse = VanillaExecutable = SC2004demo.sh
# Arguments for SC2004demo.shArguments = $(Process)
# stderr and stdoutError = SC2004.$(Process).errOutput = SC2004.$(Process).out# Log file for all Condor jobsLog = SC2004.log
# Queue up 200 Condor jobsQueue 200
Submit Condor Job
6. Write a the SC2004demo.sh script:
#! /bin/sh
# DO SIMULATIONDO SIMULATIONcd $SC2004_SCRATCH$SC2004_EXEC_DIR/synfast <<EOFsynfast argumentsEOF
# Copy output to central repositoryCopy output to central repositorycat >/tmp/transfer.csh<<EOF#!$GRIDSHELL_LOCATION/gtcshgrid –io oncat datafile > gsiftp://repository.org/scratch/experiment/dataEOFchmod 755 transfer.csh./transfer.csh
Environment variables can be defined in .cshrcEnvironment variables can be defined in .cshrc
Use GridShell to transfer output filesUse GridShell to transfer output files
Submit Condor Job
7. Submit Condor job:
(grid)% condor_submit SC2004demo.cmd
submitting jobs………………logging submit events………………200 jobs submitted to cluster 1
8. Examine Condor queue:
(grid)% condor_q
-- Submitter: iam763 : <128.182.99.154:50869> : iam763 ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 gardnerj 11/1 21:48 0+00:00:00 R 0 0.0 SC2004demo.cmd 1.1 gardnerj 11/1 21:48 0+00:00:00 R 0 0.0 SC2004demo.cmd… 1.199 gardnerj 11/1 21:48 0+00:00:00 I 0 0.0 SC2004demo.cmd 1.199 gardnerj 11/1 21:48 0+00:00:00 I 0 0.0 SC2004demo.cmd
GridShell in a NutShell We have used GridShell to turn the
TeraGrid into our own personal Condor pool
We can submit Condor jobs, and Condor will schedule these jobs across multiple TeraGrid site
TeraGrid sites do not need to share architecture or queuing systems
GridShell also allows us to use TeraGrid protocols to transfer our input and output data
GridShell in a NutShell
LESSON: Using GridShell coupled with
Condor one can easily harness the power of the Teragrid to process large numbers of independent work units.
All of this fits into existing Teragrid software.