jaime frey computer sciences department university of wisconsin-madison what’s new in condor-g
DESCRIPTION
What Is Condor-G › Use Condor to run jobs on the Grid › Uses Globus Toolkit GRAM (submit a remote job) GASS (transfer job’s files) › Two components Globus Universe GlideInTRANSCRIPT
Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
What’s New in Condor-G
www.cs.wisc.edu/condor
Outline› What is Condor-G› Released New Features› In Development
www.cs.wisc.edu/condor
What Is Condor-G› Use Condor to run jobs on the Grid› Uses Globus Toolkit
GRAM (submit a remote job) GASS (transfer job’s files)
› Two components Globus Universe GlideIn
www.cs.wisc.edu/condor
Globus Universe› Run a job on a Grid resource› Features
Job management Fault tolerance Credential management
› Roughly equivalent to the vanilla universe
www.cs.wisc.edu/condor
How It Works
Schedd
LSF
Condor-G Grid Resource
www.cs.wisc.edu/condor
How It Works
Schedd
LSF
Condor-G Grid Resource
600 Globusjobs
www.cs.wisc.edu/condor
How It Works
Schedd
LSF
Condor-G Grid Resource
GridManager
600 Globusjobs
www.cs.wisc.edu/condor
How It Works
Schedd JobManager
LSF
Condor-G Grid Resource
GridManager
600 Globusjobs
www.cs.wisc.edu/condor
How It Works
Schedd JobManager
LSF
User Job
Condor-G Grid Resource
GridManager
600 Globusjobs
www.cs.wisc.edu/condor
GlideIn› Run the Condor daemons on Grid
resources as user jobs› Create your own personal Condor pool
from temporarily-acquired Grid resources
› Brings the full power of Condor to the Grid
www.cs.wisc.edu/condor
Globus Grid
PBS LSF
Condor
Condor-G
www.cs.wisc.edu/condor
Globus Grid
PBS LSF
Condor
600 Condorjobs
Condor-G
www.cs.wisc.edu/condor
Condor-G
Globus Grid
PBS LSF
Condor
600 Condorjobs
www.cs.wisc.edu/condor
Condor-G
Globus Grid
PBS LSF
Condor glide-ins
600 Condorjobs
www.cs.wisc.edu/condor
Condor-G
Globus Grid
PBS LSF
Condor glide-ins
600 Condorjobs
www.cs.wisc.edu/condor
Condor-G
Globus Grid
PBS LSF
Condor glide-ins
600 Condorjobs
www.cs.wisc.edu/condor
Condor-G
Globus Grid
PBS LSF
Condor glide-ins
600 Condorjobs
www.cs.wisc.edu/condor
Released New Features› Stuff we’ve added in the past year› Released and ready for use in
Condor 6.6
www.cs.wisc.edu/condor
Globus ASCII Helper Protocol (GAHP)
› Encapsulates Globus libraries in separate process
› Simple ASCII protocol› Easy for legacy applications to use
Globus when they can’t link directly with the libraries
www.cs.wisc.edu/condor
How It Works - GAHP
Schedd JobManager
Condor-G Grid Resources
GridManager
JobManager
JobManagerGAHP Client
GAHP Server
www.cs.wisc.edu/condor
File Staging› Arbitrary input and output files can
be staged to and from execution site
› Same syntax as other universes› Limitation
Output files must be explicitly named
www.cs.wisc.edu/condor
File Staging (cont)› Input, Output, and Error can be
URLs Files will be transferred directly to
and from execution site› Output and Error can be staged or
streamed
www.cs.wisc.edu/condor
Credential Refresh› Renewed credentials are used by
Condor-G and forwarded to the execution site automatically
› No processes need to be restarted
www.cs.wisc.edu/condor
Better Credential Management
› One GridManager process can handle multiple credential files with same subject
› More efficient when you want to have different credential lifetimes for different jobs
www.cs.wisc.edu/condor
Grid Match-Making› Globus jobs matched with Globus
resources by the Condor match-maker using ClassAds
› Current limitation User/admin must create resources
ads
www.cs.wisc.edu/condor
Fault Tolerance› Condor-G does its best to automatically
recover from failures› User can guide decisions with job policy
expressions Periodic Release GlobusResubmit Rematch
www.cs.wisc.edu/condor
PeriodicRelease Expression
› Condor-G puts problematic jobs on hold
› This expression tells Condor-G when to release and retry such jobs
www.cs.wisc.edu/condor
GlobusResubmit Expression
› Tells Condor-G when a problematic job submission should be abandoned
› When this expression becomes true Best effort is made to clean up current
job submission New job submission is attempted
www.cs.wisc.edu/condor
Rematch Expression› Tells Condor-G when a problematic
resource should be abandoned› Evaluated when GlobusResubmit
evaluates to true› When this expression becomes true
Best effort is made to clean up current job submission
Job is rematched
www.cs.wisc.edu/condor
Job Ad ExampleGlobusContactString = TARGET.gatekeeper_urlRequirements = TARGET.Arch == “LINUX” &&
TARGET.OpSys == “LINUX”Rank = TARGET.MflopsPeriodicRelease = ((NumMatches < 10) &&
((CurrentTime-EnteredCurrentStatus) > 600))GlobusResubmit = NumSystemHolds >= NumMatchesRematch = True
www.cs.wisc.edu/condor
Hardening› Regular testing on the CMS testbed
with real applications› Many bugs and integration issues
found and fixed Hostile Environment
www.cs.wisc.edu/condor
Hostile Environment› Full disks› Machine crashes› File server lock-ups› Network outages› Power outages
www.cs.wisc.edu/condor
One CMS Dataset Run› 300 jobs› Last fall
~50 (16%) of the jobs stalled and required human recovery
Multiple service restarts (20 daemon crashes over 6 hours)
› Now 0 jobs stalled 0 service restarts
www.cs.wisc.edu/condor
Integration Work› Dozens of Condor-G improvements
and bug fixes› Over 40 Globus “bugzilla”
incidents, many with patches Globus 2.2.4 has 21 “Advisories” as of
4/11/04› Use latest version of both
www.cs.wisc.edu/condor
Scalability› Submitting several hundred jobs
produced high load on server Machine became unresponsive We saw a load average of 1000 at
one point› Caused Globus JobManager
processes
www.cs.wisc.edu/condor
Grid Manager Monitor Agent
› New tool Condor-G can use to reduce this load
› Efficient job status polling program› Allows Condor-G to shut down
JobManager processes when they’re not needed
www.cs.wisc.edu/condor
Load Reduced› 400 jobs (/bin/sleep 900)› Without Grid Monitor
42 hours to complete Peak load average of 610
› With Grid Monitor 40 minutes Peak load average of 104
www.cs.wisc.edu/condor
Miscellaneous Stuff› Email notification on job
completion› Port range restrictions› Problem jobs put on hold
www.cs.wisc.edu/condor
In Development› Stuff we’re currently working on› Will be released sometime in the
next year
www.cs.wisc.edu/condor
Job Policy Expressions› PeriodicHold› PeriodicRemove› OnExitHold› OnExitRemove
www.cs.wisc.edu/condor
Improved GlideIn› MDS use optional
User specifies necessary information› Automatic setup
GlideIn job transfers and installs binaries if needed
Binaries can come from submit machine
www.cs.wisc.edu/condor
New Job Types› Submit jobs directly to other
schedulers (not through Globus)› Why?
Richer interface semantics Not supported by Globus
www.cs.wisc.edu/condor
NorduGrid› Grid batch system designed by
Nordic countries› Globus GRAM didn’t offer
necessary semantics Client control of file staging Automatic cleanup of abandoned jobs
www.cs.wisc.edu/condor
Oracle› Oracle DBMS supports a job queue
Run this query in 5 hours Run this query every Monday
› Condor can add more management features
www.cs.wisc.edu/condor
Generic Job Interface› Re-arrange GridManager to allow
easy addition of new job types› Define appropriate interface› Plug-ins for new job types?
www.cs.wisc.edu/condor
Globus Toolkit 3.0› OGSA (Open Grid Services
Architecture)› Submit jobs to GT3 sites› Grid Service client interface to
Condor-G
www.cs.wisc.edu/condor
Miscellaneous› Condor-G for Windows› MyProxy credential management› URLs for executable, staged files
www.cs.wisc.edu/condor
Thank You!› Questions?› Also…
Condor-G & Globus Q/A session• Wednesday, 9am-12pm, room TBA
E-mail [email protected]