condor-g: a case in distributed job delegation
DESCRIPTION
Condor-G: A Case in Distributed Job Delegation. Job Delegation. Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain. Job Delegation in Condor-G Today. Globus GRAM. Batch System Front-end. Execute Machine. Condor-G. Expanding the Model. - PowerPoint PPT PresentationTRANSCRIPT
Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/condor
Condor-G: A Case in Distributed Job
Delegation
www.cs.wisc.edu/condor
Job Delegation› Transfer of responsibility to
schedule and execute a job› Multiple delegations can form a
chain
www.cs.wisc.edu/condor
Job Delegation in Condor-G Today
Condor-G
Globus GRAM
Batch System Front-end
Execute Machine
www.cs.wisc.edu/condor
Expanding the Model› What can we do with new forms of job
delegation?› Some ideas
Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling
www.cs.wisc.edu/condor
Mirroring› What it does
Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one
starts running jobs On recovery, primary Condor-G gets job
status from secondary one› Removes Condor-G submit point as
single point of failure
www.cs.wisc.edu/condor
Mirroring Example
Condor-G 1
Matchmaker
Execute Machine
Condor-G 2
www.cs.wisc.edu/condor
Mirroring Example
Condor-G 1
Matchmaker
Execute Machine
Condor-G 2
www.cs.wisc.edu/condor
Load-Balancing› What it does
Front-end Condor-G distributes all jobs among several back-end Condor-Gs
Front-end Condor-G keeps updated job status
› Improves scalability› Maintains single submit point for users
www.cs.wisc.edu/condor
Load-Balancing Example
Condor-G Back-end 1
Condor-G Front-end
Condor-G Back-end 3
Condor-G Back-end 2
www.cs.wisc.edu/condor
Glide-In Schedd› What it does
Drop a Condor-G onto the front-end machine of a cluster
Delegate jobs to the cluster through the glide-in schedd
› Apply cluster-specific policies to jobs
www.cs.wisc.edu/condor
Glide-In Schedd Example
Condor-G
Glide-In Schedd
Batch System
www.cs.wisc.edu/condor
Multi-Hop Grid Scheduling
› Match a job to a Virtual Organization (VO), then to a resource within that VO
› Easier to schedule jobs across multiple VOs and grids
www.cs.wisc.edu/condor
Multi-Hop Grid Scheduling Example
Experiment Condor-G
Experiment Resource Broker
VO Condor-G
VO Resource Broker
Globus GRAM
Batch Scheduler
www.cs.wisc.edu/condor
Endless Possibilities› These new models can be
combined with each other or with other new models
› Resulting system can be arbitrarily sophisticated
www.cs.wisc.edu/condor
Job Delegation Challenges
› New complexity introduces new issues and exacerbates existing ones
› A few… Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
www.cs.wisc.edu/condor
Transparency› Full information about job should be
available to user Information from full delegation path No manual tracing across multiple machines
› Users need to know what’s happening with their jobs
www.cs.wisc.edu/condor
Representation› Job state is a vector› How best to show this to user
Summary• Current delegation endpoint• Job state at endpoint
Full information available if desired• Series of nested ClassAds?
www.cs.wisc.edu/condor
Scheduling Control› Avoid loops in delegation path› Give user control of scheduling
Allow limiting of delegation path length?
Allow user to specify part or all of delegation path
www.cs.wisc.edu/condor
Active Job Control› User may request certain actions
hold, suspend, vacate, checkpoint› Actions cannot be completed
synchronously for user Must forward along delegation path User checks completion later
www.cs.wisc.edu/condor
Active Job Control (cont)
› Endpoint systems may not support actions If possible, execute them at furthest
point that does support them› Allow user to apply action in
middle of delegation path
www.cs.wisc.edu/condor
Revocation› Leases
Lease must be renewed periodically for delegation to remain valid
Allows revocation during long-term failures
› What are good values for lease lifetime and update interval?
www.cs.wisc.edu/condor
Error Handling and Debugging
› Many more places for things to go horribly wrong
› Need clear, simple error semantics› Logs, logs, logs
Have them everywhere
www.cs.wisc.edu/condor
Current Status› Done
Mirroring› In Progress
Condor-G -> Condor-G delegation• User must specify hops
Glide-in schedd• Set up by hand
www.cs.wisc.edu/condor
Thank You!› Questions?