batch system operation & interaction with the grid lcg/egee operations workshop may 25 th 2005...
TRANSCRIPT
Batch System Operation
&Interaction with the
Grid
LCG/EGEE Operations Workshop
May 25th 2005
Why a Batch Workshop at HEPiX? Proposed after the last Operations Workshop.
Remember the complaints then?– “ETT doesn’t work”– “ETT is meaningless when fairsharing is in place”– “The solution of a queue per VO while easy to
implement now but is not a good or long term solution.”
– “The [ETT] algorithm was questioned and other proposals were given.”
Idea was to bring together site managers, grid & local scheduler developers.
Workshop Aims Understand how different batch scheduling
systems are used at HEP sites– Are there any commonalities?
How do sites see the Grid interface? How would sites like to see the Grid interface? What is the impact of the current interface? How do developers of local and Grid level
schedulers see the future? How/can HEP site managers influence future
developments? Well attended (70-80)
– Definite interest in this area from site managers See http://www.fzk.de/hepix
Agenda Local Scheduler usage
– SLAC, RAL, LeSC, JLab, IN2P3, FNAL, DESY, CERN, BNL
– LSF, PBS, Torque/Maui, SGE (N1GE6), BQS, Condor Impact of Grid on sites
– Jeff Templon overview (c.f. previous talk), BQS@IN2P3
Local scheduler view– LSF, PBS, LoadLeveler, Condor, BQS
Grid Developments– EGEE/BLAHP, GLUE
Common batch environment– See earlier.
Site Presentations --- I Site reports covered
– Brief overview of the available computing resources, showing (in)homogeneity of resources
– Queue configuration---what and why– How do users select queues---cpu time alone or
specifying other resources (e.g. memory, local disk space availability)
– Need for, and use of, "special" queues---for "production managers", sudden high priority work, other reasons.
» Question from LHCC referee: “If there is some urgent analysis, how can [gLite] send this to a special queue?”
– Level of resource utilisation
Site Presentations --- II Overall, configurations and concerns were
broadly equivalent across sites.
Concerns were around – Scheduling– Security– Interface Scalability
Cover these issues in next few slides.
Scheduling Issues
Local Load Scheduling: summary Batch schedulers at local sites enable fine-grained
control over heterogeneous systems and are used to enforce local policies on resource allocation and provide “SLA” for users (turnround time).– Large sites have subdivision of user groups
Scheduling is by CPU time, some need to request– minimum CPU capacity for server– memory requirement– available disk work space (/pool, /scratch, /tmp)
Sites want Grid interface to use existing queue(s)– NOT to create a queue per VO.– EMPHATICALLY NOT to replicate queue structure per
VO
Grid/Local interface problems Jeff’s presentation!
In short– Not enough information passed from the site to the
Grid – No information passed from the Grid to the site
Result:– Queues at sites whilst others sit empty– Confused/frustrated site managers– Inefficient behaviour as people work the system
» “Tragedy of the commons”
Should sites (be able to) enforce policies? Sites are funded for particular tasks and need
to show funding agencies and users that they are fulfilling their mission.
This is a Grid. Why does it matter if you are running jobs for X not Y? Y may be happily running jobs at another site.
My view:– Sites need to understand and feel comfortable with
the way they accept jobs from the Grid.– If they are comfortable, account may be taken of
global activity when setting local priorities.– Let’s walk before we try to run…
Can/Should we fix this? … or should we wait to see some general
standard emerge?
Strong support from commercial people (especially Platform and Sun) for HEP to work out solutions to this problem.– They are interested in what we do.
Standards bodies (GGF,…) won’t come up with any common solution soon.– But this doesn’t mean HEP shouldn’t participate
» Raise profile of problems of interest to us» Give practical input based on realworld experience.
How to fix? Improve information available to Grid scheduler
– VO information added in GLUE schema (v1.2)» Need volunteer per batch system to maintain dynamic plug-
ins and the job manager. CERN will do this for LSF. Need other volunteers!
– but still assumption of homogeneous resources at a site.
– There is a plan to start work on GLUE v2 in November» No requirement for backwards compatibility.» Discussion should start NOW!
But need to assess impact of v1.2 changes before rushing into anything.
Grid scheduler should pass job resource requirements to the local resource manager.– Not yet. When? How?– Needs normalisation… Does this need to be per VO?
Security
Security Issues Sites are still VERY concerned about
traceability of users. Mechanisms seem to be in place to allow this,
but sites have little practical experience.– c.f. delays for CERN to block user systematically
crashing worker nodes.– Security group have doubts that sites are fulfilling
obligations in terms of log retention.– “Security Challenges” mooted; these may help
increase confidence… Whatever, it does NOT seem to be a good idea
to have a portal handling user job requests and passing these on with a common certificate…
Interface Scalability
Interface Scalability IN2P3 example: “GridJobManager asks job
status once per minute (even for 15-hour jobs).– 5000 queued jobs + 1000 running jobs = 100
queries/s” Being solved by egee BLAHP
– Caches query response But…
– further example need for discussion between sites & developers (IN2P3 fixing this issue independently)
– are there other similar issues out there?» c.f. LSF targets:
Scalability: 5K hosts, 500K active jobs, 100 concurrent users, 1M completed jobs per day
Performance: >90% slot utilistion, 5s max command response time, 4kB memory/job, master failover in <5mins
» What are targets for the CE? RB?
Some other Topics
End-to-End Guarantees The Condor talk raised many interesting points.
One in particular was the (in)ability of the overall system to offer end-to-end execution guarantees to the users.
Condor “glide-in”: pilot job submitted via the Grid which takes a job from a condor queue.
Fair enough [modulo security…] for system managers PROVIDED pilot job expresses same resource requests as it advertises in a class-ad when it starts.– Shouldn’t claim to be maximum possible length then
run short job.– Class ads and GLUE schema not so different: Both
are ways of saying what a node/site can do in a way that can be used to express (and then match) requirements.
Pre-emption & Virtualisation Strong message from batch system developers
that pre-emption is A GOOD THING. With pre-emption schedulers can maximise throughput/resource usage by– suspending many jobs to allow parallel job to run– suspending long running jobs to provide quick
turnround for priority jobs. Interest in virtualisation as method to ease this
– Also discussed at last operations workshop as a way to ease handling of multiple (conflicting) requirements for OS versions.
– Something to watch. How would (pre-empted) users like this?
– No guarantee of time to completion once job starts…
Push vs Pull A false dichotomy
– Sites can manipulate pull model to create a local queue
Real issue is early vs. late allocation of task to resource– Early: site resource utilisation maximised: a free cpu
resource can be filled immediately with a job from the local queue
– Late: user doesn’t see job sent to site A just before a cpu becomes free at site B.
Questions:– Long term, will most cpu resources be full?– What do people want to maximise? Throughput or ?
» Efficient scheduling important anyway… transparency of grid/local interface will be key.
– Pre-emption, anyone?
Conclusion
ConclusionSummary
Workshop Summary Useful workshop. [IMHO…]
Good that there has been progress since the November workshop at CERN (GLUE schema update), but much is still to be done.
The The ServiceService is is the the ChallengeChallenge
Workshop Summary Useful workshop. Good that there has been progress since the
November workshop at CERN (GLUE schema update), but much is still to be done.
[Still] Need to increase dialogue between site managers and Grid [scheduler] developers – Site managers know a lot about running services.– Unfortunate that a meeting change created a clash
and reduced scope for egee developers to participate in Kaelsruhe discussions.
– A smaller session is pencilled in for HEPiX in SLAC, October 10th – 14th. More dialogue then?
Not too early to start thinking about GLUE v2!