scheduling under lcg at ral uk hep sysman, manchester 11th november 2004 steve traylen...

Scheduling under LCG at RAL

UK HEP Sysman, Manchester11th November 2004

Steve [email protected]

RAL, LCG, torque and Maui

• Observations of RAL within LCG vs. traditional batch.

• Various issues that arose and what was done to tackle this.

• Upcoming changes in the next release.

• Some items still to be resolved.

LCG Grid vs. Traditional

Batch• Observations from the LHCb vs.

Atlas period earlier this year.• Matters in common for LCG Grid

and Batch.– RAL must provide 40% to Atlas, 30% to

LHCb, … as dictated by GridPP.• Differences for LCG Grid and

Batch.– Batch - 4000 queued for 400 job slots.– LCG - often < 5 queued for 400 job slots.

LCG Grid vs. Traditional

Batch• Providing allocations difficult with

LHCb submitting at a faster rate. • RAL only received LHCb jobs.• Only solution with OpenPBS to just

hard limit LHCb.– Idle CPUs a waste of money. – Always better to give allocation as soon as

possible. – LHCb jobs pile up due to apparent free resource.– RAL becomes unattractive (via ETT) to Atlas.

Queues per VO

• Many sites, CNAF, LIP, NIKHEF, … moved to queues per VO.

• Advantages– Estimated traversal calculation orthogonal for

each VO.• While piling up LHCb jobs Atlas jobs are still attracted,

hopefully always one queued job available.

– Queue lengths can be customised.

• Disadvantages– Change in farm just to fit into to LCG.– Adding VOs becomes harder.

Queues per VO(2)

• ETT calculation rudimentary, just increases as jobs are queued on a per queue basis.– RAL only gives 1 CPU to Zeus and 399 to

Atlas• ETT calculation does not reflect this really.

• In fact RAL’s queues now have a zero FIFO component.

• But it still works once Zeus jobs pile up they stop coming.

Take-up within LCG

CPU Scaling

• CPU variation.– Can be removed now within the batch

farm by configuring pbs_mom to normalise CPU time.

– Normalised speed published into info System.

• Walltime scaling more confusing.– RAL does, we fairshare the whole farm on

wall time.– However what we advertise is a lie.

CPU Scaling and ETT

• Only at RAL the scaling is extreme.– Normalised on Pentium 450– Nodes are scaled from 4.7 to 5.0 at present.– So CPUlimits and walltimes are very long (9 days).

• Once jobs are queued RAL became very unattractive.

• We modified info provider to make the “ETT” comparable to other sites.

• We will renormalize at some point soon.

OpenPBS to Maui/Torque

• OpenPBS (as in LCG today) hangs when one node crashes. – Torque is okay (most of the time).

• Torque is just a new version of OpenPBS maintained by www.supercluster.org.– No integration required.

• Active user community and mailing list.• Maintained well, bug fixes and patches

are accepted and added regularly.• Maui is a more sophisticated scheduler

capable of fairshare for instance.

http://www.supercluster.org/

Fairshare with Maui

• The default is FIFO and so the same as the default PBS scheduler.

• Maui supports fairshare on Walltime.– E.g., Consider last 7 days of operation.– Give Atlas 50%, CMS 20%. – Give lhcbsgm a huge priority but limit “them” to

one job.– Reserve one CPU for a 10 minute queue.

• Monitoring jobs.

• Maui will strive to reach these targets.– Tools exist to diagnose and understand why Maui

is not doing what you hope for allowing tuning.

Heterogeneous Clusters

• Many farms currently have mixed memory, local disk space, …..

• Glue contains sub clusters but they are currently identified by only the hostname.GlueSubClusterUniqueID=lcgce02.gridpp.rl.ac.ukSo only one hardware type per CE possible.

• Does RB joins the SubCluster against a GlueCE object.

Heterogeneous Clusters(2)

• It would seem easy to describe a second sub cluster and use a unique key.

• GlueSubClusterUniqueID=lcgce02.gridpp.rl.ac.uk-bigmem

• Different GlueCEs can then join on this?– Does this work?– Information providers may need tweaking.– Will the RB do this, what else will break.– Can the JobManager support different attributes

per queue to target nodes.– Advertising fake queues possible.

Future Possibilities

• One queue per VO per memory size per local disk space per time period …. = a lot of queues.

• Some sites only have one queue and insist on users setting requirements per job. – It is a good a idea within the batch farm.– The Resource Broker does not pass this on to

gram transfer, how much do we want this?

Maui based Info Provider

• Current info provider only interrogates PBS.

• PBS has no idea what is going to happen next.

• A Maui based provider could calculate the ETT better.

• But it may difficult to port to LSF, BQS, ….

Conclusions

• Moving to torque and Maui is transparent.• Maui and queues per VO will introduce more

control of resources within a site and increase their occupancy.

• Major adjustments are needed to existing queue infrastructures.

• Heterogeneous cluster support within LCG….• As LCG resources merge with other EGEE

resources new challenges arise such as running parallel jobs in production – Crossgrid?

References

• Maui and Torque homepages including documentation.– http://www.supercluster.org/

• Maui/Torque RPMS appearing in LCG.– http://www.gridpp.rl.ac.uk/tb-support/faq/

torque.html

• More Maui/Torque RPMS and qstat cache mechanism.– http://www.dutchgrid.nl/Admin/nikhef/

scheduling under lcg at ral uk hep sysman, manchester 11th november 2004 steve traylen...

Documents

lcg slide

lcg grid

queued ral

lhcb jobs atlas jobs

maui observations of

batch farm

zeus jobs

atlas ett calculation