condor usage at brookhaven national lab alexander withers (talk given by tony chan) rhic computing...
TRANSCRIPT
Condor Usage at Brookhaven National Lab
Alexander Withers (talk given by Tony Chan)RHIC Computing Facility
Condor Week - March 15, 2005
About Brookhaven National Lab
● One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE.
● Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department.
● Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects.
● RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.
Computing Facility Resources
● Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc.
● 6+ PB permanent tape storage capacity.● 500+ TB central/distributed disk storage capacity.● 1.4 million SpecInt2000 aggregrate computing
power in Linux Farm.
History of Condor at Brookhaven
● First looked at Condor in 2003 as a replacement for LSF and in-house batch software.
● Installed 6.4.7 in August 2003.● Upgraded to 6.6.0 in February 2004.● Upgraded to 6.6.6 (with 6.7.0 startd binary) in
August 2004.● User base grew from 12 (April 2004) to 50+
(March 2005).
The Rise in Condor Usage
0
200
400
600
800
1000
1200
1400
kC
PU
-ho
urs
Au
g.
Se
p.
Oc
t.
No
v.
De
c.
Ja
n.
Fe
b.
Ma
r. (
es
t.)
ACF/RCF
The Rise in Condor Usage
0
200
400
600
800
1000
1200
1400
1600
1800
avg
. #
of
run
nin
g
job
s
Au
g.
Se
p.
Oc
t.
No
v.
De
c.
Ja
n.
Fe
b.
Ma
r. (
es
t.)
ACF/RCF
Condor Cluster Usage
0
5
10
15
20
25
30
35
Av
g.
Clu
ste
r U
sa
ge
(%
)
Au
g.
Se
p.
Oc
t.
No
v.
De
c.
Ja
n.
Fe
b.
Ma
r. (
es
t.)
ACF/RCF
BNL’s modified Condorview
Overview of Computing Resources● Total of 2750 CPUs (growing to 3400+ in 2005).● Two central managers with one acting as a
backup.● Three specialized submit machines which handle
~600 simultaneous jobs each on average.● 131 of the execute nodes can also act as
submission nodes.● One monitoring/Condorview server.
Overview of Computing Resources, cont.
● Six GLOBUS gateway machines for remote job submission.
● Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3.
● Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.
Overview of Configuration● Computing resources divided into 6 pools.● Two configuration models:
– Split pool resources into two parts and restrict which jobs can run in each part.
– More complex version of the Bologna Batch System.
– A pool uses one or both of these models.
● Some pools employ user priority preemption.● Use “drop queue” method to fill fast machines
first. ● Have tools to easily reconfigure nodes.● All jobs use vanilla universe (no checkpointing).
Two Part Model
● Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction.
● Within Condor, a node advertises itself as either an analysis node or a reconstruction node.
● A job must advertise itself in the same manner to match with an appropriate node.
● Only certain users may run reconstruction jobs but anyone can run an analysis job.
Analysis/Reconstruction
Group 3
Group 2
Group 1
Fast
Slow
vm1
vm2
● No suspension● No preemption● Will start a job if CPU is free
Group 1
Group 2
Group 3
Group 4
Group 5
Reconstruction Job: wants group <= 2
A More Complex Version of the Bologna Model
● Two CPU nodes each with 8 VMs.● 2 VMs per CPU.● Only two jobs running at a time.● Four job categories, each with its own priority.● A high priority VM will suspend a random VM
of lower priority.● The random aspect is to prevent the same VM
from always getting suspended.
Analysis/Reconstruction
Group 3
Group 2
Group 1
Fast
Slow
● Low priority VMs suspended● No preemption● Will start a job if CPU is free or is of higher priority
Group 1
Group 2
Group 3
Group 4
Group 5
Reconstruction Job: wants group == 3Med. Priority (vm5/vm6)
MC (vm1/vm2)
Low (vm3/vm4)
Med (vm5/vm6)
High (vm7/vm8) High Prio
Low Prio
Issues We've Had to Deal With
● Tune parameters to alleviate scalability problems.– MATCH_TIMEOUT
– MAX_CLAIM_ALIVES_MISSED
● Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C Panasas fixed bug.
● High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.
Issues We’ve Had to Deal With, cont.
● Dagman problems (latency, termination) changed from dagman for plain Condor.
● Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off).
● Modified Condorview to meet our accounting & monitoring requirements.
Issues Not Yet Resolved
● Need job ClassAd which gives user's primary group --> better control over cluster usage.
● Transfer output files for debugging when job is evicted.
● Need option to force the schedd to release its claim after each job.
● Allow schedd to set mandatory periodic_remove policy avoid manual cleanup.
Issues Not Yet Resolved, cont.
● Shadow seems to make a large number of NIS calls. Possible problem with caching address shadows in vanilla universe?
● Need Kerberos support to comply with security mandates.
● Interested in Condor on Demand (COD), but lack of functionality prevents more usage.
● Need more (and effective) cluster management tools condor_off works?
Near-Term Plans & Summary
● Waiting for 6.8.x series (late 2005?) to upgrade.● Scalability concerns as usage rises.● High availability more critical as usage rises.● Integration of BNL Condor pools with external
pools, but concerned about security.● Need some functionalities listed above for a
meaningful upgrade and to improve cluster management capability.