condor usage at brookhaven national lab alexander withers (talk given by tony chan) rhic computing...

Condor Usage at Brookhaven National Lab

Alexander Withers (talk given by Tony Chan)RHIC Computing Facility

Condor Week - March 15, 2005

About Brookhaven National Lab

● One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE.

● Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department.

● Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects.

● RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

Computing Facility Resources

● Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc.

● 6+ PB permanent tape storage capacity.● 500+ TB central/distributed disk storage capacity.● 1.4 million SpecInt2000 aggregrate computing

power in Linux Farm.

History of Condor at Brookhaven

● First looked at Condor in 2003 as a replacement for LSF and in-house batch software.

● Installed 6.4.7 in August 2003.● Upgraded to 6.6.0 in February 2004.● Upgraded to 6.6.6 (with 6.7.0 startd binary) in

August 2004.● User base grew from 12 (April 2004) to 50+

(March 2005).

The Rise in Condor Usage

0

200

400

600

800

1000

1200

1400

kC

PU

-ho

urs

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

The Rise in Condor Usage

0

200

400

600

800

1000

1200

1400

1600

1800

avg

. #

of

run

nin

g

job

s

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

Condor Cluster Usage

0

5

10

15

20

25

30

35

Av

g.

Clu

ste

r U

sa

ge

(%

)

Au

g.

Se

p.

Oc

t.

No

v.

De

c.

Ja

n.

Fe

b.

Ma

r. (

es

t.)

ACF/RCF

BNL’s modified Condorview

Overview of Computing Resources● Total of 2750 CPUs (growing to 3400+ in 2005).● Two central managers with one acting as a

backup.● Three specialized submit machines which handle

~600 simultaneous jobs each on average.● 131 of the execute nodes can also act as

submission nodes.● One monitoring/Condorview server.

Overview of Computing Resources, cont.

● Six GLOBUS gateway machines for remote job submission.

● Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3.

● Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.

Overview of Configuration● Computing resources divided into 6 pools.● Two configuration models:

– Split pool resources into two parts and restrict which jobs can run in each part.

– More complex version of the Bologna Batch System.

– A pool uses one or both of these models.

● Some pools employ user priority preemption.● Use “drop queue” method to fill fast machines

first. ● Have tools to easily reconfigure nodes.● All jobs use vanilla universe (no checkpointing).

Two Part Model

● Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction.

● Within Condor, a node advertises itself as either an analysis node or a reconstruction node.

● A job must advertise itself in the same manner to match with an appropriate node.

● Only certain users may run reconstruction jobs but anyone can run an analysis job.

Analysis/Reconstruction

Group 3

Group 2

Group 1

Fast

Slow

vm1

vm2

● No suspension● No preemption● Will start a job if CPU is free

Group 1

Group 2

Group 3

Group 4

Group 5

Reconstruction Job: wants group <= 2

A More Complex Version of the Bologna Model

● Two CPU nodes each with 8 VMs.● 2 VMs per CPU.● Only two jobs running at a time.● Four job categories, each with its own priority.● A high priority VM will suspend a random VM

of lower priority.● The random aspect is to prevent the same VM

from always getting suspended.

Analysis/Reconstruction

Group 3

Group 2

Group 1

Fast

Slow

● Low priority VMs suspended● No preemption● Will start a job if CPU is free or is of higher priority

Group 1

Group 2

Group 3

Group 4

Group 5

Reconstruction Job: wants group == 3Med. Priority (vm5/vm6)

MC (vm1/vm2)

Low (vm3/vm4)

Med (vm5/vm6)

High (vm7/vm8) High Prio

Low Prio

Issues We've Had to Deal With

● Tune parameters to alleviate scalability problems.– MATCH_TIMEOUT

– MAX_CLAIM_ALIVES_MISSED

● Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C Panasas fixed bug.

● High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

Issues We’ve Had to Deal With, cont.

● Dagman problems (latency, termination) changed from dagman for plain Condor.

● Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off).

● Modified Condorview to meet our accounting & monitoring requirements.

Issues Not Yet Resolved

● Need job ClassAd which gives user's primary group --> better control over cluster usage.

● Transfer output files for debugging when job is evicted.

● Need option to force the schedd to release its claim after each job.

● Allow schedd to set mandatory periodic_remove policy avoid manual cleanup.

Issues Not Yet Resolved, cont.

● Shadow seems to make a large number of NIS calls. Possible problem with caching address shadows in vanilla universe?

● Need Kerberos support to comply with security mandates.

● Interested in Condor on Demand (COD), but lack of functionality prevents more usage.

● Need more (and effective) cluster management tools condor_off works?

Near-Term Plans & Summary

● Waiting for 6.8.x series (late 2005?) to upgrade.● Scalability concerns as usage rises.● High availability more critical as usage rises.● Integration of BNL Condor pools with external

pools, but concerned about security.● Need some functionalities listed above for a

meaningful upgrade and to improve cluster management capability.

condor usage at brookhaven national lab alexander withers (talk given by tony chan) rhic computing...

Documents

group 1fastslowvm1vm2

analysis job

reconstruction jobs

history of condor

job categories

reconstruction node

computing division

analysis node