dan bradley university of wisconsin-madison condor and disun teams [email protected] condor...

41
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams [email protected] http://www.cs.wisc.edu/condor Condor Administrator’s How-to

Upload: james-hubbard

Post on 05-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan BradleyUniversity of Wisconsin-Madison

Condor and DISUN [email protected]

http://www.cs.wisc.edu/condor

Condor Administrator’s How-to

Page 2: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Where to Find the Online How-to

Collection1. Go to http://www.cs.wisc.edu/condor/

2. Click on “Condor Admin How-to Recipes”

Currently, that takes you here:

http://nmi.cs.wisc.edu/node/1465

Page 3: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Brief Overviewof

Selected Bits

Page 4: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Question

› How does Condor decide which job gets to run on an execute machine?

Page 5: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

The Life of a Condor Job

schedd(job queue)

condor_submit

startd(Job Executor)

central manager(collector + negotiator)

central manager 2

central manager 3(collector + negotiator)

flock

ing

machine ClassAd

job runs

job C

lassAd

Page 6: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

First Stop: Authorization› User must be authorized to submit to schedd

ALLOW_WRITE = allow1, allow2, …DENY_WRITE = deny1, deny2, …

user@uid_domain/network

› By defualt, all authenticated users may submit jobs within trusted network

ALLOW_WRITE = */networkHOSTALLOW_WRITE = network (old style)

Page 7: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Next Stop: The Job Queue

› MAX_JOBS_RUNNING = 200

› Job priority = integer orders a user’s jobs higher priority will run sooner

Page 8: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Authorization of the Schedd to Join Pool

› ALLOW_ADVERTISE_SCHEDDDENY_ADVERTISE_SCHEDD Default: ALLOW/DENY_DAEMON

• Default: ALLOW/DENY_WRITE

› COLLECTOR_REQUIREMENTS Default: true

Page 9: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Next Stop: NegotiatorFair Share

• User priorityInversely proportional to fair share

• Example: two users, 60 batch slots• priority 50 - gets 40 slots• priority 100 - gets 20 slots

Page 10: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Fair Share Dynamics

› User priority changes over time wants to be equal to number of slots in use

› Example: User steadily running 100 jobs: priority 100 Stops running jobs:

• 1 day later: priority 50• 2 days later: priority 25

› Configure speed of adjustment:PRIORITY_HALFLIFE = 86400

Page 11: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Modified Fair Share› User Priority Factor

multiplies the “real user priority” result is called “effective user priority”

› Example:condor_userprio -setfactor [email protected] 4.0condor_userprio -setfactor [email protected] 1.0 atlas steadily uses 10 slots - effective priority 40 cms steadily uses 20 slots - effective priority 20

Page 12: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Reporting Condor Pool Usage

% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…[email protected] 599739.09 4/18/2006 14:37 7/30/2007 07:[email protected] 799300.91 4/03/2006 12:56 7/30/2007 09:[email protected] 1029384.68 4/03/2006 12:56 7/30/2007 09:[email protected] 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00

› When upgrading Condor, preserve the central manager’s AccountantLog Happens automatically if you follow general rule:

preserve Condor’s LOCAL_DIR

Page 13: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Matchmaking

› Job requirements and machine requirements must both be met

› Machine requirements are configured via the START expression

START = Owner == "appinstaller"

Page 14: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Adding to Job Requirements

APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True

Page 15: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Adding Attribute to Machine ClassAd

IsAppInstallerMachine = True

STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine

Page 16: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Choosing Between Matching Machines

1. NEGOTIATOR_PRE_JOB_RANK2. job rank expression3. NEGOTIATOR_POST_JOB_RANK4. PREEMPTION_RANK

Page 17: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Example

NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner)

› Most desirable to least: 2 unclaimed and not a desktop 1 unclaimed and desktop 0 claimed

Page 18: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Authorizing Schedd to Claim Startd

› ALLOW/DENY_WRITE

› It is the schedd which is authorized by the startd, not the user.

Page 19: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption

Page 20: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Machine Rank

› Numerical expression: higher number preempts lower number user priority is secondary to rank, because

higher rank job preempts claim to machine

› Example: CMS gets 1st prio, CDF gets 2nd, others 3rdRANK = 2*(User == “[email protected]”) + 1*(User == “[email protected]”)

Page 21: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Another Rank Example

Rank = (Group =?= "LMCG") * (1000 + RushJob)

Page 22: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Note on Scope of Condor Policies

› pool-wide scope: example negotiator user priorities, factors, etc. preemption policy related to user priority steering jobs via negotiator job rank

› execute machine/slot scope: startd machine rank, requirements preemption/suspension policy customized machine ClassAd values

› submit machine scope queue policy, automatic additions to job requirements,

and insertion of arbitrary ClassAd attributes into job

› personal scope environmental configurations: _CONDOR_<config val>=value

Page 23: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption Policy› Should Condor jobs yield to non-condor

activity on the machine?

› Should some types of jobs never be interrupted? After 4 days?

› Should some jobs immediately preempt others? After 30 minutes?

› Is suspension more desirable than killing?

› Can need for preemption be decreased by steering jobs towards the right machines?

Page 24: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Example Preemption Policy

When a claim is preempted, do not allow killing of jobs younger than 4 days old.

MaxJobRetirementTime = 3600 * 24 * 4

› Applies to all forms of preemption: user priority, machine rank, machine

activity, graceful shutdown

Page 25: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Another Preemption Policy

› Expression can refer to attributes of batch slot and job, so can be highly customized.

MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”)

Page 26: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

More Preemption Controls

› PREEMPTION_REQUIREMENTS controls user-priority based preemption at

the level of the negotiator

› PREEMPT/SUSPEND controls preemption by machine activity

(e.g. keyboard or cpu activity)

› RANK allows preemption by more desirable jobs

Page 27: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption Policy Pitfall

› If you disable all forms of preemption, you probably want to limit lifespan of claims:

PREEMPTION_REQUIRMENTS = FalsePREEMPT = FalseRANK = 0CLAIM_WORKLIFE = 3600

• Otherwise, reallocation of resources will not happen until a user runs out of matching jobs.

Page 28: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

What Happens to Preempted Jobs?

› Back to idle in job queue NumJobStarts >= 1

› job policy:periodic_hold, periodic_remove

› admin policy:SYSTEM_PERIODIC_HOLDSYSTEM_PERIODIC_REMOVE

Page 29: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Back to the Negotiator:Group Accounting

Page 30: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Fair Sharing Between Groups

• Useful when:• multiple user ids belong to same group• group’s share of pool is not tied to specific machines

# Example group settingsGROUP_NAMES = group_physics, group_chemistry

GROUP_QUOTA_group_physics = 200GROUP_QUOTA_group_chemistry = 100GROUP_AUTOREGROUP = True

GROUP_PRIO_FACTOR_group_physics = 10GROUP_PRIO_FACTOR_group_chemistry = 10DEFAULT_PRIO_FACTOR = 100

Page 31: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Setting Group Identity

• The job advertises its own group identity:

+AccountingGroup = “group_physics.dan”

group name group user

• Anyone can declare any identity.• This is not the unix/windows identity the job runs as.• It is solely for accounting and prioritization purposes.

Page 32: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Monitoring Usage

% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…[email protected] 599739.09 4/18/2006 14:37 7/30/2007 07:[email protected] 799300.91 4/03/2006 12:56 7/30/2007 09:[email protected] 1029384.68 4/03/2006 12:56 7/30/2007 09:[email protected] 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00

% condor_userprio -all -allusers

Page 33: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

How do groups compete?

› Group using least share of its quota gets top priority in matchmaking.

Page 34: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

How do user’s within group compete?

› Each group user has its own user priority

› Fair share between group members determined by the usual user priority mechanism

Page 35: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

May Group Exceed its Quota?

› Yes, but only if

GROUP_AUTOREGROUP = True

OR, if undefinedGROUP_AUTOREGROUP_<groupname> = True

Page 36: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

When Exceeding Quota, How do Users

Compete?› All non-group users plus group users

trying to exceed their quota compete for remaining machines.

› The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share. Can set default priority factor for all

members of group:GROUP_PRIO_FACTOR_<groupname> = 10

Page 37: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

The End of the Story

Page 38: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

The Life of a Condor Job

schedd(job queue)

condor_submit

startd(Job Executor)

central manager(collector + negotiator)

central manager 2

central manager 3(collector + negotiator)

flock

ing

machine ClassAd

job runs

job C

lassAd

Page 39: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Extending the Reach

› FLOCK_TO = <remote collector> requires bi-directional connectivity in Linux, can use GCB to connect

private networks

› Grid Universe: Globus, Condor-C condor_glidein JobRouter

Page 40: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Trivia

› What’s the difference?

IsHighPrioUser = Owner == “dan”

1. RANK = IsHighPrioUser2. RANK = $(IsHighPrioUser)

› case 1 needs:STARTD_ATTRS = IsHighPrioUser

Page 41: Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams dan@hep.wisc.edu  Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Where to Find the Online How-to

Collection1. Go to http://www.cs.wisc.edu/condor/

2. Click on “Condor Admin How-to Recipes”

Currently, that takes you here:

http://nmi.cs.wisc.edu/node/1465