dan bradley university of wisconsin-madison condor and disun teams dan@hep.wisc.edu condor...

Post on 05-Jan-2016

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Dan BradleyUniversity of Wisconsin-Madison

Condor and DISUN Teamsdan@hep.wisc.edu

http://www.cs.wisc.edu/condor

Condor Administrator’s How-to

Dan, Condor Week 2008www.cs.wisc.edu/condor

Where to Find the Online How-to

Collection1. Go to http://www.cs.wisc.edu/condor/

2. Click on “Condor Admin How-to Recipes”

Currently, that takes you here:

http://nmi.cs.wisc.edu/node/1465

Dan, Condor Week 2008www.cs.wisc.edu/condor

Brief Overviewof

Selected Bits

Dan, Condor Week 2008www.cs.wisc.edu/condor

Question

› How does Condor decide which job gets to run on an execute machine?

Dan, Condor Week 2008www.cs.wisc.edu/condor

The Life of a Condor Job

schedd(job queue)

condor_submit

startd(Job Executor)

central manager(collector + negotiator)

central manager 2

central manager 3(collector + negotiator)

flock

ing

machine ClassAd

job runs

job C

lassAd

Dan, Condor Week 2008www.cs.wisc.edu/condor

First Stop: Authorization› User must be authorized to submit to schedd

ALLOW_WRITE = allow1, allow2, …DENY_WRITE = deny1, deny2, …

user@uid_domain/network

› By defualt, all authenticated users may submit jobs within trusted network

ALLOW_WRITE = */networkHOSTALLOW_WRITE = network (old style)

Dan, Condor Week 2008www.cs.wisc.edu/condor

Next Stop: The Job Queue

› MAX_JOBS_RUNNING = 200

› Job priority = integer orders a user’s jobs higher priority will run sooner

Dan, Condor Week 2008www.cs.wisc.edu/condor

Authorization of the Schedd to Join Pool

› ALLOW_ADVERTISE_SCHEDDDENY_ADVERTISE_SCHEDD Default: ALLOW/DENY_DAEMON

• Default: ALLOW/DENY_WRITE

› COLLECTOR_REQUIREMENTS Default: true

Dan, Condor Week 2008www.cs.wisc.edu/condor

Next Stop: NegotiatorFair Share

• User priorityInversely proportional to fair share

• Example: two users, 60 batch slots• priority 50 - gets 40 slots• priority 100 - gets 20 slots

Dan, Condor Week 2008www.cs.wisc.edu/condor

Fair Share Dynamics

› User priority changes over time wants to be equal to number of slots in use

› Example: User steadily running 100 jobs: priority 100 Stops running jobs:

• 1 day later: priority 50• 2 days later: priority 25

› Configure speed of adjustment:PRIORITY_HALFLIFE = 86400

Dan, Condor Week 2008www.cs.wisc.edu/condor

Modified Fair Share› User Priority Factor

multiplies the “real user priority” result is called “effective user priority”

› Example:condor_userprio -setfactor atlas@hep.wisc.edu 4.0condor_userprio -setfactor cms@hep.wisc.edu 1.0 atlas steadily uses 10 slots - effective priority 40 cms steadily uses 20 slots - effective priority 20

Dan, Condor Week 2008www.cs.wisc.edu/condor

Reporting Condor Pool Usage

% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…osg_usatlas1@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24jherschleb@lmcg.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59szhou@lmcg.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59osg_cmsprod@hep.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00

› When upgrading Condor, preserve the central manager’s AccountantLog Happens automatically if you follow general rule:

preserve Condor’s LOCAL_DIR

Dan, Condor Week 2008www.cs.wisc.edu/condor

Matchmaking

› Job requirements and machine requirements must both be met

› Machine requirements are configured via the START expression

START = Owner == "appinstaller"

Dan, Condor Week 2008www.cs.wisc.edu/condor

Adding to Job Requirements

APPEND_REQUIREMENTS = MY.Owner != "appinstaller" || TARGET.IsAppInstallerMachine =?= True

Dan, Condor Week 2008www.cs.wisc.edu/condor

Adding Attribute to Machine ClassAd

IsAppInstallerMachine = True

STARTD_ATTRS = $(STARTD_ATTRS) IsAppInstallerMachine

Dan, Condor Week 2008www.cs.wisc.edu/condor

Choosing Between Matching Machines

1. NEGOTIATOR_PRE_JOB_RANK2. job rank expression3. NEGOTIATOR_POST_JOB_RANK4. PREEMPTION_RANK

Dan, Condor Week 2008www.cs.wisc.edu/condor

Example

NEGOTIATOR_PRE_JOB_RANK = (IsDesktop =!= True && isUndefined(RemoteOwner)) + isUndefined(RemoteOwner)

› Most desirable to least: 2 unclaimed and not a desktop 1 unclaimed and desktop 0 claimed

Dan, Condor Week 2008www.cs.wisc.edu/condor

Authorizing Schedd to Claim Startd

› ALLOW/DENY_WRITE

› It is the schedd which is authorized by the startd, not the user.

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption

Dan, Condor Week 2008www.cs.wisc.edu/condor

Machine Rank

› Numerical expression: higher number preempts lower number user priority is secondary to rank, because

higher rank job preempts claim to machine

› Example: CMS gets 1st prio, CDF gets 2nd, others 3rdRANK = 2*(User == “cms@hep.wisc.edu”) + 1*(User == “cdf@hep.wisc.edu”)

Dan, Condor Week 2008www.cs.wisc.edu/condor

Another Rank Example

Rank = (Group =?= "LMCG") * (1000 + RushJob)

Dan, Condor Week 2008www.cs.wisc.edu/condor

Note on Scope of Condor Policies

› pool-wide scope: example negotiator user priorities, factors, etc. preemption policy related to user priority steering jobs via negotiator job rank

› execute machine/slot scope: startd machine rank, requirements preemption/suspension policy customized machine ClassAd values

› submit machine scope queue policy, automatic additions to job requirements,

and insertion of arbitrary ClassAd attributes into job

› personal scope environmental configurations: _CONDOR_<config val>=value

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption Policy› Should Condor jobs yield to non-condor

activity on the machine?

› Should some types of jobs never be interrupted? After 4 days?

› Should some jobs immediately preempt others? After 30 minutes?

› Is suspension more desirable than killing?

› Can need for preemption be decreased by steering jobs towards the right machines?

Dan, Condor Week 2008www.cs.wisc.edu/condor

Example Preemption Policy

When a claim is preempted, do not allow killing of jobs younger than 4 days old.

MaxJobRetirementTime = 3600 * 24 * 4

› Applies to all forms of preemption: user priority, machine rank, machine

activity, graceful shutdown

Dan, Condor Week 2008www.cs.wisc.edu/condor

Another Preemption Policy

› Expression can refer to attributes of batch slot and job, so can be highly customized.

MaxJobRetirementTime = 3600 * 24 * 4 * (OSG_VO =?= “uscms”)

Dan, Condor Week 2008www.cs.wisc.edu/condor

More Preemption Controls

› PREEMPTION_REQUIREMENTS controls user-priority based preemption at

the level of the negotiator

› PREEMPT/SUSPEND controls preemption by machine activity

(e.g. keyboard or cpu activity)

› RANK allows preemption by more desirable jobs

Dan, Condor Week 2008www.cs.wisc.edu/condor

Preemption Policy Pitfall

› If you disable all forms of preemption, you probably want to limit lifespan of claims:

PREEMPTION_REQUIRMENTS = FalsePREEMPT = FalseRANK = 0CLAIM_WORKLIFE = 3600

• Otherwise, reallocation of resources will not happen until a user runs out of matching jobs.

Dan, Condor Week 2008www.cs.wisc.edu/condor

What Happens to Preempted Jobs?

› Back to idle in job queue NumJobStarts >= 1

› job policy:periodic_hold, periodic_remove

› admin policy:SYSTEM_PERIODIC_HOLDSYSTEM_PERIODIC_REMOVE

Dan, Condor Week 2008www.cs.wisc.edu/condor

Back to the Negotiator:Group Accounting

Dan, Condor Week 2008www.cs.wisc.edu/condor

Fair Sharing Between Groups

• Useful when:• multiple user ids belong to same group• group’s share of pool is not tied to specific machines

# Example group settingsGROUP_NAMES = group_physics, group_chemistry

GROUP_QUOTA_group_physics = 200GROUP_QUOTA_group_chemistry = 100GROUP_AUTOREGROUP = True

GROUP_PRIO_FACTOR_group_physics = 10GROUP_PRIO_FACTOR_group_chemistry = 10DEFAULT_PRIO_FACTOR = 100

Dan, Condor Week 2008www.cs.wisc.edu/condor

Setting Group Identity

• The job advertises its own group identity:

+AccountingGroup = “group_physics.dan”

group name group user

• Anyone can declare any identity.• This is not the unix/windows identity the job runs as.• It is solely for accounting and prioritization purposes.

Dan, Condor Week 2008www.cs.wisc.edu/condor

Monitoring Usage

% condor_userprio -usage -allusersLast Priority Update: 7/30 09:59 Accumulated Usage Last User Name Usage (hrs) Start Time Usage Time ------------------------------ ----------- ---------------- ----------------…group_physics.atlas@hep.wisc.edu 599739.09 4/18/2006 14:37 7/30/2007 07:24group_physics.cms@hep.wisc.edu 799300.91 4/03/2006 12:56 7/30/2007 09:59group_chemistry.han@che.wisc.edu 1029384.68 4/03/2006 12:56 7/30/2007 09:59group_chemistry.ben@che.wisc.edu 2013058.70 4/03/2006 16:54 7/30/2007 09:59------------------------------ ----------- ---------------- ----------------Number of users: 271 8517482.95 4/03/2006 12:56 7/29/2007 10:00

% condor_userprio -all -allusers

Dan, Condor Week 2008www.cs.wisc.edu/condor

How do groups compete?

› Group using least share of its quota gets top priority in matchmaking.

Dan, Condor Week 2008www.cs.wisc.edu/condor

How do user’s within group compete?

› Each group user has its own user priority

› Fair share between group members determined by the usual user priority mechanism

Dan, Condor Week 2008www.cs.wisc.edu/condor

May Group Exceed its Quota?

› Yes, but only if

GROUP_AUTOREGROUP = True

OR, if undefinedGROUP_AUTOREGROUP_<groupname> = True

Dan, Condor Week 2008www.cs.wisc.edu/condor

When Exceeding Quota, How do Users

Compete?› All non-group users plus group users

trying to exceed their quota compete for remaining machines.

› The user priority of the group user (e.g. “group_physics.dan”) is used to determine fair share. Can set default priority factor for all

members of group:GROUP_PRIO_FACTOR_<groupname> = 10

Dan, Condor Week 2008www.cs.wisc.edu/condor

The End of the Story

Dan, Condor Week 2008www.cs.wisc.edu/condor

The Life of a Condor Job

schedd(job queue)

condor_submit

startd(Job Executor)

central manager(collector + negotiator)

central manager 2

central manager 3(collector + negotiator)

flock

ing

machine ClassAd

job runs

job C

lassAd

Dan, Condor Week 2008www.cs.wisc.edu/condor

Extending the Reach

› FLOCK_TO = <remote collector> requires bi-directional connectivity in Linux, can use GCB to connect

private networks

› Grid Universe: Globus, Condor-C condor_glidein JobRouter

Dan, Condor Week 2008www.cs.wisc.edu/condor

Trivia

› What’s the difference?

IsHighPrioUser = Owner == “dan”

1. RANK = IsHighPrioUser2. RANK = $(IsHighPrioUser)

› case 1 needs:STARTD_ATTRS = IsHighPrioUser

Dan, Condor Week 2008www.cs.wisc.edu/condor

Where to Find the Online How-to

Collection1. Go to http://www.cs.wisc.edu/condor/

2. Click on “Condor Admin How-to Recipes”

Currently, that takes you here:

http://nmi.cs.wisc.edu/node/1465

top related