an overview of the portable batch system

58
An Overview of the Portable Batch System Gabriel Mateescu National Research Council Canada I M S B [email protected]

Upload: yanni

Post on 01-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

An Overview of the Portable Batch System. Gabriel Mateescu National Research Council Canada I M S B [email protected] www.sao.nrc.ca/~gabriel/presentations/sgi_pbs. Outline. PBS highlights PBS components Resources managed by PBS Choosing a PBS scheduler - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Overview of the Portable Batch System

An Overview of the Portable Batch System

Gabriel Mateescu

National Research Council Canada

I M S B

[email protected]

www.sao.nrc.ca/~gabriel/presentations/sgi_pbs

Page 2: An Overview of the Portable Batch System

Outline• PBS highlights

• PBS components

• Resources managed by PBS

• Choosing a PBS scheduler

• Installation and configuration of PBS

• PBS scripts and commands

• Adding preemptive job scheduling to PBS

Page 3: An Overview of the Portable Batch System

PBS Highlights • Developed by Veridian / MRJ

• Robust, portable, effective, extensible batch job queuing and resource management system

• Supports different schedulers

• Supports heterogeneous clusters

• Open PBS - open source version

• PBS Pro - commercial version

Page 4: An Overview of the Portable Batch System

Recent Versions of PBS • PBS 2.2, November 1999:

– both the FIFO and SGI scheduler have bugs in enforcing resource limits

– poor support for stopping & resuming jobs

• OpenPBS 2.3, September 2000– better FIFO scheduler: resource limits

enforced, backfilling added

• PBS Pro 5.0, September 2000 – claims support for job stopping/resuming,

better scheduling, IRIX cpusets

Page 5: An Overview of the Portable Batch System

Resources managed by PBS• PBS manages jobs, CPUs, memory,

hosts and queues

• PBS accepts batch jobs, enqueues them, runs the jobs, and delivers output back to the submitter

• Resources - describe attributes of jobs, queues, and hosts

• Scheduler - chooses the jobs that fit within queue and cluster resources

Page 6: An Overview of the Portable Batch System

Main Components of PBS • Three daemons:

– pbs_server server,

– pbs_sched scheduler,

– pbs_mom job executor & resource monitor

• The server accepts commands and communicates with the daemons– qsub - submit a job– qstat - view queue and job status– qalter - change job’s attributes– qdel - delete a job

Page 7: An Overview of the Portable Batch System

Batch Queuing

Node (CPUs + memory)

Queue A Queue B

SGI Origin System

Job exclusive scheduling

Page 8: An Overview of the Portable Batch System

Resource Examples

• ncpus number of CPUs per job

• mem resident memory per job

• pmem per-process memory

• vmem virtual memory per job

• cput CPU time per job

• walltime real time per job

• file file size per job

Page 9: An Overview of the Portable Batch System

Resource limits• resources_max - per job limit for a

resource; determines whether a job fits in a queue

• resources_default - default amount of a resource assigned to a job

• resources_available - advice to the scheduler on how much of a resource can be used by all running jobs

Page 10: An Overview of the Portable Batch System

Choosing a Scheduler (1) • FIFO scheduler:

– First-fit placement: enqueues a job in the first queue where it may fit even if it does not currently fit there and there is another queue where it will fit

– Supports per job and (in version 2.3) per queue resource limits: ncpus, mem

– Supports per server limits on the number of CPUs, and memory, (based on server attribute resources_available)

Page 11: An Overview of the Portable Batch System

Choosing a Scheduler (2) • Algorithms in FIFO scheduler

– FIFO - sort jobs by job queuing time running the earliest job first

– Backfill - relax FIFO rule for parallel jobs, as long as out-of-order jobs do not delay jobs submitted before by the FIFO order

– Fair share: sort & schedule jobs based on past usage of the machine by the job owners

– Round-robin - pick a job from each queue

– By key - sort jobs by a set of keys: shortest_job_first, smallest_memory_first

Page 12: An Overview of the Portable Batch System

Choosing a Scheduler (3) • FIFO scheduler supports round robin

load balancing as of version 2.3

• FIFO scheduler– allows decoupling the job requirements on

the number of CPUs from that on the amount memory

– simple first-fit placement can lead to the need that the user specifies an execution queue for the jobs, when the job could fit in more than one queue

Page 13: An Overview of the Portable Batch System

Choosing a Scheduler (4) • SGI scheduler

– supports FIFO, fair share, backfilling, and attempts to avoid job starvation

– supports both per job limits and per queue limits on number of CPUs, memory

– per server limit is the number of node cards– makes a best effort in choosing a queue

where to run a job. A job not having enough resources to run is kept in the submit queue

– ties the number of cpus allocated to the memory allocated per job

Page 14: An Overview of the Portable Batch System

Resource allocation • SGI scheduler allocates nodes -

node = [ PE_PER_NODE cpus, MB_PER_NODE Mbyte ]

• Number of nodes N for a job is such that [ ncpus, mem] <= [ N*PE_PER_NODE, N* MB_PER_NODE ]

where ncpus and mem are the job’s memory and cpu job limits specified, e.g., with #PBS -l mem

• Job attributes Resource_List.{ncpus, mem} set to

Resource_List.ncpus = N * PE_PER_NODE

Resource_List.mem = N * MB_PER_NODE

Page 15: An Overview of the Portable Batch System

Queue and Server Limits• FIFO scheduler:

– per job limits (ncpus, mem) are defined by resources_max queue attributes

– as of version 2.3, resources_max also defines per queue limits

– per server resource limits enforced with resources_available attributes

Page 16: An Overview of the Portable Batch System

Queue and Server Limits• SGI scheduler:

– per job limits (ncpus, mem) are defined by resources_max queue attributes

– resources_max also defines per queue limits

– per server limit is given by the number of Origin node cards. Unlike the FIFO scheduler, resource_available limits are not enforced

Page 17: An Overview of the Portable Batch System

Job enqueing (1) • The scheduler places each job in some queue• This involves several tests for resources

• Which queue a job is enqueued into depends on – what limits are tested– first-fit versus best fit placement

• A job can fit in a queue if the resources requested

by the job do not exceed the maximum value of

the resources defined for the queue. For

example, for the resource ncpus

Resource_List.ncpus <= resources_max.ncpus

Page 18: An Overview of the Portable Batch System

Job enqueing (2) • A job fits in a queue if the amount of resources

assigned to the queue plus the requested resources do not exceed the maximum number of resources for the queue. For example, for ncpus

resources_assigned.ncpus + Resource_List.ncpus <= resources_max.ncpus

• A job fits in the system if the sum of all assigned resources does not exceed the available resources. For example, for the ncpus resource,

Σ resources_assigned.ncpus + Resource_List.ncpus <=

resources_available.ncpus

Page 19: An Overview of the Portable Batch System

First fit versus best fit

• The FIFO scheduler finds the first queue where a can fit and dispatches the job to that queue – if the jobs does not actually fit it will wait for the

requested resources in the execution queue

• The SGI scheduler keeps the job in the submit queue until it finds an execution queue where the job fits then dispatches the job to that queue

• If queues are defined to have monotonically increasing resource limits (e.g., CPU time) , then first fit is not a penalty.

• However, if a job can fit in several queues, then SGI scheduler will find a better schedule

Page 20: An Overview of the Portable Batch System

Limits on the number of running jobs

• Per queue and per server limits on the number of running jobs:– max_running– max_user_run, max_group_run max

number of running jobs per user or group • Unlike the FIFO scheduler, the SGI

scheduler enforces these limits only on a per queue basis – It enforces MAX_JOBS from the

scheduler config file - substitute for max_running

Page 21: An Overview of the Portable Batch System

SGI Origin Install (1)

• Source files under OpenPBS_v2_3/src

• Consider the SGI scheduler

• Make sure the machine dependent values defines

in scheduler.cc/samples/sgi_origin/toolkit.h match the actual machine hardware

#define MB_PER_NODE ((size_t) 512*1024*1024) #define PE_PER_NODE 2

• May set PE_PER_NODE =1 to allocate half-nodes

if MB_PER_NODE is set accordingly

Page 22: An Overview of the Portable Batch System

SGI Origin Install (2) • Bug fixes in

scheduler.cc/samples/sgi_origin/pack_queues.c

• Operator precedence bug (line 198):

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

// bad operator precedence bypasses this function

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull continue; } // ... }

Page 23: An Overview of the Portable Batch System

SGI Origin Install (3)• Fix of a logical bug in pack_queues.c: if a system limit

is exceeded should not try to schedule the job

for ( qptr = qlist; qptr != NULL; qptr = qptr->next) {

if (( ( qptr->queue->flags & QFLAGS_FULL ) == 0) {

if ( !schd_evaluate_system(...) ) {

// DONT_START_JOB (0) so don’t change allfull continue; } // ... }

for (qptr=(allfull)?NULL:qlist; qptr !=NULL; qptr=qptr->next) {

// if allfull set, do not attempt to schedule

}

Page 24: An Overview of the Portable Batch System

SGI Origin Install (4)• Fix of a logical bug in user_limits.c, function

user_running() • This function counts number of running jobs so

must test for equality between job status and ‘R’

user_running ( ...)

{

for ( job= queue->jobs; job != NULL; job = job->next) {

if ( (job_state == ‘R’) && (!strcmp(job->owner,user) ) )

jobs_running++; // … }

Page 25: An Overview of the Portable Batch System

SGI Origin Install (5)• The limit npcus is not enforced in the function

mom_over_limit(), located in the file mom_mach.c under the directory src/resmom/irix6array

#define SGI_ZOMBIE_WRONG 1

int mom_over_limit( ... ) {

// ...

#if !defined(SGI_ZOMBIE_WRONG)

return (TRUE);

#endif

// ... }

Page 26: An Overview of the Portable Batch System

SGI Origin Install (4)Script to run the configure command___________________________________________________

#!/bin/csh -f

set PBS_HOME=/usr/local/pbs set PBS_SERVER_HOME=/usr/spool/pbs

# Select SGI or FIFO scheduler

set SCHED="--set-sched-code=sgi_origin --enable-nodemask

#set SCHED="--set-sched-code=fifo --enable-nodemask”

$HOME/PBS/OpenPBS_v2_3/configure \

--prefix=$PBS_HOME \

--set-server-home=$PBS_SERVER_HOME \

--set-cc=cc --set-cflags="-Dsgi -D_SGI_SOURCE -64 -g" \

--set-sched=cc $SCHED --enable-array --enable-debug

Page 27: An Overview of the Portable Batch System

SGI Origin Install (5)___________________________________________________

# cd /usr/local/pbs

# makePBS

# make

# make install

# cd /usr/spool/pbs

the script from the previous slide

sched_priv

config decay_usage

Page 28: An Overview of the Portable Batch System

Configuring for SGI scheduler • Queue types

– one submit queue – one or several execution queues

• Per server limit on the number of running job

• Load Control

• Fair share scheduling– Past usage of the machine used in ranking the jobs

– Decayed past usage per user is kept in sched_priv/decay_usage

• Scheduler restart action

• PBS manager tool: qmgr

Page 29: An Overview of the Portable Batch System

Queue definition• File sched_priv/config SUBMIT_QUEUE submit

BATCH_QUEUES hpc,back MAX_JOBS 256

ENFORCE_PRIME_TIME False

ENFORCE_DEDICATED_TIME False

SORT_BY_PAST_USAGE True

DECAY_FACTOR 0.75

SCHED_ACCT_DIR /usr/spool/pbs/server_priv/accounting

SCHED_RESTART_ACTION RESUBMIT

Page 30: An Overview of the Portable Batch System

Load Control

• Load control for SGI scheduler

sched_priv/config TARGET_LOAD_PCT 90%

TARGET_LOAD_VARIANCE -15%,+10%

• Load Control for FIFO scheduler

mom_priv/config

$max_load 2.0

$ideal_load 1.0

Page 31: An Overview of the Portable Batch System

PBS for SGI scheduler • Qmgr tool s server [email protected] create queue submit s q submit queue_type = Execution s q submit resources_max.ncpus = 4 s q submit resources_max.ncpus = 1gb

s q submit resources_default.mem = 256mb s q submit resources_default.ncpus = 1 s q submit resources_default.nice = 15 s q submit enabled = True s q submit started = True

Page 32: An Overview of the Portable Batch System

PBS for SGI scheduler create queue hpc s q hpc queue_type = Execution s q hpc resources_max.ncpus = 2 s q hpc resources_max.ncpus = 512mb

s q hpc resources_default.mem = 256mb s q hpc resources_default.ncpus = 1 s q hpc acl_groups = marley s q hpc acl_group_enable = True s q hpc enabled = True s q hpc started = True

Page 33: An Overview of the Portable Batch System

PBS for SGI scheduler • Server attributes set server default_queue = submit

s server acl_hosts = *.bar.com s server acl_host_enable = True

s server scheduling = True

s server query_other_jobs = True

Page 34: An Overview of the Portable Batch System

PBS for FIFO scheduler • File sched_config instead of config

and queues are not defined there

• Submit queue is Route queue s q submit queue_type = Route

s q submit route_destinations = hpc s q submit route_destinations += back

• Server attributes

s server resources_available.mem = 1gb

s server resources_available.ncpus = 4

Page 35: An Overview of the Portable Batch System

PBS Job Scripts• Job scripts contain PBS directives and

shell commands #PBS -l ncpus=2 #PBS -l walltime=12:20:00 #PBS -m ae #PBS -c c=30

cd ${PB_O_WORKDIR} mpirun -np 2 foo.x

Page 36: An Overview of the Portable Batch System

Basic PBS commands• Jobs are submitted with qsub % qsub [-q hpc] foo.pbs 13.node0.bar.com

• Job status is queried with qstat [-f|-a] to get job owner, name, queue, status, session ID, # CPUs, walltime

% qstat -a 13

• Alter job attributes % qalter -l walltime 20:00:00 13

Page 37: An Overview of the Portable Batch System

Job Submission and Tracking

• Find jobs in status R (running) or submitted by user bob

% qselect -s R % qselect -u bob• Query queue status to find if the

queue is enabled/started, and the number of jobs in the queue

qstat [-f | -a ] -Q

• Delete a job: qdel 13

Page 38: An Overview of the Portable Batch System

Job Environment and I/O• The job’s current directory is the

submitter’s $HOME, which is also the default location for the files created by the job. Changed with cd in the script

• The standard out and err of the job are

spooled to JobName.{o|e}JobID in the submitter’s current directory. Override this with

#PBS -o | -e pathname

Page 39: An Overview of the Portable Batch System

Tips• Trace the history of a job

% tracejob - give a time-stamped sequence of events affecting a job

• Cron jobs for cleaning up daemon work files under mom_logs, sched_logs, server_logs

• #crontab -e

9 2 * * 0 find /usr/spool/pbs/mom_logs -type f -mtype +7 -exec rm {} \;

9 2 * * 0 find /usr/spool/pbs/sched_logs -type f -mtype +7 -exec rm {} \; 9 2 * * 0 find /usr/spool/pbs/server_logs -type f -mtype +7 -exec rm {} \;

Page 40: An Overview of the Portable Batch System

Execution server Submission server

node0 node1

pbs_server, pbs_sched, pbs_mom

Sample PBS Front-End

qsub, qdel, ...

Page 41: An Overview of the Portable Batch System

PBS for clusters

• File staging - copy files (other than stdout/stderr) from a submission-only host to the server

#PBS -W stagein=/tmp/bar@n1:/home/bar/job1

#PBS -W stageout=/tmp/bar/job1/*@n1:/home/bar/job1

PBS uses the directory /tmp/bar/job1 as a scratch directory

• File staging may precede job starting -

helps in hiding latencies

Page 42: An Overview of the Portable Batch System

Setting up a PBS Cluster• Assume n1 runs the pbs_mom daemon• $PBS_SERVER_HOME/server_priv/nodes n0 np=2 gaussian n1 np=2 irix

• n0:$PBS_SERVER_HOME/mom_priv/config $clienthost n1 $ideal_load 1.5 $max_load 2.0

• n1:$PBS_SERVER_HOME/mom_priv/config $ideal_load 1.5 $max_load 2.0

Page 43: An Overview of the Portable Batch System

Setting up a PBS Cluster• Qmgr tool s server [email protected] create queue hpc s q hpc queue_type = Execution s q hpc Priority = 100

s q hpc resources_max.ncpus = 2 s q hpc resources_max.nodect = 1

s q hpc acl_groups = marley s q hpc acl_group_enable = True

Page 44: An Overview of the Portable Batch System

Setting up a PBS Cluster• Server attributes

set server default_node = n0 set server default_queue = hpc

s server acl_hosts = *.bar.com s server acl_host_enable = True s s resources_default.nodect = 1 s s resources_default.nodes = 1 s s resources_default.neednodes = 1

set server max_user_run = 2

Page 45: An Overview of the Portable Batch System

PBS features

• The job submitter can request a number of nodes with some properties

• For example – request a node with the property

gaussian:

#PBS -l nodes=gaussian

– request two nodes with the property irix

#PBS -l nodes=2:irix

Page 46: An Overview of the Portable Batch System

PBS Security Features

• All files used by PBS are owned by root and can be written only by root

• Configuration files: sched_priv/config, mom_priv/config are readable only by root

• $PBS_HOME/pbs_environment defines $PATH; it is writable only by root

• pbs_mom daemon accepts connections from a privileged port on localhost or from a host listed in mom_priv/config

• The server accepts commands from selected hosts and users

Page 47: An Overview of the Portable Batch System

Why preemptive scheduling? • Resource reservation (CPU, memory)

is needed to achieve high job throughput

• Static resource reservation may lead to low machine utilization, high job waiting times, and hence slow job turn-around

• An approach is needed to achieve both high job throughput and rapid job turn-around

Page 48: An Overview of the Portable Batch System

Static Reservation Pitfall (1)

Node (CPU + memory)

Physics Group Biotech Group

Parallel Computer or Cluster

Job Requests

Partition boundary

Page 49: An Overview of the Portable Batch System

Static Reservation Pitfall (2)

• Physics Group’s Job 1 is assigned 3 nodes and dispatched

• Biotech Group’s Job 2 is also dispatched, while Job 3 cannot execute before Job 2 finishes: there is only 1 node available for the group

• However, there are enough resources for Job 3

Page 50: An Overview of the Portable Batch System

Proposed Approach (1)• Leverage the features of the Portable

Batch System (PBS)

• Extend PBS with preemptive job scheduling

• All queues but one have reserved resources (CPUs, memory) and hold jobs that cannot be preempted. These are the dedicated queues

• Define a queue for jobs that may be preempted: the background queue

Page 51: An Overview of the Portable Batch System

Proposed Approach (2)

• Each user belongs to a group and each group is authorized to submit jobs to some dedicated queues as well as to the background queue

• The sum of the resources defined for the dedicated queues does not exceed the machine resources

• The resources assigned to jobs in a dedicated queue do not exceed the queue resource limits

Page 52: An Overview of the Portable Batch System

Proposed Approach (3)

• Jobs fitting in a dedicated queue are dispatched, observing job owner’s access rights

• Jobs not fitting in a dedicated queue are dispatched to the background queue, if there are enough available resources in the system

• Jobs in the background queue borrow resources from the dedicated queues

Page 53: An Overview of the Portable Batch System

Proposed Approach (4)

• If a job entering the system would fit in a dedicated queue provided resources lent to the background queue are reclaimed, job preemption is triggered

• Jobs from the background queue will be held to release the resources needed by a dedicated queue

• Held jobs are re-queued and will be dispatched along with the other pending jobs

Page 54: An Overview of the Portable Batch System

Example (1) Two queues, each with 4 CPU capacity

Job Queue #CPU Submit CPU time time _________________________________ 1 Physics 1 0 4 h

2 Biotech 2 0 4 h

3 Physics 4 0 3 h

4 Biotech 2 2 h 1 h

5 Physics 2 2 h 1 h

Page 55: An Overview of the Portable Batch System

Example (2)

Turn-around times

with without

Job 1 4 h 4 h Job2 4 h 4 h Job 3 4 h 7 h Job 4 3 h 3 h Job 5 3 h 3 h

75 % reduction for job 3

Page 56: An Overview of the Portable Batch System

Key Points

• Provide guaranteed resources per user group and per job

• Allow resources not used by the dedicated queues to be borrowed by the background queue

• Provide a mechanism for reclaiming resources lent to the background queue

• Achieve low job waiting time and high job throughput

Page 57: An Overview of the Portable Batch System

Benefits of the Approach

• Reduce job waiting time by harnessing resources not used by the dedicated queues

• Reduce job wall-time by reserving resources for all the jobs

• Pending jobs fitting in dedicated queues can reclaim resources from jobs that borrowed those resources and run in the background queue

Page 58: An Overview of the Portable Batch System

For more information

• Veridian web site:

www.openpbs.org

www.pbspro.com

• NRC - IMSB documentation and links

www.sao.nrc.ca/~gabriel/pbs/pbs_user.html