introduction to tamnun server and basics of pbs usage yulia halupovich cis, core systems group

16
Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Upload: augusta-walsh

Post on 18-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Introduction to TAMNUN server

and basics of PBS usage

Yulia HalupovichCIS, Core Systems Group

Page 2: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

TAMNUN LINKS

• Registration: http://reg-tamnun.technion.ac.il• Documentation and Manuals: http

://tamnun.technion.ac.il/doc/• Help Pages & Important Documents http

://tamnun.technion.ac.il/doc/Local-help/• Accessible from external Network http://www.technion.ac.il/doc/tamnun/

Page 3: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

3

Tamnun Cluster inventory – system

• Login node (Intel 2 E5645 [email protected], 96GB )

– user login– PBS– compilations,– YP master

• Admin node (Intel 2 E5-2640 [email protected], 64GB )

– SMC

• NAS node (NFS, CIFS) (Intel 2 E5620 [email protected],48GB )

– 1st enclosure – 60 slots, 60 x 1TB drives– 2nd enclosure - 60 slots, 10 x 3TB drives

• Network Solution: - 14 QDR Infiniband switches with 2:1 blocking topology

- 4 GiGE switches for the management network

Page 4: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Tamnun Cluster inventory – compute nodes (1)

Tamnun consists of public cluster, available for general Technion users and private sub-clusters purchased by Technion researchers

Public Cluster Specifications:• 80 Compute Nodes consisting of two 2.40 GHz six core Xeon

Intel processors: 960 cores with 8GB DDR3 memory per core

• 4 Graphical Processing Units (GPU): 4 servers with NVIDIA TeslaM2090 GPU Computing Modules, 512 CUDA cores

• Storage: 36 nodes with 500 GB and 52 nodes with 1 TB Sata Drives, 4 nodes with fast 1200 GB SAS drives, raw NAS storage capacity is 50 TB.

Page 5: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Tamnun Cluster inventory – compute nodes (2)

• Nodes n001 – n028 - RBNI (public)• Nodes n029 – n080 - Minerva (public)• Nodes n097 – n100 - “Gaussian” nodes with large and fast drive (public)• Nodes gn001 – gn004 - GPU (public)• Nodes gn005 – gn007 - GPU (private nodes of Hagai Perets)• Nodes n081 – n096, sn001 - private cluster (Dan Mordehai)• Nodes n101 – n108 - private cluster (Oded Amir)• Nodes n109 – n172, n217 – n232 - private cluster (Steven Frankel) • Nodes n173 – n180 - private cluster (Omri Barak) • Nodes n181 – n184 - private cluster (Rimon Arieli)• Nodes n185 – n192 - private cluster (Shmuel Osovski) • Nodes n193 – n216 - private cluster (Maytal Caspary)• Nodes n233 – n240 - private cluster (Joan Adler)• Nodes n241 – n244 - private cluster (Ronen Talmon)• Node sn002 - private node (Fabian Glaser)

Page 6: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

TAMNUN connection - general guidelines

1. Connection via server TX and GoGlobal (also from abroad)http://tx.technion.ac.il/doc/tamnun/TAMNUN_Connection_from_Windows.txt

2. Interactive usage: compiling, debugging and tuning only!3. Interactive CPU time limit = 1 hour4. Tamnun login node: use to submit jobs via PBS batch queues to Compute nodes (see next pages on PBS)5. Default quota = 50 GB, check with quota –vs username http://tx.technion.ac.il/doc/tamnun/Quota_enlargement_policy.txt

6. Secure file transfer: outside Technion: scp to TX , WinScp to/from PC, inside Technion: use WinScp 7. dropbox usage is not allowed on Tamnun! 8. No automatic data backup is available on Tamnun, see http://tx.technion.ac.il/doc/tamnun/Tamnun-Backup/

Page 7: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Portable Batch System – Definition and 3 Primary Roles

• Definition: PBS is a distributed workload management system. It handles the management and monitoring of the computational workload on a set of computers

• Queuing: Users submit tasks or “jobs” to the resource management system where they are queued up until the system is ready to run them.

• Scheduling: The process of selecting which jobs to run, when, and where, according to a predetermined policy. Aimed at balance competing needs and goals on the system(s) to maximize efficient use of resources

• Monitoring: Tracking and reserving system resources, enforcing usage policy. This includes both software enforcement of usage limits and user or administrator monitoring of scheduling policies

Page 8: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Important PBS Links on Tamnun

• PBS User Guidehttp://tamnun.technion.ac.il/doc/PBS-User-Guide.pdf

• Basic PBS Usage Instructionshttp://tamnun.technion.ac.il/doc/Local-help/TAMNUN_PBS_Usage.pdf

• Detailed Description of the Tamnun PBS Queues

http://tamnun.technion.ac.il/doc/Local-help/TAMNUN_PBS_Queues_Description.pdf

• PBS scripts examples http://tamnun.technion.ac.il/doc/Local-help/PBS-scripts/

Page 9: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Current Public Queues on TAMNUN Access Definition Queue Name Priority Descriptionnanotraining

Max wall time 168 hROUTING queue

nano_h_p High RBNI

minervatraining

Max wall time 168 ROUTING queue

minerva_h_p High Minerva

All usersMax wall time 24 hROUTING queue

all_l_p Low General

All usersThu 17:00 – Sun 08:00Max wall time 63 hMax user CPU 192

np_weekend Low Non-primetime

All users Max wall time 72 h gpu_l_p High GPU

gaussian Max wall time 168 hMax user CPU 48Max job CPU 12

gaussian_ld High Gaussian LD

All users Max wall time 24 hMax user CPU 48

general_ld Low General LD

Page 10: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Submitting jobs to PBS: qsub command• qsub command is used to submit a batch job to PBS. Submitting a PBS

job specifies a task, requests resources and sets job attributes, which can be defined in an executable scriptfile. The syntax of qsub recommended on TAMNN :

> qsub [options] scriptfile• PBS script files ( PBS shell scripts, see the next page) should be created in

the user’s directory • To obtain detailed information about qsub options, please use the

command: > man qsub• Job Identifier (JOB_ID) Upon successful submission of a batch job PBS

returns a job identifier in the following format:

> sequence_number.server_name > 12345.tamnun

Page 11: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

The PBS shell script sections

• Shell specification: #!/bin/sh • PBS directives: used to request resources or set

attributes. A directive begins with the default string “#PBS”.

• Tasks (programs or commands) - environment definitions

- I/O specifications - executable specificationsNB! Other lines started with # are comments

Page 12: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

PBS script example for multicore user code#!/bin/sh#PBS -N job_name#PBS -q queue_name#PBS -M [email protected]#PBS -l select=1:ncpus=N #PBS -l select=mem=P GB#PBS -l walltime=24:00:00PBS_O_WORKDIR=$HOME/mydircd $PBS_O_WORKDIR

./program.exe < input.file > output.file

Other examples see at http://tamnun.technion.ac.il/doc/Local-help/PBS-scripts/

Page 13: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Checking the job/queue status: qstat command

• qstat command is used to request the status of batch jobs, queues, or servers

• Detailed information: > man qstat • qstat output structure (see on Tamnun) • Useful commands> qstat –a all users in all queues (default)> qstat -1n all jobs in the system with node names> qstat -1nu username all user’s jobs with node names> qstat –f JOB_ID extended output for the job> Qstat –Q list of all queues in the system > qstat –Qf queue_name extended queue details qstat –1Gn queue_name all jobs in the queue with

node names

Page 14: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Removing job from a queue: qdel command

• qdel used to delete queued or running jobs. The job's running processes are killed. A PBS job may be deleted by its owner or by the administrator

• Detailed information: > man qdel• Useful commands > qdel JOB_ID deletes job from a queue> qdel -W force JOB_ID force delete job

Page 15: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Checking a job results and Troubleshooting• Save the JOB_ID for further inspection• Check error and output files: job_name.eJOB_ID; job_name.oJOB_ID

• Inspect job’s details (also after N days ) : > tracejob [-n N]JOB_ID

• Job in E state - occupies resources, will be deleted

• Running interactive batch job: > qsub –I pbs_scriptJob sent to execution node, PBS directives executed, job awaits user’s command

• Checking the job on an execution node: > ssh node_name > hostname> top /u user - shows user shows processes ; /1 – CPU usage> kill -9 PID remove job from the node> ls –rtl /gtmp check error, output and other files under user ownership

• Output can be copied from the node to the home directory

Page 16: Introduction to TAMNUN server and basics of PBS usage Yulia Halupovich CIS, Core Systems Group

Monitoring the system

• pbsnodes used to query the status of hosts• Syntax: > pbsnodes node_name/node_listShows extended information on a node: resources available, resources used queues list, busy/free status, jobs list

• xpbsmon & provides a way to graphically display the various nodes that run jobs. With this utility, you can see what job is running on what node, who owns the job, how many nodes assigned to a job, status of each node (color-coded and the colors are user-modifiable), how many nodes are available, free, down, reserved, offline, of unknown status, in use running multiple jobs or executing only 1 job.

• Detailed information and tutorials: > man xpbsmon