UPPMAX
● Uppsala Multidisciplinary Center for Advanced Computational Science
● http://www.uppmax.uu.se
● 2 clusters for ”public” use Rackham, 486 nodes à 20 cores (128GB RAM)
32 with 256 GB, 4 with 1 TB 2 TB local disk per node 6 PB of fast network storage (Crex)
Bianca, 200 nodes à 16 cores for SNIC-SENS 9.5 PB of storage
● SNIC-Cloud system: Dis
Organisational context
● UPPMAX is: A centre at the Dept of IT at Uppsala University A centre of Swedish National Infrastructure for
Computing (SNIC)
● UPPMAX hosts the SciLifeLab Compute and Storage
facility Supports life science researchers’ needs
● All projects and user accounts are handled via SUPR, the SNIC project management portal (supr.snic.se)
Projects and resources● All UPPMAX resources are allocated to projects. All
members are responsible for sharing project resources constructively.
● All projects are created and managed in SUPR.● A project can have:
– Thousands of core-hours per Month (kch/m)● A constant 2-core job will use about 1.5
kch/month– Storage (GB)
● Project storage in /proj/xyz● Can have backup or nobackup separately or
together
Projects and resources● The cost? Free to you but…
● Rackham’s extension cost (very roughly):
– 375 kr/TB per year
– 0.1 kr/core-hour● A typical PhD-student’s project that uses 1 TB for four years and
averages 1000 core-hours/month represents 6,300 kr.
● An ongoing potato genomics project that uses 20 TB and averages 1000 core-hours/month represents 8,700 kr each year.
● A large shotgun genomics project that uses 5 TB and averages 30,000 core-hours/month for half a year represents 19,000 kr.
● The creation of a new reference genome, requiring 100 TB and 50,000 core-hours/month for a year represents almost 100,000 kronor.
Project types● SNIC (Rackham)
Small — anyone can get 2 kch/m and 128 GB Medium — researchers can get up to 100 kch/m Large — groups can get lots of time UPPMAX Storage — additional GB for SNIC projects
(still has plenty of space) SciLifeLab Storage — additional GB for life science
research on Rackham (fully booked)
● SNIC-SENS (Bianca) For work with sensitive personal data Small (up to 20 TB) and Medium
Guide for project applications: http://uppmax.uu.se/support/getting-started/applying-for-projects/
Who are we, what do we do?● System Administrators (about 10 people)
– Have root access
– Fix problems requiring privileged access (e.g. account issues)
– Maintain operating systems and build software infrastructure
– Etc etc etc: They keep all the systems running
● Application Experts (about 7 people)
– Install software
– Help users with application- or science-related issues
– Give user workshops & seminars
– Represent user community to UPPMAX & SNIC
● Others
– UPPMAX Director: Elisabeth Larsson
– SNAC WG: Marcus Holm (manages project allocations)
– Economy admin
UPPMAX
● The basic structure of a supercomputer
UPPMAX
● The basic structure of a supercomputer
UPPMAX
● The basic structure of a supercomputer
UPPMAX
● The basic structure of a compute node
Local Diskcore
RAM
Other nodesNetwork storage
Internet
UPPMAX
Storage systems: Crex — Rackham /proj directories Castor — storage for Bianca Cygnus — new storage for Bianca ”scratch” — node local disks
UPPMAXStorage system basics:
● All nodes can access: – your home directory on Domus– a project directory on Crex or Castor– Its own local disk (2-3 TB)
● If you’re reading/writing a file once, use a directory on Crex or Castor
● If you’re reading/writing a file many times...
copy to the file to ”scratch”, the node local disk ”cp myFile $SNIC_TMP”
Graphical Access
● ThinLinc
Modern, efficient, experimental
http://www.uppmax.uu.se/support/user-guides/thinlinc-graphical-connection-guide/
● X11-forwarding
Ancient, slow, still useful
Connect with ”ssh -X [email protected]”
Check if it works by running e.g. Xclock
Are you using a Mac (OSX/MacOS)?
Then you must install Xquartz (https://www.xquartz.org)
Queue System
● More users than nodes Need for a queue
Queue System
● Scheduling jobs– long or short, narrow or wide?
Time ↑
Queue System
● Scheduling jobs– Short and narrow easier to schedule
Time ↑
Queue System
Short and narrow easier to schedule
Time ↑
A job?
● Job = what happens during booked time Described in a Bash script file
Slurm parameters Load software modules Move around file system Start programs ...and more
Queue System
● 1 mandatory setting for jobs: Who ”pays” for it? (-A)
● 3 settings you really should set: Where should it run? (-p) How wide is it? (-n) How long at most? (-t)
If in doubt: -p core -n 1 -t 10-00:00:00
Queue System
● Who ”pays” for it? (-A) Only projects can be charged
You have to be a member
This course's project ID: g2018014
● -A = account (the account you charge) No default value, mandatory
Queue System● Where should it run? (-p)
Use a whole node or just part of it? 1 node = 20 cores (16 on Bianca) 1 hour walltime = 20 core hours = expensive Waste of resources unless you have a parallel
program
● -p = partition (node or core) Default value: core
Queue System
● How wide is it? (-n) How much of the node should be booked?
1 node = 20 cores Any number of cores
1, 2, 5, 13, 15 etc
● -n = number of cores Default value: 1 Usually used together with -p core
Queue System
● How long is it? (-t) Always overestimate with ~50%
Jobs killed when timelimit reached Only charged for time used
● -t = time (hh:mm:ss) 78:00:00 or 3-6:00:00 Default value: 7-00:00:00
Queue System
● How to submit a job Write a script (bash)
Queue options Rest of the script
#! /bin/bash -l #SBATCH -A g2018014#SBATCH -p core#SBATCH -n 1#SBATCH -t 00:10:00#SBATCH -J Template_script
# go to some directorycd /proj/g2018014/marcusl
# load software modulesmodule load bioinfo-tools
# do somethingecho Hello world!
Queue System
● How to submit a job Script written, now what?
[marcusl@rackham1 ~]$ sbatch myjobscript.shSubmitted batch job 4367759[marcusl@rackham1 ~]$ jobinfo -u marcusl
CLUSTER: rackhamRunning jobs: JOBID PARTITION NAME USER ACCOUNT ST START_TIME TIME_LEFT NODES CPUS NODELIST(REASON)
Nodes in use: 479Nodes in devel, free to use: 1Nodes in other partitions, free to use: 0Nodes available, in total: 480
Nodes in test and repair: 6Nodes, all in total: 486
Waiting jobs: JOBID POS PARTITION NAME USER ACCOUNT ST START_TIME TIME_LEFT PRIORITY CPUS NODELIST(REASON) FEATURES DEPENDENCY 4367759 12 core Template_script marcusl staff PD N/A 1:00:00 190000 1 (Resources) (null)
Waiting bonus jobs:
SLURM Output
● Prints to a file instead of terminal slurm-<job id>.out
[marcusl@tintin2 glob]$ ls -ltotal 4-rw-rw-r-- 1 marcusl marcusl 62 Jun 20 13:40 my_script.sb[marcusl@tintin2 glob]$[marcusl@tintin2 glob]$ sbatch my_script.sb Submitted batch job 10281906[marcusl@tintin2 glob]$[marcusl@tintin2 glob]$ ls -ltotal 4-rw-rw-r-- 1 marcusl marcusl 92 Jun 20 13:40 my_script.sb-rw-rw-r-- 1 marcusl marcusl 87 Jun 20 13:40 slurm-10281906.out[marcusl@tintin2 glob]$
SLURM Output
● Prints to a file instead of terminal slurm-<job id>.out
[marcusl@rackham2 test]$ lsmy_script.sh[marcusl@rackham2 test]$[marcusl@rackham2 test]$ sbatch my_script.sh Submitted batch job 10281906[marcusl@rackham2 test]$[marcusl@rackham2 test]$ lsmy_script.sh slurm-10281906.out[marcusl@rackham2 test]$[marcusl@rackham2 test]$ cat slurm-10281906.outExample of error with line number and messageslurm_script: 40: An error has occurred.[marcusl@rackham2 test]$
SLURM Tools
● Squeue — quick info about jobs in queue● Jobinfo — detailed info about jobs● Finishedjobinfo — summary of finished
jobs● Jobstats — efficiency of booked resources
Squeue
● Shows quick information about job queue– All jobs: squeue– Your jobs: squeue -u <user>
[marcusl@rackham2 test]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4362762 core kbs-3 peterj CG 18:00:47 1 r446 4362767 core kbs-3 peterj CG 18:00:47 1 r446 4430481 node supernov remi PD 0:00 1 (Priority) 4433857 node pretest3 maka4186 PD 0:00 16 (Priority) 4433861 node Freq emile PD 0:00 4 (Priority) 4433740 node REDOH jolla PD 0:00 4 (Priority) 4433872 node q_timing batchtst PD 0:00 1 (Priority) 4433878 core final_vc madeline PD 0:00 1 (Priority) 4433890 node gm_grnof gulla PD 0:00 1 (Priority)
Jobstats
● Shows efficiency information of finished jobs
[marcusl@localmac ~]$ ssh -X [email protected]
..
[marcusl@rackham2 test]$ jobstats -p -A g2018014
Running '/sw/uppmax/bin/finishedjobinfo -M rackham g2018014' through a pipe to get more information, please be patient…
..
*** 10 total jobs, 0 jobs not run, 0 jobs had no jobstats files (includes jobs not run)
[marcusl@rackham2 test]$ eog *.png &
Jobstats
● Shows efficiency information of finished jobs
[marcusl@localmac ~]$ ssh -X [email protected]..[marcusl@rackham2 test]$ jobstats -p -A g2018014Running '/sw/uppmax/bin/finishedjobinfo -M rackham g2018014' through a pipe to get more information, please be patient…..*** 10 total jobs, 0 jobs not run, 0 jobs had no jobstats files (includes jobs not run)[marcusl@rackham2 test]$ eog *.png &
Interactive
● Books a node and connects you to it– No X11 forwarding through this connection, have to ”ssh -X” in
with another window
interactive -A <proj id> -p <core or node> -t <time>
[marcusl@rackham2 test]$ interactive -A g2018014 -p core -n 4 -t 03:00:00You receive the high interactive priority.There are free cores, so your job is expected to start at once.
Please, use no more than 25.6 GB of RAM.
Waiting for job 4434279 to start…[marcusl@r483 ~/test]$
UPPMAX Software
● 100+ programs installed● Managed by a 'module system'
Installed, but hidden Manually loaded before use
module avail — Lists all available modules
module load <module name> — Loads the module
module unload <module name> — Unloads the module
module list — Lists loaded modules
module spider <name> — search for modules
UPPMAX Commands
● uquota – show disk space
● Laboratory time! (again)
Same instructions PDF as this morning
Do chapter 3 If you finish:
Go back and finish chapter 2 Then do chapter 4