uppmax introduction · projects and resources all uppmax resources are allocated to projects. all...

Post on 30-Aug-2019

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

UPPMAX Introduction

Marcus Holmmarcus.holm@uppmax.uu.se

Slides courtesy of: Martin Dahlömartin.dahlo@scilifelab.uu.se

UPPMAX

● Uppsala Multidisciplinary Center for Advanced Computational Science

● http://www.uppmax.uu.se

● 2 clusters for ”public” use Rackham, 486 nodes à 20 cores (128GB RAM)

32 with 256 GB, 4 with 1 TB 2 TB local disk per node 6 PB of fast network storage (Crex)

Bianca, 200 nodes à 16 cores for SNIC-SENS 9.5 PB of storage

● SNIC-Cloud system: Dis

Organisational context

● UPPMAX is: A centre at the Dept of IT at Uppsala University A centre of Swedish National Infrastructure for

Computing (SNIC)

● UPPMAX hosts the SciLifeLab Compute and Storage

facility Supports life science researchers’ needs

● All projects and user accounts are handled via SUPR, the SNIC project management portal (supr.snic.se)

Projects and resources● All UPPMAX resources are allocated to projects. All

members are responsible for sharing project resources constructively.

● All projects are created and managed in SUPR.● A project can have:

– Thousands of core-hours per Month (kch/m)● A constant 2-core job will use about 1.5

kch/month– Storage (GB)

● Project storage in /proj/xyz● Can have backup or nobackup separately or

together

Projects and resources● The cost? Free to you but…

● Rackham’s extension cost (very roughly):

– 375 kr/TB per year

– 0.1 kr/core-hour● A typical PhD-student’s project that uses 1 TB for four years and

averages 1000 core-hours/month represents 6,300 kr.

● An ongoing potato genomics project that uses 20 TB and averages 1000 core-hours/month represents 8,700 kr each year.

● A large shotgun genomics project that uses 5 TB and averages 30,000 core-hours/month for half a year represents 19,000 kr.

● The creation of a new reference genome, requiring 100 TB and 50,000 core-hours/month for a year represents almost 100,000 kronor.

Project types● SNIC (Rackham)

Small — anyone can get 2 kch/m and 128 GB Medium — researchers can get up to 100 kch/m Large — groups can get lots of time UPPMAX Storage — additional GB for SNIC projects

(still has plenty of space) SciLifeLab Storage — additional GB for life science

research on Rackham (fully booked)

● SNIC-SENS (Bianca) For work with sensitive personal data Small (up to 20 TB) and Medium

Guide for project applications: http://uppmax.uu.se/support/getting-started/applying-for-projects/

Who are we, what do we do?● System Administrators (about 10 people)

– Have root access

– Fix problems requiring privileged access (e.g. account issues)

– Maintain operating systems and build software infrastructure

– Etc etc etc: They keep all the systems running

● Application Experts (about 7 people)

– Install software

– Help users with application- or science-related issues

– Give user workshops & seminars

– Represent user community to UPPMAX & SNIC

● Others

– UPPMAX Director: Elisabeth Larsson

– SNAC WG: Marcus Holm (manages project allocations)

– Economy admin

UPPMAX

● The basic structure of a supercomputer

UPPMAX

● The basic structure of a supercomputer

UPPMAX

● The basic structure of a supercomputer

UPPMAX

● The basic structure of a compute node

Local Diskcore

RAM

Other nodesNetwork storage

Internet

UPPMAX

Storage systems: Crex — Rackham /proj directories Castor — storage for Bianca Cygnus — new storage for Bianca ”scratch” — node local disks

UPPMAXStorage system basics:

● All nodes can access: – your home directory on Domus– a project directory on Crex or Castor– Its own local disk (2-3 TB)

● If you’re reading/writing a file once, use a directory on Crex or Castor

● If you’re reading/writing a file many times...

copy to the file to ”scratch”, the node local disk ”cp myFile $SNIC_TMP”

Graphical Access

● ThinLinc

Modern, efficient, experimental

http://www.uppmax.uu.se/support/user-guides/thinlinc-graphical-connection-guide/

● X11-forwarding

Ancient, slow, still useful

Connect with ”ssh -X username@rackham.uppmax.uu.se”

Check if it works by running e.g. Xclock

Are you using a Mac (OSX/MacOS)?

Then you must install Xquartz (https://www.xquartz.org)

Queue System

● More users than nodes Need for a queue

Queue System

● Scheduling jobs– long or short, narrow or wide?

Time ↑

Queue System

● Scheduling jobs– Short and narrow easier to schedule

Time ↑

Queue System

Short and narrow easier to schedule

Time ↑

A job?

● Job = what happens during booked time Described in a Bash script file

Slurm parameters Load software modules Move around file system Start programs ...and more

Queue System

● 1 mandatory setting for jobs: Who ”pays” for it? (-A)

● 3 settings you really should set: Where should it run? (-p) How wide is it? (-n) How long at most? (-t)

If in doubt: -p core -n 1 -t 10-00:00:00

Queue System

● Who ”pays” for it? (-A) Only projects can be charged

You have to be a member

This course's project ID: g2018014

● -A = account (the account you charge) No default value, mandatory

Queue System● Where should it run? (-p)

Use a whole node or just part of it? 1 node = 20 cores (16 on Bianca) 1 hour walltime = 20 core hours = expensive Waste of resources unless you have a parallel

program

● -p = partition (node or core) Default value: core

Queue System

● How wide is it? (-n) How much of the node should be booked?

1 node = 20 cores Any number of cores

1, 2, 5, 13, 15 etc

● -n = number of cores Default value: 1 Usually used together with -p core

Queue System

● How long is it? (-t) Always overestimate with ~50%

Jobs killed when timelimit reached Only charged for time used

● -t = time (hh:mm:ss) 78:00:00 or 3-6:00:00 Default value: 7-00:00:00

Queue System

● How to submit a job Write a script (bash)

Queue options Rest of the script

#! /bin/bash -l #SBATCH -A g2018014#SBATCH -p core#SBATCH -n 1#SBATCH -t 00:10:00#SBATCH -J Template_script

# go to some directorycd /proj/g2018014/marcusl

# load software modulesmodule load bioinfo-tools

# do somethingecho Hello world!

Queue System

● How to submit a job Script written, now what?

[marcusl@rackham1 ~]$ sbatch myjobscript.shSubmitted batch job 4367759[marcusl@rackham1 ~]$ jobinfo -u marcusl

CLUSTER: rackhamRunning jobs: JOBID PARTITION NAME USER ACCOUNT ST START_TIME TIME_LEFT NODES CPUS NODELIST(REASON)

Nodes in use: 479Nodes in devel, free to use: 1Nodes in other partitions, free to use: 0Nodes available, in total: 480

Nodes in test and repair: 6Nodes, all in total: 486

Waiting jobs: JOBID POS PARTITION NAME USER ACCOUNT ST START_TIME TIME_LEFT PRIORITY CPUS NODELIST(REASON) FEATURES DEPENDENCY 4367759 12 core Template_script marcusl staff PD N/A 1:00:00 190000 1 (Resources) (null)

Waiting bonus jobs:

SLURM Output

● Prints to a file instead of terminal slurm-<job id>.out

[marcusl@tintin2 glob]$ ls -ltotal 4-rw-rw-r-- 1 marcusl marcusl 62 Jun 20 13:40 my_script.sb[marcusl@tintin2 glob]$[marcusl@tintin2 glob]$ sbatch my_script.sb Submitted batch job 10281906[marcusl@tintin2 glob]$[marcusl@tintin2 glob]$ ls -ltotal 4-rw-rw-r-- 1 marcusl marcusl 92 Jun 20 13:40 my_script.sb-rw-rw-r-- 1 marcusl marcusl 87 Jun 20 13:40 slurm-10281906.out[marcusl@tintin2 glob]$

SLURM Output

● Prints to a file instead of terminal slurm-<job id>.out

[marcusl@rackham2 test]$ lsmy_script.sh[marcusl@rackham2 test]$[marcusl@rackham2 test]$ sbatch my_script.sh Submitted batch job 10281906[marcusl@rackham2 test]$[marcusl@rackham2 test]$ lsmy_script.sh slurm-10281906.out[marcusl@rackham2 test]$[marcusl@rackham2 test]$ cat slurm-10281906.outExample of error with line number and messageslurm_script: 40: An error has occurred.[marcusl@rackham2 test]$

SLURM Tools

● Squeue — quick info about jobs in queue● Jobinfo — detailed info about jobs● Finishedjobinfo — summary of finished

jobs● Jobstats — efficiency of booked resources

Squeue

● Shows quick information about job queue– All jobs: squeue– Your jobs: squeue -u <user>

[marcusl@rackham2 test]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4362762 core kbs-3 peterj CG 18:00:47 1 r446 4362767 core kbs-3 peterj CG 18:00:47 1 r446 4430481 node supernov remi PD 0:00 1 (Priority) 4433857 node pretest3 maka4186 PD 0:00 16 (Priority) 4433861 node Freq emile PD 0:00 4 (Priority) 4433740 node REDOH jolla PD 0:00 4 (Priority) 4433872 node q_timing batchtst PD 0:00 1 (Priority) 4433878 core final_vc madeline PD 0:00 1 (Priority) 4433890 node gm_grnof gulla PD 0:00 1 (Priority)

Jobstats

● Shows efficiency information of finished jobs

[marcusl@localmac ~]$ ssh -X marcusl@rackham.uppmax.uu.se

..

[marcusl@rackham2 test]$ jobstats -p -A g2018014

Running '/sw/uppmax/bin/finishedjobinfo -M rackham g2018014' through a pipe to get more information, please be patient…

..

*** 10 total jobs, 0 jobs not run, 0 jobs had no jobstats files (includes jobs not run)

[marcusl@rackham2 test]$ eog *.png &

Jobstats

● Shows efficiency information of finished jobs

[marcusl@localmac ~]$ ssh -X marcusl@rackham.uppmax.uu.se..[marcusl@rackham2 test]$ jobstats -p -A g2018014Running '/sw/uppmax/bin/finishedjobinfo -M rackham g2018014' through a pipe to get more information, please be patient…..*** 10 total jobs, 0 jobs not run, 0 jobs had no jobstats files (includes jobs not run)[marcusl@rackham2 test]$ eog *.png &

Interactive

● Books a node and connects you to it– No X11 forwarding through this connection, have to ”ssh -X” in

with another window

interactive -A <proj id> -p <core or node> -t <time>

[marcusl@rackham2 test]$ interactive -A g2018014 -p core -n 4 -t 03:00:00You receive the high interactive priority.There are free cores, so your job is expected to start at once.

Please, use no more than 25.6 GB of RAM.

Waiting for job 4434279 to start…[marcusl@r483 ~/test]$

UPPMAX Software

● 100+ programs installed● Managed by a 'module system'

Installed, but hidden Manually loaded before use

module avail — Lists all available modules

module load <module name> — Loads the module

module unload <module name> — Unloads the module

module list — Lists loaded modules

module spider <name> — search for modules

UPPMAX Commands

● uquota – show disk space

● Laboratory time! (again)

Same instructions PDF as this morning

Do chapter 3 If you finish:

Go back and finish chapter 2 Then do chapter 4

top related