high performance computing workshop (statistics) hpc 101 dr. charles j antonelli lsait ars january,...
TRANSCRIPT
High PerformanceComputing Workshop
(Statistics)HPC 101
Dr. Charles J AntonelliLSAIT ARS
January, 2013
cja 2013 2
CreditsContributors:
Brock Palen (CoE-IT CAC)
Jeremy Hallum (MSIS)
Tony Markel (MSIS)
Bennet Fauber (CoE-IT CAC)
LSAIT ARS
UM CoE-IT CAC
1/13
cja 2013 3
Roadmap
Flux Mechanics
High Performance Computing
Flux Architecture
Flux Batch Operations
Introduction to Scheduling
1/13
cja 2013 4
Flux Mechanics
1/13
cja 2013 5
Using Flux
Three basic requirements to use Flux:
1. A Flux account2. A Flux allocation3. An MToken (or a Software Token)
1/13
cja 2013 6
Using Flux1. A Flux account
Allows login to the Flux login nodes
Develop, compile, and test code
Available to members of U-M community, free
Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication
1/13
cja 2013 7
Using Flux2. A Flux allocation
Allows you to run jobs on the compute nodes
Current rates: $18 per core-month for Standard Flux
$24.35 per core-month for BigMem Flux
$8 subsidy per core month for LSA and Engineering
Details at http://www.engin.umich.edu/caen/hpc/planning/costing.html
To inquire about Flux allocations please email [email protected]
1/13
cja 2013 8
Using Flux3. An MToken (or a Software Token)
Required for access to the login nodesImproves cluster security by requiring a second means of proving your identity
You can use either an MToken or an application for your mobile device (called a Software Token) for this
Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/loginnodes/twofactor.html
1/13
cja 2013 9
Logging in to Fluxssh flux-login.engin.umich.edu
MToken (or Software Token) required
You will be randomly connected a Flux login nodeCurrently flux-login1 or flux-login2
Firewalls restrict access to flux-login.To connect successfully, either
Physically connect your ssh client platform to the U-M campus wired network, or
Use VPN software on your client platform, or
Use ssh to login to an ITS login node, and ssh to flux-login from there
1/13
cja 2013 10
ModulesThe module command allows you to specify what versions of software you want to usemodule list -- Show loaded modulesmodule load name -- Load module name for usemodule avail -- Show all available modulesmodule avail name -- Show versions of module name*module unload name -- Unload module namemodule -- List all optionsEnter these commands at any time during your sessionA configuration file allows default module commands to be executed at login
Put module commands in file ~/privatemodules/defaultDon’t put module commands in your .bashrc / .bash_profile
1/13
cja 2013 11
Flux environment
The Flux login nodes have the standard GNU/Linux toolkit:
make, autoconf, awk, sed, perl, python, java, emacs, vi, nano, …
Watch out for source code or data files written on non-Linux systems
Use these tools to analyze and convert source files to Linux formatfile
dos2unix, mac2unix1/13
cja 2013 12
Lab 1Task: Invoke R interactively on the login node
module load Rmodule list
Rq()
Please run only very small computations on the Flux login nodes, e.g., for testing
1/13
cja 2013 13
Lab 2Task: Run R in batch mode
module load R
Copy sample code to your login directorycdcp ~cja/stats-sample-code.tar.gz .tar -zxvf stats-sample-code.tar.gzcd ./stats-sample-code
Examine lab2.pbs and lab2.R
Edit lab2.pbs with your favorite Linux editorChange #PBS -M email address to your own
1/13
cja 2013 14
Lab 2Task: Run R in batch mode
Submit your job to Fluxqsub lab2.pbs
Watch the progress of your jobqstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s outputless lab2.out
1/13
cja 2013 15
Lab 3Task: Use the multicore package in R
The multicore package allows you to use multiple cores on a single node
module load Rcd ~/stats-sample-code
Examine lab3.pbs and lab3.R
Edit lab3.pbs with your favorite Linux editorChange #PBS -M email address to your own
1/13
cja 2013 16
Lab 3Task: Use the multicore package in R
Submit your job to Fluxqsub lab3.pbs
Watch the progress of your jobqstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s outputless lab3.out
1/13
cja 2013 17
Lab 4Task: Another multicore example in R
module load Rcd ~/stats-sample-code
Examine lab4.pbs and lab4.R
Edit lab4.pbs with your favorite Linux editorChange #PBS -M email address to your own
1/13
cja 2013 18
Lab 4Task: Another multicore example in R
Submit your job to Fluxqsub lab4.pbs
Watch the progress of your jobqstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s outputless lab4.out
1/13
cja 2013 19
Lab 5Task: Run snow interactively in R
The snow package allows you to use cores on multiple nodes
module load Rcd ~/stats-sample-code
Examine lab5.R
Start an interactive PBS sessionqsub -I -V -l procs=3 -l walltime=30:00 -A stats_flux -l qos=flux -q flux
1/13
cja 2013 20
Lab 5Task: Run snow interactively in R
cd $PBS_O_WORKDIR
Run snow in the interactive PBS sessionR CMD BATCH --vanilla lab5.R lab5.out… ignore any “Connection to lifeline lost” message
1/13
cja 2013 21
Lab 6Task: Run snowfall in R
The snowfall package is similar to snow, and allows you to change the number of cores used without modifying your R code
module load Rcd ~/stats-sample-code
Examine lab6.pbs and lab6.R
Edit lab6.pbs with your favorite Linux editorChange #PBS -M email address to your own
1/13
cja 2013 22
Lab 6Task: Run snowfall in R
Submit your job to Fluxqsub lab6.pbs
Watch the progress of your jobqstat -u uniqname
where uniqname is your own uniqname
When complete, look at the job’s outputless lab6.out
1/13
cja 2013 23
Lab 7Task: Run parallel MATLAB
Distribute parfor iterations over multiple cores on multiple nodes
Do this once:mkdir ~/matlab/cd ~/matlabwget http://cac.engin.umich.edu/resources/software/matlabdct/mpiLibConf.m
1/13
cja 2013 24
Lab 7Task: Run parallel MATLAB
Start an interactive PBS sessionmodule load matlabqsub -I -V -l nodes=2:ppn=3 -l walltime=30:00 -A stats_flux -l qos=flux -q flux
Start MATLABmatlab -nodisplay
1/13
cja 2013 25
Lab 7Task: Run parallel MATLAB
Set up a matlabpoolsched = findResource('scheduler', 'type', 'mpiexec') set(sched, 'MpiexecFileName', '/home/software/rhel6/mpiexec/bin/mpiexec') set(sched, 'EnvironmentSetMethod', 'setenv') %use the 'sched' object when calling matlabpool %the syntax for matlabpool must use the (sched, N) format matlabpool (sched, 6)
… ignore “Found pre-existing parallel job(s)” warnings
1/13
cja 2013 26
Lab 7Task: Run parallel MATLAB
Run a simple parforticx=0; parfor i=1:100000000x=x+i; end toc
Close the matlabpoolmatlabpool close
1/13
cja 2013 27
Compiling CodeAssuming default module settings
Use mpicc/mpiCC/mpif90 for MPI code
Use icc/icpc/ifort with -mp for OpenMP code
Serial code, Fortran 90:ifort -O3 -ipo -no-prec-div –xHost -o prog prog.f90
Serial code, C:icc -O3 -ipo -no-prec-div –xHost –o prog prog.cMPI parallel code:mpicc -O3 -ipo -no-prec-div –xHost -o prog prog.cmpirun -np 2 ./prog
1/13
cja 2013 28
LabTask: compile and execute simple programs on the Flux login node
Copy sample code to your login directory:cdcp ~brockp/cac-intro-code.tar.gz .tar -xvzf cac-intro-code.tar.gzcd ./cac-intro-code
Examine, compile & execute helloworld.f90:ifort -O3 -ipo -no-prec-div -xHost -o f90hello helloworld.f90./f90hello
Examine, compile & execute helloworld.c:icc -O3 -ipo -no-prec-div -xHost -o chello helloworld.c./chello
Examine, compile & execute MPI parallel code:mpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.c… ignore the “feupdateenv is not implemented and will always fail” warningmpirun -np 2 ./c_ex01… ignore runtime complaints about missing NICs
1/13
cja 2013 29
MakefilesThe make command automates your code compilation processUses a makefile to specify dependencies between source and object filesThe sample directory contains a sample makefileTo compile c_ex01:make c_ex01To compile all programs in the directorymakeTo remove all compiled programsmake cleanTo make all the programs using 8 compiles in parallel make -j8
1/13
30
High Performance Computing
1/13cja 2013
cja 2013 31
Advantages of HPC
Cheaper than the mainframe
More scalable than your laptop
Buy or rent only what you need
COTS hardware
COTS software
COTS expertise
1/13
cja 2013 32
Disadvantages of HPC
Serial applications
Tightly-coupled applications
Truly massive I/O or memory requirements
Difficulty/impossibility of porting software
No COTS expertise
1/13
cja 2013 33
Programming Models
Two basic parallel programming modelsMessage-passingThe application consists of several processes running on different nodes and communicating with each other over the network
Used when the data are too large to fit on a single node, and simple synchronization is adequate
“Coarse parallelism”
Implemented using MPI (Message Passing Interface) libraries
Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives
Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable
“Fine-grained parallelism” or “shared-memory parallelism”
Implemented using OpenMP (Open Multi-Processing) compilers and libraries
Both
1/13
cja 2013 34
Good parallelEmbarrassingly parallel
Folding@home, RSA Challenges, password cracking, …
Regular structuresDivide&conquer, e.g. Quicksort
Pipelined: N-body problems, matrix multiplication
O(n2) -> O(n)
1/13
cja 2013 35
Less good parallelSerial algorithms
Those that don’t parallelize easily
Irregular data & communications structuresE.g., surface/subsurface water hydrology modeling
Tightly-coupled algorithms
Unbalanced algorithmsMaster/worker algorithms, where the worker load is uneven
1/13
cja 2013 36
Amdahl’s Law
If you enhance a fraction f of a computationby a speedup S, the overall speedup is:
1/13
cja 2013 37
Amdahl’s Law
1/13
cja 2013 38
Flux Architecture
1/13
cja 2013 39
The Flux clusterLogin nodes Compute nodes
Storage…
Data transfernode
1/13
cja 2013 40
Behind the curtainLogin nodes Compute nodes
Storage
nyx
flux
…shared
Data transfernode
1/13
cja 2013 41
A Flux node
12 Intel cores
48 GB RAM
Local disk
Ethernet InfiniBand
1/13
cja 2013 42
A Newer Flux node
16 Intel cores
64 GB RAM
Local disk
Ethernet InfiniBand
1/13
cja 2013 43
A Flux BigMem node
1 TB RAM
Local disk
Ethernet InfiniBand
1/13
40 Intel cores
cja 2013 44
Flux hardware(January 2012)
8,016 Intel cores 200 Intel BigMem cores632 Flux nodes 5 Flux BigMem nodes
48 GB RAM/node 1 TB RAM/ BigMem node4 GB RAM/core (average) 25 GB RAM/BigMem core
4X Infiniband network (interconnects all nodes)40 Gbps, <2 us latency
Latency an order of magnitude less than Ethernet
Lustre Filesystem
Scalable, high-performance, open
Supports MPI-IO for MPI jobs
Mounted on all login and compute nodes 1/13
cja 2013 45
Flux softwareDefault Software:
Intel Compilers with OpenMPI for Fortran and C
Optional software:PGI Compilers
Unix/GNU tools
gcc/g++/gfortran
Licensed software:Abacus, ANSYS, Mathematica, Matlab, R, STATA SE, …
See http://cac.engin.umich.edu/resources/software/index.html
You can choose software using the module command
1/13
cja 2013 46
Flux networkAll Flux nodes are interconnected via Infiniband and a campus-wide private Ethernet network
The Flux login nodes are also connected to the campus backbone network
The Flux data transfer node will soon be connected over a 10 Gbps connection to the campus backbone network
This meansThe Flux login nodes can access the Internet
The Flux compute nodes cannot
If Infiniband is not available for a compute node, code on that node will fall back to Ethernet communications
1/13
cja 2013 47
Flux dataLustre filesystem mounted on /scratch on all login, compute, and transfer nodes
342TB of short-term storage for batch jobs
Large, fast, short-term
NFS filesystems mounted on /home and /home2 on all nodes
40 GB of storage per user for development & testing
Small, slow, long-term
1/13
cja 2013 48
Flux dataFlux does not provide large, long-term storage
Alternatives:ITS Value Storage
Departmental server
CAEN can mount your storage on the login nodes
Issue df –kh command on a login node to see what other groups have mounted
1/13
cja 2013 49
Globus OnlineFeatures
High-speed data transfer, much faster than SCP or SFTP
Reliable & persistent
Minimal client software: Mac OS X, Linux, Windows
GridFTP EndpointsGateways through which data flow
Exist for XSEDE, OSG, …
UMich: umich#flux, umich#nyx
Add your own server endpoint: contact flux-support
Add your own client endpoint!
More informationhttp://cac.engin.umich.edu/resources/loginnodes/globus.html 1/13
cja 2013 50
Flux Batch Operations
1/13
cja 2013 51
Portable Batch System
All production runs are run on the compute nodes using the Portable Batch System (PBS)
PBS manages all aspects of cluster job execution except job scheduling
Flux uses the Torque implementation of PBS
Flux uses the Moab scheduler for job scheduling
Torque and Moab work together to control access to the compute nodes
PBS puts jobs into queuesFlux has a single queue, named flux
1/13
cja 2013 52
Cluster workflowYou create a batch script and submit it to PBS
PBS schedules your job, and it enters the flux queue
When its turn arrives, your job will execute the batch script
Your script has access to any applications or data stored on the Flux cluster
When your job completes, anything it sent to standard output and error are saved and returned to you
You can check on the status of your job at any time, or delete it if it’s not doing what you want
A short time after your job completes, it disappears
1/13
cja 2013 53
Sample serial script#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l procs=1,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe
#Your Code Goes Below:cd $PBS_O_WORKDIR./f90hello
1/13
cja 2013 54
Sample batch script#PBS -N yourjobname#PBS -V#PBS -A youralloc_flux#PBS -l qos=flux#PBS -q flux#PBS –l procs=16,walltime=00:05:00#PBS -M youremailaddress#PBS -m abe#PBS -j oe
#Your Code Goes Below:cat $PBS_NODEFILEcd $PBS_O_WORKDIRmpirun ./c_ex01
Lists the node(s) your job ran on
No need to specify -npChange to submission directory
1/13
cja 2013 55
Basic batch commands
Once you have a script, submit it:qsub scriptfile
$ qsub singlenode.pbs6023521.nyx.engin.umich.edu
You can check on the job status:qstat jobid$ qstat 6023521nyx.engin.umich.edu: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK Memory Time S Time-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----6023521.nyx.engi cja flux hpc101i -- 1 1 -- 00:05 Q --
To delete your jobqdel jobid
$ qdel 6023521$
1/13
cja 2013 56
LabTask: Run an MPI job on 8 cores
Compile c_ex05cd ~/cac-intro-codemake c_ex05
Edit file run with your favorite Linux editorChange #PBS –M address to your own
I don’t want Brock to get your email!
Change #PBS –A allocation to stats_flux, or to your own allocation, if desired
Change #PBS –l allocation to flux
Submit your jobqsub run
1/13
cja 2013 57
PBS attributesAs always, man qsub is your friend
-N : sets the job name, can’t start with a number-V : copy shell environment to compute node-A youralloc_flux: sets the allocation you are using-l qos=flux: sets the quality of service parameter-q flux: sets the queue you are submitting to-l : requests resources, like number of cores or nodes-M : whom to email, can be multiple addresses-m : when to email: a=job abort, b=job begin, e=job end-j oe: join STDOUT and STDERR to a common file
-I : allow interactive use-X : allow X GUI use
1/13
cja 2013 58
PBS resources (1)A resource (-l) can specify:
Request wallclock (that is, running) time-l walltime=HH:MM:SS
Request C MB of memory per core-l pmem=Cmb
Request T MB of memory for entire job-l mem=Tmb
Request M cores on arbitrary node(s)-l procs=M
Request a token to use licensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox
1/13
cja 2013 59
PBS resources (2)A resource (-l) can specify:
For multithreaded code:Request M nodes with at least N cores per node-l nodes=M:ppn=N
Request M nodes with exactly N cores per node-l nodes=M:tpn=N(you’ll only use this for specific algorithms)
1/13
cja 2013 60
Interactive jobsYou can submit jobs interactively:
qsub -I -V -l procs=2 -l walltime=15:00 -A youralloc_flux -l qos=flux –q flux
This queues a job as usualYour terminal session will be blocked until the job runs
When it runs, you will be connected to one of your nodes
Invoked serial commands will run on that node
Invoked parallel commands (e.g., via mpirun) will run on all of your nodes
When you exit the terminal session your job is deleted
Interactive jobs allow you toTest your code on cluster node(s)
Execute GUI tools on a cluster node with output on your local platform’s X server
Utilize a parallel debugger interactively 1/13
cja 2013 61
LabTask: Run an interactive job
Enter this command (all on one line):qsub -I -V -l procs=2 -l walltime=15:00 -A FluxTraining_flux -l qos=flux -q flux
When your job starts, you’ll get an interactive shell
Copy and paste the batch commands from the “run” file, one at a time, into this shell
Experiment with other commands
After fifteen minutes, your interactive shell will be killed
1/13
cja 2013 62
Introduction to Scheduling
1/13
cja 2013 63
The Scheduler (1/3)
Flux scheduling policies:The job’s queue determines the set of nodes you run on
The job’s account and qos determine the allocation to be charged
If you specify an inactive allocation, your job will never run
The job’s resource requirements help determine when the job becomes eligible to run
If you ask for unavailable resources, your job will wait until they become free
There is no pre-emption
1/13
cja 2013 64
The Scheduler (2/3)Flux scheduling policies:
If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run:
How long you have waited for the resource
How much of the resource you have used so far
This is called “fairshare”
The scheduler will reserve nodes for a job with sufficient priority
This is intended to prevent starving jobs with large resource requirements
1/13
cja 2013 65
The Scheduler (3/3)Flux scheduling policies:
If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps
This is called “backfill”
Core
sTime
1/13
cja 2013 66
Gaining insightThere are several commands you can run to get some insight over the scheduler’s actions:
freenodes : shows the number of free nodes and cores currently available
showq : shows the state of the queue (like qstat -a), except shows running jobs in order of finishing
mdiag –p -t flux : shows the factors used in computing job priority
checkjob jobid : Can show why your job might not be starting
showstart –e all : Gives you a coarse estimate of job start time; use the smallest value returned
1/13
cja 2013 67
More advanced scheduling
Job Arrays
Dependent Scheduling
671/13
cja 2013 68
Job Arrays• Submit copies of identical jobs• Invoked via qsub –t:
qsub –t array-spec pbsbatch.txt
Where array-spec can be
m-n
a,b,c
m-n%slotlimit
e.g.
qsub –t 1-50%10 Fifty jobs, numbered 1 through 50,
only ten can run simultaneously
• $PBS_ARRAYID records array identifier
681/13
cja 2013 69
Dependent scheduling
• Submit jobs whose execution scheduling depends on other jobs
• Invoked via qsub –W:qsub -W depend=type:jobid[:jobid]…
Where depend can be
after Schedule after jobids have started
afterok Schedule after jobids have finished, only if no errors
afternotok Schedule after jobids have finished, only if errors
afterany Schedule after jobids have finished, regardless of status
before,beforeok,beforenotok,beforeany691/13
cja 2013 70
Dependent scheduling
Where depend can be (cont’t)
before When this job has started, jobids will be scheduled
beforeok After this job completes without errors, jobids will be scheduled
beforenotok After this job completes without errors, jobids will be scheduled
afterany After this job completes, regardless of status, jobids will be scheduled
701/13
cja 2013 71
Flux On-Demand Pilot
Alternative to a static allocationPay only for the core time you use
ProsAccommodates “bursty” usage patterns
ConsLimit of 50 cores totalLimit of 25 cores for any user
FoD pilot has ended FoD pilot users continue to be supportedFoD service is being defined
To inquire about FoD allocations please [email protected]
1/13
cja 2013 72
Flux Resources
http://www.youtube.com/user/UMCoECACUMCoECAC’s YouTube channel
http://orci.research.umich.edu/resources-services/flux/U-M Office of Research Cyberinfrastructure Flux summary page
http://cac.engin.umich.edu/Getting an account, basic overview (use menu on left to drill down)
http://cac.engin.umich.edu/startedHow to get started at the CAC, plus cluster news, RSS feed and outages
http://www.engin.umich.edu/caen/hpcXSEDE information, Flux in grant applications, startup & retention offers
http://cac.engin.umich.edu/ Resources | Systems | Flux | PBS
Detailed PBS information for Flux use
For assistance: [email protected]
Read by a team of people
Cannot help with programming questions, but can help with operational Flux and basic usage questions
1/13
cja 2013 73
SummaryThe Flux cluster is just a collection of similar Linux machines connected together to run your code, much faster than your desktop can
Command-line scripts are queued by a batch system and executed when resources become available
Some important commands are
qsubqstat -u usernameqdel jobidcheckjob
Develop and test, then submit your jobs in bulk and let the scheduler optimize their execution
1/13
cja 2013 74
Any Questions?Charles J. AntonelliLSAIT Research Systems [email protected]://www.umich.edu/~cja734 763 0607
1/13
cja 2013 75
References1. http://cac.engin.umich.edu/resources/software/R.html2. http://cac.engin.umich.edu/resources/software/snow.html3. http://cac.engin.umich.edu/resources/software/matlab.html4. CAC supported Flux software, http://cac.engin.umich.edu/resources/software/index.html, (accessed
August 2011)5. J. L. Gustafson, “Reevaluating Amdahl’s Law,” chapter for book, Supercomputers and Artificial
Intelligence, edited by Kai Hwang, 1988. http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html (accessed November 2011).
6. Mark D. Hill and Michael R. Marty, “Amdahl’s Law in the Multicore Era,” IEEE Computer, vol. 41, no. 7, pp. 33-38, July 2008. http://research.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf (accessed November 2011).
7. InfiniBand, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011).8. Intel C and C++ Compiler 1.1 User and Reference Guide,
http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011).
9. Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011).
10. Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011).11. Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed
August 2011).12. Jurg van Vliet & Flvia Paginelli, Programming Amazon EC2,’Reilly Media, 2011. ISBN 978-1-449-
39368-7.
1/13