arc cluster
DESCRIPTION
ARC Cluster. Frank Mueller North Carolina State University. PIs & Funding. NSF funding level: $550k NCSU: $60k (ETF) + $20+k (CSC) NVIDIA: donations ~$30k PIs/co-PIs: Frank Mueller Vincent Freeh Helen Gu Xuxian Jiang Xiaosong Ma Contributors: Nagiza Samatova George Rouskas. - PowerPoint PPT PresentationTRANSCRIPT
Frank MuellerNorth Carolina State University
2
PIs & Funding
NSF funding level: $550k
NCSU: $60k (ETF) + $60+k (CSC)
NVIDIA: donations ~$30k
PIs/co-PIs:
— Frank Mueller
— Vincent Freeh
— Helen Gu
— Xuxian Jiang
— Xiaosong Ma
Contributors:
— Nagiza Samatova
— George Rouskas
3
ARC Cluster: In the News
“NC State is Home to the Most Powerful Academic HPC in North Carolina” (CSC News, Feb 2011)
“Crash-Test Dummy For High-Performance Computing” (NCSU, The Abstract, Apr 2011)
“Supercomputer Stunt Double” (insideHPC, Apr 2011)
4
Purpose
Create a mid-size computational infrastructure to support research in areas such as:
5
Researchers Already Active
Many people:
Groups from within NCSU:— CSC, ECE, Mech+AeroSpace, Physics, Chem/Bio
Engineering, Materials, Operations Research ORNL VT, U Arizona Tsinghua University, Beijing, China etc.
6
System Overview
Mid TierFront Tier Back Tier
Head/Login Nodes
IB Switch Stack
GEther Switch Stack I/O Nodes Storage ArrayCompute/Spare Nodes
PFS Switch Stack
Interconnect
SSD+SATA
7
Hardware
108 Compute Nodes— 2-way SMPs with AMD Opteron/
Intel Sandy/Ivy/Broadwell procs— 8 cores/socket, 16 cores/node— 16-64 GB DRAM per node
1728 compute cores available
8
Interconnects
Gigabit Ethernet
— interactive jobs, ssh, service
— Home directories
40Gbit/s Infiniband (OFEDstack)
— MPI Communication
— Open MPI, MVAPICH
— IP over IB
9
NVIDIA GPUs
C/M2070
C2050
GTX480
K20c
GTX780
GTX680
K40c
GTX Titan X
GTX 1080
Titan X
10
Solid State Drives
Most compute nodes equipped with
OCZ RevoDrive 120GB PCIeCrucial_CT275MX3 275GB SATA3SamsungPM1725 Series 1.6TB NVMe
11
File Systems
Available Today:— NFS home directories over Gigabit Ethernet— Local per-node scratch on spinning disks (ext3)— Local per-node 120GB SSD (ext2)
Parallel File System–PVFS2–Separate dedicated nodes are available for parallel filesystems–4 clients, one of them doubles as MDS
12
Power Monitoring
Watts Up Pro— Serial and USB available.
Connected in groups of:— Mostly 4 nodes (sometimes just 3)— 2x 1 node
– 1 w/ GPU– 1 w/o GPU
13
Software Stack
Additional packages and libraries— upon request but…— Not free? you need to pay— License required? you need to sign it— Installation required? you need to
–Test it–Provide install script
check ARC website constantly changing
14
Base System
OpenHPC 1.2.1 (over CentOS 7.2)
Batch system:— Slurm
All compilers and tools are available on the compute nodes
— Gcc, gfortran, java…
15
MPI
MVAPICH— Infiniband support— Already in your default PATH
–mpicc
Open MPI— Operates over Infiniband— Activate: module switch mvapich2 openmpi
16
OpenMP
The "#pragma omp" directive in C programs works.
gcc -fopenmp -o fn fn.c
17
CUDA SDK
Ensure you are using a node with a GPU— Several types available to fine tune for your applications
needs:–Well-performing single or double precision devices.
Active: module load cuda
18
PGI Compiler (Experimental)
pgcc, pgCC
pgf77, pgf99, pghpf
OpenACC support
module load pgi64
19
Virtualization via LXD
Uses Linux containers Goal: To allow a user to configure their own environment
— User gets full root access within container— Much smaller footprint than VM
Docker support inside LXD— Requires more space
May run full VM inside:— VirtualBox…— Full VM space required
20
Job Submission
cannot SSH to a compute node
must use srun to submit jobs— Either as batch — or interactively
Presently there are “hard” limits for job times and sizes. In general, please be considerate of other users and do not abuse the system.
There are special queues for nodes with certain CPUs/GPUs
21
PBS Basics
On the login node:— to submit a job: srun …— to list jobs: squeue— to list details of your job: scontrol…— to delete/cancel/stop your job: scancel…— to check node status: sinfo
22
qsub Basics
srun -n 32 --pty /bin/bash # get 32 cores (2 nodes) in interactive mode
srun -n 16 -N 1 --pty /bin/bash # get 1 interactive node with 16 cores
srun -n 32 -N 2 -w c[30,31] --pty /bin/bash #run on nodes 30+31
srun -n 64 -N 4 -w c[30-33] --pty /bin/bash #run on nodes 30-33
srun -n 64 -N 4 -p opteron --pty /bin/bash #run on any 4 opteron nodes
srun -n 64 -N 4 -p gtx480 --pty /bin/bash #any 4 nodes w/ GTX 480 GPUs
23
Listing your nodes
Once your job begins, $SLURM_NODELIST has list of nodes allocated to you
MPI is already integrated with Slurm. Simply using prun … will automatically use all requested processes directly from Slurm.
For example, a CUDA programmer that wants to use 4 GPU nodes:
[fmuelle@login ~]$ srun –N 4 –ppy /bin/bash –p gtx480
[fmuelle@c103 ~]$ echo $SMURL_NODELIST
c[103-106]
---SSHing between these nodes FROM WITHIN the Slurm session is allowed---
24
Hardware in Action
4 racks in server room
25
Temperature Monitoring
It is the user’s responsibility to maintain room temperatures below 80 degrees while utilizing the cluster.
— ARC website has links to online browser-based temperature monitors.
— And the building staff have pagers that will alarm 24/7 when temperatures exceed the limit.
26
Connecting to ARC
ARC access is restricted to on-campus IPs only.— If you ever are unable to log in (connection gets dropped
immediately before authentication) then this is likely the cause.
Non-NCSU users may request remote access by providing a remote machine that their connections must originate from.
27
Summary
Your ARC Cluster@Home: What can I do with it? Primary purpose: Advance Computer Science Research (HPC and
beyond)— Want to run a job over the entire machine?— Want to replace parts of the software stack?
Secondary purpose: Service to sciences, engineering & beyond— Vision: Have domain scientists work w/ Computer Scientists on code
http://moss.csc.ncsu.edu/~mueller/cluster/arc/
Equipment donations welcome Ideas how to improve ARC? let us know
— Qs? send to mailing list (once you have an account)— request an account: email dfiala<at>ncsu.edu
–Research topic, abstract, and compute requirements/time– Must include your unity ID– NCSU Students: Advisor sends email as means of their approval– Non-NCSU: same + preferred username + hostname(your remote login location.
28
Slides provided by David Fiala
Edited by Frank Mueller
Current as of Aug 21, 2017