arc cluster

28
Frank Mueller North Carolina State University

Upload: garran

Post on 08-Jan-2016

26 views

Category:

Documents


0 download

DESCRIPTION

ARC Cluster. Frank Mueller North Carolina State University. PIs & Funding. NSF funding level: $550k NCSU: $60k (ETF) + $20+k (CSC) NVIDIA: donations ~$30k PIs/co-PIs: Frank Mueller Vincent Freeh Helen Gu Xuxian Jiang Xiaosong Ma Contributors: Nagiza Samatova George Rouskas. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ARC Cluster

Frank MuellerNorth Carolina State University

Page 2: ARC Cluster

2

PIs & Funding

NSF funding level: $550k

NCSU: $60k (ETF) + $60+k (CSC)

NVIDIA: donations ~$30k

PIs/co-PIs:

— Frank Mueller

— Vincent Freeh

— Helen Gu

— Xuxian Jiang

— Xiaosong Ma

Contributors:

— Nagiza Samatova

— George Rouskas

Page 3: ARC Cluster

3

ARC Cluster: In the News

“NC State is Home to the Most Powerful Academic HPC in North Carolina” (CSC News, Feb 2011)

“Crash-Test Dummy For High-Performance Computing” (NCSU, The Abstract, Apr 2011)

“Supercomputer Stunt Double” (insideHPC, Apr 2011)

Page 4: ARC Cluster

4

Purpose

Create a mid-size computational infrastructure to support research in areas such as:

Page 5: ARC Cluster

5

Researchers Already Active

Many people:

Groups from within NCSU:— CSC, ECE, Mech+AeroSpace, Physics, Chem/Bio

Engineering, Materials, Operations Research ORNL VT, U Arizona Tsinghua University, Beijing, China etc.

Page 6: ARC Cluster

6

System Overview

Mid TierFront Tier Back Tier

Head/Login Nodes

IB Switch Stack

GEther Switch Stack I/O Nodes Storage ArrayCompute/Spare Nodes

PFS Switch Stack

Interconnect

SSD+SATA

Page 7: ARC Cluster

7

Hardware

108 Compute Nodes— 2-way SMPs with AMD Opteron/

Intel Sandy/Ivy/Broadwell procs— 8 cores/socket, 16 cores/node— 16-64 GB DRAM per node

1728 compute cores available

Page 8: ARC Cluster

8

Interconnects

Gigabit Ethernet

— interactive jobs, ssh, service

— Home directories

40Gbit/s Infiniband (OFEDstack)

— MPI Communication

— Open MPI, MVAPICH

— IP over IB

Page 9: ARC Cluster

9

NVIDIA GPUs

C/M2070

C2050

GTX480

K20c

GTX780

GTX680

K40c

GTX Titan X

GTX 1080

Titan X

Page 10: ARC Cluster

10

Solid State Drives

Most compute nodes equipped with

OCZ RevoDrive 120GB PCIeCrucial_CT275MX3 275GB SATA3SamsungPM1725 Series 1.6TB NVMe

Page 11: ARC Cluster

11

File Systems

Available Today:— NFS home directories over Gigabit Ethernet— Local per-node scratch on spinning disks (ext3)— Local per-node 120GB SSD (ext2)

Parallel File System–PVFS2–Separate dedicated nodes are available for parallel filesystems–4 clients, one of them doubles as MDS

Page 12: ARC Cluster

12

Power Monitoring

Watts Up Pro— Serial and USB available.

Connected in groups of:— Mostly 4 nodes (sometimes just 3)— 2x 1 node

– 1 w/ GPU– 1 w/o GPU

Page 13: ARC Cluster

13

Software Stack

Additional packages and libraries— upon request but…— Not free? you need to pay— License required? you need to sign it— Installation required? you need to

–Test it–Provide install script

check ARC website constantly changing

Page 14: ARC Cluster

14

Base System

OpenHPC 1.2.1 (over CentOS 7.2)

Batch system:— Slurm

All compilers and tools are available on the compute nodes

— Gcc, gfortran, java…

Page 15: ARC Cluster

15

MPI

MVAPICH— Infiniband support— Already in your default PATH

–mpicc

Open MPI— Operates over Infiniband— Activate: module switch mvapich2 openmpi

Page 16: ARC Cluster

16

OpenMP

The "#pragma omp" directive in C programs works.

gcc -fopenmp -o fn fn.c

Page 17: ARC Cluster

17

CUDA SDK

Ensure you are using a node with a GPU— Several types available to fine tune for your applications

needs:–Well-performing single or double precision devices.

Active: module load cuda

Page 18: ARC Cluster

18

PGI Compiler (Experimental)

pgcc, pgCC

pgf77, pgf99, pghpf

OpenACC support

module load pgi64

Page 19: ARC Cluster

19

Virtualization via LXD

Uses Linux containers Goal: To allow a user to configure their own environment

— User gets full root access within container— Much smaller footprint than VM

Docker support inside LXD— Requires more space

May run full VM inside:— VirtualBox…— Full VM space required

Page 20: ARC Cluster

20

Job Submission

cannot SSH to a compute node

must use srun to submit jobs— Either as batch — or interactively

Presently there are “hard” limits for job times and sizes. In general, please be considerate of other users and do not abuse the system.

There are special queues for nodes with certain CPUs/GPUs

Page 21: ARC Cluster

21

PBS Basics

On the login node:— to submit a job: srun …— to list jobs: squeue— to list details of your job: scontrol…— to delete/cancel/stop your job: scancel…— to check node status: sinfo

Page 22: ARC Cluster

22

qsub Basics

srun -n 32 --pty /bin/bash # get 32 cores (2 nodes) in interactive mode

srun -n 16 -N 1 --pty /bin/bash # get 1 interactive node with 16 cores

srun -n 32 -N 2 -w c[30,31] --pty /bin/bash #run on nodes 30+31

srun -n 64 -N 4 -w c[30-33] --pty /bin/bash #run on nodes 30-33

srun -n 64 -N 4 -p opteron --pty /bin/bash #run on any 4 opteron nodes

srun -n 64 -N 4 -p gtx480 --pty /bin/bash #any 4 nodes w/ GTX 480 GPUs

Page 23: ARC Cluster

23

Listing your nodes

Once your job begins, $SLURM_NODELIST has list of nodes allocated to you

MPI is already integrated with Slurm. Simply using prun … will automatically use all requested processes directly from Slurm.

For example, a CUDA programmer that wants to use 4 GPU nodes:

[fmuelle@login ~]$ srun –N 4 –ppy /bin/bash –p gtx480

[fmuelle@c103 ~]$ echo $SMURL_NODELIST

c[103-106]

---SSHing between these nodes FROM WITHIN the Slurm session is allowed---

Page 24: ARC Cluster

24

Hardware in Action

4 racks in server room

Page 25: ARC Cluster

25

Temperature Monitoring

It is the user’s responsibility to maintain room temperatures below 80 degrees while utilizing the cluster.

— ARC website has links to online browser-based temperature monitors.

— And the building staff have pagers that will alarm 24/7 when temperatures exceed the limit.

Page 26: ARC Cluster

26

Connecting to ARC

ARC access is restricted to on-campus IPs only.— If you ever are unable to log in (connection gets dropped

immediately before authentication) then this is likely the cause.

Non-NCSU users may request remote access by providing a remote machine that their connections must originate from.

Page 27: ARC Cluster

27

Summary

Your ARC Cluster@Home: What can I do with it? Primary purpose: Advance Computer Science Research (HPC and

beyond)— Want to run a job over the entire machine?— Want to replace parts of the software stack?

Secondary purpose: Service to sciences, engineering & beyond— Vision: Have domain scientists work w/ Computer Scientists on code

http://moss.csc.ncsu.edu/~mueller/cluster/arc/

Equipment donations welcome Ideas how to improve ARC? let us know

— Qs? send to mailing list (once you have an account)— request an account: email dfiala<at>ncsu.edu

–Research topic, abstract, and compute requirements/time– Must include your unity ID– NCSU Students: Advisor sends email as means of their approval– Non-NCSU: same + preferred username + hostname(your remote login location.

Page 28: ARC Cluster

28

Slides provided by David Fiala

Edited by Frank Mueller

Current as of Aug 21, 2017