increasing cluster performance by combining rcuda with slurm

56
Increasing cluster performance by combining rCUDA with Slurm Federico Silla Technical University of Valencia Spain

Upload: inside-bigdatacom

Post on 15-Apr-2017

297 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Increasing Cluster Performance by Combining rCUDA with Slurm

Increasing cluster

performance by combining

rCUDA with Slurm

Federico SillaTechnical University of Valencia

Spain

Page 2: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 2/56

Outline

rCUDA … what’s that?

Page 3: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 3/56

Basics of CUDA

GPU

Page 4: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 4/56

rCUDA … remote CUDA

No GPU

Page 5: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 5/56

A software technology that enables a more

flexible use of GPUs in computing facilities

No GPU

rCUDA … remote CUDA

Page 6: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 6/56

Basics of rCUDA

Page 7: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 7/56

Basics of rCUDA

Page 8: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 8/56

Basics of rCUDA

Page 9: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 9/56

Physical

configuration

CP

U

Ma

inM

em

ory

Network

PC

I-e

CP

U

Ma

inM

em

ory

Network

PC

I-e

CP

U

Ma

inM

em

ory

Network

PC

I-e

CP

U

Ma

inM

em

ory

Network

PC

I-e

CP

U

Ma

inM

em

ory

Network

PC

I-e

PC

I-e

CP

U

Ma

inM

em

ory

Network

Interconnection Network

Logical connections

Logical

configuration

Cluster envision with rCUDA

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

GPUGPUmem

PC

I-e CP

U

Ma

inM

em

ory

Network

Interconnection Network

PC

I-e CP

U

GPUGPUmem

Ma

inM

em

ory

Network

GPUGPUmem

PC

I-e CP

U

GPUGPUmem

Ma

inM

em

ory

Network

GPUGPUmem

CP

U

Ma

inM

em

ory

Network

GPUGPUmem

GPUGPUmem

PC

I-e CP

U

Ma

inM

em

ory

Network

GPUGPUmem

GPUGPUmem

PC

I-e CP

U

Ma

inM

em

ory

Network

GPUGPUmem

GPUGPUmem

PC

I-e

rCUDA allows a new vision of a GPU deployment, moving from

the usual cluster configuration:

to the following one:

Page 10: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 10/56

Outline

Two questions:

• Why should we need rCUDA?

• rCUDA … slower CUDA?

Page 11: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 11/56

Outline

Two questions:

• Why should we need rCUDA?

• rCUDA … slower CUDA?

Page 12: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 12/56

The main concern with rCUDA is the

reduced bandwidth to the remote GPU

Concern with rCUDA

No GPU

Page 13: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 13/56

Using InfiniBand networks

Page 14: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 14/56

H2D pageable D2H pageable

H2D pinned D2H pinned

rCUDA EDR Orig rCUDA EDR Opt

rCUDA FDR Orig rCUDA FDR Opt

Initial transfers within rCUDA

Page 15: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 15/56

CUDASW++

Bioinformatics software for Smith-Waterman protein database searches

144 189 222 375 464 567 657 729 850 1000 1500 2005 2504 3005 3564 4061 4548 4743 5147 5478

0

5

10

15

20

25

30

35

40

45

0

2

4

6

8

10

12

14

16

18

FDR Overhead QDR Overhead GbE Overhead CUDA

rCUDA FDR rCUDA QDR rCUDA GbE

Sequence Length

rCU

DA

Ove

rhe

ad

(%

)

Exe

cu

tio

n T

ime

(s)

Performance depending on network

Page 16: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 16/56

H2D pageable D2H pageable

H2D pinned

Almost 100% of

available BWD2H pinned

Almost 100% of

available BW

rCUDA EDR Orig rCUDA EDR Opt

rCUDA FDR Orig rCUDA FDR Opt

Optimized transfers within rCUDA

Page 17: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 17/56

rCUDA optimizations on applications

• Several applications executed with CUDA and rCUDA

• K20 GPU and FDR InfiniBand

• K40 GPU and EDR InfiniBand

Lower

is better

Page 18: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 18/56

Outline

Two questions:

• Why should we need rCUDA?

• rCUDA … slower CUDA?

Page 19: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 19/56

Outline

rCUDA improves

cluster performance

Page 20: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 20/56

Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU

FDR InfiniBand based cluster

Test bench for studying rCUDA+Slurm

node with the Slurmscheduler

node with the Slurmscheduler node with

the Slurmscheduler

8+1 GPU nodes

16+1 GPU nodes

4+1 GPU nodes

Page 21: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 21/56

Applications used for tests:

GPU-Blast (21 seconds; 1 GPU; 1599 MB)

LAMMPS (15 seconds; 4 GPUs; 876 MB)

MCUDA-MEME (165 seconds; 4 GPUs; 151 MB)

GROMACS (2 nodes) (167 seconds)

NAMD (4 nodes) (11 minutes)

BarraCUDA (10 minutes; 1 GPU; 3319 MB)

GPU-LIBSVM (5 minutes; 1GPU; 145 MB)

MUMmerGPU (5 minutes; 1GPU; 2804 MB)

No

n-G

PU

Short execution time

Long execution time

Set 1

Set 2

Three workloads:

Set 1

Set 2

Set 1 + Set 2

Applications for studying rCUDA+Slurm

Page 22: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 22/56

Workloads for studying rCUDA+Slurm (I)

Page 23: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 23/56

Performance of rCUDA+Slurm (I)

Page 24: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 24/56

Workloads for studying rCUDA+Slurm (II)

Page 25: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 25/56

Performance of rCUDA+Slurm (II)

Page 26: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 26/56

Outline

Why does rCUDA improve

cluster performance?

Page 27: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 27/56

Interconnection Network

node nnode 2 node 3node 1

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

IeCPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

• Non-accelerated applications keep GPUs idle in the nodes

where they use all the cores

1st reason for improved performance

A CPU-only application spreading over

these nodes will make their GPUs

unavailable for accelerated applications

Hybrid MPI shared-memory

non-accelerated applications

usually span to all the cores

in a node (across n nodes)

Page 28: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 28/56

Interconnection Network

node nnode 2 node 3node 1

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

IeCPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

• Accelerated applications keep CPUs idle in the nodes

where they executeAn accelerated application using just one

CPU core may avoid other jobs to be

dispatched to this node

Hybrid MPI shared-memory

non-accelerated applications

usually span to all the cores

in a node (across n nodes)

2nd reason for improved performance (I)

Page 29: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 29/56

Hybrid MPI shared-memory

non-accelerated applications

usually span to all the cores

in a node (across n nodes)

Interconnection Network

node nnode 2 node 3node 1

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

IeCPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

• Accelerated applications keep CPUs idle in the nodes

where they executeAn accelerated MPI application using

just one CPU core per node may

keep part of the cluster busy

2nd reason for improved performance (II)

Page 30: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 30/56

• Do applications completely squeeze the GPUs available in the cluster?

• When a GPU is assigned to an application, computational resources

inside the GPU may not be fully used

• Application presenting low level of parallelism

• CPU code being executed (GPU assigned ≠ GPU working)

• GPU-core stall due to lack of data

• etc …

Interconnection Network

node nnode 2 node 3node 1

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

Network

GPU

PC

IeCPU

CPU RAM

RAM

RAM

Network

GPU

PC

Ie

CPU

CPU RAM

RAM

RAM

3rd reason for improved performance

Page 31: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 31/56

GPU usage of GPU-Blast

GPU assigned

but not used

GPU assigned

but not used

Page 32: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 32/56

GPU usage of CUDA-MEME

GPU utilization is far away from maximum

Page 33: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 33/56

GPU usage of LAMMPS

GPU assigned

but not used

Page 34: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 34/56

GPU allocation vs GPU utilization

GPUs

assigned

but not

used

Page 35: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 35/56

Sharing a GPU among jobs: GPU-Blast

Two concurrent

instances of GPU-Blast

One

instance

required

about 51

seconds

Page 36: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 36/56

Two concurrent

instances of GPU-Blast

Sharing a GPU among jobs: GPU-Blast

First

instance

Page 37: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 37/56

Two concurrent

instances of GPU-Blast

Sharing a GPU among jobs: GPU-Blast

First

instance

Second

instance

Page 38: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 38/56

Sharing a GPU among jobs

• LAMMPS: 876 MB

• mCUDA-MEME: 151 MB

• BarraCUDA: 3319 MB

• MUMmerGPU: 2104 MB

• GPU-LIBSVM: 145 MB

K20 GPU

Page 39: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 39/56

Outline

Other reasons for

using rCUDA?

Page 40: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 40/56

Cheaper cluster upgrade

No GPU

• Let’s suppose that a cluster without GPUs needs to be upgraded to use GPUs

• GPUs require large power supplies

• Are power supplies already installed in the nodes large enough?

• GPUs require large amounts of space

• Does current form factor of the nodes allow to install GPUs?

The answer to both

questions is usually “NO”

Page 41: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 41/56

Cheaper cluster upgrade

No GPU

GPU-enabled

Approach 1: augment the cluster with some CUDA GPU-

enabled nodes only those GPU-enabled nodes can

execute accelerated applications

Page 42: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 42/56

Cheaper cluster upgrade

Approach 2: augment the cluster with some rCUDA

servers all nodes can execute accelerated

applications

GPU-enabled

Page 43: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 43/56

Dual socket E5-2620v2 Intel Xeon + 32GB RAM + K20 GPU

FDR InfiniBand based cluster

16 nodes without GPU + 1 node with 4 GPUs

Cheaper cluster upgrade

Page 44: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 44/56

More workloads for studying rCUDA+Slurm

Page 45: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 45/56

Performance

-68% -60%

-63% -56%

+131% +119%

Page 46: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 46/56

Outline

Additional reasons for

using rCUDA?

Page 47: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 47/56

#1: More GPUs for a single application

64

GPUs!

Page 48: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 48/56

MonteCarlo Multi-GPU (from NVIDIA samples)

Lower

is better

Higher

is better

#1: More GPUs for a single application

FDR InfiniBand +

NVIDIA Tesla K20

Page 49: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 49/56

#2: Virtual machines can share GPUs

• The GPU is assigned by using PCI passthrough exclusively to a single virtual machine

• Concurrent usage of the GPU is not possible

Page 50: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 50/56

High performance

network available

Low performance

network available

#2: Virtual machines can share GPUs

Page 51: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 51/56

Box A has 4 GPUs but only one is busy

Box B has 8 GPUs but only two are busy

1. Move jobs from Box B to Box A and

switch off Box B

2. Migration should be transparent to

applications (decided by the global

scheduler)

#3: GPU task migration

Box A

Box B

Migration is performed

at GPU granularity

Page 52: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 52/56

1

1

37

13

14

14

Job granularity instead of GPU granularity

#3: GPU task migration

Page 53: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 53/56

Outline

… in summary …

Page 54: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 54/56

• Cons:

1.Reduced bandwidth to remote GPU (really a concern??)

Pros and cons of rCUDA

• Pros:

1.Many GPUs for a single application

2.Concurrent GPU access to virtual machines

3. Increased cluster throughput

4.Similar performance with smaller investment

5.Easier (cheaper) cluster upgrade

6.Migration of GPU jobs

7.Reduced energy consumption

8. Increased GPU utilization

Page 55: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 55/56

Get a free copy of rCUDA at

http://www.rcuda.net

@rcuda_

More than 650 requests world wide

rCUDA is a development by Technical University of Valencia

Page 56: Increasing Cluster Performance by Combining rCUDA with Slurm

HPC Advisory Council Switzerland Conference 2016 56/56

Thanks!

Questions?

rCUDA is a development by Technical University of Valencia