multi-gpu system design with memory networks gwangsun kim, minseok lee, jiyun jeong, john kim...

Multi-GPU System Design withMemory Networks

Gwangsun Kim, Minseok Lee, Jiyun Jeong, John Kim

Department of Computer Science

Korea Advanced Institute of Science and Technology

Single-GPU Programming Pattern

DeviceMemor

HostMemor

Multi-GPU Programming Pattern

GPU…

0 1 2 3 4 5 6 7 …

Data0 1 4 7 …2 3 5 6

DeviceMemor

HostMemor

How to place the data? A. Split B. Duplicate

Problems: 1. Programming can be challenging. 2. Inter-GPU communication cost is high.

Hybrid Memory Cube (HMC)

Logic layer

High-speed link

DRAM layers

I/O port

…Vaultcon-

troller

I/O port

Intra-HMC Network

Vaultcon-

troller

Packet 4/21

Memory Network

Logic layer

High-speed link

DRAM layers

I/O port

…Vaultcon-

troller

I/O port

Intra-HMC Network

Vaultcon-

troller

Memory network for multi-CPU [Kim et al., PACT’13]

… …NVLink

Related Work

NVLink for Nvidia Pascal architecture– Drawback: some processor bandwidth dedicated to NVLink.

SLI (Nvidia) and Crossfire (AMD) – Graphics only.

Unified virtual addressing from Nvidia– Easy access to other GPU’s memory– Restriction in memory allocation.

MEMMEM MEM MEMMEM MEM MEMMEM MEM

…GPU

PCIe Switches

Contents

Motivation Related work

Inter-GPU communication– Scalable kernel execution (SKE)– GPU memory network (GMN) design

CPU-GPU communication – Unified memory network (UMN)– Overlay network architecture

Evaluation Conclusion

Memory Network

GPU Memory Network Advantage

GPUPCIe

Separate physical address spaces

DeviceMemor

288 GB/s

15.75 GB/s

(optional)

Unified physical address space

Scalable Kernel Execution (SKE)

Executes an unmodified kernel on multiple GPUs. GPUs need to support partial execution of a kernel.

Single-GPU

Original Kernel

Virtual GPU

Multi-GPU with SKE

Source transformation[Kim et al., PPoPP’11][Lee et al., PACT’13]

[Cabezas et al., PACT’14]

KernelPartitioned

KernelKernel

Scalable Kernel Execution Implementa-tion

1D Kernel

Thread block

ThreadBlock rangefor GPU 0

Block rangefor GPU 1

Block rangefor GPU 2

Virtual GPU command queue

Virtual GPU

Application(unmodifiedsingle-GPU

version)

Original kernel meta data

Original kernel

meta data+ Block range

SKERun-time

GPU command queue

GPU1…

Memory Address Space Organization

Page A GPU X

Page B GPU Y

Page C GPU Z

Fine-grained interleaving

Load-balanced

Cache line 0

Cache line 1

Cache line 2

Cache line 3

Cache line 4

Cache line 5

… …

GPU virtual address space

al pathNon-minimal path

Multi-GPU Memory Network Topology

Load-balanced GPU channelsRemove path diversity among local HMCs

Slicedflattened butterfly

(sFBFLY)

2D Flattened butterflyw/o concentration [ISCA’07]

(FBFLY)

GPUHMC

Distributor-based flattened butterfly [PACT’13]

(dFBFLY)

Load-balanced

2 4 8 160

300dFBFLY sFBFLY

# of GPUs

# of channels in mem-

orynetwork

50%43%

Contents

Motivation Related work

Inter-GPU communication– Scalable kernel execution (SKE)– GPU memory network (GMN) design

CPU-GPU communication – Unified memory network (UMN)– Overlay network architecture

Evaluation Conclusion

Data Transfer Overhead

CPU GPU

Device memory

Host memory

0 1 2 …

Data PCIe

Problems: 1. CPU-GPU communication BW is low. 2. Data transfer (or memory copy) overhead.

Unified Memory Network

Remove PCIe bottleneck between CPU and GPUs. Eliminate memory copy between CPU and GPUs!

GPU … GPU

GPU Memory Network… … … …

… …

PCIe SwitchesIO

… …

… … …

Overlay Network Architecture

CPUs are latency-sensitive. GPUs are bandwidth-sensitive.

Off-chip linkOn-chip pass-thru path[PACT’13, FB-DIMM spec.]

Methodology

GPGPU-sim version 3.2 Assume SKE for evaluation Configuration

– 4 HMCs per CPU/GPU– 8 bidirectional channels per CPU/GPU/HMC– PCIe BW: 15.75 GB/s, latency: 600 ns– HMC: 4 GB, 8 layers, 16 vaults, 16 banks/vault, FR-FCFS– Assume 1CPU-4GPU unless otherwise mentioned.

Abbreviation Configuration

PCIe PCIe-based system with memcpy

GMN GPU memory network-based system with memcpy

UMN Unified memory network-based system (no copy)

SKE Performance with Different Designs

Results for selected workloads

Compute-intensive Data-intensive

CG CP RAY BFS STO SRAD 3DFD GMEAN

1.5Kernel time Memcpy time Total runtime

*Lower is better

82%reduction

Impact of Removing Path Diversity b/w Local HMCs

STO FT CG

SRAD RAY BH CP

3DFD BF

SCAN BP

KMN SP

1.2dFBFLY-Adaptive (UGAL) sFBFLY-Minimal

14% higher

9%lower

<1%diff.

*Lower is better

Scalability1 2 4 8

16 1 2 4 8

CP SCAN RAY 3DFD BP FWT SRAD GMEAN

speedup

*Higher is better

# GPUs

Compute-intensive Input sizenot large enough

Conclusion

We addressed two critical problems in multi-GPU systems with memory networks.

Inter-GPU communication- Improved bandwidth with GPU memory network- Scalable Kernel Execution Improved Programmability

CPU-GPU communication- Unified memory network Eliminate data transfer- Overlay network architecture

Our proposed designs improve both performance and programmability of multi-GPU systems.

multi-gpu system design with memory networks gwangsun kim, minseok lee, jiyun jeong, john kim...

cpugpu communication

intergpu communication

multigpu system design

memory allocation

host memory data slide

vault slide

gpus memory restriction

memory copy overhead

Documents

1486 ieee transactions on computers, vol....

111222362-common kim-kim without tick 071216

cyber bullying a presentation by lemon grenades. who are...

probabilistic reasoning (2) daehwan kim, ravshan khamidov,...

5. yuri kim-2017.6 - sfcf6342042942557.jimcontent.com ·...

nitrogen fertilizer use and grain …...nitrogen fertilizer...

nonparametric regression cosc 878 doctoral seminar...

class of 2010 -...

district pd kindergarten :kim brown, kim lauer, kathy thode,...

prepositions benjamin smith, geoffrey chan, yuqi bai, jiyun...

belinda from jiyun experimental school hand out put up...

research paper magnetic nanowire networks for dual ... ·...

choong sun kim (c.s. kim)

asm report lessons from sungai kim kim, pasir gudang

kim, t.-t., kim, h., kenney, m. , park, h. s., kim, h.-d

the alternate oscar presentation ( 65 th academy awards)...

belinda from jiyun experimental school

who’s who - wordpress.comkim, jonathan 100 kim joo-sung...

susan strauss and jiyun kim · 2019. 12. 6. · center for...

group exhibition | kim hyunjun, kim younghun, han jisoc