unified memory on pascal and volta · heterogeneous memory manager: a set of linux kernel patches...

70
1 Nikolay Sakharnykh - May 10, 2017 UNIFIED MEMORY ON PASCAL AND VOLTA

Upload: others

Post on 22-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

1

Nikolay Sakharnykh - May 10, 2017

UNIFIED MEMORY ON PASCAL AND VOLTA

Page 2: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

2

HETEROGENEOUS ARCHITECTURES

GPU 0

MEM

CPU

SYS MEM

GPU 0

GPU 1

MEM

GPU 1

GPU 2

MEM

GPU 2

Page 3: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

3

UNIFIED MEMORY FUNDAMENTALSSingle Pointer

CPU code GPU code

void *data;data = malloc(N);

cpu_func1(data, N);

cpu_func2(data, N);

cpu_func3(data, N);

free(data);

void *data;data = malloc(N);

cpu_func1(data, N);

gpu_func2<<<...>>>(data, N);cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

Page 4: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

4

UNIFIED MEMORY FUNDAMENTALSSingle Pointer

Explicit Memory Management

Unified Memory

void *h_data, *d_data;h_data = malloc(N);cudaMalloc(&d_data, N);cpu_func1(h_data, N);cudaMemcpy(d_data, h_data, N, ...)gpu_func2<<<...>>>(data, N);

cudaMemcpy(h_data, d_data, N, ...)cpu_func3(h_data, N);

free(h_data);cudaFree(d_data);

void *data;data = malloc(N);

cpu_func1(data, N);

gpu_func2<<<...>>>(data, N);cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

Page 5: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

5

UNIFIED MEMORY FUNDAMENTALSDeep Copy Nightmare

Explicit Memory Management

Unified Memory

char **data;data = (char**)malloc(N*sizeof(char*));for (int i = 0; i < N; i++)

data[i] = (char*)malloc(N);

char **d_data;char **h_data = (char**)malloc(N*sizeof(char*));for (int i = 0; i < N; i++) {

cudaMalloc(&h_data2[i], N);cudaMemcpy(h_data2[i], h_data[i], N, ...);

}cudaMalloc(&d_data, N*sizeof(char*));cudaMemcpy(d_data, h_data2, N*sizeof(char*), ...);

gpu_func<<<...>>>(data, N);

char **data;data = (char**)malloc(N*sizeof(char*));for (int i = 0; i < N; i++)

data[i] = (char*)malloc(N);

gpu_func<<<...>>>(data, N);

Page 6: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

6

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3

proc A proc B

memory A memory B

Page 7: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

7

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3

*addr1 = 1

local access

*addr3 = 1

page fault

proc A proc B

memory A memory B

Page 8: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

8

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3*addr3 = 1

page is populated

proc A proc B

memory A memory B

Page 9: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

9

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3

*addr2 = 1

*addr3 = 1

page fault

page fault

proc A proc B

memory A memory B

Page 10: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

10

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3

*addr2 = 1

*addr3 = 1

page migration

page migrationpage fault

page fault

proc A proc B

memory A memory B

Page 11: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

11

UNIFIED MEMORY FUNDAMENTALSOn-Demand Migration

page1

page2

page3

page1

page2

page3

proc A proc B*addr2 = 1

*addr3 = 1

local access

local access

memory A memory B

Page 12: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

12

UNIFIED MEMORY FUNDAMENTALS

When it doesn’t matter how data moves to a processor

1) Quick and dirty algorithm prototyping

2) Iterative process with lots of data reuse, migration cost can be amortized

3) Simplify application debugging

When it’s difficult to isolate the working set

1) Irregular or dynamic data structures, unpredictable access

2) Data partitioning between multiple processors

When Is This Helpful?

Page 13: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

13

UNIFIED MEMORY FUNDAMENTALSMemory Oversubscription

proc A proc B

*addr3 = 1

page fault

physical

memory

capacity

is full

memory A memory B

Page 14: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

14

UNIFIED MEMORY FUNDAMENTALSMemory Oversubscription

proc A proc B

*addr3 = 1

page fault

page eviction

physical

memory

capacity

is full

memory A memory B

Page 15: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

15

UNIFIED MEMORY FUNDAMENTALSMemory Oversubscription

proc A proc B

*addr3 = 1

page faultpage migration

memory A memory B

Page 16: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

16

UNIFIED MEMORY FUNDAMENTALSMemory Oversubscription

proc A proc B

physical

memory

capacity

is full

memory A memory B

Page 17: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

17

UNIFIED MEMORY FUNDAMENTALS

When you have large dataset and not enough physical memory

Moving pieces by hand is error-prone and requires tuning for memory size

Better to run slowly than get fail with out-of-memory error

You can actually get high performance with Unified Memory!

Memory Oversubscription Benefits

Page 18: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

18

UNIFIED MEMORY FUNDAMENTALSSystem-Wide Atomics with Exclusive Access

page1

page2

page3

page1

page2

page3

memory A memory B

proc A proc B

atomicAdd_system

(addr2, 1)

page fault local access

atomicAdd_system

(addr2, 1)

Page 19: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

19

UNIFIED MEMORY FUNDAMENTALSSystem-Wide Atomics with Exclusive Access

page1

page2

page3

page1

page2

page3

memory A memory B

atomicAdd_system

(addr2, 1)

page fault page migration

proc A proc B

Page 20: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

20

UNIFIED MEMORY FUNDAMENTALSSystem-Wide Atomics with Exclusive Access

page1

page2

page3

page1

page2

page3

memory A memory B

local access

proc A proc B

atomicAdd_system

(addr2, 1)

Page 21: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

21

UNIFIED MEMORY FUNDAMENTALSSystem-Wide Atomics over NVLINK*

page1

page2

page3

page1

page2

page3

memory A memory B

remote access local access

proc A proc B

atomicAdd_system

(addr2, 1)

atomicAdd_system

(addr2, 1)

*both processors need to support atomic operations

Page 22: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

22

UNIFIED MEMORY FUNDAMENTALS

GPUs are very good at handling atomics from thousands of threads

Makes sense to utilize atomics between GPUs or between CPU and GPU

We will see this in action on a realistic example later on

System-Wide Atomics

Page 23: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

23

AGENDA

Unified Memory Fundamentals

Under the Hood Details

Performance Analysis and Optimizations

Applications Deep Dive

Page 24: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

24

UNIFIED MEMORY ALLOCATOR

CUDA C: cudaMallocManaged is your most reliable way to opt in today

CUDA Fortran: managed attribute (per allocation)

OpenACC: -ta=managed compiler option (all dynamic allocations)

malloc support is coming on Pascal+ architectures (Linux only)

Note: you can write your own malloc hook to use cudaMallocManaged

Available Options

Page 25: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

25

HETEROGEENOUS MEMORY MANAGER

Heterogeneous Memory Manager: a set of Linux kernel patches

Allows GPUs to access all system memory (malloc, stack, file system)

Page migration will be triggered the same way as for cudaMallocManaged

Ongoing testing and reviews, planning next phase of optimizations

More details on HMM today at 4:00 in Room 211B by John Hubbard

Work In Progress

Page 26: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

26

UNIFIED MEMORYEvolution of GPU Architectures

2012 2014 2016 2017

KeplerFirst release of

the new “single-pointer”

programming model

MaxwellNo new features

related to Unified Memory Pascal

On-demand migration,

oversubscription, system-wide

atomics

VoltaAccess counters,

copy engine faults, cache

coherence, ATS support

NVLINK1

NVLINK2

Page 27: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

27

UNIFIED MEMORY ON KEPLER

Kepler GPU: no page fault support, limited virtual space

Available since CUDA 6

page1

page2

page3

page1

page2

page3

memory A memory B

GPU CPU

Page 28: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

28

UNIFIED MEMORY ON KEPLER

Bulk migration of all pages attached to current stream on kernel launch

Available since CUDA 6

page1

page2

page3

page1

page2

page3

memory A memory B

kernel

launch

page migration

page migration

GPU CPU

Page 29: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

29

UNIFIED MEMORY ON KEPLER

No on-demand migration for the GPU, no oversubscription, no system-wide atomics

Available since CUDA 6

page1

page2

page3

page1

page2

page3

memory A memory B

local accessGPU CPU

local access

Page 30: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

30

UNIFIED MEMORY ON PASCAL

Pascal GPU: page fault support, extended virtual address space (48-bit)

Available since CUDA 8

page1

page2

page3

page1

page2

page3

memory A memory B

proc A proc B

Page 31: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

31

UNIFIED MEMORY ON PASCAL

On-demand migration to accessing processor on first touch

Available since CUDA 8

page1

page2

page3

page1

page2

page3

memory A memory B

local accessproc A proc B

page fault page migration

Page 32: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

32

UNIFIED MEMORY ON PASCAL

All features: on-demand migration, oversubscription, system-wide atomics

Available since CUDA 8

page1

page2

page3

page1

page2

page3

memory A memory B

proc A proc B

local access

Page 33: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

33

UNIFIED MEMORY ON VOLTA

Volta GPU: uses fault on first touch for migration, same as Pascal

Default model

GPU CPU

page1

page2

page3

page1

page2

page3

local access

page fault page migration

GPU memory CPU memory

Page 34: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

34

UNIFIED MEMORY ON VOLTA

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1

page2

page3

page1

page2

page3

GPU CPU

remote access

remote access

local access

GPU memory CPU memory

Page 35: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

35

UNIFIED MEMORY ON VOLTA

With access counters migration only hot pages will be moved to the GPU

New Feature: Access Counters

page1

page2

page3

page1

page2

page3

GPU CPU

page migrationlocal access

GPU memory CPU memory

Page 36: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

36

UNIFIED MEMORY ON VOLTA+P9

CPU can directly access and cache GPU memory; native CPU-GPU atomics

NVLINK2: Cache Coherence

page1

page2

page3

page1

page2

page3

GPU memory CPU memory

GPU CPU

local access

remote access

remote access

Page 37: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

37

DRIVER HEURISTICS

The Unified Memory driver is doing intelligent things under the hood:

Prefetching: migrate pages proactively to reduce number of faults

Thrashing mitigation: heuristics to avoid frequent migration of shared pages

Eviction: what pages to evict when we need to make the room for new ones

You can’t control them but you can override most of these with hints

Things You Didn’t Know Exist

Page 38: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

38

DRIVER PREFETCHING

GPU architecture supports different page sizes

Contiguous pages up to a larger page size are promoted to the larger size

Driver prefetches whole regions if pages are accessed densely

Do Not Confuse with API-prefetching

GPU

CPU

Page 39: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

39

Processors share the same page and frequently read or write to it

Pascal: when memory is pinned we lose any insight into access pattern

Volta: can use access counters information to find a better location

ANTI-THRASHING POLICYFrequent Access to Shared Data

GPU

CPU

CPU throttle

pin to

CPU

Page 40: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

40

Driver keeps a single list of physical chunks of GPU memory

Chunks from the front of the list are evicted first (LRU)

A chunk is considered “in use” when it is fully-populated or migrated

EVICTION ALGORITHMWhat Pages Are Moving Out of the GPU

eviction

allocation

migration to the GPU

Page 41: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

41

AGENDA

Unified Memory Fundamentals

Under the Hood Details

Performance Analysis and Optimizations

Applications Deep Dive

Page 42: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

42

PROFILER: INSPECT

Page 43: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

43

PROFILER: FILTER

Page 44: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

44

PROFILER: CORRELATE

More details tomorrow at 10:00 in Marriott Salon 3

Page 45: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

45

USER HINTS

If you know your application well you can optimize with hints

These are also useful to override some of the driver heuristics

cudaMemPrefetchAsync(ptr, size, processor, stream)

Similar to move_pages() in Linux

cudaMemAdvise(ptr, size, advice, processor)

Similar to madvise() in Linux

Why, When, and How to Use Them

Page 46: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

46

USER HINTSPrefetching

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemPrefetchAsync(data, N, myGpuId, s);mykernel<<<..., s>>>(data, N);cudaMemPrefetchAsync(data, N, cudaCpuDeviceId, s);cudaStreamSynchronize(s);

use_data(data, N);

cudaFree(data);

Page faults can be expensive

and they stall SM execution

Avoid faults by prefetching data

to the accessing processor

GPU

CPU

CPU

Page 47: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

47

USER HINTSRead Mostly

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetReadMostly, myGpuId);cudaMemPrefetchAsync(data, N, myGpuId, s);mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

In this case prefetch creates a

copy instead of moving data

Both processors can read data

simultaneously without faults

Writes are allowed but they are

expensive

GPU

CPU

CPU

Page 48: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

48

USER HINTSPreferred Location

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

Here the kernel will page fault

and generate direct mapping to

data on the CPU

The driver will “resist”

migrating data away from the

preferred location

GPU

CPU

CPU

Page 49: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

49

USER HINTSAccessed By

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Memory can move freely to other

processors and mapping will carry

over

GPU

CPU

CPU

Page 50: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

50

USER HINTSAccessed By on Volta

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Access counters may eventually

trigger migration of this memory to

the GPU

GPU

CPU

CPU

GPU

Page 51: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

51

PERFORMANCE

How long does a page fault take to serve? - We can measure!

Page Fault Cost

Linked list traversal with some large stride to avoid prefetching effects

Page fault cost (us) DtoH HtoD

x86 + PCIe + GP100 20 30

P8 + NVLINK + GP100 20 20

Page 52: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

52

GB/s

GB/s

0

5

10

15

20

25

128KB 1MB 8MB 64MB 512MB 4GB

CPU memory

on-demand single on-demand multi

explicit single explicit multi

1

10

100

1000

128KB 1MB 8MB 64MB 512MB 4GB

GPU memory

on-demand prefetch explicit

PERFORMANCEPage Allocation Throughput

cudaMallocManaged/cudaMalloc + cudaMemset cudaMallocManaged/mmap + fill on the CPU

cudaMalloc is using

preallocated memory

for large sizes

Page 53: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

53

PERFORMANCEPage Migration Throughput (PCIe)

0

2

4

6

8

10

12

14

128KB 1MB 8MB 64MB 512MB 4GB

CPU to GPUon-demand stream on-demand warp-64k

prefetch memcpy

GB/s

0

2

4

6

8

10

12

14

128KB 1MB 8MB 64MB 512MB 4GB

GPU to CPU

on-demand single on-demand multi

prefetch memcpy

GB/s

Page 54: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

54

PERFORMANCEPage Migration Throughput (2x NVLINK)

0

5

10

15

20

25

30

128KB 1MB 8MB 64MB 512MB

CPU to GPUon-demand stream on-demand warp-64k

prefetch memcpy

GB/s

0

5

10

15

20

25

128KB 1MB 8MB 64MB 512MB

GPU to CPU

on-demand single on-demand multi

prefetch memcpy

GB/s

Page 55: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

55

PERFORMANCE

cudaMallocManaged alignment: 512B on Pascal/Volta, 4KB on Kepler/Maxwell

Too many small allocations will use up many pages

cudaMallocManaged memory is moved at system page granularity

For small allocations more data could be moved than necessary

Solution: use cached allocator or memory pools

Page Granularity Overhead

Page 56: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

56

AGENDA

Unified Memory Fundamentals

Under the Hood Details

Performance Analysis and Optimizations

Applications Deep Dive

Page 57: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

57

HPC: HPGMG

High-Performance Geometric Multigrid

Proxy AMR and Low Mach Combustion codes

Used in Top500 benchmarking

High memory usage requirements

http://crd.lbl.gov/departments/computer-science/PAR/research/hpgmg/

Combustion Simulation

Page 58: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

58

HPC: HPGMG

Hybrid implementation requires very careful memory management

Frequent data sharing when crossing the CPU-GPU threshold

Taking Advantage of the CPU and the GPU

V-CYCLE

GP

U

CP

U

THRESHOLD

F-CYCLE

Page 59: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

59

HPGMG: AMR PROXYData Locality and Reuse of AMR Levels

Optimization: prefetch the

next AMR level while

running computations on

the current level

We can use a separate

non-blocking CUDA stream

to overlap with the default

stream

https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

Page 60: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

60

AMR PROXY OVERSUBSCRIPTION

0

20

40

60

80

100

120

140

160

180

200

1.4 4.7 8.6 28.9 58.6

x86 K40 P100 (x86 PCI-e) P100 + hints (x86 PCI-e) P100 (P8 NVLINK) P100 + hints (P8 NVLINK)

Applicati

on t

hro

ughput

(MD

OF/s

)

Application working set (GB)

P100 memory size (16GB)

x86 CPU: Intel E5-2630 v3, 2 sockets of 10 cores each with HT on (40 threads)

All 5 levels fit in GPU memory

Only 2 levels fit

Only 1 level fits

Page 61: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

61

vDNN: Virtualized DNN for Scalable, Memory-Efficient Neural Network Design

Original version implemented custom heuristics to prefetch and offload data

Unified Memory can automatically migrate memory as needed!

DEEP LEARNING

Page 62: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

62

DEEP LEARNING OVERSUBSCRIPTION

GPU: NVIDIA Quadro GP100; cuDNN 5.1, CUDA 9

0

5

10

15

20

25

30

batch 12812GB

batch 25623GB

batch 51245GB

Very Large Batches (VGG-16)All in Memory Offload Conv Offload All Unified Memory

tim

e (

ms)

0

5

10

15

20

25

30

batch 1610GB

batch 3219GB

batch 6436GB

Very Deep Networks (VGG-216)All in Memory Offload Conv Offload All Unified Memory

tim

e (

ms)

GP100 mem size(16GB)

manual

offload

fails!

Unified Memory

does not require

any changes to the

existing DNN code

Page 63: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

63

GRAPH ANALYTICSBFS Traversal

1

0

1

GPU A GPU B

Page 64: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

64

GRAPH ANALYTICSBFS Traversal

2

2

1

2

0

2

1

2

2

GPU A GPU B

Page 65: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

65

GRAPH ANALYTICSBFS Traversal

3

2

2

1

2

0

2

1

2

2

3

3

GPU A GPU B

Page 66: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

66

GRAPH ANALYTICSShared vs Duplicated Visibility Vector

GPU A GPU B

shared visibility bitmap

current frontier

duplicated visibility bitmap

Page 67: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

67

GRAPH ANALYTICSSoftware vs Hardware Atomics

CPU: Intel Core i7-5930K @ 3.50GHz; GPU: NVIDIA Quadro GP100; edgefactor 16, harmonic mean over 64 random sources

Speed-u

p v

s CPU

0

5

10

15

20

25

30

35

40

45

18 19 20 21 22 23 24 25

“Single-GPU” top-down BFS on 2xGP100 with Unified Memory

GPU: shared PCIe

GPU: duplicated PCIe

GPU: shared NVLINK

GPU: duplicated NVLINK

Graph scale (2^N)

Page 68: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

68

AGENDA

Unified Memory Fundamentals

Under the Hood Details

Performance Analysis and Optimizations

Applications Deep Dive

Page 69: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

69

CONCLUSIONS AND OUTLOOK

Consider using Unified Memory for any new application development

Get your code running on the GPU much sooner!

Enjoy clean code and *virtually* no memory limits

Increase productivity, explore and prototype new algorithms

Use the explicit data management only where you need it

Page 70: UNIFIED MEMORY ON PASCAL AND VOLTA · Heterogeneous Memory Manager: a set of Linux kernel patches Allows GPUs to access all system memory (malloc, stack, file system) Page migration

70