gpu memory details

by Martin Kruliš (v1.1) 1

GPU Memory DetailsMartin Kruliš

29. 10. 2015

by Martin Kruliš (v1.1) 229. 10. 2015

Overview

…Host

Host Memory SMP

GPU Chip

GPU Device

PCI Express(16/32 GBps)

~ 25 GBps

Note that details about host memory interconnection are platform specific.

> 100 GBps

PCIe Transfers◦ Much slower than internal GPU data transfers◦ Issued explicitly by host code

cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is

mapped to the host memory space The transfer call has significant overhead

Bulk transfers are preferred

Overlapping◦ Up to 2 async. transfers while the GPU is

computing

29. 10. 2015

Host-Device Transfers

Global Memory Properties◦ Off-chip, but on the GPU device◦ High bandwidth and high latency

~ 100 GBps, 400-600 of clock cycles◦ Operated in transactions

Continuous aligned segments of 32 B - 128 B Number of transaction depends on caching model,

GPU architecture, and memory access pattern

29. 10. 2015

Global Memory

Global Memory Caching◦ Data are cached in L2 cache

Relatively small (up to 2MB on new Maxwell GPUs)◦ On CC < 3.0 (Fermi) also cached in L1 cache

Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only)

◦ CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling

◦ CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache

29. 10. 2015

Global Memory

Coalesced Transfers◦ Number of transactions caused by global memory

access depends on the pattern of the access◦ Certain access patterns are optimized◦ CC 1.x

Threads sequentially access aligned memory block Subsequent threads access subsequent words

◦ CC 2.0 and later Threads access aligned memory block Access within the block can be permuted

29. 10. 2015

Global Memory

Access Patterns◦ Perfectly aligned sequential access

29. 10. 2015

Global Memory

Access Patterns◦ Perfectly aligned with permutation

29. 10. 2015

Global Memory

Access Patterns◦ Continuous sequential, but misaligned

29. 10. 2015

Global Memory

Coalesced Loads Impact

29. 10. 2015

Global Memory

Memory Shared by SM◦ Divided into banks

Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks

Optionally, 64-bit words division is used (CC 3.x)

◦ Bank conflicts are serialized Except for reading the same address (broadcast)

29. 10. 2015

Shared Memory

Compute capability

Mem. size

# of banks

latency

1.x 16 kB 16 32bits / 2 cycles

2.x 48 kB 32 32 bits / 2 cycles

3.x 48 kB 32 64 bits / 1 cycle

Linear Addressing◦ Each thread in warp access different memory

bank◦ No collisions

29. 10. 2015

Shared Memory

Linear Addressing with Stride◦ Each thread access 2*i-th item◦ 2-way conflicts (2x slowdown) on CC < 3.0◦ No collisions on CC 3.x

Due to 64-bits per cycle throughput

29. 10. 2015

Shared Memory

Linear Addressing with Stride◦ Each thread access 3*i-th item◦ No collisions, since the number of banks is not

divisible by the stride

29. 10. 2015

Shared Memory

Broadcast◦ One set of threads access value in bank #12 and

the remaining threads access value in bank #20◦ Broadcasts are served independently on CC 1.x

I.e., sample bellow causes 2-way conflict◦ CC 2.x and 3.x serve all broadcasts

simultaneously

29. 10. 2015

Shared Memory

Shared Memory vs. L1 Cache◦ On most devices, they are the same resource◦ Division can be set for each kernel bycudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory

(i.e., selecting 48kB of 64kB for the preferred)

Shared Memory Configuration◦ Some devices (CC 3.x) can configure memory

bankscudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode

29. 10. 2015

Shared Memory

Registers◦ One register pool per multiprocessor

8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler

◦ As fast as the cores (no extra clock cycles)◦ Read-after-write dependency

24 clock cycles Can be hidden if there are enough active warps

◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts

29. 10. 2015

Registers

Per-thread Global Memory◦ Allocated automatically by compiler

Compiler may report the amount of allocated local memory (use --ptxas-options=-v)

◦ Large structures and arrays are places here Instead of the registers

◦ Register Pressure There is not enough registers to accommodate the

data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks

29. 10. 2015

Local Memory

Constant Memory◦ Special 64KB cache for read-only data

8KB is the cache working set per multiprocessor◦ CC 2.x introduces LDU (LoaD Uniform) instruction

Compiler uses to force loading read-only variables that are thread-independent into the cache

Texture Memory◦ Texture cache is optimized for 2D spatial locality◦ Additional functionality like fast data interpolation,

normalized coordinate system, or handling the boundary cases

29. 10. 2015

Constant and Texture Memory

Global Memory◦ cudaMalloc(), cudaFree()◦ Dynamic kernel allocation

malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size)

Shared Memory◦ Statically (e.g., __shared__ int foo[16];)◦ Dynamically (by kernel launch parameter)extern __shared__ float bar[];float *bar1 = &(bar[0]);float *bar2 = &(bar[size_of_bar1]);

29. 10. 2015

Memory Allocation

Global Memory◦ Data should be accessed in coalesced manner◦ Hot data should be manually cached in shared

Shared Memory◦ Bank conflicts need to be avoided

Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks

Registers and Local Memory◦ Use as few as possible, avoid registry spilling

29. 10. 2015

Implications and Guidelines

Memory Caching◦ The structures should be designed to utilize

caches in best way possible The workset of active blocks should fit L2 cache

◦ Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer

aliasing will occur Data Alignment

◦ Operate on 32bit/64bit values only◦ Align data structures to suitable powers of 2

29. 10. 2015

Implications and Guidelines

What is new in Maxwell….◦ L1 merges with texture cache

Data are cached in L1 the same way as in Fermi◦ Shared memory is independent

64k or 96k not shared with L1◦ Shared memory uses 32bit banks

Revert to Fermi-like style, keeping the aggregated bandwidth

◦ Faster shared memory atomic operations

29. 10. 2015

Maxwell Architecture

by Martin Kruliš (v1.1) 2429. 10. 2015

Discussion

gpu memory details

Documents

gpu memory model overview - penn engineering - welcome to...

gpu memory - virginia techaccel.cs.vt.edu › files ›...

cs179: gpu programming lecture 5: memory. today gpu memory...

superneurons: dynamic gpu memory management for training...

1 dissecting gpu memory hierarchy through...

gpu memory ii - people

distributed-memory multi-gpu block-sparse tensor

mosaic: a gpu memory managerwith application-transparent...

zng: architecting gpu multi-processors with new flash for...

amd vega presentation - gpu memory architecture

transactional memory for heterogeneous cpu-gpu...

multi gpu programming with mpi...multi gpu programming with...

buddy compression: enabling larger memory for deep learning...

extending unified parallel c for gpu computing · pgas...

parallel hybrid computing · gpu gpu gpu gpu openmp hmpp...

dacache: memory divergence-aware gpu cache...

gpu optimization...

mask: redesigning the gpu memory hierarchyto support multi...

on testing gpu memory for hard and soft errors

gpu security exposed - black hat briefings€¦ · gpu...