gpu memory details
Post on 31-Dec-2015
27 Views
Preview:
DESCRIPTION
TRANSCRIPT
by Martin Kruliš (v1.1) 1
GPU Memory DetailsMartin Kruliš
29. 10. 2015
by Martin Kruliš (v1.1) 229. 10. 2015
Overview
Glo
bal M
em
ory
L2 C
ach
e
L1
Cach
e
Regis
ters
Core
Core
…
L1
Cach
e
Regis
ters
Core
Core
…
…Host
CPU
Host Memory SMP
GPU Chip
GPU Device
PCI Express(16/32 GBps)
~ 25 GBps
Note that details about host memory interconnection are platform specific.
> 100 GBps
by Martin Kruliš (v1.1) 3
PCIe Transfers◦ Much slower than internal GPU data transfers◦ Issued explicitly by host code
cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is
mapped to the host memory space The transfer call has significant overhead
Bulk transfers are preferred
Overlapping◦ Up to 2 async. transfers while the GPU is
computing
29. 10. 2015
Host-Device Transfers
by Martin Kruliš (v1.1) 4
Global Memory Properties◦ Off-chip, but on the GPU device◦ High bandwidth and high latency
~ 100 GBps, 400-600 of clock cycles◦ Operated in transactions
Continuous aligned segments of 32 B - 128 B Number of transaction depends on caching model,
GPU architecture, and memory access pattern
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 5
Global Memory Caching◦ Data are cached in L2 cache
Relatively small (up to 2MB on new Maxwell GPUs)◦ On CC < 3.0 (Fermi) also cached in L1 cache
Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only)
◦ CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling
◦ CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 6
Coalesced Transfers◦ Number of transactions caused by global memory
access depends on the pattern of the access◦ Certain access patterns are optimized◦ CC 1.x
Threads sequentially access aligned memory block Subsequent threads access subsequent words
◦ CC 2.0 and later Threads access aligned memory block Access within the block can be permuted
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 7
Access Patterns◦ Perfectly aligned sequential access
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 8
Access Patterns◦ Perfectly aligned with permutation
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 9
Access Patterns◦ Continuous sequential, but misaligned
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 10
Coalesced Loads Impact
29. 10. 2015
Global Memory
by Martin Kruliš (v1.1) 11
Memory Shared by SM◦ Divided into banks
Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks
Optionally, 64-bit words division is used (CC 3.x)
◦ Bank conflicts are serialized Except for reading the same address (broadcast)
29. 10. 2015
Shared Memory
Compute capability
Mem. size
# of banks
latency
1.x 16 kB 16 32bits / 2 cycles
2.x 48 kB 32 32 bits / 2 cycles
3.x 48 kB 32 64 bits / 1 cycle
by Martin Kruliš (v1.1) 12
Linear Addressing◦ Each thread in warp access different memory
bank◦ No collisions
29. 10. 2015
Shared Memory
by Martin Kruliš (v1.1) 13
Linear Addressing with Stride◦ Each thread access 2*i-th item◦ 2-way conflicts (2x slowdown) on CC < 3.0◦ No collisions on CC 3.x
Due to 64-bits per cycle throughput
29. 10. 2015
Shared Memory
by Martin Kruliš (v1.1) 14
Linear Addressing with Stride◦ Each thread access 3*i-th item◦ No collisions, since the number of banks is not
divisible by the stride
29. 10. 2015
Shared Memory
by Martin Kruliš (v1.1) 15
Broadcast◦ One set of threads access value in bank #12 and
the remaining threads access value in bank #20◦ Broadcasts are served independently on CC 1.x
I.e., sample bellow causes 2-way conflict◦ CC 2.x and 3.x serve all broadcasts
simultaneously
29. 10. 2015
Shared Memory
by Martin Kruliš (v1.1) 16
Shared Memory vs. L1 Cache◦ On most devices, they are the same resource◦ Division can be set for each kernel bycudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory
(i.e., selecting 48kB of 64kB for the preferred)
Shared Memory Configuration◦ Some devices (CC 3.x) can configure memory
bankscudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode
29. 10. 2015
Shared Memory
by Martin Kruliš (v1.1) 17
Registers◦ One register pool per multiprocessor
8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler
◦ As fast as the cores (no extra clock cycles)◦ Read-after-write dependency
24 clock cycles Can be hidden if there are enough active warps
◦ Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts
29. 10. 2015
Registers
by Martin Kruliš (v1.1) 18
Per-thread Global Memory◦ Allocated automatically by compiler
Compiler may report the amount of allocated local memory (use --ptxas-options=-v)
◦ Large structures and arrays are places here Instead of the registers
◦ Register Pressure There is not enough registers to accommodate the
data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks
29. 10. 2015
Local Memory
by Martin Kruliš (v1.1) 19
Constant Memory◦ Special 64KB cache for read-only data
8KB is the cache working set per multiprocessor◦ CC 2.x introduces LDU (LoaD Uniform) instruction
Compiler uses to force loading read-only variables that are thread-independent into the cache
Texture Memory◦ Texture cache is optimized for 2D spatial locality◦ Additional functionality like fast data interpolation,
normalized coordinate system, or handling the boundary cases
29. 10. 2015
Constant and Texture Memory
by Martin Kruliš (v1.1) 20
Global Memory◦ cudaMalloc(), cudaFree()◦ Dynamic kernel allocation
malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size)
Shared Memory◦ Statically (e.g., __shared__ int foo[16];)◦ Dynamically (by kernel launch parameter)extern __shared__ float bar[];float *bar1 = &(bar[0]);float *bar2 = &(bar[size_of_bar1]);
29. 10. 2015
Memory Allocation
by Martin Kruliš (v1.1) 21
Global Memory◦ Data should be accessed in coalesced manner◦ Hot data should be manually cached in shared
mem
Shared Memory◦ Bank conflicts need to be avoided
Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks
Registers and Local Memory◦ Use as few as possible, avoid registry spilling
29. 10. 2015
Implications and Guidelines
by Martin Kruliš (v1.1) 22
Memory Caching◦ The structures should be designed to utilize
caches in best way possible The workset of active blocks should fit L2 cache
◦ Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer
aliasing will occur Data Alignment
◦ Operate on 32bit/64bit values only◦ Align data structures to suitable powers of 2
29. 10. 2015
Implications and Guidelines
by Martin Kruliš (v1.1) 23
What is new in Maxwell….◦ L1 merges with texture cache
Data are cached in L1 the same way as in Fermi◦ Shared memory is independent
64k or 96k not shared with L1◦ Shared memory uses 32bit banks
Revert to Fermi-like style, keeping the aggregated bandwidth
◦ Faster shared memory atomic operations
29. 10. 2015
Maxwell Architecture
by Martin Kruliš (v1.1) 2429. 10. 2015
Discussion
top related