understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus
Post on 05-Feb-2016
30 Views
Preview:
DESCRIPTION
TRANSCRIPT
Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs
Chao Li Yi Yang HongwenDai North Carolina State University NEC Lab North Carolina State University
Shenggen Yan Frank Mueller Huiyang Zhou Institute of Softare, CAS North Carolina State University North Carolina State University
Two ways to manage on-chip caches effectively Explicit software management
Shared Memory in GPU; Near Memory in MIC KNL;
Implicit hardware management L1 Data Cache (L1 D-cache).
Accelerators like GPUs provide an ideal platform Shared the same hardware resources Configured using runtime API
Insight to two questions: Is it worthwhile for application developers to explicitly manage shared memory given
existence of the hardware managed L1 D-caches in GPUs?
What are the main reasons for code utilizing shared memory to outperform code leveraging L1 D-caches (and vice versa)?
Introduction
Four Detailed Case Studies Matrix Multiplication
FFT
Marching Cubes
Pathfinder
Benchmarks Categorization and Experimental Results
Conclusions
Outline
Matrix A
Matrix B
Matrix C = A * B
One Tile
DRAM
L2 Cache
L1 D-Cache Shared Memory
PE
PE
PE
PE
PE
PE
…
Matrix-Multiplication
Matrix A
Matrix B
Matrix C = A * B
A Tile
DRAM
L2 Cache
L1 D-Cache Shared Memory
PE
PE
PE
PE
PE
PE
…
Matrix-Multiplication
Shared Memory Version VS D-cache Version:1) Same tiling optimization
2) Similar cache performance (95% hit ratio)3) Same hardware, similar latency
Surprisingly D-cache version is 43.8% slower than shared memory version
Matrix-Multiplication Code
code Time on GTX480(ms) Cycles on GPGPUsim
Software managed Cache Version 0.16 132844
Hardware managed Cache Version 0.23 281831
43.8% slower
Configuration Capacity: 48KB for both Caches; Matrix size: 256*256; Tile size: 16*16. Thread block(16,16); TB Number: 5 for both versions.
for (each thread block) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx];
__syncthreads(); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx);
__syncthreads(); }
for (each thread block) {#pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; } }
Software_managed Cache Version(i.e. Shared Memory Version) Hardware_managed Cache Version
(i.e. D-Cache Version)
tx
ty
On-chip AS, BS
Data movement Overhead
112.2% slower
Same Tiling
Dynamic Instruction Count:Shared-memory version has 12.7% more instructions than D-Cache version
Perfect L1 Cache:Shared-memory version is still faster than D-cache version (0.2%)
Micro-Benchmark: Is it possible for D-cache version to outperform shared-memory version?
__global__ void Micro_sm(float*D1,float*D2,int iter, int initer,int stride){ int i,j; float temp=0; __shared__ float sh[32]; for(j=0;j<iter;j++){ sh[threadIdx.x]=D1[threadIdx.x]; if(threadIdx.x==0) { for(i=0;i<initer;i++) { temp+=sh[i]; } } D1+= stride; }D2[0]=temp;}
Shared Memory Version
__global__ void Micro_L1(float*D1,float*D2,int iter, int initer,int stride){ int i,j; float temp=0; for(j=0;j<iter;j++){ temp=D1[threadIdx.x]; if(threadIdx.x==0) { for(i=0;i<initer;i++) { temp+=D1[i]; } } D1+= stride;}D2[0]=temp;}
D-Cache Version
Dynamic Instructions :Shared-memory Version has 12.7% more instructions than the D-Cache
Version
Perfect L1 Cache:Shared-memory version is still faster than D-cache Version.(0.2%)
Micro-Benchmark:D-cache version is 13.0% faster than shared-memory version
Cache Configuration:1) Cache associativity: From 6-way set assoc to fully assoc:
• Miss rate: 3.59MPKI to 1.79 MPKI;• Performance improved by 12.8%; Still 88.7% slower than shared-mem
Version 2) Cache capacity: 16kB vs 48kB, No impact
What happened?
Not Total Number of Misses, But how cache misses overlap with each other, i.e. MLP
for (int k = 0; k < BLOCK_SIZE; ++k) { Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; }
A[a+WA*ty+k] B[b+k*WB+tx]K=0
A[a+WA*0+0]A[a+WA*1+0]
Warp 0
Tx: 0~15Ty: 0~1
B[b+0*WB+ 0~15]
Warp 1
Tx: 0~15Ty: 2~3A[a+WA*3+0]
B[b+0*WB+ 0~15]
Warp 2
Tx: 0~15Ty: 4~5A[a+WA*5+0]
B[b+0*WB+ 0~15]
A[a+WA*2+0]
A[a+WA*4+0] ...
Warp 7
Tx: 0~15Ty: 14~15A[a+WA*15+0]
B[b+0*WB+ 0~15]A[a+WA*14+0]
16 Cache Misses 1 Cache Miss = 17 Cache Misses
Cache hit
Cache Miss
A[a+WA*ty+k] B[b+k*WB+tx]K=0
A[a+WA*0+1]A[a+WA*1+1]
Warp 0
Tx: 0~15Ty: 0~1
Warp 1
Tx: 0~15Ty: 2~3A[a+WA*3+1]
Warp 2
Tx: 0~15Ty: 4~5A[a+WA*5+1]
A[a+WA*2+1]
A[a+WA*4+1] ...
Warp 7
Tx: 0~15Ty: 14~15A[a+WA*15+1]
A[a+WA*14+1]
0 Cache Miss 1 Cache Miss = 1 Cache Miss
LD ALD B
B[b+1 *WB+ 0~15]
K=1
17 misses B[b+1 *WB+ 0~15]
B[b+1 *WB+ 0~15]
B[b+1 *WB+ 0~15]
A[a+WA*ty+k] B[b+k*WB+tx]
K=0
A[a+WA*0+1]A[a+WA*1+1]
Warp 0
Tx: 0~15Ty: 0~1
Warp 1
Tx: 0~15Ty: 2~3A[a+WA*3+1]
Warp 2
Tx: 0~15Ty: 2~3A[a+WA*5+1]
A[a+WA*2+1]
A[a+WA*4+1] ...
Warp 7
Tx: 0~15Ty: 2~3A[a+WA*15+1]
A[a+WA*14+1]
0 Cache Miss 1 Cache Miss = 1 Cache Miss
LD ALD B
B[b+1*WB+ 0~15]
B[b+1*WB+ 0~15]
B[b+1*WB+ 0~15]
B[b+1*WB+ 0~15]
K=1 K=2
LD ALD B
17 misses
1 miss
K=0
LD ALD B
LD ALD B
LD ALD B
17 misses
1 miss
1 miss
K=1 K=2
K=0
LD ALD B
K=1
LD ALD B
K=2
LD ALD B
K=3 K=15
LD ALD B
LD ALD B
…
17 misses
1 miss
1 miss
1 miss
1 miss
…
AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx]; _Sync();
A[a+WA*0+0]A[a+WA*1+0]
Warp 0
Tx: 0~15Ty: 0~1
Warp 1
Tx: 0~15Ty: 2~3A[a+WA*3+0]
Warp 2
Tx: 0~15Ty: 4~5A[a+WA*5+0]
A[a+WA*2+0]
A[a+WA*4+0]
Warp 7
Tx: 0~15Ty: 14~15A[a+WA*5+0]
A[a+WA*4+0]
16 Cache Misses = 16 Cache Misses
... ...
ld A
B[a+WA*0+0]B[a+WA*1+0]
Warp 0
Tx: 0~15Ty: 0~1
Warp 1
Tx: 0~15Ty: 2~3B[a+WA*3+0]
Warp 2
Tx: 0~15Ty: 2~3B[a+WA*5+0]
B[a+WA*2+0]
B[a+WA*4+0]
Warp 7
Tx: 0~15Ty: 2~3B[a+WA*15+0]
B[a+WA*14+0]
16 Cache Misses = 16 Cache Misses
AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx]; _Sync();
16 misses
16 misses
... ...
ld AstsAld B
ld A
16 missesstsAld B
16 misses
K=0 K=1
...
..
.K=15
lds a lds bfma
lds a lds bfma
lds a lds bfma
Sync
K=0
ld Ald B
K=1 K=2 K=3 K=15
…
17 misses
…ld Ald B
ld Ald B
ld Ald B
ld Ald B
ld A
16 misses stsAld B
16 misses
K=0 K=1
...... K=15
lds a lds bfma
lds a lds bfma
lds a lds bfma
Cycles Reduced
D-Cache Version
Shared-Memory Version
1 miss
1 miss
1 miss
1 miss
High MLP
High MLP
Low MLP
Short Summary
Matrix Multiplication (Shared Memory Version ) High Memory Level Parallelism for Shared Memory Version
More Dynamic Instructions for Shared Memory Version
Fast Fourier Transformation (Shared Memory Version ) Write Evict for D-Cache Version Uncoalesced Memory Access for D-Cache Version
MarchingCubes (D-Cache Version ) High Thread Level Parallelism for D-cache Version
PathFinder (D-Cache Version ) More Opportunities to Store Data into Registers for D-Cache Version
Experimental Methodology Experimental Setup:
Real Hardware: GTX480 (FERMI) and GTX680 (KEPLER) Simulator : GPGPUsim V3.2.1.
# of execution cores 15SIMD Pipeline Width 16
Number of Threads/Core 1536Number of Registers/Core 32768
Shared Memory /Core 16kB/48kB: 32 banks; 3-cycle latency; 1 access per cycle
L1 Data cache/Core16kB: 128B line, 4-way assoc
48kB: 128B line, 6-way assoc1-cycle hit latency
MSHR Entry 128 entries Constant Cache Size /Core 8k
Texture Cache Size/Core 12k , 128B line, 24-way assocL2 Data cache 768k: 128B line, 16-way assoc
Number of Memory Channels 6Memory Channel Bandwidth 8 Bytes/Cycle
DRAM clock 1400 MHzDRAM Schedule Queue Size 16 , Out of Order ( FR-RCFS)
Warp Scheduling Policy Greedy then oldest scheduler
Simulator Configuration
Benchmarks 16 GPGPU Workloads Cover typical on-chip memory usages.
# Abbr. APPLI CATI ON Sui te Threads Bl ocks I nput si ze1 MM matr i x Mut l i pl cat i on nvi di a sdk (16, 16) (16, 16) 256*2562 HG hi stogram AMD sdk (64, 1) (128, 1) 1024*10243 MT matr i x t ranspose nvi di a sdk (16, 16) (64, 64) 1024x10244 FWT f ast Wal sh Transf orm nvi di a sdk (512, 1) (4096, 1) 83886085 LPS 3D Lapace Sol ver GPGPUsi m (32, 4) (4, 25) 100*100*1006 NQU N- Queue Sol ver GPGPUsi m (96, 1) (256, 1) 87 STO StoreGPU GPGPUsi m (128, 1) (384, 1) 1966568 PF Path fi nder rodi ni a (256, 1) (19, 1) 4096*100*209 nw needl eman- Wunsch rodi ni a (16, 1) (64, 1) 1024*1024*1010 BP back propogat i on rodi ni a (16, 16) (16, 16) 6553611 SP scal ar product of vectors nvi di a sdk (64, 1) (256, 1) 8192*51212 MV matr i x- vector mul t i pl i cat i on we i mpl ement (32, 1) (64, 1) 2048*204813 FFT Fast Four i er Transf orm AMD sdk (64, 1) (4096, 1) 4k*1k14 MC Marchi ngCubes nvi di a sdk (32, 1) (1024, 1) 3276815 BLUR I mage Bl ure OpenCV (256, 1) 2 512( , ) 1K*1K16 DWT Di screte Wavel et Transf orm(1D) nvi di a sdk (512, 1) (256, 1) 256k
Performance Impact: FERMI
Performance on GTX480
55.7%
Performance Impact: KEPLER
Performance on GTX680
Performance Impact: GPGPUsim
Performance on GPGPUsim
27%
Performance Impact: GPGPUsim
Performance on GPGPUsim
27%
Energy Impact: GPUWattch
On average, shared memory versions consume 48.5% energy of D-cache versions with WE, 53.7% of D-cache versions with
WBWA, and 71.9% of D-cache versions with WBWA and FA policy
Conclusion An In-depth study on interesting tradeoffs between software-managed caches and hardware-
managed caches.
Key reasons for shared memory versions to outperform D-cache versions Memory Level Parallelism Memory Coalescing (bank-conflict-free accesses)
D-Cache versions mainly benefit from Improved thread-level parallelism Use registers to store variables
Shared memory versions outperform D-cache versions and consume less energy for most of the benchmarks, justifying the software complexity to manage the shared memory.
Thank You! Questions?
Similar complexity compared to explicitly data management Subtle MLP impact is not obvious, non-intuitive to engage in a prefetching optimization Even with prefetching instructions, the code is still 8% lower than shared memory version.
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)
{
prefetch(A[a + WA * ty + tx]); //Prefetch the tile of matrix A
prefetch(B[b + WB * ty + tx]); //Prefetch the tile of matrix B
__syncthreads();
#pragma unroll
for (int k = 0; k < BLOCK_SIZE; ++k)
{
Csub += A[a+WA*ty+k]*B[b+k*WB+tx];
// tx = threadIdx.x and ty = threadIdx.y
}
}
Figure. The D-cache version of matrix multiplication with prefetching instructions to improve MLP.
top related