understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus

Understanding the Tradeoffs between Software-Managed vs. Hardware-Managed Caches in GPUs

Chao Li Yi Yang HongwenDai North Carolina State University NEC Lab North Carolina State University

Shenggen Yan Frank Mueller Huiyang Zhou Institute of Softare, CAS North Carolina State University North Carolina State University

Two ways to manage on-chip caches effectively Explicit software management

Shared Memory in GPU; Near Memory in MIC KNL;

Implicit hardware management L1 Data Cache (L1 D-cache).

Accelerators like GPUs provide an ideal platform Shared the same hardware resources Configured using runtime API

Insight to two questions: Is it worthwhile for application developers to explicitly manage shared memory given

existence of the hardware managed L1 D-caches in GPUs?

What are the main reasons for code utilizing shared memory to outperform code leveraging L1 D-caches (and vice versa)?

Introduction

Four Detailed Case Studies Matrix Multiplication

FFT

Marching Cubes

Pathfinder

Benchmarks Categorization and Experimental Results

Conclusions

Outline

Matrix A

Matrix B

Matrix C = A * B

One Tile

DRAM

L2 Cache

L1 D-Cache Shared Memory

PE

PE

PE

PE

PE

PE

…

Matrix-Multiplication

Matrix A

Matrix B

Matrix C = A * B

A Tile

DRAM

L2 Cache

L1 D-Cache Shared Memory

PE

PE

PE

PE

PE

PE

…

Matrix-Multiplication

Shared Memory Version VS D-cache Version:1) Same tiling optimization

2) Similar cache performance (95% hit ratio)3) Same hardware, similar latency

Surprisingly D-cache version is 43.8% slower than shared memory version

Matrix-Multiplication Code

code Time on GTX480(ms) Cycles on GPGPUsim

Software managed Cache Version 0.16 132844

Hardware managed Cache Version 0.23 281831

43.8% slower

Configuration Capacity: 48KB for both Caches; Matrix size: 256*256; Tile size: 16*16. Thread block(16,16); TB Number: 5 for both versions.

for (each thread block) { __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx];

__syncthreads(); #pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx);

__syncthreads(); }

for (each thread block) {#pragma unroll for (int k = 0; k < BLOCK_SIZE; ++k) { Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; } }

Software_managed Cache Version(i.e. Shared Memory Version) Hardware_managed Cache Version

(i.e. D-Cache Version)

tx

ty

On-chip AS, BS

Data movement Overhead

112.2% slower

Same Tiling

Dynamic Instruction Count:Shared-memory version has 12.7% more instructions than D-Cache version

Perfect L1 Cache:Shared-memory version is still faster than D-cache version (0.2%)

Micro-Benchmark: Is it possible for D-cache version to outperform shared-memory version?

__global__ void Micro_sm(float*D1,float*D2,int iter, int initer,int stride){ int i,j; float temp=0; __shared__ float sh[32]; for(j=0;j<iter;j++){ sh[threadIdx.x]=D1[threadIdx.x]; if(threadIdx.x==0) { for(i=0;i<initer;i++) { temp+=sh[i]; } } D1+= stride; }D2[0]=temp;}

Shared Memory Version

__global__ void Micro_L1(float*D1,float*D2,int iter, int initer,int stride){ int i,j; float temp=0; for(j=0;j<iter;j++){ temp=D1[threadIdx.x]; if(threadIdx.x==0) { for(i=0;i<initer;i++) { temp+=D1[i]; } } D1+= stride;}D2[0]=temp;}

D-Cache Version

Dynamic Instructions :Shared-memory Version has 12.7% more instructions than the D-Cache

Version

Perfect L1 Cache:Shared-memory version is still faster than D-cache Version.(0.2%)

Micro-Benchmark:D-cache version is 13.0% faster than shared-memory version

Cache Configuration:1) Cache associativity: From 6-way set assoc to fully assoc:

• Miss rate: 3.59MPKI to 1.79 MPKI;• Performance improved by 12.8%; Still 88.7% slower than shared-mem

Version 2) Cache capacity: 16kB vs 48kB, No impact

What happened?

Not Total Number of Misses, But how cache misses overlap with each other, i.e. MLP

for (int k = 0; k < BLOCK_SIZE; ++k) { Csub += A[a+WA*ty+k] * B[b+k*WB+tx]; }

A[a+WA*ty+k] B[b+k*WB+tx]K=0

A[a+WA*0+0]A[a+WA*1+0]

Warp 0

Tx: 0~15Ty: 0~1

B[b+0*WB+ 0~15]

Warp 1

Tx: 0~15Ty: 2~3A[a+WA*3+0]

B[b+0*WB+ 0~15]

Warp 2

Tx: 0~15Ty: 4~5A[a+WA*5+0]

B[b+0*WB+ 0~15]

A[a+WA*2+0]

A[a+WA*4+0] ...

Warp 7

Tx: 0~15Ty: 14~15A[a+WA*15+0]

B[b+0*WB+ 0~15]A[a+WA*14+0]

16 Cache Misses 1 Cache Miss = 17 Cache Misses

Cache hit

Cache Miss

A[a+WA*ty+k] B[b+k*WB+tx]K=0

A[a+WA*0+1]A[a+WA*1+1]

Warp 0

Tx: 0~15Ty: 0~1

Warp 1

Tx: 0~15Ty: 2~3A[a+WA*3+1]

Warp 2

Tx: 0~15Ty: 4~5A[a+WA*5+1]

A[a+WA*2+1]

A[a+WA*4+1] ...

Warp 7

Tx: 0~15Ty: 14~15A[a+WA*15+1]

A[a+WA*14+1]

0 Cache Miss 1 Cache Miss = 1 Cache Miss

LD ALD B

B[b+1 *WB+ 0~15]

K=1

17 misses B[b+1 *WB+ 0~15]

B[b+1 *WB+ 0~15]

B[b+1 *WB+ 0~15]

A[a+WA*ty+k] B[b+k*WB+tx]

K=0

A[a+WA*0+1]A[a+WA*1+1]

Warp 0

Tx: 0~15Ty: 0~1

Warp 1

Tx: 0~15Ty: 2~3A[a+WA*3+1]

Warp 2

Tx: 0~15Ty: 2~3A[a+WA*5+1]

A[a+WA*2+1]

A[a+WA*4+1] ...

Warp 7

Tx: 0~15Ty: 2~3A[a+WA*15+1]

A[a+WA*14+1]

0 Cache Miss 1 Cache Miss = 1 Cache Miss

LD ALD B

B[b+1*WB+ 0~15]

B[b+1*WB+ 0~15]

B[b+1*WB+ 0~15]

B[b+1*WB+ 0~15]

K=1 K=2

LD ALD B

17 misses

1 miss

K=0

LD ALD B

LD ALD B

LD ALD B

17 misses

1 miss

1 miss

K=1 K=2

K=0

LD ALD B

K=1

LD ALD B

K=2

LD ALD B

K=3 K=15

LD ALD B

LD ALD B

…

17 misses

1 miss

1 miss

1 miss

1 miss

…

AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx]; _Sync();

A[a+WA*0+0]A[a+WA*1+0]

Warp 0

Tx: 0~15Ty: 0~1

Warp 1

Tx: 0~15Ty: 2~3A[a+WA*3+0]

Warp 2

Tx: 0~15Ty: 4~5A[a+WA*5+0]

A[a+WA*2+0]

A[a+WA*4+0]

Warp 7

Tx: 0~15Ty: 14~15A[a+WA*5+0]

A[a+WA*4+0]

16 Cache Misses = 16 Cache Misses

... ...

ld A

B[a+WA*0+0]B[a+WA*1+0]

Warp 0

Tx: 0~15Ty: 0~1

Warp 1

Tx: 0~15Ty: 2~3B[a+WA*3+0]

Warp 2

Tx: 0~15Ty: 2~3B[a+WA*5+0]

B[a+WA*2+0]

B[a+WA*4+0]

Warp 7

Tx: 0~15Ty: 2~3B[a+WA*15+0]

B[a+WA*14+0]

16 Cache Misses = 16 Cache Misses

AS(ty, tx) = A [a + WA * ty + tx]; BS(ty, tx) = B [b + WB * ty + tx]; _Sync();

16 misses

16 misses

... ...

ld AstsAld B

ld A

16 missesstsAld B

16 misses

K=0 K=1

...

..

.K=15

lds a lds bfma

lds a lds bfma

lds a lds bfma

Sync

K=0

ld Ald B

K=1 K=2 K=3 K=15

…

17 misses

…ld Ald B

ld Ald B

ld Ald B

ld Ald B

ld A

16 misses stsAld B

16 misses

K=0 K=1

...... K=15

lds a lds bfma

lds a lds bfma

lds a lds bfma

Cycles Reduced

D-Cache Version

Shared-Memory Version

1 miss

1 miss

1 miss

1 miss

High MLP

High MLP

Low MLP

Short Summary

Matrix Multiplication (Shared Memory Version ) High Memory Level Parallelism for Shared Memory Version

More Dynamic Instructions for Shared Memory Version

Fast Fourier Transformation (Shared Memory Version ) Write Evict for D-Cache Version Uncoalesced Memory Access for D-Cache Version

MarchingCubes (D-Cache Version ) High Thread Level Parallelism for D-cache Version

PathFinder (D-Cache Version ) More Opportunities to Store Data into Registers for D-Cache Version

Experimental Methodology Experimental Setup:

Real Hardware: GTX480 (FERMI) and GTX680 (KEPLER) Simulator : GPGPUsim V3.2.1.

# of execution cores 15SIMD Pipeline Width 16

Number of Threads/Core 1536Number of Registers/Core 32768

Shared Memory /Core 16kB/48kB: 32 banks; 3-cycle latency; 1 access per cycle

L1 Data cache/Core16kB: 128B line, 4-way assoc

48kB: 128B line, 6-way assoc1-cycle hit latency

MSHR Entry 128 entries Constant Cache Size /Core 8k

Texture Cache Size/Core 12k , 128B line, 24-way assocL2 Data cache 768k: 128B line, 16-way assoc

Number of Memory Channels 6Memory Channel Bandwidth 8 Bytes/Cycle

DRAM clock 1400 MHzDRAM Schedule Queue Size 16 , Out of Order ( FR-RCFS)

Warp Scheduling Policy Greedy then oldest scheduler

Simulator Configuration

Benchmarks 16 GPGPU Workloads Cover typical on-chip memory usages.

# Abbr. APPLI CATI ON Sui te Threads Bl ocks I nput si ze1 MM matr i x Mut l i pl cat i on nvi di a sdk (16, 16) (16, 16) 256*2562 HG hi stogram AMD sdk (64, 1) (128, 1) 1024*10243 MT matr i x t ranspose nvi di a sdk (16, 16) (64, 64) 1024x10244 FWT f ast Wal sh Transf orm nvi di a sdk (512, 1) (4096, 1) 83886085 LPS 3D Lapace Sol ver GPGPUsi m (32, 4) (4, 25) 100*100*1006 NQU N- Queue Sol ver GPGPUsi m (96, 1) (256, 1) 87 STO StoreGPU GPGPUsi m (128, 1) (384, 1) 1966568 PF Path fi nder rodi ni a (256, 1) (19, 1) 4096*100*209 nw needl eman- Wunsch rodi ni a (16, 1) (64, 1) 1024*1024*1010 BP back propogat i on rodi ni a (16, 16) (16, 16) 6553611 SP scal ar product of vectors nvi di a sdk (64, 1) (256, 1) 8192*51212 MV matr i x- vector mul t i pl i cat i on we i mpl ement (32, 1) (64, 1) 2048*204813 FFT Fast Four i er Transf orm AMD sdk (64, 1) (4096, 1) 4k*1k14 MC Marchi ngCubes nvi di a sdk (32, 1) (1024, 1） 3276815 BLUR I mage Bl ure OpenCV (256, 1) 2 512（，） 1K*1K16 DWT Di screte Wavel et Transf orm(1D) nvi di a sdk (512, 1) (256, 1) 256k

Performance Impact: FERMI

Performance on GTX480

55.7%

Performance Impact: KEPLER

Performance on GTX680

Performance Impact: GPGPUsim

Performance on GPGPUsim

27%

Energy Impact: GPUWattch

On average, shared memory versions consume 48.5% energy of D-cache versions with WE, 53.7% of D-cache versions with

WBWA, and 71.9% of D-cache versions with WBWA and FA policy

Conclusion An In-depth study on interesting tradeoffs between software-managed caches and hardware-

managed caches.

Key reasons for shared memory versions to outperform D-cache versions Memory Level Parallelism Memory Coalescing (bank-conflict-free accesses)

D-Cache versions mainly benefit from Improved thread-level parallelism Use registers to store variables

Shared memory versions outperform D-cache versions and consume less energy for most of the benchmarks, justifying the software complexity to manage the shared memory.

Thank You! Questions?

Similar complexity compared to explicitly data management Subtle MLP impact is not obvious, non-intuitive to engage in a prefetching optimization Even with prefetching instructions, the code is still 8% lower than shared memory version.

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep)

{

prefetch(A[a + WA * ty + tx]); //Prefetch the tile of matrix A

prefetch(B[b + WB * ty + tx]); //Prefetch the tile of matrix B

__syncthreads();

#pragma unroll

for (int k = 0; k < BLOCK_SIZE; ++k)

{

Csub += A[a+WA*ty+k]*B[b+k*WB+tx];

// tx = threadIdx.x and ty = threadIdx.y

}

}

Figure. The D-cache version of matrix multiplication with prefetching instructions to improve MLP.

understanding the tradeoffs between software-managed vs. hardware-managed caches in gpus

Documents

sharedmemory version

cache capacity

cache associativity

managed cache versioni

shared memory versionhardware

dcache versiontxtyonchip

hardware managed l1

mem version