automatic data placement into gpu on-chip memory resources chao li yi yang north carolina state...
TRANSCRIPT
Automatic Data Placement Into GPU On-Chip Memory Resources
Chao Li Yi Yang North Carolina State University NEC Labs America
Zhen Lin Huiyang Zhou North Carolina State University North Carolina State University
1www.nec-labs.com
GPUs rely on thread-level parallelism to hide off-chip latency. Judicious utilization of on-chip memory resources remains critical
to performance. The off-chip memory bandwidth still a bottleneck. E.g., Big Data Applications, Deep Learning on GPUs.
Two key challenges: Explicitly managing the intricate on-chip resources Performance portability across different GPU generations.
Our solution: Automatic data placement into GPU memory resources Compiler-driven automatic data placement Focus on programs that have been reasonably optimized Revise data placement to achieve both performance enhancement and
performance portability.
Introduction
2
Three types on on-chip memory resources: registers, shared memory, and D-caches
Different Capacity and Allocation Restrictions:
Large register file, small cache;
64 registers per thread, 48KB shared memory per TB, no limit on D-cache.
Different Accessibility: Register File: within threads (warps in Kerpler); Shared Memory: within a thread block;
D-cache: TBs on the same SM.
Different Performance Characteristics: Register File: highest bandwidth; Shared Memory: high bandwidth with fixed data
access latency; D-cache: high bandwidth with variable access latency.
Explicit Resource Management
3
GPUs evolve at a very fast pace
There is a higher increase in computation throughput than off-chip
bandwidth: Ratio of GFLOS/GB: 10X-> 15X (GTX8800-> GTX680).
Register file and D-cache/shared memory size have been changing across
different generations.
G80 (GTX 8800)
GT200(GTX 280)
FERMI(GTX 480)
KEPLER(GTX 680)
KEPLER (K20c)
Arithmetic throughput(Gflops/S)
504 933 1345 3090 3950
Memory Bandwidth
(GB/S)57 141 177 192 250
Shared memory size(KB)
16 16 48 48 48
Register file size(KB) 32 64 128 256 256
4
Performance Portability
A Compiler Algorithm For Automatic Data
Placement
Analyze Possible Data Placement Patterns
Construct Compiler Algorithms to Use the
Profitable Patterns
Our Solution
5
Data movement from one to another to achieve optimal resource utilization.
Data (re)Placement Patterns:
Direction 6: : compiler determined, e.g. A[B[0]], B[12];
Directions a and 5: :previous works on specific optimizations, significant
code changes; the trend of GPU evolution: larger register files
Directions , and : our focus.
6
Data (re)PlacementRegister variables
Shared memory variables
Local/global variables in L1 D-caches
1
2
36
4
5
6
4 5
1
2
3
Pattern 1: Shared Memory to Registers
Three reasons:
Shared memory usage may limit the
TLP;
Shared memory has longer access
latency and lower bandwidth;
Accessing shared memory incurs
instruction overhead for address
computation.
Promotion strategy on multiple
promotable shared memory
variables: reference count-based
priority.
__global__ void dynproc_kernel(…){__shared__ float prev[256];__shared__ float result [256];int tx=threadIdx.x ; for (int i=0; i<iteration ; i++){…. shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result[tx] = shortest + gpuWall[index]; __syncthreads(); prev[tx]= result[tx]; __syncthreads();} gpuResults[xidx]=result[tx]; }
Baseline
__global__ void dynproc_kernel(…){__shared__ float prev[256];float result;int tx=threadIdx.x ;for (int i=0; i<iteration ; i++){… shortest = minum( prev[tx-1], prev[tx],prev[tx+1]); result = shortest + gpuWall[index]; syncthreads(); prev[tx]= result; __syncthreads();}gpuResults[xidx]=result;}
Optimized Code7
Pattern 2: Shared Memory to L1 D-caches Three reasons:
Shared memory usage may limit the TLP
but cannot promote to registers
Local/global memory implicitly utilizes L1 D-
cache to achieve high performance;
The communication among threads can
also be ensured by global memory
variables.
To balance the tradeoff between TLP
and memory pressure, auto-tuning is
employed to determine :
Which variables to be promoted;
Whether into global or local memory.
__global__ void generateTriangles(…) {
__shared__ float3 vertlist[12*NTHREADS]; //12*32
__shared__ float3 normlist[12*NTHREADS];
//defines to the shared memory array
vertexInterp2(..., vertlist[threadIdx.x],
normlist[threadIdx.x]));
vertexInterp2(…,vertlist[threadIdx.x+NTHREADS],
normlist[threadIdx.x+NTHREADS]);
…edge = tech1Dfetch (triTex,..) ;
//uses of the shared memory array
pos[index] =
make_float4(vertlist[(edge*NTHREADS)+threadIdx.x],
1.0f);
…
} Baseline
__global__ void generateTriangles(…) { float3 vertlist[12]; float3 normlist[12]; //defines to the local memory array vertexInterp2(.., vertlist[0], normlist[0]); vertexInterp2(…,vertlist[1], normlist[1]); … edge = tech1Dfetch(triTex,..);
//uses of the local memory array pos[index] = make_float4 (vertlist[edge], 1.0f); …}
Optimized Code 8
Pattern 3: Shared Memory/D-cache to Registers to Achieve Register Tiling
Two reasons:
Common side effect of SPMD: redundant
computations and memory accesses;
Redundant shared/global memory usage can be
converted into register usage.
Three ways for saving bandwidth:
Implicitly L1 Data-cache: cache hit, but data may
be evicted out by others;
Shared memory: only select one warp for loading
task, additional control flow and _sync;
Register file: not shared among warps, so
compact warps of threads first. Introduce
C_Factor for best register tiling.
__global__ void srad_kernel(int [] c_cuda…){
int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx
+ cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16;
__shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];
….
south_c[ty][tx] = c_cuda[index_s];
if ( by == gridDim.y - 1 ) south_c[ty][tx] = …
__syncthreads();
…}
Baseline
__global__ void srad_kernel(int [] c_cuda…){
int index_s = cols * BLOCK_SIZE * by + BLOCK_SIZE * bx
+ cols * BLOCK_SIZE + tx; //BLOCK_SIZE = 16;
__shared__ float south_c[BLOCK_SIZE][BLOCK_SIZE];
….
int tmp_1= c_cuda[index_s];
#pragma unroll
for(int m=0;m<C_Factor; m++)
south_c[ty+ m*blockDim.y/C_Factor][tx] = tmp_1;
if ( by == gridDim.y - 1 ) south_c[ty][tx] = …
__syncthreads();
…}
Optimized Code 9
Analyze Possible Data Placement Patterns
Compiler Algorithms to Utilize the Profitable
Patterns
Our Solution
5
Compiler Algorithms
Compiler pass 1:(patterns 1 & 2)
Compiler pass 2:(pattern 3)
Identification Stage
Processing Stage
Auto-tuning Stage
Identification Stage
Processing Stage
Auto-tuning StageThree stages for one compiler pass:
Identifying stage: scan and generate a list of candidate variables by collecting the architecture features and analyzing memory access behavior;
Processing stage: implement the placement patterns;
Auto-tuning stage: construct the search space, decide which variables to be processed and achieve best code generation. 11
Identify and collect shared memory variables
Analyze memory access behavior
Is the access across threads?
Memory reference counts and
allocation size
The access index is
decided at runtime?
Candidate variables list
On each candidate variable in the list with priority order
! (a) && !(b)
(a)
(b)
!(a) && (b)
Promote to register file
Promote to local
memory
Promote toglobal
memory
Auto-tuning
for optimal kernel
Compiler Pass 1
12
Y N
Y N
Generate a new kernel
Input Kernel
Input Kernel
Identify and analyze the access behavior of global
and shared memory variables
Check the redundancy along x or y dimension;
generate redundancy type
Collect the expressions with indices featured with
redundancy
Adjust the thread block dimension for each different C_Factor
Construct unroll-able loop for thread compaction /coarsening/merge
Dump out the expr list
The expressions in expr list will be performed once (i.e., no redundancy)
Auto-tuning
for optimal kernel
Compiler Pass 2
13
Generate a new kernel
Auto-Tuning
Auto-tuning steps: Construct a search space based on tunable parameters;
Measure the execution time;
Select the best preforming code variant for the target architecture.
Three search spaces constructed for data placement: How many and which shared memory variables should be promoted into register file;
Which shared memory variables to be promoted into local/global memory;
The compaction factor.
Search space pruning strategies: Memory reference-count based priority;
Allocation size-based priority;
Limit the compact factor to 2’s powers.
14
Preprocessor
Memory access index regulation: An affine function of thread index;
Scaling factor: macro/constant variables, kernel launch parameter, run-time
parameters.
Dynamic loop bound: Let the user to provide the info through profiling; or
Use simple heuristics: a default loop count of 4.
Collect data structure declaration and annotate data type: int2, float4 vector type: being processed the same as int, float;
User-define struct type: identified separately.
15
Experimental Methodology
Implementation: Implement into Cetus, a source-to-source framework;
Basic CUDA syntax support from MCUDA.
Evaluation Environment:
Three generations with all possible D-cache and shared memory capacity
configurations.
Parameter GTX480 GTX680 K20c
<Shared memory size,
L1 D-cache size>
<16kB, 48kB>, <48kB, 16kB>
<16kB, 48kB>, <32kB, 32kB>, <48kB, 16kB>
<16kB, 48kB>, <32kB, 32kB>, <48kB, 16kB>
Register file size 128kB 256kB 256kB
Max number of threads per SM 512 1024 1536
Max number of registers per
thread64 64 256
Compaction Factor 2,4,8,16 2,4,8,16 2,4,8,16
16
Shared memory allocation size defined by programmer;
Initial register allocation controlled statically by compiler and architecture parameter.
GTX480 GTX680 K20C
Benchmark Input reg smem reg smem reg smem
HotSpot (HS) height 2 35 3072 36 3072 39 3072Back Prop1 (BP1) 65536 layer 13 1088 11 1088 12 1088Back Prop 2 (BP2) 65536 layer 22 0 20 0 21 0
SRAD1 (SR1) 2048*2048 20 0 20 0 26 0SRAD2 (SR2) 2048*2048 19 0 20 0 20 0
Matrix Multiply (MM) 2048*2048 23 8192 26 8192 25 8192
Path Finder (PF) 409600 steps 16 2048 18 2048 17 2048
N-Queue (NQU) N=8 15 15744 19 15744 16 15744
Marching Cubes (MC) 32768 voxels 63 9216 63 9216 76 9216
B+tree1 (BT1) qrSize=6000 18 0 19 0 21 0B+tree2 (BT2) qrSize=6000 23 0 28 0 30 0
Lu-Decompose (LUD) 2048.dat 15 2048 17 2048 17 2048
17
Benchmarks
Performance Gains from Automatic Data Placement
HSBP1
BP2SR1
SR2M
M PFNQU
MC
BT1BT2
LUD
GMEAN
0.51
1.52
2.53
3.54
4.5GTX480 GTX680 K20c
Spe
edup
Measurement: Baseline: Select the best one for original kernel by trying all different shared memory/L1D
cache sizes
For each device, generate the kernel with the optimal data placement choices.
Result: GTX480: Up to 4.14X, Average of 1.76X; GTX680: Up to 3.30X, Average of 1.61X; K20C: Up to
2.44X, Average of 1.48X. 18
Optimal Parameters (the number of shared memory array to be promoted or the C-Factor) for Different GPUs
HS BP1 BP2 SR1 SR2 MM PF NQU MC BT1 BT2 LUD0
3
6
9
12
15
18GTX480
GTX680
k20c
Sea
rch
Sp
ace
Par
amet
er
Performance Portability: Our compiler intelligently generates the optimized kernel for specific architecture;
Different architecture features of these GPUs lead to different optimal parameters. 20
Auto-Tuning
Effective pruning: Search space has been reduced
significantly;
Performance of the optimized kernel
not impacted.
Original search space
Pruned search space
HS 48 8BP1 16 3BP2 16 4SR1 16 5SR2 16 5MM 32 5PF 1 1
NQU 45 12MC 9 6BT1 3 3BT2 3 3LUD 16 4
21
Auto-tuning time (ms)
42.87311.36115.75524.13321.941210.876
8.8848.12423.98612.18314.343129.531
HSBP1BP2SR1SR2MMPF
NQUMCBT1BT2LUD
The resulting auto-tuning time is small.
Conclusions
GPUs have been widely used for general-purpose computation: Achieving high performance is not easy, one of the reasons is the
intricate on-chip memory resources;
Manually tuned code for one device may not perform well on a new device.
We propose compiler-driven automatic data placement as our solution: Our compiler algorithm refines GPU programs by altering data
placement to achieve both performance enhancement and performance portability;
We show that in different GPU devices, the kernels optimized with our compiler algorithm achieve significant performance improvement.
23
Effectiveness Breakdown
HS PF NQU MC0
1
2
3
4 Promote 1 smem arrayPromote 2 smem arraysPromote 3 smem arraysPromote 4 smem arrays
Per
form
ance
S
peed
up
HS BP1 BP2 SR1 SR2 MM BT1 BT2 LUD0.5
1
1.5
2
2.5C_Factor= 2 C_Factor= 4 C_Factor= 8 C_Factor= 16
Per
fom
ranc
e S
peed
up
19
Impact of Input Sizes (Marching-Cube)
8k 16k 32k 64k 128k 256k 512k1
1.1
1.2
1.3
1.4
1.5
Input Size (Voxels)
Spe
edup
Problem Input Size Impact: The optimized code generation for on-chip data placement is generally input agnostic;
Large input tends to show higher benefit.
22
Compiler Pass 1
Kernel shared_to_register_or_local_or_global (Kernel kernel) {
Kernel best_kernel = kernel;
float exe_time = eval(kernel); //collect the execution time of kernel;
/**Identification Stage**/
List arrays;
for (each shared memory array sma in kernel) {
sma.is_overlap = false; sma.is_index = false;
sma.access_count = 0; sma.size = allocation_size;
for (each access acc of array sma) {
sma.access_count += (acc in loop)?loop_count::1;
if (acc is overlapped across threads)
sam.is_overlap = true;
else if (the address of acc is calculated in the runtime)
sma.is_index = true ;}
if (sma.access_count >0) {arrays.add(sma);} } //end for
/**Processing Stage**/
/**Auto-tuning Stage**/
}
/**Identification Stage**/
/**Processing Stage**/
sma = array with largest access_count in arrays,
pop it out;
if (!sma.is_index and !sma.is_overlap)
replace sma with register file;
else if (sma.is_index and !sma.is_overlap)
replace sma with local memory;
else
replace sma with global memory;
/**Auto-tuning Stage**/
/**Identification Stage**/
/**Processing Stage**/
/**Auto-tuning Stage**/
generate a new kernel nkernel
exe_time1=eval(nkernel) //the execution time of
nkernel
if (exe_time1< exe_time) { // the new kernel is better
best_kernel = nkernel;
exe_time = exe_time ;}
else
return best_kernel; // found the best kernel }
//end while
Compiler Pass 2
Kernel shared_to_register_or_local_or_global (Kernel kernel) {
Kernel best_kernel = kernel;
float exe_time = eval(kernel); //collect the execution time of kernel;
/**Identification Stage**/
List exprs;
bool is_redundant_1d = false, is_redundant_2d = false;
for (each shared/global memory array sma in kernel) {
for (each access acc of array sma in expression expr) {
if (acc is independent of one thread dimension)
{ is_redundant_1d = true; exprs.add (expr);}
if (is_redundant_1d && acc is independent of the other
thread dimension in expression expr)
{is_redundant_2d = true; exprs.add (expr);} } }//end for
/**Processing Stage**/
/**Auto-tuning Stage**/
}
/**Identification Stage**/
/**Processing Stage**/
Adjust Thread Block Dimension.
if(is_redundant_1d) {
construct a one-loop with loop bound C_Factor to
perform the workload for compacted threads
convert expr in exprs to from inter-thread memory
usage into register array.
} else if(is_redundant_2d){
construct an 2-level loop with loop bound
C_Factor .x, and C_Factor . y to perform the workload
for compacted threads
convert expr in exprs to from inter-thread memory
usage into register array usage; }
/**Auto-tuning Stage**/
/**Identification Stage**/
/**Processing Stage**/
/**Auto-tuning Stage**/
generate a new kernel nkernel
exe_time1=eval(nkernel) //the execution time of
nkernel
if (exe_time1< exe_time) { // the new kernel is better
best_kernel = nkernel;
exe_time = exe_time ;}
else
return best_kernel; // found the best kernel }
//end for
our compiler algorithm focuses on code that has been reasonably optimized:
Manually or automatically by some compiler tools; Already employ classical loop optimizations such as tiling; Already allocate important data in shared memory either for
communications among threads or for data reuses.
The way of thread compaction can also be referred to as thread merge/coarsening. Compared to thread merge/coarsen/fusion, our approach specifically utilize this technique for register tiling. In other words, to utilize register reuse for eliminating the redundant usage of shared/global memory existed in GPU programs. We further focus on address how many threads to be compacted to maximize the register tiling while restrict the register pressure on TLP, thus to determine the most profitable version of data placement.