gmprof: a low-overhead, fine-grained profiling approach for gpu programs

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs

Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal

Dept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, OH, USA

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs

GPU Programming Gets Popular

• Many domains are using GPUs for high performance

2

GPU-accelerated Molecular Dynamics GPU-accelerated Seismic Imaging

• Available in both high-end/low-end systems• the #1 supercomputer in the world uses GPUs [TOP500, Nov 2012]• commodity desktops/laptops equipped with GPUs


• Need careful management of • a large amount of threads

Writing Efficient GPU Programs is Challenging

3

Thread Blocks


• Need careful management of • a large amount of threads• multi-layer memory hierarchy


4

Read-only Data Cache

DRAM (Device Memory)

L2 Cache

L1Cache

SharedMemory

Thread

Thread Blocks

Kepler GK110 Memory Hierarchy


• Need careful management of • a large amount of threads• multi-layer memory hierarchy


5



L2 Cache

L1Cache

SharedMemory

Thread

Thread Blocks

Fast but Small

Large but Slow




6

Which data in shared memory are infrequently accessed?

Which data in device memory are frequently accessed?



L2 Cache

L1Cache

SharedMemory

Thread



• Existing tools can’t help much• inapplicable to GPU• coarse-grained• prohibitive runtime overhead• cannot handle irregular/indirect accesses


7

Which data in shared memory are infrequently accessed?

Which data in device memory are frequently accessed?



L2 Cache

L1Cache

SharedMemory

Thread



Outline

• Motivation• GMProf

• Naïve Profiling Approach• Optimizations• Enhanced Algorithm

• Evaluation• Conclusions

8


GMProf-basic: The Naïve Profiling Approach

9

• Shared Memory Profiling• integer counters to count accesses to shared memory• one counter for each shared memory element• atomically update the counter

• to avoid race condition among threads

• Device Memory Profiling• integer counters to count accesses to device memory• one counter for each element in the user device memory array

• since device memory is too large to be monitored as a whole (e..g, 6GB)• atomically update the counter


Outline




10


GMProf-SA: Static Analysis Optimization

11

• Observation I: Many memory accesses can be determined statically

1. __shared__ int s[];

2. …

3. s[threadIdx.x] = 3;



12



2. …


Don’t need to count the access at runtime



13



2. …


Don’t need to count the access at runtime

• How about this …

1. __shared__ float s[];

2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {

5. temp = s[input[c]];

6. }

7. }y



14

• Observation II: Some accesses are loop-invariant• E.g. s[input[c]] is irrelavant to the outer loop iterator r


2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {


6. }

7. }y



15



2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {


6. }

7. }y

Don’t need to profile in every r

iteration



16



2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {


6. }

7. }y


iteration

• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx



17



2. …

3. for(r=0; …; …) {

4. for(c=0; …; …) {


6. }

7. }y


iteration

• Observation III: Some accesses are tid-invariant• E.g. s[input[c]] is irrelavant to threadIdx Don’t need to update the

counter in every thread


GMProf-NA: Non-Atomic Operation Optimization

18

• Atomic operation cost a lot• Serialize all concurrent threads when updating a shared counter

• Use non-atomic operation to update counters• does not impact the overall accuracy thanks to other optimizations

atomicAdd(&counter, 1);

…

…

concurrent threads serialized threads


GMProf-SM: Shared Memory Counters Optimization

19

• Make full use of shared memory• Store counters in shared memory

when possible• Reduce counter size

• E.g., 32-bit integer counters -> 8-bit


Device Memory

L2 Cache

L1Cache

SharedMemory

Fast but Small


GMProf-SM: Shared Memory Counters Optimization

20

• Make full use of shared memory• Store counters in shared memory

when possible• Reduce counter size

• E.g., 32-bit integer counters -> 8-bit


Device Memory

L2 Cache

L1Cache

SharedMemory

Fast but Small

GMProf-TH: Threshold Optimization• Precise count may not be necessary

• E.g A is accessed 10 times, while B is accessed > 100 times

• Stop counting once reaching certain threshold• Tradeoff between accuracy and overhead


Outline




21


GMProf-Enhanced: Live Range Analysis

22

• The number of accesses to a shared memory location may be misleading

shm_buf in Shared Memory

input_array in Device Memory

data0 data1 data2

output_array in Device Memory

• Need to count the accesses/reuse of DATA, not address

data0

data0 data1 data2

data1data2


• Track data during its live range in shared memory

• Use logical clock to marks the boundary of each live range• Separate counters in each live range based on logical clock

GMProf-Enhanced: Live Range Analysis

23

1. ...2. shm_buffer = input_array[0] //load data0 from DM to ShM3. ...4. output_array[0] = shm_buffer //store data0 from ShM to DM5. ...6. ...7. shm_buffer = input_array[1] //load data1 from DM to ShM8. ...9. output_array[1] = shm_buffer //store data1 from ShM to DM10. ...

live range of data0

live range of data1


Outline




24


• Platform• GPU: NVIDIA Tesla C1060

• 240 cores (30×8), 1.296GHz• 16KB shared memory per SM• 4GB device memory

• CPU: AMD Opteron 2.6GHz ×2• 8GB main memory • Linux kernel 2.6.32• CUDA Toolkit 3.0

• Six Applications• Co-clustering, EM clustering, Binomial Options, Jacobi, Sparse Matrix-

Vector Multiplication, and DXTC

25

Methodology

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU programs 26

Runtime Overhead for Profiling Shared Memory Use

182x 144x 648x 181x113x

2.6x

90x648x


Runtime Overhead for Profiling Device Memory Use

83x 197x 48.5x

1.6x


Case Study I: Put the most frequently used data into shared memory

ProfilingResult GMProf-basic

GMProf

w/o TH w/ TH

ShM 0 0 0

DM

A1(276)A2(276)A3(128)

A4(1)

A1(276)A2(276)A3(128)

A4(1)

A1(THR)A2(THR)A3(128)

A4(1)

• bo_v1: • a naïve implementation where all data arrays are stored in device

memory

A1 ~ A4: four data arrays(N): average access # of the elements in the corresponding data array


• bo_v2: • an improved version which puts the most frequently used arrays

(identified by GMProf) into shared memory

29

Case Study I: Put the most frequently used data into shared memory


GMProf

w/o TH w/ TH

ShM A1 (174,788)A2 (169,221)

A1(165,881)A2(160,315)

A1(THR)A2(THR)

DM A3(128)A4(1)

A3(128)A4(1)

A3(128)A4(1)

• bo_v2 outperforms bo_v1 by a factor of 39.63


• jcb_v1: • the shared memory is accessed frequently, but little reuse of the date

30

Case Study II: identify the true reuse of data


GMProf

w/o Enh. Alg. w/ Enh. Alg.

ShM shm_buf (5,760) shm_buf (5,748) shm_buf (2)

DM in(4)out(1)

in(4)out(1)

in(4)out(1)

• jcb_v2 outperforms jcb_v1 by 2.59 times

• jcb_v2:


GMProf

w/o Enh. Alg. w/ Enh. Alg.

ShM shm_buf (4,757) shm_buf (4,741) shm_buf (4)

DM in(1)out(1)

in(1)out(1)

in(1)out(1)


Outline


• Naïve Profiling Approach• Optimizations


31


Conclusions

• GMProf• Statically-assisted dynamic profiling approach• Architecture-based optimizations • Live range analysis to capture real usage of data• Low-overhead & Fine-grained• May be applied to profile other events

32

Thanks!Thanks!

gmprof: a low-overhead, fine-grained profiling approach for gpu programs

Documents