gklee: concolic verification and test generation for gpus
DESCRIPTION
GKLEE: Concolic Verification and Test Generation for GPUs. Guodong Li 1,2 , Peng Li 1 , Geof Sawaya 1 , Ganesh Gopalakrishnan 1 , Indradeep Ghosh 2 , Sreeranga P. Rajan 2. 1. 2. Fujitsu Labs of America. Feb. 2012. 1. GPUs are widely used!. (courtesy of Nvidia, www.engadget.com). - PowerPoint PPT PresentationTRANSCRIPT
1
GKLEE: Concolic Verification and Test Generation for GPUs
Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2
1
Feb. 2012
Fujitsu Labs of America2
1
GPUs are widely used!• About 40 of the top 500 machines are GPU based
• Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs
2
(courtesy of AMD) (courtesy of Nvidia)
(courtesy of Nvidia, www.engadget.com)
(courtesy of Intel)
In such application domains, it is important that GPU computations yield correct answers and are bug-free.
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races
3
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races
4
Write(a) Write(a) Write(a) Read(a)
Existing GPU Testing Methods are Inadequate
• Data races are a huge problem– Testing is NEVER conclusive – One has to infer data race's ill effects indirectly
through corrupted values– Even instrumented race checking gives results
only for a specific platform, and not for future validations, • for example for a different warp scheduling, e.g.
change over from old Tesla to New Fermi
5
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races
–Missed deadlocks
6
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races
–Missed deadlocks
7
__SyncThreads()
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to–Warp Divergence
8
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to–Warp Divergence
9
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence
– Non-coalesced memory accesses
10
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence
– Non-coalesced memory accesses
11
Memory
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence– Non-coalesced memory accesses
– Bank conflicts
12
Memory Banks
Existing GPU Testing Methods are Inadequate
• CUDA GDB Debugger– Manually debug the code and check races and deadlocks
• CUDA Profiler– Report numbers difficult to read– Low coverage (i.e. no all possible inputs)
13
• GKLEE– Better tool for verification and testing– Can address all the previously mentioned
points– e.g. has found bugs in real SDK kernels
previously thought to be bug-free– give root causes of the bugs
Our Contributions• GKLEE: a Symbolic Virtual GPU for
Verification, Analysis, and Test-generation
• GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses,
Warp Divergences
• GKLEE generates Tests to Run on GPU Hardware
14
15
Architecture of GKLEE
LLVM GCC Compiler
LLVM GCC Compiler
GKLEE(Executor, scheduler,
checker, test generator)
GKLEE(Executor, scheduler,
checker, test generator)
C++ GPU Program
(with Sym. Inputs)
LLVMcuda
GPU configuration
CUDA Syntax HandlerNVCCNVCC
Test Cases
Replay on Real GPU
Statistics /Bugs
16
Rest of the Talk
• Simple CUDA example• Details of Symbolic Virtual GPU• Analysis Details:– Races, Deadlocks– Degree of
• Warp divergences, Bank Conflicts, Non-Coalesced Accesses
– Functional Correctness
• Automatic Test Generation– Coverage-directed test-case reduction
CUDA
• A simple dialect of C++ with CUDA directives
• Thread blocks / teams -- SIMD “warps”• Synchronization through barriers / atomics
(GKLEE being extended to handle atomics)
17
18
Example: Increment Array Elements
Increment N-element array A by scalar b
tid 0 1 …
A
A[0]+b
__global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b;}
...A[1]+b
t0 t1
19
Illustration of Race
Increment N-element vector A by scalar btid 0 1 63
A
t63:write A[63]
...
__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) A[idx] = A[(idx – 1) % N] + b;}
RACE!
t0: read A[63]
20
Illustration of Deadlock
Increment N-element vector A by scalar btid 0 1 …
A
...
__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) { A[idx] = A[idx] + b;
__syncthreads(); }
DEADLOCK!
idx < N idx ≥ N
21
Example of a Race Found by GKLEE
21
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
“GKLEE: Is there a Race ?”
22
Example of a Race Found by GKLEE
22
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
Threads 5 and and 13 have a WW race
when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE
23
Example of Test Coverage due to GKLEE
23
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
24
Example of Test Coverage due to GKLEE
24
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
“How do we test this?”
25
Example of Test Coverage due to GKLEE
25
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
Answer 1 : “Random + “
26
Example of Test Coverage due to GKLEE
26
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
Answer 2 : Ask GKLEE:
Here are 5 tests with100% source code coverage79% avg. thread + barrier interval coverage
27
GKLEE: Symbolic Virtual GPUHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
• GKLEE models a GPU using software– The virtual GPU
represents the CUDA Programming Model (hence hide many hardware details)
– Similar to the CUDA emulator in this aspect; but with many unique features
– Can simulate CPU+GPU
virtual CPU
virtual GPU
GKLEE
28
Concolic Execution on the Virtual GPU• The values can be CONCrete or symbOLIC
(CONCOLIC) in GKLEE
– A value may be a complicated symbolic
expression
– Symbolic expressions are handled by constraint
solvers
• Determine satisfiability
• Give concrete values as evidence
– Constraint solving has become 1,000x faster over the last 10 years
29
Comparing Concrete and Symbolic Execution
10
a b c
Program:
b = a * 2;
c = a + b;
if (c > 100)
assert(0);
2010
302010
unreachable
All values are concrete
30
Comparing Concrete and Symbolic Execution
x(-,+ )
a b c
Program:
b = a * 2;
c = a + b;
if (c > 100)
assert(0);
else
…
reachable, e.g. x = 40
x(-,+ ) 2x
x(-,+ ) 3x2x
reachable, e.g. x = 30Now path condition is: 3x <= 100
The values can be concrete or symbolic
31
GKLEE Works on LLVM Bytecode• CUDA C++ programs are compiled to LLVM bytecode by
LLVM-GCC with our CUDA syntax handler• Our online technical report contains detailed description• GKLEE extends KLEE to handle CUDA features
LLVMcuda Syntax and Semantics
32
Thread Scheduling: In general, an Exp. Number of Schedules!
It is like shuffling decks of cards
> 13 trillion shuffles exist for 5 decks with 5 cards !!
> 13 trillion schedules exist for 5 threads with 5 instructions !!
More precisely, 25! / (5!)5
33
GKLEE Avoids Examining Exp. Schedules !!
Instead of considering allSchedules and All Potential Races…
34
GKLEE Avoids Examining Exp. Schedules !!
Instead of considering allSchedules and All Potential Races…
Consider JUST THIS SINGLECANONICAL SCHEDULE !!
Folk Theorem (proved in our paper):“We will find A RACEIf there is ANY race” !!
35
Closer Look: canonical scheduling
Race-free operations can be exchanged
another valid schedule (e.g. canonical schedule):
t1:a1:read x
t2:a2: write y
t1:a3:write x
t2:a4:write y
t1:a5:read x
t2:a6:read y
a valid schedule:
t2:a2:write y
t1:a1: read x
t1:a3:write x
t2:a4:write y
t2:a6:read y
t1:a5:read x
The scheduler:
(1) Applies the canonical schedule;
(2) Checks races upon the barriers;
(3) If no race then continues; otherwise reports the race and terminate
36
SIMD-aware Canonical Scheduling in GKLEE
SIMD/Barrier Aware Canonical scheduling within warp/blockt1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
Record accesses in canonical scheduleCheck whether the accesses conflict (e.g. have the same address)
37
SIMD-aware Race Checking in GKLEE
Check races on the fly (in the canonical schedule) t1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
intra-warp races inter-warp and inter-block races
38
SIMD-aware Race Checking in GKLEE
Check races on the fly (in the canonical schedule) t1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
intra-warp races inter-warp and inter-block races
SDK Kernel Example: race checking
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}
inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + Data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
SDK Kernel Example: race checking
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
t1 t2
t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)
?
SDK Kernel Example: race checking
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
GKLEE indicates that these two addresses
are equal when
t1 = 5, t2 = 13, d_data[5]= 0x04040404,
and d_data[13] = 0
indicating a Write-Write race
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
42
Experimental Results, Part I (check correctness and performance issues)
The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks(1) well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5)
memory coalescing; (6) warp divergence; (7) required volatile keyword.
Kernels Loc Race Func. Corrct.
#T Bank Conflict Perf.
Coalesced Accesses (Perf.)
Warp Divergperf
.
Volatile Needed
1.X 2.X ≤1.1 2.x
Bitonic Sort 30 yes 4 0% 0% 100% 100%
60% no
Scalar Prod. 30 yes 64 0% 0% 11% 100%
100% yes
Matric Mult 61 yes 64 0% 0% 100% 100%
0% no
Histogram64th.
69 WW unknown
32 66% 66% 100% 100%
0% yes
Reduction (7)
231 yes 16 0% 0% 100% 100%
16-83%
yes
Scan Best 78 yes 32 71% 71% 100% 100%
71% no
Scan Naïve 28 yes 32 0% 0% 50% 100%
85% yes
Scan Effi. 60 yes 32 83% 16% 0% 0% 83% no
Scan Large 196 yes 32 71% 71% 100% 100%
71% no
Radix Sort 750 WW unknown
16 3% 0% 0% 100%
5% yes
Bisect Small 1,000
ben. _ 16 38% 0% 97% 100%
43% yes
Bisect Large 1,400
ben. _ 16 15% 0% 99% 100%
53% yes
43
Automatic Test Generation • GKLEE guarantees to explore all paths w.r.t. given
inputs• The path constraint at the end of each path is solved
to generate concrete test cases – GKLEE supports many heuristic reduction techniques
t1
c2
¬c1c1
¬c2 c4
¬c3
¬c4
c3
t2
c2
¬c1c1
¬c2
c4
¬c3
¬c4
c3
c4
¬c3
¬c4
c3 c4
¬c3
¬c4
c3
t1+t2
c1c2 c3 c4
¬ c1 ¬c3
…
solve this constraint to give a concrete test
44
SDK Example: comprehensive testing
44
__global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
…
shared[0] > shared[1]
shared[0]≤shared[1]
shared[1] < shared[2]
shared[1] ≥ shared[2]
shared[0] > shared[2]
shared[0] ≤ shared[2]
Unsat: shared[0] > shared[1] shared[1] ≥ shared[2] shared[0] ≤ shared[2]
45
SDK Example: comprehensive verification
45
…
… ……
Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n]
…values=…
values=…
values=…
values=…
values=…
values=…
…… …
46
Experimental Results, Part II… (Automatic Test Generation)
Coverage information about the generated tests for some CUDA kernels.
Kernels src. code coverage
Avg. Covt
max. Covt
Avg. CovBIt
Max. CovBIt
Exec. time
Bitonic Sort 100%/100%
78%/76%
100%/94%
79%/66% 90%/76% 1s
Merge Sort 100%/100%
88%/70%
100%/85%
93%/86% 100%/100%
1.6s
Word Search
100%/100%
100%/81%
100%/85%
100%/97%
100%/100%
0.1s
Suffix Tree Match
100%/90%
55%/49%
98%/66%
55%/49% 98%/83% 31s
Histogram64
100%/100%
100%/75%
100%/75%
100%/100%
100%/100%
600s
Covt and CovTBt measure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.
47
Experimental Results, Part II (Coverage Directed Test Reduction)
Results after applying reduction Heuristics
RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage.
Kernels No Reductions RedTB RedBI
#path
Avg. CovBIt
#path
Avg. CovBIt
#path
Avg. CovBIt
Bitonic Sort 28 79%/66% 5 79%/66% 5 79%/65%
Merge Sort 34 93%/86% 4 92%/84% 4 92%/84%
Word Search 8 100%/97% 2 100%/97% 2 94%/85%
Suffix Tree Match
31 55%/49% 6 55%/49% 6 55%/49%
Histogram64 13 100%/100%
5 100%/100%
5 100%/100%
48
Additional GKLEE Features
• GKLEE employs an efficient memory
organization
• Employs many expression evaluation
optimizations• Simplify concolic expressions on the fly• Dynamically cache results• Apply dependency analysis before constraint
solving• Use manually optimized C/C++ Libraries
• GKLEE also handles all of the C++ Syntax
• GKLEE never generates false alarms
49
Experimental Results, Part III(performance comparison of two tools)
Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check.
#T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes.
Kernels #T = 4 #T = 16 #T = 64 #T = 256 #T = 1,024
PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE
Simple Reduct.
2.8 <0.1(<0.1)
T.O <0.1(<0.1)
<0.1(<0.1)
0.2(0.3) 2.3(2.9)
Matrix. Transp. 1.9 <0.1(<0.1)
T.O <0.1(0.3) <0.1(3.2) <0.1(63) 0.9(T.O)
Bitonic Sort 3.7 0.9(1) T.O T.O T.O T.O T.O
Scan Large _ <0.1(<0.1)
_ <0.1(<0.1)
0.1(0.2) 1.6(3) 22(51)
50
Other Details• Diverged warp scheduling, intra-warp, inter-warp/-
block race checking, textual aligned barrier checking
• Checking performance issues– warp divergence, bank conflicts, global memory coalescing
• Path/Test reduction techniques• Volatile declaration checking • Handling symbolic aliasing and pointers• Drivers for the kernels and replaying on the real
GPU• Other results, e.g. on CUDA SDK 4.0 programs• CUDA’s relaxed memory model and semantics
51
Summary
• GKLEE: symbolic virtual GPU– Identify correctness and performance issues– Produce concrete tests with high code coverage– Enable symbolic parallel debugging for CUDA programs – Good for other CUDA applications (e.g. compiler
optimization verification, regression testing, etc.)• The tool is open source and available at:– www.cs.utah.edu/fv/GKLEE– with tutorial, manual, tech. report, liveDVD,, etc.
• Future Work– Parameterized verification (e.g. equivalence checking) – Support for floating point numbers– Combination with runtime execution (on the real GPU)