gklee: concolic verification and test generation for gpus

1

GKLEE: Concolic Verification and Test Generation for GPUs

Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2

1

Feb. 2012

Fujitsu Labs of America2

1

GPUs are widely used!• About 40 of the top 500 machines are GPU based

• Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs

2

(courtesy of AMD) (courtesy of Nvidia)

(courtesy of Nvidia, www.engadget.com)

(courtesy of Intel)

In such application domains, it is important that GPU computations yield correct answers and are bug-free.

Existing GPU Testing Methods are Inadequate

• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races

3


• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races

4

Write(a) Write(a) Write(a) Read(a)


• Data races are a huge problem– Testing is NEVER conclusive – One has to infer data race's ill effects indirectly

through corrupted values– Even instrumented race checking gives results

only for a specific platform, and not for future validations, • for example for a different warp scheduling, e.g.

change over from old Tesla to New Fermi

5


• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races

–Missed deadlocks

6


• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races

–Missed deadlocks

7

__SyncThreads()


• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks

• Insufficient measurement of performance penalties due to–Warp Divergence

8



• Insufficient measurement of performance penalties due to–Warp Divergence

9



• Insufficient measurement of performance penalties due to– Warp Divergence

– Non-coalesced memory accesses

10



• Insufficient measurement of performance penalties due to– Warp Divergence

– Non-coalesced memory accesses

11

Memory



• Insufficient measurement of performance penalties due to– Warp Divergence– Non-coalesced memory accesses

– Bank conflicts

12

Memory Banks


• CUDA GDB Debugger– Manually debug the code and check races and deadlocks

• CUDA Profiler– Report numbers difficult to read– Low coverage (i.e. no all possible inputs)

13

• GKLEE– Better tool for verification and testing– Can address all the previously mentioned

points– e.g. has found bugs in real SDK kernels

previously thought to be bug-free– give root causes of the bugs

Our Contributions• GKLEE: a Symbolic Virtual GPU for

Verification, Analysis, and Test-generation

• GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses,

Warp Divergences

• GKLEE generates Tests to Run on GPU Hardware

14

15

Architecture of GKLEE

LLVM GCC Compiler

LLVM GCC Compiler

GKLEE(Executor, scheduler,

checker, test generator)

GKLEE(Executor, scheduler,

checker, test generator)

C++ GPU Program

(with Sym. Inputs)

LLVMcuda

GPU configuration

CUDA Syntax HandlerNVCCNVCC

Test Cases

Replay on Real GPU

Statistics /Bugs

16

Rest of the Talk

• Simple CUDA example• Details of Symbolic Virtual GPU• Analysis Details:– Races, Deadlocks– Degree of

• Warp divergences, Bank Conflicts, Non-Coalesced Accesses

– Functional Correctness

• Automatic Test Generation– Coverage-directed test-case reduction

CUDA

• A simple dialect of C++ with CUDA directives

• Thread blocks / teams -- SIMD “warps”• Synchronization through barriers / atomics

(GKLEE being extended to handle atomics)

17

18

Example: Increment Array Elements

Increment N-element array A by scalar b

tid 0 1 …

A

A[0]+b

__global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b;}

...A[1]+b

t0 t1

19

Illustration of Race

Increment N-element vector A by scalar btid 0 1 63

A

t63:write A[63]

...

__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) A[idx] = A[(idx – 1) % N] + b;}

RACE!

t0: read A[63]

20

Illustration of Deadlock

Increment N-element vector A by scalar btid 0 1 …

A

...

__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) { A[idx] = A[idx] + b;

__syncthreads(); }

DEADLOCK!

idx < N idx ≥ N

21

Example of a Race Found by GKLEE

21

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

“GKLEE: Is there a Race ?”

22

Example of a Race Found by GKLEE

22

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)

{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }

Threads 5 and and 13 have a WW race

when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE

23

Example of Test Coverage due to GKLEE

23

__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();

for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}

__shared__ unsigned shared[NUM];

inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }

24


24





“How do we test this?”

25


25





Answer 1 : “Random + “

26


26





Answer 2 : Ask GKLEE:

Here are 5 tests with100% source code coverage79% avg. thread + barrier interval coverage

27

GKLEE: Symbolic Virtual GPUHost

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

• GKLEE models a GPU using software– The virtual GPU

represents the CUDA Programming Model (hence hide many hardware details)

– Similar to the CUDA emulator in this aspect; but with many unique features

– Can simulate CPU+GPU

virtual CPU

virtual GPU

GKLEE

28

Concolic Execution on the Virtual GPU• The values can be CONCrete or symbOLIC

(CONCOLIC) in GKLEE

– A value may be a complicated symbolic

expression

– Symbolic expressions are handled by constraint

solvers

• Determine satisfiability

• Give concrete values as evidence

– Constraint solving has become 1,000x faster over the last 10 years

29

Comparing Concrete and Symbolic Execution

10

a b c

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

2010

302010

unreachable

All values are concrete

30

Comparing Concrete and Symbolic Execution

x(-,+ )

a b c

Program:

b = a * 2;

c = a + b;

if (c > 100)

assert(0);

else

…

reachable, e.g. x = 40

x(-,+ ) 2x

x(-,+ ) 3x2x

reachable, e.g. x = 30Now path condition is: 3x <= 100

The values can be concrete or symbolic

31

GKLEE Works on LLVM Bytecode• CUDA C++ programs are compiled to LLVM bytecode by

LLVM-GCC with our CUDA syntax handler• Our online technical report contains detailed description• GKLEE extends KLEE to handle CUDA features

LLVMcuda Syntax and Semantics

32

Thread Scheduling: In general, an Exp. Number of Schedules!

It is like shuffling decks of cards

> 13 trillion shuffles exist for 5 decks with 5 cards !!

> 13 trillion schedules exist for 5 threads with 5 instructions !!

More precisely, 25! / (5!)5

33

GKLEE Avoids Examining Exp. Schedules !!

Instead of considering allSchedules and All Potential Races…

34

GKLEE Avoids Examining Exp. Schedules !!

Instead of considering allSchedules and All Potential Races…

Consider JUST THIS SINGLECANONICAL SCHEDULE !!

Folk Theorem (proved in our paper):“We will find A RACEIf there is ANY race” !!

35

Closer Look: canonical scheduling

Race-free operations can be exchanged

another valid schedule (e.g. canonical schedule):

t1:a1:read x

t2:a2: write y

t1:a3:write x

t2:a4:write y

t1:a5:read x

t2:a6:read y

a valid schedule:

t2:a2:write y

t1:a1: read x

t1:a3:write x

t2:a4:write y

t2:a6:read y

t1:a5:read x

The scheduler:

(1) Applies the canonical schedule;

(2) Checks races upon the barriers;

(3) If no race then continues; otherwise reports the race and terminate

36

SIMD-aware Canonical Scheduling in GKLEE

SIMD/Barrier Aware Canonical scheduling within warp/blockt1 t32

BarrierInterval (BI1)


Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

…

Record accesses in canonical scheduleCheck whether the accesses conflict (e.g. have the same address)

37

SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule) t1 t32



Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

…

intra-warp races inter-warp and inter-block races

38

SIMD-aware Race Checking in GKLEE

Check races on the fly (in the canonical schedule) t1 t32



Instr. 1t2

Instr. 2

Instr. 3

t33 t64

Instr. 1t34

Instr. 2

Instr. 3

Instr. 4

Instr. 5

Instr. 6

Instr. 4

Instr. 5

Instr. 6

…

intra-warp races inter-warp and inter-block races

SDK Kernel Example: race checking

__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}

inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }

threadPos = … threadPos = …

data = (data4>26) & 0x3FU


s_Hist[threadPos + Data*THREAD_N]++;

s_Hist[threadPos + data*THREAD_N]++;

t1 t2







RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

t1 t2

t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)

?







t1 t2

GKLEE indicates that these two addresses

are equal when

t1 = 5, t2 = 13, d_data[5]= 0x04040404,

and d_data[13] = 0

indicating a Write-Write race

RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …

t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …

42

Experimental Results, Part I (check correctness and performance issues)

The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks(1) well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5)

memory coalescing; (6) warp divergence; (7) required volatile keyword.

Kernels Loc Race Func. Corrct.

#T Bank Conflict Perf.

Coalesced Accesses (Perf.)

Warp Divergperf

.

Volatile Needed

1.X 2.X ≤1.1 2.x

Bitonic Sort 30 yes 4 0% 0% 100% 100%

60% no

Scalar Prod. 30 yes 64 0% 0% 11% 100%

100% yes

Matric Mult 61 yes 64 0% 0% 100% 100%

0% no

Histogram64th.

69 WW unknown

32 66% 66% 100% 100%

0% yes

Reduction (7)

231 yes 16 0% 0% 100% 100%

16-83%

yes

Scan Best 78 yes 32 71% 71% 100% 100%

71% no

Scan Naïve 28 yes 32 0% 0% 50% 100%

85% yes

Scan Effi. 60 yes 32 83% 16% 0% 0% 83% no

Scan Large 196 yes 32 71% 71% 100% 100%

71% no

Radix Sort 750 WW unknown

16 3% 0% 0% 100%

5% yes

Bisect Small 1,000

ben. _ 16 38% 0% 97% 100%

43% yes

Bisect Large 1,400

ben. _ 16 15% 0% 99% 100%

53% yes

43

Automatic Test Generation • GKLEE guarantees to explore all paths w.r.t. given

inputs• The path constraint at the end of each path is solved

to generate concrete test cases – GKLEE supports many heuristic reduction techniques

t1

c2

¬c1c1

¬c2 c4

¬c3

¬c4

c3

t2

c2

¬c1c1

¬c2

c4

¬c3

¬c4

c3

c4

¬c3

¬c4

c3 c4

¬c3

¬c4

c3

t1+t2

c1c2 c3 c4

¬ c1 ¬c3

…

solve this constraint to give a concrete test

44

SDK Example: comprehensive testing

44

__global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();


…

shared[0] > shared[1]

shared[0]≤shared[1]

shared[1] < shared[2]

shared[1] ≥ shared[2]

shared[0] > shared[2]

shared[0] ≤ shared[2]

Unsat: shared[0] > shared[1] shared[1] ≥ shared[2] shared[0] ≤ shared[2]

45

SDK Example: comprehensive verification

45

…

… ……

Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n]

…values=…

values=…

values=…

values=…

values=…

values=…

…… …

46

Experimental Results, Part II… (Automatic Test Generation)

Coverage information about the generated tests for some CUDA kernels.

Kernels src. code coverage

Avg. Covt

max. Covt

Avg. CovBIt

Max. CovBIt

Exec. time

Bitonic Sort 100%/100%

78%/76%

100%/94%

79%/66% 90%/76% 1s

Merge Sort 100%/100%

88%/70%

100%/85%

93%/86% 100%/100%

1.6s

Word Search

100%/100%

100%/81%

100%/85%

100%/97%

100%/100%

0.1s

Suffix Tree Match

100%/90%

55%/49%

98%/66%

55%/49% 98%/83% 31s

Histogram64

100%/100%

100%/75%

100%/75%

100%/100%

100%/100%

600s

Covt and CovTBt measure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.

47

Experimental Results, Part II (Coverage Directed Test Reduction)

Results after applying reduction Heuristics

RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage.

Kernels No Reductions RedTB RedBI

#path

Avg. CovBIt

#path

Avg. CovBIt

#path

Avg. CovBIt

Bitonic Sort 28 79%/66% 5 79%/66% 5 79%/65%

Merge Sort 34 93%/86% 4 92%/84% 4 92%/84%

Word Search 8 100%/97% 2 100%/97% 2 94%/85%

Suffix Tree Match

31 55%/49% 6 55%/49% 6 55%/49%

Histogram64 13 100%/100%

5 100%/100%

5 100%/100%

48

Additional GKLEE Features

• GKLEE employs an efficient memory

organization

• Employs many expression evaluation

optimizations• Simplify concolic expressions on the fly• Dynamically cache results• Apply dependency analysis before constraint

solving• Use manually optimized C/C++ Libraries

• GKLEE also handles all of the C++ Syntax

• GKLEE never generates false alarms

49

Experimental Results, Part III(performance comparison of two tools)

Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check.

#T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes.

Kernels #T = 4 #T = 16 #T = 64 #T = 256 #T = 1,024

PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE

Simple Reduct.

2.8 <0.1(<0.1)

T.O <0.1(<0.1)

<0.1(<0.1)

0.2(0.3) 2.3(2.9)

Matrix. Transp. 1.9 <0.1(<0.1)

T.O <0.1(0.3) <0.1(3.2) <0.1(63) 0.9(T.O)

Bitonic Sort 3.7 0.9(1) T.O T.O T.O T.O T.O

Scan Large _ <0.1(<0.1)

_ <0.1(<0.1)

0.1(0.2) 1.6(3) 22(51)

50

Other Details• Diverged warp scheduling, intra-warp, inter-warp/-

block race checking, textual aligned barrier checking

• Checking performance issues– warp divergence, bank conflicts, global memory coalescing

• Path/Test reduction techniques• Volatile declaration checking • Handling symbolic aliasing and pointers• Drivers for the kernels and replaying on the real

GPU• Other results, e.g. on CUDA SDK 4.0 programs• CUDA’s relaxed memory model and semantics

51

Summary

• GKLEE: symbolic virtual GPU– Identify correctness and performance issues– Produce concrete tests with high code coverage– Enable symbolic parallel debugging for CUDA programs – Good for other CUDA applications (e.g. compiler

optimization verification, regression testing, etc.)• The tool is open source and available at:– www.cs.utah.edu/fv/GKLEE– with tutorial, manual, tech. report, liveDVD,, etc.

• Future Work– Parameterized verification (e.g. equivalence checking) – Support for floating point numbers– Combination with runtime execution (on the real GPU)

http://www.cs.utah.edu/fv/GKLEE

52

Thank You!

Questions?

Obtain GKLEE from

www . cs . utah . edu / fv / GKLEE

gklee: concolic verification and test generation for gpus

Documents

readlow coverage

gpu computations

symbolic virtual gpu

data races ill effects

check races

inadequatedata races

concolic verification

gpusguodong li1