improving performance of opencl on cpus€¦ · improving performance of opencl on cpus ralf...

62
Improving Performance of OpenCL on CPUs Ralf Karrenberg [email protected] Sebastian Hack [email protected] European LLVM Conference, London April 12-13, 2012 1

Upload: others

Post on 03-Oct-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Improving Performance of OpenCL on CPUs

Ralf [email protected]

Sebastian [email protected]

European LLVM Conference, LondonApril 12-13, 2012

1

Page 2: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Data-Parallel Languages: OpenCL� �__kernel void DCT(__global float * output ,

__global float * input ,

__global float * dct8x8 ,

__local float * inter ,

const uint width ,

const uint blockWidth ,

const uint inverse)

{

uint tidX = get_global_id (0);

...

inter[lidY*blockWidth + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? lidY*blockWidth + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

}� �2

Page 3: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

OpenCL: Execution Model

3

Page 4: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CPU Driver Implementation (2D, Naıve)

� �cl_int

clEnqueueNDRangeKernel(Kernel scalarKernel ,

TA argStruct ,

int* globalSizes ,

int* localSizes)

{

int groupSizeX = globalSizes [0] / localSizes [0];

int groupSizeY = globalSizes [1] / localSizes [1];

// Loop over groups.

for (int groupX =0; groupX <groupSizeX; ++ groupX) {

for (int groupY =0; groupY <groupSizeY; ++ groupY) {

// Loop over threads in group.

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

scalarKernel(argStruct , lidX , lidY ,

groupX , groupY ,

globalSizes , localSizes );

} }

} } }� �4

Page 5: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CPU Driver Implementation (2D, Group Kernel)

� �cl_int

clEnqueueNDRangeKernel(Kernel groupKernel ,

TA argStruct ,

int* globalSizes ,

int* localSizes)

{

int groupSizeX = globalSizes [0] / localSizes [0];

int groupSizeY = globalSizes [1] / localSizes [1];

// Loop over groups.

for (int groupX =0; groupX <groupSizeX; ++ groupX) {

for (int groupY =0; groupY <groupSizeY; ++ groupY) {

// Loop over threads in group.

groupKernel(argStruct ,

groupX , groupY ,

globalSizes ,

localSizes );

} } }� �5

Page 6: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CPU Driver Implementation (2D, Group Kernel, OpenMP)

� �cl_int

clEnqueueNDRangeKernel(Kernel groupKernel ,

TA argStruct ,

int* globalSizes ,

int* localSizes)

{

int groupSizeX = globalSizes [0] / localSizes [0];

int groupSizeY = globalSizes [1] / localSizes [1];

#pragma omp parallel for

for (int groupX =0; groupX <groupSizeX; ++ groupX) {

for (int groupY =0; groupY <groupSizeY; ++ groupY) {

// Loop over threads in group.

groupKernel(argStruct ,

groupX , groupY ,

globalSizes ,

localSizes );

} } }� �6

Page 7: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, Scalar)

� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

scalarKernel(argStruct , lidX , lidY ,

groupIDs , globalSizes ,

localSizes ); // to be inlined

} } }� �

7

Page 8: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, Scalar, Inlined)� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

uint tidX = get_global_id (0);

...

inter[lidY*blockWidth + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? lidY*blockWidth + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} } }� �8

Page 9: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, Scalar, Inlined, Optimized (1))� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

uint tidX = localSizes [0] * groupIDs [0] + lidX;

...

inter[lidY*blockWidth + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? lidY*blockWidth + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} } }� �9

Page 10: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, Scalar, Inlined, Optimized (1))� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

uint tidX = localSizes [0] * groupIDs [0] + lidX;

...

inter[lidY*blockWidth + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? lidY*blockWidth + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} } }� �10

Page 11: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, Scalar, Inlined, Optimized (2))� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

uint LIC = lidY*blockWidth;

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

uint tidX = localSizes [0] * groupIDs [0] + lidX;

...

inter[LIC + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? LIC + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} } }� �11

Page 12: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

uint LIC = lidY*blockWidth;

for (int lidX =0; lidX <localSizes [0]; ++lidX) {

uint tidX = localSizes [0] * groupIDs [0] + lidX;

...

inter[LIC + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? LIC + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} } }� �12

Page 13: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1 a2

b

c1

c2 d2

d1 b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 14: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1

a2

b

c1

c2

d1

d2

e

a1 a2

b

c1

c2 d2

d1 b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 15: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1

a2

b

c1

c2

d1

d2

e

a1

a2

b

c1

c2 d2

d1 b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 16: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1

a2

b

c1

c2

d1

d2

e

a1 a2

b

c1

c2 d2

d1 b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 17: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1

a2

b

c1

c2

d1

d2

e

a1 a2

b

c1

c2

d2

d1

b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 18: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Barrier Synchronization: Example

a

b

c

d

e

a1

a2

b

c1

c2

d1

d2

e

a1 a2

b

c1

c2 d2

d1 b e

c1

F1

next: F2

F2

next: F3

F3

next: F4

F4

next: F3

return

13

Page 19: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (1D, Scalar, Barrier Synchronization)

� �void groupKernel(TA argStruct , int groupID ,

int globalSizes , int localSize , ...)

{

void* data[localSize] = alloc(localSize*liveValSize );

int next = BARRIER_BEGIN;

while (true) {

switch (next) {

case BARRIER_BEGIN:

for (int i=0; i<localSize; ++i)

next = F1(argStruct , tid , ..., &data[i]); // B2

break;

...

case B4:

for (int i=0; i<localSize; ++i)

next = F4(tid , ..., &data[i]); // B3 or END

break;

case BARRIER_END: return;

} } }� �14

Page 20: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

OpenCL: Exploiting Parallelism on CPUs

CPU (1 core):All threads run sequentially

0

1

...

14

15

CPU (4 cores):Each core executes 1 thread

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

CPU (4 cores, SIMD width 4): Each core executes 4 threads

0 1 2 3 4 5 6 7 8 . . . 11 12 . . . 15

15

Page 21: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

OpenCL: Exploiting Parallelism on CPUs

CPU (1 core):All threads run sequentially

0

1

...

14

15

CPU (4 cores):Each core executes 1 thread

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

CPU (4 cores, SIMD width 4): Each core executes 4 threads

0 1 2 3 4 5 6 7 8 . . . 11 12 . . . 15

15

Page 22: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

OpenCL: Exploiting Parallelism on CPUs

CPU (1 core):All threads run sequentially

0

1

...

14

15

CPU (4 cores):Each core executes 1 thread

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

CPU (4 cores, SIMD width 4): Each core executes 4 threads

0 1 2 3 4 5 6 7 8 . . . 11 12 . . . 15

15

Page 23: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Group Kernel (2D, SIMD)

� �void groupKernel(TA argStruct , int* groupIDs ,

int* globalSizes , int* localSizes)

{

for (int lidY =0; lidY <localSizes [1]; ++lidY) {

for (int lidX =0; lidX <localSizes [0]; lidX +=4) {

__m128i lidXV = <lidX ,lidX+1,lidX+2,lidX+3>;

simdKernel (argStruct , lidXV , lidY ,

groupIDs , globalSizes ,

localSizes ); // to be inlined

} } }� �Whole-Function Vectorization (WFV) of kernel code

New kernel computes 4 “threads” at once using SIMD instruction set

Challenge: diverging control flow

16

Page 24: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Diverging Control Flow

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c e f

2 a b d e f

3 a b c e b c e f

4 a b c e b d e f

Different threads execute different code paths

Execute everything, mask out results of inactive threads(using predication, blending)

Control flow to data flow conversion on ASTs [Allen et al. POPL’83]

Whole-Function Vectorization on SSA CFGs [K & H CGO’11]

17

Page 25: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Diverging Control Flow

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c d e b c d e f

2 a b c d e b c d e f

3 a b c d e b c d e f

4 a b c d e b c d e f

Different threads execute different code paths

Execute everything, mask out results of inactive threads(using predication, blending)

Control flow to data flow conversion on ASTs [Allen et al. POPL’83]

Whole-Function Vectorization on SSA CFGs [K & H CGO’11]17

Page 26: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Diverging Control Flow

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c d e b c d e f

2 a b c d e b c d e f

3 a b c d e b c d e f

4 a b c d e b c d e f

Overhead for maintaining & updating of predicates

Overhead for operations with side-effects (e.g. load/store/call)

Expensive but rarely executed paths are now always executed

Linearization increases register pressure + more spilling

Works well for kernels with mostly straight-line code17

Page 27: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

DCT Kernel: Non-Divergent Control Flow� �__kernel void DCT(__global float * output ,

__global float * input ,

__global float * dct8x8 ,

__local float * inter ,

const uint width ,

const uint blockWidth ,

const uint inverse)

{

uint tidX = get_global_id (0);

...

inter[lidY*blockWidth + lidX] = ...

barrier(CLK_LOCAL_MEM_FENCE );

float acc = 0.0f;

for(uint k=0; k < blockWidth; k++)

{

uint index1 = lidX*blockWidth + k;

uint index2 = (inverse) ? lidY*blockWidth + k :

k*blockWidth + lidY;

acc += inter[index1] * dct8x8[index2 ];

}

output[tidY*width + tidX] = acc;

} // Compiled to LLVM bitcode.� �18

Page 28: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Non-Divergent Control Flow

Idea: optimize cases where threads do not diverge

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c e f

2 a b c e f

3 a b c e b d e f

4 a b c e b d e f

19

Page 29: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Non-Divergent Control Flow

Idea: optimize cases where threads do not diverge

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c e b d e f

2 a b c e b d e f

3 a b c e b d e f

4 a b c e b d e f

19

Page 30: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Non-Divergent Control Flow

Idea: optimize cases where threads do not diverge

a

b

c d

e

f

a

bc

de

f

Thread Trace

1 a b c e b d e f

2 a b c e b d e f

3 a b c e b d e f

4 a b c e b d e f

Option 1: Insert dynamic predicate-tests & branches to skip paths

I “Branch on superword condition code” (BOSCC) [Shin et al. PACT’07]

I Additional overhead for dynamic test

I Does not help against increased register pressure

19

Page 31: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Non-Divergent Control Flow

Idea: optimize cases where threads do not diverge

a

b

c d

e

f

a

b

c d

e

f

u

v

Thread Trace

1 a b c e b d e f

2 a b c e b d e f

3 a b c e b d e f

4 a b c e b d e f

Option 2: Statically prove non-divergence of certain blocks

I Non-divergent blocks can be excluded from linearization

I Less executed code, less register pressure

I More conservative than dynamic test + exploit both!

19

Page 32: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Uniform/Varying Branches

bu

Either all threads entering b goleft or right

if (blockWidth % 2 == 0)

{

...

}

for(uint k=0; k < blockWidth; k++)

{

...

}

bv

From the p+q threads enteringb, p go left, q go right

if (tid % 2 == 0)

{

...

}

for(uint k=0; k < tid; k++)

{

...

}

20

Page 33: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

When does a block diverge?Informally

a

b c

d e

f

g

v

v u

A block b is divergent if:

b might execute less (not provably 0)threads than its predecessor.That is: it is a successor of a varyingbranch

Two disjoint paths from the samevarying branch rejoin at b

(Additional criterion for loops)

21

Page 34: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

22

Page 35: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

22

Page 36: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

22

Page 37: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

22

Page 38: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

22

Page 39: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation I: WFV vs. Sequential Execution Comparison

Application Naıve UniVal BOSCC UniCF

BitonicSort 3.0 3.2 3.3 3.2BlackScholes 3.9 4.1 4.1 4.1DCT 0.67 0.85 0.85 1.78FastWalshTransform 0.74 0.73 0.73 0.73FloydWarshall 0.11 0.12 0.13 0.12Histogram 0.92 1.08 1.07 1.24Mandelbrot 0.51 2.4 2.4 2.4MatrixTranspose 0.97 1.44 1.44 1.44NBody 1.8 2.67 2.67 3.64

AVG 1.4 1.84 1.85 2.07

SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Naıve WFV is often inferior to sequential execution

Dynamic analysis (BOSCC) has almost no effect for these benchmarks

Static analysis (UniCF) is beneficial for suitable kernels

Kernels dominated by random memory access are not suited for WFV

23

Page 40: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation I: WFV vs. Sequential Execution Comparison

Application Naıve UniVal BOSCC UniCF

BitonicSort 3.0 3.2 3.3 3.2BlackScholes 3.9 4.1 4.1 4.1DCT 0.67 0.85 0.85 1.78FastWalshTransform 0.74 0.73 0.73 0.73FloydWarshall 0.11 0.12 0.13 0.12Histogram 0.92 1.08 1.07 1.24Mandelbrot 0.51 2.4 2.4 2.4MatrixTranspose 0.97 1.44 1.44 1.44NBody 1.8 2.67 2.67 3.64

AVG 1.4 1.84 1.85 2.07

SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Naıve WFV is often inferior to sequential execution

Dynamic analysis (BOSCC) has almost no effect for these benchmarks

Static analysis (UniCF) is beneficial for suitable kernels

Kernels dominated by random memory access are not suited for WFV

23

Page 41: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation I: WFV vs. Sequential Execution Comparison

Application Naıve UniVal BOSCC UniCF

BitonicSort 3.0 3.2 3.3 3.2BlackScholes 3.9 4.1 4.1 4.1DCT 0.67 0.85 0.85 1.78FastWalshTransform 0.74 0.73 0.73 0.73FloydWarshall 0.11 0.12 0.13 0.12Histogram 0.92 1.08 1.07 1.24Mandelbrot 0.51 2.4 2.4 2.4MatrixTranspose 0.97 1.44 1.44 1.44NBody 1.8 2.67 2.67 3.64

AVG 1.4 1.84 1.85 2.07

SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Naıve WFV is often inferior to sequential execution

Dynamic analysis (BOSCC) has almost no effect for these benchmarks

Static analysis (UniCF) is beneficial for suitable kernels

Kernels dominated by random memory access are not suited for WFV

23

Page 42: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation I: WFV vs. Sequential Execution Comparison

Application Naıve UniVal BOSCC UniCF

BitonicSort 3.0 3.2 3.3 3.2BlackScholes 3.9 4.1 4.1 4.1DCT 0.67 0.85 0.85 1.78FastWalshTransform 0.74 0.73 0.73 0.73FloydWarshall 0.11 0.12 0.13 0.12Histogram 0.92 1.08 1.07 1.24Mandelbrot 0.51 2.4 2.4 2.4MatrixTranspose 0.97 1.44 1.44 1.44NBody 1.8 2.67 2.67 3.64

AVG 1.4 1.84 1.85 2.07

SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Naıve WFV is often inferior to sequential execution

Dynamic analysis (BOSCC) has almost no effect for these benchmarks

Static analysis (UniCF) is beneficial for suitable kernels

Kernels dominated by random memory access are not suited for WFV

23

Page 43: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation I: WFV vs. Sequential Execution Comparison

Application Naıve UniVal BOSCC UniCF

BitonicSort 3.0 3.2 3.3 3.2BlackScholes 3.9 4.1 4.1 4.1DCT 0.67 0.85 0.85 1.78FastWalshTransform 0.74 0.73 0.73 0.73FloydWarshall 0.11 0.12 0.13 0.12Histogram 0.92 1.08 1.07 1.24Mandelbrot 0.51 2.4 2.4 2.4MatrixTranspose 0.97 1.44 1.44 1.44NBody 1.8 2.67 2.67 3.64

AVG 1.4 1.84 1.85 2.07

SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Naıve WFV is often inferior to sequential execution

Dynamic analysis (BOSCC) has almost no effect for these benchmarks

Static analysis (UniCF) is beneficial for suitable kernels

Kernels dominated by random memory access are not suited for WFV23

Page 44: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Evaluation II: WFVOpenCL vs. Intel/AMD (milliseconds)

Application WFVOpenCL Intel AMD Speedup vs Intel

BitonicSort 164 1,170 47,271 7.13×BlackScholes 241 329 717 1.37×DCT 201 350 693 1.74×FastWalshTransform 4,944 6,661 8,601 1.35×FloydWarshall 934(148*) 525* 471 0.56×(3.55×*)Histogram 387 1,178 527 3.07×Mandelbrot 632 1,930 29,045 3.05×MatrixTranspose 1,072 2,933 10,748 2.74×NBody 343 676 1,253 1.97×

4 cores, SIMD width 4, median of 100 iterations, no warm-up, confidence level 95%

Intel OpenCL SDK v1.1 / AMD APP SDK v2.5

Average speedup: 2.5× (Intel), 40× (AMD)

*WFV disabled – Intel driver does not vectorize FloydWarshall

24

Page 45: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

LLVM: Benefits and Drawbacks

We heavily rely on JIT code generator + no disappointment!

LLVM IR allows convenient expression of vector computations

I Vector-select and type legalization

MOVMASK still requires an intrinsic

Would be great: a way to express predication in IR

25

Page 46: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Outlook

More optimizations for WFV

Integration of WFV into LLVM mainline?

I Should integrate nicely with Hal’s BasicBlock vectorization

I Combine with loop dependency analysis / Polly for “classic” loopvectorization

Support for architectures w/ predicated execution (e.g. LRBni)

26

Page 47: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

ConclusionOpenCL benefits from “group kernel”-based implementation:

I Optimize uniform expressions & access to tid etc.

I Enable continuation-based barrier synchronization

OpenCL benefits from both multi-threading and WFV on CPUs

Divergence analysis improves WFV:

I Reduce amount of executed code

I Reduce register pressure

I Reduce overhead for maintaining & updating of predicates

Evaluation shows importance of advanced vectorization techniques

Sources available: https://github.com/karrenberg

Thank You!Questions?

27

Page 48: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

ConclusionOpenCL benefits from “group kernel”-based implementation:

I Optimize uniform expressions & access to tid etc.

I Enable continuation-based barrier synchronization

OpenCL benefits from both multi-threading and WFV on CPUs

Divergence analysis improves WFV:

I Reduce amount of executed code

I Reduce register pressure

I Reduce overhead for maintaining & updating of predicates

Evaluation shows importance of advanced vectorization techniques

Sources available: https://github.com/karrenberg

Thank You!Questions?

27

Page 49: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

28

Page 50: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks

Combine divergent blocks to divergent regions with DFS:

I Non-uniform branch found: create new region, set as active

I Post-dominator of region found: finish region, set last unfinished one asactive

I Add divergent blocks to active region

I Merge overlapping regions

Linearize regions recursively (inner before outer regions):

I Order blocks topologically by data dependencies (inner regions treatedas single blocks)

I Schedule blocks in this order by visiting all outgoing edges:

F Rewire all edges that target a divergent block

F New target: next divergent, unscheduled block of region

29

Page 51: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

30

Page 52: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

30

Page 53: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

30

Page 54: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

30

Page 55: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

CFG Linearization w/ Non-Divergent Blocks: Example

a

b c

d e

f

g

v

v u

(a)

a

c

e

b

d

f

g

(b)

a

c

e

b

d

f

g

(c)

a

c

b e

d

f

g

(d)

a

c

b e

d

f

g

(e)

(a) Original CFG

(b) Topological order (by data dependencies)

(c) Naıve: Rewire all edges to next block + all blocks are always executed

(d) Invalid: Edges to/from non-divergent block remain + b and d can be skipped

(e) Valid: Rewire edges to divergent blocks to next in list + only e can be skipped

30

Page 56: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Examples

a

b c

d e f

g h

i

u

v u

u

a

b c

e f

j

v

u u

u u

31

Page 57: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Complex Example

a

b c

d e f

g h

i

u

v u

u

a

b c

d e f

g h

i

u

v u

u

a

b c

d

e f

g h

i

32

Page 58: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Complex Example

a

b c

d e f

g h

i

u

v u

u

a

b c

d e f

g h

i

u

v u

u

a

b c

d

e f

g h

i

32

Page 59: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Complex Example

a

b c

d e f

g h

i

u

v u

u

a

b c

d e f

g h

i

u

v u

u

a

b c

d

e f

g h

i

32

Page 60: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Loop Example

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

33

Page 61: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Loop Example

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

33

Page 62: Improving Performance of OpenCL on CPUs€¦ · Improving Performance of OpenCL on CPUs Ralf Karrenberg karrenberg@cs.uni-saarland.de Sebastian Hack hack@cs.uni-saarland.de European

Retaining Control Flow: Loop Example

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

a

b

c d

e f

h

i

u

u

v

33