the pocl kernel compiler

The pocl Kernel Compiler

Clay Chang

CPU versus GPU

• Sophiscated Control• Branch Prediction• Out-of-Order Execution• Large Cache

• Little Control• No or Limited Branch

Prediction• Simple Execution• Small or no cache• Lots of ALUs

OpenCL as the Portable API

Why OpenCL for CPU

Muiti-core CPU is out there E.g. MediaTek Tri-Cluster 10 cores SoC

Mobile GPU is already busy ~25% occupied by system UI in Android

Not every programs run good on GPU Heavy Branch Divergence

OpenCL allows easily exploit multi-core and SIMD Imagine: writing pthread + SIMD in assembly or intrinsics

Running OpenCL Kernels on CPU

One thread per work-item? Thousands of threads being created Context-switching problems How to synchronize threads?

How about running one work-group on a CPU thread?

Related Works

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.

MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs

Clover (http://people.freedesktop.org/~steckdenis/clover) Shamrock (https://git.linaro.org/gpgpu/shamrock.git)

What is to pocl

POrtable Computing Language An efficient implementation of OpenCL standard which can be easily

adapted for new targets http://github.com/pocl/pocl Main developer: Pekka Jääskeläinen from Tampere University of

Technology Supporting Architecture: CPU, tce, cellspu, HSA Current version: 0.11

http://github.com/pocl/pocl


Components in pocl

The pocl Kernel Compiler

OpenCLKernel Source

Clang / LLVM poclKernel Compiler

clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …)

Single Work-item Kernel

Transformed Kernel

pocl Compilation Chain1

2

3

4 Compile Kernel (OpenCL C) by Clang

1

Linked with target-specific built-in functions, such as sin, cos, geom_distance, etc…

2

Work-group Function Generation / Parallel Work-item Loops Creation

3

Backend Optimizations (Auto-vecs, …) and CodeGen

4

Work-group_function() { for (int i = 0; i < work-group_size; i++) {

}}

Work-group Function Generation

Kernel (single work-item)

What if there are barriers?

WI-loop

clEnqueueNDRangeKernel(…., group_size, ….)

Semantics of barrier Synchronization

OpenCL 1.2 rev19 p.30:

“… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…”

if (tid % 2) { …. barrier(); …}

Kernel Without barriers

• A node in a CFG is a basic block (BB)• BB: branchless sequence of

instructions• BB executed as an entity,

from the first instruction to the last.

• An edge in a CFG represents a branch in the control flow

• Multiple exit BBs are allowed

• pocl Kernel Compiler generates WI-loop around the CFG

Types of Barrier

Un-conditional barriers barrier that dominates the exit node

Conditional barriers Barriers being placed in

if – else for-loop (b-loop)

Kernel with unconditional barriers

pocl Kernel Compiler creates WI-loops before and after the barrier

This forms an algorithm:Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers.

Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions.Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region.Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier.

barrier

barrier

A CFG with Two Conditional barriers

Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel.

Step1: Perform a depth-first traversal of the CFG, starting at the entry node.Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail).Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand.Step4: Start the algorithm at each of the found barrier successors.

A CFG with Two Conditional barriers – After Tail Duplication

Easier for WI-loops creation!

barrier

barrier

barrier barrier

?

?

“Peel” the First Loop Iteration

?

?

No more ambiguous branches in WI-

loops!

Barriers in Kernel Loops

Insert implicit barrier into:1. End of loop pre-header

block2. Before the loop latch

branch3. After the PhiNode

region of the loop header block

3

2

1

Horizontal Inner-Loop Parallelization

More parallelization after loop interchange

blockWidth unknown until runtime

Handling of Kernel Variables

1. There will be two parallel regions2. a‘s lifetime only in the first parallel region (it’s a temporary

variable)3. B’s lifetime span across both parallel regions

Context Array

References

Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.


the pocl kernel compiler

Software