the pocl kernel compiler
TRANSCRIPT
![Page 1: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/1.jpg)
The pocl Kernel Compiler
Clay Chang
![Page 2: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/2.jpg)
CPU versus GPU
• Sophiscated Control• Branch Prediction• Out-of-Order Execution• Large Cache
• Little Control• No or Limited Branch
Prediction• Simple Execution• Small or no cache• Lots of ALUs
![Page 3: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/3.jpg)
OpenCL as the Portable API
![Page 4: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/4.jpg)
Why OpenCL for CPU
Muiti-core CPU is out there E.g. MediaTek Tri-Cluster 10 cores SoC
Mobile GPU is already busy ~25% occupied by system UI in Android
Not every programs run good on GPU Heavy Branch Divergence
OpenCL allows easily exploit multi-core and SIMD Imagine: writing pthread + SIMD in assembly or intrinsics
![Page 5: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/5.jpg)
Running OpenCL Kernels on CPU
One thread per work-item? Thousands of threads being created Context-switching problems How to synchronize threads?
How about running one work-group on a CPU thread?
![Page 6: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/6.jpg)
Related Works
Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors.
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
Clover (http://people.freedesktop.org/~steckdenis/clover) Shamrock (https://git.linaro.org/gpgpu/shamrock.git)
![Page 7: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/7.jpg)
What is to pocl
POrtable Computing Language An efficient implementation of OpenCL standard which can be easily
adapted for new targets http://github.com/pocl/pocl Main developer: Pekka Jääskeläinen from Tampere University of
Technology Supporting Architecture: CPU, tce, cellspu, HSA Current version: 0.11
![Page 8: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/8.jpg)
Components in pocl
![Page 9: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/9.jpg)
The pocl Kernel Compiler
OpenCLKernel Source
Clang / LLVM poclKernel Compiler
clBuildProgram(…) clEnqueueNDRangeKernel (…, local_size, …)
Single Work-item Kernel
Transformed Kernel
![Page 10: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/10.jpg)
pocl Compilation Chain1
2
3
4 Compile Kernel (OpenCL C) by Clang
1
Linked with target-specific built-in functions, such as sin, cos, geom_distance, etc…
2
Work-group Function Generation / Parallel Work-item Loops Creation
3
Backend Optimizations (Auto-vecs, …) and CodeGen
4
![Page 11: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/11.jpg)
Work-group_function() { for (int i = 0; i < work-group_size; i++) {
}}
Work-group Function Generation
Kernel (single work-item)
What if there are barriers?
WI-loop
clEnqueueNDRangeKernel(…., group_size, ….)
![Page 12: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/12.jpg)
Semantics of barrier Synchronization
OpenCL 1.2 rev19 p.30:
“… the work-group barrier must be encountered by all work-items of a work-group executing the kernel or by none at all…”
if (tid % 2) { …. barrier(); …}
![Page 13: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/13.jpg)
Kernel Without barriers
• A node in a CFG is a basic block (BB)• BB: branchless sequence of
instructions• BB executed as an entity,
from the first instruction to the last.
• An edge in a CFG represents a branch in the control flow
• Multiple exit BBs are allowed
• pocl Kernel Compiler generates WI-loop around the CFG
![Page 14: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/14.jpg)
Types of Barrier
Un-conditional barriers barrier that dominates the exit node
Conditional barriers Barriers being placed in
if – else for-loop (b-loop)
![Page 15: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/15.jpg)
Kernel with unconditional barriers
pocl Kernel Compiler creates WI-loops before and after the barrier
This forms an algorithm:Algorithm 1: Parallel region formation when the kernel does not contain conditional barriers.
Step1: Ensure there is an implicit barrier at the entry and the exit nodes of the kernel function and that there is only one exit node in the kernel function. This is a safe starting condition as it does not affect any execution order restrictions.Step2: Perform a depth-first-search traversal of the kernel CFG. Ignore the possible back edges to avoid infinite loops and to include the loops of the kernel to the parallel region.Step3: When encountering a barrier, create a parallel region by calling CreateSubgraph for the previously encountered barrier and the newly found barrier.
barrier
barrier
![Page 16: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/16.jpg)
A CFG with Two Conditional barriers
Algorithm 2: Tail duplication for parallel region formation in the case of conditional barriers in the kernel.
Step1: Perform a depth-first traversal of the CFG, starting at the entry node.Step2: Each time a new, unprocessed conditional barrier is found, use CreateSubgraph to produce a sub-CFG from that barrier to the next exit node (duplicate the tail).Step3: Replicate the created sub-CFG using ReplicateCFG. In order to reduce code duplication, merge the tails from the same unconditional barrier paths. That is, replicate the basic blocks only after the last barrier that is unconditionally reachable from the one at hand.Step4: Start the algorithm at each of the found barrier successors.
![Page 17: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/17.jpg)
A CFG with Two Conditional barriers – After Tail Duplication
Easier for WI-loops creation!
barrier
barrier
barrier barrier
?
?
![Page 18: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/18.jpg)
“Peel” the First Loop Iteration
?
?
No more ambiguous branches in WI-
loops!
![Page 19: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/19.jpg)
Barriers in Kernel Loops
Insert implicit barrier into:1. End of loop pre-header
block2. Before the loop latch
branch3. After the PhiNode
region of the loop header block
3
2
1
![Page 20: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/20.jpg)
Horizontal Inner-Loop Parallelization
More parallelization after loop interchange
blockWidth unknown until runtime
![Page 21: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/21.jpg)
Handling of Kernel Variables
1. There will be two parallel regions2. a‘s lifetime only in the first parallel region (it’s a temporary
variable)3. B’s lifetime span across both parallel regions
Context Array
![Page 22: The pocl Kernel Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062308/55c9aa37bb61eb9a398b46f3/html5/thumbnails/22.jpg)
References
Pekka Jääskeläinen, Carlos Sánchez de La Lama, Erik Schnetter, Kalle Raiskila, Jarmo Takala, Heikki Berg: "pocl: A Performance-Portable OpenCL Implementation" in International Journal of Parallel Programming, Springer, August 2014.
http://github.com/pocl/pocl