[harvard cs264] 10a - easy, effective, efficient: gpu programming in python with pyopencl and pycuda...
DESCRIPTION
http://cs264.org Abstract: High-level scripting languages are in many ways polar opposites to GPUs. GPUs are highly parallel, subject to hardware subtleties, and designed for maximum throughput, and they offer a tremendous advance in the performance achievable for a significant number of computational problems. On the other hand, scripting languages such as Python favor ease of use over computational speed and do not generally emphasize parallelism. PyOpenCL and PyCUDA are two packages that attempt to join the two together. By showing concrete examples, both at the toy and the whole-application level, this talk aims to demonstrate that by combining these opposites, a programming environment is created that is greater than just the sum of its two parts. Speaker biography: Andreas Klöckner obtained his PhD degree working with Jan Hesthaven at the Department of Applied Mathematics at Brown University. He worked on a variety of topics all aiming to broaden the utility of discontinuous Galerkin (DG) methods. This included their use in the simulation of plasma physics and the demonstration of their particular suitability for computation on throughput-oriented graphics processors (GPUs). He also worked on multi-rate time stepping methods and shock capturing schemes for DG. In the fall of 2010, he joined the Courant Institute of Mathematical Sciences at New York University as a Courant Instructor. There, he is working on problems in computational electromagnetics with Leslie Greengard. His research interests include: - Discontinuous Galerkin and integral equation methods for wave propagation - Programming tools for parallel architectures - High-order unstructured particle-in-cell methods for plasma simulationTRANSCRIPT
Intro PyOpenCL RTCG Perspectives
Easy, Effective, Efficient:GPU Programming in Pythonwith PyOpenCL and PyCUDA
Andreas Klockner
Courant Institute of Mathematical SciencesNew York University
March 31, 2011
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives
Thanks
Jan Hesthaven (Brown)
Tim Warburton (Rice)
Leslie Greengard (NYU)
PyOpenCL, PyCUDA contributors
Nvidia Corp., AMD Corp.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 IntroductionA Common ThemeIntro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 IntroductionA Common ThemeIntro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
How are High-Performance Codes constructed?
“Traditional” Construction ofHigh-Performance Codes:
C/C++/FortranLibraries
“Alternative” Construction ofHigh-Performance Codes:
Scripting for ‘brains’GPUs for ‘inner loops’
Play to the strengths of eachprogramming environment.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Outline
1 IntroductionA Common ThemeIntro to OpenCL
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Big deal?
Big deal!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Big deal?
Big deal!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
What is OpenCL?
OpenCL (Open Computing Language) is anopen, royalty-free standard for general purposeparallel programming across CPUs, GPUs andother processors. [OpenCL 1.1 spec]
Device-neutral (Nv GPU, AMD GPU,Intel/AMD CPU)
Vendor-neutral
Comes with RTCG
Defines:
Host-side programming interface (library)
Device-side programming language (!)
Big deal?
Big deal!
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Who?
© Copyright Khronos Group, 2010 - Page 4
OpenCL Working Group
• Diverse industry participation
- Processor vendors, system OEMs, middleware vendors, application developers
• Many industry-leading experts involved in OpenCL’s design
- A healthy diversity of industry perspectives
• Apple made initial proposal and is very active in the working group
- Serving as specification editor
Credit: Khronos Group
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
When?
© Copyright Khronos Group, 2010 - Page 5
OpenCL Timeline
• Six months from proposal to released OpenCL 1.0 specification
- Due to a strong initial proposal and a shared commercial incentive
• Multiple conformant implementations shipping
- Apple’s Mac OS X Snow Leopard now ships with OpenCL
• 18 month cadence between OpenCL 1.0 and OpenCL 1.1
- Backwards compatibility protect software investment
Apple proposes OpenCL working group and contributes draft specification to Khronos
Khronos publicly releases OpenCL 1.0 as royalty-free specification
Khronos releases OpenCL 1.0 conformance tests to ensure high-quality implementations
Jun08
Dec08
May09
2H09
Multiple conformant implementations ship across diverse OS and platforms
Jun10
OpenCL 1.1 Specification released and first implementations ship
Credit: Khronos Group
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Why?
© Copyright Khronos Group, 2010 - Page 3
Processor Parallelism
CPUsMultiple cores driving performance increases
GPUsIncreasingly general purpose data-parallel
computing
Graphics APIs and Shading
Languages
Multi-processor
programming – e.g. OpenMP
EmergingIntersection
HeterogeneousComputing
OpenCL is a programming framework for heterogeneous compute resources
Credit: Khronos Group
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
CL vs CUDA side-by-side
CUDA source code:global void transpose(
float ∗A t, float ∗A,int a width, int a height )
int base idx a =
blockIdx .x ∗ BLK SIZE +blockIdx .y ∗ A BLOCK STRIDE;
int base idx a t =blockIdx .y ∗ BLK SIZE +blockIdx .x ∗ A T BLOCK STRIDE;
int glob idx a =base idx a + threadIdx.x+ a width ∗ threadIdx.y;
int glob idx a t =base idx a t + threadIdx.x+ a height ∗ threadIdx .y;
shared float A shared[BLK SIZE][BLK SIZE+1];
A shared[threadIdx .y ][ threadIdx .x] =A[ glob idx a ];
syncthreads ();
A t[ glob idx a t ] =A shared[threadIdx .x ][ threadIdx .y ];
OpenCL source code:void transpose(
global float ∗a t, global float ∗a,unsigned a width, unsigned a height)
int base idx a =get group id (0) ∗ BLK SIZE +get group id (1) ∗ A BLOCK STRIDE;
int base idx a t =get group id (1) ∗ BLK SIZE +get group id (0) ∗ A T BLOCK STRIDE;
int glob idx a =base idx a + get local id (0)+ a width ∗ get local id (1);
int glob idx a t =base idx a t + get local id (0)+ a height ∗ get local id (1);
local float a local [BLK SIZE][BLK SIZE+1];
a local [ get local id (1)∗BLK SIZE+get local id(0)] =a[ glob idx a ];
barrier (CLK LOCAL MEM FENCE);
a t [ glob idx a t ] =a local [ get local id (0)∗BLK SIZE+get local id(1)];
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL ↔ CUDA: A dictionary
OpenCL CUDAGrid Grid
Work Group BlockWork Item Thread
kernel global
global device
local shared
private local
imagend t texture<type, n, ...>barrier(LMF) syncthreads()
get local id(012) threadIdx.xyz
get group id(012) blockIdx.xyz
get global id(012) – (reimplement)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Execution Model
nD Grid
Group(0, 0)
Group(0, 1)
Group(1, 0)
Group(1, 1)
Group(2, 0)
Group(2, 1)
Work Group (1, 0)
Item(0, 0)
Item(0, 1)
Item(0, 2)
Item(0, 3)
Item(1, 0)
Item(1, 1)
Item(1, 2)
Item(1, 3)
Item(2, 0)
Item(2, 1)
Item(2, 2)
Item(2, 3)
Item(3, 0)
Item(3, 1)
Item(3, 2)
Item(3, 3)
Two-tiered Parallelism
Grid = Nx × Ny × Nz work groupsWork group = Sx × Sy × Sz work itemsTotal:
∏i∈x,y ,z SiNi work items
Comm/Sync only within work group
Work group maps to compute unit
Grid/Group ≈ outer loops in an algorithm
Device Language:get global,group,local id,size(axis)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
MemoryCompute Device 1 (Platform 0)
· · ·· · ·· · ·
MemoryCompute Device 0 (Platform 1)
· · ·· · ·· · ·
MemoryCompute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL: Computing as a Service
Host(CPU)
Memory
Compute Device 0 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 0)
· · ·· · ·· · ·
Memory
Compute Device 0 (Platform 1)
· · ·· · ·· · ·
Memory
Compute Device 1 (Platform 1)
· · ·· · ·· · ·
Memory
Platform 0 (e.g. CPUs)
Platform 1 (e.g. GPUs)
(think “chip”,has memoryinterface)
Compute Unit(think “processor”,has insn. fetch)
Processing Element(think “SIMD lane”)
Python
Device Language: ∼ C99
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
OpenCL Object Diagram
Last Revision Date: 9/30/10 Page 20
Figure 2.1 - OpenCL UML Class Diagram
Credit: Khronos Group
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives A Common Theme OpenCL
Why do Scripting for GPUs?
GPUs are everything that scriptinglanguages are not.
Highly parallelVery architecture-sensitiveBuilt for maximum FP/memorythroughput
→ complement each other
CPU: largely restricted to controltasks (∼1000/sec)
Scripting fast enough
Python + CUDA = PyCUDA
Python + OpenCL = PyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCLFirst ContactAbout PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCLFirst ContactAbout PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy23 a = numpy.random.rand(256∗∗3).astype(numpy.float32)45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)
1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get global id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev)
Compute kernel
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
1 import pyopencl as cl , numpy23 a = numpy.random.rand(256∗∗3).astype(numpy.float32)45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)
1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get global id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (1,), a dev)
Compute kernel
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Dive into PyOpenCL
8 a dev = cl. Buffer(ctx , cl .mem flags.READ WRITE, size=a.nbytes)9 cl . enqueue write buffer (queue, a dev, a)
1011 prg = cl.Program(ctx, ”””12 kernel void twice( global float ∗a)13 a[ get local id (0)+ get local size (0)∗get group id (0)] ∗= 2; 14 ”””). build ()1516 prg. twice(queue, a.shape, (256,), a dev)1718 result = numpy.empty like(a)19 cl . enqueue read buffer (queue, a dev, result ). wait()20 import numpy.linalg as la21 assert la .norm(result − 2∗a) == 0
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Outline
1 Introduction
2 Programming with PyOpenCLFirst ContactAbout PyOpenCL
3 Run-Time Code Generation
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL exposes all of OpenCL.
For example:
Every GetInfo() query
Images and Samplers
Memory Maps
Profiling and Synchronization
GL Interop
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Completeness
PyOpenCL supports (nearly)every OS that has an OpenCLimplementation.
Linux
OS X
Windows
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Automatic Cleanup
Reachable objects (memory,streams, . . . ) are never destroyed.
Once unreachable, released at anunspecified future time.
Scarce resources (memory) can beexplicitly freed. (obj.release())
Correctly deals with multiplecontexts and dependencies. (basedon OpenCL’s reference counting)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL: Documentation
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errorsautomatically
Integrate tightly with numpy
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
PyOpenCL, PyCUDA: Vital Information
http://mathema.tician.de/
software/pyopencl (or /pycuda)
Complete documentation
X Consortium License(no warranty, free for all use)
Convenient abstractionsArrays, Elementwise op., Reduction, Scan
Require: numpy, Python 2.4+(Win/OS X/Linux)
Community: mailing list, wiki, add-onpackages (FFT, scikits.cuda, . . . )
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)
A
C
B
E
G
F Q
P
R
h
r
g
rg
r
g
q
f
p
f
Switch queue to out-of-ordermode!
Specify as list of events usingwait for= optional keyword toenqueue XXX.
Can also enqueue barrier.
Common use case:Transmit/receive from other MPIranks.
Possible on Nv Fermi: Submitparallel work to increase machineuse.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives First Contact About PyOpenCL
Capturing Dependencies
B = f(A)C = g(B)E = f(C)F = h(C)G = g(E,F)P = p(B)Q = q(B)R = r(G,P,Q)
A
C
B
E
G
F Q
P
R
h
r
g
rg
r
g
q
f
p
f
Switch queue to out-of-ordermode!
Specify as list of events usingwait for= optional keyword toenqueue XXX.
Can also enqueue barrier.
Common use case:Transmit/receive from other MPIranks.
Possible on Nv Fermi: Submitparallel work to increase machineuse.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code GenerationThe IdeaRTCG in Action
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code GenerationThe IdeaRTCG in Action
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDA
PyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,GPU code doesnot need to bea compile-time
constant.
(Key: Code is data–it wants to bereasoned about at run time)
Good for codegeneration
PyCUDAPyOpenCL
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Machine-generated Code
Why machine-generate code?
Automated Tuning(cf. ATLAS, FFTW)
Data types
Specialize code for given problem
Constants faster than variables(→ register pressure)
Loop Unrolling
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL: Support for Metaprogramming
Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often sufficient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from PythonGenerates readable, indented C
Many ways of evaluating code–most important one:
Exact device timing via events
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code GenerationThe IdeaRTCG in Action
4 Perspectives
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
PyOpenCL Arrays: General Usage
Remember your first PyOpenCL program?
Abstraction is good:
1 import numpy2 import pyopencl as cl3 import pyopencl.array as cl array45 ctx = cl. create some context()6 queue = cl.CommandQueue(ctx)78 a gpu = cl array . to device (9 ctx , queue, numpy.random.randn(4,4).astype(numpy.float32))
10 a doubled = (2∗a gpu).get()11 print a doubled12 print a gpu
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.array: Simple Linear Algebra
pyopencl.array.Array:
Meant to look and feel just like numpy.
p.a.to device(ctx, queue, numpy array)
numpy array = ary.get()
+, -, ∗, /, fill, sin, arange, exp, rand, . . .
Mixed types (int32 + float32 = float64)
print cl array for debugging.
Allows access to raw bits
Use as kernel arguments, memory maps
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.elementwise: Elementwise expressions
Avoiding extra store-fetch cycles for elementwise math:
n = 10000a gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))b gpu = cl array . to device (
ctx , queue, numpy.random.randn(n).astype(numpy.float32))
from pyopencl.elementwise import ElementwiseKernellin comb = ElementwiseKernel(ctx,
” float a, float ∗x, float b, float ∗y, float ∗z”,”z[ i ] = a∗x[i ] + b∗y[i ]”)
c gpu = cl array . empty like (a gpu)lin comb(5, a gpu, 6, b gpu, c gpu)
import numpy.linalg as laassert la .norm((c gpu − (5∗a gpu+6∗b gpu)).get()) < 1e−5
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Substitution
source = (”””kernel void %(name)s(%(arguments)s)
unsigned lid = get local id (0);unsigned gsize = get global size (0);unsigned work item start = get local size (0)∗get group id (0);
for (unsigned i = work item start + lid ; i < n; i += gsize)
%(operation)s;””” %
”arguments”: ”, ”. join (arg . declarator () for arg in arguments),”operation”: operation ,”name”: name,”loop prep”: loop prep ,)
prg = cl.Program(ctx, source ). build ()
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
RTCG via Templates
from mako.template import Templatetpl = Template(”””
kernel void add(global $ type name ∗tgt,global const $ type name ∗op1,global const $ type name ∗op2)
int idx = get local id (0)
+ $ local size ∗ $ thread strides ∗ get group id (0);
% for i in range( thread strides ):<% offset = i∗ local size %>tgt [ idx + $ offset ] =
op1[idx + $ offset ]+ op2[idx + $ offset ];
% endfor”””)
rendered tpl = tpl . render(type name=”float”,local size = local size , thread strides = thread strides )
knl = cl.Program(ctx, str ( rendered tpl )). build (). add
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.reduction: Reduction made easy
Example: A dot product calculation
from pyopencl.reduction import ReductionKerneldot = ReductionKernel(ctx, dtype out=numpy.float32, neutral=”0”,
reduce expr=”a+b”, map expr=”x[i]∗y[i]”,arguments=” global const float ∗x, global const float ∗y”)
import pyopencl.clrandom as cl randx = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)y = cl rand.rand(ctx , queue, (1000∗1000), dtype=numpy.float32)
x dot y = dot(x, y). get()x dot y cpu = numpy.dot(x.get(), y.get())
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives Idea RTCG in Action
pyopencl.scan: Scan made easy
Example: A cumulative sum computation
from pyopencl.scan import InclusiveScanKernelknl = InclusiveScanKernel(ctx , np.int32 , ”a+b”)
n = 2∗∗20−2∗∗18+5host data = np.random.randint(0, 10, n).astype(np.int32)dev data = cl array . to device (queue, host data)
knl(dev data)assert (dev data.get() == np.cumsum(host data, axis=0)).all()
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 import pycuda.driver as cuda2 import pycuda.autoinit , pycuda.compiler3 import numpy45 a = numpy.random.randn(4,4).astype(numpy.float32)6 a gpu = cuda.mem alloc(a.nbytes)7 cuda.memcpy htod(a gpu, a)
[This is examples/demo.py in the PyCUDA distribution.]
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a
Compute kernel
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Whetting your appetite
1 mod = pycuda.compiler.SourceModule(”””2 global void twice( float ∗a)3 4 int idx = threadIdx.x + threadIdx.y∗4;5 a[ idx ] ∗= 2;6 7 ”””)89 func = mod.get function(”twice”)
10 func(a gpu, block=(4,4,1))1112 a doubled = numpy.empty like(a)13 cuda.memcpy dtoh(a doubled, a gpu)14 print a doubled15 print a
Compute kernel
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
PyOpenCL ↔ PyCUDA: A (rough) dictionary
PyOpenCL PyCUDAContext Context
CommandQueue Stream
Buffer mem alloc / DeviceAllocation
Program SourceModule
Kernel Function
Event (eg. enqueue marker) Event
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω :=⋃
i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut +∇ · F (u) = 0
Example
Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by
∂tE −1
ε∇× H = − j
ε, ∂tH +
1
µ∇× E = 0,
∇ · E =ρ
ε, ∇ · H = 0.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω :=⋃
i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut +∇ · F (u) = 0
Example
Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by
∂tE −1
ε∇× H = − j
ε, ∂tH +
1
µ∇× E = 0,
∇ · E =ρ
ε, ∇ · H = 0.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Discontinuous Galerkin Method
Let Ω :=⋃
i Dk ⊂ Rd .
Goal
Solve a conservation law on Ω: ut +∇ · F (u) = 0
Example
Maxwell’s Equations: EM field: E (x , t), H(x , t) on Ω governed by
∂tE −1
ε∇× H = − j
ε, ∂tH +
1
µ∇× E = 0,
∇ · E =ρ
ε, ∇ · H = 0.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
0 =
ˆDk
utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk
[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term
Flux terms:
vary by problem
expression specified by user
evaluated pointwise
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms
0 =
ˆDk
utϕ+ [∇ · F (u)]ϕ dx −ˆ∂Dk
[n · F − (n · F )∗]ϕ dSx︸ ︷︷ ︸Flux term
Flux terms:
vary by problem
expression specified by user
evaluated pointwise
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
n · (F − F ∗)E :=1
2[n × (JHK− αn × JEK)]
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
n · (F − F ∗)E :=1
2[n × (JHK− αn × JEK)]
User writes: Vectorial statement in math. notation
flux = 1/2∗cross(normal, h. int−h.ext−alpha∗cross(normal, e. int−e.ext))
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Metaprogramming DG: Flux Terms Example
Example: Fluxes for Maxwell’s Equations
n · (F − F ∗)E :=1
2[n × (JHK− αn × JEK)]
We generate: Scalar evaluator in C (6×)
a flux += (((( val a field5 − val b field5 )∗ fpair−>normal[2]− ( val a field4 − val b field4 )∗ fpair−>normal[0])
+ val a field0 − val b field0 )∗ fpair−>normal[0]− ((( val a field4 − val b field4 ) ∗ fpair−>normal[1]− ( val a field1 − val b field1 )∗ fpair−>normal[2])
+ val a field3 − val b field3 ) ∗ fpair−>normal[1])∗value type (0.5);
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix loadPreparation
Question: How should one assign work to threads?
ws : in sequenceThread
t
wi : “inline-parallel”Thread
t
wp: in parallel
Thread
t
(amortize preparation) (exploit register space)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for element-local parts of GPU DG
Per Block: KL element-local mat.mult. + matrix loadPreparation
Question: How should one assign work to threads?
ws : in sequenceThread
t
wi : “inline-parallel”Thread
t
wp: in parallel
Thread
t
(amortize preparation) (exploit register space)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loop Slicing for Differentiation
15 20 25 30wp
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2
Execu
tion t
ime [ms]
Local differentiation, matrix-in-shared,order 4, with microblockingpoint size denotes wi ∈
1, ,4
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
ws
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Nvidia GTX280 vs. single core of Intel Core 2 Duo E8400
0 2 4 6 8 10Polynomial Order N
0
50
100
150
200
250
300
GFl
ops/
s
GPU
CPU
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Memory Bandwidth on a GTX 280
1 2 3 4 5 6 7 8 9Polynomial Order N
20
40
60
80
100
120
140
160
180
200
Glo
bal M
em
ory
Bandw
idth
[G
B/s
]
GatherLiftDiffAssy.Peak
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
CFD
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
CFD
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
GPU DG Showcase
Eletromagnetism
Poisson
CFD
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive anderror-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive anderror-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Automating GPU Programming
GPU programming can be time-consuming, unintuitive anderror-prone.
Obvious idea: Let the computer do it.
One way: Smart compilers
GPU programming requires complex tradeoffsTradeoffs require heuristicsHeuristics are fragile
Another way: Dumb enumeration
Enumerate loop slicingsEnumerate prefetch optionsChoose by running resulting code on actual hardware
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Example
Empirical GPU loop optimization:
a, b, c, i , j , k = [var(s) for s in ”abcijk”]n = 500k = make loop kernel([
LoopDimension(”i”, n),LoopDimension(”j”, n),LoopDimension(”k”, n),], [(c[ i+n∗j], a[ i+n∗k]∗b[k+n∗j])])
gen kwargs = ”min threads”: 128,”min blocks”: 32,
→ Ideal case: Finds 160 GF/s kernelwithout human intervention.
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate
Non-Goal: Peak performance
Good results currently for dense linearalgebra and (some) DG subkernels
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Loo.py Status
Limited scope:
Require input/output separationKernels must be expressible using“loopy” model(i.e. indices decompose into “output”and “reduction”)Enough for DG, LA, FD, . . .
Kernel compilation limits trial rate
Non-Goal: Peak performance
Good results currently for dense linearalgebra and (some) DG subkernels
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Outline
1 Introduction
2 Programming with PyOpenCL
3 Run-Time Code Generation
4 PerspectivesPyCUDADG-FEM on the GPU“Automatic” GPU ProgrammingConclusions
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Where to from here?
PyCUDA, PyOpenCL, hedge
→ http://www.cims.nyu.edu/~kloeckner/
GPU RTCG
AK, N. Pinto et al. PyCUDA: GPU Run-Time Code Generation forHigh-Performance Computing, submitted.
GPU-DG Article
AK, T. Warburton, J. Bridge, J.S. Hesthaven, “NodalDiscontinuous Galerkin Methods on Graphics Processors”,J. Comp. Phys., 228 (21), 7863–7882.
Also: Intro in GPU Computing Gems Vol 2
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Conclusions
GPUs to me: architecture choice now widely available
Fun time to be in computational science
GPUs and scripting work surprisingly well together
Exploit a natural task decomposition in computational codesRTCG: Crucial tool
GPU Scripting great for prototyping
. . . and just as suitable for production code
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Questions?
?
Thank you for your attention!
http://www.cims.nyu.edu/~kloeckner/
image credits
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Intro PyOpenCL RTCG Perspectives PyCUDA GPU-DG Loo.py Conclusions
Image Credits
Dictionary: sxc.hu/topferC870 GPU: Nvidia Corp.OpenCL Logo: Apple Corp./Ars TechnicaOS Platforms: flickr.com/aOliN.Tk
Old Books: flickr.com/ppdigital
Floppy disk: flickr.com/ethanhein
Machine: flickr.com/13521837@N00
Adding Machine: flickr.com/thomashawk
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Implementations
Multiple GPUs via MPI: 16 GPUs vs. 64 CPUs
0 2 4 6 8 10Polynomial Order N
0
1000
2000
3000
4000
GFl
ops/
s
Flop Rates: 16 GPUs vs 64 CPU cores
GPUCPU
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Implementations
Outline
5 OpenCL implementations
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Implementations
The Nvidia CL implementation
Targets only GPUs
Notes:
Nearly identical to CUDA
No native C-level JIT in CUDA (→PyCUDA)
Page-locked memory:Use CL MEM ALLOC HOST PTR.
Careful: double meaningNeed page-locked memory for genuinelyoverlapped transfers.
No linear memory texturing
CUDA device emulation mode deprecated→ Use AMD CPU CL (faster, too!)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Implementations
The Apple CL implementation
Targets CPUs and GPUs
General notes:
Different header nameOpenCL/cl.h instead of CL/cl.hUse -framework OpenCL for Caccess.
Beware of imperfect compiler cacheimplementation(ignores include files)
CPU notes:
One work item per processor
GPU similar to hardware vendorimplementation.(New: Intel w/ Sandy Bridge)
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA
Implementations
The AMD CL implementation
Targets CPUs and GPUs (from both AMD and Nvidia)
GPU notes:
Wide SIMD groups (64)
Native 4/5-wide vectors
But: very flop-heavy machine, may ignore vectorsfor memory-bound workloads
→ Both implicit and explicit SIMD
CPU notes:
Many work items per processor (emulated)
General:
cl amd printf
Andreas Klockner GPU-Python with PyOpenCL and PyCUDA