desktop supercomputing: harnessing the power of...
TRANSCRIPT
![Page 1: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/1.jpg)
Technology for a better society 1
Desktop supercomputing:
Harnessing the power of accelerators
Lessons learned from 10 years with GPUs
![Page 2: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/2.jpg)
Technology for a better society 2
Talk outline
• Parallel computing on your desktop
• Motivation for parallelism
• Parallel algorithm design
• Lessons learned from working with multi- and many-core processors
• Ca 2005: OpenGL
• Ca 2007: CUDA and the Cell BE SDK
• Ca 2014: iPython and pyopencl
• Summary
![Page 3: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/3.jpg)
Technology for a better society 3
Motivation for going parallel
![Page 4: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/4.jpg)
Technology for a better society 4
History lesson: development of the microprocessor 1/2
1942: Digital Electric Computer (Atanasoff and Berry)
1971: Microprocessor (Hoff, Faggin, Mazor)
1947: Transistor (Shockley, Bardeen, and Brattain)
1956
1958: Integrated Circuit (Kilby)
2000
1971- Exponential growth (Moore, 1965)
![Page 5: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/5.jpg)
Technology for a better society 5
1971: 4004, 2300 trans, 740 KHz
1982: 80286, 134 thousand trans, 8 MHz
1993: Pentium P5, 1.18 mill. trans, 66 MHz
2000: Pentium 4, 42 mill. trans, 1.5 GHz
2010: Nehalem 2.3 bill. Trans, 8 cores, 2.66 GHz
History lesson: development of the microprocessor 2/2
![Page 6: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/6.jpg)
Technology for a better society 6
• Today, we have
• 6-60 processors per chip
• 8 to 32-wide SIMD instructions
• Combines both SISD, SIMD, and MIMD on a single chip
• Heterogeneous cores (e.g., CPU+GPU on single chip)
Todays multi- and many-core processor designs
Multi-core CPUs:
x86, SPARC, Power 7
Accelerators:
GPUs, Xeon Phi
Heterogeneous chips:
Intel Haswell, AMD APU
![Page 7: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/7.jpg)
Technology for a better society 7
• In addition to heterogeneous chips, we can
have heterogeneous systems
• Accelerators are connected to the CPU via
the PCI-express bus
• Slow: 15.75 GB/s each direction
• Accelerator memory is limited but fast
• Typically on the order of 10 GB
• Up-to 340 GB/s!
• Fixed size, and cannot be expanded
with new dimm’s (like CPUs)
Heterogeneous Architectures
Multi-core CPU Accelerator
Main CPU memory (up-to 64 GB) Device Memory (up-to 6 GB)
~30 GB/s
~340 GB/s ~60 GB/s
![Page 8: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/8.jpg)
Technology for a better society 8
• 1970-2004: Frequency doubles every 34 months (Moore’s law for performance)
• 1999-2014: Parallelism doubles every 30 months
End of frequency scaling
1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1
10
100
1000
10000
Desktop processor performance (SP)
1999-2014:
Parallelism doubles
every ~30 months
1971-2004:
Frequency doubles
every ~34 months
2004-2014:
Frequency
constant
SSE (4x)
Hyper-Treading (2x)
Multi-core (2-6x)
AVX (2x)
![Page 9: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/9.jpg)
Technology for a better society 9
• Heat density approaching that of nuclear reactor core: Power wall
• Traditional cooling solutions (heat sink + fan) insufficient
• Industry solution: multi-core and parallelism!
What happened in 2004?
Original graph by G. Taylor, “Energy Efficient Circuit Design and the Future of Power Delivery” EPEPS’09
W /
cm
2
Critical dimension (um)
![Page 10: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/10.jpg)
Technology for a better society 10
Why Parallelism?
100%
100%
100%
85%
90% 90%
100%
Frequency
Performance
Power
Single-core Dual-core
The power density of microprocessors
is proportional to the clock frequency cubed:1
1 Brodtkorb et al. State-of-the-art in heterogeneous computing, 2010
![Page 11: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/11.jpg)
Technology for a better society 11
• Up-to 5760 floating point
operations in parallel!
• 5-10 times as power
efficient as CPUs!
Massive Parallelism: The Graphics Processing Unit
0
50
100
150
200
250
300
350
400
2000 2005 2010 2015
Ba
nd
wid
th (
GB
/s)
0
1000
2000
3000
4000
5000
6000
2000 2005 2010 2015
Gig
afl
op
s (
SP
)
![Page 12: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/12.jpg)
Technology for a better society 12
Parallel algorithm design
![Page 13: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/13.jpg)
Technology for a better society 13
• The key to performance, is to
consider the full algorithm and
architecture interaction.
• A good knowledge of both the
algorithm and the computer
architecture is required.
Why care about computer hardware?
Graph from David Keyes, Scientific Discovery through Advanced Computing, Geilo Winter School, 2008
![Page 14: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/14.jpg)
Technology for a better society 14
• Total performance is the product of
algorithmic and numerical performance
• Your mileage may vary: algorithmic
performance is highly problem
dependent
• Many algorithms have low numerical
performance
• Only able to utilize a fraction of the
capabilities of processors, and often
worse in parallel
• Need to consider both the algorithm and
the architecture for maximum performance
Algorithmic and numerical performance
Nu
me
rica
l pe
rfo
rma
nce
Algorithmic performance
![Page 15: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/15.jpg)
Technology for a better society 15
• Most algorithms contains
a mixture of work-loads:
• Some serial parts
• Some task and / or data parallel parts
• Amdahl’s law:
• There is a limit to speedup offered by
parallelism
• Serial parts become the bottleneck for a
massively parallel architecture!
• Example: 5% of code is serial: maximum
speedup is 20 times!
Parallel considerations 1/4
S: Speedup
P: Parallel portion of code
N: Number of processors Graph from Wikipedia, user Daniels220, CC-BY-SA 3.0
![Page 16: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/16.jpg)
Technology for a better society 16
• Gustafson's law:
• If you cannot reduce serial parts of algorithm,
make the parallel portion dominate the execution
time
• Essentially: solve a bigger problem to get a
bigger speedup!
Parallel considerations 2/4
S: Speedup
P: Number of processors
α: Serial portion of code Graph from Wikipedia, user Peahihawaii, CC-BY-SA 3.0
![Page 17: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/17.jpg)
Technology for a better society 17
• A single precision number is four bytes
• You must perform over 60 operations for each
float read on a GPU!
• Over 25 operations on a CPU!
• This groups algorithms into two classes:
• Memory bound
Example: Matrix multiplication
• Compute bound
Example: Computing π
• The third limiting factor is latencies
• Waiting for data
• Waiting for floating point units
• Waiting for ...
Parallel considerations 3/4
0,00
2,00
4,00
6,00
8,00
10,00
12,00
14,00
16,00
18,00
2000 2005 2010 2015
Optimal FLOPs per byte (SP)
CPU
GPU
![Page 18: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/18.jpg)
Technology for a better society 18
• Moving data has become the major bottleneck in computing.
• Downloading 1GB from Japan to Switzerland consumes
roughly the energy of 1 charcoal briquette1.
• A FLOP costs less than moving one byte2.
• Key insight: flops are free, moving data is expensive
Parallel considerations 4/4
1 Energy content charcoal: 10 MJ / kg, kWh per GB: 0.2 (Coroama et al., 2013), Weight charcoal briquette: ~25 grams
2Simon Horst, Why we need Exascale, and why we won't get there by 2020, 2014
![Page 19: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/19.jpg)
Technology for a better society 19
Lessons learned from working with
multi- and many-core processors
![Page 20: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/20.jpg)
Technology for a better society 20
Ca 2005: GPUs with OpenGL: GPGPU
![Page 21: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/21.jpg)
Technology for a better society 21
• GPUs were first programmed using OpenGL and other graphics languages
• Mathematics were written as operations on graphical primitives
• Extremely cumbersome and error prone
Early Programming of GPUs
Input B
Input A
Output
Geometry
Element-wise matrix multiplication
![Page 22: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/22.jpg)
Technology for a better society 22
Examples of Early GPU Research at SINTEF
Preparation for FEM (~5x)
Euler Equations (~25x) Marine aqoustics (~20x)
Self-intersection (~10x)
Registration of medical
data (~20x)
Fluid dynamics and FSI (Navier-Stokes)
Inpainting (~400x matlab code)
Water injection in a fluvial reservoir (20x) Matlab Interface
Linear algebra
SW Equations (~25x)
![Page 23: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/23.jpg)
Technology for a better society 23
• Larsen & McAllister demonstrated that using GPUs
through non-programmable OpenGL could be faster
than using ATLAS [1]
• Their algorithm was "simple"
• Matrix-matrix multiplication dots
row i of matrix A with
column j of matrix B to produce element (i, j)
• We can formulate this product for a virtual cube
of processors
• Processor (m, n, 0) computes the product
A[m, n]*B[n, o]
• By summing along the o dimension, the
matrix product is complete.
• L&M used textures and blending to implement
the virtual cube of processors algorithm
Matrix-matrix multiplication with OpenGL
Matrix multiplication
[1] Fast matrix multiplies using graphics hardware, Larsen and McAllister, 2001
![Page 24: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/24.jpg)
Technology for a better society 24
• Programmable OpenGL came in 2004, which
enabled more advanced algorithms.
• Algorithms still had to be mapped to operations
on graphical primitives, to trick the GPU into
rendering mathematics.
• Errors such as for-loops that silently only execute
the first 256 iterations was common
• Painstakingly difficult to develop
PLU factorization with OpenGL
![Page 25: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/25.jpg)
Technology for a better society 25
• As part of my masters thesis, I developed a Matlab
interface to GPU implementations of linear algebra
operations.
• Operations would be enqueued from Matlab onto
the GPU
• The GPU would execute the queue of operations in
the background
• Matlab would only stall if the GPU was not finished
when asking for results. This also enabled
heterogeneous computing!
• Thin Matlab scripts enabled transparent use of the
GPU: just use gpuMatrix(a) to use the GPU
functions!
Easy access to GPU resources
![Page 26: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/26.jpg)
Technology for a better society 26
Ca 2007: CUDA and the Cell BE SDK
![Page 27: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/27.jpg)
Technology for a better society 27
My first encounter with CUDA
• May 1st 2007: I handed in my masters thesis.
• June 15th 2007: I hold the oral presentation of my thesis and receive my grade.
• June 23rd 2007: CUDA was released officially.
Most of my thesis was officially obsolete.
![Page 28: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/28.jpg)
Technology for a better society 28
NVIDIA CUDA
• CUDA solved the major problems with OpenGL
• Unstable driver and different results on different hardware (it will only run on NVIDIA)
• Uncomfortable implementation regime
• We could now program in a C-like language
• Rapid development of a whole range of new applications
• A huge interest for GPUs emerged
• The first versions, however, had the same amount of compiler bugs as OpenGL
![Page 29: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/29.jpg)
Technology for a better society 29
The benefit of CUDA
• Render primitives that cover part of
the screen that represents your
computational domain
• The shader which colours the pixel
performs the wanted calculation
OpenGL "Kernel launch" CUDA kernel launch
• "Pixels" are still the underlying
primitive, but now we execute a grid of
blocks.
• Each block runs independently, but
threads within a block can collaborate
![Page 30: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/30.jpg)
Technology for a better society 30
CUDA fever
• CUDA became a superstar in the academic
camp "over night".
• Within 2010, there were over 1000 demos,
papers, and commercial applications using
CUDA on the CUDA Showcase.
• AMD tried countering with
Close-to-the-metal (assemly for AMD GPUs) and
Brook+, but none recieved any noticeable
attention.
Google search trends
![Page 31: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/31.jpg)
Technology for a better society 31
Exploring CUDA
• CUDA sparked a whole new range of research articles on GPUs
• Hardware exploration
• How was texture memory laid out?
• How large were the different caches?
• Would it be better to use more registers and less shared memory or not?
• Is more threads always better?
• Massive focus on memory movement
• Coalesced reads and writes
• Cached versus non-cached reads
• Texture reads versus cached reads
• Low-hanging fruit was rapidly picked
• More and more articles solved real-world problems
• Proof-of-concept slowly became less interesting
![Page 32: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/32.jpg)
Technology for a better society 32
The Cell BE
• In 2007, Sony, Toshiba and Intel released
the Cell BE
• Consisted of one regular low-power CPU, and eight
low-power SIMD units
• Originally used in the Playstation console,
and the Roadrunner supercomputer:
The first petaflop supercomputer!
• There was a lot of excitement in the GPGPU
community:
Another low-cost easily available accelerator was
here! EIB
![Page 33: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/33.jpg)
Technology for a better society 33
The Cell BE
• We spent a lot of time exploring this new architecture, implementing a range of different
algorithms
Heat equation Heat equation inpainting
JPEG compression Mandelbrot
![Page 34: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/34.jpg)
Technology for a better society 34
The Cell BE
• The Cell BE was a nightmare
• You manually had to move data between cores, and know an infinite amount of
hardware details to achieve performance
• If you did anything wrong, you would get the infamous "bus error",
the evil twin of "segmentation fault".
• Game developers reported the same for the Playstation console
• IBM originally had plans for an updated version with more performance, but these
plans were scrapped early on.
![Page 35: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/35.jpg)
Technology for a better society 35
Ca 2014: iPython and pyopencl
![Page 36: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/36.jpg)
Technology for a better society 36
• OpenCL is much like CUDA, but slightly more cumbersome
to work with.
• The benefit is that the same code can run on
Intel CPUs, the Xeon Phi,
NVIDIA GPUs,
AMD GPUs, etc.
• The amount of code needed to do the exact same thing
is larger in OpenCL
• OpenCL is a C API
• CUDA has C++ bindings, and supports templates
• CUDA has a much better development community for NVIDIA
• OpenCL has a lot of very nice tools available from AMD
OpenCL
![Page 37: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/37.jpg)
Technology for a better society 37
iPython and Pyopencl
• OpenCL is a C API, which requires working in C,
and possibly long compilation times
• Even the simplest OpenCL example will require a lot of boilerplate code
• Pyopencl solves this, by enabling access to OpenCL through Python
• Ipython notebook gives us an interactive shell to try out OpenCL and prototype!
![Page 38: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/38.jpg)
Technology for a better society 38
Demo of iPython and Pyopencl
If time permits
![Page 39: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/39.jpg)
Technology for a better society 39
The caveat
• Installing can be slightly cumbersome
• You need an OpenCL driver, which complicates things
• If you're on Ubuntu and have an NVIDIA graphics card, you're lucky
• sudo apt-get install ipyton ipython-notebook python-pyopencl
and you're good to go
• If you're on any other platform or hardware, you need to do a bit of manual labour
• But it still shouldn't take longer than half an hour.
![Page 40: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/40.jpg)
Technology for a better society 40
Summary
![Page 41: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/41.jpg)
Technology for a better society 41
Summary
• We need to consider parallelism in "everything" we do on computers
• We cannot afford to use 1% of the true potential
• GPU computing used to be a pain, but no more
• Even though the GPGPU-crew said it was easy to get started
• Getting started today is easier than ever
• Getting good performance still requires knowledge of the architecture
• The success of GPUs has been unprecedented
• Its widespread availability, low cost, and "easy" programming model has made it a
success, where other parallel architectures have failed
• It will be exciting to see how the Intel Xeon Phi stands the test of time
![Page 42: Desktop supercomputing: Harnessing the power of …babrodtk.at.ifi.uio.no/files/publications/brodtkorb...1 Brodtkorb et al. State -of the art in heterogeneous computing, 2010 Technology](https://reader035.vdocuments.us/reader035/viewer/2022063010/5fc2fa732ac526076c459271/html5/thumbnails/42.jpg)
Technology for a better society 42
Thank you for your attention!
André R. Brodtkorb
Email: [email protected]
Homepage: http://babrodtk.at.ifi.uio.no/
Youtube: http://youtube.com/babrodtk/
SINTEF webpages: http://www.sintef.no/math/