acceleration of parallel algorithms using a heterogeneous computing...
TRANSCRIPT
![Page 1: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/1.jpg)
Acceleration of parallel algorithms
using a heterogeneous computing
system
Ra Inta,
Centre for Gravitational Physics,
The Australian National University NIMS-SNU-Sogang Joint Workshop on Stellar Dynamics and Gravitational-
wave Astrophysics, December 18, 2012
![Page 2: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/2.jpg)
2
Overview
1. Why a new computing architecture?
2. Hardware accelerators
3. The Chimera project
4. Performance
5. Analysis via Berkeley’s ‘Thirteen Dwaves’
6. Issues and future work
7. Conclusion
![Page 3: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/3.jpg)
Why do we need new computing
architectures?
• We live in an increasingly data-driven age
• LHC experiment alone: data rate of 1 PB/s
• Many data-hungry projects coming on-line
• Computationally limited (most LIGO/Virgo
CW searches have computation bounds)
3
![Page 5: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/5.jpg)
…supply side
• Intel
discontinued
4GHz chip
• Speed walls
• Power walls
• Moore’s Law
still holds, but for
multicore
5
![Page 6: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/6.jpg)
Time for a change in high performance
computing?
• Focus on throughput, rather than raw
flop/s (e.g. Condor on LSC clusters)
• Focus on power efficiency (top500 ‘Green
list’)
6
![Page 7: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/7.jpg)
Hardware acceleration
• Response to computational demand
• Two most common types: Graphical
Processor Unit (GPU) and Field
Programmable Gate Array (FPGA)
7
![Page 8: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/8.jpg)
The Graphical Processor Unit
8
• Subsidised by gamers
• Most common accelerator
• Excellent support
(nVidia)
![Page 9: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/9.jpg)
The Field Programmable Gate Array
Programmable hardware (‘gateware’)
replacement for application-specific
integrated circuits (ASICs)
9
![Page 10: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/10.jpg)
Cultural differences
• GPUs: software developers
• FPGAs: hardware, electronic engineers
10
FPGA development
is a lot less like
traditional
microprocessor
programming!
(VHDL, Verilog)
![Page 11: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/11.jpg)
Platform Pros Cons
CPU Analysis ‘workhorse,’
multi-tasking
Power hungry, limited
processing cores
GPU Highly parallel, fairly
simple interface (e.g.
C for CUDA)
Highly rigid instruction set
(don’t handle complex
pipelines)
FPGA Unrivalled flexibility
and pipelining
Expensive outlay,
specialised programming
interface, prohibitive
development time
11
![Page 12: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/12.jpg)
Choosing the right hardware
Highly algorithm dependent, e.g.
• Dense linear algebra: GPU
• Logic-intensive operations: FPGA
• Pipelining: FPGA
12
![Page 13: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/13.jpg)
The ‘Chimera’ project
• Appropriately(?) combine
benefits of three hardware
classes
• Commercial, off the shelf
(COTS) components
• Focus on high throughput,
energy efficiency
13
(Image: E. Koehn)
![Page 14: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/14.jpg)
Initial design
14
PCI Express
COTS constraints
![Page 15: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/15.jpg)
Subsystem Vendor Model
CPU Intel i7 Hexacore
GPGPU nVidia Tesla C2075
FPGA Altera Stratix-IV
15
![Page 16: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/16.jpg)
16
![Page 17: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/17.jpg)
Performance
17
![Page 18: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/18.jpg)
(32-bit integer (x, y) pairs)
Throughput:
• FPGA alone: 24 ×109 samp/s
• GPU alone: 2.1×109 samp/s
• CPU alone: 10 ×106 samp/s
18
Surprising!
• FPGA pipeline was highly optimised
• Expect GPU to perform better with more complicated
(floating point) integration problems
• Expect CPU to have even worse performance
![Page 19: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/19.jpg)
• image processing
• synthetic aperture arrays
(SKA, VLBA)
• video compression
19
![Page 20: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/20.jpg)
Searched 1024×768 pixel image (16-bit
greyscale) for a 8×8 template
• Numerator
GPU, 158 frame/s
• Denominator
GPU: 894 frame/s
FPGA: 12,500 frame/s
20
So much for a
Graphical
Processor Unit!
Best combination:
GPU + FPGA
![Page 21: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/21.jpg)
• Data analysis
• Interface GPUs with
experimental FPGA DAQ
systems (GMT, quantum
cryptography)
21
R. Inta, D. J. Bowman, and S. M. Scott, “The ‘Chimera’: An Off-The-Shelf
CPU/GPGPU/FPGA Hybrid Computing Platform,” Int. J. Reconfigurable Comp.,
2012(241439), 10 pp. (2012)
![Page 22: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/22.jpg)
The Thirteen Dwarves of Berkeley
• Phillip Colella identified 7 classes of
parallel algorithms (2004)
• A dwarf: “an algorithmic method
encapsulating a pattern of computation
and/or communication”
• UC Berkeley CS Dept. increased to 13
22
K. Asanovic and U C Berkeley Computer Science Department, “The landscape of
parallel computing research: a view from Berkeley,” Tech. Rep. UCB/EECS-2006-
183, UC Berkeley (2005)
![Page 23: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/23.jpg)
Dwarf Examples/Applications
1 Dense Matrix Linear algebra (dense matrices)
2 Sparse Matrix Linear algebra (sparse matrices)
3 Spectral FFT-based methods
4 N-Body Particle-particle interactions
5 Structured Grid Fluid dynamics, meteorology
6 Unstructured Grid Adaptive mesh FEM
7 MapReduce Monte Carlo integration
8 Combinational Logic Logic gates (e.g. Toffoli gates)
9 Graph traversal Searching, selection
10 Dynamic Programming ‘Tower of Hanoi’ problem
11
Backtrack/Branch-and-
Bound Global optimization
12 Graphical Models Probabilistic networks
13 Finite State Machine TTL counter
23
![Page 24: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/24.jpg)
24
*Fixed point
^Floating point
![Page 25: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/25.jpg)
Issues/future work
• Kernel module development
• Scalability is problem-size dependent
• FPGA development is labour-intensive
• Pipeline development
• GW data analysis!
25
![Page 26: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/26.jpg)
Issues/future work
26
![Page 27: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/27.jpg)
Issues/future work
• Kernel module development
• Scalability is problem-size dependent
• FPGA development is labour-intensive
• Pipeline development
• GW data analysis!
27
![Page 28: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/28.jpg)
Conclusion
• We are in a transition period for HPC
architecture
• Can exploit highly heterogeneous, COTS,
hardware accelerators for certain algorithms
• Powerful and energy efficient
• Requires further development
28
More information:
www.anu.edu.au/physics/cgp/Research/chimera.html
![Page 29: Acceleration of parallel algorithms using a heterogeneous computing …raata/Conference_files/Inta_NIMS... · 2016-05-03 · Acceleration of parallel algorithms using a heterogeneous](https://reader030.vdocuments.us/reader030/viewer/2022041015/5ec6fa8643af28539a4c99be/html5/thumbnails/29.jpg)
Thanks to NIMS and Seoul National
University!
• Dr Sang Hoon Oh
• Dr John J. Oh
29