hpc acceleration with dataflow programmingweidendo/uchpc13/docs/pell.pdf · the performance...
TRANSCRIPT
![Page 1: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/1.jpg)
Oliver Pell
HPC Acceleration
with Dataflow Programming
UCHPC Workshop 2013
![Page 2: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/2.jpg)
• The challenges
• Heterogeneous dataflow computers
• Programming dataflow computers
• Application case studies
2
Overview
![Page 3: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/3.jpg)
The Performance Challenge: 1 Exaflop
• 1 exaflop = 1018 FLOPS
• Using processor cores with 8FLOPS/clock at 2.5GHz
• 50M CPU cores
• What about power?
– Assume power envelope of 100W per chip
– Moore’s Law scaling: 8 cores today ~100 cores/chip
– 500k CPU chips
• 50MW (just for CPUs!) 100MW+ likely
• ‘Titan’: 17PF @ 8MW 470MW for Exascale
3
![Page 4: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/4.jpg)
Extreme Parallelism
• Spatial decomposition on a 10,0003 regular grid
• 1.0 Terapoints
• 20k points per core
• 273 region per core
• Computing a 13x13x13 convolution stencil: 66% halo
4
Limited by communication bandwidth
![Page 5: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/5.jpg)
• Moving data on-chip will use as much energy as computing with it
• Moving data off-chip will use 200x more energy! – And is much slower as well
The power challenge
Today 2018-20
Double precision FLOP 100pj 10pj
Moving data on-chip: 1mm 6pj
Moving data on-chip: 20mm 120pj
Moving data to off-chip memory 5000pj 2000pj
5
The data movement challenge
![Page 6: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/6.jpg)
Solving Real Problems
• Peak performance doesn’t matter
• “Real” performance on LINPACK doesn’t matter
• The TOP500 doesn’t matter
• We’re interested in solving real problems
– Science
– Commerce
– Industry
• Can we build application-specific computers?
• How should we program them?
6
![Page 7: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/7.jpg)
Dataflow Computers
![Page 8: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/8.jpg)
Control flow vs. Dataflow
8
![Page 9: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/9.jpg)
Where silicon is used?
Intel 6-Core X5680 “Westmere”
9
![Page 10: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/10.jpg)
Where silicon is used?
Intel 6-Core X5680 “Westmere”
Dataflow Processor
Computation
MaxelerOS
Computation (Dataflow cores)
10
![Page 11: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/11.jpg)
• A custom chip for a specific application
• No instructions no instruction decode logic
• No branches no branch prediction
• Explicit parallelism No out-of-order scheduling
• Data streamed onto-chip No multi-level caches
11
A Special Purpose Computer?
MyApplication Chip
(Lots o
f) M
emo
ry
Rest of the world
![Page 12: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/12.jpg)
• But we have more than one application
• Generally impractical to have machines that are completely optimized for only one code
12
A Special Purpose Computer?
MyApplication Chip
Mem
ory
Network MyApplication
Chip
Mem
ory
Network MyApplication
Chip
Mem
ory
Network OtherApplication
Chip
Mem
ory
Rest of the world
![Page 13: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/13.jpg)
• Use a reconfigurable chip that can be reprogrammed at runtime to implement:
– Different applications
– Or different versions of the same application
• Allows us to dynamically (re)allocate parts of a machine to different application tasks
13
A Special Purpose Computer?
Config 1
Mem
ory
Network Optimized for Application A Optimized for Application B Optimized for Application C Optimized for Application D
Optimized for Application A/B/C/D/E
![Page 14: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/14.jpg)
Heterogeneous Dataflow System
DFE CPU
14
Inte
rco
nn
ect
![Page 15: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/15.jpg)
Maxeler Hardware
CPUs plus DFEs Intel Xeon CPU cores and up to
4 DFEs with 192GB of RAM
DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation
of DFEs to CPU servers
Low latency connectivity Intel Xeon CPUs and 1-4 DFEs
with up to twelve 40Gbit Ethernet connections
MaxWorkstation Desktop development system
MaxCloud On-demand scalable accelerated compute resource, hosted in London
15
![Page 16: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/16.jpg)
• 1U Form Factor
• 4x dataflow engines
• 12 Intel Xeon cores
• 96GB DFE RAM
• Up to 192GB CPU RAM
• MaxRing interconnect
• 3x 3.5” hard drives
• Infiniband/10GigE
MPC-C500
16
![Page 17: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/17.jpg)
MPC-X1000
• 8 dataflow engines (384GB RAM)
• High-speed MaxRing
• Zero-copy RDMA between CPUs and DFEs over Infiniband
• Dynamic CPU/DFE balancing
17
![Page 18: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/18.jpg)
• Optimized to balance resources for particular application challenges
• Individual MPC-X nodes typically 20-50x faster than 2-socket x86 nodes
• 10-30x density and power advantage at the rack level
18
Dataflow clusters
48U seismic imaging cluster
42U in-memory analytics cluster
![Page 19: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/19.jpg)
Programming Dataflow Systems
![Page 20: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/20.jpg)
Acceleration Flow St
art
Original Application
Identify code for acceleration
and analyze bottlenecks
Write MaxCompiler
code Simulate DFE
Functions correctly?
Build full DFE configuration
Integrate with CPU code
Meets performance
goals?
Accelerated Application
NO
YES YES
NO
Transform app, architect and
model performance
20
![Page 21: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/21.jpg)
Consider Data Movement first
21
Data Access Plans Code Partitioning
Tran
sfo
rmat
ion
s
Pareto Optimal Options
Runtime
Dev
elo
pm
ent
Tim
e
Try to minimise runtime and development time, while maximising flexibility and precision.
![Page 22: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/22.jpg)
Data Flow Analysis: Matrix Multiply
22
![Page 23: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/23.jpg)
• Maxeler team includes mathematicians, scientists, electronic engineers and computer scientists
• Deploy domain expertise to co-design application, algorithm and computer architecture
23
Co-design approach
![Page 24: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/24.jpg)
SLiC
Programming with MaxCompiler
Computationally intensive
components
24
![Page 25: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/25.jpg)
Programming with MaxCompiler
25
![Page 26: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/26.jpg)
for (int i =0; i < DATA_SIZE; i++) y[i]= x[i] * x[i] + 30;
Main
Memory
CPU CPU Code
CPU Code (.c)
Programming with MaxCompiler
int *x, *y;
30 iii xxy
26
![Page 27: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/27.jpg)
#include “MaxSLiCInterface.h” #include “Calc.max” Calc(x, y, DATA_SIZE)
PCI
Express
Manager
Chip
Memory
Manager (.java)
x
x
+
30
x
Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), m.createSLiCInterface(); m.build();
link(“y", CPU));
Main
Memory
CPU CPU Code
CPU Code (.c)
Programming with MaxCompiler
SLiC
MaxelerOS
DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32));
MyKernel (.java)
int *x, *y;
y
x
x
+
30
y
x
27
![Page 28: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/28.jpg)
PCI
Express
Manager
Chip
Memory
Manager (.java)
Manager m = new Manager(“Calc”); Kernel k = new MyKernel(); m.setKernel(k); m.setIO( link(“x", CPU), m.createSLiCInterface(); m.build();
device = max_open_device(maxfile, "/dev/maxeler0"); Calc(x, DATA_SIZE)
Main
Memory
CPU CPU Code
CPUCode (.c)
Programming with MaxCompiler
SLiC
MaxelerOS
DFEVar x = io.input("x", dfeInt(32)); DFEVar result = x * x + 30; io.output("y", result, dfeInt(32));
MyKernel (.java)
#include “MaxSLiCInterface.h” #include “Calc.max” int *x, *y;
x
y
link(“y", LMEM_LINEAR1D));
x
x
+
30
x
x
x
+
30
y
28
![Page 29: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/29.jpg)
x
x
+
30
y
public class MyKernel extends Kernel {
public MyKernel (KernelParameters parameters) {
super(parameters);
HWVar x = io.input("x", hwInt(32));
HWVar result = x * x + 30;
io.output("y", result, hwInt(32));
}
}
The Full Kernel
29
![Page 30: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/30.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30
30
0
0
30
![Page 31: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/31.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30 31
31
1
1
31
![Page 32: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/32.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30 31 34
34
4
2
32
![Page 33: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/33.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30 31 34 39
39
9
3
33
![Page 34: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/34.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30 31 34 39 46
46
16
4
34
![Page 35: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/35.jpg)
x
x
+
30
y
Kernel Streaming: Programmer’s View 5 4 3 2 1 0
30 31 34 39 46 55
55
25
5
35
![Page 36: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/36.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
36
![Page 37: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/37.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
0
37
![Page 38: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/38.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
0
1
38
![Page 39: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/39.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware
30
1
2
5 4 3 2 1 0
39
![Page 40: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/40.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
31
4
3
30
40
![Page 41: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/41.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
34
9
4
30 31
41
![Page 42: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/42.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
39
16
5
30 31 34
42
![Page 43: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/43.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
46
25
30 31 34 39
43
![Page 44: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/44.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
55
30 31 34 39 46
44
![Page 45: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/45.jpg)
x
x
+
30
y
Kernel Streaming: In Hardware 5 4 3 2 1 0
30 31 34 39 46 55
45
![Page 46: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/46.jpg)
Real data flow graph as generated by MaxCompiler
4866 nodes; 10,000s of stages/cycles
46
![Page 47: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/47.jpg)
Application examples
![Page 48: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/48.jpg)
• Geophysical Model
– 3D acoustic wave equation
– Variable velocity and density
– Isotropic medium
• Numerical Model
– Finite differences (12th order convolution)
– 4th order in time
– Point source, absorbing boundary conditions
Seismic 3D Finite Difference Modeling
T. Nemeth et al, 2008
48
![Page 49: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/49.jpg)
Modeling Results
• Up to 240x speedup for 1 MAX2 card compared to single CPU core
• Speedup increases with cube size
• 1 billion point modeling domain using single card 0
50
100
150
200
250
300
0 200 400 600 800 1000 Spe
ed
up
co
mp
are
d t
o s
ingl
e c
ore
Domain size (n^3)
FD Modeling Performance
49
![Page 50: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/50.jpg)
• Sparse matrices are used in a variety of important applications
• Matrix solving. Given matrix A, vector b, find vector x in:
Ax = b
• Direct or iterative solver
• Structured vs. unstructured matrices
Sparse Matrix Solving
O. Lindtjorn et al, 2010
50
![Page 51: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/51.jpg)
Sparse Matrix in DFE
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10
Compression Ratio
Sp
eed
up
per
1U
No
de
GREE0A
1new01
624
624 Maxeler Domain Specific Address and Data Encoding
SPEEDUP is 20x-40x per 1U at 200MHz
51
![Page 52: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/52.jpg)
Credit Derivatives Valuation & Risk
• Compute value of complex financial derivatives (CDOs)
• Typically run overnight, but beneficial to compute in real-time
• Many independent jobs
• Speedup: 220-270x
• Power consumption per node drops from 250W to 235W/node
O. Mencer and S. Weston, 2010
52
![Page 53: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/53.jpg)
Bitcoin mining
• Cryptographic application
• Repeated SHA-256 hashing to search for hash value that meets certain criteria
• “Miner” who finds hash is rewarded with bitcoins
• Minimizing energy costs per computation is key to scalability & profitability
• MPC-X2000 provides 12GHash/s @ 15MHash/W – ~10x power efficiency
compared to GPU
Structure of one of 64 stages of SHA-256 function (Image: wikipedia)
![Page 54: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/54.jpg)
• Local Area Model
• Computationally intensive process
• Accurate local weather forecasts require dense grids
• Dataflow computing can provide order of magnitude performance boosts
• Example: 1 dataflow node equivalent to 381 cores – Joint work between Maxeler
and CRS4 Lab, Italy
54
Meteorological Modelling
![Page 55: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/55.jpg)
Maxeler University Program
55
![Page 56: HPC Acceleration with Dataflow Programmingweidendo/uchpc13/docs/pell.pdf · The Performance Challenge: 1 Exaflop •1 exaflop = 1018 FLOPS •Using processor cores with 8FLOPS/clock](https://reader034.vdocuments.us/reader034/viewer/2022042104/5e81e8e87db93909060e3de4/html5/thumbnails/56.jpg)
• Data movement is the fundamental challenge for current and future computer systems
• Dataflow computing achieves high performance through:
– Explicitly putting data movement at the heart of the program
– Employing massive parallelism at low clock frequencies
• Many scientific applications can benefit from this approach
Summary & Conclusions
56