![Page 1: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/1.jpg)
1
Veljko Milutinović and Saša StojanovićUniversity of Belgrade
Oliver Pell and Oskar MencerMaxeler Technologies
DataFlow Computing for Exascale HPC
![Page 2: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/2.jpg)
Essence of the Approach!Compiling below the machine code level brings speedups;also a smaller power, size, and cost.
The price to pay:The machine is more difficult to program.
Consequently:Ideal for WORM applications :)
Examples:GeoPhysics, banking, life sciencies, datamining...
![Page 3: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/3.jpg)
3
ControlFlow vs. DataFlow
![Page 4: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/4.jpg)
DataFlow Programming
4
![Page 5: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/5.jpg)
Assumptions: 1. Software includes enough parallelism to keep all cores busy 2. The only limiting factor is the number of cores.
tGPU = N * NOPS * CGPU*TclkGPU / NcoresGPU
tCPU = N * NOPS * CCPU*TclkCPU /NcoresCPU
tDF = NOPS * CDF * TclkDF + (N – 1) * TclkDF / NDF
The essential figure
![Page 6: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/6.jpg)
6
MultiCoreDualCore
?
Where are the horses going?
![Page 7: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/7.jpg)
7
Is it possibleto use 2000 chicken instead of two horses?
ManyCore
?==
![Page 8: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/8.jpg)
8
ManyCore
2 x 1000 chickens
![Page 9: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/9.jpg)
9
DataFlow
How about 2 000 000 ants?
Dat
a
![Page 10: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/10.jpg)
10
Marmalade
DataFlow
Big Data Input Results
![Page 11: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/11.jpg)
11
Factor: 20 to 200
Why is DataFlow so Much Faster?
MultiCore/ManyCore
Dataflow
Machine Level Code
Gate Transfer Level
![Page 12: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/12.jpg)
12
Factor: 20
Why are Electricity Bills so Small?
MultiCore/ManyCore
Dataflow
![Page 13: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/13.jpg)
13
Factor: 20
Why is the Cubic Foot so Small?
Data Processing
Process Control
Data Processing
Process Control
MultiCore/ManyCore
Dataflow
![Page 14: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/14.jpg)
14
MultiCore:Explain what to do, to the driverCaches, instruction buffers, and predictors needed
ManyCore:Explain what to do, to many sub-driversReduced caches and instruction buffers needed
DataFlow:Make a field of processing gatesNo caches, instruction buffers, or predictors
needed
Required Programming Effort?
![Page 15: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/15.jpg)
15
MultiCore:Business as usual
ManyCore:More difficult
DataFlow:Much more difficultDebugging both, application and configuration
code
Required Debug Effort?
![Page 16: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/16.jpg)
16
MultiCore/ManyCore:Several minutes
DataFlow:Several hours
Required Compilation Effort?
![Page 17: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/17.jpg)
17
Now the Fun Part
![Page 18: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/18.jpg)
18
MultiCore:Horse stable
ManyCore:Chicken house
DataFlow:Ant hole
Required Space?
![Page 19: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/19.jpg)
19
MultiCore:Haystack
ManyCore:Cornbits
DataFlow:Crumbs
Required Energy?
![Page 20: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/20.jpg)
20
Why Faster?
Small Data
![Page 21: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/21.jpg)
21
Why Faster?
Medium Data
![Page 22: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/22.jpg)
22
Why Faster? Big Data
![Page 23: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/23.jpg)
Power consumptionMassive static parallelism at low clock frequencies
Concurrency and communicationConcurrency between millions of tiny cores difficult,
“jitter” between cores will harm performance at synchronization points.
“Fat” dataflow chips minimize number of engines needed and statically scheduled dataflow cores minimize jitter.
Reliability and fault tolerance10-100x fewer nodes, failures much less often
Memory bandwidth and FLOP/byte ratioOptimize data movement first, and computation
second.
23
DataFlow for Exascale Challenges
![Page 24: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/24.jpg)
• DataFlow engines handle the bulk part of computation (as a “coprocessor”)
• Traditional ControlFlow CPUs run OS, main application code etc
• Lots of different ways these can be combined
24
Combining ControlFlow with DataFlow
![Page 25: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/25.jpg)
Maxeler Hardware
CPUs plus DFEsIntel Xeon CPU cores and up to
4 DFEs with 192GB of RAM
DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation
of DFEs to CPU servers
Low latency connectivityIntel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet
connections
MaxWorkstationDesktop development system
MaxCloudOn-demand scalable accelerated compute resource, hosted in London
25
![Page 26: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/26.jpg)
• Tightly coupled DFEs and CPUs• Simple data center architecture with identical nodes
26
MPC-C
![Page 27: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/27.jpg)
27
Credit Derivatives Valuation & Risk
• Compute value of complex financial derivatives (CDOs)
• Typically run overnight, but beneficial to compute in real-time
• Many independent jobs• Speedup: 220-270x• Power consumption per
node drops from 250W to 235W/node
O. Mencer and S. Weston, 2010
![Page 28: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/28.jpg)
28
• Seismic processing application • Velocity independent / data driven method
to obtain a stack of traces, based on 8 parameters– Search for every sample of each output trace
CRS Trace StackingP. Marchetti et al, 2010
2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
hHKHhmHKHmmw TzyNIPzy
TTzyNzy
TT
0
0
2
00
2 22
v
t
vtthyp
![Page 29: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/29.jpg)
29
• Performance of MAX2 DFEs vs. 1 CPU core– Land case (8 params), speedup of 230x– Marine case (6 params), speedup of 190x
CRS Results
CPU Coherency MAX2 Coherency
![Page 30: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/30.jpg)
• DFEs are shared resources on the cluster, accessible via Infiniband connections
• Loose coupling optimizes efficiency• Communication managed in hardware for performance
30
MPC-X
![Page 31: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/31.jpg)
1. Coarse grained, stateful– CPU requires DFE for minutes or hours
2. Fine grained, stateless transactional– CPU requires DFE for ms to s– Many short computations
3. Fine grained, transactional with shared database– CPU utilizes DFE for ms to s– Many short computations, accessing common database data
31
Major Classes of Applications
![Page 32: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/32.jpg)
• Long runtime, but:• Memory requirements
change dramatically based on modelled frequency
• Number of DFEs allocated to a CPU process can be easily varied to increase available memory
• Streaming compression• Boundary data exchanged
over chassis MaxRing
32
Coarse Grained: FD Wave Modeling
0
200
400
600
800
1,000
1,200
1,400
1,600
1,800
2,000
1 4 8
Equi
vale
nt C
PU c
ores
Number of MAX2 cards
15Hz peak frequency
30Hz peak frequency
45Hz peak frequency
70Hz peak frequency
0
10
20
30
40
50
60
70
80
0 10 20 30 40 50 60 70 80Peak Frequency (Hz)
Timesteps (thousand)
Domain points (billion)
Total computed points (trillion)
![Page 33: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/33.jpg)
• Portfolio with thousands of Vanilla European Options• Analyse > 1,000,000 scenarios• Many CPU processes run on many DFEs
– Each transaction executes on any DFE in the assigned group atomically
• ~50x MPC-X vs. multi-core x86 node
33/13
Fine Grained, Stateless: BSOP
CPU DFE Loop over instruments
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Tail analysis on CPU
CPU DFE Loop over instruments
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Tail analysis on CPU
CPU DFE Loop over instruments
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Tail analysis on CPU
CPU DFE Loop over instruments
Random number generator and
sampling of underliers
Price instruments using Black
Scholes
Tail analysis on CPU
DFE Loop over instrumentsCPUMarket and instruments data
Random number generator and
sampling of underliers
Price instruments using Black
ScholesInstrument values
Tail analysis on CPU
![Page 34: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/34.jpg)
• DFE DRAM contains the database to be searched• CPUs issue transactions find(x, db)• Complex search function
– Text search against documents– Shortest distance to coordinate (multi-dimensional)– Smith Waterman sequence alignment for genomes
• Any CPU runs on any DFE that has been loaded with the database– MaxelerOS may add or remove DFEs
from the processing group to balance system demands– New DFEs must be loaded with the search DB before use
34
Fine Grained, Shared Data: Searching
![Page 35: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/35.jpg)
• Dataflow computing focuses on data movement andutilizes massive parallelism at low clock frequencies
• Improved performance, power efficiency, system size, and data movementcan help address exascale challenges
• Mix of DataFlow with ControlFlow and interconnect can be balanced at a system level
• What’s next?
35
Conclusion
![Page 36: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/36.jpg)
36/8
The TriPeak
BSC + Maxeler
![Page 37: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/37.jpg)
37/8
The TriPeak
MontBlanc = A ManyCore (NVidia) + a MultiCore (ARM)Maxeler = A FineGrain DataFlow (FPGA)
How about a happy marriageof MontBlanc and Maxeler?
In each happy marriage,it is known who does what :)
![Page 38: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/38.jpg)
38/8
Core of the Symbiotic Success:An intelligent scheduler,partially implemented for compile time,and partially for run time.
At compile time:Checking what part of code fits where(MontBllanc or Maxeler).
At run time:Rechecking the compile time decision,based on the current data values
![Page 39: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/39.jpg)
39/839/839
![Page 40: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/40.jpg)
40/840/8© H. Maurer40
![Page 41: Veljko Milutinović and Saša Stojanović University of Belgrade Oliver Pell and Oskar Mencer Maxeler Technologies 1](https://reader036.vdocuments.us/reader036/viewer/2022062322/56649e4d5503460f94b43a0e/html5/thumbnails/41.jpg)
41/841/8© H. Maurer41