reconfigurable supercomputing: what are the problems? what are the solutions? reiner hartenstein tu...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Reconfigurable Supercomputing:
What are the Problems?
What are the Solutions?
Reiner Hartenstein
TU Kaiserslautern
Dagstuhl, Germany, April 2 - 7, 2006
Dynamically Reconfigurable Architectures
© 2006, [email protected] http://hartenstein.de2
TU Kaiserslautern
The Supercomputing Paradox
Rapidly growing listed Teraflops
Often limited sustained Teraflops
Almost stalled application implementation progress
Increasing number of processors running in parallel
COTS processor decreasing cost
Very high total cost of the Tera(?)flops
promising technology
poor results
Scientists waiting for affordable compute capacity
© 2006, [email protected] http://hartenstein.de3
TU Kaiserslautern
dangerously telling this to the supercomputing
people:
You … used the wrong
roadmap the past
20 years !!!
© 2006, [email protected] http://hartenstein.de5
TU Kaiserslautern
3 Reconfigurable Computing Paradoxes
The high performance paradox
The low power paradox
Reconfigurable Computing Education Paradox
© 2006, [email protected] http://hartenstein.de6
TU Kaiserslautern
The Pervasiveness of RC
162,000
127,000
158,000113,000
171,000194,000
# of hits by Google
1,620,000
915,000
398,000
272,000
647,000
1,490,000
# of hits by Google
search “FPGA and ….”
© 2006, [email protected] http://hartenstein.de7
TU Kaiserslautern
going into every application area
Almost 10 million hits
© 2006, [email protected] http://hartenstein.de8
TU Kaiserslautern
We now also have the hardware / configware / software chasm
The Reconfigurable Computing Education Paradox:
Curricula still ignore these extremely hot new challenges
in addition to the hardware / software chasm
its run-away accelerated pervasiveness, despite of all these educational deficits
…. educational deficits
© 2006, [email protected] http://hartenstein.de9
TU Kaiserslautern
Computing
Curricula 2004 (1)Within about 500
pages the term reconfigurable is not found – nor its synonyms
© 2006, [email protected] http://hartenstein.de10
TU Kaiserslauternobsolete
von Neumann‘s monopoly
inside curricula is obsolete
© 2006, [email protected] http://hartenstein.de11
TU Kaiserslautern
von Neumann is not the common model
programcounter
DPUCPU
RAMmemory
von Neumann bottleneck
von Neumann instruction-
stream-based machine
co-processors
acceleratorCPU
instruction-stream-based
data-stream-
based
hard
ware
morp
hw
are
software
mainframe age:
microprocessor age:
wagging the dog
the tail is
vN paradigm dominance ?
dual
pa
radi
gm
dual
pa
radi
gm
© 2006, [email protected] http://hartenstein.de12
TU Kaiserslauternmodern FPGA bestsellers:
The new model is reality:FPGA fabrics, together with several µprocessors, several memory banks, and other IP cores, on the same COTS microchip
© 2006, [email protected] http://hartenstein.de13
TU KaiserslauternBill Gates
Speech by Bill Gates at a summit meeting of US state governors: "American high schools are obsolete."
"The high schools of today teach kids about today's computers like on a 50-year-old mainframe.
„Without re-design for the needs of the 21st century, we will keep limiting - even ruining - the lives of millions of Americans every year."
© 2006, [email protected] http://hartenstein.de14
TU Kaiserslauterncarved out of stone
The most important cultural revolution since the invention of text
characters:it‘s not the mainframe
It is the Microchip !
© 2006, [email protected] http://hartenstein.de15
TU Kaiserslautern
RC education needed
http://fpl.org/RCeducation/
35 submissions from
Australia, Brasil, India, USA, and throughout Europe
Jürgen BeckerJörg HenkelR. Hartenstein
© 2006, [email protected] http://hartenstein.de16
TU Kaiserslautern
Reconfigurable Computing Paradoxes
The high performance paradox
The low power paradox
Reconfigurable Computing Education Paradox
© 2006, [email protected] http://hartenstein.de17
TU Kaiserslautern
The FPGA Low Power Paradox
„very power-hungry“ [Rick Kornfeld*]
*) personal communication
The awful technology of FPGAs:
FPGAs run at lower clock frequencies, draw much more power and are more expensive.
Reducing the electricity bill by an order of magnitude and more by supercomputer 2 FPGA migration
© 2006, [email protected] http://hartenstein.de18
TU Kaiserslautern
telling this to the low power design people ?
you … used the wrong
roadmap the past 15 years:
use FPGAs !
ISLPED, Oct 4 – 6, Tegernse
e
PATMOS, Sep 13 – 15, Montpellier
1991: Kaiserslautern, Germany 1992: Paris, France
1993: Montpellier, France
© 2006, [email protected] http://hartenstein.de19
TU Kaiserslautern
Reconfigurable Computing Paradoxes
The high performance paradox
The low power paradox
Reconfigurable Computing Education Paradox
© 2006, [email protected] http://hartenstein.de20
TU Kaiserslautern The High Performance Paradox
Effective integration density much worse than the Gordon Moore curve: by a factor of more than 10,000
85% of all designers hate their tools
The awful technology of FPGAs:
FPGAs run at lower clock frequencies, and are more expensive.
© 2006, [email protected] http://hartenstein.de21
TU Kaiserslautern
fine-grained RC: 1st DeHon‘s Law #
reconfigurability overhead>
routing congestion
wiring overhead
overhead:
>> 10 000
1980 1990 2000 2010100
103
106
109
FPGAlogical
FPGArouted
density:
FPGAphysical
(Gordon Moore curve)
transistors / microchip
(microprocessor)
immense area inefficiency
[1996: Ph. D, MIT]
© 2006, [email protected] http://hartenstein.de22
TU Kaiserslautern
coarse-grained RC: Hartenstein‘s Law
#
FPGArouted
>> 10 000
1980 1990 2000 2010100
103
106
109
(Gordon Moore curve)
transistors / microchip
rDPA physical rDPA logical
area efficiency very close to Moore‘s law
[1996: ISIS, Austin, TX]
e.g.
KressArray
family
© 2006, [email protected] http://hartenstein.de23
TU Kaiserslautern
2 1 0.5 0.250.001
0.01
0.1
1
10
100
1000
0.13 0.1 0.07
µ feature size
MOPS / milliWatt
standard microprocessor
DSP
instruction set processors(fine grained reconf.)
FPGAs
hardwired
Claassen‘s Law
© 2006, [email protected] http://hartenstein.de24
TU Kaiserslautern
2 1 0.5 0.250.001
0.01
0.1
1
10
100
1000
0.13 0.1 0.07
µ feature size
MOPS / milliWatt
standard microprocessor
DSP
instruction set processors(fine grained reconf.)
FPGAs
hardwired
Claassen‘s Law
hardwired and coarse-grained reconf.
(rDPA)
: Hartenstein‘s Amendment
© 2006, [email protected] http://hartenstein.de25
TU Kaiserslautern
Selection of published speed-up factors
1980 1990 2000 2010100
103
106
109
8080
P4
7%/yr
50%/yr
http://xputers.informatik.uni-kl.de/faq-pages/fqa.html
100 000
Los Alamos traffic simulation
Los Alamos traffic simulation
47
real-time face detectionreal-time face detection6000
video-rate stereo vision
video-rate stereo vision
900pattern
recognitionpattern
recognition730
SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching
Smith-Waterman pattern matching
288
BLASTBLAST52protein identificationprotein identification
40
molecular dynamics simulationmolecular dynamics simulation
88
Reed-Solomon Decoding
Reed-Solomon Decoding2400
Viterbi DecodingViterbi Decoding
400
FFTFFT
100
1000MA
CMA
C
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
Grid-based DRC:no FPGA: DPLA on MoM by TU-KL
2000
2-D FIR filter (no FPGA: DPLA by TU-KL)2-D FIR filter (no FPGA: DPLA by TU-KL)39,4
Lee Routing (DPLA by TU-
KL)
Lee Routing (DPLA by TU-
KL)
160
Grid-based DRC („fair
comparizon“)
Grid-based DRC („fair
comparizon“)15000
DSP and wirelessImage processing,Pattern matching,
Multimedia
Bioinformatics
GRAPEGRAPE20
Astrophysics
MoM Xputer architecture
cryptocrypto
Microprocessor
rela
tive p
erf
orm
ance
Memory
X 2/yr
© 2006, [email protected] http://hartenstein.de26
TU Kaiserslautern
2nd DeHon‘s Law
Computational Density
1
10
100
1000
2 1 0.5 0.25 0.13 0.1 0.07
µ feature sizeRISC
FPGA
[IEEE COMPUTER, 2000]
© 2006, [email protected] http://hartenstein.de27
TU Kaiserslautern
The three RC Paradoxes
poor technology
brilliant results
poor toolsvery poor education
© 2006, [email protected] http://hartenstein.de28
TU Kaiserslautern
Why supercomputing / HPC failed
instruction-stream-based: memory-cycle-hungry
the wrong way, how the data are moved around
instruction fetch overhead
because of the interconnect network architecture
address computation overhead and other overhead
sequencing overheadThe law or More:
© 2006, [email protected] http://hartenstein.de29
TU Kaiserslautern
Earth Simulator
5120 Processors, 5000 pins eachES 20: TFLOPS
Crossbar weight: 220 t, 3000 km of cable,moving data around
inside the
© 2006, [email protected] http://hartenstein.de30
TU Kaiserslautern
data moved around by software
i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall
P&R: move
locality of
operation, not data !
extr
emel
y unbal
ance
d
stolen from Bob Colwell
CPU
© 2006, [email protected] http://hartenstein.de31
TU Kaiserslautern
An Archetype Common Model needed
Guidance for organizing efficient solutions
Make the project manageable
Allow to share lessions between applications and between disciplines
Useful simple archetype not widely accepted
An archetype common model should provide ....
Progress stalled by the software/configware chasm
Configware IndustryConfigware Industryfrom the
support undergraduate educastion
© 2006, [email protected] http://hartenstein.de32
TU Kaiserslautern
The new paradigm: how the data are traveling
transport-triggered: an old hat
pipeline, or chaining
systolic array
asynchronous (via handshake)
wavefront array
no, not by instruction execution
© 2006, [email protected] http://hartenstein.de33
TU Kaiserslautern
DPA
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“ time
port #
time
time
port #time
port #
Flowware defines: ... which data item at which time at which port
Def.: data streams
(flowware)
(pipe network)
source and sink ?source and sink ?
H. T. Kung
systolic arrays:
© 2006, [email protected] http://hartenstein.de34
TU KaiserslauternData streams source and sink: not my job
Not my Job!
© 2006, [email protected] http://hartenstein.de35
TU Kaiserslautern
xxx
xxx
xxx
|
||
x x
x
x
x
x
x x
x
- -
-
input data streams
xx
x
x
x
x
xx
x
--
-
-
-
-
-
-
-
-
-
-
xxx
xxx
xxx
|
|
|
|
|
|
|
|
|
|
|
|
|
|output data streams
„data
streams“
distributed memory
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
MOn-chip
Auto-Sequencing Memory
RA
M
GA
G
ASM
implemented by distributed on-
chip memory
© 2006, [email protected] http://hartenstein.de36
TU Kaiserslautern
How the data are moved
DMA,
vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975]
Henk Corporaal coins the term “transport-triggered”
Application-specific distributed memory [Catthoor et al.]
ASM use GAG generic address generator[TU-KL publ.: Tokyo 1989 + NH journal]by the way: GAG st…. by TI [TI patent 1995]
MoM: GAG-based storage scheme methodology [Herz*]
*) [see Michael Herz et al.: ICECS 2002 (Dubrovnik)]
© 2006, [email protected] http://hartenstein.de37
TU Kaiserslautern
The dual paradigm approach
von Neumann paradigm Kress-Kung paradigm
Software Engineering
Software Engineering
Configware
Engineering
Configware
Engineering
ASM
CPU
© 2006, [email protected] http://hartenstein.de38
TU KaiserslauternMathematical Synthesis Methods
algebraic methods
i. e., linear projections
yields only uniform arrays w. linear pipes
only for applications with regular data dependencies
© 2006, [email protected] http://hartenstein.de39
TU Kaiserslautern
Coarse-grained reconfigurable arrays are a Generalization of the Systolic
Array ....discard algebraic synthesis methods
[Rainer Kress]
the achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible
now reconfigurability really makes sense
use optimization algorithms instead, for example: simulated annealing
R. Kress
© 2006, [email protected] http://hartenstein.de40
TU Kaiserslautern
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
Coarse grain is about computing, not logic
rout thru onlyrout thru only
not usednot usedbackbus connectbackbus connect
SNN filter on KressArray (mainly a pipe network)
[Ulrich Nageldinger]
Example: mapping onto rDPA by DPSS: based on simulated annealing
rDPU, 32 bit
no CPU
tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable)tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable)
© 2006, [email protected] http://hartenstein.de41
TU Kaiserslautern
Software / Configware Co-Compilation
Resource Parameters
supportingdifferentplatformsAnalyzer
/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
C language source
FW Code
simulated annealing
[Juergen Becker’s CoDe-X, 1996]
© 2006, [email protected] http://hartenstein.de42
TU Kaiserslautern
Software / Configware Co-Compilation
Resource Parameters
supportingdifferentplatformsAnalyzer
/ Profiler
SW code
SWcompiler
paradigm“vN" machine
CW Code
CWcompiler
anti machineparadigm
Partitioner
C language source
FW Code
simulated annealingFor thesis see book exhibit rack at library entrance
For thesis see book exhibit rack at library entrance
[Juergen Becker’s CoDe-X, 1996]
© 2006, [email protected] http://hartenstein.de43
TU Kaiserslautern
Distributed Memory Parallelism Capability
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
array size example: 10 x 16
NN ports interconnect layerNN ports interconnect layer
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
ASM
back
bus
conn
ect
layers
…back
bus
conn
ect
layers
…
© 2006, [email protected] http://hartenstein.de44
TU Kaiserslautern
Applications for coarse-grained arrays
(on-chip distributed memory for intermediate results)
Multi-standard world HDTV receiver
with steady I/O data streams at constant speed:
Wide variety of multimedia applications
Wide variety of real-time applications
Many other applications
© 2006, [email protected] http://hartenstein.de45
TU Kaiserslautern
The wrong mind set ....
„but you can‘t implement decisions!“
(remark of a high-ranked industrial research head – discussion after a talk by Ulrich Nageldinger – RAW Orlando)
© 2006, [email protected] http://hartenstein.de46
TU Kaiserslautern
a tiny section of the pipe network
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
S
+
© 2006, [email protected] http://hartenstein.de47
TU Kaiserslautern
The wrong mind set ....
+
ABR C
section of a very large pipe network:
decision
not knowing this solution:symptom of the hardware / software chasm
and the configware / software chasm
„but you can‘t implement decisions!“
=1=0
© 2006, [email protected] http://hartenstein.de48
TU Kaiserslautern
introducing hardware description languages(in the mid‘ seventies)
“The decision box becomes a
(de)multiplexer”This is so simple: why did it take decades to find
out ?
The wrong mind set – the wrong road map!
© 2006, [email protected] http://hartenstein.de49
TU Kaiserslautern
hypothetical branching example to illustrate software-to-configware
migration
*) if no intermediate storage in register file
C = 1simple conservative CPU example
memory cycles
nanoseconds
if C then read A
read instruction 1 100instruction decoding
read operand* 1 100operate & reg. transfers
if not C then read B
read instruction 1 100instruction decoding
add & store
read instruction 1 100instruction decoding
operate & reg. transfers
store result 1 100
total 5 500
S = R + (if C then A else B endif);
S
+
ABR C
clock200 MHz(5 nanosec)
=1
sect
ion
of a
maj
or p
ipe
netw
ork
on rD
PU
no m
emor
y cy
cles
:
no m
emor
y cy
cles
:
spee
d-up
fac
tor
= 1
00
spee
d-up
fac
tor
= 1
00
© 2006, [email protected] http://hartenstein.de50
TU Kaiserslautern
why the RC paradigm shift is so important
Move the stool or the grand piano?
by Software
byConfigware
© 2006, [email protected] http://hartenstein.de51
TU Kaiserslautern
the data-stream-based approachhas no von Neumann bottle-neck
has no von Neumann bottle-neck
… understand only this parallelism solution:
the instruction-stream-based approach
von Neuma
nn bottle-necks
von Neuma
nn bottle-necks
... c
annot
cope w
ith
this
one
© 2006, [email protected] http://hartenstein.de52
TU Kaiserslautern
What means Reconfigurable Computing?
microprogramming?
switching the multiplexers?
concurrency of 64 or 256 CPUs on a single chip?
routing ALU result to a register?
it means using the Kress/Kung machine paradigm !
© 2006, [email protected] http://hartenstein.de53
TU KaiserslauternvN paradigm loosing its
dominance
http://bwrc.eecs.berkeley.edu/Research/RAMP/people.htm
RAMP project proposes: Run LINUX on FPGAs
© 2006, [email protected] http://hartenstein.de54
TU Kaiserslautern
Cray XD1
vN paradigm loosing its dominanceXilinx inside !Xilinx inside !
Xilinx FPGAXilinx FPGA
© 2006, [email protected] http://hartenstein.de55
TU KaiserslauternRecommended Pentium successor
Discard most caches
Have 64* cores
with clever interconnect for:
concurrent processes,
for multithreading, and,
Kung-Kress rDPA array
The Desk-top Supercomputer!
© 2006, [email protected] http://hartenstein.de56
TU Kaiserslautern
What means Reconfigurable Computing ?
The key issue: which is the underlying paradigm?
Operation not based on instruction-streams at run time
No instruction fetch at run time
machine paradigm is data stream-based: Kress-KungUndergraduate education needs a dual paradigm approach: symbiosis of von Neumann / Kress-Kung
© 2006, [email protected] http://hartenstein.de61
TU KaiserslauternTerm to be used for „soft hardware“
accelwareadaptwareadjustwarealtwarealterwarearrangewarechangewareconformwaredoughwarefabricswarefabrixwarefitwareflexwareformwareFPware
gateware gateroutwarehpcwareLUTware matchwaremodiwaremorphware® morfwaremouldwaremuxwareparwareparawarepasswarepathwarepatchware
performware perfwareperwarepipewareplatformwarerailwarerangewareRCwareressourcewareroutwareroutewareroutingwareRTwareshapewareshuntware
shuntingwarespeedwarespeedupware suitewareswitchwareswitchingwarestreamwarestructwaretransferwaretranswarevariwarevarywarewarpwarexferwarexware
send yourproposal to:
unfortunately “Morphware” is trademarked
© 2006, [email protected] http://hartenstein.de62
TU KaiserslauternCompilation: Software vs.
Configware
source program
softwarecompiler
software code
Software Engineeri
ng
Software Engineeri
ng
configware code
mapper
configwarecompiler
scheduler
flowware code
source „program“
Configware
Engineering
Configware
Engineering
placement &
routing
data
C, FORTRANMATHLAB
© 2006, [email protected] http://hartenstein.de63
TU Kaiserslautern
Co-Compilation
softwarecompiler
software code
Software / Configware Co-Compiler
Software / Configware Co-Compiler
configware code
mapperconfigware
compiler
scheduler
flowware code
data
C, FORTRAN, MATHLAB
automatic SW / CW partitioner
© 2006, [email protected] http://hartenstein.de64
TU Kaiserslautern
Why use Reconfigurable Computing
Exploit spatial parallelism, and ..
… high bandwidth and low latency memory access
Ride the technology curve avoiding specific silicon Adapt to change: standards, trends, …..
Reduce risk
Adapt to application / deployment requirements
instead of spec. hardware?
instead of software?
… and fine-grained parallelism when useful
© 2006, [email protected] http://hartenstein.de65
TU Kaiserslautern
Computing Curricula 2004 (2)
#C
E Configw
are
Engineering
missing
volume:
CE missing
© 2006, [email protected] http://hartenstein.de66
TU Kaiserslautern
Computing
Curricula 2004 (3)
2.2.1.
© 2006, [email protected] http://hartenstein.de67
TU Kaiserslautern
Computing
Curricula 2004 (4)
2.2.1.
… how it should be
CONFIGWARE
MORPHWARE
morphware and configware added