reconfigurable supercomputing means to brave the paradigm ... · in modern computer history. the...
Post on 20-May-2020
9 Views
Preview:
TRANSCRIPT
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 1
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
Twin-Paradigm Programmer Education:
proposed New Text Book (and project)
Reiner Hartenstein
talk for Jan 2010 at
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
1981: visiting professor at UC Berkeley
IEEE fellow, SDPS fellow, FPL fellow, other awards
Professor (ordinarius emeritus), TU Kaiserslautern
my CV (1)
Founder and co-founder of several international annual conference series
All academic degrees from EE department at Karlsruhe Institute of Technology (KIT)
Karl Steinbuch
2
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Viktor Prasanna called him „The father of Reconfigurable Computing“ (also pre-FPGA era) -- („Gerald Estrin: grandfather of RC“)
1983: founder of German Mead-&-Conway VLSI design scene: the multi university „E.I.S. project“ (grant: 35 million DM)
my CV (2)
Author & compiler writer of KARL[1]: in the 80ies the most successful HDL and trailblazer before VHDL
[1] R. Hartenstein: Fundamentals of Structured Hardware Design,
1977, North Holland / American Elsevier -- bestseller
3 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Complete EDA framework[2] around KARL and ABL: 85 mio ECU grant by ESPRIT programme of the EU
my CV (3)
format-checking functional floorplan graphic editor
[2] R. Hartenstein: The History of KARL and ABL; in: J. Mermet (editor): Fundamentals and Standards in Hardware Description Languages; 1993.
also see: http://xputers.informatik.uni-kl.de/karl/karl_history_fbi.html
calculus-based term rewriting floorplan generator,
automatic test generation,
testability analysis,
embedded router,
structured logic synthesis,
simulator …
4
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
my CV (4)
consulting
Hans-D. Genscher, Foreign Secretary & Vice-Chancellor
lobbying & consulting
Heinz Riesenhuber, Helmut Kohl‘s Tech & Research Secretary
Konrad Schwaiger, Commission of the
European Union
popular science book
writing for newspapers
local
global
global
http://hartenstein.de/Pressespiegel.html
5
some of my hobbies
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
6
Book Outline
• Introduction
• Instruction Stream Machines
• Data Stream Machines
• Time to Space Mapping
• Hetero Systems
• Applications
• Outlook
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 2
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
7
How Societies Chose
to Fail or Succeed
Collapse of our computing ecosystem ?
Unaffordability of von-Neumann-centric computing could jeopardize
all facets of our global economy.
Many-core: failure could jeopardize both, IT industry & sections of the economy depending on rapid
improvement of IT. [Dave Patterson]
Several recent outages of cloud computing services.
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
8
Motivations for a Game Change
Energy consumption of all computers world-wide may become unaffordable
To enable mastering the impact of manycore the programmer population is not yet existing
Fundamentally new types of skills for programmers promise to cool down the chronic “software crisis”
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Energy-efficient Computing
• Low power circuit design*, SSD, & „green computers“**:
• The Twin Paradigm approach:
It‘s important: all sections should be supported
Promising results being better by orders of magnitude
technology established, but massive commitment required
2 different approaches:
*) e. g. PATMOS annual conference series
9 **) remember the talk by Rich Goldman on Tuesday ! © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
10
Re-thinking basic assumptions
Instead of physical limits, fundamental misconceptions of algorithmic complexity theory
limit the progress and will necessitate new breakthroughs.
Not processing is costly, but moving data and messages
We’ve to re-think basic assumptions behind computing
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Put old ideas into practice
(POIIP)
11
... „The biggest payoff will come from putting old ideas into practice and teaching people how to apply them properly.“ [David Parnas]
“We need a complete re-definition of CS” [Burton Smith and other celebrities]
Wrong! I do not agree, [Reiner Hartenstein]
finding out, that ...
“We need a complete re-definition of curriculum recommendations - missing several key issues.”
[Reiner Hartenstein]
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Educational Deficits
Transdisciplinary fragmentation: each application domain uses its own trick boxes
Too many sophisticated very clever architectures
We need a fundamental model with a methodology which all application domains have in common
Educational deficits have stalled Reconfigurable Computing (RC) as well as classical supercomputing
Transdisciplinary education & basic research needed
12
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 3
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern it’s old stuff !
• Most of the enabling technologies of Reconfigurable Computing have been published in the 70ies and 80ies:
13
• This is mainly ignored by the CS community by the tunnel view of a reductionist mind set.
• We need to think out of the box: R&D and education need a twin paradigm approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
14
Book Outline
• Introduction
• Instruction Stream Machines • Data Stream Machines • Hetero Systems • Time to Space Mapping • Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Power Consumption of Computers
Energy cost may overtake
IT equipment cost in the near future
but „we may ultimately need revolutionary new solutions“ [Horst Simon, LBNL, Berkeley]
... has become an industry-wide issue: incremental improvements are on track,
[Albert Zomaya]
Current trends will lead to unaffordable future operation cost of our cyber infrastructure
Power consumption by internet: x30 til 2030 if trends continue G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 –11 Sep 2008
15 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
50 years Software Crisis
Nathan’s Law: Software is a gas. It expands to fill its container ...
Nathan Myhrvold
… until being limited by Moore’s Law [& Kryder’s Law]
Wirth‘s Law
“software is slowing faster than hardware is accelerating“
Oct 1957 The Economist: Nov 19th 1955
In 1955, Parkinson could not have foreseen the impact of software. formula: bureaucracy growth independent of actual work to be done
[Niklaus Wirth]
[Cyril Northcote Parkinson]
Software critics is not new:
F. L. Bauer 1968, coined the term „Software Crisis“
N. N. 1995: THE STANDISH GROUP REPORT
Robert N. Charette 2005: Why Software Fails; IEEE Spectrum, Sep 2005
Anthony Berglas 2008: Why it is Important that Software Projects Fail L. Savain 2006: Why Software is bad
Peter G. Neumann 1985-2003:
216x “Inside Risks“(18 years inside back cover of Comm_ACM)
term by F. L. Bauer [1968]
16
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
“Software”
overhead piles up to code sizes of astronomic dimensions
The von Neumann
Syndrome:
stands for extremely memory-cycle-hungry instruction streams
stolen from Bob Colwell
memory wall,
caches, ...
from earlier talks: from earlier talks:
datastream parallelism
instruction stream parallelism
data meet PU
C.V. Ramamoorthy
“The Memory Wall”
coined by Sally McKee (& co-author)
Patterson’s Law:
Dave Patterson
bandwidth gap grows 50% / year
has reached >1000x
17 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Program Program Performance
„Multicore computers shift the burden of software performance from chip designers to programmers.“
we anyway need a Software Education Revolution ... Since People have to write code differently,
[J. Larus: Spending Moore's Dividend; C_ACM, May 2009] ... performance drops & other problems
in moving single-core to multicore ...
... the chance to move RC* from niche to mainstream
a scenario like before the Mead-&-Conway revolution Missing programmer population, tools and methodology:
*) RC = Reconfigurable Computing
Embedded syst. & hdw scene have the right background to reform the parallelism education of programmers
program program
Programming Programming
18
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 4
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
A Heliocentric CS Model needed
FE
Flowware
Engineering
PE
Program Engineering
The Generalization of Software Engineering —
A Twin Paradigm Dual Dichotomy Approach.
time to space mapping
issue
SE
Software
Engineering
RPU RPU
special
*) do not confuse with „dataflow“!
19 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Following the exemplar: our ideal:
educational framework to create a designer and researchers population
20
The most influential research project
in modern computer history.
The brain child of:
Carver Mead Lynn Conway
(~1980-…) The VLSI Revolution
Solving the VLSI design crisis: missing reply to Moore‘s Law
missing designer population & tools
Incubator of EDA* industry, workstations…
*) Electronic Design Automation
text book [1980] ( Bestseller ! )
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
VLSI design education spreading rapidly
1981 - 1983
A few examples:
E.I.S. Projekt (DE): 20 universities [1983...]
EUROCHIP (EU) [1984...]
21
world-wide
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The most influential research project
in modern computer history.
The VLSI Revolution
educational framework to create designer & researcher population
22
Carver Mead Lynn Conway
Solving the VLSI design crisis: missing reply to Moore‘s Law
missing designer population & tools
Incubator EDA industry, workstations…
software software
vN syndrome vN syndrome
programmer programmer
PE* PE*
programmer programmer
PSCs** PSCs**
programming programming
following the exemplar:
The Next Text Book Bestseller !
Next most influential R&D project Next most influential R&D project
*) Program Engineering
**) personal SuperComputers
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Teaching properly
„The biggest payoff will come from Putting Old ideas into Practice
and teaching people how to apply them properly.“
Dav
id P
arna
s
(POIIP) Put Old ideas into Practice
Lean extension of SE education: avoiding to scare away students
23 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
M-&-C-like textbook needed
avoiding specialization overload
24
Breadth of specialization massively reduced
Clean-up &
intuitive models
Fixing the
Education
Dilemma
Silicon Foundry
(external technology)
Gate level
Switching level
Circuit level
RT level
Anwendung
Layout level
Technology
on site
submit reject
submit reject
submit reject
submit reject
submit reject
conventional design flow:
fra
gm
entation
ta
ll th
in
m
an
application
The new M-&-C
organization:
Carver Mead Lynn Conway
stressing “Systems”
coherence
The text book [1980] ( Bestseller ! )
Breadth of specialization
„fa
b-less“
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 5
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
25
Book Outline
• Introduction • Instruction Stream Machines
• Data Stream Machines • Hetero Systems • Time to Space Mapping • Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The first Reconfigurable Computer
•prototyped 1884 by Herman Hollerith
•a century before FPGA introduction
•data-stream-based •data-stream-based
•60 years later the von Neumann (vN) model took over
• instruction-stream-based • instruction-stream-based
26
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Early LUT
60 years later: RAM available f. configuration
non-volatile configuration “memory”
field-programmable:
•manually
•or, by swapping pre-wired plug boards
But only used for von Neumann machine (at that time)
27 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
28
The History of Mainframes
• Reconfigurable was first
• Von Neumann had the strongest lobby
• Von-Neumanized hardware design
• Problems with hardware description languages
But curriculum planners insisted in going through all abstraction levels
„I understand everything, although I do not know any electronics“
The software faculty did not support this mind set
Every student had to learn how to design a CPU
M&C misunderstood: only for chip designers & toolmakers
(the tall thin man)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Teaching for Change: an early martyr
„Turing is irrelevant“
The von Neumann model is the emulation of a tape machine
http://www.sigsoft.org/SEN/parnas.html
D. L. Parnas (keynote):
"Teaching for Change“;
10th Conf. Softw. Engineering Education
and Training (CSEET '97)
„The von Neumann syndrome“: coined ~ a decade later
Prof. C.V. Ramamoorthy, (UC Berkeley),
SDPS 2006, San Diego, CA
Brad Cox 1990: Planning the Software Industrial Revolution
Dijkstra 1968: The Goto considered harmful
R. Hartenstein, G. Koch 1975: The universal Bus considered harmful Backus 1978: Can programming be liberated from the von Neumann style?
Arvind et al., 1983: A critique of Multiprocessing the von Neumann Style L. Savain 2006:
Why Software is bad …
Peter G. Neumann 1985-2003: 216x “Inside Risks“ (18 years inside back cover
of Comm_ACM)
Critique of von Neumann is not new:
punished for blasphemy?
(mimicking tape on RAM)
Peter G. Neumann
Teaching for Change
29 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
30
The von Neumann Paradigm Trap
•Antimachine:
•data counter(s) at memory unit(s)
•[Burks, Goldstein, von Neumann; 1946]
•Program counter at Datapath Unit
CS education got stuck in this paradigm trap
which stems from technology of the 1940s
We need a twin- paradigm approach
CS education’s right eye is blind, and
its left eye suffers from tunnel view
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 6
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
31
Instruction Stream Machines
• a very simple von Neumann machine explained at bit level program sequencer explained, opcode, control structures, loops, simulator, simple example exercisesd
• imperative language, compilation, execution on von Neumann simulator, example exercises
• parallel processes, Hoare, MPI, etc -- extremely simple !!
• a supercomputing example (differential equations ?)
• The von Neumann Syndrome
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
32
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
33
Instruction Stream Machines
• a very simple von Neumann machine explained at bit level program sequencer explained, opcode, control structures, loops, simulator, eimple example exercisesd
• imperative language, compilation, execution on von Neumann simulator, example exercises
• parallel processes, Hoare, MPI, etc -- extremely simple !!
• a supercomputing example (differential equations)
• The von Neumann Syndrome
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
34
Compilation: Software
source program
software compiler
software code
Software Engineering Software
Engineering
instruction schedule
(Befehls-Fahrplan)
sequential
(von Neumann model)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
35
imperative Software Language (control part)
language category Software Languages (instruction streams)
deterministic procedural sequencing:
traceable, checkpointable
operation sequence driven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
state register program counter
address computation
massive memory cycle overhead
Instruction fetch memory cycle overhead
parallel memory bank access interleaving only
co-located with datapath
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
36
Instruction Stream Machines
• a very simple von Neumann machine explained at bit level program sequencer explained, opcode, control structures, loops, simulator, eimple example exercisesd
• imperative language, compilation, execution on von Neumann simulator, example exercises
• parallel processes, Hoare, MPI, etc -- extremely simple !!
• a supercomputing example (differential equations)
• The von Neumann Syndrome
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 7
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
37
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
38
Instruction Stream Machines
• a very simple von Neumann machine explained at bit level program sequencer explained, opcode, control structures, loops, simulator, eimple example exercisesd
• imperative language, compilation, execution on von Neumann simulator, example exercises
• parallel processes, Hoare, MPI, etc -- extremely simple !!
• a supercomputing example (differential equations)
• The von Neumann Syndrome
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
39
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
40
Instruction Stream Machines
• a very simple von Neumann machine explained at bit level program sequencer explained, opcode, control structures, loops, simulator, eimple example exercisesd
• imperative language, compilation, execution on von Neumann simulator, example exercises
• parallel processes, Hoare, MPI, etc -- extremely simple !!
• a supercomputing example (differential equations)
• The von Neumann Syndrome
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
von Neumann overhead vs. Reconfigurable Computing
overhead von Neumann
machine hardwired
anti machine reconfigurable anti machine
instruction fetch instruction stream none*
state address computation instruction stream none*
data address computation instruction stream none*
data meet PU + other overh. instruction stream none*
i / o to / from off-chip RAM instruction stream none*
Inter PU communication instruction stream none*
message passing overhead instruction stream none*
*) configured before run time
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPA: reconfigurable datapath array rDPA: reconfigurable datapath array
(coa
rse-
grai
ned
rec.
) (c
oars
e-gr
aine
d re
c.)
41 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
overhead von Neumann
machine hardwired
anti machine reconfigurable anti machine
instruction fetch instruction stream none*
state address computation instruction stream none*
data address computation instruction stream none*
data meet P + other overh. instruction stream none*
i / o to / from off-chip RAM instruction stream none*
Inter PU communication instruction stream none*
message passing overhead instruction stream none*
**) just by reconfigurable address generator
von Neumann overhead vs. Reconfigurable Computing
*) configured before run time
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPA: reconfigurable datapath array rDPA: reconfigurable datapath array
(coa
rse-
grai
ned
rec.
) (c
oars
e-gr
aine
d re
c.)
42
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 8
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Data meeting the Processing Unit (PU)
by Software
by Configware
routing the data by memory-cycle-hungry instruction streams thru shared memory
data-stream-based: placement* of the execution locality ...
We have 2 choices
pipe network generated by configware compilation
... explaining the RC advantage
*) before run time
(data)
(PU)
43 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern The von Neumann Syndrome
The data-stream-based anti machine approach:
The instruction-stream-based von Neumann approach:
has no von Neumann bottle-necks
has no von Neumann bottle-necks
the watering pot model [Hartenstein]
has several
von Neumann overhead
phenomena
has several
von Neumann overhead
phenomena
per CPU!
44
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Reconfigurable Computing means …
• Reconfigurable Computing means moving overhead from run time to compile time**
• For HPC run time is more precious than compiletime
• Reconfigurable Computing replaces “looping” at run time* …
http://www.tnt-factory.de/videos_hamster_im_laufrad.htm
… by configuration before run time
*) e. g. complex address computation **) or, loading time
45 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
46
Book Outline
• Introduction • Instruction Stream Machines • Data Stream Machines
• Hetero Systems • Time to Space Mapping • Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
47
We need a new machine paradigm
a programmer does not understand function evaluation without machine mechanisms - without a pogram counter …
data-stream-based mind set
we urgently need a
datastream based
machine paradigm
we urgently need a
datastream based
machine paradigm it was pepared almost 30 years ago
x x x
x x x
x x x
|
| |
x x x
x
x x
x x
x
- -
-
x x x x
x x
x x x
- - -
- - -
- - -
- - -
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| data streams
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Compilation: Software vs. Configware
source program
software compiler
software code
Software Engineering Software
Engineering Configware Engineering Configware Engineering
placement & routing
scheduler
flowware code
data
C, FORTRAN MATHLAB, …
instruction streams data streams configuration
configware code
mapper
configware compiler
source „program“
48
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 9
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
49
data counter
GAG RAM
ASM
data counter
GAG RAM
ASM
data counter
GAG RAM
ASM
Configware Compilation
configware code
flowware code
mapper
configware compiler
scheduler
source „program“
Configware Engineering Configware Engineering
placement & routing
data
C, FORTRAN MATHLAB
programming the data counters
configware compilation fundamentally different from software compilation
configware compilation fundamentally different from software compilation
x x x
x x x
x x x
|
| |
x x x
x
x x
x x
x
- -
-
x x x x
x x
x x x
- - -
- - -
- - -
- - -
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| data streams
rDPA
pipe network
data counter
GAG RAM
ASM: Auto-Sequencing Memories ASM
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
50
Data Stream Machines
• Systolic array, def of datastream -simple !! -- (e. g. Matrix compute) examples intro, simulator, exercises
• DMA, auto-sequencing memory, data stream machine exercises programming data streams e.g.of prev. section
• introducing a flowware language, implement a compiler example exercises running on datastream machine
• introductory examples illustrate datastream machine use
• introducing the datastream machine
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
51
Having introduced Data streams
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
input data stream
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| output data streams
time
port #
time
time
port # time
port #
systolic array research throughout the 80ies:
mainly algebra experts
~1980
DPA (pipe network)
execution: transport-triggered
no memory wall
H. T. Kung
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
DPA
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
input data streams
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| output data streams
DPU operation is transport-triggered
no instruction streams
no message passing
nor thru common memory
where were the supercomputing people ?
massively reducing memory cycles
The right road map to HPC: there ignored for decades
52
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
input data stream
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| output data streams
time
port #
time
time
port # time
port #
defines: ... which data item at which time at which port
(H. T. Kung paradigm)
Algebra experts‘ hobby, early 80ies
DPA* (pipe network) *) DataPath Array
(array of DPUs)
The Systolic Array
DataPath Unit has no program counter!
it’s no CPU!
nice time/space notation -
53 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Yielding only linear unifornm pipes
pipeline properties array applications
shape resources
mapping scheduling
(data stream formation)
systolic array
regular data
dependencies only
linear only
uniform only
linear projection or algebraic synthesis
super-systolic rDPA
no restrictions simulated
annealing or P&R algorithm
(e.g. force-directed) scheduling algorithm
*
54
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 10
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
of course algebraic (linear projection)
only for applications with regular data dependencies
Algebra experts caught by their own paradigm trap
Rainer Kress discarded their algebraic synthesis methods and replaced it by simulated annealing: rDPA
1995
Synthesis Method?
55 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Super Pipe Networks
pipeline properties array applications
shape resources
mapping scheduling
(data stream formation)
systolic array
regular data
dependencies only
linear only
uniform only
linear projection or algebraic synthesis
super-systolic rDPA
no restrictions simulated
annealing or P&R algorithm
(e.g. force-directed) scheduling algorithm
*
*) KressArray [1995]
56
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
pipe network, organized at compile time
rDPA = rDPU array, i. e. coarse-grained
rDPU = reconf. datapath unit (no program counter)
What pipe network ?
rDPA rDPA
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
Generalization*
of
the systolic array
array port receiving or sending a data stream
rDPA rDPA
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU [R. Kress, 1995]
*) supporting non-linear pipes on free form hetero arrays
57 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
another free form pipe network example:
honeycomb architecture
rDPA = rDPU array
rDPU = reconfigurable
datapath unit
more generalization
Generalization*
of
the systolic array
array port receiving or sending a data stream
[R. Kress, 1995]
*) supporting non-linear pipes on free form hetero arrays – also zigzag, spiral, fork and join, and even more crazy routing patterns tolerated
rDPA rDPA
rDPU rDPU rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU
58
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern super-systolic array
59
(recall this example !)
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
rout thru only
not used backbus connect
supporting any complex free form pipe networks
far beyond just uniform linear pipes
CoDe-X inside [Jürgen Becker]
by KressArray Xplorer [Ulrich Nageldinger]
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
60
Data Stream Machines
• Systolic array, def of datastream -simple !! -- (e. g. Matrix compute) examples intro, simulator, exercises
• DMA, auto-sequencing memory, data stream machine exercises programming data streams e.g.of prev. section
• introducing a flowware language, implement a compiler example exercises running on datastream machine
• introductory examples illustrate datastream machine use
• introducing the datastream machine
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 11
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern “It’s not our job”
x x x
x x x
x x x
| | |
x x x x
x x
x x x
- - -
x x x x
x x
x x x
- - -
- - -
- - -
- - -
x x x
x x x
x x x
| | |
| | |
| | |
| |
|
resources
sequencer
Machine:
61 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
data counter
GAG RAM
ASM: Auto-Sequencing
Memory rDPA rDPA
ASM ASM ASM ASM ASM ASM
ASM ASM
ASM ASM
ASM ASM
Migration benefit by on-chip RAM
so that the drastic code size reduction by software to configware migration can beat the memory wall
Some RC chips have hundreds of on-chip RAM blocks, orders of magnitude faster than off-chip RAM
multiple on-chip RAM blocks are the enabling technology for ultra-fast anti machine solutions
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
ASM ASM
ASM ASM
ASM ASM
ASM ASM ASM ASM ASM ASM
rDPA = rDPU array, i. e. coarse-grained
rDPU = reconf. datapath unit (no program counter)
GAGs inside ASMs generate the data streams
GAG = generic address generator
62
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
63
The counterpart of the von Neumann machine
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
|
(r)DPA
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
data counter
GAG RAM
ASM: Auto-Sequencing
Memory
data counters instead of a program counter
data counters: located at memory (not at data path)
coarse-grained
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Counters: the same micro architecture ?
data stream machine (anti machine)
data
counter
memory bank
asM
progra
m
counter
DPU CPU
instruction stream machine: (von Neumann etc.)
yes, is possible, but for data counters ...
*) for history of AGUs see Herz et al.: Proc. ICECS 2002, Dubrovnik, Croatia
... a much better AGU methodology is available*
AGU: address generator unit
64
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
von Neumann vs. anti machine
progra
m
counter
DPU CPU
RAM memory
von Neumann bottleneck
(r)DPA
without sequencer
asMA: auto-sequencing Memory Array
........
asM
asM
asM
asM
asM
(r)DPA
........
data stream machine (anti machine)
data
counter
memory bank
asM
asM: auto-sequencing Memory
instruction stream machine (von Neumann etc.)
65 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Behavior of the Counter
data
counter
memory bank
asM
asM
asM
asM
asM
asM
........
progra
m
counter
DPU CPU
(r)DPA
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 12
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
anti machine
Configware Compilation
data
counter
memory bank
asM
asM
asM
asM
asM
asM
........
(r)DPA configware compiler
source „program“
mapper
Placement & routing
data scheduler
flowware code
67 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
68
Data Stream Machines
• Systolic array, def of datastream -simple !! -- (e. g. Matrix compute) examples intro, simulator, exercises
• DMA, auto-sequencing memory, data stream machine exercises programming data streams e.g.of prev. section
• introducing a flowware language, implement a compiler example exercises running on datastream machine
• introductory examples illustrate datastream machine use
• introducing the datastream machine
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
data counter
GAG RAM
ASM: Auto-Sequencing
Memory
ASM: Auto-Sequencing
Memory
use data counters, no program counter
rDPA rDPA
x x
x
x
x
x
x x
x
- -
-
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
ASM Data stream generators
x x x
x x x
x x x
|
| |
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
|
ASM
ASM
ASM
ASM
ASM
ASM
AS
M
AS
M
AS
M
AS
M
AS
M
AS
M
implemented
by distributed
on-chip memory
implemented
by distributed
on-chip memory
reconfigurable reconfigurable
non-von-Neumann machine paradigm
systolic array super systolic pipe network
[1995]
69 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
(anti-von-Neumann machine paradigm)
Data Counter instead of Program Counter
Generalization of the DMA
data counter
GAG RAM
ASM: Auto-Sequencing Memory ASM
GAG & enabling technology: published 1989 [by TU-KL], Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL
**) -- patented by TI** 1995
Storge Scheme optimization methodology, etc.
70
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
71
GAG: 2-D Generic Data Sequence Examples
a) b)
c)
d) e) f) g)
Limit Slider
Base Slider
GAG
Address Stepper
B0 DA L0
A
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
72
GAG Slider Operation Demo Example
yx
ceiling
C
address
DLDB
L0B 0 DA
F
floor
DLDB
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 13
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
GAG Slider Model
LimitStepper
BaseStepper
AddressStepper
B0DAL0
A
LimitStepper
BaseStepper
AddressStepper
B0DAL0
A
sliders
B 0 B
[
0 L
]
0 L 0
B 0 B
[
0 A D
A D
L
]
0 L 0
GAG
Generic Address
Generator
floor ceiling
73 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
74
MoM Scan window (MoMSW) Illustration
• Multiple* vari-size reconfigurable MoMSW scan windows
• MoMSW controlled by reconfigurable GAG (generic address generators)
• 2-dimensional (data) memory address space
MoM architectural primary features:
*) typically 3
ASM: Auto-
Sequencing
Memory
ASM: Auto-
Sequencing
Memory
ASM: Auto-
Sequencing
Memory
ASM: Auto-
Sequencing
Memory
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
75
CGFFT: Parallel Scan Pattern Animation MoM-3 with 3 varisize scan windows
Datapath Datapath ASM: Auto-
Sequencing
Memory
ASM: Auto-
Sequencing
Memory
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
76
Reconfigurable Generic Address Generator GAG
Generalization of the DMA
data counter
GAG
GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik]
patented by TI 1995
• storge scheme optimization methodology, etc.
Acceleration factors by:
• address computation without memory cycles
• supporting scratch optimization strategies (smart d-caching)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
77
Significance of MoMSW Reconfigurable Scan Windows
• MoMSW Scan windows have the potential to drastically reduce traffic to/from slow off-chip memory.
• No instruction streams needed to implement scratch pad optimization strategies using fast on-chip memory
• MoMSW Scan windows may contribute to speed-up by a factor of 10 and sometimes even much more
• MoMSW Scan windows are the deterministic alternative („d-caching“) to (indeterministic and speculative) classical cache usage: performance can be well predicted
• For data-stream-based computing scan windows are highly effective, whereas classical caches are entirely useless
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
78
Acceleration Mechanisms by ASM-based MoMSW
•parallelism by multi bank memory architecture •reconfigurable address compuattion – before run time
•avoiding multiple accesses to the same data.
•avoiding memory cycles for address computation
• improve parallelism by storage scheme transformations
•minimize data movement across chip boundaries
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 14
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Xputer Lab at Kaiserslautern: MoM I and II 1986:
79 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
80
Data Stream Machines
• Systolic array, def of datastream -simple !! -- (e. g. Matrix compute) examples intro, simulator, exercises
• DMA, auto-sequencing memory, data stream machine exercises programming data streams e.g.of prev. section
• introducing a flowware language, implement a compiler example exercises running on datastream machine
• introductory examples illustrate datastream machine use
• introducing the datastream machine
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Similar Programming Language Paradigms
language category Computer Languages Xputer Languages
both deterministic procedural sequencing: traceable, checkpointable
sequencingdriven by:
read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching
read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching
81 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
82
Programming Language Paradigms
language category Computer Languages Xputer Languages
both deterministic procedural sequencing: traceable, checkpointable
operationsequencedriven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes,instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.),data loop, loop nesting,parallel loops, escapes,data stream branching
state register program counter data counter(s)
addresscomputation
massive memorycycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoided
parallel memorybank access interleaving only no restrictions
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Why these paradigms are twins
instruction stream languages
read next instruction
goto (instruction address)
jump to (instruction address)
instruction loop
instruction loop nesting
instruction loop escape
instruction stream branching
no internally parallel loops
data stream languages
read next data item
goto (data address)
jump to (data address)
data loop
data loop nesting
data loop escape
data stream branching
yes: internally parallel loops
83 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Procedural Languages Twins
systolic Flowware Languages
read next data item
goto (data address)
jump to (data address)
data loop
data loop nesting
data loop escape
data stream branching
yes: internally parallel loops
84
imperative Software Languages
read next instruction
goto (instruction address)
jump to (instruction address)
instruction loop
instruction loop nesting
instruction loop escape
instruction stream branching
no: no internally parallel loops
But there is the Asymmetry But there is the Asymmetry
program counter data counter(s)
for data parallelism for data parallelism
super
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 15
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The twin paradigms: different moving of data
# moving data between data transport execution
triggered by strategy
1 von Neumann CPU cores
via common memory
instruction stream
moving data at run time
2
(r)DPU cores within (r)DPA
(coarse-grained array)
piped thru directly from
(r)DPU to (r)DPU
arrival of data
(transport-triggered)
moving at compile time the
locality of execution
Who moves operand to operator if not an instruction?
85 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
86
Von Neumann vs. anti machine
# feature von Neumann
machine
hardwired
anti machine
reconfigurable
anti machine
1 m’ code schedules: instruction stream data streams
2 # prog’ sources 1 2
3 source 1 none configware
4 source 2 software flowware
5 sequenced by: program counter data counters
6 counter co-located with: PU (data path): CPU memory block: ASM
9 inter PU communication: common memory piped through
10 data meeting PU: move data at run time move locality of execution at compile rime
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Overhead avoided by Anti Machine
# feature von Neumann
machine
hardwired
anti machine
reconfigurable
anti machine
11 state address computation
overhead at run time instruction stream
overhead:
none
12 data address computation
overhead at run time instruction stream none
13 Inter PU communication
overhead at run time instruction stream none
14 instruction fetch at run time instruction stream none
15 data meet PU at run time instruction stream none
16 other overhead instruction stream none
87 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
88
Which language ?
Computer scientists haven’t been interested in programming clusters. If putting the cluster on a chip is what excites them, fine.
It will still have to run Fortran!
*) like CoDe-X
Scientists do not change their direction
Newton’s 1st Law à la Gordon Bell:
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
New Languages for Parallelism ?
Efforts to extend standards-based, serial programming languages with features to describe parallel constructs are likely to fail.
Nick Tredennick:
Term Rewriting Systems may raise the abstraction level up to math formulae
Mauricio Ayala-Rincón:
What’s more likely to succeed are languages that raise the level of abstraction in algorithm description
89 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
90
Which Language ?
*) like CoDe-X
Support tools have been demonstrated by academia
Classical programming languages, but with a slightly different semantics (data-procedural) are good candidates for parallel programming.
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 16
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
91
JPEG zigzag scan pattern
x
y
EastScan is step by [1,0] end EastScan;
SouthScan is step by [0,1] endSouthScan;
*> Declarations
NorthEastScan is loop 8 times until [*,1] step by [1,-1] endloop end NorthEastScan;
SouthWestScan is loop 8 times until [1,*] step by [-1,1] endloop end SouthWestScan;
HalfZigZag is EastScan loop 3 times SouthWestScan SouthScan NorthEastScan EastScan endloop end HalfZigZag;
goto PixMap[1,1]
HalfZigZag; SouthWestScan uturn (reverse (HalfZigZag))
reverse (HalfZigZag)
data counter data counter
data counter data counter
2
1
3
4
HalfZigZag
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
92
What is the reason of the paradox ?
Result from decades of tunnel view in CS R&D and education
basic mind set completely wrong
the von Neumann Syndrome
“CPU: most flexible platform” ?
>1000 CPUs running in parallel: the most inflexible platform
However, FPGA & rDPA are very flexible
The Law of More: drastically declining programmer productivity
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
93
Dual paradigm: an old hat (3)
“procedure call” or function call
call Module-name (parameters);
Hardware Description Languages;
Hardware description: space domain
Software: time domain
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Software / Configware Co-Compilation
Analyzer / Profiler
SW code
SW compiler
para d igm “vN" machine
CW Code
CW compiler
anti machine paradigm
Partitioner
C language source
FW Code
Juergen Becker 1996
But we need a dual paradigm approach: to run legacy software together w. configware
Reconfigurable Computing: Technology is Ready, Users are Not
apropos compilation:
94
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Why a new machine paradigm ???
The anti machine as the 2nd paradigm is the key to curricular innovation
rDPA rDPA µprocessor µprocessor
... a Troyan horse to introduce data-stream-based issues to the classical mind set of programmers
Programming by flowware instead of software is very easy to learn
Flowware education: no fully fledged hardware expert needed to program configware
(... same language primitives)
95 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
96
Data Stream Machines
• Systolic array, def of datastream -simple !! -- (e. g. Matrix compute) examples intro, simulator, exercises
• DMA, auto-sequencing memory, data stream machine exercises programming data streams e.g.of prev. section
• introducing a flowware language, implement a compiler example exercises running on datastream machine
• introductory examples illustrate datastream machine use
• introducing the datastream machine
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 17
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
97
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
98
Book Outline
• Introduction • Instruction Stream Machines • Data Stream Machines • Hetero Systems
• Time to Space Mapping • Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
99
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Our Contemporary Computer Machine Model
Machine model
resources sequencer
property programming
source property programming source state register
ASIC accelerator hardwired - hardwired -
CPU hardwired - programmable Software (instruction streams)
program counter
RPU accelerator programmable
Configware (configuration
code) programmable
Flowware (data
streams)
data counters
twin Paradigm Dichotomy
in CPU
in RAM
data counters of reconfigurable address generators in asM (auto-sequencing) data memory blocks
the same language primitives!
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The clash of Paradigms (a simplified view)
Computation in time (procedural-only mindset)
Computing in space (& time since unavoidable)
CS&E: Co-design mindset (mainly embedded systems) (mostly hardware people with some programming skills)
you can teach programming to a hardware guy you can teach programming to a hardware guy
you can’t teach hardware to a programmer
you can’t teach hardware to a programmer
This is the fault of our CS curricula This is the fault of our CS curricula
CS: Software-only mindset: von-Neumann-based
101 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
102
von Neumann machine
resources
sequencer
Machine: resources
sequencer
memory
algorithms
software program counter
von Neumann machine:
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 18
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
103
Anti machine
resources
sequencer
memory
algorithms
flowware
data counters
hardwired anti machine:
resources
sequencer
memory
algorithms
flowware
data counters
reconfigurable anti machine:
configware
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
104
Dual paradigm: an old hat
Mapped into a Hardware mind set: action box = Flipflop, decision box = (de)multiplexer
Software mind set: instruction-stream-based: flow chart -> control instructions (FSM: state transition)
-> Register Transfer Modules (DEC: mid 1970ies); similar concept: Case Western Reserve Univ. ;
FF
token bit
evoke
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
105
Dual paradigm: an old hat (2)
“It is so simple!
why did it take 25 years to find out ?”
Hardware Description Language scene ~1970:
Because of the reductionists’ tunnel view
Because of a lack of transdisciplinary thinking
FF
token bit
evoke
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
decision box turns into demultiplexer
106
[1967] PvOIIP
0 1 B0
B1
CO
ND
ITIO
N
ENABLE
demultiplexer:
B0
B1
CONDITION
ENABLE
decision box:
RTM as DEC product available: 1973
[~1971] (introducing HDLs):
„That‘ so simple! Why did it take 30 years to find out ?“
C. G. Bell et al: IEEE Trans-C21/5, May 1972
W. A. Clark: 1967 SJCC, AFIPS Conf. Proc.
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
manycore programming* requires mapping from computing in time, to computing in space (& time)
Programming in CS education is based on fine-grained strictly sequential thinking
Decades old methods and simple models are stubbornly ignored**, even by engineers
Resulting in embarrassing situations (e. g. in talking to your CEO)
Education: a Key Issue
*) and Reconfigurable Computing **) except most hardware people
decision box turns into multiplexer
decision box turns into multiplexer “That’s so simple!
why did it take 30 years to find out?”
in 1971:
C. G. Bell et al: IEEE Trans-C21/5, May 1972
W. A. Clark: 1967 SJCC, AFIPS Conf. Proc.
task P
task Q task R
C 0 1 flowchart
flipflop P
flipflop Q flipflop R
C condition
evoke P
evoke Q evoke R
0 1
token
(token) (token)
RTM
107 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
108
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
rout thru only
not used backbus connect
SNN filter on (supersystolic) KressArray (mainly a pipe network)
reconfigurable Data Path Unit, e. g. 32 bits wide
reconfigurable Data Path Unit, e. g. 32 bits wide
no CPU
rDPU rDPU
note: software perspective without instruction streams
Symptom of the von Neumann Syndrome
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 19
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
109
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 rDPUs
Coarse-grained Reconfigurable Array
rout thru only
not used backbus connect
SNN filter on (supersystolic) KressArray (mainly a pipe network)
reconfigurable Data Path Unit, 32 bits wide
reconfigurable Data Path Unit, 32 bits wide
no CPU
rDPU rDPU
note: software perspective without instruction streams: pipelining
note: software perspective without instruction streams: pipelining
compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
KressArray Example
Y
X
rout thru only rout thru and function (multiplexer)
X
swap
0
1
0
1
>
Y
if X > Y then swap;
Conditional Swap
swap turned into a wiring pattern
110
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Really so simple ?
111
(recall this example !)
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
rout thru only
not used backbus connect
embarrassing reaction to Ulrich Nageldinger‘s talk at RAW 1996
CoDe-X inside [Jürgen Becker]
by KressArray Xplorer [Ulrich Nageldinger]
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Brick Wall in the Brain
112
immediately* a VIP jumps up: „But you can‘t implement decisions!“
Embarrassing: a top level R&D manager of a global IT corp. group
*) discussion after the talk: RAW at Orlando, FLA
completely missing
sense of Dichotomies
structural procedural
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
„but you can‘t implement decisions!“
S = R + (if C then A else B endif);
=1
+
A B R C
section of a very large pipe network: not knowing this solution
symptom of the hardware / software chasm
and the configware / software chasm
113 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
„But you can‘t implement decisions!“
114
S = R + (if C then A else B endif);
=1
+
A B R C
section of a very large pipe network: Software to
Configware
Migration:
it‘s criminal, that typical CS
graduates don‘t know this!
illustrating, that mono-rail education is fatal
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 20
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
115
Hetero Systems
• Hetero systems examples. Co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Domains & what we need
116
term source for programming … domain
software instruction streams time (procedural)
configware ressources (structures) space (structural)
flowware data streams time (procedural)
we need data parallelism we need data parallelism
we need paradigm twins we need paradigm twins
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
117 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Terminology
term program counter
execution triggered
by paradigm
CPU
yes instruction fetch
instruction-stream-based
DPU** no data arrival*
data-stream-based
DPU
progra
m
counter
DPU CPU CPU
*) “transport-triggered” **) does not have a program counter
118
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The twin paradigm approach
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU CPU CPU
CPU CPU
CPU CPU
CPU CPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
inefficient von Neumann
paradigm
high performance antimachine paradigm
heterogenous manycore system
no instruction streams at run time:
only intermediate data to be stored: at distributed fast on-chip memory
for legacy software and supervisory tasks
Software/ Configware
Co-compilation
119 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Why a Dichotomy ?
120
Dichotomy = mutual allocation to two opposed domains such, that a third domain is excluded.
The dichotomy model as an educational orientation guide for dual rail education to overcome the software/configware chasm
Dichotomy of machine paradigms (von Neumann /Anti machine): the „Twin Paradigm“ model
Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 21
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Teaching properly
121
Dichotomy model for dual rail education to overcome the software/configware chasm & the software/hardware chasm
1) Machine Paradigm Dichotomy (von Neumann / Datastream machine): the Twin Paradigm model
2) Relativity Dichotomy: time domain / space domain – helps parallelization by time to space mapping
„The biggest payoff will come from Putting Old ideas into Practice and teaching people how to apply them properly.“
Dav
id P
arna
s
(POIIP) von Neumann reconfigurable
keep in mind ~hardware
Put Old ideas into Practice
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
1 ) Paradigm Dichotomy
122
instruction domain
(procedural dichotomy)
imperative Language (instruction-procedural) (data-procedural)
datastream domain
The Twin Paradigm Approach (TTPA)
program
counter
CPU CPU data
counter
(r)DPA (r)DPA
+ - - +
Systolic Flowware Language super-
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
1 ) Paradigm Dichotomy
123
instruction domain
(procedural dichotomy)
imperative Language (instruction-procedural) (data-procedural)
datastream domain
The Twin Paradigm Approach (TTPA)
program
counter
CPU CPU data
counter
(r)DPA (r)DPA
+ - - +
+
+ data
parallelism
data
parallelism
we need we need
Systolic Flowware Language super-
s s
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
The anti universe
•Paul Dirac predicted a complete anti universe consisting of antimatter
•“There are regions in the universe, which consist of antimatter .....
•We are not aware, that there is a new area in computing sciences , which consists of antimatter of computing
• .... But there are asymmetries”
•Reconfigurable Computing is made from this antimatter: data-stream-based computing
•when a particle hits its antiparticle, both are converted into energy: Annihilation
124
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
anti particles
• 1956: anti neutron created on Bevatron
• 1928: Paul Dirac: „there should be an anti electron having positive charge“ (Nobel price 1933)
• 1932: Carl David Anderson detected this „positron“ in cosmic radiation (Nobel price 1936)
• 1955 Owen Chamberlain et al. create anti proton on Bevatron
• 1954: new accelerators: cyclotron, like Berkeley‘s Bevatron
• 1965: creation of a deuterium anti nucleus at CERN
hydrogen anti hydrogen
• 1995: hydrogen anti atom created at CERN – by forcing positron and anti proton to merge by very low energy.
125 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Matter & Antimatter: Atom and Anti Atom
The World of Matter -
machine paradigm: the Atom
Anti Matter -
machine paradigm: Anti Atom
+ + -
- - +
126
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 22
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Matter & Antimatter of Informatics : Machine and Anti Machine
+
CPU
- 1936 1st electronic computer (Konrad Zuse)
Machine paradigm: „von Neumann“
1946 v. N. machine paradigm
1971 1st microprocessor (Ted Hoff)
1979 „data streams“ (systolic array: Kung / Leiserson)
- DPU
+
Anti Machine paradigm
1990 anti machine paradigm published
1995 rDPA / DPSS (supersystolic: Rainer Kress)
novel
compilation
techniques
127 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
- DPU
(r)DPU
+
CPU
instruction stream
Matter vs. antimatter: CPU vs. DPU
-
progra
m
counter
DPU CPU
without sequencer
+
dat
a st
ream
(r)DPA
dat
a st
ream
s
+
+
128
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
heavy anti atoms: DPA = DPU array
- DPA
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU -
DPA
+
+
+
+
+
+
+
+
+
coher
ent
dat
a st
ream
s sp
inni
ng a
roun
d
129 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Parallelism by Concurrency
+ -
+
- -
+
- +
+
-
- +
- +
independent instruction streams difficult ...
130
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
131
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
132
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
CPU CPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
CPU CPU
Compilation for Dual Paradigm Multicore
SW compiler
CW compiler
C language source
Partitioner
Juergen Becker’s
CoDe-X, 1996
compile to hybrid multicore compile to hybrid multicore
placement and routing placement and routing
automatic parallelization by loop transformations automatic parallelization by loop transformations
generating a pipe network generating a pipe network
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 23
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
133
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
134
Jürgen Becker’s CoDE-X -2 Co-Compiler
Analyzer / Profiler
GNU C compiler
para d igm Computer machine
DPSS
X-C compiler
Anti machine paradigm
Partitioner
X-C is C language extended by MoPL X-C
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU
rDPU rDPU rDPU rDPU rDPU rDPU rDPU rDPU CPU
Resource Parameters
supporting KressArray
family
Pipelining:
A Shorter Leap
Pipelining:
A Shorter Leap
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Configware resources: variable
The RAM-based domains
Flowware algorithm: variable
Configware domain Configware domain
algorithm: variable
resources: fixed
Software CPU
135
Software domain Software domain (von Neumann Machine)
Anti Machine:
Nic
k T
redenn
ick’s
Pe
rspect
ive
space domain space domain
(structure)
time domain time domain
(data streams)
time domain time domain
(instruction streams)
1 Program Source needed
2 Program Sources needed
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
136
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Loop Transformation Examples
loop 1-8 body body endloop
loop 1-8 body endloop
loop 9-16 body endloop
fork
join
strip mining
loop 1-4 trigger endloop
loop 1-2 trigger endloop
loop 1-8 trigger endloop
reconf.array: host: loop 1-16 body endloop
sequential processes: resource parameter driven Co-Compilation
loop unrolling
137 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
138
Anti-Matter (vs. Matter)
Paul Dirac (1928, Nobel prize 1933):
But in CS it’s really existing, it’s Reconfigurable Computing
only accelerator artefacts* (1955 – 1995)
Dichotomy Example:
Dirac: “But there are Asymmetries” but in CS: only one asymmetry !
hydrogen
(CERN 1995) anti hydrogen
*) except positron , found from cosmic radiation
*) except positron , found from cosmic radiation
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 24
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Double Dichotomy
1) Paradigm Dichotomy
2) Relativity Dichotomy
Procedure time
(Software-Domain)
Structure space
(Configware-Domain)
instruction stream von Neumann
(Software-Domain)
data stream Anti Machine
(Flowware-Domain)
139 139 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Double Dichotomy (2)
1) Paradigm Dichotomy
2) Relativity Dichotomy
-Procedure time:
(Software-Domain)
-Structure space:
(Configware-Domain)
instruction stream von Neumann
(Software-Domain)
data stream Anti Machine
(Flowware-Domain)
140
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Double Dichotomy (3)
2) Relativity Dichotomy
-Procedure time:
(Software-Domain)
-Structure space:
(Configware-Domain)
1) Paradigm Dichotomy
instruction stream von Neumann Machine
(Software-Domain) data stream Datastream Machine
(Flowware-Domain)
141 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
2) Relativity Dichotomy
time domain: space domain:
procedure domain structure domain
2 phases: 1) programming
instruction streams 2) run time
3 phases: 1) reconfiguration
of structures
time space
2) programming data streams
3) run time
142 von Neumann Machine Anti Machine
(time time/space)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern Relativity Dichotomy (2)
time domain: space domain:
procedure domain structure domain
2 phases: 1) programming
instruction streams 2) run time
3 phases: 1) reconfiguration
of structures
time space
2) programming data streams
3) run time
143
von Neumann Machine Datastream Machine
(time time/space)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
time-iterative to space-iterative
144
a time to
space/time
mapping
loop transformation methodogy: 70ies and later
n*k time steps, 1 CPU
n time steps, k DPUs
Often the space dimension is limited (e.g. because of the chip size) n time steps,
1 CPU
1 time step, n DPUs
a time to
space
mapping
e. g. example: bubble sort migration Strip mining
[D. Loveman, J-ACM, 1977]
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 25
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
145
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
146
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
147
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Architektur statt Synchro
148
„Shuffle Sort“
conditional swap
conditional swap
conditional swap
conditional swap
Modifikation:
mit Shuffle-
Funktion
conditional swap
conditional swap
conditional swap
conditional swap
conditional swap
conditional swap
swap
conditional swap
conditional
Direkte Zeit-
Raum-Abbildung
Zugriffs-Konflikte
Bessere Architektur statt aufwendiger Synchronisation: Halbierung der Anzahl Blöcke + Auf und Ab der Daten (Shuffle Funktion) – kein von-Neumann-Syndrom !
Beispiel
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Zeit zu Raum Abbildung
Zeit-Domäne: Raum-Domäne:
Prozedur-Domäne Struktur-Domäne
149
Programmschleife n Zeitschritte, 1 CPU
Pipeline 1 Taktschritt, n DPUs
Bubble Sort n x k Zeitschritte,
1 „conditional swap“ unit
Shuffle Sort k Taktschritte, n „conditional swap“ units
Zeit-Algorithmus Raum-Algorithmus
conditional swap
x
y
conditional swap
conditional swap
conditional swap
conditional swap
Zeit-Algorithmus Raum- / Zeit-Algorithmus
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Transformationen seit den 70ern
Zeit-Domäne: Raum-Domäne:
Prozedur-Domäne Struktur-Domäne
150
Programmschleife n x k Zeitschritte,
1 CPU
Pipeline k Taktschritte, n DPUs
Zeit-Algorithmus Raum- / Zeit-Algorithmus
Strip Mining
Transformation
Schleifentransformationen: reichhaltige Methodik publiziert Hier ein Beispiel: [Übersicht: Diss.
Karin Schmidt, 1994, Shaker Verlag]
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 26
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
time-iterative to space-iterative
151
a time to
space/time
mapping
loop transformation methodogy: 70ies and later
n*k time steps, 1 CPU
n time steps, k DPUs
Often the space dimension is limited n time steps, 1 CPU
1 time step, n DPUs
a time to
space
mapping
e. g. example: bubble sort migration Strip mining
[D. Loveman, J-ACM, 1977]
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
We need to POIIP for:
152
2 simple key rules of thumb:
a) loop turns into pipeline
Software to Hardware Migration
Software to Configware Migration and
b) decision box turns
into demultiplexer
rDPU rDPU loop body
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU loop body
Put Old ideas into Practice
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
153
Hetero Systems
• Hetero systems examples. co-compilation, co-design, interfacing (very short intro).
• Softw 2 configw-&-hardw migration:
• Software/configware/hardware partitioning
• The MORPHEUS project
• Mastering the Hardware / Software Chasm
• The Twin Paradigm Approach
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
154
... (may be)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
155
Book Outline
• Introduction • Instruction Stream Machines • Data Stream Machines • Hetero Systems • Time to Space Mapping
• Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Dual-rail education
156
von Neumann
machine
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
CPU CPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
rDPU rDPU
datastream
machine
Software domain Configware domain
Conventional
Computing
Reconfigurable
Computing
Time domain Space domain 1)
2)
program
counter
CPU CPU data
counter
(r)DPA (r)DPA s s
Dic
hoto
my
#
migration, interfacing, partitioning
„The importance of not separating in your mind the notions of hardware and software“
Yale Patt:
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 27
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
157
Book Outline
• Introduction • Instruction Stream Machines • Data Stream Machines • Hetero Systems • Time to Space Mapping • Application Examples
• Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
158
Application Exampless
• survey on speed-up-factor publications automotive, genome sequencing, image processing, what else ?
• elevator example or traffic light example
• cryptography • automotive
• bioinformatics • image processing
• genome sequencing • digital signal processing
• information science • … your proposals …..
• bubble sort to shuffle sort migration example
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
159
... (old stuff)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
160
Book Outline
• Introduction • Instruction Stream Machines • Data Stream Machines • Hetero Systems • Time to Space Mapping • Application Examples • Outlook
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
161
thank you for your patience
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
162
END
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 28
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
extra pages
for discussion:
163 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
164
FPGAs in Supercomputing
• Synergisms: coarse-grained parallelism through conventional parallel processing,
reconfigurable logic box: 1 Bit
• and: fine-grained parallelism through direct configware execution on the FPGAs
DPU program counter
DPU CPU CPU
DataPath Units 32 Bit, 64 Bit
(millions of rLBs embedded in a reconfigurable interconnect fabrics)
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
165
Have to re-think basic assumptions
Instead of physical limits, fundamental misconceptions of algorithmic complexity theory limit the progress and will necessitate new breakthroughs.
Not processing is costly, but moving data and messages
We’ve to re-think basic assumptions behind computing
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
166
„It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“
Avoiding the paradigm shift?
Tarek El-Ghazawi, panelist at SuperComputing 2006
SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors
We need a bridge strategy by developing new text books and advanced tools for training the software community
We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques.
166
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
167
PISA-MoM
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
168
Processing 4-by-4 Reference Patterns
Mead-&-Conway nMOS Design Rules: 256 4-by-4 reference patterns
Mead-&-Conway CMOS Design Rules: >800 4-by-4 reference patterns
MoM: all reference patterns matched in a single clock cycle
vN Software: some reference patterns can be skipped, depending on earlier patterns
DPLA: fabricated by the
E.I.S. Multi University Project:
PISA DRC accelerator [ICCAD 1984]
1984: 1 DPLA replaces 256 FPGAs
Reference patterns automatically generated from Design Rules
PISA: a forerunner
of the MoM
accelerator reconfigurable
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 29
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
169
Acceleration Mechanisms by ASM-based MoMSW
•parallelism by multi bank memory architecture •reconfigurable address compuattion – before run time
•avoiding multiple accesses to the same data.
•avoiding memory cycles for address computation
• improve parallelism by storage scheme transformations
•minimize data movement across chip boundaries
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
170
ASM
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
171
171 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
172
Morphware: old stuff
structural programming (non-v-N)
1971 PROMs for small logic
1984 first Xilinx FPGA
1975 PLA
1978 PAL with PALASM tool
meanwhile mainstream …
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern We need to POIIP for:
173
2 key rules of thumb - terrifically simple:
1) loop turns into pipeline [1979]
Software to Hardware Migration:
Software to Configware Migration: and
2) decision box turns into demultiplexer
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern (2) We need to POIIP for:
174
2 key rules of thumb - terrifically simple:
1) loop turns into pipeline [1979]
2) decision box turns into demultiplexer
[1968]: PvOIIP
Software to Hardware Migration:
Software to Configware Migration: and
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 30
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
POIIP: Loop turns into pipeline
175
[1979]
(reconfigurable)
DataPath Unit:
rDPU rDPU loop body
rDPU rDPU
rDPU rDPU
rDPU rDPU
Pipeline:
rDPU rDPU loop body
loop:
complex loop body
nested loops
complex rDPU or pipe network inside rDPU
complex pipe network
CPU CPU
Memory Memory
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Brick Wall in the Brain
176
immediately* a VIP jumps up: „But you can‘t implement decisions!“
Embarrassing: a top level R&D manager of a global IT corp. group
*) discussion after the talk: RAW at Orlando, FLA
completely missing
sense of Dichotomies
structural procedural
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Computational Density vs. Code Size SPECfp2000/Billion-Transistors PowerPC 1998: 72 2003: 5 SUN: 1985: 105 2000: 30 1990: 17 alpha: 1992: 550 1999: 10
http://www.nytimes.com/2006/03/27/technology/27soft.html
1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000
109
108
107
106
105
104
103
102
101
100
2002
7750
4004/50 single-board computer
8080 6000
360
360 360
7090
7070
704
650
650
Windows lines of code Windows 95: 15 M Windows 98: 18 M Windows XP (’01): 35 M Windows Vista (’07): 50 M
2005 2010
MOPS/MHz/Mio Transistors Pentium: 1996: 0,43 2000: 0,06
Computational density SPECfp2000/BillionTransistors
MOPS/MHz/ Mio Transistors
Pentium 4 125,000,000
other intel µP
[BWRC, UC Berkeley, 2004]
177 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
ICT is at an inflection point
Senior Counselor to the U.S. Trade Representative (USTR) on strategy and negotiations.
Cheap Revolution:
„Broadband is significant at the inflection point, prompting major market governance changes“
massive funding needed Cowhey‘s & Aronson‘s Law:
affordable broadband software performance
„Future prosperity depends on network capacity, ..., efficient pricing, and flexible platforms“
e.g., the living room commercially more important than the comparatively small PC market.
massive
funding
required
178
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Coarse-grained vs. fine-grained
device granularity path width eff’ve density flexibility
FPGA fine-grained ~ 1 bit very low general purpose
DPA coarse-grained multi bit:
e.g. 32 bits very high
specialized
rDPA coarse-grained domain-
specific platform
FPGA
fine-grained &
embedded hdw. mixed high
179 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
180
MoM
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 31
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
MoM anti machine an Xputer architecture
Multiple Scan Windows
data
counter
memory bank
asM
asM
asM
asM
asM
asM
......
asM
A d
istr
ibut
ed m
em
ory
rDPU smart
memory interface
example: 4x4 scan windows
.....
181 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
16 point CGFFT: mapped onto 2-D memory space
182
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
outp
ut
temp
temp
temp
coeff
.
coeff
.
coeff
.
CGFFT: Nested and Parallel Scan Pattern
inpu
t
coeff
.
ini
ini+1
coeff. empty
MAC
183 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
CGFFT: Parallel Scan Pattern Animation
ini
ini+1
coeff. empty
outk
MAC
outj
184
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
CGFFT: Parallel Scan Pattern Animation
MAC
outj
outj+1
outk
outk+1
ini
ini+1
coeff. empty
Ini+2
ini+3
coeff. empty
MAC
4 MAC units
in parallel
8 MAC units
in parallel
185 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern CGFFT: Nested and Parallel Scan Pattern
scanouter loop
patternHLScan is 3 steps [2, 0]
SP1 is 7 steps [0, 2]
SP23 is 7 steps [0, 1]
inner loopcompound
scanpatterns
3 in parallel
goto
186
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 32
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
>>> distributed memory
(application-specific)
distributed memory
The new discipline of
[] Herz et al.: proc. IEEE ICECS 2002
187 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
188
GAG
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
GAG generic address generator Scheme
Base Slider
B0
Limit Slider
L0
0 B
[
Address Stepper
DA
A
D A
| | | |
L
]
limit
all 3 are copies of the same BSU
stepper circuit GAG
189 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
GAG Slider Model
LimitStepper
BaseStepper
AddressStepper
B0DAL0
A
LimitStepper
BaseStepper
AddressStepper
B0DAL0
A
sliders
B 0 B
[
0 L
]
0 L 0
B 0 B
[
0 A D
A D
L
]
0 L 0
GAG
Generic Address
Generator
floor ceiling
190
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern GAG: Address Stepper
GAG =
Address
Generator
Generic
+ / –
Escape
Clause End
Detect
Step Counter
=o
L A D A
init tag
A
Address endExec
maxStepCount
0 B Limit Base stepVector
[ ] | |
D A L B 0
[ ] | | | |
limit
GAG: Address Stepper
191 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
192
GAG Complex Sequencer Implementation
Limit Slider
Base Slider
GAU
Address Stepper
B0 DA L0
A
all `been published
in 1990
Limit Slider
Base Slider
GAU
Address Stepper
B0 DA L0
A
Limit Slider
Base Slider
GAU
Address Stepper
B0 DA L0
A
GAU GAU
GAG
Generic Address Generator
SDS
GAG
VLIW
stack
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 33
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
193
Generic Address Generator GAG
Generalization of the DMA
data counter
GAG
GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik]
patented by TI 1995
• storge scheme optimization methodology, etc.
Acceleration factors by:
• address computation without memory cycles
*) Software to Xputer migration
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
194
miscellaneous
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Dave Patterson‘s Curve .....
1
10
100
1000 Performance
1980 1990 2000
µProc:
60%/yr..
DRAM
7%/yr..
Processor-Memory
Performance Gap:
(grows 50% / year)
DRAM
CPU
Other problems:
• routing congestion,
•memory cycle overhead
... is not the only problem
195 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
196
Future Trends in Microelectronics
• This especially holds for research funding and politics in Europe [R. H.][R. H.]
• Organizations usually behave poorer than anyone can predict [Gordon Bell][Gordon Bell]
•• Predictions require some history Predictions require some history [Gordon Bell][Gordon Bell]
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
197
The new paradigm: how the data are traveling
An old hat: transport-triggered
pipeline, or chaining
super systolic array
better not by instruction execution better not by instruction execution
DPU DPU DPU
vN Move Processor instruction-driven
+ instruction-driven
[Jack Lipovski, EUROMiCRO, Nice, 1975]
P&R: move locality of operation, not data !
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
198
Xputer Principles
addr. generators reconfigurable
Data Path reconfigurable
Xputer
CPU We used the VAX-11/750 of my group
DPLA
rALU
contemporary ?
1984: first FPGAs: very tiny & very expensive
ASM ASM
reiner@hartenstein.de
21 May 2013
IEEE Computer Society EAB November 2005 at Philadelphia 34
Reiner Hartenstein, TU Kaiserslautern, Germany http://hartenstein.de
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
HPC experts coming ...
Simulation of Star Clusters: x10 speed-up by supercomputer-to-morphware migration
(also molecular biology et al.)
Rainer Spurzem, University of Heidelberg
Reinhard Maenner, University of Mannheim HPC pioneer since 1976 (Physics Dept Heidelberg)
Configware by
Astrophysics by
ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by August Kopff
Gottfried Kirch
August Kopff
example: N-body problem going configware [FPL 1999]
199 © 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Curriculum recommendations
Real-Time Systems (Sweden)
Recommendations to design new ICT curricula
ACM/AIS/IEEE Computing Curricula recommendations
Chess - Center for Hybrid and Embedded Software Systems
(courses in embedded systems)
(EU) Graduate Curriculum on Embedded Software and Systems
Advanced Real Time Systems
Workshop on Embedded Systems Education
WESE
a variety of conference series
CSEE&T Conference on Software Engineering Education and Training
Int‘l Conf. on Microelectronics System Education
Workshop on Computer Architecture Education
200
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
• abstracting away the space domain
201
reductionist approach: reductionist approach:
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
202
Conclusions (1)
We massively need programmable accelerator co-processors
Established technologies are available and we can still use standard software and their tools
Configware skills and basic hardware knowledge
are essential qualifications for programmers.
We need a massive Migration of Software to Configware.
The implementation wall: the programmer population‘s unsustainable skills mismatches
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
203
Conclusions (2)
CS education is a monster !
Yaw-dropping sclerosis of curriculum taskforces
We need a complete re-definition of CS education
We urgently need a fundamental Research and
Education Revolution with connected thinking
Fully wrong educational mainstream approaches
We need „une' Levée en Masses“
© 2010, reiner@hartenstein.de http://hartenstein.de
TU Kaiserslautern
Applications TOPICS repository
Software Engineering: Component Technologies; Testing; Empirical Studies; Design Recovery and Documentation; Formal Methods;
Software Design; Software Process; Program Analysis; Software Architecture; Software Understanding; Consistency Management and
Quality Assurance; Case Studies; Automotive Software Engineering; Process Analysis and Improvement; Process and Tools; Testing and
Fault Correction; Extreme Programming; Undergraduate Education; Course Delivery and Evaluation; Process and Methodology
Parallel and Distributed Systems' Architectures: design, analysis, and implementation of multiple-processor systems (including multi-
processors, multicomputers, and networks); impact of VLSI on system design; interprocessor communications;
Parallel and Distributed Systems' Software: parallel languages and compilers; scheduling and task partitioning; databases, operating
systems, and programming environments for multiple-processor systems;
Parallel and Distributed Systems' Algorithms and applications: models of computation; analysis and design of parallel/distributed
algorithms; application studies resulting in better multiple-processor systems;
Other issues on Parallel and Distributed Systems'' and Software Engineering: performance measurements, evaluation, modeling and
simulation of multiple-processor systems; real-time, reliability and fault-tolerance issues; sequential-to-parallel forms conversion of software.
Algorithms: Data Base and Data Mining; Wireless Networks; Parallel and Network Computing; Routing, Sorting and Clustering;
Interconnection Networks; Routing and Scheduling; Compilers; Architecture and Systems; Routing Algorithms; Sorting and Routing
The American Recovery and Reinvestment Act of 2009 (ARRA, or “the stimulus”) totaled approximately $787 billion. Of this, approximately $21.5 billion (2.7 percent) was for the support of R&D—$18 billion for the conduct of research and $3.5 billion for facilities and equipment.
top related