reconfigurable supercomputing: what are the problems? what are the solutions? reiner hartenstein tu...

68
Reconfigurable Supercomputing: What are the Problems? What are the Solutions? Reiner Hartenstein TU Kaiserslautern Dagstuhl, Germany, April 2 - 7, 2006 Dynamically Reconfigurable Architectures

Post on 19-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Reconfigurable Supercomputing:

What are the Problems?

What are the Solutions?

Reiner Hartenstein

TU Kaiserslautern

Dagstuhl, Germany, April 2 - 7, 2006

Dynamically Reconfigurable Architectures

© 2006, [email protected] http://hartenstein.de2

TU Kaiserslautern

   

   

   

   

The Supercomputing Paradox

Rapidly growing listed Teraflops

Often limited sustained Teraflops

Almost stalled application implementation progress

Increasing number of processors running in parallel

COTS processor decreasing cost

Very high total cost of the Tera(?)flops

promising technology

poor results

Scientists waiting for affordable compute capacity

© 2006, [email protected] http://hartenstein.de3

TU Kaiserslautern

dangerously telling this to the supercomputing

people:

You … used the wrong

roadmap the past

20 years !!!

© 2006, [email protected] http://hartenstein.de4

TU Kaiserslautern

   

   

   

   

progress stalled

© 2006, [email protected] http://hartenstein.de5

TU Kaiserslautern

   

   

   

   

3 Reconfigurable Computing Paradoxes

The high performance paradox

The low power paradox

Reconfigurable Computing Education Paradox

© 2006, [email protected] http://hartenstein.de6

TU Kaiserslautern

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

search “FPGA and ….”

© 2006, [email protected] http://hartenstein.de7

TU Kaiserslautern

going into every application area

Almost 10 million hits

© 2006, [email protected] http://hartenstein.de8

TU Kaiserslautern

We now also have the hardware / configware / software chasm

The Reconfigurable Computing Education Paradox:

Curricula still ignore these extremely hot new challenges

in addition to the hardware / software chasm

its run-away accelerated pervasiveness, despite of all these educational deficits

…. educational deficits

© 2006, [email protected] http://hartenstein.de9

TU Kaiserslautern

Computing

Curricula 2004 (1)Within about 500

pages the term reconfigurable is not found – nor its synonyms

© 2006, [email protected] http://hartenstein.de10

TU Kaiserslauternobsolete

von Neumann‘s monopoly

inside curricula is obsolete

© 2006, [email protected] http://hartenstein.de11

TU Kaiserslautern

von Neumann is not the common model

programcounter

DPUCPU

RAMmemory

von Neumann bottleneck

von Neumann instruction-

stream-based machine

co-processors

acceleratorCPU

instruction-stream-based

data-stream-

based

hard

ware

morp

hw

are

software

mainframe age:

microprocessor age:

wagging the dog

the tail is

vN paradigm dominance ?

dual

pa

radi

gm

dual

pa

radi

gm

© 2006, [email protected] http://hartenstein.de12

TU Kaiserslauternmodern FPGA bestsellers:

The new model is reality:FPGA fabrics, together with several µprocessors, several memory banks, and other IP cores, on the same COTS microchip

© 2006, [email protected] http://hartenstein.de13

TU KaiserslauternBill Gates

Speech by Bill Gates at a summit meeting of US state governors: "American high schools are obsolete."

"The high schools of today teach kids about today's computers like on a 50-year-old mainframe.

„Without re-design for the needs of the 21st century, we will keep limiting - even ruining - the lives of millions of Americans every year."

© 2006, [email protected] http://hartenstein.de14

TU Kaiserslauterncarved out of stone

The most important cultural revolution since the invention of text

characters:it‘s not the mainframe

It is the Microchip !

© 2006, [email protected] http://hartenstein.de15

TU Kaiserslautern

RC education needed

http://fpl.org/RCeducation/

35 submissions from

Australia, Brasil, India, USA, and throughout Europe

Jürgen BeckerJörg HenkelR. Hartenstein

© 2006, [email protected] http://hartenstein.de16

TU Kaiserslautern

   

   

   

   

Reconfigurable Computing Paradoxes

The high performance paradox

The low power paradox

Reconfigurable Computing Education Paradox

© 2006, [email protected] http://hartenstein.de17

TU Kaiserslautern

   

   

   

   

The FPGA Low Power Paradox

„very power-hungry“ [Rick Kornfeld*]

*) personal communication

The awful technology of FPGAs:

FPGAs run at lower clock frequencies, draw much more power and are more expensive.

Reducing the electricity bill by an order of magnitude and more by supercomputer 2 FPGA migration

© 2006, [email protected] http://hartenstein.de18

TU Kaiserslautern

telling this to the low power design people ?

you … used the wrong

roadmap the past 15 years:

use FPGAs !

ISLPED, Oct 4 – 6, Tegernse

e

PATMOS, Sep 13 – 15, Montpellier

1991: Kaiserslautern, Germany 1992: Paris, France

1993: Montpellier, France

© 2006, [email protected] http://hartenstein.de19

TU Kaiserslautern

   

   

   

   

Reconfigurable Computing Paradoxes

The high performance paradox

The low power paradox

Reconfigurable Computing Education Paradox

© 2006, [email protected] http://hartenstein.de20

TU Kaiserslautern The High Performance Paradox

Effective integration density much worse than the Gordon Moore curve: by a factor of more than 10,000

85% of all designers hate their tools

The awful technology of FPGAs:

FPGAs run at lower clock frequencies, and are more expensive.

© 2006, [email protected] http://hartenstein.de21

TU Kaiserslautern

fine-grained RC: 1st DeHon‘s Law #

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

density:

FPGAphysical

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]

© 2006, [email protected] http://hartenstein.de22

TU Kaiserslautern

coarse-grained RC: Hartenstein‘s Law

#

FPGArouted

>> 10 000

1980 1990 2000 2010100

103

106

109

(Gordon Moore curve)

transistors / microchip

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

e.g.

KressArray

family

© 2006, [email protected] http://hartenstein.de23

TU Kaiserslautern

2 1 0.5 0.250.001

0.01

0.1

1

10

100

1000

0.13 0.1 0.07

µ feature size

MOPS / milliWatt

standard microprocessor

DSP

instruction set processors(fine grained reconf.)

FPGAs

hardwired

Claassen‘s Law

© 2006, [email protected] http://hartenstein.de24

TU Kaiserslautern

2 1 0.5 0.250.001

0.01

0.1

1

10

100

1000

0.13 0.1 0.07

µ feature size

MOPS / milliWatt

standard microprocessor

DSP

instruction set processors(fine grained reconf.)

FPGAs

hardwired

Claassen‘s Law

hardwired and coarse-grained reconf.

(rDPA)

: Hartenstein‘s Amendment

© 2006, [email protected] http://hartenstein.de25

TU Kaiserslautern

Selection of published speed-up factors

1980 1990 2000 2010100

103

106

109

8080

P4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

100 000

Los Alamos traffic simulation

Los Alamos traffic simulation

47

real-time face detectionreal-time face detection6000

video-rate stereo vision

video-rate stereo vision

900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL

2000

2-D FIR filter (no FPGA: DPLA by TU-KL)2-D FIR filter (no FPGA: DPLA by TU-KL)39,4

Lee Routing (DPLA by TU-

KL)

Lee Routing (DPLA by TU-

KL)

160

Grid-based DRC („fair

comparizon“)

Grid-based DRC („fair

comparizon“)15000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

MoM Xputer architecture

cryptocrypto

Microprocessor

rela

tive p

erf

orm

ance

Memory

X 2/yr

© 2006, [email protected] http://hartenstein.de26

TU Kaiserslautern

2nd DeHon‘s Law

Computational Density

1

10

100

1000

2 1 0.5 0.25 0.13 0.1 0.07

µ feature sizeRISC

FPGA

[IEEE COMPUTER, 2000]

© 2006, [email protected] http://hartenstein.de27

TU Kaiserslautern

   

   

   

   

The three RC Paradoxes

poor technology

brilliant results

poor toolsvery poor education

© 2006, [email protected] http://hartenstein.de28

TU Kaiserslautern

   

   

   

   

Why supercomputing / HPC failed

instruction-stream-based: memory-cycle-hungry

the wrong way, how the data are moved around

instruction fetch overhead

because of the interconnect network architecture

address computation overhead and other overhead

sequencing overheadThe law or More:

© 2006, [email protected] http://hartenstein.de29

TU Kaiserslautern

Earth Simulator

5120 Processors, 5000 pins eachES 20: TFLOPS

Crossbar weight: 220 t, 3000 km of cable,moving data around

inside the

© 2006, [email protected] http://hartenstein.de30

TU Kaiserslautern

data moved around by software

i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall

P&R: move

locality of

operation, not data !

extr

emel

y unbal

ance

d

stolen from Bob Colwell

CPU

© 2006, [email protected] http://hartenstein.de31

TU Kaiserslautern

An Archetype Common Model needed

Guidance for organizing efficient solutions

Make the project manageable

Allow to share lessions between applications and between disciplines

Useful simple archetype not widely accepted

An archetype common model should provide ....

Progress stalled by the software/configware chasm

Configware IndustryConfigware Industryfrom the

support undergraduate educastion

© 2006, [email protected] http://hartenstein.de32

TU Kaiserslautern

   

   

   

   

The new paradigm: how the data are traveling

transport-triggered: an old hat

pipeline, or chaining

systolic array

asynchronous (via handshake)

wavefront array

no, not by instruction execution

© 2006, [email protected] http://hartenstein.de33

TU Kaiserslautern

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

Flowware defines: ... which data item at which time at which port

Def.: data streams

(flowware)

(pipe network)

source and sink ?source and sink ?

H. T. Kung

systolic arrays:

© 2006, [email protected] http://hartenstein.de34

TU KaiserslauternData streams source and sink: not my job

Not my Job!

© 2006, [email protected] http://hartenstein.de35

TU Kaiserslautern

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data streams

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“

distributed memory

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MOn-chip

Auto-Sequencing Memory

RA

M

GA

G

ASM

implemented by distributed on-

chip memory

© 2006, [email protected] http://hartenstein.de36

TU Kaiserslautern

   

   

   

   

How the data are moved

DMA,

vN move processor [Jack Lipovski, EUROMiCRO, Nice, 1975]

Henk Corporaal coins the term “transport-triggered”

Application-specific distributed memory [Catthoor et al.]

ASM use GAG generic address generator[TU-KL publ.: Tokyo 1989 + NH journal]by the way: GAG st…. by TI [TI patent 1995]

MoM: GAG-based storage scheme methodology [Herz*]

*) [see Michael Herz et al.: ICECS 2002 (Dubrovnik)]

© 2006, [email protected] http://hartenstein.de37

TU Kaiserslautern

The dual paradigm approach

von Neumann paradigm Kress-Kung paradigm

Software Engineering

Software Engineering

Configware

Engineering

Configware

Engineering

ASM

CPU

© 2006, [email protected] http://hartenstein.de38

TU KaiserslauternMathematical Synthesis Methods

algebraic methods

i. e., linear projections

yields only uniform arrays w. linear pipes

only for applications with regular data dependencies

© 2006, [email protected] http://hartenstein.de39

TU Kaiserslautern

Coarse-grained reconfigurable arrays are a Generalization of the Systolic

Array ....discard algebraic synthesis methods

[Rainer Kress]

the achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

now reconfigurability really makes sense

use optimization algorithms instead, for example: simulated annealing

R. Kress

© 2006, [email protected] http://hartenstein.de40

TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru onlyrout thru only

not usednot usedbackbus connectbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

Example: mapping onto rDPA by DPSS: based on simulated annealing

rDPU, 32 bit

no CPU

tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable)tool: KressArray Xplorer: diss. Ulrich Nageldinger (downloadable)

© 2006, [email protected] http://hartenstein.de41

TU Kaiserslautern

Software / Configware Co-Compilation

Resource Parameters

supportingdifferentplatformsAnalyzer

/ Profiler

SW code

SWcompiler

paradigm“vN" machine

CW Code

CWcompiler

anti machineparadigm

Partitioner

C language source

FW Code

simulated annealing

[Juergen Becker’s CoDe-X, 1996]

© 2006, [email protected] http://hartenstein.de42

TU Kaiserslautern

Software / Configware Co-Compilation

Resource Parameters

supportingdifferentplatformsAnalyzer

/ Profiler

SW code

SWcompiler

paradigm“vN" machine

CW Code

CWcompiler

anti machineparadigm

Partitioner

C language source

FW Code

simulated annealingFor thesis see book exhibit rack at library entrance

For thesis see book exhibit rack at library entrance

[Juergen Becker’s CoDe-X, 1996]

© 2006, [email protected] http://hartenstein.de43

TU Kaiserslautern

Distributed Memory Parallelism Capability

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

array size example: 10 x 16

NN ports interconnect layerNN ports interconnect layer

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

ASM

back

bus

conn

ect

layers

…back

bus

conn

ect

layers

© 2006, [email protected] http://hartenstein.de44

TU Kaiserslautern

Applications for coarse-grained arrays

(on-chip distributed memory for intermediate results)

Multi-standard world HDTV receiver

with steady I/O data streams at constant speed:

Wide variety of multimedia applications

Wide variety of real-time applications

Many other applications

© 2006, [email protected] http://hartenstein.de45

TU Kaiserslautern

The wrong mind set ....

„but you can‘t implement decisions!“

(remark of a high-ranked industrial research head – discussion after a talk by Ulrich Nageldinger – RAW Orlando)

© 2006, [email protected] http://hartenstein.de46

TU Kaiserslautern

a tiny section of the pipe network

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

S

+

© 2006, [email protected] http://hartenstein.de47

TU Kaiserslautern

The wrong mind set ....

+

ABR C

section of a very large pipe network:

decision

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

=1=0

© 2006, [email protected] http://hartenstein.de48

TU Kaiserslautern

introducing hardware description languages(in the mid‘ seventies)

“The decision box becomes a

(de)multiplexer”This is so simple: why did it take decades to find

out ?

The wrong mind set – the wrong road map!

© 2006, [email protected] http://hartenstein.de49

TU Kaiserslautern

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B

read instruction 1 100instruction decoding

add & store

read instruction 1 100instruction decoding

operate & reg. transfers

store result 1 100

total 5 500

S = R + (if C then A else B endif);

S

+

ABR C

clock200 MHz(5 nanosec)

=1

sect

ion

of a

maj

or p

ipe

netw

ork

on rD

PU

no m

emor

y cy

cles

:

no m

emor

y cy

cles

:

spee

d-up

fac

tor

= 1

00

spee

d-up

fac

tor

= 1

00

© 2006, [email protected] http://hartenstein.de50

TU Kaiserslautern

why the RC paradigm shift is so important

Move the stool or the grand piano?

by Software

byConfigware

© 2006, [email protected] http://hartenstein.de51

TU Kaiserslautern

the data-stream-based approachhas no von Neumann bottle-neck

has no von Neumann bottle-neck

… understand only this parallelism solution:

the instruction-stream-based approach

von Neuma

nn bottle-necks

von Neuma

nn bottle-necks

... c

annot

cope w

ith

this

one

© 2006, [email protected] http://hartenstein.de52

TU Kaiserslautern

   

   

   

   

What means Reconfigurable Computing?

microprogramming?

switching the multiplexers?

concurrency of 64 or 256 CPUs on a single chip?

routing ALU result to a register?

it means using the Kress/Kung machine paradigm !

© 2006, [email protected] http://hartenstein.de53

TU KaiserslauternvN paradigm loosing its

dominance

http://bwrc.eecs.berkeley.edu/Research/RAMP/people.htm

RAMP project proposes: Run LINUX on FPGAs

© 2006, [email protected] http://hartenstein.de54

TU Kaiserslautern

Cray XD1

vN paradigm loosing its dominanceXilinx inside !Xilinx inside !

Xilinx FPGAXilinx FPGA

© 2006, [email protected] http://hartenstein.de55

TU KaiserslauternRecommended Pentium successor

Discard most caches

Have 64* cores

with clever interconnect for:

concurrent processes,

for multithreading, and,

Kung-Kress rDPA array

The Desk-top Supercomputer!

© 2006, [email protected] http://hartenstein.de56

TU Kaiserslautern

   

   

   

   

What means Reconfigurable Computing ?

The key issue: which is the underlying paradigm?

Operation not based on instruction-streams at run time

No instruction fetch at run time

machine paradigm is data stream-based: Kress-KungUndergraduate education needs a dual paradigm approach: symbiosis of von Neumann / Kress-Kung

© 2006, [email protected] http://hartenstein.de57

TU Kaiserslautern

thank you

© 2006, [email protected] http://hartenstein.de58

TU Kaiserslautern

END

© 2006, [email protected] http://hartenstein.de59

TU Kaiserslautern

   

   

   

   

© 2006, [email protected] http://hartenstein.de60

TU Kaiserslautern

Backup for Discussion:

© 2006, [email protected] http://hartenstein.de61

TU KaiserslauternTerm to be used for „soft hardware“

accelwareadaptwareadjustwarealtwarealterwarearrangewarechangewareconformwaredoughwarefabricswarefabrixwarefitwareflexwareformwareFPware

gateware gateroutwarehpcwareLUTware matchwaremodiwaremorphware® morfwaremouldwaremuxwareparwareparawarepasswarepathwarepatchware

performware perfwareperwarepipewareplatformwarerailwarerangewareRCwareressourcewareroutwareroutewareroutingwareRTwareshapewareshuntware

shuntingwarespeedwarespeedupware suitewareswitchwareswitchingwarestreamwarestructwaretransferwaretranswarevariwarevarywarewarpwarexferwarexware

send yourproposal to:

unfortunately “Morphware” is trademarked

© 2006, [email protected] http://hartenstein.de62

TU KaiserslauternCompilation: Software vs.

Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB

© 2006, [email protected] http://hartenstein.de63

TU Kaiserslautern

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitioner

© 2006, [email protected] http://hartenstein.de64

TU Kaiserslautern

Why use Reconfigurable Computing

Exploit spatial parallelism, and ..

… high bandwidth and low latency memory access

Ride the technology curve avoiding specific silicon Adapt to change: standards, trends, …..

Reduce risk

Adapt to application / deployment requirements

instead of spec. hardware?

instead of software?

… and fine-grained parallelism when useful

© 2006, [email protected] http://hartenstein.de65

TU Kaiserslautern

Computing Curricula 2004 (2)

#C

E Configw

are

Engineering

missing

volume:

CE missing

© 2006, [email protected] http://hartenstein.de66

TU Kaiserslautern

Computing

Curricula 2004 (3)

2.2.1.

© 2006, [email protected] http://hartenstein.de67

TU Kaiserslautern

Computing

Curricula 2004 (4)

2.2.1.

… how it should be

CONFIGWARE

MORPHWARE

morphware and configware added

© 2006, [email protected] http://hartenstein.de68

TU Kaiserslautern