spatial computation computing without general-purpose processors mihai budiu [email protected]...

Spatial ComputationComputing without General-Purpose Processors

Mihai [email protected]

Carnegie Mellon University

July 8, 2004

2

Mihai [email protected]

Carnegie Mellon University

Spatial Computation

A computation model based on:

• application-specific hardware

• no interpretation

• minimal resource sharing

Spatial Computation

3

The Engine Behind This Talk

main( )

{

signal(SIGINT, welcome);

while (slides( ) && time( )) {

talk( );

}

}

4

Research Scope

Object: future architectures

Tool:compilers

Evaluation:simulators

5

Research Methodology

Constraint Space

state-of-the-art

X (e.g., power)

Y (e.g., cost)

“reasonable limits”

incrementalevolution

new solutions

6

Outline• Introduction: problems of current architectures

• Compiling Application-Specific Hardware

• Pipelining

• ASH Evaluation

• Conclusions

1000

Per

form

ance

1

10

100

19

80

19

84

19

86

19

88

19

90

19

92

19

94

19

96

19

98

20

00

19

82

7

Resources

• We do not worry about not having hardware resources• We worry about being able to use hardware resources

[Intel]

8

Design Complexity1981

1983

1985

1987

1989

1991

1993

1995

1997

1999

2003

2001

2005

2007

2009

Designer productivity

104

Chip size

105

106

107

108

109

1010

Tra

nsis

tors

9

Communication vs. Computation

5ps 20ps

gate wire

Power consumption on wires is also dominant

10

Power Consumption

Toasted CPU: about 2 sec after removing cooler.

(Tom’s Hardware Guide)

11

Energy Efficiency

ALUs

Pentium 4

12

Clock Speed

Cannot rely on global signals(clock is a global signal)

3GHz

6GHz

10GHz

13

Instruction-Set Architecture

Software

Hardware

ISA

VERY rigid to changes(e.g. x86 vs Itanium)

14

Our Proposal• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU

High-ILPcomputation

Low ILP computation+ OS + VM CPU ASH

Memory

$

15

Outline

• Problems of current architectures

• CASH: Compiling ASH– program representation– compiling C programs

• Pipelining

• ASH Evaluation

• Conclusions

16

Application-Specific HardwareC program

Compiler

Dataflow IR

Reconfigurable/custom hw

SW

HW

ISA

HW backend

17

Application-Specific HardwareC program

Compiler

Dataflow IR

CPU [predication]

SW backend

Soft

18

...

def-use

may-dep.

Key: Intermediate Representation

Traditionally

• SSA + predication + speculation

• Uniform for scalars and memory

• Explicitly encodes may-depend

• Executable

• Precise semantics

• Dataflow IR

• Close to asynchronous target

Our IR

CFG

19

Computation = Dataflow

• Operations ) functional units• Variables ) wires• No interpretation

x = a & 7;...

y = x >> 2;

Programs

&

a 7

>>

2

x

Circuits

20

Basic Computation

+data

valid

ack

latch

21

+

Asynchronous Computation

data

valid

ack

1

+

2

+

3

+

4

+

8

+

7

+

6

+

5

latch

22

Distributed Control Logic

+ -

ackrdy

global

FSM

asynchronous control

short, local wires

23

Outline

• Problems of current architectures

• CASH: Compiling ASH– program representation– compiling C programs

• Pipelining

• ASH Evaluation

• Conclusions

24

MUX: Forward Branches

if (x > 0) y = -x;

elsey = b*x;

*

x

b 0

y

!

- >

Conditionals ) Speculation critical path

SSA= no arbitration

25

Control Flow ) Data Flow

datapredicate

Merge (label)

Gateway

data

data

Split (branch)p

!

26

i

+1< 100

0

*

+

sum

0

Loops

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

ret

27

no speculation

sequencingof side-effects

Predication and Side-Effects

Load

addr

data

pred

token

token

tomemory

28

Memory Access

LD

ST

LD

MonolithicMemory

local communication global structures

pipelinedarbitratednetwork

Future work: fragment this!related workcomplexity

29

CASH Optimizations

• SSA-based optimizations– unreachable/dead code, gcse, strength reduction,

loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining

• Memory optimizations– dependence & alias analysis, register promotion,

redundant load/store elimination, memory access pipelining, loop decoupling

• Boolean optimizations– Espresso CAD tool, bitwidth analysis

30

Outline• Problems of current architectures

• Compiling ASH

• Pipelining

• Evaluation: CASH vs. clocked designs

• Conclusions

31

Pipeliningi

+

<=

100

1

*

+

sum

pipelinedmultiplier(8 stages)

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;

step 1

32

Pipeliningi

+

<=

100

1

*

+

sum

step 2

33

Pipeliningi

+

<=

100

1

*

+

sum

step 3

34

Pipeliningi

+

<=

100

1

*

+

sum

step 4

35

Pipeliningi

+

<=

100

1

i=1

i=0

+

sum

step 5

36

Pipeliningi

+

<=

100

1

*i=1

i=0

+

sum

step 6

37

Pipeliningi

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

Longlatency pipe

predicate

step 7

38

Predicate ackedge is on thecritical path.

Pipeliningi

+

<=

100

1

*

+

sum

critical pathi’s loop

sum’s loop

39

Pipeline balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

decouplingFIFO

step 7

40

Pipeline balancing i

+

<=

100

1

*

+

sum

i’s loop

sum’s loop

critical path

decouplingFIFO

41

Outline• Problems of current architectures

• Compiling ASH

• Pipelining

• Evaluation: CASH vs. clocked designs

• Conclusions

42

Evaluating ASHC

CASHcore

Verilog back-end

Synopsys,Cadence P/R

ASIC

180nm std. cell library, 2V

~1999technology

Mediabench kernels(1 hot function/benchmark)

ModelSim(Verilog simulation)

performancenumbers

Mem

43

ASH AreaP4: 217

normalized area

minimal RISC core

0

1

2

3

4

5

6

7

8

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Sq

uar

e m

m

Mem accessDatapath

44

ASH vs 600MHz CPU [.18 m]

1.08

1.61

0.45 0.45

2.19

1.17

1.731.62

1.91

1.65

3.76

3.51

1.48

0

0.5

1

1.5

2

2.5

3

3.5

4

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

avg

Tim

es

slo

we

r

45

Bottleneck: Memory Protocol

LD

ST Memory

•Token release to dependents: requires round-trip to memory.•Limit study: round trip zero time ) up to 6x speed-up.

LSQ

•Exploring protocol for in-order data delivery & fast token release.

46

PowerDSP110

mP4000

Xeon [+cache]67000

0

5

10

15

20

25

30

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

jpeg_

e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

Po

we

r [m

W]

47

Energy Efficiency

0.01 0.1 1 10 100 1000

Energy Efficiency [Operations/nJ]

General-purpose DSP

Dedicated hardware

ASH media kernels

Asynchronous P

Microprocessors

1000x

FPGAs

48

Outline

Problems of current architectures

+ Compiling ASH

+ Pipelining

+ ASH Evaluation

= Future/related work & conclusions

49

Related Work

NanotechnologyDataflowmachines

High-levelsynthesis

Reconfigurablecomputing

Computerarchitecture

Embeddedsystems

Asynchronouscircuits

Compilation

50

Future Work• Optimizations for

area/speed/power

• Memory partitioning

• Concurrency

• Compiler-guided layout

• Explore extensible ISAs

• Hybridization with superscalar mechanisms

• Reconfigurable hardware support for ASH

• Formal verification

51

How far can you go?

Grand Vision:Certified Circuit Generation

• Translation validation: input ´ output

• Preserve input properties– e.g., C programs cannot deadlock– e.g., type-safe programs cannot crash

• Debug, test, verify only at source-level

HLL IR IRopt Verilog gates layout

formally validated

52

Conclusions

Feature Advantages

No interpretation Energy efficiency, speed

Spatial layout Short wires, no contention

Asynchronous Low power, scalable

Distributed No global signals

Automatic compilation Design productivity, no ISA

Spatial computation strengths

53

Backup Slides• Reconfigurable hardware

• Critical paths• Control logic• ASH vs ...• ASH weaknesses• Exceptions• Normalized area• Why C?• Splitting memory• More performance• Recursive calls

54

Reconfigurable Hardware

Universal gates

and/or

storage elements

Interconnectionnetwork

Programmable switches

55

Switch controlled by a 1-bit RAM cell

0001

Universal gate = RAM

a0a1a0

a1

dataa1 & a2

0data in

control

Main RH Ingredient: RAM Cell

back

56

Critical Paths

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

57

Lenient Operations

if (x > 0) y = -x;

elsey = b*x;

*

xb 0

y

!

- >

Solves the problem of unbalanced paths

back to talkback

=

rdyin

ackout

rdyoutackin

datain dataout

Re

g

C

Asynchronous Control

back back to talk

59

HLL to HW

High-level Synthesis

BehavioralHDL

SynchronousHardware

ReconfigurableComputing

C [subsets]

Hardwareconfiguration

(spatial computation)

Asynchronouscircuits

ConcurrentLanguage

AsynchronousHardware

Prior work

This research

60

CASH vs High-Level Synthesis

• CASH: the only existing tool to translate complete ANSI C to hardware

• CASH generates asynchronous circuits

• CASH does not treat C as an HDL– no annotations required– no reactivity model– does not handle non-C, e.g., concurrency

back

61

ASH Weaknesses

• Low efficiency for low-ILP code

• Does not adapt at runtime

• Monolithic memory

• Resource waste

• Not flexible

• No support for exceptions

62

ASH Weaknesses (2)

• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static

– No branch prediction– No dynamic unrolling– No register renaming

• Calls/returns not lenient

back

63

Predicted not takenEffectively a noop for CPU!

Predicted taken.

Branch Prediction

for (i=0; i < N; i++) {

...

if (exception) break;

}

i

+

<

1

&

!

exception

result available before inputs

ASH crit path

CPU crit path

back

64

Exceptions• Strictly speaking, C has no exceptions

• In practice hard to accommodate exceptions in hardware implementations

• An advantage of software flexibility: PC is single point of execution control

High-ILPcomputation

Low ILP computation+ OS + VM + exceptions CPU ASH

Memory

back

$$$

65

Why C

• Huge installed base

• Embedded specifications written in C

• Small and simple language– Can leverage existing tools– Simpler compiler

• Techniques generally applicable

• Not a toy language

back

66

Performance

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

adpcm

_d

adpcm

_e

g721_d

g721_e

gsm_d

gsm_e

jpeg_

d

jpeg_

e

mpeg2_d

mpeg2_e

pegwit_

d

pegwit_

eavg

Meg

aop

erat

ion

s p

er s

eco

nd

MOPSallMOPSspecMOPS

67

Parallelism Profile

0

5

10

15

20

25

adpc

m_d

adpc

m_e

epic

_d

epic

_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg

_d

jpeg

_e

mes

a

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

rast

a

CPU

ASH

4

68

Normalized Area

back back to talk

0

20

40

60

80

100

120

adpc

m_d

adpc

m_e

g721

_d

g721

_e

gsm

_d

gsm

_e

jpeg_

d

jpeg_

e

mpe

g2_d

mpe

g2_e

pegw

it_d

pegw

it_e

avg

0

0.5

1

1.5

2

2.5Lines/sq mmsq mm/kbyte

69

Memory Partitioning• MIT RAW project: Babb FCCM ‘99,

Barua HiPC ‘00,Lee ASPLOS ‘00

• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02

• Berkeley CCured: Necula POPL ‘02

• Illinois FlexRAM: Fraguella PPoPP ‘03

• Hand-annotations #pragma

back back to talk

70

Memory Complexity

back

LSQ

RAMaddr

data

back to talk

71

Recursion

recursive call

save live values

restore live valuesstack

back

72

Me?

spatial computation computing without general-purpose processors mihai budiu [email protected]...

Documents

arbitration slide

dominant slide

performance slide

spatial computation

ir cfg slide

hardware resources intel

toms hardware guide

x86 vs itanium slide