mihai budiu microsoft research – silicon valley joint work with girish venkataramani, tiberiu...
TRANSCRIPT
Mihai BudiuMicrosoft Research – Silicon Valley
joint work with
Girish Venkataramani, Tiberiu Chelcea, Seth Copen Goldstein
Carnegie Mellon University
Spatial ComputationComputing without General-Purpose Processors
May 10, 2005
2
Outline• Intro: Problems of current architectures
• Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
1000
Per
form
ance
1
10
100
19
80
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
19
82
3
Resources
• We do not worry about not having hardware resources• We worry about being able to use hardware resources
[Intel]
4
Complexity
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
Designer productivity
104
Chip size
105
106
107
108
109
1010
ALUs
Cannot rely on global signals(clock is a global signal)
5ps 20ps
gatewire
5
Complexity
1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
Designer productivity
104
Chip size
105
106
107
108
109
1010
ALUs
Cannot rely on global signals(clock is a global signal)
5ps 20ps
gatewire
Automatictranslation
C ! HW
Simple, short,unidirectionalinterconnect
No interpretationDistributed
control,Asynchronous
Simple hw,mostly idle
6
Our Proposal:Application-Specific Hardware
• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU
High-ILPcomputation
Low ILP computation+ OS + VM CPU ASH
Memory
$
7
Outline• Problems of current architectures
• CASH: Compiling Application-Specific Hardware
• ASH Evaluation
• Conclusions
8
Application-Specific HardwareC program
Compiler
Dataflow IR
Reconfigurable/custom hw
HW backend
9
Computation Dataflow
x = a & 7;...
y = x >> 2;
Program
&
a 7
>>
2
x
IR
a
Circuits
&7
>>2
No interpretation
Operations Nodes Pipeline stages
Variables Def-use edges Channels (wires)
10
Basic Computation=Pipeline Stage
data
valid
ack
latch+
11
+
Asynchronous Computation
data
valid
ack
1
+
2
+
3
+
4
+
8
+
7
+
6
+
5
latch
12
Distributed Control Logic
+ -
ackrdy
global
FSM
short, local wires
13
MUX: Forward Branches
if (x > 0) y = -x;
elsey = b*x;
*
x
b 0
y
!
- >
Conditionals ) Speculation
SSA= no arbitration
Critical path
14
Control Flow ) Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!
15
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
retback
16
Pipeliningi
+
<=
100
1
*
+
sum
pipelinedmultiplier(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
step 1
17
Pipeliningi
+
<=
100
1
*
+
sum
step 2
18
Pipeliningi
+
<=
100
1
*
+
sum
step 3
19
Pipeliningi
+
<=
100
1
*
+
sum
step 4
20
Pipeliningi
+
<=
100
1
i=1
i=0
+
sum
step 5
21
Pipeliningi
+
<=
100
1
*i=1
i=0
+
sum
step 6
back
22
Pipeliningi
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
Longlatency pipe
predicate
step 7
23
Predicate ackedge is on thecritical path.
Pipeliningi
+
<=
100
1
*
+
sum
critical pathi’s loop
sum’s loop
24
Pipeline balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
decouplingFIFO
step 7
25
Pipeline balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
critical path
decouplingFIFO
back back to talk
26
ProceduresCaller
CalleeCall
Argument
Return
Continuation
27
Memory Access
LD
ST
LD
MonolithicMemory
local communication global structures
Future work: fragment this!
pipelinedarbitratednetwork
28
Outline• Problems of current architectures
• Compiling ASH
• ASH Evaluation
• Conclusions
29
Evaluating ASHC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
ASIC
180nm std. cell library, 2V
~1999technology
Mediabench kernels(1 hot function/benchmark)
ModelSim(Verilog simulation)
performancenumbers
Mem
commercial tools
30
Compile TimeC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
ASIC
20 seconds
10 seconds
20 minutes1 hour
200 lines
Mem
31
ASH Area (mm2)P4: 217
minimal RISC core
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Are
a [
sq
mm
]
Memory access
Circuit
32
ASH vs 600MHz CPU [4-wide OOO, .18 m]
2.40
1.37
1.79
1.98
0.74
1.65
0.56
1.34
0.80
1.06
0.43 0.44
1.05
0.00
0.50
1.00
1.50
2.00
2.50
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
aver
age
Tim
es f
aste
r
33
Bottleneck: Memory Protocol
LD
ST Memory
• Enabling dependent operations requires round-trip to memory.LS
Q
• Exploring novel memory access protocols.
34
Power (mW)DSP110
mP4000
Xeon [+cache]67000
70
29
10 10
19
38
46
2623
30
22 2225
0
10
20
30
40
50
60
70
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
aver
age
Po
we
r [m
W]
35
Energy-delay
363 285
1524 1788
147
389
36
437
171 174
48 50
227
1
10
100
1000
10000
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
jpeg_
e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
aver
age
Tim
es
be
tte
r th
an
su
pe
rsc
ala
r
36
Energy Efficiency (op/nJ)
5766
143 143
52 51
39
6255
40
28 28
0
20
40
60
80
100
120
140
160
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
[Op
erat
ion
s/n
J](n
on-
spe
cula
tive
ari
thm
etic
)
37
Energy Efficiency
0.01 0.1 1 10 100 1000
Energy Efficiency [Operations/nJ]
General-purpose DSP
Dedicated hardware
ASH media kernels
FPGA
Microprocessors
1000x
Asynchronous P
38
Outline
Problems of current architectures
+ Compiling ASH
+ Evaluation
= Related work, Conclusions
39
Bilbliography• Dataflow: A Complement to Superscalar
Mihai Budiu, Pedro Artigas, and Seth Copen GoldsteinISPASS 2005
• Spatial ComputationMihai Budiu, Girish Venkataramani, Tiberiu Chelcea, and Seth Copen GoldsteinASPLOS 2004
• C to Asynchronous Dataflow Circuits: An End-to-End ToolflowGirish Venkataramani, Mihai Budiu, Tiberiu Chelcea, and Seth Copen Goldstein IWLS 2004
• Optimizing Memory Accesses For Spatial ComputationMihai Budiu and Seth Copen GoldsteinCGO 2003
• Compiling Application-Specific HardwareMihai Budiu and Seth Copen GoldsteinFPL 2002
40
Related Work• Optimizing compilers
• High-level synthesis
• Reconfigurable computing
• Dataflow machines
• Asynchronous circuits
• Spatial computation
We target an extreme point in the design space:no interpretation,
fully distributed computation and control
41
ASH Design Point
• Design an ASIC in a day
• Fully automatic synthesis to layout
• Fully distributed control and computation
(spatial computation)– Replicate computation to simplify wires
• Energy/op rivals custom ASIC
• Performance rivals superscalar
• E£t 100 times better than any processor
42
Conclusions
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Designer productivity
Spatial computation strengths
43
Backup Slides• Absolute performance • Control logic• Exceptions• Leniency• Normalized area• ASH weaknesses• Splitting memory• Recursive calls• Leakage• Why not compare to…• Targeting FPGAs
44
Absolute Performance
0
1000
2000
3000
4000
5000
6000
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Mill
ion
s o
f O
pe
rati
on
s p
er
Se
co
nd MOPSall
MOPSspecMOPS
12300
CPU range
back
=
rdyin
ackout
rdyoutackin
datain dataout
Re
g
back
Pipeline Stage
C
46
Exceptions• Strictly speaking, C has no exceptions
• In practice hard to accommodate exceptions in hardware implementations
• An advantage of software flexibility: PC is single point of execution control
High-ILPcomputation
Low ILP computation+ OS + VM + exceptions CPU ASH
Memory
back
$$$
47
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
48
Lenient Operations
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Solves the problem of unbalanced paths
back back to talk
49
Normalized Area
back
0
50
100
150
200
250
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
jpeg_
e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
aver
age
So
urc
e L
ines
/sq
mm
0
1
2
3
4
5
6
Ob
ject
co
de
Kb
/sq
mm
Lines/sq mm
KBytes/sq mm
50
ASH Weaknesses
• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static
– No branch prediction– No dynamic unrolling– No register renaming
• Calls/returns not lenient
back
51
Predicted not takenEffectively a noop for CPU!
Predicted taken.
Branch Prediction
for (i=0; i < N; i++) {
...
if (exception) break;
}
i
+
<
1
&
!
exception
result available before inputs
ASH crit path
CPU crit path
back
52
Memory Partitioning• MIT RAW project: Babb FCCM ‘99,
Barua HiPC ‘00,Lee ASPLOS ‘00
• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02
• Illinois FlexRAM: Fraguella PPoPP ‘03
• Hand-annotations #pragma
back
53
Recursion
recursive call
save live values
restore live valuesstack
back
54
Leakage Power
Ps = k Area e-VT
• Employ circuit-level techniques
• Cut power supply of idle circuit portions– most of the circuit is idle most of the time– strong locality of activity
back
55
Why Not Compare To…• In-order processor
– Worse in all metrics than superscalar, except power– We beat it in all metrics, including performance
• DSP– We expect roughly the same results as for superscalar
(Wattch maintains high IPC for these kernels)
• ASIC– No available tool-flow supports C to the same degree
• Asynchronous ASIC– We compared with a Balsa synthesis system– We are 15 times better in Et compared to resulting ASIC
• Async processor– We are 350 times better in Et than Amulet (scaled to .18)
back
56
Why not target FPGA
• Do not support asynchronous circuits
• Very inefficient in area, power, delay
• Too fine-grained for datapath circuits
• We are designing an async FPGA
back