spatial computation computing without general-purpose processors mihai budiu [email protected]...
Post on 20-Dec-2015
214 views
TRANSCRIPT
Spatial ComputationComputing without General-Purpose Processors
Mihai [email protected]
Carnegie Mellon University
July 8, 2004
2
Mihai [email protected]
Carnegie Mellon University
Spatial Computation
A computation model based on:
• application-specific hardware
• no interpretation
• minimal resource sharing
Spatial Computation
3
The Engine Behind This Talk
main( )
{
signal(SIGINT, welcome);
while (slides( ) && time( )) {
talk( );
}
}
4
Research Scope
Object: future architectures
Tool:compilers
Evaluation:simulators
5
Research Methodology
Constraint Space
state-of-the-art
X (e.g., power)
Y (e.g., cost)
“reasonable limits”
incrementalevolution
new solutions
6
Outline• Introduction: problems of current architectures
• Compiling Application-Specific Hardware
• Pipelining
• ASH Evaluation
• Conclusions
1000
Per
form
ance
1
10
100
19
80
19
84
19
86
19
88
19
90
19
92
19
94
19
96
19
98
20
00
19
82
7
Resources
• We do not worry about not having hardware resources• We worry about being able to use hardware resources
[Intel]
8
Design Complexity1981
1983
1985
1987
1989
1991
1993
1995
1997
1999
2003
2001
2005
2007
2009
Designer productivity
104
Chip size
105
106
107
108
109
1010
Tra
nsis
tors
9
Communication vs. Computation
5ps 20ps
gate wire
Power consumption on wires is also dominant
10
Power Consumption
Toasted CPU: about 2 sec after removing cooler.
(Tom’s Hardware Guide)
11
Energy Efficiency
ALUs
Pentium 4
12
Clock Speed
Cannot rely on global signals(clock is a global signal)
3GHz
6GHz
10GHz
13
Instruction-Set Architecture
Software
Hardware
ISA
VERY rigid to changes(e.g. x86 vs Itanium)
14
Our Proposal• ASH addresses these problems• ASH is not a panacea• ASH “complementary” to CPU
High-ILPcomputation
Low ILP computation+ OS + VM CPU ASH
Memory
$
15
Outline
• Problems of current architectures
• CASH: Compiling ASH– program representation– compiling C programs
• Pipelining
• ASH Evaluation
• Conclusions
16
Application-Specific HardwareC program
Compiler
Dataflow IR
Reconfigurable/custom hw
SW
HW
ISA
HW backend
17
Application-Specific HardwareC program
Compiler
Dataflow IR
CPU [predication]
SW backend
Soft
18
...
def-use
may-dep.
Key: Intermediate Representation
Traditionally
• SSA + predication + speculation
• Uniform for scalars and memory
• Explicitly encodes may-depend
• Executable
• Precise semantics
• Dataflow IR
• Close to asynchronous target
Our IR
CFG
19
Computation = Dataflow
• Operations ) functional units• Variables ) wires• No interpretation
x = a & 7;...
y = x >> 2;
Programs
&
a 7
>>
2
x
Circuits
20
Basic Computation
+data
valid
ack
latch
21
+
Asynchronous Computation
data
valid
ack
1
+
2
+
3
+
4
+
8
+
7
+
6
+
5
latch
22
Distributed Control Logic
+ -
ackrdy
global
FSM
asynchronous control
short, local wires
23
Outline
• Problems of current architectures
• CASH: Compiling ASH– program representation– compiling C programs
• Pipelining
• ASH Evaluation
• Conclusions
24
MUX: Forward Branches
if (x > 0) y = -x;
elsey = b*x;
*
x
b 0
y
!
- >
Conditionals ) Speculation critical path
SSA= no arbitration
25
Control Flow ) Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!
26
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
ret
27
no speculation
sequencingof side-effects
Predication and Side-Effects
Load
addr
data
pred
token
token
tomemory
28
Memory Access
LD
ST
LD
MonolithicMemory
local communication global structures
pipelinedarbitratednetwork
Future work: fragment this!related workcomplexity
29
CASH Optimizations
• SSA-based optimizations– unreachable/dead code, gcse, strength reduction,
loop-invariant code motion, software pipelining, reassociation, algebraic simplifications, induction variable optimizations, loop unrolling, inlining
• Memory optimizations– dependence & alias analysis, register promotion,
redundant load/store elimination, memory access pipelining, loop decoupling
• Boolean optimizations– Espresso CAD tool, bitwidth analysis
30
Outline• Problems of current architectures
• Compiling ASH
• Pipelining
• Evaluation: CASH vs. clocked designs
• Conclusions
31
Pipeliningi
+
<=
100
1
*
+
sum
pipelinedmultiplier(8 stages)
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;
step 1
32
Pipeliningi
+
<=
100
1
*
+
sum
step 2
33
Pipeliningi
+
<=
100
1
*
+
sum
step 3
34
Pipeliningi
+
<=
100
1
*
+
sum
step 4
35
Pipeliningi
+
<=
100
1
i=1
i=0
+
sum
step 5
36
Pipeliningi
+
<=
100
1
*i=1
i=0
+
sum
step 6
37
Pipeliningi
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
Longlatency pipe
predicate
step 7
38
Predicate ackedge is on thecritical path.
Pipeliningi
+
<=
100
1
*
+
sum
critical pathi’s loop
sum’s loop
39
Pipeline balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
decouplingFIFO
step 7
40
Pipeline balancing i
+
<=
100
1
*
+
sum
i’s loop
sum’s loop
critical path
decouplingFIFO
41
Outline• Problems of current architectures
• Compiling ASH
• Pipelining
• Evaluation: CASH vs. clocked designs
• Conclusions
42
Evaluating ASHC
CASHcore
Verilog back-end
Synopsys,Cadence P/R
ASIC
180nm std. cell library, 2V
~1999technology
Mediabench kernels(1 hot function/benchmark)
ModelSim(Verilog simulation)
performancenumbers
Mem
43
ASH AreaP4: 217
normalized area
minimal RISC core
0
1
2
3
4
5
6
7
8
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Sq
uar
e m
m
Mem accessDatapath
44
ASH vs 600MHz CPU [.18 m]
1.08
1.61
0.45 0.45
2.19
1.17
1.731.62
1.91
1.65
3.76
3.51
1.48
0
0.5
1
1.5
2
2.5
3
3.5
4
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
avg
Tim
es
slo
we
r
45
Bottleneck: Memory Protocol
LD
ST Memory
•Token release to dependents: requires round-trip to memory.•Limit study: round trip zero time ) up to 6x speed-up.
LSQ
•Exploring protocol for in-order data delivery & fast token release.
46
PowerDSP110
mP4000
Xeon [+cache]67000
0
5
10
15
20
25
30
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
jpeg_
e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
Po
we
r [m
W]
47
Energy Efficiency
0.01 0.1 1 10 100 1000
Energy Efficiency [Operations/nJ]
General-purpose DSP
Dedicated hardware
ASH media kernels
Asynchronous P
Microprocessors
1000x
FPGAs
48
Outline
Problems of current architectures
+ Compiling ASH
+ Pipelining
+ ASH Evaluation
= Future/related work & conclusions
49
Related Work
NanotechnologyDataflowmachines
High-levelsynthesis
Reconfigurablecomputing
Computerarchitecture
Embeddedsystems
Asynchronouscircuits
Compilation
50
Future Work• Optimizations for
area/speed/power
• Memory partitioning
• Concurrency
• Compiler-guided layout
• Explore extensible ISAs
• Hybridization with superscalar mechanisms
• Reconfigurable hardware support for ASH
• Formal verification
51
How far can you go?
Grand Vision:Certified Circuit Generation
• Translation validation: input ´ output
• Preserve input properties– e.g., C programs cannot deadlock– e.g., type-safe programs cannot crash
• Debug, test, verify only at source-level
HLL IR IRopt Verilog gates layout
formally validated
52
Conclusions
Feature Advantages
No interpretation Energy efficiency, speed
Spatial layout Short wires, no contention
Asynchronous Low power, scalable
Distributed No global signals
Automatic compilation Design productivity, no ISA
Spatial computation strengths
53
Backup Slides• Reconfigurable hardware
• Critical paths• Control logic• ASH vs ...• ASH weaknesses• Exceptions• Normalized area• Why C?• Splitting memory• More performance• Recursive calls
54
Reconfigurable Hardware
Universal gates
and/or
storage elements
Interconnectionnetwork
Programmable switches
55
Switch controlled by a 1-bit RAM cell
0001
Universal gate = RAM
a0a1a0
a1
dataa1 & a2
0data in
control
Main RH Ingredient: RAM Cell
back
56
Critical Paths
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
57
Lenient Operations
if (x > 0) y = -x;
elsey = b*x;
*
xb 0
y
!
- >
Solves the problem of unbalanced paths
back to talkback
=
rdyin
ackout
rdyoutackin
datain dataout
Re
g
C
Asynchronous Control
back back to talk
59
HLL to HW
High-level Synthesis
BehavioralHDL
SynchronousHardware
ReconfigurableComputing
C [subsets]
Hardwareconfiguration
(spatial computation)
Asynchronouscircuits
ConcurrentLanguage
AsynchronousHardware
Prior work
This research
60
CASH vs High-Level Synthesis
• CASH: the only existing tool to translate complete ANSI C to hardware
• CASH generates asynchronous circuits
• CASH does not treat C as an HDL– no annotations required– no reactivity model– does not handle non-C, e.g., concurrency
back
61
ASH Weaknesses
• Low efficiency for low-ILP code
• Does not adapt at runtime
• Monolithic memory
• Resource waste
• Not flexible
• No support for exceptions
62
ASH Weaknesses (2)
• Both branch and join not free• Static dataflow (no re-issue of same instr)• Memory is “far”• Fully static
– No branch prediction– No dynamic unrolling– No register renaming
• Calls/returns not lenient
back
63
Predicted not takenEffectively a noop for CPU!
Predicted taken.
Branch Prediction
for (i=0; i < N; i++) {
...
if (exception) break;
}
i
+
<
1
&
!
exception
result available before inputs
ASH crit path
CPU crit path
back
64
Exceptions• Strictly speaking, C has no exceptions
• In practice hard to accommodate exceptions in hardware implementations
• An advantage of software flexibility: PC is single point of execution control
High-ILPcomputation
Low ILP computation+ OS + VM + exceptions CPU ASH
Memory
back
$$$
65
Why C
• Huge installed base
• Embedded specifications written in C
• Small and simple language– Can leverage existing tools– Simpler compiler
• Techniques generally applicable
• Not a toy language
back
66
Performance
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
adpcm
_d
adpcm
_e
g721_d
g721_e
gsm_d
gsm_e
jpeg_
d
jpeg_
e
mpeg2_d
mpeg2_e
pegwit_
d
pegwit_
eavg
Meg
aop
erat
ion
s p
er s
eco
nd
MOPSallMOPSspecMOPS
67
Parallelism Profile
0
5
10
15
20
25
adpc
m_d
adpc
m_e
epic
_d
epic
_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg
_d
jpeg
_e
mes
a
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
rast
a
CPU
ASH
4
68
Normalized Area
back back to talk
0
20
40
60
80
100
120
adpc
m_d
adpc
m_e
g721
_d
g721
_e
gsm
_d
gsm
_e
jpeg_
d
jpeg_
e
mpe
g2_d
mpe
g2_e
pegw
it_d
pegw
it_e
avg
0
0.5
1
1.5
2
2.5Lines/sq mmsq mm/kbyte
69
Memory Partitioning• MIT RAW project: Babb FCCM ‘99,
Barua HiPC ‘00,Lee ASPLOS ‘00
• Stanford SpC: Semeria DAC ‘01, TVLSI ‘02
• Berkeley CCured: Necula POPL ‘02
• Illinois FlexRAM: Fraguella PPoPP ‘03
• Hand-annotations #pragma
back back to talk
70
Memory Complexity
back
LSQ
RAMaddr
data
back to talk
71
Recursion
recursive call
save live values
restore live valuesstack
back
72
Me?