computing without processors thesis proposal
DESCRIPTION
Computing Without Processors Thesis Proposal. Mihai Budiu July 30, 2001. Thesis Committee: Seth Goldstein, chair Todd Mowry Peter Lee Babak Falsafi, ECE Nevin Heintze, Agere Systems. This presentation uses TeXPoint by George Necula. Four Types of Research. Solve nonexistent problems - PowerPoint PPT PresentationTRANSCRIPT
Computing Without ProcessorsThesis Proposal
Mihai Budiu July 30, 2001
This presentation uses TeXPoint by George Necula
Thesis Committee:Seth Goldstein, chair
Todd Mowry Peter Lee
Babak Falsafi, ECENevin Heintze, Agere Systems
2
Four Types of Research
• Solve nonexistent problems
• Solve past problems
• Solve current problems
• Solve future problems
3
The Law
(source: Intel)
4
The Crossover Phenomenon
time
technology
5
Example Crossover
time
DRAM
CPU
1980
caches
access speed (ns)
no caches
200
Trouble Aheadfor
Microarchitecture
7
Signal Propagation
time
now
mmdie size
distancein 1 clock
20
8
Reliability & Yield
time
defects/chip
tolerable
new process
occurring
now
9
Energy
timenow
100W
CPU consumption
thermal dissipation
power
10
Instruction-Level Parallelism (ILP)
time
fetch
commit
instructions
now
11
Premises of this Research
• We will have lots of gates– Moore’s law continues– Nanotechnology
• Contemporary architectures do not scale
12
Outline
• Motivation
• ASH: Application-Specific Hardware
• The spatial model of computation
• CASH: Compiling for ASH
• Evolutionary path
• Conclusions
• Future work
13
ASH Application-Specific Hardware
Reconfigurablehardware
HLL program
Compiler
Circuit
14
ASH: A Scalable Architecture-- Thesis Statement --
Application-specific hardware on a reconfigurable-hardware substrate is a solution for the smooth evolution of computer architecture.
We can provide scalable compilers for translating high-level languages into hardware.
15
Exampleint f(void){ int i=0, j = 0;
for (; i < 10; i++) j += i;
return j;}
16
Outline
• Motivation
• ASH: Application-Specific Hardware
• The spatial model of computation
• CASH: Compiling for ASH
• Evolutionary path
• Conclusions
• Future work
17
• Build reconfigurable hardware using nanotechnology
Huge structures
ASH and Nanotechnology
• Low Power: 1010 gates use less than 2 W• Low cost: nanocents/gate• High density: 105x over CMOS
Nano-RAM cell
In yellow: a CMOS RAM cell.
18
A graph of the whole program execution:
A Limit Study of Performance
Memory word
Basic block
Memory write
Memory read
Control-flow transfer
19
Typical Program Graph (g721_e)
Control flow transfer
100% memory cluster
Memory reads
100% code cluster
memcpy
20
Program Graph After Inlining memcpy
memcpy
21
Application Slowdown
-1
0
1
2
3
4
5
6
7
8
9
10
11
tim
es s
low
er t
han
nat
ive
1 clock/square 5 clocks/square
22
How Time Is Spent
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
099.g
o
129.c
ompr
ess
130.l
i
132.i
jpeg
adpc
m_d
adpc
m_e
epic_
e
g721
_Q_d
g721
_Q_e
gsm
_d
gsm
_e
jpeg_d
jpeg_e
mpe
g2_d
per
cen
t
idle
executioncontrol flow
register traffic
No caches: reads expensive
No speculation
23
Lesson
The spatial model of computation has different properties.
24
Outline
• Motivation
• ASH: Application-Specific Hardware
• The spatial model of computation
• CASH: Compiling for ASH
• Evolutionary path
• Future work
25
CASH: Compiling for ASH
Memory partitioning
Interconnection net
Program to circuits
26
Compilation
1. Program
int reverse(int x){ int k,r=0; for (k=0; k<32; k++) r |= x&1; x = x >> 1; r = r << 1; }}
Unknown latency ops.
Computations& local storage2. Split-phase Abstract
Machines
3. Configurations placed
independently4. Placement on chip
Reliability
27
Split-phase Abstract Machines
SAM 1
SAM 2SAM 3
CFG
Power
28
Hyperblock => SAM
• Single-entry, multiple exit
• May contain loops
29
SAM => FSM
Start Loop
Exit
Exit
RemoteMemory
Localmemory
30
Implementing SAMs- interesting details -
31
The SAM FSM
Computation
Predicates (control)
Combinational logic
start exit
Reg
iste
r
args results
32
Computation = Dataflow
• Variables => wires + tokens• No token store; no token matching • Local communication only
Signals
x = a & 7;...
y = x >> 2;
Programs
&
a 7
>>
2
x
Circuits
33
Tokens & Synchronization
• Tokens signal operation completion• Possible implementations:
data
validack
Local
data
valid
reset
Global
data valid
Static
34
Speculation
if (x > 0) y = -x;
elsey = b*x;
*
x
b 0
y
!
slow
Computation Predicates
- >- >
and Eager Muxes
Static-Single Assignment implemented in hardware
ILP
35
Predicates
*q = 2;
• Guard side-effects– Memory access– Procedure calls
• Control looping
• Decide exit branch
• Select variable definition x=... x=...
...=x
36
Computing Predicates
• Correct for irreducible graphs• Correct even when speculatively computed • Can be eagerly computed
s t
b
37
Loops + Dataflow
for (i=0; i < 10; i++)a[i] += i;
+
load
+
store
&a[0]
+
1i
a[0]
0
a[1]
a[2]
a[3]
= Pipelining
38
Outline
• Motivation
• ASH: Application-Specific Hardware
• The spatial model of computation
• CASH: Compiling for ASH
• Evolutionary path
• Conclusions
• Future work
39
Evolutionary Path
Microprocessors ASH
The problem with ASH: Resources
40
Virtualization
41
CPU+ASH
core computation
support computation+ OS+ VM
CPU ASH
Memory
42
Outline
• Motivation
• ASH: Application-Specific Hardware
• The spatial model of computation
• CASH: Compiling for ASH
• Evolutionary path
• Conclusions
• Future work
43
ASH Benefits
Problem Solution
Reliability Configuration around defects
Power Only “useful” gates switching
Signals Localized computation
ILP Statically extracted
44
Scalable Performance
performance
CPU
ASH
time
now
45
Summary
• Contemporary CPU architecture faces lots of problems
• Application-Specific Hardware (ASH) provides a scalable technology
• Compiling HLL into hardware dataflow machines is an effective solution
46
Timeline
12/0206/01
CASH core
09/01 12/01 04/02 06/02 09/02
Writethesis
Hw/sw partitioning(ASH + CPU)
Costmodels
ASH Simulation
Loop parallelization
Explore architectural/compiler trade-offs
now
Memory partitioning
47
Extras
• Related work
• Reconfigurable hardware
• Other cross-over phenomena
• A CPU + ASH study
• More about predicates
48
Related Work
• Hardware synthesis from HLL
• Reconfigurable hardware
• Predicated execution
• Dataflow machines
• Speculative execution
• Predicated SSA
back
49
Reconfigurable Hardware
Universal gates
and/or
storage elements
Interconnectionnetwork
Programmable Switches
back to presentation
50
Switch controlled by a 1-bit RAM cell
0001
Universal gate = RAM
a0a1a0
a1
dataa1 & a2
0data in
control
Main RH Ingredient: RAM Cell
back
51
Reconfigurable Computing
• Back to ENIAC-style computation
• Synthesize one machine to solve one problem
back back to “extras”
52
Efficiency
time
idle
used
hardware resources
now
53
Manufacturing Cost
time
3x109$
now
cost
affordable
cost
54
Complexity
time
transistors
manageable
available
109
108
1010
now
55
CAD Tools
time
manual interventions
now
feasible
necessary
back
56
ASH BenefitsProblem Solution
Reliability Configuration around defects
Power Only “useful” gates switching
Signals Localized computation
ILP Statically extracted
Complexity Hierarchy of abstractions
CAD Compiler + local place & route
Efficiency Circuit customized to application
Cost No masks, no physics, same substrate
Performance Scalableback
57
CPU+ASH Study
• Reconfigurable functional unit on processor pipeline
• Adapted SimpleScalar 3.0• ASH & CPU use the same memory
hierarchy (incl. L1)• ASH can access CPU registers• CPU pipeline interlocked with ASH• Results pending
back
58
Simplifying Predicates
• Shared implementations
• Control equivalence
a
b
c
59
Deep Speculation
if (p) if (q) x = a; else x = b;else x = c;
x
a b c
!pp&!qp&q
60
Predicates & Tokens
*q = 2 readysafe
q
~x
ready
safe
x
*q = 2
1
ready & safe
q
Predicated tokens Eliminate speculation
~x
safe & readyx
back
ready
Eliminate wires
P P_ready
P & P_ready