![Page 1: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/1.jpg)
Dataflow: A Complement to Superscalar
Mihai Budiu – Microsoft Research
Pedro V. Artigas – Carnegie Mellon University
Seth Copen Goldstein – Carnegie Mellon University
2005
![Page 2: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/2.jpg)
2
Computer Architecture-- A Simplified History --
1967 1990
superscalar
dataflow
2005
![Page 3: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/3.jpg)
3
This Work
• Re-evaluate dataflow– Same workloads as superscalar
(C programs: Mediabench, Spec)
– Modern performance analysis tool(whole-program critical path)
• Use of superscalar mechanisms in dataflow
![Page 4: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/4.jpg)
4
Why Study Dataflow
• Naturally exploit ILP• Potentially very high ILP• Simple, regular
microarchitecture• Very low power
[1/1000 superscalar]• Suitable for stream processing
![Page 5: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/5.jpg)
5
Outline
• Motivation• ASH: A Static Dataflow Model
• Explaining bottlenecks• Conclusions
![Page 6: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/6.jpg)
6
Application-Specific Hardware
C program
Compiler
Dataflow IR
![Page 7: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/7.jpg)
7
Computation Dataflow
x = a & 7;...
y = x >> 2;
Program
&
a 7
>>
2
x
IR
a
Circuits
&7
>>2
Operations Nodes Pipeline stages
Variables Def-use edges Channels (wires)
Pure dataflow: no program counter
![Page 8: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/8.jpg)
8
Basic Computation=Pipeline Stage
data
valid
ack
latch+
![Page 9: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/9.jpg)
9
Control Flow => Data Flow
datapredicate
Merge (label)
Gateway
data
data
Split (branch)p
!
![Page 10: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/10.jpg)
10
i
+1< 100
0
*
+
sum
0
Loops
int sum=0, i;
for (i=0; i < 100; i++)
sum += i*i;
return sum;return sum; !
ret
![Page 11: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/11.jpg)
11
Comparison: Idealized Simulation
• Compared to 4-wide OOO SimpleScalar• Same operation latencies• Same memory hierarchy (LSQ, L1, L2)• not free
![Page 12: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/12.jpg)
12
Obvious!
ASH runs at full dataflow speed,and has no resource limitations, so CPU cannot do any better(if compilers equally good)
![Page 13: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/13.jpg)
13
SpecInt95, ASH vs 4-way OOO
-50
-40
-30
-20
-10
0
10
20
300
99
.go
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r /
fas
ter
![Page 14: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/14.jpg)
14
Outline• Motivation• ASH: A Static Dataflow Model• Dissection: explaining bottlenecks
• Conclusions
![Page 15: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/15.jpg)
15
The Scalpel
C CASH ASH SimulatorASH
tracedrawings
Dynamic Critical Path
Automaticanalysis
![Page 16: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/16.jpg)
16
The (Loop) Body
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
SpecINT95: 124.m88ksim, init_processor()
![Page 17: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/17.jpg)
17
Dynamic Critical Path
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
load predicate
loop predicate
sizeof(X[j])
definition
![Page 18: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/18.jpg)
18
MIPS gcc CodeLOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
L1=>L2=>L3=>L5=>L14-instructions loop-carried dependence
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 19: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/19.jpg)
19
If Branch Prediction Correct
L1=>L2=>L3=>L5=>L1for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
![Page 20: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/20.jpg)
20
SpecInt95, perfect prediction
-60
-40
-20
0
20
40
60
09
9.g
o
12
4.m
88
ksim
12
9.c
om
pre
ss
13
0.li
13
2.ij
pe
g
13
4.p
erl
14
7.v
ort
ex
Pe
rce
nt
slo
we
r/fa
ste
r
Speed-up
prediction
no data
![Page 21: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/21.jpg)
21
Critical Path with Prediction
Loads are notspeculative
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 22: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/22.jpg)
22
Prediction + Load Speculation
~4 cycles!Load not pipelined(self-anti-dependence)
ack edge
for (j = 0; X[j].r != 0xF; j++)
if (X[j].r == i)
break;
![Page 23: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/23.jpg)
23
OOO Pipe Snapshot
IF DA EX WB CT
L3 L3 L3
registerrenaming
LOOP:
L1: beq $v0,$a1,EXIT ; X[j].r == i
L2: addiu $v1,$v1,20 ; &X[j+1].r
L3: lw $v0,0($v1) ; X[j+1].r
L4: addiu $a0,$a0,1 ; j++
L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF
EXIT:
![Page 24: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/24.jpg)
24
Conclusions: Limitations of Static Dataflow
1. dataflow state is “more” distributed
2. “control” dependences still limit ILP
3. nontrivial to squash distributed speculation
4. good prediction may need global information
5. self-antidependences can be critical
(removed by register renaming)
6. distributed computation => more remote accesses
7. more synchronization in dataflow (“join” is not free)
![Page 25: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/25.jpg)
25
![Page 26: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/26.jpg)
26
Unrolling Does Not Help
for(i = 0; i < 64; i++) {
for (j = 0; X[j].r != 0xF; j+=2) {
if (X[j].r == i)
break;
if (X[j+1].r == 0xF)
break;
if (X[j+1].r == i)
break;
}
Y[i] = X[j].q;
}
when 1 iteration
![Page 27: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/27.jpg)
27
How Performance Is Evaluated
C
Unlimited ILPstatic dataflow
LSQL18K
L21/4M
Mem
2
8
72
SimpleScalar
CASH
gcc
![Page 28: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/28.jpg)
28
Last-Arrival Events
+
data
valid
ack
• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges
![Page 29: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/29.jpg)
29
Dynamic Critical Path
3. Some edges may repeat 2. Trace back along
last-arrival edges
1. Start from last node
back back to talk
![Page 30: Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon](https://reader036.vdocuments.us/reader036/viewer/2022062404/5515130955034673228b4a9c/html5/thumbnails/30.jpg)
30
History
Out-of-orderBranch predSpeculation
TomasulloIBM 360
1967
ThorntonCDC 1964
KarpGraph model
1966
SmithBr pred1981
FisherVLIW
CockeSuperscalar
1985
SmithPrecise spec
1988
DennisDataflow lang
1974
BurgerTRIPS2001
OskinWaveScalar
2003
ArvindTagged-token
1977
PapadopoulosMonsoon
1988