diverge-merge processor (dmp) hyesoon kim josé a. joao onur mutlu* yale n. patt hps research group...
Post on 22-Dec-2015
215 Views
Preview:
TRANSCRIPT
Diverge-Merge Processor (DMP)
Hyesoon Kim José A. Joao
Onur Mutlu* Yale N. Patt
HPS Research Group *Microsoft ResearchUniversity of Texas at Austin
2
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
3
Predicated Execution
Convert control flow dependence to data dependence
(normal branch code)
C B
D
AT N
p1 = (cond) branch p1, TARGET
mov b, 1 jmp JOIN
TARGET: mov b, 0
A
B
C
B
C
D
A
(predicated code)
A
B
C
if (cond) { b = 0;}else { b = 1;} p1 = (cond)
(!p1) mov b, 1
(p1) mov b, 0
4
Fetch Decode Rename Schedule RegisterRead Execute
Benefit of Predicated Execution Predicated Execution can be high
performance and energy-efficient.
A
BC
D
AE
F
Predicated Execution
Branch Prediction
Pipeline flush!!
E D BF
nop
Fetch Decode Rename Schedule RegisterRead Execute
AB AC B AC BD AD C BE AE D CF B AF E D C B A AF BCDEF E D ABCF E ABCDF E D C B AF E D C ABE D C B AF AF BCDE
5
Limitations/Problems of Predication
ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser’98] can solve this problem but
it is only applicable to simple hammocks.
Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim’05]
Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region,
and complex CFGs Hyperblock[Mahlke’92] cannot adapt to frequently-executed paths
dynamically.
6
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
7
Diverge-Merge Processor (DMP)
DMP can dynamically predicate complex branches
(in addition to simple hammocks).
The compiler identifies Diverge branches
Control-flow merge (CFM) points
The microarchitecture decides when and what to
predicate dynamically.
8
select-µops (φ-nodes in SSA)
Dynamic Predication
A
B
C
H
Klauser et al.[PACT’98]: Dynamic-hammock predication
C B
H
AT N
mov R1, 1 jmp JOIN
TARGET: mov R1, 0
A
B
C
p1 = (cond) branch p1, TARGET
(mov R1, 1)PR10 = 1
(mov R1, 0)PR11 = 0
PR12 = (cond) ? PR11 : PR10
Low-confidence
H JOIN: add R5, R1, 1
9
Diverge-Merge Processor
C B
E
D
F G
Frequently executed path
Not frequently executed path
A
C
E
B
H
Insert select-µops
Diverge Branch
CFM point
A
H
10
diverge-branch executed block CFM point
Diverge-Merge Processor
C B
E
D
F G
Frequently executed path
Not frequently executed path
A A A
A A A
A
H
11
Control-Flow GraphsA
simple hammock
A
nested hammock
A
frequently-hammock
A
loop
A
. . . . . . . . . . .
non-merging
DMP
Dynamic Hammock
SW pred
Wish br.
Dual-path
12
Dual-path Execution vs. DMP
Low-confidence
C
D
E
F
B
D
E
F
A
BC
D
E
F
path 1 path 2
C
D
E
F
B
path 1 path 2
Dual-path DMP
CFMCFM
13
Control-Flow GraphsA
simple hammock
A
nested hammock
A
frequently-hammock
A
loop
A
. . . . . . . . . . .
non-merging
DMP
Dynamic-hammock
SW pred
Wish br.
Dual-path
sometimes
sometimes
14
0
2
4
6
8
10
12
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twol
f
com
p goijp
eg li
m88
ksim
amea
n
Mis
pre
dic
tio
ns
pe
r k
ilo in
str
uc
tio
ns
(M
PK
I)
non-merging
loop
frequently
nested
simple
Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically
predicated in DMP.
15
0
2
4
6
8
10
12
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twol
f
com
p goijp
eg li
m88
ksim
amea
n
Mis
pre
dic
tio
ns
pe
r k
ilo in
str
uc
tio
ns
(M
PK
I)
non-merging
loop
frequently
nested
simple
Distribution of Mispredicted Branches 66% of mispredicted branches can be dynamically
predicated in DMP.
16
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
17
Fetch Mechanism
C B
E
D
F G
predicted path
A
C
E
B
H
Diverge Branch
CFM point
A
H
Low Confidence
Round-robin fetch
18
PR21PR11PR41
add pr21 pr13, #1 (p1)
Dynamic Predication
Arch. Phy. M
R1
R2 PR12
R3 PR13
A
C
E
B
H
branch r0, C
add r1 r3, #1
add r4 r1, r3
add r1 r2, # -1
branch pr10,C p1 = pr10
add pr24 pr41, pr13
add pr31 pr12, # -1(!p1)
Arch. Phy. M
R1
R2 PR12
R3 PR13
PR31
1
1
select-µop pr41 = p1? pr21 : pr31
RAT2
RAT1
Forks RAT, RAS, and GHR
PR11
19
DMP Support
ISA Support Mark diverge branches/CFM points.
Compiler Support [CGO’07] The compiler identifies diverge branches and the
corresponding CFM points. Hardware Support
Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication
20
Hardware Complexity Analysis
ST-LD Forwarding
SWpred.
Dualpath
Select-Uop Gen.
Rename Support
Front-End
Check Flush/no Flush
Predicate Registers
Confidence Estimator
Wishbr.
Multi path
Dyn.ham.
DMP
21
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
22
Simulation Methodology
12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation
Alpha ISA execution driven simulator Baseline processor configuration
64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence
estimator Less aggressive processor (paper) Power model using Wattch
23
0
10
20
30
40
50
60
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
hmea
n
IPC
im
pro
vem
ent
(%)
simplesimple,nestedsimple,nested,frequentlysimple,nested,frequently,loop
Different CFG types
24
Performance Improvement
0
5
10
15
20
25
Per
form
ance
Im
pro
vem
ent
(%) DMP
dynamic-hammockdual-pathmultipathlimited software predicationwish branches
25
Energy Consumption
-5
0
5
10
Red
uct
ion
(%
)
DMPdynamic-hammockdual-pathmultipathlimited software predicationwish branches
26
Outline
Predicated Execution Diverge-Merge Processor (DMP) Implementation of DMP Experimental Evaluation Conclusion
27
Conclusion DMP introduces the concept of frequently-hammocks and it
dynamically predicates complex CFGs.
DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.
DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy
DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate
dynamically.
30
Handling Mispredictions
C B
E
D
F G
predicted path
A
C
E
B
H
Diverge Br.
CFM point
A
H
Misprediction!add pr21 pr13, #1 (p1)
branch pr10,C p1 = pr10
add pr24 pr41, pr13
add pr31 pr12, # -1(!p1)
select-µop pr41 = p1? pr21 : pr31
add pr44 pr34, # -1(!p1)
B
C
E
H
A
(0)
(1)
(1)Flush
D add pr34 pr31, pr13D
31
Loop Branches
Exit Condition The loop branch is predicted to exit the loop.
Benefit Reduced pipeline flushes: when the predicated
loop is iterated more times than it should be. Instructions in the extra iterations of the loop
become NOPs. Instructions after loop-exit can still be executed.
Negative Effects Increased execution delay of loop-carried
dependencies The overhead of select-µops
32
Loop Branches Predicate each loop iteration separately
A
B
select-uop pr32 = p2 ? pr31: pr22 select-uop pr33 = p2 ? pr30: pr23
select-uop pr22 = p1 ? pr21: pr11 select-uop pr23 = p1? pr20: pr10
add pr21 pr11, #1 (p1) pr20 = (cond1) (p1)branch A, pr20 (p1) p2 = pr20
A
add r1 r1, #1r0 = (cond1)branch A, r0
A
add r1 r1, #1r0 = (cond1)branch A, r0
A
add r7 r1, #10B
add r1 r1, #1r0 = (cond1)branch A, r0
A
add pr31 pr22, #1 (p2)pr30 = (cond1) (p2)branch A, pr30 (p2)
A
add pr7 pr32, #10B
branch A, pr10 p1 = pr10A
Loop br. is predicted to exit the loop
33
Enhanced Mechanisms Multiple CFM points
The hardware chooses one CFM point for each instance of dynamic predication.
Exit Optimizations Counter Policy: What if one path does not
reach the CFM point? Number of fetched instructions > Threshold
Yield Policy: What if another low confidence diverge branch is encountered in dynamic predication mode? Later low confidence branch is more likely
mispredicted.
A
B C
G D F
EH
34
Detailed DMP Support
32 Predicate register ids Fetch mechanism
High performance I-Cache Fetch two cache lines Predict 3 branches Fetch stops at the first taken branch
35
Diverge and Merge?
0%
20%
40%
60%
80%
100%
gzi
p
vpr
gcc
mcf
cra
fty
pa
rse
r
eo
n
pe
rlbm
k
ga
p
vort
ex
bzi
p2
two
lf
com
p
go
ijpe
g li
m8
8ks
im
am
ea
n
Me
rge
(%
)
36
Useful Dynamic Predication Mode
0
5
10
15
20
25
30
gzi
p
vpr
gcc
mcf
cra
fty
pa
rse
r
eo
n
pe
rlbm
k
ga
p
vort
ex
bzi
p2
two
lf
com
p
go
ijpe
g li
m8
8ks
im
am
ea
n
Div
erg
e b
ran
ch
ac
tua
lly m
isp
red
icte
d (
%)
37
Perfect Branch Prediction
-40
-35
-30
-25
-20
-15
-10
-5
0
Energy
-70
-60
-50
-40
-30
-20
-10
0
EDP
0102030405060708090
100
Performance
delta
(%
)
4 wide-20 stages-128 window
8 wide-30 stages-512 window
38
Maximum Power
0
2
4
6
8
Max
imu
m P
ow
er I
ncr
eam
ent
(%) DMP
dynamic-hammock
dual-path
multipath
software predication
wish branches
39
Branch Predictor Effects
0
5
10
15
20
25
30
35
IPC
del
ta (
%)
perceptron-dynamic-hammockperceptron-dual-pathperceptron-multipathperceptron-DMP
OGEHL-baseOGEHL-dynamic-hammockOGEHL-dual-pathOGEHL-multipathOGEHL-DMP
40
Confidence Estimator Effects
0
5
10
15
20
25
30
35
dynamic-hammock dual-path multipath DMP
IPC
del
ta (
%)
512B
2KB
4KB
16KB
perfect
41
Results in Less Aggressive Processors
-5
0
5
10
15
20
25
30
35
gzip vp
rgc
cm
cf
craft
y
parse
reo
n
perlb
mk
gap
vorte
xbz
ip2 twolf
com
p goijp
eg li
m88
ksim
hmea
n
IPC
del
ta (
%)
dynamic-hammock
dual-path
multi-path
dmp
0.50.550.6
0.650.7
0.750.8
0.850.9
0.951
1.05
gzip vp
rgc
cm
cf
craf
ty
pars
er eon
perlb
mk
gap
vorte
xbz
ip2 twol
f
com
p goijp
eg li
m88
ksim
amea
n
Exe
cuti
on
tim
e n
orm
aliz
ed t
o t
he
bas
elin
e
limited software predicationwish branches dmp
42
DMP vs. Perfect Conditional BP
227229
0
20
40
60
80
100
120
140gz
ip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
hmea
n
IPC
del
ta (
%)
dmp
Perf BP
43
Enhanced DMP Mechanisms
-10
0
10
20
30
40
50
60
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
hmea
n
IPC
del
ta (
%)
single-cfmmultiple-cfmmcfm-countermcfm-counter-yield
44
47%58%
-50
510
1520
2530
3540
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
hmea
n
IPC
im
pro
vem
ent
(%)
dynamic-hammockdual-path
multipathDMP
DMP vs. Other Mechanisms
45
0
0.2
0.4
0.6
0.8
1
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p
go
ijpeg
li
m88
ksim
amea
n
No
rma
lize
d e
xe
cu
tio
n t
ime
limited software predicationwish branches DMP
Comparisons with Predication/Wish Branches
non-predicated
46
Reduction in Pipeline Flushes
Average overhead: Dynamic-hammock: 4 instructions/entry Dual-path: 150 instructions/entry Multipath: 200 instructions/entry DMP: 20 instructions/entry
0
10
20
30
40
50
60
70
80gz
ip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
amea
nRed
uct
ion
in
pip
elin
e fl
ush
es (
%)
dynamic-hammockdual-pathmultipathDMP
47
Handling Nested Diverge Branches
Basic DMP Ignore other low
confidence div. branches
Enhanced DMP Exit dynamic
predication mode and re-enter from the younger low confidence branch on predicted path (Yield policy)
C B
EF G
Diverge Br.
CFM point
A
H
D
48
Compiler Support [CGO’07]
Compiler analyzes the control flow and the profile data Step1: Identify diverge branch candidates and
CFM points. Step2: Select diverge branches based on
(1) the number of instructions between a branch and the CFM point
(2) the probability of merging at the CFM point Heuristics or a cost-benefit model
Step3: Mark the selected branches/CFM points.
49
Future Research
Hardware Support Better confidence estimators Efficient hardware mechanism to detect
diverge branches and CFM points Increase hardware complexity but eliminate
the need for ISA/compiler support
Compiler Support Better compiler algorithms [CGO’07]
50
Power Measurement Configurations
100 nm Technology Baseline processor
4GHZ Less aggressive processor
1.5GHz CC3 clock-gating model in Wattch: unused
units dissipate only 10% of their maximum power
DMP: one more RAT/RAS/GHR, select-uop generation module, additional fields in BTB, predicate registers, CFM registers, load-store forwarding, instruction retirement
51
Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode
0
50
100
150
200
250
300
350
gzip
vpr
gcc
mcf
craf
ty
pars
er
eon
perlb
mk
gap
vort
ex
bzip
2
twol
f
com
p go
ijpeg
li
m88
ksim
amea
n
Wro
ng
-pat
h i
nst
ruct
ion
s p
er e
ntr
y
dynamic-hammockdual-pathmultipathdmp
52
Fetched/Executed Instructions
-25
-20
-15
-10
-5
0
5
baseline less-aggressive
de
lta
(%
)
fetched instructionsexecuted instructionsmax powerenergyenergy-delay product
53
ISA Support
Example of Diverge Br and CFM markers
OPCODE TARGET
00 : normal branch10 : diverge forward branch11 : diverge loop branch
CFM rel address
CFM = CFM rel address + PC
54
Entering Dynamic Predication Mode
Entry condition When a diverge branch has low confidence.
The Front-end Stores the address of the CFM point to the CFM
register. Forks the RAS, GHR, and RAT. Allocates a predicate register.
Fetch Mechanisms Round-robin fetch from two paths The processor follows the branch predictor until
it reaches the corresponding CFM point.
55
Exiting Dynamic Predication Mode
Exit condition Both paths of a diverge branch have
reached the corresponding CFM point. A diverge branch is resolved.
Select-µop mechanism Similar to φ-node in SSA Merges register values from two paths.
56
Multipath Execution
Low-confidence
C
E
H
I
Instructions after the control-flow merge point are fetched multiple times. Waste of resources and energy.
B
G
H
I
A
BC
E
H
I
path 3 path 4
D GF D
H
I
F
H
I
path 1 path 2
Low-confidence
top related