accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...
Post on 02-Aug-2020
13 Views
Preview:
TRANSCRIPT
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY
Raj Parihar Advanced Computer Architecture Lab University of Rochester
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Despite the proliferation of multi-core, multi-threaded systems
High single-thread performance is still an important CPU design goal
Modern programs do not lack instruction level parallelism
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Raj Parihar Advanced Computer Architecture Lab University of Rochester 2
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Despite the proliferation of multi-core, multi-threaded systems
High single-thread performance is still an important CPU design goal
Modern programs do not lack instruction level parallelism
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Real challenge: exploit implicit parallelism without undue cost
One effective approach: Decoupled look-ahead architecture
Raj Parihar Advanced Computer Architecture Lab University of Rochester 3
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
Raj Parihar Advanced Computer Architecture Lab University of Rochester 4
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Raj Parihar Advanced Computer Architecture Lab University of Rochester 5
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
Lack of correctness constraint allows many optimizations
Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 6
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
Lack of correctness constraint allows many optimizations
Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 6
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Outline
Motivation
Baseline decoupled look-ahead
Look-ahead: a new bottleneck
Look-ahead thread acceleration
Weak dependences/instructions
Do-It-Yourself branches & skeleton tuning
Experimental analysis
Additional insights and summary
Raj Parihar Advanced Computer Architecture Lab University of Rochester 7
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Baseline Decoupled Look-ahead Architecture
Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and
Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching
Main Core
Branch QueueLook-ahead Core
L0$ L1$
Executes Look-aheadskeleton
Executes programbinary
L2$
Register state synchronization
Prefetching hints
Branch prediction1
2
Main Memory
A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO’08
Raj Parihar Advanced Computer Architecture Lab University of Rochester 8
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Baseline Decoupled Look-ahead Architecture
Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and
Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching
Main Core
Branch QueueLook-ahead Core
L0$ L1$
Executes Look-aheadskeleton
Executes programbinary
L2$
Register state synchronization
Prefetching hints
Branch prediction1
2addq v0, v0, v0nop......bgt a1, 0x12001f9a0subq v0, t0, a2
addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0subq a1, 0x2, a1addq v0, v0, v0bgt a1, 0x12001f9a0subq v0, t0, a2
Main Memory
Program binary
Skeleton
A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08
Raj Parihar Advanced Computer Architecture Lab University of Rochester 9
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 10
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed; speed of look-ahead is not an issue (left half)Look-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single−thread Decoupled Look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 11
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single−thread Decoupled Look−ahead Ideal (cache, branch)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 12
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Look−ahead limit Single−thread Decoupled Look−ahead Ideal (cache, branch)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 13
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 1: Weak insts do not look different
After the fact analysis: based on static attributes of insts reveals
Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)
Weakness has very poor correlation with static attributes
Hard to identify the weak instructions through static heuristics
addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0
1
2
Instruction Type
Num
ber
of In
puts
WeakInstructions
addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0
1
2
Instruction Type
Num
ber
of In
puts
StrongInstructions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 15
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 1: Weak insts do not look different
After the fact analysis: based on static attributes of insts reveals
Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)
Weakness has very poor correlation with static attributes
Hard to identify the weak instructions through static heuristics
addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0
1
2
Instruction Type
Num
ber
of In
puts
WeakInstructions
addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0
1
2
Instruction Type
Num
ber
of In
puts
StrongInstructions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 15
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 2: False positives are extremely costly
After the fact analysis and close inspection also reveals
Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains
Case in point: zapnot in gap
zapnot Ra Rb Rc
84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%
Raj Parihar Advanced Computer Architecture Lab University of Rochester 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 2: False positives are extremely costly
After the fact analysis and close inspection also reveals
Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains
Case in point: zapnot in gap
zapnot Ra Rb Rc
84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%
Raj Parihar Advanced Computer Architecture Lab University of Rochester 16
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 3: Neither absolute nor additive
Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!
Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown
0 50 100 150 200 250 300−40%
−30%
−20%
−10%
0%
10%
20%
Cummulative weak instructions
Per
form
ance
impa
ct o
ver
base
line
look
−ah
ead
perlbmk
Raj Parihar Advanced Computer Architecture Lab University of Rochester 17
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 3: Neither absolute nor additive
Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!
Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown
0 50 100 150 200 250 300−40%
−30%
−20%
−10%
0%
10%
20%
Cummulative weak instructions
Per
form
ance
impa
ct o
ver
base
line
look
−ah
ead
perlbmk
Raj Parihar Advanced Computer Architecture Lab University of Rochester 17
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)
Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Progress of Genetic Evolution Process
Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved
GA evolution, helped by hybridization shows good progress
1 2 3 4 5 6 70%
20%
40%
60%
80%
100%
# of Generations
Pro
gres
s re
lativ
e to
bes
t GA
sol
utio
n
eon
mcf
pbmk
twolf
vpr
art
eqk
fma
amp
lucas
Raj Parihar Advanced Computer Architecture Lab University of Rochester 21
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Progress of Genetic Evolution Process
Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved
GA evolution, helped by hybridization shows good progress
1 2 3 4 5 6 70%
20%
40%
60%
80%
100%
# of Generations
Pro
gres
s re
lativ
e to
bes
t GA
sol
utio
n
eon
mcf
pbmk
twolf
vpr
art
eqk
fma
amp
lucas
Raj Parihar Advanced Computer Architecture Lab University of Rochester 21
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Evolution can be Online or Offline
Offline evolution: one time tuning (e.g. install time)
Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result
Online evolution: takes longer but has little overhead
Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations
1
1.5
2
2.5
3
1 116
231
346
461
576
691
806
921
1036
1151
1266
1381
1496
1611
1726
1841
1956
2071
2186
2301
2416
2531
2646
2761
2876
2991
3106
3221
3336
3451
3566
3681
3796
3911
4026
4141
4256
4371
4486
4601
4716
Accumulated
IPC
Number of instruc6ons (in millions)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 22
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Evolution can be Online or Offline
Offline evolution: one time tuning (e.g. install time)
Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result
Online evolution: takes longer but has little overhead
Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations
1
1.5
2
2.5
3
1 116
231
346
461
576
691
806
921
1036
1151
1266
1381
1496
1611
1726
1841
1956
2071
2186
2301
2416
2531
2646
2761
2876
2991
3106
3221
3336
3451
3566
3681
3796
3911
4026
4141
4256
4371
4486
4601
4716
Accumulated
IPC
Number of instruc6ons (in millions)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 22
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
A Locomotive and Cargo Analogy
Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload
Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo
(under utilization of locomotive’s capability)
L1 prefetches L2 prefetches
Raj Parihar Advanced Computer Architecture Lab University of Rochester 23
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
A Locomotive and Cargo Analogy
Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload
Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo
(under utilization of locomotive’s capability)
L1 prefetches L2 prefetches
Raj Parihar Advanced Computer Architecture Lab University of Rochester 23
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branches
To accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetches
R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetches
R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetchesR. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Hardware Support for DIY Branches
Hardware support needed to synchronize after DIY regions
Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion
Look-aheadThread
MainThread
Branch Queue
Branch Predictor
DIY Mode
i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]
Direction+
DIY info
DIY call depth register+
DIY mode bit
Skeleton [mask, diy, duty]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 25
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Hardware Support for DIY Branches
Hardware support needed to synchronize after DIY regions
Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion
Look-aheadThread
MainThread
Branch Queue
Branch Predictor
DIY Mode
i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]
Direction+
DIY info
DIY call depth register+
DIY mode bit
Skeleton [mask, diy, duty]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 25
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Experimental Setup
Program/binary analysis tool: ALTO
Simulator: detailed out-of-order,cycle-level in-house
SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)
Genetic algorithm framework
Modeled as offline and onlineextension to the simulator
Microarchitectural configurations:Baseline core (Similar to POWER5)
Fetch/Decode/Issue/Commit 8 / 4 / 6 / 6ROB 128Functional units INT 2+1 mul +1 div, FP 2+1 mul +1 divFetch Q/ Issue Q / Reg. (int,fp) (32, 32) / (32, 32) / (80, 80)LSQ(LQ,SQ) 64 (32,32) 2 search portsBranch predictor Gshare – 8K entries, 13 bit historyBr. mispred. penalty at least 7 cyclesL1 data cache (private) 32KB, 4-way, 64B line, 2 cycles, 2 portsL1 inst cache (private) 64KB, 2-way, 128B, 2 cyclesL2 cache (shared) 1MB, 8-way, 128B, 15 cyclesMemory access latency 200 cyclesLook-ahead core: Baseline core with only LQ, no SQ
L0 cache: 32KB, 4-way, 64B line, 2 cyclesRound trip latency to L1: 6 cycles
Communication: Branch Output Queue: 512 entriesReg copy latency (recovery): 64 cycles
Table 1: Microarchitectural configurations.
Raj Parihar Advanced Computer Architecture Lab University of Rochester 26
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Overall Performance Gain
Final decoupled look-ahead system
Skeleton payload tuning + Weak dependence + DIY branches
Performance speedup over:
Baseline look-ahead: 1.20x Single-thread: 1.61x
gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0
1.2
1.4
1.6
1.8
2.0
Spe
edup
ove
r B
asel
ine
DLA
Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA
Raj Parihar Advanced Computer Architecture Lab University of Rochester 28
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Overall Performance Gain
Final decoupled look-ahead system
Skeleton payload tuning + Weak dependence + DIY branches
Performance speedup over:
Baseline look-ahead: 1.20x Single-thread: 1.61x
gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0
1.2
1.4
1.6
1.8
2.0
Spe
edup
ove
r B
asel
ine
DLA
Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA
Raj Parihar Advanced Computer Architecture Lab University of Rochester 28
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Future Explorations
Effective look-ahead to improve L1 prefetching performance
Speeding up critical threads and serial bottlenecks via a shared
look-ahead agent in multi-threaded applications
Cost effective SMT implementation of decoupled look-ahead
Role of look-ahead to promote parallelization, value predictions
and acceleration of interpreted programs
Backward strawman: integrate non-speculative look-ahead
computations in the main thread directly
Raj Parihar Advanced Computer Architecture Lab University of Rochester 30
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Details in the Thesis and Papers
Decoupled Look-ahead Architecture:
Weak dependence removal in decoupled look-ahead [HPCA’14]
Load balancing in look-ahead via DIY branches [PACT-SRC’15]
Speculative parallelization in decoupled look-ahead [PACT’11]
DIY branches and payload tuning [in prep. for HPCA’17]
Shared Cache Management:
Hardware support for protective and collaborative caches [ISMM’16]
Protection and utilization in shared cache via rationing [PACT’14]
A coldness metric for cache optimization [MSPC’13]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 31
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation: Cache Rationing
Compute systems with shared resources are prevalent today
Multi-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources
Raj Parihar Advanced Computer Architecture Lab University of Rochester 32
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation: Cache Rationing
Compute systems with shared resources are prevalent todayMulti-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources
Significant performance loss due to co-run interference: >25%
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
2.32
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with equake(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 33
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Idea of Cache Rationing
Achieve resource protection and utilization both simultaneously
Rationing policy:
Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs
Conservative sharing: provides a safety net for less aggressive
programs in the presence of non cooperative programs
R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16
Raj Parihar Advanced Computer Architecture Lab University of Rochester 34
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Idea of Cache Rationing
Achieve resource protection and utilization both simultaneously
Rationing policy:
Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs
Conservative sharing: provides a safety net for less aggressive
programs in the presence of non cooperative programs
R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16
Raj Parihar Advanced Computer Architecture Lab University of Rochester 34
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware Support
Ration Accounting: ration counter-register pairs
To track the current usage of a program, maintained per core per set
Usage Tracking: access-bit and block owner
To detect unused ration and ensure entitlement, 1 per cache blk
blk 1 blk w-1
w ways
Data array
Access bit
p counter-register pairs
s se
ts
Ration tracker
w ways
Status bit
Tag array
Block ownerRation counter
Owner allocation
Additional storage overhead: <1% of total cache storage
Raj Parihar Advanced Computer Architecture Lab University of Rochester 35
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware Support
Ration Accounting: ration counter-register pairs
To track the current usage of a program, maintained per core per set
Usage Tracking: access-bit and block owner
To detect unused ration and ensure entitlement, 1 per cache blk
blk 1 blk w-1
w ways
Data array
Access bit
p counter-register pairs
s se
ts
Ration tracker
w ways
Status bit
Tag array
Block ownerRation counter
Owner allocation
Additional storage overhead: <1% of total cache storage
Raj Parihar Advanced Computer Architecture Lab University of Rochester 35
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Resource Protection Co-Run
Co-run with a high-pressure peer (mcf)
Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
nor
m. t
o so
lo r
un w
/ 512
KB
L2$
... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...
1.52
SPEC 2000: with mcf(2 cores, 1 MB L2 cache)
INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf
FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi
Raj Parihar Advanced Computer Architecture Lab University of Rochester 36
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Resource Protection Co-Run
Co-run with a high-pressure peer (mcf)Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
nor
m. t
o so
lo r
un w
/ 512
KB
L2$
... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...
1.52
SPEC 2000: with mcf(2 cores, 1 MB L2 cache)
INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf
FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi
Raj Parihar Advanced Computer Architecture Lab University of Rochester 36
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Capacity Utilization Co-Run
Co-run with a low-pressure peer (eon): cache demand <128 KB
Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs
Equal partitioning No partitioning Rationing PIPP-equal0.8
1.0
1.2
1.4
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
1.744
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with eon(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 37
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Capacity Utilization Co-Run
Co-run with a low-pressure peer (eon): cache demand <128 KB
Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs
Equal partitioning No partitioning Rationing PIPP-equal0.8
1.0
1.2
1.4
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
1.744
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with eon(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 37
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robust
Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robust
Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regionsRaj Parihar Advanced Computer Architecture Lab University of Rochester 38
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Acknowledgments
Funding agencies: NSF, NSFC
Prof. Michael C. Huang Alok Garg
Prof. Chen Ding and his research group at URCS
Past & current members of Advanced Computer Architecture
Lab at University of Rochester
Raj Parihar Advanced Computer Architecture Lab University of Rochester 39
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Backup Slides
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY
Raj Parihar Advanced Computer Architecture Lab University of Rochester 40
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 42
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 42
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)Decoupled look−ahead
4.0
Raj Parihar Advanced Computer Architecture Lab University of Rochester 43
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Correlation with RTL/FPGA Accurate Simulator
Reported performance improvement results are very pessimistic
Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces
RTL accurate simulator: shows 2x more performance potential
Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0
1.2
1.4
1.6
1.8
2.0
2.2
Spe
edup
ove
r si
ngle
-thr
ead SimpleScalar
IMG-psim
1.56x(projected)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 45
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Correlation with RTL/FPGA Accurate Simulator
Reported performance improvement results are very pessimistic
Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces
RTL accurate simulator: shows 2x more performance potential
Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0
1.2
1.4
1.6
1.8
2.0
2.2
Spe
edup
ove
r si
ngle
-thr
ead SimpleScalar
IMG-psim
1.56x(projected)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 45
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Simplified Look-ahead Core
Baseline skeleton: 71% After distillation: 57%
2-wide look-ahead core: (Front end is still 8-wide)
2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%
Renam
eROB
Int I
QFp
IQ
Decod
e
RAT-dec
RAT-wl
RAT-bl
DCL-cm
pLS
QTot
al0
20
40
60
80
100
% P
ower
(N
orm
. to
4-w
ide
DLA
)
3-wide Look-ahead core2-wide Look-ahead core
Look-ahead core components
Raj Parihar Advanced Computer Architecture Lab University of Rochester 46
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Simplified Look-ahead Core
Baseline skeleton: 71% After distillation: 57%2-wide look-ahead core: (Front end is still 8-wide)
2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%
Renam
eROB
Int I
QFp
IQ
Decod
e
RAT-dec
RAT-wl
RAT-bl
DCL-cm
pLS
QTot
al0
20
40
60
80
100
% P
ower
(N
orm
. to
4-w
ide
DLA
)
3-wide Look-ahead core2-wide Look-ahead core
Look-ahead core components
Raj Parihar Advanced Computer Architecture Lab University of Rochester 46
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline vs. Tuned Skeleton
Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed
Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead
4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1
1.1
1.2
1.3
1.4
1.5
Spe
edup
ove
r S
ingl
e-T
hrea
d
Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead
INT FP
Raj Parihar Advanced Computer Architecture Lab University of Rochester 47
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline vs. Tuned Skeleton
Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed
Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead
4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1
1.1
1.2
1.3
1.4
1.5
Spe
edup
ove
r S
ingl
e-T
hrea
d
Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead
INT FP
Raj Parihar Advanced Computer Architecture Lab University of Rochester 47
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hybridization: Heuristically Designed Initial Solutions
Genetic evolution could be a slow and lengthy process
Heuristic based solutions are helpful to jump start the evolution
Heuristically designed solutions in our system:
Superposition chromosome; Orthogonal subroutine chromosome
X
Multi-Instruction Genes
X XX X X
X X
X XX X XX X X X
X X X X X X X X X
Initial Chromosomes
A B CSubroutines
(b) Superposition Chromosomes
(c) Orthogonal Chromosomes
(a) Single-gene Chromosomes
Single-Instruction Genes
Chromosome
XX
XX
X
Raj Parihar Advanced Computer Architecture Lab University of Rochester 48
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hybridization: Heuristically Designed Initial Solutions
Genetic evolution could be a slow and lengthy process
Heuristic based solutions are helpful to jump start the evolution
Heuristically designed solutions in our system:
Superposition chromosome; Orthogonal subroutine chromosome
X
Multi-Instruction Genes
X XX X X
X X
X XX X XX X X X
X X X X X X X X X
Initial Chromosomes
A B CSubroutines
(b) Superposition Chromosomes
(c) Orthogonal Chromosomes
(a) Single-gene Chromosomes
Single-Instruction Genes
Chromosome
XX
XX
X
Raj Parihar Advanced Computer Architecture Lab University of Rochester 48
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Online Genetic Evolution: equake
Primary overhead comes from testing bad skeleton configs
Break-even point: 1.8 billion insts (1-2 sec of native execution)By 4.6 billion insts: overall cumulative speed is already 10% faster
1
1.5
2
2.5
3
1 138
275
412
549
686
823
960
1097
1234
1371
1508
1645
1782
1919
2056
2193
2330
2467
2604
2741
2878
3015
3152
3289
3426
3563
3700
3837
3974
4111
4248
4385
4522
4659 Ac
cumulated
IPC
Number of instruc6ons (1 epoch = 1 million instruc6ons)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
0.5 1
1.5 2
2.5 3
3.5
1 13
8 27
5 41
2 54
9 68
6 82
3 96
0 10
97
1234
13
71
1508
16
45
1782
19
19
2056
21
93
2330
24
67
2604
27
41
2878
30
15
3152
32
89
3426
35
63
3700
38
37
3974
41
11
4248
43
85
4522
46
59
Distrib
uted
IPC
Number of instruc4ons (1 epoch = 1 million instruc4ons)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 49
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]
Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]
DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]
DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Potential DIY Modules
Loop iterations accessing same cache line, reduction operations
Library function calls: printf, OtsMove, OtsFill etc.
A case in point: mark modified reg() from 176.gcc
Dynamic contribution: 3% Performance speedup: 10%
static void mark_modified_reg (dest, x)
rtx dest; rtx x;
{
int regno, i;
if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);
if (GET_CODE (dest) == MEM) modified_mem = 1;
if (GET_CODE (dest) != REG) return;
regno = REGNO (dest);
if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;
else
for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)
modified_regs[regno + i] = 1;
}
Raj Parihar Advanced Computer Architecture Lab University of Rochester 51
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Potential DIY Modules
Loop iterations accessing same cache line, reduction operations
Library function calls: printf, OtsMove, OtsFill etc.A case in point: mark modified reg() from 176.gcc
Dynamic contribution: 3% Performance speedup: 10%
static void mark_modified_reg (dest, x)
rtx dest; rtx x;
{
int regno, i;
if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);
if (GET_CODE (dest) == MEM) modified_mem = 1;
if (GET_CODE (dest) != REG) return;
regno = REGNO (dest);
if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;
else
for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)
modified_regs[regno + i] = 1;
}
Raj Parihar Advanced Computer Architecture Lab University of Rochester 51
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Distribution
Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches
Optimal only 30% of the time
For the remaining 70% of the time other payloads are optimal
Performance potential of customized payloads: 1.21x
DLA bB
bB+L
2
bB+S
f
bB+L
1
bB+L
2+Sf
bB+L
1+L2
bB+L
1+Sf B
B+L2
B+Sf
B+L1
B+L2+
Sf
B+L1+
L2
B+L1+
Sf
B+L1+
L2+S
f
All-ins
tST
Skeleton Payload (epoch=10k insts)
0
20
40
60
80
# of
bes
t epo
chs
(%) gcc
mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma
Raj Parihar Advanced Computer Architecture Lab University of Rochester 52
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Distribution
Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches
Optimal only 30% of the time
For the remaining 70% of the time other payloads are optimal
Performance potential of customized payloads: 1.21x
DLA bB
bB+L
2
bB+S
f
bB+L
1
bB+L
2+Sf
bB+L
1+L2
bB+L
1+Sf B
B+L2
B+Sf
B+L1
B+L2+
Sf
B+L1+
L2
B+L1+
Sf
B+L1+
L2+S
f
All-ins
tST
Skeleton Payload (epoch=10k insts)
0
20
40
60
80
# of
bes
t epo
chs
(%) gcc
mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma
Raj Parihar Advanced Computer Architecture Lab University of Rochester 52
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Tuning Framework
Collects performance of various payloads in regular epoch
Associates static code region with the most optimal payload
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xZZ
Cycles
600700...400
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xZZ
Cycles
500700...400
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xAA
Cycles
500700...400
(A) Initial payload performance
B+L1+L2
bB+L1
bB+L2
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xAA
Best Skt
B+L1+L2bB+L2...bB+L1
(B) Best payload per epoch
PC
0xAA0xBB…0xZZ
<Payload#: Cnt> tuples
#1:50, #2:30,..., #N:100#4:10, #5:50...#1:10, #2:30,..., #N:10
(C) Per PC payload tuples
PC
0xAA0xBB…0xZZ
Best payload
#N:100#5:50...#2:30
(D) Best payload per PC
0xAA: bB+L20xZZ:
bB+L10xBB: B+L1+L2
(E) Final skeleton
Skeleton Payloads
Raj Parihar Advanced Computer Architecture Lab University of Rochester 53
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Performance Impact of Duty Cycle
One DIY call example from 179.art
5 10 20 25 50 75 80 90 100
Duty Cycle (not to the scale)
0
2
4
6
8
Per
form
ance
gai
n ov
er D
LA (
%)
WeightAdj() in 179.art
Raj Parihar Advanced Computer Architecture Lab University of Rochester 54
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak Dependence: Insights and Findings
The evolution process is remarkably robust
Different inputs and configuration do not invalidate resultsCan use sampling to accelerate fitness test w/o appreciable impacton quality of solution found
Energy reduction → due to less activity and stalling
About 10% dynamic instructions removed from skeleton11% energy saving over baselne decoupled look-ahead
Impact of weak insts removal on look-ahead quality is very small
Similar prefetch and branch hint accuracy
Raj Parihar Advanced Computer Architecture Lab University of Rochester 55
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Speculative Parallel Look-ahead
Self-tuned skeleton is used in the speculative parallel look-ahead
In some cases, self-tuned and speculative parallel look-ahead
techniques are synergistic (ammp, art)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 56
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
1
13
18
1
2
1
0x120011490 ldt $f0, 0(a0)
1.
1
0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)
0x1200114bc stt $f0, 0(a2)
0x12000da84 lda a5, 744(sp)
3. 0x12000daec lda a5, 4(a5)
2. 0x12000dac0 ldl t7, 0(a5)
4. 0x120011984 ldq a0, 80(sp)
5.
6.
7. 0x1200119f8 bis 0, t8, t11
8. 0x120011b04 lda a0, 8(a0)
1
2
1
A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11
Raj Parihar Advanced Computer Architecture Lab University of Rochester 57
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
1
13
18
1
2
1
0x120011490 ldt $f0, 0(a0)
1.
1
0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)
0x1200114bc stt $f0, 0(a2)
0x12000da84 lda a5, 744(sp)
3. 0x12000daec lda a5, 4(a5)
2. 0x12000dac0 ldl t7, 0(a5)
4. 0x120011984 ldq a0, 80(sp)
5.
6.
7. 0x1200119f8 bis 0, t8, t11
8. 0x120011b04 lda a0, 8(a0)
1
2
1
A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11
Raj Parihar Advanced Computer Architecture Lab University of Rochester 57
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)
Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
789101112131415
1617181920212223
1
3456
2
Lookahead thread 0 Lookahead thread 1
Tim
e
Duplicate rename tableand set up context
Merge
Spawn
Cleanupduplicated state
Raj Parihar Advanced Computer Architecture Lab University of Rochester 60
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
789101112131415
1617181920212223
1
3456
2
Lookahead thread 0 Lookahead thread 1
Tim
e
Duplicate rename tableand set up context
Merge
Spawn
Cleanupduplicated state
Raj Parihar Advanced Computer Architecture Lab University of Rochester 60
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speedup of Speculative Parallelization
Applications in which the look-ahead thread is a bottleneck
Speculative look-ahead over decoupled look-ahead: 1.13x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1
2
3
4
5
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
Speculatively parallel look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 61
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speedup of Speculative Parallelization
Applications in which the look-ahead thread is a bottleneck
Speculative look-ahead over decoupled look-ahead: 1.13x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1
2
3
4
5
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
Speculatively parallel look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 61
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13x
Speculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65IPC
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Microthreads vs Decoupled Look-ahead
Lightweight Microthreads: Decoupled Look-ahead:
Raj Parihar Advanced Computer Architecture Lab University of Rochester 64
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead Skeleton Construction
Raj Parihar Advanced Computer Architecture Lab University of Rochester 65
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Under-clocked Dual-core Speedup
Typically a dual-core can be clocked only upto 90% clock
frequency of a single-core system
After adjusting the frequency of single-core
Single-core IPC: 1.80 (INT), 2.28 (FP), 2.05 (Combined)
Baseline look-ahead over 10% over-clocked single-thread
Speedup: 1.13x (INT), 1.34x (FP), 1.24x (Combined)
Self-tuned look-ahead over single-thread: (for 14 applications)
Speedup: 1.20x (INT), 1.96x (FP), 1.43x (Combined)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 66
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: SPEC 2006
Self-tuned look-ahead achieves 1.10x speedup over baseline
look-ahead for SPEC CPU 2006 applications
perl bzp gcc mcf go hmer sjen libq h264 omn astr xaln milc deal splx Gmean1
2
3
4
5678
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
GA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 67
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: Speedup Analysis
A larger code (with more genes) takes slightly more time to evolve
0 10,000 20,000 30,000
1
10
100
# of static instructions
Rel
ativ
e P
erfo
rman
ce G
ain
Spec 2006spec 2000
Liner Regression Line( r = −0.46 )
Ideal − DLA
GA − DLARelative Performance Gain =
Raj Parihar Advanced Computer Architecture Lab University of Rochester 68
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: Speedup Analysis
Performance gain has strong correlation with # of generations
0 10,000 20,000 30,000
2
4
6
8
10
# of static instructions
Sat
urat
ion
Gen
erat
ion
Spec 2006
Spec 2000
Liner Regression Line(r = 0.56)
Saturation generation (>= 90% of the best GA solution)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 69
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Partial Recovery in Speculative Parallelization
Raj Parihar Advanced Computer Architecture Lab University of Rochester 70
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Flexibility in Look-ahead Hardware Design
Comparison of regular (partial versioning) cache support with twoother alternatives
No cache versioning supportDependence violation detection and squash
Raj Parihar Advanced Computer Architecture Lab University of Rochester 71
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Genetic Algorithm Evolution
Raj Parihar Advanced Computer Architecture Lab University of Rochester 72
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Multi-instruction Gene Examples
Raj Parihar Advanced Computer Architecture Lab University of Rochester 73
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Superposition based Chromosomes
Raj Parihar Advanced Computer Architecture Lab University of Rochester 74
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Recovery based Early Termination of Fitness Test
Raj Parihar Advanced Computer Architecture Lab University of Rochester 75
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Optimizations to Implementation
Fitness test optimizations
Sampling based fitnessMulti-instruction genesEarly termination of tests
GA framework optimizations
Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy
Raj Parihar Advanced Computer Architecture Lab University of Rochester 76
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Sampling based Fitness Test
Raj Parihar Advanced Computer Architecture Lab University of Rochester 77
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
L2 Cache Sensitivity Study
Speedup for various L2 caches is quite stable
1.139x (1 MB), 1.133x (2 MB), and 1.131x (4 MB) L2 caches
Avg. speedups, shown in the figure, are relative to
single-threaded execution with a 1 MB L2 cache
Raj Parihar Advanced Computer Architecture Lab University of Rochester 78
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Approximable Program Paradigm
Weak dependence removal and speculative parallelization
techniques can be applied to any approximate program
Few real-life examples of approximate computing
Google search: does not work with a coherent, up-to-date databaseMap-Reduce paradigm: ignores consistently failing recordsMedia applications: photo, audio and video have some tolerance
Algorithm and applications level approximations
Modern benchmarks e.g., PARSEC are fundamentally approximateApplications space: clustering, predictions, optimizations, etc.BenchNN: a neural network based alternative to PARSEC
Raj Parihar Advanced Computer Architecture Lab University of Rochester 79
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
References (Partial)
Decoupled Access/Execute Computer ArchitecturesJ. Smith, ACM TC’84
A Study of Slipstream ProcessorsZ. Purser, K. Sundaramoorthy, E. Rotenberg, MICRO’00
Master/Slave Speculative ParallelizationC. Zilles, G. Sohi, MICRO’02
A Performance-Correctness Explicitly Decoupled ArchitectureA. Garg, M. Huang, MICRO’08
Speculative Parallelization in Decoupled Look-aheadA. Garg, R. Parihar, M. Huang, PACT’11
Accelerating Decoupled Look-ahead via Weak Dependence Removal: AMetaheuristic ApproachR. Parihar, M. Huang, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 80
top related