accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...

MotivationBaseline decoupled look-aheadLook-ahead thread acceleration

Additional insights and summary

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY

Raj Parihar Advanced Computer Architecture Lab University of Rochester

Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism

bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1

ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K

Raj Parihar Advanced Computer Architecture Lab University of Rochester 2

Motivation

Despite the proliferation of multi-core, multi-threaded systems

High single-thread performance is still an important CPU design goal

Modern programs do not lack instruction level parallelism

Real challenge: exploit implicit parallelism without undue cost

One effective approach: Decoupled look-ahead architecture

Motivation

Decoupled look-ahead architecture targets

Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling

Motivation

The look-ahead thread can often become a new bottleneck

Lack of correctness constraint allows many optimizations

Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread

Motivation

The look-ahead thread can often become a new bottleneck

Lack of correctness constraint allows many optimizations

Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread

Outline

Motivation

Baseline decoupled look-ahead

Look-ahead: a new bottleneck

Look-ahead thread acceleration

Weak dependences/instructions

Do-It-Yourself branches & skeleton tuning

Experimental analysis

Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

Register state synchronization

Prefetching hints

Branch prediction1

Main Memory

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO’08

Baseline Decoupled Look-ahead Architecture

Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and

Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching

Main Core

Branch QueueLook-ahead Core

L0$ L1$

Executes Look-aheadskeleton

Executes programbinary

Register state synchronization

Prefetching hints

Branch prediction1

2addq v0, v0, v0nop......bgt a1, 0x12001f9a0subq v0, t0, a2

addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0subq a1, 0x2, a1addq v0, v0, v0bgt a1, 0x12001f9a0subq v0, t0, a2

Main Memory

Program binary

Skeleton

A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08

Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks

Single-thread, decoupled look-ahead, ideal, and look-ahead limit

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck

aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0

Single thread

Application categories:Bottleneck removed; speed of look-ahead is not an issue (left half)Look-ahead thread is the new bottleneck

Single−thread Decoupled Look−ahead

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck

Single−thread Decoupled Look−ahead Ideal (cache, branch)

Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)

Look−ahead limit Single−thread Decoupled Look−ahead Ideal (cache, branch)

Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis

Weak Dependences/Instructions

Not all instructions are equally important and critical

Example of weak instructions:

Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions

Plenty of weak instructions are

present in programs (100s of)

Weak instruction can be experimentally defined and their

impact quantified in isolation

Challenge # 1: Weak insts do not look different

After the fact analysis: based on static attributes of insts reveals

Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)

Weakness has very poor correlation with static attributes

Hard to identify the weak instructions through static heuristics

addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0

Instruction Type

WeakInstructions

addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0

Instruction Type

StrongInstructions

Challenge # 1: Weak insts do not look different

After the fact analysis: based on static attributes of insts reveals

Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)

Weakness has very poor correlation with static attributes

Hard to identify the weak instructions through static heuristics

addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0

Instruction Type

WeakInstructions

addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0

Instruction Type

StrongInstructions

Challenge # 2: False positives are extremely costly

After the fact analysis and close inspection also reveals

Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains

Case in point: zapnot in gap

zapnot Ra Rb Rc

84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%

Challenge # 2: False positives are extremely costly

After the fact analysis and close inspection also reveals

Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains

Case in point: zapnot in gap

zapnot Ra Rb Rc

84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%

Challenge # 3: Neither absolute nor additive

Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!

Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown

0 50 100 150 200 250 300−40%

−30%

−20%

−10%

Cummulative weak instructions

perlbmk

Challenge # 3: Neither absolute nor additive

Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!

Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown

0 50 100 150 200 250 300−40%

−30%

−20%

−10%

Cummulative weak instructions

perlbmk

Metaheuristic Based Trail-and-Error Approach

Recap: Challenges in identifying weak instructions

Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive

Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic

Metaheuristic: Completely agnostic of meaning of solution

Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.

R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14

Genetic Algorithm based Framework

The problem naturally maps to genetic algorithm

Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)

Genetic evolution: Procreation, mutation, fitness-based selection

ProgramBinary

Look-aheadBinary

Chromosome creation GA evolution

Single-GeneChromosome

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

Fitness test,Elitism

Look-ahead construction

(Binary Parser)

SuperpositionChromosome

OrthogonalChromosome

Xover &Mutation

De-duplication

ProgramBinary

Look-aheadBinary

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

(Binary Parser)

Xover &Mutation

De-duplication

ProgramBinary

Look-aheadBinary

Parents Pool

Children Pool

Reproduction

RouletteWheel

Parentselection

(Binary Parser)

Xover &Mutation

De-duplication

Speedup of Weak Dependence Removal

Applications in which the look-ahead thread is a bottleneck

Self-tuned, genetic algorithm based decoupled look-ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x

craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1

Baseline look−aheadGA based look−ahead

Speedup over baseline decoupled look-ahead: 1.11x (geomean)

Overall speedup over single-thread baseline: 1.48x

Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x

Progress of Genetic Evolution Process

Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved

GA evolution, helped by hybridization shows good progress

1 2 3 4 5 6 70%

# of Generations

Progress of Genetic Evolution Process

Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved

GA evolution, helped by hybridization shows good progress

1 2 3 4 5 6 70%

# of Generations

Evolution can be Online or Offline

Offline evolution: one time tuning (e.g. install time)

Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result

Online evolution: takes longer but has little overhead

Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations

Accumulated

Number of instruc6ons (in millions)

Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead

Evolution can be Online or Offline

Offline evolution: one time tuning (e.g. install time)

Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result

Online evolution: takes longer but has little overhead

Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations

Accumulated

Number of instruc6ons (in millions)

A Locomotive and Cargo Analogy

Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload

Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo

(under utilization of locomotive’s capability)

L1 prefetches L2 prefetches

A Locomotive and Cargo Analogy

Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload

Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo

(under utilization of locomotive’s capability)

L1 prefetches L2 prefetches

Idea of Do-It-Yourself (DIY) Branches

Extends the idea of weak instructions to easy-to-predict branches

To accelerate the look-ahead thread via DIY branches

Either skip completely or partially execute in the skeleton

(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]

(4) DIY [A -> C]

(5) DIY [A -> B -> BR -> C]

(A) Forward conditional branch (If-Than, If-Than-Else) transformations

(B) Backward conditional branch (Loop) transformations

Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches

(4) DIY [A -> C]

(5) DIY [A -> B -> BR -> C]

Tune skeleton via selectively including/excluding prefetches

R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches

(4) DIY [A -> C]

(5) DIY [A -> B -> BR -> C]

Tune skeleton via selectively including/excluding prefetchesR. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)

Hardware Support for DIY Branches

Hardware support needed to synchronize after DIY regions

Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion

Look-aheadThread

MainThread

Branch Queue

Branch Predictor

DIY Mode

i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]

Direction+

DIY info

DIY call depth register+

DIY mode bit

Skeleton [mask, diy, duty]

Hardware Support for DIY Branches

Hardware support needed to synchronize after DIY regions

Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion

Look-aheadThread

MainThread

Branch Queue

Branch Predictor

DIY Mode

i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]

Direction+

DIY info

DIY call depth register+

DIY mode bit

Skeleton [mask, diy, duty]

Experimental Setup

Program/binary analysis tool: ALTO

Simulator: detailed out-of-order,cycle-level in-house

SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)

Genetic algorithm framework

Modeled as offline and onlineextension to the simulator

Microarchitectural configurations:Baseline core (Similar to POWER5)

Fetch/Decode/Issue/Commit 8 / 4 / 6 / 6ROB 128Functional units INT 2+1 mul +1 div, FP 2+1 mul +1 divFetch Q/ Issue Q / Reg. (int,fp) (32, 32) / (32, 32) / (80, 80)LSQ(LQ,SQ) 64 (32,32) 2 search portsBranch predictor Gshare – 8K entries, 13 bit historyBr. mispred. penalty at least 7 cyclesL1 data cache (private) 32KB, 4-way, 64B line, 2 cycles, 2 portsL1 inst cache (private) 64KB, 2-way, 128B, 2 cyclesL2 cache (shared) 1MB, 8-way, 128B, 15 cyclesMemory access latency 200 cyclesLook-ahead core: Baseline core with only LQ, no SQ

L0 cache: 32KB, 4-way, 64B line, 2 cyclesRound trip latency to L1: 6 cycles

Communication: Branch Output Queue: 512 entriesReg copy latency (recovery): 64 cycles

Table 1: Microarchitectural configurations.

Individual Performance Gains

Speedup of DIY branches over baseline look-ahead: 1.08x

Speedup of Skeleton Payload Tuning: 1.12x

Combined speedup (DIY + Payload Tuning): 1.15x

gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean

ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead

Weak InstsRemoval(16.2%)

Overall Performance Gain

Final decoupled look-ahead system

Skeleton payload tuning + Weak dependence + DIY branches

Performance speedup over:

Baseline look-ahead: 1.20x Single-thread: 1.61x

gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0

Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA

Overall Performance Gain

Final decoupled look-ahead system

Skeleton payload tuning + Weak dependence + DIY branches

Performance speedup over:

Baseline look-ahead: 1.20x Single-thread: 1.61x

gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0

Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA

Salient Features of Decoupled Look-ahead

Hard-to-predict, data-dependent branches

Conventional predictors do not capture data-dependent behaviors

Prefetching for cold misses

Conventional prefetchers take time to learn access/miss patterns

Other potential advantages:

Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)

Potential hurdles and showstoppers:

If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically

Future Explorations

Effective look-ahead to improve L1 prefetching performance

Speeding up critical threads and serial bottlenecks via a shared

look-ahead agent in multi-threaded applications

Cost effective SMT implementation of decoupled look-ahead

Role of look-ahead to promote parallelization, value predictions

and acceleration of interpreted programs

Backward strawman: integrate non-speculative look-ahead

computations in the main thread directly

Details in the Thesis and Papers

Decoupled Look-ahead Architecture:

Weak dependence removal in decoupled look-ahead [HPCA’14]

Load balancing in look-ahead via DIY branches [PACT-SRC’15]

Speculative parallelization in decoupled look-ahead [PACT’11]

DIY branches and payload tuning [in prep. for HPCA’17]

Shared Cache Management:

Hardware support for protective and collaborative caches [ISMM’16]

Protection and utilization in shared cache via rationing [PACT’14]

A coldness metric for cache optimization [MSPC’13]

Motivation: Cache Rationing

Compute systems with shared resources are prevalent today

Multi-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources

Motivation: Cache Rationing

Compute systems with shared resources are prevalent todayMulti-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources

Significant performance loss due to co-run interference: >25%

Equal partitioning No partitioning Rationing PIPP-equal0.7

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with equake(2 cores, 1 MB L2 cache)

Idea of Cache Rationing

Achieve resource protection and utilization both simultaneously

Rationing policy:

Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs

Conservative sharing: provides a safety net for less aggressive

programs in the presence of non cooperative programs

R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16

Idea of Cache Rationing

Achieve resource protection and utilization both simultaneously

Rationing policy:

Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs

Conservative sharing: provides a safety net for less aggressive

programs in the presence of non cooperative programs

R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16

Hardware Support

Ration Accounting: ration counter-register pairs

To track the current usage of a program, maintained per core per set

Usage Tracking: access-bit and block owner

To detect unused ration and ensure entitlement, 1 per cache blk

blk 1 blk w-1

w ways

Data array

Access bit

p counter-register pairs

Ration tracker

w ways

Status bit

Tag array

Block ownerRation counter

Owner allocation

Additional storage overhead: <1% of total cache storage

Hardware Support

Ration Accounting: ration counter-register pairs

To track the current usage of a program, maintained per core per set

Usage Tracking: access-bit and block owner

To detect unused ration and ensure entitlement, 1 per cache blk

blk 1 blk w-1

w ways

Data array

Access bit

p counter-register pairs

Ration tracker

w ways

Status bit

Tag array

Block ownerRation counter

Owner allocation

Additional storage overhead: <1% of total cache storage

Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)

Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi

Resource Protection Co-Run

Co-run with a high-pressure peer (mcf)Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage

... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...

SPEC 2000: with mcf(2 cores, 1 MB L2 cache)

INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf

FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi

Capacity Utilization Co-Run

Co-run with a low-pressure peer (eon): cache demand <128 KB

Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with eon(2 cores, 1 MB L2 cache)

Capacity Utilization Co-Run

Co-run with a low-pressure peer (eon): cache demand <128 KB

Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs

1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26

SPEC 2000: with eon(2 cores, 1 MB L2 cache)

Summary

Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck

Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations

Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries

Metaheuristic based self-tuning approach is simple and robust

Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Multi-threaded workload can benefit from speeding up serial

sections and bottleneck threads in critical regions

Summary

Metaheuristic based self-tuning approach is simple and robust

Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Summary

Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

Summary

Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling

sections and bottleneck threads in critical regionsRaj Parihar Advanced Computer Architecture Lab University of Rochester 38

Acknowledgments

Funding agencies: NSF, NSFC

Prof. Michael C. Huang Alok Garg

Prof. Chen Ding and his research group at URCS

Past & current members of Advanced Computer Architecture

Lab at University of Rochester

Backup Slides

Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism

Raj Parihar

Advisor: Prof. Michael C. Huang

Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY

Summary of Distillation Techniques

Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs

Stores would have been committed by main thread

Selective value (zero) substitution for the L2 misses

If the look-ahead distance drops below a threshold

Speculative parallelization of skeleton w/o any rollback support

Weak dependence/instruction removal from skeleton

Consecutive loop iterations accessing the same cache line

A wide variety of library calls and reduction operations

Selective payload: eliminate payload if it slows look-ahead

Practical Advantages of Decoupled Look-ahead

Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)

Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution

gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean

Speculative slice limit (ideal)

Speculative slice limit (ideal)Decoupled look−ahead

Look-ahead thread is a self-reliant agent,completely independent of main thread

No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable

Natural throttling mechanism to prevent

Run-away prefetching, cache pollution

Look-ahead thread size comparable to

aggregation of short helper threads

Cache misses90% 95%

DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68

Branch mispredictions90% 95%

DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236

Cache misses90% 95%

Correlation with RTL/FPGA Accurate Simulator

Reported performance improvement results are very pessimistic

Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces

RTL accurate simulator: shows 2x more performance potential

Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0

ead SimpleScalar

IMG-psim

1.56x(projected)

Correlation with RTL/FPGA Accurate Simulator

Reported performance improvement results are very pessimistic

Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces

RTL accurate simulator: shows 2x more performance potential

Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0

ead SimpleScalar

IMG-psim

1.56x(projected)

Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%

2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

RAT-dec

RAT-wl

RAT-bl

DCL-cm

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components

Simplified Look-ahead Core

Baseline skeleton: 71% After distillation: 57%2-wide look-ahead core: (Front end is still 8-wide)

2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%

RAT-dec

RAT-wl

RAT-bl

DCL-cm

3-wide Look-ahead core2-wide Look-ahead core

Look-ahead core components

Baseline vs. Tuned Skeleton

Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed

Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead

4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1

Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead

INT FP

Baseline vs. Tuned Skeleton

Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed

Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead

4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1

Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead

INT FP

Hybridization: Heuristically Designed Initial Solutions

Genetic evolution could be a slow and lengthy process

Heuristic based solutions are helpful to jump start the evolution

Heuristically designed solutions in our system:

Superposition chromosome; Orthogonal subroutine chromosome

Multi-Instruction Genes

X XX X X

X XX X XX X X X

X X X X X X X X X

Initial Chromosomes

A B CSubroutines

(b) Superposition Chromosomes

(c) Orthogonal Chromosomes

(a) Single-gene Chromosomes

Single-Instruction Genes

Chromosome

Hybridization: Heuristically Designed Initial Solutions

Genetic evolution could be a slow and lengthy process

Heuristic based solutions are helpful to jump start the evolution

Heuristically designed solutions in our system:

Superposition chromosome; Orthogonal subroutine chromosome

Multi-Instruction Genes

X XX X X

X XX X XX X X X

X X X X X X X X X

Initial Chromosomes

A B CSubroutines

(b) Superposition Chromosomes

(c) Orthogonal Chromosomes

(a) Single-gene Chromosomes

Single-Instruction Genes

Chromosome

Online Genetic Evolution: equake

Primary overhead comes from testing bad skeleton configs

Break-even point: 1.8 billion insts (1-2 sec of native execution)By 4.6 billion insts: overall cumulative speed is already 10% faster

4659 Ac

cumulated

Number of instruc6ons (1 epoch = 1 million instruc6ons)

Distrib

Number of instruc4ons (1 epoch = 1 million instruc4ons)

Comparison with Other Proposals

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]

Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]

DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%

Dual core execution or DCE [Zhou: PACT’05]DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)

Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.

A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

Potential DIY Modules

Loop iterations accessing same cache line, reduction operations

Library function calls: printf, OtsMove, OtsFill etc.A case in point: mark modified reg() from 176.gcc

Dynamic contribution: 3% Performance speedup: 10%

static void mark_modified_reg (dest, x)

rtx dest; rtx x;

int regno, i;

if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);

if (GET_CODE (dest) == MEM) modified_mem = 1;

if (GET_CODE (dest) != REG) return;

regno = REGNO (dest);

if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;

for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)

modified_regs[regno + i] = 1;

Skeleton Payload Distribution

Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches

Optimal only 30% of the time

For the remaining 70% of the time other payloads are optimal

Performance potential of customized payloads: 1.21x

DLA bB

1+Sf B

All-ins

Skeleton Payload (epoch=10k insts)

(%) gcc

mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma

Skeleton Payload Distribution

Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches

Optimal only 30% of the time

For the remaining 70% of the time other payloads are optimal

Performance potential of customized payloads: 1.21x

DLA bB

1+Sf B

All-ins

Skeleton Payload (epoch=10k insts)

(%) gcc

mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma

Skeleton Payload Tuning Framework

Collects performance of various payloads in regular epoch

Associates static code region with the most optimal payload

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

600700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xZZ

Cycles

500700...400

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Cycles

500700...400

(A) Initial payload performance

B+L1+L2

InstCount

1k2k…Nk

StartPC

0xAA0xBB...0xAA

Best Skt

B+L1+L2bB+L2...bB+L1

(B) Best payload per epoch

0xAA0xBB…0xZZ

<Payload#: Cnt> tuples

#1:50, #2:30,..., #N:100#4:10, #5:50...#1:10, #2:30,..., #N:10

(C) Per PC payload tuples

0xAA0xBB…0xZZ

Best payload

#N:100#5:50...#2:30

(D) Best payload per PC

0xAA: bB+L20xZZ:

bB+L10xBB: B+L1+L2

(E) Final skeleton

Skeleton Payloads

Performance Impact of Duty Cycle

One DIY call example from 179.art

5 10 20 25 50 75 80 90 100

Duty Cycle (not to the scale)

WeightAdj() in 179.art

Weak Dependence: Insights and Findings

The evolution process is remarkably robust

Different inputs and configuration do not invalidate resultsCan use sampling to accelerate fitness test w/o appreciable impacton quality of solution found

Energy reduction → due to less activity and stalling

About 10% dynamic instructions removed from skeleton11% energy saving over baselne decoupled look-ahead

Impact of weak insts removal on look-ahead quality is very small

Similar prefetch and branch hint accuracy

Comparison with Speculative Parallel Look-ahead

Self-tuned skeleton is used in the speculative parallel look-ahead

In some cases, self-tuned and speculative parallel look-ahead

techniques are synergistic (ammp, art)

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

0x120011490 ldt $f0, 0(a0)

0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)

0x1200114bc stt $f0, 0(a2)

0x12000da84 lda a5, 744(sp)

3. 0x12000daec lda a5, 4(a5)

2. 0x12000dac0 ldl t7, 0(a5)

4. 0x120011984 ldq a0, 80(sp)

7. 0x1200119f8 bis 0, t8, t11

8. 0x120011b04 lda a0, 8(a0)

A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11

Unique Opportunities for Speculative Parallelization

Skeleton code offers more parallelism

Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation

Look-ahead is inherently error-tolerant

Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS

0x120011490 ldt $f0, 0(a0)

0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)

0x1200114bc stt $f0, 0(a2)

0x12000da84 lda a5, 744(sp)

3. 0x12000daec lda a5, 4(a5)

2. 0x12000dac0 ldl t7, 0(a5)

4. 0x120011984 ldq a0, 80(sp)

7. 0x1200119f8 bis 0, t8, t11

8. 0x120011b04 lda a0, 8(a0)

A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11

Software Support

Dependence analysis

Profile guided, coarse-grain at basicblock level

Spawn and Target points

Basic blocks with consistentdependence distance of more thanthreshold of DMIN

Spawned thread executes fromtarget point

Loop level parallelism is also

exploited

Software Support

Dependence analysis

exploited

Software Support

Dependence analysis

exploited

Parallelism Potential in Look-ahead Binary

Available parallelism for 2 core/contexts system; DMIN = 15BB

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0

Original binarySkeleton

Skeleton exhibits significant more BB level parallelism (17%)

Loop based FP applications exhibit more BB level parallelism

Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

789101112131415

1617181920212223

Lookahead thread 0 Lookahead thread 1

Duplicate rename tableand set up context

Cleanupduplicated state

Hardware and Runtime Support

Thread spawning and merging are verysimilar to regular thread spawning except

Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC

Value communication

Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level

789101112131415

1617181920212223

Lookahead thread 0 Lookahead thread 1

Duplicate rename tableand set up context

Cleanupduplicated state

Speedup of Speculative Parallelization

Speculative look-ahead over decoupled look-ahead: 1.13x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1

Baseline look−ahead

Speculatively parallel look−ahead

Speedup of Speculative Parallelization

Speculative look-ahead over decoupled look-ahead: 1.13x

crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1

Speculatively parallel look−ahead

Speculative Look-ahead vs Conventional TLS

Skeleton provides more opportunities for parallelization

Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x

Speculatively parallel mainSpeculatively parallel look−ahead

Speculative look-ahead over decoupled LA baseline: 1.13x

Speculative main thread over single thread baseline: 1.07x

1.65IPC

Baseline Cache Partitioning

Baseline (naive) cache partitioning/sharing policies:

Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches

Two extremes: Resource protection vs. Capacity utilization

Unrelated program co-run: individual slowdowns may not bejustifiable if from different users

Unlike slowing down a thread occasionally to improve throughput

Cache rationing: achieves good cache protection and cache

utilization without slowing down individual programs

Microthreads vs Decoupled Look-ahead

Lightweight Microthreads: Decoupled Look-ahead:

Look-ahead Skeleton Construction

Under-clocked Dual-core Speedup

Typically a dual-core can be clocked only upto 90% clock

frequency of a single-core system

After adjusting the frequency of single-core

Single-core IPC: 1.80 (INT), 2.28 (FP), 2.05 (Combined)

Baseline look-ahead over 10% over-clocked single-thread

Speedup: 1.13x (INT), 1.34x (FP), 1.24x (Combined)

Self-tuned look-ahead over single-thread: (for 14 applications)

Speedup: 1.20x (INT), 1.96x (FP), 1.43x (Combined)

Self-tuned Look-ahead: SPEC 2006

Self-tuned look-ahead achieves 1.10x speedup over baseline

look-ahead for SPEC CPU 2006 applications

perl bzp gcc mcf go hmer sjen libq h264 omn astr xaln milc deal splx Gmean1

GA based look−ahead

Self-tuned Look-ahead: Speedup Analysis

A larger code (with more genes) takes slightly more time to evolve

0 10,000 20,000 30,000

# of static instructions

Spec 2006spec 2000

Liner Regression Line( r = −0.46 )

Ideal − DLA

GA − DLARelative Performance Gain =

Self-tuned Look-ahead: Speedup Analysis

Performance gain has strong correlation with # of generations

0 10,000 20,000 30,000

# of static instructions

Spec 2006

Spec 2000

Liner Regression Line(r = 0.56)

Saturation generation (>= 90% of the best GA solution)

Partial Recovery in Speculative Parallelization

Flexibility in Look-ahead Hardware Design

Comparison of regular (partial versioning) cache support with twoother alternatives

No cache versioning supportDependence violation detection and squash

Genetic Algorithm Evolution

Multi-instruction Gene Examples

Superposition based Chromosomes

Recovery based Early Termination of Fitness Test

Optimizations to Implementation

Fitness test optimizations

Sampling based fitnessMulti-instruction genesEarly termination of tests

GA framework optimizations

Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy

Sampling based Fitness Test

L2 Cache Sensitivity Study

Speedup for various L2 caches is quite stable

1.139x (1 MB), 1.133x (2 MB), and 1.131x (4 MB) L2 caches

Avg. speedups, shown in the figure, are relative to

single-threaded execution with a 1 MB L2 cache

Approximable Program Paradigm

Weak dependence removal and speculative parallelization

techniques can be applied to any approximate program

Few real-life examples of approximate computing

Google search: does not work with a coherent, up-to-date databaseMap-Reduce paradigm: ignores consistently failing recordsMedia applications: photo, audio and video have some tolerance

Algorithm and applications level approximations

Modern benchmarks e.g., PARSEC are fundamentally approximateApplications space: clustering, predictions, optimizations, etc.BenchNN: a neural network based alternative to PARSEC

References (Partial)

Decoupled Access/Execute Computer ArchitecturesJ. Smith, ACM TC’84

A Study of Slipstream ProcessorsZ. Purser, K. Sundaramoorthy, E. Rotenberg, MICRO’00

Master/Slave Speculative ParallelizationC. Zilles, G. Sohi, MICRO’02

A Performance-Correctness Explicitly Decoupled ArchitectureA. Garg, M. Huang, MICRO’08

Speculative Parallelization in Decoupled Look-aheadA. Garg, R. Parihar, M. Huang, PACT’11

Accelerating Decoupled Look-ahead via Weak Dependence Removal: AMetaheuristic ApproachR. Parihar, M. Huang, HPCA’14

accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...

Documents

strategic planning- look ahead, look inside

learning disabilities - look ahead

look back, look ahead

nprr351 look-ahead sced update

practical sat solving: look-ahead techniquespractical sat...

lifepac 5th grade science unit 10 worktext -...

answer key look ahead x

business plan - look ahead · introduction 3 look ahead -...

a look ahead: echelon

speculative parallelization in decoupled...

accelerating decoupled look-ahead to exploit implicit...

20131106- look ahead plan

medication support - look ahead

a look ahead

accelerating decoupled look-ahead to exploit implicit...

2020 - look ahead

look ahead!

probabilistic look ahead contingency analysis and dynamic...

look ahead people look ahead potential...welcome to look...

look ahead 2014 industry brochure