implementing optimizations at decode time

Implementing Optimizations

at Decode Time

Ilhyun KimMikko H. Lipasti

Pharm TeamUniversity of Wisconsin—

Madisonhttp://www.ece.wisc.edu/~pharm

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 2

What this talk is about

It’s not about new optimizations Memory reference combining Silent store squashing

It’s not about decode How to build an instruction decoder

It is about implementation A way to implement dynamic optimizations in a

pipeline w/ speculative scheduling

“Implementing Optimizations at Decode time”


Outline Speculative Scheduling

Why it causes problems with dynamic optimizations

Speculative Decode Enables dynamic optimizations in the processor core

Case Study: Memory Reference Combining

Case Study: Silent Store Squashing

Conclusions


Where do you want to put optimizations?

Optimization trade-offs

Trace cache

Decode /Trace Cache fill

Execution Core

Instr cache

Host machine

Virtual machine

Binary Translation / OptimizationProcess

or

Compiler

Execution Core

Decode

Fetch

MostGlobal

MostDynamic

Can we achieve fully dynamic optimizations?

Dynamic events affect execution

for the very next clock cycle


Speculative Scheduling

Fetch DecodeIssue/Exe

Writeback Commit

Atomic wakeup/select

Fetch Decode Schedule Dispatch RF Exe Writeback Commit

non-atomic wakeup/select

Fetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback Commit

Wakeup/Select

Fetch Decode Schedule Dispatch RF Exe Writeback Commit

Wakeup/SelectSpec wakeup

/select

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted


Commit



Spec wakeup/select


Commit




Commit




Commit




Commit




Commit



Latency Changed!!


Commit


Invalid input value



Commit


Overview

Unlike the original Tomasulo’s Algorithm, Instructions are scheduled based on pre-determined

latency Resources are allocated at schedule time Once instructions leave scheduler, it is impractical to

change resource/execution scheduling

Pipeline CANNOT react to observed events immediately


lw r1 4(r29) add r2 r1 + 1

What becomes harder?


Commit


WakeupLoad latency 2


Commit


Issuenow


Commit


value found in RF


Commit


Load lat 1!!Bubble aheadmove the value

cancel cache access

Fully dynamic optimization in execution stage is hard


Commit


Load lat 1!!Now, it executes

- no benefitNO BENEFIT!!reduced

load latency

Optimization: Avoid the cache access if the value is available in RF

(Load and store reuse using register file contents, ICS 2001)


Speculative scheduling breaks fully dynamic optimizations

Optimizing a parent instruction is not enough Benefits come from dependent (data, resource)

instructions that execute sooner Instructions cannot react immediately under speculative

scheduling

Some techniques become less efficient, or even unavailable if they depend on:

Instant re-execution Variable execution latency Instant resource allocation/deallocation

The scheduler should know what will happen in advance

not fully dynamic – predictor required How to communicate with the scheduler?


lw r1 4(r29)add r2 r1 + 1

Our Solution

Predictor


Commit

Optimization target

Transform this (e.g. loadmove)assuming that it will happen


Commit

Optimization target

lw r1 4(r29) move r1p1add r2 r1 + 1

Optimization isinvisible to scheduler

(Since it’s ‘move’ instr)


Commit

Optimization target


Commit

Optimization target

wakeup dependentinstr(‘move’ lat 1)


Commit

Optimization target


Commit

Optimization target

Appears to beoptimized here

Benefit fromreduced latency

Optimization: Avoid the cache access if the value is available in RF

(Load and store reuse using register file contents, ICS 2001)


Speculative Decode (SD) Decoding instructions into an optimistic sequence rather than one that works correctly in all cases

(unsafe) reaps benefits of fully dynamic optimization when correctly predicted requires verification code for correctness flushes the pipeline when mispredicted

lw r1 36(r29)add r3 r1 r5

Speculative

Decode

lw r1 36(r29)p2 = predicted valuebne r1 p2 softtrapadd r3 p2 r5

ex) Load Value Prediction


Benefits of SD Pre-schedule optimizations outside the OoO core

enables dynamic optimizations more effectively eliminates resource contention than even fully

dynamic optimizations – leads to better performance

Implementing optimization using existing I-ISA primitives implement microarchitectural ideas with minimal core change reuse existing data/control path in the core minimize negative effects on the scheduler – invisible to the

scheduler

Fetch Decode

Predictor

OoO Execution Core Commit

OriginalInstructions

TransformedInstructions

When mispredicted, Squashing & Refetching (same as branch mispredictions)


Translation Layer for SD Many decoders already have a translation layer between

user-ISA and implementation-ISA Because direct implementation of complex instructions is

difficult P6, Pentium 4, Power4, K7……

Functionality required for SD One-to-multiple instruction expansion (x86 decoders) Dynamically variable mapping between U-ISA and I-ISA

(experimental S/390)

Reducing the decode overhead Trace cache / Decoded instruction cache (Pentium 4) Instruction-path Coprocessors (Chou and Shen, ISCA 2000) Performance drop is not drastic w/ extra decode stages

(Sensitivity study in the paper)


Outline


Speculative Decode



Conclusions



Discussed extensively in literature Wilson et. al: Increasing cache port efficiency for dynamic superscalar

microprocessors, ISCA 1996

LW 100

LW 404

LW 104

…

LW 400

LW 104

100104

400404 Cache /Memor

y

64-bit datapat

h

64-bit data buffer

Load completed

LSQ

combinableload issues

Byte selection

One cache access satisfies multiple loads (load all scheme) Cache port / latency benefits

BUT, speculative scheduler should know if they can be combined

fails to achieve both benefits


Reference Combining via SD Wide data paths in support of instruction set

extensions AMD Hammer project: x86-64bit PowerPC 64bit implementations Multimedia extensions (SSE, MMX, Altivec…)

Many programs are still written in 32-bit mode for backward compatibility

SD enables existing binaries to benefit from wider data paths w/o recompilation

Wider (128-bit) combining leads to more benefits (performance data in the paper)


Reference Combining via SD Detecting combinable pairs statically

Same base register with word-size offset Two adjacent word memory instructions in program order

lw r1, 0(r10)

lw r2, 4(r10)

lw r3, 8(r10)

lw r4, 12(r10)

Predict alignment of references

dlw r1, 0(r10)exthi r2, r1


doubleword-aligned

lw r1, 0(r10)


lw r4, 12(r10)

word-aligned only


Pipeline front-end

When misaligned, The memory system detects it (same as base case) After the pipeline is drained, the original instructions

are fetched again and decoded without transformation

Reference Combining via SD

Fetch Decode

Predictor

Sequence

Detector

adjacentloads/stores

combining predicted

lw+lwsw+sw

dlw + extract

merge + dsw

ToExecution

Core

alignment historyof loads/stores


Microarchitectural Assumptions

SimpleScalar PISA w/ Speculative scheduling 4-wide; 8-stage pipeline Hybrid branch predictor(Gshare + bimodal) 64-entry RUU; 32-entry load / 16-entry store

schedulers 64KB I-DL1, 512KB unified L2 2 load / 1 store ports (mutually exclusive) 2 store buffers outside the OoO core

HW memory reference combining (HWC) Magic scheduler w/ perfect combining knowledge+ store merging in store buffer (for store combining)


0.4

0.5

0.6

0.7

0.8

0.9

1

com

p

gcc

go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

No

rma

lize

d D

L1

acc

ess

HWC w/ oracle schedulerSDC

HWC vs. SDC – cache access reductions

HWC reduces more cache accesses than SDC


0

0.2

0.4

0.6

0.8

1co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Load

/sto

re s

ched

uler

ful

l rat

eHWC vs. SDC – LSQ contention reductions

SDC reduces LSQ contention more (fewer memory instructions)

BaseHWC w/ oracle scheduler

SDC


0

2

4

6

8

10

12co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Spe

edup

s %

HWC w/ oracle scheduler

SDC

SD can reap many of benefits of pure hardware implementations

HWC vs. SDC – Speedups


Outline


Speculative Decode



Conclusions


Case Study: Silent Store Squashing(SSS)

Eliminates stores that do not change architectural states Reducing core and memory system contention

Separate load/store schedulers imply replication more contention Do explicit

conversion

Memory

(1) access memory

=

(2) compare values

(3) nullify the store when silent

Store

Store Scheduler

V

Load (converted)

Load Scheduler

V

Value

A store is converted implicitly into 3 operations


Silent Store Squashing via SD Explicitly removes predicted stores

reduces store scheduler contention

add r1, 1, r1

lw r5, 4(r10)

sw r1, 16(r29)sw r1, 16(r29)lw p1, 16(r29)bne p1, r1, trap

load + trap for store verify the pipeline is drained when not silent

No store, No aliasing silent stores do not change the value no RAW allowing later loads to bypass earlier unresolved stores even

with true dependences

no RAW


HWSSS vs. SDSSS – memory disambiguation

0.0

0.2

0.4

0.6

0.8

com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

Avg c

lock c

ycle

s o

f

a lo

ad is

sue b

locked b

y s

tore

s

BaseSD SSS

Better memory disambiguation achieved no store less store-to-load block cycles


-2

3

8

13

18

23co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Spe

edup

s %

HWSSS SDSSS

HWSSS vs. SDSSS – Speedups

SDSSS outperforms HWSSS Better memory disambiguation HWSSS does not reduce contention in the store scheduler


Conclusions Speculative scheduling makes optimizations in

execution stage impractical Pre-schedule optimization by transforming

instructions

Advantages of SD-based implementations Enabling execution stage optimizations Reusing existing data/control path No negative effect on instruction scheduling Reduces contention inside the core better

Two case studies show that SD can reap many benefits of pure hardware implementations

Memory reference combining: less queue contention Silent store squashing: less queue contention, better

memory disambiguation


Backup slides


Is HWC easy to integrate??

Schedule Detecting piggyback loads to

be issued at the same clock cycle

Actual values are not involved in scheduling

Detecting them without effective addresses?

Register File & Result bus More loads satisfied at the

same time more result bus bandwidth more RF write ports

RegisterFile

Schedule

LSQ

Addrgen

Load unit

RF

Exe1

Exe2

Result bus

RegisterFile

MagicSchedule

LSQ

Addrgen

Load unit

RF

Exe1

Exe2

Result bus


SD Combining Prediction

0

10

20

30

40

50

com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

vpr

Perc

enta

ge o

f all m

em

ory

refe

rences

Mispredicted

Combinable but not predicted

Combined

Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)

Miss rates: ~0.1% of all references


0

20

40

60

80

100com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

vpr

Perc

enta

ge o

f all s

tore

s Not silent / predicted not silentNot silent / predicted silentCheckedSilent / predicted not silentSilent / predicted silent

Silence Prediction

45% of silent stores detected (1024 entries, ~2.5KB) Low miss rate (~1%)


Silence Predictor

Last Store Value(lower n bits)

Confidence counter

Threshold counter

PC

Current Store Value(lower 8 bits)

+1(same) / -1(different)

=

Store verify result

-1(silent) / +4(not silent)

Predict

Time Silence Predictor State

100 3 4value confid thres

Store 100 to [A](a) PC x: Store 100, [A] A 100B 100

100 4 4Load from [B]

CompareStore 100 to [B]

(b) PC x: Store 100, [B] A 100B 100

(c) PC y: Store 50, [B] A 100B 50

Load from [A]Compare(d) PC x: Store 100, [B] 100 5 3 A 100

B 50

MemoryDecoded OpsOriginal Instruction

100 0 7 Store 100 to [B] A 100B 100

(e)


Silence Squashing via SD Explicit store verify depending on predictor states:

store load+compare+store load+branch

Eliminate negative effects on scheduling logic Explicit load issue for verify SSS is virtually invisible to scheduling logic

Explicit compare/branch operation Existing branch unit maintains correct machine states

Fetch Decode

Predictor silent store predicted

store load+comp+store

load+branch

ToExecution

Core

last store valueis silent?


SD Combining Prediction Predictor

Tag +4 /-4 (1)

Next Target Register

Alignment History(4)0 1 2 3

PC

= 1111 or 1010 ? Predict

Is aligned?

=

Shift

Combining is predicted when: 1111: aligned 4 times in a row 1010: base is increasing by 4

Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)

Miss rates: ~0.1% of all references Capturing up to 26% of all memory references


Silence Prediction Avoiding unnecessary store verify (load issue) How do we train the predictor without silence

outcome? Silence outcome is achieved when we do store verify

Correlating the last value information for training

No squashno SD

Squashload+trap

Checkload+compar

e+storeThe same value

for several instances

SilentNot Silent

Not Silent

Silent

Different values

45% of silent stores detected (1024 entries, ~2.5KB)

Low miss rate (~1%)


Future Work Spectrum of power / performance design points

attainable by speculative decode Single core, multiple marketing targets

Exposing complex control paths to I-ISA Improving controllability on processor core achieving

more benefit from SD Developing I-ISA for complexity-effective core design

And more……

implementing optimizations at decode time

Documents

schedule optimizations

resource instructions

instruction cache pentium

dynamic predictor requiredhow

store reuse

parent instruction

instruction decoderit

sddecoding instructions