implementing optimizations at decode time

35
Implementing Optimizations at Decode Time Ilhyun Kim Mikko H. Lipasti Pharm Team University of Wisconsin— Madison http://www.ece.wisc.edu/~pharm

Upload: jadyn

Post on 31-Jan-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Implementing Optimizations at Decode Time. Ilhyun Kim Mikko H. Lipasti Pharm Team University of Wisconsin—Madison. http://www.ece.wisc.edu/~pharm. What this talk is about. It’s not about new optimizations Memory reference combining Silent store squashing It’s not about decode - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Implementing Optimizations  at Decode Time

Implementing Optimizations

at Decode Time

Ilhyun KimMikko H. Lipasti

Pharm TeamUniversity of Wisconsin—

Madisonhttp://www.ece.wisc.edu/~pharm

Page 2: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 2

What this talk is about

It’s not about new optimizations Memory reference combining Silent store squashing

It’s not about decode How to build an instruction decoder

It is about implementation A way to implement dynamic optimizations in a

pipeline w/ speculative scheduling

“Implementing Optimizations at Decode time”

Page 3: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 3

Outline Speculative Scheduling

Why it causes problems with dynamic optimizations

Speculative Decode Enables dynamic optimizations in the processor core

Case Study: Memory Reference Combining

Case Study: Silent Store Squashing

Conclusions

Page 4: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 4

Where do you want to put optimizations?

Optimization trade-offs

Trace cache

Decode /Trace Cache fill

Execution Core

Instr cache

Host machine

Virtual machine

Binary Translation / OptimizationProcess

or

Compiler

Execution Core

Decode

Fetch

MostGlobal

MostDynamic

Can we achieve fully dynamic optimizations?

Dynamic events affect execution

for the very next clock cycle

Page 5: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 5

Speculative Scheduling

Fetch DecodeIssue/Exe

Writeback Commit

Atomic wakeup/select

Fetch Decode Schedule Dispatch RF Exe Writeback Commit

non-atomic wakeup/select

Fetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback CommitFetch Decode Schedule Dispatch RF Exe Writeback Commit

Wakeup/Select

Fetch Decode Schedule Dispatch RF Exe Writeback Commit

Wakeup/SelectSpec wakeup

/select

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Spec wakeup/select

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Re-schedulewhen latency mispredicted

Latency Changed!!

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

Invalid input value

Speculatively issued instructions

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Speculatively issued instructions

Overview

Unlike the original Tomasulo’s Algorithm, Instructions are scheduled based on pre-determined

latency Resources are allocated at schedule time Once instructions leave scheduler, it is impractical to

change resource/execution scheduling

Pipeline CANNOT react to observed events immediately

Page 6: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 6

lw r1 4(r29) add r2 r1 + 1

What becomes harder?

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

WakeupLoad latency 2

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

Issuenow

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

value found in RF

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

Load lat 1!!Bubble aheadmove the value

cancel cache access

Fully dynamic optimization in execution stage is hard

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Re-schedulewhen latency mispredicted

Load lat 1!!Now, it executes

- no benefitNO BENEFIT!!reduced

load latency

Optimization: Avoid the cache access if the value is available in RF

(Load and store reuse using register file contents, ICS 2001)

Page 7: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 7

Speculative scheduling breaks fully dynamic optimizations

Optimizing a parent instruction is not enough Benefits come from dependent (data, resource)

instructions that execute sooner Instructions cannot react immediately under speculative

scheduling

Some techniques become less efficient, or even unavailable if they depend on:

Instant re-execution Variable execution latency Instant resource allocation/deallocation

The scheduler should know what will happen in advance

not fully dynamic – predictor required How to communicate with the scheduler?

Page 8: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 8

lw r1 4(r29)add r2 r1 + 1

Our Solution

Predictor

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

Transform this (e.g. loadmove)assuming that it will happen

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

lw r1 4(r29) move r1p1add r2 r1 + 1

Optimization isinvisible to scheduler

(Since it’s ‘move’ instr)

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

wakeup dependentinstr(‘move’ lat 1)

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

Fetch Decode Schedule Dispatch RF ExeWriteback/Recover

Commit

Optimization target

Appears to beoptimized here

Benefit fromreduced latency

Optimization: Avoid the cache access if the value is available in RF

(Load and store reuse using register file contents, ICS 2001)

Page 9: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 9

Speculative Decode (SD) Decoding instructions into an optimistic sequence rather than one that works correctly in all cases

(unsafe) reaps benefits of fully dynamic optimization when correctly predicted requires verification code for correctness flushes the pipeline when mispredicted

lw r1 36(r29)add r3 r1 r5

Speculative

Decode

lw r1 36(r29)p2 = predicted valuebne r1 p2 softtrapadd r3 p2 r5

ex) Load Value Prediction

Page 10: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 10

Benefits of SD Pre-schedule optimizations outside the OoO core

enables dynamic optimizations more effectively eliminates resource contention than even fully

dynamic optimizations – leads to better performance

Implementing optimization using existing I-ISA primitives implement microarchitectural ideas with minimal core change reuse existing data/control path in the core minimize negative effects on the scheduler – invisible to the

scheduler

Fetch Decode

Predictor

OoO Execution Core Commit

OriginalInstructions

TransformedInstructions

When mispredicted, Squashing & Refetching (same as branch mispredictions)

Page 11: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 11

Translation Layer for SD Many decoders already have a translation layer between

user-ISA and implementation-ISA Because direct implementation of complex instructions is

difficult P6, Pentium 4, Power4, K7……

Functionality required for SD One-to-multiple instruction expansion (x86 decoders) Dynamically variable mapping between U-ISA and I-ISA

(experimental S/390)

Reducing the decode overhead Trace cache / Decoded instruction cache (Pentium 4) Instruction-path Coprocessors (Chou and Shen, ISCA 2000) Performance drop is not drastic w/ extra decode stages

(Sensitivity study in the paper)

Page 12: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 12

Outline

Speculative Scheduling

Speculative Decode

Case Study: Memory Reference Combining

Case Study: Silent Store Squashing

Conclusions

Page 13: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 13

Case Study: Memory Reference Combining

Discussed extensively in literature Wilson et. al: Increasing cache port efficiency for dynamic superscalar

microprocessors, ISCA 1996

LW 100

LW 404

LW 104

LW 400

LW 104

100104

400404 Cache /Memor

y

64-bit datapat

h

64-bit data buffer

Load completed

LSQ

combinableload issues

Byte selection

One cache access satisfies multiple loads (load all scheme) Cache port / latency benefits

BUT, speculative scheduler should know if they can be combined

fails to achieve both benefits

Page 14: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 14

Reference Combining via SD Wide data paths in support of instruction set

extensions AMD Hammer project: x86-64bit PowerPC 64bit implementations Multimedia extensions (SSE, MMX, Altivec…)

Many programs are still written in 32-bit mode for backward compatibility

SD enables existing binaries to benefit from wider data paths w/o recompilation

Wider (128-bit) combining leads to more benefits (performance data in the paper)

Page 15: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 15

Reference Combining via SD Detecting combinable pairs statically

Same base register with word-size offset Two adjacent word memory instructions in program order

lw r1, 0(r10)

lw r2, 4(r10)

lw r3, 8(r10)

lw r4, 12(r10)

Predict alignment of references

dlw r1, 0(r10)exthi r2, r1

dlw r3, 8(r10)exthi r4, r3

doubleword-aligned

lw r1, 0(r10)

dlw r2, 4(r10)exthi r3, r2

lw r4, 12(r10)

word-aligned only

Page 16: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 16

Pipeline front-end

When misaligned, The memory system detects it (same as base case) After the pipeline is drained, the original instructions

are fetched again and decoded without transformation

Reference Combining via SD

Fetch Decode

Predictor

Sequence

Detector

adjacentloads/stores

combining predicted

lw+lwsw+sw

dlw + extract

merge + dsw

ToExecution

Core

alignment historyof loads/stores

Page 17: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 17

Microarchitectural Assumptions

SimpleScalar PISA w/ Speculative scheduling 4-wide; 8-stage pipeline Hybrid branch predictor(Gshare + bimodal) 64-entry RUU; 32-entry load / 16-entry store

schedulers 64KB I-DL1, 512KB unified L2 2 load / 1 store ports (mutually exclusive) 2 store buffers outside the OoO core

HW memory reference combining (HWC) Magic scheduler w/ perfect combining knowledge+ store merging in store buffer (for store combining)

Page 18: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 18

0.4

0.5

0.6

0.7

0.8

0.9

1

com

p

gcc

go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

No

rma

lize

d D

L1

acc

ess

HWC w/ oracle schedulerSDC

HWC vs. SDC – cache access reductions

HWC reduces more cache accesses than SDC

Page 19: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 19

0

0.2

0.4

0.6

0.8

1co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Load

/sto

re s

ched

uler

ful

l rat

eHWC vs. SDC – LSQ contention reductions

SDC reduces LSQ contention more (fewer memory instructions)

BaseHWC w/ oracle scheduler

SDC

Page 20: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 20

0

2

4

6

8

10

12co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Spe

edup

s %

HWC w/ oracle scheduler

SDC

SD can reap many of benefits of pure hardware implementations

HWC vs. SDC – Speedups

Page 21: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 21

Outline

Speculative Scheduling

Speculative Decode

Case Study: Memory Reference Combining

Case Study: Silent Store Squashing

Conclusions

Page 22: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 22

Case Study: Silent Store Squashing(SSS)

Eliminates stores that do not change architectural states Reducing core and memory system contention

Separate load/store schedulers imply replication more contention Do explicit

conversion

Memory

(1) access memory

=

(2) compare values

(3) nullify the store when silent

Store

Store Scheduler

V

Load (converted)

Load Scheduler

V

Value

A store is converted implicitly into 3 operations

Page 23: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 23

Silent Store Squashing via SD Explicitly removes predicted stores

reduces store scheduler contention

add r1, 1, r1

lw r5, 4(r10)

sw r1, 16(r29)sw r1, 16(r29)lw p1, 16(r29)bne p1, r1, trap

load + trap for store verify the pipeline is drained when not silent

No store, No aliasing silent stores do not change the value no RAW allowing later loads to bypass earlier unresolved stores even

with true dependences

no RAW

Page 24: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 24

HWSSS vs. SDSSS – memory disambiguation

0.0

0.2

0.4

0.6

0.8

com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

Avg c

lock c

ycle

s o

f

a lo

ad is

sue b

locked b

y s

tore

s

BaseSD SSS

Better memory disambiguation achieved no store less store-to-load block cycles

Page 25: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 25

-2

3

8

13

18

23co

mp

gcc go

ijpeg

li

m88

k

perl

vort

ex

bzip

gzip

mcf

pars

er vpr

Spe

edup

s %

HWSSS SDSSS

HWSSS vs. SDSSS – Speedups

SDSSS outperforms HWSSS Better memory disambiguation HWSSS does not reduce contention in the store scheduler

Page 26: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 26

Conclusions Speculative scheduling makes optimizations in

execution stage impractical Pre-schedule optimization by transforming

instructions

Advantages of SD-based implementations Enabling execution stage optimizations Reusing existing data/control path No negative effect on instruction scheduling Reduces contention inside the core better

Two case studies show that SD can reap many benefits of pure hardware implementations

Memory reference combining: less queue contention Silent store squashing: less queue contention, better

memory disambiguation

Page 27: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 27

Backup slides

Page 28: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 28

Is HWC easy to integrate??

Schedule Detecting piggyback loads to

be issued at the same clock cycle

Actual values are not involved in scheduling

Detecting them without effective addresses?

Register File & Result bus More loads satisfied at the

same time more result bus bandwidth more RF write ports

RegisterFile

Schedule

LSQ

Addrgen

Load unit

RF

Exe1

Exe2

Result bus

RegisterFile

MagicSchedule

LSQ

Addrgen

Load unit

RF

Exe1

Exe2

Result bus

Page 29: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 29

SD Combining Prediction

0

10

20

30

40

50

com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

vpr

Perc

enta

ge o

f all m

em

ory

refe

rences

Mispredicted

Combinable but not predicted

Combined

Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)

Miss rates: ~0.1% of all references

Page 30: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 30

0

20

40

60

80

100com

p

gcc

go

ijpeg li

m88k

perl

vort

ex

bzip

gzip

mcf

pars

er

vpr

Perc

enta

ge o

f all s

tore

s Not silent / predicted not silentNot silent / predicted silentCheckedSilent / predicted not silentSilent / predicted silent

Silence Prediction

45% of silent stores detected (1024 entries, ~2.5KB) Low miss rate (~1%)

Page 31: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 31

Silence Predictor

Last Store Value(lower n bits)

Confidence counter

Threshold counter

PC

Current Store Value(lower 8 bits)

+1(same) / -1(different)

=

Store verify result

-1(silent) / +4(not silent)

Predict

Time Silence Predictor State

100 3 4value confid thres

Store 100 to [A](a) PC x: Store 100, [A] A 100B 100

100 4 4Load from [B]

CompareStore 100 to [B]

(b) PC x: Store 100, [B] A 100B 100

(c) PC y: Store 50, [B] A 100B 50

Load from [A]Compare(d) PC x: Store 100, [B] 100 5 3 A 100

B 50

MemoryDecoded OpsOriginal Instruction

100 0 7 Store 100 to [B] A 100B 100

(e)

Page 32: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 32

Silence Squashing via SD Explicit store verify depending on predictor states:

store load+compare+store load+branch

Eliminate negative effects on scheduling logic Explicit load issue for verify SSS is virtually invisible to scheduling logic

Explicit compare/branch operation Existing branch unit maintains correct machine states

Fetch Decode

Predictor silent store predicted

store load+comp+store

load+branch

ToExecution

Core

last store valueis silent?

Page 33: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 33

SD Combining Prediction Predictor

Tag +4 /-4 (1)

Next Target Register

Alignment History(4)0 1 2 3

PC

= 1111 or 1010 ? Predict

Is aligned?

=

Shift

Combining is predicted when: 1111: aligned 4 times in a row 1010: base is increasing by 4

Over 80% of adjacent combinable references are captured (1024 entries, ~4KB)

Miss rates: ~0.1% of all references Capturing up to 26% of all memory references

Page 34: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 34

Silence Prediction Avoiding unnecessary store verify (load issue) How do we train the predictor without silence

outcome? Silence outcome is achieved when we do store verify

Correlating the last value information for training

No squashno SD

Squashload+trap

Checkload+compar

e+storeThe same value

for several instances

SilentNot Silent

Not Silent

Silent

Different values

45% of silent stores detected (1024 entries, ~2.5KB)

Low miss rate (~1%)

Page 35: Implementing Optimizations  at Decode Time

May 28, 2002 Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 35

Future Work Spectrum of power / performance design points

attainable by speculative decode Single core, multiple marketing targets

Exposing complex control paths to I-ISA Improving controllability on processor core achieving

more benefit from SD Developing I-ISA for complexity-effective core design

And more……