march 2, 2011cs152, spring 2011 cs 152 computer architecture and engineering lecture 11 -...

March 2, 2011 CS152, Spring 2011

CS 152 Computer Architecture and

Engineering

Lecture 11 - Out-of-Order Issue,

Register Renaming,

& Branch Prediction

Krste AsanovicElectrical Engineering and Computer Sciences

University of California at Berkeley

http://www.eecs.berkeley.edu/~krstehttp://inst.eecs.berkeley.edu/~cs152

March 2, 2011 CS152, Spring 2011 2

Last time in Lecture 12• Pipelining is complicated by multiple and/or variable

latency functional units• Out-of-order and/or pipelined execution requires

tracking of dependencies– RAW– WAR– WAW

• Dynamic issue logic can support out-of-order execution to improve performance

– Last time, looked at simple scoreboard to track out-of-order completion

• Hardware register renaming can further improve performance by removing hazards.

March 2, 2011 CS152, Spring 2011 3

Register Renaming

• Decode does register renaming and adds instructions to the issue-stage instruction reorder buffer (ROB)

renaming makes WAR or WAW hazards impossible

• Any instruction in ROB whose RAW hazards have been satisfied can be issued.

Out-of-order or dataflow execution

IF ID WB

ALU Mem

Fadd

Fmul

Issue

March 2, 2011 CS152, Spring 2011 4

Renaming StructuresRenaming table &regfile

Reorder buffer

Load Unit

FU FU Store Unit

< t, result >

Ins# use exec op p1 src1 p2 src2 t1

t2

.

.tn

• Instruction template (i.e., tag t) is allocated by the Decode stage, which also associates tag with register in regfile• When an instruction completes, its tag is deallocated

Replacing the tag by its valueis an expensive operation

March 2, 2011 CS152, Spring 2011 5

Reorder Buffer Management

Instruction slot is candidate for execution when:• It holds a valid instruction (“use” bit is set)• It has not already started execution (“exec” bit is clear)• Both operands are available (p1 and p2 are set)

t1

t2

.

.

.

tn

ptr2 next to

deallocate

ptr1

nextavailable

Ins# use exec op p1 src1 p2 src2

Destination registers are renamed to the instruction’s slot tag

ROB managed circularly• “exec” bit is set when instruction begins execution • When an instruction completes its “use” bit is marked free• ptr2 is incremented only if the “use” bit is marked free

March 2, 2011 CS152, Spring 2011 6

Renaming & Out-of-order IssueAn example

• When are tags in sources replaced by data?

• When can a name be reused?

1 LD F2, 34(R2)2 LD F4, 45(R3)3 MULTD F6, F4, F24 SUBD F8, F2, F25 DIVD F4, F2, F86 ADDD F10, F6, F4

Renaming table Reorder bufferIns# use exec op p1 src1 p2 src2

t1

t2

t3t4t5..

data / ti

p dataF1F2F3F4F5F6F7F8

Whenever an FU produces data

Whenever an instruction completes

t1

1 1 0 LD

t2

2 1 0 LD

5 1 0 DIV 1 v1 0 t4 4 1 0 SUB 1 v1 1 v1

t4

3 1 0 MUL 0 t2 1 v1

t3

t5

v1v1

1 1 1 LD 0

4 1 1 SUB 1 v1 1 v1 4 0

v4

5 1 0 DIV 1 v1 1 v4

2 1 1 LD 2 0

3 1 0 MUL 1 v2 1 v1

March 2, 2011 CS152, Spring 2011 7

IBM 360/91 Floating-Point UnitR. M. Tomasulo, 1967

Mult

1

123456

loadbuffers(from memory)

1234

Adder

123

Floating-PointRegfile

store buffers(to memory)

...

instructions

Common bus ensures that data is made available immediately to all the instructions waiting for it.Match tag, if equal, copy value & set presence “p”.

Distribute instruction templatesby functionalunits

< tag, result >

p tag/datap tag/datap tag/data



p tag/datap tag/data

p tag/datap tag/data2



p tag/datap tag/datap tag/datap tag/data

March 2, 2011 CS152, Spring 2011 8

Effectiveness?

Renaming and Out-of-order execution was firstimplemented in 1969 in IBM 360/91 but did not show up in the subsequent models until mid-Nineties.

Why ?Reasons

1. Effective on a very small class of programs2. Memory latency a much bigger problem3. Exceptions not precise!

One more problem needed to be solved

Control Hazards!

March 2, 2011 CS152, Spring 2011 9

Precise Interrupts

It must appear as if an interrupt is taken between two instructions (say Ii and Ii+1)

• the effect of all instructions up to and including Ii is totally complete

• no effect of any instruction after Ii has taken place

The interrupt handler either aborts the program or restarts it at Ii+1 .

March 2, 2011 CS152, Spring 2011 10

Effect on InterruptsOut-of-order Completion

I1 DIVD f6, f6, f4I2 LD f2, 45(r3)I3 MULTD f0, f2, f4I4 DIVD f8, f6, f2I5 SUBD f10, f0, f6I6 ADDD f6, f8, f2

out-of-order comp 1 2 2 3 1 4 3 5 5 4 6 6 restore f2 restore f10

Consider interrupts

Precise interrupts are difficult to implement at high speed- want to start execution of later instructions before exception checks finished on earlier instructions

March 2, 2011 CS152, Spring 2011 11

Exception Handling(In-Order Five-Stage Pipeline)

• Hold exception flags in pipeline until commit point (M stage)• Exceptions in earlier pipe stages override later exceptions• Inject external interrupts at commit point (override others)• If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage

Asynchronous Interrupts

ExcD

PCD

PCInst. Mem D Decode E M

Data Mem W+

ExcE

PCE

ExcM

PCM

Cause

EPC

Kill D Stage

Kill F Stage

Kill E Stage

Illegal Opcode Overflow

Data Addr Except

PC Address Exceptions

Kill Writeback

Select Handler

PC

Commit Point

March 2, 2011 CS152, Spring 2011 12

Fetch: Instruction bits retrieved from cache.

Phases of Instruction Execution

I-cache

Fetch Buffer

Issue Buffer

Functional Units

ArchitecturalState

Execute: Instructions and operands issued to execution units. When execution completes, all results and exception flags are available.

Decode: Instructions dispatched to appropriate issue-stage buffer

Result Buffer

Commit: Instruction irrevocably updates architectural state (aka “graduation”).

PC

Commit

Decode/Rename

March 2, 2011 CS152, Spring 2011 13

In-Order Commit for Precise Exceptions

• Instructions fetched and decoded into instruction reorder buffer in-order• Execution is out-of-order ( out-of-order completion)• Commit (write-back to architectural state, i.e., regfile & memory, is in-order

Temporary storage needed to hold results before commit (shadow registers and store buffers)

Fetch Decode

Execute

CommitReorder Buffer

In-order In-orderOut-of-order

Exception?

KillKill Kill

Inject handler PC

March 2, 2011 CS152, Spring 2011 14

Extensions for Precise Exceptions

Reorder buffer

ptr2

next tocommit

ptr1

nextavailable

• add <pd, dest, data, cause> fields in the instruction template• commit instructions to reg file and memory in program order buffers can be maintained circularly• on exception, clear reorder buffer by resetting ptr1=ptr2

(stores must wait for commit before updating memory)

Inst# use exec op p1 src1 p2 src2 pd dest data cause

March 2, 2011 CS152, Spring 2011 15

Rollback and Renaming

Register file does not contain renaming tags any more.How does the decode stage find the tag of a source register?

Search the “dest” field in the reorder buffer

Register File(now holds only committed state)

Reorderbuffer

Load Unit

FU FU FU Store Unit

< t, result >

t1

t2

.

.tn

Ins# use exec op p1 src1 p2 src2 pd dest data

Commit

March 2, 2011 CS152, Spring 2011 16

Renaming Table

Register File

Reorder buffer

Load Unit

FU FU FU Store Unit

< t, result >

t1

t2

.

.tn


Commit

Rename Table

Renaming table is a cache to speed up register name look up. It needs to be cleared after each exception taken. When else are valid bits cleared? Control transfers

r1 t vr2

tagvalid bit

March 2, 2011 CS152, Spring 2011 17

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute

Decode

ResultBuffer Commit

PC

Fetch

Branchexecuted

Next fetch started

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !

Control Flow Penalty

How much work is lost if pipeline doesn’t follow correct instruction flow?

~ Loop length x pipeline width

March 2, 2011 CS152, Spring 2011 18

Instruction Taken known? Target known?

J

JRBEQZ/BNEZ

MIPS Branches and Jumps

Each instruction fetch depends on one or two pieces of information from the preceding instruction:

1) Is the preceding instruction a taken branch?

2) If so, what is the target address?

After Inst. Decode

After Inst. Decode After Inst. Decode

After Inst. Decode After Reg. Fetch

After Reg. Fetch*

*Assuming zero detect on register read

March 2, 2011 CS152, Spring 2011 19

Branch Penalties in Modern Pipelines

A PC Generation/MuxP Instruction Fetch Stage 1F Instruction Fetch Stage 2B Branch Address Calc/Begin DecodeI Complete DecodeJ Steer Instructions to Functional unitsR Register File ReadE Integer Execute

Remainder of execute pipeline (+ another 6 stages)

UltraSPARC-III instruction fetch pipeline stages(in-order issue, 4-way superscalar, 750MHz, 2000)

Branch Target Address Known

Branch Direction &Jump Register Target Known

March 2, 2011 CS152, Spring 2011 20

Reducing Control Flow Penalty

Software solutions• Eliminate branches - loop unrolling

Increases the run length • Reduce resolution time - instruction scheduling

Compute the branch condition as early as possible (of limited value)

Hardware solutions• Find something else to do - delay slots

Replaces pipeline bubbles with useful work

(requires software cooperation)• Speculate - branch prediction

Speculative execution of instructions beyond the branch

March 2, 2011 CS152, Spring 2011 21

Branch Prediction

Motivation:Branch penalties limit performance of deeply pipelined processors

Modern branch predictors have high accuracy(>95%) and can reduce branch penalties significantly

Required hardware support:Prediction structures:

• Branch history tables, branch target buffers, etc.

Mispredict recovery mechanisms:• Keep result computation separate from commit• Kill instructions following branch in pipeline• Restore state to state following branch

March 2, 2011 CS152, Spring 2011 22

Static Branch Prediction

Overall probability a branch is taken is ~60-70% but:

ISA can attach preferred direction semantics to branches, e.g., Motorola MC88110

bne0 (preferred taken) beq0 (not taken)

ISA can allow arbitrary choice of statically predicted direction, e.g., HP PA-RISC, Intel IA-64 typically reported as ~80% accurate

JZ

JZbackward

90%forward

50%

March 2, 2011 CS152, Spring 2011 23

Dynamic Branch Predictionlearning based on past behavior

Temporal correlationThe way a branch resolves may be a good predictor of the way it will resolve at the next execution

Spatial correlation Several branches may resolve in a highly correlated manner (a preferred path of execution)

March 2, 2011 CS152, Spring 2011 24

• Assume 2 BP bits per instruction• Change the prediction after two consecutive mistakes!

¬takewrong

taken¬ taken

taken

taken

taken¬takeright

takeright

takewrong

¬ taken

¬ taken¬ taken

BP state:(predict take/¬take) x (last prediction right/wrong)

Branch Prediction Bits

March 2, 2011 CS152, Spring 2011 25

Branch History Table

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0 0Fetch PC

Branch? Target PC

+

I-Cache

Opcode offset

Instruction

k

BHT Index

2k-entryBHT,2 bits/entry

Taken/¬Taken?

March 2, 2011 CS152, Spring 2011 26

Exploiting Spatial CorrelationYeh and Patt, 1992

History register, H, records the direction of the last N branches executed by the processor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

If first condition false, second condition also false

March 2, 2011 CS152, Spring 2011 27

Two-Level Branch PredictorPentium Pro uses the result from the last two branchesto select one of the four sets of BHT bits (~95% correct)

0 0

kFetch PC

Shift in Taken/¬Taken results of each branch

2-bit global branch history shift register

Taken/¬Taken?

March 2, 2011 CS152, Spring 2011 28

Speculating Both Directions

• resource requirement is proportional to the number of concurrent speculative executions

An alternative to branch prediction is to execute both directions of a branch speculatively

• branch prediction takes less resources than speculative execution of both paths

• only half the resources engage in useful work when both directions of a branch are executed speculatively

With accurate branch prediction, it is more cost effective to dedicate all resources to the predicted direction

March 2, 2011 CS152, Spring 2011 29

Limitations of BHTs

Only predicts branch direction. Therefore, cannot redirect fetch stream until after branch target is determined.

UltraSPARC-III fetch pipeline

Correctly predicted taken branch penalty

Jump Register penalty


Remainder of execute pipeline (+ another 6 stages)

March 2, 2011 CS152, Spring 2011 30

CS152 Administrivia

March 2, 2011 CS152, Spring 2011 31

Branch Target Buffer

BP bits are stored with the predicted target address.

IF stage: If (BP=taken) then nPC=target else nPC=PC+4later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb

IMEM

PC

Branch Target Buffer (2k entries)

k

BPbpredicted

target BP

target

March 2, 2011 CS152, Spring 2011 32

Address Collisions

What will be fetched after the instruction at 1028?BTB prediction = Correct target =

Assume a 128-entry BTB

BPbtarget

take236

1028 Add .....

132 Jump 100

InstructionMemory

2361032

kill PC=236 and fetch PC=1032

Is this a common occurrence?Can we avoid these bubbles?

March 2, 2011 CS152, Spring 2011 33

BTB is only for Control Instructions

BTB contains useful information for branch and jump instructions only

Do not update it for other instructions

For all other instructions the next PC is PC+4 !

How to achieve this effect without decoding the instruction?

March 2, 2011 CS152, Spring 2011 34

Branch Target Buffer (BTB)

• Keep both the branch PC and target PC in the BTB • PC+4 is fetched if match fails• Only taken branches and jumps held in BTB• Next PC determined before branch fetched and decoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

Entry PC

=

match

predicted

target

target PC

March 2, 2011 CS152, Spring 2011 35

Combining BTB and BHT• BTB entries are considerably more expensive than BHT, but can

redirect fetches at earlier stage in pipeline and can accelerate indirect branches (JR)

• BHT can hold many more entries and is more accurate


BTB

BHTBHT in later pipeline stage corrects when BTB misses a predicted taken branch

BTB/BHT only updated after branch resolves in E stage

March 2, 2011 CS152, Spring 2011 36

Uses of Jump Register (JR)• Switch statements (jump to address of matching case)

• Dynamic function call (jump to run-time function address)

• Subroutine returns (jump to return address)

How well does BTB work for each of these cases?

BTB works well if same case used repeatedly

BTB works well if same function usually called, (e.g., in C++ programming, when objects have same type in virtual function call)

BTB works well if usually return to the same place

Often one function called from many distinct call sites!

March 2, 2011 CS152, Spring 2011 37

Subroutine Return Stack

Small structure to accelerate JR for subroutine returns, typically much more accurate than BTBs.

&fb()

&fc()

Push call address when function call executed

Pop return address when subroutine return decoded

fa() { fb(); }

fb() { fc(); }

fc() { fd(); }

&fd() k entries(typically k=8-16)

March 2, 2011 CS152, Spring 2011 38

Mispredict Recovery

In-order execution machines:– Assume no instruction issued after branch can write-back before

branch resolves– Kill all instructions in pipeline behind mispredicted branch

– Multiple instructions following branch in program order can complete before branch resolves

Out-of-order execution?

March 2, 2011 CS152, Spring 2011 39

In-Order Commit for Precise Exceptions

• Instructions fetched and decoded into instruction reorder buffer in-order• Execution is out-of-order ( out-of-order completion)• Commit (write-back to architectural state, i.e., regfile & memory, is in-order

Temporary storage needed in ROB to hold results before commit

Fetch Decode

Execute


In-order In-orderOut-of-order

KillKill Kill

Exception?Inject handler PC

March 2, 2011 CS152, Spring 2011 40

Branch Misprediction in Pipeline

Fetch Decode

Execute


Kill

Kill Kill

BranchResolution

Inject correct PC

• Can have multiple unresolved branches in ROB• Can resolve branches out-of-order by killing all the instructions in ROB that follow a mispredicted branch

BranchPrediction

PC

Complete

March 2, 2011 CS152, Spring 2011 41

t vt vt v

Recovering ROB/Renaming Table

Register File

Reorder buffer Load

UnitFU FU FU Store

Unit

< t, result >

t1

t2

.

.tn


Commit

Rename Table r1

t v

r2

Take snapshot of register rename table at each predicted branch, recover earlier snapshot if branch mispredicted

Rename Snapshots

Ptr2 next to commit

Ptr1 next available

rollback next available

March 2, 2011 CS152, Spring 2011 42

“Data-in-ROB” Design(HP PA8000, Pentium Pro, Core2Duo, Nehalem)

• On dispatch into ROB, ready sources can be in regfile or in ROB dest (copied into src1/src2 if ready before dispatch)

• On completion, write to dest field and broadcast to src fields.• On issue, read from ROB src fields

Register Fileholds only committed state

Reorderbuffer

Load Unit

FU FU FU Store Unit

< t, result >

t1

t2

.

.tn


Commit

March 2, 2011 CS152, Spring 2011 43

Acknowledgements

• These slides contain material developed and copyright by:

– Arvind (MIT)– Krste Asanovic (MIT/UCB)– Joel Emer (Intel/MIT)– James Hoe (CMU)– John Kubiatowicz (UCB)– David Patterson (UCB)

• MIT material derived from course 6.823• UCB material derived from course CS252

march 2, 2011cs152, spring 2011 cs 152 computer architecture and engineering lecture 11 -...

Documents