eecs 470 pipeline control hazards lecture 5 coverage: chapter 3 & appendix a

EECS 470Pipeline Control Hazards

Lecture 5Coverage: Chapter 3 & Appendix A

Pipeline function for BEQ

• Fetch: read instruction from memory

• Decode: read source operands from reg

• Execute: calculate target address and test for equality

• Memory: Send target to PC if test is equal

• Writeback: Nothing left to do

Control Hazards

beq 1 1 10sub 3 4 5

time

fetch decode execute memory writeback

fetch decode execute

beq

sub

Approaches to handling control hazards

• Avoidance– Make sure there are no hazards in the code

• Detect and Stall– Delay fetch until branch resolved.

• Speculate and Squash if wrong– Go ahead and fetch more instruction in case

it is correct, but stop them if they shouldn’t have been executed

Handling branch hazards: avoid all hazards

• Don’t have branch instructions! – Maybe a little impractical

• Predication can eliminate some branches– If-conversion– Hyperblocks

if-conversion

if (a == b) { x++; y = n / d;}

sub t1 a, bjnz t1, PC+2add x x, #1div y n, d

sub t1 a, badd(t1) x x, #1div(t1) y n, d

sub t1 a, badd t2 x, #1div t3 n, dcmov(t1) x t2cmov(t1) y t3

Removing hazards by refining a branch instruction

• Redefine branch instructions: ptbeq regA regB offset

prepare to branch if equal

If (R[regA] = = R[regB]) execute instructions at PC+1, PC+2, PC+3 then PC+1+offset

ptbnz example

t = 5n = 7g = c + 2bnz g, PC + 1m = 5a = 3

g = c + 2bnz g, PC + 4t = 5n = 7noopm = 5a = 3

Problems with this solution

• Old programs (legacy code) may not run correctly on new implementations– Longer pipelines tend to need more noops

• Programs get larger as noops are included– Especially a problem for machines that try to execute

more than one instruction every cycle– Harder to find useful instructions

• Program execution is slower– CPI is one, but some I’s are noops

Handling control hazards: detect and stall

• Detection:– Must wait until decode– Compare opcode to beq or jalr– Alternately, this is just another control signal

• Stall:– Keep current instructions in fetch– Pass noop to decode stage (not execute!)

PC Instmem

REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

bnz r1

PC Instmem

REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

noop

MUX

Control Hazards

beq 1 1 10sub 3 4 5

time

fetch decode execute memory writeback

fetch fetch fetch

beq

sub fetch

or

fetchTarget:

Problems with detect and stall• CPI increases every time a branch is detected!

• Is that necessary? Not always!– Only about ½ of the time is the branch taken

• Let’s assume that it is NOT taken…– In this case, we can ignore the beq (treat it like a noop)– Keep fetching PC + 1

• What if we are wrong?– OK, as long as we do not COMPLETE any instructions we

mistakenly executed (i.e. don’t perform writeback)

Handling data hazards: speculate and squash

• Speculate: assume not equal– Keep fetching from PC+1 until we know that

the branch is really taken

• Squash: stop bad instructions if taken– Send a noop to:

• Decode, Execute and Memory

– Send target address to PC

PC REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

equal

MUX

beqsubaddnand

add

sub

beq

beq

Instmem

noop

noop

noop

Problems with fetching PC+1

• CPI increases every time a branch is taken!– About ½ of the time

• Is that necessary?

No!, but how can you fetch from the targetbefore you even know the previous instructionis a branch – much less whether it is taken???

PC Instmem

REGfile

MUXA

LU

MUX

1

Datamemory

++

MUX

IF/ID

ID/EX

EX/Mem

Mem/WB

signext

Control

beq

bpc

MUX

target

targ

et

eq?

Branch Target Buffer

Fetch PC

Predicted target PC

Send PCto BTB

found?

Yes

usetarget

usePC+1

No

Branch prediction

• Predict not taken: ~50% accurate– No BTB needed; always use PC+1

• Predict backward taken: ~65% accurate– BTB holds targets for backward branches (loops)

• Predict same as last time: ~80% accurate– Update BTB for any taken branch

What about indirect branches?

• Could use same approach– PC+1 unlikely indirect target– Indirect jumps often have multiple targets (for

same instruction)• Switch statements• Virtual function calls• Shared library (DLL) calls

Indirect jump: Special Case

• Return address stack– Function returns have deterministic behavior

(usually)• Return to different locations (BTB doesn’t work well)• Return location known ahead of time

– In some register at the time of the call

– Build a specialize structure for return addresses• Call instructions write return address to R31 AND RAS• Return instructions pop predicted target off stack

– Issues: finite size (save or forget on overflow?);– Issues: long jumps (clear when wrong?)

Branch prediction

• Pentium: ~85% accurate

• Pentium Pro: ~92% accurate

• Best paper designs: ~96% accurate

Costs of branch prediction/speculation

• Performance costs?– Minimal: no difference between waiting and squashing; and it is

a huge gain when prediction is correct!

• Power?– Large: in very long/wide pipelines many instructions can be

squashed• Squashed = # mispredictions pipeline length/width before target

resolved

• Area?– Can be large: predictors can get very big as we will see next

time

• Complexity?– Designs are more complex– Testing becomes more difficult

What else can be speculated?

• Dependencies– I think this data is coming from that store instruction)

• Values – I think I will load a 0 value

• Accuracy?– Branch prediction (direction) is Boolean (T,NT)– Branch targets are stable or predictable (RAS)– Dependencies are limited– Values cover a huge space (0 – 4B)

eecs 470 pipeline control hazards lecture 5 coverage: chapter 3 & appendix a

Documents