branch prediction j. nelson amaral. why branch prediction? every 5-7 instruction of a program is a...

Branch Prediction

J. Nelson Amaral

Why Branch Prediction?

• Every 5-7 instruction of a program is a branch• Not predicting, or miss-predicting, is very

costly in architectures with deep pipelines or with many functional units.

Baer p. 129

Anatomy of a Predictor

Baer p. 130

Anatomy of a Branch Predictor

• Event Source: the execution of the program– Predictive information:

• Can be encoded in the instruction code – a bit indicates most likely outcome– forward/backward branch

• Obtained from some profiling informationBaer p. 130

Prog. Exec.

Anatomy of a Branch Predictor (cont.)

• Event Selection: when to predict?– Simple solution: compute the prediction for every

instruction (even non-branches)• Only use the result of the prediction for branches

Baer p. 130

Event Selec.


• Prediction Indexing:– Use part of the PC to index prediction tables:

• history of outcome of previous branches at this PC• history of execution path leading to this PC

Baer p. 130

Pred. Index.


• Predictor Mechanism:– Static (example):

• forward: always not taken• backward: always taken

– Dynamic:• Finite State Machine predictor: saturating counters• Markov predictor: correlation Baer p. 131

Pred. Mechan.


• Feedback and Recovery:– Use real outcome to reinforce prediction– Must recover from miss-predictions

Baer p. 131

Feedback

Control Flow StatisticsApplication % control

flow% cond. branches

(% taken)% Uncond.(% direct)

% calls % returns

SPEC95int 20.4 14.9 (46) 1.1 (77) 2.2 2.1

Desktop 18.7 13 (39) 1.1 (92) 2.4 2.1

A 4-way superscalar has to predict a branch, on average,every other cycle.

Baer p. 131

Interbranch Distances40% of the time there is 1 or 0 cycles betweenpredictions

Branch resolution takes +/- 10 cycles

If the prediction is wrong, up to 40 wronginstructions are in flight by the time theresolution occurs.

Simulation for a 4-way out-of-order architecture Baer p. 131

Static Predictions

Always Taken Always Not Taken

OR

Baer p. 132

Static Predictions

• Early studies indicated that 2/3 of branches are taken– but 30% of those branches were

unconditional!

• For conditional branches there appears to be no preferred direction.

Always TakenBaer p. 132

Alternative Static Predictions

Forward Always Not Taken Backward Always Taken

Accuracy improvementsare barely noticeable.

Static prediction based onprofiling is slightly better.

Static branch-not-takenhas no implementationcost on pipeline.

Baer p. 132

Dynamic Predictors

• Prediction of a given branch changes with the execution of the program.– Simple: a finite-state machine encodes the

outcome of a few recent executions of the branch.– Elaborate: Not only early branch outcomes, but

other correlated parts of the programs are considered.

Baer p. 132

When to predict?• Static prediction: at the

Instruction Decode stage– Know that the instruction

is a branch

• Dynamic prediction: at the Instruction Fetch stage– Calculate prediction for

every instruction, even non-branch ones.

Baer p. 133

What to Predict?

• Branch Direction: Is branch taken on not?

• Branch Target: Address of next instruction for a taken branch

Baer p. 133

Predicting Direction

• Where we find the prediction?

• How to encode the prediction?

Look at the recent past:

What was the direction the last time this samebranch was executed?

A single bit encodes the prediction:

Prediction bit is set at prediction time.

Baer p. 133

Prediction Hysteresis

• Look at the last two resolutions– Two wrong predictions

are necessary to change the prediction

– Motivated by wrong predictions at the end of inner loops.

Baer p. 133

2-Bit Saturating CounterLast two instanceswere taken

Last instancewas taken but theprevious was not

Last two instanceswere not taken

Last instancewas not taken but theprevious was taken

Baer p. 134

2-Bit Saturating Counter (Example)for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end;

i ← 0

m ≤ 0

n ≥ 0

j ← 0

S1; S2; …; Sk

j < n

j←j+1

i←i+1

i < m

i←i+1

i j Pred Outc

1-bit

0 0 NT T

0 1 T T

0 n T NT

1 0 NT T

1 1 T TT

NT

2 × m misspredictions

Baer p. 134

2-Bit Saturating Counter (Example)for(i=0 ; i < m ; i++) for(j=0; j<n ; j++) begin S1; S2; …; Sk end;

i ← 0

m ≤ 0

n ≥ 0

j ← 0

S1; S2; …; Sk

j < n

j←j+1

i←i+1

i < m

i←i+1

i j Pred Outc

1-bit

State Pred

2-bit

0 0 NT T wNT NT

0 1 T T sT T

Outc

T

T

0 n T NT sT T

1 0 NT T wT T

NT

T

1 1 T T sT T T

m + 1 misspredictions

T

NT

Baer p. 134

Accuracy of Branch Prediction• Includes unconditional branches• Predictions are associated with branches after each branch’s

first execution

3-bit counters yield onlyminor improvements

Baer p. 135

Average of 26 traces (IBM 379, DEC PDP-11, CDC 6400)

Average of 32 traces (MIPS R2000, Sun SPARC, DEC VAX, Motorola 68000)

Fix prediction. Determined by the first execution of the branch.

Where to store the Prediction

Need one (or two) bit for each possible branch address.

Storing prediction bits with instructions.

Use a cache (Branch Prediction Buffer – BPB).

Solution: ditch the tags.

32-bit address → 230 entries

Need to modifycode every 5 instructions.

Many more bits fortags than for predictions.

Baer p. 136

Pattern History Table (PHT)

Use selected bits from PCto index (or hash) the PHT.

Aliasing: multiple branchesmay index the same PHT entry.

Performance degrades slightly.

Baer p. 136

Each entry of the PHPstores the state of afinite state machineassociated with a branch.

Accuracy of Bimodal Predictor(based on PHT)

Based on 10 SPEC89 traces.

Baer p. 137

Separate PHTSeparate PHTEmbedded in Instruction cacheEmbedded in Instruction cache

Where the Predictor is Stored?

Alpha 21264: 1 counter per instruction? (2K counters)

Sun UltraSPARC:2 counters/cache line(2K counters)

AMD K5:1 counter/cache line(1K counters)

MIPS R10000: (512 counters)

IBM PowerPC 620: (512 counters)

Intel Pentium: Combines PHP with Branch Target Buffer(512 entries)Baer p. 137

Feedback and Recovery

Baer p. 137

Feedback

Feedback: Bimodal Predictor• Feedback: update 2-bit counter for executing

branch• When the updating is done?

– When the actual direction is found (EX stage)Other predictions of the same branch are done.

– When the branch commitsEven more predictions are done.

– Speculatively when the prediction is doneOnly reinforces prediction in bimodal predictor.

Textbook typo (p. 137): choice for the timing of the “update”. Baer p. 137

EX/commit updating makes little difference in performance.

Local × Global Predictor

• Local: – Only use history of the branch to be predicted

• Global:– Use history of other branches that precede the

branch to be predicted.

Baer p. 138

Motivation for Global Prediction

• Example from SPEC program eqntott:

if (aa == 2) /* b1 */ aa = 0;if (bb == 2) /* b2 */ bb = 0;if(aa != bb){ /* b3 */ ….}


If b1 and b2 are taken,then b3 is not taken.

Baer p. 138

Correlator Predictor

History Register

1 inserted to the right when a branchis taken (0 otherwise)

Shifted-out bits are lost

Two-level predictor.

Baer p. 139

Update Problem in theCorrelator Predictor

• PHT is updated non-speculatively at commit stage.

• What is the problem with non-speculative updates of the global register?

Baer p. 139

Updating the Global Register in theCorrelator Predictor



Event TimePrediction of b1 tPrediction of b2 t+1Prediction of b3 t+2

Commit of b1 t+5

Branches b1 and b2 are notinclude in the prediction ofbranch b3!

Baer p. 139

Updating the Global Register in theCorrelator Predictor



Mispredictions and cache missesaffect the commit time of earlierbranches.

•Two consecutive predictions of a branch b may use different ancestors of b.

• Even if the path leading to b is the same

Baer p. 139

Solution to the Update Problem in theCorrelator Predictor

• Update Global Register speculatively when prediction is made.

• New problem: – Need a repair mechanism– All bits after a misprediction

are from branches in the wrong path.

Baer p. 139

Repair Mechanism for Global Register in the Correlator Predictor

• Decode Stage:– Checkpoint current GR into

a FIFO queue• Commit Stage:

– H: head of the queue– The corresponding check-

pointed GR is H.– Correct prediction: discard H– Incorrect prediction: shift

branch outcome into H and make it the new GR.

Baer p. 144

Optimization to GR Checkpointing

Put into the queue a GRthat has the correctedbit shifted into it.

Baer p. 144

Issues with Correlator Predictor

• For small PHTs– Performance is worse than local predictors

• It does not use the location of the branch in the program for the prediction– May introduce excessive aliasing

• Solution to the aliasing problem:– Reintroduce the PC in the indexing of PHT

Baer p. 140

gshare Predictor

A common hash is an XOR function.Baer p. 141

Accuracy and Use of gshare• Almost perfect for SPEC

FP95.• 0.83 accuracy for SPEC

INT95– 0.65 for program go

AMD K5

Sun UltraSPARC

IBM Power4

Baer p. 141

Example• Assume n=4:

– bimodal mispredicts 1/5 times– global mispredicts from 0 to 5

times depending on other branches in the loop

• This branch has a fix pattern:– “4 taken, 1 not taken”

• How can this pattern be learned?– Remember the history of

individual branches• We need predictors more

attuned to locality of individual branches

i ← 0

m ≤ 0

n ≥ 0

j ← 0

S1; S2; …; Sk

j < n

j←j+1

i←i+1

i < m

i←i+1

T

NT

Baer p. 142

global-set predictor

• First Level: A global shift register for correlations• Second Level: A set of multiple PHTs to prevent

aliasing– expensive in terms of storage

• must use few PHTs to be viableBaer p. 142/143

set-global predictor

• Set of Branch History registers (BHT)• A single global PHT

Baer p. 143

set-set predictor

• A set of branch history registers (BHT)• A set of PHTs

Baer p. 143

Predicting the Branch Target

• When is the target of a branch computed?– In a superscalar architecture (p.e., the IA-32 of the

Intel P6) after several pipeline stages.

• What is the point of predicting direction early if we don’t know where the branch goes?– Need to also predict the branch target address.

Baer p. 145

Branch Target Buffer (BTB)

• A cachelike storage that records branch addresses and associated targets

• If there is a hit in BTB for branch predicted taken:– PC ← Target in BTB for branch

Baer p. 146

Integrated BTB-PHT

• BTB needs much more space than the PHT– # of entries is limited by BTB.

• BTB must be accessed on a single cycle

Baer p. 146

Decoupled BTB-PHT

• Parallel BTB and PHT access• if PHT say ‘taken’ and hit in BTB

then PC ← Address in BTB Baer p. 146

Decoupled BTB-PHT

• For space efficiency:– Only taken branches are added

to BTB• They are added at the backend

when the outcome is known.

IBM PowerPC 620: 256-entry, 2-way set-associative BTB2K counter PHT

Baer p. 146

Integrating the BTB with the Branch History Table (BHT)

• The history of all branches needs to be recorded in BTB+BHT• Taken and not taken branches need to be included

Most likely, it is not thesame bit field from the PCthat is used to index the BTB+BHTand to select the PHT

Intel P64-bit local history512 BTB entries# of PHTs not published

What happens on a BTB miss?

“Backward taken, forward not taken” prediction.

Baer p. 147

Two Instances of Mispredictions

• Direction of branch b is mispredicted– Recovery only when b is at the head of the

reorder buffer• lots of instructions to be nullified

• BTB miss for branch b (direction is correctly predicted taken)– Cannot fetch instructions until target is computed

• only affect the filling of the front end

Baer p. 147

misfetch• Branch is correctly predicted taken and• There is a hit in the BTB• but target address is wrong

– caused by indirect jumps• more common in object-oriented languages

– can modify a BTB entry after two misfetches• need a counter with each BTB entry

Intel Pentium MHas an indirect branch predictor associates global history registerswith target address

Baer p. 148

Chapter 2 — Instructions: Language of the Computer — 53

CMPUT 229 Flashback:Procedure Call Instructions

• Procedure call: jump and link

– Address of following instruction put in $ra– Jumps to target address

• Procedure return: jump register

– Copies $ra to program counter– Can also be used for computed jumps

• e.g., for case/switch statements

jal ProcedureLabel

jr $ra

P-H p. 113


Example fact(3)

MIPS assembly:fact:

sub $sp, $sp, 8 # Make room in stack for 2 more itemssw $ra, 4($sp) # save the return addresssw $a0, 0($sp) # save the argument nslt $t0, $a0, 1 # if ($a0<1) then $t01 else $t0 0beq $t0, $zero, L1 # if n 1, go to L1add $v0, $zero, 1 # return 1add $sp, $sp, 8 # pop two items from the stackjr $ra # return to the instruction after jal

L1: sub $a0, $a0, 1 # subtract 1 from argumentjal fact: # call fact(n-1)lw $a0, 0($sp) # just returned from jal: restore nlw $ra, 4($sp) # restore the return addressadd $sp, $sp, 8 # pop two items from the stackmul $v0, $a0, $v0 # return n*fact(n-1)jr $ra # return to the caller

$t0

$v0

3$a0

Processor

0x1000 2000$sp

$ra

$spMemory High Address

0x1000 3FFB addi $a0,$zero,30x1000 4000 jal fact0x1000 4004 ….

Low Address

Pat.-Hen. pp. 136-138and A-26/A-29

int fact ( int n ) { if (n < 1) return(1); else return(n * fact(n-1)); }


Example fact(3)

MIPS assembly:fact:



$t0

$v0

3$a0

Processor

0x1000 2000$sp

0x1000 4004$ra

Memory High Address


Low Address

Pat.-Hen. pp. 136-138and A-26/A-29

$sp



Example fact(3)

MIPS assembly:fact:



1$t0

6$v0

3$a0

Processor

0x1000 2000$sp

0x1000 4004$ra

0x1000 4004

3

0x1000 6FEC

2

0x1000 6FEC

1



Low Address

0x1000 6FEC

0

Pat.-Hen. pp. 136-138and A-26/A-29


Call/Return Mechanisms

foo(….){ …0x10001000 jal bar0x10001004 … …0x10001800 jal bar0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}

bar(….){ …0x1000F0E0 jal baz0x1000F0E4 … ... jar $ra}

baz(….){ ... jar $ra}

How to predict the next instructionto be executed after the return?

We know that the branch is always taken.

The return address is known sincethe time of each call!

Baer p. 150

Return Address Stack

foo(….){ …0x10001000 jal bar0x10001004 … …0x10001800 jal bar0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}

bar(….){ …0x1000F0E0 jal baz0x1000F0E4 … ... jar $ra}

baz(….){ ... jar $ra}

Pop address from stack at return.

Push return address into stackat the function call.

Stack is a circular FIFO. Wrong address on overflow. What is the best strategy to handle FIFO overflow? Baer p. 150

Speculative calls and returns

foo(….){ …0x10000FFC beq … target0x10001000 jal bar0x10001004 … …target:0x10001800 jal baz0x10001804 … …0x10001CE4 jal bar0x10001CE8 … ...}

bar(….){ …0x1000F0E0 bne … next0x1000F0E4 jr $ra ...next: ….}

Function calls and returns executedin the predicted path of a branchchange the return address stack.

Need a recovery mechanism for thereturn address stack.

If a single path is followed, save thepointer to the top of the stack on abranch prediction and restore it incase of misprediction. Baer p. 150

Return StacksMIPS R10000: 1-entry return stack

DEC Alpha 21164:12-entry return stack

Intel Pentium III: 16-entry return stackBaer p. 151

A different way of doing things…

Don’t know which way to go?

“Some people go both ways.”

(Scarecrow, The Wizard of Oz)

Baer p. 151

IBM System 360/91

• Upon decoding a branch:– fetch, decode, and enqueue both the taken and

the not taken paths into separate buffers

• Upon branch resolution:– one buffer becomes the execution path– the other is discarded

Baer p. 151

In a restricted version …Branch is predicted

taken

There is aBTB hit

Instruction Cache Line:

Branch Instruction Resume Buffer:@#$&%misprediction!

Fetch from Resume Buffer!

MIPS R10000Intel P6

Fall-through instructions in cache line

Baer p. 151

Loop Detector

• A separate loop predictor detects loop patterns:– TTTTTTTNTTTTTTTNTTTTTTTNTTTTTTTNTT….

• Uses a separate counter for each recognized loop

Intel Pentium M

Baer p. 151

Sophisticated Predictors• Tension:

– Branch Correlation (global information) × Individual Branch Patterns (local information)

• neutral aliasing– between branches biased the same way

• destructive aliasing– between branches with opposite bias

• bias bit– added to BTB– PHT predicts if direction agrees with the bias bit

• two branches with strong opposite bias that alias do not destroy each other prediction.

Baer p. 152

skewed predictor

• Goal: reduce aliasing• Use three PHTs

– different hashing function for each PHT– Take majority vote

Baer p. 153

hybrid (or combining) predictor

Two different prediction strategies

Tournament predictor:predicts which strategyshould be used

Baer p. 156

Tournament Predictor

Baer p. 155

branch prediction j. nelson amaral. why branch prediction? every 5-7 instruction of a program is a...

Documents