branch prediction dimitris karteris rafael pasvantidιs

45
Branch Prediction Dimitris Karteris Rafael Pasvantidιs

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Branch Prediction

Dimitris KarterisRafael Pasvantidιs

2

Outline

What are branches? Techniques for handling branches Branch prediction Why do we need branch prediction? Branch prediction schemes

(static/dynamic) “Real” branch predictors

3

Branches

Instructions which can alter the flow of instruction execution in a program

4

Types of Branches

Conditional Unconditional

Direct if - then- else for loops (BEZ, BNEZ, etc)

procedure calls (JAL) goto (J)

Indirect return (JR) virtual function lookup function pointers (JALR)

5

Techniques for handling branches

IF ID EX MEM WB

Stalling Branch delay slots

Relies on programmer/compiler to fill Depends on being able to find suitable

instructions Ties resolution delay to a particular

pipeline Predication

tranform control dependence to data dependence on branch condition

6

Why aren’t these techniques acceptable?

Branches are frequent (15-25%) Today’s pipelines are deeper and

wider Higher performance penalty for

stalling Misprediction Penalty = issue width *

resolution delay cycles A lot of cycles can be wasted!!!

7

Branch Prediction

Predicting the outcome of a branch Direction:

Taken / Not Taken Direction predictors

Target Address PC+offset (Taken)/ PC+4 (Not Taken) Target address predictors

Branch Target Address Cache (BTAC) or Branch Target Buffer (BTB)

8

Why do we need branch prediction?

Branch prediction Increases the number of instructions

available for the scheduler to issue. Increases instruction level parallelism (ILP)

Allows useful work to be completed while waiting for the branch to resolve

9

Cycle Fetch Decode Execute Save

1 if (x>0)

2 a=0 if (x>0)

3 b=1 a=0 if (x>0)

4 c=2 b=1 a=0 if (x>0)

5 c=2 b=1 a=0

6 c=2 b=1

7 c=2

Cycle Fetch Decode Execute Save

1 if (x>0)

2 a=0 if (x>0)

3 b=1 a=0 if (x>0)

4 d=3 squash b=1

squash a=0

if (x>0)

5 d=3 squash b=1

squash a=0

6 d=3 squash b=1

7 d=3

Cycle Fetch Decode Execute Save

1 if (x>0)

2 d=3 if (x>0)

3 d=3 if (x>0)

4 d=3 if (x>0)

5 d=3

A simple example which demonstrates the benefits

if (x > 0) { a=0; b=1; c=2;}d=3;

10

Classification of branch prediction schemes (1) Static schemes

Decision before runtime (i.e. at compile time)

Predict Branch Taken / Not Taken All branches taken scheme : 34% avg.

misprediction rate Backward Taken/Forward Not Taken

(BTFNT) Advantage in Loops Doesn’t work well on programs with

irregular branches Ball and Larus approach enhancement

works a little better

11

Classification of branch prediction schemes (2) Profiling

branch prediction based on profiles created by earlier runs

key observation: behavior of branches bimodally distributed

Preset static prediction bit in the opcode Doesn’t work well on data sets that occur at run-time

Static schemes useful for scheduling when the branch delays are exposed by

the architecture assisting dynamic predictors determining frequent code paths

12

Classification of branch prediction schemes (3) Dynamic Schemes

Prediction decisions may change during the execution of the program

Branch Target Buffer Lee and Smith 2-bit saturating up-down counters to

collect history information Static Training Scheme

Use statistics collected from pre-run of the program and history pattern consist of the last N run-time execution

13

What happens when a branch is mispredicted?

On mispredict: No speculative state may commit

Squash instructions in the pipeline Must not allow stores in the pipeline to

occur Cannot allow stores which would not have

happened to commit Need to handle exceptions appropriately

14

Simple branch predictor Accessed early in the pipeline using branch

instruction (PC)

15

2-bit branch prediction

16

2-bit predictor state diagram

17

2-bit branch prediction A branch must miss twice before the

prediction is changed It’s a specialization of the n-bit

saturating scheme. Branch prediction buffer can be

implemented as: Special cache accessed with the instruction

address during IF Pair of bits attached to each block in the

instruction cache

18

N-bit predictor scheme

19

Spec98 prediction accuracy (4K entry buffer)

20

Spec98 prediction accuracy, infinite buffer

21

Correlating (Two-Level) branch predictors (1)

Consider the sequence (2):If (d==0)

d=1;If (d==1)

MIPS assembly for (2):BNEZ R1,L1 ;branch b1DADDIU R1,R0,#1 ;d=1

L1:DADDIU R3,R1,#-1BNEZ R3,L2 ;branch b2…

L2:

Consider the sequence (1):If (aa==2)

aa=0;If (bb==2)

bb=0;if(aa!=bb) {

22

1-bit correlation branch predictor

in (1) if b1 is NOT taken then b2 is NOT taken too! consider a predictor with 1 bit of correlation to capture dependence of one branch from another 2 prediction bits per branch:

1 assuming last branch executed was Not Taken 1 assuming last branch executed was Taken

Pred bits Pred if last branch not taken

Pred if last branch taken

NT/NT NT NT

NT/T NT T

T/NT T NT

T/T T T

23

Comparisond=? b1 pred b1 act new b1

predb2 pred b2 act new b2

pred

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

d=? b1 pred b1 act new b1 pred

b2 pred b2 act new b2 pred

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

24

Correlating branch predictors

2 bits of global history means we look at T/NT behavior of last to branches to determine the behavior of THIS branch.

The buffer can be implemented as an one dimensional array

(m,n) predictor uses behavior of last m branches to choose from 2m predictor each being an n-bit predictor. It takes (2m x n x # of entries selected by the branch address) bits.

25

Q: how can we capture the behavior of last n branches and adjust the behavior of the current branch accordingly?

A: we use an n bit shift register and shift the behavior of each branch to this register as they become known.

Correlating branch predictors

110 Last branch outcome

26

Correlating branch predictors Higher prediction rates than simple 2-bit

predictor scheme with only trivial additional amount of HW (m-bit shift register)

NOTE: buffer NOT a cache, so counters may correspond to different branches at some point in time

Buffer can be implemented as a linear memory array that is n-bits wide Indexing is done by concatenating global history

bits with the bits from the branch address

27

Correlating branch predictors

How many bits are there in a (0,2) predictor that has 4K entries selected from the branch address?

20 x 2 x 4K = 8K How many bits the example

predictor has? 22 x 2 x 16 =128 bits.

28

Correlating predictor performance

29

Hashing branch prediction algorithms

gselect gshare

30

Gshare correlating predictor

31

Hybrid predictors

The basic idea is to use a META predictor to select among multiple predictors

Example: Local predictors are better in some

branches Global predictors are better in utilizing

correlation Use a predictor to select the better predictor

32

Tournament predictors n/m means:

n left predictor m right predictor

0 incorrect 1 correct A predictor must

be twice incorrect before we switch to another one

33

Fractions of predictions coming from the local predictor

The tournament predictor selects between a local 2-bit predictor and a 2-bit Gshare predictor

Each predictor has 1024 entries each 2 bits for a total 64K bits.

34

Misprediction rates

35

Need Address at Same Time as Prediction

Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken)

Note: must check for branch match now, since can’t use wrong branch address

36

A Branch Target Buffer Predicted PC

Branch Prediction:Taken or not Taken

37

Return Address Predictors Return addresses can be predicted

with BTB but accuracy can be low Procedure may be called from

multiple sites Solution: small buffer operating

as a stack If stack large enough it will predict

perfectly

38

“Real” Branch Predictors

Alpha 21264 Sun UltraSPARC-III Intel Pentium III AMD Athlon K7

39

Alpha 21264 8-stage pipeline, mispredict penalty

7 cycles Hybrid predictor (Fetch)

12-bit GAg (4K-entry PHT, 2 bit counters) 10-bit PAg (1K-entry BHT, 1K-entry PHT, 3-bit

counters)

40

Alpha 21264 branch prediction mechanism

41

Sun UltraSPARC-III

14-stage pipeline, bpred accessed in instruction fetch stages 2-3

16K-entry 2-bit counter Gshare predictor Bimodal predictor which XOR’s PC bits with

global history register (except 3 lower order bits) to reduce aliasing

Miss queue Halves mispredict penalty by providing

instructions for immediate use

42

Intel Pentium with MMX

43

Intel Pentium III

Dynamic branch prediction 512-entry BTB predicts direction and target,

4-bit history used with PC to derive direction Static branch predictor for BTB misses Return Address Stack (RAS), 4/8 entries Branch Penalties:

Not Taken: no penalty Correctly predicted taken: 1 cycle Mispredicted: at least 9 cycles, as many as

26, average 10-15 cycles

44

AMD Athlon K7

10-stage integer, 15-stage fp pipeline, predictor accessed in fetch

2K-entry bimodal, 2K-entry BTAC 12-entry RAS Branch Penalties:

Correct Predict Taken: 1 cycle Mispredict penalty: at least 10 cycles

Q/A’s