branch prediction high-performance computer architecture joe crop oregon state university school of...

50
Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Upload: alexandrina-singleton

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Branch Prediction High-Performance Computer Architecture

Joe CropOregon State University

School of Electrical Engineering and Computer Science

Page 2: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 2

Control Hazard

beq r1,r3,label

and r2,r3,r5

or r6,r1,r7

add r8,r1,r9

label: xor r10,r1,r11 Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Page 3: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 3

Branch Penalty Impact

• If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND– Compute taken branch address earlier

• MIPS branch tests if register = 0 or 0– beqz R4, name

• MIPS Solution:– Move Zero test to ID/RF stage– Adder to calculate new PC in ID/RF stage– 1 clock cycle penalty for branch versus 3

Page 4: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 4

Adder

IF/ID

Modified MIPS DatapathMemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

MU

X

MU

X

SignExtend

Zero?

ME

M/W

B

EX

/ME

M

Next SEQ PC

rd rd rd

WB Data

Next PC

PC

rs

rt

ImmM

UX

ID/E

X

InstructionMemory Register

FileData

Memory

ALU

Adder

Page 5: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 5

Branch Resolved in ID Stage

beq r1,r3,label

and r2,r3,r5

Label: xor r10,r1,r11

… Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Reg AL

U

DMemIfetch Reg

Page 6: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 6

Branch Prediction

• Predict Branch Not Taken– Execute successor instructions in sequence.– “Squash” instructions in pipeline if branch actually taken.– 47% MIPS branches not taken on average.– PC+4 already calculated, so use it to get next instruction.

• Predict Branch Taken– 53% MIPS branches taken on average.– But haven’t calculated branch target address yet

• MIPS still incurs 1 cycle branch penalty

• Other machines: branch target known before outcome

• Delay Branch Technique

Page 7: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 7

Delay Branches• This technique involves using software making the delay slots valid

and useful. Some n number of instructions after the branch is executed regardless of whether the branch is taken.

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

• 1 delay slot allows proper decision and branch target address in 5 stage pipeline

• MIPS uses this.

Branch delay of length n

Page 8: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 8

Performance Effect of Branch Penalty

Let

pb = the probability that an instruction is a branch

pt = the probability that a branch is takenb = the branch penaltyCPI = the average number of cycles per instruction.

Then

CPI = (1 - pb) + pb[pt(1 + b) + (1 - pt)]

CPI = 1 + bptpb

Page 9: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 9

Delay Branch Technique

Page 10: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 10

Delay Branch Technique (1)

A:=B+CIf B>C Then Goto Next Delay Slot...

Next:

becomes

If B>C Then Goto Next A:=B+C....

Next:

“From before”

Page 11: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 11

Delay Branch Technique (2)

Next: X := Y * Z

...B := A + CIf B > C Then Goto Next Delay Slot

becomes

X := Y * ZNext: ...

...B := A + CIf B > C Then Goto Next X := Y * Z

“From target”

Must be OK to executewhen not taken

May need to duplicate

Page 12: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 12

Delay Branch Technique (3)

B := A + CIf B > C Then Goto Next Delay SlotX := Y * Z...

Next:becomes

B := A + CIf B > C Then Goto Next X := Y * Z...

Next:

“From fall through”

Must be OK to executewhen taken

Page 13: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 13

Delay Branch Technique (cont.)The performance of Delay Branches can be modeled by the following equation:

CPI = 1+bpbpnop

where pnop is the fraction of the b delay slots filled with nops. Thus, if fi is the probability that the delay slot i is filled with a useful instruction, then

pnop = 1 - (f1 + f2 + …+ fb)/b

Example: Suppose we have the following characteristic

b=4, f1 =0.6, f2 = 0.1, f3 = f4 =0, pb=0.2We have

CPI = 1 + 4 0.2 0.825 = 1.66

Page 14: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 14

Delay Branch Technique (cont.)The concept of squashing or annulling can be used in conjunction with delay branches.

X := Y * ZNext: ...

B := A + C

If B > C Then Goto Next X := Y * Z =>This instruction is

nullifiedbne,a rs,rt,label a bit Branch outcome Delay inst. Executed?

taken yesnot taken yes

a taken yes

a not taken no (annulled)

Page 15: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 15

Delay Branch Technique (cont.)

• For processors with this capability, the performance can be modeled as

CPI = 1 + bpb[pnop(1 - pnull) + pnull)]

where pnull=(1-pt) for nullify-on-branch-not-taken.

• Suppose b=4, f1=0.8, f2=0.3, f3=0.1, f4=0, pb=0.2, pnull= 0.35

=> CPI=1.644

Page 16: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 16

Delayed Branch Performance

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots.– About 80% of instructions executed in branch delay

slots useful in computation.– About 50% (60% x 80%) of slots usefully filled.

Page 17: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 2: A Five Stage RISC Pipeline 17

Evaluating Branch Alternatives

Suppose Conditional & Unconditional = 14%, 65% change PC

Prediction Branch CPI speedup v. speedup v.scheme penalty unpipelined stallStall pipeline 3 1.42 3.5 1.0Predict taken 1 1.14 4.4 1.26Predict not taken 1 1.09 4.5 1.29Delayed branch 0.5 1.07 4.6 1.31

Page 18: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

18

Reducing Branch PenaltyBranch penalty in dynamically scheduled processors:

wasted cycles due to pipeline flushing on mis-predicted branches

Reduce branch penalty:

1. Predict branch/jump instructions AND branch direction (taken or not taken)

2. Predict branch/jump target address (for taken branches)

3. Speculatively execute instructions along the predicted path

Page 19: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

19

What to Use and What to Predict

Available info:– Current predicted PC– Past branch history (direction

and target)

What to predict:– Conditional branch inst: branch

direction and target address– Jump inst: target address– Procedure call/return: target

address

May need instruction pre-decoded

IM

PC

Predictors

PC

pred_PC

pred info feedbackPC & Inst

Page 20: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

20

Mis-prediction Detections and Feedbacks

Detections:• At the end of decoding

– Target address known at decoding, and not match

– Flush fetch stage

• At commit (most cases)– Wrong branch direction or target

address not match– Flush the whole pipeline

Feedbacks:• Any time a mis-prediction is

detected• At a branch’s commit(at EXE: called speculative update)

FETCH

RENAME

SCHD

REB/ROB

COMMIT

WB

EXE

predictors

Page 21: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

21

Branch Direction Prediction

• Predict branch direction: taken or not taken (T/NT)

• Static prediction: compilers decide the direction• Dynamic prediction: hardware decides the direction

using dynamic information1. 1-bit Branch-Prediction Buffer2. 2-bit Branch-Prediction Buffer3. Correlating Branch Prediction Buffer4. Tournament Branch Predictor5. and more …

Not taken

taken BNE R1, R2, L1

…L1: …

Page 22: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

22

Predictor for a Single Branch

state2. Predict

Output T/NT

1. Access

3. Feedback T/NT

T

Predict TakenPredict Taken1 0

T

NT

General Form

1-bit prediction

NT

PC

Feedback

Page 23: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

23

Branch History Table of 1-bit Predictor

BHT also Called Branch Prediction Buffer in textbook

• Can use only one 1-bit predictor, but accuracy is low

• BHT: use a table of simple predictors, indexed by bits from PC

• Similar to direct mapped cache

• More entries, more cost, but less conflicts, higher accuracy

• BHT can contain complex predictors

PredictionPrediction

K-bitBranchaddress

2k

Page 24: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

24

1-bit BHT Weakness

• Example: in a loop, 1-bit BHT will cause 2 mispredictions

• Consider a loop of 9 iterations before exit:for (…){ for (i=0; i<9; i++) a[i] = a[i] * 2.0;}– End of loop case, when it exits instead of looping as

before– First time through loop on next time through code,

when it predicts exit instead of looping– Only 80% accuracy even if loop 90% of the time

Page 25: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

25

• Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 3.7, p. 249)

• Gray: stop, not taken• Blue: go, taken• Adds hysteresis to decision making process

2-bit Saturating Counter

T

T

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken

11 10

01 00T

NT

T

NT

NT

Page 26: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

26

Correlating Branches

Code example showing the potential

If (d==0)

d=1;

If (d==1)

Assemble code

BNEZ R1, L1

DADDIU R1,R0,#1

L1: DADDIU R3,R1,#-1

BNEZ R3, L2

L2:

Observation: if BNEZ1 is not taken, then BNEZ2 is taken

Page 27: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 27

(1, 1) Predictor• (1,1) predictor - last branch, 1-bit prediction• We use a pair of bits where the first bit being the prediction if the

last branch in the program was not taken, and the second bit being the prediction if the last branch was taken.

Prediction BitsPrediction If

Last branch Not Taken Last Branch Taken

NT/NT Not Taken Not Taken

NT/T Not Taken Taken

T/NT Taken Not Taken

T/T Taken Taken

Page 28: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 28

(1, 1) Predictor: Example• Consider the following code assuming d is assigned to R1.

if (d==0)d=1;

if (d==1)

bnez R1,L1 ; branch b1 (d!=0)addi R1,R0,#1 ; d==0, so d=1

L1: subi R3,R1,#1bnez R3,L2 ; branch b2 (d!=1)...

L2:

• Suppose d alternates between 2 and 0, (1, 1) predictor initialized to not

taken. Bold indicate prediction.

• The only misprediction is on the first iteration, when d=2, because the b1 was not correlated with the previous prediction of b2

d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred

2 NT/NT T T/NT NT/NT T NT/T

0 T/NT NT T/NT NT/T NT NT/T

2 T/NT T T/NT NT/T T NT/T

0 T/NT NT T/NT NT/T NT NT/T

Page 29: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 29

(1, 1) Predictor: Example• If we had use a 1-bit predictor

• We would have had all the branches mispredicted!

d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred

2 NT T T NT T T

0 T NT NT T NT NT

2 NT T T NT T T

0 T NT NT T NT NT

Page 30: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 30

(m, n) Predictor(m,n) Predictor:In general, (m,n) predictor uses the behavior of last m branches (using shift register) to choose from 2m

branch predictors, each of which is a n-bit predictor for a single branch.

Page 31: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 31

Performance of (2, 2) Predictor• Improvement is most

noticeable in integer benchmarks.

• (m,n) predictor outperforms 2-bit predictor, even with unlimited entries!

Integer benchmarks

Page 32: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 32

Tournament Predictors• Uses multiple predictors, usually one based on local

information and one based on global information.– Local predictors are better for some branches– Global predictors are better at utilizing correlation

• A selector is used to choose among the predictors, usually a 2-bit saturating counter.

n/m means:• n - left predictor• m - right predictor

0/1 means:• 0 - Incorrect• 1 - Correct

11

1001

00

Page 33: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 33

Example: Alpha 21264 Branch Predictor

21264 uses the most sophisticated branch predictor.

Last 10 outcomes of this branch

3-bit saturatingcounter

2-bitpredictor

2-bitsaturatingcounter

Last 12 outcomes of all the branches

Page 34: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Tournament Predictor in Alpha 21264• Local predictor consists of a 2-level predictor:

– Top level a local history table consisting of 1024 10-bit entries; each 10-bit entry corresponds to the most recent 10 branch outcomes for the entry. 10-bit history allows patterns 10 branches to be discovered and predicted

– Next level Selected entry from the local history table is used to index a table of 1K entries consisting 3-bit saturating counters, which provide the local prediction

• Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits! (~180K transistors)

1K 10 bits

1K 3 bits

Page 35: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

% of predictions from local predictor in Tournament Prediction Scheme

98%

100%

94%

90%

55%

76%

72%

63%

37%

69%

0% 20% 40% 60% 80% 100%

nasa7

matrix300

tomcatv

doduc

spice

fpppp

gcc

espresso

eqntott

li

Page 36: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

94%

96%

98%

98%

97%

100%

70%

82%

77%

82%

84%

99%

88%

86%

88%

86%

95%

99%

0% 20% 40% 60% 80% 100%

gcc

espresso

li

fpppp

doduc

tomcatv

Profile-based

2-bit counter

Tournament

Accuracy of Branch Prediction

• Profile: branch profile from last execution(static in that is encoded in instruction, but profile)

fig 3.40

Page 37: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Accuracy v. Size (SPEC89)

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Total predictor size (Kbits)

Con

ditio

nal b

ran

ch m

isp

redi

ctio

n r

ate

Local - 2 bit counters

Correlating - (2,2) scheme

Tournament

Page 38: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Power Consumption

BlueRISC’s Compiler-driven Power-Aware Branch PredictionComparison with 512 entry BTAC bimodal (patent-pending)

Copyright 2007 CAM & BlueRISC

Page 39: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Pitfall: Sometimes dumber is better• Alpha 21264 uses tournament predictor (29 Kbits)• Earlier 21164 uses a simple 2-bit predictor with 2K

entries (or a total of 4 Kbits)• SPEC95 benchmarks, 21264 outperforms

– 21264 avg. 11.5 mispredictions per 1000 instructions– 21164 avg. 16.5 mispredictions per 1000 instructions

• Reversed for transaction processing (TP) !– 21264 avg. 17 mispredictions per 1000 instructions– 21164 avg. 15 mispredictions per 1000 instructions

• TP code much larger & 21164 hold 2X branch predictions based on local behavior (2K vs. 1K local predictor in the 21264)

• What about power?– Large predictors give some increase in prediction rate but for a

large power cost

Page 40: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 40

Branch Target BufferBTB acts as a cache for BTAs. This eliminates cycles wasted per branch required to calculate the BTAs.

Page 41: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 41

BTB (cont.)

BTA and the outcome of the branch is known by end of ID stage

…but not relayed until EX stage

Page 42: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 42

BTB (cont.)

Page 43: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 43

Return Address PredictionBTB and BPB do a good job in predicting how future behavior will repeat. However, the subroutine call/return paradigm makes correct prediction difficult.

The BTB then contains the following after the second subroutine is called:

Inst. Addr Target Addr.100 500520 104112 500

When we return from subr, we get a hit on a valid entry in the BTB (Inst. Addr. = 520) and predict that we will return to address 104. However, this is not correct. The next instruction should be 116!

Page 44: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 44

Subroutine Return StackIn order to detect such mispredictions, subroutine return stack can be used to augment the BTB.

Page 45: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Chapter 3 - Exploiting ILP 45

Performance of SRSSPEC 95

Page 46: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Pentium 4’s Branch Predictor

• “Unveiling the Intel Branch Predictors”– Pentium 4– http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1597026

46

Page 47: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Natural Branch Predictors

• “Towards a High Performance Neural Branch Predictor”– http://webspace.ulbsibiu.ro/lucian.vintan/html/USA.pdf– The main advantage of the neural predictor is its ability to

exploit long histories while requiring only linear resource growth

– Used in IA-64 simulators

47

Page 48: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

Core 2’s Branch Predictor?• TAGE: Tagged Geometric

Chapter 3 - Exploiting ILP 48

Page 49: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

TAGE Performance

49

Page 50: Branch Prediction High-Performance Computer Architecture Joe Crop Oregon State University School of Electrical Engineering and Computer Science

To Learn More

Chapter 3 - Exploiting ILP 50