cda 5106 advanced computer architecture i...

Computer Science Department

University of Central Florida

CDA 5106 Advanced Computer Architecture I

Pipelining

2

Designing a processor

• Design the ISA

• Classify instructions for the ISA (e.g., MIPS):

– Memory references

– Register-Register ALU Operations

– Register-Immediate ALU Operations

– Branches

• Work out the execution for each operation class

• Design appropriate hardware

• Look for opportunities to improve…

• …while maintaining correct execution

3

How to Execute an Instruction

• Instruction fetch (“IF”) – IR = Mem[PC] – NPC = PC + 4

• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21] – B = Regs[IR20..16] – Imm = sign-extend(IR15..0)

• Execute (“EX”) – Memory reference:

• ALUOutput = A + Imm – Reg/Reg ALU Operation:

• ALUOutput = A op B – Reg/Immediate ALU Operation:

• ALUOutput = A op Imm – Branch:

• ALUOutput = NPC + Imm; Cond = (A op 0)

4

Executing an Instruction (cont.)

• Memory Access/Branch completion (“MEM“)

– Memory Reference:

• Load_Mem_Data = Mem[ALUOutput] /* Load */

• Mem[ALUOutput] = B /* Store */

– Branch

• If (cond) PC = ALUOutput, else PC = NPC

• Write back (“WB”)

– Reg-Reg ALU Operation:

• Regs[IR15..11] = ALUOutput

– Reg-Immediate ALU Operation:

• Regs[IR20..16] = ALUOutput

– Load instruction:

• Regs[IR20..16] = Load_Mem_Data

5

How to Execute an Instruction

• Instruction fetch (“IF”) – IR = Mem[PC]

– NPC = PC + 4

AL

U

Instruction

cache

PC

NPC

IR (inst.

reg.)

4

6

How to Execute an Instruction (cont.)

• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21]

– B = Regs[IR20..16]

– Imm = sign-extend(IR15..0)

Regs

sign

extend

A

B

Imm

IR (inst.

reg.)

7

How to Execute an Instruction (cont.) • Execute (“EX”)

– Memory reference:

• ALUOutput = A + Imm

– Reg/Reg ALU Operation:

• ALUOutput = A op B

– Reg/Immediate ALU Operation:

• ALUOutput = A op Imm

– Branch:

• ALUOutput = NPC + Imm; Cond = (A op 0)

A

B

AL

U

MU

X

MU

X

=0? cond

Imm

NPC

8


• Memory Access/Branch completion (“MEM“) – Memory Reference:

• Load_Mem_Data = Mem[ALUOutput] /* Load */

• Mem[ALUOutput] = B /* Store */

– Branch

• If (cond) PC = ALUOutput, else PC = NPC

AL

U

cond

MU

X

data

cache LMD

NPC

PC

B

9


• Write back (“WB”) – Reg-Reg ALU Operation:

• Regs[IR15..11] = ALUOutput – Reg-Immediate ALU Operation:

• Regs[IR20..16] = ALUOutput – Load instruction:

• Regs[IR20..16] = Load_Mem_Data

Regs

LMD

MU

X

10

AL

U

Instruction

cache

PC

NPC

IR (inst.

reg.) Regs

sign

extend

A

B

AL

U

MU

X

MU

X

=0? cond

MU

X

data

cache LMD

MU

X

4

Imm

Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback

(WB)

11

An Abstract View of Single-Cycle Implementation

12

Controller

13

Analysis

• Single-cycle Implementation -> Multi-cycle Implementation • All instructions (except branch):

– IF, ID, EX, MEM, WB

• Branch: (12%)

– IF, ID, EX, MEM

• CPI = 5*0.88+4*0.12 = 4.88 cycles • Graphically:

IC Reg ALU DC Reg

IC Reg ALU DC Reg

IC Reg ALU DC Reg

14

Unpipelined Execution

• Throughput

– Depends on full latency of instruction

IF ID EX MEM WB

IF ID EX MEM WB

I$ idle

decoder idle, RF read ports idle

ALU idle

D$ idle

RF write port idle

15

Pipelined Execution

• Isolate each of IF, ID, EX, MEM, WB with latches

• When instruction i is in WB, i+1 is in MEM, etc.

• Graphically:

IC Reg ALU Reg DC

IC Reg ALU Reg DC

IC Reg ALU Reg DC

IC Reg ALU Reg DC

IC Reg ALU Reg DC

time i

i+1

i+2

i+3

i+4

17

Pipeline speedup (no stalls)

• For a pipeline of n stages:

0)= case, (ideal

/)]1([

pipelined timeexec ave

dunpipeline timeexec. ave. = speedup

latch

latchunpipe

unpipe

Tn

nnTT

T

IF ID EX MEM WB

18

Pipeline limits

• Limitations of pipelining

– Tlatch

• Delay, setup, hold times

• Clock skew

• Latch takes up more of cycle as cycle shrinks: deeper pipelining gives diminishing returns

– Minimum logic between latches

Tlatch

cycle

deeper pipelining

19

Pipelining Idealisms

• Uniform subcomputations

– Can pipeline into stages with equal delay

– Balance pipeline stages

• Identical computations

– Can fill pipeline with identical work

– Unify instruction types

• Independent computations

– No relationships between work units

– Minimize pipeline stalls

• Are these practical?

– No, but can get close enough to get significant speedup

20

Pipeline hazards

• A hazard reduces the performance of the pipeline – Due to program’s characteristics

– Potential violations of program dependences

• Hazard Resolution – Static Method: Performed statically by compiler

– Dynamic Method: Performed dynamically by hardware at run time, e.g., stall, flush, forwarding

• Three kinds: – Structural hazards - not enough hardware resources for all

combinations of instructions

– Data hazards - Dependencies between instructions prevent their overlapped execution

– Control hazards - Branches change the PC, which results in late code

21

Structural hazard

• Consider a pipeline with a unified data+instruction cache:

WB

EX

ID

MEM

EX

IF i (load) ID

IF i+1

EX

ID

IF i+2

WB

MEM

ID

IF i+4 MEM

WB

WB

MEM

EX

IF i+3

MEM

EX

ID

stall

22

Modeling stalls

speedup = ave. exec. time unpipelined

ave exec time pipelined

CPI CT

CPI CT

CPI CPI (stall cycles per instruction)

= 1+ (stall cycles per instruction)

speedup = CPI CT

CPI CT

CPI (= )

stall cycles per instruction)

speedup = 1

1+ (stall cycles per instruction)

CT

CT

1

1+ (stall cycles per instruction)

CT

CT

stall cycles per instruction)

unpipe unpipe

pipe pipe

pipe nostall

unpipe unpipe

pipe pipe

unpipe

unpipe

pipe

pipe

pipe

n

n

n

1

1

(

(

this assumes that the

two CT’s are equal

this assumes that the

two CPIs (CPIpipe and CPInostall)

are equal

(same thing)

23

Modeling stalls (example)

• Ex:

– n = 5 (e.g., MIPS pipeline)

– 20% of instructions are branches

– 60% of branches are taken

– Penalties:

• Taken branches: 3 stall cycles

• Not-taken branches: 0 stall cycles

• How many stall cycles per instr. on average?

– stall cycles/instr. = (0.8 x 0) + (0.2 x [ 0.6 x 3 + 0.4 x 0]) = 0.2 x 0.6 x 3 = 0.36

– Speedup = 5 / 1.36 = 3.68

24

Data Hazards

• Read-after-write (RAW) hazard

IF add r1, r2, r3 ID

IF add r4, r1, r5

EX

ID

IF

MEM

stall

stall

WB

ID

IF

EX

ID

WB

MEM

MEM

EX

reg. is read reg. is written

Reg.

write

Reg.

read

WB

ID

CT/2 CT/2

Perform the register write in the 1st half of the clock

cycle and the read in the second half.

2 cycle stall

25

RAW Data Hazards

IF ID EX MEM WB

IF ID EX MEM WB

add r1, r2, r3

add r4, r1, r5

Result (r2+r3) is available

r1 is written

r1 is read

only need the result not r1

27

Data forwarding (bypasses)

AL

U

D$

MU

X

MU

X

B

A

IMM

MU

X

ID/EX EX/MEM MEM/WB

RF

bypass 1 bypass 2

28

Data forwarding (cont.)

IF add r1, r2, r3 ID

IF add r4, r1, r5

WB

MEM

WB

MEM

EX WB

EX

ID

IF add r6, r1, r5

MEM

EX

ID

IF add r7, r1, r5

byp 1

add r8, r1, r5

byp 2

WB

MEM

EX

ID

WB

MEM

EX

ID

IF

29

Stalls due to RAW data hazards

• Our simple pipeline

– Most RAW hazards => no stall

– Loads cause 1-cycle stall

IF load r1, r2, r3 ID

IF add r4, r1, r5

EX

ID

IF

MEM

stall

stall

WB

MEM

MEM

EX

value available value needed

WB

EX

ID

byp 2

30

Other data hazards

• WAR (write-after-read)

– A r1, r2, r3

– B r2, r4, r5

– Hazard if B writes R2 before A reads R2

– Doesn’t happen in simple DLX pipeline, but can in others

• E.g., occurs if pipeline allows late register reads

SW 0(R2), R1 IF ID EX MEM1 MEM2 MEM3 WB

ADD R1, R3, R4 IF ID EX WB

writes R1 during WB

Reads R1 during MEM3

31

Other data hazards (cont.)

• WAW (write-after-write)

– A r1, r2, r3

– B r1, r4, r5

– Hazard if B writes R1 before A writes R1

– Result: later instructions see wrong value in the register

– Occurs if instructions can write register file out-of-order

– This also doesn’t happen in simple DLX pipeline, but can in others:

LW R1, 0(R2) IF ID EX MEM1 MEM2 WB

ADD R1, R2, R3 IF ID EX WB

writes 1st version of R1

writes 2nd version of R1

32

Other data hazards (cont.)

• Handling WAR/WAW hazards

– Stall the later instruction (stalls in WB stage)

– Detect in decode and prevent from happening by stalling earlier (easier to implement)

– Compiler: don’t reuse register specifier

– Hardware: register renaming (see next major topic – ILP)

33

Types of dependencies

• True-dependence (pure-dependence, flow-dependence) – ADD R1,R2,R3

– SUB R4,R5,R1

– May cause RAW hazards

• Anti-Dependence – ADD R3,R2,R1

– SUB R1,R4,R5

– May cause WAR hazards

– Due to reuse: Removed by using another register

• Output-Dependence – ADD R1,R2,R3

– SUB R1,R4,R5

– May cause WAW hazards

– Due to reuse: Removed by using another register

34

Control hazards

• Branches throw a wrench in the cogs

– Disrupts pipeline because we don’t know what to fetch next

– Problems

• Don’t know we have a branch until decode (ID)

• Don’t know taken target until execute (EX)

• Don’t know branch direction (taken/not taken) until execute (EX) or Memory stage MEM)

35

Handling Control Hazards: method 1

• stall

IF BNE r1, r2 ID

IF not-taken target MEM EX

WB MEM

ID

EX

stall WB

PC+4 known branch known

IF BNE r1, r2 ID

IF not-taken target

MEM EX

WB

ID

EX

stall

WB

MEM

IF taken target

direction known (nt) PC+offset known

NOT-TAKEN

TAKEN

36


• predict not-taken

IF BNE r1, r2 ID

IF not-taken target WB MEM

WB MEM

EX

EX

ID


IF BNE r1, r2 ID

IF not-taken target

MEM EX

WB

ID WB

MEM

IF taken target

direction known (nt) PC+offset known

TAKEN

NOT-TAKEN

EX

ID

IF

37

Move branch target back

AL

U

Instruction

cache

PC

NPC

IR (inst.

reg.)

A

AL

U

MU

X

data

cache LMD

MU

X

4 =0? cond

MU

X

Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM)

=0? cond

MU

X

AL

U

Add an ALU

m May increase time for ID

m Eliminates 1 cycles, but still 1 additional stall

RF

38

Reducing branch penalty via the compiler: Delay slots

• Change the meaning of a branch so that next instruction after branch holds something useful

A BEQZ R1, X

B ADD R4,R2,R3

...

...

move useful instruction here

from above the branch

A

B

IFA IDA MEMA EXA WBA

X

IFB IDB MEMB EXB WBB

IFX IDX MEMX EXX WBX

“delay slot”

39

Delay slots

• Add n slots to cover n holes

• ISA is changed to mean “n instructions after any branch are always executed”

• Problem:

– ISA feature that encodes pipeline structure

– Difficult to maintain across generations

– Typically can fill:

• 1 slot 75% of time

• 2 slots about 25% of time

• >2 slots almost never

40

Filling slot from above branch

• Advantage

– Delay slot instruction can always execute regardless of branch outcome (don’t ever need to squash it)

• Disadvantage

– Need a “safe” instruction from above the branch

– Safe means: moving the instruction to the delay slot doesn’t violate any data dependencies

BEQZ R1, X

ADD R4, R2, R3

GOOD SCENARIO

NOP

BEQZ R1, X

ADD R4, R2, R3

BEQZ R1, X

ADD R1, R2, R3

BAD SCENARIO

NOP

BEQZ R1, X

ADD R1, R2, R3

NOP

41

Filling slot from target or fall-through

• If you can’t fill slot(s) from above the branch, use instructions from either:

– Target of branch (if frequently taken)

– Fall-through of branch (if frequently not-taken)

• Example: fill from target (branch is frequently taken)

BEQZ R1, X

NOP

X: SUB R4, R2, R3

Y: …

BEQZ R1, Y

X: SUB R4, R2, R3

Y: …

SUB R4, R2, R3 (copy of X)

(change target to Y)

• Disadvantages

– Only works if delay slot instruction is safe to execute when branch goes the opposite (infrequent) direction

42

Eliminating all stalls: Issues

• When:

– Detect an un-decoded instruction is a branch

• Where:

– Predict where the branch will go (if taken)

• Whether:

– (For conditional branches) Predict if it will be taken or not, before execution

• Optimal: try to determine all three in IF stage

– Won’t work perfectly (prediction), but we can try our best

43

AL

U

Instruction

cache

PC

NPC

IR (inst.

reg.) Regs

sign

extend

A

B

AL

U

MU

X

MU

X

=0? cond

MU

X

data

cache LMD

MU

X

4

Imm

Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback

(WB)

44

Issues with When and Where

• What does IF know? – Only the address of the instruction (PC)

– Keep buffer (cache) of last known branch targets around

– Buffer is written to by WB stage

Last

known

branches

PC value Hit = we know it is a branch (“WHEN”),

BTB returns branch target (“WHERE”)

Miss = assume not a branch

• Traditional name for this is a Branch Target Buffer (BTB)

tag pre-computed target BTB entry:

45

Predicting Where with returns

• Problem: A lot of jumps are returns from procedures

– Holding the last target address is a poor predictor

• Solution: Keep a hardware “stack” of return addresses

– Push return address when a “call” is executed

– Pop buffer on returns to get prediction

• Bottom of stack is filled with old value on a pop

– Need approx 4-8 entries for integer code

return?

Each entry

in BTB now

contains:

tag pre-computed target

46

Return Address Stack

RAS

empty

X:

call A

A

X:

call A RAS

X+4

Y:

call B

B

Y:

call B RAS

X+4

Y+4

Y+4

ret

ret

Prediction: Y+4

RAS

X+4

X+4

ret

ret

Prediction: X+4

RAS

empty

47

Issues with Whether

• Predicting conditional branches

– And sometimes unconditional branches if needed before decode

• Two approaches:

– Hardware to supply prediction

– Software

• Heuristics

• Profiling

48

Hardware branch prediction

• 1-bit schemes (Pattern History Table):

– Add 1-bit prediction field to branch target buffer

– Set prediction field = 1 if branch was taken, 0 if branch was not taken

– At IF, check “branch prediction buffer”:

• if prediction field = 1 then predict taken

• else predict not-taken

– Problems:

• Some branches don’t do what they did last time!

• Think of a simple 10 iteration loop, start predict NT

– What is prediction accuracy?

– Isn’t this high enough?!?

• Need more sophisticated predictor

49

Why accuracy matters so much

• Reduce stalls by: – Decreasing branch penalty

• Modify the pipeline (HW question)

– Increasing accuracy

• Fancy prediction schemes

– Decreasing fraction of branches

• compile-time code ordering

speedup = efficiency =stall cycles

efficiencystall cycles

stall cycles = branch penalty accuracy) fraction

efficiency = branch penalty accuracy) fraction

branch penalty accuracy) fraction = efficiency

accuracy = efficiency

branch penalty fraction

accuracy = 1-efficiency

branch penalty fraction

branch

branch

branch

branch

branch

nn

1

1

1

1

1

1 1

11

1

1

11

11

(

(

(

accuracy accuracy

branch penalty eff = 0.9 eff = 0.99

1 44.44% 94.95%

2 72.22% 97.47%

3 81.48% 98.32%

4 86.11% 98.74%

10 94.44% 99.49%

20 97.22% 99.75%

50

Smith n-bit counter predictor

• Replace prediction bit with n-bit counter:

11

10

01

00

T

T

T

T

N

N

N

N

predict taken

predict not-taken

initial state

(using NT

heuristic)

Problems with n > 2 to 3

Smith called it “inertia”

51

Example of Smith counter

previous state

new state

T T T T T T N T T N N N N N N N N T T 01 10 11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01

11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01 10 10

6 mispredictions out of 19 branch executions

T N T N T N T N T N T N T N T N T N previous state

new state

01 10

01 10

01 10

01 10

19 mispredictions out of 19: the infamous “toggle branch”

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

01 10

52

Improving the smith counter

• Options:

– Capture correlations between branches (“global”)

– Associate predictions with branch histories, not branch addresses (use different indexing scheme)

• Gselect (global history with index selection):

global branch history register

n-1 0

behavior of last branch

(shift in most recent outcome)

index

(using low order

bits of address)

... each entry is a two-bit counter

(or perhaps simpler)

BHR

53

Gselect example 1 0

BHR

matrix

00 01 10 11

A: BEQZ R1, D

D: BEQZ R1, F

F: NOT R1,R1

G: JUMP A

initially R1 = 0

...

...

A: T, pred N

New BHR = 01

D: T, pred N

New BHR = 11

01 01

01 01

A:

D:

BHR = 00

01 01

01 01

00 01 10 11

10 01

01 01

A:

D:

BHR = 01

01 01

01 01

00 01 10 11

10 01

01 10

A:

D:

BHR = 11

01 01

01 01

00 01 10 11

A: N, pred N

New BHR = 10

D: N, pred N

New BHR = 00

A: T, pred T

New BHR = 01

10 01

01 10

A:

D:

BHR = 10

01 01

01 00

00 01 10 11

10 01

01 10

A:

D:

BHR = 00

00 01

01 00

00 01 10 11

11 01

01 10

A:

D:

BHR = 01

00 01

01 00

00 01 10 11

D: T, pred T

New BHR = 11

11 01

01 11

A:

D:

BHR = 11

00 01

01 00

00 01 10 11

11 01

01 11

A:

D:

BHR = 10

00 01

01 00

00 01 10 11

A: N, pred N

New BHR = 10

underlined means entry

was updated due to last

branch execution

54

Gshare (global history with index sharing) Branch Predictor

BHR

n-1 0

behavior of last branch

index

(using low order

bits of address)

each entry is a two-bit counter

(or perhaps simpler)

Exclusive-or

(hopefully) makes

sure each index/BHR

combination goes to different

entry in the table.

55

Yeh/Patt predictors

pattern table

2-bit counters

(indexed by history

pattern)

1110111

shift registers history table

01

address of

branch

local predictor (pAg)

Hybrid Predictors (Yeh & Patt, 1993)

• Use global/local branch history to build (other) branch predictors

• G/g = Global, P/p = Per-address GHR

branch outcome

PHT

GAg

Yeh & Patt, 1993


• G/g = Global, P/p = Per-address GHR

branch outcome

PHT

GAp

PC

Yeh & Patt, 1993


• G/g = Global, P/p = Per-address GBHT

PHT

PAg

PC

Yeh & Patt, 1993


• G/g = Global, P/p = Per-address GBHT

PHT

PAp

PC

60

Yeh/Patt Example (pAg): toggle branch

A: TNTNTNTNTNTNTNTNTNTN

B: TTTTTTTTTTTTTTTTTTTTTT

01

01

01

01

PT

00

00

00

00

HT

00

01

10

11

A:

B:

A: T, pred N

10

01

01

01

PT

01

00

00

00

HT

00

01

10

11

A:

B:

11

01

01

01

PT

01

01

00

00

HT

00

01

10

11

A:

B:

B: T, pred T

11

00

01

01

PT

10

01

00

00

HT

00

01

10

11

A:

B:

A: N, pred N B: T, pred N

11

01

01

01

PT

10

11

00

00

HT

00

01

10

11

A:

B:

11

01

10

01

PT

01

11

00

00

HT

00

01

10

11

A:

B:

A: T, pred N

B: T, pred N

11

01

10

10

PT

01

11

00

00

HT

00

01

10

11

A:

B:

A: N, pred N

11

00

10

10

PT

10

11

00

00

HT

00

01

10

11

A:

B:

11

00

10

11

PT

10

11

00

00

HT

00

01

10

11

A:

B:

B: T, pred T A: T, pred T

11

00

11

11

PT

01

11

00

00

HT

00

01

10

11

A:

B:

PT entries 01, 10 are “trained” for A

and 11 is “trained” for B In general: provides 96-98% accuracy for integer code

61

Hybrid predictors

Predictor

#1

(e.g., gshare)

Predictor

#2

(eg, bimodal)

Chooser

array of 2bit

counters

address of

branch

• Both predictors supply a prediction-- pipeline uses only one

• Chooser updated based on which predictor was correct

– Increment chooser counter if #1 was correct, decrement if #2 was correct

prediction

Tournament Predictor: Alpha 21264

PC

1,02410b 1,0243b

12b

4,096

2b

branch outcome

4,09

62b

High accuracy! SPECfp95: 0.1% mp rate SPECint95: 1.15% mp rate

predictor predictor

63


from EX:

recover PC

BTB branch

predictor RAS

+ 4

Taken Target:

conditional branch

jump/call direct

jump/call indirect

control

logic

type

hit

taken

Next-PC MUX

from ID: PC+offset

updates

from ID/EX from EX

from IF/ID

I$

PC

next-PC

Branch prediction – Next PC logic

64

Control Hazards (revisited…)

• Suppose: – Always predict not-taken

• E.g., no BTB and no dynamic branch predictor – ID determines when, whether, where

• Not-taken: no penalty • Taken: 1 cycle penalty • Is there a compiler solution?

A

B

IF ID MEM EX WB

IF - - - -

A BEQZ R1, X

IF ID MEM EX WB X

65

Canceling branches

• Provide two types of delayed branches – Normal delayed branch

• Delay slot instruction is always executed

– Canceling delayed branch

• Slot filled from target (assumes taken branch): slot is squashed if branch is not-taken (branch likely inst)

• Slot filled from fall-through (assumes not-taken branch): slot is squashed if branch is taken

• Having both types greatly enhances compiler’s ability to fill most slots – Use normal branch when a safe instruction is available to fill

slot (from above, target, or fall-through)

– Use canceling branch when only unsafe candidates are available to fill slot (from target or fall-through)

66

Canceling branches (cont.)

• Change encoding to add a likely bit or to add new opcodes

• If compiler thinks branch is frequently taken

– Compiler sets likely bit = 1

– Compiler fills delay slot from target

– Hardware knows to squash delay slot(s) if branch not-taken

• If compiler thinks branch is frequently not-taken

– Compiler sets likely bit = 0

– Compiler fills delay slot from fall-through

– Hardware knows to squash delay slot(s) if branch taken

• If likely bit set capriciously, most delay slot instructions must be squashed

Opcode

6

rs1

5

rd

5 15 1

likely bit

67

Methods for setting likely bit

• Likely bit is essentially a static branch prediction

– Compiler makes a prediction that is fixed for that branch

– Likely bit = 1 means predict taken

– Likely bit = 0 means predict not-taken

• Static branch prediction methods (compiler branch prediction)

– Heuristics

– Profiling

68

Heuristics

• Heuristic #1: Branches are not taken

– The majority of conditional branches are not taken

– True about 60% of the time

• Heuristic #2: backward branches are taken, forward branches are not taken (BTFNT)

– Theme: Most backward branches are loops

– Notes:

• Since branches are PC relative, sign bit of offset = the prediction

• Jim Smith reports 70% accuracy for this scheme for scientific workloads

• Heuristic #3: Ball/Larus style predictions

– Set of rules to predict branches in special situations

69

Examples of Ball/Larus predictions

• Detect loops and use BTFNT

• Since error values returned by library functions are negative:

– …and since errors are rare

– Predict BLTZ, BLEZ, etc. not taken

– Predict BGTZ, BGEZ, etc., taken

• If a call is in the body of an if…then, predict the “then” branch as not-taken

– Since most calls in if…thens guard special case code

• Problem:

– Works great for SPECint92 (from which it was designed)

– My code might not work like that!

70

Profiling

• Three steps:

– 1. Run the program [the “profiled run”]

– 2. Record the average preferred direction for each branch (taken or not-taken)

– 3. Recompile to set the likely bits

• What if the program takes inputs (e.g., sort)?

– Collect a representative set of inputs somehow

• Problems:

– One prediction for entire run

– The profiled run is slow!

• Use hardware to collect predictions

• Runs at normal speed: users don’t realize it’s profiling

71

Control hazards (slide 33)

• Branches throw a wrench in the cogs

– Disrupts pipeline because we don’t know what to fetch next

– Problems

• Don’t know we have a branch until decode (ID)

• Don’t know taken target until execute (EX or ID)

• Don’t know branch direction (taken/not taken) until execute (EX)

72

Handling Control Hazards: method 1 (slide 34)

• stall

IF BNE r1, r2 ID

IF not-taken target MEM EX

WB MEM

ID

EX

stall WB

PC+4 known branch known and pc+offset is know

IF BNE r1, r2 ID

IF not-taken target

MEM EX

WB

ID

EX

stall

WB

MEM

IF taken target

direction known

NOT-TAKEN

TAKEN

73

Handling Control Hazards: method 2 (slide 35)

• predict not-taken

IF BNE r1, r2 ID

IF not-taken target WB MEM

WB MEM

EX

EX

ID


IF BNE r1, r2 ID

IF not-taken target

MEM EX

WB

ID WB

MEM

IF taken target

direction known (nt)

and PC+offset known

TAKEN

NOT-TAKEN

EX

ID

IF

74

Handling Control Hazards: method 3 (cont.)

• Branch prediction

• Case A:

– Correct branch prediction

– (BTB-hit) or (BTB-miss and not-taken)

IF BNE r1, r2 ID

IF correct target WB MEM

WB MEM

EX

EX

ID

75


• Case B:

– Correct branch prediction

– BTB-miss and taken

IF BNE r1, r2 ID

IF not-taken target

MEM EX

WB

ID WB

MEM

IF taken target

EX

76


• Case C: – Incorrect branch prediction

– BTB-hit

– Think about the following case: predict a taken branch

(incorrect prediction) but BTB miss.

IF BNE r1, r2 ID

IF incorrect target

MEM EX

WB

ID WB

MEM

IF correct target

EX

ID

IF

77


• Compiler based approaches: delayed branch & canceling branch

– HW support

• Different opcodes for different types of branches

• Likely bit in canceling branches

– Compiler support

• Move the code

• Set the likely bit

• Ex: BTB has 90% hit rate, prediction accuracy is 95%, 60% branches taken

• Also, any misses to BTB that stall pipe will stall pipe for 1 extra cycle to update BTB

• What is the misprediction penalty for this pipeline?

• (.90)[(.05)(2)] + (.1)[(.4)(0) + (.6)(6+1)] =

• (.9)(.1) + (.1)(4.2) = .09 + .42 = .51 cycles/branch

• Note: delayed branches were about .5 cycles/branch for simple pipe

• Improvement increases quickly with better prediction accuracy

IF D1 D2 D3 R E I

target known branch dir known predict

Branch Performance

Exceptions: Harder Still

• Exceptions: interrupt instruction execution unexpectedly

• Harder to handle in pipelines since there is more overlap

– may have five instructions in flight when exception is raised

• Common exceptions:

– I/O device interrupt

– OS call

– Arithmetic overflow, FP anomaly

– Page fault

– Misaligned memory access

– Memory protection violation

– Illegal instruction

– Power failure

Restartable Exceptions

• The difficulty: implementing restartable exceptions

– exceptions that must restart interrupted instruction

• arithmetic overflow

• page fault

• some I/O

• Another program must be invoked to:

– save state

– correct exceptional condition

– restore state

• Invisible to original program

– restartable exceptions needed for virtual memory

Precise Exceptions and Pipelining

• CPU has precise exceptions if faulting instruction and after can be stopped and restarted, and:

– all previous instructions executed and committed

– no later instruction committed

• Required for implementing virtual memory/demand paging

• Save PC and state at excepting instruction

– later on restore state, restart on that PC

• Force exception vector into pipe for next IF

• Turn off all writes for faulting instruction and later

– earlier instructions finish as normal

• OS saves excepting PC+register file and handles exception

– Does this always work?

Implementing Exceptions

• What if EPC is in a branch delay slot?

– can we restart the instruction in the delay slot?

• Must save EPC for faulting instruction – but restart from branch!

– obvious: save 2 PCs

• OS issues rfe (return from exception) instruction

– restarts user program and re-enters user mode

Precise Exceptions

• Required for implementing virtual memory/demand paging

– all mps implement precise exceptions for integer pipes

– also required for IEEE 754 floating-point compliance

• FP execute out of order for performance

– hard to achieve precise exceptions there

• Some provide two modes: (a) performance mode, (b) precise mode

– precise mode restricts overlap – slow

– Alpha 21064/21164, MIPS R8000

• ~10 slower

Out-of-order Exceptions

lw IF ID EX MEM WB

add IF ID EX MEM WB

• Load can cause (among other things) page fault in MEM

• Add can cause (among other things) overflow in EX

• Two exceptions in the same cycle!

– handle page fault first

– in case of tie handle “earlier” instruction

• It gets worse…

Out-of-order Exceptions

lw IF ID EX MEM WB

add IF ID EX MEM WB

• Load can cause (among other things) page fault in MEM

• Add can cause (among other things) page fault in IF

– need precise integer exceptions – cannot handle this one yet

– need to wait until load completes (or causes an exception)

• Implement exception status vector, carried with each instruction

– turn off all writes on an exception

– prevent stores in MEM stage

– check vector in WB (instructions before are complete)

– handle earliest exception in program order

When Does State Change?

• An instruction is committed when it is guaranteed to complete

– easy to restart if state changed only when committed

• MIPS: WB

• VAX: auto-increment mode, state updated in middle of inst, need HW support to back out, undo – “roll back” state changes

• Some architectures have string copy instructions

– updates memory – cannot undo 100%

– general-purpose registers hold all state

– instruction continues after exception rather than restart

cda 5106 advanced computer architecture i...

Documents