05 advanced pipelining computer architecture

INTRODUCTION TO ADVANCED PIPELINING

Lecture 5

Pipelined Processor: Datapath + Control

PC

Inst

ruct

ion

Add

Instruction[20– 16]

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresul t

Writeregis ter

Writedata

Readdata 1

Readdata 2

Readregis ter 1

Readregis ter 2

Signextend

Mux1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Sh iftleft 2

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Address

Address

Reg

Writ

e

ALUSrc

ALUOp

RegDst

MemRead

Mem

ToR

eg

Mem

Writ

eBranch

PCSrc

Imem

DmemRegs

Control Hazard on BranchesThree Stage Stall

Four Branch Hazard Alternatives(Drawn in subsequent slides)

#1: Stall until branch direction is clear – 3 slots delay –Well, move decision to 2nd stage by testing register – Save 2 cycles – See Fig

#2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken 47% branches not taken on average PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken 53% branches taken on average But haven’t calculated branch target address Move the branch adder to 2nd stage Still incurs 1 cycle branch penalty – Why?

#4: Dynamic Branch Prediction – Keep a history of branches and predict accordingly – 90% accuracy – employed in most CPUs

Reducing Stalls

Stall: wait until decision is clear To stall pipeline, clear the contents of the existing

instructions in the pipeline – clear contents of IF/ID, ID/EX and EX/MEM registers.

Move up decision to 2nd stage by adding hardware to check registers as being read – Adopted by many MIPS processors - See Fig. 6.51 – Penalty 1 cycle

Use Exclusive OR to compare the output of registers in the 2nd stage and enable the branch condition instead of waiting for comparison by the ALU in the 3rd stage.

Flush instruction in the IF stage by adding a control line called IF.Flush in Fig. 6.51 that zeros the IF/ID pipeline register – no operation.

Control Hazard Solutions guess branch taken, then back up if wrong: “branch

prediction” For example, Predict not taken

Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time)

More dynamic scheme: keep history of the branch instruction (~ 90%)

add

beq

Load

AL

U IM Reg DM Reg

AL

U IM Reg DM Reg

IM

AL

UReg DM Reg

Instr.

Order

Time (clock cycles)

Compiler Solutions

Redefine branch behavior (takes place after next instruction) “delayed branch”

Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” (≥ 50% of time)

add

beq

Misc

AL

U IM Reg DM Reg

AL

U IM Reg DM Reg

IM

AL

UReg DM Reg

Load IM

AL

UReg DM Reg

Instr.

Order

Time (clock cycles)

Example Nondelayed vs. Delayed Branch

add M1 ,M2,M3

sub M4, M5,M6

beq M1, M4, Exit

or M8, M9 ,M10

xor M10, M1,M11

Nondelayed Branch

Exit:

add M1 ,M2,M3

sub M4, M5,M6

beq M1, M4, Exit

or M8, M9 ,M10

xor M10, M1,M11

Delayed Branch

Exit:

Delayed Branch

Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when

branch taken From fall through: only valuable when branch not

taken Compiler effectiveness for single branch

delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch

delay slots useful in computation About 50% (60% x 80%) of slots usefully filled

Dynamic Branch Prediction Performance = ƒ (accuracy, cost of

misprediction) Branch History Table (BHT): Lower bits of PC

address index table of 1-bit values Says whether or not branch taken last time ( T-

Taken, N ) No full address check

Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as

before First time through loop on ne x t time through code,

when it predicts e x it instead of looping Only 77.8% accuracy if 9 iterations per loop on

average

Better Solution: 2-bit scheme:

Red: stop, not taken Green: go, taken

2-bit Branch Prediction - Scheme 1

T

T

N

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

N

T

N

N

T* T*N

N*N*T

(Jim Smith, 1981)

Branch History Table (BHT)

BHT is a table of “Predictors” 2-bit, saturating counters indexed by PC address of Branch

In Fetch phase of branch: Predictor from BHT used to make prediction

When branch completes: Update corresponding Predictor

Predictor 0

Predictor 127

Predictor 1

•••

Branch PC

T

T

NTN

TN

N

T* T*N

N*N*T

Another Solution: 2-bit scheme where change prediction (in either direction) only if get misprediction twic e :

Red: stop, not taken Green: go, taken

2-bit Branch Prediction - Scheme 2

T

T

N

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

N

T

N

N

T* T*N

N*N*T

Lee & A. Smith, IEEE Computer, Jan 1984

Comparison

Actual: N N T T N N T TState: N* N* N* N*T T* T*N N* N*T

Predicted: N N N N ? ? ? ?

Actual: N N T T N N T TState: N* N* N* N*T T*N N*T N* N*T

Predicted: N N N N T N N NScheme 1

Scheme 2

T

T

NTNTN

N

T* T*N

N*N*T

T

T

NTNT

N

NT* T*N

N*N*T

2 1

Further Comparison

Alternating taken / not-taken Your worst-case prediction scenario Both schemes achieve 80-95% accuracy with

only a small difference in behavior

T

T

NTNT

N

NT* T*N

N*N*T

T

T

NTNTN

N

T* T*N

N*N*T

12

n-bit Branch Predictor

n-bit p re d ic tio n : Ke e p a n n-bit s a tura ting c o unte r fo r e a ch

bra nch. Inc re m e nt it o n bra nch taken a nd d e c re m e nt

it o n bra nch not taken . If the c o unte r is g re a te r tha n o r e q ua l to ha lf

its m a x im um va lue , p re d ic t the bra nch a s ta ke n.

This c a n be d o ne fo r a ny n, But it turns o ut tha t n= 2 p e rfo rm s a lm o s t a s g o o d

a s o the r va lue s fo r n.

Correlating Branches

Idea: taken/not taken of recently executed branches is related to behavior of present branch (as well as the history of that branch behavior) Then behavior of recent

branches selects between, say, 4 predictions of next branch, updating just that prediction

(2,2) predictor: 2-bit global, 2-bit local

Branch address (4 bits)

2-bits per branch local predictors

PredictionPrediction

2-bit recent global branch history(01 = not taken (0) then taken (1) branches before reaching this)

Accuracy of Different Schemes

Floating Point Arithmetic Pipeline

Pipeline arithmetic units are usually found in very high speed computers

They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems


Example for floating-point addition and subtraction Inputs are two normalized floating-point binary

numbers X = A x 2^a Y = B x 2^b

A and B are two fractions that represent the mantissas a and b are the exponents


Compare the exponents Align the mantissas Add or subtract the mantissas Normalize the result

Floating Point Arithmetic Pipeline X = 0.9504 x 103 and Y = 0.8200 x 102 The two exponents are subtracted in the first segment to obtain 3-

2=1 The larger exponent 3 is chosen as the exponent of the result Segment 2 shifts the mantissa of Y to the right to obtain Y =

0.0820 x 103 The mantissas are now aligned Segment 3 produces the sum Z = 1.0324 x 103 Segment 4 normalizes the result by shifting the mantissa once to

the right and incrementing the exponent by one to obtain Z = 0.10324 x 104

Case Study: MIPS R4000 Pipeline

8 Stage Pipeline:

IF First half of fetching of instruction PC selection Initiation of instruction cache access

IS - Second half of fetching of instruction Access to instruction cache

RF Instruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check)

EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation

DF - First half of access to data cacheDS - Second half of access to data cacheTC - Tag check for data cache hitWB -Write back for loads and register-register operations

The Pipeline Structure of the R4000

REG

AL

U Data Memory REG

Instruction is available

Tag check

load data available

IF IS RF EX DF DS TC WB

Case Study: MIPS R4000LOAD Latency

2 Cycle Load Latency

Load data availableLoad data availablewith forwardingwith forwarding

LD R1, X IF IS RF EX DF DS TC WB

IF IS RF EX DF DS . . .

ADD R3, R1, R2 IF IS RF EX DF DS TC WB

IF IS RF EX DF . . .

EX

Load data neededLoad data needed

EX

2 Stall Cycles2 Stall Cycles

Extending DLX to Handle Floating Point Operations

IF ID MEM WB

Integer Unit(EX)Integer Unit(EX)

FP/integer multiplyFP MultiplierFP Multiplier

FP AdderFP Adder

FP DividerFP Divider

05 advanced pipelining computer architecture

Engineering