05 advanced pipelining computer architecture
TRANSCRIPT
![Page 1: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/1.jpg)
INTRODUCTION TO ADVANCED PIPELINING
Lecture 5
![Page 2: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/2.jpg)
Pipelined Processor: Datapath + Control
PC
Inst
ruct
ion
Add
Instruction[20– 16]
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresul t
Writeregis ter
Writedata
Readdata 1
Readdata 2
Readregis ter 1
Readregis ter 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Sh iftleft 2
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Address
Address
Reg
Writ
e
ALUSrc
ALUOp
RegDst
MemRead
Mem
ToR
eg
Mem
Writ
eBranch
PCSrc
Imem
DmemRegs
![Page 3: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/3.jpg)
Control Hazard on BranchesThree Stage Stall
![Page 4: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/4.jpg)
Four Branch Hazard Alternatives(Drawn in subsequent slides)
#1: Stall until branch direction is clear – 3 slots delay –Well, move decision to 2nd stage by testing register – Save 2 cycles – See Fig
#2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken 47% branches not taken on average PC+4 already calculated, so use it to get next instruction
#3: Predict Branch Taken 53% branches taken on average But haven’t calculated branch target address Move the branch adder to 2nd stage Still incurs 1 cycle branch penalty – Why?
#4: Dynamic Branch Prediction – Keep a history of branches and predict accordingly – 90% accuracy – employed in most CPUs
![Page 5: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/5.jpg)
Reducing Stalls
Stall: wait until decision is clear To stall pipeline, clear the contents of the existing
instructions in the pipeline – clear contents of IF/ID, ID/EX and EX/MEM registers.
Move up decision to 2nd stage by adding hardware to check registers as being read – Adopted by many MIPS processors - See Fig. 6.51 – Penalty 1 cycle
Use Exclusive OR to compare the output of registers in the 2nd stage and enable the branch condition instead of waiting for comparison by the ALU in the 3rd stage.
Flush instruction in the IF stage by adding a control line called IF.Flush in Fig. 6.51 that zeros the IF/ID pipeline register – no operation.
![Page 6: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/6.jpg)
Control Hazard Solutions guess branch taken, then back up if wrong: “branch
prediction” For example, Predict not taken
Impact: 1 clock per branch instruction if right, 2 if wrong (static: right ~ 50% of time)
More dynamic scheme: keep history of the branch instruction (~ 90%)
add
beq
Load
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IM
AL
UReg DM Reg
Instr.
Order
Time (clock cycles)
![Page 7: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/7.jpg)
Compiler Solutions
Redefine branch behavior (takes place after next instruction) “delayed branch”
Impact: 1 clock cycle per branch instruction if can find instruction to put in the “delay slot” (≥ 50% of time)
add
beq
Misc
AL
U IM Reg DM Reg
AL
U IM Reg DM Reg
IM
AL
UReg DM Reg
Load IM
AL
UReg DM Reg
Instr.
Order
Time (clock cycles)
![Page 8: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/8.jpg)
Example Nondelayed vs. Delayed Branch
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
or M8, M9 ,M10
xor M10, M1,M11
Nondelayed Branch
Exit:
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
or M8, M9 ,M10
xor M10, M1,M11
Delayed Branch
Exit:
![Page 9: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/9.jpg)
Delayed Branch
Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when
branch taken From fall through: only valuable when branch not
taken Compiler effectiveness for single branch
delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch
delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
![Page 10: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/10.jpg)
Dynamic Branch Prediction Performance = ƒ (accuracy, cost of
misprediction) Branch History Table (BHT): Lower bits of PC
address index table of 1-bit values Says whether or not branch taken last time ( T-
Taken, N ) No full address check
Problem: in a loop, 1-bit BHT will cause 2 mispredictions (avg is 9 iterations before exit): End of loop case, when it exits instead of looping as
before First time through loop on ne x t time through code,
when it predicts e x it instead of looping Only 77.8% accuracy if 9 iterations per loop on
average
![Page 11: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/11.jpg)
Better Solution: 2-bit scheme:
Red: stop, not taken Green: go, taken
2-bit Branch Prediction - Scheme 1
T
T
N
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
N
T
N
N
T* T*N
N*N*T
(Jim Smith, 1981)
![Page 12: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/12.jpg)
Branch History Table (BHT)
BHT is a table of “Predictors” 2-bit, saturating counters indexed by PC address of Branch
In Fetch phase of branch: Predictor from BHT used to make prediction
When branch completes: Update corresponding Predictor
Predictor 0
Predictor 127
Predictor 1
•••
Branch PC
T
T
NTN
TN
N
T* T*N
N*N*T
![Page 13: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/13.jpg)
Another Solution: 2-bit scheme where change prediction (in either direction) only if get misprediction twic e :
Red: stop, not taken Green: go, taken
2-bit Branch Prediction - Scheme 2
T
T
N
Predict Taken
Predict Not Taken
Predict Taken
Predict Not TakenT
N
T
N
N
T* T*N
N*N*T
Lee & A. Smith, IEEE Computer, Jan 1984
![Page 14: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/14.jpg)
Comparison
Actual: N N T T N N T TState: N* N* N* N*T T* T*N N* N*T
Predicted: N N N N ? ? ? ?
Actual: N N T T N N T TState: N* N* N* N*T T*N N*T N* N*T
Predicted: N N N N T N N NScheme 1
Scheme 2
T
T
NTNTN
N
T* T*N
N*N*T
T
T
NTNT
N
NT* T*N
N*N*T
2 1
![Page 15: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/15.jpg)
Further Comparison
Alternating taken / not-taken Your worst-case prediction scenario Both schemes achieve 80-95% accuracy with
only a small difference in behavior
T
T
NTNT
N
NT* T*N
N*N*T
T
T
NTNTN
N
T* T*N
N*N*T
12
![Page 16: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/16.jpg)
n-bit Branch Predictor
n-bit p re d ic tio n : Ke e p a n n-bit s a tura ting c o unte r fo r e a ch
bra nch. Inc re m e nt it o n bra nch taken a nd d e c re m e nt
it o n bra nch not taken . If the c o unte r is g re a te r tha n o r e q ua l to ha lf
its m a x im um va lue , p re d ic t the bra nch a s ta ke n.
This c a n be d o ne fo r a ny n, But it turns o ut tha t n= 2 p e rfo rm s a lm o s t a s g o o d
a s o the r va lue s fo r n.
![Page 17: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/17.jpg)
Correlating Branches
Idea: taken/not taken of recently executed branches is related to behavior of present branch (as well as the history of that branch behavior) Then behavior of recent
branches selects between, say, 4 predictions of next branch, updating just that prediction
(2,2) predictor: 2-bit global, 2-bit local
Branch address (4 bits)
2-bits per branch local predictors
PredictionPrediction
2-bit recent global branch history(01 = not taken (0) then taken (1) branches before reaching this)
![Page 18: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/18.jpg)
Accuracy of Different Schemes
![Page 19: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/19.jpg)
Floating Point Arithmetic Pipeline
Pipeline arithmetic units are usually found in very high speed computers
They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems
![Page 20: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/20.jpg)
Floating Point Arithmetic Pipeline
Example for floating-point addition and subtraction Inputs are two normalized floating-point binary
numbers X = A x 2^a Y = B x 2^b
A and B are two fractions that represent the mantissas a and b are the exponents
![Page 21: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/21.jpg)
Floating Point Arithmetic Pipeline
Compare the exponents Align the mantissas Add or subtract the mantissas Normalize the result
![Page 22: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/22.jpg)
Floating Point Arithmetic Pipeline X = 0.9504 x 103 and Y = 0.8200 x 102 The two exponents are subtracted in the first segment to obtain 3-
2=1 The larger exponent 3 is chosen as the exponent of the result Segment 2 shifts the mantissa of Y to the right to obtain Y =
0.0820 x 103 The mantissas are now aligned Segment 3 produces the sum Z = 1.0324 x 103 Segment 4 normalizes the result by shifting the mantissa once to
the right and incrementing the exponent by one to obtain Z = 0.10324 x 104
![Page 23: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/23.jpg)
Case Study: MIPS R4000 Pipeline
8 Stage Pipeline:
IF First half of fetching of instruction PC selection Initiation of instruction cache access
IS - Second half of fetching of instruction Access to instruction cache
RF Instruction decode, register fetch, hazard checking, and also instruction cache hit detection(tag check)
EX Execution Effective address calculation ALU operation Branch target computation and condition evaluation
DF - First half of access to data cacheDS - Second half of access to data cacheTC - Tag check for data cache hitWB -Write back for loads and register-register operations
![Page 24: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/24.jpg)
The Pipeline Structure of the R4000
REG
AL
U Data Memory REG
Instruction is available
Tag check
load data available
IF IS RF EX DF DS TC WB
![Page 25: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/25.jpg)
Case Study: MIPS R4000LOAD Latency
2 Cycle Load Latency
Load data availableLoad data availablewith forwardingwith forwarding
LD R1, X IF IS RF EX DF DS TC WB
IF IS RF EX DF DS . . .
ADD R3, R1, R2 IF IS RF EX DF DS TC WB
IF IS RF EX DF . . .
EX
Load data neededLoad data needed
EX
2 Stall Cycles2 Stall Cycles
![Page 26: 05 advanced pipelining computer architecture](https://reader031.vdocuments.us/reader031/viewer/2022020110/55a68f3e1a28abcc7d8b48d3/html5/thumbnails/26.jpg)
Extending DLX to Handle Floating Point Operations
IF ID MEM WB
Integer Unit(EX)Integer Unit(EX)
FP/integer multiplyFP MultiplierFP Multiplier
FP AdderFP Adder
FP DividerFP Divider