recap: summary of pipelining basics
DESCRIPTION
CS 152: Computer Architecture and Engineering Lecture 13 Advanced Pipelining Randy H. Katz, Instructor Satrajit Chatterjee, Teaching Assistant George Porter, Teaching Assistant. Recap: Summary of Pipelining Basics. Five Stages: Fetch: Fetch instruction from memory - PowerPoint PPT PresentationTRANSCRIPT
CS 152Lec 13.1
CS 152: Computer Architecture
and Engineering
Lecture 13
Advanced Pipelining
Randy H. Katz, InstructorSatrajit Chatterjee, Teaching Assistant
George Porter, Teaching Assistant
CS 152Lec 13.2
Recap: Summary of Pipelining Basics Five Stages:
• Fetch: Fetch instruction from memory
• Decode: get register values and decode control information
• Execute: Execute arithmetic operations/calculate addresses
• Memory: Do memory ops (load or store)
• Writeback: Write results back to registers (I.e. COMMIT)
Pipelines pass control information down the pipe just as data moves down pipe
Forwarding/Stalls handled by local control
Balancing length of instructions makes pipelining much smoother
Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency
CS 152Lec 13.3
Recap: Can Pipelining Get Us into Trouble? Yes: Pipeline Hazards
• Structural hazards: attempt to use the same resource two different ways at the same time
- E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV)
• Data hazards: attempt to use item before it is ready
- E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer
- Instruction depends on result of prior instruction still in the pipeline
• Control hazards: attempt to make a decision before condition is evaulated
- E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in
- Branch instructions
Can always resolve hazards by waiting• Pipeline control must detect the hazard
• Take action (or delay action) to resolve hazards
CS 152Lec 13.4
Pipelining the Load Instruction
The five independent functional units in the pipeline datapath are:
• Instruction Memory for the Ifetch stage
• Register File’s Read ports (bus A and busB) for the Reg/Dec stage
• ALU for the Exec stage
• Data Memory for the Mem stage
• Register File’s Write port (bus W) for the Wr stage
Clock
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch Reg/Dec Exec Mem Wr1st lw
Ifetch Reg/Dec Exec Mem Wr2nd lw
Ifetch Reg/Dec Exec Mem Wr3rd lw
CS 152Lec 13.5
The Four Stages of R-type
Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: • ALU operates on the two register operands
• Update PC
Wr: Write the ALU output back to the register file
Cycle 1Cycle 2 Cycle 3Cycle 4
Ifetch Reg/Dec Exec WrR-type
CS 152Lec 13.6
Pipelining the R-type and Load Instruction
We have pipeline conflict or structural hazard:• Two instructions try to write to the register file at the same
time!
• Only one write port
Clock
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type
Ops! We have a problem!
CS 152Lec 13.7
Important Observation Each functional unit can only be used once per
instruction
Each functional unit must be used at the same stage for all instructions:
• Load uses Register File’s Write Port during its 5th stage
• R-type uses Register File’s Write Port during its 4th stage
Ifetch Reg/Dec Exec Mem WrLoad
1 2 3 4 5
Ifetch Reg/Dec Exec WrR-type
1 2 3 4
2 ways to solve this pipeline hazard.
CS 152Lec 13.8
Solution 1: Insert “Bubble” into the Pipeline
Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle
• The control logic can be complex.
• Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!
Clock
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Exec WrR-type
Ifetch Reg/Dec Exec WrR-type Pipeline
Bubble
Ifetch Reg/Dec Exec Wr
CS 152Lec 13.9
Solution 2: Delay R-type’s Write by One Cycle Delay R-type’s register write by one cycle:
• Now R-type instructions also use Reg File’s write port at Stage 5
• Mem stage is a NOOP stage: nothing is being done.
Clock
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec Mem WrLoad
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Mem WrR-type
Ifetch Reg/Dec Exec WrR-type Mem
Exec
Exec
Exec
Exec
1 2 3 4 5
CS 152Lec 13.10
Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– M;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– M;
S <– A + SX;
Mem[S] <- B
if Cond PC < PC+SX;
M <– S
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
A
B
S
Reg
File
Equ
al
PC
Next
PC
IR
Inst
. M
em
D
M
M <– S
CS 152Lec 13.11
The Four Stages of Store
Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory
Reg/Dec: Registers Fetch and Instruction Decode
Exec: Calculate the memory address
Mem: Write the data into the Data Memory
Cycle 1Cycle 2 Cycle 3Cycle 4
Ifetch Reg/Dec Exec MemStore Wr
CS 152Lec 13.12
The Three Stages of Beq
Ifetch: Instruction Fetch• Fetch the instruction from the Instruction Memory
Reg/Dec: • Registers Fetch and Instruction Decode
Exec: • compares the two register operand,
• select correct branch target address
• latch into PC
Cycle 1Cycle 2 Cycle 3Cycle 4
Ifetch Reg/Dec Exec MemBeq Wr
CS 152Lec 13.13
Control Diagram
IR <- Mem[PC]; PC < PC+4;
A <- R[rs]; B<– R[rt]
S <– A + B;
R[rd] <– S;
S <– A + SX;
M <– Mem[S]
R[rd] <– M;
S <– A or ZX;
R[rt] <– S;
S <– A + SX;
Mem[S] <- B
If Cond PC < PC+SX;
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
A
B
S
Reg
File
Equ
al
PC
Next
PC
IR
Inst
. M
em
D
M <– S M <– S
M
CS 152Lec 13.14
Administrivia Get started on Lab 5: Pipelining is difficult to get
right! Be sure that we will test “gotcha” cases in our mystery programs…
Next Lecture: “state-of-the-art” pipelining• Out-of-order execution/register renaming
• Reorder buffers
CS 152Lec 13.15
The Five Classic Components of a Computer
Today’s Topics: • Recap last lecture
• Review MIPS R3000 pipeline
• Administrivia
• Advanced Pipelining
• SuperScalar, VLIW/EPIC
The Big Picture: Where are We Now?
Control
Datapath
Memory
Processor
Input
Output
CS 152Lec 13.16
Recall: Single Cycle Control!
DataOut
Clk
5
Rw Ra Rb
32 32-bitRegisters
Rd
ALU
Clk
Data In
DataAddress
IdealData
Memory
Instruction
InstructionAddress
IdealInstructionMemory
Clk
PC
5Rs
5Rt
32
323232
A
B
Next
Addre
ss
Control
Datapath
Control Signals Conditions
CS 152Lec 13.17
Data Stationary Control The Main Control generates the control signals
during Reg/Dec• Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle
later
• Control signals for Mem (MemWr Branch) are used 2 cycles later
• Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
IF/ID
Reg
iste
r
ID/E
x R
eg
iste
r
Ex/M
em
Reg
iste
r
Mem
/Wr R
eg
iste
r
Reg/Dec Exec Mem
ExtOp
ALUOp
RegDst
ALUSrc
BranchMemWr
MemtoReg
RegWr
MainControl
ExtOp
ALUOp
RegDst
ALUSrc
MemtoReg
RegWr
MemtoReg
RegWr
MemtoReg
RegWr
BranchMemWr Branch
MemWr
Wr
CS 152Lec 13.18
Datapath + Data Stationary Control
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
A
B
SR
eg
File
PC
Next
PC
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
M
rs rt
oprsrt
fun
im
exmewbrwv
mewbrwv
wbrwv
CS 152Lec 13.19
Let’s Try it Out
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
these addresses are octal
CS 152Lec 13.20
Start: Fetch 10
Exe
c
Reg
. F
ile
Mem
Acc
ess
Dat
aM
em
A
B
SReg
File
IR
Inst
. M
em
D
Dec
ode
MemCtrl
WB Ctrl
M
rs rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
IF
PC
Nex
t P
C
10
=
n n n n
CS 152Lec 13.21
Fetch 14, Decode 10
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
A
B
S
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
M
2 rt im
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1, r2
(35
)
ID
IF
PC
Next
PC
14
=
n n n
CS 152Lec 13.22
Fetch 20, Decode 14, Exec 10
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
r2
B
S
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
M
2 rt 35
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
addI r2
, r2
, 3
ID
IF
EX
PC
Next
PC
20
=
n n
CS 152Lec 13.23
Fetch 24, Decode 20, Exec 14, Mem 10
Exec
Reg.
File
Mem
Acc
ess
Data
Mem
r2
B
r2+
35
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
M
4 5 3
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
sub r
3, r4
, r5
addI r2
, r2
, 3
ID
IF
EX
M
PC
Next
PC
24
=
n
CS 152Lec 13.24
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
r4
r5
r2+
3
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
M[r
2+
35]6 7
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
lw r
1
beq r
6, r7
10
0
addI r2
sub r
3
ID
IF
EX
M WB
PC
Next
PC
30
=
Note Delayed Branch: always execute ori after beq
CS 152Lec 13.25
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
r6
r7
r2+
3
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
r1=
M[r
2+
35]
9 xx
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
beq
addI r2
sub r
3r4
-r5
10
0
ori
r8
, r9
17
ID
IF
EX
M WB
PC
Next
PC
100
=
CS 152Lec 13.26
Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
ID
EX
M WB
PC
Next
PC
___
=
Fill it in yourself!
?
CS 152Lec 13.27
Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24
Exec
Reg.
File
Mem
Acc
es
sD
ata
Mem
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
EX
M WB
PC
Next
PC
___
=
Fill it in yourself!
? ?
?
?
?
CS 152Lec 13.28
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30
Exec
Reg.
File
Mem
Acc
ess
Data
Mem
Reg
File
IR
Inst
. M
em
D
Deco
de
MemCtrl
WB Ctrl
10 lw r1, r2(35)
14 addI r2, r2, 3
20 sub r3, r4, r5
24 beq r6, r7, 100
30 ori r8, r9, 17
34 add r10, r11, r12
100 and r13, r14, 15
M
WB
PC
Next
PC
___
=
Fill it in yourself!
? ?
?
?
?
?
CS 152Lec 13.29
Pipelined Processor
Separate control at each stage
Stalls propagate backwards to freeze previous stages
Bubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.
Exec
Reg.
File
Mem
Acc
ess
Data
Mem
A
B
S
M
Reg
File
Equ
al
PC
Next
PC
IR
Inst
. M
em
Valid
IRex
Dcd
Ctr
l
IRm
em
Ex C
trl
IRw
b
Mem
Ctr
l
WB
Ctr
l
D
Stalls
Bubbles
CS 152Lec 13.30
Recap: Data Hazards
Avoid some “by design”• Eliminate WAR by always fetching operands early (DCD) in
pipe
• Eliminate WAW by doing all WBs in order (last stage, static)
Detect and resolve remaining ones• Stall or forward (if possible)
IF DCD EX Mem WB
IF DCD OF Ex Mem
RAW Data Hazard
WAW Data Hazard
IF DCD OF Ex RS WAR Data Hazard
IF DCD EX Mem WB
IF DCD EX Mem WB
CS 152Lec 13.31
Hazard Detection
Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline.
A RAW hazard exists on register if Rregs( i ) Wregs( j )• Keep a record of pending writes (for inst's in the pipe) and compare
with operand regs of current instruction.
• When instruction issues, reserve its result register.
• When on operation completes, remove its write reservation.
A WAW hazard exists on register if Wregs( i ) Wregs( j )
A WAR hazard exists on register if Wregs( i ) Rregs( j )
Window on execution:Only pending instructions cancause exceptionsInst J
Inst INew Inst
InstructionMovement:
CS 152Lec 13.32
Record of Pending Writes In Pipeline Registers
Current operand registers
Pending writes
hazard <=((rs == rwex) & regWex) OR
((rs == rwmem) & regWme) OR
((rs == rwwb) & regWwb) OR
((rt == rwex) & regWex) OR
((rt == rwmem) & regWme) OR
((rt == rwwb) & regWwb)
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw rs rt
CS 152Lec 13.33
Resolve RAW by “forwarding” (or bypassing)
Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe
•Increase muxes to add paths from pipeline registers
•Data Forwarding = Data Bypassing
npc
I memRegs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw rs rtForward
mux
CS 152Lec 13.34
Forwarding
PCInstruction
memory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Inst
ruct
ion
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
CS 152Lec 13.35
What about Memory Operations?
A B
op Rd Ra Rb
op Rd Ra Rb
Rd
to regfile
R
Rd
If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations!
What does delaying WB on arithmetic operations cost? – cycles ? – hardware ?
What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1 “Delayed Loads”
Can recognize this in decode stage and introduce bubble while stalling fetch stage (hint for lab 5!)
Tricky situation: R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Handle with bypass in memory stage!
D
Mem
T
CS 152Lec 13.36
Compiler Avoiding Load Stalls:
% loads stalling pipeline
0% 20% 40% 60% 80%
tex
spice
gcc
25%
14%
31%
65%
42%
54%
scheduled unscheduled
CS 152Lec 13.37
What about Interrupts, Traps, Faults? External Interrupts:
• Allow pipeline to drain, Fill with NOPs
• Load PC with interrupt address
Faults (within instruction, restartable)• Force trap instruction into IF
• disable writes till trap hits WB
• must save multiple PCs or PC + state
Recall: Precise Exceptions State of the machine is preserved as if program executed up to the offending instruction
• All previous instructions completed
• Offending instruction and all following instructions act as if they have not even started
• Same system code will work on different implementations
CS 152Lec 13.38
Exception/Interrupts: Implementation Questions5 instructions, executing in 5 different pipeline
stages! Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory access; memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch; misaligned memory access; memory-protection
violation; memory error
How do we stop the pipeline? How do we restart it? Do we interrupt immediately or wait? How do we sort all of this out to maintain
preciseness?
CS 152Lec 13.39
Exception Handling
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PClw $2,20($5)
Regs
A im op rwn
detect bad instruction address
detect bad instruction
detect overflow
detect bad data address
Allow exception to take effect
Excp
Excp
Excp
Excp
CS 152Lec 13.40
Another Look at the Exception Problem
Use pipeline to sort this out!• Pass exception status along with instruction.
• Keep track of PCs for every instruction in pipeline.
• Don’t act on exception until it reache WB stage
Handle interrupts through “faulting noop” in IF stage
When instruction reaches end of MEM stage:• Save PC EPC, Interrupt vector addr PC
• Turn all instructions in earlier stages into noops!
Pro
gra
m F
low
Time
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
Data TLB
Bad Inst
Inst TLB fault
Overflow
CS 152Lec 13.41
Resolution: Freeze Above & Bubble Below
Flush accomplished by setting “invalid” bit in pipeline
npc
I mem
Regs
B
alu
S
D mem
m
IAU
PC
Regs
A im op rwn
op rwn
op rwn
op rw rs rt
bubble
freeze
CS 152Lec 13.42
FYI: MIPS R3000 clocking discipline
2-phase non-overlapping clocks
Pipeline stage is two (level sensitive) latches
phi1
phi2
phi1 phi1phi2Edge-triggered
CS 152Lec 13.43
MIPS R3000 Instruction Pipeline
Inst Fetch DecodeReg. Read
ALU / E.A Memory Write Reg
TLB I-Cache RF Operation WB
E.A. TLB D-Cache
TLB
I-cache
RF
ALUALU
TLB
D-Cache
WB
Resource Usage
Write in phase 1, read in phase 2 => eliminates bypass from WB
CS 152Lec 13.44
Recall: Data Hazard on r1
Instr.
Order
Time (clock cycles)
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
IF ID/RF EX MEM WBALUIm Reg Dm Reg
ALUIm Reg Dm Reg
ALUIm Reg Dm Reg
Im
ALUReg Dm Reg
ALUIm Reg Dm Reg
With MIPS R3000 pipeline, no need to forward from WB stage
CS 152Lec 13.45
MIPS R3000 Multicycle Operations
Ex: Multiply, Divide, Cache Miss
Use control word of local stage to step through multicycle operation
Stall all stages above multicycle operation in the pipeline
Drain (bubble) stages below it
Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready - This means stall mflo/mfhi in
decode stage if multiply/divide still executing - Extra credit in Lab 5 does this
A B
op Rd Ra Rb
mul Rd Ra Rb
Rd
to regfile
R
T Rd
CS 152Lec 13.46
Is CPI = 1 for Our Pipeline?
Remember that CPI is an “Average # cycles/inst
CPI here is 1, since the average throughput is 1 instruction every cycle.
What if there are stalls or multi-cycle execution?
Usually CPI > 1. How close can we get to 1??
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
CS 152Lec 13.47
Case Study: MIPS R4000 (200 MHz)
8 Stage Pipeline:• IF–first half of fetching of instruction; PC selection happens here as
well as initiation of instruction cache access.
• IS–second half of access to instruction cache.
• RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection.
• EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation.
• DF–data fetch, first half of access to data cache.
• DS–second half of access to data cache.
• TC–tag check, determine whether the data cache access hit.
• WB–write back for loads and register-register operations.
8 Stages: What is impact on Load delay? Branch delay? Why?
CS 152Lec 13.48
Case Study: MIPS R4000
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
TWO CycleLoad Latency
IF ISIF
RFISIF
EXRFISIF
DFEXRFISIF
DSDFEXRFISIF
TCDSDFEXRFISIF
WBTCDSDFEXRFISIF
THREE CycleBranch Latency(conditions evaluated during EX phase)
Delay slot plus two stallsBranch likely cancels delay slot if not taken
CS 152Lec 13.49
MIPS R4000 Floating Point
FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW
8 kinds of stages in FP units:Stage Functional unit Description
A FP adder Mantissa ADD stage
D FP divider Divide pipeline stage
E FP multiplier Exception test stage
M FP multiplier First stage of multiplier
N FP multiplier Second stage of multiplier
R FP adder Rounding stage
S FP adder Operand shift stage
U Unpack FP numbers
CS 152Lec 13.50
MIPS FP Pipe Stages
FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S
Multiply U E+M M M M N N+A R
Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R
Square root U E (A+R)108 … A R
Negate U S
Absolute value U S
FP compare U A R
Stages:M First stage of multiplierN Second stage of multiplierR Rounding stageS Operand shift stageU Unpack FP numbers
A Mantissa ADD stage
D Divide pipeline stage
E Exception test stage
CS 152Lec 13.51
R4000 Performance
Not ideal CPI of 1:• Load stalls (1 or 2 clock cycles)
• Branch stalls (2 cycles + unfilled slots)
• FP result stalls: RAW data hazard (latency)
• FP structural stalls: Not enough FP hardware (parallelism)
00.5
11.5
22.5
33.5
44.5
eqnto
tt
esp
ress
o
gcc li
doduc
nasa
7
ora
spic
e2g6
su2co
r
tom
catv
Base Load stalls Branch stalls FP result stalls FP structural
stalls
CS 152Lec 13.52
Summary Hazards limit performance
• Structural: need more HW resources
• Data: need forwarding, compiler scheduling
• Control: early evaluation & PC, delayed branch, prediction
Data hazards must be handled carefully:• RAW data hazards handled by forwarding
• WAW and WAR hazards don’t exist in 5-stage pipeline
MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
Exceptions in 5-stage pipeline recorded when theyoccur, but acted on only at WB (end of MEM) stage
• Must flush all previous instructions
More performance from deeper pipelines, parallelism