CPE 631 Review: Pipelining
Electrical and Computer EngineeringUniversity of Alabama in Huntsville
Aleksandar Milenkovic, [email protected]
http://www.ece.uah.edu/~milenka
2
AM
LaCASA
Outline
Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards
Structural Data Control
3
AM
LaCASA
Laundry Example (by David Patterson)
Four loads of clothes: A, B, C, D
Task: each one to wash, dry, and fold Resources
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
4
AM
LaCASA
Sequential Laundry
Sequential laundry takes 6 hours for 4 loads If they learned pipelining,
how long would laundry take?
A
B
C
D
30 40 2030 40 2030 40 2030 40 20
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
5
AM
LaCASA
Pipelined Laundry
Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6 PM 7 8 9 10 11 Midnight
Task
Order
Time
30 40 40 40 40 20
6
AM
LaCASA
Pipelining Lessons Pipelining doesn’t help
latency of single task, it helps throughput of entire workload
Pipeline rate is limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “fill” pipeline and time to “drain” reduce speedup
A
B
C
D
6 PM 7 8 9
Task
Order
Time
30 40 40 40 40 20
7
AM
LaCASA
Computer Pipelines
Execute billions of instructions, so throughput is what matters
What is desirable in instruction sets for pipelining? Variable length instructions vs.
all instructions same length? Memory operands part of any operation vs.
memory operands only in loads or stores? Register operand many places in instruction
format vs. registers located in same place?
8
AM
LaCASA
A "Typical" RISC
Registers 32 64-bit general-purpose (integer) registers (R0-R31)
32 64-bit floating-point registers (F0-F31)
Data types 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit
double words for integer data 32-bit single- or 64-bit double-precision numbers
Addressing Modes for MIPS Data Transfers Load-store architecture: Immediate, Displacement Memory is byte addressable with a 64-bit address Mode bit to select Big Endian or Little Endian
9
AM
LaCASA
MIPS64 Instruction Formats
Op31 26 01516202125
Rs Rt immediate
Op31 26 025
Op31 26 01516202125
Rs Rt
address
Rd funct
Register-Register561011
Register-Immediate
Jump / Call
shamt
Op31 26 01516202125
Fmt Ft Fs funct
561011
Fd
Floating-point (FR)
Op31 26 01516202125
Fmt Ft immediate
Floating-point (FI)
10
AM
LaCASA
MIPS64 Instructions
MIPS Operations(See Appendix B, Figure B.26)
Data Transfers (LB, LBU, SB, LH, LHU, SH, LW, LWU, SW, LD, SD, L.S, L.D, S.S, S.D, MFCO, MTCO, MOV.S, MOV.D, MFC1, MTC1)
Arithmetic/Logical (DADD, DADDI, DADDU, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVU, MADD, AND, ANDI, OR, ORI, XOR, XORI, LUI, DSLL, DSRL, DSRA, DSLLV, DSRLV, DSRAV, SLT, SLTI, SLTU, SLTIU)
Control (BEQZ, BNEZ, BEQ, BNE, BC1T, BC1F, MOVN, MOVZ, J, JR, JAL, JALR, TRAP, ERET)
Floating Point (ADD.D, ADD.S, ADD.PS, SUB.D, SUB.S, SUB.PS, MUL.D, MUL.S, MUL.PS, MADD.D, MADD.S, MADD.PS, DIV.D, DIV.S, DIV.PS, CVT._._, C._.D, C._.S
11
AM
LaCASA
5 Steps of Simple RISC Datapath
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
LMD
ALU
MU
X
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
4
Ad
der Zero?
Next SEQ PC
Addre
ss
Next PC
WB Data
Inst
RD
RS1
RS2
Imm
12
AM
LaCASA
5 Steps of Simple RISC Datapath (cont’d)
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/E
X
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Addre
ss
RS1
RS2
Imm
MU
X
13
AM
LaCASA
Visualizing Pipeline
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
Instr.
Order
Time (clock cycles)
Reg AL
U
DMIM Reg
CC 2 CC 3 CC 4 CC 6 CC 7CC 5CC 1
14
AM
LaCASA
Instruction Flow through Pipeline
Reg
ALU
DM
IMR
eg
CC 2 CC 3CC 1
Reg
ALU
DM
IMR
eg
Reg
ALU
DM
IMR
eg
Add R1,R2,R3
Nop
Nop
Nop
Add R1,R2,R3
Nop
Nop
Lw R4,0(R2)
Add R1,R2,R3
Nop
Lw R4,0(R2)
Sub R6,R5,R7
Reg
ALU
DM
IMR
eg
Add R1,R2,R3
Lw R4,0(R2)
Sub R6,R5,R7
CC 4
Xor R9,R8,R1
Time (clock cycles)
15
AM
LaCASA
Simple RISC Pipeline Definition: IF, ID
Stage IF IF/ID.IR Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC
EX/MEM.ALUOUT} else {IF/ID.NPC, PC PC + 4};
Stage ID ID/EX.A Regs[IF/ID.IR6…10];
ID/EX.B Regs[IF/ID.IR11…15]; ID/EX.Imm (IF/ID.IR16)16 ## IF/ID.IR16…31; ID/EX.NPC IF/ID.NPC;
ID/EX.IR IF/ID.IR;
16
AM
LaCASA
Simple RISC Pipeline Definition: IE
ALU EX/MEM.IR ID/EX.IR; EX/MEM.ALUOUT ID/EX.A func ID/EX.B; or
EX/MEM.ALUOUT ID/EX.A func ID/EX.Imm; EX/MEM.cond 0;
load/store EX/MEM.IR ID/EX.IR;
EX/MEM.B ID/EX.B; EX/MEM.ALUOUT ID/EX.A ID/EX.Imm; EX/MEM.cond 0;
branch EX/MEM.Aluout ID/EX.NPC (ID/EX.Imm<< 2); EX/MEM.cond (ID/EX.A func 0);
17
AM
LaCASA
Simple RISC Pipeline Def.: MEM, WB
Stage MEM ALU
MEM/WB.IR EX/MEM.IR; MEM/WB.ALUOUT EX/MEM.ALUOUT;
load/store MEM/WB.IR EX/MEM.IR; MEM/WB.LMD Mem[EX/MEM.ALUOUT] or
Mem[EX/MEM.ALUOUT] EX/MEM.B; Stage WB
ALU Regs[MEM/WB.IR16…20] MEM/WB.ALUOUT; or
Regs[MEM/WB.IR11…15] MEM/WB.ALUOUT; load
Regs[MEM/WB.IR11…15] MEM/WB.LMD;
18
AM
LaCASA
Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this
combination of instructions Data hazards: Instruction depends on result of
prior instruction still in the pipeline Control hazards: Caused by delay between
the fetching of instructions and decisions about changes in control flow (branches and jumps)
19
AM
LaCASA
One Memory Port/Structural Hazards
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
20
AM
LaCASA
One Memory Port/Structural Hazards (cont’d)
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
21
AM
LaCASA
Instr.
Order
dadd r1,r2,r3
dsub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Data Hazard on R1Time (clock cycles)
IF ID/RF EX MEM WB
22
AM
LaCASA
Three Generic Data Hazards
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
I: add r1,r2,r3J: sub r4,r1,r3
23
AM
LaCASA
Three Generic Data Hazards
Write After Read (WAR) InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
24
AM
LaCASA
Three Generic Data Hazards
Write After Write (WAW) InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because:
All instructions take 5 stages, and Writes are always in stage 5
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
25
AM
LaCASA
Time (clock cycles)
Forwarding to Avoid Data Hazard
Inst
r.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
26
AM
LaCASA
HW Change for Forwarding
MEM
/WR
ID/E
X
EX
/MEM
DataMemory
ALU
mux
mux
Registe
rs
NextPC
Immediate
mux
27
AM
LaCASA
Forwarding to DM input
lw R4,0(R1)
sw 12(R1),R4
add R1,R2,R3
Inst.
Order
Time (clock cycles)
- Forward R1 from EX/MEM.ALUOUT to ALU input (lw)- Forward R1 from MEM/WB.ALUOUT to ALU input (sw)- Forward R4 from MEM/WB.LMD to memory input (memory output to memory input)
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
CC 2 CC 3 CC 4 CC 6 CC 7CC 5CC 1
28
AM
LaCASA
Forwarding to DM input (cont’d)
sw 0(R4),R1
add R1,R2,R3
Inst.
Order
Time (clock cycles)
Forward R1 from MEM/WB.ALUOUT to DM input
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
CC 2 CC 3 CC 4 CC 6CC 5CC 1
29
AM
LaCASA
Forwarding to Zero
beqz R1,50
add R1,R2,R3
Instruction
Order
Time (clock cycles)
Forward R1 from EX/MEM.ALUOUT to Zero
sub R4,R5,R6
bneq R1,50
add R1,R2,R3
Forward R1 from MEM/WB.ALUOUT to Zero
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
CC 2 CC 3 CC 4 CC 6CC 5CC 1
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
Reg AL
U
DMIM Reg
Z
Z
30
AM
LaCASA
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Data Hazard Even with Forwarding
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
31
AM
LaCASA
Data Hazard Even with Forwarding
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg
ALU
DMemIfetch Reg
RegIfetch
ALU
DMem RegBubble
Ifetch
ALU
DMem RegBubble Reg
Ifetch
ALU
DMemBubble Reg
32
AM
LaCASA
Try producing fast code for
a = b + c;
d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,b
LW Rc,c
ADD Ra,Rb,Rc
SW a,Ra
LW Re,e
LW Rf,f
SUB Rd,Re,Rf
SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code:
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd
33
AM
LaCASA
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
34
AM
LaCASA
Example: Branch Stall Impact
If 30% branch, Stall 3 cycles significant Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0 MIPS Solution:
Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
35
AM
LaCASA
Ad
der
IF/ID
Pipelined Simple RISC DatapathMemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC
RD RD RD WB
Data
• Data stationary control– local decode for each instruction phase / pipeline stage
Next PC
Addre
ss
RS1
RS2
ImmM
UX
ID/E
X
36
AM
LaCASA
Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken
Execute successor instructions in sequence “Squash” instructions in pipeline if branch
actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next
instruction
37
AM
LaCASA
Branch not Taken
Time [clocks]MemIF ID Ex WB
IF ID
branch(not taken)
5
IF ID Ex Mem WB
Ex Mem WB
Ii+1
Ii+2
MemIF ID Ex WB
Instructions
branch(taken)
5
IF idle idle idle idle
IF ID Ex Mem WB
Ii+1
branchtarget
IF ID Ex Mem WBbranchtarget+1
Branch is untaken(determined during ID),we have fetched the fall-through and just continue no wasted cycles
Branch is taken(determined during ID),restart the fetch from at the branch target one cycle wasted
38
AM
LaCASA
Four Branch Hazard Alternatives
#3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address
in MIPS MIPS still incurs 1 cycle branch penalty
Make sense only when branch target is known before branch outcome
39
AM
LaCASA
Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following
instruction
branch instructionsequential successor1sequential successor2........sequential successornbranch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this
Branch delay of length n
40
AM
LaCASA
Delayed Branch
Where to get instructions to fill branch delay slot? Before branch instruction From the target address:
only valuable when branch taken From fall through:
only valuable when branch not taken
41
AM
LaCASA
Scheduling the branch delay slot: From Before
Delay slot is scheduled with an independent instruction from before the branch
Best choice, always improves performance
ADD R1,R2,R3
if(R2=0) then
<Delay Slot>
Becomes
if(R2=0) then
<ADD R1,R2,R3>
42
AM
LaCASA
Scheduling the branch delay slot: From Target
Delay slot is scheduled from the target of the branch
Must be OK to execute that instruction if branch is not taken
Usually the target instruction will need to be copied because it can be reached by another path programs are enlarged
Preferred when the branch is taken with high probability
SUB R4,R5,R6
...
ADD R1,R2,R3
if(R1=0) then
<Delay Slot>
Becomes
...
ADD R1,R2,R3
if(R2=0) then
<SUB R4,R5,R6>
43
AM
LaCASA
Scheduling the branch delay slot:From Fall Through
Delay slot is scheduled from thetaken fall through
Must be OK to execute that instruction if branch is taken
Improves performance when branch is not taken
ADD R1,R2,R3
if(R2=0) then
<Delay Slot>
SUB R4,R5,R6
Becomes
ADD R1,R2,R3
if(R2=0) then
<SUB R4,R5,R6>
44
AM
LaCASA
Delayed Branch Effectiveness
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch
delay slots useful in computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)
45
AM
LaCASA
Example: Branch Stall Impact
Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles
Op Freq Cycles CPI(i) (% Time)
Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%)
=> new CPI = 1.9, or almost 2 times slower
46
AM
LaCASA
Example 2: Speed Up Equation for Pipelining
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline CPI Idealdepth Pipeline CPI Ideal
Speedup
pipelined
dunpipeline
TimeCycle
TimeCycle
CPI stall Pipeline 1depth Pipeline
Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, CPI = 1:
47
AM
LaCASA
Example 3: Evaluating Branch Alternatives (for 1 program)
Scheduling Branch CPI speedup v. scheme penalty stall
Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31
Conditional & Unconditional = 14%, 65% change PC
48
AM
LaCASA
Example 4: Dual-port vs. Single-port
Machine A: Dual ported memory (“Harvard Architecture”)
Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate
Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed
49
AM
LaCASA
Extended Simple RISC Pipeline
IF ID
EXInt
EXFP/I Mult
EXFP Add
EXFP/I Div
Mem WB
DLX pipe with three unpipelined, FP functional units
In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1
50
AM
LaCASA
Extended Simple RISC Pipeline (cont’d)
Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations
Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result
Functional unit Latency Initiation interval
Integer ALU 0 1
Data Memory 1 1
FP Add 3 1
FP/Integer Multiply 6 1
FP/Integer Divide 24 25
51
AM
LaCASA
Extended Simple RISC Pipeline (cont’d)
IF ID
Ex
M1 M2 M3 M4 M5 M6 M7
A1 A2 A3
..
M WB
A4
52
AM
LaCASA
Extended Simple RISC Pipeline (cont’d)
Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined
Pipeline timing for independent operationsMUL.D IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB
ADD.D IF ID A1 A2 A3 A4 Mem WB
L.D IF ID Ex Mem WB
S.D IF ID Ex Mem WB
53
AM
LaCASA
Hazards and Forwarding in Longer Pipes
Structural hazard: divide unit is not fully pipelined detect it and stall the instruction
Structural hazard: number of register writes can be larger than one due to varying running times
WAW hazards are possible Exceptions!
instructions can complete in different order than they were issued
RAW hazards will be more frequent
54
AM
LaCASA
Examples
Stalls arising from RAW hazards
Three instructions that want to perform a write back to the FP register file simultaneously
L.D F4, 0(R2) IF ID EX Mem WB
MUL.D F0, F4, F6
IF ID stall M1 M2 M3 M4 M5 M6 M7 Mem WB
ADD.D F2, F0, F8
IF stall ID stall stall stall stall stall stall A1 A2 A3 A4 Mem WB
S.D 0(R2), F2 IF stall stall stall stall stall stall ID EX stall stall stall Mem
MUL.D F0, F4, F6
IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB
... IF ID EX Mem WB
... IF ID EX Mem WB
ADD.D F2, F4, F6
IF ID A1 A2 A3 A4 Mem WB
... IF ID EX Mem WB
... IF ID EX Mem WBL.D F2, 0(R2) IF ID EX Mem WB
55
AM
LaCASA
Solving Register Write Conflicts
First approach: track the use of the write port in the ID stage and stall an instruction before it issues
use a shift register that indicates when already issued instructions will use the register file
if there is a conflict with an already issued instruction, stall the instruction for one clock cycle
on each clock cycle the reservation register is shifted one bit Alternative approach: stall a conflicting instruction when it tries
to enter MEM or WB stage we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of
MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two
different places
56
AM
LaCASA
WAW Hazards
Result of ADD.D is overwritten without any instruction ever using it WAWs occur when useless instruction is executed still, we must detect them and provide correct
executionWhy?
IF ID EX Mem WB
ADD.D F2, F4, F6 IF ID A1 A2 A3 A4 Mem WB
IF ID EX Mem WB
L.D F2, 0(R2) IF ID EX Mem WB
BNEZ R1, foo
DIV.D F0, F2, F4 ; delay slot from fall-through
...
foo: L.D F0, qrs
57
AM
LaCASA
Solving WAW Hazards
First approach: delay the issue of load instruction until ADD.D enters MEM
Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away
Detect hazard in ID when LD is issuing stall LD, or make ADDD no-op
Luckily this hazard is rare
58
AM
LaCASA
Hazard Detection in ID Stage
Possible hazards hazards among FP instructions hazards between an FP instruction and an
integer instr. FP and integer registers are distinct,
except for FP load-stores, and FP-integer moves
Assume that pipeline does all hazard detection in ID stage
59
AM
LaCASA
Hazard Detection in ID Stage (cont’d)
Check for structural hazards wait until the required functional unit is not busy and
make sure that the register write port is available Check for RAW data hazards
wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result
Check for WAW data hazards determine if any instruction in A1, .. A4, M1, .. M7, D
has the same register destination as this instruction; if so, stall the issue of the instruction in ID
60
AM
LaCASA
Forwarding Logic
Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction
If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data