computer architecture (cs-213) · • how much cycles does first instruction take, in a pipelined...
TRANSCRIPT
Computer Architecture (CS-213)
Main Objectives
• Hazards in Pipelining • Types of Hazards • Data Hazards • Forwarding • Stalling • Control Hazards
Revision
• Do you Know ? • Ideally speed-up (performance factor) from non-
pipelined to pipelined is what ? • What limits the Speed-up ? • How much cycles does first instruction take, in a
pipelined processor, to execute ? • How do we manage control signals in pipelining ?
Pipeline Hazards
• Data Hazards – an instruction uses the result of the previous instruction. A hazard occurs exactly when an instruction tries to read a register in its ID stage that an earlier instruction intends to write in its WB stage.
• Control Hazards – the location of an instruction depends on
previous instruction • Structural Hazards – two instructions need to access the same
resource
5
Data Hazards & forwarding
Data Forwarding
• Take the result from the earliest point that it exists in any of the pipeline state registers and forward it to the functional units (e.g., the ALU) that need it that cycle
• For ALU functional unit: the inputs can come from any pipeline register rather than just from ID/EX by – adding multiplexors to the inputs of the ALU – connecting the Rd write data in EX/MEM or MEM/WB
to either (or both) of the EX’s stage Rs and Rt ALU mux inputs
– adding the proper control hardware to control the new muxes
Data Forwarding Control Conditions 1. EX/MEM hazard: if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10 if (EX/MEM.RegWrite and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
Forwards the result from the previous instr. to either input of the ALU
Forwards the result from the second previous instr. to either input of the ALU
2. MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Forwarding Illustration
I n s t r.
O r d e r
add $1,
sub $4,$1,$5
and $6,$7,$1
ALU IM Reg DM Reg
ALU IM Reg DM Reg
ALU IM Reg DM Reg
EX/MEM hazard forwarding
MEM/WB hazard forwarding
I n s t r.
O r d e r
add $1,$1,$2
ALU IM Reg DM Reg
add $1,$1,$3
add $1,$1,$4
ALU IM Reg DM Reg
ALU IM Reg DM Reg
• Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded?
Complication
Corrected Data Forwarding Control Conditions
2. MEM/WB hazard: if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRs) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd != 0) and (EX/MEM.RegisterRd != ID/EX.RegisterRt) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
12
Avoiding/Detecting Data Hazards
• Add two NOP instructions between sub and add instructions
• Detect a hazard – Ex/Mem.RegisterRd = ID/Exe.RegisterRs – Ex/Mem.RegisterRd = ID/Exe.RegisterRt – Mem/WB.RegisterRd = ID/Exe.RegisterRs – Mem/WB.RegisterRd = ID/Exe.RegisterRt
13
Forwarding Data
14
Implementing the forward unit
15
Control values for forwarding multiplexers
Mux Control Source Explanation
ForwardA =00 ID/EX 1st ALU operand comes from register file
ForwardA= 10 EX/MEM 1st ALU operand is forwarded from prior ALU result
ForwardA= 01 MEM/WB 1st ALU operand is forwarded from data memory or an earlier ALU result
16
Control values for forwarding multiplexers
Mux Control Source Explanation
ForwardB =00 ID/EX 2nd ALU operand comes from register file
ForwardB= 10 EX/MEM 2nd ALU operand is forwarded from prior ALU result
ForwardB= 01 MEM/WB 2nd ALU operand is forwarded from data memory or an earlier ALU result
17
Modified Datapath for Forwarding
18
Load Word Hazard and Stall
19
Load Word Hazard and Stall
20
Hazard Detection Unit
21
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd
Software Scheduling to Avoid Load Hazards
Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
Control Hazards • Branch instructions can cause great performance loss • Branch instructions need two things:
– The result of branch: Taken or Not Taken – Branch target:
• PC + 4 Branch NOT taken • PC + 4 + immediate*4 Branch Taken
• Branch instruction is not detected until the ID stage – At which point a new instruction has already been fetched
• For our original pipeline: – Effective address is not calculated until EX stage – Branch condition get set in the EX/MEM register (EX/MEM.zero) – 3-cycle branch delay
M
IF/ID
Branch Delay – CC1 • Consider the pipelined execution of: beq $1, $3, 100 • During the first cycle, beq is fetched in the IF stage
Rs
Rt
Imm16 Extend
PC =
100
0
Registers
A d d
Dat
a_in
ALU result
ID EX beq $1, $3, 100 MEM
ID/EX EX/MEM
Reg
_dst
1004
Instruction Memory
Address
Instruction
+4
m u x
0
1 A L U m
u x
m u x
m u x
Writeback data
Zero
Op
Main Control M
W
E
W
PCSrc
M
Branch Delay – CC2 • During the second cycle, beq is decoded in the ID stage • The next_1 instruction is fetched in the IF stage
Extend
Registers
Rs
Rt
Imm16
PC =
100
4
A d d
Dat
a_in
ALU result
EX next_1 MEM
ID/EX EX/MEM
Reg
_dst
Instruction Memory
Address
Instruction
1008
+4
m u x
0
1 A L U m
u x
m u x
m u x
Writeback data
Zero
Main Control M
W
E
W
PCSrc
beq $1, $3, 100
IF/ID
beq
$3
$1
100
1004
Imm16
M
Branch Delay – CC3 • During the third cycle, beq is executed in the EX stage • The next_2 instruction is fetched in the IF stage
Extend
Registers
Rs
Rt PC =
100
8
Dat
a_in
ALU result
beq $1, $3, 100 next_2 MEM
EX/MEM
Reg
_dst
Instruction Memory
Address
Instruction
1012
+4
m u x
0
1
Writeback data
Zero
Main Control M
W
E
W
PCSrc
next_1
IF/ID
1008
ID/EX
1004
12
34
1234
10
0
A d d
A L U m
u x
m u x
m u x
Beq = 1
Imm16
Instruction Memory
Address
Instruction
1016
Branch Delay – CC4 • During the fourth cycle, beq reaches MEM stage • The next_3 instruction is fetched in the IF stage
Extend
Registers
Rs
Rt PC =
101
2
Dat
a_in
ALU result
next_1 next_3 beq $1, $3, 100
Reg
_dst
Writeback data
Main Control M
W
E
M
W
Zero = 1
Beq = 1
next_2
IF/ID
1012
ID/EX
1008
A d d
A L U m
u x
m u x
m u x
EX/MEM
1404
0
1
PCSrc
m u x
0
1
+4
Imm16
Instruction Memory
Address
Instruction
1408
Branch Delay – CC5 • During the fifth cycle, branch_target instruction is fetched • Next_1 thru next_3 should be converted into NOPs
Extend
Registers
Rs
Rt PC =
140
4
Dat
a_in
ALU result
next_2 branch_target next_1
Reg
_dst
Writeback data
Main Control M
W
E
M
W
Zero
next_3
IF/ID
1016
ID/EX
1012
A d d
A L U m
u x
m u x
m u x
EX/MEM
PCSrc
m u x
0
1
+4
3-Cycle Branch Delay
beq $1,$3,100
Next_1 // bubble
ALU
DM
IM Reg
Reg
cc1 cc2 cc3 cc4 cc5 cc6
IM Reg
IM Reg
IM Reg
Bubble
Bubble
Bubble
Next_2 // bubble
Next_3 // bubble
Branch_Target
Bubble
Bubble
Bubble
ALU
IM Reg
Bubble
Bubble
cc7
• Next_1 thru Next_3 will be fetched anyway • Pipeline should flush Next_1 thru Next_3 if branch is
taken • Otherwise, they can be executed normally
Reducing the Delay of Branches • Branch delay can be reduced from 3 cycles to just 1 cycle • Branch decision is moved from 4th into 2nd pipeline stage
– Branches can be determined earlier in the ID stage – Branch address calculation adder is moved to ID stage – A comparator in the ID stage to compare the two fetched registers
• To determine branch decision, whether the branch is taken or not
• Only one instruction that follows the branch will be fetched • If the branch is taken then only one instruction is flushed • We need a control signal IF.Flush to zero the IF/ID register
– This will convert the fetched instruction into a NOP, given we need to branch to target address.
PC Instruction memory
4
Registers
M u x
M u x
M u x
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Data memory
M u x
Hazard detection
unit
Forwarding unit
IF.Flush
IF/ID
Sign extend
Control
M u x
=
Shift left 2
M u x
Reducing the Delay of Branches
Branch Hazard Alternatives • Always stall the pipeline until branch direction is known
– Next instruction is always flushed (turned into a NOP) • Predict Branch Not Taken
– Fetch successor instruction: PC+4 already calculated – Almost half of MIPS branches are not taken on average – Flush instructions in pipeline only if branch is actually taken
• Predict Branch Taken – Can predict backward branches in loops which are taken most
of the time – However, branch target address is determined in ID stage so we
Must reduce branch delay from 1 cycle to 0, but how? • Delayed Branch
– Define branch to take place AFTER the following instruction
• Define branch to take place after the next instruction • For a 1-cycle branch delay, we have one delay slot branch instruction branch delay slot – next instruction . . . branch target – if branch taken
• Compiler/assembler fills the branch delay slot By selecting a useful instruction, which must be executed irrelevant of the result of branch.
branch instruction (taken) IF ID EX MEM WB
branch delay slot (next instruction) IF ID EX MEM WB
branch target IF ID EX MEM WB
Delayed Branch
Scheduling the Branch Delay Slot
1. From an independent instruction before the branch
2. From a target instruction when branch is predicted taken
3. From fall through when branch is predicted not taken
add $t2,$t3,$t4
beq $s1, $s0
Delay Slot
beq $s1, $s0
add $t2,$t3,$t4
sub $t4,$t5,$t6
beq $s1, $s0
Delay Slot
beq $s1, $s0
sub $t4,$t5,$t6
Delay Slot
beq $s1, $s0
Sub $t4,$t5,$t6
From
Bef
ore
From
Tar
get
From
Fal
l Thr
ough
beq $s1, $s0
sub $t4,$t5,$t6
More on Delayed Branch • Scheduling delay slot with
– Independent instruction is the best choice • However, not always possible to find an independent instruction
– Target instruction is useful when branch is predicted taken • Such as in a loop branch (e.g., for loop) • Cancel branch delay instruction if branch is not taken
– Fall through is useful when branch is predicted not taken • Cancel branch delay instruction if branch is taken
• Disadvantages of delayed branch – Branch delay can increase to multiple cycles in deeper pipelines – Zero-delay branching + dynamic branch prediction are the most
appropriate solution to the problem
Zero-Delayed Branch • How can we achieve zero-delay for a taken branch …
– If the branch target address is computed in the IF stage ?
• Solution – Check the PC to see if the instruction being fetched is a branch – Store the branch target address in a table in the IF stage – Such a table is called the branch target buffer – If branch is predicted taken then
• Next PC = branch target fetched from target buffer
– Otherwise, if branch is predicted not taken then • Next PC = PC + 4
– Zero-delay is achieved because Next PC is determined in IF stage
Branch Target and Prediction Buffer
• The branch target buffer is implemented as a small cache – That stores the branch target address of taken branches
• We also have a branch prediction buffer, It determines whether we need to take the branch or not. – To store the prediction bits for branch instructions – The prediction bits are dynamically determined by the hardware
PC
mux
+4
Pred
ictio
n Bu
ffer
Branch Target Buffer
PC of Branch Target Address
Lookup
Dynamic Branch Prediction • Prediction of branches at runtime using prediction bits
– One or few prediction bits are associated with a branch instruction • Branch prediction buffer is a small memory
– Indexed by the lower portion of the address of branch instruction • The simplest scheme is to have 1 prediction bit per branch • We don’t know if the prediction bit is correct or not • If correct prediction …
– Continue normal execution – no wasted cycles • If incorrect prediction (misprediction) …
– Flush the instructions that were incorrectly fetched – wasted cycles – Update prediction bit and target address for future use
• Prediction is just a hint that is assumed to be correct • If incorrect then fetched instructions are flushed • 1-bit prediction scheme has a performance shortcoming
– A loop branch is almost always taken, except for last iteration – 1-bit scheme will predict incorrectly twice, rather than once – On the first and last loop iterations
• 2-bit prediction schemes are often used – A prediction must be wrong twice before it is changed – A loop branch is mispredicted only once on the last iteration
2-bit Prediction Scheme
Not Taken
Predict Taken
Predict Taken
Not Taken
Taken
Taken
Taken
Taken
Not Taken
Not Taken
Not Taken
Not Taken