abstraction question general purpose processors have an abstraction layer fixed at the isa and have...
TRANSCRIPT
Abstraction Question
• General purpose processors have an abstraction layer fixed at the ISA and have little control over the compilers or code run on the machine
• Embedded processors tend to have entire control over the code run, compiler (if any), the ISA, and the hardware – so breaking the abstraction layer makes sense.
• Bottom line becomes who controls what elements of the design.
Intel side note
We’ve been doing mips – how’d intel do this for a CISC x86 ISA? Microops from Pentium Pro on
Branch Hazardsor
“Which way did he go?”
Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Read 5.1, 5.2
Control Dependence
• Just as an instruction will be dependent on other instructions to provide its operands ( dependence), it will also be dependent on other instructions to determine whether it gets executed or not ( dependence or ___________ dependence).
• Control dependences are particularly critical with _____________ branches.
add $5, $3, $2
sub $6, $5, $2
beq $6, $7, somewhere
and $9, $6, $1
... somewhere: or $10, $5, $2
add $12, $11, $9
...
conditional
branchcontrol
data
Dealing With Branch Hazards
• Hardware– stall until you know which direction
– reduce hazard through earlier computation of branch direction
– guess which direction assume not taken (easiest) more educated guess based on history (requires that you know it is a
branch before it is even decoded!)
• Hardware/Software– noops, or instructions that get executed either way (delayed
branch).
Given our current pipeline – let’s assume we stall until we know the branch outcome. How many cycles will you lose per branch?
Stalling the pipeline
Selection
cycles
A 0
B 1
C 2
D 3
E 4
Stalling for Branch Hazards
beq $4, $0, there
and $12, $2, $5
or ...
add ...
sw ...
IM Reg DM Reg
IM Reg
IM Reg
DM
IM Reg
DM Reg
IM Reg
DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Bubble BubbleBubble
Stalling for Branch Hazards
• Seems wasteful, particularly when the branch isn’t taken.
• Makes all branches cost 4 cycles.
• What if we just assume ____________
Why?
branchesAren’t taken
Assume Branch Not Taken
beq $4, $0, there
and $12, $2, $5
or ...
add ...
sw ...
IM Reg
DM Reg
IM Reg
IM Reg
DM
IM Reg DM Reg
IM Reg
DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
• works pretty well when you’re right
Assume Branch Not Taken
beq $4, $0, there
and $12, $2, $5
or ...
add ...
there: sub $12, $4, $2
IM Reg
IM Reg
IM
IM Reg
IM Reg
DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Flush
Flush
Flush
• same performance as stalling when you’re wrong
Wrong Path insts
Let’s improve the pipeline so we move branch resolution to Decode.
How many cycles would we lose then on a taken branch?
Stalling the pipeline
Selection
cycles
A 0
B 1
C 2
D 3
E 4
Add drawing ofResolving in decode.
The Pipeline with flushing for taken branches
• Notice the IF/ID flush line added.
Branch Hazards – Assume Not Taken
• Great if most of your branches aren’t taken.
• What about loops which are taken 95% of the time?– we would like the option of assuming not taken for some branches,
and taken for others, depending on ???
Branch Hazards – Predicting Taken
IM Reg
AL
U DM Reg
IM Reg
AL
U DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
beq $2, $1, here
here: lw
Reading quiz
Required information to predict Taken:
1. An instruction is a branch before decode
2. The target of the branch
3. The outcome of the branch
Selection Required knowledge
A 2,3
B 1,2,3
C 1,2
D 2
E None of the above
Branch Target Buffer
• Keeps track of the PCs of recently seen branches and their targets.
• Consult during Fetch (in parallel with Instruction Memory read) to determine:– Is this a branch?
– If so, what is the target
Branch Hazards – Predict Taken
• Static policy:– Forward branches (if statements) predict not taken
– Backward branches (loops) predict taken
• Dynamic prediction (coming soon)
• First – Branch Delay Slots
Eliminating the Branch Stall
• There’s no rule that says we have to see the effect of the branch immediately. Why not wait an extra instruction before branching?
• The original SPARC and MIPS processors each used a single branch delay slot to eliminate single-cycle stalls after branches.
• The instruction after a conditional branch is always executed in those machines, regardless of whether the branch is taken or not!
Branch Delay Slot
beq $4, $0, there
and $12, $2, $5
there: or ...
add ...
sw ...
IM Reg
DM Reg
IM Reg
IM Reg
DM
IM Reg
DM Reg
IM Reg
DM Reg
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8
Branch delay slot instruction (next instruction after a branch) is executed evenif the branch is taken.
Filling the branch delay slot
1 add $5, $3, $7
2 add $9, $1, $3
3 sub $6, $1, $4
4 and $7, $8, $2
5 beq $6, $7, there
nop /* branch delay slot */
6 add $9, $1, $2
7 sub $2, $9, $5
...
there:
8 mult $2, $10, $11
…
No-R7 WAR
Safe, $1 and $3 are fine
No-R6
No-R7
Not safe ($9 on nt path)
Not safe (needs $9 not yet produced)
Not safe ($2 is used before overwritten)
E is the correct answer
Selection Safe instructions
A 1,2
B 2,6
C 6,8
D 1,2,7,8
E None of the above
* It is not safe to assume anything about the … code
Filling the branch delay slot• The branch delay slot is only useful if you can find
something to put there.
• If you can’t find anything, you must put a noop to insure correctness.
Branch Delay Slots
• This works great for this implementation of the architecture, but becomes a permanent part of the ISA.
• What about the MIPS R10000, which has a 5-cycle branch penalty, and executes 4 instructions per cycle???
• Bottom line: Exposed a detail of the hardware implementation to the ISA.
Dynamic Branch Prediction
• Can we guess the outcome of branches?
• What should we base that guess on?
Branch Prediction
101
program counter
for (i=0;i<10;i++) {......}
...
...add $i, $i, #1beq $i, #10, loop
Accuracy?
Too easily swayed
Two-bit predictors give better loop prediction
for (i=0;i<10;i++) {......}
...
...add $i, $i, #1beq $i, #10, loop
Strongly Taken11
Weakly Taken10
Weakly Not Taken01
Strongly Not Taken00
Decr
em
ent
when
not
taken
Incr
em
ent
when t
aken
branch address
00
PHT
Better (less sway)Slower learning
Suppose we have the following branch patterns for 3 branches (A, B, C). What is the accuracy of a 1-bit and 2-bit Branch History Table. Assume initial values of 1 (1-bit) and (10) 2-bit.
Strongly Taken11
Weakly Taken10
Weakly Not Taken01
Strongly Not Taken00
Decr
em
ent
when
not
taken
Incr
em
ent
when t
aken
1-bitA. T T T T N B. T T N T N
C. N T N T N
2-bitA. T T T T N B. T T N T N
C. N T N T N
Modern Branch Prediction - Pentium 4
• Performance dependent on accurate Branch Prediction
• 20 Stage Pipeline – 3-way issue– 60 instructions in flight (12 branches)
– 17th stage is branch resolution
– ~17*3=51 instructions lost on mispredict
X
1 2 3 4 5 6 7 8 9 1011121314151617181920
Branch Prediction
• Latest branch predictors are significantly more sophisticated, using more advanced correlating techniques, larger structures, and even AI techniques
• Use patterns of branches (local history) and recent other branch history (global history) to make predictions
For a given program on our 5-stage MIPS pipeline processor:
20% of insts are loads, 50% of instructions following a load are arithmetic instructions depending on the load
20% of instructions are branches. Using dynamic branch prediction, we achieve 80% prediction accuracy.
What is the CPI of your program?Selection CPI
A 0.76
B 0.9
C 1.0
D 1.1
E 1.14
Putting it all together.
Control Hazards -- Key Points
• Control (or branch) hazards arise because we must fetch the next instruction before we know if we are branching or where we are branching.
• Control hazards are detected in hardware.
• We can reduce the impact of control hazards through:– early detection of branch address and condition
– branch prediction
– branch delay slots
Given our 5-stage MIPS pipeline – what is the steady state CPI for the following code? Assume the branch is taken thousands of times. Recall – a processor is in steady state when all stages are active.
Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop Selection CPI
A 1
B 1.25
C 1.5
D 1.75
E None of the above
Steady-State CPI = (#insts+#stalls+#flushed_insts) #insts
IF = 200psID = 100psEX = 100psM = 200ps
WB = 100ps
Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time):Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop
Selection CPI CT
A Increase Increase
B Increase Decrease
C Decrease Increase
D Decrease Decrease
E Increase No Change
What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.
Isomorphic
IF = 200psID = 100psEX = 200psM = 200ps
WB = 100ps
Hardware engineers determine these to be the execution times per stage of the MIPS 5-stage pipeline processor. Consider splitting IF and M into 2 stages each. (So IF1 IF2 and M1 M2.) The most important code run by the company is (assume branch is taken most of the time):Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop
Selection CPI CT
A Increase Increase
B Increase Decrease
C Decrease Increase
D Decrease Decrease
E Increase No Change
What would be the impact of the new 7-stage pipeline compared to the original 5-stage MIPS pipeline.. Assume the pipeline has forwarding where available, predicts branch not taken, and resolves branches in ID.
Loop: lw r1, 0 (r2) add r2, r3, r4 sub r5, r1, r2 beq r5, $zero, Loop
7-stage Pipeline For diagramming the code in the last slide. Drive home – more stages = more hazards
Pipelining -- Key Points
• Pipelining focuses on improving instruction throughput, not individual instruction latency.
• Data hazards can be handled by hardware or software – but most modern processors have hardware support for stalling and forwarding.
• Control hazards can be handled by hardware or software – but most modern processors use Branch Target Buffers and advanced dynamic branch prediction to reduce the hazard.
• ET = IC*CPI*CT