cda 5106 advanced computer architecture i...
TRANSCRIPT
![Page 1: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/1.jpg)
Computer Science Department
University of Central Florida
CDA 5106 Advanced Computer Architecture I
Pipelining
![Page 2: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/2.jpg)
2
Designing a processor
• Design the ISA
• Classify instructions for the ISA (e.g., MIPS):
– Memory references
– Register-Register ALU Operations
– Register-Immediate ALU Operations
– Branches
• Work out the execution for each operation class
• Design appropriate hardware
• Look for opportunities to improve…
• …while maintaining correct execution
![Page 3: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/3.jpg)
3
How to Execute an Instruction
• Instruction fetch (“IF”) – IR = Mem[PC] – NPC = PC + 4
• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21] – B = Regs[IR20..16] – Imm = sign-extend(IR15..0)
• Execute (“EX”) – Memory reference:
• ALUOutput = A + Imm – Reg/Reg ALU Operation:
• ALUOutput = A op B – Reg/Immediate ALU Operation:
• ALUOutput = A op Imm – Branch:
• ALUOutput = NPC + Imm; Cond = (A op 0)
![Page 4: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/4.jpg)
4
Executing an Instruction (cont.)
• Memory Access/Branch completion (“MEM“)
– Memory Reference:
• Load_Mem_Data = Mem[ALUOutput] /* Load */
• Mem[ALUOutput] = B /* Store */
– Branch
• If (cond) PC = ALUOutput, else PC = NPC
• Write back (“WB”)
– Reg-Reg ALU Operation:
• Regs[IR15..11] = ALUOutput
– Reg-Immediate ALU Operation:
• Regs[IR20..16] = ALUOutput
– Load instruction:
• Regs[IR20..16] = Load_Mem_Data
![Page 5: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/5.jpg)
5
How to Execute an Instruction
• Instruction fetch (“IF”) – IR = Mem[PC]
– NPC = PC + 4
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.)
4
![Page 6: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/6.jpg)
6
How to Execute an Instruction (cont.)
• Instruction decode/Register fetch (“ID”) – A = Regs[IR25..21]
– B = Regs[IR20..16]
– Imm = sign-extend(IR15..0)
Regs
sign
extend
A
B
Imm
IR (inst.
reg.)
![Page 7: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/7.jpg)
7
How to Execute an Instruction (cont.) • Execute (“EX”)
– Memory reference:
• ALUOutput = A + Imm
– Reg/Reg ALU Operation:
• ALUOutput = A op B
– Reg/Immediate ALU Operation:
• ALUOutput = A op Imm
– Branch:
• ALUOutput = NPC + Imm; Cond = (A op 0)
A
B
AL
U
MU
X
MU
X
=0? cond
Imm
NPC
![Page 8: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/8.jpg)
8
How to Execute an Instruction (cont.)
• Memory Access/Branch completion (“MEM“) – Memory Reference:
• Load_Mem_Data = Mem[ALUOutput] /* Load */
• Mem[ALUOutput] = B /* Store */
– Branch
• If (cond) PC = ALUOutput, else PC = NPC
AL
U
cond
MU
X
data
cache LMD
NPC
PC
B
![Page 9: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/9.jpg)
9
How to Execute an Instruction (cont.)
• Write back (“WB”) – Reg-Reg ALU Operation:
• Regs[IR15..11] = ALUOutput – Reg-Immediate ALU Operation:
• Regs[IR20..16] = ALUOutput – Load instruction:
• Regs[IR20..16] = Load_Mem_Data
Regs
LMD
MU
X
![Page 10: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/10.jpg)
10
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.) Regs
sign
extend
A
B
AL
U
MU
X
MU
X
=0? cond
MU
X
data
cache LMD
MU
X
4
Imm
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback
(WB)
![Page 11: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/11.jpg)
11
An Abstract View of Single-Cycle Implementation
![Page 12: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/12.jpg)
12
Controller
![Page 13: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/13.jpg)
13
Analysis
• Single-cycle Implementation -> Multi-cycle Implementation • All instructions (except branch):
– IF, ID, EX, MEM, WB
• Branch: (12%)
– IF, ID, EX, MEM
• CPI = 5*0.88+4*0.12 = 4.88 cycles • Graphically:
IC Reg ALU DC Reg
IC Reg ALU DC Reg
IC Reg ALU DC Reg
![Page 14: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/14.jpg)
14
Unpipelined Execution
• Throughput
– Depends on full latency of instruction
IF ID EX MEM WB
IF ID EX MEM WB
I$ idle
decoder idle, RF read ports idle
ALU idle
D$ idle
RF write port idle
![Page 15: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/15.jpg)
15
Pipelined Execution
• Isolate each of IF, ID, EX, MEM, WB with latches
• When instruction i is in WB, i+1 is in MEM, etc.
• Graphically:
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
IC Reg ALU Reg DC
time i
i+1
i+2
i+3
i+4
![Page 16: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/16.jpg)
17
Pipeline speedup (no stalls)
• For a pipeline of n stages:
0)= case, (ideal
/)]1([
pipelined timeexec ave
dunpipeline timeexec. ave. = speedup
latch
latchunpipe
unpipe
Tn
nnTT
T
IF ID EX MEM WB
![Page 17: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/17.jpg)
18
Pipeline limits
• Limitations of pipelining
– Tlatch
• Delay, setup, hold times
• Clock skew
• Latch takes up more of cycle as cycle shrinks: deeper pipelining gives diminishing returns
– Minimum logic between latches
Tlatch
cycle
deeper pipelining
![Page 18: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/18.jpg)
19
Pipelining Idealisms
• Uniform subcomputations
– Can pipeline into stages with equal delay
– Balance pipeline stages
• Identical computations
– Can fill pipeline with identical work
– Unify instruction types
• Independent computations
– No relationships between work units
– Minimize pipeline stalls
• Are these practical?
– No, but can get close enough to get significant speedup
![Page 19: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/19.jpg)
20
Pipeline hazards
• A hazard reduces the performance of the pipeline – Due to program’s characteristics
– Potential violations of program dependences
• Hazard Resolution – Static Method: Performed statically by compiler
– Dynamic Method: Performed dynamically by hardware at run time, e.g., stall, flush, forwarding
• Three kinds: – Structural hazards - not enough hardware resources for all
combinations of instructions
– Data hazards - Dependencies between instructions prevent their overlapped execution
– Control hazards - Branches change the PC, which results in late code
![Page 20: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/20.jpg)
21
Structural hazard
• Consider a pipeline with a unified data+instruction cache:
WB
EX
ID
MEM
EX
IF i (load) ID
IF i+1
EX
ID
IF i+2
WB
MEM
ID
IF i+4 MEM
WB
WB
MEM
EX
IF i+3
MEM
EX
ID
stall
![Page 21: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/21.jpg)
22
Modeling stalls
speedup = ave. exec. time unpipelined
ave exec time pipelined
CPI CT
CPI CT
CPI CPI (stall cycles per instruction)
= 1+ (stall cycles per instruction)
speedup = CPI CT
CPI CT
CPI (= )
stall cycles per instruction)
speedup = 1
1+ (stall cycles per instruction)
CT
CT
1
1+ (stall cycles per instruction)
CT
CT
stall cycles per instruction)
unpipe unpipe
pipe pipe
pipe nostall
unpipe unpipe
pipe pipe
unpipe
unpipe
pipe
pipe
pipe
n
n
n
1
1
(
(
this assumes that the
two CT’s are equal
this assumes that the
two CPIs (CPIpipe and CPInostall)
are equal
(same thing)
![Page 22: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/22.jpg)
23
Modeling stalls (example)
• Ex:
– n = 5 (e.g., MIPS pipeline)
– 20% of instructions are branches
– 60% of branches are taken
– Penalties:
• Taken branches: 3 stall cycles
• Not-taken branches: 0 stall cycles
• How many stall cycles per instr. on average?
– stall cycles/instr. = (0.8 x 0) + (0.2 x [ 0.6 x 3 + 0.4 x 0]) = 0.2 x 0.6 x 3 = 0.36
– Speedup = 5 / 1.36 = 3.68
![Page 23: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/23.jpg)
24
Data Hazards
• Read-after-write (RAW) hazard
IF add r1, r2, r3 ID
IF add r4, r1, r5
EX
ID
IF
MEM
stall
stall
WB
ID
IF
EX
ID
WB
MEM
MEM
EX
reg. is read reg. is written
Reg.
write
Reg.
read
WB
ID
CT/2 CT/2
Perform the register write in the 1st half of the clock
cycle and the read in the second half.
2 cycle stall
![Page 24: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/24.jpg)
25
RAW Data Hazards
IF ID EX MEM WB
IF ID EX MEM WB
add r1, r2, r3
add r4, r1, r5
Result (r2+r3) is available
r1 is written
r1 is read
only need the result not r1
![Page 25: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/25.jpg)
27
Data forwarding (bypasses)
AL
U
D$
MU
X
MU
X
B
A
IMM
MU
X
ID/EX EX/MEM MEM/WB
RF
bypass 1 bypass 2
![Page 26: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/26.jpg)
28
Data forwarding (cont.)
IF add r1, r2, r3 ID
IF add r4, r1, r5
WB
MEM
WB
MEM
EX WB
EX
ID
IF add r6, r1, r5
MEM
EX
ID
IF add r7, r1, r5
byp 1
add r8, r1, r5
byp 2
WB
MEM
EX
ID
WB
MEM
EX
ID
IF
![Page 27: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/27.jpg)
29
Stalls due to RAW data hazards
• Our simple pipeline
– Most RAW hazards => no stall
– Loads cause 1-cycle stall
IF load r1, r2, r3 ID
IF add r4, r1, r5
EX
ID
IF
MEM
stall
stall
WB
MEM
MEM
EX
value available value needed
WB
EX
ID
byp 2
![Page 28: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/28.jpg)
30
Other data hazards
• WAR (write-after-read)
– A r1, r2, r3
– B r2, r4, r5
– Hazard if B writes R2 before A reads R2
– Doesn’t happen in simple DLX pipeline, but can in others
• E.g., occurs if pipeline allows late register reads
SW 0(R2), R1 IF ID EX MEM1 MEM2 MEM3 WB
ADD R1, R3, R4 IF ID EX WB
writes R1 during WB
Reads R1 during MEM3
![Page 29: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/29.jpg)
31
Other data hazards (cont.)
• WAW (write-after-write)
– A r1, r2, r3
– B r1, r4, r5
– Hazard if B writes R1 before A writes R1
– Result: later instructions see wrong value in the register
– Occurs if instructions can write register file out-of-order
– This also doesn’t happen in simple DLX pipeline, but can in others:
LW R1, 0(R2) IF ID EX MEM1 MEM2 WB
ADD R1, R2, R3 IF ID EX WB
writes 1st version of R1
writes 2nd version of R1
![Page 30: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/30.jpg)
32
Other data hazards (cont.)
• Handling WAR/WAW hazards
– Stall the later instruction (stalls in WB stage)
– Detect in decode and prevent from happening by stalling earlier (easier to implement)
– Compiler: don’t reuse register specifier
– Hardware: register renaming (see next major topic – ILP)
![Page 31: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/31.jpg)
33
Types of dependencies
• True-dependence (pure-dependence, flow-dependence) – ADD R1,R2,R3
– SUB R4,R5,R1
– May cause RAW hazards
• Anti-Dependence – ADD R3,R2,R1
– SUB R1,R4,R5
– May cause WAR hazards
– Due to reuse: Removed by using another register
• Output-Dependence – ADD R1,R2,R3
– SUB R1,R4,R5
– May cause WAW hazards
– Due to reuse: Removed by using another register
![Page 32: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/32.jpg)
34
Control hazards
• Branches throw a wrench in the cogs
– Disrupts pipeline because we don’t know what to fetch next
– Problems
• Don’t know we have a branch until decode (ID)
• Don’t know taken target until execute (EX)
• Don’t know branch direction (taken/not taken) until execute (EX) or Memory stage MEM)
![Page 33: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/33.jpg)
35
Handling Control Hazards: method 1
• stall
IF BNE r1, r2 ID
IF not-taken target MEM EX
WB MEM
ID
EX
stall WB
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID
EX
stall
WB
MEM
IF taken target
direction known (nt) PC+offset known
NOT-TAKEN
TAKEN
![Page 34: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/34.jpg)
36
Handling Control Hazards: method 2
• predict not-taken
IF BNE r1, r2 ID
IF not-taken target WB MEM
WB MEM
EX
EX
ID
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
direction known (nt) PC+offset known
TAKEN
NOT-TAKEN
EX
ID
IF
![Page 35: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/35.jpg)
37
Move branch target back
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.)
A
AL
U
MU
X
data
cache LMD
MU
X
4 =0? cond
MU
X
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM)
=0? cond
MU
X
AL
U
Add an ALU
m May increase time for ID
m Eliminates 1 cycles, but still 1 additional stall
RF
![Page 36: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/36.jpg)
38
Reducing branch penalty via the compiler: Delay slots
• Change the meaning of a branch so that next instruction after branch holds something useful
A BEQZ R1, X
B ADD R4,R2,R3
...
...
move useful instruction here
from above the branch
A
B
IFA IDA MEMA EXA WBA
X
IFB IDB MEMB EXB WBB
IFX IDX MEMX EXX WBX
“delay slot”
![Page 37: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/37.jpg)
39
Delay slots
• Add n slots to cover n holes
• ISA is changed to mean “n instructions after any branch are always executed”
• Problem:
– ISA feature that encodes pipeline structure
– Difficult to maintain across generations
– Typically can fill:
• 1 slot 75% of time
• 2 slots about 25% of time
• >2 slots almost never
![Page 38: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/38.jpg)
40
Filling slot from above branch
• Advantage
– Delay slot instruction can always execute regardless of branch outcome (don’t ever need to squash it)
• Disadvantage
– Need a “safe” instruction from above the branch
– Safe means: moving the instruction to the delay slot doesn’t violate any data dependencies
BEQZ R1, X
ADD R4, R2, R3
GOOD SCENARIO
NOP
BEQZ R1, X
ADD R4, R2, R3
BEQZ R1, X
ADD R1, R2, R3
BAD SCENARIO
NOP
BEQZ R1, X
ADD R1, R2, R3
NOP
![Page 39: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/39.jpg)
41
Filling slot from target or fall-through
• If you can’t fill slot(s) from above the branch, use instructions from either:
– Target of branch (if frequently taken)
– Fall-through of branch (if frequently not-taken)
• Example: fill from target (branch is frequently taken)
BEQZ R1, X
NOP
X: SUB R4, R2, R3
Y: …
BEQZ R1, Y
X: SUB R4, R2, R3
Y: …
SUB R4, R2, R3 (copy of X)
(change target to Y)
• Disadvantages
– Only works if delay slot instruction is safe to execute when branch goes the opposite (infrequent) direction
![Page 40: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/40.jpg)
42
Eliminating all stalls: Issues
• When:
– Detect an un-decoded instruction is a branch
• Where:
– Predict where the branch will go (if taken)
• Whether:
– (For conditional branches) Predict if it will be taken or not, before execution
• Optimal: try to determine all three in IF stage
– Won’t work perfectly (prediction), but we can try our best
![Page 41: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/41.jpg)
43
AL
U
Instruction
cache
PC
NPC
IR (inst.
reg.) Regs
sign
extend
A
B
AL
U
MU
X
MU
X
=0? cond
MU
X
data
cache LMD
MU
X
4
Imm
Instruction Fetch (IF) Instruction Decode (ID) Execute (EX) Memory (MEM) Writeback
(WB)
![Page 42: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/42.jpg)
44
Issues with When and Where
• What does IF know? – Only the address of the instruction (PC)
– Keep buffer (cache) of last known branch targets around
– Buffer is written to by WB stage
Last
known
branches
PC value Hit = we know it is a branch (“WHEN”),
BTB returns branch target (“WHERE”)
Miss = assume not a branch
• Traditional name for this is a Branch Target Buffer (BTB)
tag pre-computed target BTB entry:
![Page 43: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/43.jpg)
45
Predicting Where with returns
• Problem: A lot of jumps are returns from procedures
– Holding the last target address is a poor predictor
• Solution: Keep a hardware “stack” of return addresses
– Push return address when a “call” is executed
– Pop buffer on returns to get prediction
• Bottom of stack is filled with old value on a pop
– Need approx 4-8 entries for integer code
return?
Each entry
in BTB now
contains:
tag pre-computed target
![Page 44: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/44.jpg)
46
Return Address Stack
RAS
empty
X:
call A
A
X:
call A RAS
X+4
Y:
call B
B
Y:
call B RAS
X+4
Y+4
Y+4
ret
ret
Prediction: Y+4
RAS
X+4
X+4
ret
ret
Prediction: X+4
RAS
empty
![Page 45: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/45.jpg)
47
Issues with Whether
• Predicting conditional branches
– And sometimes unconditional branches if needed before decode
• Two approaches:
– Hardware to supply prediction
– Software
• Heuristics
• Profiling
![Page 46: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/46.jpg)
48
Hardware branch prediction
• 1-bit schemes (Pattern History Table):
– Add 1-bit prediction field to branch target buffer
– Set prediction field = 1 if branch was taken, 0 if branch was not taken
– At IF, check “branch prediction buffer”:
• if prediction field = 1 then predict taken
• else predict not-taken
– Problems:
• Some branches don’t do what they did last time!
• Think of a simple 10 iteration loop, start predict NT
– What is prediction accuracy?
– Isn’t this high enough?!?
• Need more sophisticated predictor
![Page 47: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/47.jpg)
49
Why accuracy matters so much
• Reduce stalls by: – Decreasing branch penalty
• Modify the pipeline (HW question)
– Increasing accuracy
• Fancy prediction schemes
– Decreasing fraction of branches
• compile-time code ordering
speedup = efficiency =stall cycles
efficiencystall cycles
stall cycles = branch penalty accuracy) fraction
efficiency = branch penalty accuracy) fraction
branch penalty accuracy) fraction = efficiency
accuracy = efficiency
branch penalty fraction
accuracy = 1-efficiency
branch penalty fraction
branch
branch
branch
branch
branch
nn
1
1
1
1
1
1 1
11
1
1
11
11
(
(
(
accuracy accuracy
branch penalty eff = 0.9 eff = 0.99
1 44.44% 94.95%
2 72.22% 97.47%
3 81.48% 98.32%
4 86.11% 98.74%
10 94.44% 99.49%
20 97.22% 99.75%
![Page 48: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/48.jpg)
50
Smith n-bit counter predictor
• Replace prediction bit with n-bit counter:
11
10
01
00
T
T
T
T
N
N
N
N
predict taken
predict not-taken
initial state
(using NT
heuristic)
Problems with n > 2 to 3
Smith called it “inertia”
![Page 49: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/49.jpg)
51
Example of Smith counter
previous state
new state
T T T T T T N T T N N N N N N N N T T 01 10 11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01
11 11 11 11 11 10 11 11 10 01 00 00 00 00 00 00 01 10 10
6 mispredictions out of 19 branch executions
T N T N T N T N T N T N T N T N T N previous state
new state
01 10
01 10
01 10
01 10
19 mispredictions out of 19: the infamous “toggle branch”
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
01 10
![Page 50: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/50.jpg)
52
Improving the smith counter
• Options:
– Capture correlations between branches (“global”)
– Associate predictions with branch histories, not branch addresses (use different indexing scheme)
• Gselect (global history with index selection):
global branch history register
n-1 0
behavior of last branch
(shift in most recent outcome)
index
(using low order
bits of address)
... each entry is a two-bit counter
(or perhaps simpler)
BHR
![Page 51: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/51.jpg)
53
Gselect example 1 0
BHR
matrix
00 01 10 11
A: BEQZ R1, D
D: BEQZ R1, F
F: NOT R1,R1
G: JUMP A
initially R1 = 0
...
...
A: T, pred N
New BHR = 01
D: T, pred N
New BHR = 11
01 01
01 01
A:
D:
BHR = 00
01 01
01 01
00 01 10 11
10 01
01 01
A:
D:
BHR = 01
01 01
01 01
00 01 10 11
10 01
01 10
A:
D:
BHR = 11
01 01
01 01
00 01 10 11
A: N, pred N
New BHR = 10
D: N, pred N
New BHR = 00
A: T, pred T
New BHR = 01
10 01
01 10
A:
D:
BHR = 10
01 01
01 00
00 01 10 11
10 01
01 10
A:
D:
BHR = 00
00 01
01 00
00 01 10 11
11 01
01 10
A:
D:
BHR = 01
00 01
01 00
00 01 10 11
D: T, pred T
New BHR = 11
11 01
01 11
A:
D:
BHR = 11
00 01
01 00
00 01 10 11
11 01
01 11
A:
D:
BHR = 10
00 01
01 00
00 01 10 11
A: N, pred N
New BHR = 10
underlined means entry
was updated due to last
branch execution
![Page 52: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/52.jpg)
54
Gshare (global history with index sharing) Branch Predictor
BHR
n-1 0
behavior of last branch
index
(using low order
bits of address)
each entry is a two-bit counter
(or perhaps simpler)
Exclusive-or
(hopefully) makes
sure each index/BHR
combination goes to different
entry in the table.
![Page 53: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/53.jpg)
55
Yeh/Patt predictors
pattern table
2-bit counters
(indexed by history
pattern)
1110111
shift registers history table
01
address of
branch
local predictor (pAg)
![Page 54: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/54.jpg)
Hybrid Predictors (Yeh & Patt, 1993)
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GHR
branch outcome
PHT
GAg
![Page 55: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/55.jpg)
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GHR
branch outcome
PHT
GAp
PC
![Page 56: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/56.jpg)
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GBHT
PHT
PAg
PC
![Page 57: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/57.jpg)
Yeh & Patt, 1993
• Use global/local branch history to build (other) branch predictors
• G/g = Global, P/p = Per-address GBHT
PHT
PAp
PC
![Page 58: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/58.jpg)
60
Yeh/Patt Example (pAg): toggle branch
A: TNTNTNTNTNTNTNTNTNTN
B: TTTTTTTTTTTTTTTTTTTTTT
01
01
01
01
PT
00
00
00
00
HT
00
01
10
11
A:
B:
A: T, pred N
10
01
01
01
PT
01
00
00
00
HT
00
01
10
11
A:
B:
11
01
01
01
PT
01
01
00
00
HT
00
01
10
11
A:
B:
B: T, pred T
11
00
01
01
PT
10
01
00
00
HT
00
01
10
11
A:
B:
A: N, pred N B: T, pred N
11
01
01
01
PT
10
11
00
00
HT
00
01
10
11
A:
B:
11
01
10
01
PT
01
11
00
00
HT
00
01
10
11
A:
B:
A: T, pred N
B: T, pred N
11
01
10
10
PT
01
11
00
00
HT
00
01
10
11
A:
B:
A: N, pred N
11
00
10
10
PT
10
11
00
00
HT
00
01
10
11
A:
B:
11
00
10
11
PT
10
11
00
00
HT
00
01
10
11
A:
B:
B: T, pred T A: T, pred T
11
00
11
11
PT
01
11
00
00
HT
00
01
10
11
A:
B:
PT entries 01, 10 are “trained” for A
and 11 is “trained” for B In general: provides 96-98% accuracy for integer code
![Page 59: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/59.jpg)
61
Hybrid predictors
Predictor
#1
(e.g., gshare)
Predictor
#2
(eg, bimodal)
Chooser
array of 2bit
counters
address of
branch
• Both predictors supply a prediction-- pipeline uses only one
• Chooser updated based on which predictor was correct
– Increment chooser counter if #1 was correct, decrement if #2 was correct
prediction
![Page 60: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/60.jpg)
Tournament Predictor: Alpha 21264
PC
1,02410b 1,0243b
12b
4,096
2b
branch outcome
4,09
62b
High accuracy! SPECfp95: 0.1% mp rate SPECint95: 1.15% mp rate
predictor predictor
![Page 61: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/61.jpg)
63
Handling Control Hazards: method 3
from EX:
recover PC
BTB branch
predictor RAS
+ 4
Taken Target:
conditional branch
jump/call direct
jump/call indirect
control
logic
type
hit
taken
Next-PC MUX
from ID: PC+offset
updates
from ID/EX from EX
from IF/ID
I$
PC
next-PC
Branch prediction – Next PC logic
![Page 62: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/62.jpg)
64
Control Hazards (revisited…)
• Suppose: – Always predict not-taken
• E.g., no BTB and no dynamic branch predictor – ID determines when, whether, where
• Not-taken: no penalty • Taken: 1 cycle penalty • Is there a compiler solution?
A
B
IF ID MEM EX WB
IF - - - -
A BEQZ R1, X
IF ID MEM EX WB X
![Page 63: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/63.jpg)
65
Canceling branches
• Provide two types of delayed branches – Normal delayed branch
• Delay slot instruction is always executed
– Canceling delayed branch
• Slot filled from target (assumes taken branch): slot is squashed if branch is not-taken (branch likely inst)
• Slot filled from fall-through (assumes not-taken branch): slot is squashed if branch is taken
• Having both types greatly enhances compiler’s ability to fill most slots – Use normal branch when a safe instruction is available to fill
slot (from above, target, or fall-through)
– Use canceling branch when only unsafe candidates are available to fill slot (from target or fall-through)
![Page 64: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/64.jpg)
66
Canceling branches (cont.)
• Change encoding to add a likely bit or to add new opcodes
• If compiler thinks branch is frequently taken
– Compiler sets likely bit = 1
– Compiler fills delay slot from target
– Hardware knows to squash delay slot(s) if branch not-taken
• If compiler thinks branch is frequently not-taken
– Compiler sets likely bit = 0
– Compiler fills delay slot from fall-through
– Hardware knows to squash delay slot(s) if branch taken
• If likely bit set capriciously, most delay slot instructions must be squashed
Opcode
6
rs1
5
rd
5 15 1
likely bit
![Page 65: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/65.jpg)
67
Methods for setting likely bit
• Likely bit is essentially a static branch prediction
– Compiler makes a prediction that is fixed for that branch
– Likely bit = 1 means predict taken
– Likely bit = 0 means predict not-taken
• Static branch prediction methods (compiler branch prediction)
– Heuristics
– Profiling
![Page 66: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/66.jpg)
68
Heuristics
• Heuristic #1: Branches are not taken
– The majority of conditional branches are not taken
– True about 60% of the time
• Heuristic #2: backward branches are taken, forward branches are not taken (BTFNT)
– Theme: Most backward branches are loops
– Notes:
• Since branches are PC relative, sign bit of offset = the prediction
• Jim Smith reports 70% accuracy for this scheme for scientific workloads
• Heuristic #3: Ball/Larus style predictions
– Set of rules to predict branches in special situations
![Page 67: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/67.jpg)
69
Examples of Ball/Larus predictions
• Detect loops and use BTFNT
• Since error values returned by library functions are negative:
– …and since errors are rare
– Predict BLTZ, BLEZ, etc. not taken
– Predict BGTZ, BGEZ, etc., taken
• If a call is in the body of an if…then, predict the “then” branch as not-taken
– Since most calls in if…thens guard special case code
• Problem:
– Works great for SPECint92 (from which it was designed)
– My code might not work like that!
![Page 68: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/68.jpg)
70
Profiling
• Three steps:
– 1. Run the program [the “profiled run”]
– 2. Record the average preferred direction for each branch (taken or not-taken)
– 3. Recompile to set the likely bits
• What if the program takes inputs (e.g., sort)?
– Collect a representative set of inputs somehow
• Problems:
– One prediction for entire run
– The profiled run is slow!
• Use hardware to collect predictions
• Runs at normal speed: users don’t realize it’s profiling
![Page 69: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/69.jpg)
71
Control hazards (slide 33)
• Branches throw a wrench in the cogs
– Disrupts pipeline because we don’t know what to fetch next
– Problems
• Don’t know we have a branch until decode (ID)
• Don’t know taken target until execute (EX or ID)
• Don’t know branch direction (taken/not taken) until execute (EX)
![Page 70: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/70.jpg)
72
Handling Control Hazards: method 1 (slide 34)
• stall
IF BNE r1, r2 ID
IF not-taken target MEM EX
WB MEM
ID
EX
stall WB
PC+4 known branch known and pc+offset is know
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID
EX
stall
WB
MEM
IF taken target
direction known
NOT-TAKEN
TAKEN
![Page 71: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/71.jpg)
73
Handling Control Hazards: method 2 (slide 35)
• predict not-taken
IF BNE r1, r2 ID
IF not-taken target WB MEM
WB MEM
EX
EX
ID
PC+4 known branch known
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
direction known (nt)
and PC+offset known
TAKEN
NOT-TAKEN
EX
ID
IF
![Page 72: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/72.jpg)
74
Handling Control Hazards: method 3 (cont.)
• Branch prediction
• Case A:
– Correct branch prediction
– (BTB-hit) or (BTB-miss and not-taken)
IF BNE r1, r2 ID
IF correct target WB MEM
WB MEM
EX
EX
ID
![Page 73: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/73.jpg)
75
Handling Control Hazards: method 3 (cont.)
• Case B:
– Correct branch prediction
– BTB-miss and taken
IF BNE r1, r2 ID
IF not-taken target
MEM EX
WB
ID WB
MEM
IF taken target
EX
![Page 74: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/74.jpg)
76
Handling Control Hazards: method 3 (cont.)
• Case C: – Incorrect branch prediction
– BTB-hit
– Think about the following case: predict a taken branch
(incorrect prediction) but BTB miss.
IF BNE r1, r2 ID
IF incorrect target
MEM EX
WB
ID WB
MEM
IF correct target
EX
ID
IF
![Page 75: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/75.jpg)
77
Handling Control Hazards: method 4
• Compiler based approaches: delayed branch & canceling branch
– HW support
• Different opcodes for different types of branches
• Likely bit in canceling branches
– Compiler support
• Move the code
• Set the likely bit
![Page 76: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/76.jpg)
• Ex: BTB has 90% hit rate, prediction accuracy is 95%, 60% branches taken
• Also, any misses to BTB that stall pipe will stall pipe for 1 extra cycle to update BTB
• What is the misprediction penalty for this pipeline?
• (.90)[(.05)(2)] + (.1)[(.4)(0) + (.6)(6+1)] =
• (.9)(.1) + (.1)(4.2) = .09 + .42 = .51 cycles/branch
• Note: delayed branches were about .5 cycles/branch for simple pipe
• Improvement increases quickly with better prediction accuracy
IF D1 D2 D3 R E I
target known branch dir known predict
Branch Performance
![Page 77: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/77.jpg)
Exceptions: Harder Still
• Exceptions: interrupt instruction execution unexpectedly
• Harder to handle in pipelines since there is more overlap
– may have five instructions in flight when exception is raised
• Common exceptions:
– I/O device interrupt
– OS call
– Arithmetic overflow, FP anomaly
– Page fault
– Misaligned memory access
– Memory protection violation
– Illegal instruction
– Power failure
![Page 78: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/78.jpg)
Restartable Exceptions
• The difficulty: implementing restartable exceptions
– exceptions that must restart interrupted instruction
• arithmetic overflow
• page fault
• some I/O
• Another program must be invoked to:
– save state
– correct exceptional condition
– restore state
• Invisible to original program
– restartable exceptions needed for virtual memory
![Page 79: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/79.jpg)
Precise Exceptions and Pipelining
• CPU has precise exceptions if faulting instruction and after can be stopped and restarted, and:
– all previous instructions executed and committed
– no later instruction committed
• Required for implementing virtual memory/demand paging
• Save PC and state at excepting instruction
– later on restore state, restart on that PC
• Force exception vector into pipe for next IF
• Turn off all writes for faulting instruction and later
– earlier instructions finish as normal
• OS saves excepting PC+register file and handles exception
– Does this always work?
![Page 80: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/80.jpg)
Implementing Exceptions
• What if EPC is in a branch delay slot?
– can we restart the instruction in the delay slot?
• Must save EPC for faulting instruction – but restart from branch!
– obvious: save 2 PCs
• OS issues rfe (return from exception) instruction
– restarts user program and re-enters user mode
![Page 81: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/81.jpg)
Precise Exceptions
• Required for implementing virtual memory/demand paging
– all mps implement precise exceptions for integer pipes
– also required for IEEE 754 floating-point compliance
• FP execute out of order for performance
– hard to achieve precise exceptions there
• Some provide two modes: (a) performance mode, (b) precise mode
– precise mode restricts overlap – slow
– Alpha 21064/21164, MIPS R8000
• ~10 slower
![Page 82: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/82.jpg)
Out-of-order Exceptions
lw IF ID EX MEM WB
add IF ID EX MEM WB
• Load can cause (among other things) page fault in MEM
• Add can cause (among other things) overflow in EX
• Two exceptions in the same cycle!
– handle page fault first
– in case of tie handle “earlier” instruction
• It gets worse…
![Page 83: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/83.jpg)
Out-of-order Exceptions
lw IF ID EX MEM WB
add IF ID EX MEM WB
• Load can cause (among other things) page fault in MEM
• Add can cause (among other things) page fault in IF
– need precise integer exceptions – cannot handle this one yet
– need to wait until load completes (or causes an exception)
• Implement exception status vector, carried with each instruction
– turn off all writes on an exception
– prevent stores in MEM stage
– check vector in WB (instructions before are complete)
– handle earliest exception in program order
![Page 84: CDA 5106 Advanced Computer Architecture I Pipeliningdcm/Teaching/CDA5106-Fall2015/AdditionalReadings/...Computer Science Department University of Central Florida CDA 5106 Advanced](https://reader031.vdocuments.us/reader031/viewer/2022020302/5adbb3067f8b9afc0f8e317a/html5/thumbnails/84.jpg)
When Does State Change?
• An instruction is committed when it is guaranteed to complete
– easy to restart if state changed only when committed
• MIPS: WB
• VAX: auto-increment mode, state updated in middle of inst, need HW support to back out, undo – “roll back” state changes
• Some architectures have string copy instructions
– updates memory – cannot undo 100%
– general-purpose registers hold all state
– instruction continues after exception rather than restart