l18 – pipeline issues 1 comp 411 – spring 2008 04/03/08 cpu pipelining issues finishing up...
Post on 20-Dec-2015
214 views
TRANSCRIPT
L18 – Pipeline Issues 1Comp 411 – Spring 2008 04/03/08
CPU Pipelining Issues
Finishing up Chapter 6
This pipe stuff makesmy head hurt!
What have you been beating
your head against?
L18 – Pipeline Issues 2Comp 411 – Spring 2008 04/03/08
ALUA B
ALUFN
Data MemoryRD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
5-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM Y
MEM WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB Y
WBMemory
Address is available right after instruction enters Memory stage
Data is needed just before rising clock edge at end of Write Back stage
almost 2 clock cycles
• Omits some details• NO bypass or interlock logic
L18 – Pipeline Issues 3Comp 411 – Spring 2008 04/03/08
Pipelining
Improve performance by increasing instruction throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?
Instructionfetch
Reg ALUData
accessReg
8 nsInstruction
fetchReg ALU
Dataaccess
Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch
Reg ALUData
accessReg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 nsInstruction
fetchReg ALU
Dataaccess
Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
L18 – Pipeline Issues 4Comp 411 – Spring 2008 04/03/08
Pipelining
What makes it easyall instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
Individual Instructions still take the same number of cycles
But we’ve improved the through-put by increasing the number of simultaneously executing instructions
L18 – Pipeline Issues 5Comp 411 – Spring 2008 04/03/08
Structural Hazards
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
InstFetch
RegRead
ALU DataAccess
Reg Write
L18 – Pipeline Issues 6Comp 411 – Spring 2008 04/03/08
Problem with starting next instruction before first is finisheddependencies that “go backward in time” are
data hazards
Data Hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
L18 – Pipeline Issues 7Comp 411 – Spring 2008 04/03/08
Have compiler guarantee no hazardsWhere do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Problem: this really slows us down!
Software Solution
L18 – Pipeline Issues 8Comp 411 – Spring 2008 04/03/08
Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding
Forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
L18 – Pipeline Issues 9Comp 411 – Spring 2008 04/03/08
Load word can still cause a hazard:an instruction tries to read a register following a load instruction
that writes to the same register.
Thus, we need a hazard detection unit to “stall” the instruction
Can't always forward
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
L18 – Pipeline Issues 10Comp 411 – Spring 2008 04/03/08
Stalling
We can stall the pipeline by keeping an instruction in the same stage
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
L18 – Pipeline Issues 11Comp 411 – Spring 2008 04/03/08
When we decide to branch, other instructions are in the pipeline!
We are predicting “branch not taken”need to add hardware for flushing instructions if
we are wrong
Branch Hazards
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
L18 – Pipeline Issues 12Comp 411 – Spring 2008 04/03/08
Improving Performance
Try to avoid stalls! E.g., reorder these instructions:
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)
Add a “branch delay slot”the next instruction after a branch is always executedrely on compiler to “fill” the slot with something
useful
Superscalar: start more than one instruction in the same cycle
L18 – Pipeline Issues 13Comp 411 – Spring 2008 04/03/08
Dynamic Scheduling
The hardware performs the “scheduling” hardware tries to find instructions to execute
out of order execution is possible
speculative execution and dynamic branch prediction
All modern processors are very complicatedPentium 4: 20 stage pipeline, 6 simultaneous
instructions
PowerPC and Pentium: branch history table
Compiler technology important
L18 – Pipeline Issues 14Comp 411 – Spring 2008 04/03/08
ALUA B
ALUFN
RD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
5-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM Y
MEM WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB Y
WBMemory
We wanted a simple, clean pipeline but…
• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage
NOP
Data Memory
•broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic•added A/B bypass muxes to get data before it’s written to regfile
L18 – Pipeline Issues 15Comp 411 – Spring 2008 04/03/08
Bypass MUX Details
RD1
RD2
Register File
A Bypass B Bypass
ASEL02
AALU
BSEL01
BALU
’16’ SEXT(imm)
To ALU To ALU
WDALU
To Mem
JT
from ALU/MEM/WB/PC from ALU/MEM/WB/PC
1
shamt =BZ
The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic.
L18 – Pipeline Issues 16Comp 411 – Spring 2008 04/03/08
ALUA B
ALUFN
RD
WD R/WAdrWr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BT
PC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
Final 5-Stage miniMIPS
PCALU 00 IRALU
A B WDALU
PCMEM 00 IRMEM Y
MEM WDMEM
WARegister
File
WA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”
“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB Y
WBMemory Data Memory
NOP
•Added branch delay slot and early branch resolution logic to fix a CONTROL hazard
• Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards
• Added pipeline interlocks to fix load delay STRUCTURAL hazard
NOP
L18 – Pipeline Issues 17Comp 411 – Spring 2008 04/03/08
Pipeline Summary (I)
• Started with unpipelined implementation– direct execute, 1 cycle/instruction– it had a long cycle time: mem + regs + alu + mem
+ wb
• We ended up with a 5-stage pipelined implementation– increase throughput (3x???)– delayed branch decision (1 cycle)
Choose to execute instruction after branch– delayed register writeback (3 cycles)
Add bypass paths (6 x 2 = 12) to forward correct value
– memory data available only in WB stageIntroduce NOPs at IRALU, to stall IF and RF
stages until LD result was ready
L18 – Pipeline Issues 18Comp 411 – Spring 2008 04/03/08
Pipeline Summary (II)
Fallacy #1: Pipelining is easySmart people get it wrong all of the time!
Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).
Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.
L18 – Pipeline Issues 19Comp 411 – Spring 2008 04/03/08
RISC = Simplicity???
Generalization of registers and
operand coding
Complex instructions, addressing modes
Addressingfeatures, eg
index registers
RISCs
Primitive Machines with
direct implementati
ons
VLIWs,Super-Scalars ?
Pipelines, Bypasses,
Annulment, …, ...
“The P.T. Barnum World’s Tallest Dwarf Competition”
World’s Most Complex RISC?