pipelined datapath
DESCRIPTION
Pipelined Datapath. Lecture notes from MKP, H. H. Lee and S. Yalamanchili. Reading. Sections 4.5 – 4.9. Pipeline Performance. Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath. Pipeline Performance. - PowerPoint PPT PresentationTRANSCRIPT
Pipelined Datapath
Lecture notes from MKP, H. H. Lee and S. Yalamanchili
(2)
Reading• Sections 4.5 – 4.10• Practice Problems: 1, 3, 8, 12
(3)
Pipeline Performance• Assume time for stages is
100ps for register read or write 200ps for other stages
• Compare pipelined datapath with single-cycle datapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
(4)
Pipeline PerformanceSingle-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
(5)
Pipeline Speedup• If all stages are balanced
i.e., all take the same time
• If not balanced, speedup is less• Speedup due to increased throughput
Latency (time for each instruction) does not decrease
(6)
Basic IdeaAll instructions
are 32-bitsFew & regular
instruction formats
Alignment of memory operands
(7)
Pipelining• What makes it easy
All instructions are the same length Simple instruction formats Memory operands appear only in loads and stores
• What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch
instructions data hazards: an instruction depends on a previous
instruction• What really makes it hard:
exception handling trying to improve performance with out-of-order
execution, etc.
(8)
Pipeline registers• Need registers between stages
To hold information produced in previous cycle
Pipeline stage execution time
(9)
Graphically Representing Pipelines
• Shading indicates the unit is being used by the instruction
• Shading on the right half of the register file (ID or WB) or memory means the element is being read in that stage
• Shading on the left half means the element is being written in that stage
IF ID MEM WBEX
2 4 6 8 10Time
lw
IF ID MEM WBEXadd
(10)
Graphically Representing Pipelines
• Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
(11)
Structural Hazard
IF ID MEM WBEX
2 4 6 8 10Time
IF ID MEM WBEX
IF ID MEM WBEX
IF ID MEM WBEX
lw
add
sub
add
Need to separate instruction and data memory
(12)
IF for Load, Store, …
Pipeline stage execution time
(13)
ID for Load, Store, …
Pipeline stage execution time
(14)
EX for Load
Pipeline stage execution time
(15)
MEM for Load
Pipeline stage execution time
(16)
WB for Load
Wrongregisternumber
(17)
Corrected Datapath for Load
Pipeline stage execution time
(18)
EX for Store
Pipeline stage execution time
(19)
MEM for Store
Pipeline stage execution time
(20)
WB for Store
Pipeline stage execution time
(21)
Pipelining Example
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
add $14, $5, $6 lw $13, 24($1) add $12, $3, $4 sub $11, $2, $3 lw $10, 20($1)
RegDst
0
1
Mux
Instruction[20– 16]
Instruction[15– 11]
Pipeline stage execution time
Note what is happening in the register file
(22)
Pipelined Control (Simplified)
(23)
Pipelined Control
• Control signals derived from instruction As in single-cycle
implementation
• Pass control signals along like data
Execution/AddressCalculation stage control
linesMemory access stage
control lines
Write-backstage control
lines
InstructionRegDst
ALUOp1
ALUOp0
ALUSrc Branch
MemRead
MemWrite
Regwrite
Memto Reg
R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X
(24)
Pipelined Control
(25)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
IF: lw $10, 9($1)
(26)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
IF: sub $11, $2, $3 ID: lw $10, 9($1)
11
010
0001E
“lw”
(27)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
11
010
00E
ID: sub $11, $2, $3 EX: lw $10, 9($1)IF: and $12, $4, $5
1
0
10
000
1100
“sub”
(28)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10E
EX: sub $11, $2, $3 MEM: lw $10, 9($1)ID: and $12, $4, $5
0
1
10
000
1100
IF: or $13, $6, $7
110
10
“and”
(29)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10E
MEM: sub $11, .. WB: lw $10, 9($1)
EX: and $12, $4, $5
0
1
10
000
1100
ID: or $13, $6, $7
100
00
“or”
IF: add $14, $8, $9
1
1
(30)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10E
WB: sub $11, ..MEM: and $12…
0
1
10
000
1100
EX: or $13, $6, $7
100
00
“add”
ID: add $14, $8, $9
1
0
IF: xxxx
(31)
Datapath with Control
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
M
WB
WBIF/ID
PCSrc
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10
WB: and $12…
0
1
MEM: or $13, ..
100
00
EX: add $14, $8, $9
1
0
IF: xxxx ID: xxxx
X
M
WB
ID/EX
E
(32)
Datapath with ControlWB: or $13…
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
M
WB
WBIF/ID
PCSrc
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
MEM: add $14, ..
1000
0
EX: xxxx
1
0
IF: xxxx ID: xxxx
X
M
WB
ID/EX
E
(33)
Datapath with ControlWB: add $14..MEM: xxxxEX: xxxxIF: xxxx ID: xxxx
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
M
WB
WBIF/ID
PCSrc
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
1
0X
M
WB
ID/EX
E
(34)
Data Hazards (4.7)• An instruction depends on completion of data
access by a previous instruction add $s0, $t0, $t1
sub $t2, $s0, $t3
(35)
• Problem with starting next instruction before first is finished dependencies that “go backward in time” are data
hazards
Dependencies
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
(36)
• Have compiler guarantee no hazards• Where do we insert the “nops” ?
sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)
• Problem: this really slows us down!
Software Solution
(37)
A Better Solution• Consider this sequence:
sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
• We can resolve hazards with forwarding How do we detect when to forward?
(38)
Dependencies & Forwarding
Do not wait for results to be written to the
register file – find them in the pipeline forward
to ALU
(39)
Forwarding
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10E
MEM: sub $11, .. WB: lw $10, 9($1)
EX: and $6, $4, $5
0
1
10
000
1100
ID: or $13, $6, $7
100
00
“or”
IF: add $14, $8, $9
1
1
(40)
Forwarding (simplified)
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
ALU
(41)
Forwarding (from EX/MEM)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
(42)
Forwarding (from MEM/WB)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
(43)
Forwarding (operand selection)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
ForwardingUnit
(44)
Forwarding (operand propagation)
ALU
DataMemory
RegisterFile
MU
X
ID/EX EX/MEM MEM/WB
MU
XM
UX
ForwardingUnit
Rt
Rs
MU
X
Rd
Rt
EX/MEM Rd
MEM/WB Rd
Combinational Logic!
(45)
Detecting the Need to Forward
• Pass register numbers along pipeline e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register• ALU operand register numbers in EX stage are
given by ID/EX.RegisterRs, ID/EX.RegisterRt
• Data hazards when1a. EX/MEM.RegisterRd = ID/EX.RegisterRs1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd fromEX/MEM
pipeline reg
Fwd fromMEM/WB
pipeline reg
(46)
Detecting the Need to Forward• But only if forwarding instruction will write to a
register! EX/MEM.RegWrite, MEM/WB.RegWrite
• And only if Rd for that instruction is not $zero EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0
(47)
Forwarding Paths
(48)
Forwarding Conditions• EX hazard
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
• MEM hazard if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
(49)
Double Data Hazard• Consider the sequence:
add $1,$1,$2add $1,$1,$3add $1,$1,$4
• Both hazards occur Want to use the most recent
• Revise MEM hazard condition Only forward if EX hazard condition isn’t true
(50)
Revised Forwarding Condition• MEM hazard
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
Checking precedence of EX hazard
(51)
Datapath with Forwarding
(52)
Concurrent Execution• Correct execution is about managing
dependencies Producer-consumer Structural (using the same hardware component)
• We will come across other types of dependencies later!
(53)
Load-Use Data Hazard
Need to stall for one cycle
(54)
Forwarding
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2R
egW
r ite
MemRead
Control
ALU
Instruction[15– 11]
6
X
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
10
000
10E
MEM: lw $11, 0($2) WB: lw $10, 9($1)
EX: and $6, $4, $11
0
1
10
000
1100
ID: or $13, $6, $7
100
00
“or”
IF: add $14, $8, $9
1
1
(55)
Load-Use Hazard Detection• Check when using instruction is decoded in ID
stage• ALU operand register numbers in ID stage are
given by IF/ID.RegisterRs, IF/ID.RegisterRt
• Load-use hazard when ID/EX.MemRead and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))
• If detected, stall and insert bubble
(56)
Code Scheduling to Avoid Stalls• Reorder code to avoid use of load result in the
next instruction• C code for A = B + E; C = B + F;
lw $t1, 0($t0)lw $t2, 4($t0)add $t3, $t1, $t2sw $t3, 12($t0)lw $t4, 8($t0)add $t5, $t1, $t4sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)lw $t2, 4($t0)lw $t4, 8($t0)add $t3, $t1, $t2sw $t3, 12($t0)add $t5, $t1, $t4sw $t5, 16($t0)
11 cycles13 cycles
(57)
How to Stall the Pipeline• Force control values in ID/EX register
to 0 EX, MEM and WB perform a nop (no-operation)
• Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw
o Can subsequently forward to EX stage
(58)
Stall/Bubble in the Pipeline
Stall inserted here
(59)
Stall/Bubble in the Pipeline
Or, more accurately…
(60)
Datapath with Hazard Detection
Pipeline stage execution time
ALUSrc mux is missing!
(61)
Control Hazards (4.8)• Branch instruction determines flow of control
Fetching next instruction depends on branch outcome Pipeline cannot always fetch correct instruction
o Still working on ID stage of branch
• In MIPS pipeline Need to compare registers and determine the branch
condition
(62)
Branch Hazards• If branch outcome determined in MEM
PC
Flush theseinstructions(Set controlvalues to 0)
(63)
Reducing Branch Delay• Move hardware to determine outcome to ID
stage Target address adder Register comparator Add IF.Flush signal to squash IF/ID register
• Example: branch taken36: sub $10, $4, $840: beq $1, $3, 7244: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7 ...72: lw $4, 50($7)
(64)
Example: Branch Taken
(65)
Example: Branch Taken
(66)
Data Hazards for Branches• If a comparison register is a destination of 2nd or
3rd preceding ALU instruction
…
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target
Can resolve using forwarding
(67)
Data Hazards for Branches• If a comparison register is a destination of
preceding ALU instruction or 2nd preceding load instruction Need 1 stall cycle
beq stalled
IF ID EX MEM WB
IF ID EX MEM WB
IF ID
ID EX MEM WB
add $4, $5, $6
lw $1, addr
beq $1, $4, target
(68)
Data Hazards for Branches• If a comparison register is a destination of
immediately preceding load instruction Need 2 stall cycles
beq stalled
IF ID EX MEM WB
IF ID
ID
ID EX MEM WB
beq stalled
lw $1, addr
beq $1, $0, target
(69)
Delay Slot (MIPS)• Expose pipeline• Load and jump/branch entail a “delay slot” • The instruction right after the jump or branch is
executed before the jump/branchjal function_A
add $4, $5, $6 ; executed before jmplw $12, 8($4) ; executed after return
• Jump/branch and the delay slot instruction are considered “indivisible”
• In the delay slot, the compiler needs to schedule A useful instruction (either before the jmp, or after the
jmp w/o side effect) otherwise a NOP
(70)
Branch Prediction• Longer pipelines cannot readily determine
branch outcome early Stall penalty becomes unacceptable
• Predict outcome of branch Only stall if prediction is wrong
• In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay
(71)
MIPS with Predict Not Taken
Prediction correct
Prediction incorrect
(72)
1-Bit Predictor: Shortcoming• Inner loop branches mispredicted twice!
outer: … …inner: … … beq …, …, inner … beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
(73)
2-Bit Predictor: State Machine• Only change prediction on two successive
mispredictions
(74)
More-Realistic Branch Prediction• Static branch prediction
Based on typical branch behavior Example: loop and if-statement branches
o Predict backward branches takeno Predict forward branches not taken
• Dynamic branch prediction Hardware measures actual branch behavior
o e.g., record recent history of each branch Assume future behavior will continue the trend
o When wrong, stall while re-fetching, and update history
(75)
AMD Bobcat
http://hothardware.com
Later in this course
ECE 6100
ECE 6100
Later in this course
Instruction Level Parallelism (ILP)
(76)
Intel Sandy Bridge
bdti.com
(77)
Exceptions and Interrupts (4.9)• “Unexpected” events requiring change
in flow of control Different ISAs use the terms differently
• Exception Arises within the CPU
o e.g., undefined opcode, overflow, syscall, …
• Interrupt From an external I/O controller
• Dealing with them without sacrificing performance is hard
(78)
Handling Exceptions• In MIPS, exceptions managed by a System
Control Coprocessor (CP0)• Save PC of offending (or interrupted)
instruction In MIPS: Exception Program Counter (EPC)
• Save indication of the problem In MIPS: Cause register We’ll assume 1-bit
o 0 for undefined opcode, 1 for overflow
• Jump to handler at 80000180
(79)
An Alternate Mechanism• Vectored Interrupts
Handler address determined by the cause• Example:
Undefined opcode: C000 0000 Overflow: C000 0020 …: C000 0040
• Instructions either Deal with the interrupt, or Jump to real handler
(80)
Handler Actions• Read cause, and transfer to relevant handler• Determine action required• If restartable
Take corrective action use EPC to return to program
• Otherwise Terminate program Report error using EPC, cause, …
(81)
Exceptions in a Pipeline• Another form of control hazard• Consider overflow on add in EX stage
add $1, $2, $1 Prevent $1 from being clobbered Complete previous instructions Flush add and subsequent instructions Set Cause and EPC register values Transfer control to handler
• Similar to mispredicted branch Use much of the same hardware
(82)
Pipeline with Exceptions
(83)
Exception Properties• Restartable exceptions
Pipeline can flush the instruction Handler executes, then returns to the instruction
o Re-fetched and executed from scratch
• PC saved in EPC register Identifies causing instruction Actually PC + 4 is saved
o Handler must adjust
(84)
Exception Example• Exception on add in
40 sub $11, $2, $444 and $12, $2, $548 or $13, $2, $64C add $1, $2, $150 slt $15, $6, $754 lw $16, 50($7)…
• Handler80000180 sw $25, 1000($0)80000184 sw $26, 1004($0)…
(85)
Exception Example
(86)
Exception Example
(87)
Multiple Exceptions• Pipelining overlaps multiple instructions
Could have multiple exceptions at once• Simple approach: deal with exception from
earliest instruction Flush subsequent instructions “Precise” exceptions
• In complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult!
(88)
Imprecise Exceptions• Just stop pipeline and save state
Including exception cause(s)• Let the handler work out
Which instruction(s) had exceptions Which to complete or flush
o May require “manual” completion• Simplifies hardware, but more complex handler
software• Not feasible for complex multiple-issue
out-of-order pipelines
(89)
Performance• How do we assess the impact of stall cycles?
• How close do we approach the ideal of one instruction per cycle execution time?
• Back to the CPI model!
(90)
Recall: Program Execution time
~= Instruction_count * CPIavg * clock_cycle_time
algorithms/compiler architecture technology
Relative frequency
Number of instruction classes
(91)
Assessing Performance• Ideal CPI is increased by dependencies • Performance impact on CPI can be assessed by
computing the impact on a per instruction basis
Increase in CPI = Base CPI + Probability_of_event * penalty_for_event
For example, an event may be a branch misprediction or the occurrence of a data hazard
The probability is computed for the occurrence of the event on an instruction
• Examples: pipelined processors
(92)
Instruction-Level Parallelism (ILP)(4.10)• Pipelining: executing multiple instructions in
parallel• To increase ILP
Deeper pipelineo Less work per stage shorter clock cycle
Multiple issueo Replicate pipeline stages multiple pipelineso Start multiple instructions per clock cycleo CPI < 1, so use Instructions Per Cycle (IPC)o E.g., 4GHz 4-way multiple-issue
16 BIPS, peak CPI = 0.25, peak IPC = 4o But dependencies reduce this in practice
(93)
Multiple Issue• Static multiple issue
Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards
• Dynamic multiple issue CPU examines instruction stream and chooses
instructions to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at
runtime
(94)
MIPS with Static Dual Issue• Two-issue packets
One ALU/branch instruction One load/store instruction 64-bit aligned
o ALU/branch, then load/storeo Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX MEM WB
n + 4 Load/store IF ID EX MEM WB
n + 8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
(95)
MIPS with Static Dual Issue
Address computation
ALU computation
ALU operation Load/store
Aligned Instruction Pair
(96)
Instruction Level Parallelism (ILP)
IF ID MEM WB
• Single (program) thread of execution• Issue multiple instructions from the same
instruction stream• Average CPI<1• Often called out of order (OOO) cores
Multiple instructions in EX at the same time
(97)
Dynamically Scheduled CPU
Results also sent to any waiting reservation stations
Reorders buffer for register writes Can supply
operands for issued instructions
Preserves dependencies
Hold pending operands
(98)
AMD Bulldozer
forum.beyond3d.comb
(99)
AMD Bobcat
http://hothardware.com
Later in this course
ECE 6100
ECE 6100
Later in this course
Instruction Level Parallelism (ILP)
(100)
The P4 MicroarchitectureFrom, “The Microarchitecture of the Pentium 4 Processor 1,” G. Hinton et.al, Intel Technology Journal Q1, 2001
(101)
Study Guide• Given a code block, and initial register values
(those that are accessed) be able to determine state of all pipeline registers at some future clock cycle.
• Determine the size of each pipeline register• Track pipeline state in the case of forwarding
and branches• Compute the number of cycles to execute a
code block• Modify the datapath to include forwarding and
hazard detection for branches (this is trickier and time consuming but well worth it)
(102)
Study Guide (cont.)• Schedule code (manually) to improve
performance, for example to eliminate hazards and fill delay slots
• Modify the data path to add new instructions such as j
• Modify the data path to accommodate a two cycle data memory access, i.e., the data memory itself is a two cycle pipeline Modify the forwarding and hazard control logic
• Given a code sequence, be able to compute the number of stall cycles
(103)
Study Guide (cont.)• Track the state of the 2-bit branch predictor
over a sequence of branches in a code segment, for example a for-loop
• Show the pipeline state before and after an exception has taken place.
(104)
Glossary• Branch prediction • Branch hazards• Branch delay Control
hazard• Data hazard• Delay slot • Dynamic instruction
issue • Forwarding• Imprecise exception
• Instruction scheduling
• Instruction level parallelism (ILP)
• Load-to-use hazard• Pipeline bubbles• Stall cycles• Static instruction
issue• Structural hazard