chap.6: enhancing performance with pipelining
Post on 21-Mar-2016
52 Views
Preview:
DESCRIPTION
TRANSCRIPT
Chap.6: Enhancing Performance with Pipelining
Jen-Chang Liu, Spring 2006
Parts of the slides are duplicated frominst.eecs.berkeley.edu/~cs61c
Review Datapath (1/3)Datapath is the hardware that performs operations necessary to execute programs.Control instructs datapath on what to do next.Datapath needs:
access to storage (general purpose registers and memory) computational ability (ALU) helper hardware (local registers and PC)
Review Datapath (2/3)Five stages of datapath (executing an instruction):
1. Instruction Fetch (Increment PC)2. Instruction Decode (Read Registers)3. ALU (Computation)4. Memory Access5. Write to Registers
ALL instructions must go through ALL five stages.
Review Datapath (3/3)P
C
inst
ruct
ion
mem
ory
+4
rtrsrd
regi
ster
s
ALU
Dat
am
emor
y
imm
1. InstructionFetch
2. Decode/ Register
Read
3. Execute 4. Memory5. WriteBack
Review Single cycle datapath
Multi-cycle datapath
Instructionfetch
Data/registerread
Instructionexecution
Memory/registerread/write
Registerwrite
Instructionfetch
Data/registerread
Instructionexecution
Memory/registerread/write
Registerwrite
Outline Overview A pipeline datapath Pipelined control Data hazards
Forwarding Stalls
Branch hazards Superscalar and dynamic pipelining
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
What’s pipelining?Time
76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
洗衣 乾衣折疊收藏non-pipelined
pipelining
Use different resources at the same time
Pipelining Definition: an implementation
technique in which multiple instructions are overlapped in execution
How to achieve pipelining? An instruction is divided into steps
(stages) We have separate resources for each
stage
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Single-cycle implementation
Pipelined implementation
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
What does pipelining improve?
Pipelining improves performance by Increasing the instruction throughput 單位時間完成的指令數目增加 Decreasing the execution time of an individual instruction 單一指令執行時間不變
Ideal conditions for savingTime between inst. pipelined =
Time between inst. non-pipelined
Number of pipe stages
Design instruction set for pipelining
MIPS has been designed for pipelining1. Instructions are of the same length
Easier to fetch and decode Ex. 80x86 inst.s ranges from 1 to 17 bytes
2. A few instruction formats, with the source register fields located in the same place
可以在決定是甚麼指令前讀暫存器
Design instruction set for pipelining (cont.)
3. Memory operands only appear for loads and stores
Calculate memory address in execution stage
4. Operands are aligned in memory One memory access for data transfer
Instructionfetch
Instruction Decoding,
Data/registerread
Instructionexecution
Memory/registerread/write
Registerwrite
0 1 2 3
Aligned
NotAligned
Pipeline hazards Hazards: the situations in pipelining
when the next instruction cannot execute in the following clock cycle使下一個指令無法在下一個 cycle 執行
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
危險
甚麼時候下個指令無法執行?
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
1. Structural hazards The hardware cannot support the
combination of instructions that we want to execute in the same clock cycle
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
硬體結構問題
2 memory access at the sameclock?
Structural Hazard #1: Single Memory
Solution:Second memory
infeasible and inefficient to createTwo Level 1 Caches
Cache: a temporary smaller [of usually most recently used] copy of memory
have both an L1 Instruction Cache and an L1 Data Cache
need more complex hardware to control when both caches miss
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
1. Structural hazard #2: single register
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Can’t read and write to registers simultaneously?
Structural Hazard #2: Registers
Fact: Register access is VERY fast: takes less than half the time of ALU stage
Solution: introduce conventionalways Write to Registers during first half of each clock cycle
always Read from Registers during second half of each clock cycle
Result: can perform Read and Write during same clock cycle
2. Control hazards The need to make a decision based on
the results of one instruction while others are executing Ex. Branch instruction
下一個指令需等待前面指令的執行結果
add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)40: or $7, $8, $9
? …無法判斷下一個指令是那個…
Solution to control hazards #1: stall 拖延 stall (bubble): 暫停下一個指令執行
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)4 ns
Instructionfetch Reg ALU Data
access Reg2ns
Instructionfetch Reg ALU Data
access Reg
2ns
2 4 6 8 10 12 14 16Programexecutionorder(in instructions)
nop: no operationnop: no operationInstruction
fetch Reg ALU Dataaccess Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)4 ns
Instructionfetch Reg ALU Data
access Reg2ns
Instructionfetch Reg ALU Data
access Reg
2ns
2 4 6 8 10 12 14 16Programexecutionorder(in instructions)
completion of branchFilling two nops after branch is inefficient!
Solution to control hazards #1: stall
Stall (bubble): 暫停下一個指令執行Instruction
fetch Reg ALU Dataaccess Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)4 ns
Instructionfetch Reg ALU Data
access Reg2ns
Instructionfetch Reg ALU Data
access Reg
2ns
2 4 6 8 10 12 14 16Programexecutionorder(in instructions)
假設 branch 的位址計算,比較都可用另外的 hardware 在此 stage 完成
Solution to control hazards #2:
delayed branchadd $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)
40: or $7, $8, $9? …
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg
2 ns
2 4 6 8 10 12 14
2 ns
(Delayed branch slot)
Programexecutionorder(in instructions)
Redefine branchesOld definition: if we take the branch, none of the instructions after the branch get executed by accident
New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)
The term “Delayed Branch” meanswe always execute inst after branch
Solution to control hazards #2:
delayed branch (cont.)
Notes on Branch-Delay SlotWorst-Case Scenario: can always put a no-op in the branch-delay slot
Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program
re-ordering instructions is a common method of speeding up programs
compiler must be very smart in order to find instructions to do this
usually can find such an instruction at least 50% of the time
Solution to control hazards #2:
delayed branch (cont.)
Solution to control hazards #3: predict 預測 預測下一個可能執行的指令 .
Ex. 猜 branch always not takenInstruction
fetch Reg ALU Dataaccess Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg2 ns
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5 ,$6
or $7, $8, $9
Instructionfetch Reg ALU Data
access Reg
2 4 6 8 10 12 14
2 4 6 8 10 12 14
Instructionfetch Reg ALU Data
access Reg
2 ns
4 ns
bubble bubble bubble bubble bubble
Programexecutionorder(in instructions)
(not taken沒有跳走 )
(taken)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5, $6
lw $3, 300($0)
Instructionfetch Reg ALU Data
access Reg2 ns
Instructionfetch Reg ALU Data
access Reg2 ns
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
beq $1, $2, 40
add $4, $5 ,$6
or $7, $8, $9
Instructionfetch Reg ALU Data
access Reg
2 4 6 8 10 12 14
2 4 6 8 10 12 14
Instructionfetch Reg ALU Data
access Reg
2 ns
4 ns
bubble bubble bubble bubble bubble
Programexecutionorder(in instructions)
Solution to control hazards: predict (cont.)
Example: easy to predict
Dynamic prediction hardware Record the history of branch results 90% accuracy
Loop: … … beq $1, $2, Loop
3. Data hazard An instruction depends on the results of
a previous instruction still in the pipeline
Solution 1: complier (assembler)
下一個指令需要的資料需等待前一個指令執行結果Example:
add $s0, $t0, $t1sub $t2, $s0, $t3
Standard notation for steps in MIPS
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
Instructionfetch
Instruction Decoding,
Data/registerread
Instructionexecution
Memory/registerread/write
Registerwrite
讀出 寫入
Solution to data hazards: forwarding
Forwarding: Getting the missing item early from the internal resourceadd $s0, $t0, $t1sub $t2, $s0, $t3 等待指令執行完畢?
add $s0, $t0, $t1
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBEX
IF ID MEMEX
Time2 4 6 8 10
MEM
WBMEM
$t0+$t1
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
Another example of forwarding
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
Time2 4 6 8 10 12 14
lw $s0, 20($t1)
sub $t2, $s0, $t3
Programexecutionorder(in instructions)
IF ID WBMEMEX
IF ID WBMEMEX
bubble bubble bubble bubble bubble
?
Mem[20($t1)]
Example: Reordering codes
lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)Swap 0($t1) and 4($t1)
What’s wrong with pipelining?Time
2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEMTime
2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
lw $t2, 4($t1)sw $t2, 0($t1)
Example: Reordering codes
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
lw $t2, 4($t1)
sw $t2, 0($t1)
lw $t0, 0($t1)lw $t2, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1)
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEMsw $t0, 4($t1)
Peer Instruction
A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.B. Longer pipelines are always a win (since less work per stage & a faster clock).C. We can rely on compilers to help us avoid data hazards by reordering instrs.
ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT
Outline Overview A pipeline datapath: How to build? Pipelined control Data hazards
Forwarding Stalls
Branch hazards Superscalar and dynamic pipelining
Multiple instructions execute using pipelining
IM Reg DM RegALU
IM Reg DM RegALU
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Time (in clock cycles)
lw $2, 200($0)
lw $3, 300($0)
Programexecutionorder(in instructions)
lw $1, 100($0) IM Reg DM RegALU
How to combine these multiple datapaths? Shared units during one clock cycle add buffer (registers) to hold data (as we do in multi-cycle)
Divide single-cycle datapath into stages
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
1
2
Data flow
Pipelined datapath with pipeline registers (Fig 6.11)
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
64bits 128b 97b 64b
Example: (1) instruction fetch
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetchlw
Address
Datamemory
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Instruction decodelw
Address
Datamemory
Example: (2) instruction decode
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetchlw
Address
Datamemory
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Instruction decodelw
Address
Datamemory
Example: (3) execute
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Executionlw
Address
Datamemory
Example: (4) memory access
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataData
memory1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memorylw
Address
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write backlw
Writeregister
Address
97108/Patterson Figure 06.15
Example: (5) write back
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataData
memory1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Memorylw
Address
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writedata
ReaddataData
memory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Write backlw
Writeregister
Address
97108/Patterson Figure 06.15
?
寫回暫存器的號碼對嗎?
Preserve the Destination Register number
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
add 5-bit buffer
Trace another pipelining Trace helps you understand how pipelining works !!! Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction decodelw $10, 20($1)
Instruction fetchsub $11, $2, $3
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetchlw $10, 20($1)
Address
Datamemory
Address
Datamemory
Clock 1
Clock 2
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction decodelw $10, 20($1)
Instruction fetchsub $11, $2, $3
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Instruction fetchlw $10, 20($1)
Address
Datamemory
Address
Datamemory
Clock 1
Clock 2
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
3216Sign
extend
Writeregister
Writedata
Memorylw $10, 20($1)
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Executionsub $11, $2, $3
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Executionlw $10, 20($1)
Instruction decodesub $11, $2, $3
3216Sign
extend
Address
Datamemory
Datamemory
Address
Clock 3
Clock 4
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
3216Sign
extend
Writeregister
Writedata
Memorylw $10, 20($1)
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Executionsub $11, $2, $3
Instructionmemory
Address
4
0
Add Addresult
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Executionlw $10, 20($1)
Instruction decodesub $11, $2, $3
3216Sign
extend
Address
Datamemory
Datamemory
Address
Clock 3
Clock 4
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
lw $10, 20($1)
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Instr
uct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Datamemory
Address
Datamemory
Clock 6
Clock 5
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Inst
ruct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
lw $10, 20($1)
Instructionmemory
Address
4
32
0
Add Addresult
1
ALUresult
Zero
Shiftleft 2
Instr
uct
ion
IF/ID EX/MEMID/EX MEM/WB
Write backMux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Mux
ALUReaddata
Writeregister
Writedata
sub $11, $2, $3
Memory
sub $11, $2, $3
Address
Datamemory
Address
Datamemory
Clock 6
Clock 5
Multiple-clock-cycle pipeline diagram
Outline Overview A pipeline datapath Pipelined control Data hazards
Forwarding Stalls
Branch hazards Superscalar and dynamic pipelining
Label the control lines (Fig 6.22)
PC
Instructionmemory
Address
Inst
ruct
ion
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
Add Addresult
Shiftleft 2
ALUresult
ALUZero
Add
0
1
Mux
0
1
Mux
written during each clock cycle
Divide the control lines into groups
Fig. 6.25
How to set control lines at each stage? Extend the pipeline registers to include control information 將控制訊號儲存在 pipeline registers
Propagation of control (Fig.6.26)
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
PC
Instructionmemory
Inst
ruct
ion
Add
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
Mem
Writ
e
AddressData
memory
Address
Fig. 6.27
Instructionmemory
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15– 0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
ALU
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
20
$X
$1
10
X
Me
mW
rite
MemRead
Me
mW
rite
Datamemory
Address
Address
Address
Clock 2
Clock 1
Old bookFig 6.31
Instructionmemory
Instruction[20– 16]
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
Instruction[15– 0]
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: before<1> EX: before<2> MEM: before<3> WB: before<4>
MEM/WB
IF: lw $10, 20($1)
000
00
0000
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0
Datamemory
Address
Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
Mux
0
1
Add Addresult
Writeregister
Writedata
Mux
1
ALUresult
Zero
ALUcontrol
Shiftleft 2
Reg
Writ
e
ALU
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>
MEM/WB
IF: sub $11, $2, $3
010
11
0001
000
00
000
0
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
lwControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
X
10
20
X
1
Instruction[20– 16]
Instruction[15– 0] Sign
extend
Instruction[15– 11]
20
$X
$1
10
X
Mem
Writ
e
MemRead
Mem
Writ
e
Datamemory
Address
Address
Address
Clock 2
Clock 1
Instructionmemory
Address
Instruction[20– 16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
Re
gWrit
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
MEM/WB
IF: or $13, $6, $7
000
10
1100
000
10
101
0
11
10
0
00
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$5
$4
X
12
Me
mW
rite
MemRead
Me
mW
rite
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Zero
Signextend
Signextend
Datamemory
Address
Clock 3
Clock 4
20($1)
Instructionmemory
Address
Instruction[20– 16]
Mem
toR
eg
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Reg
Writ
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>
MEM/WB
IF: and $12, $4, $5
000
10
1100
010
11
000
1
00
00
0
00
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
Writeregister
Writedata 1
ALUresult
ALUcontrol
Shiftleft 2
Reg
Writ
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>
MEM/WB
IF: or $13, $6, $7
000
10
1100
000
10
101
0
11
10
0
00
Mux
0
1
Add
PC
0Writedata
Mux
1
andControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
12
X
X
5
4
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$5
$4
X
12
Mem
Writ
e
MemRead
Mem
Writ
e
sub
11
X
X
3
2
X
$3
$2
X
11
$1
20
10
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$3
$2
11
Mux
Mux
ALUAddress Read
dataData
memory
10
WB
Zero
Zero
Signextend
Signextend
Datamemory
Address
Clock 3
Clock 4
20($1)
Instructionmemory
Address
Instruction[20– 16]
Branch
ALUSrc
4
Instruction[15– 0]
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
ALUresult
Shiftleft 2
Re
gWrit
e
MemRead
Control
ALU
Instruction[15– 11]
EX
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .
MEM/WB
IF: add $14, $8, $9
000
10
1100
000
10
101
0
10
00
0
Mux
0
1
Add
PC
0Writedata
Readdata
Mux
1
WB
EX
M
Instructionmemory
Address
Mem
toR
eg
ALUOp
Branch
RegDst
ALUSrc
4
0
0
1
Add Addresult
1
ALUresult
ALUcontrol
Shiftleft 2
Re
gWrit
e
M
WB
Inst
ruct
ion
IF/ID EX/MEMID/EX
ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .
MEM/WB
IF: after<1>
000
10
1100
000
10
101
0
10
00
0
10
Mux
0
1
Add
PC
0Writedata
Mux
1
addControl
Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
14
X
X
9
8
Instruction[20– 16]
Instruction[15– 0]
Instruction[15– 11]
X
$9
$8
X
14
Me
mW
rite
MemRead
Me
mW
rite
or
13
X
X
7
6
X
$7
$6
X
13
$4
Mux
0
Mux
1
ALUOp
RegDst
ALUcontrol
M
WB
$7
$6
13
Mux
Mux
ALUReaddata
12
WB
11 10
10$5
12
WB
Mem
toR
eg
11
11
11
Writeregister
Writedata
Zero
Zero
Datamemory
Address
Datamemory
Address
Signextend
Signextend
Clock 5
Clock 6
20($1)
Outline Overview A pipeline datapath Pipelined control Data hazards
Forwarding Stalls
Branch hazards Superscalar and dynamic pipelining
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Example: data hazards
Both r/w is oK!
?
?
Solution 1: software Insert independent instruction or nop (no operation)
sub $2, $1, $3nopnopand $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)
Hardware approach: forwarding
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
1
2
EX/MEM MEM/WB 1EX hazard2MEM hazard
Registers
Mux M
ux
ALU
ID/EX MEM/WB
Datamemory
Mux
Forwardingunit
EX/MEM
b. With forwarding
ForwardB
Rd EX/MEM.RegisterRd
MEM/WB.RegisterRd
RtRtRs
ForwardA
Mux
ALU
ID/EX MEM/WB
Datamemory
EX/MEM
a. No forwarding
Registers
Mux
Control for forwarding Set ForwardA and ForwardB
EX hazard:If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))ForwardA = 10
// 排除寫入 $0// 前一指令有寫入暫存器動作
// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同
If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))ForwardB = 10
// 排除寫入 $0// 前一指令有寫入暫存器動作
// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同
Outline Overview A pipeline datapath Pipelined control Data hazards
Forwarding Stalls
Branch hazards Superscalar and dynamic pipelining
Unsolved data hazards
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
?
Stall
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Hazard detection unit When to stall?
If ( ID/EX.MemRead and (( ID/EX.RegisterRt = IF/ID.RegisterRs) or ( ID/EX.RegisterRt = IF/ID.RegisterRt) )Stall the pipeline
// 前一指令是 load
Outline Overview A pipeline datapath Pipelined control Data hazards
Forwarding Stalls
Branch hazards Advanced pipelining
Advanced technology Single-cycle implementation multi-cycle implementation Pipelining (instruction-level
parallelism)* How to build fast processors ?1. Increase depth of the pipeline: Superpipelining
2. Replicate the internal components of the computer : multiple issue
Dynamic multiple issue (hardware): superscalarStatic multiple issue (compiler): VLIW
How to schedule?
Superpipelining (longer pipelines) Ideal improvement of pipelining
Time between inst. pipelined = Time between inst. nonpipelinedNumber of pipe stages
* Divide instruction up to 8 or more stages* How to balance the length of each cycle?=> Clock cycle depends on the step of longest time
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Multiple-issue pipeline Two problems
How to package instructions into issue slots
Static issue processor: compiler Dynamic issue processor: at runtime by
hardware Dealing with data and control hazards
Static issue processor: compiler Dynamic issue processor: at runtime by
hardware
Static multiple issueTime
2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
Time2 4 6 8 10
add $s0, $t0, $t1 IF ID WBEX MEM
…
Issue packet
• The mix of instructions can be initiated in a given clock is usually restricted Issue packet can be taken as a single instruction allowing several operations Very long instruction word (VLIW)
Ex. Two-issue MIPS 2 instructions are issued per clock
cycle One for integer ALU or branch The other for load or store
* One slot can be nop
A static two-issue datapath
PC Instructionmemory
4
RegistersMux
Mux
ALU
Mux
Datamemory
Mux
40000040
Signextend Sign
extend
ALU Address
Writedata
1
2
avoidStructural hazard
64-bit
Superscalar: Dynamic pipeline scheduling
Dynamic pipeline scheduling Hardware support for reordering the
order of instruction execution so as to avoid stalls
Ex. Unsolved pipeline stalllw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20Stall (memory is slow)
Dynamic pipeline scheduling
Commitunit
Instruction fetchand decode unit
…
In-order issue
In-order commit
Load/Store
FloatingpointIntegerInteger …Functional
unitsOut-of-order execute
Reservationstation
Reservationstation
Reservationstation
Reservationstation
1
2
3
Buffers tohold operandsand operation
Reorder buffer,decide when it’s safeto put the result back
In the orderof programexecution
dynamic pipelining Static pipelining Advantage of dynamic pipelining
Hide memory latency Avoid the stalls that compiler could not
schedule Speculative execute instructions
(combined with branch prediction) while waiting for hazards to be solved
Fallacies and pitfalls Pipelining is easy Pipelining ideas can be implemented
independent of technology No. of transistors on-chip increase,
additional hardware makes sense Ex. Extra hardware in dynamic pipelining
Increase the depth of pipelining always increase performance
Depth of pipelining
Data hazards: large percentage of cycles become stalls
Control hazards: slower branches Pipeline register overhead
1 2 4 8 16
Pipeline depth
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Rel
ativ
e pe
rform
ance
Fp pipelining, 19863 factors:
top related