chap.6: enhancing performance with pipelining

Chap.6: Enhancing Performance with Pipelining

Jen-Chang Liu, Spring 2006

Parts of the slides are duplicated frominst.eecs.berkeley.edu/~cs61c

Review Datapath (1/3)Datapath is the hardware that performs operations necessary to execute programs.Control instructs datapath on what to do next.Datapath needs:

access to storage (general purpose registers and memory) computational ability (ALU) helper hardware (local registers and PC)

Review Datapath (2/3)Five stages of datapath (executing an instruction):

1. Instruction Fetch (Increment PC)2. Instruction Decode (Read Registers)3. ALU (Computation)4. Memory Access5. Write to Registers

ALL instructions must go through ALL five stages.

Review Datapath (3/3)P

rtrsrd

1. InstructionFetch

2. Decode/ Register

3. Execute 4. Memory5. WriteBack

Review Single cycle datapath

Multi-cycle datapath

Instructionfetch

Data/registerread

Instructionexecution

Memory/registerread/write

Registerwrite

Instructionfetch

Data/registerread

Registerwrite

Outline Overview A pipeline datapath Pipelined control Data hazards

Forwarding Stalls

Branch hazards Superscalar and dynamic pipelining

Time76 PM 8 9 10 11 12 1 2 AM

Taskorder

What’s pipelining?Time

76 PM 8 9 10 11 12 1 2 AM

Time76 PM 8 9 10 11 12 1 2 AM

Taskorder

洗衣乾衣折疊收藏non-pipelined

pipelining

Use different resources at the same time

Pipelining Definition: an implementation

technique in which multiple instructions are overlapped in execution

How to achieve pipelining? An instruction is divided into steps

(stages) We have separate resources for each

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

Programexecutionorder(in instructions)

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

2 ns 2 ns 2 ns 2 ns 2 ns

Single-cycle implementation

Pipelined implementation

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

What does pipelining improve?

Pipelining improves performance by Increasing the instruction throughput 單位時間完成的指令數目增加 Decreasing the execution time of an individual instruction 單一指令執行時間不變

Ideal conditions for savingTime between inst. pipelined =

Time between inst. non-pipelined

Number of pipe stages

Design instruction set for pipelining

MIPS has been designed for pipelining1. Instructions are of the same length

Easier to fetch and decode Ex. 80x86 inst.s ranges from 1 to 17 bytes

2. A few instruction formats, with the source register fields located in the same place

可以在決定是甚麼指令前讀暫存器

Design instruction set for pipelining (cont.)

3. Memory operands only appear for loads and stores

Calculate memory address in execution stage

4. Operands are aligned in memory One memory access for data transfer

Instructionfetch

Instruction Decoding,

Data/registerread

Registerwrite

0 1 2 3

Aligned

NotAligned

Pipeline hazards Hazards: the situations in pipelining

when the next instruction cannot execute in the following clock cycle使下一個指令無法在下一個 cycle 執行

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

危險

甚麼時候下個指令無法執行？

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

1. Structural hazards The hardware cannot support the

combination of instructions that we want to execute in the same clock cycle

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

硬體結構問題

2 memory access at the sameclock?

Structural Hazard #1: Single Memory

Solution:Second memory

infeasible and inefficient to createTwo Level 1 Caches

Cache: a temporary smaller [of usually most recently used] copy of memory

have both an L1 Instruction Cache and an L1 Data Cache

need more complex hardware to control when both caches miss

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

1. Structural hazard #2: single register

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

Can’t read and write to registers simultaneously?

Structural Hazard #2: Registers

Fact: Register access is VERY fast: takes less than half the time of ALU stage

Solution: introduce conventionalways Write to Registers during first half of each clock cycle

always Read from Registers during second half of each clock cycle

Result: can perform Read and Write during same clock cycle

2. Control hazards The need to make a decision based on

the results of one instruction while others are executing Ex. Branch instruction

下一個指令需等待前面指令的執行結果

add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)40: or $7, $8, $9

? …無法判斷下一個指令是那個…

Solution to control hazards #1: stall 拖延 stall (bubble): 暫停下一個指令執行

access Reg

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

access Reg2ns

access Reg

2 4 6 8 10 12 14 16Programexecutionorder(in instructions)

nop: no operationnop: no operationInstruction

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

access Reg2ns

access Reg

completion of branchFilling two nops after branch is inefficient!

Solution to control hazards #1: stall

Stall (bubble): 暫停下一個指令執行Instruction

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)4 ns

access Reg2ns

access Reg

假設 branch 的位址計算，比較都可用另外的 hardware 在此 stage 完成

Solution to control hazards #2:

delayed branchadd $4, $5, $6 beq $1, $2, 40 lw $3, 300($0)

40: or $7, $8, $9? …

access Reg

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

access Reg2 ns

access Reg

2 4 6 8 10 12 14

(Delayed branch slot)

Redefine branchesOld definition: if we take the branch, none of the instructions after the branch get executed by accident

New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

The term “Delayed Branch” meanswe always execute inst after branch

delayed branch (cont.)

Notes on Branch-Delay SlotWorst-Case Scenario: can always put a no-op in the branch-delay slot

Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

re-ordering instructions is a common method of speeding up programs

compiler must be very smart in order to find instructions to do this

usually can find such an instruction at least 50% of the time

delayed branch (cont.)

Solution to control hazards #3: predict 預測預測下一個可能執行的指令 .

Ex. 猜 branch always not takenInstruction

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

access Reg2 ns

access Reg

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

access Reg

2 4 6 8 10 12 14

access Reg

bubble bubble bubble bubble bubble

(not taken沒有跳走 )

(taken)

access Reg

beq $1, $2, 40

add $4, $5, $6

lw $3, 300($0)

access Reg2 ns

access Reg

beq $1, $2, 40

add $4, $5 ,$6

or $7, $8, $9

access Reg

2 4 6 8 10 12 14

access Reg

Solution to control hazards: predict (cont.)

Example: easy to predict

Dynamic prediction hardware Record the history of branch results 90% accuracy

Loop: … … beq $1, $2, Loop

3. Data hazard An instruction depends on the results of

a previous instruction still in the pipeline

Solution 1: complier (assembler)

下一個指令需要的資料需等待前一個指令執行結果Example:

add $s0, $t0, $t1sub $t2, $s0, $t3

Standard notation for steps in MIPS

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEM

Instructionfetch

Instruction Decoding,

Data/registerread

Registerwrite

讀出寫入

Solution to data hazards: forwarding

Forwarding: Getting the missing item early from the internal resourceadd $s0, $t0, $t1sub $t2, $s0, $t3 等待指令執行完畢？

add $s0, $t0, $t1

sub $t2, $s0, $t3

IF ID WBEX

IF ID MEMEX

Time2 4 6 8 10

$t0+$t1

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

IF ID WBMEMEX

Another example of forwarding

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

IF ID WBMEMEX

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

IF ID WBMEMEX

Time2 4 6 8 10 12 14

lw $s0, 20($t1)

sub $t2, $s0, $t3

IF ID WBMEMEX

Mem[20($t1)]

Example: Reordering codes

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)Swap 0($t1) and 4($t1)

What’s wrong with pipelining?Time

2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMTime

2 4 6 8 10

lw $t2, 4($t1)sw $t2, 0($t1)

Example: Reordering codes

Time2 4 6 8 10

lw $t2, 4($t1)

sw $t2, 0($t1)

lw $t0, 0($t1)lw $t2, 4($t1) sw $t0, 4($t1)sw $t2, 0($t1)

Time2 4 6 8 10

add $s0, $t0, $t1 IF ID WBEX MEMsw $t0, 4($t1)

Peer Instruction

A. Thanks to pipelining, I have reduced the time it took me to wash my shirt.B. Longer pipelines are always a win (since less work per stage & a faster clock).C. We can rely on compilers to help us avoid data hazards by reordering instrs.

ABC1: FFF2: FFT3: FTF4: FTT5: TFF6: TFT7: TTF8: TTT

Outline Overview A pipeline datapath: How to build? Pipelined control Data hazards

Forwarding Stalls

Multiple instructions execute using pipelining

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)

lw $1, 100($0) IM Reg DM RegALU

How to combine these multiple datapaths? Shared units during one clock cycle add buffer (registers) to hold data (as we do in multi-cycle)

Divide single-cycle datapath into stages

Instructionmemory

Address

Add Addresult

Shiftleft 2

Instruction

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

ALUresult

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

Data flow

Pipelined datapath with pipeline registers (Fig 6.11)

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Datamemory

Address

64bits 128b 97b 64b

Example: (1) instruction fetch

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction fetchlw

Address

Datamemory

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

ID/EX MEM/WB

Instruction decodelw

Address

Datamemory

Example: (2) instruction decode

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction fetchlw

Address

Datamemory

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

ID/EX MEM/WB

Instruction decodelw

Address

Datamemory

Example: (3) execute

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

ID/EX MEM/WB

Executionlw

Address

Datamemory

Example: (4) memory access

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataData

memory1

ALUresult

ALUZero

ID/EX MEM/WB

Memorylw

Address

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

ALUresult

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address

97108/Patterson Figure 06.15

Example: (5) write back

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataData

memory1

ALUresult

ALUZero

ID/EX MEM/WB

Memorylw

Address

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writedata

ReaddataData

memory

ALUresult

ALUZero

ID/EX MEM/WB

Write backlw

Writeregister

Address

97108/Patterson Figure 06.15

寫回暫存器的號碼對嗎？

Preserve the Destination Register number

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

Address

Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

ALUresult

ALUZero

add 5-bit buffer

Trace another pipelining Trace helps you understand how pipelining works !!! Single-clock-cycle pipeline diagram example: lw $10, 20($1) sub $11, $2, $3

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction decodelw $10, 20($1)

Instruction fetchsub $11, $2, $3

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction fetchlw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction decodelw $10, 20($1)

Instruction fetchsub $11, $2, $3

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Instruction fetchlw $10, 20($1)

Address

Datamemory

Address

Datamemory

Clock 1

Clock 2

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memorylw $10, 20($1)

Readdata

ALUresult

ALUZero

Executionsub $11, $2, $3

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Executionlw $10, 20($1)

Instruction decodesub $11, $2, $3

3216Sign

extend

Address

Datamemory

Address

Clock 3

Clock 4

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

3216Sign

extend

Writeregister

Writedata

Memorylw $10, 20($1)

Readdata

ALUresult

ALUZero

Executionsub $11, $2, $3

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Executionlw $10, 20($1)

Instruction decodesub $11, $2, $3

3216Sign

extend

Address

Datamemory

Address

Clock 3

Clock 4

Instructionmemory

Address

Add Addresult

ALUresult

Shiftleft 2

IF/ID EX/MEMID/EX MEM/WB

Write backMux

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

Add Addresult

ALUresult

Shiftleft 2

Write backMux

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Instructionmemory

Address

Add Addresult

ALUresult

Shiftleft 2

Write backMux

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

ALUReaddata

Writeregister

Writedata

lw $10, 20($1)

Instructionmemory

Address

Add Addresult

ALUresult

Shiftleft 2

Write backMux

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

ALUReaddata

Writeregister

Writedata

sub $11, $2, $3

Memory

sub $11, $2, $3

Address

Datamemory

Address

Datamemory

Clock 6

Clock 5

Multiple-clock-cycle pipeline diagram

Forwarding Stalls

Label the control lines (Fig 6.22)

Instructionmemory

Address

Instruction[20– 16]

MemtoReg

Branch

RegDst

ALUSrc

16 32Instruction[15– 0]

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

1Write

data Mux

ALUcontrol

RegWrite

MemRead

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

Add Addresult

Shiftleft 2

ALUresult

ALUZero

written during each clock cycle

Divide the control lines into groups

Fig. 6.25

How to set control lines at each stage? Extend the pipeline registers to include control information 將控制訊號儲存在 pipeline registers

Propagation of control (Fig.6.26)

Control

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Instructionmemory

Branch

RegDst

ALUSrc

16 32Instruction[15– 0]

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Writedata

Readdata

ALUcontrol

Shiftleft 2

MemRead

Control

WBIF/ID

EX/MEM

MEM/WB

AddressData

memory

Address

Fig. 6.27

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

ALUcontrol

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

Datamemory

Address

Writedata

Readdata

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

0Writedata

Readdata

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15– 0] Sign

extend

MemRead

Datamemory

Address

Clock 2

Clock 1

Old bookFig 6.31

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

ALUcontrol

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

Datamemory

Address

Writedata

Readdata

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

0Writedata

Readdata

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15– 0] Sign

extend

MemRead

Datamemory

Address

Clock 2

Clock 1

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

0Writedata

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

MemRead

RegDst

ALUcontrol

ALUAddress Read

dataData

memory

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

0Writedata

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

MemRead

RegDst

ALUcontrol

ALUAddress Read

dataData

memory

Signextend

Datamemory

Address

Clock 3

Clock 4

20($1)

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

0Writedata

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

MemRead

RegDst

ALUcontrol

ALUReaddata

Writeregister

Writedata

Datamemory

Address

Datamemory

Address

Signextend

Clock 5

Clock 6

20($1)

Forwarding Stalls

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

sub $2, $1, $3

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Example: data hazards

Both r/w is oK!

Solution 1: software Insert independent instruction or nop (no operation)

sub $2, $1, $3nopnopand $12, $2, $6or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Hardware approach: forwarding

IM Reg

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

EX/MEM MEM/WB 1EX hazard2MEM hazard

Registers

ID/EX MEM/WB

Datamemory

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

Rd EX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Control for forwarding Set ForwardA and ForwardB

EX hazard:If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRs))ForwardA = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

If ( EX/MEM.RegWriteand (EX/MEM.RegisterRd <> 0)and (EX/MEM.RegisterRd = ID/EX.RegisterRt))ForwardB = 10

// 排除寫入 $0// 前一指令有寫入暫存器動作

// 前一指令有寫入暫存器號碼與現在欲使用暫存器相同

Forwarding Stalls

Unsolved data hazards

lw $2, 20($1)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

bubble

Hazard detection unit When to stall?

If ( ID/EX.MemRead and (( ID/EX.RegisterRt = IF/ID.RegisterRs) or ( ID/EX.RegisterRt = IF/ID.RegisterRt) )Stall the pipeline

// 前一指令是 load

Forwarding Stalls

Branch hazards Advanced pipelining

Advanced technology Single-cycle implementation multi-cycle implementation Pipelining (instruction-level

parallelism)* How to build fast processors ?1. Increase depth of the pipeline: Superpipelining

2. Replicate the internal components of the computer : multiple issue

Dynamic multiple issue (hardware): superscalarStatic multiple issue (compiler): VLIW

How to schedule?

Superpipelining (longer pipelines) Ideal improvement of pipelining

Time between inst. pipelined = Time between inst. nonpipelinedNumber of pipe stages

* Divide instruction up to 8 or more stages* How to balance the length of each cycle?=> Clock cycle depends on the step of longest time

access Reg

8 nsInstruction

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

access Reg

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

Multiple-issue pipeline Two problems

How to package instructions into issue slots

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware Dealing with data and control hazards

Static issue processor: compiler Dynamic issue processor: at runtime by

hardware

Static multiple issueTime

2 4 6 8 10

Time2 4 6 8 10

Issue packet

• The mix of instructions can be initiated in a given clock is usually restricted Issue packet can be taken as a single instruction allowing several operations Very long instruction word (VLIW)

Ex. Two-issue MIPS 2 instructions are issued per clock

cycle One for integer ALU or branch The other for load or store

* One slot can be nop

A static two-issue datapath

PC Instructionmemory

RegistersMux

Datamemory

40000040

Signextend Sign

extend

ALU Address

Writedata

avoidStructural hazard

64-bit

Superscalar: Dynamic pipeline scheduling

Dynamic pipeline scheduling Hardware support for reordering the

order of instruction execution so as to avoid stalls

Ex. Unsolved pipeline stalllw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20Stall (memory is slow)

Dynamic pipeline scheduling

Commitunit

Instruction fetchand decode unit

In-order issue

In-order commit

Load/Store

FloatingpointIntegerInteger …Functional

unitsOut-of-order execute

Reservationstation

Buffers tohold operandsand operation

Reorder buffer,decide when it’s safeto put the result back

In the orderof programexecution

dynamic pipelining Static pipelining Advantage of dynamic pipelining

Hide memory latency Avoid the stalls that compiler could not

schedule Speculative execute instructions

(combined with branch prediction) while waiting for hazards to be solved

Fallacies and pitfalls Pipelining is easy Pipelining ideas can be implemented

independent of technology No. of transistors on-chip increase,

additional hardware makes sense Ex. Extra hardware in dynamic pipelining

Increase the depth of pipelining always increase performance

Depth of pipelining

Data hazards: large percentage of cycles become stalls

Control hazards: slower branches Pipeline register overhead

1 2 4 8 16

Pipeline depth

Fp pipelining, 19863 factors:

chap.6: enhancing performance with pipelining

savingdesign instruction

instruction throughput

instruction formats

placedesign instruction

l1 instruction cache

stages of datapath

pcreview datapath

datapath needs

Documents

structural & data hazards topic 3: pipelining ece 4750...

multiphase pipelining

recap (pipelining)

pipelining i

cs.305 computer architecture enhancing performance with...

mips pipelining

1 1998 morgan kaufmann publishers chapter six enhancing...

1 2004 morgan kaufmann publishers chapter six enhancing...

pipelining multiplier

xml pipelining

chapter6 pipelining

pipelining enhancing performance with c d...

1 cse 45432 suny new paltz chapter six enhancing performance...

pipelining & parallel processing -...

pipelining iv

software pipelining

pipelining & parallel processing -...

instruction pipelining

pipelining and retiming prepared by mark jarvin. agenda...

arm pipelining