l18 – pipeline issues 1 comp 411 – spring 2008 04/03/08 cpu pipelining issues finishing up...

L18 – Pipeline Issues 1Comp 411 – Spring 2008 04/03/08

CPU Pipelining Issues

Finishing up Chapter 6

This pipe stuff makesmy head hurt!

What have you been beating

your head against?


ALUA B

ALUFN

Data MemoryRD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory

Address is available right after instruction enters Memory stage

Data is needed just before rising clock edge at end of Write Back stage

almost 2 clock cycles

• Omits some details• NO bypass or interlock logic


Pipelining

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns



Pipelining

What makes it easyall instructions are the same length

just a few instruction formats

memory operands appear only in loads and stores

What makes it hard?structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions

data hazards: an instruction depends on a previous instruction

Individual Instructions still take the same number of cycles

But we’ve improved the through-put by increasing the number of simultaneously executing instructions


Structural Hazards

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write


Problem with starting next instruction before first is finisheddependencies that “go backward in time” are

data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM


Have compiler guarantee no hazardsWhere do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution


Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM


Load word can still cause a hazard:an instruction tries to read a register following a load instruction

that writes to the same register.

Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM


Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble


When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken”need to add hardware for flushing instructions if

we are wrong

Branch Hazards

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg


Improving Performance

Try to avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Add a “branch delay slot”the next instruction after a branch is always executedrely on compiler to “fill” the slot with something

useful

Superscalar: start more than one instruction in the same cycle


Dynamic Scheduling

The hardware performs the “scheduling” hardware tries to find instructions to execute

out of order execution is possible

speculative execution and dynamic branch prediction

All modern processors are very complicatedPentium 4: 20 stage pipeline, 6 simultaneous

instructions

PowerPC and Pentium: branch history table

Compiler technology important


ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory

We wanted a simple, clean pipeline but…

• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage

NOP

Data Memory

•broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic•added A/B bypass muxes to get data before it’s written to regfile


Bypass MUX Details

RD1

RD2

Register File

A Bypass B Bypass

ASEL02

AALU

BSEL01

BALU

’16’ SEXT(imm)

To ALU To ALU

WDALU

To Mem

JT

from ALU/MEM/WB/PC from ALU/MEM/WB/PC

1

shamt =BZ

The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic.


ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

Final 5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory Data Memory

NOP

•Added branch delay slot and early branch resolution logic to fix a CONTROL hazard

• Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards

• Added pipeline interlocks to fix load delay STRUCTURAL hazard

NOP


Pipeline Summary (I)

• Started with unpipelined implementation– direct execute, 1 cycle/instruction– it had a long cycle time: mem + regs + alu + mem

+ wb

• We ended up with a 5-stage pipelined implementation– increase throughput (3x???)– delayed branch decision (1 cycle)

Choose to execute instruction after branch– delayed register writeback (3 cycles)

Add bypass paths (6 x 2 = 12) to forward correct value

– memory data available only in WB stageIntroduce NOPs at IRALU, to stall IF and RF

stages until LD result was ready


Pipeline Summary (II)

Fallacy #1: Pipelining is easySmart people get it wrong all of the time!

Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).

Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.


RISC = Simplicity???

Generalization of registers and

operand coding

Complex instructions, addressing modes

Addressingfeatures, eg

index registers

RISCs

Primitive Machines with

direct implementati

ons

VLIWs,Super-Scalars ?

Pipelines, Bypasses,

Annulment, …, ...

“The P.T. Barnum World’s Tallest Dwarf Competition”

World’s Most Complex RISC?

l18 – pipeline issues 1 comp 411 – spring 2008 04/03/08 cpu pipelining issues finishing up...

Documents

l18 pipeline issues

instruction memory

stage slide

load instruction

forwarding slide

register alu

aludata access reg

data hazards data hazards