l18 – pipeline issues 1 comp 411 – spring 2008 04/03/08 cpu pipelining issues finishing up...

19
L18 – Pipeline Issues 1 omp 411 – Spring 2008 04/03/0 8 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been beating your head against?

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

L18 – Pipeline Issues 1Comp 411 – Spring 2008 04/03/08

CPU Pipelining Issues

Finishing up Chapter 6

This pipe stuff makesmy head hurt!

What have you been beating

your head against?

L18 – Pipeline Issues 2Comp 411 – Spring 2008 04/03/08

ALUA B

ALUFN

Data MemoryRD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory

Address is available right after instruction enters Memory stage

Data is needed just before rising clock edge at end of Write Back stage

almost 2 clock cycles

• Omits some details• NO bypass or interlock logic

L18 – Pipeline Issues 3Comp 411 – Spring 2008 04/03/08

Pipelining

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

L18 – Pipeline Issues 4Comp 411 – Spring 2008 04/03/08

Pipelining

What makes it easyall instructions are the same length

just a few instruction formats

memory operands appear only in loads and stores

What makes it hard?structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions

data hazards: an instruction depends on a previous instruction

Individual Instructions still take the same number of cycles

But we’ve improved the through-put by increasing the number of simultaneously executing instructions

L18 – Pipeline Issues 5Comp 411 – Spring 2008 04/03/08

Structural Hazards

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

InstFetch

RegRead

ALU DataAccess

Reg Write

L18 – Pipeline Issues 6Comp 411 – Spring 2008 04/03/08

Problem with starting next instruction before first is finisheddependencies that “go backward in time” are

data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

L18 – Pipeline Issues 7Comp 411 – Spring 2008 04/03/08

Have compiler guarantee no hazardsWhere do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution

L18 – Pipeline Issues 8Comp 411 – Spring 2008 04/03/08

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

L18 – Pipeline Issues 9Comp 411 – Spring 2008 04/03/08

Load word can still cause a hazard:an instruction tries to read a register following a load instruction

that writes to the same register.

Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

L18 – Pipeline Issues 10Comp 411 – Spring 2008 04/03/08

Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

L18 – Pipeline Issues 11Comp 411 – Spring 2008 04/03/08

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken”need to add hardware for flushing instructions if

we are wrong

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

L18 – Pipeline Issues 12Comp 411 – Spring 2008 04/03/08

Improving Performance

Try to avoid stalls! E.g., reorder these instructions:

lw $t0, 0($t1)lw $t2, 4($t1)sw $t2, 0($t1)sw $t0, 4($t1)

Add a “branch delay slot”the next instruction after a branch is always executedrely on compiler to “fill” the slot with something

useful

Superscalar: start more than one instruction in the same cycle

L18 – Pipeline Issues 13Comp 411 – Spring 2008 04/03/08

Dynamic Scheduling

The hardware performs the “scheduling” hardware tries to find instructions to execute

out of order execution is possible

speculative execution and dynamic branch prediction

All modern processors are very complicatedPentium 4: 20 stage pipeline, 6 simultaneous

instructions

PowerPC and Pentium: branch history table

Compiler technology important

L18 – Pipeline Issues 14Comp 411 – Spring 2008 04/03/08

ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory

We wanted a simple, clean pipeline but…

• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage

NOP

Data Memory

•broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logic•added A/B bypass muxes to get data before it’s written to regfile

L18 – Pipeline Issues 15Comp 411 – Spring 2008 04/03/08

Bypass MUX Details

RD1

RD2

Register File

A Bypass B Bypass

ASEL02

AALU

BSEL01

BALU

’16’ SEXT(imm)

To ALU To ALU

WDALU

To Mem

JT

from ALU/MEM/WB/PC from ALU/MEM/WB/PC

1

shamt =BZ

The previous diagram was oversimplified. Really need for the bypass muxes to precede the A and B muxes to provide the correct values for the jump target (JT), write data, and early branch decision logic.

L18 – Pipeline Issues 16Comp 411 – Spring 2008 04/03/08

ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

Final 5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM Y

MEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB Y

WBMemory Data Memory

NOP

•Added branch delay slot and early branch resolution logic to fix a CONTROL hazard

• Added lots of bypass paths and detection logic to fix various STRUCTURAL hazards

• Added pipeline interlocks to fix load delay STRUCTURAL hazard

NOP

L18 – Pipeline Issues 17Comp 411 – Spring 2008 04/03/08

Pipeline Summary (I)

• Started with unpipelined implementation– direct execute, 1 cycle/instruction– it had a long cycle time: mem + regs + alu + mem

+ wb

• We ended up with a 5-stage pipelined implementation– increase throughput (3x???)– delayed branch decision (1 cycle)

Choose to execute instruction after branch– delayed register writeback (3 cycles)

Add bypass paths (6 x 2 = 12) to forward correct value

– memory data available only in WB stageIntroduce NOPs at IRALU, to stall IF and RF

stages until LD result was ready

L18 – Pipeline Issues 18Comp 411 – Spring 2008 04/03/08

Pipeline Summary (II)

Fallacy #1: Pipelining is easySmart people get it wrong all of the time!

Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).

Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.

L18 – Pipeline Issues 19Comp 411 – Spring 2008 04/03/08

RISC = Simplicity???

Generalization of registers and

operand coding

Complex instructions, addressing modes

Addressingfeatures, eg

index registers

RISCs

Primitive Machines with

direct implementati

ons

VLIWs,Super-Scalars ?

Pipelines, Bypasses,

Annulment, …, ...

“The P.T. Barnum World’s Tallest Dwarf Competition”

World’s Most Complex RISC?