l16 – pipelining 1 comp 411 – spring 2011 4/20/2011 pipelining between 411 problems sets, i...

28
L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty laundry Read Chapter 4.5-4.8

Upload: berenice-price

Post on 05-Jan-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 1Comp 411 – Spring 2011 4/20/2011

Pipelining

Between 411 problems sets, I

haven’t had a minute to do laundry

Now that’s what Icall dirty laundry

Read Chapter 4.5-4.8

Page 2: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 2Comp 411 – Spring 2011 4/20/2011

Forget 411… Let’s Solve a “Relevant Problem”

Device: Washer

Function: Fill, Agitate, Spin

WasherPD = 30 mins

Device: Dryer

Function: Heat, Spin

DryerPD = 60 mins

INPUT:dirty laundry

OUTPUT:4 more weeks

Page 3: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 3Comp 411 – Spring 2011 4/20/2011

One Load at a Time

Everyone knows that the real reason that UNC students put off doing laundry so long is *not* because they procrastinate, are lazy, or even have better things to do.

The fact is, doing laundry one load at a time is not smart.

(Sorry Mom, but you were wrong about this one!)

Step 1:

Step 2:

Total = WasherPD + DryerPD

= _________ mins90

Page 4: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 4Comp 411 – Spring 2011 4/20/2011

Doing N Loads of Laundry

Here’s how they do laundry at Duke, the “combinational” way.

(Actually, this is just an urban legend. No one at Duke actually does laundry. The butler’s all arrive on Wednesday morning, pick up the dirty laundry and return it all pressed and starched by dinner)

Step 1:

Step 2:

Step 3:

Step 4:

Total = N*(WasherPD + DryerPD)

= ____________ minsN*90

Page 5: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 5Comp 411 – Spring 2011 4/20/2011

Doing N Loads… the UNC way

UNC students “pipeline” the laundry process.

That’s why we wait!

Step 1:

Step 2:

Step 3:

Total = N * Max(WasherPD, DryerPD)

= ____________ mins

N*60

…Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

Actually, it’s more like N*60 + 30 if we account for the startup transient correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

Page 6: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 6Comp 411 – Spring 2011 4/20/2011

Recall Our Performance Measures

Latency:The delay from when an input is established until the output associated with that input becomes valid.

(Duke Laundry = _________ mins)( UNC Laundry = _________ mins)

Throughput:The rate at which inputs or outputs are processed.

(Duke Laundry = _________ outputs/min)( UNC Laundry = _________ outputs/min)

90

120

1/90

1/60

Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.

Even though we increaselatency, it takes less time per load

Page 7: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 7Comp 411 – Spring 2011 4/20/2011

Okay, Back to Circuits…

F

G

HX P(X)

For combinational logic: latency = tPD, throughput = 1/tPD.

We can’t get the answer faster, but are we making effective use of our hardware at all times?

G(X)

F(X)

P(X)

X

F & G are “idle”, just holding their outputs stable while H performs its computation

Page 8: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 8Comp 411 – Spring 2011 4/20/2011

Pipelined Circuits

use registers to hold H’s input stable!

F

G

HX P(X)

15

20

25

Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline : if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.

Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):

latency45

______

throughput1/45

______unpipelined

2-stage pipeline 50

worse

1/25better

Pipelining uses registers

to improve the

throughput of combinational

circuits

Page 9: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 9Comp 411 – Spring 2011 4/20/2011

Pipeline Diagrams

Input

F Reg

G Reg

H Reg

i i+1 i+2 i+3

Xi Xi+1

F(Xi)

G(Xi)

Xi+2

F(Xi+1)

G(Xi+1)

H(Xi)

Xi+3

F(Xi+2)

G(Xi+2)

H(Xi+1)

Clock cycleP

ipe

line

sta

ge

s

The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

H(Xi+2)

F

G

HX P(X)

15

20

25

This is an exampleof parallelism. Atany instant we arecomputing 2 results.

Page 10: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 10Comp 411 – Spring 2011 4/20/2011

Pipelining Summary

Advantages:– Higher throughput than combinational system– Different parts of the logic work on different parts of the

problem…

Disadvantages:– Generally, increases latency– Only as good as the *weakest* link

(often called the pipeline’s BOTTLENECK)

Page 11: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 11Comp 411 – Spring 2011 4/20/2011

Review of CPU Performance

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

MIPS =FreqCPI

To Increase MIPS:

1. DECREASE CPI.

- RISC simplicity reduces CPI to 1.0.

- CPI below 1.0? State-of-the-art multiple instruction issue

2. INCREASE Freq.

- Freq limited by delay along longest combinational path; hence

- PIPELINING is the key to improving performance.

Page 12: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 12Comp 411 – Spring 2011 4/20/2011

Where Are the Bottlenecks?

Pipelining goal:Break LONG combinational paths memories, ALU in separate stages

WA

PC

+4

InstructionMemory

A

D

RegisterFile

RA1 RA2

RD1 RD2

ALUA B

WA

ALUFN

Control Logic

Data Memory

RD

WD R/W

Adr

Wr

WDSEL0 1 2

BSELWDSELALUFNWr

J:<25:0>

PCSEL

WERF

WERF

00

PC+4

Rt: <20:16>

Imm: <15:0>

ASEL

SEXT

+x4

BT

Z

BT

WASEL

Rd:<15:11>Rt:<20:16> 0

123

WASEL

PC<31:29>:J<25:0>:00

JT

JT

N V C

ZVN C

Rs: <25:21>

ASEL20

SEXT

BSEL01

SEXT

shamt:<10:6>

PCSEL 0123456

“16”

IRQ

0x800000800x800000400x80000000

RESET

“31”“27”

1

WD

WE

Page 13: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 13Comp 411 – Spring 2011 4/20/2011

miniMIPS Timing

Different instructions use various parts of the data path.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000

Program execution order

Time

CLK

Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup

sw $2, 20($4)

The above scenario is possible only if the system could vary the clock period based on the instruction being executed. This leads to complicated timing generation, and, in the end, slower systems, since it is not very compatible with pipelining!

1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS

6 nS2 nS2 nS5 nS4 nS6 nS1 nS

Page 14: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 14Comp 411 – Spring 2011 4/20/2011

Uniform miniMIPS Timing

With a fixed clock period, we have to allow for the worse case.

add $4, $5, $6beq $1, $2, 40lw $3, 30($0)jal 20000

Program execution order

TimeCLK

Instruction FetchInstruction DecodeRegister Prop DelayALU OperationBranch TargetData AccessRegister Setup

sw $2, 20($4)

By accounting for the “worse case” path (i.e. allowing time for each possible combination of operations) we can implement a fixed clock period. This simplifies timing generation, enforces a uniform processing order, and allows for pipelining!

Isn’t the net effect just a slower CPU?

1 instr EVERY 20 nS

6 nS2 nS2 nS5 nS4 nS6 nS1 nS

Page 15: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 15Comp 411 – Spring 2011 4/20/2011

Goal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IFInstruction Fetch stage: Maintains PC, fetches

one instruction per cycle and passes it to

WB Write-Back stage: writes result back into register file.

ID/RFInstruction Decode/Register File stage: Decode

control lines and select source operands

ALUALU stage: Performs specified operation,

passes result to

MEMMemory stage: If it’s a lw, use ALU result as an

address, pass mem data (or ALU result if not lw) to

Page 16: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 16Comp 411 – Spring 2011 4/20/2011

ALUA B

ALUFN

Data MemoryRD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM YMEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB YWB

Memory

Address is available right after instruction enters Memory stage

Data is needed just before rising clock edge at end of Write Back stage

almost 2 clock cycles

• Omits some details• NO bypass or interlock logic

Page 17: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 17Comp 411 – Spring 2011 4/20/2011

Pipelining

Improve performance by increasing instruction throughput

Ideal speedup is number of stages in the pipeline. Do we achieve this?

Instructionfetch

Reg ALUData

accessReg

8 nsInstruction

fetchReg ALU

Dataaccess

Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch

Reg ALUData

accessReg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 nsInstruction

fetchReg ALU

Dataaccess

Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Page 18: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 18Comp 411 – Spring 2011 4/20/2011

Pipelining

What makes it easyall instructions are the same length

just a few instruction formats

memory operands appear only in loads and stores

What makes it hard?structural hazards: suppose we had only one memory

control hazards: need to worry about branch instructions

data hazards: an instruction depends on a previous instruction

Individual Instructions still take the same number of cycles

But we’ve improved the through-put by increasing the number of simultaneously executing instructions

Page 19: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 19Comp 411 – Spring 2011 4/20/2011

Structural Hazards

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Inst

Fetch

Reg

Read

ALU Data

Access

Reg Write

Page 20: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 20Comp 411 – Spring 2011 4/20/2011

Problem with starting next instruction before first is finisheddependencies that “go backward in time” are data hazards

Data Hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 21: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 21Comp 411 – Spring 2011 4/20/2011

Have compiler guarantee no hazards

Where do we insert the “nops” ?

sub $2, $1, $3and $12, $2, $5or $13, $6, $2add $14, $2, $2sw $15, 100($2)

Problem: this really slows us down!

Software Solution

Page 22: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 22Comp 411 – Spring 2011 4/20/2011

Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding

Forwarding

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 23: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 23Comp 411 – Spring 2011 4/20/2011

Load word can still cause a hazard:an instruction tries to read a register following a load instruction that

writes to the same register.

Thus, we need a hazard detection unit to “stall” the instruction

Can't always forward

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 24: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 24Comp 411 – Spring 2011 4/20/2011

Stalling

We can stall the pipeline by keeping an instruction in the same stage

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 25: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 25Comp 411 – Spring 2011 4/20/2011

When we decide to branch, other instructions are in the pipeline!

We are predicting “branch not taken”need to add hardware for flushing instructions if we are

wrong

Branch Hazards

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Page 26: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 26Comp 411 – Spring 2011 4/20/2011

ALUA B

ALUFN

RD

WD R/WAdrWr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BT

PC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage miniMIPS

PCALU 00 IRALU

A B WDALU

PCMEM 00 IRMEM YMEM WDMEM

WARegister

File

WA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”

“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB YWB

Memory

We wanted a simple, clean pipeline but…

• added CLK EN to freeze IF/RF stages so we can wait for lw to reach WB stage

NOP

Data Memory

broke the sequential semantics of ISA by adding a branch delay-slot and early branch resolution logicadded A/B bypass muxes to get data before it’s written to regfile

Page 27: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 27Comp 411 – Spring 2011 4/20/2011

Pipeline Summary (I)

• Started with unpipelined implementation– direct execute, 1 cycle/instruction

– it had a long cycle time: mem + regs + alu + mem + wb

• We ended up with a 5-stage pipelined implementation

– increase throughput (3x???)

– delayed branch decision (1 cycle)

Choose to execute instruction after branch

– delayed register writeback (3 cycles)

Add bypass paths (6 x 2 = 12) to forward correct value

– memory data available only in WB stage

Introduce NOPs at IRALU, to stall IF and RF stages until LD result was ready

Page 28: L16 – Pipelining 1 Comp 411 – Spring 2011 4/20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty

L16 – Pipelining 28Comp 411 – Spring 2011 4/20/2011

Pipeline Summary (II)

Fallacy #1: Pipelining is easySmart people get it wrong all of the time!

Fallacy #2: Pipelining is independent of ISAMany ISA decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).

Fallacy #3: Increasing Pipeline stages improves performanceDiminishing returns. Increasing complexity.