1 pipeline datapath with some slides from: john lazzaro and dan garcia

Pipeline Datapath

With some slides from:

John Lazzaro andDan Garcia

Flip Flops פליפ-לופ

D Q A flip-flop “samples” right before the edge, and then “holds” value.Sampling

circuitHolds value

Which circuit contributes t_setup delay?

Which circuit contributes t_clk-to-Q delay?

clk-to-Q ?

CLK == 0

Sense D, but Qoutputs old value.

CLK 0->1

Capture D, passvalue to Q

setup ?

hold ?

clk-to-Q

Performance Equation

Seconds

Program

Instructions

Program=

Seconds

Cycle Instruction

Cycles

Goal is to

optimize execution time,

notindividu

alequationterms.

The CPI of the

program.Reflects

the program’

s instructio

n mix.

Machinesare

optimizedwith

respect to

programworkload

Clockperiod.

Optimizejointlywith

machineCPI.

השעוןHertz=1/sec

של במהירות פנטיום מחשב

מבצע שהוא שעון 2 *10^8פירושו מחזורי.בשניה

לוקח שעון מחזור כל

200MHZ

5*10^-9=5nanosecond

בימינו פקודה לוקחת ?כמה

datapath מבנה ה-P

1. InstructionFetch

2. Decode/ Register

3. Execute 4. Memory5. Write

Gotta Do Laundry° Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, fold, and put away

A B C D

° Dryer takes 30 minutes

° “Folder” takes 30 minutes

° “Stasher” takes 30 minutes to put clothes into drawers

° Washer takes 30 minutes

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads

30Time

3030 3030 30 3030 3030 3030 3030 3030

6 PM 7 8 9 10 11 12 1 2 AM

Pipelined Laundry

• Pipelined laundry takes 3.5 hours for 4 loads!

12 2 AM6 PM 7 8 9 10 11 1

Time303030 3030 3030

General Definitions

• Latency: time to completely execute a certain task– for example, time to read a sector from disk is

disk access time or disk latency

• Throughput: amount of work that can be done over a period of time

Pipelining Lessons (1/2)

• Pipelining doesn’t help latency of single task, it helps throughput of entire workload

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number pipe stages

• Time to “fill” pipeline and time to “drain” it reduces speedup:2.3X v. 4X in this example

6 PM 7 8 9

303030 3030 3030Task

Pipelining Lessons (2/2)

• Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline?

• Pipeline rate limited by slowest pipeline stage

• Unbalanced lengths of pipe stages also reduces speedup

6 PM 7 8 9

303030 3030 3030Task

Inspiration: Automobile assembly lineAssembly line moves on a steady clock.

Each station does the same task on each car.Car body shell

Car chassis

Mergestation

Boltingstation

The clock

Inspiration: Automobile assembly lineSimpler station tasks → more cars per hour.Simple tasks take less time, clock is faster.

Inspiration: Automobile assembly lineLine speed limited by slowest task.

Most efficient if all tasks take same time to do

Inspiration: Automobile assembly lineSimpler tasks, complex car → long line!

These lines go 24 x 7,

and rarely shut down. Why?

Lessons from car assembly lines

Faster line movement yields more cars per hour off the line.

Faster line movement requires more stages, each doing simpler tasks.

To maximize efficiency, all stages should take same amount of time(if not, workers in fast stages are idle)

“Filling”, “flushing”, and “stalling” assembly line are all bad news.

datapath מבנה ה-P

1. InstructionFetch

2. Decode/ Register

3. Execute 4. Memory5. Write

Key Analogy: The instruction is the car

IR IR IR

Instruction Fetch

Pipeline Stage #1

Stage #2

Controlshardware

in stage 2

Stage #3

Controlshardware

in stage 3

Stage #4

Controlshardware

in stage 4

Stage #5

Controlshardware

in stage 5

“Data-stationary control”

Representation #1: Timeline

IF (Fetch) ID (Decode) EX (ALU)

MEM WB

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample Program

IDMEMWB

EXPipeline is “full”

Good for visualizing pipeline fills.

I1:I2:I3:I4:I5:

t1 t2 t3 t4 t5 t6 t7 t8Time:Inst

Pipeline is “full”

Good for visualizing pipeline stalls.

Representation #2: Resource Usage

IR IR IR IR

ADD R4,R3,R2

OR R7,R6,R5

SUB R1,R9,R8XOR R3,R2,R1

AND R6,R5,R4I1:I2:I3:I4:I5:

Sample ProgramI1 I2

IF:ID:EX:MEM:WB:

t1 t2 t3 t4 t5 t6 t7 t8Time:Stage

IF (Fetch) ID (Decode) EX (ALU) MEM WB

Review: Datapath for MIPS

Stage 1 Stage 2 Stage 3Stage 4 Stage 5

• Use datapath figure to represent pipelineIFtch Dcd Exec Mem WB

U I$ Reg D$ Reg

1. InstructionFetch

2. Decode/ Register Read 3. Execute 4. Memory

5. WriteBack

Graphical Pipeline Representation

Instr.

Time (clock cycles)

Reg Reg

(In Reg, right half highlight read, left half write)

Example• Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write

• Nonpipelined Execution:–lw : IF + Read Reg + ALU + Memory + Write Reg

= 2 + 1 + 2 + 2 + 1 = 8 ns–add: IF + Read Reg + ALU + Write Reg

= 2 + 1 + 2 + 1 = 6 ns

• Pipelined Execution:–Max(IF,Read Reg,ALU,Memory,Write Reg) = 2

לשלבים חלוקה

Instructionmemory

Address

Add Addresult

Shiftleft 2

Instruction

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

ALUresult

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back

הרגיסטרים הוספת

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Datamemory

Address

In struc t ion

m e m o ry

Add re ss

A ddA d d

re s u lt

S h if t

le f t 2

IF /I D E X / M E M M E M /W B

0W ri teda ta

R eg iste rs

R e a dd a ta 1

R e a dd a ta 2

R e a dre g is te r 1

16S ig n

e xte nd

W ritere g is te r

W rited ata

R ea dda ta

A LUre s u lt

Z e ro

ID /E X

In s tr u c t io n fe tc h

In struc t ion

m e m o ry

A dd re ss

A ddA dd

res u lt

S h if t

le ft 2

IF / I D E X / M E M

0W r iteda ta

R eg iste rs

R e a dda ta 1

R e a dda ta 2

16S i gn

e xte nd

W r itere g is te r

W rited ata

R ea dda ta

A LUre s u lt

ID /E X M E M /W B

I n s t r u c t io n d e c o d e

A d d re s s

m e m o ry

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

ID/EX MEM/WB

Execution

Address

Datamemory

EX/MEM

In s tru c t ion

m e m or y

A dd res s

A d dA dd

r e su lt

S h if t

le ft 2

I F /ID E X/M E M

0W rit ed a ta

R e g is te rs

R ea dd ata 1

R ea dd ata 2

R e adre g is ter 1

R e adre g is ter 2

16S ig n

e xte nd

W ritere g is ter

W riteda ta

R e add at a

D a ta

m em o ry

A L Ures u lt

Z er o

ID /E X M E M /W B

M e m o r y

A d dre ss

MEM/WB

In stru ct ion

m em ory

A dd res s

Ad dA dd

r esu lt

S h ift

l e ft 2

I F /ID E X /M E M

0W rit ed a ta

R eg iste rs

R e add ata 1

R e add ata 2

R e a dre g is ter 1

R e a dre g is ter 2

16S ig n

e xte nd

W rited a ta

R e a ddat aD a ta

m e m o ry

A LUre s u lt

ID /EX M E M /W B

W r it e b a c k

W ritere g is te r

A d dre s s

A correction !!! תיקון

Keep the right Rd all the way!

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

Address

Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

ALUresult

ALUZero

Instructionmemory

Address

Add Addresult

Shiftleft 2

IF/ID EX/MEM MEM/WB

0Writedata

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

ALUresult

ALUZero

Address

Datamemory

So here is the updated CPU;

Control

Instructionmemory

Address

Instruction[20– 16]

MemtoReg

Branch

RegDst

ALUSrc

16 32Instruction[15– 0]

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

1Write

data Mux

ALUcontrol

RegWrite

MemRead

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

AddAdd

result

Shiftleft 2

ALUresult

הבקרה קוויExecution/Address Calculation

stage control linesMemory access stage

control lines

Write-back stage control

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Writedata

Readdata

ALUcontrol

Shiftleft 2

MemRead

Control

WBIF/ID

EX/MEM

MEM/WB

AddressData

memory

Address

Datapath with Control

דוגמאA demonstration of a sequence of instructions:

Lw $10,20($1)

Sub $11,$2,$3

And $12,$4,$5

Or $13,$6,$7

Add $14,$8,$9

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

ALUcontrol

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: before<1> EX: before<2> MEM: before<3> WB: before<4>

MEM/WB

IF: lw $10, 20($1)

Datamemory

Address

Writedata

Readdata

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: lw $10, 20($1) EX: before<1> MEM: before<2> WB: before<3>

MEM/WB

IF: sub $11, $2, $3

0Writedata

Readdata

lwControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Instruction[15– 0] Sign

extend

MemReadM

Datamemory

Address

Clock 2

Clock 1

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: sub $11, $2, $3 EX: lw $10, . . . MEM: before<1> WB: before<2>

MEM/WB

IF: and $12, $4, $5

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

Writeregister

Writedata 1

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: and $12, $2, $3 EX: sub $11, . . . MEM: lw $10, . . . WB: before<1>

MEM/WB

IF: or $13, $6, $7

0Writedata

andControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

MemReadM

RegDst

ALUcontrol

ALUAddress Read

dataData

memory

Signextend

Datamemory

Address

Clock 3

Clock 4

ID: and $12, $4, $5

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

ALUresult

Shiftleft 2

MemRead

Control

IF/ID EX/MEMID/EX

ID: or $13, $6, $7 EX: and $12, . . . MEM: sub $11, . . . WB: lw $10, . . .

MEM/WB

IF: add $14, $8, $9

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: add $14, $8, $9 EX: or $13, . . . MEM: and $12, . . . WB: sub $11, . . .

MEM/WB

IF: after<1>

0Writedata

addControl

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

MemReadM

RegDst

ALUcontrol

ALUReaddata

Writeregister

Writedata

Datamemory

Address

Datamemory

Address

Signextend

Clock 5

Clock 6

Instructionmemory

Address

Branch

ALUSrc

Add Addresult

Writedata

ALUresult

Shiftleft 2

MemRead

Control

Signextend

IF/ID EX/MEMID/EX

ID: after<1> EX: add $14, . . . MEM: or $13, . . . WB: and $12, . . .

MEM/WB

IF: after<2>

0Writedata

Readdata

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

ALUresult

ALUcontrol

Shiftleft 2

IF/ID EX/MEMID/EX

ID: after<2> EX: after<1> MEM: add $14, . . . WB: or $13, . . .

MEM/WB

IF: after<3>

0Writedata

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

extend

MemReadM

RegDst

ALUcontrol

ALUReaddata

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2 Zero

Datamemory

Address

Datamemory

Address

Clock 7

Clock 8

Instructionmemory

Address

Branch

RegDst

ALUSrc

Add Addresult

ALUresult

ALUcontrol

Shiftleft 2R

IF/ID EX/MEMID/EX

ID: after<3> EX: after<2> MEM: after<1> WB: add $14, . . .

MEM/ WB

IF: after<4>

0Writedata

Control

Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

extend

Instru ction[15– 11]

MemRead

ALUReaddata

Writeregister

Writedata

Datamemory

Address

Clock 9

Problems for Computers• Limits to pipelining: Hazards prevent next

instruction from executing during its designated clock cycle– Structural hazards: HW cannot support this combination

of instructions (single person to fold and put clothes away)

– Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

An example for data hazards:

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

An example for data hazards:Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

Graphic representation of data hazards:

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Solving data hazards by adding nops

and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

sub $2, $1, $ 3

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

CC 10 CC 11 CC 12

– 20 – 20 – 20

The internal structure of the Register File

Read data 2

write data

Read data 1

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

We could earn 1 ck cycle if GPR is “transparent”

and $12, $2, $

or $13, $6, $ 2

add $14, $2, $

sw $15, 100( $2

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Re

IM Reg DMReg

IM Reg DM Reg

IM Reg DMReg

sub $2, $1, $ 3

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

CC 10 CC 11 CC 12

– 20 – 20 – 20

We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5.

The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero).

Read data 2

write data

Read data 1

Rd reg 2 (= Rt)

Rd reg 1 (= Rs)

RegWrite

Wr reg (= Rd) 32

שונים רגיסטרים שני של ערכים בוזמנית היציאות משתי קוראים) הבאה ) השעון בעליית האחרים הרגיסטרים לאחד כותבים

write data32

write data

Wr reg 5

5Wr reg

sub $2, $1, $3

and $12, $2, $5

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

After doing that change we only need 2 nops

After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions.

Unfortunately, this happens too often. We need a better solution!

Graphic representation of data hazards:

IM Reg

sub $2, $1, $3

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

DM Reg

IM Reg

sub $2, $1, $3

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

DM Reg

Forwarding – הערכים גניבת

IM Reg

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

IM Reg

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

Forwarding (done at the execute phase)

PCInstruction

memory

Registers

Control

EX/MEM

MEM/WB

Datamemory

Forwardingunit

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRs

If ID/EX.Rs=EX/MEM.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction

instead of the output of the GPR. If ID/EX.Rs=MEM/WB.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. [ similarly, compare also ID/EX.Rt to MEM/WB.Rd ]

Similarly, compare also ID/EX.Rt to EX/MEM.Rd and to MEM/WB.Rd

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5 needs forwarding from the previous instruction

Or $4, $4, $2 needs forwarding from two instructions back

Add $9, $4, $2 needs forwarding from 3 instructions back (thru the “transparent” GPR)

Here we discuss the $2 register only

(The first two cases are handled in the execute phase, the last one, in the decode phase).

An example for forwarding דוגמא

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2 needs forwarding from the previous instruction

Add $9, $4, $2 needs forwarding from the previous instruction

Here we discuss the $4 register and there are two case (the 2nd one in purple)

PCInstruction

memory

Registers

Datamemory

Forwardingunit

and $4, $2, $5 sub $2, $1, $3

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

Control

PCInstruction

memory

Registers

Datamemory

Forwardingunit

or $4, $4, $2 and $4, $2, $5

sub $2, . . .

EX/MEM

before<1>

MEM/WB

add $9, $4, $2

Clock 4

Control

Since Rs=2 and Rd of previous inst. was 2, we use ALUout instead of Rs

Sub $2, $1, $3

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

PCInstruction

memory

Registers

Datamemory

Forwardingunit

add $9, $4, $2 or $4, $4, $2

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

Control

PCInstruction

memory

Datamemory

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

Registers

Control

In blue we see forwarding from two instructions back (Mem->Exec.), in red, from previous instruction (WB->Exec.), in purple, from 3 instructions back (WB->Decode).

PCInstruction

memory

Registers

Datamemory

Forwardingunit

add $9, $4, $2 or $4, $4, $2

and $4, . . .

EX/MEM

sub $2, . . .

MEM/WB

after<1>

Clock 5

Control

PCInstruction

memory

Datamemory

Forwardingunit

after<1>after<2> add $9, $4, $2 or $4, . . .

EX/MEM

and $4, . . .

MEM/WB

Clock 6

Registers

Control

The solution does not work for lw - הפתרון תמיד לאעובד

(in lw we do not have the data in the pipe!, it comes from the data memory!)

lw $2, 20($1)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

If the previous instruction was lw to a register and we try to use the register in the current instruction, we have a problem, since we cannot go back in time!

One solution is to avoid such cases by adding a nop (by the Assembler) whenever Rt of the lw is equal to Rs or Rt of the following instruction.

Another h/w solution is to add Bubbles,i.e., add nop by hardware

lw $2, 20($1)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

bubble

“nop”

We need to hold IF/ID for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

Hazard detection unit

PCInstruction

memory

Registers

Control

EX/MEM

MEM/WB

Datamemory

Hazarddetection

Forwardingunit

ID/EX.MemRead

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRs

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd

We need to hold the IF/ID and PC for one ck cycle and insert a “nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

If (ID/EX.MemRd)&& ( (ID/EX.Rt= =IF/ID.Rs) || (ID/EX.Rt= =IF/ID.Rt) ) we must “stall” the pipeline! This means that prev. inst was lw and it was to the current Rs or Rt. (of course if one of them is not used, don’t stall)

Holding means”freeze” the IF/ID and the PC for 1 clock cycle

Hold the IF/ID by not giving a IF/IDWrire signal and do not increment the PC (which already points at the nex instruction) by not giving the PCWrite signal. Inserting a nop is by clearing all control signals.

Rt from prev. inst.

Rs, Rt of current inst.

identifies lw

An example for lw hazard detection דוגמא

lw $2, 20($1)

And $4, $2, $5

Or $4, $4, $2

Add $9, $4, $2

Hazarddetection

ID/EX.RegisterRt

lw $2, 20($1)

PCInstruction

memory

Registers

Datamemory

Forwardingunit

and $4, $2, $5

before<1>

EX/MEM

before<2>

MEM/WB

or $4, $4, $2

Clock 3

500 11

Control

Hazarddetection

ID/EX.RegisterRt

ID/EX.MemRead

before<3>

PCInstruction

memory

Registers

Datamemory

Forwardingunit

EX/MEM

MEM/WB

and $4, $2, $5 lw $2, 20($1) before<1> before<2>

Clock 2

Control

Hazarddetection

ID/EX.RegisterRt

Hazarddetection

ID/EX.RegisterRt

PCInstruction

memory

Registers

Datamemory

and $4, $2, $5 bubble

lw $2, . . .

EX/MEM

before<1>

MEM/WB

Clock 4

Control

bubble lw $2, . . .

PCInstruction

memory

Registers

Datamemory

Forwardingunit

and $4, $2, $5

EX/MEM

MEM/WB

add $9, $4, $2

Clock 5

210 10

Control

ID/EX.MemRead

or $4, $4, $2

The lw instruction is in the WB phase. $2 is “being written”. We can use $2 in the Execute phase of the and instruction, with the help of forwarding.

Registers

Control

PCInstruction

memory

PCInstruction

memory

Hazarddetection

ID/EX.RegisterRt

bubble

Registers

Datamemory

add $9, $4, $2

and $4, . . .

EX/MEM

MEM/WB

Clock 6

210 10

Control

add $9, $4, $2 or $4, . . . and $4, . . .after<2> after<1>

after<1>

Clock 7

Datamemory

Forwardingunit

EX/MEM

MEM/WB

Hazarddetection

ID/EX.RegisterRt

or $4, $4, $2

ID/EX.MemRead

Instructionmemory

Branch

RegDst

ALUSrc

Add Addresult

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

ALUresult

Writedata

Readdata

ALUcontrol

Shiftleft 2

MemRead

Control

WBIF/ID

EX/MEM

MEM/WB

AddressData

memory

Address

Just to remind us how branch is handled we show again the Datapath with Control

Branch Hazards

40 beq $1, $3, 7

IM Reg

DM Reg

Reg Reg

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Here we calc.Rs-Rt

Here we decide to branch (switching the address to the PC and issuing PCWrite Cond)

These 3 instructions should be “killed” before they do harm, I.e., change any register.

In cc5 we already use the new PC calculated by the branch. (PC=72)

Control Hazard: Branching (1/7)

Where do we do the compare for the branch?

Instr 1

Instr 2

Instr 3

Instr 4A

LU I$ Reg D$ Reg

U I$ Reg D$ Reg

U I$ Reg D$ RegA

LUReg D$ Reg

U I$ Reg D$ Reg

Instr.

Time (clock cycles)

Control Hazard: Branching (2/7)• We put branch decision-making hardware in

ALU stage– therefore two more instructions after the branch

will always be fetched, whether or not the branch is taken

• Desired functionality of a branch– if we do not take the branch, don’t waste any

time and continue executing normally– if we take the branch, don’t execute any

instructions after the branch, just go to the desired label

• Initial Solution: Stall until decision is made– insert “no-op” instructions: those that

accomplish nothing, just take time– Drawback: branches take 3 clock cycles each

(assuming comparator is put in ALU stage)

Control Hazard: Branching (4/7)• Optimization #1:

– move asynchronous comparator up to Stage 2– as soon as instruction is decoded (Opcode

identifies is as a branch), immediately make a decision and set the value of the PC (if necessary)

– Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed

– Side Note: This means that branches are idle in Stages 3, 4 and 5.

PC Instructionmemory

Registers

EX/MEM

MEM/WB

Datamemory

Hazarddetection

Forwardingunit

IF.Flush

Signextend

Control

Shiftleft 2

• Insert a single no-op (bubble)

U I$ Reg D$ Reg

U I$ Reg D$ RegA

LUReg D$ Reg I$

Instr.

Time (clock cycles)

bubble

• Impact: 2 clock cycles per branch instruction slow

• Optimization #2: Redefine branches– Old definition: if we take the branch, none of

the instructions after the branch get executed by accident

– New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

Control Hazard: Branching (7/7)• Notes on Branch-Delay Slot

– Worst-Case Scenario: can always put a no-op in the branch-delay slot

– Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program

• re-ordering instructions is a common method of speeding up programs

• compiler must be very smart in order to find instructions to do this

• usually can find such an instruction at least 50% of the time

• Jumps also have a delay slot…

Example: Nondelayed vs. Delayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Nondelayed Branch

add $1 ,$2,$3

sub $4, $5,$6

beq $1, $4, Exit

or $8, $9 ,$10

xor $10, $1,$11

Delayed Branch

Exit: Exit:

Question (1/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full)Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

Answer (1/2)• Assume 1 instr/clock, delayed branch, 5 stage

pipeline, forwarding, interlock on unresolved load hazards. 103 iterations, so pipeline full.

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

• How many pipeline stages (clock cycles) per loop iteration to execute this code?

1. 2. (data hazard so stall)

3.4.5.6.

(delayed branch so exec. nop)7.

1 2 3 4 5 6 7 8 9 10

Question (2/2)

Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full). Rewrite this code to reduce pipeline stages (clock cycles) per loop to as few as possible. Loop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addiu $s1, $s1, -4bne $s1, $zero, Loopnop

•How many pipeline stages (clock cycles) per loop iteration to execute this code?

12345678910

A (2/2) How long to execute?

• How many pipeline stages (clock cycles) per loop iteration to execute your revised code? (assume pipeline is full)

• Rewrite this code to reduce clock cycles per loop to as few as possible:

Loop: lw $t0, 0($s1)addiu $s1, $s1, -4 addu $t0, $t0, $s2bne $s1, $zero, Loopsw $t0, +4($s1)

(no hazard since extra cycle)

(modified sw to put past addiu)

1 2 3 4 5 6 7 8 9 10

PCInstruction

memory

Registers

Signextend

Control

Hazarddetection

Forwardingunit

IF.Flush

and $12, $2, $5 beq $1, $3, 7 sub $10, $4, $8

MEM/WB

EX/MEM

Clock 3

ALUData

memory

bubble (nop)lw $4, 50($7)

Clock 4

Shiftleft 2

before<1>

beq $1, $3, 7 sub $10, . . . before<1>

before<2>

PC Instructionmemory

Registers

Signextend

Control

Hazarddetection

Forwardingunit

IF.Flush

MEM/WB

EX/MEM

ALUData

memory

Shiftleft 2

sub $10, $4, $8

beq $1, $3, 7

and $12, $2, $5

lw $4, 50($7)

Summary of hazardsData hazards:

* Forward from previous instruction

* Forward from two instructions ago

* (Forward thru “transparent”GPR = from 3 instructions ago)

* If we cannot forward, (after lw) we stall the pipe by inserting a nop and freezing IF/ID and PC for 1 ck cycle

Control hazards:

* If branch is successful we flush the instruction following the branch (which is at the IF/ID register. We

just clear the register)

Notes:

In the real MIPS CPU, no flush was employed. This give the compiler the opportunity to put useful instructions following the branch. This explains why the simulator always performs the instruction following the branch.this is called a delayed branch.

Also, in the real MIPS CPU no lw stall was used. Again this give some freedom to the compiler to

choose whether to put a nop following lw or some useful instruction. This is called a delayed load.

1 pipeline datapath with some slides from: john lazzaro and dan garcia

Documents

datapath and control

pipelined datapath

datapath & control design

datapath elements & single cycle datapath...

wta lazzaro

francesco (lazzaro) guardi iii

lazzaro spallanzani experiment

1 processor: datapath and control single cycle processor...

single cycle datapath

datapath design

asynchronous datapath design

designing a single cycle datapath & datapath control

2004-10-07 john lazzaro (cs.berkeley/~lazzaro)

cs61c l17 single cycle cpu datapath (1) garcia, fall 2005 ©...

libro canti numerato san lazzaro

building a datapath datapath 1 - virginia...

building a datapath datapath 1 - department of computer...

lazzaro weis

datapath functional units

report lazzaro spallanzani