multicycle datapath implementation - uc santa barbarastrukov/ece154winter2012/lecture8.pdf ·...

Multicycle DatapathImplementation

Adapted from instructor’s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, © 2008, MK]

andand Computer Architecture: From Microprocessors to Supercomputers,

B. Parhami, 2005 Oxford Press

Review: A Multicycle Data Path

jta Inst Reg x Reg

ALU Cache Reg file

j

imm

rs,rt,rd (rs) Address z Reg PC

file (rt)

Data

Data Reg y Reg

Control

op fn

Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS For naming of instruction fields see Fig 13 1

Control

Feb. 2011Computer Architecture, Data Path and

ControlSlide 2

MicroMIPS. For naming of instruction fields, see Fig. 13.1.

Review: MicroprogrammingCycle 1 Cycle 3 Cycle 2 Cycle 1 Cycle 4 Cycle 5

State 5 ALUSrcX = 1 ALUSrcY = 1

ALUFunc = ‘’ JumpAddr = %

PCSrc = @ PCWrite = #

State 6

InstData = 1 MemWrite = 1

Jump/ Branch

Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1,

State 0

InstData = 0 MemRead = 1

IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’

PCSrc = 3 PCWrite = 1

Start

lw/ sw lw

sw

State 1

ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’

State 8

State 7

State 4

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 2

ALUSrcX = 1 ALUSrcY = 2

ALUFunc = ‘+’

State 3


For jal, RegDst 2, RegInSrc 1, RegWrite = 1 fetch: PCnext, CacheFetch # State 0 (start)

PC + 4imm, PCdisp1 # State 1lui1: lui(imm) # State 7lui

rt z, PCfetch # State 8luiadd1: x + y # State 7add

rd z, PCfetch # State 8addsub1: x - y # State 7sub

ALU- type

RegDst = 0 or 1RegInSrc = 1 RegWrite = 1

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies

Note for State 7: ALUFunc is determined based on the op and fn f ields

PC t l

Cache t l

Register t l

ALU i t

Sequence t l

ALU f ti

sub1: x y # State 7subrd z, PCfetch # State 8sub

slt1: x - y # State 7sltrd z, PCfetch # State 8slt

addi1: x + imm # State 7addirt z, PCfetch # State 8addi

slti1: x - imm # State 7sltirt z, PCfetch # State 8slti

control control control inputs

JumpAddr PCSrc

PCWrite

FnType LogicFn

AddSub ALUSrcY

controlfunction,

and1: x y # State 7andrd z, PCfetch # State 8and

or1: x y # State 7orrd z, PCfetch # State 8or

xor1: x y # State 7xorrd z, PCfetch # State 8xor

nor1: x y # State 7norInstData

MemRead MemWrite

IRWrite

ALUSrcX RegInSrc

RegDst RegWrite

rd z, PCfetch # State 8norandi1: x imm # State 7andi

rt z, PCfetch # State 8andiori1: x imm # State 7ori

rt z, PCfetch # State 8orixori: x imm # State 7xori

rt z, PCfetch # State 8xoril 1 + i PCdi 2 # St t 2lwsw1: x + imm, mPCdisp2 # State 2lw2: CacheLoad # State 3

rt Data, PCfetch # State 4sw2: CacheStore, PCfetch# State 6j1: PCjump, PCfetch # State 5jjr1: PCjreg, PCfetch # State 5jrbranch1: PCbranch, PCfetch # State 5branchjal1 PCj mp $31PC PCfetch # State 5jal

Microprogram memory or PLA

Address 1

Incr

MicroPC

Data

0

0 1 2 3

Dispatch table 1

Dispatch table 2

Feb. 2011 Slide 3

jal1: PCjump, $31PC, PCfetch # State 5jalsyscall1:PCsyscall, PCfetch # State 5syscallop (from

instruction register) Control signals to data path

Sequence control

Microinstruction register

Review: Exception Control

Cycle 1 Cycle 3 Cycle 2 Cycle 4 Cycle 5

State 5 ALUSrcX = 1

State 6 Jump/Control

States

sw

ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = %

PCSrc = @ PCWrite = #

InstData = 1 MemWrite = 1

Jump/Branch

State 0 InstData = 0 MemRead = 1

IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’


lw/ sw lw

State 1


State 4

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 2


State 3


Start

ALU-

State 8

RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1

State 7

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = VariesALU

type RegWrite = 1ALUFunc = Varies

State 10 IntCause = 0

CauseWrite = 1 ALUSrcX = 0

State 9 IntCause = 1

CauseWrite = 1 ALUSrcX = 0ALUSrcX = 0

ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1


ALUSrcX = 0ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1


Illegal operation

Overflow


ControlSlide 4

Fig. 14.10 Exception states 9 and 10 added to the control state machine.

MIPS Pipelined Datapath and Control

Single‐Cycle vs. Multicycle vs. Pipelined

Clock

Time

needed

Clock

Instr 1 Instr 4 Instr 3 Instr 2 Time

allotted

Instr 2 Instr 1 Instr 3 Instr 4 3 cycles 3 cycles 4 cycles 5 cycles

Time saved

Time needed

Time allotted

1

2

3

1

2

3

Cycle Cycle1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

Drainageregion

a

a

w

w

f

f

f

r

r

d

d

d

r r r r r r r

f f f f f f f

3

4

5

3

4

5

6 Pipeline

Start-up region

a

a

a

a

w

w

w

w

f

f

f

f

r

r

r

r

d

d

d

d

a a a a a a a

w w w w w w w

d d d d d d d f = Fetch r = Reg read a = ALU op d = Data access w = Writeback


ControlSlide 6

7

(a) Task-time diagram (b) Space-time diagram Instruction

stagea wf r d

Pipelining Analogy§4.5 A

n OPipelining Analogy• Pipelined laundry: overlapping execution

Parallelism improves performance

Overview

of– Parallelism improves performance f Pipelining

Four loads: Four loads: Speedup= 8/3.5 = 2.3/

Non‐stop: Speedupp p= 2n/0.5n + 1.5 ≈ 4= number of stages

Chapter 4 — The Processor —7

MIPS PipelineMIPS PipelineFive stages, one step per stage

1. IF: Instruction fetch from memory

2. ID: Instruction decode & register read

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

2. ID: Instruction decode & register read

3. EX: Execute operation or calculate address

4. MEM: Access memory operand

5. WB: Write result back to register

IFetch Dec Exec Mem WBlw


Pipeline PerformancePipeline Performance• Assume time for stages is

– 100ps for register read or write00ps o eg ste ead o te

– 200ps for other stages

• Compare pipelined datapath with single‐cycle p p p p g ydatapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps


beq 200ps 100 ps 200ps 500ps

Pipeline PerformancePipeline PerformanceSingle‐cycle (Tc= 800ps)

Pipelined (Tc= 200ps)p ( c p )


Pipeline SpeedupPipeline Speedup

• If all stages are balancedIf all stages are balanced– i.e., all take the same time

– Time between instructions i li dTime between instructionspipelined= Time between instructionsnonpipelined

Number of stages

• If not balanced, speedup is less• Speedup due to increased throughputp p g p

– Latency (time for each instruction) does not decrease


Pipelining and ISA DesignPipelining and ISA Design

• MIPS ISA designed for pipeliningg p p g– All instructions are 32‐bits

• Easier to fetch and decode in one cyclef 86 1 17 b i i• c.f. x86: 1‐ to 17‐byte instructions

– Few and regular instruction formats• Can decode and read registers in one stepg p

– Load/store addressing• Can calculate address in 3rd stage, access memory in 4thstagestage

– Alignment of memory operands• Memory access takes only one cycle


Graphically Representing MIPS Pipeline

ALUIM Reg DM Reg

• Can help with answering questions like:– How many cycles does it take to execute this code?How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Is there a hazard, why does it occur, and how can it be fixed?

Why Pipeline? For Performance!Ti ( l k l )Time (clock cycles)

Inst 0

A

IM R DM ROnce the pipeline

i f llInst

Inst 0

Inst 1

LUIM Reg DM Reg

ALUIM Reg DM Reg

is full, one instruction is

completed every cycle so CPI = 1t

r.

O Inst 2

Ug g

ALUIM Reg DM Reg

cycle, so CPI = 1

rder

Inst 3

ALUIM Reg DM Reg

r

Inst 4

ALUIM Reg DM Reg

Time to fill the pipelineTime to fill the pipeline

HazardsHazards

• Situations that prevent starting the next p ginstruction in the next cycle

• Structure hazards– A required resource is busy

• Data hazard– Need to wait for previous instruction to complete its data read/write

• Control hazardControl hazard– Deciding on control action depends on previous instruction


Structure HazardsStructure Hazards

• Conflict for use of a resourceConflict for use of a resource

• In MIPS pipeline with a single memoryL d/ t i d t– Load/store requires data access

– Instruction fetch would have to stall for that cycleW ld i li “b bbl ”• Would cause a pipeline “bubble”

• Hence, pipelined datapaths require separate i i /d iinstruction/data memories– Or separate instruction/data caches


Ti ( l k l )

A Single Memory Would Be a Structural HazardTime (clock cycles)

lw

A

M R M RReading data from

Inst

lw

Inst 1

LUMem Reg Mem Reg

ALUMem Reg Mem Reg

memory

tr.

O Inst 2

Ug g

ALUMem Reg Mem Reg

rder

Inst 3

ALUMem Reg Mem Reg

r

Inst 4

ALUMem Reg Mem RegReading instruction

from memoryy

Fix with separate instr and data memories (I$ and D$)

Data HazardsData Hazards• An instruction depends on completion of data access by a previous instructionaccess by a previous instruction– add $s0, $t0, $t1sub $t2 $s0 $t3sub $t2, $s0, $t3


Register Usage Can Cause Data Hazards• Dependencies backward in time cause hazards

AL

IM Reg DM Reg

Dependencies backward in time cause hazards

add $1 LUIM Reg DM Reg

ALUIM Reg DM Reg

add $1,

sub $4,$1,$5 U

ALUIM Reg DM Reg

$ ,$ ,$

and $6,$1,$7

ALUIM Reg DM Regor $8,$1,$9

ALUIM Reg DM Regxor $4,$1,$5

Read before write data hazard

Loads Can Cause Data Hazards• Dependencies backward in time cause hazards

lw $1 4($2)

AL

IM Reg DM Reg

Dependencies backward in time cause hazards

Inst

lw $1,4($2)

sub $4,$1,$5

LUIM Reg DM Reg

ALUIM Reg DM Regt

r.

O

$ ,$ ,$

and $6,$1,$7U

ALUIM Reg DM Reg

rder

or $8,$1,$9

ALUIM Reg DM Reg

r

xor $4,$1,$5

ALUIM Reg DM Reg

Load‐use data hazard

How About Register File Access?Time (clock cycles)

I

( y )

ALUIM Reg DM Reg

Fix register file access hazard by doing add $1,I

nst Inst 1

Ug

ALUIM Reg DM Reg

hazard by doing reads in the second half of the cycle and writes in the first half

,

r.

Or

Inst 2ALUIM Reg DM Reg

rder

ALUIM Reg DM Regadd $2,$1,

l k d th t t lclock edge that controls register writing

clock edge that controls loading of pipeline state registers

One Way to “Fix” a Data Hazard

I add $1,

ALUIM Reg DM Reg

Can fix data hazard by

waiting – stall –

stall

nstr

waiting – stall –but impacts CPI

stall

r.

Order

sub $4,$1,$5

ALUIM Reg DM Reg

and $6,$1,$7

ALUIM Reg DM Reg

Forwarding (aka Bypassing)Forwarding (aka Bypassing)• Use result when it is computed

Don’t wait for it to be stored in a register– Don t wait for it to be stored in a register

– Requires extra connections in the datapath


Another Way to “Fix” a Data Hazard

ALUIM Reg DM Reg

Fix data hazards by forwarding results as soon as they are I add $1,

ALUIM Reg DM Reg

available to where they are needed

nstr.

sub $4,$1,$5

ALUIM Reg DM Reg

r.

Ord

and $6,$1,$7

ALUIM Reg DM Reg

der or $8,$1,$9

ALUIM Reg DM Regxor $4,$1,$5

Forwarding Illustration

I add $1,

ALUIM Reg DM Reg

nstr

sub $4,$1,$5

ALUIM Reg DM Reg

r.

Or and $6,$7,$1

ALUIM Reg DM Reg

der

$ ,$ ,$

EX forwarding MEM forwarding

Yet Another Complication!• Another potential data hazard can occur when there is• Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded?should be forwarded?

Ins

add $1,$1,$2

ALUIM Reg DM Reg

str.

O

add $1,$1,$3 ALUIM Reg DM Reg

Orde

add $1,$1,$4

ALUIM Reg DM Reg

r

Load‐Use Data HazardLoad Use Data Hazard• Can’t always avoid stalls by forwarding

If value not computed when needed– If value not computed when needed

– Can’t forward backward in time!


Code Scheduling to Avoid StallsCode Scheduling to Avoid Stalls• Reorder code to avoid use of load result in the next instructionnext instruction

• C code for A = B + E; C = B + F;

lw $t1, 0($t0)

lw $t2, 4($t0)

lw $t1, 0($t0)

lw $t2, 4($t0),

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4 8($t0)

stall

,

lw $t4, 8($t0)

add $t3, $t1, $t2

sw $t3 12($t0)lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)stall

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)


11 cycles13 cycles

multicycle datapath implementation - uc santa barbarastrukov/ece154winter2012/lecture8.pdf ·...

Documents