multicycle datapath implementation - uc santa barbarastrukov/ece154winter2012/lecture8.pdf ·...

28
Multicycle Datapath Implementation Adapted from instructor’s supplementary material from Computer Organization and Design, 4 th Edition, Patterson & Hennessy, © 2008, MK] and and Computer Architecture: From Microprocessors to Supercomputers, B. Parhami, 2005 Oxford Press

Upload: truongthuan

Post on 06-Feb-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Multicycle DatapathImplementation

Adapted from instructor’s supplementary  material from Computer Organization and Design, 4th Edition,  Patterson & Hennessy, © 2008, MK]

andand Computer Architecture: From Microprocessors to Supercomputers, 

B. Parhami, 2005 Oxford Press

Page 2: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Review: A Multicycle Data Path

jta Inst Reg x Reg

ALU Cache Reg file

j

imm

rs,rt,rd (rs) Address z Reg PC

file (rt)

Data

Data Reg y Reg

Control

op fn

Fig. 14.2 Abstract view of a multicycle instruction execution unit for MicroMIPS For naming of instruction fields see Fig 13 1

Control

Feb. 2011Computer Architecture, Data Path and 

ControlSlide 2

MicroMIPS. For naming of instruction fields, see Fig. 13.1.

Page 3: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Review: MicroprogrammingCycle 1 Cycle 3 Cycle 2 Cycle 1 Cycle 4 Cycle 5

State 5 ALUSrcX = 1 ALUSrcY = 1

ALUFunc = ‘’ JumpAddr = %

PCSrc = @ PCWrite = #

State 6

InstData = 1 MemWrite = 1

Jump/ Branch

Notes for State 5: % 0 for j or jal, 1 for syscall, don’t-care for other instr’s @ 0 for j, jal, and syscall, 1 for jr, 2 for branches # 1 for j, jr, jal, and syscall, ALUZero () for beq (bne), bit 31 of ALUout for bltz For jal, RegDst = 2, RegInSrc = 1,

State 0

InstData = 0 MemRead = 1

IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’

PCSrc = 3 PCWrite = 1

Start

lw/ sw lw

sw

State 1

ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’

State 8

State 7

State 4

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 2

ALUSrcX = 1 ALUSrcY = 2

ALUFunc = ‘+’

State 3

InstData = 1 MemRead = 1

For jal, RegDst 2, RegInSrc 1, RegWrite = 1 fetch: PCnext, CacheFetch # State 0 (start)

PC + 4imm, PCdisp1 # State 1lui1: lui(imm) # State 7lui

rt z, PCfetch # State 8luiadd1: x + y # State 7add

rd z, PCfetch # State 8addsub1: x - y # State 7sub

ALU- type

RegDst = 0 or 1RegInSrc = 1 RegWrite = 1

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = Varies

Note for State 7: ALUFunc is determined based on the op and fn f ields

PC t l

Cache t l

Register t l

ALU i t

Sequence t l

ALU f ti

sub1: x y # State 7subrd z, PCfetch # State 8sub

slt1: x - y # State 7sltrd z, PCfetch # State 8slt

addi1: x + imm # State 7addirt z, PCfetch # State 8addi

slti1: x - imm # State 7sltirt z, PCfetch # State 8slti

control control control inputs

JumpAddr PCSrc

PCWrite

FnType LogicFn

AddSub ALUSrcY

controlfunction,

and1: x y # State 7andrd z, PCfetch # State 8and

or1: x y # State 7orrd z, PCfetch # State 8or

xor1: x y # State 7xorrd z, PCfetch # State 8xor

nor1: x y # State 7norInstData

MemRead MemWrite

IRWrite

ALUSrcX RegInSrc

RegDst RegWrite

rd z, PCfetch # State 8norandi1: x imm # State 7andi

rt z, PCfetch # State 8andiori1: x imm # State 7ori

rt z, PCfetch # State 8orixori: x imm # State 7xori

rt z, PCfetch # State 8xoril 1 + i PCdi 2 # St t 2lwsw1: x + imm, mPCdisp2 # State 2lw2: CacheLoad # State 3

rt Data, PCfetch # State 4sw2: CacheStore, PCfetch# State 6j1: PCjump, PCfetch # State 5jjr1: PCjreg, PCfetch # State 5jrbranch1: PCbranch, PCfetch # State 5branchjal1 PCj mp $31PC PCfetch # State 5jal

Microprogram memory or PLA

Address 1

Incr

MicroPC

Data

0

0 1 2 3

Dispatch table 1

Dispatch table 2

Feb. 2011 Slide 3

jal1: PCjump, $31PC, PCfetch # State 5jalsyscall1:PCsyscall, PCfetch # State 5syscallop (from

instruction register) Control signals to data path

Sequence control

Microinstruction register

Page 4: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Review: Exception Control

Cycle 1 Cycle 3 Cycle 2 Cycle 4 Cycle 5

State 5 ALUSrcX = 1

State 6 Jump/Control 

States

sw

ALUSrcY = 1 ALUFunc = ‘’ JumpAddr = %

PCSrc = @ PCWrite = #

InstData = 1 MemWrite = 1

Jump/Branch

State 0 InstData = 0 MemRead = 1

IRWrite = 1 ALUSrcX = 0 ALUSrcY = 0 ALUFunc = ‘+’

PCSrc = 3 PCWrite = 1

lw/ sw lw

State 1

ALUSrcX = 0 ALUSrcY = 3 ALUFunc = ‘+’

State 4

RegDst = 0 RegInSrc = 0 RegWrite = 1

State 2

ALUSrcX = 1 ALUSrcY = 2 ALUFunc = ‘+’

State 3

InstData = 1 MemRead = 1

Start

ALU-

State 8

RegDst = 0 or 1 RegInSrc = 1 RegWrite = 1

State 7

ALUSrcX = 1 ALUSrcY = 1 or 2 ALUFunc = VariesALU

type RegWrite = 1ALUFunc = Varies

State 10 IntCause = 0

CauseWrite = 1 ALUSrcX = 0

State 9 IntCause = 1

CauseWrite = 1 ALUSrcX = 0ALUSrcX = 0

ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1

PCSrc = 0 PCWrite = 1

ALUSrcX = 0ALUSrcY = 0 ALUFunc = ‘’ EPCWrite = 1 JumpAddr = 1

PCSrc = 0 PCWrite = 1

Illegal operation

Overflow

Feb. 2011Computer Architecture, Data Path and 

ControlSlide 4

Fig. 14.10 Exception states 9 and 10 added to the control state machine.

Page 5: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

MIPS Pipelined Datapath and  Control

Page 6: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Single‐Cycle vs. Multicycle vs. Pipelined

Clock

Time

needed

Clock

Instr 1 Instr 4 Instr 3 Instr 2 Time

allotted

Instr 2 Instr 1 Instr 3 Instr 4 3 cycles 3 cycles 4 cycles 5 cycles

Time saved

Time needed

Time allotted

1

2

3

1

2

3

Cycle Cycle1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11

Drainageregion

a

a

w

w

f

f

f

r

r

d

d

d

r r r r r r r

f f f f f f f

3

4

5

3

4

5

6 Pipeline

Start-up region

a

a

a

a

w

w

w

w

f

f

f

f

r

r

r

r

d

d

d

d

a a a a a a a

w w w w w w w

d d d d d d d f = Fetch r = Reg read a = ALU op d = Data access w = Writeback

Feb. 2011Computer Architecture, Data Path and 

ControlSlide 6

7

(a) Task-time diagram (b) Space-time diagram Instruction

stagea wf r d

Page 7: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Pipelining Analogy§4.5 A

n OPipelining Analogy• Pipelined laundry: overlapping execution

Parallelism improves performance

Overview

 of– Parallelism improves performance f Pipelining

Four loads: Four loads: Speedup= 8/3.5 = 2.3/

Non‐stop: Speedupp p= 2n/0.5n + 1.5 ≈ 4= number of stages

Chapter 4 — The Processor —7

Page 8: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

MIPS PipelineMIPS PipelineFive stages, one step per stage

1. IF: Instruction fetch from memory

2. ID: Instruction decode & register read

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

2. ID: Instruction decode & register read

3. EX: Execute operation or calculate address

4. MEM: Access memory operand

5. WB: Write result back to register

IFetch Dec Exec Mem WBlw

Chapter 4 — The Processor —8

Page 9: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Pipeline PerformancePipeline Performance• Assume time for stages is

– 100ps for register read or write00ps o eg ste ead o te

– 200ps for other stages

• Compare pipelined datapath with single‐cycle p p p p g ydatapath

Instr Instr fetch Register read

ALU op Memory access

Register write

Total time

lw 200ps 100 ps 200ps 200ps 100 ps 800ps

sw 200ps 100 ps 200ps 200ps 700ps

R-format 200ps 100 ps 200ps 100 ps 600ps

beq 200ps 100 ps 200ps 500ps

Chapter 4 — The Processor —9

beq 200ps 100 ps 200ps 500ps

Page 10: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Pipeline PerformancePipeline PerformanceSingle‐cycle (Tc= 800ps)

Pipelined (Tc= 200ps)p ( c p )

Chapter 4 — The Processor —10

Page 11: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Pipeline SpeedupPipeline Speedup

• If all stages are balancedIf all stages are balanced– i.e., all take the same time

– Time between instructions i li dTime between instructionspipelined= Time between instructionsnonpipelined

Number of stages

• If not balanced, speedup is less• Speedup due to increased throughputp p g p

– Latency (time for each instruction) does not decrease

Chapter 4 — The Processor —11

Page 12: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Pipelining and ISA DesignPipelining and ISA Design

• MIPS ISA designed for pipeliningg p p g– All instructions are 32‐bits

• Easier to fetch and decode in one cyclef 86 1 17 b i i• c.f. x86: 1‐ to 17‐byte instructions

– Few and regular instruction formats• Can decode and read registers in one stepg p

– Load/store addressing• Can calculate address in 3rd stage, access memory in 4thstagestage

– Alignment of memory operands• Memory access takes only one cycle

Chapter 4 — The Processor —12

Page 13: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Graphically Representing MIPS Pipeline

ALUIM Reg DM Reg

• Can help with answering questions like:– How many cycles does it take to execute this code?How many cycles does it take to execute this code?– What is the ALU doing during cycle 4?– Is there a hazard, why does it occur, and how can it be fixed?

Page 14: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Why Pipeline? For Performance!Ti ( l k l )Time (clock cycles)

Inst 0

A

IM R DM ROnce the pipeline 

i f llInst

Inst 0

Inst 1

LUIM Reg DM Reg

ALUIM Reg DM Reg

is full, one instruction is 

completed every cycle so CPI = 1t

r.

O Inst 2

Ug g

ALUIM Reg DM Reg

cycle, so CPI = 1

rder

Inst 3

ALUIM Reg DM Reg

r

Inst 4

ALUIM Reg DM Reg

Time to fill the pipelineTime to fill the pipeline

Page 15: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

HazardsHazards

• Situations that prevent starting the next p ginstruction in the next cycle

• Structure hazards– A required resource is busy

• Data hazard– Need to wait for previous instruction to complete its data read/write

• Control hazardControl hazard– Deciding on control action depends on previous instruction

Chapter 4 — The Processor —15

Page 16: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Structure HazardsStructure Hazards

• Conflict for use of a resourceConflict for use of a resource

• In MIPS pipeline with a single memoryL d/ t i d t– Load/store requires data access

– Instruction fetch would have to stall for that cycleW ld i li “b bbl ”• Would cause a pipeline “bubble”

• Hence, pipelined datapaths require separate i i /d iinstruction/data memories– Or separate instruction/data caches

Chapter 4 — The Processor —16

Page 17: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Ti ( l k l )

A Single Memory Would Be a Structural HazardTime (clock cycles)

lw

A

M R M RReading data from 

Inst

lw

Inst 1

LUMem Reg Mem Reg

ALUMem Reg Mem Reg

memory

tr.

O Inst 2

Ug g

ALUMem Reg Mem Reg

rder

Inst 3

ALUMem Reg Mem Reg

r

Inst 4

ALUMem Reg Mem RegReading instruction 

from memoryy

Fix with separate instr and data memories (I$ and D$)

Page 18: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Data HazardsData Hazards• An instruction depends on completion of data access by a previous instructionaccess by a previous instruction– add $s0, $t0, $t1sub $t2 $s0 $t3sub $t2, $s0, $t3

Chapter 4 — The Processor —18

Page 19: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Register Usage Can Cause Data Hazards• Dependencies backward in time cause hazards

AL

IM Reg DM Reg

Dependencies backward in time cause hazards

add $1 LUIM Reg DM Reg

ALUIM Reg DM Reg

add $1,

sub $4,$1,$5 U

ALUIM Reg DM Reg

$ ,$ ,$

and $6,$1,$7

ALUIM Reg DM Regor $8,$1,$9

ALUIM Reg DM Regxor $4,$1,$5

Read before write data hazard

Page 20: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Loads Can Cause Data Hazards• Dependencies backward in time cause hazards

lw $1 4($2)

AL

IM Reg DM Reg

Dependencies backward in time cause hazards

Inst

lw $1,4($2)

sub $4,$1,$5

LUIM Reg DM Reg

ALUIM Reg DM Regt

r.

O

$ ,$ ,$

and $6,$1,$7U

ALUIM Reg DM Reg

rder

or $8,$1,$9

ALUIM Reg DM Reg

r

xor $4,$1,$5

ALUIM Reg DM Reg

Load‐use data hazard

Page 21: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

How About Register File Access?Time (clock cycles)

I

( y )

ALUIM Reg DM Reg

Fix register file access hazard by doing add $1,I

nst Inst 1

Ug

ALUIM Reg DM Reg

hazard by doing reads in the second half of the cycle and writes in the first half

,

r.

Or

Inst 2ALUIM Reg DM Reg

rder

ALUIM Reg DM Regadd $2,$1,

l k d th t t lclock edge that controls register writing

clock edge that controls loading of pipeline state registers

Page 22: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

One Way to “Fix” a Data Hazard

I add $1,

ALUIM Reg DM Reg

Can fix data hazard by 

waiting – stall –

stall

nstr

waiting – stall –but impacts CPI

stall

r.

Order

sub $4,$1,$5

ALUIM Reg DM Reg

and $6,$1,$7

ALUIM Reg DM Reg

Page 23: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Forwarding (aka Bypassing)Forwarding (aka Bypassing)• Use result when it is computed

Don’t wait for it to be stored in a register– Don t wait for it to be stored in a register

– Requires extra connections in the datapath

Chapter 4 — The Processor —23

Page 24: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Another Way to “Fix” a Data Hazard

ALUIM Reg DM Reg

Fix data hazards by forwarding results as soon as they are I add $1,

ALUIM Reg DM Reg

available to where they are needed

nstr.

sub $4,$1,$5

ALUIM Reg DM Reg

r.

Ord

and $6,$1,$7

ALUIM Reg DM Reg

der or $8,$1,$9

ALUIM Reg DM Regxor $4,$1,$5

Page 25: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Forwarding Illustration

I add $1,

ALUIM Reg DM Reg

nstr

sub $4,$1,$5

ALUIM Reg DM Reg

r.

Or and $6,$7,$1

ALUIM Reg DM Reg

der

$ ,$ ,$

EX forwarding MEM forwarding

Page 26: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Yet Another Complication!• Another potential data hazard can occur when there is• Another potential data hazard can occur when there is a conflict between the result of the WB stage instruction and the MEM stage instruction – which should be forwarded?should be forwarded?

Ins

add $1,$1,$2

ALUIM Reg DM Reg

str.

O

add $1,$1,$3 ALUIM Reg DM Reg

Orde

add $1,$1,$4

ALUIM Reg DM Reg

r

Page 27: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Load‐Use Data HazardLoad Use Data Hazard• Can’t always avoid stalls by forwarding

If value not computed when needed– If value not computed when needed

– Can’t forward backward in time!

Chapter 4 — The Processor —27

Page 28: Multicycle Datapath Implementation - UC Santa Barbarastrukov/ece154Winter2012/Lecture8.pdf · Multicycle Datapath Implementation Adapted from instructor’s supplementary material

Code Scheduling to Avoid StallsCode Scheduling to Avoid Stalls• Reorder code to avoid use of load result in the next instructionnext instruction

• C code for A = B + E; C = B + F;

lw $t1, 0($t0)

lw $t2, 4($t0)

lw $t1, 0($t0)

lw $t2, 4($t0),

add $t3, $t1, $t2

sw $t3, 12($t0)

lw $t4 8($t0)

stall

,

lw $t4, 8($t0)

add $t3, $t1, $t2

sw $t3 12($t0)lw $t4, 8($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)stall

sw $t3, 12($t0)

add $t5, $t1, $t4

sw $t5, 16($t0)

Chapter 4 — The Processor —28

11 cycles13 cycles