advanced computer architectures laboratory on dlx pipelining vittorio zaccaria

Advanced Computer Architectures

Laboratory on DLX Pipelining

Vittorio Zaccaria

Vittorio Zaccaria – Laboratory of

Architectures

DLX Load/Store Architecture

Registers are faster than memory The compiler can do deeper optimization

16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding:

Addr. Mode contained in the operation code Fits in one word Faster decoding


Architectures

DLX (cont.) 32 General purpose registers 32 bit instructions:

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call


Architectures

DLX Pipeline


Architectures

Pipeline Visualization


Architectures

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

–Structural hazards: HW cannot support this combination of instructions

–Data hazards: Instruction depends on result of prior instruction still in the pipeline

–Control hazards: Pipelining of branches & other instructions that change the PC

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Hazards


Architectures

Structural Hazards


Architectures

Data Hazards


Architectures

Control Hazards


Architectures

An example program:

.datadati_a: .word 1,2,3,4,5,6,7,8dati_b: .word 2,3,4,5,6,7,7,9

.text

.global main

add r3,r0,0loop: lw r4,dati_a(r3)

lw r5,dati_b(r3)sub r5,r5,r4addi r3,r3,4bnez r5,loop

exit:


Architectures

1st Exercise: Draw pipeline chart Indicate:

Data Hazards between WB stages and ID stages.

Control Hazards between EX stage and IF stage

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB





Bnez r5,loop IF ID EX MEM

Hazard Individuation


Architectures

2nd Exercise: Hazard Resolution Software solution

NOPs insertion Hardware solutions

Bubbles/stalls generation Register forwarding

Software optimizations Code rescheduling


Architectures

NOP insertionadd r3,r0,0NOPNOP

Loop: Lw r4,dati_a(r3)Lw r5,dati_b(r3)NOPNOPSub r5,r5,r4Add r3,r3,4NOPBnez r5,LoopNOP


Architectures

NOP dynamic execution

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17

add r3,r0,0 IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB


Lw r5,dati_b(r3) IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB


Add r3,r3,4 IF ID EX MEM WBNOP IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WBNOP IF ID EX MEM WB


First loop:

Second loop: ........

Loop composed by 5 instr and 4 Nops


Architectures

Performance Indexes CPI= average clock cycles per

instruction; Average Clock cycles=

n° instr+n°stalls/nops+44 is the n° of cycles needed to execute the last instruction.

CPI=[Average Clock cycles]/[n° instr]


Architectures

Performance evaluation of NOPs Actual CPI=

Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7

MIPS frequency[=200Mhz]

------------------------- = 82.35 MIPS CPI*10^6


Architectures

NOPs Manual Exercise Execute manually the loop for two

cycles (finishing on the nop after the 2nd bnez) and calculate CPI and MIPS

10 minutes


Architectures

Results CPI= (21+4)/11=2.27 MIPS= 88


Architectures

Asymptotic loop performance Consider an intermediate cycle of

the loop. Count instructions + nops of the

cycle and divide it by the number of effective instructions -> asymptotical CPI

10 minutes


Architectures

Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI=

(Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n


------------------------- = 111 MIPS CPI*10^6


Architectures

Bubbles Bubbles are NOPs inserted by the

hardware. Branch instructions provoke the

generation of a NOP Next instructions are stalled Previous instructions are executed.


Architectures

Bubbles Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17


Lw r4,dati_a(r3) IF BubbleBubble ID EX MEM WB


Sub r5,r5,r4 IF BubbleBubble ID EX MEM WB


Bnez r5,loop IF Bubble ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB


Architectures

Performance evaluation of bubbles Actual CPI=

Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7


------------------------- = 82.35 MIPS CPI*10^6


Architectures

Verify on the simulator File-> load code ... -> pipe1.s ->

select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st

load of the 2nd cycle has been executed)


Architectures

Result


Architectures

Manual Exercise Preview what happens in an

intermediate cycle Calculate asymptotical CPI and

MIPS 10 minutes


Architectures

Let’s simulate it Simulate the program until the 4th

cycle


Architectures

Solutions After the 1st cycle, we note the

same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are:

Asymptotic values: CPI=1.8 MIPS=111.11


Architectures

Result Forwarding


Architectures

Forwarding Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13




Sub r5,r5,r4 IF ID Bubble EX MEM WB

Add r3,r3,4 IF Bubble ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB


Architectures

Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle

Just to the WB of the 2nd bnez


Architectures

Simulation results


Architectures

Manual Exercise Calculate CPI and MIPS for the 2

cycles. Calculate Asymptotical CPI and

MIPS. 15 minutes


Architectures

Results 2 cycles:

11 instructions 1 nop 2 stalls 4 cycles to flush the pipe

CPI=18/11=1.63 MIPS=122


Architectures

Asymptotical Results

5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=142.86.


Architectures

Speedup Speed up of A w.r.t. B:

Exec. Time B

-------------

Exec. Time A


Architectures

Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes


Architectures

Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.2

9


Architectures

Scheduling Optimizations change of the order of operations

to minimize stalls/bubbles (forwarding enabled):

lw r3,0(r2)add r3,r3,r7lw r4,0(r2)add r4,r4,r8add r4,r4,r3

CPI=(5+2+4)/5

lw r3,0(r2)lw r4,0(r2)add r3,r3,r7add r4,r4,r8add r4,r4,r3

CPI=(5+4)/5


Architectures

1st Exercise

addi r1,r0,1

seq r2,r1,r1

add r3,r3,r3

Loop: lw r4,0(r3)

sub r3,r3,r4

bnez r1,Loop


Architectures

Manual Exercises Draw the conflicts between operations

until the end of the 3rd execution of the cycle (last instruction bnez). No forwarding possible.

Insert bubbles/aborts in the right place to solve hazards.

Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes


Architectures

Hazard Diagramaddi r 1, r 0, 1 IF ID EX MEMWB

seq r 2, r 1, r 1 IF ID EX MEMWB

add r 3, r 3, r 3 IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB








Architectures

Bubbles/Stall insertion


Architectures

CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2


Architectures

Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline

diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware.

Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes


Architectures

Pipeline Diagram


Architectures

Results CPI=21/12=1.75 Asymptotical

CPI=[(4+1)n+4]/3n=5/3=1.66


Architectures

2nd exercise

loop: lw r2,dati_a(r4)

lw r3,dati_b(r5)

add r1,r2,r3

sw dati_a(r6),r1

addi r4,r4,4

addi r5,r5,4

addi r6,r6,4

j loop


Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards, assume no forwarding possible.

Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes


Architectures

Results

8 instructions1 NOP4 stalls=> CPI=~13/8


Architectures

Results No forwarding and no scheduling

asymptotic result: 13/8


Architectures

A Possible Re-Scheduling

loop:lw r2,dati_a(r4)

lw r3,dati_b(r5)

addi r4,r4,4

addi r5,r5,4

add r1,r2,r3

sw dati_a(r6),r1

addi r6,r6,4

j loopIdea: increase distance of add from last lw.


Architectures

Re-Scheduling results

Scheduled code decreases CPI to 11/8


Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes


Architectures

Forwarding Results

With forwarding but not rescheduling we obtain: 10/8


Architectures

Re-schedulingWe use the same re-scheduled code:

By rescheduling the loop we obtain 9/8


Architectures

Speedup Results

Total requested speedup is:

CPI[unscheduled,unforwarded] 13

---------------------------- = ----

CPI[scheduled,forwarded] 9


Architectures

3rd Exercise

loop: lw r2,dati_a(r1)addi r2,r2,4lw r3,dati_b(r1)addi r3,r3,4lw r4,dati_a(r1)addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop


Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes


Architectures

Bubbles insertion

11 instructions, 1 nop, 12 stalls => CPI= 24/11


Architectures

Rescheduled codeloop: lw r2,dati_a(r1)

lw r3,dati_b(r1)lw r4,dati_a(r1)addi r2,r2,4addi r3,r3,4addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop

Idea: perform elaborations after all data has been loaded


Architectures

Scheduled code results

11 instr., 1 nop, 7 stalls => CPI=19/11


Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes


Architectures

Bubbles insertion

11 + 1 NOP + 4 stalls => CPI=16/11


Architectures

Rescheduling Results

11 instr. + 1 NOP + 1 stall => CPI=13/11Requested Speedup=24/13


Architectures

Floating Point Pipeline Hazards DLX FPU Pipeline


Architectures

DLX FPU Pipeline Latency of a FU=number of cycles that

must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1).

Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU.

A stall in a pipeline does not mean a stall in the entire processor.


Architectures

FPU Latencies and I.I.

FU Latency

Initiation Interval

Integer ALU 0 1

FP add 1 1

FP and integer multiply

4 1

FP and integer divide

18 19 [structural hazards!]

WINDLX default latencies


Architectures

Problems with FPUs Divide instructions can provoke

structural hazards and need to be stalled in the ID stage.

Writes in the RF can be more than one.

WAW hazards are possible because WB can be reached out of order.

RAW hazards more frequent due to the longer latency of operations.


Architectures

Long Stalls even with Full Forwarding


Architectures

Register file structural hazard solution. Structural hazards on register file:

Solution: stall one of the instructions before entering the MEM stage.


Architectures

FPU WAW Hazards

Subd finishes before multd!there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results!

ld f6,dati_a(r2)ld f2,dati_b(r3)multd f6,f2,f4subd f6,f2,f2addd f6,f8,f2


Architectures

Exercise: execute only a cycle of this loop:

loop: ld f0,dati_a(r2)

ld f4,dati_b(r3)

multd f0,f0,f4

addd f2,f0,f2

addi r2,r2,8

addi r3,r3,8

sub r5,r4,r2

bnez r5,loop

How many cycles between the IF of the 1st ld and the WB of the 1st bnez?


Architectures

Results

CPI of the trace =19/8 instructions.

advanced computer architectures laboratory on dlx pipelining vittorio zaccaria

Documents

nops slide

loop nop slide

stage slide

decoding slide

minutes slide

pipeline hazards

hazard individuation

nop loop