advanced computer architectures laboratory on dlx pipelining vittorio zaccaria

76
Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria

Upload: nicole-weeks

Post on 14-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Advanced Computer Architectures

Laboratory on DLX Pipelining

Vittorio Zaccaria

Vittorio Zaccaria – Laboratory of

Architectures

DLX Load/Store Architecture

Registers are faster than memory The compiler can do deeper optimization

16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding:

Addr. Mode contained in the operation code Fits in one word Faster decoding

Vittorio Zaccaria – Laboratory of

Architectures

DLX (cont.) 32 General purpose registers 32 bit instructions:

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

Vittorio Zaccaria – Laboratory of

Architectures

DLX Pipeline

Vittorio Zaccaria – Laboratory of

Architectures

Pipeline Visualization

Vittorio Zaccaria – Laboratory of

Architectures

Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle

–Structural hazards: HW cannot support this combination of instructions

–Data hazards: Instruction depends on result of prior instruction still in the pipeline

–Control hazards: Pipelining of branches & other instructions that change the PC

Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Hazards

Vittorio Zaccaria – Laboratory of

Architectures

Structural Hazards

Vittorio Zaccaria – Laboratory of

Architectures

Data Hazards

Vittorio Zaccaria – Laboratory of

Architectures

Control Hazards

Vittorio Zaccaria – Laboratory of

Architectures

An example program:

.datadati_a: .word 1,2,3,4,5,6,7,8dati_b: .word 2,3,4,5,6,7,7,9

.text

.global main

add r3,r0,0loop: lw r4,dati_a(r3)

lw r5,dati_b(r3)sub r5,r5,r4addi r3,r3,4bnez r5,loop

exit:

Vittorio Zaccaria – Laboratory of

Architectures

1st Exercise: Draw pipeline chart Indicate:

Data Hazards between WB stages and ID stages.

Control Hazards between EX stage and IF stage

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM

Hazard Individuation

Vittorio Zaccaria – Laboratory of

Architectures

2nd Exercise: Hazard Resolution Software solution

NOPs insertion Hardware solutions

Bubbles/stalls generation Register forwarding

Software optimizations Code rescheduling

Vittorio Zaccaria – Laboratory of

Architectures

NOP insertionadd r3,r0,0NOPNOP

Loop: Lw r4,dati_a(r3)Lw r5,dati_b(r3)NOPNOPSub r5,r5,r4Add r3,r3,4NOPBnez r5,LoopNOP

Vittorio Zaccaria – Laboratory of

Architectures

NOP dynamic execution

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17

add r3,r0,0 IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WBNOP IF ID EX MEM WBNOP IF ID EX MEM WB

Sub r5,r5,r4 IF ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WBNOP IF ID EX MEM WB

Bnez r5,loop IF ID EX MEM WBNOP IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

First loop:

Second loop: ........

Loop composed by 5 instr and 4 Nops

Vittorio Zaccaria – Laboratory of

Architectures

Performance Indexes CPI= average clock cycles per

instruction; Average Clock cycles=

n° instr+n°stalls/nops+44 is the n° of cycles needed to execute the last instruction.

CPI=[Average Clock cycles]/[n° instr]

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of NOPs Actual CPI=

Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7

MIPS frequency[=200Mhz]

------------------------- = 82.35 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of

Architectures

NOPs Manual Exercise Execute manually the loop for two

cycles (finishing on the nop after the 2nd bnez) and calculate CPI and MIPS

10 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Results CPI= (21+4)/11=2.27 MIPS= 88

Vittorio Zaccaria – Laboratory of

Architectures

Asymptotic loop performance Consider an intermediate cycle of

the loop. Count instructions + nops of the

cycle and divide it by the number of effective instructions -> asymptotical CPI

10 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI=

(Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n

MIPS frequency[=200Mhz]

------------------------- = 111 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles Bubbles are NOPs inserted by the

hardware. Branch instructions provoke the

generation of a NOP Next instructions are stalled Previous instructions are executed.

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13 CK14 CK15 CK16 CK17

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF BubbleBubble ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF BubbleBubble ID EX MEM WB

Add r3,r3,4 IF ID EX MEM WB

Bnez r5,loop IF Bubble ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB

Vittorio Zaccaria – Laboratory of

Architectures

Performance evaluation of bubbles Actual CPI=

Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7

MIPS frequency[=200Mhz]

------------------------- = 82.35 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of

Architectures

Verify on the simulator File-> load code ... -> pipe1.s ->

select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st

load of the 2nd cycle has been executed)

Vittorio Zaccaria – Laboratory of

Architectures

Result

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercise Preview what happens in an

intermediate cycle Calculate asymptotical CPI and

MIPS 10 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Let’s simulate it Simulate the program until the 4th

cycle

Vittorio Zaccaria – Laboratory of

Architectures

Solutions After the 1st cycle, we note the

same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are:

Asymptotic values: CPI=1.8 MIPS=111.11

Vittorio Zaccaria – Laboratory of

Architectures

Result Forwarding

Vittorio Zaccaria – Laboratory of

Architectures

Result Forwarding

Vittorio Zaccaria – Laboratory of

Architectures

Forwarding Example

CK1 CK2 CK3 CK4 CK5 CK6 CK7 CK8 CK9 CK10 CK11 CK12 CK13

add r3,r0,0 IF ID EX MEM WB

Lw r4,dati_a(r3) IF ID EX MEM WB

Lw r5,dati_b(r3) IF ID EX MEM WB

Sub r5,r5,r4 IF ID Bubble EX MEM WB

Add r3,r3,4 IF Bubble ID EX MEM WB

Bnez r5,loop IF ID EX MEM WB

Lw r4,dati_a(r3) Aborted IF ID EX MEM WB

Vittorio Zaccaria – Laboratory of

Architectures

Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle

Just to the WB of the 2nd bnez

Vittorio Zaccaria – Laboratory of

Architectures

Simulation results

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercise Calculate CPI and MIPS for the 2

cycles. Calculate Asymptotical CPI and

MIPS. 15 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Results 2 cycles:

11 instructions 1 nop 2 stalls 4 cycles to flush the pipe

CPI=18/11=1.63 MIPS=122

Vittorio Zaccaria – Laboratory of

Architectures

Asymptotical Results

5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=142.86.

Vittorio Zaccaria – Laboratory of

Architectures

Speedup Speed up of A w.r.t. B:

Exec. Time B

-------------

Exec. Time A

Vittorio Zaccaria – Laboratory of

Architectures

Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.2

9

Vittorio Zaccaria – Laboratory of

Architectures

Scheduling Optimizations change of the order of operations

to minimize stalls/bubbles (forwarding enabled):

lw r3,0(r2)add r3,r3,r7lw r4,0(r2)add r4,r4,r8add r4,r4,r3

CPI=(5+2+4)/5

lw r3,0(r2)lw r4,0(r2)add r3,r3,r7add r4,r4,r8add r4,r4,r3

CPI=(5+4)/5

Vittorio Zaccaria – Laboratory of

Architectures

1st Exercise

addi r1,r0,1

seq r2,r1,r1

add r3,r3,r3

Loop: lw r4,0(r3)

sub r3,r3,r4

bnez r1,Loop

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercises Draw the conflicts between operations

until the end of the 3rd execution of the cycle (last instruction bnez). No forwarding possible.

Insert bubbles/aborts in the right place to solve hazards.

Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Hazard Diagramaddi r 1, r 0, 1 IF ID EX MEMWB

seq r 2, r 1, r 1 IF ID EX MEMWB

add r 3, r 3, r 3 IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

l w r 4, 0( r 3) IF ID EX MEMWB

sub r 3, r 3, r 4 IF ID EX MEMWB

bnez r 1, Loop IF ID EX MEMWB

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles/Stall insertion

Vittorio Zaccaria – Laboratory of

Architectures

CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2

Vittorio Zaccaria – Laboratory of

Architectures

Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline

diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware.

Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Pipeline Diagram

Vittorio Zaccaria – Laboratory of

Architectures

Results CPI=21/12=1.75 Asymptotical

CPI=[(4+1)n+4]/3n=5/3=1.66

Vittorio Zaccaria – Laboratory of

Architectures

2nd exercise

loop: lw r2,dati_a(r4)

lw r3,dati_b(r5)

add r1,r2,r3

sw dati_a(r6),r1

addi r4,r4,4

addi r5,r5,4

addi r6,r6,4

j loop

Vittorio Zaccaria – Laboratory of

Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards, assume no forwarding possible.

Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Results

8 instructions1 NOP4 stalls=> CPI=~13/8

Vittorio Zaccaria – Laboratory of

Architectures

Results No forwarding and no scheduling

asymptotic result: 13/8

Vittorio Zaccaria – Laboratory of

Architectures

A Possible Re-Scheduling

loop:lw r2,dati_a(r4)

lw r3,dati_b(r5)

addi r4,r4,4

addi r5,r5,4

add r1,r2,r3

sw dati_a(r6),r1

addi r6,r6,4

j loopIdea: increase distance of add from last lw.

Vittorio Zaccaria – Laboratory of

Architectures

Re-Scheduling results

Scheduled code decreases CPI to 11/8

Vittorio Zaccaria – Laboratory of

Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Forwarding Results

With forwarding but not rescheduling we obtain: 10/8

Vittorio Zaccaria – Laboratory of

Architectures

Re-schedulingWe use the same re-scheduled code:

By rescheduling the loop we obtain 9/8

Vittorio Zaccaria – Laboratory of

Architectures

Speedup Results

Total requested speedup is:

CPI[unscheduled,unforwarded] 13

---------------------------- = ----

CPI[scheduled,forwarded] 9

Vittorio Zaccaria – Laboratory of

Architectures

3rd Exercise

loop: lw r2,dati_a(r1)addi r2,r2,4lw r3,dati_b(r1)addi r3,r3,4lw r4,dati_a(r1)addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop

Vittorio Zaccaria – Laboratory of

Architectures

1st part Assume no forwarding possible Insert bubbles/aborts in the right place

to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize

stalls by augmenting the distance between conflicting instructions.

20 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles insertion

11 instructions, 1 nop, 12 stalls => CPI= 24/11

Vittorio Zaccaria – Laboratory of

Architectures

Rescheduled codeloop: lw r2,dati_a(r1)

lw r3,dati_b(r1)lw r4,dati_a(r1)addi r2,r2,4addi r3,r3,4addi r4,r4,4add r2,r2,r3add r2,r2,r4sw dati_a(r1),r2addi r1,r1,4bnez r1,loop

Idea: perform elaborations after all data has been loaded

Vittorio Zaccaria – Laboratory of

Architectures

Scheduled code results

11 instr., 1 nop, 7 stalls => CPI=19/11

Vittorio Zaccaria – Laboratory of

Architectures

2nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right

place to solve hazards Schedule the instructions to minimize stalls

by augmenting the distance between conflicting instructions.

Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code

(w/o fw.) and the last rescheduled and forwarded code.

10 minutes

Vittorio Zaccaria – Laboratory of

Architectures

Bubbles insertion

11 + 1 NOP + 4 stalls => CPI=16/11

Vittorio Zaccaria – Laboratory of

Architectures

Rescheduling Results

11 instr. + 1 NOP + 1 stall => CPI=13/11Requested Speedup=24/13

Vittorio Zaccaria – Laboratory of

Architectures

Floating Point Pipeline Hazards DLX FPU Pipeline

Vittorio Zaccaria – Laboratory of

Architectures

DLX FPU Pipeline Latency of a FU=number of cycles that

must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1).

Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU.

A stall in a pipeline does not mean a stall in the entire processor.

Vittorio Zaccaria – Laboratory of

Architectures

FPU Latencies and I.I.

FU Latency

Initiation Interval

Integer ALU 0 1

FP add 1 1

FP and integer multiply

4 1

FP and integer divide

18 19 [structural hazards!]

WINDLX default latencies

Vittorio Zaccaria – Laboratory of

Architectures

Problems with FPUs Divide instructions can provoke

structural hazards and need to be stalled in the ID stage.

Writes in the RF can be more than one.

WAW hazards are possible because WB can be reached out of order.

RAW hazards more frequent due to the longer latency of operations.

Vittorio Zaccaria – Laboratory of

Architectures

Long Stalls even with Full Forwarding

Vittorio Zaccaria – Laboratory of

Architectures

Register file structural hazard solution. Structural hazards on register file:

Solution: stall one of the instructions before entering the MEM stage.

Vittorio Zaccaria – Laboratory of

Architectures

FPU WAW Hazards

Subd finishes before multd!there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results!

ld f6,dati_a(r2)ld f2,dati_b(r3)multd f6,f2,f4subd f6,f2,f2addd f6,f8,f2

Vittorio Zaccaria – Laboratory of

Architectures

Exercise: execute only a cycle of this loop:

loop: ld f0,dati_a(r2)

ld f4,dati_b(r3)

multd f0,f0,f4

addd f2,f0,f2

addi r2,r2,8

addi r3,r3,8

sub r5,r4,r2

bnez r5,loop

How many cycles between the IF of the 1st ld and the WB of the 1st bnez?

Vittorio Zaccaria – Laboratory of

Architectures

Results

CPI of the trace =19/8 instructions.