computer architectures m

1

Parallel architecturesComputer Architectures M

Parallelism

2

Architecture

•Synthesis: a physical implementation. There are many possible synthesises of the sameimplementation (for instance different technologies)

The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16,IA32…). More an ISA remains more are the programs implemented on it and thereforecompatibility becomes the main issue.

•Architecture: functional behaviour of a computer. For instance a processor which executesDLX code

•Implementation: a logical network implementing the architecture. It is called alsomicroarchitecture. There are many implementations of the same architecture. Example:family x86

The architecture is defined by the machine language that is the instruction set (assemblylanguage). Instruction Set Architecture -> ISA

Parallelism

3

Parallelism

• Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.)………..

Instruction level parallelism

• SequentialSingle instruction executed at a time

• PipelinedMultiple instructions executed simultaneously

• SuperpipelinedMultiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV)

• Scalar A single pipeline

• SuperscalarMultiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision)

• Very Long Instruction WordMultiple pipelines; many instructions started at the same time. Instructionorder decided at compile time

Parallelism

4

Parallelism architectures

• Memory level parallelismA memory able to provide multiple data at different addresses at the same time (outstandingrequests - DDR2, DDR3 etc.)

•Multicore (core level parallelism)Many processors in the same chip (i.e.. Core duo – Nehalem – Sandy Bridge …..)

•Multithread (thread level parallelism)Pipelines of the same processor used by different processes at the same time (timesharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc….)

Parallelism

5

Deep Pipeline (Superpipeline)

FetchDecodeExecuteMemory

Writeback

Fetch

Decode

Execute

Memory

Writeback

Branchpenalty

•Each stage subdivided in three substages.. Higher clock frequency but higher branchpenalty

Branchpenalty

•Higher power consumption!!!!!!!!!!!!

Parallelism

6

Parallel pipelines

Sequential Time parallelism: pipeline

Space parallelism: VLIW

Space-time parallelism: (ie. I5, I7…)Parallelism

7

Diversified pipelines - 1

IF

ID

RD

MEM2 FP2

FP3

WB

Multi instruction buffer to avoid pipelines block.

Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines.

Different execution times problemInstruction interdependency problem

Parallelism

ALU MEM1 FP1 BREX F => Floating

8

Diversified pipelines - 2

IF

ID

RD

EX ALU MEM1 FP1 BR

MEM2 FP2

FP3

Dispatch Buffer

Reorder Buffer

«Out Of Order» execution

”In order” execution

WB ”In order” retirement

Parallelism

9

Floating Point DLX – F instructions

IF ID MEM WB

Integer

FPMultipl.

FPadder

FP/Int.Divid.

multicycle stages

IF ID MEM WB

ExInteger

M1 M2 M3 M4 M5 M6 M7

FP Multiply

A1 A2 A3 A4FP Add

FP/INT. Divide(i.e . 24 clock cycles – one

instruction at a time executed)Parallelism

Pipelined

10

DLX revisited

• Example FMUL F1,F2, F2 (no interdependency between instructions in this sequence)FADD F3, F4, F5FLD F6, 10(R8)FST 40(R10), F9

FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

FADD IF ID A1 A2 A3 A4 MEM WB

FLD IF ID EX MEM WB

FST IF ID EX MEM (WB)

• Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent

Data written

Data required for computing the address

In violet the stages where the operands are neededand in green the stages where new results areproduced

nop

Same destination registerWrite sequence error

• Very important structure change (more intermediate registers, more complex ID stage to send eachinstruction to the appropriate execution stage)

• Hazards problems: the instructions do not end in the same order of their issue.

• Since the division is normally a single functional unity , up to 40 clocks stalls may occur in this case

• Multiple instructions at the same time in the same stages (in particular in WB)• Write After Write hazards (WAW)– i.e. if a FADD F6, F4, F5 (four EX cycles ) directly preceded a

FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by thecompiler since useless)

• Instructions are not completed in order

Parallelism

Red squares: execution

11

DLX revisited

• For WAW hazards consider the following example

IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID A1 A2 A3 A4 MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

FMUL F0, F4, F6

…………..

…………..

FADD F2, F4, F6

…………..…………..

FLD F2, 0(R2)

If FADD were started one clock later a Write After Write hazard would have taken place !!

Multiple RF write operations

• To cope with multiple write operations at the same time of different registers the number of the input ports ofthe RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as tochoose the instructions to be stalled). More complex pipelines

• RAW hazards are solved through the forwarding

Normally the hazards are detected in the ID stage considering the preceding and following instructions so as tointroduce the required stalls (in this case FLD would have been stalled one clock)

Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which useinteger register for address computing

Parallelism

12

DLX revisited

• How can we grant that the final result is that of the program ?

• In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached theMEM stage. It must be however assumed that between the two instructions there must at leastone using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler wouldhave dropped the instruction !

• The situation would have been even worse if FLD had been completed before the FADD.

• In any case it is always possible that different instructions are completed in an order differentfrom that of their issue

Parallelism

Compiler

13

Let’s consider this high level language statementsX = Y + ZA = B * C

to be executed in a processor with the following pipeline

FetchF

Dec.D

IssueI

Ex.E

Ex.E

Ex.E

WBW

In order emissionThe issue of the addition (multiply) is possible only

AFTER the previous instruction execution calculatingR2 (R5) that is after the last EX stage of R2 <= Z

(R5 <= C) possibly with forwarding

Busy decoder

The issue is here possiblesince data to R1 e R2 havebeen already produced

Multiply: waits for results

RAW

StallsDecoder occupiedData not available

D freed by the previous additioninstruction

Busy decoder- RAW

Decoder busy

Addition resultnot yet ready

Parallelism

At the end of thisstage the additionresult is available

Compiler

14

But we can modify the emission without modyfying the result

16 cicles instead of 22 !!!!

before

after

FetchF

Dec.D

IssueI

Ex.E

Ex.E

Ex.E

WBW

Waiting for R5

Busy decoder

Parallelism Waiting for R6

Emission possiblesince R1 and R2 already available

Multicycle hazards

15

Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green).

I1 F1 = F2 + F3I2 F2 = F4 x F5I3 F3 = F3 + F4I4 F6 = F6 x F6I5 F1 = F3 + F5I6 F2 = F3 + F4

I1

I2

I3

I4

I5

I6

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10

NB: in this graph the hazards are potential since theregisters only are considered no matter how manycycles are required by the executions

Parallelism

I1

I2

WAR (F2)

I6WAW(F2)

I3WAR (F3)

RAW (F3)

I5RAW (F3)

WAW(F1)

Dynamic instructions scheduling

16

• Systems with out of order executions but commitment always in order

• Temporal dependencies (hazards) not known at compile time

• It allows the execution of the code on different pipelines and on superscalar processors withno implications for the compiler.

Parallelism

• It allows the execution of instructions ahead of their position (in the following case FSUBF12,F8,F14) if the conditions allow it

FDIV F0,F2,F4

FADD F10,F0,F8 (RAW - must wait for F0)

FSUB F12,F8,F14 (can be executed anyway)

17

Scoreboard

• Consider the following sequence

FDIV F0, F2, F4FADD F10, F0, F8FSUB F8, F8, F14

They must readthe same value

Write After Read (WAR)

•There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end beforeFADD has read F8 an error would occur (F8 already updated)

•A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 hadbeen used as destination (in case FSUB would end before FADD – but probably FADD droppedby the compiler)

•“Scoreboard” technique: an instruction per clock should be terminated executing aninstruction as soon as possible.

Parallelism

Read after Write (RAW)

18

Scoreboard

FP MULFP MUL

FP DIV

Registers

FP ADD

INTEG

Scoreboard

The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines whenan instruction can read its operands and start its execution. The scoreboard considers all systemstate changes and decides when the first instruction in the FIFO queue (as produced by thecompiler) can be started.

Functional units

Parallelism

19

Scoreboard

• Obviously some stalls can be induced because the number of busses available for transfers issmall

• The four stages equivalent to ID, EX and WB in DLX are:1. Emission: if a functional unit for the instruction is available (free) the instruction is issued

unless another functional unit has already an instruction which must write into the samedestination register. No WAW hazards therefore. In this latter case the instruction is stalledwhich blocks the emission of all the following instructions in the prefetch queue even when allother conditions for them are met!

2. Operand read: the instruction has been emitted. If the operand(s) is(are) available and noalready executing instruction must write it(them), the operand(s) is(are) read otherwise stallin the functional unit

3. Execution: when the result has been computed and stored the scoreboard is informed so as tounblock a possibly waiting instruction

4. In case of possible WAR the instruction is stalled and does not write the result if there is aprevious instruction which has not yet read the operands and one(both) of them is(are) thedestination register(s) of the considered instruction. Once the operand(s) has(have) beenread the result can be written

• It must be noticed that with this organisation the forwarding is avoided since the results arewritten as soon as produced (but for the wait WAR – point 4)

The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing theRAW risks) .

Parallelism

20

An example

Hypothetical timing for different instructions (which includesthe operands read and execution)

FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4 (MULD)

FSUB F8, F6, F2 (SUBD) FDIV F10, F0, F6 (DIVD)

FADD F6, F8, F2 (ADDD)

RAW

< WAR

RAW

Parallelism

Integer

Do you find more hypothetical hazards?

For instancewhataboutF0?

21

Scoreboard entitiesInstruction stages: emission, operands read, execution and writeback

Statuses of the functional units (FU): 9 parametersBusy Unit busyOp Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registersQj, Qk Functional units producing the required operands (if not yet ready) for

the registers Fj and FkRj, Rk Flags (yes) indicating whether Fj, Fk have been already updated

Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register

N.B. It must be remembered that in case of possible WAWthe instructions emission is stalled (point 1 of the rules)

N.B. In the following example we suppose that two multiplication/division units are availableParallelism

22

Example (here we assume that F0 is a “normal”register and not always “0”)

0

Instruction status Read ExecutionWriteInstruction j k Issue Op complete ResultLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8Functional unit status

Time Name

Register result statusClock

F0 F2 F4 F6 F8 F10 F12 ... F31Functional Unit producing the result for the floating point register Fx (Qj, Qk)

Instructions statesProgression clock

1 integer unit2 multipl. units1 add/sub unit1 division unit

Rj and Rk indicates whether (possibly in the next cycle if just produced) thedata can be read from the operands source registers of the instruction whichmust be executed. Qjand Qk are the Functional Units which produce them (ifnot yet ready). Fj and Fk are the registers where data produced by Qj and Qkare stored (or will be stored in the next clock cycle – data available if thecorresponding Ri is yes) to be used in the executed instruction

F2dest Source1Source2

IntegerMult1Mult2AddDivide

FU for j FU for k Fj? Fk?Busy Op Fi Fj Fk Qj Qk Rj Rk

Register Qi Ready ?FU=Functional Unit

n. of clock cycles of execution yet

to elapse

NBLD = FLDMULTD = FMULSUBD = FSUBDIVD = FDIVADDD = FADD

FLD 1 cycleFADD, FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Floating point result registers

23

Cycle 1Instruction status Read Execution WriteInstruction j k Issue Op/Excomplete ResultLD F6 34 R2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No

Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Integer

Functional unit used for producing the result in F6

R2 is supposed to be already availableand therefore in the next clock can beused. LD uses the integer unit

At clock 1 the instruction state of LD F6,34(R2) is Issue

Parallelism

Yes Load F6 R2 Yes

R2

1

Brown colourfor state change

1

24

Cycle 2Instruction status Read Execution WriteInstruction j k Issue Op/Ex complete ResultLD F6 34 R2 1 2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No


FU Integer

Data ready in R2: instructioncan proceed: execution

NB: The second LD cannot be emittedbecause the only integer unit is busyand the same applies for MULTD andthe following instructions becauseinstructions must be emitted in orderalthough their functional units are free!

Parallelism

2

25

Cycle 3Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No


FU Integer


Parallelism

Op/Ex

3

26

Cycle 4Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load R2Mult1 NoMult2 NoAdd NoDivide No


FU

4

F6

Register at the end of the period has been writtenInteger functional unit freed at the end of the period

The change of status of the FUsindicates their value at the clockpositive edge ending the currentcycle (future status). For instancethe integer functional unit is freedat the end of cycle 4 together withthe result writeback. LD F6 34,R2disappears totally from scoreboardat the clock positive edgeconcluding the current cycle 4.

Parallelism

Op/Ex

Integer4

27

Cycle 5Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No


RU

S1 S2 RUj RUk Rj? Rk?

R3 supposed already ready as in the previous case

5

Yes Load F2 R3 Yes

IntegerThe Integer Functional Unit must produce a new value for F2

At the beginning of cycle 5 the integer unitis already free and then LD F2 45, R3 can be emitted and start

Parallelism

4Op/Ex

5

28

Cycle 6Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1Mult2 NoAdd NoDivide No


FU Integer

S1 S2 FUj FUk Fj? Fk?

F4 supposedalreadypresent

Yes Mult F0 F2 F4 Integer No Yes

Mult

MULTD waits for F2from the integer unit !!!!

6

MULTD F0 F2, F4 can start because its FU is free and the destination register is F0

Parallelism

Op/Ex

6

29

Cycle 7

MULTD stalled in theexecution unit because F2not yet ready.

Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAddDivide No


FU Mult Integer


(NB : FP adderexecutes

FP subtractionstoo)

F8Yes Subd F6 F2 Integer Yes No

Add

7

SUBD F8 F6, F2 can start becausethe arithmetic FP sum/subtraction isfree.

Parallelism

SUBD needs F2

Op/Ex

7

30

Cycle 8Instruction status Read EX Write

Instruction j k Issue complete. Result

LD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 Yes Mult F0 F2 F4 YesMult2 NoAdd Yes Sub F8 F6 F2 YesDivide


FU Mult1 Add


F0 not yet available

Yes Load F2 R3

8

Yes Div F10 F0 F6 Mult1 No Yes

Divide

DIVD F10 F0, F6 can startbecause the divide FP FU is free

Updated at the end of the cycle

Yes

Yes

F2 available !!

F2 written allows MULTD andSUBD to read the operands duringthe next cycle

F2 is written and therefore the integer unit is freeParallelism

Op/Ex

8

31

Cycle 9 - 10Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7

N.B.: MULTD and SUBD can readthe operands because F2available (see cycle 8). DIVD isstill stalled because of F0.

99

DIVD F10 F0 F6 8ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?

Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No

10 clock Mult1 Yes Mult F0 F2 F4Mult2 No

2 clock Add Yes Sub F8 F6 F2Divide Yes Div F10 F0 F6 Mult1 No Yes


FU Mult1 Add Divide

40 clock

Parallelism

ADDD cannot start becauseSUBD uses the adder FU

Op/Ex

9-10

32

Cycle 11Nota: FU Add requires 2 cycles for theSUBD and therefore nothing happens incycle 10 while MULTD still processes itsdata

NB: ADDD will use the result of theSUBD but is not yet started because of

SUBD (the FU is busy)

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk

Integer No8 clocks more Mult1 Yes Mult F0 F2 F4

Mult2 No0 Add Yes Sub F8 F6 F2

Divide Yes Div F10 F0 F6 Mult1 No YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

FU Mult1 Add Divide

Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11

Parallelism

Op/Ex

11

33

Cycle 12

Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk

Integer No7 clocks more Mult1 Yes Mult F0 F2 F4

Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes


FU Mult1 Divide

Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 SUBD ends freeing the FU. In the nextperiod ADDD can start

12

F8 is written and the ADD/SUB FU is freed

FLD 1 cycleFADD and FSUB 2c yclesFMUL 10 cyclesFDIV 40 cycles

Parallelism

Op/Ex

12

34

Cycle 13


Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAddDivide Yes Div F10 F0 F6 Mult1 No Yes


FU Mult1 Divide

Instruction status Fead EX WriteInstruction j k IssueOp/Excomplete FesultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12 Now ADDD can start because SUBDhas finished its execution and hasfreed the FU

Yes Add F6 F8 F2 Yes Yes

Add

13

6 Clocks more


Parallelism

13

35

Cycle 14


Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes


FU Mult1 Add Divide


11 12

13 14

5 clocks more

2 Clocks more


Parallelism

Op/Ex

14

36

Cycle 15

ADDD requires two cycles and therefore no system status change




FU Mult1 Add Divide


11 12

13 14

4 Clocks more

1 Clock more


Parallelism

Op/Ex

15

37

Cycle 16




FU Mult1 Add Divide


11 12

13 14

ADDD ended its EX stage while MULTDand DIVD keep executing

16

3 clocks more


Parallelism

Op/Ex

16

38

Cycle 17




FU Mult1 Add Divide


11 12

13 14 16

NB !!! ADDD stalled (cannot write) because of aWAR with DIVD on F6. DIVD does not readF6 because it waits for F0 produced byMULTD (operands are read in parallel).MULT and DIVD keep executing

Stalled becauseWAR F6

2 Clocks more


Parallelism

Op/Ex

17

39

Cycle 18

MULT still executing

DIVD still stalled




FU Mult1 Add Divide


11 12

13 14 16

1 clock more


Parallelism

Op/Ex

18

40

Cycle 19




FU Mult1 Add Divide


11 12

13 14 16

MULT ends its execution, will write in cycle20 (after 10 cycles) which will unblockDIVD and then ADDD

19


Parallelism

Op/Ex

19

41

Cycle 20


Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Yes Yes


FU Add Divide


11 12

13 14 16

19 MULTD writes F0 unblocking DIVD

20


Parallelism

Op/Ex

20

42

Cycle 21


Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6


FU Add Divide


11 12

13 14 16

19 20

DIVD reads both F0 and F6 (whichcould not be written by ADDDbecause of WAR) unblocking ADDDwhich can write F6 in the next cycle

21

Parallelism

Op/Ex

21

43

Cycle 22


Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6


FU Divide


11 12

13 14 16

19 20

21

Parallelism

Now ADDD can write F6 after theWAR hazards with DIVD disappeared.For 6 cycles ADDD couldn’t write F6although its result was available

22

Op/Ex

22

44

Cycle 61


Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6


FU Divide

Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

2122

DIVD execution ends after 40 cycles61

Result

Parallelism

Op/Ex

61

45

Cycle 62

All executions ended


Integer NoMult1 NoMult2 NoAdd No

0 Divide NoRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

62 FU

Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2

11 12

13 14 16

19 20

2122

61

Result

62

Parallelism

Op/Ex

46

Scoreboard limits

• An instruction can be emitted only if all previous instructions have been emitted

WAWWAR

FDIV F0, F2, F4FADD F6, F0, F8FSTOR F6, 0(R1)FSUB F8, F10, F14FMUL F6, F10, F8

N.B Hazards of the sequence are onlypotential: their occurrence dependson the instructions execution time

• Register values must be read in any case in parallel only from the register file (which meansthat they must have been already stored in the registers – no RAW problem)

Parallelism

RAW

47

Renaming – Tomasulo Algorithm

Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers whereinstructions can be «parked» waiting for the availability of the requested Fu and the needed data.

The following benefits occur

«Renaming» indicates a location different from the RF where a requested data is produced/stored and can beobtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed

Parallelism

A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU isfree and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operandsEITHER the source register data OR the reservation stations producing them are indicated (whence renaming).The renaming occurs at run-time

A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoidingthe register file access). Similar to the case of forwarding

When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF areavailable) only the most recently produced data are written (for each register a TAG is used indicating the FU whichhas the right to write)

Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the informationstored in the reservation stations of each functional unit determines whether an instruction can execute in the FUsince the source (where the data is being produced - if not yet int the RF) and NOT the RF is indicated. RAWhazards are no more possible since the requested data are provided as soon as produced. The same for WAR (dataare read by the reservation stations while written)

Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RFthrough the common data busses (multiple reservation stations in addition to RF register can be accessed at the sametime when multiple busses are available)

48

Tomasulo AlgorithmTomasulo eliminates not only WAWs but also WARs

FLD F6, 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, F6FDIV F10, F0, F6FADD F6, F8, F2

Renaming (functional unitproducing the data)

Possible WAW

As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the mostrecent instruction in the RS using a destination register can write the register.

FLD [T/F6], 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, [T]FDIV F10, F0, [T]FADD F6, F8, F2

NB: When an instruction is inserted in a RS it is checked whether one or more of its operands are beingproduced elsewhere by other RS: if yes then renaming

For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read itsoperands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but sinceFDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holdsfor FSUB.

Parallelism

Possible WAR.

49

Tomasulo Algorithm

Parallelism

Very high performance without special compilers

Differences with scoreboard

Buffer and controls directly distributed in the FUs (there is no centralizedcontrol): buffers are called “reservation stations”

Source registers names substituted by pointers to buffers of the reservationstations (if the requested data are being there produced)

“Renaming”: a direct pointer to the sources and not to the register

One ore more Common Data Bus for sending results to all FUs requiringthem

Load and Stores considered as FUs (a STORE can also be a source for a RSexecuting a LOAD)

50

Tomasulo AlgorithmIn this example is it assumed thatthe MUL unit executes the DIVs tooand that the ADD executes the SUBs too . LOAD and STORES are handled as other instructions

In this example: 3 RS for add/sub2 RS for mult/div5 RS for store5 RS for load

In this example only oneData Bus. Please noticethat the same CommonData Bus is used also bythe RS waiting for data

Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements:either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming)

For thedataproducedby the FUs

Parallelism

51

Tomasulo Algorithm

• Writeback: as soon as a data is produced, it is tranferred over one CDB (when more than one areavailable) to the RF and to the RS waiting for it.

Parallelism

• Load buffers are used to store the load addresses

• Store buffers contain the computed addresses and the data to be written in memory

• Load and store must be executed in sequence if they are related to the same addresses. In theother cases it is possibile to anticipate the LOADs (never the STOREs)

• In figure there are 3 phases (each one of which can last several clocks):

• Emission: the instructions are extracted in order from the general instruction queue when there is afree RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operandsare extracted from RF or the producing FU as indicated. In case of WAW it must be determinedwhich instruction must provide the data

• Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must betransferred over a bus anyway) in order to catch them (and their sources) as soon as available:RAW are therefore avoided (we are sure not to read stale data in the RF).

52

Tomasulo Algorithm

Let’s see the scoreboard example in a Tomasulo Architecture. Let’s supposethat the execution times are the same of the scoreboard (FLD 1+1 cycles,FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB.“+1” for the writeback

LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4FSUB F8, F6, F2FDIV F10, F0, F6FADD F6, F8, F2

Parallelism

53

Reservation Station

Register File Status: Indicates which FU will write the register (if needed). A blank meansthat there are no instructions which must write the register and therefore its value can bedirectly used

N.B. From the general instruction queue one instruction per clock is emitted when a FUs RSfor that instruction is available otherwise stall. In our example we assume only one CDB.

Parallelism

Op: opcode of the instruction to be executed

Vj, Vk: places where the operands are read (either RF or the FUs producing them).If blank the data is produced by the corresponding Qj or Qk

Qj, Qk: Functional units producing the results. A blank indicates that the source operandsare already in Vj or Vk or that they are not required

Busy: Busy FU

54

Cycle 0Instruction status

Execution WriteInstruction j k IssueLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No


0 FUThe FU producing the new value

Producing FU – if blank it means that the dat is in RF

Operands register. If blank the datum is produced in the corresponding Q FU

NB. For LD (ST here not used) there is a limitednumber of RS. Their BUSY status is here displayed differently from the FU (see next slide)

For sake of simplicity Rj e Rk(ready/notready) of the scoreboardare not displayed since their valuesare implicit in the status of Qj andQk

Parallelism

Load/store notindicated in the

status table

FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback

55


Execution WriteInstruction j k Issue BusyLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No

Add3 No0 Mult1 No0 Mult2 No


1 FU

Address1 Load1 Yes

Load1

34+R2

3 RS foradder/sub

2 RS formul/div

NB: Here it is assumed that R2 and R3 are already available

5 RS for the LOAD

Parallelism


56


Execution WriteInstruction j k Issue BusyLD F6 34 R2 1LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No


2 FU Load1

AddressLoad1 Yes 34+R2

5 RS for LOAD

2-2 Load2 Yes 45+R3

Load2

The second LD is emitted. One instruction per clock isemitted (when possible)

N.B. A second LOAD has been emitted(not possible with the scoreboard)and parked in the RS. R3 valuealready available in the RF

Parallelism

NB: Load -> 2 cycles: the first one for computing the address and the second for reading the data


57


Execution WriteInstruction j k Issue BusyLD F6 34 R2 1 2--3 Load1 YesLD F2 45 R3 2 3- Load2 YesMULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1Mult2 No


3 FU Load2 Load1

Address34+R245+R3

Yet10 cycles

LD two cycles

MULTD can be emitted although F2 NOTyet available . F2-> renaming

3

Yes Mult F4

Mult1

Load2

MULTD emitted (free RS )

Data supposed alreadyin the RF

Parallelism


58

Cycle 4

The FUs execute both sums and subtractions

Instruction status Execution WriteInstruction j k Issue Busy

LD F6 34 R2 1 2--3LD F2 45 R3 2 3--4 Load2 YesMULTD F0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1Add2 NoAdd3 NoMult1 Yes Mult F4 Load2Mult2 No


4 FU Mult1 Load2

Address

45+R3

Yet 3 cycles

Yet 10 cycles

4

The data read from memory LD F6 34(R2) is writtenboth in the RF and in the RS of SUBD and MULTD which are waiting for it

4

Add1

Yes Sub F6 (captured on the fly) Load2SUBD is emitted (RS free)F6 available in RF at the end of the cycle

FU freed at the end of clock cycleParallelism


59


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4MULTD F0 F2 F4 3SUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk3 Add1 Yes Sub F6 (capt.) F2 (capt)0 Add2 No

Add3 No10 Mult1 Yes Mult F2 (capt) F4

0 Mult2Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

5 FU Mult1 Add1

Cycles yet to be executed for completing the execution

5

5

Yes Div F6 Mult1

Mult2

DIVD is emitted (RS free)

Wait for F0

FU freedParallelism

The datum read from memory with LD F2 45(R3)is written both in register F2 and in the RS ofSUBD and MULTD which are waiting for it


60

Cycle 6


Instruction status Execution WriteInstruction j k Issue

LD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 --DIVD F10 F0 F6 5ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes Sub F6 (capt) F2 (capt)

Add2 Yes Add F2 Add1Add3 No

9 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1


6 FU Mult1 Add1 Mult2

Yet 40 cyclesNow MULTD can execute (F2 and F4 available)

6

Add2

ADDD is emitted (RS free)

Wait for F0

Wait for F8

Parallelism


61


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes Sub F6 (capt) F2 (capt)

Add2 Yes Add F2 Add1Add3 No



7 FU Mult1 Add2 Add1 Mult2

6 -- 7SUBD (as ADDD) two cycles

ADDD stalled waiting for SUBD (F8)

Data in F6 will be overwritten byADDD but it was already read and ispresent in the RS of DIVD

Yet 40 cycles

Parallelism


62


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 No

2 Add2 Yes Add F8 F2Add3 No




Yet 40

0

8

NB: SUBD ends before MULTD andallows ADDD (which captures theresult of F8) to start executing

FU freedParallelism


63


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 --Reservation Stations S1 S2 RS for j RS for k






Yet 40

ADDD executing

Parallelism


64


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k



5 Mult1 Yes Mult F2 F4Mult2 Yes Div Mult1



9 -- 10

Two execution cycles

Yet 40 F6

Parallelism


65


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No



11 FU Mult1 Mult2

40

0

11ADDD too ends beforeMULTD and DIVD

FU freed


Parallelism


66


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k




12 FU Mult1 Mult2

40

Waiting for the data producedby MULTD


Parallelism


67


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k


1 Mult1 Yes Mult F2 F4Yet 40 Mult2 Yes Div F6 Mult1


15 FU Mult1 Mult2

Waiting for the data producedby MULTD

Parallelism


68


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

40 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

16 FU Mult2

Now DIVD can execute

0

0

16

FU freedParallelism


69


Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No

0 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31

56 FU Mult2

Parallelism

70

Cycle 57Instruction status Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56 57ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No


57 FU

Parallelism

71

A demo can be found at

http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm

Parallelism

Limits of Tomasulo Algorithm

73

• NOT precise interrupts

Parallelism

• Very complex

• Each CDB must be connected to each RS – Complex cabling – Reduce n. of CDB means reduced efficiency

• If a single CDB is present only one instruction per cycle can end

• Ouf of order instructions completion !!!!!!

Exceptions

74

• Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or – see later – segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must

be «restartable»,. The executing program is normally temporarily aborted.

Parallelism

• Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation

must be saved «Response» subroutine (handler) must be executed

• Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program

Examples

75

Instruction Restart

Parallelism

Precise exceptions/interrupts

76

• Precise exceptions(interrupts) : instruction commitment in order

Parallelism

• Exceptions must be “precise” that is their behaviour must be same that would occur in a “non-pipelined” architecture

• Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if

they never started The same code must executed identically on different architectures

• Complex problem with pipeline, OOO execution (see later) etc.

• Scoreboard and Tomasulo have:In order emission, execution (and therefore terminated) out of order fuori ordine

78

• Automatic WAW avoidance

ROBFP OpQueue

FP Adder FP AdderRes Stations Res Stations

FP Regs

Reorder Buffer (ROB)

Parallelism

• FIFO queue

• Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we saythat the instruction is virtually inserted in the ROB

• When instructions are terminated the results are stored in the ROB (instead of theRF) which provides also the operands to other instructions which requires them(renaming!) Commitment

• Easy “undo” of speculated instructions (see later)or of branches erroneously predicted or exceptions

• Commitment: the results of the instruction which has reached the topslot of the FIFO are transferred to the architectural registers (registerswhich could be read by a test program)

79

Tomasulo again

Parallelism

80

Tomasulo in 4 steps

N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same(unlikely, otherwise the compiler would have dropped the first one) the result of the most recentinstruction is used.

Parallelism

• Emission— Emission of an instruction from the instruction queue when a RS anda ROB slot available. In the RS are indicated the operands source and the ROB slotwhere an instruction will be “parked” after its esecution (this phase is called«dispatch”). The results are NOT written in the RF until the commitment phase.NB the lack of one of the two conditions blocks the emission of the followinginstructions

• Execution — Operands transformation. If not yet ready they can be in the ROB (inthis case the operand values computed by the nearest previous instructions areused) or still computed in the FU. This phase is indicated as “issue”.

• Result writeback — Execution ends. Result trasmitted on the CDB for the RSwaiting for them and to the ROB.

• Commitment—Architectural registers (or memory) update with the results stored inthe ROB when the instruction is on the top of the ROB FIFO. In case of erroneouslypredicted branch the ROB results are just dropped (“graduation”).

EMISSION IN ORDERCOMMITMENT IN ORDER

Parallelism 81

HW with ROB

ReorderBuffer

FPOp

Queue

FP Adder FP Adder

Res Stations Res Stations

FP RegsC

ompar netw

ork

• ROB is a circular queue• Program counter i.e. used for branch

ROB

Des

tinat

ion

Reg

iste

r

Res

ult

Exc

eptio

n?

Valid

(ter

min

ated

)

Prog

ram

Cou

nter

82

Example

LD F0, 10(R2) 3 cyclesFADD F10, F4, F0 5 cyclesFDIV F2, F10, F6 20 cyclesBRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

Parallelism

83

To memory

FP adders FP multipliers

Reservation Stations

FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) N

Completed?

From memory

1 10+R2

ROB

Tomasulo with ROB – cycle 1

ROB top

ROB end

Source

M1

LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4

InstructionDest.

Parallelism

FP registers

ROBPosition ROB

Position

Cod.Op. Operands Cod.

Op. Operands

ROBPosition

84

2 FADD F10,F4, ROB1



FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1F0 LD F0,10(R2) ExTop

End

1 10+R2

ROB


To memory

From memory

M1


F10 NFADD F10, F4, F0 [ROB1]ROB1

Renaming !!

(Memory2 clocks)

Three slotsfor memoryoperations

Completed?Source InstructionDest.

Parallelism There can be also two ROB sources

FP registers

RAW

ROBPosition ROB

Position


Op. Operands

ROBPosition

85

32 FADD F10, F4, ROB1



FP Opqueue

1 10+R2

ROB


FDIV F2, ROB2, F6

To memory

From memory

M1


ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F2F10F0

ROB 2

LD F0,10(R2)

N

N

ExTop

End

FADD F10, F4, F0 [ROB1]

FDIV F2, F10 [ROB2], F6

ROB 1


Parallelism

FP registers

ROBPosition ROB

Position


Op. Operands

ROBPosition

86

32 FADD F10, F4, F06 FADD F0, ROB5, F6



FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0 ROB5 FADD F0, F4 [ROB5], F6 N

F4 LD F4,0(R3) Ex

-- N

F2F10

ROB2

Completed and committed (F0)

N

Ex Top

End

5 0+R3


FADD F10, F4, F0

FDIV F2, F10 [ROB2], F6

FDIV F2, ROB2, F6

To memory

From memory

BRNE F2 [ROB3], +100

M1

F0(Updated by memory op ROB 1)

In cycle 4 (end of the first LD) FADD F10, F4, F0 started executing

Emitted in cycle 4 in parallel withLD F4, 0(R3)


Data capturedon the fly. Notmore present in the ROB


Parallelism

ROBPosition ROB

Position


Op. Operands

ROBPosition

Not yetcommitted

87

32 FADD F10, F4, F06 FADD F0, ROB5, F6



FP Opqueue

ROB7ROB6

ROB5

ROB4

ROB3

ROB2

ROB1

F0ROB5ROB5

ST 0(R3), F4[ROB5]

FADD F0, F4 [ROB5], F6 N

F4 LD F4, 0(R3) Ex

-- N

F2F10

ROB2 N

Ex Top

End

5 0+R3


FADD F10, F4, F0

FDIV F2, F10[ROB2], F6

FDIV F2, ROB2, F6

To memory

From memory

FP registers

BRNE F2 [ROB3], +100

M1

F0

NB ST can start its execution whenLD F4, 0(R3) has terminated the execution NOT when is committed



Parallelism

N

ROBPosition ROB

Position


Op. Operands

ROBPosition

(Updated by memory op ROB 1)

ROB3

88

Register Renaming

• For each commitment the pointer to the architectural register points to the physicalregister linked the commited instruction. When a new instruction regarding the samearchitectural register is committed the pointer to it is changed (and the physical registerpreviously embodying the architectural register is freed).

Parallelism

• But when an emitted instruction must use a register where can it be found? In theROB or in the RF ? The entire ROB should be analysed and the most recent slotfound (if any) whose destination is the required register: the instruction shouldeither point to it (if any) or to the RF. Complex and slow procedure

• Solution: to use a number of physical registers greater than that of the architecturalregisters (the register known to the assembler language programmer - ISA) and to keepa pointer to the most recent (possibly not yet architectural)

• Whenever an instruction inserted in the ROB must write a register (i.e. F17), it points toa new physical register associate to the involved register (F17) where the result will betemporarily stored. Any following instruction which must use register (F17) will usethat physical register

89

An example with R2

R2-0R2-1

R2-3R2-4R2-5R2-6R2-7R2-8

Circular queue of register R2

Pointer to the first free register R2 when LD R2,

10(R5) is emitted

Let’ suppose that R2-2 andR2-3 are alredy occupied byprevious not yet committedinstructions

LD R2, 10(R5) ; R2-4 (destination.)

first physical register free associated to R2

Parallelism

•When LD R2, 10(R5) is emitted register R2-4 is given to it as destination which will be used by MUL(as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first freeregister will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If thecommitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 willbecome the architectural R2 register at the commitment of the related instruction. The busy registersare freed when no more needed

• No more distinction between register file and ROB locations. Normally there are 40-120 physicalregisters

R2-2

ArchitecturalregisterMUL R8, R2, R5 ; R2-4 (source.)

RADD R2, R9, R6 ; R2-5 (destination.)DIV R2, R2, R10 ; R2-6 (destination) and

R2-5 (source) (here commitment of instruction using R2-2)

R2-1R2-2

90

HW support for register renaming

• If no physical registers (circurarly) are available the instruction is stalled. There isno emission also if no free slot in the ROB is available and no RS is available

Parallelism

• Free/busy register table. Two solutions: one pool of physical registers for allarchitectural registers or one pool for each architectural register.

• Fast mapping between architectural and physical registers (run time)

• Great number of physical registers

ROB «without Tomasulo»

Parallelism 91

• When two instructions are ready for execution, FIFO rule (so as to speed-up thecommitment, always in order)

• Instructions are emitted as soon a free slot in the ROB and a physical destination register areavailable using the register renaming

• For each FU there is a virtual queue whose slots point to the ROB slots which require thatFU.

• The instruction of this queue are executed as soon as the required operands areavailable.

92

ROB and speculation

Need of a separated Return Stack Buffer for the speculative calls (otherwisethe stack could be damaged). It is a separated stack whose content is copiedonto the stack if the branch has been correctly predicted as taken. Allinstructions following a branch not yet commited use this stack. In case ofmisprediction the RSB content is cancelled

Parallelism

• Dynamic instruction execution granting precise interrupts which are checked at theinstruction commitment always in order

• Cancellation of speculative instructions when a branch is erroneously predicted

The prediction error must be revealed ASAP. The cancellation of post-branchinstructions erroneously executed allows the preceding instructions to keepexecuting. The erroneously executed instructions are not yet commited

The early branch prediction avoids the execution of useless instructions(sometimes very time expensive). It must remembered that not only theROB flush occurs but also the cancellation of all the instructions already inthe pipeline

93

FLD F4,0(R10)FDIV F8, F0, F4FMUL F4, F2, F3FMUL F4, F4, F4FADD F6, F10,F4FLD F4, 0(R5)

RAW WAW

RAWWAWRAW

Example - 1

Parallelism

Same execution times as in the previous Tomasulo example

94

F10F8F6F4F2

FU

Instruction status Exe.Instruction j k Issue Compl. Busy

Load1Load2

Address

Load2

WriteResult

Store 1Store 2


Time Name Busy Op Vj Vk Qj Qk

Register result status

Clock 0

Add1Add2Add3Mult1Mult2

FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Parallelism

Tomasulo without ROB and with renaming (RES stations). Multiplication FU execute the divisionstoo.

Three RS for LOAD, 2 for STORE, 2 for MUL/DIV

95

Load1

F10F8F6F4F2

R10yes1

FU


Load1Load2

Address

Load2

WriteResult

Store 1Store 2




Clock 1



Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism

CLOCK 1

96

Mult1

F10F8F6F4F2

F0divyes

2R10yes2-1

FU


Load1Load2

Address

Load2

WriteResult

Store 1Store 2




Clock 2


Load1

Load1



CLOCK 2

97

Mult1

F10F8F6F4F2mulyes

F0divyes

32

R10yes2-31

FU


Load1Load2

Address

Load2

WriteResult

Store 1Store 2




Clock 3


Mult2

Load1F2 F3



CLOCK 3

98

Mult1

F10F8F6F4F2mulyes

F0divyesyet 9 cycles

32

2-31

FU


Load1Load2

Address

Load2

WriteResult

Store 1Store 2




Clock 4


Mult2

F2 F3

4-

4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

Stalled for lack of free RS untilcycle 13 (end of the precedingmultiplication – only two slots inthe multiply FU) blocking theemission of FADD which could beexecuted since there are two freeslots in the corresponding RS.

4-

yet 39 cycles F4

ParallelismThree RS for LOAD, 2 for STORE, 2 for MUL/DIV

CLOCK 4

99

80000000: FLD F4, 0(R10)80000004: FDIV F8, F0, F480000008: FMUL F4, F2, F38000000C: FMUL F4, F4, F480000010: FADD F6, F10,F480000014: FLD F4, 0(R5)

RAW WAW

RAWWAWRAW

ROB and register renaming.

The instructions are in any case inserted in the ROB when a free slot and a physical register (oneof the many associated to the same architectural register) is available and then executed whenthe FU and the operands are available (policy of all modern processors). By so doing instructionsare not only terminated OOO (but with results reordered in the ROB) but also emitted even if theFU is not available The execution is totally OOO but with an In-Order commitment

Example - 2

Same instruction stream

Parallelism

100

Addr Op. Des Sorg

P0 P1

Free Free. Free Free Free Arch

P2 P3 P4 P5

F4

ROB RAT

Q0 Q1

Busy Free. Free Free Free Arch

Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Free Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Initial situation

Renaming registersfor F4, F6 e F8

12345

Parallelism

Register Allocation Table

Top free registers of the circular queues

These are thearchitectural registerswhich a program monitorwould display

These are registers in useby not yet committedinstructions. They willbecome architecturalregisters when the relatedinstructions are committed Here we assume that the instruction using Z0 precedes the

instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed

102

R10yes1


Load1Load2

AddressWrite

Result

Load3Store 1Store 2


Clock 1



Parallelism

CLOCK 1

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1

Busy Free. Free Free Free Arch

P2 P3 P4 P5

Q0 Q1


Q2 Q3 Q4 Q5

F6

Z0 Z1


Z2 Z3 Z4 Z5

F8

ROB RAT

12345

Renaming: the first available register for F4

is used

F4

103

Mult2

F0divyes

2R10yes2-1


Load1Load2

Address

Load3

WriteResult

Store 1Store 2


Clock 2

Add1Add2Add3Mult1 P0


Parallelism

CLOCK 2

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1


P2 P3 P4 P5

Q0 Q1


Q2 Q3 Q4 Q5

F6

Z0 Z1

Busy Busy Free Free Free Arch

Z2 Z3 Z4 Z5

F8

Most recentlyattributed physical

register for F4

12345

ROBRAT

Renaming

F4

104

mulyesF0divyes

32

R10yes2-31


Load1Load2

Address

Load3

WriteResult

Store 1Store 2


Clock 3


P0F2 F3


F2,F3P1FMUL80000008

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P0 P1

Busy Busy. Free Free Free Arch

P2 P3 P4 P5F4

ROBRAT

Q0 Q1


Q2 Q3 Q4 Q5

F6

Z0 Z1

Arch Busy Free Free Free Free

Z2 Z3 Z4 Z5

F8

12345

Parallelism

CLOCK 3

waiting for F4 (P0)

Previous instruction usingZ0 has been committed

Z0 is now the architectural register

P0

Op Vj Vk Qj Qk

105

mulyesF0divyes

Yet 9 cycles

32

2-31


Load1Load2

AddressWrite

Result

Time Name Busy Op

Clock 4

Add1Add2Add3Mult1Mult2 F2 F3

4-

4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5

4

Yet 39 cycles

4-

Not yet executablebut however inserted in the ROB

It does not block the emissionof the following instructions

Parallelism

Load3Store 1Store 2

Integer

Load2Store 1Store 2

Qk

P1,P1P2FMUL8000000C

F2,F3P1FMUL80000008

F0,P0Z1FDIV80000004

0,R10P0FLD80000000

Addr Op. Des Sorg

P1P0

Busy Busy. Busy Free Free Arch

P2 P3 P4 P5F4

ROBRAT

Q0 Q1

Arch Free Free Free Free Free

Q2 Q3 Q4 Q5

F6

Z0 Z1


Z2 Z3 Z4 Z5

F8

12345

Ended but not yetcommitted !

CLOCK 4

Instruction usingQ0 has ended its execution

Q0 is now the architectural register

106


mulyesF0divyes

Yet 8 cycles

32

2-31


Load1Address

WriteResult

Time Name Busy Op Vj Vk

Clock 5


P0

F2 F3

4-

4

4

5

yes add F10 P2

Yet 38 cycles

Load23-

Parallelism

Load3Store 1Store 2

CLOCK 5

waiting for F4 (P1)

Qj Qk

Integer

Load2Store 1Store 2

Qk

F10,P2Q1FADD80000010

P1,P1P2FMUL8000000C

F2,F3P2FMUL80000008

F0,P1Z1FDIV80000004

Addr Op. Des Sorg

P1P0

Arch Busy. Busy Busy Free Free

P2 P3 P4 P5

F4ROBRAT

Q0 Q1


Q2 Q3 Q4 Q5

F6

Z0 Z1


Z2 Z3 Z4 Z5

F8

12345

FLD commited: the architectural register F4

is now P0

computer architectures m

Documents