computer architectures m
TRANSCRIPT
1
Parallel architecturesComputer Architectures M
Parallelism
2
Architecture
•Synthesis: a physical implementation. There are many possible synthesises of the sameimplementation (for instance different technologies)
The ISA varies slowly while the implementation change rapidly (see for instance IA8, IA16,IA32…). More an ISA remains more are the programs implemented on it and thereforecompatibility becomes the main issue.
•Architecture: functional behaviour of a computer. For instance a processor which executesDLX code
•Implementation: a logical network implementing the architecture. It is called alsomicroarchitecture. There are many implementations of the same architecture. Example:family x86
The architecture is defined by the machine language that is the instruction set (assemblylanguage). Instruction Set Architecture -> ISA
Parallelism
3
Parallelism
• Superscalar superpipelined (i.e. Pentium IV, I5, I7 etc.)………..
Instruction level parallelism
• SequentialSingle instruction executed at a time
• PipelinedMultiple instructions executed simultaneously
• SuperpipelinedMultiple stages for each operation (EX, MEM etc.) in order to increase the clock frequency (i.e. Pentium IV)
• Scalar A single pipeline
• SuperscalarMultiple pipelines; many instructions started at the same time. Possibile Out Of Order execution (run time decision)
• Very Long Instruction WordMultiple pipelines; many instructions started at the same time. Instructionorder decided at compile time
Parallelism
4
Parallelism architectures
• Memory level parallelismA memory able to provide multiple data at different addresses at the same time (outstandingrequests - DDR2, DDR3 etc.)
•Multicore (core level parallelism)Many processors in the same chip (i.e.. Core duo – Nehalem – Sandy Bridge …..)
•Multithread (thread level parallelism)Pipelines of the same processor used by different processes at the same time (timesharing) (as if it were a multicore – ex. Pentium IV, Nehalem, Sandy Bridge etc….)
Parallelism
5
Deep Pipeline (Superpipeline)
FetchDecodeExecuteMemory
Writeback
Fetch
Decode
Execute
Memory
Writeback
Branchpenalty
•Each stage subdivided in three substages.. Higher clock frequency but higher branchpenalty
Branchpenalty
•Higher power consumption!!!!!!!!!!!!
Parallelism
6
Parallel pipelines
Sequential Time parallelism: pipeline
Space parallelism: VLIW
Space-time parallelism: (ie. I5, I7…)Parallelism
7
Diversified pipelines - 1
IF
ID
RD
MEM2 FP2
FP3
WB
Multi instruction buffer to avoid pipelines block.
Dedicated pipelines. The instruction sequence is defined at compile-time. Careful compilation is fundamental in order to avoid an underexploitation of the pipelines.
Different execution times problemInstruction interdependency problem
Parallelism
ALU MEM1 FP1 BREX F => Floating
8
Diversified pipelines - 2
IF
ID
RD
EX ALU MEM1 FP1 BR
MEM2 FP2
FP3
Dispatch Buffer
Reorder Buffer
«Out Of Order» execution
”In order” execution
WB ”In order” retirement
Parallelism
9
Floating Point DLX – F instructions
IF ID MEM WB
Integer
FPMultipl.
FPadder
FP/Int.Divid.
multicycle stages
IF ID MEM WB
ExInteger
M1 M2 M3 M4 M5 M6 M7
FP Multiply
A1 A2 A3 A4FP Add
FP/INT. Divide(i.e . 24 clock cycles – one
instruction at a time executed)Parallelism
Pipelined
10
DLX revisited
• Example FMUL F1,F2, F2 (no interdependency between instructions in this sequence)FADD F3, F4, F5FLD F6, 10(R8)FST 40(R10), F9
FMUL IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
FADD IF ID A1 A2 A3 A4 MEM WB
FLD IF ID EX MEM WB
FST IF ID EX MEM (WB)
• Because of the different instructions execution times Read After Write (RAW - DLX) hazards are more frequent
Data written
Data required for computing the address
In violet the stages where the operands are neededand in green the stages where new results areproduced
nop
Same destination registerWrite sequence error
• Very important structure change (more intermediate registers, more complex ID stage to send eachinstruction to the appropriate execution stage)
• Hazards problems: the instructions do not end in the same order of their issue.
• Since the division is normally a single functional unity , up to 40 clocks stalls may occur in this case
• Multiple instructions at the same time in the same stages (in particular in WB)• Write After Write hazards (WAW)– i.e. if a FADD F6, F4, F5 (four EX cycles ) directly preceded a
FLD F6, 10(R8) (one EX cycle) (although in this case the FADD would have been dropped by thecompiler since useless)
• Instructions are not completed in order
Parallelism
Red squares: execution
11
DLX revisited
• For WAW hazards consider the following example
IF ID M1 M2 M3 M4 M5 M6 M7 MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID A1 A2 A3 A4 MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
FMUL F0, F4, F6
…………..
…………..
FADD F2, F4, F6
…………..…………..
FLD F2, 0(R2)
If FADD were started one clock later a Write After Write hazard would have taken place !!
Multiple RF write operations
• To cope with multiple write operations at the same time of different registers the number of the input ports ofthe RF can be increased (expensive) or stalls must be introduced (normally in MEM or WB stages so as tochoose the instructions to be stalled). More complex pipelines
• RAW hazards are solved through the forwarding
Normally the hazards are detected in the ID stage considering the preceding and following instructions so as tointroduce the required stalls (in this case FLD would have been stalled one clock)
Hazards occur normally among homogeneous registers (FP or Integer) but for the FLOAD and FSTOR which useinteger register for address computing
Parallelism
12
DLX revisited
• How can we grant that the final result is that of the program ?
• In the previous case FLD F2, 0(R2) must be stalled until FADD F2, F4, F6 has reached theMEM stage. It must be however assumed that between the two instructions there must at leastone using through the forwarding the result of FADD F2, F4, F6 otherwise the compiler wouldhave dropped the instruction !
• The situation would have been even worse if FLD had been completed before the FADD.
• In any case it is always possible that different instructions are completed in an order differentfrom that of their issue
Parallelism
Compiler
13
Let’s consider this high level language statementsX = Y + ZA = B * C
to be executed in a processor with the following pipeline
FetchF
Dec.D
IssueI
Ex.E
Ex.E
Ex.E
WBW
In order emissionThe issue of the addition (multiply) is possible only
AFTER the previous instruction execution calculatingR2 (R5) that is after the last EX stage of R2 <= Z
(R5 <= C) possibly with forwarding
Busy decoder
The issue is here possiblesince data to R1 e R2 havebeen already produced
Multiply: waits for results
RAW
StallsDecoder occupiedData not available
D freed by the previous additioninstruction
Busy decoder- RAW
Decoder busy
Addition resultnot yet ready
Parallelism
At the end of thisstage the additionresult is available
Compiler
14
But we can modify the emission without modyfying the result
16 cicles instead of 22 !!!!
before
after
FetchF
Dec.D
IssueI
Ex.E
Ex.E
Ex.E
WBW
Waiting for R5
Busy decoder
Parallelism Waiting for R6
Emission possiblesince R1 and R2 already available
Multicycle hazards
15
Let’s suppose to have a FP adder (1 cicle – in red) and a multiplier (3 cicles in green).
I1 F1 = F2 + F3I2 F2 = F4 x F5I3 F3 = F3 + F4I4 F6 = F6 x F6I5 F1 = F3 + F5I6 F2 = F3 + F4
I1
I2
I3
I4
I5
I6
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
NB: in this graph the hazards are potential since theregisters only are considered no matter how manycycles are required by the executions
Parallelism
I1
I2
WAR (F2)
I6WAW(F2)
I3WAR (F3)
RAW (F3)
I5RAW (F3)
WAW(F1)
Dynamic instructions scheduling
16
• Systems with out of order executions but commitment always in order
• Temporal dependencies (hazards) not known at compile time
• It allows the execution of the code on different pipelines and on superscalar processors withno implications for the compiler.
Parallelism
• It allows the execution of instructions ahead of their position (in the following case FSUBF12,F8,F14) if the conditions allow it
FDIV F0,F2,F4
FADD F10,F0,F8 (RAW - must wait for F0)
FSUB F12,F8,F14 (can be executed anyway)
17
Scoreboard
• Consider the following sequence
FDIV F0, F2, F4FADD F10, F0, F8FSUB F8, F8, F14
They must readthe same value
Write After Read (WAR)
•There is an antidependency (WAR hazard) between FADD and FSUB: should FSUB end beforeFADD has read F8 an error would occur (F8 already updated)
•A possible Write After Write (WAW) hazard would occur if in FSUB F10 instead of F8 hadbeen used as destination (in case FSUB would end before FADD – but probably FADD droppedby the compiler)
•“Scoreboard” technique: an instruction per clock should be terminated executing aninstruction as soon as possible.
Parallelism
Read after Write (RAW)
18
Scoreboard
FP MULFP MUL
FP DIV
Registers
FP ADD
INTEG
Scoreboard
The scoreboard is somehow equivalent to the ID stage (just after the fetch) and determines whenan instruction can read its operands and start its execution. The scoreboard considers all systemstate changes and decides when the first instruction in the FIFO queue (as produced by thecompiler) can be started.
Functional units
Parallelism
19
Scoreboard
• Obviously some stalls can be induced because the number of busses available for transfers issmall
• The four stages equivalent to ID, EX and WB in DLX are:1. Emission: if a functional unit for the instruction is available (free) the instruction is issued
unless another functional unit has already an instruction which must write into the samedestination register. No WAW hazards therefore. In this latter case the instruction is stalledwhich blocks the emission of all the following instructions in the prefetch queue even when allother conditions for them are met!
2. Operand read: the instruction has been emitted. If the operand(s) is(are) available and noalready executing instruction must write it(them), the operand(s) is(are) read otherwise stallin the functional unit
3. Execution: when the result has been computed and stored the scoreboard is informed so as tounblock a possibly waiting instruction
4. In case of possible WAR the instruction is stalled and does not write the result if there is aprevious instruction which has not yet read the operands and one(both) of them is(are) thedestination register(s) of the considered instruction. Once the operand(s) has(have) beenread the result can be written
• It must be noticed that with this organisation the forwarding is avoided since the results arewritten as soon as produced (but for the wait WAR – point 4)
The scoreboard technique allows to transfer instructions directly from EX to WB stage (reducing theRAW risks) .
Parallelism
20
An example
Hypothetical timing for different instructions (which includesthe operands read and execution)
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4 (MULD)
FSUB F8, F6, F2 (SUBD) FDIV F10, F0, F6 (DIVD)
FADD F6, F8, F2 (ADDD)
RAW
< WAR
RAW
Parallelism
Integer
Do you find more hypothetical hazards?
For instancewhataboutF0?
21
Scoreboard entitiesInstruction stages: emission, operands read, execution and writeback
Statuses of the functional units (FU): 9 parametersBusy Unit busyOp Operation Code presently executed Fi Instruction destination (result) register Fj, Fk Operands source registersQj, Qk Functional units producing the required operands (if not yet ready) for
the registers Fj and FkRj, Rk Flags (yes) indicating whether Fj, Fk have been already updated
Result status register : indicates which functional unit will write each register. Void when no functional unit has to do with the specific register
N.B. It must be remembered that in case of possible WAWthe instructions emission is stalled (point 1 of the rules)
N.B. In the following example we suppose that two multiplication/division units are availableParallelism
22
Example (here we assume that F0 is a “normal”register and not always “0”)
0
Instruction status Read ExecutionWriteInstruction j k Issue Op complete ResultLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8Functional unit status
Time Name
Register result statusClock
F0 F2 F4 F6 F8 F10 F12 ... F31Functional Unit producing the result for the floating point register Fx (Qj, Qk)
Instructions statesProgression clock
1 integer unit2 multipl. units1 add/sub unit1 division unit
Rj and Rk indicates whether (possibly in the next cycle if just produced) thedata can be read from the operands source registers of the instruction whichmust be executed. Qjand Qk are the Functional Units which produce them (ifnot yet ready). Fj and Fk are the registers where data produced by Qj and Qkare stored (or will be stored in the next clock cycle – data available if thecorresponding Ri is yes) to be used in the executed instruction
F2dest Source1Source2
IntegerMult1Mult2AddDivide
FU for j FU for k Fj? Fk?Busy Op Fi Fj Fk Qj Qk Rj Rk
Register Qi Ready ?FU=Functional Unit
n. of clock cycles of execution yet
to elapse
NBLD = FLDMULTD = FMULSUBD = FSUBDIVD = FDIVADDD = FADD
FLD 1 cycleFADD, FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Floating point result registers
23
Cycle 1Instruction status Read Execution WriteInstruction j k Issue Op/Excomplete ResultLD F6 34 R2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Integer
Functional unit used for producing the result in F6
R2 is supposed to be already availableand therefore in the next clock can beused. LD uses the integer unit
At clock 1 the instruction state of LD F6,34(R2) is Issue
Parallelism
Yes Load F6 R2 Yes
R2
1
Brown colourfor state change
1
24
Cycle 2Instruction status Read Execution WriteInstruction j k Issue Op/Ex complete ResultLD F6 34 R2 1 2LD F2 45 R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Integer
Data ready in R2: instructioncan proceed: execution
NB: The second LD cannot be emittedbecause the only integer unit is busyand the same applies for MULTD andthe following instructions becauseinstructions must be emitted in orderalthough their functional units are free!
Parallelism
2
25
Cycle 3Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F6 R2Mult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Integer
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
3
26
Cycle 4Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load R2Mult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU
4
F6
Register at the end of the period has been writtenInteger functional unit freed at the end of the period
The change of status of the FUsindicates their value at the clockpositive edge ending the currentcycle (future status). For instancethe integer functional unit is freedat the end of cycle 4 together withthe result writeback. LD F6 34,R2disappears totally from scoreboardat the clock positive edgeconcluding the current cycle 4.
Parallelism
Op/Ex
Integer4
27
Cycle 5Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 NoMult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
RU
S1 S2 RUj RUk Rj? Rk?
R3 supposed already ready as in the previous case
5
Yes Load F2 R3 Yes
IntegerThe Integer Functional Unit must produce a new value for F2
At the beginning of cycle 5 the integer unitis already free and then LD F2 45, R3 can be emitted and start
Parallelism
4Op/Ex
5
28
Cycle 6Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1Mult2 NoAdd NoDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Integer
S1 S2 FUj FUk Fj? Fk?
F4 supposedalreadypresent
Yes Mult F0 F2 F4 Integer No Yes
Mult
MULTD waits for F2from the integer unit !!!!
6
MULTD F0 F2, F4 can start because its FU is free and the destination register is F0
Parallelism
Op/Ex
6
29
Cycle 7
MULTD stalled in theexecution unit because F2not yet ready.
Instruction status Read Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7MULTD F0 F2 F4 6SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger Yes Load F2 R3Mult1 Yes Mult F0 F2 F4 Integer No YesMult2 NoAddDivide No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult Integer
S1 S2 FUj FUk Fj? Fk?
(NB : FP adderexecutes
FP subtractionstoo)
F8Yes Subd F6 F2 Integer Yes No
Add
7
SUBD F8 F6, F2 can start becausethe arithmetic FP sum/subtraction isfree.
Parallelism
SUBD needs F2
Op/Ex
7
30
Cycle 8Instruction status Read EX Write
Instruction j k Issue complete. Result
LD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTDF0 F2 F4 6SUBD F8 F6 F2 7DIVD F10 F0 F6ADDD F6 F8 F2Functional unit status dest
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkIntegerMult1 Yes Mult F0 F2 F4 YesMult2 NoAdd Yes Sub F8 F6 F2 YesDivide
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add
S1 S2 FUj FUk Fj? Fk?
F0 not yet available
Yes Load F2 R3
8
Yes Div F10 F0 F6 Mult1 No Yes
Divide
DIVD F10 F0, F6 can startbecause the divide FP FU is free
Updated at the end of the cycle
Yes
Yes
F2 available !!
F2 written allows MULTD andSUBD to read the operands duringthe next cycle
F2 is written and therefore the integer unit is freeParallelism
Op/Ex
8
31
Cycle 9 - 10Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6SUBD F8 F6 F2 7
N.B.: MULTD and SUBD can readthe operands because F2available (see cycle 8). DIVD isstill stalled because of F0.
99
DIVD F10 F0 F6 8ADDD F6 F8 F2Functional unit status dest S1 S2 FUj FUk Fj? Fk?
Time Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger No
10 clock Mult1 Yes Mult F0 F2 F4Mult2 No
2 clock Add Yes Sub F8 F6 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
40 clock
Parallelism
ADDD cannot start becauseSUBD uses the adder FU
Op/Ex
9-10
32
Cycle 11Nota: FU Add requires 2 cycles for theSUBD and therefore nothing happens incycle 10 while MULTD still processes itsdata
NB: ADDD will use the result of theSUBD but is not yet started because of
SUBD (the FU is busy)
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer No8 clocks more Mult1 Yes Mult F0 F2 F4
Mult2 No0 Add Yes Sub F8 F6 F2
Divide Yes Div F10 F0 F6 Mult1 No YesRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11
Parallelism
Op/Ex
11
33
Cycle 12
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk
Integer No7 clocks more Mult1 Yes Mult F0 F2 F4
Mult2 NoAdd NoDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Divide
Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 SUBD ends freeing the FU. In the nextperiod ADDD can start
12
F8 is written and the ADD/SUB FU is freed
FLD 1 cycleFADD and FSUB 2c yclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
12
34
Cycle 13
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAddDivide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Divide
Instruction status Fead EX WriteInstruction j k IssueOp/Excomplete FesultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12 Now ADDD can start because SUBDhas finished its execution and hasfreed the FU
Yes Add F6 F8 F2 Yes Yes
Add
13
6 Clocks more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
13
35
Cycle 14
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14
5 clocks more
2 Clocks more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
14
36
Cycle 15
ADDD requires two cycles and therefore no system status change
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name BusyOp Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14
4 Clocks more
1 Clock more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
15
37
Cycle 16
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14
ADDD ended its EX stage while MULTDand DIVD keep executing
16
3 clocks more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
16
38
Cycle 17
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
NB !!! ADDD stalled (cannot write) because of aWAR with DIVD on F6. DIVD does not readF6 because it waits for F0 produced byMULTD (operands are read in parallel).MULT and DIVD keep executing
Stalled becauseWAR F6
2 Clocks more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
17
39
Cycle 18
MULT still executing
DIVD still stalled
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
1 clock more
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
18
40
Cycle 19
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 Yes Mult F0 F2 F4Mult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Mult1 No Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Mult1 Add Divide
Instruction status Read EX WriteInstruction j k Issue completeResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
MULT ends its execution, will write in cycle20 (after 10 cycles) which will unblockDIVD and then ADDD
19
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
19
41
Cycle 20
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6 Yes Yes
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Add Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
19 MULTD writes F0 unblocking DIVD
20
FLD 1 cycleFADD FSUB 2 cyclesFMUL 10 cyclesFDIV 40 cycles
Parallelism
Op/Ex
20
42
Cycle 21
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 NoMult2 NoAdd Yes Add F6 F8 F2Divide Yes Div F10 F0 F6
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Add Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
19 20
DIVD reads both F0 and F6 (whichcould not be written by ADDDbecause of WAR) unblocking ADDDwhich can write F6 in the next cycle
21
Parallelism
Op/Ex
21
43
Cycle 22
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Divide
Instruction status Read EX WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
19 20
21
Parallelism
Now ADDD can write F6 after theWAR hazards with DIVD disappeared.For 6 cycles ADDD couldn’t write F6although its result was available
22
Op/Ex
22
44
Cycle 61
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 NoMult2 NoAdd NoDivide Yes Div F10 F0 F6
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
FU Divide
Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
19 20
2122
DIVD execution ends after 40 cycles61
Result
Parallelism
Op/Ex
61
45
Cycle 62
All executions ended
Functional unit status dest S1 S2 FUj FUk Fj? Fk?Time Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer NoMult1 NoMult2 NoAdd No
0 Divide NoRegister result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
62 FU
Instruction status Read EX WriteInstruction j k Issue completeLD F6 34 R2 1 2 3 4LD F2 45 R3 5 6 7 8MULTD F0 F2 F4 6 9SUBD F8 F6 F2 7 9DIVD F10 F0 F6 8ADDD F6 F8 F2
11 12
13 14 16
19 20
2122
61
Result
62
Parallelism
Op/Ex
46
Scoreboard limits
• An instruction can be emitted only if all previous instructions have been emitted
WAWWAR
FDIV F0, F2, F4FADD F6, F0, F8FSTOR F6, 0(R1)FSUB F8, F10, F14FMUL F6, F10, F8
N.B Hazards of the sequence are onlypotential: their occurrence dependson the instructions execution time
• Register values must be read in any case in parallel only from the register file (which meansthat they must have been already stored in the registers – no RAW problem)
Parallelism
RAW
47
Renaming – Tomasulo Algorithm
Tomasulo algorithm: “renaming” is based on the concept of “reservation stations” which are functional units buffers whereinstructions can be «parked» waiting for the availability of the requested Fu and the needed data.
The following benefits occur
«Renaming» indicates a location different from the RF where a requested data is produced/stored and can beobtained. The name «renaming» is used because it is as if the source registers of an instruction were renamed
Parallelism
A reservation station is a place of a FU where an instruction emitted from the instruction queue waits until the FU isfree and the needed data arrive as soon as produced (N.B. before being written in the RF). For its operandsEITHER the source register data OR the reservation stations producing them are indicated (whence renaming).The renaming occurs at run-time
A reservation station captures a required operand exactly when and where it is (not waiting until it is written avoidingthe register file access). Similar to the case of forwarding
When multiple writes to the same register occur (WAW – possible only if multiple busses between FUs and RF areavailable) only the most recently produced data are written (for each register a TAG is used indicating the FU whichhas the right to write)
Hazards detection and execution control are distributed (not grouped as for the Scoreboard) : only the informationstored in the reservation stations of each functional unit determines whether an instruction can execute in the FUsince the source (where the data is being produced - if not yet int the RF) and NOT the RF is indicated. RAWhazards are no more possible since the requested data are provided as soon as produced. The same for WAR (dataare read by the reservation stations while written)
Results are transferred directly to the waiting FUs reservation stations without the necessity of reading the RFthrough the common data busses (multiple reservation stations in addition to RF register can be accessed at the sametime when multiple busses are available)
48
Tomasulo AlgorithmTomasulo eliminates not only WAWs but also WARs
FLD F6, 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, F6FDIV F10, F0, F6FADD F6, F8, F2
Renaming (functional unitproducing the data)
Possible WAW
As far as the WAW between FLD and FADD per F6 is concerned the mechanism grants that only the mostrecent instruction in the RS using a destination register can write the register.
FLD [T/F6], 32(R2)FLD F2, 44(R3)FMUL F0, F2, F4 FSUB F8, F2, [T]FDIV F10, F0, [T]FADD F6, F8, F2
NB: When an instruction is inserted in a RS it is checked whether one or more of its operands are beingproduced elsewhere by other RS: if yes then renaming
For the FADD a potential WAR with the FDIV could occur if FADD ended before FDIV has read itsoperands (in case of F8 of FSUB and of F2 of FLD they were both immediately available for FADD) but sinceFDIV points for F6 to the RS of FLD F6, 32(R2) and not to RF the problem does not occur. The same holdsfor FSUB.
Parallelism
Possible WAR.
49
Tomasulo Algorithm
Parallelism
Very high performance without special compilers
Differences with scoreboard
Buffer and controls directly distributed in the FUs (there is no centralizedcontrol): buffers are called “reservation stations”
Source registers names substituted by pointers to buffers of the reservationstations (if the requested data are being there produced)
“Renaming”: a direct pointer to the sources and not to the register
One ore more Common Data Bus for sending results to all FUs requiringthem
Load and Stores considered as FUs (a STORE can also be a source for a RSexecuting a LOAD)
50
Tomasulo AlgorithmIn this example is it assumed thatthe MUL unit executes the DIVs tooand that the ADD executes the SUBs too . LOAD and STORES are handled as other instructions
In this example: 3 RS for add/sub2 RS for mult/div5 RS for store5 RS for load
In this example only oneData Bus. Please noticethat the same CommonData Bus is used also bythe RS waiting for data
Each RS (more than one for each FU) stores an emitted instruction and for each operand either of two elements:either the operand value (i.e. read from RF) or the name of the RS which is producing it (renaming)
For thedataproducedby the FUs
Parallelism
51
Tomasulo Algorithm
• Writeback: as soon as a data is produced, it is tranferred over one CDB (when more than one areavailable) to the RF and to the RS waiting for it.
Parallelism
• Load buffers are used to store the load addresses
• Store buffers contain the computed addresses and the data to be written in memory
• Load and store must be executed in sequence if they are related to the same addresses. In theother cases it is possibile to anticipate the LOADs (never the STOREs)
• In figure there are 3 phases (each one of which can last several clocks):
• Emission: the instructions are extracted in order from the general instruction queue when there is afree RS for the requested FU (the only condition) otherwise the instruction queue stalls. Operandsare extracted from RF or the producing FU as indicated. In case of WAW it must be determinedwhich instruction must provide the data
• Execution: if one ore more operands are not yet available CDB (s) must be monitored (data must betransferred over a bus anyway) in order to catch them (and their sources) as soon as available:RAW are therefore avoided (we are sure not to read stale data in the RF).
52
Tomasulo Algorithm
Let’s see the scoreboard example in a Tomasulo Architecture. Let’s supposethat the execution times are the same of the scoreboard (FLD 1+1 cycles,FADD and FSUB 2+1 cycles, FMUL 10+1 cycles, FDIV 40+1 cycles) – NB.“+1” for the writeback
LD F6, 34(R2)LD F2, 45(R3)FMUL F0, F2,F4FSUB F8, F6, F2FDIV F10, F0, F6FADD F6, F8, F2
Parallelism
53
Reservation Station
Register File Status: Indicates which FU will write the register (if needed). A blank meansthat there are no instructions which must write the register and therefore its value can bedirectly used
N.B. From the general instruction queue one instruction per clock is emitted when a FUs RSfor that instruction is available otherwise stall. In our example we assume only one CDB.
Parallelism
Op: opcode of the instruction to be executed
Vj, Vk: places where the operands are read (either RF or the FUs producing them).If blank the data is produced by the corresponding Qj or Qk
Qj, Qk: Functional units producing the results. A blank indicates that the source operandsare already in Vj or Vk or that they are not required
Busy: Busy FU
54
Cycle 0Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 No0 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
0 FUThe FU producing the new value
Producing FU – if blank it means that the dat is in RF
Operands register. If blank the datum is produced in the corresponding Q FU
NB. For LD (ST here not used) there is a limitednumber of RS. Their BUSY status is here displayed differently from the FU (see next slide)
For sake of simplicity Rj e Rk(ready/notready) of the scoreboardare not displayed since their valuesare implicit in the status of Qj andQk
Parallelism
Load/store notindicated in the
status table
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
55
Cycle 1Instruction status
Execution WriteInstruction j k Issue BusyLD F6 34 R2LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk0 Add1 No0 Add2 No
Add3 No0 Mult1 No0 Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
1 FU
Address1 Load1 Yes
Load1
34+R2
3 RS foradder/sub
2 RS formul/div
NB: Here it is assumed that R2 and R3 are already available
5 RS for the LOAD
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
56
Cycle 2Instruction status
Execution WriteInstruction j k Issue BusyLD F6 34 R2 1LD F2 45 R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
2 FU Load1
AddressLoad1 Yes 34+R2
5 RS for LOAD
2-2 Load2 Yes 45+R3
Load2
The second LD is emitted. One instruction per clock isemitted (when possible)
N.B. A second LOAD has been emitted(not possible with the scoreboard)and parked in the RS. R3 valuealready available in the RF
Parallelism
NB: Load -> 2 cycles: the first one for computing the address and the second for reading the data
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
57
Cycle 3Instruction status
Execution WriteInstruction j k Issue BusyLD F6 34 R2 1 2--3 Load1 YesLD F2 45 R3 2 3- Load2 YesMULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
3 FU Load2 Load1
Address34+R245+R3
Yet10 cycles
LD two cycles
MULTD can be emitted although F2 NOTyet available . F2-> renaming
3
Yes Mult F4
Mult1
Load2
MULTD emitted (free RS )
Data supposed alreadyin the RF
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
58
Cycle 4
The FUs execute both sums and subtractions
Instruction status Execution WriteInstruction j k Issue Busy
LD F6 34 R2 1 2--3LD F2 45 R3 2 3--4 Load2 YesMULTD F0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1Add2 NoAdd3 NoMult1 Yes Mult F4 Load2Mult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
4 FU Mult1 Load2
Address
45+R3
Yet 3 cycles
Yet 10 cycles
4
The data read from memory LD F6 34(R2) is writtenboth in the RF and in the RS of SUBD and MULTD which are waiting for it
4
Add1
Yes Sub F6 (captured on the fly) Load2SUBD is emitted (RS free)F6 available in RF at the end of the cycle
FU freed at the end of clock cycleParallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
59
Cycle 5Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4MULTD F0 F2 F4 3SUBD F8 F6 F2 4DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk3 Add1 Yes Sub F6 (capt.) F2 (capt)0 Add2 No
Add3 No10 Mult1 Yes Mult F2 (capt) F4
0 Mult2Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
5 FU Mult1 Add1
Cycles yet to be executed for completing the execution
5
5
Yes Div F6 Mult1
Mult2
DIVD is emitted (RS free)
Wait for F0
FU freedParallelism
The datum read from memory with LD F2 45(R3)is written both in register F2 and in the RS ofSUBD and MULTD which are waiting for it
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
60
Cycle 6
Cycles yet to be executed for completing the execution
Instruction status Execution WriteInstruction j k Issue
LD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 --DIVD F10 F0 F6 5ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk2 Add1 Yes Sub F6 (capt) F2 (capt)
Add2 Yes Add F2 Add1Add3 No
9 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
6 FU Mult1 Add1 Mult2
Yet 40 cyclesNow MULTD can execute (F2 and F4 available)
6
Add2
ADDD is emitted (RS free)
Wait for F0
Wait for F8
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
61
Cycle 7Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj Qk1 Add1 Yes Sub F6 (capt) F2 (capt)
Add2 Yes Add F2 Add1Add3 No
8 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
7 FU Mult1 Add2 Add1 Mult2
6 -- 7SUBD (as ADDD) two cycles
ADDD stalled waiting for SUBD (F8)
Data in F6 will be overwritten byADDD but it was already read and ispresent in the RS of DIVD
Yet 40 cycles
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
62
Cycle 8Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 No
2 Add2 Yes Add F8 F2Add3 No
7 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
8 FU Mult1 Add2 Mult2
Yet 40
0
8
NB: SUBD ends before MULTD andallows ADDD (which captures theresult of F8) to start executing
FU freedParallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
63
Cycle 9Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 --Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 No
2 Add2 Yes Add F8 F2Add3 No
6 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
9 FU Mult1 Add2 Mult2
Yet 40
ADDD executing
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
64
Cycle 10Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 No
1 Add2 Yes Add F8 F2Add3 No
5 Mult1 Yes Mult F2 F4Mult2 Yes Div Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
10 FU Mult1 Add2 Mult2
9 -- 10
Two execution cycles
Yet 40 F6
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
65
Cycle 11Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No
4 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
11 FU Mult1 Mult2
40
0
11ADDD too ends beforeMULTD and DIVD
FU freed
Cycles yet to be executed for completing the execution
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
66
Cycle 12Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 --SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No
3 Mult1 Yes Mult F2 F4Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
12 FU Mult1 Mult2
40
Waiting for the data producedby MULTD
Cycles yet to be executed for completing the execution
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
67
Cycle 15Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 No
1 Mult1 Yes Mult F2 F4Yet 40 Mult2 Yes Div F6 Mult1
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
15 FU Mult1 Mult2
Waiting for the data producedby MULTD
Parallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
68
Cycle 16Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No
40 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
16 FU Mult2
Now DIVD can execute
0
0
16
FU freedParallelism
FLD 1+1 cycles,FADD and FSUB 2+1 cycles,FMUL 10+1 cycles,FDIV 40+1 cycles)NB. “+1” for the writeback
69
Cycle 56Instruction status
Execution WriteInstruction j k IssueLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 No
0 Mult2 Yes Div F0 F6Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
56 FU Mult2
Parallelism
70
Cycle 57Instruction status Execution WriteInstruction j k Issue complete ResultLD F6 34 R2 1 2--3 4LD F2 45 R3 2 3--4 5MULTD F0 F2 F4 3 6 -- 15 16SUBD F8 F6 F2 4 6 -- 7 8DIVD F10 F0 F6 5 17 -- 56 57ADDD F6 F8 F2 6 9 -- 10 11Reservation Stations S1 S2 RS for j RS for k
Time Name Busy Op Vj Vk Qj QkAdd1 NoAdd2 NoAdd3 NoMult1 NoMult2 No
Register result statusClock F0 F2 F4 F6 F8 F10 F12 ... F31
57 FU
Parallelism
71
A demo can be found at
http://www.ecs.umass.edu/ece/koren/architecture/Tomasulo1/tomasulo_files/tomasulo.htm
Parallelism
Limits of Tomasulo Algorithm
73
• NOT precise interrupts
Parallelism
• Very complex
• Each CDB must be connected to each RS – Complex cabling – Reduce n. of CDB means reduced efficiency
• If a single CDB is present only one instruction per cycle can end
• Ouf of order instructions completion !!!!!!
Exceptions
74
• Traps: internal causes Exceptional conditions (overflow, zero division etc.) Errors (i.e. parity) Page fault (or – see later – segment fault): data not available in memory Syncronous to the current process Operating systems handler Instruction can be interrupted during its execution (i.e. page fault) and therefore must
be «restartable»,. The executing program is normally temporarily aborted.
Parallelism
• Exception/interrupt: non-programmed control transfer Return address and all other information necessary to restore the interrupted situation
must be saved «Response» subroutine (handler) must be executed
• Two exceptions types: interrupt and trap Interrupts: external causes The user program are interrupted and the then restored Asyncronous to the current process Acknowledged at the end of the current instruction (if interrupts enabled) The handler is responsibility of the user program
Examples
75
Instruction Restart
Parallelism
Precise exceptions/interrupts
76
• Precise exceptions(interrupts) : instruction commitment in order
Parallelism
• Exceptions must be “precise” that is their behaviour must be same that would occur in a “non-pipelined” architecture
• Precise: machine status is saved as if the code would have been executed until the exception : All preceding instruction must be terminated All instructions following the instruction which provoked the exception must be handled as if
they never started The same code must executed identically on different architectures
• Complex problem with pipeline, OOO execution (see later) etc.
• Scoreboard and Tomasulo have:In order emission, execution (and therefore terminated) out of order fuori ordine
78
• Automatic WAW avoidance
ROBFP OpQueue
FP Adder FP AdderRes Stations Res Stations
FP Regs
Reorder Buffer (ROB)
Parallelism
• FIFO queue
• Stores pointers to all instructions in FIFO order as they are emitted. For sake of simplicity we saythat the instruction is virtually inserted in the ROB
• When instructions are terminated the results are stored in the ROB (instead of theRF) which provides also the operands to other instructions which requires them(renaming!) Commitment
• Easy “undo” of speculated instructions (see later)or of branches erroneously predicted or exceptions
• Commitment: the results of the instruction which has reached the topslot of the FIFO are transferred to the architectural registers (registerswhich could be read by a test program)
79
Tomasulo again
Parallelism
80
Tomasulo in 4 steps
N.B. Sometimes more instructions can be commited simultaneously. If the destination is the same(unlikely, otherwise the compiler would have dropped the first one) the result of the most recentinstruction is used.
Parallelism
• Emission— Emission of an instruction from the instruction queue when a RS anda ROB slot available. In the RS are indicated the operands source and the ROB slotwhere an instruction will be “parked” after its esecution (this phase is called«dispatch”). The results are NOT written in the RF until the commitment phase.NB the lack of one of the two conditions blocks the emission of the followinginstructions
• Execution — Operands transformation. If not yet ready they can be in the ROB (inthis case the operand values computed by the nearest previous instructions areused) or still computed in the FU. This phase is indicated as “issue”.
• Result writeback — Execution ends. Result trasmitted on the CDB for the RSwaiting for them and to the ROB.
• Commitment—Architectural registers (or memory) update with the results stored inthe ROB when the instruction is on the top of the ROB FIFO. In case of erroneouslypredicted branch the ROB results are just dropped (“graduation”).
EMISSION IN ORDERCOMMITMENT IN ORDER
Parallelism 81
HW with ROB
ReorderBuffer
FPOp
Queue
FP Adder FP Adder
Res Stations Res Stations
FP RegsC
ompar netw
ork
• ROB is a circular queue• Program counter i.e. used for branch
ROB
Des
tinat
ion
Reg
iste
r
Res
ult
Exc
eptio
n?
Valid
(ter
min
ated
)
Prog
ram
Cou
nter
82
Example
LD F0, 10(R2) 3 cyclesFADD F10, F4, F0 5 cyclesFDIV F2, F10, F6 20 cyclesBRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
Parallelism
83
To memory
FP adders FP multipliers
Reservation Stations
FP Opqueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0 LD F0,10(R2) N
Completed?
From memory
1 10+R2
ROB
Tomasulo with ROB – cycle 1
ROB top
ROB end
Source
M1
LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
InstructionDest.
Parallelism
FP registers
ROBPosition ROB
Position
Cod.Op. Operands Cod.
Op. Operands
ROBPosition
84
2 FADD F10,F4, ROB1
FP adders FP multipliers
Reservation Stations
FP Opqueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1F0 LD F0,10(R2) ExTop
End
1 10+R2
ROB
Tomasulo with ROB – cycle 2
To memory
From memory
M1
LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
F10 NFADD F10, F4, F0 [ROB1]ROB1
Renaming !!
(Memory2 clocks)
Three slotsfor memoryoperations
Completed?Source InstructionDest.
Parallelism There can be also two ROB sources
FP registers
RAW
ROBPosition ROB
Position
Cod.Op. Operands Cod.
Op. Operands
ROBPosition
85
32 FADD F10, F4, ROB1
FP adders FP multipliers
Reservation Stations
FP Opqueue
1 10+R2
ROB
Tomasulo with ROB – cycle 3
FDIV F2, ROB2, F6
To memory
From memory
M1
LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F2F10F0
ROB 2
LD F0,10(R2)
N
N
ExTop
End
FADD F10, F4, F0 [ROB1]
FDIV F2, F10 [ROB2], F6
ROB 1
Completed?Source InstructionDest.
Parallelism
FP registers
ROBPosition ROB
Position
Cod.Op. Operands Cod.
Op. Operands
ROBPosition
86
32 FADD F10, F4, F06 FADD F0, ROB5, F6
FP adders FP multipliers
Reservation Stations
FP Opqueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0 ROB5 FADD F0, F4 [ROB5], F6 N
F4 LD F4,0(R3) Ex
-- N
F2F10
ROB2
Completed and committed (F0)
N
Ex Top
End
5 0+R3
Tomasulo with ROB – cycle 5
FADD F10, F4, F0
FDIV F2, F10 [ROB2], F6
FDIV F2, ROB2, F6
To memory
From memory
BRNE F2 [ROB3], +100
M1
F0(Updated by memory op ROB 1)
In cycle 4 (end of the first LD) FADD F10, F4, F0 started executing
Emitted in cycle 4 in parallel withLD F4, 0(R3)
LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
Data capturedon the fly. Notmore present in the ROB
Completed?Source InstructionDest.
Parallelism
ROBPosition ROB
Position
Cod.Op. Operands Cod.
Op. Operands
ROBPosition
Not yetcommitted
87
32 FADD F10, F4, F06 FADD F0, ROB5, F6
FP adders FP multipliers
Reservation Stations
FP Opqueue
ROB7ROB6
ROB5
ROB4
ROB3
ROB2
ROB1
F0ROB5ROB5
ST 0(R3), F4[ROB5]
FADD F0, F4 [ROB5], F6 N
F4 LD F4, 0(R3) Ex
-- N
F2F10
ROB2 N
Ex Top
End
5 0+R3
Tomasulo with ROB – cycle 6
FADD F10, F4, F0
FDIV F2, F10[ROB2], F6
FDIV F2, ROB2, F6
To memory
From memory
FP registers
BRNE F2 [ROB3], +100
M1
F0
NB ST can start its execution whenLD F4, 0(R3) has terminated the execution NOT when is committed
LD F0, 10(R2)FADD F10, F4, F0FDIV F2, F10, F6BRNE F2, +100LD F4, 0(R3)FADD F0, F4, F6ST 0(R3), F4
Completed?Source InstructionDest.
Parallelism
N
ROBPosition ROB
Position
Cod.Op. Operands Cod.
Op. Operands
ROBPosition
(Updated by memory op ROB 1)
ROB3
88
Register Renaming
• For each commitment the pointer to the architectural register points to the physicalregister linked the commited instruction. When a new instruction regarding the samearchitectural register is committed the pointer to it is changed (and the physical registerpreviously embodying the architectural register is freed).
Parallelism
• But when an emitted instruction must use a register where can it be found? In theROB or in the RF ? The entire ROB should be analysed and the most recent slotfound (if any) whose destination is the required register: the instruction shouldeither point to it (if any) or to the RF. Complex and slow procedure
• Solution: to use a number of physical registers greater than that of the architecturalregisters (the register known to the assembler language programmer - ISA) and to keepa pointer to the most recent (possibly not yet architectural)
• Whenever an instruction inserted in the ROB must write a register (i.e. F17), it points toa new physical register associate to the involved register (F17) where the result will betemporarily stored. Any following instruction which must use register (F17) will usethat physical register
89
An example with R2
R2-0R2-1
R2-3R2-4R2-5R2-6R2-7R2-8
Circular queue of register R2
Pointer to the first free register R2 when LD R2,
10(R5) is emitted
Let’ suppose that R2-2 andR2-3 are alredy occupied byprevious not yet committedinstructions
LD R2, 10(R5) ; R2-4 (destination.)
first physical register free associated to R2
Parallelism
•When LD R2, 10(R5) is emitted register R2-4 is given to it as destination which will be used by MUL(as soon as the new datum is computed). Now R2-2, R2-3 e R2-4 are «busy» and the first freeregister will be R2-5. R2-2, R2-3 ans R2-4 will be freed as soon the related instructions end. If thecommitment is “in-order” all hazards disappear. R2-1 is the architectural R2 register. R2-2 willbecome the architectural R2 register at the commitment of the related instruction. The busy registersare freed when no more needed
• No more distinction between register file and ROB locations. Normally there are 40-120 physicalregisters
R2-2
ArchitecturalregisterMUL R8, R2, R5 ; R2-4 (source.)
RADD R2, R9, R6 ; R2-5 (destination.)DIV R2, R2, R10 ; R2-6 (destination) and
R2-5 (source) (here commitment of instruction using R2-2)
R2-1R2-2
90
HW support for register renaming
• If no physical registers (circurarly) are available the instruction is stalled. There isno emission also if no free slot in the ROB is available and no RS is available
Parallelism
• Free/busy register table. Two solutions: one pool of physical registers for allarchitectural registers or one pool for each architectural register.
• Fast mapping between architectural and physical registers (run time)
• Great number of physical registers
ROB «without Tomasulo»
Parallelism 91
• When two instructions are ready for execution, FIFO rule (so as to speed-up thecommitment, always in order)
• Instructions are emitted as soon a free slot in the ROB and a physical destination register areavailable using the register renaming
• For each FU there is a virtual queue whose slots point to the ROB slots which require thatFU.
• The instruction of this queue are executed as soon as the required operands areavailable.
92
ROB and speculation
Need of a separated Return Stack Buffer for the speculative calls (otherwisethe stack could be damaged). It is a separated stack whose content is copiedonto the stack if the branch has been correctly predicted as taken. Allinstructions following a branch not yet commited use this stack. In case ofmisprediction the RSB content is cancelled
Parallelism
• Dynamic instruction execution granting precise interrupts which are checked at theinstruction commitment always in order
• Cancellation of speculative instructions when a branch is erroneously predicted
The prediction error must be revealed ASAP. The cancellation of post-branchinstructions erroneously executed allows the preceding instructions to keepexecuting. The erroneously executed instructions are not yet commited
The early branch prediction avoids the execution of useless instructions(sometimes very time expensive). It must remembered that not only theROB flush occurs but also the cancellation of all the instructions already inthe pipeline
93
FLD F4,0(R10)FDIV F8, F0, F4FMUL F4, F2, F3FMUL F4, F4, F4FADD F6, F10,F4FLD F4, 0(R5)
RAW WAW
RAWWAWRAW
Example - 1
Parallelism
Same execution times as in the previous Tomasulo example
94
F10F8F6F4F2
FU
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load2
WriteResult
Store 1Store 2
Reservation Stations
Time Name Busy Op Vj Vk Qj Qk
Register result status
Clock 0
Add1Add2Add3Mult1Mult2
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Parallelism
Tomasulo without ROB and with renaming (RES stations). Multiplication FU execute the divisionstoo.
Three RS for LOAD, 2 for STORE, 2 for MUL/DIV
95
Load1
F10F8F6F4F2
R10yes1
FU
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load2
WriteResult
Store 1Store 2
Reservation Stations
Time Name Busy Op Vj Vk Qj Qk
Register result status
Clock 1
Add1Add2Add3Mult1Mult2
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism
CLOCK 1
96
Mult1
F10F8F6F4F2
F0divyes
2R10yes2-1
FU
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load2
WriteResult
Store 1Store 2
Reservation Stations
Time Name Busy Op Vj Vk Qj Qk
Register result status
Clock 2
Add1Add2Add3Mult1Mult2
Load1
Load1
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism
CLOCK 2
97
Mult1
F10F8F6F4F2mulyes
F0divyes
32
R10yes2-31
FU
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load2
WriteResult
Store 1Store 2
Reservation Stations
Time Name Busy Op Vj Vk Qj Qk
Register result status
Clock 3
Add1Add2Add3Mult1Mult2
Mult2
Load1F2 F3
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Three RS for LOAD, 2 for STORE, 2 for MUL/DIVParallelism
CLOCK 3
98
Mult1
F10F8F6F4F2mulyes
F0divyesyet 9 cycles
32
2-31
FU
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load2
WriteResult
Store 1Store 2
Reservation Stations
Time Name Busy Op Vj Vk Qj Qk
Register result status
Clock 4
Add1Add2Add3Mult1Mult2
Mult2
F2 F3
4-
4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Stalled for lack of free RS untilcycle 13 (end of the precedingmultiplication – only two slots inthe multiply FU) blocking theemission of FADD which could beexecuted since there are two freeslots in the corresponding RS.
4-
yet 39 cycles F4
ParallelismThree RS for LOAD, 2 for STORE, 2 for MUL/DIV
CLOCK 4
99
80000000: FLD F4, 0(R10)80000004: FDIV F8, F0, F480000008: FMUL F4, F2, F38000000C: FMUL F4, F4, F480000010: FADD F6, F10,F480000014: FLD F4, 0(R5)
RAW WAW
RAWWAWRAW
ROB and register renaming.
The instructions are in any case inserted in the ROB when a free slot and a physical register (oneof the many associated to the same architectural register) is available and then executed whenthe FU and the operands are available (policy of all modern processors). By so doing instructionsare not only terminated OOO (but with results reordered in the ROB) but also emitted even if theFU is not available The execution is totally OOO but with an In-Order commitment
Example - 2
Same instruction stream
Parallelism
100
Addr Op. Des Sorg
P0 P1
Free Free. Free Free Free Arch
P2 P3 P4 P5
F4
ROB RAT
Q0 Q1
Busy Free. Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Free Free Free Free Arch
Z2 Z3 Z4 Z5
F8
Initial situation
Renaming registersfor F4, F6 e F8
12345
Parallelism
Register Allocation Table
Top free registers of the circular queues
These are thearchitectural registerswhich a program monitorwould display
These are registers in useby not yet committedinstructions. They willbecome architecturalregisters when the relatedinstructions are committed Here we assume that the instruction using Z0 precedes the
instruction using Q0. RAT for R5, R10, F0, F2, F10 not displayed
102
R10yes1
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
AddressWrite
Result
Load3Store 1Store 2
Time Name Busy Op Vj Vk Qj Qk
Clock 1
Add1Add2Add3Mult1Mult2
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Parallelism
CLOCK 1
0,R10P0FLD80000000
Addr Op. Des Sorg
P0 P1
Busy Free. Free Free Free Arch
P2 P3 P4 P5
Q0 Q1
Busy Free Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Free Free Free Free Arch
Z2 Z3 Z4 Z5
F8
ROB RAT
12345
Renaming: the first available register for F4
is used
F4
103
Mult2
F0divyes
2R10yes2-1
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load3
WriteResult
Store 1Store 2
Time Name Busy Op Vj Vk Qj Qk
Clock 2
Add1Add2Add3Mult1 P0
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
Parallelism
CLOCK 2
F0,P0Z1FDIV80000004
0,R10P0FLD80000000
Addr Op. Des Sorg
P0 P1
Busy Free Free Free Free Arch
P2 P3 P4 P5
Q0 Q1
Busy Free Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Busy Busy Free Free Free Arch
Z2 Z3 Z4 Z5
F8
Most recentlyattributed physical
register for F4
12345
ROBRAT
Renaming
F4
104
mulyesF0divyes
32
R10yes2-31
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
Address
Load3
WriteResult
Store 1Store 2
Time Name Busy Op Vj Vk Qj Qk
Clock 3
Add1Add2Add3Mult1Mult2
P0F2 F3
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
F2,F3P1FMUL80000008
F0,P0Z1FDIV80000004
0,R10P0FLD80000000
Addr Op. Des Sorg
P0 P1
Busy Busy. Free Free Free Arch
P2 P3 P4 P5F4
ROBRAT
Q0 Q1
Busy Free Free Free Free Arch
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
12345
Parallelism
CLOCK 3
waiting for F4 (P0)
Previous instruction usingZ0 has been committed
Z0 is now the architectural register
P0
Op Vj Vk Qj Qk
105
mulyesF0divyes
Yet 9 cycles
32
2-31
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Load2
AddressWrite
Result
Time Name Busy Op
Clock 4
Add1Add2Add3Mult1Mult2 F2 F3
4-
4FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
4
Yet 39 cycles
4-
Not yet executablebut however inserted in the ROB
It does not block the emissionof the following instructions
Parallelism
Load3Store 1Store 2
Integer
Load2Store 1Store 2
Qk
P1,P1P2FMUL8000000C
F2,F3P1FMUL80000008
F0,P0Z1FDIV80000004
0,R10P0FLD80000000
Addr Op. Des Sorg
P1P0
Busy Busy. Busy Free Free Arch
P2 P3 P4 P5F4
ROBRAT
Q0 Q1
Arch Free Free Free Free Free
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
12345
Ended but not yetcommitted !
CLOCK 4
Instruction usingQ0 has ended its execution
Q0 is now the architectural register
106
FLD F4 0 R10FDIV F8 F0 F4FMUL F4 F2 F3FMUL F4 F4 F4FADD F6 F10 F4FLD F4 0 R5
mulyesF0divyes
Yet 8 cycles
32
2-31
Instruction status Exe.Instruction j k Issue Compl. Busy
Load1Address
WriteResult
Time Name Busy Op Vj Vk
Clock 5
Add1Add2Add3Mult1Mult2
P0
F2 F3
4-
4
4
5
yes add F10 P2
Yet 38 cycles
Load23-
Parallelism
Load3Store 1Store 2
CLOCK 5
waiting for F4 (P1)
Qj Qk
Integer
Load2Store 1Store 2
Qk
F10,P2Q1FADD80000010
P1,P1P2FMUL8000000C
F2,F3P2FMUL80000008
F0,P1Z1FDIV80000004
Addr Op. Des Sorg
P1P0
Arch Busy. Busy Busy Free Free
P2 P3 P4 P5
F4ROBRAT
Q0 Q1
Arch Busy Free Free Free Free
Q2 Q3 Q4 Q5
F6
Z0 Z1
Arch Busy Free Free Free Free
Z2 Z3 Z4 Z5
F8
12345
FLD commited: the architectural register F4
is now P0