introduction to dynamic scheduling of instructions (the
TRANSCRIPT
EE457
Introduction to Dynamic Scheduling of Instructions(The Tomasulo Algorithm)
ByGandhi Puvvada
Out of Order (OoO) Execution
2
References• EE557 Textbook
• Prof. Dubois’ EE557 Classnotes
• Prof. Annavaram’s slides
• Prof. Patterson’s Lecture slides
3
Instruction Scheduling(Re-ordering of instructions)
• We will limit our discussion to scheduling of instructions mostly with-in the basic block(basic block = a straight-line code sequence with no branches).
• Compiler can perform static instruction scheduling.
• Tomasulo Algorithm lets us schedule instructions dynamically (in hardware).
4
Static Scheduling(based on Prof. Dubois slides)
• Strengths-- Hardware simplicity-- Compiler has a global view of the code
• Weaknesses-- can not be CPU-implementation specific-- can not foresee dynamic events
-- cache misses -- data-dependent delays-- conditional branches
-- can not pre-compute memory addresses
6
Simple 5-stage pipelineIn-order executionRAW dependency
Solve it by forwarding, if not, by stalling
IF ID EX M WB
Dependent instructions are stalled in the ID stage
IM DM
9
Provide multiple functional units(for simplicity, we avoid talking about floating point
execution unit and floating point register file)
Stall, after decoding, in queues
IF ID
Queues andFunctional unit
WB
IM
DM Load/Store
Integer
Multiply
Divide
10
Tomasulo’s plan
• OoO Out of order execution
• Multiple functional units(say, Integer, DM, Multiplier, Divider)
• Queues between ID and EX stages(in place ID/EX register)
11
Out of order execution ?!Problems all over ??!!
• For the time, no branch prediction, no speculative execution beyondbranches, just stall on a conditional branch
• No support for precise exceptionsEven then, …
12
RAW, WAR, and WAWRAW = Read After Write
lw $8, 40($2); add $9, $8, $7;
WAR = Write after Readadd $9, $8, $6; lw $8, 40($2);
WAW = Write after Writeadd $9, $8, $6; lw $9, 40($2); WAW ?
How is it possible?Consider a printer or a FIFO
13
RAW, WAR, and WAW(some terminology to remember)
RAW = Read After Writelw $8, 40($2); add $9, $8, $7;
WAR = Write after Readadd $9, $8, $6; lw $8, 40($2);
WAW = Write after Writeadd $9, $8, $6; lw $9, 40($2); WAW
An output dependency
WAR An anti-dependency
RAW A true dependency
Nam
e D
epen
denc
es
14
RAW, WAR, and WAW
• In-order execution: We need to deal with RAW only.
• Out of order execution:Now we need to deal with WAR and WAW besides RAW.
15
Limited Architectural RegistersMore Physical Registers
Register Renaminglw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
It is clear that compiler is using $8 as a temporary register.
If there is a delay in obtaining $2, the first part of the code can not proceed.
Unfortunately, the second part of the code can not proceed because of name dependency for $8.
16
If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part!
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $48, 60($3);add $48, $48, $48;sw $48, 60($3);
This is an example of name dependency.
17
Four different temporary registers can be used here as shown: $8, $18, $28, and $48(or called with coded names, LION, TIGER, CAT, and ANT).
lw $8, 40($2);add $18, $8, $8;sw $18, 40($2);
lw $28, 60($3);add $48, $28, $28;sw $48, 60($3);
lw LION, 40($2);add TIGER, LION, LION;sw TIGER, 40($2);
lw CAT, 60($3);add ANT, CAT, CAT;sw ANT, 60($3);
18
Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibilitywith previously compiled codes?
Answer: Yes / No
Why?
19
Answer: Can not change the number of Architectural Registers
Register Renaming Through TaggingRegisters
This solves name dependency problems (WAR and WAW) while
attending to true dependency (RAW) through waiting in queues.
20
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
square_root $2, $10; $1$2$3$4$5$6$7$8...$31
$1$2$3$4$5$6$7$8...$31
RST RF
RST = Register Status TableRF = Register File
destinationdependent
source
21
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
square_root $2, $10; $1$2$3$4$5$6$7$8...$31
$1$2$3$4$5$6$7$8...$31
RST RF
22
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
square_root $2, $10; $1$2$3$4$5$6$7$8...$31
$1$2$3$4$5$6$7$8...$31
RST RF
23
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
square_root $2, $10; $1$2$3$4$5$6$7$8...$31
$1$2$3$4$5$6$7$8...$31
RST RF
24
Dispatch unit decodes and dispatches instructions.
For destination operand, an instruction carries a TAG (but not the actual register name)!
For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)!
lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);
lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);
square_root $2, $10;
25
TAGs for destinations or sources or for both?
• A new tag is assigned to the destination register of the instruction being dispatched.
• For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction.
• If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.
26
Unique TAG
• Like SSN, we need a unique TAG
• SSNs are reused.
• Similarly TAGs can be reused.
• TAGs are similar to the number TOKENs.
4
4
27
Take a number vs. Take a token
In State Bank of India, they issue brass tokens to customers waiting for service.
Tokens are reclaimed and reused.
4
28
TAGs (= Tokens)
• How many Tokens should the bank cashier have to start with?
• What happens if the tokens are run out?
• Does he need to have any order in holding tokens and issuing tokens?
• Does he have to collect tokens back?
4
29
TAG FIFO (FIFOs are taught in EE560)
• To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.
• Filled with (say) 64 tokens (in any order) initially on reset.
• Tokens return in out of order anyway.• Put tokens back in stack and issue.
01
63
wp rp
2
Full
wp
rp
63
2
2 tokens issued
1
63
wprp2
1 token returned
30
Simplifiedfor EE457
IntegerMultiplier
Issue Unit
Block Diagramprovided by Prof. Dubois
Int.
Div
ider
63
2
TAG FIFO
31
Front-End & Back-End• IFQ Instruction Fetch Queue (a FIFO structure)
• Dispatch unit (including RST, RF, Tag FIFO)
• Load Store and other Issue Queues
• Issue Unit
• Functional units
• CDB (Common Data Bus)
33
Bottle neck in the design
• CDB = Common Data Bus
Do all instructions use CDB?
• sw ?
• j (jump)?
• beq
35
EE557 approach for address calculation
EE457/560 approach for address calculationDedicated adder, to compute address, attached to the load-store queue.
Address calculation for lw and sw
37
Memory DisambiguationRAW
sw $2, 2000($0);
lw $8, 2000($0);
WARlw $2, 2000($0);
sw $8, 2000($0);
WAWsw $2, 2000($0);
sw $8, 2000($0);
38
Memory DisambiguationRAW
sw $2, 2000($0);
lw $8, 2000($0);
WARlw $2, 2000($0);
sw $8, 2000($0);
WAWsw $2, 2000($0);
sw $8, 2000($0);
This later lw can proceed only if there is no store ahead of it with the same address.
This later sw can proceed only if there is no store ahead of it with the same address.
This later sw can proceed only if there is no load ahead of it with the same address.
39
Maintaining instructions in the order of arrival (issue order/program order)
in a queueIs it necessary or is it desirable?
In the case of L-S Queue ?
In the case of Integer and other queues (multqueue, div queue)?
40
Maintaining instructions in the order of arrival (issue order/program order)
in a queueIs it necessary or is it desirable?
In the case of L-S Queue ? NECESSARY to enforce memory disambiguation rules
In the case of Integer and other queues (multqueue, div queue)?
DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it.
41
Priority (based on the order of arrival) among instructions ready
to execute• Is it necessary or is it desirable?
• Local priority with in the queues
• Global priority across the queues
42
Issue Unit
• CDB availability constraint
• Pipelined functional unitvs.
Multi-cycle functional unit
• Conflict resolutionRound-robin priority adequate?, well, …
CDB
43
Conditional branches• Dispatch unit stops dispatching until the
branch is resolved.
• CDB broadcasts the result of the branch
• Dispatching continues there after either at the fall-through instruction or at target instruction.
• Successful branch shall cause flushing of IFQ very much like jump.
44
Conditional branches• Since we stop dispatching instructions
after a branch, does it mean that this branch is the last instruction to be executed in the back-end ?
• Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved?
45
Tomasulo Loop ExampleLoop: LW $2, 40($1);
MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;
• Assume Multiply takes 4 clocks• Assume first load takes 8 clocks (cache
miss), second load takes 1 clock (hit)
Based on Prof. Annavaram’s lecture slide
46
How could Tomasulo overlap iterations of loops?
The destination registers, different TAGs in different iterations. These tags were given in place of the source operands to the dependent instructions following them.
Loop: LW $2, 40($1);MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;
47
Loop: LW $2, 40($1);MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;
Loop: LW $2, 40($1);MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;
Say, only two iterations.Let us unroll the two iterations.
destination register
dependent sourceregister(s)