introduction to dynamic scheduling of instructions (the

EE457

Introduction to Dynamic Scheduling of Instructions(The Tomasulo Algorithm)

ByGandhi Puvvada

Out of Order (OoO) Execution

2

References• EE557 Textbook

• Prof. Dubois’ EE557 Classnotes

• Prof. Annavaram’s slides

• Prof. Patterson’s Lecture slides

3

Instruction Scheduling(Re-ordering of instructions)

• We will limit our discussion to scheduling of instructions mostly with-in the basic block(basic block = a straight-line code sequence with no branches).

• Compiler can perform static instruction scheduling.

• Tomasulo Algorithm lets us schedule instructions dynamically (in hardware).

4

Static Scheduling(based on Prof. Dubois slides)

• Strengths-- Hardware simplicity-- Compiler has a global view of the code

• Weaknesses-- can not be CPU-implementation specific-- can not foresee dynamic events

-- cache misses -- data-dependent delays-- conditional branches

-- can not pre-compute memory addresses

6

Simple 5-stage pipelineIn-order executionRAW dependency

Solve it by forwarding, if not, by stalling

IF ID EX M WB

Dependent instructions are stalled in the ID stage

IM DM

7

Simple 5-stage pipeline: Dependent instructions are stalled in the ID stage

and lw

8

Simple 5-stage pipeline: Dependent instructions can not be stalled in the EX stage. Why?

andlw

9

Provide multiple functional units(for simplicity, we avoid talking about floating point

execution unit and floating point register file)

Stall, after decoding, in queues

IF ID

Queues andFunctional unit

WB

IM

DM Load/Store

Integer

Multiply

Divide

10

Tomasulo’s plan

• OoO Out of order execution

• Multiple functional units(say, Integer, DM, Multiplier, Divider)

• Queues between ID and EX stages(in place ID/EX register)

11

Out of order execution ?!Problems all over ??!!

• For the time, no branch prediction, no speculative execution beyondbranches, just stall on a conditional branch

• No support for precise exceptionsEven then, …

12

RAW, WAR, and WAWRAW = Read After Write

lw $8, 40($2); add $9, $8, $7;

WAR = Write after Readadd $9, $8, $6; lw $8, 40($2);

WAW = Write after Writeadd $9, $8, $6; lw $9, 40($2); WAW ?

How is it possible?Consider a printer or a FIFO

13

RAW, WAR, and WAW(some terminology to remember)

RAW = Read After Writelw $8, 40($2); add $9, $8, $7;

WAR = Write after Readadd $9, $8, $6; lw $8, 40($2);

WAW = Write after Writeadd $9, $8, $6; lw $9, 40($2); WAW

An output dependency

WAR An anti-dependency

RAW A true dependency

Nam

e D

epen

denc

es

14

RAW, WAR, and WAW

• In-order execution: We need to deal with RAW only.

• Out of order execution:Now we need to deal with WAR and WAW besides RAW.

15

Limited Architectural RegistersMore Physical Registers

Register Renaminglw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

It is clear that compiler is using $8 as a temporary register.

If there is a delay in obtaining $2, the first part of the code can not proceed.

Unfortunately, the second part of the code can not proceed because of name dependency for $8.

16

If we had 64 registers instead of 32 registers, then perhaps compiler might have used $48 instead of $8 and we could have executed the second part of the code before the first part!

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $48, 60($3);add $48, $48, $48;sw $48, 60($3);

This is an example of name dependency.

17

Four different temporary registers can be used here as shown: $8, $18, $28, and $48(or called with coded names, LION, TIGER, CAT, and ANT).

lw $8, 40($2);add $18, $8, $8;sw $18, 40($2);

lw $28, 60($3);add $48, $28, $28;sw $48, 60($3);

lw LION, 40($2);add TIGER, LION, LION;sw TIGER, 40($2);

lw CAT, 60($3);add ANT, CAT, CAT;sw ANT, 60($3);

18

Can a later implementation provide 64 registers (instead of 32) while maintaining binary compatibilitywith previously compiled codes?

Answer: Yes / No

Why?

19

Answer: Can not change the number of Architectural Registers

Register Renaming Through TaggingRegisters

This solves name dependency problems (WAR and WAW) while

attending to true dependency (RAW) through waiting in queues.

20

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

square_root $2, $10; $1$2$3$4$5$6$7$8...$31

$1$2$3$4$5$6$7$8...$31

RST RF

RST = Register Status TableRF = Register File

destinationdependent

source

21

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

square_root $2, $10; $1$2$3$4$5$6$7$8...$31

$1$2$3$4$5$6$7$8...$31

RST RF

22

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

square_root $2, $10; $1$2$3$4$5$6$7$8...$31

$1$2$3$4$5$6$7$8...$31

RST RF

23

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

square_root $2, $10; $1$2$3$4$5$6$7$8...$31

$1$2$3$4$5$6$7$8...$31

RST RF

24

Dispatch unit decodes and dispatches instructions.

For destination operand, an instruction carries a TAG (but not the actual register name)!

For source operands, an instruction carries either the values or TAGs of the operands (but not the actual register names)!

lw $8, 40($2);add $8, $8, $8;sw $8, 40($2);

lw $8, 60($3);add $8, $8, $8;sw $8, 60($3);

square_root $2, $10;

25

TAGs for destinations or sources or for both?

• A new tag is assigned to the destination register of the instruction being dispatched.

• For each of the source registers (source operands) of the instruction being dispatched, either the value of the source register (if it has not been previously tagged) or the existing tag associated with the source register (if it has been tagged already) is conveyed to the instruction.

• If a tag is conveyed for a source, then the instruction needs to wait for the original instruction with that destination tag to go on to the CDB and announce the value.

26

Unique TAG

• Like SSN, we need a unique TAG

• SSNs are reused.

• Similarly TAGs can be reused.

• TAGs are similar to the number TOKENs.

4

4

27

Take a number vs. Take a token

In State Bank of India, they issue brass tokens to customers waiting for service.

Tokens are reclaimed and reused.

4

28

TAGs (= Tokens)

• How many Tokens should the bank cashier have to start with?

• What happens if the tokens are run out?

• Does he need to have any order in holding tokens and issuing tokens?

• Does he have to collect tokens back?

4

29

TAG FIFO (FIFOs are taught in EE560)

• To issue and collect Tokens (TAGs), use a circular FIFO (First-in-First-Out) unit.

• Filled with (say) 64 tokens (in any order) initially on reset.

• Tokens return in out of order anyway.• Put tokens back in stack and issue.

01

63

wp rp

2

Full

wp

rp

63

2

2 tokens issued

1

63

wprp2

1 token returned

30

Simplifiedfor EE457

IntegerMultiplier

Issue Unit

Block Diagramprovided by Prof. Dubois

Int.

Div

ider

63

2

TAG FIFO

31

Front-End & Back-End• IFQ Instruction Fetch Queue (a FIFO structure)

• Dispatch unit (including RST, RF, Tag FIFO)

• Load Store and other Issue Queues

• Issue Unit

• Functional units

• CDB (Common Data Bus)

33

Bottle neck in the design

• CDB = Common Data Bus

Do all instructions use CDB?

• sw ?

• j (jump)?

• beq

34

load store queue

• Address calculation

• Memory disambiguation

35

EE557 approach for address calculation

EE457/560 approach for address calculationDedicated adder, to compute address, attached to the load-store queue.

Address calculation for lw and sw

36

Memory DisambiguationEE557

37

Memory DisambiguationRAW

sw $2, 2000($0);

lw $8, 2000($0);

WARlw $2, 2000($0);

sw $8, 2000($0);

WAWsw $2, 2000($0);

sw $8, 2000($0);

38

Memory DisambiguationRAW

sw $2, 2000($0);

lw $8, 2000($0);

WARlw $2, 2000($0);

sw $8, 2000($0);

WAWsw $2, 2000($0);

sw $8, 2000($0);

This later lw can proceed only if there is no store ahead of it with the same address.

This later sw can proceed only if there is no store ahead of it with the same address.

This later sw can proceed only if there is no load ahead of it with the same address.

39

Maintaining instructions in the order of arrival (issue order/program order)

in a queueIs it necessary or is it desirable?

In the case of L-S Queue ?

In the case of Integer and other queues (multqueue, div queue)?

40

Maintaining instructions in the order of arrival (issue order/program order)

in a queueIs it necessary or is it desirable?

In the case of L-S Queue ? NECESSARY to enforce memory disambiguation rules

In the case of Integer and other queues (multqueue, div queue)?

DESIRABLE, so that an earlier instruction gets executed whenever possible, there by perhaps reducing too many instructions waiting on it.

41

Priority (based on the order of arrival) among instructions ready

to execute• Is it necessary or is it desirable?

• Local priority with in the queues

• Global priority across the queues

42

Issue Unit

• CDB availability constraint

• Pipelined functional unitvs.

Multi-cycle functional unit

• Conflict resolutionRound-robin priority adequate?, well, …

CDB

43

Conditional branches• Dispatch unit stops dispatching until the

branch is resolved.

• CDB broadcasts the result of the branch

• Dispatching continues there after either at the fall-through instruction or at target instruction.

• Successful branch shall cause flushing of IFQ very much like jump.

44

Conditional branches• Since we stop dispatching instructions

after a branch, does it mean that this branch is the last instruction to be executed in the back-end ?

• Is it possible that the back-end holds simultaneously (a) some instructions dispatched before the branch and (b) some instructions issued after the branch was resolved?

45

Tomasulo Loop ExampleLoop: LW $2, 40($1);

MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;

• Assume Multiply takes 4 clocks• Assume first load takes 8 clocks (cache

miss), second load takes 1 clock (hit)

Based on Prof. Annavaram’s lecture slide

46

How could Tomasulo overlap iterations of loops?

The destination registers, different TAGs in different iterations. These tags were given in place of the source operands to the dependent instructions following them.

Loop: LW $2, 40($1);MULT $4 $2, $3;SW $4, 40($1);ADDI $1, $1, -4;BNE $1, $0, Loop;

47



Say, only two iterations.Let us unroll the two iterations.

destination register

dependent sourceregister(s)

48

Because, there is no reorder buffer. Note: Your EE560 project will use a reorder buffer!

introduction to dynamic scheduling of instructions (the

Documents