out-of-order commit processors

Out-of-Order Commit ProcessorsOut-of-Order Commit Processors

Adrián Cristal (UPC), Daniel Ortega (HP Labs),

Josep Llosa (UPC) and Mateo Valero (UPC)

HPCA-10, MadridFebruary 14-17th 2004

2

Motivation IMotivation I

0

0.5

1

1.5

2

2.5

3

3.5

4

128 256 512 1024 2048 4096

In-flight Instructions

IPC

L2 Perfect 100 500 1000

Spec FP 2000

0.30X 3.5X

Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

3

1168 1382 1607 1868 1955 20340

200

400

600

800

1000

1200

1400

1600

1800

2000

Num

ber

of I

n-fli

ght

Inst

ruct

ions

Number of In-flight Instructions (SpecFP)

10% 25% 50% 75% 90%

Motivation II – Resources - ROBMotivation II – Resources - ROB

Often nearly full

Instructions in-flight (ROB=2048, Mem 500 cycles)

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

4

1 10 25 50 75 90 1000

100

200

300

400

500

600

FP

Qu

eu

e

Distribution of in-flight Instructions

Blocked-LongBlocked-ShortReady

1168 1382 1607 1868 1955

Number of Instructions

Long/Short Lat. Inst.Remove – ReinsertDependence Chain

Motivation III – Resources – FP Queue Motivation III – Resources – FP Queue

State of FP Queues (ROB=2048, Mem 500 cycles)

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

5

OutlineOutline

MotivationOut-of-Order Commit

Multicheckpointing ROBSlow Line Instruction QueuePerformance EvaluationConclusion

6

Out-of-Order CommitOut-of-Order Commit

Oldest

Newest

Instruction F

low

Oldest Checkpoint

New Checkpoint

I5

Br 3

I6

Br 2

St

I4

I3

Ld

Br 1

I2

I1

Ld

Checkpoint

New Checkpoint

7


Oldest

Newest

Instruction F

low

Oldest Checkpoint

I5

Br 3

I6

Br 2

St

I4

I3

Ld

Br 1

I2

I1

Ld

Checkpoint

New CheckpointCheckpoint

Oldest Checkpoint

Store Buffer

Oldest Checkpoint

To Memory

Gan

g

Co

mm

it

8

Miss Branch Prediction

Recover from Checkpoint

Oldest Checkpoint


Oldest

Newest

Instruction F

low

St

I4

I3

I5

Br 3

Br 2Checkpoint

St

Store Buffer

I7

I8

9

Out-of-Order Commit IIOut-of-Order Commit II

Checkpoint Table. Each entry has: PC of the next Instruction Instruction Counter: Count the number of

instructions still alive Map Table: Allows to recover the register file Pointer to the Store Buffer Mechanism to recover free Registers

• Future Free– One bit for each Physical Register

• Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39• Ephemeral Registers: Tech. Rep UPC-DAC-

2003.51

10

Checkpoint CreationCheckpoint Creation

Save Pc Save Map Table Clean Future Free Bits Clean Instruction Counter Get a pointer to the first free entry of the store

buffer, and mark this entry in the store buffer.

11

Instruction DecodificationInstruction Decodification

Add 1 to the Instruction Counter of the newest checkpoint

R1R2 op R3 If R1 is mapped to PhyReg_N

• Set PhyReg_N bit of the future free vector bits• Map R1 to the new Physical Register

Associate the instruction to the last created checkpoint

12

Instruction WritebackInstruction Writeback

Decrement the Instruction Counter of the checkpoint associated to the instruction

If the instruction is a mispredicted branch: Recover From the associated checkpoint:

• Fetch instructions from saved PC• Release all entries in the store buffer from the

pointed entry• Free all registers in the future free vector of the

entry and for all the newer checkpoints entries

13

Checkpoint EliminationCheckpoint Elimination

If this counter is 0 and if it is the oldest checkpoint, then: The checkpoint is removed

• Clean the corresponding mark in the store buffer• The registers marked in the Future Free vector

are freed

14

OutlineOutline

MotivationOut-of-Order CommitSlow Line Instruction QueuePerformance EvaluationConclusions

15

Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue

Slow LineInstruction

Queue

LD

a

b

Slow Line Instruction QueueSlow Line Instruction Queue

16

Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue


Queue

LD

a

b


17Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue


Queue

LD

ab

Load End

Begin reinser

t


18

Slow Lane Instruction Queue IISlow Lane Instruction Queue II

Very simple Buffer – Slow Lane Instruction Queue (SLIQ)

Each Load that miss in L2 has a pointer to an entry in the SLIQ

Pseudo ROB

19

Slow Line Instruction Queue IIISlow Line Instruction Queue III

When a Instruction is retired from the Pseudo ROB, its state is looked on:

• If the instruction is a load miss, the pointer is written• If the instruction depends on a long latency instruction,

it is moved to de SLIQ

When a load that miss in L2 finish its execution: The SLIQ is traversed from the instruction pointed by

the load if this point is older than the current traversal position.

The load’s dependent instructions are reinserted to the IQ

20

Performance EvaluationPerformance Evaluation

Processor Configuration (Baseline 4096): Fetch/Commit width 4 Branch Predictor 16K entries Gshare Instruction L1 32Kb, 4-way, 32 bytes line, 2

cycle Data L1 32Kb, 4-way, 32 bytes line, 2

cycle L2 size 512Kb, 4-way, 64 bytes line, 10 cycle Memory Latency 1000 cycles Physical Registers 4096 entries Load/Store Queue 4096 entries Reorder Buffer 4096 entries Integer General Units 4 (lat/rep 1/1) Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) FP Functional Units 4 (lat/rep 2/1) FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)

21

Performance Evaluation - Some ConsiderationsPerformance Evaluation - Some Considerations

We mix both models. The processor takes the checkpoints when the

instructions are retired from the pseudo ROB. Many branches are resolved at this time, so the

probability to come back to the checkpoint is reduced.

If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

22

IPC – Different ConfigurationsIPC – Different Configurations

0

0,5

1

1,5

2

2,5

3

3,5

512 1024 2048

Slow Lane Instruction Queue

IPC

COoO 32COoO 64COoO 128Baseline

Baseline 4096

Baseline 128

23

0

0,5

1

1,5

2

2,5

3

3,5

IPC

Baseline

4

8

16

32

64

128

Number of Checkpoints and PerformanceNumber of Checkpoints and Performance

Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers

24

In-Flight InstructionsIn-Flight Instructions

0

500

1000

1500

2000

2500

3000

512 1024 2048

Slow Lane Instruction Queue

In-f

light

Inst

ruct

ions

COoO 32COoO 64COoO 128Baseline

Baseline 4096

Baseline 128

25

Delay in re-insertion from SLIQDelay in re-insertion from SLIQ

0

0,5

1

1,5

2

2,5

3

32 64 128

1

48

12

SLIQ: 1024 entries

26

Towards affordable Kilo-Instruction ProcessorTowards affordable Kilo-Instruction Processor

Adding Ephemeral Registers to the Out-of-Order Commit Processors

Change in the SLIQ to list of Buckets of Instructions

J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

27

0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000

Virtual TagsMemory Latency

IPC

256

512

Limit 4096

Limit 4096

Limit 4096

Baseline 128

Baseline 128

Baseline 128

Putting It All TogetherPutting It All TogetherPhysicalRegisters

Virtual Registers

IQs of 128 entries

Memory Latency

28

ConclusionConclusion

To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. The resources are underutilized

We present two techniques to reduce the need for resources and we show its effectiveness Out of Order Commit Slow Lane Instruction Queue

29

Thank you very much

36

1 10 25 50 75 90 1000

50

100

150

200

250

ST

Qu

eu

e


ReadyAddress ReadyBlocked-LongBlocked-Short

20 108 435 1004 1361


INT

State of ST Queues (specInt, ROB=2048)State of ST Queues (specInt, ROB=2048)

Locality

38

1 10 25 50 75 90 1000

50

100

150

200

250

300

350

400

450

Int.

Qu

eu

e


Blocked-LongBlocked-ShortReady

20 108 435 1004 1361


INT

State of Int Queues (specInt, ROB=2048)State of Int Queues (specInt, ROB=2048)

Long/Short Lat. Inst.Remove – ReinsertDependence Chain

39

20 108 435 1004 1361 17560

100

200

300

400

500

600

700

800

900

1000

Int.

Re

gis

ters

Number of In-flight Instructions (SpecInt)

DeadBlocked-LongBlocked-ShortLive

10% 25% 50% 75% 90%

State of Registers (Int, ROB=2048)State of Registers (Int, ROB=2048)

Early Release

Virtual Registers

out-of-order commit processors

Documents