out-of-order commit processors

32
Out-of-Order Commit Out-of-Order Commit Processors Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th 2004

Upload: mendel

Post on 14-Jan-2016

36 views

Category:

Documents


1 download

DESCRIPTION

Out-of-Order Commit Processors. Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th 2004. 4. L2 Perfect. 100. 500. 1000. 3.5. 3. 2.5. IPC. 2. 1.5. 1. 0.5. 0. 128. 256. 512. 1024. 2048. 4096. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Out-of-Order Commit Processors

Out-of-Order Commit ProcessorsOut-of-Order Commit Processors

Adrián Cristal (UPC), Daniel Ortega (HP Labs),

Josep Llosa (UPC) and Mateo Valero (UPC)

HPCA-10, MadridFebruary 14-17th 2004

Page 2: Out-of-Order Commit Processors

2

Motivation IMotivation I

0

0.5

1

1.5

2

2.5

3

3.5

4

128 256 512 1024 2048 4096

In-flight Instructions

IPC

L2 Perfect 100 500 1000

Spec FP 2000

0.30X 3.5X

Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

Page 3: Out-of-Order Commit Processors

3

1168 1382 1607 1868 1955 20340

200

400

600

800

1000

1200

1400

1600

1800

2000

Num

ber

of I

n-fli

ght

Inst

ruct

ions

Number of In-flight Instructions (SpecFP)

10% 25% 50% 75% 90%

Motivation II – Resources - ROBMotivation II – Resources - ROB

Often nearly full

Instructions in-flight (ROB=2048, Mem 500 cycles)

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Page 4: Out-of-Order Commit Processors

4

1 10 25 50 75 90 1000

100

200

300

400

500

600

FP

Qu

eu

e

Distribution of in-flight Instructions

Blocked-LongBlocked-ShortReady

1168 1382 1607 1868 1955

Number of Instructions

Long/Short Lat. Inst.Remove – ReinsertDependence Chain

Motivation III – Resources – FP Queue Motivation III – Resources – FP Queue

State of FP Queues (ROB=2048, Mem 500 cycles)

A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Page 5: Out-of-Order Commit Processors

5

OutlineOutline

MotivationOut-of-Order Commit

Multicheckpointing ROBSlow Line Instruction QueuePerformance EvaluationConclusion

Page 6: Out-of-Order Commit Processors

6

Out-of-Order CommitOut-of-Order Commit

Oldest

Newest

Instruction F

low

Oldest Checkpoint

New Checkpoint

I5

Br 3

I6

Br 2

St

I4

I3

Ld

Br 1

I2

I1

Ld

Checkpoint

New Checkpoint

Page 7: Out-of-Order Commit Processors

7

Out-of-Order CommitOut-of-Order Commit

Oldest

Newest

Instruction F

low

Oldest Checkpoint

I5

Br 3

I6

Br 2

St

I4

I3

Ld

Br 1

I2

I1

Ld

Checkpoint

New CheckpointCheckpoint

Oldest Checkpoint

Store Buffer

Oldest Checkpoint

To Memory

Gan

g

Co

mm

it

Page 8: Out-of-Order Commit Processors

8

Miss Branch Prediction

Recover from Checkpoint

Oldest Checkpoint

Out-of-Order CommitOut-of-Order Commit

Oldest

Newest

Instruction F

low

St

I4

I3

I5

Br 3

Br 2Checkpoint

St

Store Buffer

I7

I8

Page 9: Out-of-Order Commit Processors

9

Out-of-Order Commit IIOut-of-Order Commit II

Checkpoint Table. Each entry has: PC of the next Instruction Instruction Counter: Count the number of

instructions still alive Map Table: Allows to recover the register file Pointer to the Store Buffer Mechanism to recover free Registers

• Future Free– One bit for each Physical Register

• Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39• Ephemeral Registers: Tech. Rep UPC-DAC-

2003.51

Page 10: Out-of-Order Commit Processors

10

Checkpoint CreationCheckpoint Creation

Save Pc Save Map Table Clean Future Free Bits Clean Instruction Counter Get a pointer to the first free entry of the store

buffer, and mark this entry in the store buffer.

Page 11: Out-of-Order Commit Processors

11

Instruction DecodificationInstruction Decodification

Add 1 to the Instruction Counter of the newest checkpoint

R1R2 op R3 If R1 is mapped to PhyReg_N

• Set PhyReg_N bit of the future free vector bits• Map R1 to the new Physical Register

Associate the instruction to the last created checkpoint

Page 12: Out-of-Order Commit Processors

12

Instruction WritebackInstruction Writeback

Decrement the Instruction Counter of the checkpoint associated to the instruction

If the instruction is a mispredicted branch: Recover From the associated checkpoint:

• Fetch instructions from saved PC• Release all entries in the store buffer from the

pointed entry• Free all registers in the future free vector of the

entry and for all the newer checkpoints entries

Page 13: Out-of-Order Commit Processors

13

Checkpoint EliminationCheckpoint Elimination

If this counter is 0 and if it is the oldest checkpoint, then: The checkpoint is removed

• Clean the corresponding mark in the store buffer• The registers marked in the Future Free vector

are freed

Page 14: Out-of-Order Commit Processors

14

OutlineOutline

MotivationOut-of-Order CommitSlow Line Instruction QueuePerformance EvaluationConclusions

Page 15: Out-of-Order Commit Processors

15

Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue

Slow LineInstruction

Queue

LD

a

b

Slow Line Instruction QueueSlow Line Instruction Queue

Page 16: Out-of-Order Commit Processors

16

Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue

Slow LineInstruction

Queue

LD

a

b

Slow Line Instruction QueueSlow Line Instruction Queue

Page 17: Out-of-Order Commit Processors

17Ps e

udo

Rob

Oldest

Newest

Instruction Flow

Ld

x

x

x

a

x

x

x

b

x

Data D

epen

denc

e

Load/StoreQueue

InstructionQueue

Slow LineInstruction

Queue

LD

ab

Load End

Begin reinser

t

Slow Line Instruction QueueSlow Line Instruction Queue

Page 18: Out-of-Order Commit Processors

18

Slow Lane Instruction Queue IISlow Lane Instruction Queue II

Very simple Buffer – Slow Lane Instruction Queue (SLIQ)

Each Load that miss in L2 has a pointer to an entry in the SLIQ

Pseudo ROB

Page 19: Out-of-Order Commit Processors

19

Slow Line Instruction Queue IIISlow Line Instruction Queue III

When a Instruction is retired from the Pseudo ROB, its state is looked on:

• If the instruction is a load miss, the pointer is written• If the instruction depends on a long latency instruction,

it is moved to de SLIQ

When a load that miss in L2 finish its execution: The SLIQ is traversed from the instruction pointed by

the load if this point is older than the current traversal position.

The load’s dependent instructions are reinserted to the IQ

Page 20: Out-of-Order Commit Processors

20

Performance EvaluationPerformance Evaluation

Processor Configuration (Baseline 4096): Fetch/Commit width 4 Branch Predictor 16K entries Gshare Instruction L1 32Kb, 4-way, 32 bytes line, 2

cycle Data L1 32Kb, 4-way, 32 bytes line, 2

cycle L2 size 512Kb, 4-way, 64 bytes line, 10 cycle Memory Latency 1000 cycles Physical Registers 4096 entries Load/Store Queue 4096 entries Reorder Buffer 4096 entries Integer General Units 4 (lat/rep 1/1) Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) FP Functional Units 4 (lat/rep 2/1) FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)

Page 21: Out-of-Order Commit Processors

21

Performance Evaluation - Some ConsiderationsPerformance Evaluation - Some Considerations

We mix both models. The processor takes the checkpoints when the

instructions are retired from the pseudo ROB. Many branches are resolved at this time, so the

probability to come back to the checkpoint is reduced.

If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

Page 22: Out-of-Order Commit Processors

22

IPC – Different ConfigurationsIPC – Different Configurations

0

0,5

1

1,5

2

2,5

3

3,5

512 1024 2048

Slow Lane Instruction Queue

IPC

COoO 32COoO 64COoO 128Baseline

Baseline 4096

Baseline 128

Page 23: Out-of-Order Commit Processors

23

0

0,5

1

1,5

2

2,5

3

3,5

IPC

Baseline

4

8

16

32

64

128

Number of Checkpoints and PerformanceNumber of Checkpoints and Performance

Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers

Page 24: Out-of-Order Commit Processors

24

In-Flight InstructionsIn-Flight Instructions

0

500

1000

1500

2000

2500

3000

512 1024 2048

Slow Lane Instruction Queue

In-f

light

Inst

ruct

ions

COoO 32COoO 64COoO 128Baseline

Baseline 4096

Baseline 128

Page 25: Out-of-Order Commit Processors

25

Delay in re-insertion from SLIQDelay in re-insertion from SLIQ

0

0,5

1

1,5

2

2,5

3

32 64 128

1

48

12

SLIQ: 1024 entries

Page 26: Out-of-Order Commit Processors

26

Towards affordable Kilo-Instruction ProcessorTowards affordable Kilo-Instruction Processor

Adding Ephemeral Registers to the Out-of-Order Commit Processors

Change in the SLIQ to list of Buckets of Instructions

J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

Page 27: Out-of-Order Commit Processors

27

0

0.5

1

1.5

2

2.5

3

3.5

512 1024 2048 512 1024 2048 512 1024 2048

100 500 1000

Virtual TagsMemory Latency

IPC

256

512

Limit 4096

Limit 4096

Limit 4096

Baseline 128

Baseline 128

Baseline 128

Putting It All TogetherPutting It All TogetherPhysicalRegisters

Virtual Registers

IQs of 128 entries

Memory Latency

Page 28: Out-of-Order Commit Processors

28

ConclusionConclusion

To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. The resources are underutilized

We present two techniques to reduce the need for resources and we show its effectiveness Out of Order Commit Slow Lane Instruction Queue

Page 29: Out-of-Order Commit Processors

29

Thank you very much

Page 30: Out-of-Order Commit Processors

36

1 10 25 50 75 90 1000

50

100

150

200

250

ST

Qu

eu

e

Distribution of in-flight Instructions

ReadyAddress ReadyBlocked-LongBlocked-Short

20 108 435 1004 1361

Number of Instructions

INT

State of ST Queues (specInt, ROB=2048)State of ST Queues (specInt, ROB=2048)

Locality

Page 31: Out-of-Order Commit Processors

38

1 10 25 50 75 90 1000

50

100

150

200

250

300

350

400

450

Int.

Qu

eu

e

Distribution of in-flight Instructions

Blocked-LongBlocked-ShortReady

20 108 435 1004 1361

Number of Instructions

INT

State of Int Queues (specInt, ROB=2048)State of Int Queues (specInt, ROB=2048)

Long/Short Lat. Inst.Remove – ReinsertDependence Chain

Page 32: Out-of-Order Commit Processors

39

20 108 435 1004 1361 17560

100

200

300

400

500

600

700

800

900

1000

Int.

Re

gis

ters

Number of In-flight Instructions (SpecInt)

DeadBlocked-LongBlocked-ShortLive

10% 25% 50% 75% 90%

State of Registers (Int, ROB=2048)State of Registers (Int, ROB=2048)

Early Release

Virtual Registers