icfp: tolerating all level cache misses in in-order processors

HPCA-15 :: Feb 18, 2009

iCFP: Tolerating All Level Cache Misses in In-Order Processors

Andrew Hilton, Santosh Nagarakatte, Amir RothUniversity of Pennsylvania

{adhilton,santoshn,amir}@cis.upenn.edu

A Brief History …

Pentium(in-order)

PentiumII (out-of-order)

performance

Core2Duo (out-of-order, 2 cores)

power

Nehalem (out-of-order, 4 cores, 8 threads)

Niagara2 (in-order, 16 cores, 64 threads)

POWER!

[ 3 ][ 3 ]

In-order vs. Out-of-Order

Out-of-order cores• Single thread IPC (+63%)

Key idea• Main benefit of out-of-order: data cache miss tolerance• Can we add to in-order in a simple way?

Is there a compromise?

In-order cores• Power efficiency• More cores

• Regfile checkpoint-restore

Runahead

Runahead execution [Dundas+, ICS’97]

• In-order + miss-level parallelism (MLP)• Checkpoint and “advance” under miss• Restore checkpoint when miss returns RF0

D$I$

Pois

on

• Per register “poison” bits Forwarding$

• Forwarding cache

Can we do better?

Additional hardware?

Yes We Can! (Sorry)

iCFP: in-order Continual Flow Pipeline• Runahead, but … • Save miss-independent work• Re-execute only miss forward slice

Forwarding$

RF0

D$I$

Pois

on

Slice Buffer

• Slice buffer

Additional hardware?

In-order adaptation of CFP [Srinivasan+, ASPLOS’04]

• Unblock pipeline latches, not issue queue and regfile• Apply to misses at all cache levels, not just L2

• Replace forwarding cache with store buffer Store Buffer

RF1

• Hijack additional regfile used for multi-threading

Pois

on

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

[ 7 ][ 7 ]

I$

ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

A1B1C1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

PC/instance

Bold paths are active

Instructions flowing through pipeline

Tail

Pois

on

Pois

on

Tail last completed instruction RF0

[ 8 ][ 8 ]

• Checkpoint regfile

I$


Load A1 misses, transition to “advance” mode

A1B1C1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$ Miss

Tail

Pois

on

Pois

on

• Poison A1’s output register r2

r2

[ 9 ][ 9 ]

• Checkpoint regfile

I$


Load A1 misses, transition to “advance” mode

C1D1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

• Poison A1’s output register r2• Divert A1 to slice buffer

Pending miss (red)

A1

Tail

B1

Pois

on

Pois

on

r2

[ 10 ][ 10 ]

I$


• Propagate poison through data dependences

C1D1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1

Tail

B1

Pois

on

Pois

on

r2

[ 11 ][ 11 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]

C1D1E1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer

Miss-dependent instruction (this color)

Tail

B1

Pois

on

Pois

on

r2r3

[ 12 ][ 12 ]

I$


E1F1

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer• Buffer stores in store buffer

B1

Tail

D1

C1

Pois

on

Pois

on

r2r3r5

[ 13 ][ 13 ]

I$


F1A2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual

B1

D1

C1

Tail

E1

Pois

on

Pois

on

r2r3r5

[ 14 ][ 14 ]

I$


A2B2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance

• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile

B1

D1

C1D1

F1

Tail

Miss-independent instruction (green)

E1

Pois

on

Pois

on

r2r3r5

[ 15 ][ 15 ]

I$


B2B2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance


B1

D1

C1D1

A2

E1

Tail Pois

on

Pois

on

r2r3r5

[ 16 ][ 16 ]

I$


C2D2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Advance


• Can “un-poison” tail registers

B1

D1

C1D1

B2

E1

Tail

A2

Pois

on

Pois

on

r3r5

[ 17 ][ 17 ]

I$


D2E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

A1Miss Returns

When A1 miss returns, transition to “rally”• Stall fetch• Pipe in contents of slice buffer

B1

D1

C1D1

A2E1

C2

B2

Fill

Tail

Pois

on

Pois

on

r5

[ 18 ][ 18 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$A1

Drain

• Drain advance instructions already in pipeline (C2–D2)

B1

D1

C1D1

A2E1 B2C2

D2Tail

Pois

on

Pois

on

[ 19 ][ 19 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$B1

Drain

• Drain advance instructions already in pipeline (C2–D2)

D1

C1D1

A2E1 B2C2

D2

A1

Tail

Pois

on

Pois

on

[ 20 ][ 20 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$C1

Rally

• Complete deferred instructions from slice buffer

D1

D1

A2E1 B2C2

D2

B1

Tail

Rally

Pois

on

Pois

on

[ 21 ][ 21 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$D1

Rally

• Execute deferred instructions from slice buffer• When slice buffer is empty, un-block fetch

D1

A2E1 B2C2

D2

C1

Tail

Rally

Pois

on

Pois

on

[ 22 ][ 22 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Rally

Wait for deferred instructions to complete

D1

A2E1 B2C2

D2Tail

Rally

Pois

on

Pois

on

[ 23 ][ 23 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes

D1

A2E1 B2C2

D2Tail

Rally

Pois

on

Pois

on

[ 24 ][ 24 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint

D1D2Tail

Rally

Pois

on

Pois

on

A2E1 B2C2

[ 25 ][ 25 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail

D1D2Tail

Pois

on

Pois

on

[ 26 ][ 26 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$E2

Back To Normal

When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail• Drain stores from store buffer to D$

D2Tail

Pois

on

Pois

on

D1

[ 27 ][ 27 ]

I$

A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6] Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$

One Way Or The Other

If rally hits mis-predicted branch, exception, etc.• Flush pipeline• Discard store buffer contents• Restore regfile from checkpoint

Tail

Pois

on

Pois

on

A1

iCFP Roadmap

Motivation and overview

(Not fully) working example

Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety

Performance features

Evaluation

[ 29 ][ 29 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1

Slice Buffer

D$B1

Where do A1–C1 write r2, r3, r5 during rally?• Not in Tail RF0• Already written by logically younger A2–C2

D1

C1D1

A2E1 B2C2

D2

A1

Tail

Rally Register CommunicationRally

Pois

on

Pois

on

[ 30 ][ 30 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C1

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

D1

D1

A2E1 B2C2

D2

B1

Rally Register Communication

Tail

Rally

A1

Pois

on

Pois

on

[ 31 ][ 31 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D1

Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)

D1

A2E1 B2C2

D2

C1

Rally Register Communication

Tail

Rally A1B1

Pois

on

Pois

on

Store-Load Forwarding

iCFP is in-order but …• Rally loads out-of-order wrt advance stores (possible WAR hazards)

Store-load forwarding mechanism should• Avoid WAR hazards• Avoid redoing stores

Forwarding cache? D$ with speculative writes?• Not what we want

What we really want is a large (64-entry+) store queue• Like in an out-of-order processor– Associative search doesn’t scale nicely

[ 33 ][ 33 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)Chained Store Buffer

86 85 84 83 82 81 80(SSN)

Replace associative search with iterative indexed search• Exploit fact that stores enter store buffer in order

• Address must be known: otherwise stall• Overlay store buffer with address-based hash table

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

[ 34 ][ 34 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison


86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

85AC 2AC85

81

1AC81

Match, forward

[ 35 ][ 35 ]

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison


86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

Loads follow chain starting at appropriate root table entry• For example, load to address 1AC

Rally loads ignore younger stores, avoid WAR hazards• For example, rally load to address 1B4 …• … whose immediately older store 81 (note during advance)

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

83B4

1B483

Younger store, ignore

15

Go to D$

[ 36 ][ 36 ]

Chained Store Buffer

+ Non-speculative (including no WAR hazards)+ Scalable + Average number of excess hops < 0.05 with 64-entry root table– Must stall on (miss-dependent) stores with unknown addresses• These are rare

7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0

addressvalue

poison

Tail (younger) Head (older)

86 85 84 83 82 81 80(SSN)

44 81 0 15 0 77 0link

85868321

ACB0B4B8

Root……

……

64

-en

trie

s

[ 37 ][ 37 ]

Multi-Processor Safety

iCFP is in-order but … (yeah again)• Advance loads are vulnerable to stores from other threads• Just like in an out-of-order processor

Must snoop/verify these• Associative load queue too expensive for in-order processor• Paper describes scheme based on local signatures

[ 38 ][ 38 ]

Methodology

Cycle-level simulation• 2-way issue 9-stage in-order pipeline• 32KByte D$• 20-cycle 1MByte, 8-way L2 (8 8-entry stream buffers)• 400 cycle main memory, 4Bytes/cycle, 32 outstanding misses• 128-entry chained store buffer, 128-entry slice buffer

Spec2000 benchmarks• Alpha AXP ISA• DEC OSF compiler -04 optimization• 2% sampling with warm-up

[ 39 ][ 39 ]

Initial Evaluation

iCFP vs. Runahead: advance on L2 misses• Roughly same performance: +10%• Dominated by MLP• iCFP’s ability to reuse work rarely significant (vortex)

% Speedup over 2-way in-order

0

10

20

30

40

50

applu mgrid swim bzip2 vortex vpr

Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$

SpecFPSpecFP SpecINTSpecINT

[ 40 ][ 40 ]

Initial Evaluation

Runahead advance on D$ misses too: performance drops • Chance for MLP is low and can’t reuse work• Overhead of restoring checkpoint is high

• Especially because baseline stalls on use, not miss


0

10

20

30

40

50




[ 41 ][ 41 ]

Initial Evaluation

iCFP advance under D$ misses too• Can reuse work without restoring checkpoint but …

• iCFP* executes rallies until completion in blocking fashion• No efficient way to handle D$ misses under L2 misses


0

10

20

30

40

50




[ 42 ][ 42 ]

iCFP Performance Features

Non-blocking rallies• Miss during rally (dependent or just pending)? Don’t stall, slice it out

Fine-grain multi-threaded rallies• Proceed in parallel with advance execution at the tail• Rallies process dependence chains, can’t exploit superscalar

These need: incremental updates of tail register state• Both values and poison bits• Note: store buffer is not a tail snapshot, so no additional support

[ 43 ][ 43 ]

I$


C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$B1

Question: should current rally instruction update Tail RF?• A1? B1? C1? • No, no, yes

D1

D1

E1

A1

C1B2A2

Tail

Incremental Tail UpdatesRally

Pois

on

Pois

on

r2r3r5

[ 44 ][ 44 ]

I$


C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$B1

Advance execution tags registers with sequence numbers• Distance of writing instruction from checkpoint

D1

D1

E1

A1

C1B2A2

Tail

Incremental Tail Updates12345678

Rally

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

[ 45 ][ 45 ]

I$


C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C1

Rally updates Tail RF if seqnum matches

D1

D1

E1

B1

B2A2

Tail

Rally


A1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

A1’s is 1, so no

[ 46 ][ 46 ]

I$


C2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D1


D1

E1

C1

B2A2

Tail

Rally


A1B1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

B1’s is 2, so no

[ 47 ][ 47 ]

I$


D2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$C2


D1

E1

B2A2

Tail

Rally


A1B1

C1

C1

Pois

on

Pois

on

r2r3r5

Seq

Seq7

83

C1’s is 3, so yes

[ 48 ][ 48 ]

I$


E2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$D2


D1

E1

B2A2

Tail

Rally


A1B1

C1

C1

C2

Pois

on

Pois

on

r2r3

Seq

Seq7

83

[ 49 ][ 49 ]

I$


F2

Store Buffer

RF0 (Tail)

RF1 (Rally)

Slice Buffer

D$E2

Proper slicing can continue at tail

D1

E1

B2A2

Tail


A1B1

C1

C1

D2

C2

Pois

on

Pois

on

r2r3r5

Seq

Seq7

89

C2 sliced because r3 poison preserved

[ 50 ][ 50 ]

Another iCFP Performance Feature

Minimal rallies• Only traverse slice of returned miss, not entire slice buffer

Implementation: borrow trick from TCI [AlZawawi+, ISCA’07]

• Replace poison bits with bitvectors• Re-organize slice buffer to support sparse access• See paper for details

[ 51 ][ 51 ]

Tolerating All Level Cache Misses

iCFP performance features?


0

10

20

30

40

50


Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$


[ 52 ][ 52 ]


iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)


0

10

20

30

40

50




[ 53 ][ 53 ]


iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)• Help iCFP-D$ even more (now better than iCFP-L2)


0

10

20

30

40

50




[ 54 ][ 54 ]

Feature Contribution Analysis

iCFP*-D$: no “performance” features


0

10

20

30

40

50


iCFP* + non-blocking + multi-threading + minimal


[ 55 ][ 55 ]


Non-blocking rallies• Most significant performance feature• Helps programs with dependent misses (vpr, mcf)• Helps programs with D$ misses under L2 misses (applu)


0

10

20

30

40

50




[ 56 ][ 56 ]


Multi-threaded rallies: one slot of 2-way superscalar• “Free” with support for non-blocking rallies• Helps uniformly


0

10

20

30

40

50




[ 57 ][ 57 ]


Minimal rallies: 8-bit poison vectors• Helps uniformly (most misses are independent)


0

10

20

30

40

50




Out of Slice Buffer?

iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size


0

10

20

30

40

50


0 (Runahead) 32 64 128


What About Store Buffer?

• A little more sensitive to store buffer size


0

10

20

30

40

50


32 64 128 128-assoc


What About Store Buffer?

• A little more sensitive to store buffer size• Chaining essentially performance equivalent to associative search


0

10

20

30

40

50


32 64 128 128-assoc


[ 62 ][ 62 ]

Performance vs. Hardware Cost

• Runahead: +11% for checkpoints, poison bits, forwarding cache• iCFP: +17%, for checkpoints, poison bits, store buffer, slice buffer

• Basically: Runahead + 6% for a 128-entry slice buffer


0

20

40

60

80

100


Runahead iCFP OoO CFP


[ 63 ][ 63 ]

Performance vs. Hardware Cost% Speedup over 2-way in-order

0

20

40

60

80

100


Runahead iCFP OoO CFP

• OoO: +63% for 128-entry window, 32-entry issue queue, etc.• CFP: +75% for OoO and 128-entry slice buffer


[ 64 ][ 64 ]

Related Work

Multipass pipelining [Barnes+, MICRO’05]

• Rallies re-execute everything, but with higher ILP

Simple Latency Tolerant Processor [Nekkalapu+, ICCD’08]

• Similar, but … single, blocking rallies, speculative cache writes

Rock [Tremblay+, ISSCC’08]

• “Upon encountering a long latency instruction, the pipeline takes a checkpoint … creates future state and only reruns dependent instructions accumulated since the original checkpoint …. While one thread is completing the future created by the ahead thread, it continues execution to create the next future version of the architected state … This leapfrogging continues …”

• Sounds similar, what does it really do?

[ 65 ][ 65 ]

Conclusion

iCFP: in-order Continual Flow Pipeline• In-order + ability to flow around cache misses at all levels• Minimal hardware: runahead + slice buffer

Key features: not present elsewhere (afawk)• Non-blocking, multi-threaded, minimal rallies

Supporting technologies• Chained store buffer• Incremental tail register state updates

Incremental is a good thing!

[ 66 ][ 66 ]

[ 67 ][ 67 ]

Comparative Performance

0

10

20

30

40

50

applu swim SpecFP bzip2 vpr SpecInt

Runahead Multipass SLTP iCFP

icfp: tolerating all level cache misses in in-order processors

Documents