icfp: tolerating all level cache misses in in-order processors
DESCRIPTION
iCFP: Tolerating All Level Cache Misses in In-Order Processors. Andrew Hilton, Santosh Nagarakatte, Amir Roth University of Pennsylvania {adhilton,santoshn,amir}@cis.upenn.edu. HPCA-15 :: Feb 18, 2009. A Brief History …. performance. power. Pentium ( in-order ). PentiumII - PowerPoint PPT PresentationTRANSCRIPT
HPCA-15 :: Feb 18, 2009
iCFP: Tolerating All Level Cache Misses in In-Order Processors
Andrew Hilton, Santosh Nagarakatte, Amir RothUniversity of Pennsylvania
{adhilton,santoshn,amir}@cis.upenn.edu
A Brief History …
Pentium(in-order)
PentiumII (out-of-order)
performance
Core2Duo (out-of-order, 2 cores)
power
Nehalem (out-of-order, 4 cores, 8 threads)
Niagara2 (in-order, 16 cores, 64 threads)
POWER!
[ 3 ][ 3 ]
In-order vs. Out-of-Order
Out-of-order cores• Single thread IPC (+63%)
Key idea• Main benefit of out-of-order: data cache miss tolerance• Can we add to in-order in a simple way?
Is there a compromise?
In-order cores• Power efficiency• More cores
• Regfile checkpoint-restore
Runahead
Runahead execution [Dundas+, ICS’97]
• In-order + miss-level parallelism (MLP)• Checkpoint and “advance” under miss• Restore checkpoint when miss returns RF0
D$I$
Pois
on
• Per register “poison” bits Forwarding$
• Forwarding cache
Can we do better?
Additional hardware?
Yes We Can! (Sorry)
iCFP: in-order Continual Flow Pipeline• Runahead, but … • Save miss-independent work• Re-execute only miss forward slice
Forwarding$
RF0
D$I$
Pois
on
Slice Buffer
• Slice buffer
Additional hardware?
In-order adaptation of CFP [Srinivasan+, ASPLOS’04]
• Unblock pipeline latches, not issue queue and regfile• Apply to misses at all cache levels, not just L2
• Replace forwarding cache with store buffer Store Buffer
RF1
• Hijack additional regfile used for multi-threading
Pois
on
iCFP Roadmap
Motivation and overview
(Not fully) working example
Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety
Performance features
Evaluation
[ 7 ][ 7 ]
I$
ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
A1B1C1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
PC/instance
Bold paths are active
Instructions flowing through pipeline
Tail
Pois
on
Pois
on
Tail last completed instruction RF0
[ 8 ][ 8 ]
• Checkpoint regfile
I$
ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
Load A1 misses, transition to “advance” mode
A1B1C1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$ Miss
Tail
Pois
on
Pois
on
• Poison A1’s output register r2
r2
[ 9 ][ 9 ]
• Checkpoint regfile
I$
ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
Load A1 misses, transition to “advance” mode
C1D1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
• Poison A1’s output register r2• Divert A1 to slice buffer
Pending miss (red)
A1
Tail
B1
Pois
on
Pois
on
r2
[ 10 ][ 10 ]
I$
ExampleA1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
• Propagate poison through data dependences
C1D1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1
Tail
B1
Pois
on
Pois
on
r2
[ 11 ][ 11 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C1D1E1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer
Miss-dependent instruction (this color)
Tail
B1
Pois
on
Pois
on
r2r3
[ 12 ][ 12 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E1F1
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer• Buffer stores in store buffer
B1
Tail
D1
C1
Pois
on
Pois
on
r2r3r5
[ 13 ][ 13 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F1A2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual
B1
D1
C1
Tail
E1
Pois
on
Pois
on
r2r3r5
[ 14 ][ 14 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
A2B2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile
B1
D1
C1D1
F1
Tail
Miss-independent instruction (green)
E1
Pois
on
Pois
on
r2r3r5
[ 15 ][ 15 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
B2B2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile
B1
D1
C1D1
A2
E1
Tail Pois
on
Pois
on
r2r3r5
[ 16 ][ 16 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C2D2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Advance
• Propagate poison through data dependences• Divert miss-dependent instructions to slice buffer • Buffer stores in store buffer• Miss-independent instructions execute as usual, update regfile
• Can “un-poison” tail registers
B1
D1
C1D1
B2
E1
Tail
A2
Pois
on
Pois
on
r3r5
[ 17 ][ 17 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
D2E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
A1Miss Returns
When A1 miss returns, transition to “rally”• Stall fetch• Pipe in contents of slice buffer
B1
D1
C1D1
A2E1
C2
B2
Fill
Tail
Pois
on
Pois
on
r5
[ 18 ][ 18 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$A1
Drain
• Drain advance instructions already in pipeline (C2–D2)
B1
D1
C1D1
A2E1 B2C2
D2Tail
Pois
on
Pois
on
[ 19 ][ 19 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$B1
Drain
• Drain advance instructions already in pipeline (C2–D2)
D1
C1D1
A2E1 B2C2
D2
A1
Tail
Pois
on
Pois
on
[ 20 ][ 20 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$C1
Rally
• Complete deferred instructions from slice buffer
D1
D1
A2E1 B2C2
D2
B1
Tail
Rally
Pois
on
Pois
on
[ 21 ][ 21 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$D1
Rally
• Execute deferred instructions from slice buffer• When slice buffer is empty, un-block fetch
D1
A2E1 B2C2
D2
C1
Tail
Rally
Pois
on
Pois
on
[ 22 ][ 22 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$E2
Rally
Wait for deferred instructions to complete
D1
A2E1 B2C2
D2Tail
Rally
Pois
on
Pois
on
[ 23 ][ 23 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$E2
Back To Normal
When last deferred instruction completes
D1
A2E1 B2C2
D2Tail
Rally
Pois
on
Pois
on
[ 24 ][ 24 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$E2
Back To Normal
When last deferred instruction completes• Release register checkpoint
D1D2Tail
Rally
Pois
on
Pois
on
A2E1 B2C2
[ 25 ][ 25 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$E2
Back To Normal
When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail
D1D2Tail
Pois
on
Pois
on
[ 26 ][ 26 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$E2
Back To Normal
When last deferred instruction completes• Release register checkpoint • Resume normal execution at the tail• Drain stores from store buffer to D$
D2Tail
Pois
on
Pois
on
D1
[ 27 ][ 27 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6] Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$
One Way Or The Other
If rally hits mis-predicted branch, exception, etc.• Flush pipeline• Discard store buffer contents• Restore regfile from checkpoint
Tail
Pois
on
Pois
on
A1
iCFP Roadmap
Motivation and overview
(Not fully) working example
Correctness features• Register communication for miss-dependent instructions• Store-load forwarding• Multiprocessor safety
Performance features
Evaluation
[ 29 ][ 29 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1
Slice Buffer
D$B1
Where do A1–C1 write r2, r3, r5 during rally?• Not in Tail RF0• Already written by logically younger A2–C2
D1
C1D1
A2E1 B2C2
D2
A1
Tail
Rally Register CommunicationRally
Pois
on
Pois
on
[ 30 ][ 30 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$C1
Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)
D1
D1
A2E1 B2C2
D2
B1
Rally Register Communication
Tail
Rally
A1
Pois
on
Pois
on
[ 31 ][ 31 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$D1
Use RF1 as rally scratch-pad• Update Tail RF0 if youngest writer (not in this example)
D1
A2E1 B2C2
D2
C1
Rally Register Communication
Tail
Rally A1B1
Pois
on
Pois
on
Store-Load Forwarding
iCFP is in-order but …• Rally loads out-of-order wrt advance stores (possible WAR hazards)
Store-load forwarding mechanism should• Avoid WAR hazards• Avoid redoing stores
Forwarding cache? D$ with speculative writes?• Not what we want
What we really want is a large (64-entry+) store queue• Like in an out-of-order processor– Associative search doesn’t scale nicely
[ 33 ][ 33 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
addressvalue
poison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80(SSN)
Replace associative search with iterative indexed search• Exploit fact that stores enter store buffer in order
• Address must be known: otherwise stall• Overlay store buffer with address-based hash table
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
[ 34 ][ 34 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
addressvalue
poison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80(SSN)
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
85AC 2AC85
81
1AC81
Match, forward
[ 35 ][ 35 ]
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
addressvalue
poison
Tail (younger) Head (older)Chained Store Buffer
86 85 84 83 82 81 80(SSN)
44 81 0 15 0 77 0link
Loads follow chain starting at appropriate root table entry• For example, load to address 1AC
Rally loads ignore younger stores, avoid WAR hazards• For example, rally load to address 1B4 …• … whose immediately older store 81 (note during advance)
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
83B4
1B483
Younger store, ignore
15
Go to D$
[ 36 ][ 36 ]
Chained Store Buffer
+ Non-speculative (including no WAR hazards)+ Scalable + Average number of excess hops < 0.05 with 64-entry root table– Must stall on (miss-dependent) stores with unknown addresses• These are rare
7B0 2AC 388 1B4 384 1AC 38090 78 ?? 56 ?? 34 120 0 1 0 1 0 0
addressvalue
poison
Tail (younger) Head (older)
86 85 84 83 82 81 80(SSN)
44 81 0 15 0 77 0link
85868321
ACB0B4B8
Root……
……
64
-en
trie
s
[ 37 ][ 37 ]
Multi-Processor Safety
iCFP is in-order but … (yeah again)• Advance loads are vulnerable to stores from other threads• Just like in an out-of-order processor
Must snoop/verify these• Associative load queue too expensive for in-order processor• Paper describes scheme based on local signatures
[ 38 ][ 38 ]
Methodology
Cycle-level simulation• 2-way issue 9-stage in-order pipeline• 32KByte D$• 20-cycle 1MByte, 8-way L2 (8 8-entry stream buffers)• 400 cycle main memory, 4Bytes/cycle, 32 outstanding misses• 128-entry chained store buffer, 128-entry slice buffer
Spec2000 benchmarks• Alpha AXP ISA• DEC OSF compiler -04 optimization• 2% sampling with warm-up
[ 39 ][ 39 ]
Initial Evaluation
iCFP vs. Runahead: advance on L2 misses• Roughly same performance: +10%• Dominated by MLP• iCFP’s ability to reuse work rarely significant (vortex)
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$
SpecFPSpecFP SpecINTSpecINT
[ 40 ][ 40 ]
Initial Evaluation
Runahead advance on D$ misses too: performance drops • Chance for MLP is low and can’t reuse work• Overhead of restoring checkpoint is high
• Especially because baseline stalls on use, not miss
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$
SpecFPSpecFP SpecINTSpecINT
[ 41 ][ 41 ]
Initial Evaluation
iCFP advance under D$ misses too• Can reuse work without restoring checkpoint but …
• iCFP* executes rallies until completion in blocking fashion• No efficient way to handle D$ misses under L2 misses
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 Runahead-D$ iCFP*-L2 iCFP*-D$
SpecFPSpecFP SpecINTSpecINT
[ 42 ][ 42 ]
iCFP Performance Features
Non-blocking rallies• Miss during rally (dependent or just pending)? Don’t stall, slice it out
Fine-grain multi-threaded rallies• Proceed in parallel with advance execution at the tail• Rallies process dependence chains, can’t exploit superscalar
These need: incremental updates of tail register state• Both values and poison bits• Note: store buffer is not a tail snapshot, so no additional support
[ 43 ][ 43 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$B1
Question: should current rally instruction update Tail RF?• A1? B1? C1? • No, no, yes
D1
D1
E1
A1
C1B2A2
Tail
Incremental Tail UpdatesRally
Pois
on
Pois
on
r2r3r5
[ 44 ][ 44 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$B1
Advance execution tags registers with sequence numbers• Distance of writing instruction from checkpoint
D1
D1
E1
A1
C1B2A2
Tail
Incremental Tail Updates12345678
Rally
Pois
on
Pois
on
r2r3r5
Seq
Seq7
83
[ 45 ][ 45 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$C1
Rally updates Tail RF if seqnum matches
D1
D1
E1
B1
B2A2
Tail
Rally
Incremental Tail Updates12345678
A1
Pois
on
Pois
on
r2r3r5
Seq
Seq7
83
A1’s is 1, so no
[ 46 ][ 46 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
C2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$D1
Rally updates Tail RF if seqnum matches
D1
E1
C1
B2A2
Tail
Rally
Incremental Tail Updates12345678
A1B1
Pois
on
Pois
on
r2r3r5
Seq
Seq7
83
B1’s is 2, so no
[ 47 ][ 47 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
D2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$C2
Rally updates Tail RF if seqnum matches
D1
E1
B2A2
Tail
Rally
Incremental Tail Updates12345678
A1B1
C1
C1
Pois
on
Pois
on
r2r3r5
Seq
Seq7
83
C1’s is 3, so yes
[ 48 ][ 48 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
E2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$D2
Rally updates Tail RF if seqnum matches
D1
E1
B2A2
Tail
Rally
Incremental Tail Updates12345678
A1B1
C1
C1
C2
Pois
on
Pois
on
r2r3
Seq
Seq7
83
[ 49 ][ 49 ]
I$
A1: load [r1] -> r2B1: load [r2] -> r3C1: add r3, r4 -> r5D1: store r5 -> [r6]E1: add r1, #4 -> r1F1: branch r1, #40, AA2: load [r1] -> r2B2: load [r2] -> r3C2: add r3, r4 -> r5D2: store r5 -> [r6]
F2
Store Buffer
RF0 (Tail)
RF1 (Rally)
Slice Buffer
D$E2
Proper slicing can continue at tail
D1
E1
B2A2
Tail
Incremental Tail Updates123456789
A1B1
C1
C1
D2
C2
Pois
on
Pois
on
r2r3r5
Seq
Seq7
89
C2 sliced because r3 poison preserved
[ 50 ][ 50 ]
Another iCFP Performance Feature
Minimal rallies• Only traverse slice of returned miss, not entire slice buffer
Implementation: borrow trick from TCI [AlZawawi+, ISCA’07]
• Replace poison bits with bitvectors• Re-organize slice buffer to support sparse access• See paper for details
[ 51 ][ 51 ]
Tolerating All Level Cache Misses
iCFP performance features?
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$
SpecFPSpecFP SpecINTSpecINT
[ 52 ][ 52 ]
Tolerating All Level Cache Misses
iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$
SpecFPSpecFP SpecINTSpecINT
[ 53 ][ 53 ]
Tolerating All Level Cache Misses
iCFP performance features?• Help iCFP-L2 (now better than Runahead-L2)• Help iCFP-D$ even more (now better than iCFP-L2)
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
Runahead-L2 iCFP*-L2 iCFP-L2 iCFP-D$
SpecFPSpecFP SpecINTSpecINT
[ 54 ][ 54 ]
Feature Contribution Analysis
iCFP*-D$: no “performance” features
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
iCFP* + non-blocking + multi-threading + minimal
SpecFPSpecFP SpecINTSpecINT
[ 55 ][ 55 ]
Feature Contribution Analysis
Non-blocking rallies• Most significant performance feature• Helps programs with dependent misses (vpr, mcf)• Helps programs with D$ misses under L2 misses (applu)
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
iCFP* + non-blocking + multi-threading + minimal
SpecFPSpecFP SpecINTSpecINT
[ 56 ][ 56 ]
Feature Contribution Analysis
Multi-threaded rallies: one slot of 2-way superscalar• “Free” with support for non-blocking rallies• Helps uniformly
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
iCFP* + non-blocking + multi-threading + minimal
SpecFPSpecFP SpecINTSpecINT
[ 57 ][ 57 ]
Feature Contribution Analysis
Minimal rallies: 8-bit poison vectors• Helps uniformly (most misses are independent)
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
iCFP* + non-blocking + multi-threading + minimal
SpecFPSpecFP SpecINTSpecINT
Out of Slice Buffer?
iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
0 (Runahead) 32 64 128
SpecFPSpecFP SpecINTSpecINT
Out of Slice Buffer?
iCFP defaults to runahead when out of slice or store buffer• Not overly sensitive to slice buffer size
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
0 (Runahead) 32 64 128
SpecFPSpecFP SpecINTSpecINT
What About Store Buffer?
• A little more sensitive to store buffer size
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
32 64 128 128-assoc
SpecFPSpecFP SpecINTSpecINT
What About Store Buffer?
• A little more sensitive to store buffer size• Chaining essentially performance equivalent to associative search
% Speedup over 2-way in-order
0
10
20
30
40
50
applu mgrid swim bzip2 vortex vpr
32 64 128 128-assoc
SpecFPSpecFP SpecINTSpecINT
[ 62 ][ 62 ]
Performance vs. Hardware Cost
• Runahead: +11% for checkpoints, poison bits, forwarding cache• iCFP: +17%, for checkpoints, poison bits, store buffer, slice buffer
• Basically: Runahead + 6% for a 128-entry slice buffer
% Speedup over 2-way in-order
0
20
40
60
80
100
applu mgrid swim bzip2 vortex vpr
Runahead iCFP OoO CFP
SpecFPSpecFP SpecINTSpecINT
[ 63 ][ 63 ]
Performance vs. Hardware Cost% Speedup over 2-way in-order
0
20
40
60
80
100
applu mgrid swim bzip2 vortex vpr
Runahead iCFP OoO CFP
• OoO: +63% for 128-entry window, 32-entry issue queue, etc.• CFP: +75% for OoO and 128-entry slice buffer
SpecFPSpecFP SpecINTSpecINT
[ 64 ][ 64 ]
Related Work
Multipass pipelining [Barnes+, MICRO’05]
• Rallies re-execute everything, but with higher ILP
Simple Latency Tolerant Processor [Nekkalapu+, ICCD’08]
• Similar, but … single, blocking rallies, speculative cache writes
Rock [Tremblay+, ISSCC’08]
• “Upon encountering a long latency instruction, the pipeline takes a checkpoint … creates future state and only reruns dependent instructions accumulated since the original checkpoint …. While one thread is completing the future created by the ahead thread, it continues execution to create the next future version of the architected state … This leapfrogging continues …”
• Sounds similar, what does it really do?
[ 65 ][ 65 ]
Conclusion
iCFP: in-order Continual Flow Pipeline• In-order + ability to flow around cache misses at all levels• Minimal hardware: runahead + slice buffer
Key features: not present elsewhere (afawk)• Non-blocking, multi-threaded, minimal rallies
Supporting technologies• Chained store buffer• Incremental tail register state updates
Incremental is a good thing!
[ 66 ][ 66 ]
[ 67 ][ 67 ]
Comparative Performance
0
10
20
30
40
50
applu swim SpecFP bzip2 vpr SpecInt
Runahead Multipass SLTP iCFP