ekivolos self contained, accurate precomputation prefetching islam atta xin tong andreas moshovos...
TRANSCRIPT
EKIVOLOSSelf Contained, Accurate Precomputation Prefetching
Islam Atta
Xin Tong
Andreas Moshovos
Viji Srinivasan
Ioana Baldini
4.4ZB
44ZB
2013
2020
EMC2 DIGITALUNIVERSE STUDY
2 Graphic Credit: www.editeddaily.com
Prefetching is the traditional remedy
3
LOG
Unconventional Data Sources
Unstructured &Semi-Structured
Sparse Matrices Graphs
XML
Graphic Credit: www.editeddaily.com
Memory-Bound
Hardware Prefetchers
History ofAccesses
PredictFuture Accesses
CurrentState
History-based predictions may not be sufficient!
Non-RepetitiveIrregularAccesses
Shared Cache
Memory
Precomputation Slice
(P-Slice)
LLC
Target Load
Tim
e
Prefetch
Load
Delinquent Load: a problematic load which accounts for a significant amount of memory stalls.
Hit
Context 1
4
Precomputation Prefetchers
ProgramSlice
PrecomputeFuture Accesses
CurrentState
MainThread
Context 0
PrecomputationPrediction- -based Prefetching
Yet Another Precomputation Prefetcher?
Manually At Compile Time Traces from Binary
Past Work constructed P-slices…
Re-design binary-based implementations to prioritize accuracy
5
Burdensome Task Requires Source CodeDense P-slices
Inaccurate P-slices
Accurate FastP-slices are ought to be…
Conventional Binary-based methods Over-Simplify P-slices
Correctness: Do not modify the state of the main-thread.
Fast: Aggressively optimize a p-slice. Ignore Control Flow
Ignore Memory Dependencies
Monitor & Correct
Mechanisms
Potential Inaccuracy
Variable Run-ahead distanceItera
tions
Time
Main Thread
α
Abort & Restart 6
InaccurateLightP-slice
Applications with intense code divergence or memory dependencies foil
“Monitor & Correct” mechanisms
Paradigm shift – Accuracy-First P-slice
Memory Dependencies: p-slice uses a local store buffer.
Control Flow: Merge multiple traces, instead of the single dominant trace.
All data dependencies can be maintained.
• No-monitoring• Accurate• Maybe slightly
slower but can still run-ahead
Accurately replicate main thread’s
execution path.
Eventually higher Run-ahead distance
Itera
tions
Time
Main Thread
α
InaccurateLightP-slice
More AccurateDenserP-slice
EKIVOLOS – “ Slow and Steady Wins The Race”
7
Web GraphsCircuit SimulationDNA AnalysisSocial NetworksGraph PartitioningClusteringFluid Dynamics
Sparse Matrices
8
Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication
Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication
V[] RV[]
V_val V_idxM_val
M_idx
M_begin
9
x =
LinearFragmented
Linear
M[][]
Out
er
Inne
r
Scan over V_idx[]Find corresponding Row
Scan over M_idx[]Find corresponding RV[]
RV Accesses: History does not entail Future
Random!
Binary-based P-Slice Construction
Pre-Compute RV Addresses
CPU
Execute CollectInstruction
Trace
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
IdentifyDominant
Loop
Apply Backward
Slicing
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
1 2 3 4
Tim
e
Identify Delinquent Load0
10
SpVM P-Slice: Backward Slicing
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]Delinquent Load
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x957a mla r5, r6, ip, r5
0x9582 ldr r4, [r0, #16]
0x9588 ldr r5, [sl]0x958c add r3, #1
0x9566 ldr r4, [r0, #12]0x9568 add r1, #4
0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x9568 add r1, #4
0x956e ldr r4, [r9,r1]
0x9576 ldr r5, [r2,r4,lsl#2]
Eliminate Control Flow
Retain OnlyRegister
Dependencies
Eliminate Stores
11
Inner-most Dominant Loop
V[] RV[]
V_val V_idxM_val
M_idx
M_begin
M[][]
Fails to Pre-Compute RV Addresses for Multiple Rows0x9568 add r1, #4
0x956e ldr r4, [r9,r1]
0x9576 ldr r5, [r2,r4,lsl#2]Dominant-Path P-slice
EKIVOLOSLocal Store BufferMemory Dependencies
Keep Control FlowMerge Multiple Traces
Maintains All Data Dependencies
Accurately Replicates Main Thread’s Execution Path
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]
Outer Inner
Prefetch
Core
L1
L2
LSB
Simple Algorithm, Much Better Accuracy
12
Evaluation – Methodology
System Setup
ESESC Simulator, ARM ISA
Main core: Out-of-Order, 3GHz
Prefetch core: In-Order, 3GHz
Area & Energy: MCPAT 1.2
EvaluatedWorkloadsRepresent
Computational Biology
Data Mining Floating Point Differential
Graph Search
Hash Table joins Image Processing
Optimization Scheduling
Simulation Sorting Sparse Matrix Multiplication
Support Vector Machines
13
00.10.20.30.40.50.60.70.80.9
1
Nor
mal
ized
MPK
I Key Results
Ekivolos (Control Flow Only)
Dominant-Path Precomputation Prefetcher
Ekivolos (Control Flow and Memory Dependencies)
Bett
er
SpeedupLLC Misses
11.21.41.61.8
22.22.42.62.8
Rela
tive
Spee
dup
Energy 10%
ControlFlow
MemoryDependencies70% 267% (0-12X)
SMS – Spatial Address CorrelationAMPM – Pattern MatchingPC/AC – Address Correlation with PC-LocalizationEkivolos+ASP – Adding Simple Stream Prefetcher 14
Bett
er
Limitations of Ekivolos
Currently Requires Offline ProfilingEffectiveness depends on Profiling Input
Targets only Delinquent Loads
15
Future Work Directions
Enhancements P-Core Architecture Benchmarks
Online Profiling “In-Memory” or “In-Cache” Processing
Suites suitable for Architectural Studies
Big Data Diverse Memory Access Patterns
16
What We Learned
P-slices Need Not be Aggressively Optimized
Simple AlgorithmControl Flow & Memory Dependencies
Emerging Algorithms not Studied Before
Prefetch-cores can be Simplified17