ekivolos self contained, accurate precomputation prefetching islam atta xin tong andreas moshovos...

17
EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Upload: joella-george

Post on 19-Jan-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

EKIVOLOSSelf Contained, Accurate Precomputation Prefetching

Islam Atta

Xin Tong

Andreas Moshovos

Viji Srinivasan

Ioana Baldini

Page 2: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

4.4ZB

44ZB

2013

2020

EMC2 DIGITALUNIVERSE STUDY

2 Graphic Credit: www.editeddaily.com

Page 3: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Prefetching is the traditional remedy

3

LOG

Unconventional Data Sources

Unstructured &Semi-Structured

Sparse Matrices Graphs

XML

Graphic Credit: www.editeddaily.com

Memory-Bound

Hardware Prefetchers

History ofAccesses

PredictFuture Accesses

CurrentState

History-based predictions may not be sufficient!

Non-RepetitiveIrregularAccesses

Page 4: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Shared Cache

Memory

Precomputation Slice

(P-Slice)

LLC

Target Load

Tim

e

Prefetch

Load

Delinquent Load: a problematic load which accounts for a significant amount of memory stalls.

Hit

Context 1

4

Precomputation Prefetchers

ProgramSlice

PrecomputeFuture Accesses

CurrentState

MainThread

Context 0

PrecomputationPrediction- -based Prefetching

Page 5: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Yet Another Precomputation Prefetcher?

Manually At Compile Time Traces from Binary

Past Work constructed P-slices…

Re-design binary-based implementations to prioritize accuracy

5

Burdensome Task Requires Source CodeDense P-slices

Inaccurate P-slices

Accurate FastP-slices are ought to be…

Page 6: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Conventional Binary-based methods Over-Simplify P-slices

Correctness: Do not modify the state of the main-thread.

Fast: Aggressively optimize a p-slice. Ignore Control Flow

Ignore Memory Dependencies

Monitor & Correct

Mechanisms

Potential Inaccuracy

Variable Run-ahead distanceItera

tions

Time

Main Thread

α

Abort & Restart 6

InaccurateLightP-slice

Applications with intense code divergence or memory dependencies foil

“Monitor & Correct” mechanisms

Page 7: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Paradigm shift – Accuracy-First P-slice

Memory Dependencies: p-slice uses a local store buffer.

Control Flow: Merge multiple traces, instead of the single dominant trace.

All data dependencies can be maintained.

• No-monitoring• Accurate• Maybe slightly

slower but can still run-ahead

Accurately replicate main thread’s

execution path.

Eventually higher Run-ahead distance

Itera

tions

Time

Main Thread

α

InaccurateLightP-slice

More AccurateDenserP-slice

EKIVOLOS – “ Slow and Steady Wins The Race”

7

Page 8: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Web GraphsCircuit SimulationDNA AnalysisSocial NetworksGraph PartitioningClusteringFluid Dynamics

Sparse Matrices

8

Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication

Page 9: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Example of Hard-to-Predict AccessesSpVM – Sparse-Vector Sparse-Matrix Multiplication

V[] RV[]

V_val V_idxM_val

M_idx

M_begin

9

x =

LinearFragmented

Linear

M[][]

Out

er

Inne

r

Scan over V_idx[]Find corresponding Row

Scan over M_idx[]Find corresponding RV[]

RV Accesses: History does not entail Future

Random!

Page 10: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Binary-based P-Slice Construction

Pre-Compute RV Addresses

CPU

Execute CollectInstruction

Trace

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

IdentifyDominant

Loop

Apply Backward

Slicing

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

1 2 3 4

Tim

e

Identify Delinquent Load0

10

Page 11: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

SpVM P-Slice: Backward Slicing

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]Delinquent Load

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]

0x9588 ldr r5, [sl]0x958c add r3, #1

0x9566 ldr r4, [r0, #12]0x9568 add r1, #4

0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

0x957a mla r5, r6, ip, r5

0x9582 ldr r4, [r0, #16]

0x9588 ldr r5, [sl]0x958c add r3, #1

0x9566 ldr r4, [r0, #12]0x9568 add r1, #4

0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

0x9568 add r1, #4

0x956e ldr r4, [r9,r1]

0x9576 ldr r5, [r2,r4,lsl#2]

Eliminate Control Flow

Retain OnlyRegister

Dependencies

Eliminate Stores

11

Inner-most Dominant Loop

V[] RV[]

V_val V_idxM_val

M_idx

M_begin

M[][]

Fails to Pre-Compute RV Addresses for Multiple Rows0x9568 add r1, #4

0x956e ldr r4, [r9,r1]

0x9576 ldr r5, [r2,r4,lsl#2]Dominant-Path P-slice

Page 12: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

EKIVOLOSLocal Store BufferMemory Dependencies

Keep Control FlowMerge Multiple Traces

Maintains All Data Dependencies

Accurately Replicates Main Thread’s Execution Path

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9592 ldr r3, [sp,#16]0x9594 ldr r5, [r3,#8]0x9596 ldr r1, [sp,#4]0x9598 add r1, #10x959a str r1, [sp,#4]0x959c cmp r1, r50x959e bge 0x95c20x95a2 ldr r3, [r1,#4]!0x9538 add r7, r3, #10x953a ldr r3, [fp,r3,lsl#2]0x9546 add sl, fp, r7, lsl#20x9554 ldr r1, [r0,#4]0x9556 mov r8, r3, lsl#20x955a ldr r4, [r0,#0]0x955c add r9, r1, r80x9560 mov r1, #00x9562 add r8, r40x9564 b 0x956e0x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

0x957a mla r5, r6, ip, r50x957e str r5, [r2,r4,lsl#2]0x9582 ldr r4, [r0, #16]0x9584 cmp r4, r70x9586 blt 0x95b40x9588 ldr r5, [sl]0x958c add r3, #10x958e cmp r3, r50x9590 blt 0x95660x9566 ldr r4, [r0, #12]0x9568 add r1, #40x956a cmp r3, r40x956c bge 0x95d00x956e ldr r4, [r9,r1]0x9572 ldr r6, [r8,r1]0x9576 ldr r5, [r2,r4,lsl#2]

Outer Inner

Prefetch

Core

L1

L2

LSB

Simple Algorithm, Much Better Accuracy

12

Page 13: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Evaluation – Methodology

System Setup

ESESC Simulator, ARM ISA

Main core: Out-of-Order, 3GHz

Prefetch core: In-Order, 3GHz

Area & Energy: MCPAT 1.2

EvaluatedWorkloadsRepresent

Computational Biology

Data Mining Floating Point Differential

Graph Search

Hash Table joins Image Processing

Optimization Scheduling

Simulation Sorting Sparse Matrix Multiplication

Support Vector Machines

13

Page 14: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

00.10.20.30.40.50.60.70.80.9

1

Nor

mal

ized

MPK

I Key Results

Ekivolos (Control Flow Only)

Dominant-Path Precomputation Prefetcher

Ekivolos (Control Flow and Memory Dependencies)

Bett

er

SpeedupLLC Misses

11.21.41.61.8

22.22.42.62.8

Rela

tive

Spee

dup

Energy 10%

ControlFlow

MemoryDependencies70% 267% (0-12X)

SMS – Spatial Address CorrelationAMPM – Pattern MatchingPC/AC – Address Correlation with PC-LocalizationEkivolos+ASP – Adding Simple Stream Prefetcher 14

Bett

er

Page 15: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Limitations of Ekivolos

Currently Requires Offline ProfilingEffectiveness depends on Profiling Input

Targets only Delinquent Loads

15

Page 16: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

Future Work Directions

Enhancements P-Core Architecture Benchmarks

Online Profiling “In-Memory” or “In-Cache” Processing

Suites suitable for Architectural Studies

Big Data Diverse Memory Access Patterns

16

Page 17: EKIVOLOS Self Contained, Accurate Precomputation Prefetching Islam Atta Xin Tong Andreas Moshovos Viji Srinivasan Ioana Baldini

What We Learned

P-slices Need Not be Aggressively Optimized

Simple AlgorithmControl Flow & Memory Dependencies

Emerging Algorithms not Studied Before

Prefetch-cores can be Simplified17