dasx : hardware accelerator for software data structures snehasish kumar, naveen vedula, arrvindh...

34
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan (IBM Research)

Upload: mervin-cobb

Post on 02-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University),

Vijayalakshmi Srinivasan (IBM Research)

Page 2: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

2

Executive Summary

Datavoid simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }}

mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4

for each array

element!

mov

CORE

Reorder Buffer

mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4

for each array

element!

movmuladdldExtra work encumbers the core!

DASX : Accelerate the access of and compute on software data structures

H1

H2

H3

H4

H5

High level info lost!

Page 3: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

3

Outline

– Challenges of data-centric applications

– Existing mechanisms to address challenges

– DASX : Data Structure Accelerator

– Benchmarks and Evaluation

Page 4: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

4

Challenge 1/3 : Instruction Overhead

1D Vector : 2 2D Vector : 3 6D Vector : 12

Instructions / Element

OLAP Cube [Gray et al. DMKD ‘96] upto 15D!

Unordered Set : avg. 12 instructionsBTree : 100s of instructionsCOMPUTE DATA9% 66%

void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }}

Page 5: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

5

Challenge 2/3 : Memory Level Parallelism

Each element independent

mov

CORE

Reorder Buffer

mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4

for each array

element!

movmuladdld

Cant discover more MLP!

Accessing multiple data structures makes this worse!

Page 6: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

6

Challenge 3/3 : Managing Cache Space

CPU

L1

L2

MEM

Not enough space in cache

Page 7: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

7

Outline

– Challenges of data-centric applications

– Existing mechanisms to address challenges

– DASX : Data Structure Accelerator

– Benchmarks and Evaluation

Page 8: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

8

Existing Mechanisms – Prefetching

+ Increases Memory Level Parallelism– Increases instructions (SW PF)– Best effort (HW PF)– Can cause cache thrashing

void simple() { for (int i = 0; i<size; ++i){ prefetch(a + k); prefetch(b + k); prefetch(c + k); a[i] += b[i] + c[i]; }}

add

Reorder Buffer

addprefmovload

Page 9: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

9

Existing Mechanisms – SIMD

+ Reduce Instructions– Algorithm change– Increase power

void simple(){ for (int i = 0; i<size; i+=k){ SIMD_LOAD(a[i]:a[i+k]); SIMD_LOAD(b[i]:b[i+k]); SIMD_LOAD(c[i]:c[i+k]); SIMD_ADD(a[…], b[…], c[…]); }}

addloadadd

addloadadd

Reorder Buffer

Page 10: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

10

Outline

– Challenges of data-centric applications

– Existing mechanisms to address challenges

– DASX : Data Structure Accelerator

– Benchmarks and Evaluation

Page 11: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

11

CACHE

OOOCORE

Our Approach – DASX

SHARED LASTLEVEL CACHE

Colle

ctor

Proc

essi

ngEl

emen

ts(P

Es)

DASX

Data structure specific fetch engine

Lightweight pipelinesAll ins. fixed latency

Page 12: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

12

DASX – Sample Programmer’s APIvoid simple() { for (int i=0; i<size; ++i){ a[i] = b[i] + c[i]; }}

coll_a = new coll(ST, &a, INT, size, 0, VEC);coll_b = new coll(LD, &b, INT, size, 0, VEC);coll_c = new coll(LD, &c, INT, size, 0, VEC);

BEGIN SIMPLE

END SIMPLE

auto kfn = [](auto i, auto j) { return i + j;}

Initialize Collectorsgroup::add(coll_a, coll_b, coll_c);

start(kfn, size);

Run in lock-step

Start processing

Page 13: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

13

DASX – Data Structure Accelerator

1

CACHE

MEMTranslate key, fetch elements

2Allocate

3 Lock iteration data4

Fill local storage

5 Compute (SPMD) STOPGO

6Write back dirty data

7 Unlock iteration data

STOP

Colle

ctor

PEs

Page 14: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

14

Colle

ctor

DASX – Data Structure Accelerator

CACHE

MEM

Lock iteration data

Write back dirty data

STOP Compute (SPMD)

Fill local storage

1

Translate key, fetch elements

Allocate2

3

7 Unlock iteration data4

6

5

DECOUPLEDACCESS (1 – 3)

EXECUTE (5 – 7)

PEs

Page 15: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

15

Challenges Recap

Challenge 1 : Reduce Instruction Overhead

Challenge 2 : Increase Memory Level Parallelism

Challenge 3 : Better Cache Management

Page 16: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

16

DASX – Processing ElementsInstruction

Memory (1KB)

REG (32)

REG(32)

LANE 1 LANE 8…

Features• 3 stage pipeline• Single Program Multiple Data• Each PE – exec. 1 iteration

• No address generation• Reference data using “keys”

“Reduce Instruction Overhead” by using SPMD Model and removing address

generation.

Page 17: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

17

DASX – Key Interface Vector Keys LD Key == LD Iter * Size + Offset

Hash Table KeysLD KEY

BTree Keys

1 2 30

Key Data

0 Data

1 Data 2 Data

Remove address generation overhead

Page 18: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

18

DASX – Collector

Data structure fetch engine• Specialize traversal• User defined elements

Data Structure Collector HW OPVector Address / Stride Calc. – ADD, CMP

Hash Table Index Calc + Bucket Traversal. – INT ALU BTree Traversal – CMOV, ADD, CMP

Tasks – 1) Prefetch 2) Manage Cache Space

Page 19: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

19

Collector Task 1 : Prefetch

1

CACHE

MEMTranslate keys,fetch elements

2 Allocate

• Run asynchronously with compute• Reduce address generation cost • Granularity of access : Data structure element• Enhanced memory level parallelism

Colle

ctor

Page 20: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

20

Collector Task 2 : Manage Cache Space

CACHE

3 Lock iteration data

4 Fill local storage6 Write back dirty data

7 Unlock iteration data

• Manage cache fill and replacement• Bulk fill OBJ-Store before iteration• Per element refill from cache to OBJ-Store

Colle

ctor

PEs

OBJ-Store

Page 21: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

21

Outline

– Challenges of data-centric applications

– Existing mechanisms to address challenges

– DASX : Data Structure Accelerator

– Benchmarks and Evaluation

Page 22: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

22

BenchmarksRecommender Text Search Hash Table

OLAP Cubing BTree Black-Scholes

H1

H2

H3

H4

H5

Page 23: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

23

Evaluation – Setup

DASX

vs

8

1KB

32 KB L1

32 KB L1

IO CORE IO CORE

MT (8 threads)

LLC – 4MB, 16 WAY, NUCA

DRAM – DDR2-400, 16GB, 4 Chn.

64 KB L1

OOO CORE

vs

OOO

Page 24: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

24

Evaluation – Performance Breakdown

1.25

0.00

0.25

0.50

0.75

1.00

D. Cube(Memory Bound)

Black.(Compute Bound)

1 In-Order Core at LLC

Normalized to OOO Core ( Lower is better)

+ Collector(data structure engine)

– Address Gen. + Local Store

X 8MT

MT

Page 25: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

25

Evaluation – Performance

D.CubeReco

.BTree

Hash.

Black.

Text.0

5

10

15

20DASX 2C-4T

Spee

dup

(Hig

her i

s be

tter

)

23.2

158

(Normalized to OOO)MT (8)

Page 26: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

26

Evaluation – Energy vs Performance

0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.220.00

0.25

0.50

0.75

1.00

Execution Cycles

Ener

gyData-Cubing

MT-32

MT-16MT-8

DASX-4DASX-8

OOO

Best

Page 27: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

27

Summary

Highlighted the challenges of data-centric workloads

Demonstrated the effectiveness of using data structure specific information

Data structure aware hardware accelerator achieves 4.4X performance improvement

Page 28: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

28

Q & A

Page 29: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

29

Backup

1.Percentage of data structure instructions – 302.Why collector groups? – 313.Energy breakdown – 324.Obj-Store details – 335.Address Translation for keys – 34

Page 30: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

30

Percentage of data structure instructions

Page 31: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

31

Why collector groups

Page 32: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

32

Evaluation – Energy Reduction

D.CubeReco.

BTreeHash.

Black.Text.

0

1

2

3

4

5

0

15

30

45

60

75

NW-DASX NW-2C-4TCache-DASX Cache-2C-4T

Net

wor

k (H

ighe

r is b

etter

)

Cach

e (H

ighe

r is b

etter

)

32.7 6.5

12.2

Streaming Cache Thrashing

Page 33: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

33

DASX – OBJ-Store

Reduce energy – filter access to LLC

Organization : Decoupled sector cache (1KB)• Minimize tag overhead for vectors• Adapt to spatial locality (eg. struct fields)

KEY V/I LLC*

Tag

LD / ST – PE Write backs

Data

Page 34: DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan

DASX : Hardware Accelerator for Software Data Structures

34

DASX – Address Translation for Keys

• Reduce energy overhead

• Keys are coalesced by the collector into cache lines

• Only one translation per line vs. per access

• No reverse translation, due to back pointer (refer OBJ-Store)