alembic: distilling c++ into high-performance grappa

⚗ Alembic

1

Automatic Locality Extraction via MigrationBrandon Holt, Preston Briggs, Luis Ceze, Mark Oskin

2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3


2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



Thread

2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



Thread

Data

2

Partitioned Global Address Space (PGAS)

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



Thread

Data

X10

Grappa

2

Partitioned Global Address Space (PGAS)

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



Thread

Data

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3

Move data or computation?

Thread

Data

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data

move data



Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data

move data move computation

Thread



Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

thread migration*



Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

RPC†

thread migration*



Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

RPC†

thread migration*

explicit (on/at)



Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

RPC†

thread migration*

explicit (on/at)



…

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

generated by PGAS compilers

RPC†

thread migration*

explicit (on/at)



…

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



3


Thread

Data


Thread

generated by PGAS compilers

RPC†

thread migration*

explicit (on/at)



…

Why not automatically move computation?

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...



4


Thread

Data


Thread

Why not automatically move computation?generated by

PGAS compilers

RPC†

thread migration*

…



⚗ Alembic automatically move computation

to reduce communication

5

⚗ Alembic Static optimizing migration algorithm

– Constrained by anchor points – Greedy heuristic to reduce communication

Implementation for C++ in LLVM

Evaluation – 6x better than naive compiler-generated

communication – 82% of hand-tuned performance

⚗ algorithm

6

⚗ algorithm Locality analysis

– Identify anchor points – Partition anchors into locality sets

Heuristic region selection – Divide into regions that minimize communication – Transform task to migrate at region boundaries

6

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark⚗ algorithm


locality analysis

7

HOPS Benchmark

count: 0winner:

count: 0winner:A: count: 0

winner:

⚗ algorithm


locality analysis

7

HOPS Benchmark

count: 0winner:


winner:

210200 102B:

⚗ algorithm


locality analysis

7

HOPS Benchmark

forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });

count: 0winner:


winner:

210200 102B:

⚗ algorithm


locality analysis

7

HOPS Benchmark


count: 0winner:


winner:

210200 102B:

i=4

B[4]2

A[2] count: 1winner: 4

⚗ algorithm


locality analysis

7

HOPS Benchmark


count: 0winner:


winner:

210200 102B:

i=4

B[4]2


⚗ algorithm

Anchor points – memory locations are owned by one node – so memory references are anchored to that node – these are constraints on the thread’s execution


locality analysis

7

HOPS Benchmark


count: 0winner:


winner:

⚓︎⚓︎

⚓︎

210200 102B:

i=4

B[4]2


⚗ algorithm

⚓︎

locality analysis

8

⚗ algorithm

AB

i

fetch_add(&a->count, 1)a->winner = i

B[i]

locality analysis

8

⚗ algorithm

Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node

AB

i

fetch_add(&a->count, 1)a->winner = i

B[i]

* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.

locality analysis

9

⚗ algorithm


A

B

i

fetch_add(&a->count, 1)

a->winner = i

B[i]


locality analysis

9

⚗ algorithm


A

B

i


a->winner = i

B[i]

Plug in your own locality-congruence rules!


region selection

Region selection (heuristic optimization)

10

⚗ algorithm

[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

A B i


a->winner = i

B[i]

region selection


10

⚗ algorithm


A B i


a->winner = i

B[i]

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

region selection


10

⚗ algorithm


A B i


a->winner = i

B[i]

communication cost heuristic: function of # of messages and message size (continuation size)


region selection


11

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph



⚗ algorithm

region selection


11

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph


migrationmessages


⚗ algorithm

region selection


11

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph



⚗ algorithm

region selection


11

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph



continuation data(message size)

⚗ algorithm

region selection


11

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph



⚗ algorithm

- 2 messages - extra data stays - fixed 8-byte payload

- 1 message - payload includes

the extra data

Modified HOPS to use extra data after the remote operation

Message cost experiment

12

blocking operation

migrate

???

region selection⚗ algorithm

- 2 messages - extra data stays - fixed 8-byte payload

- 1 message - payload includes

the extra data

Modified HOPS to use extra data after the remote operation

Message cost experiment

12

blocking operation

migrate

???

region selection⚗ algorithm

0.000

0.005

0.010

0.015

0.020

0.025

16 32 48 64 80 96 112 128Data size (bytes)

GU

PSasyncblocking

migrate blocking

25

20

15

10

05

00

ops/

sec

(mill

ions

)

Thread state size (bytes)

region selection

13

⚗ algorithm

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection


13

⚗ algorithm

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible

13

⚗ algorithm

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection


possible– at region intersections, compute cost

heuristic for the possible choices

13

⚗ algorithm

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection


possible– at region intersections, compute cost

heuristic for the possible choices– evaluate intersections pair-wise, greedily

choose the best in each case

13

⚗ algorithm

A B i


a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

14

⚗ algorithm

Transform thread to migrate at region boundaries – create continuations for values that cross regions,

and pack them into active messages

14

⚗ algorithm



[A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

15

⚗ algorithm



[a,i]{

long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

[A,B](long i) {

migrate(node(B+i), _);

}

[A,B,i]{

Counter global* a = A + B[i];

migrate(node(a), _);

}

16






⚗ Implementation C++ extensions to support global pointers

Anchor point / locality partitioning analysis pass

Region selection and continuation-passing transform pass

17

18






Benchmarks – Ported Grappa applications (irregular, data-intensive, …)

Performance (12 nodes) – naive put/get compiler-generated

communication – hand-tuned migration decisions – Alembic-generated migrations

19

⚗ evaluation

Connected ComponentsPagerankBFS

Intsort

Benchmarks – Ported Grappa applications (irregular, data-intensive, …)

Performance (12 nodes) – naive put/get compiler-generated

communication – hand-tuned migration decisions – Alembic-generated migrations

19

⚗ evaluation

Connected ComponentsPagerankBFS

Intsort

GlobalHashSet symmetric* set;Graph symmetric* g;void explore(VertexID r, color_t color) { Vertex global* vs = g->vertices(); phaser.enroll(vs[r].nadj) forall<async>(adj(g,vs+r), [=](VertexID j){ auto& v = vs[j]; if (cmp_swap(&v.color, -1, color)){ spawn([=]{ explore(j, color); }); } else if (v.color != color) { Edge edge(color, v.color); set->insert(edge); phaser.complete(1); } }); phaser.complete(1);}

20

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

BFS PagerankConnected

components Intsort

⚗ evaluation

bette

r

21

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

BFS PagerankConnected

components Intsort

⚗ evaluation

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

better

bette

r

22

⚗ Alembic Algorithm to make automatic migration decisions

– Analyze locality by partitioning anchors – Greedy optimization to reduce communication cost heuristic

LLVM implementation for Grappa C++ Performance — near hand-tuned, much better than PGAS baseline

Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin

alembic: distilling c++ into high-performance grappa

Software

node 7thread2memorycores

computation migration

threaddatamove data

software caching

preston briggs

migrationbrandon holt

luis ceze

mark oskin2memorycores