alembic: distilling c++ into high-performance grappa

67
Alembic 1 Automatic Locality Extraction via Migration Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin

Upload: brandon-g-holt

Post on 16-Jul-2015

104 views

Category:

Software


0 download

TRANSCRIPT

⚗ Alembic

1

Automatic Locality Extraction via MigrationBrandon Holt, Preston Briggs, Luis Ceze, Mark Oskin

2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

Thread

2

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

Thread

Data

2

Partitioned Global Address Space (PGAS)

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

Thread

Data

X10

Grappa

2

Partitioned Global Address Space (PGAS)

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

Thread

Data

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

thread migration*

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

RPC†

thread migration*

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

RPC†

thread migration*

explicit (on/at)

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

RPC†

thread migration*

explicit (on/at)

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

generated by PGAS compilers

RPC†

thread migration*

explicit (on/at)

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

3

Move data or computation?

Thread

Data

move data move computation

Thread

generated by PGAS compilers

RPC†

thread migration*

explicit (on/at)

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

Why not automatically move computation?

Memory

Cores

...Memory

Cores

Memory

Cores

Interconnect

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

Memory

Cores

...

Node 0 Node 1 Node 2 Node 3

Node 4 Node 5 Node 6 Node 7

4

Move data or computation?

Thread

Data

move data move computation

Thread

Why not automatically move computation?generated by

PGAS compilers

RPC†

thread migration*

* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.

† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.

⚗ Alembic automatically move computation

to reduce communication

5

⚗ Alembic Static optimizing migration algorithm

– Constrained by anchor points – Greedy heuristic to reduce communication

Implementation for C++ in LLVM

Evaluation – 6x better than naive compiler-generated

communication – 82% of hand-tuned performance

5

⚗ Alembic Static optimizing migration algorithm

– Constrained by anchor points – Greedy heuristic to reduce communication

Implementation for C++ in LLVM

Evaluation – 6x better than naive compiler-generated

communication – 82% of hand-tuned performance

⚗ algorithm

6

⚗ algorithm Locality analysis

– Identify anchor points – Partition anchors into locality sets

Heuristic region selection – Divide into regions that minimize communication – Transform task to migrate at region boundaries

6

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark⚗ algorithm

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

count: 0winner:

count: 0winner:A: count: 0

winner:

⚗ algorithm

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

count: 0winner:

count: 0winner:A: count: 0

winner:

210200 102B:

⚗ algorithm

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });

count: 0winner:

count: 0winner:A: count: 0

winner:

210200 102B:

⚗ algorithm

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });

count: 0winner:

count: 0winner:A: count: 0

winner:

210200 102B:

i=4

B[4]2

A[2] count: 1winner: 4

⚗ algorithm

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });

count: 0winner:

count: 0winner:A: count: 0

winner:

210200 102B:

i=4

B[4]2

A[2] count: 1winner: 4

⚗ algorithm

Anchor points – memory locations are owned by one node – so memory references are anchored to that node – these are constraints on the thread’s execution

Node 0 Node 1 Node 2

locality analysis

7

HOPS Benchmark

forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });

count: 0winner:

count: 0winner:A: count: 0

winner:

⚓︎⚓︎

⚓︎

210200 102B:

i=4

B[4]2

A[2] count: 1winner: 4

⚗ algorithm

⚓︎

locality analysis

8

⚗ algorithm

AB

i

fetch_add(&a->count, 1)a->winner = i

B[i]

locality analysis

8

⚗ algorithm

Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node

AB

i

fetch_add(&a->count, 1)a->winner = i

B[i]

* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.

locality analysis

9

⚗ algorithm

Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node

A

B

i

fetch_add(&a->count, 1)

a->winner = i

B[i]

* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.

locality analysis

9

⚗ algorithm

Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node

A

B

i

fetch_add(&a->count, 1)

a->winner = i

B[i]

Plug in your own locality-congruence rules!

* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.

region selection

Region selection (heuristic optimization)

10

⚗ algorithm

[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

region selection

Region selection (heuristic optimization)

10

⚗ algorithm

[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

region selection

Region selection (heuristic optimization)

10

⚗ algorithm

[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

migrationmessages

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

continuation data(message size)

⚗ algorithm

region selection

Region selection (heuristic optimization)

11

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

Dependence Graph

communication cost heuristic: function of # of messages and message size (continuation size)

region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node

⚗ algorithm

- 2 messages - extra data stays - fixed 8-byte payload

- 1 message - payload includes

the extra data

Modified HOPS to use extra data after the remote operation

Message cost experiment

12

blocking operation

migrate

???

region selection⚗ algorithm

- 2 messages - extra data stays - fixed 8-byte payload

- 1 message - payload includes

the extra data

Modified HOPS to use extra data after the remote operation

Message cost experiment

12

blocking operation

migrate

???

region selection⚗ algorithm

0.000

0.005

0.010

0.015

0.020

0.025

16 32 48 64 80 96 112 128Data size (bytes)

GU

PSasyncblocking

migrate blocking

25

20

15

10

05

00

ops/

sec

(mill

ions

)

Thread state size (bytes)

region selection

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible– at region intersections, compute cost

heuristic for the possible choices

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible– at region intersections, compute cost

heuristic for the possible choices

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible– at region intersections, compute cost

heuristic for the possible choices

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible– at region intersections, compute cost

heuristic for the possible choices– evaluate intersections pair-wise, greedily

choose the best in each case

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

region selection

Region selection (heuristic optimization)– for each anchor, expand a region as far as

possible– at region intersections, compute cost

heuristic for the possible choices– evaluate intersections pair-wise, greedily

choose the best in each case

13

⚗ algorithm

A B i

fetch_add(&a->count, 1)

a->winner = i

B[i]

prev

if (prev == 0)

a = _ + _

14

⚗ algorithm

Transform thread to migrate at region boundaries – create continuations for values that cross regions,

and pack them into active messages

14

⚗ algorithm

Transform thread to migrate at region boundaries – create continuations for values that cross regions,

and pack them into active messages

[A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

15

⚗ algorithm

Transform thread to migrate at region boundaries – create continuations for values that cross regions,

and pack them into active messages

[a,i]{

long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

[A,B](long i) {

migrate(node(B+i), _);

}

[A,B,i]{

Counter global* a = A + B[i];

migrate(node(a), _);

}

15

⚗ algorithm

Transform thread to migrate at region boundaries – create continuations for values that cross regions,

and pack them into active messages

[a,i]{

long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}

[A,B](long i) {

migrate(node(B+i), _);

}

[A,B,i]{

Counter global* a = A + B[i];

migrate(node(a), _);

}

16

⚗ Alembic Static optimizing migration algorithm

– Constrained by anchor points – Greedy heuristic to reduce communication

Implementation for C++ in LLVM

Evaluation – 6x better than naive compiler-generated

communication – 82% of hand-tuned performance

⚗ Implementation C++ extensions to support global pointers

Anchor point / locality partitioning analysis pass

Region selection and continuation-passing transform pass

17

18

⚗ Alembic Static optimizing migration algorithm

– Constrained by anchor points – Greedy heuristic to reduce communication

Implementation for C++ in LLVM

Evaluation – 6x better than naive compiler-generated

communication – 82% of hand-tuned performance

Benchmarks – Ported Grappa applications (irregular, data-intensive, …)

Performance (12 nodes) – naive put/get compiler-generated

communication – hand-tuned migration decisions – Alembic-generated migrations

19

⚗ evaluation

Connected ComponentsPagerankBFS

Intsort

Benchmarks – Ported Grappa applications (irregular, data-intensive, …)

Performance (12 nodes) – naive put/get compiler-generated

communication – hand-tuned migration decisions – Alembic-generated migrations

19

⚗ evaluation

Connected ComponentsPagerankBFS

Intsort

GlobalHashSet symmetric* set;Graph symmetric* g;void explore(VertexID r, color_t color) { Vertex global* vs = g->vertices(); phaser.enroll(vs[r].nadj) forall<async>(adj(g,vs+r), [=](VertexID j){ auto& v = vs[j]; if (cmp_swap(&v.color, -1, color)){ spawn([=]{ explore(j, color); }); } else if (v.color != color) { Edge edge(color, v.color); set->insert(edge); phaser.complete(1); } }); phaser.complete(1);}

Benchmarks – Ported Grappa applications (irregular, data-intensive, …)

Performance (12 nodes) – naive put/get compiler-generated

communication – hand-tuned migration decisions – Alembic-generated migrations

19

⚗ evaluation

Connected ComponentsPagerankBFS

Intsort

GlobalHashSet symmetric* set;Graph symmetric* g;void explore(VertexID r, color_t color) { Vertex global* vs = g->vertices(); phaser.enroll(vs[r].nadj) forall<async>(adj(g,vs+r), [=](VertexID j){ auto& v = vs[j]; if (cmp_swap(&v.color, -1, color)){ spawn([=]{ explore(j, color); }); } else if (v.color != color) { Edge edge(color, v.color); set->insert(edge); phaser.complete(1); } }); phaser.complete(1);}

20

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

BFS PagerankConnected

components Intsort

⚗ evaluation

bette

r

21

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

BFS PagerankConnected

components Intsort

⚗ evaluation

Performance(MTEPS)

Data moved(GB)

0

25

50

75

100

0

100

200

put/get

alembic

manual

put/get

alembic

manual

Performance(MOPS)

Data moved(GB)

0

50

100

150

200

0

100

200

300

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

2

4

6

8

0

25

50

75

put/get

alembic

manual

put/get

alembic

manual

Performance(MTEPS)

Data moved(GB)

0

10

20

30

40

0

50

100

150

200

250

put/get

alembic

manual

put/get

alembic

manual

better

bette

r

22

⚗ Alembic Algorithm to make automatic migration decisions

– Analyze locality by partitioning anchors – Greedy optimization to reduce communication cost heuristic

LLVM implementation for Grappa C++ Performance — near hand-tuned, much better than PGAS baseline

Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin