alembic: distilling c++ into high-performance grappa
TRANSCRIPT
⚗ Alembic
1
Automatic Locality Extraction via MigrationBrandon Holt, Preston Briggs, Luis Ceze, Mark Oskin
2
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
2
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
Thread
2
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
Thread
Data
2
Partitioned Global Address Space (PGAS)
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
Thread
Data
X10
Grappa
2
Partitioned Global Address Space (PGAS)
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
Thread
Data
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
thread migration*
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
RPC†
thread migration*
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
RPC†
thread migration*
explicit (on/at)
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
RPC†
thread migration*
explicit (on/at)
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
…
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
generated by PGAS compilers
RPC†
thread migration*
explicit (on/at)
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
…
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
3
Move data or computation?
Thread
Data
move data move computation
Thread
generated by PGAS compilers
RPC†
thread migration*
explicit (on/at)
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
…
Why not automatically move computation?
Memory
Cores
...Memory
Cores
Memory
Cores
Interconnect
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
Memory
Cores
...
Node 0 Node 1 Node 2 Node 3
Node 4 Node 5 Node 6 Node 7
4
Move data or computation?
Thread
Data
move data move computation
Thread
Why not automatically move computation?generated by
PGAS compilers
RPC†
thread migration*
…
* M. C. Carlisle and A. Rogers. Software caching and computation migration in Olden. In PPOPP ’95, ACM.
† L. V. Kale and S. Krishnan. CHARM++: A portable concurrent object oriented system based on C++. OOPSLA ’93. ACM.
⚗ Alembic automatically move computation
to reduce communication
5
⚗ Alembic Static optimizing migration algorithm
– Constrained by anchor points – Greedy heuristic to reduce communication
Implementation for C++ in LLVM
Evaluation – 6x better than naive compiler-generated
communication – 82% of hand-tuned performance
5
⚗ Alembic Static optimizing migration algorithm
– Constrained by anchor points – Greedy heuristic to reduce communication
Implementation for C++ in LLVM
Evaluation – 6x better than naive compiler-generated
communication – 82% of hand-tuned performance
⚗ algorithm Locality analysis
– Identify anchor points – Partition anchors into locality sets
Heuristic region selection – Divide into regions that minimize communication – Transform task to migrate at region boundaries
6
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
count: 0winner:
count: 0winner:A: count: 0
winner:
⚗ algorithm
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
count: 0winner:
count: 0winner:A: count: 0
winner:
210200 102B:
⚗ algorithm
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });
count: 0winner:
count: 0winner:A: count: 0
winner:
210200 102B:
⚗ algorithm
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });
count: 0winner:
count: 0winner:A: count: 0
winner:
210200 102B:
i=4
B[4]2
A[2] count: 1winner: 4
⚗ algorithm
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });
count: 0winner:
count: 0winner:A: count: 0
winner:
210200 102B:
i=4
B[4]2
A[2] count: 1winner: 4
⚗ algorithm
Anchor points – memory locations are owned by one node – so memory references are anchored to that node – these are constraints on the thread’s execution
Node 0 Node 1 Node 2
locality analysis
7
HOPS Benchmark
forall(0, B.size, [A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner });
count: 0winner:
count: 0winner:A: count: 0
winner:
⚓︎⚓︎
⚓︎
210200 102B:
i=4
B[4]2
A[2] count: 1winner: 4
⚗ algorithm
⚓︎
locality analysis
8
⚗ algorithm
Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node
AB
i
fetch_add(&a->count, 1)a->winner = i
B[i]
* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.
locality analysis
9
⚗ algorithm
Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node
A
B
i
fetch_add(&a->count, 1)
a->winner = i
B[i]
* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.
locality analysis
9
⚗ algorithm
Locality partitioning: pessimistic value partitioning* (value numbering) – each anchor starts in its own set – merge sets if you can prove they are congruent – for locality partitioning: congruence means on the same node
A
B
i
fetch_add(&a->count, 1)
a->winner = i
B[i]
Plug in your own locality-congruence rules!
* B. Alpern, M. N. Wegman, and F. K. Zadeck. Detecting equality of variables in programs. POPL ’88, pages 1–11. ACM, 1988.
region selection
Region selection (heuristic optimization)
10
⚗ algorithm
[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
region selection
Region selection (heuristic optimization)
10
⚗ algorithm
[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
region selection
Region selection (heuristic optimization)
10
⚗ algorithm
[ A,B ](long i ) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
migrationmessages
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
continuation data(message size)
⚗ algorithm
region selection
Region selection (heuristic optimization)
11
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
Dependence Graph
communication cost heuristic: function of # of messages and message size (continuation size)
region: contiguous sequence of instructions (or a DAG of basic blocks) which can all execute on the same node
⚗ algorithm
- 2 messages - extra data stays - fixed 8-byte payload
- 1 message - payload includes
the extra data
Modified HOPS to use extra data after the remote operation
Message cost experiment
12
blocking operation
migrate
???
region selection⚗ algorithm
- 2 messages - extra data stays - fixed 8-byte payload
- 1 message - payload includes
the extra data
Modified HOPS to use extra data after the remote operation
Message cost experiment
12
blocking operation
migrate
???
region selection⚗ algorithm
0.000
0.005
0.010
0.015
0.020
0.025
16 32 48 64 80 96 112 128Data size (bytes)
GU
PSasyncblocking
migrate blocking
25
20
15
10
05
00
ops/
sec
(mill
ions
)
Thread state size (bytes)
region selection
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible– at region intersections, compute cost
heuristic for the possible choices
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible– at region intersections, compute cost
heuristic for the possible choices
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible– at region intersections, compute cost
heuristic for the possible choices
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible– at region intersections, compute cost
heuristic for the possible choices– evaluate intersections pair-wise, greedily
choose the best in each case
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
region selection
Region selection (heuristic optimization)– for each anchor, expand a region as far as
possible– at region intersections, compute cost
heuristic for the possible choices– evaluate intersections pair-wise, greedily
choose the best in each case
13
⚗ algorithm
A B i
fetch_add(&a->count, 1)
a->winner = i
B[i]
prev
if (prev == 0)
a = _ + _
14
⚗ algorithm
Transform thread to migrate at region boundaries – create continuations for values that cross regions,
and pack them into active messages
14
⚗ algorithm
Transform thread to migrate at region boundaries – create continuations for values that cross regions,
and pack them into active messages
[A,B](long i) { Counter global* a = A + B[i]; long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
15
⚗ algorithm
Transform thread to migrate at region boundaries – create continuations for values that cross regions,
and pack them into active messages
[a,i]{
long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
[A,B](long i) {
migrate(node(B+i), _);
}
[A,B,i]{
Counter global* a = A + B[i];
migrate(node(a), _);
}
15
⚗ algorithm
Transform thread to migrate at region boundaries – create continuations for values that cross regions,
and pack them into active messages
[a,i]{
long prev = fetch_add(&a->count, 1); if (prev == 0) // first to arrive a->winner = i; // is winner}
[A,B](long i) {
migrate(node(B+i), _);
}
[A,B,i]{
Counter global* a = A + B[i];
migrate(node(a), _);
}
16
⚗ Alembic Static optimizing migration algorithm
– Constrained by anchor points – Greedy heuristic to reduce communication
Implementation for C++ in LLVM
Evaluation – 6x better than naive compiler-generated
communication – 82% of hand-tuned performance
⚗ Implementation C++ extensions to support global pointers
Anchor point / locality partitioning analysis pass
Region selection and continuation-passing transform pass
17
18
⚗ Alembic Static optimizing migration algorithm
– Constrained by anchor points – Greedy heuristic to reduce communication
Implementation for C++ in LLVM
Evaluation – 6x better than naive compiler-generated
communication – 82% of hand-tuned performance
Benchmarks – Ported Grappa applications (irregular, data-intensive, …)
Performance (12 nodes) – naive put/get compiler-generated
communication – hand-tuned migration decisions – Alembic-generated migrations
19
⚗ evaluation
Connected ComponentsPagerankBFS
Intsort
Benchmarks – Ported Grappa applications (irregular, data-intensive, …)
Performance (12 nodes) – naive put/get compiler-generated
communication – hand-tuned migration decisions – Alembic-generated migrations
19
⚗ evaluation
Connected ComponentsPagerankBFS
Intsort
GlobalHashSet symmetric* set;Graph symmetric* g;void explore(VertexID r, color_t color) { Vertex global* vs = g->vertices(); phaser.enroll(vs[r].nadj) forall<async>(adj(g,vs+r), [=](VertexID j){ auto& v = vs[j]; if (cmp_swap(&v.color, -1, color)){ spawn([=]{ explore(j, color); }); } else if (v.color != color) { Edge edge(color, v.color); set->insert(edge); phaser.complete(1); } }); phaser.complete(1);}
Benchmarks – Ported Grappa applications (irregular, data-intensive, …)
Performance (12 nodes) – naive put/get compiler-generated
communication – hand-tuned migration decisions – Alembic-generated migrations
19
⚗ evaluation
Connected ComponentsPagerankBFS
Intsort
GlobalHashSet symmetric* set;Graph symmetric* g;void explore(VertexID r, color_t color) { Vertex global* vs = g->vertices(); phaser.enroll(vs[r].nadj) forall<async>(adj(g,vs+r), [=](VertexID j){ auto& v = vs[j]; if (cmp_swap(&v.color, -1, color)){ spawn([=]{ explore(j, color); }); } else if (v.color != color) { Edge edge(color, v.color); set->insert(edge); phaser.complete(1); } }); phaser.complete(1);}
20
Performance(MTEPS)
Data moved(GB)
0
25
50
75
100
0
100
200
put/get
alembic
manual
put/get
alembic
manual
Performance(MOPS)
Data moved(GB)
0
50
100
150
200
0
100
200
300
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
2
4
6
8
0
25
50
75
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
10
20
30
40
0
50
100
150
200
250
put/get
alembic
manual
put/get
alembic
manual
BFS PagerankConnected
components Intsort
⚗ evaluation
bette
r
21
Performance(MTEPS)
Data moved(GB)
0
25
50
75
100
0
100
200
put/get
alembic
manual
put/get
alembic
manual
Performance(MOPS)
Data moved(GB)
0
50
100
150
200
0
100
200
300
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
2
4
6
8
0
25
50
75
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
10
20
30
40
0
50
100
150
200
250
put/get
alembic
manual
put/get
alembic
manual
BFS PagerankConnected
components Intsort
⚗ evaluation
Performance(MTEPS)
Data moved(GB)
0
25
50
75
100
0
100
200
put/get
alembic
manual
put/get
alembic
manual
Performance(MOPS)
Data moved(GB)
0
50
100
150
200
0
100
200
300
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
2
4
6
8
0
25
50
75
put/get
alembic
manual
put/get
alembic
manual
Performance(MTEPS)
Data moved(GB)
0
10
20
30
40
0
50
100
150
200
250
put/get
alembic
manual
put/get
alembic
manual
better
bette
r
22
⚗ Alembic Algorithm to make automatic migration decisions
– Analyze locality by partitioning anchors – Greedy optimization to reduce communication cost heuristic
LLVM implementation for Grappa C++ Performance — near hand-tuned, much better than PGAS baseline
Brandon Holt, Preston Briggs, Luis Ceze, Mark Oskin