region-based memory management for task dataflow modelsdimitrios s. nikolopoulos (forth-ics)...
TRANSCRIPT
Region-Based Memory Managementfor Task Dataflow Models
Dimitrios S. Nikolopoulos
Joint work with: Spyros Lyberis and Polyvios PratikakisComputer Architecture and VLSI Systems Laboratory (CARV)
Institute of Computer Science (ICS)Foundation for Research and Technology – Hellas (FORTH)
EPoPPEA, January 24, 2012
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 1 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 2 / 25
The ProblemManycore Processors and Shared Memory
Shared memory can be hard to scale to many coresI Cache coherence becomes expensiveI Causes excessive communication, memory contention
Manycore processors that relax shared memory abstractionI AMD Opteron: 12 cores, NUMA, MMU per coreI Cell Broadband Engine: 1 + 8 cores, no shared memoryI GPUs: Thousands of cores, small shared memories between few coresI Intel Single Chip Cloud: 48 cores, no shared memory
Increased demand for parallel software developmentI Threads and shared memory: accessible, error-proneI Message-passing: tedious, experts-onlyI Both are non-deterministic: difficult to debug and understand
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 3 / 25
Task-parallelism for distributed and shared memory
Available in several contemporary programming models (Sequoia,Cilk, OMPSs)
Recursive task-parallelismI Parallel programs are composed from nested parallel tasks
Annotate each task with its memory footprint
Runtime system performs dependency analysis, scheduling, allrequired data transfers
Support region-based memory managementI Easier to express irregular task footprintsI Enables use of dynamic data structures
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 4 / 25
Benefits
Shared memory abstractionI Resemble shared memory programmingI Runtime system hides data transfers among core memoriesI Use memory footprints to transfer required data locally before starting
a taskI Alternatively, schedule a task near the data
Message-passing execution semanticsI Tasks compute on local dataI Hierarchical scheduling, easier to scale to more coresI Remove shared-memory bottleneckI No need for complicated SDSM protocol on every access
DeterministicI Implicit, correct synchronizationI Repeatable behaviorI Formal proof
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 5 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 6 / 25
Motivation for Regions
Task memory footprints are easy to express if the are composed offew objects, contiguous in memory
What if the task memory footprint is:I A linked list or part of it?I A tree or part of it?I A graph or part of it?
Hard to use many common linked data structures
Hard to express irregular algorithms and applications
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 7 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
region L
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
region L region R
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
region L region R
v0
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
region L region R
v0
v1
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
ExampleRegion-based Memory Management
region G, L, R;Object ∗v0;init () {G = newregion();L = newsubregion(G);R = newsubregion(G);v0 = ralloc (G, sizeof (Object));v0->left = ralloc (L, sizeof (Object));v0->right = ralloc (R, sizeof (Object));}
region G
region L region R
v0
v1 v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 8 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 9 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
region L region R
v0
v1 v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
f[G]
region L region R
v0
v1 v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
f[G]
region L
f[L]
region R
v0
v1 v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
f[G]
region L
f[L]
region R
f[R]
v0
v1 v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
f[G]
region L
f[L]
region R
f[R]
v0
v1
h[v1]
v2
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
ExampleTasks and Regions
f(Object ∗v0) {if (done) return;spawn f(v0->left) [ inout region L];spawn f(v0->right) [ inout region R];spawn h(v0->left) [ inout v0->left ];spawn h(v0->right) [inout v0->right ];}
region G
f[G]
region L
f[L]
region R
f[R]
v0
v1
h[v1]
v2
h[v2]
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 10 / 25
Deterministic Scheduler Operational Semantics
Define small-step operational semanticsI 〈T ,D,S ,R〉 →p 〈T ′,D ′,S ′,R ′〉
Memory model:I Global address space: the store S includes all memory addressesI Distributed memory implementation: each task only accesses memory
locations declared in its footprint
Scheduling algorithm:I Dependency metadata D maintain a task queue per memory locationI Always possible to spawn a taskI But, it can only run when at the head of queue for all locations in the
footprint
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 11 / 25
Deterministic Proof Technique(Pratikakis et. al, MSPC’11)
Define sequential operational semanticsI 〈S , e〉 →s 〈S ′, e′〉
Prove sequential equivalence on execution tracesI Intuitively: Every parallel execution will produce the same value and
memory state as the sequential executionI More precisely: Given a program e and a finite (terminating) parallel
execution trace for e that produces a value v and a memory state S ,we can always construct a sequential execution trace for e that alsoreturns v and produces S
Proof by induction on the parallel traceI We can always reorder steps in the parallel trace to bring the next
“sequential” step to the front of the traceI Proof does not require scoped parallelism (sync)I Similar to a confluence proof
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 12 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 13 / 25
Distributed Memory Management API
• Object allocation:
void *sys_alloc(size_t size, rid_t region);
void sys_balloc(size_t size, rid_t region,
int num_objects, void **objects);
void sys_free(void *ptr);
void *sys_realloc(void *ptr, size_t size, rid_t rgn);
• Region allocation:
rid_t sys_ralloc(rid_t parent, char level_hint);
void sys_rfree(rid_t region);
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 14 / 25
Distributed Memory Implementation by ExampleA = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
A: 0
alloc() hash dep waiting ready queue
B: 1
C: 0
D: 2
core0
A, C
core1
B
core2
D
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by ExampleA = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
A: 0 0
alloc() hash dep waiting ready queue
B: 1
C: 0
D: 2
0(10), 1(20)
core0
A, C
core1
B
core2
D
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by ExampleA = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
A: 1 0
alloc() hash dep waiting ready queue
B: 1
C: 0
D: 2
core0
C
core1
A, B
core2
D
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by ExampleA = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
A: 1 0
0
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
0(5), 2(80)
core0
C
core1
A, B
core2
C, D
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by ExampleA = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
A: 1 0
0
2
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
core0 core1
A, B
core2
C, D
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by Example
A: 1 0
1
2
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
core0 core1
A, B
core2
C, D
A = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
task2 (C, D) { ... # task inout D task4 (D); # wait on D}
2(80)
0
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by Example
A: 1
1
1
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
core0 core1
A, B
core2
C, D
A = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
task2 (C, D) { ... # task inout D task4 (D); # wait on D}
0
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by Example
A: 1
0
1
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
core0 core1
A, B
core2
C, D
A = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
task2 (C, D) { ... # task inout D task4 (D); # wait on D}
2(5), 2(80)
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by Example
A: 1
0
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
core0 core1
A, B
core2
B, C, D
A = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
task2 (C, D) { ... # task inout D task4 (D); # wait on D}
1(20), 2(80)
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Distributed Memory Implementation by Example
A: 1 0
1
2
alloc() hash dep waiting ready queue
B: 1
C: 2
D: 2
A = malloc(10);B = malloc(20);C = malloc(5);D = malloc(80);
# task in A, out Btask1(A, B);
# task in C, out Dtask2(C, D);
# task in B, inout Dtask3(B, D);
task2 (C, D) { ... # task inout D task4 (D); # wait on D}
2(80)
0
Decode: Enqueue task args into alloc() hash fromleft/right side; Count non-ownershipargs in dep waiting
Issue: If dep waiting reaches 0, put task inready queue along with affinity stats
Prepare: Decide where to run, fetch non-local data
Run: Execute code when fetch is finished
Collect: Update alloc() hash ownership; Update depwaiting for these objects; Maybe goto Issue
1(10), 1(20)
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 15 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 16 / 25
Hierarchical schedulers topology
Level 2
Level 1
Level 0
(Level 3: Workers)
20
13121110
00
21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 17 / 25
Splitting region trees for hierarchical scheduling
root
R1 Ra
R
Mb
Ra1
L
L2L1 Ma
Mb1 Mb2
M
Ma1
1
root region
region
object
L0 scheduler
L1 scheduler
L1 schedulerL2 scheduler
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 18 / 25
Barnes-Hut, 1.1M bodies, Single scheduler
0
100
200
300
400
500
600
Serial 1 2 4 8 16 32 64 128 256 512
Exe
cutio
n tim
e (s
econ
ds)
Number of workers
ComputationWorker commSched comm
Barrier waitSimulate
Load balancingNew local tree
Fetch remote tree
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 19 / 25
Barnes-Hut, 1.1M bodies, Two-level scheduler hierarchy
0
100
200
300
400
500
600
Serial 1(1) 2(1) 4(1) 8(1) 16(2) 32(4) 64(8) 128(16) 256(32) 512(64)
Exe
cutio
n tim
e (s
econ
ds)
Number of workers (number of L1 schedulers)
ComputationWorker commSched comm
Barrier waitSimulate
Load balancingNew local tree
Fetch remote tree
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 20 / 25
UPC comparison, 16 worker cores
0
10
20
30
40
50
60
70
80
90
100
15,000 30,000 60,000 120,000 240,000
Exe
cutio
n tim
e (s
econ
ds)
Number of objects
UPC, 248-B objectsUPC, 504-B objects
UPC, 1016-B objectsUPC, 2040-B objects
Myrmics, 248-B objectsMyrmics, 504-B objects
Myrmics, 1016-B objectsMyrmics, 2040-B objects
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 21 / 25
UPC comparison, 15,000 504-B objects
2
4
8
16
32
64
128
256
16 32 64 128
Exe
cutio
n tim
e (s
econ
ds)
Number of workers
UPCMyrmics, single scheduler
Myrmics, hierarchical schedulers
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 22 / 25
Outline
1 IntroductionContribution
2 Memory Management with Nested Regions
3 Task and Region SemanticsBy exampleFormal Definition
4 Distributed Memory Regions
5 Early Experimental Results
6 Conclusions
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 23 / 25
Conclusions
Regions language construct:I arbitrary task memory footprints, better productivityI shared memory programming abstractionsI distributed memory execution semanticsI scalability
Work in progress:I Distributed memory task-based runtime (Myrmics)I Compiler support for regions (SCOOP)I Implementation and testing on FPGA prototype (Formic)
Acknowledgments:
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 24 / 25
CARV (FORTH-ICS) Formic 512-core FPGA Prototype
Dimitrios S. Nikolopoulos (FORTH-ICS) Region-Based Memory Management EPoPPEA 2012 25 / 25