embracing explicit communication in work- stealing runtime...
TRANSCRIPT
Embracing Explicit Communication in Work-
Stealing Runtime SystemsAndreas Prell
Manycore Processors• “Cluster-on-chip” architectures
• Increasing thread- and data-level parallelism
• Growing importance of scalable communication
Left: S. Bell et al.,TILE64TM Processor: A 64-Core SoC with Mesh Interconnect, ISSCC 2008 Center: T. Mattson et al., The 48-Core SCC Processor: The Programmer’s View, SC 2010 Right: A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro 2016
From Threads to TasksMake it easy to express fine-grained task parallelism
int recurse(int n) { if (n < 2) return base_case();
int x;
std::thread t([&] { x = recurse(n-1); });
int y = recurse(n-2); t.join();
return x + y; } Threads
From Threads to Tasksint recurse(int n) { if (n < 2) return base_case();
int x = spawn recurse(n-1); int y = recurse(n-2);
sync;
return x + y; }
Make it easy to express fine-grained task parallelism
Tasks
Task Pool
Thread Pool
From Threads to TasksRuntime system manages parallel execution
:::
:::
:::
:::
Central versus Distributed Task Pools
0
5
10
15
20
25
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of threads
GCC 4.9.1ICC 14.0.1
GCC: GNU libgomp, ICC: Intel OpenMP RTL
Benchmark: UTS T3L (binomial tree of ~111 million nodes)
223x}
Load Balancing through Work Stealing
W1 W2 W3
A
B
Load Balancing through Work Stealing
W1 W2 W3
push B
A
B
Load Balancing through Work Stealing
W1 W2 W3
push C
C
B
A
C
Load Balancing through Work Stealing
W1 W2 W3
pop
B
A
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
Load Balancing through Work Stealing
W1 W2 W3
B
A
steal
Load Balancing through Work Stealing
W1 W2 W3
B
A
Load Balancing through Work Stealing
W1 W2 W3
B
AShared deques
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
Load Balancing through Work Stealing
W1 W2 W3
B
A
pop
Private deques
Steal Request
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
Load Balancing through Work Stealing
W1 W2 W3
B
APrivate deques
steal
Load Balancing through Work Stealing
W1 W2 W3
B
Private deques
steal
A
Load Balancing through Work Stealing
W1 W2 W3
B
Private deques
Embracing Explicit Communication
Requirements• Work-stealing scheduling
• Explicit communication → Efficient message passing, private-access deques
• Task synchronization → Collective, individual
• Coarse-grained parallelism → Polling
• Fine-grained parallelism → Adaptive stealing strategies, granularity control
ChannelsSimple message passing abstraction: bool channel_send(Channel *, void *, size_t);
bool channel_receive(Channel *, void *, size_t);
Bounded FIFO message queues
Building blocks: MPSC SPSC
struct steal_request { Channel *chan; int thief; // ... };
Steal Requests
struct steal_request req = {
.chan = ,
.thief = ID,
// ...
};
int i = select_victim();
channel_send( [i], &req, sizeof(req));
Steal Requests
MPSC
SPSC
Two channels per worker
Steal Request
Steal Requests
W1 W2 W3
ACK: Nothing
Steal Requests
W1 W2 W3
Steal Request
Steal Requests
W1 W2 W3
Steal Requests
W1 W2 W3
Steal Request
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
Steal Request
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
Steal Requests
W1 W2 W3
Idea: Eliminate ACKs by forwarding steal requests
…
Forwarding
• reduces number of messages
• facilitates asynchronous stealing
• improves performance
Steal Requests
Benchmark: BPC with d = 105, n = 9, and t as shown (x-axis)
400
500
600
700
800
900
1000
1100
0 1 2 3 4 5 6 7 8 9 10
Exec
utio
n tim
e (m
s)
Task length (µs)
AcknowledgingForwarding
…
…
H
Stealing Tasks
T
One task
T’
Send T or &T to thief
H
Stealing Tasks
T
One task
T’
Send T or &T to thief Share memory by communicating*
*A. Gerrand, https://blog.golang.org/share-memory-by-communicating, 13 July 2010
H
Stealing Tasks
T
tasks
Send &H’ to thief
bn/2c
T’ H’
Stealing Tasks
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Benchmark: SPC with n = 106 and t = 100 µs
Task length 100 µs
…
Stealing Tasks
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Benchmark: SPC with n = 106 and t = 10 µs
Task length 10 µs
Task Synchronization
Task Barrier#include <stdio.h> #include "tasking.h"
ASYNC_VOID_DECL ( puts, const char *s, s );
int main(void) { TASKING_INIT();
ASYNC(puts, "Order"); ASYNC(puts, "Undefined");
TASKING_BARRIER();
ASYNC(puts, "Last");
TASKING_EXIT(); return 0; }
#include <stdio.h> #include <omp.h>
int main(void) { #pragma omp parallel { #pragma omp master { #pragma omp task puts("Order"); #pragma omp task puts("Undefined"); } #pragma omp barrier #pragma omp master { #pragma omp task puts("Last"); } } return 0; }
Termination Detection with Steal Requests
Problem: Detect when all workers are idle without resorting to implicit communication
Idea: “Color” steal requests
→ Avoids separate control messages
→ Termination follows from forwarding
Extended Steal Requestsstruct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };
Extended Steal Requests
struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };
working idle reg_idle
Extended Steal Requests
working idle reg_idle
// Manager receives req switch (req.state) { case idle: // Mark reg_idle break; }
Steal Request
(reg_idle)
Notifying the Manager
W1 W2 Manager
Update!
Notifying the Manager
W1 W2 Manager
Notifying the Manager
W1 W2 Manager
Steal request
UpdatesWorker i
Worker j j2
i3
Manager
m2 m3
i1
m1
Update
Steal request
i2
j1
Task
(reg_idle)
struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle, update } state; // ... };
// Manager receives req switch (req.state) { case update: // ... break; case idle: // ... break; }
Task BarrierImpact of explicit communication
0
1000
2000
3000
4000
5000
6000
60 120 180 240
Max
. tas
k ba
rrier
late
ncy
(µs)
Number of workers N
Return steal request after N attemptsReturn steal request after 10 attempts
Task BarrierImpact of explicit communication
0
20
40
60
80
100
60 120 180 240
Min
. tas
k ba
rrier
late
ncy
(µs)
Number of workers N
Steal requests + cancel after barrierSteal requests + worker 0 as manager
Intel OpenMP barrier
Worker 0 cancels a random worker’s steal request
No additional communication required
Channel-based Futuresint x = spawn f(n-1); int y = f(n-2); sync;
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
1. Allocates a one-element SPSC channel 2. Creates a task, passing the channel
3. Returns a handle to the channel (future)
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
1. Waits for the task to send its result 2. Receives and returns the result
3. Frees or recycles the channel
future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);
Channel-based Futures
Tries to schedule other work to avoid idling
Performance
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Futures w/ cachingCilk Plus
Fork/join parallelism
Benchmarks from left to right: Tree recursion with n = 34 and t = 1 µs | 14 Queens Problem | Cilksort of 108 integers
Adaptive Strategies
Adaptive Stealing
Adaptive Stealing
Steal-one Steal-half
Idea: Reevaluate strategy after stealsN
Count how many tasks have been executed:
M/N = 1
M/N < 2
M/N > 1 M/N � 2
M
Adaptive Stealing
Benchmark: BPC with d = 105, n = 9, and t = 10 µs
Task length 10 µs
600
800
1000
1200
1400
1600
1800
8 16 24 32 40 48
Exec
utio
n tim
e (m
s)
Number of workers
Steal-halfSteal-adaptive N = 3Steal-adaptive N = 5
Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50
Steal-one
Adaptive Stealing
Benchmark: BPC with d = 1, n = 999,999, and t = 10 µs
Task length 10 µs
0
500
1000
1500
2000
2500
8 16 24 32 40 48
Exec
utio
n tim
e (m
s)
Number of workers
Steal-oneSteal-adaptive N = 3Steal-adaptive N = 5
Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50
Steal-half
Very Fine-grained Tasks
Benchmark: SPC with n = 106 and t = 1 µs
0 1 2 3 4 5 6 7 8
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-oneSteal-half
Steal-adaptive
Task length 1 µs
for (i = 0; i < N; i++) ASYNC(f, i, ...);
Splittable Tasks
Benchmark: SPC with n = 106 and t = 1 µs
Task length 1 µs
0
8
16
24
32
40
48
0 8 16 24 32 40 48
Spee
dup
over
seq
. exe
cutio
n
Number of workers
Steal-adaptiveLazy work splitting
for (i = 0; i < N; i++) ASYNC(f, i, ...);
ASYNC_FOR ( f, 0, N, ... );
Lazy Binary SplittingAssumes concurrent work-stealing deques: Worker splits when local deque is empty, otherwise executes tasks sequentially
→ Splitting is lazy as opposed to eager
→ Chunking (granularity control) is implicit
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
Lazy Binary Splitting
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
Lazy Binary Splitting
A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010
Available to thieves
1/2
Lazy Guided SplittingExample: Four workers
Lazy Guided SplittingExample: Four workers
Available to thieves
3/4
Lazy Adaptive SplittingExample: Two workers are idle
Lazy Adaptive SplittingExample: Two workers are idle
Available to thieves
2/3
Lazy Splitting
Neither strategy is truly lazy
Difficult to know which strategy works best → Explicit communication solves this problem
Lazy Adaptive SplittingExample: Worker receives two steal requests
Lazy Adaptive SplittingExample: Worker receives two steal requests
Lazy Adaptive SplittingExample: Worker receives two steal requests
Send to thieves
1/31/3
PerformanceLazy Splitting
Benchmark: Parallel loops of Fine, Coarse, Random, Increasing, and Decreasing Granularity
|{z}Balanced
|{z}Unbalanced
0
8
16
24
32
40
48
FG CG RG IG DG
Spee
dup
over
seq
. exe
cutio
nBinary splitting
Guided splittingAdaptive splitting
PerformanceMixing tasks and splittable tasks
0
8
16
24
32
40
48
9 99 999 9999 99999Spee
dup
over
seq
. exe
cutio
n
Size of splittable consumer tasks
Cilk PlusBinary splitting
Guided splittingAdaptive splitting
Benchmark: BPC with d = [104, 103, …, 1], n = [9, 99, …, 99999], and t = 10 µs
…
…
…
PerformanceMixing tasks and splittable tasks
Benchmark: BPC with d = [105, 104, …, 10], n = [9, 99, …, 99999], and t = 1 µs
0
8
16
24
32
40
48
9 99 999 9999 99999Spee
dup
over
seq
. exe
cutio
n
Size of splittable consumer tasks
Single tasksBinary splitting
Guided splittingAdaptive splitting
…
…
…
Conclusion
Performance Ranking
2-socket Intel Xeon (24 threads)
1. Chase-Lev WS -1.6 %
2. Channel WS -2.4 %3. Cilk Plus -4.6 %4. Intel OpenMP -10.7 %
4-socket AMD Opteron(48 threads)
1. Chase-Lev WS -2.2 %
2. Channel WS -2.4 %3. Cilk Plus -6.9 %4. Intel OpenMP -21.8 %
60-core Intel Xeon Phi (240 threads)
1. Chase-Lev WS -13.7 %
2. Channel WS -13.7 %3. Intel OpenMP -22.2 %4. Cilk Plus -28.1 %
Average deviations from the best median speedups
21 benchmarks/workloads (20 in the case of Cilk Plus)
David Chase and Yossi Lev, Dynamic Circular Work-Stealing Deque, SPAA 2005
Summary• Work-stealing runtime system with
• private deques
• channel communication
• Workers
• forward steal requests
• adapt their stealing strategy
• split tasks lazily
Flexibility ✔
Performance ✔
}
}