embracing explicit communication in work- stealing runtime...

Embracing Explicit Communication in Work-

Stealing Runtime SystemsAndreas Prell

Manycore Processors• “Cluster-on-chip” architectures

• Increasing thread- and data-level parallelism

• Growing importance of scalable communication

Left: S. Bell et al.,TILE64TM Processor: A 64-Core SoC with Mesh Interconnect, ISSCC 2008 Center: T. Mattson et al., The 48-Core SCC Processor: The Programmer’s View, SC 2010 Right: A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro 2016

From Threads to TasksMake it easy to express fine-grained task parallelism

int recurse(int n) { if (n < 2) return base_case();

int x;

std::thread t([&] { x = recurse(n-1); });

int y = recurse(n-2); t.join();

return x + y; } Threads

From Threads to Tasksint recurse(int n) { if (n < 2) return base_case();

int x = spawn recurse(n-1); int y = recurse(n-2);

sync;

return x + y; }

Make it easy to express fine-grained task parallelism

Tasks

Task Pool

Thread Pool

From Threads to TasksRuntime system manages parallel execution

:::

:::

:::

:::

Central versus Distributed Task Pools

0

5

10

15

20

25

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of threads

GCC 4.9.1ICC 14.0.1

GCC: GNU libgomp, ICC: Intel OpenMP RTL

Benchmark: UTS T3L (binomial tree of ~111 million nodes)

223x}

Load Balancing through Work Stealing

W1 W2 W3

A

B


W1 W2 W3

push B

A

B


W1 W2 W3

push C

C

B

A

C


W1 W2 W3

pop

B

A


W1 W2 W3

B

A

pop


W1 W2 W3

B

A

steal


W1 W2 W3

B

A


W1 W2 W3

B

AShared deques


W1 W2 W3

B

APrivate deques


W1 W2 W3

B

A

pop

Private deques

Steal Request


W1 W2 W3

B

APrivate deques


W1 W2 W3

B

APrivate deques

steal


W1 W2 W3

B

Private deques

steal

A


W1 W2 W3

B

Private deques

Embracing Explicit Communication

Requirements• Work-stealing scheduling

• Explicit communication → Efficient message　passing, private-access deques

• Task synchronization → Collective, individual

• Coarse-grained parallelism → Polling

• Fine-grained parallelism → Adaptive stealing strategies, granularity control

ChannelsSimple message passing abstraction: bool channel_send(Channel *, void *, size_t);

bool channel_receive(Channel *, void *, size_t);

Bounded FIFO message queues

Building blocks: MPSC SPSC

struct steal_request { Channel *chan; int thief; // ... };

Steal Requests

struct steal_request req = {

.chan = ,

.thief = ID,

// ...

};

int i = select_victim();

channel_send( [i], &req, sizeof(req));

Steal Requests

MPSC

SPSC

Two channels per worker

Steal Request

Steal Requests

W1 W2 W3

ACK: Nothing

Steal Requests

W1 W2 W3

Steal Request

Steal Requests

W1 W2 W3

Steal Requests

W1 W2 W3

Steal Request

Steal Requests

W1 W2 W3

Idea: Eliminate ACKs by forwarding steal requests

Steal Requests

W1 W2 W3

Idea: Eliminate ACKs by forwarding steal requests

…

Forwarding

• reduces number of messages

• facilitates asynchronous stealing

• improves performance

Steal Requests

Benchmark: BPC with d = 105, n = 9, and t as shown (x-axis)

400

500

600

700

800

900

1000

1100

0 1 2 3 4 5 6 7 8 9 10

Exec

utio

n tim

e (m

s)

Task length (µs)

AcknowledgingForwarding

…

…

H

Stealing Tasks

T

One task

T’

Send T or &T to thief

H

Stealing Tasks

T

One task

T’

Send T or &T to thief Share memory by communicating*

*A. Gerrand, https://blog.golang.org/share-memory-by-communicating, 13 July 2010

https://blog.golang.org/share-memory-by-communicating

H

Stealing Tasks

T

tasks

Send &H’ to thief

bn/2c

T’ H’

Stealing Tasks

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half

Benchmark: SPC with n = 106 and t = 100 µs

Task length 100 µs

…

Stealing Tasks

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half


Task length 10 µs

Task Synchronization

Task Barrier#include <stdio.h> #include "tasking.h"

ASYNC_VOID_DECL ( puts, const char *s, s );

int main(void) { TASKING_INIT();

ASYNC(puts, "Order"); ASYNC(puts, "Undefined");

TASKING_BARRIER();

ASYNC(puts, "Last");

TASKING_EXIT(); return 0; }

#include <stdio.h> #include <omp.h>

int main(void) { #pragma omp parallel { #pragma omp master { #pragma omp task puts("Order"); #pragma omp task puts("Undefined"); } #pragma omp barrier #pragma omp master { #pragma omp task puts("Last"); } } return 0; }

Termination Detection with Steal Requests

Problem: Detect when all workers are idle without resorting to implicit communication

Idea: “Color” steal requests

→ Avoids separate control messages

→ Termination follows from forwarding

Extended Steal Requestsstruct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };

Extended Steal Requests

struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };

working idle reg_idle

Extended Steal Requests

working idle reg_idle

// Manager receives req switch (req.state) { case idle: // Mark reg_idle break; }

Steal Request

(reg_idle)

Notifying the Manager

W1 W2 Manager

Update!


W1 W2 Manager


W1 W2 Manager

Steal request

UpdatesWorker i

Worker j j2

i3

Manager

m2 m3

i1

m1

Update

Steal request

i2

j1

Task

(reg_idle)

struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle, update } state; // ... };

// Manager receives req switch (req.state) { case update: // ... break; case idle: // ... break; }

Task BarrierImpact of explicit communication

0

1000

2000

3000

4000

5000

6000

60 120 180 240

Max

. tas

k ba

rrier

late

ncy

(µs)

Number of workers N

Return steal request after N attemptsReturn steal request after 10 attempts

Task BarrierImpact of explicit communication

0

20

40

60

80

100

60 120 180 240

Min

. tas

k ba

rrier

late

ncy

(µs)

Number of workers N

Steal requests + cancel after barrierSteal requests + worker 0 as manager

Intel OpenMP barrier

Worker 0 cancels a random worker’s steal request

No additional communication required

Channel-based Futuresint x = spawn f(n-1); int y = f(n-2); sync;

future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);


Channel-based Futures

1. Allocates a one-element SPSC channel 2. Creates a task, passing the channel

3. Returns a handle to the channel (future)



1. Waits for the task to send its result 2. Receives and returns the result

3. Frees or recycles the channel



Tries to schedule other work to avoid idling

Performance

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Futures w/ cachingCilk Plus

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers


0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers


Fork/join parallelism

Benchmarks from left to right: Tree recursion with n = 34 and t = 1 µs | 14 Queens Problem | Cilksort of 108 integers

Adaptive Strategies

Adaptive Stealing

Adaptive Stealing

Steal-one Steal-half

Idea: Reevaluate strategy after stealsN

Count how many tasks have been executed:

M/N = 1

M/N < 2

M/N > 1 M/N � 2

M

Adaptive Stealing

Benchmark: BPC with d = 105, n = 9, and t = 10 µs

Task length 10 µs

600

800

1000

1200

1400

1600

1800

8 16 24 32 40 48

Exec

utio

n tim

e (m

s)

Number of workers

Steal-halfSteal-adaptive N = 3Steal-adaptive N = 5

Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50

Steal-one

Adaptive Stealing

Benchmark: BPC with d = 1, n = 999,999, and t = 10 µs

Task length 10 µs

0

500

1000

1500

2000

2500

8 16 24 32 40 48

Exec

utio

n tim

e (m

s)

Number of workers

Steal-oneSteal-adaptive N = 3Steal-adaptive N = 5

Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50

Steal-half

Very Fine-grained Tasks


0 1 2 3 4 5 6 7 8

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half

Steal-adaptive

Task length 1 µs

for (i = 0; i < N; i++) ASYNC(f, i, ...);

Splittable Tasks


Task length 1 µs

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-adaptiveLazy work splitting

for (i = 0; i < N; i++) ASYNC(f, i, ...);

ASYNC_FOR ( f, 0, N, ... );

Lazy Binary SplittingAssumes concurrent work-stealing deques: Worker splits when local deque is empty, otherwise executes tasks sequentially

→ Splitting is lazy as opposed to eager

→ Chunking (granularity control) is implicit

A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010

Lazy Binary Splitting


Lazy Binary Splitting


Available to thieves

1/2

Lazy Guided SplittingExample: Four workers

Lazy Guided SplittingExample: Four workers


3/4

Lazy Adaptive SplittingExample: Two workers are idle

Lazy Adaptive SplittingExample: Two workers are idle


2/3

Lazy Splitting

Neither strategy is truly lazy

Difficult to know which strategy works best → Explicit communication solves this problem

Lazy Adaptive SplittingExample: Worker receives two steal requests

Lazy Adaptive SplittingExample: Worker receives two steal requests

Send to thieves

1/31/3

PerformanceLazy Splitting

Benchmark: Parallel loops of Fine, Coarse, Random, Increasing, and Decreasing Granularity

|{z}Balanced

|{z}Unbalanced

0

8

16

24

32

40

48

FG CG RG IG DG

Spee

dup

over

seq

. exe

cutio

nBinary splitting

Guided splittingAdaptive splitting

PerformanceMixing tasks and splittable tasks

0

8

16

24

32

40

48

9 99 999 9999 99999Spee

dup

over

seq

. exe

cutio

n

Size of splittable consumer tasks

Cilk PlusBinary splitting


Benchmark: BPC with d = [104, 103, …, 1], n = [9, 99, …, 99999], and t = 10 µs

…

…

…

PerformanceMixing tasks and splittable tasks

Benchmark: BPC with d = [105, 104, …, 10], n = [9, 99, …, 99999], and t = 1 µs

0

8

16

24

32

40

48

9 99 999 9999 99999Spee

dup

over

seq

. exe

cutio

n

Size of splittable consumer tasks

Single tasksBinary splitting


…

…

…

Conclusion

Performance Ranking

2-socket Intel Xeon (24 threads)

1. Chase-Lev WS -1.6 %

2. Channel WS -2.4 %3. Cilk Plus -4.6 %4. Intel OpenMP -10.7 %

4-socket AMD Opteron(48 threads)


2. Channel WS -2.4 %3. Cilk Plus -6.9 %4. Intel OpenMP -21.8 %

60-core Intel Xeon Phi (240 threads)


2. Channel WS -13.7 %3. Intel OpenMP -22.2 %4. Cilk Plus -28.1 %

Average deviations from the best median speedups

21 benchmarks/workloads (20 in the case of Cilk Plus)

David Chase and Yossi Lev, Dynamic Circular Work-Stealing Deque, SPAA 2005

Summary• Work-stealing runtime system with

• private deques

• channel communication

• Workers

• forward steal requests

• adapt their stealing strategy

• split tasks lazily

Flexibility ✔

Performance ✔

}

}

embracing explicit communication in work- stealing runtime...

Documents