embracing explicit communication in work- stealing runtime...

80
Embracing Explicit Communication in Work- Stealing Runtime Systems Andreas Prell

Upload: others

Post on 10-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Embracing Explicit Communication in Work-

Stealing Runtime SystemsAndreas Prell

Page 2: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Manycore Processors• “Cluster-on-chip” architectures

• Increasing thread- and data-level parallelism

• Growing importance of scalable communication

Left: S. Bell et al.,TILE64TM Processor: A 64-Core SoC with Mesh Interconnect, ISSCC 2008 Center: T. Mattson et al., The 48-Core SCC Processor: The Programmer’s View, SC 2010 Right: A. Sodani et al., Knights Landing: Second-Generation Intel Xeon Phi Product, IEEE Micro 2016

Page 3: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

From Threads to TasksMake it easy to express fine-grained task parallelism

int recurse(int n) { if (n < 2) return base_case();

int x;

std::thread t([&] { x = recurse(n-1); });

int y = recurse(n-2); t.join();

return x + y; } Threads

Page 4: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

From Threads to Tasksint recurse(int n) { if (n < 2) return base_case();

int x = spawn recurse(n-1); int y = recurse(n-2);

sync;

return x + y; }

Make it easy to express fine-grained task parallelism

Tasks

Page 5: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Task Pool

Thread Pool

From Threads to TasksRuntime system manages parallel execution

:::

:::

:::

:::

Page 6: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Central versus Distributed Task Pools

0

5

10

15

20

25

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of threads

GCC 4.9.1ICC 14.0.1

GCC: GNU libgomp, ICC: Intel OpenMP RTL

Benchmark: UTS T3L (binomial tree of ~111 million nodes)

223x}

Page 7: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

A

Page 8: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

B

Load Balancing through Work Stealing

W1 W2 W3

push B

A

B

Page 9: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

push C

C

B

A

C

Page 10: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

pop

B

A

Page 11: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

A

pop

Page 12: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

A

pop

Page 13: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

A

steal

Page 14: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

A

Page 15: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

AShared deques

Page 16: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

APrivate deques

Page 17: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

A

pop

Private deques

Page 18: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

Load Balancing through Work Stealing

W1 W2 W3

B

APrivate deques

Page 19: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

APrivate deques

steal

Page 20: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

Private deques

steal

A

Page 21: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Load Balancing through Work Stealing

W1 W2 W3

B

Private deques

Page 22: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Embracing Explicit Communication

Page 23: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Requirements• Work-stealing scheduling

• Explicit communication → Efficient message passing, private-access deques

• Task synchronization → Collective, individual

• Coarse-grained parallelism → Polling

• Fine-grained parallelism → Adaptive stealing strategies, granularity control

Page 24: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

ChannelsSimple message passing abstraction: bool channel_send(Channel *, void *, size_t);

bool channel_receive(Channel *, void *, size_t);

Bounded FIFO message queues

Building blocks: MPSC SPSC

Page 25: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

struct steal_request { Channel *chan; int thief; // ... };

Steal Requests

Page 26: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

struct steal_request req = {

.chan = ,

.thief = ID,

// ...

};

int i = select_victim();

channel_send( [i], &req, sizeof(req));

Steal Requests

MPSC

SPSC

Two channels per worker

Page 27: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

Steal Requests

W1 W2 W3

Page 28: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

ACK: Nothing

Steal Requests

W1 W2 W3

Page 29: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

Steal Requests

W1 W2 W3

Page 30: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Requests

W1 W2 W3

Page 31: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

Steal Requests

W1 W2 W3

Idea: Eliminate ACKs by forwarding steal requests

Page 32: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

Steal Requests

W1 W2 W3

Idea: Eliminate ACKs by forwarding steal requests

Page 33: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Requests

W1 W2 W3

Idea: Eliminate ACKs by forwarding steal requests

Page 34: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Forwarding

• reduces number of messages

• facilitates asynchronous stealing

• improves performance

Steal Requests

Benchmark: BPC with d = 105, n = 9, and t as shown (x-axis)

400

500

600

700

800

900

1000

1100

0 1 2 3 4 5 6 7 8 9 10

Exec

utio

n tim

e (m

s)

Task length (µs)

AcknowledgingForwarding

Page 35: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

H

Stealing Tasks

T

One task

T’

Send T or &T to thief

Page 36: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

H

Stealing Tasks

T

One task

T’

Send T or &T to thief Share memory by communicating*

*A. Gerrand, https://blog.golang.org/share-memory-by-communicating, 13 July 2010

Page 37: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

H

Stealing Tasks

T

tasks

Send &H’ to thief

bn/2c

T’ H’

Page 38: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Stealing Tasks

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half

Benchmark: SPC with n = 106 and t = 100 µs

Task length 100 µs

Page 39: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Stealing Tasks

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half

Benchmark: SPC with n = 106 and t = 10 µs

Task length 10 µs

Page 40: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Task Synchronization

Page 41: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Task Barrier#include <stdio.h> #include "tasking.h"

ASYNC_VOID_DECL ( puts, const char *s, s );

int main(void) { TASKING_INIT();

ASYNC(puts, "Order"); ASYNC(puts, "Undefined");

TASKING_BARRIER();

ASYNC(puts, "Last");

TASKING_EXIT(); return 0; }

#include <stdio.h> #include <omp.h>

int main(void) { #pragma omp parallel { #pragma omp master { #pragma omp task puts("Order"); #pragma omp task puts("Undefined"); } #pragma omp barrier #pragma omp master { #pragma omp task puts("Last"); } } return 0; }

Page 42: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Termination Detection with Steal Requests

Problem: Detect when all workers are idle without resorting to implicit communication

Idea: “Color” steal requests

→ Avoids separate control messages

→ Termination follows from forwarding

Page 43: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Extended Steal Requestsstruct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };

Page 44: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Extended Steal Requests

struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle } state; // ... };

working idle reg_idle

Page 45: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Extended Steal Requests

working idle reg_idle

// Manager receives req switch (req.state) { case idle: // Mark reg_idle break; }

Page 46: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal Request

(reg_idle)

Notifying the Manager

W1 W2 Manager

Page 47: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Update!

Notifying the Manager

W1 W2 Manager

Page 48: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Notifying the Manager

W1 W2 Manager

Page 49: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Steal request

UpdatesWorker i

Worker j j2

i3

Manager

m2 m3

i1

m1

Update

Steal request

i2

j1

Task

(reg_idle)

struct steal_request { Channel *chan; int thief; enum { working, idle, reg_idle, update } state; // ... };

// Manager receives req switch (req.state) { case update: // ... break; case idle: // ... break; }

Page 50: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Task BarrierImpact of explicit communication

0

1000

2000

3000

4000

5000

6000

60 120 180 240

Max

. tas

k ba

rrier

late

ncy

(µs)

Number of workers N

Return steal request after N attemptsReturn steal request after 10 attempts

Page 51: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Task BarrierImpact of explicit communication

0

20

40

60

80

100

60 120 180 240

Min

. tas

k ba

rrier

late

ncy

(µs)

Number of workers N

Steal requests + cancel after barrierSteal requests + worker 0 as manager

Intel OpenMP barrier

Worker 0 cancels a random worker’s steal request

No additional communication required

Page 52: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Channel-based Futuresint x = spawn f(n-1); int y = f(n-2); sync;

future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);

Page 53: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);

Channel-based Futures

1. Allocates a one-element SPSC channel 2. Creates a task, passing the channel

3. Returns a handle to the channel (future)

Page 54: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);

Channel-based Futures

1. Waits for the task to send its result 2. Receives and returns the result

3. Frees or recycles the channel

Page 55: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

future fx = FUTURE(f, n-1); int y = f(n-2); int x = AWAIT(fx, int);

Channel-based Futures

Tries to schedule other work to avoid idling

Page 56: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Performance

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Futures w/ cachingCilk Plus

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Futures w/ cachingCilk Plus

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Futures w/ cachingCilk Plus

Fork/join parallelism

Benchmarks from left to right: Tree recursion with n = 34 and t = 1 µs | 14 Queens Problem | Cilksort of 108 integers

Page 57: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Adaptive Strategies

Page 58: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Adaptive Stealing

Page 59: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Adaptive Stealing

Steal-one Steal-half

Idea: Reevaluate strategy after stealsN

Count how many tasks have been executed:

M/N = 1

M/N < 2

M/N > 1 M/N � 2

M

Page 60: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Adaptive Stealing

Benchmark: BPC with d = 105, n = 9, and t = 10 µs

Task length 10 µs

600

800

1000

1200

1400

1600

1800

8 16 24 32 40 48

Exec

utio

n tim

e (m

s)

Number of workers

Steal-halfSteal-adaptive N = 3Steal-adaptive N = 5

Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50

Steal-one

Page 61: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Adaptive Stealing

Benchmark: BPC with d = 1, n = 999,999, and t = 10 µs

Task length 10 µs

0

500

1000

1500

2000

2500

8 16 24 32 40 48

Exec

utio

n tim

e (m

s)

Number of workers

Steal-oneSteal-adaptive N = 3Steal-adaptive N = 5

Steal-adaptive N = 10Steal-adaptive N = 25Steal-adaptive N = 50

Steal-half

Page 62: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Very Fine-grained Tasks

Benchmark: SPC with n = 106 and t = 1 µs

0 1 2 3 4 5 6 7 8

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-oneSteal-half

Steal-adaptive

Task length 1 µs

for (i = 0; i < N; i++) ASYNC(f, i, ...);

Page 63: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Splittable Tasks

Benchmark: SPC with n = 106 and t = 1 µs

Task length 1 µs

0

8

16

24

32

40

48

0 8 16 24 32 40 48

Spee

dup

over

seq

. exe

cutio

n

Number of workers

Steal-adaptiveLazy work splitting

for (i = 0; i < N; i++) ASYNC(f, i, ...);

ASYNC_FOR ( f, 0, N, ... );

Page 64: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Binary SplittingAssumes concurrent work-stealing deques: Worker splits when local deque is empty, otherwise executes tasks sequentially

→ Splitting is lazy as opposed to eager

→ Chunking (granularity control) is implicit

A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010

Page 65: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Binary Splitting

A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010

Page 66: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Binary Splitting

A. Tzannes et al., Lazy Binary Splitting: A Run-Time Adaptive Work-Stealing Scheduler, PPoPP 2010

Available to thieves

1/2

Page 67: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Guided SplittingExample: Four workers

Page 68: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Guided SplittingExample: Four workers

Available to thieves

3/4

Page 69: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Adaptive SplittingExample: Two workers are idle

Page 70: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Adaptive SplittingExample: Two workers are idle

Available to thieves

2/3

Page 71: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Splitting

Neither strategy is truly lazy

Difficult to know which strategy works best → Explicit communication solves this problem

Page 72: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Adaptive SplittingExample: Worker receives two steal requests

Page 73: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Adaptive SplittingExample: Worker receives two steal requests

Page 74: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Lazy Adaptive SplittingExample: Worker receives two steal requests

Send to thieves

1/31/3

Page 75: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

PerformanceLazy Splitting

Benchmark: Parallel loops of Fine, Coarse, Random, Increasing, and Decreasing Granularity

|{z}Balanced

|{z}Unbalanced

0

8

16

24

32

40

48

FG CG RG IG DG

Spee

dup

over

seq

. exe

cutio

nBinary splitting

Guided splittingAdaptive splitting

Page 76: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

PerformanceMixing tasks and splittable tasks

0

8

16

24

32

40

48

9 99 999 9999 99999Spee

dup

over

seq

. exe

cutio

n

Size of splittable consumer tasks

Cilk PlusBinary splitting

Guided splittingAdaptive splitting

Benchmark: BPC with d = [104, 103, …, 1], n = [9, 99, …, 99999], and t = 10 µs

Page 77: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

PerformanceMixing tasks and splittable tasks

Benchmark: BPC with d = [105, 104, …, 10], n = [9, 99, …, 99999], and t = 1 µs

0

8

16

24

32

40

48

9 99 999 9999 99999Spee

dup

over

seq

. exe

cutio

n

Size of splittable consumer tasks

Single tasksBinary splitting

Guided splittingAdaptive splitting

Page 78: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Conclusion

Page 79: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Performance Ranking

2-socket Intel Xeon (24 threads)

1. Chase-Lev WS -1.6 %

2. Channel WS -2.4 %3. Cilk Plus -4.6 %4. Intel OpenMP -10.7 %

4-socket AMD Opteron(48 threads)

1. Chase-Lev WS -2.2 %

2. Channel WS -2.4 %3. Cilk Plus -6.9 %4. Intel OpenMP -21.8 %

60-core Intel Xeon Phi (240 threads)

1. Chase-Lev WS -13.7 %

2. Channel WS -13.7 %3. Intel OpenMP -22.2 %4. Cilk Plus -28.1 %

Average deviations from the best median speedups

21 benchmarks/workloads (20 in the case of Cilk Plus)

David Chase and Yossi Lev, Dynamic Circular Work-Stealing Deque, SPAA 2005

Page 80: Embracing Explicit Communication in Work- Stealing Runtime ...aprell.github.io/papers/thesis_slides.pdf · Manycore Processors • “Cluster-on-chip” architectures • Increasing

Summary• Work-stealing runtime system with

• private deques

• channel communication

• Workers

• forward steal requests

• adapt their stealing strategy

• split tasks lazily

Flexibility ✔

Performance ✔

}

}