mit cilk

136
Intel XTRL / USA The Era of Multicore Is Here 1 Processor Type Price Number of Cores Dell Inspiron R15 Intel Core i3 370M 2.4GHz $649.99 2 Dell Inspiron N5030 Intel Pentium T4500 2.30GHz $479.99 2 lenovo IdeaPad Y560 Intel Core i7 740QM 1.73GHz $849.99 4 ASUS G Series G73JW-XN1 Intel Core i7 740QM 1.73GHz $1449.99 4 MSI CR620- 691US Intel Core i3 380M 2.53GHz $599.99 2 Source: www.newegg.com

Upload: raymond-kung

Post on 24-Jun-2015

536 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mit cilk

Intel XTRL / USA 1

The Era of Multicore Is HereProcessor

Type Price Numberof Cores

Dell Inspiron R15 Intel Core i3 370M2.4GHz $649.99 2

Dell Inspiron N5030

Intel Pentium T4500 2.30GHz $479.99 2

lenovo IdeaPad Y560

Intel Core i7 740QM1.73GHz $849.99 4

ASUS G SeriesG73JW-XN1

Intel Core i7 740QM1.73GHz $1449.99 4

MSI CR620-691US

Intel Core i3 380M 2.53GHz $599.99 2

Toshiba Satellite L675D-S7102

AMD Athlon II P360 2.30GHz $599.99 2

Source: www.newegg.com

Page 2: Mit cilk

Intel XTRL / USA 2

Network

Memory

Chip Multiprocessor (CMP)

PPP

Multicore Architecture*

¢ ¢ ¢

*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

Page 3: Mit cilk

Intel XTRL / USA 3

Concurrency Platforms

Operating System

Concurrency Platform

User Application

A concurrency platform, that provides linguistic support and handles load balancing, can ease the task of parallel programming.

Page 4: Mit cilk

Using Memory Mapping to Support Cactus Stacks in

Work-Stealing Runtime Systems

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

I-Ting Angelina Lee

March 22, Intel XTRL / USA

Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson

Page 5: Mit cilk

Using Memory Mapping to Support Cactus Stacks in

Work-Stealing Runtime Systems

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

I-Ting Angelina Lee

Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson

March 22, Intel XTRL / USA

Page 6: Mit cilk

Intel XTRL / USA 6

Three Desirable Criteria

BoundedStack Space

GoodPerformance

Serial-ParallelReciprocity

Interoperability with serial code, including binaries

Ample parallelism linear speedup

Reasonable space usage compared to serial execution

Page 7: Mit cilk

Intel XTRL / USA 7

Various Strategies

StrategySP

Reciprocity TimeBound

SpaceBound

1. Recompile Everything

2. One Stack Per Worker

3. Limited-Depth Stacks

4. Depth-Restricted Stealing

5. New Stack When Needed

6. Recycle Ancestor Stacks

7. TLMM Cactus Stacks

Cilk++

TBB

Cilk Plus

The Cactus-Stack Problem: how to satisfy all three criteria simultaneously.

Page 8: Mit cilk

Intel XTRL / USA 8

The Cactus-Stack Problem

CustomerEngineer

Space Usage Performance

SP Reciprocity

Page 9: Mit cilk

Intel XTRL / USA 9

The Cactus-Stack Problem

Parallelize my software?

Space Usage Performance

SP Reciprocity

Page 10: Mit cilk

Intel XTRL / USA 10

The Cactus-Stack Problem

Sure! Use my concurrency

platform!

Space Usage Performance

SP Reciprocity

Page 11: Mit cilk

Intel XTRL / USA 11

The Cactus-Stack Problem

Sure! Use my concurrency

platform!

Space Usage Performance

SP Reciprocity

Page 12: Mit cilk

Intel XTRL / USA 12

The Cactus-Stack Problem

Just be sure to recompile all

your codebase.

Space Usage Performance

Page 13: Mit cilk

Intel XTRL / USA 13

The Cactus-Stack Problem

Hm … I use third party binaries …

Space Usage Performance

Page 14: Mit cilk

Intel XTRL / USA 14

The Cactus-Stack Problem

*Sigh*. Ok fine.

Space Usage Performance

SP Reciprocity

Page 15: Mit cilk

Intel XTRL / USA 15

The Cactus-Stack Problem

Upgrade your RAM then …

Performance

SP Reciprocity

Page 16: Mit cilk

Intel XTRL / USA 16

The Cactus-Stack Problem

… you are gonna need

extra memory.

Performance

SP Reciprocity

Page 17: Mit cilk

Intel XTRL / USA 17

The Cactus-Stack Problem

… no?

Performance

SP Reciprocity

Page 18: Mit cilk

Intel XTRL / USA 18

The Cactus-Stack Problem

Space Usage Performance

SP Reciprocity

… no?

Page 19: Mit cilk

Intel XTRL / USA 19

The Cactus-Stack Problem

⌃#

Well … you didn’t say you want any

performance guarantee, did you?

Space Usage

SP Reciprocity

Page 20: Mit cilk

Intel XTRL / USA 20

The Cactus-Stack Problem

⌃#

Gee … I can get that just by

running serially.

Space Usage

SP Reciprocity

Page 21: Mit cilk

Intel XTRL / USA 21

The Cactus-Stack Problem

BoundedStack Space

GoodPerformance

Serial-ParallelReciprocity

Interoperability with serial code, including binaries

Ample parallelism linear speedup

Reasonable space usage compared to serial execution

Page 22: Mit cilk

Intel XTRL / USA 22

Legacy Linear Stack

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.

Page 23: Mit cilk

Intel XTRL / USA 23

Legacy Linear Stack

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.

Page 24: Mit cilk

Intel XTRL / USA 24

Legacy Linear Stack — 1960*

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.

* Stack-based space management for recursive subroutines developed with compilers for Algol 60.

Page 25: Mit cilk

Intel XTRL / USA 25

Cactus Stack — 1968*

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

A cactus stack supports multiple views in parallel.

* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.

Page 26: Mit cilk

Intel XTRL / USA 26

Heap-Based Cactus Stack

A

C

heap

A heap-based cactus stack allocates frames off the heap.

D E

B

Mesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.

Page 27: Mit cilk

Intel XTRL / USA 27

Modern Concurrency Platforms Cilk++ (Intel) Cilk-5 (MIT) Cilk-M (MIT) Cilk Plus (Intel) Fortress (Oracle Labs) Habanero (Rice) JCilk (MIT) OpenMP StreamIt (MIT) Task Parallel Library (Microsoft) Threading Building Blocks (Intel) X10 (IBM) …

Page 28: Mit cilk

Intel XTRL / USA 28

Heap-Based Cactus Stack

A

C

heap

A heap-based cactus stack allocates frames off the heap.

D E

B

MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.

Good time and space bounds can be obtained …

Page 29: Mit cilk

Intel XTRL / USA 29

Heap-Based Cactus Stack

A

C

heap

Heap linkage: call/return via frames in the heap.

D E

B

Heap linkage parallel functions fail to interoperate with legacy serial code.

Page 30: Mit cilk

Intel XTRL / USA 30

Various Strategies

StrategySP

Reciprocity TimeBound

SpaceBound

1. Recompile Everything

2. One Stack Per Worker

3. Limited-Depth Stacks

4. Depth-Restricted Stealing

5. New Stack When Needed

6. Recycle Ancestor Stacks

7. TLMM Cactus Stacks

The main constraint: once allocated, a frame’s location in virtual address cannot change.

Page 31: Mit cilk

Intel XTRL / USA 31

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Page 32: Mit cilk

Intel XTRL / USA 32

The Cilk Programming Model

int fib(int n) { if(n < 2) { return n; } int x = spawn fib(n-1); int y = fib(n-2); sync; return (x + y);

}

Control cannot pass this point until all spawned children have returned.

Cilk keywords grant permission for parallel execution. They do not command parallel execution.

The named child function may execute in parallel with the continuation of its parent.

Page 33: Mit cilk

Intel XTRL / USA 33

Cilk-M

A work-stealing runtime system based on Cilk that solves the cactus-stack problem by

thread-local memory mapping (TLMM).

Page 34: Mit cilk

Intel XTRL / USA 34

Cilk-M Overview

Thread-local memory mapped (TLMM) region:

A virtual-address range in which each thread can map physical memory independently.

stack

heap

uninitialized data (bss)

initialized data

code

High virtual addr

Low virtual addr

TLMM

sharedIdea: Allocate the stacks for each worker in the TLMM region.

Page 35: Mit cilk

Intel XTRL / USA 35

Basic Cilk-M Idea

BA

CED

P1 P2 P3

Ax: 42

BE

Unreasonable simplification: Assume that we can map with arbitrary granularity.

y: &x

Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

0x7f000

Workers achieve sharing by mapping the same physical memory at the same virtual address.

Page 36: Mit cilk

Intel XTRL / USA 36

Cilk Guarantees with aHeap-Based Cactus Stack

Time bound: Tp = T1 / P + O(T∞) . linear speedup when P ≪ T1 / T∞

Space bound: Sp /P ≤ S1 .

Does not support SP-reciprocity.

Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism

SP — stack space on P processorsS1 — stack space of a serial execution

Page 37: Mit cilk

Intel XTRL / USA 37

Cilk Depth

Cilk depth is the max number of Cilk functions nested on the stack during a serial execution

B

A

C

ED

GF

Cilk depth (3) is not the same as spawn depth (2).

Page 38: Mit cilk

Intel XTRL / USA 38

Cilk-M Guarantees

Time bound: Tp = T1 / P + O((S1+D) T∞) .

linear speedup when P ≪ T1 / (S1+D)T∞

Space bound: Sp /P ≤ S1+D , where S1 is measured in pages.

SP reciprocity: No longer need to distinguish function types Parallelism or not is dictated only by how a function is invoked

(spawn vs. call).

Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism

SP — stack space on P processorsS1 — stack space of a serial execution D — Cilk depth

Page 39: Mit cilk

Intel XTRL / USA 39

We implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.

We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).

We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.

System Overview

Page 40: Mit cilk

Intel XTRL / USA 40

Performance Comparison

Cilk-M running time / Cilk Plus running time

Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)

AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3

cholesky

cilksort

fft

fib

fib_weird

heat

lu

matmul

nqueens

qsortrectmul

strassen

0

0.2

0.4

0.6

0.8

1

1.2

Page 41: Mit cilk

Intel XTRL / USA 41

Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D

cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22

fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47

fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12

lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25

qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10

Space bound: Sp /P ≤ S1+D

Page 42: Mit cilk

Intel XTRL / USA 42

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Page 43: Mit cilk

Intel XTRL / USA 43

spawn

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawncall

spawn

P

spawn

PP

callspawn

spawncall

Cilk-M’s Work-Stealing Scheduler

Page 44: Mit cilk

Intel XTRL / USA 44

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawncall

spawncall

P PP

call!

Cilk-M’s Work-Stealing Scheduler

spawnspawn

callspawn

spawncall

Page 45: Mit cilk

Intel XTRL / USA 45

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawncall

spawncall

P PP

Cilk-M’s Work-Stealing Scheduler

spawn

spawn!

spawnspawn

callspawn

spawncall

Page 46: Mit cilk

Intel XTRL / USA 46

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P P PP

Cilk-M’s Work-Stealing Scheduler

spawn! call! spawn!

spawncall

spawncall

spawnspawn

spawnspawn

callspawn

spawncall

call spawn

Page 47: Mit cilk

Intel XTRL / USA 47

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P P PP

Cilk-M’s Work-Stealing Scheduler

spawn

return!

spawncall

spawncall

spawncall

spawncall

spawncall

spawnspawn

spawn

Page 48: Mit cilk

Intel XTRL / USA 48

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P P PP

Cilk-M’s Work-Stealing Scheduler

spawncall

spawncall

steal!

When a worker runs out of work, it steals from the top of a random victim’s deque.

spawncall

spawncall

spawncall

spawnspawn

spawn

Page 49: Mit cilk

Intel XTRL / USA 49

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P P PP

Cilk-M’s Work-Stealing Scheduler

spawncall

spawn!

When a worker runs out of work, it steals from the top of a random victim’s deque.

spawncall spawn

call

spawncall

spawncall

spawnspawn

spawnspawn

Page 50: Mit cilk

Intel XTRL / USA 50

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P P PP

Cilk-M’s Work-Stealing Scheduler

spawncall

Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup.

spawncall spawn

call

spawn

spawnspawn

callspawn

callspawnspawn

Page 51: Mit cilk

Intel XTRL / USA 51

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Page 52: Mit cilk

Intel XTRL / USA 52

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

0x7f000

BA

CED

Use standard linear stack in virtual memory.

Page 53: Mit cilk

Intel XTRL / USA 53

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

steal A

0x7f000

BA

CED

Ax: 42

Ax: 42

TLMM-Based Cactus Stacks

Map (not copy) the stolen prefix to the same virtual addresses.

Page 54: Mit cilk

Intel XTRL / USA 54

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

Ax: 42

0x7f000

BA

CED

Subsequent spawns and calls grow down-ward in the thief’s TLMM region.

Cy: &x

Page 55: Mit cilk

Intel XTRL / USA 55

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

Cy: &x

Ax: 42

Both workers see the same virtual address value for &x.

0x7f000

BA

CED

Page 56: Mit cilk

Intel XTRL / USA 56

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

Cy: &x

D

Ax: 42

0x7f000

BA

CED

Both workers see the same virtual address value for &x.

Page 57: Mit cilk

Intel XTRL / USA 57

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

D

Cy: &x

Ax: 42

Ax: 42

Cy: &x

steal C

0x7f000

BA

CED

Cy: &x

Ax: 42

Map (not copy) the stolen prefix to the same virtual addresses.

Page 58: Mit cilk

Intel XTRL / USA 58

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x

Unreasonable simplification: Assume that we can map with arbitrary granularity.

Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

0x7f000

BA

CED

Subsequent spawns and calls grow down-ward in the thief’s TLMM region. E

z: &x

Page 59: Mit cilk

Intel XTRL / USA 59

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x E

Unreasonable simplification: Assume that we can map with arbitrary granularity.

Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

z: &x

0x7f000

BA

CED

All workers see the same virtual address value for &x.

Page 60: Mit cilk

Intel XTRL / USA 60

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

B

A

BA

CED

Page 61: Mit cilk

Intel XTRL / USA 61

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

Map the stolen prefix.

A

B

A

steal ABA

CED

A

Page 62: Mit cilk

Intel XTRL / USA 62

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B

A

steal A

Advance the stack pointer fragmentation.

BA

CED

Page 63: Mit cilk

Intel XTRL / USA 63

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B C

D

A

BA

CED

Page 64: Mit cilk

Intel XTRL / USA 64

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B C

D

A

steal C

A

C

BA

CED

A

C

Page 65: Mit cilk

Intel XTRL / USA 65

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B C

D

A

steal C

A

CAdvance the stack pointer again additional fragmentation.

BA

CED

Page 66: Mit cilk

Intel XTRL / USA 66

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B

E

C

D

A A

C

BA

CED

Advance the stack pointer again additional fragmentation.

Page 67: Mit cilk

Intel XTRL / USA 67

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B

E

C

D

A A

C

BA

CED

Space-reclaiming heuristic: reset the stack pointer upon successful sync.

Page 68: Mit cilk

Intel XTRL / USA 68

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Page 69: Mit cilk

Intel XTRL / USA 69

Space Bound with a Heap-Based Cactus Stack

Theorem [BL94]. Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at most SP/P ≤ S1.Proof. The work-stealing algorithm maintains the busy-leaves property:

Every active leaf frame has a worker executing on it. ■

PPP

S1

P = 4

P

Page 70: Mit cilk

Intel XTRL / USA 70

Cilk-M Space Bound

Claim. Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stack is at most SP/P ≤ S1+D.Proof. The work-stealing algorithm maintains the busy-leaves property:

Every active leaf frame has a worker executing on it. ■

PPP

S1

P = 4

P

Page 71: Mit cilk

Intel XTRL / USA 71

Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D

cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22

fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47

fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12

lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25

qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10

Space bound: Sp /P ≤ S1+D

Page 72: Mit cilk

Intel XTRL / USA 72

Performance Bound with aHeap-Based Cactus Stack

Theorem [BL94]. A work-stealing scheduler can achieve expected running time

TP = T1 / P + O(T∞)on P processors.

Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism

Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).

Page 73: Mit cilk

Intel XTRL / USA 73

Cilk-M Performance Bound

Claim. A work-stealing scheduler can achieve expected running time

Tp = T1 / P + C T∞

on P processors, where C = O(S1+D) .

Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism

D — Cilk depth

Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).

Page 74: Mit cilk

Intel XTRL / USA 74

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Page 75: Mit cilk

Intel XTRL / USA 75

To Be or Not To Be … a Process

A Worker = A Process A Worker = A Thread

Every worker has its own page table.

By default, nothing is shared.

Manually (i.e. mmap) share nonstack memory.

User calls to mmap do not work (which may include malloc).

Workers share a single page table.

By default, everything is shared.

Reserve a region to be independently mapped.

User calls to mmap operate properly.

Page 76: Mit cilk

Intel XTRL / USA 76

Page 28

TLMM 2

Page 12

TLMM 1

Page Table for TLMM (Ideally)

Page 32

Shared

Page 7

TLMM 0

x86: Hardware walks the page table.Each thread has a single root-page directory!

Page 77: Mit cilk

Intel XTRL / USA 77

Support for TLMM

Page 7

Thread 0

Page 12

Thread 1

Page 32

Must synchronize the root-page directory among threads.

Page 78: Mit cilk

Intel XTRL / USA 78

Limitation of TLMM Cactus Stacks TLMM does not work for codes that require one

thread to see another thread’s stack. E.g., MCS locks [MCS91]:

When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.

Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.

When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.

If A allocates LA on its stack using TLMM, LA may not be visible to B!

Page 79: Mit cilk

Intel XTRL / USA 79

Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously: Serial-Parallel Reciprocity Good Performance Bounded and efficient use of memory for the

cactus stack Cilk-M employs:

TLMM-based cactus stacks OS support for TLMM (~600 lines of code) Legacy compatible linkage

Cilk-M Summary

Page 80: Mit cilk

Intel XTRL / USA 80

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Outline

Page 81: Mit cilk

Intel XTRL / USA 81

The JCilk Language

Java CoreFunctionalities

ParallelConstructsfrom Cilk:

spawn & sync

Joint work with John Danaher and Charles Leiserson

Page 82: Mit cilk

Intel XTRL / USA 82

The JCilk Language

Java CoreFunctionalities

ParallelConstructsfrom Cilk:

spawn & sync

ExceptionHandling

Joint work with John Danaher and Charles Leiserson

Page 83: Mit cilk

Intel XTRL / USA 83

JCilk provides a faithful extension of Java’s exception mechanism consistent with Cilk’s primitives.

JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.

Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.

Exception Handling in a Concurrent Context

Page 84: Mit cilk

Intel XTRL / USA 84

The JCilk System

JCilkto

Java + goto

Jgo compiler: GCJ + goto

supportJVM

source

JCilkRuntime System

Fib.jcilk Fib.jgo Fib.class

JCilk Compiler

Page 85: Mit cilk

Intel XTRL / USA 85

JCilk's strategy of integrating multithreading with Java's exception semantics is synergistic – it obviates the need for Cilk’s inlet and abort.

JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.

What We Discovered

Page 86: Mit cilk

Intel XTRL / USA 86

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Outline

Page 87: Mit cilk

Intel XTRL / USA 87

Initially, L1 = 0 and L2 = 0

Thread 1

1 L1 = 1;

2 if(L2 == 0) {3 /* critical section */4 …5 }6 L1 = 0;

Thread 2

1 L2 = 1;

2 if(L1 == 0) {3 /* critical section */4 …5 }6 L2 = 0;

Dekker’s Protocol (Simplified)

Reads may be reordered with older writes.

Page 88: Mit cilk

Intel XTRL / USA 88

Initially, L1 = 0 and L2 = 0

Thread 1

1 L1 = 1;2 mfence();3 if(L2 == 0) {4 /* critical section */5 …6 }7 L1 = 0;

Thread 2

1 L2 = 1;2 mfence();3 if(L1 == 0) {4 /* critical section */5 …6 }7 L2 = 0;

Memory fences needed cause stalling

Dekker’s Protocol (Simplified)

Page 89: Mit cilk

Intel XTRL / USA 89

Applications of Dekker’s Protocol The THE protocol used by Cilk’s work stealing

scheduler [FLR98] the victim vs. the thief

Java Monitors using Quickly Reacquirable Locks or Biased Locking [DMS03] [OKK04] the bias-holding thread vs. a revoker thread

JNI reentry barrier in JVM a Java mutator thread vs. the garbage collector

Network package processing [VNE10] the owner thread vs. other threads

Applications exhibit asymmetric synchronization patterns.

Page 90: Mit cilk

Intel XTRL / USA 90

We introduce location-based memory fences, which causes a thread’s instruction stream to serialize when another thread attempts to access the guarded memory location.

Some applications can benefit from a software implementation [DHY03] that uses interrupt.

A light-weight hardware mechanism can piggyback on the cache coherence protocol.

Location-Based Memory FencesJoint work with Edya Ladan-Mozes and Dmitriy Vyukov

Page 91: Mit cilk

Intel XTRL / USA 91

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Outline

Page 92: Mit cilk

Intel XTRL / USA 92

Transactional Memory

atomic { //Ax++;

} Rset: x Wset: x

Rset: w,xWset: w,x

A

Memory

atomic { //Bw = x;

}Rset: x Wset: w

B

Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory.

Page 93: Mit cilk

Intel XTRL / USA 93

Transactional Memory

atomic { //Ax++;

} Rset: x Wset: x

Rset: w,xWset: w,x

A

Memory

atomic { //Bw = x;

}Rset: x Wset: w

B

TM guarantees that transactions are serializable [P79].

Page 94: Mit cilk

Intel XTRL / USA 94

Nested Transactions

atomic { //A int a = x;...

atomic { //B w++; } int b = y; z = x + y;}

Rset: wWset: w

Rset: x Wset:

Rset: w,x,y,zWset: w,x,y,z

A

Memory

B

Closed-nesting: propagate the changes to A.

Page 95: Mit cilk

Intel XTRL / USA 95

Nested Transactions

atomic { //A int a = x;...

atomic { //B w++; } int b = y; z = x + y;}

Rset: wWset: w

Rset: x Wset:

Rset: w,x,y,zWset: w,x,y,z

A

Memory

B

Open-nesting: commit the changes globally.

Page 96: Mit cilk

Intel XTRL / USA 96

Nested Transactions

Safety Efficency

Closed Nesting[M85] ✓

Open Nesting[MH06, MAC+06, NMA+07] ✓

All memories are treated equally – there is only one level of abstraction.

Page 97: Mit cilk

Intel XTRL / USA 97

Ownership-Aware Transactions (OAT)

Ownership-aware transactions is a hybrid between open-nesting and closed-nesting; it provides multiple levels of abstraction.

In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept of ownership types [BLS03] to ensure data encapsulation within a module.

The OAT system guarantees abstract serializability as long as the program conforms to a set of well-defined constraints on how the modules share data.

Joint work with Kunal Agrawal and Jim Sukha

Page 98: Mit cilk

Intel XTRL / USA 98

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Outline

Page 99: Mit cilk

Intel XTRL / USA 99

Parallelism Abstraction

Operating System

Concurrency Platform

User Application

A concurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling.

Page 100: Mit cilk

Intel XTRL / USA 100

Memory AbstractionA memory abstraction provides a different “view” of a memory location depending on the execution context in which the memory access is made. TLMM cactus stack: each worker gets its own linear local view

of the tree-structured call stack. Hyperobject [FHLL09]: a linguistic mechanism that supports

coordinated local views of the same nonlocal object. Transactional Memory [HM93]: memory accesses dynamically

enclosed by an atomic block appear to occur atomically.

Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?

Page 101: Mit cilk

Intel XTRL / USA 101

OS / Hardware Support for Memory Abstraction

Recently researchers begin to explore ways to enable memory abstractions using page mapping / page protection mechanism:

Can we relax limitation of manipulating virtual memory at page-granularity ?

C# with atomic sections [AHM09] (strong atomicity) Grace [BYL+09] (deterministic execution) Sammati [PV10] (deadlock avoidance) Cilk-M [LSH+10] (TLMM cactus stack)

Page 102: Mit cilk

Intel XTRL / USA 102

THANK YOU!

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Page 103: Mit cilk

Intel XTRL / USA 103

Page 104: Mit cilk

Intel XTRL / USA 104

Page 105: Mit cilk

Intel XTRL / USA 105

Quadratic Stack Growth [Robison08]P

P

P

P

P

S

S

S

P

SS

S

S

S S

P

SS

S

P

SS

. . .

. . .

. . .

. . .

Depth = d

Assume one linear stack per worker

Repeat d times. . .

P : parallel

S : serial

: spawn: call

Page 106: Mit cilk

Intel XTRL / USA 106

Quadratic Stack Growth [Robison08]P

P

P

P

P

S

S

S

P

SS

S

S

S S

P

SS

S

P

SS

. . .

. . .

. . .

. . .

. . .

Assume one linear stack per worker

Repeat d times

The green worker repeatedly blocks, then steals, using Θ(d2) stack space.

Depth = d

Page 107: Mit cilk

Intel XTRL / USA

cholesky

cilksort

fft

fib

fib_weird

heat

knapsack

lu

matmul

nqueens

rectmul

strassen

0

0.2

0.4

0.6

0.8

1

1.2

Performance Comparison

Cilk-M running time / Cilk-5 running time

Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)

AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3

107

Page 108: Mit cilk

Intel XTRL / USA

Space Usage (Hand Compiled)Benchmark D S1 S16 / 16 S1 + D

cholesky 12 2 3.19 14cilksort 18 2 3.19 20

fft 22 4 3.69 26fib 43 2 3.69 45

fib_weird 281 8 8.44 289heat 10 2 2.38 12

knapsack 34 2 5.00 36lu 10 2 3.19 12

matmul 22 2 3.31 24nqueen 16 2 3.25 18rectmul 27 2 4.06 29strassen 8 2 3.13 10

Space bound: Sp /P ≤ S1+D 108

Page 109: Mit cilk

Intel XTRL / USA 109

Space UsageBenchmark Cilk-M

S16

Cilk-5S16

Cilk-MH16

Cilk-5H16

Cilk-MS16 + H16

Cilk-5S16 + H16

cholesky 51 16 193 345 244 361cilksort 51 16 193 265 244 281

fft 60 48 169 1017 229 1065fib 59 16 169 185 228 201

fib_weird 135 64 217 217 353 281heat 38 16 209 273 247 289

knapsack 80 16 169 361 249 377lu 51 16 185 265 236 281

matmul 53 16 169 257 222 273nqueen 52 16 161 249 213 265rectmul 65 32 169 240 234 272strassen 50 16 161 417 211 433

Page 110: Mit cilk

Intel XTRL / USA 110

GCC/Linux C Subroutine Linkage

bp

sp

B’s local variables

args to B’s callees

B’s return addressA’s base pointer

A’s return addressA’s parent’s base ptr

A’s local variables

args to B

The legacy linear stack obtains efficiency by overlapping frames.

args to A

framefor A

framefor B

linkageregion

BA

CED

Page 111: Mit cilk

Intel XTRL / USA 111

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

BThe thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

A

steal ABA

CED

Page 112: Mit cilk

Intel XTRL / USA 112

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B

A

The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

C

D

BA

CED

Page 113: Mit cilk

Intel XTRL / USA 113

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B C

D

A

The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

A

C

steal CBA

CED

Page 114: Mit cilk

Intel XTRL / USA 114

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

A

B

E

C

D

A A

The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

C

BA

CED

Page 115: Mit cilk

Intel XTRL / USA 115

Key Invocation Invariants1. Arguments are passed via stack pointer with positive

offset.2. Local variables are referenced via base pointer with

negative offset.3. Live registers are flushed onto the stack immediately

before each spawn.4. Live registers are flushed onto the stack before returning

back to runtime if sync fails.5. When resuming a stolen function after a spawn or sync,

live registers are restored from the stack.6. When returning from a spawn, the return value is

flushed from its register onto the stack.7. The frame size is fixed before any spawn statements.

Page 116: Mit cilk

Intel XTRL / USA 116

GCC/Linux C Subroutine Linkage

BA

CED

A’s return addressA’s parent’s base ptr

A’s local variables

sp

bp

args to A’s callees

framefor A

Legacy linear stacks enable efficient passing of arguments from caller to callee.

args to A

Page 117: Mit cilk

Intel XTRL / USA 117

GCC/Linux C Subroutine Linkage

BA

CED

A’s return addressA’s parent’s base ptr

A’s local variables

sp

bp

args to A’s callees

framefor A

Frame A accesses its arguments through positive offset indexed from its base pointer.

args to Alinkageregion

Page 118: Mit cilk

Intel XTRL / USA 118

GCC/Linux C Subroutine Linkage

BA

CED

A’s return addressA’s parent’s base ptr

A’s local variables

sp

bp

args to A’s callees

Frame A accesses its local variables through negative offset indexed from its base pointer.

framefor A

args to A

Page 119: Mit cilk

Intel XTRL / USA 119

GCC/Linux C Subroutine Linkage

BA

CED

A’s return addressA’s parent’s base ptr

A’s local variables

sp

bp

args to A’s callees

Before invoking B, A places the arguments for B into the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer.

args to B

args to A

linkageregion

framefor A

Page 120: Mit cilk

Intel XTRL / USA 120

GCC/Linux C Subroutine Linkage

sp

bp

B’s return address

A’s return addressA’s parent’s base ptr

A’s local variables

args to B

BA

CED

A then makes the call to B, which saves the return address for B and transfers control to B.

args to A

framefor A

Page 121: Mit cilk

Intel XTRL / USA 121

bp

GCC/Linux C Subroutine Linkage

bp

B’s return addressA’s base pointer

A’s return addressA’s parent’s base ptr

A’s local variables

BA

CED

args to B

args to A

framefor A

Upon entering, B saves A’s base pointer and sets the base pointer to where the stack pointer is. sp

Page 122: Mit cilk

Intel XTRL / USA 122

GCC/Linux C Subroutine Linkage

bp

B’s local variables

args to B’s callees

spB’s return addressA’s base pointer

A’s return addressA’s parent’s base ptr

A’s local variables

BA

CED

args to B

args to A

framefor A

framefor B

B advances the stack pointer to allocate space for local variables and linkage region.

Page 123: Mit cilk

Intel XTRL / USA 123

GCC/Linux C Subroutine Linkage

bp

sp

B’s local variables

args to B’s callees

B’s return addressA’s base pointer

A’s return addressA’s parent’s base ptr

A’s local variables

BA

CED

args to B

The legacy linear stack obtains efficiency by overlapping frames.

args to A

framefor A

framefor B

Page 124: Mit cilk

Intel XTRL / USA 124

Legacy Linear Stack

B

A

C

ED B

A

C

DE

invocation tree

An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.

High Addr

Low Addr

Page 125: Mit cilk

Intel XTRL / USA 125

Legacy Linear Stack

B

A

C

ED

invocation tree

A

C

E

x: 42

y: &x

High Addr

Low Addr

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot

pass a pointer to its stack variable up to its parent.…

Page 126: Mit cilk

Intel XTRL / USA 126

Legacy Linear Stack

B

A

C

ED

invocation tree

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot

pass a pointer to its stack variable up to its parent.

A

C

E

y: &z

z: 42

High Addr

Low Addr

Page 127: Mit cilk

Intel XTRL / USA 127

Given n > 0, search for one way to arrange n queens on an n-by-n chessboard so that none attacks another.

legal configuration illegal configuration

The Queens Problem

Page 128: Mit cilk

Intel XTRL / USA 128

Exploring the Search Tree for Queens

r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0

start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Serial strategy: Depth-first search with backtracking. The search tree size grows exponentially as n increases.

Page 129: Mit cilk

Intel XTRL / USA 129

Exploring the Search Tree for Queens

r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0

start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Parallel strategy: spawn searches in parallel. Speculative computation – some work may be wasted.

Page 130: Mit cilk

Intel XTRL / USA 130

Exploring the Search Tree for Queens

r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0

start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Parallel strategy: spawn searches in parallel. Abort other parallel searches once a solution is found.

Page 131: Mit cilk

Intel XTRL / USA 131

Various Parallel Programming Models

Page 132: Mit cilk

Intel XTRL / USA 132

class SAT_Solver {public:

int solve( … );

private: …

};

1. Convert the entire code base to Cilk++ language.

2. Structure the project so that Cilk++ code calls C++ code, but not conversely.

3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code

Parallelize Your Code using Cilk++

Page 133: Mit cilk

Intel XTRL / USA 133

class SAT_Solver {public:

int solve( … );

private: …

};

1. Convert the entire project to Cilk++ language.

2. Structure the project so that Cilk++ code calls C++ code, but not conversely.

3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code

Parallelize Your Code using TBB

Your program may end up using a lot more stack space or fail to get good speedup.

Page 134: Mit cilk

Intel XTRL / USA 134

Network

Memory

Chip Multiprocessor (CMP)

PPP

Multicore Architecture — 2001*

¢ ¢ ¢

*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

Page 135: Mit cilk

Intel XTRL / USA 135

The Era of Multicore IS Here

Source: www.newegg.com 1 2 3 4 6 8 12

0

5

10

15

20

25

30

35

40

45

50

DesktopServer

# of Cores

# of CPUS

Single core processor is becoming obsolete.

Page 136: Mit cilk

Intel XTRL / USA 136

My Sister Is Buying a New Laptop …Display Processor

TypeNumberof Cores

MacBook 13.3” 2.4GHz Intel Core 2 2

MacBook Pro

13” 2.3-2.7GHzIntel Core i5 / i7 2

15” 2.0-2.3GHz Intel Core i7 4

17” 2.2-2.3GHzIntel Core i7 4

MacBook Air11” 1.4-1.6GHz

Intel Core 2 2

13” 1.86-2.13GHzIntel Core 2 2

Source: www.apple.com

The era of multicore IS here!