mit cilk

Intel XTRL / USA 1

The Era of Multicore Is HereProcessor

Type Price Numberof Cores

Dell Inspiron R15 Intel Core i3 370M2.4GHz $649.99 2

Dell Inspiron N5030

Intel Pentium T4500 2.30GHz $479.99 2

lenovo IdeaPad Y560

Intel Core i7 740QM1.73GHz $849.99 4

ASUS G SeriesG73JW-XN1

Intel Core i7 740QM1.73GHz $1449.99 4

MSI CR620-691US

Intel Core i3 380M 2.53GHz $599.99 2

Toshiba Satellite L675D-S7102

AMD Athlon II P360 2.30GHz $599.99 2

Source: www.newegg.com

http://www.newegg.com/

Intel XTRL / USA 2

Network

…

Memory

Chip Multiprocessor (CMP)

PPP

Multicore Architecture*

¢ ¢ ¢

*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).

Intel XTRL / USA 3

Concurrency Platforms

Operating System

Concurrency Platform

User Application

A concurrency platform, that provides linguistic support and handles load balancing, can ease the task of parallel programming.

Using Memory Mapping to Support Cactus Stacks in

Work-Stealing Runtime Systems

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

I-Ting Angelina Lee

March 22, Intel XTRL / USA

Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson

Using Memory Mapping to Support Cactus Stacks in

Work-Stealing Runtime Systems

Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology

I-Ting Angelina Lee

Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson

March 22, Intel XTRL / USA

Intel XTRL / USA 6

Three Desirable Criteria

BoundedStack Space

GoodPerformance

Serial-ParallelReciprocity

Interoperability with serial code, including binaries

Ample parallelism linear speedup

Reasonable space usage compared to serial execution

Intel XTRL / USA 7

Various Strategies

StrategySP

Reciprocity TimeBound

SpaceBound

1. Recompile Everything

2. One Stack Per Worker

3. Limited-Depth Stacks

4. Depth-Restricted Stealing

5. New Stack When Needed

6. Recycle Ancestor Stacks

7. TLMM Cactus Stacks

Cilk++

TBB

Cilk Plus

The Cactus-Stack Problem: how to satisfy all three criteria simultaneously.

Intel XTRL / USA 8

The Cactus-Stack Problem

CustomerEngineer

Space Usage Performance

SP Reciprocity

Intel XTRL / USA 9


Parallelize my software?


SP Reciprocity

Intel XTRL / USA 10


Sure! Use my concurrency

platform!


SP Reciprocity

Intel XTRL / USA 11


Sure! Use my concurrency

platform!


SP Reciprocity

Intel XTRL / USA 12


Just be sure to recompile all

your codebase.


Intel XTRL / USA 13


Hm … I use third party binaries …


Intel XTRL / USA 14


*Sigh*. Ok fine.


SP Reciprocity

Intel XTRL / USA 15


Upgrade your RAM then …

Performance

SP Reciprocity

Intel XTRL / USA 16


… you are gonna need

extra memory.

Performance

SP Reciprocity

Intel XTRL / USA 17


… no?

Performance

SP Reciprocity

Intel XTRL / USA 18



SP Reciprocity

… no?

Intel XTRL / USA 19


⌃#

Well … you didn’t say you want any

performance guarantee, did you?

Space Usage

SP Reciprocity

Intel XTRL / USA 20


⌃#

Gee … I can get that just by

running serially.

Space Usage

SP Reciprocity

Intel XTRL / USA 21


BoundedStack Space

GoodPerformance

Serial-ParallelReciprocity

Interoperability with serial code, including binaries

Ample parallelism linear speedup

Reasonable space usage compared to serial execution

Intel XTRL / USA 22

Legacy Linear Stack

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.

Intel XTRL / USA 23

Legacy Linear Stack

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.

Intel XTRL / USA 24

Legacy Linear Stack — 1960*

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.

* Stack-based space management for recursive subroutines developed with compilers for Algol 60.

Intel XTRL / USA 25

Cactus Stack — 1968*

B

A

C

ED

invocation tree

A A

B

A

C

A

C

D

A

C

E

CBA D E

views of stack

A cactus stack supports multiple views in parallel.

* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.

Intel XTRL / USA 26

Heap-Based Cactus Stack

A

C

heap

A heap-based cactus stack allocates frames off the heap.

D E

B

Mesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.

Intel XTRL / USA 27

Modern Concurrency Platforms Cilk++ (Intel) Cilk-5 (MIT) Cilk-M (MIT) Cilk Plus (Intel) Fortress (Oracle Labs) Habanero (Rice) JCilk (MIT) OpenMP StreamIt (MIT) Task Parallel Library (Microsoft) Threading Building Blocks (Intel) X10 (IBM) …

Intel XTRL / USA 28


A

C

heap

A heap-based cactus stack allocates frames off the heap.

D E

B

MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.

Good time and space bounds can be obtained …

Intel XTRL / USA 29


A

C

heap

Heap linkage: call/return via frames in the heap.

D E

B

Heap linkage parallel functions fail to interoperate with legacy serial code.

Intel XTRL / USA 30

Various Strategies

StrategySP

Reciprocity TimeBound

SpaceBound

1. Recompile Everything

2. One Stack Per Worker

3. Limited-Depth Stacks

4. Depth-Restricted Stealing

5. New Stack When Needed

6. Recycle Ancestor Stacks

7. TLMM Cactus Stacks

The main constraint: once allocated, a frame’s location in virtual address cannot change.

Intel XTRL / USA 31

Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM

Survey of My Other WorkDirection for Future Work

Outline

Intel XTRL / USA 32

The Cilk Programming Model

int fib(int n) { if(n < 2) { return n; } int x = spawn fib(n-1); int y = fib(n-2); sync; return (x + y);

}

Control cannot pass this point until all spawned children have returned.

Cilk keywords grant permission for parallel execution. They do not command parallel execution.

The named child function may execute in parallel with the continuation of its parent.

Intel XTRL / USA 33

Cilk-M

A work-stealing runtime system based on Cilk that solves the cactus-stack problem by

thread-local memory mapping (TLMM).

Intel XTRL / USA 34

Cilk-M Overview

Thread-local memory mapped (TLMM) region:

A virtual-address range in which each thread can map physical memory independently.

stack

heap

uninitialized data (bss)

initialized data

code

High virtual addr

Low virtual addr

TLMM

sharedIdea: Allocate the stacks for each worker in the TLMM region.

Intel XTRL / USA 35

Basic Cilk-M Idea

BA

CED

P1 P2 P3

Ax: 42

BE

Unreasonable simplification: Assume that we can map with arbitrary granularity.

y: &x

Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

0x7f000

Workers achieve sharing by mapping the same physical memory at the same virtual address.

Intel XTRL / USA 36

Cilk Guarantees with aHeap-Based Cactus Stack

Time bound: Tp = T1 / P + O(T∞) . linear speedup when P ≪ T1 / T∞

Space bound: Sp /P ≤ S1 .

Does not support SP-reciprocity.

Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism

SP — stack space on P processorsS1 — stack space of a serial execution

Intel XTRL / USA 37

Cilk Depth

Cilk depth is the max number of Cilk functions nested on the stack during a serial execution

B

A

C

ED

GF

Cilk depth (3) is not the same as spawn depth (2).

Intel XTRL / USA 38

Cilk-M Guarantees

Time bound: Tp = T1 / P + O((S1+D) T∞) .

linear speedup when P ≪ T1 / (S1+D)T∞

Space bound: Sp /P ≤ S1+D , where S1 is measured in pages.

SP reciprocity: No longer need to distinguish function types Parallelism or not is dictated only by how a function is invoked

(spawn vs. call).


SP — stack space on P processorsS1 — stack space of a serial execution D — Cilk depth

Intel XTRL / USA 39

We implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.

We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).

We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.

System Overview

Intel XTRL / USA 40

Performance Comparison

Cilk-M running time / Cilk Plus running time

Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)

AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3

cholesky

cilksort

fft

fib

fib_weird

heat

lu

matmul

nqueens

qsortrectmul

strassen

0

0.2

0.4

0.6

0.8

1

1.2

Intel XTRL / USA 41

Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D

cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22

fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47

fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12

lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25

qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10

Space bound: Sp /P ≤ S1+D

Intel XTRL / USA 42



Outline

Intel XTRL / USA 43

spawn

Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].

P

spawncall

spawn

P

spawn

PP

callspawn

spawncall

Cilk-M’s Work-Stealing Scheduler

Intel XTRL / USA 44


P

spawncall

spawncall

P PP

call!


spawnspawn

callspawn

spawncall

Intel XTRL / USA 45


P

spawncall

spawncall

P PP


spawn

spawn!

spawnspawn

callspawn

spawncall

Intel XTRL / USA 46


P P PP


spawn! call! spawn!

spawncall

spawncall

spawnspawn

spawnspawn

callspawn

spawncall

call spawn

Intel XTRL / USA 47


P P PP


spawn

return!

spawncall

spawncall

spawncall

spawncall

spawncall

spawnspawn

spawn

Intel XTRL / USA 48


P P PP


spawncall

spawncall

steal!

When a worker runs out of work, it steals from the top of a random victim’s deque.

spawncall

spawncall

spawncall

spawnspawn

spawn

Intel XTRL / USA 49


P P PP


spawncall

spawn!

When a worker runs out of work, it steals from the top of a random victim’s deque.

spawncall spawn

call

spawncall

spawncall

spawnspawn

spawnspawn

Intel XTRL / USA 50


P P PP


spawncall

Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup.

spawncall spawn

call

spawn

spawnspawn

callspawn

callspawnspawn

Intel XTRL / USA 51



Outline

Intel XTRL / USA 52

TLMM-Based Cactus Stacks

P1 P2 P3

Ax: 42

B

y: &x


0x7f000

BA

CED

Use standard linear stack in virtual memory.

Intel XTRL / USA 53

P1 P2 P3

Ax: 42

B

y: &x


steal A

0x7f000

BA

CED

Ax: 42

Ax: 42


Map (not copy) the stolen prefix to the same virtual addresses.

Intel XTRL / USA 54


P1 P2 P3

Ax: 42

B

y: &x


Ax: 42

0x7f000

BA

CED

Subsequent spawns and calls grow down-ward in the thief’s TLMM region.

Cy: &x

Intel XTRL / USA 55


P1 P2 P3

Ax: 42

B

y: &x


Cy: &x

Ax: 42

Both workers see the same virtual address value for &x.

0x7f000

BA

CED

Intel XTRL / USA 56


P1 P2 P3

Ax: 42

B

y: &x


Cy: &x

D

Ax: 42

0x7f000

BA

CED

Both workers see the same virtual address value for &x.

Intel XTRL / USA 57


P1 P2 P3

Ax: 42

B

y: &x


D

Cy: &x

Ax: 42

Ax: 42

Cy: &x

steal C

0x7f000

BA

CED

Cy: &x

Ax: 42

Map (not copy) the stolen prefix to the same virtual addresses.

Intel XTRL / USA 58


P1 P2 P3

Ax: 42

B

y: &x


Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

0x7f000

BA

CED

Subsequent spawns and calls grow down-ward in the thief’s TLMM region. E

z: &x

Intel XTRL / USA 59


P1 P2 P3

Ax: 42

B

y: &x E


Cy: &x

D

Ax: 42

Ax: 42

Cy: &x

z: &x

0x7f000

BA

CED

All workers see the same virtual address value for &x.

Intel XTRL / USA 60

0x7d000

0x7f000

0x7e000

page size

Handling Page Granularity

P1 P2 P3

B

A

BA

CED

Intel XTRL / USA 61

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

Map the stolen prefix.

A

B

A

steal ABA

CED

A

Intel XTRL / USA 62

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B

A

steal A

Advance the stack pointer fragmentation.

BA

CED

Intel XTRL / USA 63

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B C

D

A

BA

CED

Intel XTRL / USA 64

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B C

D

A

steal C

A

C

BA

CED

A

C

Intel XTRL / USA 65

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B C

D

A

steal C

A

CAdvance the stack pointer again additional fragmentation.

BA

CED

Intel XTRL / USA 66

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B

E

C

D

A A

C

BA

CED

Advance the stack pointer again additional fragmentation.

Intel XTRL / USA 67

0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B

E

C

D

A A

C

BA

CED

Space-reclaiming heuristic: reset the stack pointer upon successful sync.

Intel XTRL / USA 68



Outline

Intel XTRL / USA 69

Space Bound with a Heap-Based Cactus Stack

Theorem [BL94]. Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at most SP/P ≤ S1.Proof. The work-stealing algorithm maintains the busy-leaves property:

Every active leaf frame has a worker executing on it. ■

PPP

S1

P = 4

P

Intel XTRL / USA 70

Cilk-M Space Bound

Claim. Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stack is at most SP/P ≤ S1+D.Proof. The work-stealing algorithm maintains the busy-leaves property:

Every active leaf frame has a worker executing on it. ■

PPP

S1

P = 4

P

Intel XTRL / USA 71

Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D

cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22

fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47

fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12

lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25

qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10

Space bound: Sp /P ≤ S1+D

Intel XTRL / USA 72

Performance Bound with aHeap-Based Cactus Stack

Theorem [BL94]. A work-stealing scheduler can achieve expected running time

TP = T1 / P + O(T∞)on P processors.


Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).

Intel XTRL / USA 73

Cilk-M Performance Bound

Claim. A work-stealing scheduler can achieve expected running time

Tp = T1 / P + C T∞

on P processors, where C = O(S1+D) .


D — Cilk depth

Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).

Intel XTRL / USA 74



Outline

Intel XTRL / USA 75

To Be or Not To Be … a Process

A Worker = A Process A Worker = A Thread

Every worker has its own page table.

By default, nothing is shared.

Manually (i.e. mmap) share nonstack memory.

User calls to mmap do not work (which may include malloc).

Workers share a single page table.

By default, everything is shared.

Reserve a region to be independently mapped.

User calls to mmap operate properly.

Intel XTRL / USA 76

Page 28

TLMM 2

Page 12

TLMM 1

Page Table for TLMM (Ideally)

Page 32

Shared

Page 7

TLMM 0

x86: Hardware walks the page table.Each thread has a single root-page directory!

Intel XTRL / USA 77

Support for TLMM

Page 7

Thread 0

Page 12

Thread 1

Page 32

Must synchronize the root-page directory among threads.

Intel XTRL / USA 78

Limitation of TLMM Cactus Stacks TLMM does not work for codes that require one

thread to see another thread’s stack. E.g., MCS locks [MCS91]:

When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.

Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.

When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.

If A allocates LA on its stack using TLMM, LA may not be visible to B!

Intel XTRL / USA 79

Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously: Serial-Parallel Reciprocity Good Performance Bounded and efficient use of memory for the

cactus stack Cilk-M employs:

TLMM-based cactus stacks OS support for TLMM (~600 lines of code) Legacy compatible linkage

Cilk-M Summary

Intel XTRL / USA 80

Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional

Memory Direction for Future Work

Outline

Intel XTRL / USA 81

The JCilk Language

Java CoreFunctionalities

ParallelConstructsfrom Cilk:

spawn & sync

Joint work with John Danaher and Charles Leiserson

Intel XTRL / USA 82

The JCilk Language

Java CoreFunctionalities

ParallelConstructsfrom Cilk:

spawn & sync

ExceptionHandling

Joint work with John Danaher and Charles Leiserson

Intel XTRL / USA 83

JCilk provides a faithful extension of Java’s exception mechanism consistent with Cilk’s primitives.

JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.

Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.

Exception Handling in a Concurrent Context

Intel XTRL / USA 84

The JCilk System

JCilkto

Java + goto

Jgo compiler: GCJ + goto

supportJVM

source

JCilkRuntime System

Fib.jcilk Fib.jgo Fib.class

JCilk Compiler

Intel XTRL / USA 85

JCilk's strategy of integrating multithreading with Java's exception semantics is synergistic – it obviates the need for Cilk’s inlet and abort.

JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.

What We Discovered

Intel XTRL / USA 86



Outline

Intel XTRL / USA 87

Initially, L1 = 0 and L2 = 0

Thread 1

1 L1 = 1;

2 if(L2 == 0) {3 /* critical section */4 …5 }6 L1 = 0;

Thread 2

1 L2 = 1;

2 if(L1 == 0) {3 /* critical section */4 …5 }6 L2 = 0;

Dekker’s Protocol (Simplified)

Reads may be reordered with older writes.

Intel XTRL / USA 88

Initially, L1 = 0 and L2 = 0

Thread 1

1 L1 = 1;2 mfence();3 if(L2 == 0) {4 /* critical section */5 …6 }7 L1 = 0;

Thread 2

1 L2 = 1;2 mfence();3 if(L1 == 0) {4 /* critical section */5 …6 }7 L2 = 0;

Memory fences needed cause stalling

Dekker’s Protocol (Simplified)

Intel XTRL / USA 89

Applications of Dekker’s Protocol The THE protocol used by Cilk’s work stealing

scheduler [FLR98] the victim vs. the thief

Java Monitors using Quickly Reacquirable Locks or Biased Locking [DMS03] [OKK04] the bias-holding thread vs. a revoker thread

JNI reentry barrier in JVM a Java mutator thread vs. the garbage collector

Network package processing [VNE10] the owner thread vs. other threads

Applications exhibit asymmetric synchronization patterns.

Intel XTRL / USA 90

We introduce location-based memory fences, which causes a thread’s instruction stream to serialize when another thread attempts to access the guarded memory location.

Some applications can benefit from a software implementation [DHY03] that uses interrupt.

A light-weight hardware mechanism can piggyback on the cache coherence protocol.

Location-Based Memory FencesJoint work with Edya Ladan-Mozes and Dmitriy Vyukov

Intel XTRL / USA 91



Outline

Intel XTRL / USA 92

Transactional Memory

atomic { //Ax++;

} Rset: x Wset: x

Rset: w,xWset: w,x

A

Memory

atomic { //Bw = x;

}Rset: x Wset: w

B

Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory.

Intel XTRL / USA 93

Transactional Memory

atomic { //Ax++;

} Rset: x Wset: x

Rset: w,xWset: w,x

A

Memory

atomic { //Bw = x;

}Rset: x Wset: w

B

TM guarantees that transactions are serializable [P79].

Intel XTRL / USA 94

Nested Transactions

atomic { //A int a = x;...

atomic { //B w++; } int b = y; z = x + y;}

Rset: wWset: w

Rset: x Wset:

Rset: w,x,y,zWset: w,x,y,z

A

Memory

B

Closed-nesting: propagate the changes to A.

Intel XTRL / USA 95

Nested Transactions

atomic { //A int a = x;...

atomic { //B w++; } int b = y; z = x + y;}

Rset: wWset: w

Rset: x Wset:

Rset: w,x,y,zWset: w,x,y,z

A

Memory

B

Open-nesting: commit the changes globally.

Intel XTRL / USA 96

Nested Transactions

Safety Efficency

Closed Nesting[M85] ✓

Open Nesting[MH06, MAC+06, NMA+07] ✓

All memories are treated equally – there is only one level of abstraction.

Intel XTRL / USA 97

Ownership-Aware Transactions (OAT)

Ownership-aware transactions is a hybrid between open-nesting and closed-nesting; it provides multiple levels of abstraction.

In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept of ownership types [BLS03] to ensure data encapsulation within a module.

The OAT system guarantees abstract serializability as long as the program conforms to a set of well-defined constraints on how the modules share data.

Joint work with Kunal Agrawal and Jim Sukha

Intel XTRL / USA 98



Outline

Intel XTRL / USA 99

Parallelism Abstraction

Operating System

Concurrency Platform

User Application

A concurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling.

Intel XTRL / USA 100

Memory AbstractionA memory abstraction provides a different “view” of a memory location depending on the execution context in which the memory access is made. TLMM cactus stack: each worker gets its own linear local view

of the tree-structured call stack. Hyperobject [FHLL09]: a linguistic mechanism that supports

coordinated local views of the same nonlocal object. Transactional Memory [HM93]: memory accesses dynamically

enclosed by an atomic block appear to occur atomically.

Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?


OS / Hardware Support for Memory Abstraction

Recently researchers begin to explore ways to enable memory abstractions using page mapping / page protection mechanism:

Can we relax limitation of manipulating virtual memory at page-granularity ?

C# with atomic sections [AHM09] (strong atomicity) Grace [BYL+09] (deterministic execution) Sammati [PV10] (deadlock avoidance) Cilk-M [LSH+10] (TLMM cactus stack)


THANK YOU!




Quadratic Stack Growth [Robison08]P

P

P

P

P

S

S

S

P

SS

S

S

S S

P

SS

S

P

SS

. . .

. . .

. . .

. . .

Depth = d

Assume one linear stack per worker

Repeat d times. . .

P : parallel

S : serial

: spawn: call


Quadratic Stack Growth [Robison08]P

P

P

P

P

S

S

S

P

SS

S

S

S S

P

SS

S

P

SS

. . .

. . .

. . .

. . .

. . .

Assume one linear stack per worker

Repeat d times

The green worker repeatedly blocks, then steals, using Θ(d2) stack space.

Depth = d

Intel XTRL / USA

cholesky

cilksort

fft

fib

fib_weird

heat

knapsack

lu

matmul

nqueens

rectmul

strassen

0

0.2

0.4

0.6

0.8

1

1.2

Performance Comparison

Cilk-M running time / Cilk-5 running time

Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)

AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3

107

Intel XTRL / USA

Space Usage (Hand Compiled)Benchmark D S1 S16 / 16 S1 + D

cholesky 12 2 3.19 14cilksort 18 2 3.19 20

fft 22 4 3.69 26fib 43 2 3.69 45

fib_weird 281 8 8.44 289heat 10 2 2.38 12

knapsack 34 2 5.00 36lu 10 2 3.19 12

matmul 22 2 3.31 24nqueen 16 2 3.25 18rectmul 27 2 4.06 29strassen 8 2 3.13 10

Space bound: Sp /P ≤ S1+D 108


Space UsageBenchmark Cilk-M

S16

Cilk-5S16

Cilk-MH16

Cilk-5H16

Cilk-MS16 + H16

Cilk-5S16 + H16

cholesky 51 16 193 345 244 361cilksort 51 16 193 265 244 281

fft 60 48 169 1017 229 1065fib 59 16 169 185 228 201

fib_weird 135 64 217 217 353 281heat 38 16 209 273 247 289

knapsack 80 16 169 361 249 377lu 51 16 185 265 236 281

matmul 53 16 169 257 222 273nqueen 52 16 161 249 213 265rectmul 65 32 169 240 234 272strassen 50 16 161 417 211 433


GCC/Linux C Subroutine Linkage

bp

sp

B’s local variables

args to B’s callees

B’s return addressA’s base pointer

A’s return addressA’s parent’s base ptr

A’s local variables

args to B

The legacy linear stack obtains efficiency by overlapping frames.

args to A

framefor A

framefor B

linkageregion

BA

CED


0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

BThe thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

A

steal ABA

CED


0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B

A

The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.

C

D

BA

CED


0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B C

D

A


A

C

steal CBA

CED


0x7d000

0x7f000

0x7e000

page size


P1 P2 P3

A

B

E

C

D

A A


C

BA

CED


Key Invocation Invariants1. Arguments are passed via stack pointer with positive

offset.2. Local variables are referenced via base pointer with

negative offset.3. Live registers are flushed onto the stack immediately

before each spawn.4. Live registers are flushed onto the stack before returning

back to runtime if sync fails.5. When resuming a stolen function after a spawn or sync,

live registers are restored from the stack.6. When returning from a spawn, the return value is

flushed from its register onto the stack.7. The frame size is fixed before any spawn statements.



BA

CED



sp

bp

args to A’s callees

framefor A

Legacy linear stacks enable efficient passing of arguments from caller to callee.

args to A



BA

CED



sp

bp


framefor A

Frame A accesses its arguments through positive offset indexed from its base pointer.

args to Alinkageregion



BA

CED



sp

bp


Frame A accesses its local variables through negative offset indexed from its base pointer.

framefor A

args to A



BA

CED



sp

bp


Before invoking B, A places the arguments for B into the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer.

args to B

args to A

linkageregion

framefor A



sp

bp

B’s return address



args to B

BA

CED

A then makes the call to B, which saves the return address for B and transfers control to B.

args to A

framefor A


bp


bp




BA

CED

args to B

args to A

framefor A

Upon entering, B saves A’s base pointer and sets the base pointer to where the stack pointer is. sp



bp



spB’s return addressA’s base pointer



BA

CED

args to B

args to A

framefor A

framefor B

B advances the stack pointer to allocate space for local variables and linkage region.



bp

sp






BA

CED

args to B

The legacy linear stack obtains efficiency by overlapping frames.

args to A

framefor A

framefor B


Legacy Linear Stack

B

A

C

ED B

A

C

DE

invocation tree

An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.

High Addr

Low Addr


Legacy Linear Stack

B

A

C

ED

invocation tree

A

C

E

x: 42

y: &x

High Addr

Low Addr

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot

pass a pointer to its stack variable up to its parent.…


Legacy Linear Stack

B

A

C

ED

invocation tree

Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot

pass a pointer to its stack variable up to its parent.

A

C

E

y: &z

z: 42

High Addr

Low Addr

✗


Given n > 0, search for one way to arrange n queens on an n-by-n chessboard so that none attacks another.

legal configuration illegal configuration

The Queens Problem


Exploring the Search Tree for Queens

r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0

start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Serial strategy: Depth-first search with backtracking. The search tree size grows exponentially as n increases.




start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Parallel strategy: spawn searches in parallel. Speculative computation – some work may be wasted.




start

r0,c1 r0,c2 r0,c3r0,c0

. . . . . .r2,c0 r2,c0 r2,c0 r2,c0

Parallel strategy: spawn searches in parallel. Abort other parallel searches once a solution is found.


Various Parallel Programming Models


class SAT_Solver {public:

int solve( … );

…

private: …

};

1. Convert the entire code base to Cilk++ language.

2. Structure the project so that Cilk++ code calls C++ code, but not conversely.

3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code

Parallelize Your Code using Cilk++


class SAT_Solver {public:

int solve( … );

…

private: …

};

1. Convert the entire project to Cilk++ language.

2. Structure the project so that Cilk++ code calls C++ code, but not conversely.

3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code

Parallelize Your Code using TBB

Your program may end up using a lot more stack space or fail to get good speedup.


Network

…

Memory

Chip Multiprocessor (CMP)

PPP

Multicore Architecture — 2001*

¢ ¢ ¢

*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).


The Era of Multicore IS Here

Source: www.newegg.com 1 2 3 4 6 8 12

0

5

10

15

20

25

30

35

40

45

50

DesktopServer

# of Cores

# of CPUS

Single core processor is becoming obsolete.



My Sister Is Buying a New Laptop …Display Processor

TypeNumberof Cores

MacBook 13.3” 2.4GHz Intel Core 2 2

MacBook Pro

13” 2.3-2.7GHzIntel Core i5 / i7 2

15” 2.0-2.3GHz Intel Core i7 4

17” 2.2-2.3GHzIntel Core i7 4

MacBook Air11” 1.4-1.6GHz

Intel Core 2 2

13” 1.86-2.13GHzIntel Core 2 2

Source: www.apple.com

The era of multicore IS here!


mit cilk

Documents

cactus stack problem

cactusstack problemsure

cactusstack problemhm

cactusstack problemupgrade

cactusstack problemjust

cactusstack problemwell

cactusstack problemgee

stack variables