mit cilk
TRANSCRIPT
![Page 1: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/1.jpg)
Intel XTRL / USA 1
The Era of Multicore Is HereProcessor
Type Price Numberof Cores
Dell Inspiron R15 Intel Core i3 370M2.4GHz $649.99 2
Dell Inspiron N5030
Intel Pentium T4500 2.30GHz $479.99 2
lenovo IdeaPad Y560
Intel Core i7 740QM1.73GHz $849.99 4
ASUS G SeriesG73JW-XN1
Intel Core i7 740QM1.73GHz $1449.99 4
MSI CR620-691US
Intel Core i3 380M 2.53GHz $599.99 2
Toshiba Satellite L675D-S7102
AMD Athlon II P360 2.30GHz $599.99 2
Source: www.newegg.com
![Page 2: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/2.jpg)
Intel XTRL / USA 2
Network
…
Memory
Chip Multiprocessor (CMP)
PPP
Multicore Architecture*
¢ ¢ ¢
*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
![Page 3: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/3.jpg)
Intel XTRL / USA 3
Concurrency Platforms
Operating System
Concurrency Platform
User Application
A concurrency platform, that provides linguistic support and handles load balancing, can ease the task of parallel programming.
![Page 4: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/4.jpg)
Using Memory Mapping to Support Cactus Stacks in
Work-Stealing Runtime Systems
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
I-Ting Angelina Lee
March 22, Intel XTRL / USA
Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson
![Page 5: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/5.jpg)
Using Memory Mapping to Support Cactus Stacks in
Work-Stealing Runtime Systems
Computer Science and Artificial Intelligence LaboratoryMassachusetts Institute of Technology
I-Ting Angelina Lee
Joint work with Silas Boyd-Wickizer, Zhiyi Huang, and Charles Leiserson
March 22, Intel XTRL / USA
![Page 6: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/6.jpg)
Intel XTRL / USA 6
Three Desirable Criteria
BoundedStack Space
GoodPerformance
Serial-ParallelReciprocity
Interoperability with serial code, including binaries
Ample parallelism linear speedup
Reasonable space usage compared to serial execution
![Page 7: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/7.jpg)
Intel XTRL / USA 7
Various Strategies
StrategySP
Reciprocity TimeBound
SpaceBound
1. Recompile Everything
2. One Stack Per Worker
3. Limited-Depth Stacks
4. Depth-Restricted Stealing
5. New Stack When Needed
6. Recycle Ancestor Stacks
7. TLMM Cactus Stacks
Cilk++
TBB
Cilk Plus
The Cactus-Stack Problem: how to satisfy all three criteria simultaneously.
![Page 8: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/8.jpg)
Intel XTRL / USA 8
The Cactus-Stack Problem
CustomerEngineer
Space Usage Performance
SP Reciprocity
![Page 9: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/9.jpg)
Intel XTRL / USA 9
The Cactus-Stack Problem
Parallelize my software?
Space Usage Performance
SP Reciprocity
![Page 10: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/10.jpg)
Intel XTRL / USA 10
The Cactus-Stack Problem
Sure! Use my concurrency
platform!
Space Usage Performance
SP Reciprocity
![Page 11: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/11.jpg)
Intel XTRL / USA 11
The Cactus-Stack Problem
Sure! Use my concurrency
platform!
Space Usage Performance
SP Reciprocity
![Page 12: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/12.jpg)
Intel XTRL / USA 12
The Cactus-Stack Problem
Just be sure to recompile all
your codebase.
Space Usage Performance
![Page 13: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/13.jpg)
Intel XTRL / USA 13
The Cactus-Stack Problem
Hm … I use third party binaries …
Space Usage Performance
![Page 14: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/14.jpg)
Intel XTRL / USA 14
The Cactus-Stack Problem
*Sigh*. Ok fine.
Space Usage Performance
SP Reciprocity
![Page 15: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/15.jpg)
Intel XTRL / USA 15
The Cactus-Stack Problem
Upgrade your RAM then …
Performance
SP Reciprocity
![Page 16: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/16.jpg)
Intel XTRL / USA 16
The Cactus-Stack Problem
… you are gonna need
extra memory.
Performance
SP Reciprocity
![Page 17: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/17.jpg)
Intel XTRL / USA 17
The Cactus-Stack Problem
… no?
Performance
SP Reciprocity
![Page 18: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/18.jpg)
Intel XTRL / USA 18
The Cactus-Stack Problem
Space Usage Performance
SP Reciprocity
… no?
![Page 19: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/19.jpg)
Intel XTRL / USA 19
The Cactus-Stack Problem
⌃#
Well … you didn’t say you want any
performance guarantee, did you?
Space Usage
SP Reciprocity
![Page 20: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/20.jpg)
Intel XTRL / USA 20
The Cactus-Stack Problem
⌃#
Gee … I can get that just by
running serially.
Space Usage
SP Reciprocity
![Page 21: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/21.jpg)
Intel XTRL / USA 21
The Cactus-Stack Problem
BoundedStack Space
GoodPerformance
Serial-ParallelReciprocity
Interoperability with serial code, including binaries
Ample parallelism linear speedup
Reasonable space usage compared to serial execution
![Page 22: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/22.jpg)
Intel XTRL / USA 22
Legacy Linear Stack
B
A
C
ED
invocation tree
A A
B
A
C
A
C
D
A
C
E
CBA D E
views of stack
An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.
![Page 23: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/23.jpg)
Intel XTRL / USA 23
Legacy Linear Stack
B
A
C
ED
invocation tree
A A
B
A
C
A
C
D
A
C
E
CBA D E
views of stack
Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.
![Page 24: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/24.jpg)
Intel XTRL / USA 24
Legacy Linear Stack — 1960*
B
A
C
ED
invocation tree
A A
B
A
C
A
C
D
A
C
E
CBA D E
views of stack
Rule for pointers: A parent can pass pointers to its stack variables down to its children, but not the other way around.
* Stack-based space management for recursive subroutines developed with compilers for Algol 60.
![Page 25: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/25.jpg)
Intel XTRL / USA 25
Cactus Stack — 1968*
B
A
C
ED
invocation tree
A A
B
A
C
A
C
D
A
C
E
CBA D E
views of stack
A cactus stack supports multiple views in parallel.
* Cactus stacks were supported directly in hardware by the Burroughs B6500 / B7500 computers.
![Page 26: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/26.jpg)
Intel XTRL / USA 26
Heap-Based Cactus Stack
A
C
heap
A heap-based cactus stack allocates frames off the heap.
D E
B
Mesa (1979), Ada (1979), Cedar (1986)MultiLisp (1985), Mul-T (1989), Id (1991), pH (1995), and more use this strategy.
![Page 27: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/27.jpg)
Intel XTRL / USA 27
Modern Concurrency Platforms Cilk++ (Intel) Cilk-5 (MIT) Cilk-M (MIT) Cilk Plus (Intel) Fortress (Oracle Labs) Habanero (Rice) JCilk (MIT) OpenMP StreamIt (MIT) Task Parallel Library (Microsoft) Threading Building Blocks (Intel) X10 (IBM) …
![Page 28: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/28.jpg)
Intel XTRL / USA 28
Heap-Based Cactus Stack
A
C
heap
A heap-based cactus stack allocates frames off the heap.
D E
B
MIT Cilk-5 (1998) and Intel Cilk++ (2009)use this strategy as well.
Good time and space bounds can be obtained …
![Page 29: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/29.jpg)
Intel XTRL / USA 29
Heap-Based Cactus Stack
A
C
heap
Heap linkage: call/return via frames in the heap.
D E
B
Heap linkage parallel functions fail to interoperate with legacy serial code.
![Page 30: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/30.jpg)
Intel XTRL / USA 30
Various Strategies
StrategySP
Reciprocity TimeBound
SpaceBound
1. Recompile Everything
2. One Stack Per Worker
3. Limited-Depth Stacks
4. Depth-Restricted Stealing
5. New Stack When Needed
6. Recycle Ancestor Stacks
7. TLMM Cactus Stacks
The main constraint: once allocated, a frame’s location in virtual address cannot change.
![Page 31: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/31.jpg)
Intel XTRL / USA 31
Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM
Survey of My Other WorkDirection for Future Work
Outline
![Page 32: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/32.jpg)
Intel XTRL / USA 32
The Cilk Programming Model
int fib(int n) { if(n < 2) { return n; } int x = spawn fib(n-1); int y = fib(n-2); sync; return (x + y);
}
Control cannot pass this point until all spawned children have returned.
Cilk keywords grant permission for parallel execution. They do not command parallel execution.
The named child function may execute in parallel with the continuation of its parent.
![Page 33: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/33.jpg)
Intel XTRL / USA 33
Cilk-M
A work-stealing runtime system based on Cilk that solves the cactus-stack problem by
thread-local memory mapping (TLMM).
![Page 34: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/34.jpg)
Intel XTRL / USA 34
Cilk-M Overview
Thread-local memory mapped (TLMM) region:
A virtual-address range in which each thread can map physical memory independently.
stack
heap
uninitialized data (bss)
initialized data
code
High virtual addr
Low virtual addr
TLMM
sharedIdea: Allocate the stacks for each worker in the TLMM region.
![Page 35: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/35.jpg)
Intel XTRL / USA 35
Basic Cilk-M Idea
BA
CED
P1 P2 P3
Ax: 42
BE
Unreasonable simplification: Assume that we can map with arbitrary granularity.
y: &x
Cy: &x
D
Ax: 42
Ax: 42
Cy: &x
0x7f000
Workers achieve sharing by mapping the same physical memory at the same virtual address.
![Page 36: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/36.jpg)
Intel XTRL / USA 36
Cilk Guarantees with aHeap-Based Cactus Stack
Time bound: Tp = T1 / P + O(T∞) . linear speedup when P ≪ T1 / T∞
Space bound: Sp /P ≤ S1 .
Does not support SP-reciprocity.
Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism
SP — stack space on P processorsS1 — stack space of a serial execution
![Page 37: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/37.jpg)
Intel XTRL / USA 37
Cilk Depth
Cilk depth is the max number of Cilk functions nested on the stack during a serial execution
B
A
C
ED
GF
Cilk depth (3) is not the same as spawn depth (2).
![Page 38: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/38.jpg)
Intel XTRL / USA 38
Cilk-M Guarantees
Time bound: Tp = T1 / P + O((S1+D) T∞) .
linear speedup when P ≪ T1 / (S1+D)T∞
Space bound: Sp /P ≤ S1+D , where S1 is measured in pages.
SP reciprocity: No longer need to distinguish function types Parallelism or not is dictated only by how a function is invoked
(spawn vs. call).
Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism
SP — stack space on P processorsS1 — stack space of a serial execution D — Cilk depth
![Page 39: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/39.jpg)
Intel XTRL / USA 39
We implemented a prototype Cilk-M runtime system based on the open-source Cilk-5 runtime system.
We modified the open-source Linux kernel (2.6.29 running on x86 64-bit CPU’s) to provide support for TLMM (~600 lines of code).
We have ported the runtime system to work with the Intel’s Cilk Plus compiler in place of the native Cilk Plus runtime.
System Overview
![Page 40: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/40.jpg)
Intel XTRL / USA 40
Performance Comparison
Cilk-M running time / Cilk Plus running time
Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)
AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3
cholesky
cilksort
fft
fib
fib_weird
heat
lu
matmul
nqueens
qsortrectmul
strassen
0
0.2
0.4
0.6
0.8
1
1.2
![Page 41: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/41.jpg)
Intel XTRL / USA 41
Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D
cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22
fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47
fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12
lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25
qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10
Space bound: Sp /P ≤ S1+D
![Page 42: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/42.jpg)
Intel XTRL / USA 42
Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM
Survey of My Other WorkDirection for Future Work
Outline
![Page 43: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/43.jpg)
Intel XTRL / USA 43
spawn
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P
spawncall
spawn
P
spawn
PP
callspawn
spawncall
Cilk-M’s Work-Stealing Scheduler
![Page 44: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/44.jpg)
Intel XTRL / USA 44
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P
spawncall
spawncall
P PP
call!
Cilk-M’s Work-Stealing Scheduler
spawnspawn
callspawn
spawncall
![Page 45: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/45.jpg)
Intel XTRL / USA 45
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P
spawncall
spawncall
P PP
Cilk-M’s Work-Stealing Scheduler
spawn
spawn!
spawnspawn
callspawn
spawncall
![Page 46: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/46.jpg)
Intel XTRL / USA 46
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P P PP
Cilk-M’s Work-Stealing Scheduler
spawn! call! spawn!
spawncall
spawncall
spawnspawn
spawnspawn
callspawn
spawncall
call spawn
![Page 47: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/47.jpg)
Intel XTRL / USA 47
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P P PP
Cilk-M’s Work-Stealing Scheduler
spawn
return!
spawncall
spawncall
spawncall
spawncall
spawncall
spawnspawn
spawn
![Page 48: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/48.jpg)
Intel XTRL / USA 48
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P P PP
Cilk-M’s Work-Stealing Scheduler
spawncall
spawncall
steal!
When a worker runs out of work, it steals from the top of a random victim’s deque.
spawncall
spawncall
spawncall
spawnspawn
spawn
![Page 49: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/49.jpg)
Intel XTRL / USA 49
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P P PP
Cilk-M’s Work-Stealing Scheduler
spawncall
spawn!
When a worker runs out of work, it steals from the top of a random victim’s deque.
spawncall spawn
call
spawncall
spawncall
spawnspawn
spawnspawn
![Page 50: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/50.jpg)
Intel XTRL / USA 50
Each worker maintains a work deque of frames, and it manipulates the bottom of the deque like a stack [MKH90, BL94, FLR98].
P P PP
Cilk-M’s Work-Stealing Scheduler
spawncall
Theorem [BL94]: With sufficient parallelism, workers steal infrequently linear speedup.
spawncall spawn
call
spawn
spawnspawn
callspawn
callspawnspawn
![Page 51: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/51.jpg)
Intel XTRL / USA 51
Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM
Survey of My Other WorkDirection for Future Work
Outline
![Page 52: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/52.jpg)
Intel XTRL / USA 52
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
0x7f000
BA
CED
Use standard linear stack in virtual memory.
![Page 53: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/53.jpg)
Intel XTRL / USA 53
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
steal A
0x7f000
BA
CED
Ax: 42
Ax: 42
TLMM-Based Cactus Stacks
Map (not copy) the stolen prefix to the same virtual addresses.
![Page 54: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/54.jpg)
Intel XTRL / USA 54
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
Ax: 42
0x7f000
BA
CED
Subsequent spawns and calls grow down-ward in the thief’s TLMM region.
Cy: &x
![Page 55: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/55.jpg)
Intel XTRL / USA 55
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
Cy: &x
Ax: 42
Both workers see the same virtual address value for &x.
0x7f000
BA
CED
![Page 56: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/56.jpg)
Intel XTRL / USA 56
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
Cy: &x
D
Ax: 42
0x7f000
BA
CED
Both workers see the same virtual address value for &x.
![Page 57: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/57.jpg)
Intel XTRL / USA 57
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
D
Cy: &x
Ax: 42
Ax: 42
Cy: &x
steal C
0x7f000
BA
CED
Cy: &x
Ax: 42
Map (not copy) the stolen prefix to the same virtual addresses.
![Page 58: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/58.jpg)
Intel XTRL / USA 58
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x
Unreasonable simplification: Assume that we can map with arbitrary granularity.
Cy: &x
D
Ax: 42
Ax: 42
Cy: &x
0x7f000
BA
CED
Subsequent spawns and calls grow down-ward in the thief’s TLMM region. E
z: &x
![Page 59: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/59.jpg)
Intel XTRL / USA 59
TLMM-Based Cactus Stacks
P1 P2 P3
Ax: 42
B
y: &x E
Unreasonable simplification: Assume that we can map with arbitrary granularity.
Cy: &x
D
Ax: 42
Ax: 42
Cy: &x
z: &x
0x7f000
BA
CED
All workers see the same virtual address value for &x.
![Page 60: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/60.jpg)
Intel XTRL / USA 60
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
B
A
BA
CED
![Page 61: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/61.jpg)
Intel XTRL / USA 61
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
Map the stolen prefix.
A
B
A
steal ABA
CED
A
![Page 62: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/62.jpg)
Intel XTRL / USA 62
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B
A
steal A
Advance the stack pointer fragmentation.
BA
CED
![Page 63: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/63.jpg)
Intel XTRL / USA 63
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B C
D
A
BA
CED
![Page 64: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/64.jpg)
Intel XTRL / USA 64
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B C
D
A
steal C
A
C
BA
CED
A
C
![Page 65: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/65.jpg)
Intel XTRL / USA 65
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B C
D
A
steal C
A
CAdvance the stack pointer again additional fragmentation.
BA
CED
![Page 66: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/66.jpg)
Intel XTRL / USA 66
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B
E
C
D
A A
C
BA
CED
Advance the stack pointer again additional fragmentation.
![Page 67: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/67.jpg)
Intel XTRL / USA 67
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B
E
C
D
A A
C
BA
CED
Space-reclaiming heuristic: reset the stack pointer upon successful sync.
![Page 68: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/68.jpg)
Intel XTRL / USA 68
Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM
Survey of My Other WorkDirection for Future Work
Outline
![Page 69: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/69.jpg)
Intel XTRL / USA 69
Space Bound with a Heap-Based Cactus Stack
Theorem [BL94]. Let S1 be the stack space required by a serial execution of a program. The stack space per worker of a P-worker execution using a heap-based cactus stack is at most SP/P ≤ S1.Proof. The work-stealing algorithm maintains the busy-leaves property:
Every active leaf frame has a worker executing on it. ■
PPP
S1
P = 4
P
![Page 70: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/70.jpg)
Intel XTRL / USA 70
Cilk-M Space Bound
Claim. Let S1 be the stack space required by a serial execution of a program. Let D be the Cilk depth. The stack space per worker of a P-worker execution using a TLMM cactus stack is at most SP/P ≤ S1+D.Proof. The work-stealing algorithm maintains the busy-leaves property:
Every active leaf frame has a worker executing on it. ■
PPP
S1
P = 4
P
![Page 71: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/71.jpg)
Intel XTRL / USA 71
Space UsageBenchmark D S1 S16 /16 (S16 /16) / S1 S1 + D
cholesky 12 3 3.44 1.15 15cilksort 18 3 3.63 1.21 22
fft 22 6 4.81 0.80 28fib 43 4 4.44 1.11 47
fib_weird 281 22 18.63 0.85 303heat 10 2 2.75 1.38 12
lu 10 2 3.43 1.72 38matmul 22 3 4.00 1.33 12nqueen 16 3 3.38 1.13 25
qsort 72 6 6.31 1.05 19rectmul 27 4 4.75 1.19 31strassen 8 2 3.50 1.75 10
Space bound: Sp /P ≤ S1+D
![Page 72: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/72.jpg)
Intel XTRL / USA 72
Performance Bound with aHeap-Based Cactus Stack
Theorem [BL94]. A work-stealing scheduler can achieve expected running time
TP = T1 / P + O(T∞)on P processors.
Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism
Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / T∞ ), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).
![Page 73: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/73.jpg)
Intel XTRL / USA 73
Cilk-M Performance Bound
Claim. A work-stealing scheduler can achieve expected running time
Tp = T1 / P + C T∞
on P processors, where C = O(S1+D) .
Definition. TP — execution time on P processorsT1 — work T∞ — span T1 / T∞ — parallelism
D — Cilk depth
Corollary. If the computation exhibits sufficient parallelism (P ≪ T1 / (S1+D)T∞), this bound guarantees near-perfect linear speedup (T1 / Tp ≈ P).
![Page 74: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/74.jpg)
Intel XTRL / USA 74
Cilk-M: The Cactus Stack Problem Cilk-M Overview Cilk-M’s Work-Stealing Scheduler TLMM-Based Cactus Stacks The Analysis of Cilk-M OS Support for TLMM
Survey of My Other WorkDirection for Future Work
Outline
![Page 75: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/75.jpg)
Intel XTRL / USA 75
To Be or Not To Be … a Process
A Worker = A Process A Worker = A Thread
Every worker has its own page table.
By default, nothing is shared.
Manually (i.e. mmap) share nonstack memory.
User calls to mmap do not work (which may include malloc).
Workers share a single page table.
By default, everything is shared.
Reserve a region to be independently mapped.
User calls to mmap operate properly.
![Page 76: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/76.jpg)
Intel XTRL / USA 76
Page 28
TLMM 2
Page 12
TLMM 1
Page Table for TLMM (Ideally)
Page 32
Shared
Page 7
TLMM 0
x86: Hardware walks the page table.Each thread has a single root-page directory!
![Page 77: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/77.jpg)
Intel XTRL / USA 77
Support for TLMM
Page 7
Thread 0
Page 12
Thread 1
Page 32
Must synchronize the root-page directory among threads.
![Page 78: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/78.jpg)
Intel XTRL / USA 78
Limitation of TLMM Cactus Stacks TLMM does not work for codes that require one
thread to see another thread’s stack. E.g., MCS locks [MCS91]:
When a thread A attempts to acquire a mutual-exclusion lock L, it may be blocked if another thread B already owns L.
Rather than spin-waiting on L itself, A adds itself to a queue associated with L and spins on a local variable LA, thereby reducing coherence traffic.
When B releases L, it resets LA, which wakes A up for another attempt to acquire the lock.
If A allocates LA on its stack using TLMM, LA may not be visible to B!
![Page 79: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/79.jpg)
Intel XTRL / USA 79
Cilk-M is a C-based concurrency platform that satisfies all three criteria simultaneously: Serial-Parallel Reciprocity Good Performance Bounded and efficient use of memory for the
cactus stack Cilk-M employs:
TLMM-based cactus stacks OS support for TLMM (~600 lines of code) Legacy compatible linkage
Cilk-M Summary
![Page 80: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/80.jpg)
Intel XTRL / USA 80
Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional
Memory Direction for Future Work
Outline
![Page 81: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/81.jpg)
Intel XTRL / USA 81
The JCilk Language
Java CoreFunctionalities
ParallelConstructsfrom Cilk:
spawn & sync
Joint work with John Danaher and Charles Leiserson
![Page 82: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/82.jpg)
Intel XTRL / USA 82
The JCilk Language
Java CoreFunctionalities
ParallelConstructsfrom Cilk:
spawn & sync
ExceptionHandling
Joint work with John Danaher and Charles Leiserson
![Page 83: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/83.jpg)
Intel XTRL / USA 83
JCilk provides a faithful extension of Java’s exception mechanism consistent with Cilk’s primitives.
JCilk’s exception semantics include an implicit abort mechanism, which allows speculative parallelism to be expressed succinctly in JCilk.
Other researchers [I91, TY00, BM00] pursued new linguistic mechanisms.
Exception Handling in a Concurrent Context
![Page 84: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/84.jpg)
Intel XTRL / USA 84
The JCilk System
JCilkto
Java + goto
Jgo compiler: GCJ + goto
supportJVM
source
JCilkRuntime System
Fib.jcilk Fib.jgo Fib.class
JCilk Compiler
![Page 85: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/85.jpg)
Intel XTRL / USA 85
JCilk's strategy of integrating multithreading with Java's exception semantics is synergistic – it obviates the need for Cilk’s inlet and abort.
JCilk’s abort mechanism extends Java’s existing exception mechanism in a naturally way to propagate an abort, allowing the programmer to clean-up.
What We Discovered
![Page 86: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/86.jpg)
Intel XTRL / USA 86
Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional
Memory Direction for Future Work
Outline
![Page 87: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/87.jpg)
Intel XTRL / USA 87
Initially, L1 = 0 and L2 = 0
Thread 1
1 L1 = 1;
2 if(L2 == 0) {3 /* critical section */4 …5 }6 L1 = 0;
Thread 2
1 L2 = 1;
2 if(L1 == 0) {3 /* critical section */4 …5 }6 L2 = 0;
Dekker’s Protocol (Simplified)
Reads may be reordered with older writes.
![Page 88: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/88.jpg)
Intel XTRL / USA 88
Initially, L1 = 0 and L2 = 0
Thread 1
1 L1 = 1;2 mfence();3 if(L2 == 0) {4 /* critical section */5 …6 }7 L1 = 0;
Thread 2
1 L2 = 1;2 mfence();3 if(L1 == 0) {4 /* critical section */5 …6 }7 L2 = 0;
Memory fences needed cause stalling
Dekker’s Protocol (Simplified)
![Page 89: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/89.jpg)
Intel XTRL / USA 89
Applications of Dekker’s Protocol The THE protocol used by Cilk’s work stealing
scheduler [FLR98] the victim vs. the thief
Java Monitors using Quickly Reacquirable Locks or Biased Locking [DMS03] [OKK04] the bias-holding thread vs. a revoker thread
JNI reentry barrier in JVM a Java mutator thread vs. the garbage collector
Network package processing [VNE10] the owner thread vs. other threads
Applications exhibit asymmetric synchronization patterns.
![Page 90: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/90.jpg)
Intel XTRL / USA 90
We introduce location-based memory fences, which causes a thread’s instruction stream to serialize when another thread attempts to access the guarded memory location.
Some applications can benefit from a software implementation [DHY03] that uses interrupt.
A light-weight hardware mechanism can piggyback on the cache coherence protocol.
Location-Based Memory FencesJoint work with Edya Ladan-Mozes and Dmitriy Vyukov
![Page 91: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/91.jpg)
Intel XTRL / USA 91
Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional
Memory Direction for Future Work
Outline
![Page 92: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/92.jpg)
Intel XTRL / USA 92
Transactional Memory
atomic { //Ax++;
} Rset: x Wset: x
Rset: w,xWset: w,x
A
Memory
atomic { //Bw = x;
}Rset: x Wset: w
B
Transactional Memory (TM) [HM93] provides a transactional interface for accessing memory.
![Page 93: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/93.jpg)
Intel XTRL / USA 93
Transactional Memory
atomic { //Ax++;
} Rset: x Wset: x
Rset: w,xWset: w,x
A
Memory
atomic { //Bw = x;
}Rset: x Wset: w
B
TM guarantees that transactions are serializable [P79].
![Page 94: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/94.jpg)
Intel XTRL / USA 94
Nested Transactions
atomic { //A int a = x;...
atomic { //B w++; } int b = y; z = x + y;}
Rset: wWset: w
Rset: x Wset:
Rset: w,x,y,zWset: w,x,y,z
A
Memory
B
Closed-nesting: propagate the changes to A.
![Page 95: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/95.jpg)
Intel XTRL / USA 95
Nested Transactions
atomic { //A int a = x;...
atomic { //B w++; } int b = y; z = x + y;}
Rset: wWset: w
Rset: x Wset:
Rset: w,x,y,zWset: w,x,y,z
A
Memory
B
Open-nesting: commit the changes globally.
![Page 96: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/96.jpg)
Intel XTRL / USA 96
Nested Transactions
Safety Efficency
Closed Nesting[M85] ✓
Open Nesting[MH06, MAC+06, NMA+07] ✓
All memories are treated equally – there is only one level of abstraction.
![Page 97: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/97.jpg)
Intel XTRL / USA 97
Ownership-Aware Transactions (OAT)
Ownership-aware transactions is a hybrid between open-nesting and closed-nesting; it provides multiple levels of abstraction.
In OAT, the programmer writes code with transactional modules, and the OAT system uses the concept of ownership types [BLS03] to ensure data encapsulation within a module.
The OAT system guarantees abstract serializability as long as the program conforms to a set of well-defined constraints on how the modules share data.
Joint work with Kunal Agrawal and Jim Sukha
![Page 98: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/98.jpg)
Intel XTRL / USA 98
Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional
Memory Direction for Future Work
Outline
![Page 99: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/99.jpg)
Intel XTRL / USA 99
Parallelism Abstraction
Operating System
Concurrency Platform
User Application
A concurrency platform provides a layer of parallelism abstraction to help load balancing and task scheduling.
![Page 100: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/100.jpg)
Intel XTRL / USA 100
Memory AbstractionA memory abstraction provides a different “view” of a memory location depending on the execution context in which the memory access is made. TLMM cactus stack: each worker gets its own linear local view
of the tree-structured call stack. Hyperobject [FHLL09]: a linguistic mechanism that supports
coordinated local views of the same nonlocal object. Transactional Memory [HM93]: memory accesses dynamically
enclosed by an atomic block appear to occur atomically.
Can a concurrency platform as well mitigate the complexity of synchronization by providing the right memory abstractions?
![Page 101: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/101.jpg)
Intel XTRL / USA 101
OS / Hardware Support for Memory Abstraction
Recently researchers begin to explore ways to enable memory abstractions using page mapping / page protection mechanism:
Can we relax limitation of manipulating virtual memory at page-granularity ?
C# with atomic sections [AHM09] (strong atomicity) Grace [BYL+09] (deterministic execution) Sammati [PV10] (deadlock avoidance) Cilk-M [LSH+10] (TLMM cactus stack)
![Page 102: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/102.jpg)
Intel XTRL / USA 102
THANK YOU!
Cilk-MSurvey of My Other Work The JCilk Language Location-Based Memory Fences Ownership-Aware Transactional
Memory Direction for Future Work
![Page 103: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/103.jpg)
Intel XTRL / USA 103
![Page 104: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/104.jpg)
Intel XTRL / USA 104
![Page 105: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/105.jpg)
Intel XTRL / USA 105
Quadratic Stack Growth [Robison08]P
P
P
P
P
S
S
S
P
SS
S
S
S S
P
SS
S
P
SS
. . .
. . .
. . .
. . .
Depth = d
Assume one linear stack per worker
Repeat d times. . .
P : parallel
S : serial
: spawn: call
![Page 106: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/106.jpg)
Intel XTRL / USA 106
Quadratic Stack Growth [Robison08]P
P
P
P
P
S
S
S
P
SS
S
S
S S
P
SS
S
P
SS
. . .
. . .
. . .
. . .
. . .
Assume one linear stack per worker
Repeat d times
The green worker repeatedly blocks, then steals, using Θ(d2) stack space.
Depth = d
![Page 107: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/107.jpg)
Intel XTRL / USA
cholesky
cilksort
fft
fib
fib_weird
heat
knapsack
lu
matmul
nqueens
rectmul
strassen
0
0.2
0.4
0.6
0.8
1
1.2
Performance Comparison
Cilk-M running time / Cilk-5 running time
Time Bound: Tp = T1 / P + C T∞ , where C = O(S1+D)
AMD 4 quad-core 2GHz Opteron, 64KB private L1, 512K private L2, 2MB shared L3
107
![Page 108: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/108.jpg)
Intel XTRL / USA
Space Usage (Hand Compiled)Benchmark D S1 S16 / 16 S1 + D
cholesky 12 2 3.19 14cilksort 18 2 3.19 20
fft 22 4 3.69 26fib 43 2 3.69 45
fib_weird 281 8 8.44 289heat 10 2 2.38 12
knapsack 34 2 5.00 36lu 10 2 3.19 12
matmul 22 2 3.31 24nqueen 16 2 3.25 18rectmul 27 2 4.06 29strassen 8 2 3.13 10
Space bound: Sp /P ≤ S1+D 108
![Page 109: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/109.jpg)
Intel XTRL / USA 109
Space UsageBenchmark Cilk-M
S16
Cilk-5S16
Cilk-MH16
Cilk-5H16
Cilk-MS16 + H16
Cilk-5S16 + H16
cholesky 51 16 193 345 244 361cilksort 51 16 193 265 244 281
fft 60 48 169 1017 229 1065fib 59 16 169 185 228 201
fib_weird 135 64 217 217 353 281heat 38 16 209 273 247 289
knapsack 80 16 169 361 249 377lu 51 16 185 265 236 281
matmul 53 16 169 257 222 273nqueen 52 16 161 249 213 265rectmul 65 32 169 240 234 272strassen 50 16 161 417 211 433
![Page 110: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/110.jpg)
Intel XTRL / USA 110
GCC/Linux C Subroutine Linkage
bp
sp
B’s local variables
args to B’s callees
B’s return addressA’s base pointer
A’s return addressA’s parent’s base ptr
A’s local variables
args to B
The legacy linear stack obtains efficiency by overlapping frames.
args to A
framefor A
framefor B
linkageregion
BA
CED
![Page 111: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/111.jpg)
Intel XTRL / USA 111
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
BThe thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.
A
steal ABA
CED
![Page 112: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/112.jpg)
Intel XTRL / USA 112
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B
A
The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.
C
D
BA
CED
![Page 113: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/113.jpg)
Intel XTRL / USA 113
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B C
D
A
The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.
A
C
steal CBA
CED
![Page 114: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/114.jpg)
Intel XTRL / USA 114
0x7d000
0x7f000
0x7e000
page size
Handling Page Granularity
P1 P2 P3
A
B
E
C
D
A A
The thief advances its stack pointer past the next page boundary, reserving space for the linkage region for the next callee.
C
BA
CED
![Page 115: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/115.jpg)
Intel XTRL / USA 115
Key Invocation Invariants1. Arguments are passed via stack pointer with positive
offset.2. Local variables are referenced via base pointer with
negative offset.3. Live registers are flushed onto the stack immediately
before each spawn.4. Live registers are flushed onto the stack before returning
back to runtime if sync fails.5. When resuming a stolen function after a spawn or sync,
live registers are restored from the stack.6. When returning from a spawn, the return value is
flushed from its register onto the stack.7. The frame size is fixed before any spawn statements.
![Page 116: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/116.jpg)
Intel XTRL / USA 116
GCC/Linux C Subroutine Linkage
BA
CED
A’s return addressA’s parent’s base ptr
A’s local variables
sp
bp
args to A’s callees
framefor A
Legacy linear stacks enable efficient passing of arguments from caller to callee.
args to A
![Page 117: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/117.jpg)
Intel XTRL / USA 117
GCC/Linux C Subroutine Linkage
BA
CED
A’s return addressA’s parent’s base ptr
A’s local variables
sp
bp
args to A’s callees
framefor A
Frame A accesses its arguments through positive offset indexed from its base pointer.
args to Alinkageregion
![Page 118: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/118.jpg)
Intel XTRL / USA 118
GCC/Linux C Subroutine Linkage
BA
CED
A’s return addressA’s parent’s base ptr
A’s local variables
sp
bp
args to A’s callees
Frame A accesses its local variables through negative offset indexed from its base pointer.
framefor A
args to A
![Page 119: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/119.jpg)
Intel XTRL / USA 119
GCC/Linux C Subroutine Linkage
BA
CED
A’s return addressA’s parent’s base ptr
A’s local variables
sp
bp
args to A’s callees
Before invoking B, A places the arguments for B into the reserved linkage region it will share with B, which A indexes using positive offset off its stack pointer.
args to B
args to A
linkageregion
framefor A
![Page 120: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/120.jpg)
Intel XTRL / USA 120
GCC/Linux C Subroutine Linkage
sp
bp
B’s return address
A’s return addressA’s parent’s base ptr
A’s local variables
args to B
BA
CED
A then makes the call to B, which saves the return address for B and transfers control to B.
args to A
framefor A
![Page 121: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/121.jpg)
Intel XTRL / USA 121
bp
GCC/Linux C Subroutine Linkage
bp
B’s return addressA’s base pointer
A’s return addressA’s parent’s base ptr
A’s local variables
BA
CED
args to B
args to A
framefor A
Upon entering, B saves A’s base pointer and sets the base pointer to where the stack pointer is. sp
![Page 122: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/122.jpg)
Intel XTRL / USA 122
GCC/Linux C Subroutine Linkage
bp
B’s local variables
args to B’s callees
spB’s return addressA’s base pointer
A’s return addressA’s parent’s base ptr
A’s local variables
BA
CED
args to B
args to A
framefor A
framefor B
B advances the stack pointer to allocate space for local variables and linkage region.
![Page 123: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/123.jpg)
Intel XTRL / USA 123
GCC/Linux C Subroutine Linkage
bp
sp
B’s local variables
args to B’s callees
B’s return addressA’s base pointer
A’s return addressA’s parent’s base ptr
A’s local variables
BA
CED
args to B
The legacy linear stack obtains efficiency by overlapping frames.
args to A
framefor A
framefor B
![Page 124: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/124.jpg)
Intel XTRL / USA 124
Legacy Linear Stack
B
A
C
ED B
A
C
DE
invocation tree
An execution of a serial Algol-like language can be viewed as a serial walk of an invocation tree.
High Addr
Low Addr
![Page 125: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/125.jpg)
Intel XTRL / USA 125
Legacy Linear Stack
B
A
C
ED
invocation tree
A
C
E
x: 42
y: &x
High Addr
Low Addr
Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot
pass a pointer to its stack variable up to its parent.…
![Page 126: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/126.jpg)
Intel XTRL / USA 126
Legacy Linear Stack
B
A
C
ED
invocation tree
Rule for pointers: A parent can pass pointers to its stack variables down to its children, but a child cannot
pass a pointer to its stack variable up to its parent.
A
C
E
y: &z
z: 42
High Addr
Low Addr
✗
![Page 127: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/127.jpg)
Intel XTRL / USA 127
Given n > 0, search for one way to arrange n queens on an n-by-n chessboard so that none attacks another.
legal configuration illegal configuration
The Queens Problem
![Page 128: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/128.jpg)
Intel XTRL / USA 128
Exploring the Search Tree for Queens
r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0
start
r0,c1 r0,c2 r0,c3r0,c0
. . . . . .r2,c0 r2,c0 r2,c0 r2,c0
Serial strategy: Depth-first search with backtracking. The search tree size grows exponentially as n increases.
![Page 129: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/129.jpg)
Intel XTRL / USA 129
Exploring the Search Tree for Queens
r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0
start
r0,c1 r0,c2 r0,c3r0,c0
. . . . . .r2,c0 r2,c0 r2,c0 r2,c0
Parallel strategy: spawn searches in parallel. Speculative computation – some work may be wasted.
![Page 130: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/130.jpg)
Intel XTRL / USA 130
Exploring the Search Tree for Queens
r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0 r1,c3r1,c2r1,c1r1,c0
start
r0,c1 r0,c2 r0,c3r0,c0
. . . . . .r2,c0 r2,c0 r2,c0 r2,c0
Parallel strategy: spawn searches in parallel. Abort other parallel searches once a solution is found.
![Page 131: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/131.jpg)
Intel XTRL / USA 131
Various Parallel Programming Models
![Page 132: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/132.jpg)
Intel XTRL / USA 132
class SAT_Solver {public:
int solve( … );
…
private: …
};
1. Convert the entire code base to Cilk++ language.
2. Structure the project so that Cilk++ code calls C++ code, but not conversely.
3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code
Parallelize Your Code using Cilk++
![Page 133: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/133.jpg)
Intel XTRL / USA 133
class SAT_Solver {public:
int solve( … );
…
private: …
};
1. Convert the entire project to Cilk++ language.
2. Structure the project so that Cilk++ code calls C++ code, but not conversely.
3. Allow C++ functions to call Cilk++ functions, but convert entire subtree to use Cilk++.a. Use C++ wrapper functionsb. Use “extern C++”c. Limited call back to C++ code
Parallelize Your Code using TBB
Your program may end up using a lot more stack space or fail to get good speedup.
![Page 134: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/134.jpg)
Intel XTRL / USA 134
Network
…
Memory
Chip Multiprocessor (CMP)
PPP
Multicore Architecture — 2001*
¢ ¢ ¢
*The first non-embedded multicore microprocessor was the Power4 from IBM (2001).
![Page 135: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/135.jpg)
Intel XTRL / USA 135
The Era of Multicore IS Here
Source: www.newegg.com 1 2 3 4 6 8 12
0
5
10
15
20
25
30
35
40
45
50
DesktopServer
# of Cores
# of CPUS
Single core processor is becoming obsolete.
![Page 136: Mit cilk](https://reader038.vdocuments.us/reader038/viewer/2022102717/5589f350d8b42ace6e8b4688/html5/thumbnails/136.jpg)
Intel XTRL / USA 136
My Sister Is Buying a New Laptop …Display Processor
TypeNumberof Cores
MacBook 13.3” 2.4GHz Intel Core 2 2
MacBook Pro
13” 2.3-2.7GHzIntel Core i5 / i7 2
15” 2.0-2.3GHz Intel Core i7 4
17” 2.2-2.3GHzIntel Core i7 4
MacBook Air11” 1.4-1.6GHz
Intel Core 2 2
13” 1.86-2.13GHzIntel Core 2 2
Source: www.apple.com
The era of multicore IS here!