![Page 1: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/1.jpg)
cso.io
Scalable Performance for Scala Message-Passing Concurrency
Andrew Bate
Department of Computer Science University of Oxford
![Page 2: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/2.jpg)
Motivation
Multi-core commodity hardware
Non-uniform shared memory
Expose potential parallelism
Correctness and formal verification
Compatibility
int arr[x][y];
![Page 3: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/3.jpg)
EMBEDDED DOMAIN-SPECIFIC LANGUAGE
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 4: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/4.jpg)
Why an Embedded DSL?
Ease of implementation
Leverage existing tools
Leverage known syntax Higher-order functions
Rich type system
Lightweight syntax
Compile-time macros
![Page 5: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/5.jpg)
def map[I, O](f: I => O)(in: ?[I], out: ![O]) = proc { repeat { out ! (f(in?)) } run (proc { in.closein } || proc { out.closeout }) }
in 𝑓𝑓(v) v
out map 𝑓𝑓
Examples
![Page 6: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/6.jpg)
def tee[@specialized T](in: ?[T], outs: Seq[![T]]) = proc { var v = null val outputs = (|| (out <- outs) proc { out ! v })) repeat { v = in?; run outputs } run (proc { in.closein } || (|| (out <- outs) proc { out.closeout })) }
in
v
v
v
v
out1
out2
outn
tee
⋮
Examples
![Page 7: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/7.jpg)
SCALABLE PERFORMANCE through bytecode rewriting
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 8: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/8.jpg)
CPS Transformation
Call n(f)
Return
Init
Pre-call
Post-call
Prelude
rewinding
pausing
Call n()
Return
Init
![Page 9: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/9.jpg)
Analysing the call graph
do()
x()
y()
z()
?() do()
y()
Transform these methods
![Page 10: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/10.jpg)
Engineering
Live variable analysis
Lazy load and store
Constant inlining
![Page 11: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/11.jpg)
Functional Expressions for (i <- 0 until n; j <- i until n) println(i)
intWrapper(0).until(n).foreach( i: Int => intWrapper(i).until(n).foreach(j: Int => println(i)) )
Com
pile
s to
Tran
sfor
ms t
o
var i = 0 while (i < n) { var j = i while (j < n) { println(i); j += 1 } i += 1 }
![Page 12: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/12.jpg)
Tail call optimisations
Shared memory
SBT plugin support
More Features
![Page 13: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/13.jpg)
CHANNELS
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 14: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/14.jpg)
More Features
Generalised alt
Specialization for primitives
Optimised extended rendezvous
![Page 15: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/15.jpg)
SCHEDULER
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 16: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/16.jpg)
Scheduler States
Created
Waiting
Terminated
Paused
Running
![Page 17: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/17.jpg)
Scheduling: Central FIFO
Thre
ad 1
Thre
ad 2
Thre
ad 𝑚𝑚
Scheduler
𝑃𝑃1 𝑃𝑃2 𝑃𝑃3 𝑃𝑃𝑛𝑛
⋯
![Page 18: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/18.jpg)
Scheduling: FIFO per thread
Thre
ad 1
Thre
ad 2
Thre
ad 𝑚𝑚
Scheduler
𝑃𝑃1
Scheduler
𝑃𝑃3
Scheduler
𝑃𝑃𝑛𝑛
⋯
⋯
![Page 19: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/19.jpg)
Scheduler
Scheduling: Batches per thread
Thre
ad 1
Thre
ad 2
Thre
ad 𝑚𝑚
Scheduler
Scheduler
⋯
⋯ ? ? ?
![Page 20: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/20.jpg)
Scheduling: Batches per thread
Scheduler
𝑃𝑃1 𝑃𝑃2 𝑃𝑃𝑛𝑛 𝑄𝑄1 𝑄𝑄𝑚𝑚 𝑅𝑅1 𝑅𝑅𝑘𝑘
Dispatch Count = max const × Batch Length, Dispatch Limit
![Page 21: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/21.jpg)
DEADLOCK DETECTION
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 22: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/22.jpg)
Example Tee
x2 x3
x5
Merge
Prefix 1
Console
Tee
Merge
Tee
![Page 23: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/23.jpg)
Example Tee
x2 x3
x5
Merge
Prefix 1
Console
Tee
Merge
Tee
!
! !
! !
! ? ?
?
!
Deadlock detected! The cycle of ungranted requests is: Prefix1 -!-> Tee1 Tee3 -!-> x5 Tee1 -!-> Tee2 x5 -!-> Merge2 Tee2 -!-> Tee3 Merge2 -!-> Prefix1
![Page 24: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/24.jpg)
PERFORMANCE EVALUATION
1
2
3
4
5
Embedded DSL
Bytecode rewriting
Channels
Scheduler
Deadlock detection
![Page 25: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/25.jpg)
Ring topology
100
1000
10000
100000
Tim
e to
pas
s a m
essa
ge 3
00 ti
mes
aro
und
an n
pro
cess
ring
(ms)
Number n of processes spawned
CSO2 FIFO Scheduler
Java primitives
CSO2 Batch Scheduler
![Page 26: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/26.jpg)
Ring topology
0
50000
100000
150000
200000
250000
300000
Tim
e to
pas
s a m
essa
ge 3
00 ti
mes
aro
und
an n
pro
cess
ring
(ms)
Number n of processes spawned
CSO2 FIFO Scheduler
Java primitives
CSO2 Batch Scheduler
![Page 27: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/27.jpg)
Fully connected topology
10
100
1000
10000
100000
1000000
Tim
e to
pas
s n2 m
essa
ges (
ms)
Number n of processes / actors spawned
ErlangScala ActorsJCSPJava PrimitivesOccamCSO2 FIFO SchedulerCSO2 Batch SchedulerGo
CSO2
CSO2
![Page 28: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/28.jpg)
Fully connected topology
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
Tim
e to
pas
s n2 m
essa
ges (
ms)
Number n of processes / actors spawned
ErlangScala ActorsJCSPJava PrimitivesOccamCSO2 FIFO SchedulerCSO2 Batch SchedulerGo
CSO2 CSO2
![Page 29: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/29.jpg)
Fully connected topology
0
10000
20000
30000
40000
50000
60000
Tim
e to
pas
s n2 m
essa
ges (
ms)
Number n of processes / actors spawned
JCSP
Occam
CSO2 FIFO Scheduler
CSO2 Batch Scheduler
Go
CSO2
CSO2
![Page 30: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/30.jpg)
Fully connected topology
0
2000
4000
6000
8000
10000
12000
14000
16000
Tim
e to
pas
s n2 m
essa
ges (
ms)
Number n of processes / actors spawned
CSO2 Batch Scheduler
CSO2 FIFO Scheduler
Go
![Page 31: Scalable Performance for Scala Message-Passing Concurrency · Ring topology . 0. 50000. 100000. 150000. 200000. 250000. 300000. Time to pass a message 300 times around an n process](https://reader035.vdocuments.us/reader035/viewer/2022071100/5fd93b0306176d08416c8f64/html5/thumbnails/31.jpg)
Summary
• High performance library for building massively concurrent systems on the JVM
• Deadlock detection
• Outperforms Java primitives, JCSP, Scala Actors, Occam, and very close to Go