simgrid kernel 101 introducing the simgrid kernel
TRANSCRIPT
SimGrid Kernel 101
Introducing the SimGrid Kernel
Da SimGrid Team
May 27, 2014
About this Presentation
Goals and ContentsI Present Simix, the simulation kernel of SimGrid
I Show some gore details about where our performance comes from
I This is NOT for newcommers but for hardcore SimGrid users (or curious folks)
The SimGrid 101 serieI This is part of a serie of presentations introducing various aspects of SimGrid
I SimGrid 101. Introduction to the SimGrid Scienti�c ProjectI SimGrid User 101. Practical introduction to SimGrid and MSGI SimGrid User::Platform 101. De�ning platforms and experiments in SimGridI SimGrid User::SimDag 101. Practical introduction to the use of SimDagI SimGrid User::Visualization 101. Visualization of SimGrid simulation resultsI SimGrid User::SMPI 101. Simulation MPI applications in practiceI SimGrid User::Model-checking 101. Formal Veri�cation of SimGrid programsI SimGrid Internal::Models. The Platform Models underlying SimGridI SimGrid Internal::Kernel. Under the Hood of SimGrid
I Get them from http://simgrid.gforge.inria.fr/documentation.html
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 2/15
SimGrid Internals in a Nutshell
Example of user code to executeAlice
Send "toto" to Bob
Listen from Bob
Bob
Listen from Alice
Send "blah" to Alice
SimGrid internal Main Loop
1. Run every ready user process in rowI Each wants to consume resourcesI Assign actions on resources
2. Compute share for actions
3. Get earliest �nishing action
4. Unlock user code waiting on this action
SimGrid Functional Organization
I MSG: User-friendly syntaxic sugar
I Simix: Processes, synchro (SimPosix)
I SURF: Resources usage interface
I Models: Action completion computation
AliceMaestro BobSimulation
Kernel:who’s next?
(done)
(done)
"blah" to Alice
Receive from Bob
Send "toto" to Bob
from Alice
Send
Receive
������
������
������������������������������������������������
����������������������������������������������������������
��������
������������
�������������������������
���������
Simulated time
���
���
LMM
SIMIX
SURF
MSG
Actions{372435
245245
530530
50664work
remaining
variable
...
x1
x2
x2
x2
x3
x3
xn+ +
+
... ≤ CP
≤ CL1
≤ CL4
≤ CL2
≤ CL3
Constraints
Variables
Conditions{
... Process
usercode
usercode
usercode
usercode
usercode
...
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 3/15
SimGrid Internals in a Nutshell
Example of user code to executeAlice
Send "toto" to Bob
Listen from Bob
Bob
Listen from Alice
Send "blah" to Alice
SimGrid internal Main Loop
1. Run every ready user process in rowI Each wants to consume resourcesI Assign actions on resources
2. Compute share for actions
3. Get earliest �nishing action
4. Unlock user code waiting on this action
SimGrid Functional Organization
I MSG: User-friendly syntaxic sugar
I Simix: Processes, synchro (SimPosix)
I SURF: Resources usage interface
I Models: Action completion computation
AliceMaestro BobSimulation
Kernel:who’s next?
(done)
(done)
"blah" to Alice
Receive from Bob
Send "toto" to Bob
from Alice
Send
Receive
������
������
������������������������������������������������
����������������������������������������������������������
��������
������������
�������������������������
���������
Simulated time
���
���
LMM
SIMIX
SURF
MSG
Actions{372435
245245
530530
50664work
remaining
variable
...
x1
x2
x2
x2
x3
x3
xn+ +
+
... ≤ CP
≤ CL1
≤ CL4
≤ CL2
≤ CL3
Constraints
Variables
Conditions{
... Process
usercode
usercode
usercode
usercode
usercode
...
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 3/15
SimGrid Internals in a Nutshell
Example of user code to executeAlice
Send "toto" to Bob
Listen from Bob
Bob
Listen from Alice
Send "blah" to Alice
SimGrid internal Main Loop
1. Run every ready user process in rowI Each wants to consume resourcesI Assign actions on resources
2. Compute share for actions
3. Get earliest �nishing action
4. Unlock user code waiting on this action
SimGrid Functional Organization
I MSG: User-friendly syntaxic sugar
I Simix: Processes, synchro (SimPosix)
I SURF: Resources usage interface
I Models: Action completion computation
AliceMaestro BobSimulation
Kernel:who’s next?
(done)
(done)
"blah" to Alice
Receive from Bob
Send "toto" to Bob
from Alice
Send
Receive
������
������
������������������������������������������������
����������������������������������������������������������
��������
������������
�������������������������
���������
Simulated time
���
���
LMM
SIMIX
SURF
MSG
Actions{372435
245245
530530
50664work
remaining
variable
...
x1
x2
x2
x2
x3
x3
xn+ +
+
... ≤ CP
≤ CL1
≤ CL4
≤ CL2
≤ CL3
Constraints
Variables
Conditions{
... Process
usercode
usercode
usercode
usercode
usercode
...
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 3/15
SimGrid Internals in a Nutshell
Example of user code to executeAlice
Send "toto" to Bob
Listen from Bob
Bob
Listen from Alice
Send "blah" to Alice
SimGrid internal Main Loop
1. Run every ready user process in rowI Each wants to consume resourcesI Assign actions on resources
2. Compute share for actions
3. Get earliest �nishing action
4. Unlock user code waiting on this action
SimGrid Functional Organization
I MSG: User-friendly syntaxic sugar
I Simix: Processes, synchro (SimPosix)
I SURF: Resources usage interface
I Models: Action completion computation
AliceMaestro BobSimulation
Kernel:who’s next?
(done)
(done)
"blah" to Alice
Receive from Bob
Send "toto" to Bob
from Alice
Send
Receive
������
������
������������������������������������������������
����������������������������������������������������������
��������
������������
�������������������������
���������
Simulated time
���
���
LMM
SIMIX
SURF
MSG
Actions{372435
245245
530530
50664work
remaining
variable
...
x1
x2
x2
x2
x3
x3
xn+ +
+
... ≤ CP
≤ CL1
≤ CL4
≤ CL2
≤ CL3
Constraints
Variables
Conditions{
... Process
usercode
usercode
usercode
usercode
usercode
...
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 3/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Introduction Example
function P1
//Compute...Send()...
end function
function P2
//Compute...Recv()...
end function
time ← 0Ptime ← {P1,P2}while Ptime 6= ∅ do
schedule(Ptime)time ← solve(&done_actions)Ptime ← proc_unblock(done_actions)
end whileSimGrid's Main Loop
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 4/15
Simix as an OS (Operating Simulator)
Requirements
I User code to run in a thread-like thing, we control the scheduling
I We want portability
; generic mechanisms; several implementations
I We want to run the processes in parallel; we want model-checking
; Isolate processes from each others
I We want it as e�cient as possible ; That's what an OS does!
Chosen Design
I Processes are perfectly isolated from environmentsimcalls: only way of interacting with others/platformThe maestro runs that code �in kernel mode�
I Processes virtualized with context factoriesThreads (pthread/win); ucontexts; Raw assemblyJava contexts, Java continuations, Ruby contexts
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 5/15
How e�ciently can we simulate P2P Protocols
P2P is a nightmare for the simulator
I People want huge �ne grained systems (many events in large platforms)
I As a result, no standard too. Many short lived ones (even one shoot ones)
I If we manage be e�cient on this workload, others will be easy
PeerSimI Simple enough to get adapted, but no network in model (abstracted)
I Query-cycle mode (application as automata): 106 nodes; DES: 103
I Query-cycle: user-unfriendly way to express dist. apps; DES: sequential
OverSimI Scalable: 105 nodes using simplistic network models
I Realistic: can leverage the omNET++ packet-level simulator
I Simplistic models are sequential, parallel omNET++ seemingly never used
PlanetSimI Parallel execution, but query-cycle mode only (embarrassingly parallel)
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 6/15
Parallel P2P simulators: the dPeerSim attempt
dPeerSimI Parallel implementation of PeerSim/DES (not by PeerSim main authors)
I Classical parallelization: spreads the load over several Logical Processes (LP)
LP #1 LP #2
LP #3 LP #4
Experimental Results
I Uses Chord as a standard workload: e.g. 320,000 nodes ; 320,000 requests
I The result are impressive at �rst glanceI 4h10 using two Logical Processes: only 1h06 using 16 LPsI Speedup of 4 using 8 times more resources, that really not bad
I But this is to be compared to sequential resultsI The same simulation takes 47 seconds in the original sequential PeerSimI (and 5 seconds using the precise network models of SimGrid in sequential)
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 7/15
Parallel Simulation vs. Dist. Apps Simulators
Simulation
Workload
Simulation
Engine
Execution
Environment
I Granularity, Communication Pattern
I Events population, probability & delay
I #simulation objects, #processors
I Parallel protocol, if any:� Conservative (lookahead, . . . )� Optimistic (state save & restore, . . . )
I Event list mgnt, Timing model. . .
I OS, Programming Language (C, Java. . . ),Networking Interface (MPI, . . . )
I Hardware aspects (CPU, mem., net)
Sim
ulation
Workload User Code
Virtualization Layer
Networking Models
Simulation Engine
Execution
Environment
Classical Parallel Simulation Schema[Balakrishnan et al ]
Layered View ofDist. App. Simulators
I The classical approach is to distribute the Simulation Engine entirely
I Hard issues: conservatives ; too few parallelism; optimistic ; roll back
I From our experience, most of the time is in so called �simulation workload�I User code executed as threads, that are scheduled according to simulationI The user code itself can reveal resource hungry: numerous / large processes
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 8/15
Main Idea here
Split at Virtualization, not Simulation EngineI Virtualization contains threads (user's stack)
I Engine & Models remains sequential
Models + EnginesVirtualization + SynchroUser
tnMU1
U2
U3
tn+1 tn+2
Sim
ulation
Workload User Code
Virtualization Layer
Networking Models
Simulation Engine
Execution
Environment
Understanding the trade-o�
I Sequential time:∑SR
(engine +model + virtu + user)
I Classical schema:∑SR
(maxi∈LP
(enginei +modeli + virtui + useri ) + proto
)I Proposed schema:
∑SR
(engine +model + max
i∈WT
(virtui + useri ) + sync
)I Synchronization protocol expensive wrt the engine's load to be distributed
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 9/15
Enabling Parallel Simulation of Dist.Apps
Challenge: Allow User-Code to run Concurrently
I DES simulator full of shared data structures: how to avoid race conditions?
I Fine-locking would be di�cult and ine�cient; wouldn't avoid app-level racesI A: recv, B: send, C : send ; Which send matches the recv from A in simulation?I Depends on execution order in host system ; simulation not reproducible. . .
Solution: OS-inspired Separation Simulated Processes
I Mediate any interaction of processes with their environment, as in real OSese.g. don't create a new process directly, but issue a simcall to request creation
Models+EnginesVirtualization + SynchroUser (isolated)
M
simcall U2
U1request answer
U3
actual interaction
1: t ← 02: Pt ← P
3: while Pt 6= ∅ do
4: parallel_schedule(Pt)5: handle_simcalls()6: (t, events)← models_solve()7: Pt ← proc_to_wake(events)8: end while
I Processes isolated from each othersI Simcalls data locally stored
I Simcalls handled centrally once users blockedI Arbitrary �xed order for reproducibility
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 10/15
Enabling Parallel Simulation of Dist.Apps
Challenge: Allow User-Code to run Concurrently
I DES simulator full of shared data structures: how to avoid race conditions?
I Fine-locking would be di�cult and ine�cient; wouldn't avoid app-level racesI A: recv, B: send, C : send ; Which send matches the recv from A in simulation?I Depends on execution order in host system ; simulation not reproducible. . .
Solution: OS-inspired Separation Simulated Processes
I Mediate any interaction of processes with their environment, as in real OSese.g. don't create a new process directly, but issue a simcall to request creation
Models+EnginesVirtualization + SynchroUser (isolated)
M
simcall U2
U1request answer
U3
actual interaction
1: t ← 02: Pt ← P
3: while Pt 6= ∅ do
4: parallel_schedule(Pt)5: handle_simcalls()6: (t, events)← models_solve()7: Pt ← proc_to_wake(events)8: end while
I Processes isolated from each othersI Simcalls data locally stored
I Simcalls handled centrally once users blockedI Arbitrary �xed order for reproducibility
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 10/15
Enabling Parallel Simulation of Dist.Apps
Challenge: Allow User-Code to run Concurrently
I DES simulator full of shared data structures: how to avoid race conditions?
I Fine-locking would be di�cult and ine�cient; wouldn't avoid app-level racesI A: recv, B: send, C : send ; Which send matches the recv from A in simulation?I Depends on execution order in host system ; simulation not reproducible. . .
Solution: OS-inspired Separation Simulated Processes
I Mediate any interaction of processes with their environment, as in real OSese.g. don't create a new process directly, but issue a simcall to request creation
Models+EnginesVirtualization + SynchroUser (isolated)
M
simcall U2
U1request answer
U3
actual interaction
1: t ← 02: Pt ← P
3: while Pt 6= ∅ do
4: parallel_schedule(Pt)5: handle_simcalls()6: (t, events)← models_solve()7: Pt ← proc_to_wake(events)8: end while
I Processes isolated from each othersI Simcalls data locally stored
I Simcalls handled centrally once users blockedI Arbitrary �xed order for reproducibility
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 10/15
E�cient Parallel Simulation
Leveraging Multicores
I P2P involve millions of user processes, but dozens of cores at best
I Having millions of System threads is di�cult (when possible)
I Co-routines (Unix ucontexts, Windows �bers): highly e�cient but not parallel
I N:M model used: millions of coroutines executed on few threads
T1tn
T2
tn+1M
Logical View
... ...T2
Tn
T1
fetch_add()futex_wait()futex_wake()
Ideal Algorithm
Reducing Synchronization Costs
I Inter-thread synchronization achieved through system calls (of real OS)
I Costs of syscalls are critical to performance ; save all possible syscalls
I Assembly reimplementation of ucontext: no syscall on context switch
I Synchronize only at scheduling round boundaries using futexes
I Dynamic load distribution: hardware fetch-and-add next process' index
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 11/15
Microbenchmarking Synchronization Costs
Rq: P2P and Chord are ultra �ne grain: this is thus a worst case scenario
Comparing our user context containers
I pthreads hit a scalability limit by 32,000 processes (amount of semaphores)
I System contexts and ASM contexts have no hard limit (beside available RAM)
I pthreads are about 10 times slower than our own ASM contexts
I ASM contexts are about 20% faster than system ones(only di�erence: avoid any syscalls on user context switches)
Measuring intrinsic synchronization costs
I Disabling parallelism at runtime: no noticeable performance change
I Enabling parallelism over 1 thread: 15% performance drop of
I Demonstrate the di�culty although the careful optimization
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 12/15
Sequential Performance in State of the Art
I Scenario: Initialize Chord, and simulate 1000 seconds of protocol
I Arbitrary Time Limit: 12 hours (kill simulation afterward)
0
10000
20000
30000
40000
0 500000 1e+06 1.5e+06 2e+06
Run
ning
tim
e in
sec
onds
Number of nodes
OverSim (OMNeT++)PeerSim
OverSim (simple underlay)SimGrid (precise, sequential)
SimGrid (constant, sequential)
Largest simulated scenarioSize Time
Omnet++ 10k 1h40PeerSim 100k 4h36OverSim 300k 10h
SG, precise10k 130s300k 32mn2M 6h23
SG, simple 2M 5h30
Memory Usage
I 2M precise nodes: 32 GiB
I That is 18kiB per process
(User stack: 12kiB)
Extra complexity to allow parallel execution don't impact sequential perfDa SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 13/15
Bene�ts of the Parallel Execution
0.8
0.9
1
1.1
1.2
1.3
1.4
Spe
edup
(pr
ecis
e m
odel
)
1 thread2 threads4 threads8 threads
16 threads24 threads
0.8
0.9
1
1.1
1.2
1.3
1.4
0 500000 1e+06 1.5e+06 2e+06
Spe
edup
(co
nsta
nt m
odel
)
Number of nodes
I Speedup (tseq
tpar): up to 45%
I More e�cient with simple model:I Less work in engine + Amhdal law
I Speedup depends on thread amountI 8 threads (of 24 cores) often betterI Synch costs remain hard to amortizeI They depend on thread amount
Parallel E�ciency ( speedup#cores
) for 2M nodes
Model 4 threads 8 th. 16 th. 24 th.Precise 0.28 0.15 0.07 0.05Constant 0.33 0.16 0.08 0.06
I Baaaaad e�ciency results
I Remember, P2P and Chord:Worst case scenarios
Yet, �rst time that Chord's parallel simulation is faster than best known sequential
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 14/15
Conclusions on Parallel Simulation
Problem Classical parallelisation is suboptimal (spatial decomposition)I Optimistic's rollbacks di�cult with complex network modelsI Pessimistic look ahead limited because P2P app topology 6= network one⇒ dPeerSim: 2 LPs: 4h; 16 LPs: 1h, but 47 seconds sequential without LPs
Proposal Better to keep central engine and leverage virtualization threadsI Making this possible mandates an OS-inspired separation of processesI Making this e�cient for P2P mandates to reduce synchros to bare minimum
Evaluation Implemented in SimGrid (http://simgrid.gforge.inria.fr)I Still orders of magnitude faster than PeerSim and OverSim in sequentialI Parallel execution (somehow) bene�cial for (very) large amount of processes
Take home message
I Parallel P2P simulator mandates creative approaches and careful optimization
Future workI Further technical improvements (automatic tuning thread amount; Java bindings)
I Attempt distribution (beyond memory limit and for HPC tasks)
I Leverage this tool to conduct nice studies
Da SimGrid Team Kernel 101 Introduction Basics Simulated OS Parallel? Parallel! Evaluation CC 15/15