approches fonctionnelles de la programmation parallèle et des méta-ordinateurs
DESCRIPTION
Approches fonctionnelles de la programmation parallèle et des méta-ordinateurs Sémantiques, implantations et certification. Background. Implicit. Explicit. Automatic parallelization. Skeletons. Data-parallelism. Parallel extensions. Concurrent programming. Parallel programming. - PowerPoint PPT PresentationTRANSCRIPT
Approches fonctionnelles
de la programmation parallèle
et des méta-ordinateurs
Sémantiques, implantations et certification
Frédéric Gava Sous la direction de Frédéric Loulergue
2/51
Parallel programming
Implicit Explicit
Data-parallelismParallel
extensionsConcurrent
programmingAutomatic
parallelizationSkeletons
Background
3/51
Projects
• 2002-2004
• ACI Grid
• 4 partners
• Design of parallel and grid libraries of primitives for OCaml with applications to distributed SGBD and numeric computations
• 2004-2007
• ACI Young researchers
• Production of a programming environment in which certified parallel programs can be written, proved and safely executed
4/51
Outline
• Introduction
I. Semantics of BSML and certification
II. Extensionsa. New primitives : parallel composition & parallel IO
b. Library of parallel data structures
III. Globalized operations
• Conclusion and future work
5/51
Introduction
6/51
Characterized by:– p number of processors– r processors speed– L global synchronization– g communication phase (1 word at most
sent or received by each processor)
BSP architecture:
The BSP model
P/M P/M P/M P/M P/M
Network
Synchronization unit
7/51T(s) = (max0i<p wi) + hg + L
BSP model of execution
8/51
-calculus ML
BS-calculus
Parallel constructions
BSML
Parallel primitives
• Structured parallelism as an explicit parallel extension of ML
• Functional language with BSP cost predictions
• Allows the implementation of skeletons
• Implemented as a parallel library for the "Objective Caml" language
• Using a parallel data structure called parallel vector
The BSML language
9/51
fp-1…f1f0
gp-1…g1g0
Parallelpart
Sequentialpart
Replicatedpart
A BSML program
10/51
mkpar: (int )par
f (p-1)…(f 1)(f 0)(mkpar f )
apply: ( ) par par par
fp-1…f1f0
vp-1…v1v0
fp-1 vp-1…f1 v1f0 v0
apply
Asynchronous primitives
11/51
put: (int option) par(int option) par
NoneNoneSome v4Some v1
NoneNoneSome v3None
NoneSome v5NoneNone
NoneNoneSome v2None
3210
NoneNoneNoneNone
NoneNoneSome v5None
Some v4Some v3NoneSome v2
Some v1NoneNoneNone
3210
put
proj: option par(int option)
vp-1…v1v0
proj
fsuch that (f i)=vi
Synchronous primitives
12/51
Semantics and certification
13/51
Natural semantics
Small steps semantics
Distributed semantics
Abstract machine
Programming model
Easy for proofs
Easy for costs
Make asynchronous steps appear
Execution modelClose to a real
implementation
Outline
14/51
Expression of our mini language :
e ::=.e functional core language | (e e) | … | (mkpar e) parallel primitives | … | <e, e, … , e> parallel vector | (e)[s] substitution | .e[s] closure
Mini language
15/51
• Semantics = set of axioms and inference rules• Easy to understand, makes proofs more easy• Example:
Natural semantics
Confluent
16/51
• Semantics = set of rewriting rules• Using contexts for the strategy• Easier understanding of costs and errors• Example:
Confluent (costs and values)Equivalent to the previous semantics
Small steps semantics
Global cost
Local costs
17/51Distributed evaluation
scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vecscan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid
Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vecscan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid
Prog scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vecscan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid
Prog
Distributed semantics• Semantics = set of parallel rewriting rules• SPMD style:
Small steps
scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vecscan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid op vec) (fun()->scan'(mid+1) lst op vec))in let com = ...(* send wm to processes m+1…p+1 *) let op’ = ...(* applies op to wm and wi, m<i<p *) in parfun2 op’ com vec’ in scan' 0 (bsp_p()-1) op vec scan op vec = let rec scan' fst lst op vec = if fst>=lst then vec else let mid=(fst+lst)/2 in let vec'= mix mid (super (fun()->scan' fst mid
Prog
Parallel vector
Parts of the Parallel vector
ConfluentEquivalent to the previous semantics
18/51
PUSHSWAPPIDCONSAPPSENDPUSHSWAPAPP
CAMPUSHSWAPPIDCONSAPPSENDPUSHSWAPAPP
CAMPUSHSWAPPIDCONSAPPSENDPUSHSWAPAPP
CAM
PID of the machine
for mkpar
Synchronous
instruction
for put
Minimal set of parallel instructionsEquivalence with the distributed semantics
BSP-CAM = p*CAM + BSP instructions (style SPMD)
Abstract machine
COMMUNICATIONS
19/51
• The Coq Proof assistant: Typed-calculus with dependent types Specification = term (goal) Language of tactics to build a proof of this goal Extraction of the proof (certified program)
• BSML and Coq : Axiomatization of the primitive semantics in Coq Proof of BSML programs as usual proof of ML
programs Certification and extraction of BSML programs:
a) Broadcast, total exchange …b) Prefixesc) Sort
Certification of BSML programs
20/51
Example: replicate
Specification of replicate:
intros T a.exists (mkpar T (fun pid: Z a)).rewrite mkpar_def.
Certified extraction: let replicate a = mkpar (fun pid a)
21/51
Extensions and parallel data structures
22/51
BSML
• New primitive• Divide-and-conquer• Properties
Parallel composition
Parallel Data-structures• Simplify programming• OCaml interfaces• Load-balancing
• Confluent semantics• Two equivalent semantics
Implemented with
Outline
External memory (IO)• New primitives• New cost model• Property
Confluent semantics
23/51
• Several programs on the same machine
• New primitives for parallel composition:– Superposition– Juxtaposition (implemented with the superposition)
• Divide-and-conquer BSP algorithms
Multiprogramming
24/51
• super : (unit (unit )
super E1 E2 = (E1 (), E2())
• Fusion of communications/synchronization
• Preserves the BSP model
• Pure functional semantics
Parallel superposition
25/51
Communications
Synchronization
Communications
Synchronization
Communications
Synchronization
Communications
Synchronization
Communications
Synchronization
E1 E2 super E1 E2
0 1 20 . . .1 . . .2 . . .
0 1 20 . . .1 . . .2 . . .
0 1 20 . . .1 . . .2 . . .
ConfluentBSPEquivalence
Parallel superposition
26/51
Example: parallel prefixes
Size of the polynomials
Time(s)
Direct version (BSML+MPI)Superposition version
Juxtaposition version
27/51
• Observations:
Data Structures are as important as algorithms
Symbolic computations use these data structures massively
• A parallel implementation of data structures:
Interfaces as close as possible to the sequential ones
Modular implementation to get a straightforward maintenance
Load-balancing of the data
Parallel data structures
28/51
• 5 modules: Set, Map, Stack, Queue, Hashtable
• Interfaces:
Same as in OCaml
With some specific parallel functions such as parallel reductions
• A parallel data structure = one data structure on each processor
• Manual or Automatic load-balancing:
To get similar sizes of the local data structures
Better performances for parallel iterations
A two super-steps algorithm using histograms
Parallel data structures
29/51
Computation of the “nth” nearest neighbors atom in a molecule :
Example
Number of atoms
Time(s)
Sequential version
Parallel version (BSML+PUB)
30/51
Example with load balancing
Number of atoms
Time(s)
Without balancing
With balancing
31/51
Motivations :
External memories
Number of elements
Time(s)
Measured
Predicted
32/51
Disc 1
Bus
Processor
Memory
Disc 2
Disc D
P/MP/M P/M P/M P/M
Network
We add to the BSP model: • D = the number of disks• B = the size of the blocs• O = latency of the disks• G = time to read/write a byte
The EM-BSP model
33/51
P/M P/M P/M P/M P/M
Network
Disc 1 Disc 2 Disc M
We add to the BSP model: • Shared disks• With parameters similar to those of the local disks
Shared disks
34/51
• For safety, two kinds of files: local and global ones
• New primitives to manipulate these files (IO primitives)
New semantics Confluent EM-BSP cost of the primitives
External memory in BSML
35/51
BSMLlib
PUBMPI TCP/IP Threads
Comm SuperIO
Low
er le
vel
Primitives Std library
Modular implementation
Parallel datastructures
36/51
Cost prediction
Number of elements
Time(s)
ListsArrays
Predicted (max)Predicted (avg)
37/51
IO cost prediction
Number of elements
Time(s)
Predicted BSML
Measured BSML-IO
Predicted BSML-IO
38/51
Globalized operations
39/51
BSML DMML
+MSPML
Desynchronize
SemanticsCost modelsImplementations
Outline
40/51
• Using the MPM model (parameters similar to that of BSP)
• But with a different execution model:
• Same language as BSML (parallel vector) but with new primitives of communication: put mget
MSPML
41/51
MSPML
Natural semantics
Small steps semantics
Distributed semantics
Programming model
Easy for proofs
Easy for costs
Execution model Makes
asynchronous steps appear
Similar to BSML
Very different
Similar to BSML
42/51
Proc. 0 1 2
Empty
Local computation
get v 1
0,v’
Environment of Communications
0,v 0,v’’
Asynchronous communications
communication
request 0 1v’
A bit later
43/51
Proc. 0 1 2
empty 0,v’0,v’’1,w’2,w’’
0,v0,v’1,w
request 2 0Not ready
Asynchronous communications
44/51
BSML
MSPML
BSML
BSML
Intranet
Departmental meta-computing
45/51
• BSML+ MSPML-like for coordination
• Two kinds of vectors: parallel vector: par departmental vectors: dep
• Operational semantics (confluent)
• Performance model (the DMM model)
• Implementation
Departmental Meta-computing ML
46/51
• Computation of the prefixes where each processor contains a value
• Naive method: each processor sends its value to other processors
• Better method:
1) Each BSP unit computes a parallel prefix
2) One processor of each BSP unit receives values of other units
3) Each BSP unit finishes its computation with this value
Example: departmental prefixes
47/51
Experiments
Size of the polynomials
Time(s)
Better algorithm
Naive algorithm
BSP algorithm (one cluster)
48/51
Conclusion and future work
49/51
Conclusion
I. Semantics of BSML:1) Confluent and equivalent semantics2) Abstract machine3) Proof of BSML programs
II. Expressivity:1) Parallel composition2) Parallel data structures 3) Parallel IO
III. Meta-computing:1) Desynchronization of BSML (MSPML)2) Departmental Meta-computing ML (DMML)
SemanticsCost modelsImplementations
50/51
• Cost prediction:1. Static analysis of the programs2. Cost prediction of certified programs
• Proofs of BSP imperative programs:
Future work in the Propac project
IMP ML
Coq Program correction
Extension with BSP
operations
BSML
Extension of the logical assertions
51/51
• Design of parallel model checkers for High-level Petri Nets
• Using BSML to implement a toolkit:
a) Using the BSP model to dynamically load-balance
b) Using a modular and generic implementation to
ease the use of this toolkit
• Using the Propac tools to certify this implementation
Vérification efficace par Interaction de
Techniques (VITE)
Merci de votre attention
53/51
Proofs of programs (with Coq)
Natural semantics
BSML MSPML
Small steps semantics
Distributed semantics
CAM
BSP
MPM
PUB MPI TCP/IP TCP/IP
Programming model
Usefullfor costs
Execution model
Execution model
BSML and MSPML
54/51
Place
Token
Transition
State
Arc
Petri nets
55/51
BSMLParallel Semantics
Distributed evaluation
Abstract Machines
Parallel Implementation
Performance model
Design ofBSP-CAM
High Level Semantics
Nat Step Distr
SequentialImplemen-
tation
Coq Axioma-tisation
Proofs of BSML
programs
Dynamic cost analysis
Propac