integrative parallel programming in hpc
DESCRIPTION
Lecture in the Georgia Tech "Hot CSE" seminar, 20140922TRANSCRIPT
![Page 1: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/1.jpg)
Integrative Parallel Programming in HPC
Victor Eijkhout
2014/09/22
![Page 2: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/2.jpg)
• Introduction
• Motivating example
• Type system
• Demonstration
• Other applications
• Tasks and processes
• Task execution
• Research
• Conclusion
GA Tech — 2014/09/22— 2
![Page 3: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/3.jpg)
Introduction
GA Tech — 2014/09/22— 3
![Page 4: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/4.jpg)
My aims for a new parallel programmingsystem
1. There are many types of parallelism⇒ Uniform treatment of parallelism
2. Data movement is more important than computation⇒ While acknowledging the realities of hardware
3. CS theory seems to ignore HPC-type of parallelism⇒ Strongly theory based
IMP: Integrative Model for Parallelism
GA Tech — 2014/09/22— 4
![Page 5: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/5.jpg)
Design of a programming system
One needs to distinguish:
Programming model How does it look in code
Execution model How is it actually executed
Data model How is data placed and moved about
Three different vocabularies!
GA Tech — 2014/09/22— 5
![Page 6: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/6.jpg)
Programming modelSequential semantics
[A]n HPF program may be understood (and debugged)using sequential semantics, a deterministic world that we arecomfortable with. Once again, as in traditional programming,the programmer works with a single address space, treating anarray as a single, monolithic object, regardless of how it maybe distributed across the memories of a parallel machine.(Nikhil 1993)
As opposed to
[H]umans are quickly overwhelmed by concurrency andfind it much more difficult to reason about concurrent thansequential code. Even careful people miss possible interleavingsamong even simple collections of partially ordered operations.(Sutter and Larus 2005)
GA Tech — 2014/09/22— 6
![Page 7: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/7.jpg)
Programming model
Sequential semantics is close to the mathematics of the problem.
Note: sequential semantics in the programming model does notmean BSP synchronization in the execution.
Also note: sequential semantics is subtly different from SPMD(but at least SPMD puts you in the asynchronous mindset)
GA Tech — 2014/09/22— 7
![Page 8: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/8.jpg)
Execution model
Virtual machine: data flow.
• Data flow expresses the essential dependencies in analgorithm.
• Data flow applies to multiple parallelism models.
• But it would be a mistake to program dataflow explicitly.
GA Tech — 2014/09/22— 8
![Page 9: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/9.jpg)
Data model
Distribution: mapping from processors to data.(note: traditionally the other way around)
Needed (and missing from existing systems such as UPC, HPF):
• distributions need to be first-class objects:⇒ we want an algebra of distributions
• algorithms need to be expressed in distributions
GA Tech — 2014/09/22— 9
![Page 10: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/10.jpg)
Integrative Model for Parallelism (IMP)
• Theoretical model for describing parallelism
• Library (or maybe language) for describing operations onparallel data
• Minimal, yet sufficient, specification of parallel aspects
• Many aspects are formally derived (often as first-classobjects), including messages and task dependencies.
• ⇒ Specify what, not how
• ⇒ Improve programmer productivity, code quality, efficiencyand robustness
GA Tech — 2014/09/22— 10
![Page 11: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/11.jpg)
Motivating example
GA Tech — 2014/09/22— 11
![Page 12: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/12.jpg)
1D example: 3-pt averagingData parallel calculation: yi = f (xi−1, xi , xi+1)
Each point has a dependency on three points, some on otherprocessing elements
GA Tech — 2014/09/22— 12
![Page 13: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/13.jpg)
α, β, γ distributionsDistribution: processor-to-elements mapping
• α distribution: data assignment on input• γ distribution: data assignment on output• β distribution: ‘local data’ assignment• ⇒ β is dynamically defined from the algorithm
GA Tech — 2014/09/22— 13
![Page 14: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/14.jpg)
DataflowWe get a dependency structure:
Interpretation:
• Tasks: local task graph• Message passing: messages
Note: this structure follows from the distributions of the algorithm,it is not programmed.
GA Tech — 2014/09/22— 14
![Page 15: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/15.jpg)
Algorithms in the Integrative ModelKernel: mapping between two distributed objects
• An algorithm consists of Kernels
• Each kernel consists of independent operations/tasks
• Traditional elements of parallel programming are derived fromthe kernel specification.
GA Tech — 2014/09/22— 15
![Page 16: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/16.jpg)
Type system
GA Tech — 2014/09/22— 16
![Page 17: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/17.jpg)
Generalized data parallelism
Functions
f : Realk → Real
applied to arrays y = f (x):
yi = f(x(If (i))
)This defines function
If : N → 2N
for instance If = {i , i − 1, i + 1}.
GA Tech — 2014/09/22— 17
![Page 18: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/18.jpg)
Distributions
Distribution is (non-disjoint, non-unique) mapping from processorsto sets of indices:
d : P → 2N
Distributed data:
x(d) : p 7→ {xi : i ∈ d(p)}
Operations on distributions:
g : N → N ⇒ g(d) : p 7→ {g(i) : i ∈ d(p)}
GA Tech — 2014/09/22— 18
![Page 19: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/19.jpg)
Algorithms in terms of distributionsIf d is a distribution, and (funky notation)
x � y ≡ x + y , x � y ≡ x − y
the motivating example becomes:
y(d) = x(d) + x(d � 1) + x(d � 1)
and the β distribution is
β = d ∪ d � 1 ∪ d � 1
To reiterate: the β distribution comes from the structure of thealgorithm
GA Tech — 2014/09/22— 19
![Page 20: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/20.jpg)
Transformations of distributions
How do you go from the α to β distribution of a distributed object?
x(β) = T (α, β)x(α) whereT (α, β) = α−1β
Define α−1β : P → 2P by:
q ∈ α−1β(p) ≡ α(q) ∩ β(p) 6= ∅
‘If q ∈ α−1β(p), the task on q has data for the task on p’
• OpenMP: task wait
• MPI: message between q and p
GA Tech — 2014/09/22— 20
![Page 21: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/21.jpg)
Parallel computing with transformationsLet y(γ) distributed output, then (total needed input)
β = If (γ)
so
y(γ) = f(x(β)
), β = If (γ)
is local operation. However, x(α), so
y = f (Tx) ≡
y is distributed as y(γ)
x is distributed as x(α)
β = If γ
T = α−1β
GA Tech — 2014/09/22— 21
![Page 22: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/22.jpg)
Dataflow
q ∈ α−1β(p)
Parts of a dataflow graphcan be realized with OMP tasksor MPI messages
Total dataflow graph fromall kernels andall processes in kernels
GA Tech — 2014/09/22— 22
![Page 23: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/23.jpg)
To summarize
• Distribution language is global with sequential semantics
• Leads to dataflow formulation
• Can be interpreted in multiple parallelism modes
• Execution likely to be efficient
GA Tech — 2014/09/22— 23
![Page 24: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/24.jpg)
Demonstration
GA Tech — 2014/09/22— 24
![Page 25: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/25.jpg)
Can you code this?
• As a library / internal DSL: express distributions in customAPI, write local operation in ordinary C/F
• ⇒ easy integration in existing codes
• As a programming language / external DSL: requires compilertechnology:
• ⇒ prospect for interactions between data movement and localcode.
GA Tech — 2014/09/22— 25
![Page 26: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/26.jpg)
Approach taken
• Program expresses the sequential semantics of kernels
• Base class to realize the IMP concepts
• One derived class that turns IMP into MPI
• One derived class that turns IMP into OpenMP+tasks
Total: few thousand lines.
GA Tech — 2014/09/22— 26
![Page 27: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/27.jpg)
GA Tech — 2014/09/22— 27
![Page 28: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/28.jpg)
Code
IMP_distribution *blocked =
new IMP_distribution
("disjoint-block",problem_environment,globalsize);
for (int step=0; step<=nsteps; ++step) {
IMP_object
*output_vector = new IMP_object( blocked );
all_objects[step] = output_vector;
}
GA Tech — 2014/09/22— 28
![Page 29: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/29.jpg)
for (int step=0; step<=nsteps; ++step) {
IMP_object
*input_vector = all_objects[step-1],
*input_vector = all_objects[step];
IMP_kernel *update_step =
new IMP_kernel(input_vector,output_vector);
update_step->localexecutefn = &threepoint_execute;
update_step->add_beta_oper( new ioperator(">>1") );
update_step->add_beta_oper( new ioperator("<<1") );
update_step->add_beta_oper( new ioperator("none") );
queue->add_kernel( step,update_step );
GA Tech — 2014/09/22— 29
![Page 30: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/30.jpg)
Inspector-executor
queue->analyze_dependencies();
queue->execute();
• Analysis done once (expensive)
• execution multiple times (very efficient)
(In MPI context you can dispense with the queue and executekernels directly)
GA Tech — 2014/09/22— 30
![Page 31: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/31.jpg)
(Do I really have to put up performancegraphs?)
GA Tech — 2014/09/22— 31
![Page 32: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/32.jpg)
(Do I really have to put up performancegraphs?)
2 4 6 8 10 12 14 16
10-1
100
Gflop under strong scaling of vector averaging
OpenMPIMP
0 200 400 600 800 10000
20
40
60
80
100
120
140
Gflop under weak scaling of vector averaging
MPIIMP
GA Tech — 2014/09/22— 32
![Page 33: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/33.jpg)
Summary: the motivating example inparallel language
Write the three-point averaging as
y(u) =(x(u) + x(u � 1) + x(u � 1)
)/3
• Global description, sequential semantics
• Execution is driven by dataflow, no synchronization
• α-distribution given by context
• β-distribution is u + u � 1 + u � 1
• Messages and task dependencies are derived.
GA Tech — 2014/09/22— 33
![Page 34: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/34.jpg)
Other applications
GA Tech — 2014/09/22— 34
![Page 35: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/35.jpg)
N-body problems
GA Tech — 2014/09/22— 35
![Page 36: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/36.jpg)
Distributions of the N-body problem
Going up the levels:
γ(k−1) = γ(k)/2
β(k) = 2× γ(k) ∪ 2× γ(k) + |γ(k)|.
Redundant computation is never explicitly mentioned.
(This can be coded; code is essentially the same as the formulas)
GA Tech — 2014/09/22— 36
![Page 37: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/37.jpg)
Tasks and processes
GA Tech — 2014/09/22— 37
![Page 38: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/38.jpg)
Task graphTask is local execution:
Task ≡ Kernel× P
Task numbering:
〈i , p〉 where i ≤ n, p ∈ P
Dependency edge:⟨〈i , q〉, 〈i + 1, p〉
⟩iff q ∈ α−1β(p).
also written
t ′ = 〈i , q〉, t = 〈i + 1, p〉, t ′ < t
GA Tech — 2014/09/22— 38
![Page 39: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/39.jpg)
Processors and synchronization
Processor Cp is a (non-disjoint) subset of tasks:
Task = ∪pCp.
For a task t ∈ T we define a task t ′ as a synchronization point ift ′ is an immediate predecessor on another processor:
t ∈ Cp ∧ t ′ < t ∧ t ′ ∈ Cp′ ∧ p 6= p′.
If L ⊂ Task, base BL is
BL = {t ∈ L : pred(t) 6⊂ L}.
GA Tech — 2014/09/22— 39
![Page 40: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/40.jpg)
Local computationsTwo-parameter covering {Lk,p}k,p of T is called localcomputations if
1. the p index corresponds to the division in processors:
Cp = ∪kLk,p.
2. the k index corresponds to the partial ordering on tasks: thesets Lk = ∪pLk,p satisfy
t ∈ Lk ∧ t ′ < t ⇒ t ′ ∈⋃`≤k
L`
3. the synchronization points synchronize only with previouslevels:
pred(Bk,p)− Cp ⊂⋃`<k
L`
For a given k, all Lk,p can be executed independently.
GA Tech — 2014/09/22— 40
![Page 41: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/41.jpg)
(a): (b): (c):
Are these local computations? Yes, No, Yes
GA Tech — 2014/09/22— 41
![Page 42: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/42.jpg)
Communication avoiding compiler
Definitions can be given purely in terms of the task graph.
Programmer decides how ‘thick’ to make the Lk,p covering,communication avoiding scheduling is formally derived.
GA Tech — 2014/09/22— 42
![Page 43: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/43.jpg)
Co-processors
Distributions can describe data placement
Our main worry is latency of data movement: in IMP, data can besent early-as-possible; our communication avoiding compilertransforms algorithms to maximize granularity
GA Tech — 2014/09/22— 43
![Page 44: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/44.jpg)
Task execution
GA Tech — 2014/09/22— 44
![Page 45: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/45.jpg)
What is a task?A task is a Finite State Automaton with five states, transitions aretriggered by receiving signals from other tasks:
requesting Each task starts out by posting a request forincoming data to each of its predecessors.
accepting The requested data is in the process of arriving orbeing made available.
exec The data dependencies are satisfied and the task canexecute locally; in a refinement of this model therecan be a separate exec state for each predecessor.
avail Data that was produced and that serves as origin forsome dependency is published to all successor tasks.
used All published origin data has been absorbed by theendpoint of the data dependency, and any temporarybuffers can be released.
GA Tech — 2014/09/22— 45
![Page 46: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/46.jpg)
p states control messages q, s states
requesting
↓notifyReadyToSend
...← exec∀q < p
requestToSend
↓q < p
accepting →
↓
sendData← availacknowledgeReceipt → ↓∀q < p
used
requestingnotifyReadyToSend
exec →
↓∀s > p
↓requestToSend s > p← accepting
∃s > p...
avail
↓
sendData→acknowledgeReceipt← ∀s > p
used
GA Tech — 2014/09/22— 46
![Page 47: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/47.jpg)
How does a processor manage tasks?
Theorem: if you get a request-to-send, you can release the sendbuffers of your predecessor tasks
Corrolary: we have a functional model that doesn’t need garbagecollection
GA Tech — 2014/09/22— 47
![Page 48: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/48.jpg)
Research
GA Tech — 2014/09/22— 48
![Page 49: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/49.jpg)
Open questionsMany!
• Software is barely in demonstration stage: needs much morefunctionality
• Theoretical questions: SSA, cost, scheduling,
• Practical questions: interaction with local code, heterogeneity,interaction with hardware
• Application: this works for tradition HPC, N-body, probablysorting and graph algorithms. Beyond?
• Software-hardware co-design: IMP model has semantics fordata movement, hardware can be made more efficient usingthis.
GA Tech — 2014/09/22— 49
![Page 50: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/50.jpg)
Conclusion
GA Tech — 2014/09/22— 50
![Page 51: Integrative Parallel Programming in HPC](https://reader034.vdocuments.us/reader034/viewer/2022051411/547e7514b4af9f9b158b56f4/html5/thumbnails/51.jpg)
The future’s so bright, I gotta wear shades
• IMP has the right abstraction level: global expression, yetnatural derivation of practical concepts.
• Concept notation looks humanly possible: basis for anexpressive programming system
• Global description without talking about processes/processors:prospect for heterogeneous programming
• All concepts are explicit: middleware for scheduling, resilience,et cetera
• Applications to most conceivable scientific operations
GA Tech — 2014/09/22— 51