parallel programming
DESCRIPTION
Parallel Programming. Sathish S. Vadhiyar. Motivations of Parallel Computing. Parallel Machine: a computer system with more than one processor Motivations Faster Execution time due to non-dependencies between regions of code Presents a level of modularity - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/1.jpg)
Parallel Programming
Sathish S. Vadhiyar
![Page 2: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/2.jpg)
2
Motivations of Parallel ComputingParallel Machine: a computer system with more than one
processorMotivations
• Faster Execution time due to non-dependencies between regions of code
• Presents a level of modularity
• Resource constraints. Large databases.
• Certain class of algorithms lend themselves
• Aggregate bandwidth to memory/disk. Increase in data throughput.
• Clock rate improvement in the past decade – 40%
• Memory access time improvement in the past decade – 10%
![Page 3: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/3.jpg)
3
Parallel Programming and Challenges
Recall the advantages and motivation of parallelism
But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization
![Page 4: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/4.jpg)
4
Challenges
P0
P1
Idle timeComputation
Communication
Synchronization
![Page 5: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/5.jpg)
5
How do we evaluate a parallel program? Execution time, Tp Speedup, S
S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)
Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1
Scalability – Limitations in parallel computing, relation to n and p.
![Page 6: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/6.jpg)
6
Speedups and efficiency
Ideal p
S
Practical
p
E
![Page 7: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/7.jpg)
7
Limitations on speedup – Amdahl’s law Amdahl's law states that the performance
improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.
Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.
Places a limit on the speedup due to parallelism.
Speedup = 1(fs + (fp/P))
![Page 8: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/8.jpg)
Amdahl’s law Illustration
Courtesy:
http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html
http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm
Efficiency
0
0.2
0.4
0.6
0.8
1
0 5 10 15
S = 1 / (s + (1-s)/p)
![Page 9: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/9.jpg)
Amdahl’s law analysis
f P=1 P=4 P=8 P=16 P=32
1.00 1.0 4.00 8.00 16.00 32.00
0.99 1.0 3.88 7.48 13.91 24.43
0.98 1.0 3.77 7.02 12.31 19.75
0.96 1.0 3.57 6.25 10.00 14.29
•For the same fraction, speedup numbers keep moving away from processor size.
•Thus Amdahl’s law is a bit depressing for parallel programming.
•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.
![Page 10: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/10.jpg)
Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time
on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time
For a particular number of processors, find the problem size for which parallel time is equal to the constant time
For that problem size, find the sequential time and the corresponding speedup
Thus speedup is scaled or scaled speedup. Also called weak speedup
![Page 11: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/11.jpg)
11
Metrics (Contd..)
N P=1 P=4 P=8 P=16 P=32
64 1.0 0.80 0.57 0.33
192 1.0 0.92 0.80 0.60
512 1.0 0.97 0.91 0.80
Table 5.1: Efficiency as a function of n and p.
![Page 12: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/12.jpg)
12
Scalability Efficiency decreases with increasing P;
increases with increasing N How effectively the parallel algorithm can
use an increasing number of processors How the amount of computation performed
must scale with P to keep E constant This function of computation in terms of P is
called isoefficiency function. An algorithm with an isoefficiency function
of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable
![Page 13: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/13.jpg)
13
Parallel Program Models
Single Program Multiple Data (SPMD)
Multiple Program Multiple Data (MPMD)
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
![Page 14: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/14.jpg)
14
Programming Paradigms
Shared memory model – Threads, OpenMP, CUDA
Message passing model – MPI
![Page 15: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/15.jpg)
PARALLELIZATION
![Page 16: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/16.jpg)
16
Parallelizing a ProgramGiven a sequential program/algorithm, how
to go about producing a parallel versionFour steps in program parallelization
1. DecompositionIdentifying parallel tasks with large extent of possible
concurrent activity; splitting the problem into tasks
2. AssignmentGrouping the tasks into processes with best load
balancing
3. OrchestrationReducing synchronization and communication costs
4. MappingMapping of processes to processors (if possible)
![Page 17: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/17.jpg)
17
Steps in Creating a Parallel Program
P0
Tasks Processes Processors
P1
P2 P3
p0 p1
p2 p3
p0 p1
p2 p3
Partitioning
Sequentialcomputation
Parallelprogram
Assignment
Decomposition
Mapping
Orchestration
![Page 18: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/18.jpg)
18
Orchestration
GoalsStructuring communicationSynchronization
ChallengesOrganizing data structures – packingSmall or large messages?How to organize communication and
synchronization ?
![Page 19: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/19.jpg)
19
OrchestrationMaximizing data locality
Minimizing volume of data exchangeNot communicating intermediate results – e.g. dot product
Minimizing frequency of interactions - packingMinimizing contention and hot spots
Do not use the same communication pattern with the other processes in all the processes
Overlapping computations with interactionsSplit computations into phases: those that depend on
communicated data (type 1) and those that do not (type 2)
Initiate communication for type 1; During communication, perform type 2
Replicating data or computationsBalancing the extra computation or storage cost with
the gain due to less communication
![Page 20: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/20.jpg)
20
Mapping
Which process runs on which particular processor?Can depend on network topology,
communication pattern of processorsOn processor speeds in case of
heterogeneous systems
![Page 21: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/21.jpg)
21
Mapping
Static mapping Mapping based on Data partitioning
Applicable to dense matrix computations Block distribution Block-cyclic distribution
Graph partitioning based mapping Applicable for sparse matrix computations
Mapping based on task partitioning
0 0 0 1 1 1 2 2 2
0 1 0 1 2 0 1 22
![Page 22: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/22.jpg)
22
Based on Task Partitioning Based on task dependency graph
In general the problem is NP complete
0
0 4
0 2 4 6
0 1 2 3 4 5 6 7
![Page 23: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/23.jpg)
23
Mapping
Dynamic Mapping A process/global memory can hold a set
of tasks Distribute some tasks to all processes Once a process completes its tasks, it
asks the coordinator process for more tasks
Referred to as self-scheduling, work-stealing
![Page 24: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/24.jpg)
24
High-level GoalsTable 2.1 Steps in the Parallelization Process and Their Goals
StepArchitecture-Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no Balance workloadReduce communication volume
Orchestration Yes Reduce noninherent communication via data locality
Reduce communication and synchronization cost as seen by the processor
Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same processor if necessary
Exploit locality in network topology
![Page 25: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/25.jpg)
PARALLEL ARCHITECTURE
25
![Page 26: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/26.jpg)
26
Classification of Architectures – Flynn’s classification
In terms of parallelism in instruction and data stream
Single Instruction Single Data (SISD): Serial Computers
Single Instruction Multiple Data (SIMD)
- Vector processors and processor arrays
- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
![Page 27: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/27.jpg)
27
Classification of Architectures – Flynn’s classification
Multiple Instruction Single Data (MISD): Not popular
Multiple Instruction Multiple Data (MIMD)
- Most popular - IBM SP and most other
supercomputers, clusters,
computational Grids etc.
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
![Page 28: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/28.jpg)
28
Classification of Architectures – Based on Memory Shared memory 2 types – UMA and
NUMA
UMA
NUMA
Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
![Page 29: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/29.jpg)
29
Classification 2:Shared Memory vs Message Passing
Shared memory machine: The n processors share physical address space Communication can be done through this
shared memory
The alternative is sometimes referred to as a message passing machine or a distributed memory machine
PP P P PP P
Interconnect
Main Memory
PP P P PP P
Interconnect
M MMMMMM
![Page 30: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/30.jpg)
30
Shared Memory Machines
The shared memory could itself be distributed among the processor nodes Each processor might have some portion
of the shared physical address space that is physically close to it and therefore accessible in less time
Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access
![Page 31: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/31.jpg)
31
Classification of Architectures – Based on Memory Distributed memory
Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/
Recently multi-cores Yet another classification – MPPs,
NOW (Berkeley), COW, Computational Grids
![Page 32: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/32.jpg)
32
Parallel Architecture: Interconnection Networks
An interconnection network defined by switches, links and interfaces Switches – provide mapping between input
and output ports, buffering, routing etc. Interfaces – connects nodes with network
![Page 33: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/33.jpg)
33
Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to
interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN
Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message
from source to destination is decided Network topologies
Static – point-to-point communication links among processing nodes
Dynamic – Communication links are formed dynamically by switches
![Page 34: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/34.jpg)
34
Interconnection Networks Static
Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network
Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.
For more details, and evaluation of topologies, refer to book by Grama et al.
![Page 35: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/35.jpg)
35
Indirect InterconnectsShared bus
Multiple bus
Crossbar switch
Multistage Interconnection Network
2x2 crossbar
![Page 36: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/36.jpg)
36
Direct Interconnect TopologiesLinear
RingStar
Mesh2D
Torus
Hypercube(binary n-cube)
n=2 n=3
![Page 37: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/37.jpg)
37
Evaluating Interconnection topologies Diameter – maximum distance between any two
processing nodes Full-connected – Star – Ring – Hypercube -
Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –
12p/2
logP
12
24
d
![Page 38: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/38.jpg)
38
Evaluating Interconnection topologies bisection width – minimum number of
links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -
2
Root(P)
1
1
P2/4
P/2
![Page 39: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/39.jpg)
39
Evaluating Interconnection topologies channel width – number of bits that can be
simultaneously communicated over a link, i.e. number of physical wires between 2 nodes
channel rate – performance of a single physical wire
channel bandwidth – channel rate times channel width
bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth
![Page 40: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/40.jpg)
40
X: 0
Shared Memory Architecture: Caches
X: 0
Read X Read X
X: 0
Read X
Cache hit: Wrong data!!
P1 P2
Write X=1
X: 1
X: 1
![Page 41: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/41.jpg)
41
Cache Coherence Problem
If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache
coherence problem Shared variable modification, private cache
Objective: processes shouldn’t read `stale’ data
Solutions Hardware: cache coherence mechanisms
![Page 42: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/42.jpg)
42
Cache Coherence Protocols
Write update – propagate cache line to other processors on every write to a processor
Write invalidate – each processor gets the updated cache line whenever it reads stale data
Which is better?
![Page 43: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/43.jpg)
43
X: 0
Invalidation Based Cache Coherence
X: 0
Read X Read X
X: 0
Read X
Invalidate
P1 P2
Write X=1
X: 1X: 1
X: 1
![Page 44: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/44.jpg)
44
Cache Coherence using invalidate protocols
3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the
data item Dirty – state of the data item in P0
Implementations Snoopy
for bus based architectures shared bus interconnect where all cache controllers monitor all
bus activity There is only one operation through bus at a time; cache
controllers can be built to take corrective action and enforce coherence in caches
Memory operations are propagated over the bus and snooped Directory-based
Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors
A central directory maintains states of cache blocks, associated processors
Implemented with presence bits
![Page 45: Parallel Programming](https://reader035.vdocuments.us/reader035/viewer/2022062301/56814f0d550346895dbca087/html5/thumbnails/45.jpg)
END