parallel programming

Parallel Programming

Sathish S. Vadhiyar

2

Motivations of Parallel ComputingParallel Machine: a computer system with more than one

processorMotivations

• Faster Execution time due to non-dependencies between regions of code

• Presents a level of modularity

• Resource constraints. Large databases.

• Certain class of algorithms lend themselves

• Aggregate bandwidth to memory/disk. Increase in data throughput.

• Clock rate improvement in the past decade – 40%

• Memory access time improvement in the past decade – 10%

3

Parallel Programming and Challenges

Recall the advantages and motivation of parallelism

But parallel programs incur overheads not seen in sequential programs Communication delay Idling Synchronization

4

Challenges

P0

P1

Idle timeComputation

Communication

Synchronization

5

How do we evaluate a parallel program? Execution time, Tp Speedup, S

S(p, n) = T(1, n) / T(p, n) Usually, S(p, n) < p Sometimes S(p, n) > p (superlinear speedup)

Efficiency, E E(p, n) = S(p, n)/p Usually, E(p, n) < 1 Sometimes, greater than 1

Scalability – Limitations in parallel computing, relation to n and p.

6

Speedups and efficiency

Ideal p

S

Practical

p

E

7

Limitations on speedup – Amdahl’s law Amdahl's law states that the performance

improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.

Overall speedup in terms of fractions of computation time with and without enhancement, % increase in enhancement.

Places a limit on the speedup due to parallelism.

Speedup = 1(fs + (fp/P))

Amdahl’s law Illustration

Courtesy:

http://www.metz.supelec.fr/~dedu/docs/kohPaper/node2.html

http://nereida.deioc.ull.es/html/openmp/pdp2002/sld008.htm

Efficiency

0

0.2

0.4

0.6

0.8

1

0 5 10 15

S = 1 / (s + (1-s)/p)

Amdahl’s law analysis

f P=1 P=4 P=8 P=16 P=32

1.00 1.0 4.00 8.00 16.00 32.00

0.99 1.0 3.88 7.48 13.91 24.43

0.98 1.0 3.77 7.02 12.31 19.75

0.96 1.0 3.57 6.25 10.00 14.29

•For the same fraction, speedup numbers keep moving away from processor size.

•Thus Amdahl’s law is a bit depressing for parallel programming.

•In practice, the number of parallel portions of work has to be large enough to match a given number of processors.

Gustafson’s Law Amdahl’s law – keep the parallel work fixed Gustafson’s law – keep computation time

on parallel processors fixed, change the problem size (fraction of parallel/sequential work) to match the computation time

For a particular number of processors, find the problem size for which parallel time is equal to the constant time

For that problem size, find the sequential time and the corresponding speedup

Thus speedup is scaled or scaled speedup. Also called weak speedup

11

Metrics (Contd..)

N P=1 P=4 P=8 P=16 P=32

64 1.0 0.80 0.57 0.33

192 1.0 0.92 0.80 0.60

512 1.0 0.97 0.91 0.80

Table 5.1: Efficiency as a function of n and p.

12

Scalability Efficiency decreases with increasing P;

increases with increasing N How effectively the parallel algorithm can

use an increasing number of processors How the amount of computation performed

must scale with P to keep E constant This function of computation in terms of P is

called isoefficiency function. An algorithm with an isoefficiency function

of O(P) is highly scalable while an algorithm with quadratic or exponential isoefficiency function is poorly scalable

13

Parallel Program Models

Single Program Multiple Data (SPMD)

Multiple Program Multiple Data (MPMD)

Courtesy: http://www.llnl.gov/computing/tutorials/parallel_comp/

14

Programming Paradigms

Shared memory model – Threads, OpenMP, CUDA

Message passing model – MPI

PARALLELIZATION

16

Parallelizing a ProgramGiven a sequential program/algorithm, how

to go about producing a parallel versionFour steps in program parallelization

1. DecompositionIdentifying parallel tasks with large extent of possible

concurrent activity; splitting the problem into tasks

2. AssignmentGrouping the tasks into processes with best load

balancing

3. OrchestrationReducing synchronization and communication costs

4. MappingMapping of processes to processors (if possible)

17

Steps in Creating a Parallel Program

P0

Tasks Processes Processors

P1

P2 P3

p0 p1

p2 p3

p0 p1

p2 p3

Partitioning

Sequentialcomputation

Parallelprogram

Assignment

Decomposition

Mapping

Orchestration

18

Orchestration

GoalsStructuring communicationSynchronization

ChallengesOrganizing data structures – packingSmall or large messages?How to organize communication and

synchronization ?

19

OrchestrationMaximizing data locality

Minimizing volume of data exchangeNot communicating intermediate results – e.g. dot product

Minimizing frequency of interactions - packingMinimizing contention and hot spots

Do not use the same communication pattern with the other processes in all the processes

Overlapping computations with interactionsSplit computations into phases: those that depend on

communicated data (type 1) and those that do not (type 2)

Initiate communication for type 1; During communication, perform type 2

Replicating data or computationsBalancing the extra computation or storage cost with

the gain due to less communication

20

Mapping

Which process runs on which particular processor?Can depend on network topology,

communication pattern of processorsOn processor speeds in case of

heterogeneous systems

21

Mapping

Static mapping Mapping based on Data partitioning

Applicable to dense matrix computations Block distribution Block-cyclic distribution

Graph partitioning based mapping Applicable for sparse matrix computations

Mapping based on task partitioning

0 0 0 1 1 1 2 2 2

0 1 0 1 2 0 1 22

22

Based on Task Partitioning Based on task dependency graph

In general the problem is NP complete

0

0 4

0 2 4 6

0 1 2 3 4 5 6 7

23

Mapping

Dynamic Mapping A process/global memory can hold a set

of tasks Distribute some tasks to all processes Once a process completes its tasks, it

asks the coordinator process for more tasks

Referred to as self-scheduling, work-stealing

24

High-level GoalsTable 2.1 Steps in the Parallelization Process and Their Goals

StepArchitecture-Dependent? Major Performance Goals

Decomposition Mostly no Expose enough concurrency but not too much

Assignment Mostly no Balance workloadReduce communication volume

Orchestration Yes Reduce noninherent communication via data locality

Reduce communication and synchronization cost as seen by the processor

Reduce serialization at shared resourcesSchedule tasks to satisfy dependences early

Mapping Yes Put related processes on the same processor if necessary

Exploit locality in network topology

PARALLEL ARCHITECTURE

25

26

Classification of Architectures – Flynn’s classification

In terms of parallelism in instruction and data stream

Single Instruction Single Data (SISD): Serial Computers

Single Instruction Multiple Data (SIMD)

- Vector processors and processor arrays

- Examples: CM-2, Cray-90, Cray YMP, Hitachi 3600


27

Classification of Architectures – Flynn’s classification

Multiple Instruction Single Data (MISD): Not popular

Multiple Instruction Multiple Data (MIMD)

- Most popular - IBM SP and most other

supercomputers, clusters,

computational Grids etc.


28

Classification of Architectures – Based on Memory Shared memory 2 types – UMA and

NUMA

UMA

NUMA

Examples: HP-Exemplar, SGI Origin, Sequent NUMA-Q


29

Classification 2:Shared Memory vs Message Passing

Shared memory machine: The n processors share physical address space Communication can be done through this

shared memory

The alternative is sometimes referred to as a message passing machine or a distributed memory machine

PP P P PP P

Interconnect

Main Memory

PP P P PP P

Interconnect

M MMMMMM

30

Shared Memory Machines

The shared memory could itself be distributed among the processor nodes Each processor might have some portion

of the shared physical address space that is physically close to it and therefore accessible in less time

Terms: NUMA vs UMA architecture Non-Uniform Memory Access Uniform Memory Access

31

Classification of Architectures – Based on Memory Distributed memory


Recently multi-cores Yet another classification – MPPs,

NOW (Berkeley), COW, Computational Grids

32

Parallel Architecture: Interconnection Networks

An interconnection network defined by switches, links and interfaces Switches – provide mapping between input

and output ports, buffering, routing etc. Interfaces – connects nodes with network

33

Parallel Architecture: Interconnections Indirect interconnects: nodes are connected to

interconnection medium, not directly to each other Shared bus, multiple bus, crossbar, MIN

Direct interconnects: nodes are connected directly to each other Topology: linear, ring, star, mesh, torus, hypercube Routing techniques: how the route taken by the message

from source to destination is decided Network topologies

Static – point-to-point communication links among processing nodes

Dynamic – Communication links are formed dynamically by switches

34

Interconnection Networks Static

Bus – SGI challenge Completely connected Star Linear array, Ring (1-D torus) Mesh – Intel ASCI Red (2-D) , Cray T3E (3-D), 2DTorus k-d mesh: d dimensions with k nodes in each dimension Hypercubes – 2-logp mesh – e.g. many MIMD machines Trees – our campus network

Dynamic – Communication links are formed dynamically by switches Crossbar – Cray X series – non-blocking network Multistage – SP2 – blocking network.

For more details, and evaluation of topologies, refer to book by Grama et al.

35

Indirect InterconnectsShared bus

Multiple bus

Crossbar switch

Multistage Interconnection Network

2x2 crossbar

36

Direct Interconnect TopologiesLinear

RingStar

Mesh2D

Torus

Hypercube(binary n-cube)

n=2 n=3

37

Evaluating Interconnection topologies Diameter – maximum distance between any two

processing nodes Full-connected – Star – Ring – Hypercube -

Connectivity – multiplicity of paths between 2 nodes. Maximum number of arcs to be removed from network to break it into two disconnected networks Linear-array – Ring – 2-d mesh – 2-d mesh with wraparound – D-dimension hypercubes –

12p/2

logP

12

24

d

38

Evaluating Interconnection topologies bisection width – minimum number of

links to be removed from network to partition it into 2 equal halves Ring – P-node 2-D mesh - Tree – Star – Completely connected – Hypercubes -

2

Root(P)

1

1

P2/4

P/2

39

Evaluating Interconnection topologies channel width – number of bits that can be

simultaneously communicated over a link, i.e. number of physical wires between 2 nodes

channel rate – performance of a single physical wire

channel bandwidth – channel rate times channel width

bisection bandwidth – maximum volume of communication between two halves of network, i.e. bisection width times channel bandwidth

40

X: 0

Shared Memory Architecture: Caches

X: 0

Read X Read X

X: 0

Read X

Cache hit: Wrong data!!

P1 P2

Write X=1

X: 1

X: 1

41

Cache Coherence Problem

If each processor in a shared memory multiple processor machine has a data cache Potential data consistency problem: the cache

coherence problem Shared variable modification, private cache

Objective: processes shouldn’t read `stale’ data

Solutions Hardware: cache coherence mechanisms

42

Cache Coherence Protocols

Write update – propagate cache line to other processors on every write to a processor

Write invalidate – each processor gets the updated cache line whenever it reads stale data

Which is better?

43

X: 0

Invalidation Based Cache Coherence

X: 0

Read X Read X

X: 0

Read X

Invalidate

P1 P2

Write X=1

X: 1X: 1

X: 1

44

Cache Coherence using invalidate protocols

3 states associated with data items Shared – a variable shared by 2 caches Invalid – another processor (say P0) has updated the

data item Dirty – state of the data item in P0

Implementations Snoopy

for bus based architectures shared bus interconnect where all cache controllers monitor all

bus activity There is only one operation through bus at a time; cache

controllers can be built to take corrective action and enforce coherence in caches

Memory operations are propagated over the bus and snooped Directory-based

Instead of broadcasting memory operations to all processors, propagate coherence operations to relevant processors

A central directory maintains states of cache blocks, associated processors

Implemented with presence bits

parallel programming

Documents

parallel time

parallel algorithm

parallel processors

terms of p

parallel programmingsathish

function of n

increasing p increases

sequential time