parallel computing slides from prof. jeffrey hollingsworthramani/cmsc662/lecture13_parallel.pdf ·...

Parallel ComputingSlides from Prof. Jeffrey Hollingsworth

What is Parallel Computing?

� Does it include:

– super-scalar processing (more than one instruction at once)?

– client/server computing?

• what if RPC calls are non-blocking?

– vector processing (same instruction to several values)?

– collection of PC’s not connected to a network?

� For us, parallel computing requires:

– more than one processing element

– nodes connected to a communication network

– nodes working together to solve a single problem

Why Parallelism

� Speed

– need to get results faster than possible with sequential

• a weather forecast that is late is useless

– could come from

• more processing elements (P.E.)

• more memory (or cache)

• more disks

� Cost: cheaper to buy many smaller machines

– this is only recently true due to

• VLSI

• commodity parts

What Does a Parallel Computer Look Like?

� Hardware

– processors

– communication

– memory

– coordination

� Software

– programming model

– communication libraries

– operating system

Processing Elements (PE)� Key Processor Choices

– How many?

– How powerful?

– Custom or off-the-shelf?

� Major Styles of Parallel Computing– SIMD - Single Instruction Multiple Data

• one master program counter (PC)

– MIMD - Multiple Instruction Multiple Data

• separate code for each processor

– SPMD - Single Program Multiple Data

• same code on each processor, separate PC’s on each

– Dataflow - instruction waits for operands

• “automatically” finds parallelism

Program Counter

Mask Flag

Processors

Program

Processors

Program Counter Program Counter Program Counter

Program #1 Program #2 Program #3

Processors

Program Counter Program Counter Program Counter

Program Program Program

Program

I2 I3I1

Dataflow

instruction

Communication Networks

� Connect

– PE’s, memory, I/O

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

MEM MEM

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

Communication Networks

� Connect

– PE’s, memory, I/O

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

MEM MEM

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

Memory Systems

� Design Issues

– Where is the memory

• divided among each node

• centrally located (on communication network)

– Access by processors

• can all processors get to all memory?

• is the access time uniform?

Coordination

� Synchronization

– protection of a single object (locks)

– coordination of processors (barriers)

� Size of a unit of work by a processor

– need to manage two issues

• load balance - processors have equal work

• coordination overhead - communication and sync.

– often called “grain” size - large grain vs. fine grain

Sources of Parallelism

� Statements

– called “control parallel”

– can perform a series of steps in parallel

� Loops

– called “data parallel”

– most common source of parallelism

– each processor gets one (or more) iterations to perform

Example of Parallelism

� Easy (embarrassingly parallel)

– multiple independent jobs (i.e..., different simulations)

� Scientific

– Largest users of parallel computing

– dense linear algebra (divide up matrix)

– physical system simulations (divide physical space)

� Databases

– biggest commerical success of parallel computing (divide tuples)

• exploits semantics of relational calculus

� AI

– search problems (divide search space)

– pattern recognition and image processing (divide image)

Metrics in Application Performance

� Speedup (often call strong scaling)

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Weak Scaling (also called Iso-Speedup)

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

Metrics in Application Performance

� Speedup

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Iso-Speedup

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

How to Write Parallel Programs

� Use old serial code

– compiler converts it to parallel

– called the dusty deck problem

� Serial Language plus Communication Library

– no compiler changes required!

– PVM and MPI use this approach

� New language for parallel computing

– requires all code to be re-written

– hard to create a language that provides performance on different platforms

� Hybrid Approach

– HPF - add data distribution commands to code

– add parallel loops and synchronization operations

Application Example - Weather

� Typical of many scientific codes

– computes results for three dimensional space

– compute results at multiple time steps

– uses equations to describe physics/chemistry of the problem

– grids are used to discretize continuous space

• granularity of grids is important to speed/accuracy

� Simplifications (for example, not in real code)

– earth is flat (no mountains)

– earth is round (poles are really flat, earth buldges at equator)

– second order properties

� Divide Continuous space into discrete parts

– for this code, grid size is fixed and uniform

• possible to change grid size or use multiple grids

– use three grids

• two for latitude and longitude

• one for elevation

• Total of M * N * L points

� Design Choice: where is the grid point?

– left, right, or center of the grid

– in multiple dimensions this multiples:

• for 3 dimensions have 27 possible points

Grid Points

Variables

� One dimensional– m - geo-potential (gravitational effects)

� Two dimensional– pi - “shifted” surface pressure

– sigmadot - vertical component of the wind velocity

� Three dimensional (primary variables)– <u,v> - wind velocity/direction vector

– T - temperature

– q - specific humidity

– p - pressure

� Not included– clouds

– precipitation

– can be derived from others

Serial Computation

� Convert equations to discrete form

� Update from time t to t + delta tforeach longitude, latitude, altitude

ustar[i,j,k] = n * pi[i,j] * u[i,j,k]

vstar[i,j,k] = m[j] * pi[i,j] * v[i,j,k]

sdot[i,j,k] = pi[i,j] * sigmadot[i,j]

foreach longitude, latitude, altitude

D = 4 * ((ustar[i,j,k] + ustar[i-1,j,k]) * (q[i,j,k] + q[i-1,j,k]) +

terms in {i,j,k}{+,-}{1,2}

piq[i,j,k] = piq[i,j,k] + D * delat

similar terms for piu, piv, piT, and pi

foreach longitude, latitude, altitude

q[i,j,k] = piq[i,j,k]/pi[i,j,k]

u[i,j,k] = piu[i,j,k]/pi[i,j,k]

v[i,j,k] = piv[i,j,k]/pi[i,j,k]

T[i,j,k] = piT[i,j,k]/pi[i,j,k]

Shared Memory Version

� in each loop nest, iterations are independent

� use a parallel for-loop for each loop nest

� synchronize (barrier) after each loop nest

– this is overly conservative, but works

– could use a single sync variable per item, but would incur excessive overhead

� potential parallelism is M * N * L

� private variables: D, i, j, k

� Advantages of shared memory

– easier to get something working (ignoring performance)

� Hard to debug

– other processors can modify shared data

Distributed Memory Weather� decompose data to specific processors

– assign a cube to each processor

• maximize volume to surface ratio

• minimizes communication/computation ratio

– called a <block,block,block> distribution

� need to communicate {i,j,k}{+,-}{1,2} terms at boundaries

– use send/receive to move the data

– no need for barriers, send/receive operations provide sync

• sends earlier in computation too hide communication time

� Advantages

– easier to debug?

– consider data locality explicitly with data decomposition

� Problems

– harder to get the code running

Seismic Code

� Given echo data, compute under sea map

� Computation model– designed for a collection of workstations

– uses variation of RPC model

– workers are given an independent trace to compute

• requires little communication

• supports load balancing (1,000 traces is typical)

� Performance– max mfops = O((F * nz * B*)1/2)

– F - single processor MFLOPS

– nz - linear dimension of input array

– B* - effective communication bandwidth

• B* = B/(1 + BL/w) ≈ B/7 for Ethernet (10msec lat., w=1400)

– real limit to performance was latency not bandwidth

Database Applications

� Too much data to fit in memory (or sometimes disk)

– data mining applications (K-Mart has a 4-5TB database)

– imaging applications (NASA has a site with 0.25 petabytes)

• use a fork lift to load tapes by the pallet

� Sources of parallelism

– within a large transaction

– among multiple transactions

� Join operation

– form a single table from two tables based on a common field

– try to split join attribute in disjoint buckets

• if know data distribution is uniform its easy

• if not, try hashing

Speedup in Join parallelism

� Books claims a speed up of1/p2 is possible

– split each relation into p buckets

• each bucket is a disjoint subset of the joint attribute

– each processor only has to consider N/p tuples per relation

• join is O(n2) so each processor does O((N/p)2) work

• so spedup is O(N2/p2)/O(N2) = O(1/p2)

� this is a lie!

• could split into 1/p buckets on one processor

• time would then be O(p * (N/p)2) = O(N2/p)

• so speedup is O(N2/p2)/O(N2/p) = O(1/p)

– Amdahls law is not violated

Parallel Search (TSP)

� may appear to be faster than 1/n

– but this is not really the case either

� Algorithm

– compute a path on a processor

• if our path is shorter than the shortest one, send it to the others.

• stop searching a path when it is longer than the shortest.

– before computing next path, check for word of a new min path

– stop when all paths have been explored.

� Why it appears to be faster than 1/n speedup

– we found the a path that was shorter sooner

– however, the reason for this is a different search order!

Ensuring a fair speedup

� Tserial = faster of

– best known serial algorithm

– simulation of parallel computation

• use parallel algorithm

• run all processes on one processor

– parallel algorithm run on one processor

� If it appears to be super-linear

– check for memory hierarchy

• increased cache or real memory may be reason

– verify order operations is the same in parallel and serial cases

Quantitative Speedup

� Consider master-worker

– one master and n worker processes

– communication time increases as a linear function of n

Tp = TCOMPp + TCOMMp

TCOMPp = Ts/P

1/Sp= Tp/Ts = 1/P + TCOMMp/Ts

TCOMMp is P * TCOMM1

1/Sp=1/p + p * TCOMM1/Ts = 1/P + P/r1

where r1 = Ts/TCOMM1

d(1/Sp)/dP = 0 --> Popt = r11/2 and Sopt= 0.5 r1

� For hierarchy of masters

– TCOMMp = (1+logP)TCOMM1

– Popt= r1 and Sopt = r1/(1 + log r1)

� Goals:

– Standardize previous message passing:

• PVM, P4, NX

– Support copy free message passing

– Portable to many platforms

� Features:

– point-to-point messaging

– group communications

– profiling interface: every function has a name shifted version

� Buffering

– no guarantee that there are buffers

– possible that send will block until receive is called

� Delivery Order

– two sends from same process to same dest. will arrive in order

– no guarantee of fairness between processes on recv.

MPI Communicators

� Provide a named set of processes for communication

� All processes within a communicator can be named

– numbered from 0…n-1

� Allows libraries to be constructed

– application creates communicators

– library uses it

– prevents problems with posting wildcard receives

• adds a communicator scope to each receive

� All programs start will MPI_COMM_WORLD

Non-Blocking Functions

� Two Parts

– post the operation

– wait for results

� Also includes a poll option

– checks if the operation has finished

� Semantics

– must not alter buffer while operation is pending

MPI Misc.

� MPI Types

– All messages are typed

• base types are pre-defined:

– int, double, real, {,unsigned}{short, char, long}

• can construct user defined types

– includes non-contiguous data types

� Processor Topologies

– Allows construction of Cartesian & arbitrary graphs

– May allow some systems to run faster

� What’s not in MPI-1

– process creation

– I/O

– one sided communication

MPI Housekeeping Calls

� Include <mpi.h> in your program

� If using mpich, …

� First call MPI_Init(&argc, &argv)

� MPI_Comm_rank(MPI_COMM_WORLD, &myrank)

– Myrank is set to id of this process

� MPI_Wtime

– Returns wall time

� At the end, call MPI_Finalize()

MPI Communication Calls

� Parameters

– var – a variable

– num – number of elements in the variable to use

– type {MPI_INT, MPI_REAL, MPI_BYTE}

– root – rank of processor at root of collective operation

– dest – rank of destination processor

– status - variable of type MPI_Status;

� Calls (all return a code – check for MPI_Success)

– MPI_Send(var, num, type, dest, tag, MPI_COMM_WORLD)

– MPI_Recv(var, num, type, dest, MPI_ANY_TAG, MPI_COMM_WORLD, &status)

– MPI_Bcast(var, num, type, root, MPI_COMM_WORLD)

– MPI_Barrier(MPI_COMM_WORLD)

� Provide a simple, free, portable parallel environment

� Run on everything

– Parallel Hardware: SMP, MPPs, Vector Machines

– Network of Workstations: ATM, Ethernet,

• UNIX machines and PCs running Win*

– Works on a heterogenous collection of machines

• handles type conversion as needed

� Provides two things

– message passing library

• point-to-point messages

• synchronization: barriers, reductions

– OS support

• process creation (pvm_spawn)

PVM Environment (UNIX)

Application

Process

Bus Network

PVMDPVMD

Application

Process

Application

Process

Application

ProcessApplication

Process

Sun SPARC Sun SPARC

IBM RS/6000 Cray Y-MPDECmmp 12000

� One PVMD per machine

– all processes communicate through pvmd (by default)

� Any number of application processes per node

PVM Message Passing

� All messages have tags

– an integer to identify the message

– defined by the user

� Messages are constructed, then sent

– pvm_pk{int,char,float}(*var, count, stride)

– pvm_unpk{int,char,float} to unpack

� All proccess are named based on task ids (tids)

– local/remote processes are the same

� Primary message passing functions

– pvm_send(tid, tag)

– pvm_recv(tid, tag)

PVM Process Control

� Creating a process

– pvm_spawn(task, argv, flag, where, ntask, tids)

– flag and where provide control of where tasks are started

– ntask controls how many copies are started

– program must be installed on target machine

� Ending a task

– pvm_exit

– does not exit the process, just the PVM machine

� Info functions

– pvm_mytid() - get the process task id

PVM Group Operations

� Group is the unit of communication– a collection of one or more processes

– processes join group with pvm_joingroup(“<group name>“)

– each process in the group has a unique id

• pvm_gettid(“<group name>“)

� Barrier– can involve a subset of the processes in the group

– pvm_barrier(“<group name>“, count)

� Reduction Operations– pvm_reduce( void (*func)(), void *data, int count, int

datatype, int msgtag, char *group, int rootinst)

• result is returned to rootinst node

• does not block

– pre-defined funcs: PvmMin, PvmMax,PvmSum,PvmProduct

PVM Performance Issues

� Messages have to go through PVMD

– can use direct route option to prevent this problem

� Packing messages

– semantics imply a copy

– extra function call to pack messages

� Heterogenous Support

– information is sent in machine independent format

– has a short circuit option for known homogenous comm.

• passes data in native format then

Sample PVM Programint main(int argc, char **argv) {

int myGroupNum;

int friendTid;

int mytid;

int tids[2];

int message[MESSAGESIZE];

int c,i,okSpawn;

/* Initialize process and spawn if necessary */

myGroupNum=pvm_joingroup("ping-pong");

mytid=pvm_mytid();

if (myGroupNum==0) { /* I am the first process */

pvm_catchout(stdout);

okSpawn=pvm_spawn(MYNAME,argv,0,"",1,&friendTid);

if (okSpawn!=1) {

printf("Can't spawn a copy of myself!\n");

pvm_exit();

exit(1);

tids[0]=mytid;

tids[1]=friendTid;

} else { /*I am the second process */

friendTid=pvm_parent();

tids[0]=friendTid;

tids[1]=mytid;

pvm_barrier("ping-pong",2);

/* Main Loop Body */

if (myGroupNum==0) {

/* Initialize the message */

for (i=0 ; i<MESSAGESIZE ; i++) {

message[i]='1';

/* Now start passing the message back and forth */

for (i=0 ; i<ITERATIONS ; i++) {

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

} else {

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

pvm_exit();

exit(0);

Defect Patterns in High Performance Computing

Based on Materials Developed by

Taiga Nakamura

What is This Lecture?

� Debugging and testing parallel code is hard

– What kinds of software defects (bugs) are common?

– How can they be prevented or found/fixed effectively?

� Hypothesis: Knowing common defects (bugs) will reduce the time spent debugging

– … during programming assignments, course projects

� Here: Common defect types in parallel programming

– “Defect patterns” in HPC

– Based on the empirical data we collected in past studies

– Examples are in C/MPI (suspect similar defect types in Fortran/MPI, OpenMP, UPC, CAF, …)

Example Problem

� Consider the following problem:

1. N cells, each of which holds an integer [0..9]• E.g., cell[0]=2, cell[1]=1, …, cell[N-1]=3

2. In each step, cells are updated using the values of neighboring cells• cell

next[x] = (cell[x-1] + cell[x+1]) mod 10

• cellnext

[0]=(3+1), cellnext

[1]=(2+6), …

• (Assume the last cell is adjacent to the first cell)

3. Repeat 2 for steps times

A sequence of N cells

2 1 6 8 7 1 0 2 4 5 1 … 3

What defects can appear when implementing a parallel solution in MPI?

First, Sequential Solution

� Approach to implementation– Use an integer array buffer[] to represent the cell values

– Use a second array nextbuffer[] to store the values in

the next step, and swap the buffers

– Straightforward implementation!

/* Initialize cells */

int x, n, *tmp;

int *buffer = (int*)malloc(N * sizeof(int));

int *nextbuffer = (int*)malloc(N * sizeof(int));

FILE *fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < N; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

/* Final output */

free(nextbuffer); free(buffer);

Sequential C Code

Approach to a Parallel Version

� Each process keeps (1/size) of the cells

– size:number of processes

2 1 6 8 7 1 0 2 4 5 1 … 3

2 1 …

Process 0

• Each process needs to:• update the locally-stored cells

• exchange boundary cell values between neighboring processes (nearest-neighbor communication)

Process 1

Process (size-1)

Process 2

Recurring HPC Defects

� Now, we will simulate the process of writing parallel code and discuss what kinds of defects can appear.

� Defect types are shown as:

– Pattern descriptions

– Concrete examples in MPI implementation

Pattern: Erroneous use of language features• Simple mistakes in understanding that are common for novices

• E.g., inconsistent parameter types between send and recv, • E.g., forgotten mandatory function calls• E.g., inappropriate choice of functions

Symptoms:• Compile-type error (easy to fix)• Some defects may surface only under specific conditions

• (number of processors, value of input, hardware/software environment…)

Causes:• Lack of experience with the syntax and semantics of new

language features

Cures & preventions:• Check unfamiliar language features carefully

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

fp = fopen("input.dat", "r");

fclose(fp);

/* Main loop */

/* Final output */

/* Finalize MPI */

MPI_Finalize();

Adding basic MPI functions

What are the bugs?

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

fclose(fp);

/* Main loop */

What are the defects?

MPI_Init(&argc, &argv);

MPI_Finalize();

• Passing NULL to MPI_Init is invalid in MPI-1 (ok in MPI-2)

• MPI_Finalize must be called by all processors in every execution path

Does MPI Have Too Many Functions To Remember?

� Yes (100+ functions), but…

� Advanced features are not necessarily used

� Try to understand a few, basic language features thoroughly

MPI keywords in Conjugate Gradient in C/C++ (15 students)

1 10 100 1000

MPI_Address

MPI_Aint

MPI_Allgatherv

MPI_Allreduce

MPI_Alltoall

MPI_Alltoallv

MPI_Barrier

MPI_Bcast

MPI_Comm_rank

MPI_Comm_size

MPI_Datatype

MPI_Finalize

MPI_Init

MPI_Irecv

MPI_Isend

MPI_Recv

MPI_Reduce

MPI_Request

MPI_Send

MPI_Sendrecv

MPI_Status

MPI_Type_commit

MPI_Type_struct

MPI_Waitall

MPI_ANY_SOURCE

MPI_ANY_TAG

MPI_CHAR

MPI_COMM_WORLD

MPI_DOUBLE

MPI_INT

MPI_LONG

MPI_SUM

24 functions, 8 constants

Pattern: Space Decomposition• Incorrect mapping between the problem space and the

program memory space

Symptoms:• Segmentation fault (if array index is out of range)• Incorrect or slightly incorrect output

Causes:• Mapping in parallel version can be different from that in

serial version• E.g., Array origin is different in every processor• E.g., Additional memory space for communication can

complicate the mapping logic

Cures & preventions:• Validate the memory allocation carefully when parallelizing

the code

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < nlocal; x++) {

/* Exchange boundary cells with neighbors */

Decompose the problem space

buffer[]

0 (nlocal+1)

What are the bugs?

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

N may not be divisible by size

• N may not by divisible by size• Off by one error in inner loop

Pattern: Side-effect of Parallelization• Ordinary serial constructs can cause defects when they are

accessed in parallel contexts

Symptoms:• Various correctness/performance problems

Causes:• “Sequential part” tends to be overlooked

• Typical parallel programs contain only a few parallel primitives, and the rest of the code is made of a sequential program running in parallel

Cures & preventions:• Don’t just focus on the parallel code• Check that the serial code is working on one processor, but

remember that the defect may surface only in a parallel context

/* Initialize cells with input file */

nskip = ...

for (x = 0; x < nskip; x++) { fscanf(fp, "%d", &dummy);}

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

fclose(fp);

/* Main loop */

Data I/O

• What are the defects?

/* Initialize cells with input file */

if (rank == 0) {

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

for (p = 1; p < size; p++) {

/* Read initial data for process p and send it */

fclose(fp);

else {

/* Receive initial data*/

Data I/O

• Filesystem may cause performance bottleneck if all processors access the same file simultaneously

• (Schedule I/O carefully, or let “master” processor do all I/O)

/* What if we initialize cells with random values... */

srand(time(NULL));

buffer[x+1] = rand() % 10;

/* Main loop */

Generating Initial Data

• (Other than the fact that rand() is not a good pseudo-random number generator in the first place…)

/* What if we initialize cells with random values... */

srand(time(NULL));

buffer[x+1] = rand() % 10;

/* Main loop */

What are the Defects?

• All procs might use the same pseudo-random sequence, spoiling independence

• Hidden serialization in rand() causes performance bottleneck

srand(time(NULL) + rank);

Pattern: Synchronization• Improper coordination between processes

• Well-known defect type in parallel programming• Deadlocks, race conditions

Symptoms:• Program hangs• Incorrect/non-deterministic output

Causes:• Some defects can be very subtle• Use of asynchronous (non-blocking) communication can lead to

more synchronization defects

Cures & preventions:• Make sure that all communications are correctly coordinated

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

Communication

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

• Obvious example of deadlock (can’t avoid noticing this)

0 (nlocal+1)

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

Another Example

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

• This causes deadlock too• MPI_Ssend is a synchronous send (see the next slides.)

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

Yet Another Example

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

Potential deadlock

• This may work (many novice programmers write this code)

• but it can cause deadlock with some implementation or parameters

Modes of MPI blocking communication

� http://www.mpi-forum.org/docs/mpi-11-html/node40.html– Standard (MPI_Send): may either return immediately when the outgoing message is

buffered in the MPI buffers, or block until a matching receive has been posted.– Buffered (MPI_Bsend): a send operation is completed when the MPI buffers the

outgoing message. An error is returned when there is insufficient buffer space– Synchronous (MPI_Ssend): a send operation is complete only when the matching

receive operation has started to receive the message.– Ready (MPI_Rsend): a send can be started only after the matching receive has been

posted.

� In our code MPI_Send won’t probably be blocked in most implementations (each message’s just one integer), but it should still be avoided.

� A “correct” solution could be:– (1) alternate the order of send and recv– (2) use MPI_Bsend with sufficient buffer size– (3) MPI_Sendrecv, or – (4) MPI_Isend/recv

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request1);

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

Non-Blocking Communication

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

• Synchronization (e.g. MPI_Wait, MPI_Barrier) is needed at each iteration (but too many barriers can cause a performance problem)

Pattern: Performance defect• Scalability problem because processors are not working in

parallel• The program output itself is correct• Perfect parallelization is often difficult: need to evaluate

if the execution speed is unacceptable

Symptoms:• Sub-linear scalability• Performance much less than expected (e.g, most time spent

waiting),

Causes:• Unbalanced amount of computation• Load balancing may depend on input data

Cures & preventions:• Make sure all processors are “working” in parallel• Profiling tool might help

if (rank != 0) {

if (rank != size-1) {

Scheduling communication

• Complicated communication pattern- does not cause deadlock

if (rank != 0) {

if (rank != size-1) {

What are the bugs?

1 Send → 0 Recv → 0 Send → 1 Recv2 Send → 1 Recv→ 1 Send → 2 Recv3 Send → 2 Recv → 2 Send → 3 Recv

0 (nlocal+1)

• Communication requires O(size) time (a “correct” solution takes O(1))

Summary

� This is an attempt to share knowledge about common defects in parallel programming

– Erroneous use of language features

– Space Decomposition

– Side-effect of Parallelization

– Synchronization

– Performance defect

– Try to avoid these defect patterns in your code

CMSC 714

Lecture 4

OpenMP and UPC

Chau-Wen Tseng(from A. Sussman)

Programming Model Overview

� Message passing (MPI, PVM)

– Separate address spaces

– Explicit messages to access shared data

• Send / receive (MPI 1.0), put / get (MPI 2.0)

� Multithreading (Java threads, pthreads)

– Shared address space

• Only local variables on thread stack are private

– Explicit thread creation, synchronization

� Shared-memory programming (OpenMP, UPC)

– Mixed shared / separate address spaces

– Implicit threads & synchronization

Shared Memory Programming Model

� Attempts to ease task of parallel programming

– Hide details

• Thread creation, messages, synchronization

– Compiler generate parallel code

• Based on user annotations

� Possibly lower performance

– Less control over

• Synchronization

• Locality

• Message granularity

� My inadvertently introduce data races

– Read & write same shared memory location in parallel loop

OpenMP

� Support parallelism for SMPs, multi-core

– Provide a simple portable model

– Allows both shared and private data

– Provides parallel for/do loops

� Includes

– Automatic support for fork/join parallelism

– Reduction variables

– Atomic statement

• one processes executes at a time

– Single statement

• only one process runs this code (first thread to reach it)

OpenMP

� Characteristics

– Both local & shared memory (depending on directives)

– Parallelism directives for parallel loops & functions

– Compilers convert into multi-threaded programs (i.e. pthreads)

– Not supported on clusters

� Example#pragma omp parallel for private(i)

for (i=0; i<NUPDATE; i++) {

int ran = random();

table[ ran & (TABSIZE-1) ] ^= stable[ ran >> (64-LSTSIZE) ];

Parallel for indicates loop iterations may be

executed in parallel

parallel computing slides from prof. jeffrey hollingsworthramani/cmsc662/lecture13_parallel.pdf ·...

Documents

parallel pagerank computation using mpi

matrix eigensystem tutorial for parallel computation

stochastic computation from serial to parallel

biologically inspired massively-parallel computation

parallel bayesian computation

graphx: unifying data-parallel and graph-parallel...

a complexity theory for parallel computation · a...

vector machines model for parallel computation

models of parallel computation

parallel computation with industrial codes

introduction to parallel computation...introduction to...

massively parallel computation using graphics processors

optimizing data shufﬂing in data-parallel computation by...

anatomy of parallel computation with tensors

limits to parallel computation

cse 260 – introduction to parallel computation

parallel computation models

intro parallel computation

survey of parallel computation

cosc 6374 parallel computation parallel design patterns...