parallel computing slides from prof. jeffrey hollingsworthramani/cmsc662/lecture13_parallel.pdf ·...

139
1 Parallel Computing Slides from Prof. Jeffrey Hollingsworth

Upload: others

Post on 23-Dec-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

1

Parallel ComputingSlides from Prof. Jeffrey Hollingsworth

Page 2: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

2

What is Parallel Computing?

� Does it include:

– super-scalar processing (more than one instruction at once)?

– client/server computing?

• what if RPC calls are non-blocking?

– vector processing (same instruction to several values)?

– collection of PC’s not connected to a network?

� For us, parallel computing requires:

– more than one processing element

– nodes connected to a communication network

– nodes working together to solve a single problem

Page 3: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

3

Why Parallelism

� Speed

– need to get results faster than possible with sequential

• a weather forecast that is late is useless

– could come from

• more processing elements (P.E.)

• more memory (or cache)

• more disks

� Cost: cheaper to buy many smaller machines

– this is only recently true due to

• VLSI

• commodity parts

Page 4: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

4

What Does a Parallel Computer Look Like?

� Hardware

– processors

– communication

– memory

– coordination

� Software

– programming model

– communication libraries

– operating system

Page 5: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

5

Processing Elements (PE)� Key Processor Choices

– How many?

– How powerful?

– Custom or off-the-shelf?

� Major Styles of Parallel Computing– SIMD - Single Instruction Multiple Data

• one master program counter (PC)

– MIMD - Multiple Instruction Multiple Data

• separate code for each processor

– SPMD - Single Program Multiple Data

• same code on each processor, separate PC’s on each

– Dataflow - instruction waits for operands

• “automatically” finds parallelism

Page 6: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

6

SIMD

0

11

01

Program Counter

Mask Flag

Processors

Program

Page 7: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

7

MIMD

Processors

Program Counter Program Counter Program Counter

Program #1 Program #2 Program #3

Page 8: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

8

SPMD

Processors

Program Counter Program Counter Program Counter

Program Program Program

Program

Page 9: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

9

I2 I3I1

Dataflow

instruction

instruction

I4

Page 10: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

10

Communication Networks

� Connect

– PE’s, memory, I/O

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

PE

MEM MEM

MEM

PE

PE

MEM

PE

Page 11: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

11

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

PE

PE

PE

PE PE

PE

PEPE

PE PE

PE

PE

PE PE

PEPE

PEPE

PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

Page 12: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

12

Communication Networks

� Connect

– PE’s, memory, I/O

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

PE

MEM MEM

MEM

PE

PE

MEM

PE

Page 13: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

13

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

PE

PE

PE

PE PE

PE

PEPE

PE PE

PE

PE

PE PE

PEPE

PEPE

PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

Page 14: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

14

Memory Systems

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Design Issues

– Where is the memory

• divided among each node

• centrally located (on communication network)

– Access by processors

• can all processors get to all memory?

• is the access time uniform?

Page 15: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

15

Coordination

� Synchronization

– protection of a single object (locks)

– coordination of processors (barriers)

� Size of a unit of work by a processor

– need to manage two issues

• load balance - processors have equal work

• coordination overhead - communication and sync.

– often called “grain” size - large grain vs. fine grain

Page 16: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

16

Sources of Parallelism

� Statements

– called “control parallel”

– can perform a series of steps in parallel

� Loops

– called “data parallel”

– most common source of parallelism

– each processor gets one (or more) iterations to perform

Page 17: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

17

Example of Parallelism

� Easy (embarrassingly parallel)

– multiple independent jobs (i.e..., different simulations)

� Scientific

– Largest users of parallel computing

– dense linear algebra (divide up matrix)

– physical system simulations (divide physical space)

� Databases

– biggest commerical success of parallel computing (divide tuples)

• exploits semantics of relational calculus

� AI

– search problems (divide search space)

– pattern recognition and image processing (divide image)

Page 18: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

18

Metrics in Application Performance

� Speedup (often call strong scaling)

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Weak Scaling (also called Iso-Speedup)

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

Page 19: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

19

Metrics in Application Performance

� Speedup

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Iso-Speedup

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

Page 20: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

20

How to Write Parallel Programs

� Use old serial code

– compiler converts it to parallel

– called the dusty deck problem

� Serial Language plus Communication Library

– no compiler changes required!

– PVM and MPI use this approach

� New language for parallel computing

– requires all code to be re-written

– hard to create a language that provides performance on different platforms

� Hybrid Approach

– HPF - add data distribution commands to code

– add parallel loops and synchronization operations

Page 21: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

21

Application Example - Weather

� Typical of many scientific codes

– computes results for three dimensional space

– compute results at multiple time steps

– uses equations to describe physics/chemistry of the problem

– grids are used to discretize continuous space

• granularity of grids is important to speed/accuracy

� Simplifications (for example, not in real code)

– earth is flat (no mountains)

– earth is round (poles are really flat, earth buldges at equator)

– second order properties

Page 22: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

22

� Divide Continuous space into discrete parts

– for this code, grid size is fixed and uniform

• possible to change grid size or use multiple grids

– use three grids

• two for latitude and longitude

• one for elevation

• Total of M * N * L points

� Design Choice: where is the grid point?

– left, right, or center of the grid

– in multiple dimensions this multiples:

• for 3 dimensions have 27 possible points

Grid Points

C RL

Page 23: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

23

Variables

� One dimensional– m - geo-potential (gravitational effects)

� Two dimensional– pi - “shifted” surface pressure

– sigmadot - vertical component of the wind velocity

� Three dimensional (primary variables)– <u,v> - wind velocity/direction vector

– T - temperature

– q - specific humidity

– p - pressure

� Not included– clouds

– precipitation

– can be derived from others

Page 24: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

24

Serial Computation

� Convert equations to discrete form

� Update from time t to t + delta tforeach longitude, latitude, altitude

ustar[i,j,k] = n * pi[i,j] * u[i,j,k]

vstar[i,j,k] = m[j] * pi[i,j] * v[i,j,k]

sdot[i,j,k] = pi[i,j] * sigmadot[i,j]

end

foreach longitude, latitude, altitude

D = 4 * ((ustar[i,j,k] + ustar[i-1,j,k]) * (q[i,j,k] + q[i-1,j,k]) +

terms in {i,j,k}{+,-}{1,2}

piq[i,j,k] = piq[i,j,k] + D * delat

similar terms for piu, piv, piT, and pi

end

foreach longitude, latitude, altitude

q[i,j,k] = piq[i,j,k]/pi[i,j,k]

u[i,j,k] = piu[i,j,k]/pi[i,j,k]

v[i,j,k] = piv[i,j,k]/pi[i,j,k]

T[i,j,k] = piT[i,j,k]/pi[i,j,k]

end

Page 25: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

25

Shared Memory Version

� in each loop nest, iterations are independent

� use a parallel for-loop for each loop nest

� synchronize (barrier) after each loop nest

– this is overly conservative, but works

– could use a single sync variable per item, but would incur excessive overhead

� potential parallelism is M * N * L

� private variables: D, i, j, k

� Advantages of shared memory

– easier to get something working (ignoring performance)

� Hard to debug

– other processors can modify shared data

Page 26: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

26

Distributed Memory Weather� decompose data to specific processors

– assign a cube to each processor

• maximize volume to surface ratio

• minimizes communication/computation ratio

– called a <block,block,block> distribution

� need to communicate {i,j,k}{+,-}{1,2} terms at boundaries

– use send/receive to move the data

– no need for barriers, send/receive operations provide sync

• sends earlier in computation too hide communication time

� Advantages

– easier to debug?

– consider data locality explicitly with data decomposition

� Problems

– harder to get the code running

Page 27: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

27

Seismic Code

� Given echo data, compute under sea map

� Computation model– designed for a collection of workstations

– uses variation of RPC model

– workers are given an independent trace to compute

• requires little communication

• supports load balancing (1,000 traces is typical)

� Performance– max mfops = O((F * nz * B*)1/2)

– F - single processor MFLOPS

– nz - linear dimension of input array

– B* - effective communication bandwidth

• B* = B/(1 + BL/w) ≈ B/7 for Ethernet (10msec lat., w=1400)

– real limit to performance was latency not bandwidth

Page 28: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

28

Database Applications

� Too much data to fit in memory (or sometimes disk)

– data mining applications (K-Mart has a 4-5TB database)

– imaging applications (NASA has a site with 0.25 petabytes)

• use a fork lift to load tapes by the pallet

� Sources of parallelism

– within a large transaction

– among multiple transactions

� Join operation

– form a single table from two tables based on a common field

– try to split join attribute in disjoint buckets

• if know data distribution is uniform its easy

• if not, try hashing

Page 29: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

29

Speedup in Join parallelism

� Books claims a speed up of1/p2 is possible

– split each relation into p buckets

• each bucket is a disjoint subset of the joint attribute

– each processor only has to consider N/p tuples per relation

• join is O(n2) so each processor does O((N/p)2) work

• so spedup is O(N2/p2)/O(N2) = O(1/p2)

� this is a lie!

• could split into 1/p buckets on one processor

• time would then be O(p * (N/p)2) = O(N2/p)

• so speedup is O(N2/p2)/O(N2/p) = O(1/p)

– Amdahls law is not violated

Page 30: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

30

Parallel Search (TSP)

� may appear to be faster than 1/n

– but this is not really the case either

� Algorithm

– compute a path on a processor

• if our path is shorter than the shortest one, send it to the others.

• stop searching a path when it is longer than the shortest.

– before computing next path, check for word of a new min path

– stop when all paths have been explored.

� Why it appears to be faster than 1/n speedup

– we found the a path that was shorter sooner

– however, the reason for this is a different search order!

Page 31: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

31

Ensuring a fair speedup

� Tserial = faster of

– best known serial algorithm

– simulation of parallel computation

• use parallel algorithm

• run all processes on one processor

– parallel algorithm run on one processor

� If it appears to be super-linear

– check for memory hierarchy

• increased cache or real memory may be reason

– verify order operations is the same in parallel and serial cases

Page 32: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

32

Quantitative Speedup

� Consider master-worker

– one master and n worker processes

– communication time increases as a linear function of n

Tp = TCOMPp + TCOMMp

TCOMPp = Ts/P

1/Sp= Tp/Ts = 1/P + TCOMMp/Ts

TCOMMp is P * TCOMM1

1/Sp=1/p + p * TCOMM1/Ts = 1/P + P/r1

where r1 = Ts/TCOMM1

d(1/Sp)/dP = 0 --> Popt = r11/2 and Sopt= 0.5 r1

1/2

� For hierarchy of masters

– TCOMMp = (1+logP)TCOMM1

– Popt= r1 and Sopt = r1/(1 + log r1)

Page 33: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

33

MPI

� Goals:

– Standardize previous message passing:

• PVM, P4, NX

– Support copy free message passing

– Portable to many platforms

� Features:

– point-to-point messaging

– group communications

– profiling interface: every function has a name shifted version

� Buffering

– no guarantee that there are buffers

– possible that send will block until receive is called

� Delivery Order

– two sends from same process to same dest. will arrive in order

– no guarantee of fairness between processes on recv.

Page 34: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

34

MPI Communicators

� Provide a named set of processes for communication

� All processes within a communicator can be named

– numbered from 0…n-1

� Allows libraries to be constructed

– application creates communicators

– library uses it

– prevents problems with posting wildcard receives

• adds a communicator scope to each receive

� All programs start will MPI_COMM_WORLD

Page 35: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

35

Non-Blocking Functions

� Two Parts

– post the operation

– wait for results

� Also includes a poll option

– checks if the operation has finished

� Semantics

– must not alter buffer while operation is pending

Page 36: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

36

MPI Misc.

� MPI Types

– All messages are typed

• base types are pre-defined:

– int, double, real, {,unsigned}{short, char, long}

• can construct user defined types

– includes non-contiguous data types

� Processor Topologies

– Allows construction of Cartesian & arbitrary graphs

– May allow some systems to run faster

� What’s not in MPI-1

– process creation

– I/O

– one sided communication

Page 37: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

37

MPI Housekeeping Calls

� Include <mpi.h> in your program

� If using mpich, …

� First call MPI_Init(&argc, &argv)

� MPI_Comm_rank(MPI_COMM_WORLD, &myrank)

– Myrank is set to id of this process

� MPI_Wtime

– Returns wall time

� At the end, call MPI_Finalize()

Page 38: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

38

MPI Communication Calls

� Parameters

– var – a variable

– num – number of elements in the variable to use

– type {MPI_INT, MPI_REAL, MPI_BYTE}

– root – rank of processor at root of collective operation

– dest – rank of destination processor

– status - variable of type MPI_Status;

� Calls (all return a code – check for MPI_Success)

– MPI_Send(var, num, type, dest, tag, MPI_COMM_WORLD)

– MPI_Recv(var, num, type, dest, MPI_ANY_TAG, MPI_COMM_WORLD, &status)

– MPI_Bcast(var, num, type, root, MPI_COMM_WORLD)

– MPI_Barrier(MPI_COMM_WORLD)

Page 39: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

39

PVM

� Provide a simple, free, portable parallel environment

� Run on everything

– Parallel Hardware: SMP, MPPs, Vector Machines

– Network of Workstations: ATM, Ethernet,

• UNIX machines and PCs running Win*

– Works on a heterogenous collection of machines

• handles type conversion as needed

� Provides two things

– message passing library

• point-to-point messages

• synchronization: barriers, reductions

– OS support

• process creation (pvm_spawn)

Page 40: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

40

PVM Environment (UNIX)

Application

Process

Bus Network

PVMDPVMD

PVMDPVMD

PVMD

Application

Process

Application

Process

Application

ProcessApplication

Process

Sun SPARC Sun SPARC

IBM RS/6000 Cray Y-MPDECmmp 12000

� One PVMD per machine

– all processes communicate through pvmd (by default)

� Any number of application processes per node

Page 41: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

41

PVM Message Passing

� All messages have tags

– an integer to identify the message

– defined by the user

� Messages are constructed, then sent

– pvm_pk{int,char,float}(*var, count, stride)

– pvm_unpk{int,char,float} to unpack

� All proccess are named based on task ids (tids)

– local/remote processes are the same

� Primary message passing functions

– pvm_send(tid, tag)

– pvm_recv(tid, tag)

Page 42: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

42

PVM Process Control

� Creating a process

– pvm_spawn(task, argv, flag, where, ntask, tids)

– flag and where provide control of where tasks are started

– ntask controls how many copies are started

– program must be installed on target machine

� Ending a task

– pvm_exit

– does not exit the process, just the PVM machine

� Info functions

– pvm_mytid() - get the process task id

Page 43: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

43

PVM Group Operations

� Group is the unit of communication– a collection of one or more processes

– processes join group with pvm_joingroup(“<group name>“)

– each process in the group has a unique id

• pvm_gettid(“<group name>“)

� Barrier– can involve a subset of the processes in the group

– pvm_barrier(“<group name>“, count)

� Reduction Operations– pvm_reduce( void (*func)(), void *data, int count, int

datatype, int msgtag, char *group, int rootinst)

• result is returned to rootinst node

• does not block

– pre-defined funcs: PvmMin, PvmMax,PvmSum,PvmProduct

Page 44: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

44

PVM Performance Issues

� Messages have to go through PVMD

– can use direct route option to prevent this problem

� Packing messages

– semantics imply a copy

– extra function call to pack messages

� Heterogenous Support

– information is sent in machine independent format

– has a short circuit option for known homogenous comm.

• passes data in native format then

Page 45: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

45

Sample PVM Programint main(int argc, char **argv) {

int myGroupNum;

int friendTid;

int mytid;

int tids[2];

int message[MESSAGESIZE];

int c,i,okSpawn;

/* Initialize process and spawn if necessary */

myGroupNum=pvm_joingroup("ping-pong");

mytid=pvm_mytid();

if (myGroupNum==0) { /* I am the first process */

pvm_catchout(stdout);

okSpawn=pvm_spawn(MYNAME,argv,0,"",1,&friendTid);

if (okSpawn!=1) {

printf("Can't spawn a copy of myself!\n");

pvm_exit();

exit(1);

}

tids[0]=mytid;

tids[1]=friendTid;

} else { /*I am the second process */

friendTid=pvm_parent();

tids[0]=friendTid;

tids[1]=mytid;

}

pvm_barrier("ping-pong",2);

/* Main Loop Body */

if (myGroupNum==0) {

/* Initialize the message */

for (i=0 ; i<MESSAGESIZE ; i++) {

message[i]='1';

}

/* Now start passing the message back and forth */

for (i=0 ; i<ITERATIONS ; i++) {

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

}

} else {

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

}

pvm_exit();

exit(0);

}

Page 46: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

46

Defect Patterns in High Performance Computing

Based on Materials Developed by

Taiga Nakamura

Page 47: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

47

What is This Lecture?

� Debugging and testing parallel code is hard

– What kinds of software defects (bugs) are common?

– How can they be prevented or found/fixed effectively?

� Hypothesis: Knowing common defects (bugs) will reduce the time spent debugging

– … during programming assignments, course projects

� Here: Common defect types in parallel programming

– “Defect patterns” in HPC

– Based on the empirical data we collected in past studies

– Examples are in C/MPI (suspect similar defect types in Fortran/MPI, OpenMP, UPC, CAF, …)

Page 48: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

48

Example Problem

� Consider the following problem:

1. N cells, each of which holds an integer [0..9]• E.g., cell[0]=2, cell[1]=1, …, cell[N-1]=3

2. In each step, cells are updated using the values of neighboring cells• cell

next[x] = (cell[x-1] + cell[x+1]) mod 10

• cellnext

[0]=(3+1), cellnext

[1]=(2+6), …

• (Assume the last cell is adjacent to the first cell)

3. Repeat 2 for steps times

A sequence of N cells

2 1 6 8 7 1 0 2 4 5 1 … 3

What defects can appear when implementing a parallel solution in MPI?

Page 49: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

49

First, Sequential Solution

� Approach to implementation– Use an integer array buffer[] to represent the cell values

– Use a second array nextbuffer[] to store the values in

the next step, and swap the buffers

– Straightforward implementation!

Page 50: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

50

/* Initialize cells */

int x, n, *tmp;

int *buffer = (int*)malloc(N * sizeof(int));

int *nextbuffer = (int*)malloc(N * sizeof(int));

FILE *fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < N; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

/* Final output */

...

free(nextbuffer); free(buffer);

Sequential C Code

Page 51: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

51

Approach to a Parallel Version

� Each process keeps (1/size) of the cells

– size:number of processes

2 1 6 8 7 1 0 2 4 5 1 … 3

2 1 …

Process 0

• Each process needs to:• update the locally-stored cells

• exchange boundary cell values between neighboring processes (nearest-neighbor communication)

Process 1

Process (size-1)

Process 2

Page 52: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

52

Recurring HPC Defects

� Now, we will simulate the process of writing parallel code and discuss what kinds of defects can appear.

� Defect types are shown as:

– Pattern descriptions

– Concrete examples in MPI implementation

Page 53: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

53

Pattern: Erroneous use of language features• Simple mistakes in understanding that are common for novices

• E.g., inconsistent parameter types between send and recv, • E.g., forgotten mandatory function calls• E.g., inappropriate choice of functions

Symptoms:• Compile-type error (easy to fix)• Some defects may surface only under specific conditions

• (number of processors, value of input, hardware/software environment…)

Causes:• Lack of experience with the syntax and semantics of new

language features

Cures & preventions:• Check unfamiliar language features carefully

Page 54: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

54

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

/* Initialize cells */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

...

/* Final output */

...

/* Finalize MPI */

MPI_Finalize();

Adding basic MPI functions

What are the bugs?

Page 55: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

55

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

/* Initialize cells */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

...

What are the defects?

MPI_Init(&argc, &argv);

MPI_Finalize();

• Passing NULL to MPI_Init is invalid in MPI-1 (ok in MPI-2)

• MPI_Finalize must be called by all processors in every execution path

Page 56: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

56

Does MPI Have Too Many Functions To Remember?

� Yes (100+ functions), but…

� Advanced features are not necessarily used

� Try to understand a few, basic language features thoroughly

MPI keywords in Conjugate Gradient in C/C++ (15 students)

3

1

10

68

1

2

38

77

72

67

3

14

24

2

2

67

6

2

66

10

53

1

1

4

4

42

8

488

200

125

2

74

1 10 100 1000

MPI_Address

MPI_Aint

MPI_Allgatherv

MPI_Allreduce

MPI_Alltoall

MPI_Alltoallv

MPI_Barrier

MPI_Bcast

MPI_Comm_rank

MPI_Comm_size

MPI_Datatype

MPI_Finalize

MPI_Init

MPI_Irecv

MPI_Isend

MPI_Recv

MPI_Reduce

MPI_Request

MPI_Send

MPI_Sendrecv

MPI_Status

MPI_Type_commit

MPI_Type_struct

MPI_Waitall

MPI_ANY_SOURCE

MPI_ANY_TAG

MPI_CHAR

MPI_COMM_WORLD

MPI_DOUBLE

MPI_INT

MPI_LONG

MPI_SUM

24 functions, 8 constants

Page 57: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

57

Pattern: Space Decomposition• Incorrect mapping between the problem space and the

program memory space

Symptoms:• Segmentation fault (if array index is out of range)• Incorrect or slightly incorrect output

Causes:• Mapping in parallel version can be different from that in

serial version• E.g., Array origin is different in every processor• E.g., Additional memory space for communication can

complicate the mapping logic

Cures & preventions:• Validate the memory allocation carefully when parallelizing

the code

Page 58: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

58

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < nlocal; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

...

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Decompose the problem space

buffer[]

0 (nlocal+1)

What are the bugs?

Page 59: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

59

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

...

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the defects?

N may not be divisible by size

• N may not by divisible by size• Off by one error in inner loop

Page 60: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

60

Pattern: Side-effect of Parallelization• Ordinary serial constructs can cause defects when they are

accessed in parallel contexts

Symptoms:• Various correctness/performance problems

Causes:• “Sequential part” tends to be overlooked

• Typical parallel programs contain only a few parallel primitives, and the rest of the code is made of a sequential program running in parallel

Cures & preventions:• Don’t just focus on the parallel code• Check that the serial code is working on one processor, but

remember that the defect may surface only in a parallel context

Page 61: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

61

/* Initialize cells with input file */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

nskip = ...

for (x = 0; x < nskip; x++) { fscanf(fp, "%d", &dummy);}

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

fclose(fp);

/* Main loop */

...

Data I/O

• What are the defects?

Page 62: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

62

/* Initialize cells with input file */

if (rank == 0) {

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

for (p = 1; p < size; p++) {

/* Read initial data for process p and send it */

}

fclose(fp);

}

else {

/* Receive initial data*/

}

Data I/O

• Filesystem may cause performance bottleneck if all processors access the same file simultaneously

• (Schedule I/O carefully, or let “master” processor do all I/O)

Page 63: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

63

/* What if we initialize cells with random values... */

srand(time(NULL));

for (x = 0; x < nlocal; x++) {

buffer[x+1] = rand() % 10;

}

/* Main loop */

...

Generating Initial Data

• What are the defects?

• (Other than the fact that rand() is not a good pseudo-random number generator in the first place…)

Page 64: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

64

/* What if we initialize cells with random values... */

srand(time(NULL));

for (x = 0; x < nlocal; x++) {

buffer[x+1] = rand() % 10;

}

/* Main loop */

...

What are the Defects?

• All procs might use the same pseudo-random sequence, spoiling independence

• Hidden serialization in rand() causes performance bottleneck

srand(time(NULL) + rank);

Page 65: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

65

Pattern: Synchronization• Improper coordination between processes

• Well-known defect type in parallel programming• Deadlocks, race conditions

Symptoms:• Program hangs• Incorrect/non-deterministic output

Causes:• Some defects can be very subtle• Use of asynchronous (non-blocking) communication can lead to

more synchronization defects

Cures & preventions:• Make sure that all communications are correctly coordinated

Page 66: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

66

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Communication

• What are the defects?

Page 67: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

67

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• Obvious example of deadlock (can’t avoid noticing this)

0 (nlocal+1)

Page 68: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

68

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Another Example

• What are the defects?

Page 69: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

69

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• This causes deadlock too• MPI_Ssend is a synchronous send (see the next slides.)

Page 70: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

70

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Yet Another Example

• What are the defects?

Page 71: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

71

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Potential deadlock

• This may work (many novice programmers write this code)

• but it can cause deadlock with some implementation or parameters

Page 72: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

72

Modes of MPI blocking communication

� http://www.mpi-forum.org/docs/mpi-11-html/node40.html– Standard (MPI_Send): may either return immediately when the outgoing message is

buffered in the MPI buffers, or block until a matching receive has been posted.– Buffered (MPI_Bsend): a send operation is completed when the MPI buffers the

outgoing message. An error is returned when there is insufficient buffer space– Synchronous (MPI_Ssend): a send operation is complete only when the matching

receive operation has started to receive the message.– Ready (MPI_Rsend): a send can be started only after the matching receive has been

posted.

� In our code MPI_Send won’t probably be blocked in most implementations (each message’s just one integer), but it should still be avoided.

� A “correct” solution could be:– (1) alternate the order of send and recv– (2) use MPI_Bsend with sufficient buffer size– (3) MPI_Sendrecv, or – (4) MPI_Isend/recv

Page 73: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

73

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request1);

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request2);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request3);

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request4);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Non-Blocking Communication

• What are the defects?

Page 74: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

74

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request1);

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request2);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request3);

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request4);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• Synchronization (e.g. MPI_Wait, MPI_Barrier) is needed at each iteration (but too many barriers can cause a performance problem)

Page 75: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

75

Pattern: Performance defect• Scalability problem because processors are not working in

parallel• The program output itself is correct• Perfect parallelization is often difficult: need to evaluate

if the execution speed is unacceptable

Symptoms:• Sub-linear scalability• Performance much less than expected (e.g, most time spent

waiting),

Causes:• Unbalanced amount of computation• Load balancing may depend on input data

Cures & preventions:• Make sure all processors are “working” in parallel• Profiling tool might help

Page 76: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

76

if (rank != 0) {

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

}

if (rank != size-1) {

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

}

Scheduling communication

• Complicated communication pattern- does not cause deadlock

What are the defects?

Page 77: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

77

if (rank != 0) {

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

}

if (rank != size-1) {

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

}

What are the bugs?

1 Send → 0 Recv → 0 Send → 1 Recv2 Send → 1 Recv→ 1 Send → 2 Recv3 Send → 2 Recv → 2 Send → 3 Recv

0 (nlocal+1)

• Communication requires O(size) time (a “correct” solution takes O(1))

Page 78: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

78

Summary

� This is an attempt to share knowledge about common defects in parallel programming

– Erroneous use of language features

– Space Decomposition

– Side-effect of Parallelization

– Synchronization

– Performance defect

– Try to avoid these defect patterns in your code

Page 79: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

79

CMSC 714

Lecture 4

OpenMP and UPC

Chau-Wen Tseng(from A. Sussman)

Page 80: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

80

Programming Model Overview

� Message passing (MPI, PVM)

– Separate address spaces

– Explicit messages to access shared data

• Send / receive (MPI 1.0), put / get (MPI 2.0)

� Multithreading (Java threads, pthreads)

– Shared address space

• Only local variables on thread stack are private

– Explicit thread creation, synchronization

� Shared-memory programming (OpenMP, UPC)

– Mixed shared / separate address spaces

– Implicit threads & synchronization

Page 81: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

81

Shared Memory Programming Model

� Attempts to ease task of parallel programming

– Hide details

• Thread creation, messages, synchronization

– Compiler generate parallel code

• Based on user annotations

� Possibly lower performance

– Less control over

• Synchronization

• Locality

• Message granularity

� My inadvertently introduce data races

– Read & write same shared memory location in parallel loop

Page 82: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

82

OpenMP

� Support parallelism for SMPs, multi-core

– Provide a simple portable model

– Allows both shared and private data

– Provides parallel for/do loops

� Includes

– Automatic support for fork/join parallelism

– Reduction variables

– Atomic statement

• one processes executes at a time

– Single statement

• only one process runs this code (first thread to reach it)

Page 83: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

83

OpenMP

� Characteristics

– Both local & shared memory (depending on directives)

– Parallelism directives for parallel loops & functions

– Compilers convert into multi-threaded programs (i.e. pthreads)

– Not supported on clusters

� Example#pragma omp parallel for private(i)

for (i=0; i<NUPDATE; i++) {

int ran = random();

table[ ran & (TABSIZE-1) ] ^= stable[ ran >> (64-LSTSIZE) ];

}

Parallel for indicates loop iterations may be

executed in parallel

Page 84: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

84

More on OpenMP

� Characteristics

– Not a full parallel language, but a language extension

– A set of standard compiler directives and library routines

– Used to create parallel Fortran, C and C++ programs

– Usually used to parallelize loops

– Standardizes last 15 years of SMP practice

� Implementation

– Compiler directives using #pragma omp <directive>

– Parallelism can be specified for regions & loops

– Data can be

• Private – each processor has local copy

• Shared – single copy for all processors

Page 85: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

85

OpenMP – Programming Model

� Fork-join parallelism (restricted form of MIMD)

– Normally single thread of control (master)

– Worker threads spawned when parallel region encountered

– Barrier synchronization required at end of parallel region

Master Thread

Parallel Regions

Page 86: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

86

OpenMP – Example Parallel Region

double a[1000];

omp_set_num_threads(4);

#pragma omp parallel

{

int id = omp_thread_num();

foo(id,a);

}

printf(“all done \n”);

double a[1000];

#pragma omp parallel

foo(3,a);

printf(“all done \n”);

foo(2,a);foo(1,a);foo(0,a);

omp_set_num_threads(4);

� Task level parallelism – #pragma omp parallel { … }

OpenMP compiler

Page 87: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

87

OpenMP – Example Parallel Loop

#pragma omp parallel

{

int id, i, nthreads,start, end;

id = omp_get_thread_num();

nthreads = omp_get_num_threads();

start = id * N / nthreads ; // assigning

end = (id+1) * N / nthreads ; // work

for (i=start; i<end; i++) {

foo(i);

}

}

#pragma omp parallel for

for (i=0;i<N;i++) {

foo(i);

}

� Loop level parallelism – #pragma omp parallel for

– Loop iterations are assigned to threads, invoked as functions

OpenMP compiler

Loop iterations scheduled in blocks

Page 88: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

88

Iteration Scheduling

� Parallel for loop

– Simply specifies loop iterations may be executed in parallel

– Actual processor assignment is up to compiler / run-time system

� Scheduling goals

– Reduce load imbalance

– Reduce synchronization overhead

– Improve data location

� Scheduling approaches

– Block (chunks of contiguous iterations)

– Cyclic (round-robin)

– Dynamic (threads request additional iterations when done)

Page 89: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

89

Parallelism May Cause Data Races

� Data race

– Multiple accesses to shared data in parallel

– At least one access is a write

– Result dependent on order of shared accesses

� May be introduced by parallel loop

– If data dependence exists between loop iterations

– Result depend on order loop iterations are executed

– Example

#pragma omp parallel for

for (i=1;i<N-1;i++) {

a[i] = ( a[i-1] + a[i+1] ) / 2;

}

Page 90: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

90

Sample Fortran77 OpenMP Code

program compute_pi

integer n, i

double precision w, x, sum, pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, “Enter # of intervals: “

read *,n

c calculate the interval size

w = 1.0d0/n

sum = 0.0d0

!$OMP PARALLEL DO PRIVATE(x), SHARED(w)

!$OMP& REDUCTION(+: sum)

do i = 1, n

x = w * (i - 0.5d0)

sum = sum + f(x)

enddo

pi = w * sum

print *, “computed pi = “, pi

stop

end

Page 91: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

91

Reductions

� Specialized computations that

– Partial results may be computed in parallel

– Combine partial results into final result

– Examples

• Addition, multiplication, minimum, maximum, count

� OpenMP reduction variable

– Compiler inserts code to

• Compute partial result locally

• Use synchronization / communication to combine results

Page 92: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

92

UPC

� Extension to C for parallel computing

� Target Environment

– Distributed memory machines

– Cache coherent multi-processors

– Multi-core processors

� Features

– Explicit control of data distribution

– Includes parallel for statement

– MPI-like run-time library support

Page 93: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

93

UPC

� Characteristics

– Local memory, shared arrays accessed by global pointers

– Parallelism : single program on multiple nodes (SPMD)

– Provides illusion of shared one-dimensional arrays

– Features

• Data distribution declarations for arrays

• One-sided communication routines (memput / memget)

• Compilers translate shared pointers & generate communication

• Can cast shared pointers to local pointers for efficiency

� Exampleshared int *x, *y, z[100];

upc_forall (i = 0; i < 100; j++) { z[i] = *x++ ×××× *y++; }

Page 94: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

94

More UPC

� Shared pointer

– Key feature of UPC

• Enables support for distributed memory architectures

– Local (private) pointer pointing to shared array

– Consists of two parts

• Processor number

• Local address on processor

– Read operations on shared pointer

• If for nonlocal data, compiler translates into memget( )

– Write operations on shared pointer

• If for nonlocal data, compiler translates into memput( )

– Cast into local private pointer

• Accesses local portion of shared array w/o communication

Page 95: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

95

UPC Execution Model

� SPMD-based

– One thread per processor

– Each thread starts with same entry to main

� Different consistency models possible

– “Strict” model is based on sequential consistency

• Results must match some sequential execution order

– “Relaxed” based on release consistency

• Writes visible only after release synchronization

– Increased freedom to reorder operations

– Reduced need to communicate results

– Consistency models are tricky

• Avoid data races altogether

Page 96: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

96

Forall Loop

� Forms basis of parallelism

� Add fourth parameter to for loop, “affinity”

– Where code is executed is based on “affinity”

– Attempt to assign loop iteration to processor with shared data

• To reduce communication

� Lacks explicit barrier before / after execution

– Differs from OpenMP

� Supports nested forall loops

Page 97: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

97

Split-phase Barriers

� Traditional Barriers

– Once enter barrier, busy-wait until all threads arrive

� Split-phase

– Announce intention to enter barrier (upc_notify)

– Perform some local operations

– Wait for other threads (upc_wait)

� Advantage

– Allows work while waiting for processes to arrive

� Disadvantage

– Must find work to do

– Takes time to communicate both notify and wait

Page 98: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

98

Computing Environment� Cost Effective High Performance Computing

– Dedicated servers are expensive

– Non-dedicated machines are useful

• high processing power(~1GHz), fast network (100Mbps+)

• Long idle time(~50%), low resource usage

Machin

es in

offic

e

Need cycles to run my simulations

Computer Lab

Supercomputer

Clustered server

W/S’s and PC’s

Network

Page 99: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

99

OS Support For Parallel Computing

� Many applications need raw compute power

– Computer H/W and S/W Simulations

– Scientific/Engineering Computation

– Data Mining, Optimization problems

� Goal

– Exploit computation cycles on idle workstations

� Projects

– Condor

– Linger-Longer

Page 100: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

100

Issues

� Scheduling

– What jobs to run on which machines?

– When to start / stop using idle machines?

� Transparency

– Can applications execute as if on home machine?

� Checkpoints

– Can work be saved if job is interrupted?

Page 101: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

101

What Is Condor?

� Condor – Exploits computation cycles in collections of

• workstations

• dedicated clusters

– Manages both

• resources (machines)

• resource requests (jobs)

– Has several mechanisms

• ClassAd Matchmaking

• Process checkpoint/ restart / migration

• Remote System Calls

• Grid Awareness

– Scalable to thousands of jobs / machines

Page 102: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

102

Condor – Dedicated Resources

� Dedicated Resources

– Compute Clusters

� Manage

– Node monitoring, scheduling

– Job launch, monitor & cleanup

Page 103: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

103

Condor – Non-dedicated Resources

� Examples

– Desktop workstations in offices

– Workstations in student labs

� Often idle

– Approx 70% of the time!

� Condor policy

– Use workstation if idle

– Interrupt and move job if user activity detected

Page 104: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

104

Mechanisms in Condor

� Transparent Process Checkpoint / Restart

� Transparent Process Migration

� Transparent Redirection of I/O

– Condor’s Remote System Calls

Page 105: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

105

CondorView Usage Graph

Page 106: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

106

What is ClassAd Matchmaking?

� Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners.

� Users (jobs) have constraints:

– “I need an Alpha with 256 MB RAM”

� Owners (machines) have constraints:

– “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

� Semi-structured data --- no fixed schema

Page 107: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

107

Some Challenges

� Condor does whatever it takes to run your jobs, even if some machines…

– Crash (or are disconnected)

– Run out of disk space

– Don’t have your software installed

– Are frequently needed by others

– Are far away & managed by someone else

Page 108: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

108

Condor’s Standard Universe

� Condor can support various combinations of features/environments

– In different “Universes”

� Different Universes provide different functionality

– Vanilla

• Run any Serial Job

– Scheduler

• Plug in a meta-scheduler

– Standard

• Support for transparent process checkpoint and restart

Page 109: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

109

Process Checkpointing

� Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file

– Memory, CPU, I/O, etc.

� The process can then be restarted

– From right where it left off

� Typically no changes to your job’s source code needed

– However, your job must be relinked with Condor’s Standard Universe support library

Page 110: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

110

When Will Condor Checkpoint Your Job?

� Periodically, if desired

– For fault tolerance

� To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority)

– Preemptive-resume scheduling

� When you explicitly run

– condor_checkpoint

– condor_vacate

– condor_off

– condor_restart

Page 111: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

111

Condor Daemon Layout

Personal Condor / Central Manager

master

collector

negotiator

schedd

startd

= Process Spawned

Page 112: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

112

Layout of the Condor Pool

Central Manager (Frieda’s)

master

collector

negotiator

schedd

startd

= ClassAdCommunicationPathway

= Process Spawned

Desktop

schedd

startd

master

Desktop

schedd

startd

master

Cluster Node

master

startd

Cluster Node

master

startd

Page 113: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

113

Access to Data in Condor

� Use Shared Filesystem if available

� No shared filesystem?– Remote System Calls (in the Standard Universe)

– Condor File Transfer Service

• Can automatically send back changed files

• Atomic transfer of multiple files

– Remote I/O Proxy Socket

Page 114: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

114

Standard Universe Remote System Calls

� I/O System calls trapped

– Sent back to submit machine

� Allows Transparent Migration Across Domains

– Checkpoint on machine A, restart on B

� No Source Code changes required

� Language Independent

� Opportunities

– For Application Steering

• Condor tells customer process “how” to open files

– For compression on the fly

Page 115: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

115

Customer Job

Job Startup

Submit

Schedd

Shadow

Startd

Starter

Condor

Syscall Lib

Page 116: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

116

Job

Fork

startershadow

Home

File

System

I/O Library

I/O Server I/O Proxy

Secure Remote I/O

Local System Calls

Local I/O

(Chirp)

Execution SiteSubmission Site

Page 117: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

117

Job Submission Machine

Job Execution Site

Job

Condor-G GridManager

GASS Server

Condor-G Scheduler

Persistant Job Queue

End User Requests

Condor Shadow

Process for Job X

Condor-G Collector

Fork

Globus Daemons +

Local Site Scheduler

[See Figure 1]

Fork

Condor Daemons

Job X

Condor S ystem Call

Trapping & C heckpoint Library

Fork

Resource

In formation

Transfer Job X

Redi rected

System Call Data

Page 118: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

118

Exploiting Idle Cycles

in Networks of Workstations

Kyung Dong Ryu

© Copyright 2001, K.D. Ryu, All Rights Reserved.

Ph.D. Defense

Page 119: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

119

High Performance Computing in NOW

� Many systems support harvesting idle machines

– Traditional Approach : Coarse-Grained Cycle Stealing

• while owner is away: send guest job and run

• when owner returns: stop, then

– migrate guest job: Condor, NOW system

– suspend or kill guest job: Butler, LSF, DQS system

� But…

Page 120: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

120

Additional CPU Time and Memory is Available

� When a user is active

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100CPU Usage (%)

Cum

ula

tive D

istr

.

all

idle

busy

– CPU usage is < 10%, 75% of time

Trace from UC Berkeley

– 30 MB memory is available, 70% of time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60

memory size (MB)

Pro

babili

ty

all

idle

busy

Total: 64MB

Page 121: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

121

Questions

� Can we exploit fine grained idle resources?

– For sequential programs and parallel programs

– Improve throughput

� How to reduce effect on user?

– Two level CPU scheduling

– Memory limits

– Network and I/O throttling

Page 122: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

122

Fine-Grain Idle Cycles

� Coarse-Grain Idle Cycles

– (t1,t3): Keyboard/mouse events

– (t4,~): High CPU usage

– Recruitment threshold

� Fine-Grain Idle Cycles

– All empty slots

– Whenever resource(CPU) is not used

Non-idleIdle Idle Non-idle

RecruitmentThreshold

Keyboard/Mouse Events

CPU

Usage

t1 t2 t3 t4

Page 123: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

123

Linger Longer: Fine-Grain Cycle Stealing

� Goals:

– Harvest more available resources

– Limit impact on local jobs

� Technique: Lower Guest Job’s Resource Priority

– Exploit fine-grained idle intervals even when user is active

• Starvation-level low CPU priority

• Dynamically limited memory use

• Dynamically throttled I/O and network bandwidth

� Adaptive Migration

– No need to move guest job to avoid local job delay

– Could move guest job to improve guest performance

Page 124: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

124

Adaptive Migration� When Migration Benefit overweighs Migration Cost

– Non-idle_Period ≥ Linger_Time + Migration_Cost / Non-idle_Usage

– Linger_Time ∝ Migration_Cost / Non-idle_Usage

• Migration_cost = Suspend_Time(source) + Process_Size/Network_Bandwidth + Resume_Time(dest.)

MIGRATION

NO MIGRATION

GUESTJOB

LOCAL JOB

MIGRATION

MIGRATION COST

NON-IDLE PERIOD

t0

t2

t1

t3

t4

tf2

tf1

Nonid

le U

sage

Nd A

Nd B

Nd A

LINGER TIME

Page 125: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

125

Need a Suite of Mechanisms

� Policy:

– Most unused resources should be available

– Resource should be quickly revoked when local jobs reclaim

� Dynamic Bounding Mechanisms for:

1. CPU

2. Memory

3. I/O Bandwidth

4. Network Bandwidth

Goal: maximize usage of idle resources

Constraint: limit impact on local jobs

Page 126: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

126

CPU bounding: Is Unix “nice” sufficient ?

� CPU priority is not strict

– run two empty loop processes (guest: nice 19)

OS Host Guest

Solaris (SunOS 5.5) 84% 15%

Linux (2.0.32) 91% 8%

OSF1 99% 0%

AIX (4.2) 60% 40%

� Why ?

– Anti-Starvation Policy

Page 127: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

127

CPU Bounding: Starvation Level Priority

� Original Linux CPU Scheduler

– One Level : process priority

– Run-time Scheduling Priority

• nice value & remaining time quanta

• Ti = 20 - nice_level + 1/2 * Ti-1

– Low priority process can preempt high priority process

� Extended Linux CPU Scheduler

– Two Level : 1) process class, 2) priority

– If runnable host processes exist

• Schedule a host process as in unmodified scheduler

– Only when no host process is runnable

• Schedule a guest process

Page 128: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

128

Memory Bounding: Page Limits

� Extended page replacement algorithm

� Adaptive Page-Out Speed

– When a host job steals a guest’s page, page-out multiple pages

• faster than default

High Limit

Low Limit

Priority to Host Job

Priority to Guest Job

Based only on LRU

Mai

n M

emory

Pag

es

– No limit on taking free pages

– High Limit :

• Maximum pages guest can hold

– Low Limit :

• Minimum pages guaranteed to guest

Page 129: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

129

Experiment: Memory Bounding� Prioritized Memory Page Replacement

– Total Available Memory : 180MB

– Guest Memory Thresholds: High Limit (70MB), Low Limit (50MB)

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

0 20 40 60 80 100 120 140 160 180 200

time (sec)

mem

ory(

MB

)

host job

memory

guest job

memory

High Limit

Low Limit

Page 130: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

130

Experiment: Nice vs. CPU & Memory Bounding

� Large Memory Footprint Job

– Each job takes 82 sec to run in isolation

Policy and Setup Host time (secs)

Guest time (secs)

Host Delay

Host starts then guest,

Guest niced 19 89 176 8.0%

Linger priority 83 165 0.8% Guest starts then host

Guest niced 19 > 5 hours > 5 hours > 2,000%

Linger priority 99 255 8.1%

– Host-then-guest:

• Reduce host job delay 8% to 0.8%

– Guest-then-host:

• Nice causes memory thrashing

• CPU & memory bounding serializes the execution

Page 131: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

131

I/O and Network Throttling

Problem 1: Guest I/O & comm. can slow down local jobs

Problem 2: Migration/checkpoint bothers local users

� Policy: Limit guest I/O and comm. bandwidth

– Only when host I/O or communication is active� Mechanism : Rate Windows

– Keep track of I/O rate by host and guest

– throttle guest I/O rate when host I/O is active� Implementation: a loadable kernel module

– Highly portable and deployable

– Light-weight : I/O call intercept

Page 132: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

132

I/O Throttling Mechanism : Rate Windows� Regulate I/O Rate

RegulatedProcess ?

ExceedsTarget Rate ?

Compute delay, split

& sleep

Avg.Rate

TargetRate

Yes Yes

No No

Application

Library

Kernel

Application

Library

Kernel

Rate Windows

Request split

I/O or

Comm.

Request

delay < dmin :

ignore

delay > dmax:

split req:sleep dmax

otherwise:

sleep delayM seconds

Avg. Rate

4kB 60kB 12kB 16kB

100msec

500msec

75msec

80msec

N items

Page 133: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

133

Experiment: I/O Throttling

� tar programs as host and guest jobs– Guest I/O Limit : 500 kB/sec (~10%)

– Throttling Threshold : Lo: 500 kB/sec Hi: 1000 kB/s

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80 90

time (sec)

I0 r

ate

(kB

/s)

host

guest

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80 90 100

time (sec)

IO r

ate

(kB

/s) host

guest

� Without I/O Throttling

� Host tar takes 72 seconds

� With I/O Throttling

� Host tar takes 42 seconds

Page 134: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

134

Dilation Factor in I/O Throttling� File I/O rate ≠ disk I/O rate

– Buffer Cache, Prefetching

� Control disk I/O by throttling file I/O

– Adjust delay using

• dilation factor = avg. disk I/O rate / avg. file I/O rate

– Compile test (I/O Limit: 500kB/s)

0

500

1000

1500

2000

0 10 20 30 40 50

Time (sec)

Rate

(K

B/s

)

File I/ODisk I/O

0

500

1000

1500

2000

0 10 20 30 40 50 60 70 80

Time (sec)

Ra

te (

KB

/s)

File I/O

Disk I/O

0

500

1000

1500

2000

0 10 20 30 40 50 60

Time (sec)R

ate

(K

B/s

)

File I/O

Disk I/O

(a) No limit (b) File I/O limit (c) Disk I/O limit

Page 135: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

135

Experiment: Network Throttling

� Guest job migration vs. httpd as a host job

– Guest job migration disrupts host job communication

– Throttling migration when host job comm. is active

– Guest job comm. Limit: 500 kB/s

� Without Comm. Throttling

� Lose b/w to migration

� With Comm. Throttling

� Take full b/w immediately

0

2000

4000

6000

8000

10000

0 6 12 18 24 30 36 42 48

Time (sec)

Co

mm

. B

an

dw

idth

(k

B/s

)

guest migr

w eb server

0

2000

4000

6000

8000

10000

0 6 12 18 24 30 36 42 48

Time (sec)C

om

m.

Ba

nd

wid

th (

kB

/s)

guest migr

w eb server

Page 136: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

136

Guest Job Performance

– Overall, LL improves 50%~70% over IE

– Less improvement for larger jobs (lu.B)

• Only 36% improvement for lu.B.30m

• Less memory is available while non-idle

– LF is slightly better than LL

– Less Variation for LL

• lu.B.30m: 23.6% for LL, 47.5% for LF

0.0

1.0

2.0

3.0

4.0

5.0

6.0

mg.W.1m mg.W.30m sp.A.10m lu.B.1.5m lu.B.30m

Th

rou

gh

pu

t (b

as

e=

8) LL LF PM IE

Page 137: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

137

Host Job Slowdown

– LL/LF delays less for small and medium size jobs

• 0.8%~1.1% for LL/LF, 1.4%~2.3% for PM/IE

• Non-prioritized migration operations of PM/IE

– More delay for large jobs

• Memory contention

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

mg.W.1m mg.W.30m sp.A.10m lu.B.1.5m lu.B.30m

ho

st

de

lay (

%)

LL LF PM IE

Page 138: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

138

Conclusions

� Identified opportunities for fine-grain idle resources

� Linger-Longer can exploit up to 60% more idle time

– Fine-Grain Cycle Stealing

– Adaptive Migration

� Linger-Longer can improve parallel applications in NOW

� A suite of mechanisms insulates local job’s performance

– CPU scheduling: starvation-level priority

– Memory Priority: lower and upper limits

– I/O and Network Bandwidth Throttling: Rate Windows

� Linger-Longer really improves

– Guest job throughput by 50% to 70%

– With a 3% host job slowdown

Page 139: Parallel Computing Slides from Prof. Jeffrey Hollingsworthramani/cmsc662/Lecture13_parallel.pdf · – simulation of parallel computation • use parallel algorithm • run all processes

139

Related Work

� Idle Cycle Stealing Systems– Condor [Litzkow88]

– NOW project [Anderson95]

– Butler [Dannenberg85], LSF [Green93], DQS [Zhou93]

� Process Migration in OS– Sprite [Douglis 91], Mosix [Barak 95]

� Idle Memory Stealing Systems– Dodo [Acharya 99], GMS [Freely 95]

– Cooperative Caching [Dahlin 94][Sarkar 96]

� Parallel Programs on Non-dedicated Workstations– Reconfiguration [Acharya 97]

– MIST/MPVM [Clark 95], Silk-NOW [Brumofe 97]

– CARMI [Pruyne 95] (Master-worker model)

� Performance Isolation– Eclipse [Bruno 98]

– Resource container [Banga 99]