parallel computing slides from prof. jeffrey hollingsworthramani/cmsc662/lecture13_parallel.pdf ·...

Post on 23-Dec-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Parallel ComputingSlides from Prof. Jeffrey Hollingsworth

2

What is Parallel Computing?

� Does it include:

– super-scalar processing (more than one instruction at once)?

– client/server computing?

• what if RPC calls are non-blocking?

– vector processing (same instruction to several values)?

– collection of PC’s not connected to a network?

� For us, parallel computing requires:

– more than one processing element

– nodes connected to a communication network

– nodes working together to solve a single problem

3

Why Parallelism

� Speed

– need to get results faster than possible with sequential

• a weather forecast that is late is useless

– could come from

• more processing elements (P.E.)

• more memory (or cache)

• more disks

� Cost: cheaper to buy many smaller machines

– this is only recently true due to

• VLSI

• commodity parts

4

What Does a Parallel Computer Look Like?

� Hardware

– processors

– communication

– memory

– coordination

� Software

– programming model

– communication libraries

– operating system

5

Processing Elements (PE)� Key Processor Choices

– How many?

– How powerful?

– Custom or off-the-shelf?

� Major Styles of Parallel Computing– SIMD - Single Instruction Multiple Data

• one master program counter (PC)

– MIMD - Multiple Instruction Multiple Data

• separate code for each processor

– SPMD - Single Program Multiple Data

• same code on each processor, separate PC’s on each

– Dataflow - instruction waits for operands

• “automatically” finds parallelism

6

SIMD

0

11

01

Program Counter

Mask Flag

Processors

Program

7

MIMD

Processors

Program Counter Program Counter Program Counter

Program #1 Program #2 Program #3

8

SPMD

Processors

Program Counter Program Counter Program Counter

Program Program Program

Program

9

I2 I3I1

Dataflow

instruction

instruction

I4

10

Communication Networks

� Connect

– PE’s, memory, I/O

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

PE

MEM MEM

MEM

PE

PE

MEM

PE

11

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

PE

PE

PE

PE PE

PE

PEPE

PE PE

PE

PE

PE PE

PEPE

PEPE

PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

12

Communication Networks

� Connect

– PE’s, memory, I/O

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Possible Topologies

– bus - simple, but doesn’t scale

– ring - orders delivery of messages

PE

MEM MEM

MEM

PE

PE

MEM

PE

13

Topologies (cont)

– tree - needs to increase bandwidth near the top

PE PE PEPE

PE

PE

PE

PE PE

PE

PEPE

PE PE

PE

PE

PE PE

PEPE

PEPE

PEPE

–mesh - two or three dimensions

–hypercube - needs a power of number of nodes

14

Memory Systems

� Key Performance Issues

– latency: time for first byte

– throughput: average bytes/second

� Design Issues

– Where is the memory

• divided among each node

• centrally located (on communication network)

– Access by processors

• can all processors get to all memory?

• is the access time uniform?

15

Coordination

� Synchronization

– protection of a single object (locks)

– coordination of processors (barriers)

� Size of a unit of work by a processor

– need to manage two issues

• load balance - processors have equal work

• coordination overhead - communication and sync.

– often called “grain” size - large grain vs. fine grain

16

Sources of Parallelism

� Statements

– called “control parallel”

– can perform a series of steps in parallel

� Loops

– called “data parallel”

– most common source of parallelism

– each processor gets one (or more) iterations to perform

17

Example of Parallelism

� Easy (embarrassingly parallel)

– multiple independent jobs (i.e..., different simulations)

� Scientific

– Largest users of parallel computing

– dense linear algebra (divide up matrix)

– physical system simulations (divide physical space)

� Databases

– biggest commerical success of parallel computing (divide tuples)

• exploits semantics of relational calculus

� AI

– search problems (divide search space)

– pattern recognition and image processing (divide image)

18

Metrics in Application Performance

� Speedup (often call strong scaling)

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Weak Scaling (also called Iso-Speedup)

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

19

Metrics in Application Performance

� Speedup

– ratio of time on n nodes to time on a single node

– hold problem size fixed

– should really compare to best serial time

– goal is linear speedup

– super-linear speedup is possible due to:

• adding more memory

• search problems

� Iso-Speedup

– scale data size up with number of nodes

– goal is a flat horizontal curve

� Amdahl's Law

– max speedup is 1/(serial fraction of time)

� Computation to Communication Ratio

– goal is to maximize this ratio

20

How to Write Parallel Programs

� Use old serial code

– compiler converts it to parallel

– called the dusty deck problem

� Serial Language plus Communication Library

– no compiler changes required!

– PVM and MPI use this approach

� New language for parallel computing

– requires all code to be re-written

– hard to create a language that provides performance on different platforms

� Hybrid Approach

– HPF - add data distribution commands to code

– add parallel loops and synchronization operations

21

Application Example - Weather

� Typical of many scientific codes

– computes results for three dimensional space

– compute results at multiple time steps

– uses equations to describe physics/chemistry of the problem

– grids are used to discretize continuous space

• granularity of grids is important to speed/accuracy

� Simplifications (for example, not in real code)

– earth is flat (no mountains)

– earth is round (poles are really flat, earth buldges at equator)

– second order properties

22

� Divide Continuous space into discrete parts

– for this code, grid size is fixed and uniform

• possible to change grid size or use multiple grids

– use three grids

• two for latitude and longitude

• one for elevation

• Total of M * N * L points

� Design Choice: where is the grid point?

– left, right, or center of the grid

– in multiple dimensions this multiples:

• for 3 dimensions have 27 possible points

Grid Points

C RL

23

Variables

� One dimensional– m - geo-potential (gravitational effects)

� Two dimensional– pi - “shifted” surface pressure

– sigmadot - vertical component of the wind velocity

� Three dimensional (primary variables)– <u,v> - wind velocity/direction vector

– T - temperature

– q - specific humidity

– p - pressure

� Not included– clouds

– precipitation

– can be derived from others

24

Serial Computation

� Convert equations to discrete form

� Update from time t to t + delta tforeach longitude, latitude, altitude

ustar[i,j,k] = n * pi[i,j] * u[i,j,k]

vstar[i,j,k] = m[j] * pi[i,j] * v[i,j,k]

sdot[i,j,k] = pi[i,j] * sigmadot[i,j]

end

foreach longitude, latitude, altitude

D = 4 * ((ustar[i,j,k] + ustar[i-1,j,k]) * (q[i,j,k] + q[i-1,j,k]) +

terms in {i,j,k}{+,-}{1,2}

piq[i,j,k] = piq[i,j,k] + D * delat

similar terms for piu, piv, piT, and pi

end

foreach longitude, latitude, altitude

q[i,j,k] = piq[i,j,k]/pi[i,j,k]

u[i,j,k] = piu[i,j,k]/pi[i,j,k]

v[i,j,k] = piv[i,j,k]/pi[i,j,k]

T[i,j,k] = piT[i,j,k]/pi[i,j,k]

end

25

Shared Memory Version

� in each loop nest, iterations are independent

� use a parallel for-loop for each loop nest

� synchronize (barrier) after each loop nest

– this is overly conservative, but works

– could use a single sync variable per item, but would incur excessive overhead

� potential parallelism is M * N * L

� private variables: D, i, j, k

� Advantages of shared memory

– easier to get something working (ignoring performance)

� Hard to debug

– other processors can modify shared data

26

Distributed Memory Weather� decompose data to specific processors

– assign a cube to each processor

• maximize volume to surface ratio

• minimizes communication/computation ratio

– called a <block,block,block> distribution

� need to communicate {i,j,k}{+,-}{1,2} terms at boundaries

– use send/receive to move the data

– no need for barriers, send/receive operations provide sync

• sends earlier in computation too hide communication time

� Advantages

– easier to debug?

– consider data locality explicitly with data decomposition

� Problems

– harder to get the code running

27

Seismic Code

� Given echo data, compute under sea map

� Computation model– designed for a collection of workstations

– uses variation of RPC model

– workers are given an independent trace to compute

• requires little communication

• supports load balancing (1,000 traces is typical)

� Performance– max mfops = O((F * nz * B*)1/2)

– F - single processor MFLOPS

– nz - linear dimension of input array

– B* - effective communication bandwidth

• B* = B/(1 + BL/w) ≈ B/7 for Ethernet (10msec lat., w=1400)

– real limit to performance was latency not bandwidth

28

Database Applications

� Too much data to fit in memory (or sometimes disk)

– data mining applications (K-Mart has a 4-5TB database)

– imaging applications (NASA has a site with 0.25 petabytes)

• use a fork lift to load tapes by the pallet

� Sources of parallelism

– within a large transaction

– among multiple transactions

� Join operation

– form a single table from two tables based on a common field

– try to split join attribute in disjoint buckets

• if know data distribution is uniform its easy

• if not, try hashing

29

Speedup in Join parallelism

� Books claims a speed up of1/p2 is possible

– split each relation into p buckets

• each bucket is a disjoint subset of the joint attribute

– each processor only has to consider N/p tuples per relation

• join is O(n2) so each processor does O((N/p)2) work

• so spedup is O(N2/p2)/O(N2) = O(1/p2)

� this is a lie!

• could split into 1/p buckets on one processor

• time would then be O(p * (N/p)2) = O(N2/p)

• so speedup is O(N2/p2)/O(N2/p) = O(1/p)

– Amdahls law is not violated

30

Parallel Search (TSP)

� may appear to be faster than 1/n

– but this is not really the case either

� Algorithm

– compute a path on a processor

• if our path is shorter than the shortest one, send it to the others.

• stop searching a path when it is longer than the shortest.

– before computing next path, check for word of a new min path

– stop when all paths have been explored.

� Why it appears to be faster than 1/n speedup

– we found the a path that was shorter sooner

– however, the reason for this is a different search order!

31

Ensuring a fair speedup

� Tserial = faster of

– best known serial algorithm

– simulation of parallel computation

• use parallel algorithm

• run all processes on one processor

– parallel algorithm run on one processor

� If it appears to be super-linear

– check for memory hierarchy

• increased cache or real memory may be reason

– verify order operations is the same in parallel and serial cases

32

Quantitative Speedup

� Consider master-worker

– one master and n worker processes

– communication time increases as a linear function of n

Tp = TCOMPp + TCOMMp

TCOMPp = Ts/P

1/Sp= Tp/Ts = 1/P + TCOMMp/Ts

TCOMMp is P * TCOMM1

1/Sp=1/p + p * TCOMM1/Ts = 1/P + P/r1

where r1 = Ts/TCOMM1

d(1/Sp)/dP = 0 --> Popt = r11/2 and Sopt= 0.5 r1

1/2

� For hierarchy of masters

– TCOMMp = (1+logP)TCOMM1

– Popt= r1 and Sopt = r1/(1 + log r1)

33

MPI

� Goals:

– Standardize previous message passing:

• PVM, P4, NX

– Support copy free message passing

– Portable to many platforms

� Features:

– point-to-point messaging

– group communications

– profiling interface: every function has a name shifted version

� Buffering

– no guarantee that there are buffers

– possible that send will block until receive is called

� Delivery Order

– two sends from same process to same dest. will arrive in order

– no guarantee of fairness between processes on recv.

34

MPI Communicators

� Provide a named set of processes for communication

� All processes within a communicator can be named

– numbered from 0…n-1

� Allows libraries to be constructed

– application creates communicators

– library uses it

– prevents problems with posting wildcard receives

• adds a communicator scope to each receive

� All programs start will MPI_COMM_WORLD

35

Non-Blocking Functions

� Two Parts

– post the operation

– wait for results

� Also includes a poll option

– checks if the operation has finished

� Semantics

– must not alter buffer while operation is pending

36

MPI Misc.

� MPI Types

– All messages are typed

• base types are pre-defined:

– int, double, real, {,unsigned}{short, char, long}

• can construct user defined types

– includes non-contiguous data types

� Processor Topologies

– Allows construction of Cartesian & arbitrary graphs

– May allow some systems to run faster

� What’s not in MPI-1

– process creation

– I/O

– one sided communication

37

MPI Housekeeping Calls

� Include <mpi.h> in your program

� If using mpich, …

� First call MPI_Init(&argc, &argv)

� MPI_Comm_rank(MPI_COMM_WORLD, &myrank)

– Myrank is set to id of this process

� MPI_Wtime

– Returns wall time

� At the end, call MPI_Finalize()

38

MPI Communication Calls

� Parameters

– var – a variable

– num – number of elements in the variable to use

– type {MPI_INT, MPI_REAL, MPI_BYTE}

– root – rank of processor at root of collective operation

– dest – rank of destination processor

– status - variable of type MPI_Status;

� Calls (all return a code – check for MPI_Success)

– MPI_Send(var, num, type, dest, tag, MPI_COMM_WORLD)

– MPI_Recv(var, num, type, dest, MPI_ANY_TAG, MPI_COMM_WORLD, &status)

– MPI_Bcast(var, num, type, root, MPI_COMM_WORLD)

– MPI_Barrier(MPI_COMM_WORLD)

39

PVM

� Provide a simple, free, portable parallel environment

� Run on everything

– Parallel Hardware: SMP, MPPs, Vector Machines

– Network of Workstations: ATM, Ethernet,

• UNIX machines and PCs running Win*

– Works on a heterogenous collection of machines

• handles type conversion as needed

� Provides two things

– message passing library

• point-to-point messages

• synchronization: barriers, reductions

– OS support

• process creation (pvm_spawn)

40

PVM Environment (UNIX)

Application

Process

Bus Network

PVMDPVMD

PVMDPVMD

PVMD

Application

Process

Application

Process

Application

ProcessApplication

Process

Sun SPARC Sun SPARC

IBM RS/6000 Cray Y-MPDECmmp 12000

� One PVMD per machine

– all processes communicate through pvmd (by default)

� Any number of application processes per node

41

PVM Message Passing

� All messages have tags

– an integer to identify the message

– defined by the user

� Messages are constructed, then sent

– pvm_pk{int,char,float}(*var, count, stride)

– pvm_unpk{int,char,float} to unpack

� All proccess are named based on task ids (tids)

– local/remote processes are the same

� Primary message passing functions

– pvm_send(tid, tag)

– pvm_recv(tid, tag)

42

PVM Process Control

� Creating a process

– pvm_spawn(task, argv, flag, where, ntask, tids)

– flag and where provide control of where tasks are started

– ntask controls how many copies are started

– program must be installed on target machine

� Ending a task

– pvm_exit

– does not exit the process, just the PVM machine

� Info functions

– pvm_mytid() - get the process task id

43

PVM Group Operations

� Group is the unit of communication– a collection of one or more processes

– processes join group with pvm_joingroup(“<group name>“)

– each process in the group has a unique id

• pvm_gettid(“<group name>“)

� Barrier– can involve a subset of the processes in the group

– pvm_barrier(“<group name>“, count)

� Reduction Operations– pvm_reduce( void (*func)(), void *data, int count, int

datatype, int msgtag, char *group, int rootinst)

• result is returned to rootinst node

• does not block

– pre-defined funcs: PvmMin, PvmMax,PvmSum,PvmProduct

44

PVM Performance Issues

� Messages have to go through PVMD

– can use direct route option to prevent this problem

� Packing messages

– semantics imply a copy

– extra function call to pack messages

� Heterogenous Support

– information is sent in machine independent format

– has a short circuit option for known homogenous comm.

• passes data in native format then

45

Sample PVM Programint main(int argc, char **argv) {

int myGroupNum;

int friendTid;

int mytid;

int tids[2];

int message[MESSAGESIZE];

int c,i,okSpawn;

/* Initialize process and spawn if necessary */

myGroupNum=pvm_joingroup("ping-pong");

mytid=pvm_mytid();

if (myGroupNum==0) { /* I am the first process */

pvm_catchout(stdout);

okSpawn=pvm_spawn(MYNAME,argv,0,"",1,&friendTid);

if (okSpawn!=1) {

printf("Can't spawn a copy of myself!\n");

pvm_exit();

exit(1);

}

tids[0]=mytid;

tids[1]=friendTid;

} else { /*I am the second process */

friendTid=pvm_parent();

tids[0]=friendTid;

tids[1]=mytid;

}

pvm_barrier("ping-pong",2);

/* Main Loop Body */

if (myGroupNum==0) {

/* Initialize the message */

for (i=0 ; i<MESSAGESIZE ; i++) {

message[i]='1';

}

/* Now start passing the message back and forth */

for (i=0 ; i<ITERATIONS ; i++) {

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

}

} else {

pvm_recv(tid,msgid);

pvm_upkint(message,MESSAGESIZE,1);

pvm_initsend(PvmDataDefault);

pvm_pkint(message,MESSAGESIZE,1);

pvm_send(tid,msgid);

}

pvm_exit();

exit(0);

}

46

Defect Patterns in High Performance Computing

Based on Materials Developed by

Taiga Nakamura

47

What is This Lecture?

� Debugging and testing parallel code is hard

– What kinds of software defects (bugs) are common?

– How can they be prevented or found/fixed effectively?

� Hypothesis: Knowing common defects (bugs) will reduce the time spent debugging

– … during programming assignments, course projects

� Here: Common defect types in parallel programming

– “Defect patterns” in HPC

– Based on the empirical data we collected in past studies

– Examples are in C/MPI (suspect similar defect types in Fortran/MPI, OpenMP, UPC, CAF, …)

48

Example Problem

� Consider the following problem:

1. N cells, each of which holds an integer [0..9]• E.g., cell[0]=2, cell[1]=1, …, cell[N-1]=3

2. In each step, cells are updated using the values of neighboring cells• cell

next[x] = (cell[x-1] + cell[x+1]) mod 10

• cellnext

[0]=(3+1), cellnext

[1]=(2+6), …

• (Assume the last cell is adjacent to the first cell)

3. Repeat 2 for steps times

A sequence of N cells

2 1 6 8 7 1 0 2 4 5 1 … 3

What defects can appear when implementing a parallel solution in MPI?

49

First, Sequential Solution

� Approach to implementation– Use an integer array buffer[] to represent the cell values

– Use a second array nextbuffer[] to store the values in

the next step, and swap the buffers

– Straightforward implementation!

50

/* Initialize cells */

int x, n, *tmp;

int *buffer = (int*)malloc(N * sizeof(int));

int *nextbuffer = (int*)malloc(N * sizeof(int));

FILE *fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < N; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

/* Final output */

...

free(nextbuffer); free(buffer);

Sequential C Code

51

Approach to a Parallel Version

� Each process keeps (1/size) of the cells

– size:number of processes

2 1 6 8 7 1 0 2 4 5 1 … 3

2 1 …

Process 0

• Each process needs to:• update the locally-stored cells

• exchange boundary cell values between neighboring processes (nearest-neighbor communication)

Process 1

Process (size-1)

Process 2

52

Recurring HPC Defects

� Now, we will simulate the process of writing parallel code and discuss what kinds of defects can appear.

� Defect types are shown as:

– Pattern descriptions

– Concrete examples in MPI implementation

53

Pattern: Erroneous use of language features• Simple mistakes in understanding that are common for novices

• E.g., inconsistent parameter types between send and recv, • E.g., forgotten mandatory function calls• E.g., inappropriate choice of functions

Symptoms:• Compile-type error (easy to fix)• Some defects may surface only under specific conditions

• (number of processors, value of input, hardware/software environment…)

Causes:• Lack of experience with the syntax and semantics of new

language features

Cures & preventions:• Check unfamiliar language features carefully

54

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

/* Initialize cells */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

...

/* Final output */

...

/* Finalize MPI */

MPI_Finalize();

Adding basic MPI functions

What are the bugs?

55

/* Initialize MPI */

MPI_Status status;

status = MPI_Init(NULL, NULL);

if (status != MPI_SUCCESS) { exit(-1); }

/* Initialize cells */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < N; x++) { fscanf(fp, "%d", &buffer[x]); }

fclose(fp);

/* Main loop */

...

What are the defects?

MPI_Init(&argc, &argv);

MPI_Finalize();

• Passing NULL to MPI_Init is invalid in MPI-1 (ok in MPI-2)

• MPI_Finalize must be called by all processors in every execution path

56

Does MPI Have Too Many Functions To Remember?

� Yes (100+ functions), but…

� Advanced features are not necessarily used

� Try to understand a few, basic language features thoroughly

MPI keywords in Conjugate Gradient in C/C++ (15 students)

3

1

10

68

1

2

38

77

72

67

3

14

24

2

2

67

6

2

66

10

53

1

1

4

4

42

8

488

200

125

2

74

1 10 100 1000

MPI_Address

MPI_Aint

MPI_Allgatherv

MPI_Allreduce

MPI_Alltoall

MPI_Alltoallv

MPI_Barrier

MPI_Bcast

MPI_Comm_rank

MPI_Comm_size

MPI_Datatype

MPI_Finalize

MPI_Init

MPI_Irecv

MPI_Isend

MPI_Recv

MPI_Reduce

MPI_Request

MPI_Send

MPI_Sendrecv

MPI_Status

MPI_Type_commit

MPI_Type_struct

MPI_Waitall

MPI_ANY_SOURCE

MPI_ANY_TAG

MPI_CHAR

MPI_COMM_WORLD

MPI_DOUBLE

MPI_INT

MPI_LONG

MPI_SUM

24 functions, 8 constants

57

Pattern: Space Decomposition• Incorrect mapping between the problem space and the

program memory space

Symptoms:• Segmentation fault (if array index is out of range)• Incorrect or slightly incorrect output

Causes:• Mapping in parallel version can be different from that in

serial version• E.g., Array origin is different in every processor• E.g., Additional memory space for communication can

complicate the mapping logic

Cures & preventions:• Validate the memory allocation carefully when parallelizing

the code

58

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 0; x < nlocal; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

...

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Decompose the problem space

buffer[]

0 (nlocal+1)

What are the bugs?

59

MPI_Comm_size(MPI_COMM_WORLD &size);

MPI_Comm_rank(MPI_COMM_WORLD &rank);

nlocal = N / size;

buffer = (int*)malloc((nlocal+2) * sizeof(int));

nextbuffer = (int*)malloc((nlocal+2) * sizeof(int));

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

...

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the defects?

N may not be divisible by size

• N may not by divisible by size• Off by one error in inner loop

60

Pattern: Side-effect of Parallelization• Ordinary serial constructs can cause defects when they are

accessed in parallel contexts

Symptoms:• Various correctness/performance problems

Causes:• “Sequential part” tends to be overlooked

• Typical parallel programs contain only a few parallel primitives, and the rest of the code is made of a sequential program running in parallel

Cures & preventions:• Don’t just focus on the parallel code• Check that the serial code is working on one processor, but

remember that the defect may surface only in a parallel context

61

/* Initialize cells with input file */

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

nskip = ...

for (x = 0; x < nskip; x++) { fscanf(fp, "%d", &dummy);}

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

fclose(fp);

/* Main loop */

...

Data I/O

• What are the defects?

62

/* Initialize cells with input file */

if (rank == 0) {

fp = fopen("input.dat", "r");

if (fp == NULL) { exit(-1); }

for (x = 0; x < nlocal; x++) { fscanf(fp, "%d", &buffer[x+1]);}

for (p = 1; p < size; p++) {

/* Read initial data for process p and send it */

}

fclose(fp);

}

else {

/* Receive initial data*/

}

Data I/O

• Filesystem may cause performance bottleneck if all processors access the same file simultaneously

• (Schedule I/O carefully, or let “master” processor do all I/O)

63

/* What if we initialize cells with random values... */

srand(time(NULL));

for (x = 0; x < nlocal; x++) {

buffer[x+1] = rand() % 10;

}

/* Main loop */

...

Generating Initial Data

• What are the defects?

• (Other than the fact that rand() is not a good pseudo-random number generator in the first place…)

64

/* What if we initialize cells with random values... */

srand(time(NULL));

for (x = 0; x < nlocal; x++) {

buffer[x+1] = rand() % 10;

}

/* Main loop */

...

What are the Defects?

• All procs might use the same pseudo-random sequence, spoiling independence

• Hidden serialization in rand() causes performance bottleneck

srand(time(NULL) + rank);

65

Pattern: Synchronization• Improper coordination between processes

• Well-known defect type in parallel programming• Deadlocks, race conditions

Symptoms:• Program hangs• Incorrect/non-deterministic output

Causes:• Some defects can be very subtle• Use of asynchronous (non-blocking) communication can lead to

more synchronization defects

Cures & preventions:• Make sure that all communications are correctly coordinated

66

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Communication

• What are the defects?

67

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• Obvious example of deadlock (can’t avoid noticing this)

0 (nlocal+1)

68

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Another Example

• What are the defects?

69

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• This causes deadlock too• MPI_Ssend is a synchronous send (see the next slides.)

70

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Yet Another Example

• What are the defects?

71

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Send (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Send (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Potential deadlock

• This may work (many novice programmers write this code)

• but it can cause deadlock with some implementation or parameters

72

Modes of MPI blocking communication

� http://www.mpi-forum.org/docs/mpi-11-html/node40.html– Standard (MPI_Send): may either return immediately when the outgoing message is

buffered in the MPI buffers, or block until a matching receive has been posted.– Buffered (MPI_Bsend): a send operation is completed when the MPI buffers the

outgoing message. An error is returned when there is insufficient buffer space– Synchronous (MPI_Ssend): a send operation is complete only when the matching

receive operation has started to receive the message.– Ready (MPI_Rsend): a send can be started only after the matching receive has been

posted.

� In our code MPI_Send won’t probably be blocked in most implementations (each message’s just one integer), but it should still be avoided.

� A “correct” solution could be:– (1) alternate the order of send and recv– (2) use MPI_Bsend with sufficient buffer size– (3) MPI_Sendrecv, or – (4) MPI_Isend/recv

73

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request1);

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request2);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request3);

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request4);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

Non-Blocking Communication

• What are the defects?

74

/* Main loop */

for (n = 0; n < steps; n++) {

for (x = 1; x < nlocal+1; x++) {

nextbuffer[x] = (buffer[(x-1+N)%N]+buffer[(x+1)%N]) % 10;

}

/* Exchange boundary cells with neighbors */

MPI_Isend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request1);

MPI_Irecv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request2);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &request3);

MPI_Irecv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &request4);

tmp = buffer; buffer = nextbuffer; nextbuffer = tmp;

}

What are the Defects?

• Synchronization (e.g. MPI_Wait, MPI_Barrier) is needed at each iteration (but too many barriers can cause a performance problem)

75

Pattern: Performance defect• Scalability problem because processors are not working in

parallel• The program output itself is correct• Perfect parallelization is often difficult: need to evaluate

if the execution speed is unacceptable

Symptoms:• Sub-linear scalability• Performance much less than expected (e.g, most time spent

waiting),

Causes:• Unbalanced amount of computation• Load balancing may depend on input data

Cures & preventions:• Make sure all processors are “working” in parallel• Profiling tool might help

76

if (rank != 0) {

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

}

if (rank != size-1) {

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

}

Scheduling communication

• Complicated communication pattern- does not cause deadlock

What are the defects?

77

if (rank != 0) {

MPI_Ssend (&nextbuffer[nlocal],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD);

MPI_Recv (&nextbuffer[0], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD, &status);

}

if (rank != size-1) {

MPI_Recv (&nextbuffer[nlocal+1],1,MPI_INT, (rank+1)%size,

tag, MPI_COMM_WORLD, &status);

MPI_Ssend (&nextbuffer[1], 1, MPI_INT, (rank+size-1)%size,

tag, MPI_COMM_WORLD);

}

What are the bugs?

1 Send → 0 Recv → 0 Send → 1 Recv2 Send → 1 Recv→ 1 Send → 2 Recv3 Send → 2 Recv → 2 Send → 3 Recv

0 (nlocal+1)

• Communication requires O(size) time (a “correct” solution takes O(1))

78

Summary

� This is an attempt to share knowledge about common defects in parallel programming

– Erroneous use of language features

– Space Decomposition

– Side-effect of Parallelization

– Synchronization

– Performance defect

– Try to avoid these defect patterns in your code

79

CMSC 714

Lecture 4

OpenMP and UPC

Chau-Wen Tseng(from A. Sussman)

80

Programming Model Overview

� Message passing (MPI, PVM)

– Separate address spaces

– Explicit messages to access shared data

• Send / receive (MPI 1.0), put / get (MPI 2.0)

� Multithreading (Java threads, pthreads)

– Shared address space

• Only local variables on thread stack are private

– Explicit thread creation, synchronization

� Shared-memory programming (OpenMP, UPC)

– Mixed shared / separate address spaces

– Implicit threads & synchronization

81

Shared Memory Programming Model

� Attempts to ease task of parallel programming

– Hide details

• Thread creation, messages, synchronization

– Compiler generate parallel code

• Based on user annotations

� Possibly lower performance

– Less control over

• Synchronization

• Locality

• Message granularity

� My inadvertently introduce data races

– Read & write same shared memory location in parallel loop

82

OpenMP

� Support parallelism for SMPs, multi-core

– Provide a simple portable model

– Allows both shared and private data

– Provides parallel for/do loops

� Includes

– Automatic support for fork/join parallelism

– Reduction variables

– Atomic statement

• one processes executes at a time

– Single statement

• only one process runs this code (first thread to reach it)

83

OpenMP

� Characteristics

– Both local & shared memory (depending on directives)

– Parallelism directives for parallel loops & functions

– Compilers convert into multi-threaded programs (i.e. pthreads)

– Not supported on clusters

� Example#pragma omp parallel for private(i)

for (i=0; i<NUPDATE; i++) {

int ran = random();

table[ ran & (TABSIZE-1) ] ^= stable[ ran >> (64-LSTSIZE) ];

}

Parallel for indicates loop iterations may be

executed in parallel

84

More on OpenMP

� Characteristics

– Not a full parallel language, but a language extension

– A set of standard compiler directives and library routines

– Used to create parallel Fortran, C and C++ programs

– Usually used to parallelize loops

– Standardizes last 15 years of SMP practice

� Implementation

– Compiler directives using #pragma omp <directive>

– Parallelism can be specified for regions & loops

– Data can be

• Private – each processor has local copy

• Shared – single copy for all processors

85

OpenMP – Programming Model

� Fork-join parallelism (restricted form of MIMD)

– Normally single thread of control (master)

– Worker threads spawned when parallel region encountered

– Barrier synchronization required at end of parallel region

Master Thread

Parallel Regions

86

OpenMP – Example Parallel Region

double a[1000];

omp_set_num_threads(4);

#pragma omp parallel

{

int id = omp_thread_num();

foo(id,a);

}

printf(“all done \n”);

double a[1000];

#pragma omp parallel

foo(3,a);

printf(“all done \n”);

foo(2,a);foo(1,a);foo(0,a);

omp_set_num_threads(4);

� Task level parallelism – #pragma omp parallel { … }

OpenMP compiler

87

OpenMP – Example Parallel Loop

#pragma omp parallel

{

int id, i, nthreads,start, end;

id = omp_get_thread_num();

nthreads = omp_get_num_threads();

start = id * N / nthreads ; // assigning

end = (id+1) * N / nthreads ; // work

for (i=start; i<end; i++) {

foo(i);

}

}

#pragma omp parallel for

for (i=0;i<N;i++) {

foo(i);

}

� Loop level parallelism – #pragma omp parallel for

– Loop iterations are assigned to threads, invoked as functions

OpenMP compiler

Loop iterations scheduled in blocks

88

Iteration Scheduling

� Parallel for loop

– Simply specifies loop iterations may be executed in parallel

– Actual processor assignment is up to compiler / run-time system

� Scheduling goals

– Reduce load imbalance

– Reduce synchronization overhead

– Improve data location

� Scheduling approaches

– Block (chunks of contiguous iterations)

– Cyclic (round-robin)

– Dynamic (threads request additional iterations when done)

89

Parallelism May Cause Data Races

� Data race

– Multiple accesses to shared data in parallel

– At least one access is a write

– Result dependent on order of shared accesses

� May be introduced by parallel loop

– If data dependence exists between loop iterations

– Result depend on order loop iterations are executed

– Example

#pragma omp parallel for

for (i=1;i<N-1;i++) {

a[i] = ( a[i-1] + a[i+1] ) / 2;

}

90

Sample Fortran77 OpenMP Code

program compute_pi

integer n, i

double precision w, x, sum, pi, f, a

c function to integrate

f(a) = 4.d0 / (1.d0 + a*a)

print *, “Enter # of intervals: “

read *,n

c calculate the interval size

w = 1.0d0/n

sum = 0.0d0

!$OMP PARALLEL DO PRIVATE(x), SHARED(w)

!$OMP& REDUCTION(+: sum)

do i = 1, n

x = w * (i - 0.5d0)

sum = sum + f(x)

enddo

pi = w * sum

print *, “computed pi = “, pi

stop

end

91

Reductions

� Specialized computations that

– Partial results may be computed in parallel

– Combine partial results into final result

– Examples

• Addition, multiplication, minimum, maximum, count

� OpenMP reduction variable

– Compiler inserts code to

• Compute partial result locally

• Use synchronization / communication to combine results

92

UPC

� Extension to C for parallel computing

� Target Environment

– Distributed memory machines

– Cache coherent multi-processors

– Multi-core processors

� Features

– Explicit control of data distribution

– Includes parallel for statement

– MPI-like run-time library support

93

UPC

� Characteristics

– Local memory, shared arrays accessed by global pointers

– Parallelism : single program on multiple nodes (SPMD)

– Provides illusion of shared one-dimensional arrays

– Features

• Data distribution declarations for arrays

• One-sided communication routines (memput / memget)

• Compilers translate shared pointers & generate communication

• Can cast shared pointers to local pointers for efficiency

� Exampleshared int *x, *y, z[100];

upc_forall (i = 0; i < 100; j++) { z[i] = *x++ ×××× *y++; }

94

More UPC

� Shared pointer

– Key feature of UPC

• Enables support for distributed memory architectures

– Local (private) pointer pointing to shared array

– Consists of two parts

• Processor number

• Local address on processor

– Read operations on shared pointer

• If for nonlocal data, compiler translates into memget( )

– Write operations on shared pointer

• If for nonlocal data, compiler translates into memput( )

– Cast into local private pointer

• Accesses local portion of shared array w/o communication

95

UPC Execution Model

� SPMD-based

– One thread per processor

– Each thread starts with same entry to main

� Different consistency models possible

– “Strict” model is based on sequential consistency

• Results must match some sequential execution order

– “Relaxed” based on release consistency

• Writes visible only after release synchronization

– Increased freedom to reorder operations

– Reduced need to communicate results

– Consistency models are tricky

• Avoid data races altogether

96

Forall Loop

� Forms basis of parallelism

� Add fourth parameter to for loop, “affinity”

– Where code is executed is based on “affinity”

– Attempt to assign loop iteration to processor with shared data

• To reduce communication

� Lacks explicit barrier before / after execution

– Differs from OpenMP

� Supports nested forall loops

97

Split-phase Barriers

� Traditional Barriers

– Once enter barrier, busy-wait until all threads arrive

� Split-phase

– Announce intention to enter barrier (upc_notify)

– Perform some local operations

– Wait for other threads (upc_wait)

� Advantage

– Allows work while waiting for processes to arrive

� Disadvantage

– Must find work to do

– Takes time to communicate both notify and wait

98

Computing Environment� Cost Effective High Performance Computing

– Dedicated servers are expensive

– Non-dedicated machines are useful

• high processing power(~1GHz), fast network (100Mbps+)

• Long idle time(~50%), low resource usage

Machin

es in

offic

e

Need cycles to run my simulations

Computer Lab

Supercomputer

Clustered server

W/S’s and PC’s

Network

99

OS Support For Parallel Computing

� Many applications need raw compute power

– Computer H/W and S/W Simulations

– Scientific/Engineering Computation

– Data Mining, Optimization problems

� Goal

– Exploit computation cycles on idle workstations

� Projects

– Condor

– Linger-Longer

100

Issues

� Scheduling

– What jobs to run on which machines?

– When to start / stop using idle machines?

� Transparency

– Can applications execute as if on home machine?

� Checkpoints

– Can work be saved if job is interrupted?

101

What Is Condor?

� Condor – Exploits computation cycles in collections of

• workstations

• dedicated clusters

– Manages both

• resources (machines)

• resource requests (jobs)

– Has several mechanisms

• ClassAd Matchmaking

• Process checkpoint/ restart / migration

• Remote System Calls

• Grid Awareness

– Scalable to thousands of jobs / machines

102

Condor – Dedicated Resources

� Dedicated Resources

– Compute Clusters

� Manage

– Node monitoring, scheduling

– Job launch, monitor & cleanup

103

Condor – Non-dedicated Resources

� Examples

– Desktop workstations in offices

– Workstations in student labs

� Often idle

– Approx 70% of the time!

� Condor policy

– Use workstation if idle

– Interrupt and move job if user activity detected

104

Mechanisms in Condor

� Transparent Process Checkpoint / Restart

� Transparent Process Migration

� Transparent Redirection of I/O

– Condor’s Remote System Calls

105

CondorView Usage Graph

106

What is ClassAd Matchmaking?

� Condor uses ClassAd Matchmaking to make sure that work gets done within the constraints of both users and owners.

� Users (jobs) have constraints:

– “I need an Alpha with 256 MB RAM”

� Owners (machines) have constraints:

– “Only run jobs when I am away from my desk and never run jobs owned by Bob.”

� Semi-structured data --- no fixed schema

107

Some Challenges

� Condor does whatever it takes to run your jobs, even if some machines…

– Crash (or are disconnected)

– Run out of disk space

– Don’t have your software installed

– Are frequently needed by others

– Are far away & managed by someone else

108

Condor’s Standard Universe

� Condor can support various combinations of features/environments

– In different “Universes”

� Different Universes provide different functionality

– Vanilla

• Run any Serial Job

– Scheduler

• Plug in a meta-scheduler

– Standard

• Support for transparent process checkpoint and restart

109

Process Checkpointing

� Condor’s Process Checkpointing mechanism saves all the state of a process into a checkpoint file

– Memory, CPU, I/O, etc.

� The process can then be restarted

– From right where it left off

� Typically no changes to your job’s source code needed

– However, your job must be relinked with Condor’s Standard Universe support library

110

When Will Condor Checkpoint Your Job?

� Periodically, if desired

– For fault tolerance

� To free the machine to do a higher priority task (higher priority job, or a job from a user with higher priority)

– Preemptive-resume scheduling

� When you explicitly run

– condor_checkpoint

– condor_vacate

– condor_off

– condor_restart

111

Condor Daemon Layout

Personal Condor / Central Manager

master

collector

negotiator

schedd

startd

= Process Spawned

112

Layout of the Condor Pool

Central Manager (Frieda’s)

master

collector

negotiator

schedd

startd

= ClassAdCommunicationPathway

= Process Spawned

Desktop

schedd

startd

master

Desktop

schedd

startd

master

Cluster Node

master

startd

Cluster Node

master

startd

113

Access to Data in Condor

� Use Shared Filesystem if available

� No shared filesystem?– Remote System Calls (in the Standard Universe)

– Condor File Transfer Service

• Can automatically send back changed files

• Atomic transfer of multiple files

– Remote I/O Proxy Socket

114

Standard Universe Remote System Calls

� I/O System calls trapped

– Sent back to submit machine

� Allows Transparent Migration Across Domains

– Checkpoint on machine A, restart on B

� No Source Code changes required

� Language Independent

� Opportunities

– For Application Steering

• Condor tells customer process “how” to open files

– For compression on the fly

115

Customer Job

Job Startup

Submit

Schedd

Shadow

Startd

Starter

Condor

Syscall Lib

116

Job

Fork

startershadow

Home

File

System

I/O Library

I/O Server I/O Proxy

Secure Remote I/O

Local System Calls

Local I/O

(Chirp)

Execution SiteSubmission Site

117

Job Submission Machine

Job Execution Site

Job

Condor-G GridManager

GASS Server

Condor-G Scheduler

Persistant Job Queue

End User Requests

Condor Shadow

Process for Job X

Condor-G Collector

Fork

Globus Daemons +

Local Site Scheduler

[See Figure 1]

Fork

Condor Daemons

Job X

Condor S ystem Call

Trapping & C heckpoint Library

Fork

Resource

In formation

Transfer Job X

Redi rected

System Call Data

118

Exploiting Idle Cycles

in Networks of Workstations

Kyung Dong Ryu

© Copyright 2001, K.D. Ryu, All Rights Reserved.

Ph.D. Defense

119

High Performance Computing in NOW

� Many systems support harvesting idle machines

– Traditional Approach : Coarse-Grained Cycle Stealing

• while owner is away: send guest job and run

• when owner returns: stop, then

– migrate guest job: Condor, NOW system

– suspend or kill guest job: Butler, LSF, DQS system

� But…

120

Additional CPU Time and Memory is Available

� When a user is active

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100CPU Usage (%)

Cum

ula

tive D

istr

.

all

idle

busy

– CPU usage is < 10%, 75% of time

Trace from UC Berkeley

– 30 MB memory is available, 70% of time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40 45 50 55 60

memory size (MB)

Pro

babili

ty

all

idle

busy

Total: 64MB

121

Questions

� Can we exploit fine grained idle resources?

– For sequential programs and parallel programs

– Improve throughput

� How to reduce effect on user?

– Two level CPU scheduling

– Memory limits

– Network and I/O throttling

122

Fine-Grain Idle Cycles

� Coarse-Grain Idle Cycles

– (t1,t3): Keyboard/mouse events

– (t4,~): High CPU usage

– Recruitment threshold

� Fine-Grain Idle Cycles

– All empty slots

– Whenever resource(CPU) is not used

Non-idleIdle Idle Non-idle

RecruitmentThreshold

Keyboard/Mouse Events

CPU

Usage

t1 t2 t3 t4

123

Linger Longer: Fine-Grain Cycle Stealing

� Goals:

– Harvest more available resources

– Limit impact on local jobs

� Technique: Lower Guest Job’s Resource Priority

– Exploit fine-grained idle intervals even when user is active

• Starvation-level low CPU priority

• Dynamically limited memory use

• Dynamically throttled I/O and network bandwidth

� Adaptive Migration

– No need to move guest job to avoid local job delay

– Could move guest job to improve guest performance

124

Adaptive Migration� When Migration Benefit overweighs Migration Cost

– Non-idle_Period ≥ Linger_Time + Migration_Cost / Non-idle_Usage

– Linger_Time ∝ Migration_Cost / Non-idle_Usage

• Migration_cost = Suspend_Time(source) + Process_Size/Network_Bandwidth + Resume_Time(dest.)

MIGRATION

NO MIGRATION

GUESTJOB

LOCAL JOB

MIGRATION

MIGRATION COST

NON-IDLE PERIOD

t0

t2

t1

t3

t4

tf2

tf1

Nonid

le U

sage

Nd A

Nd B

Nd A

LINGER TIME

125

Need a Suite of Mechanisms

� Policy:

– Most unused resources should be available

– Resource should be quickly revoked when local jobs reclaim

� Dynamic Bounding Mechanisms for:

1. CPU

2. Memory

3. I/O Bandwidth

4. Network Bandwidth

Goal: maximize usage of idle resources

Constraint: limit impact on local jobs

126

CPU bounding: Is Unix “nice” sufficient ?

� CPU priority is not strict

– run two empty loop processes (guest: nice 19)

OS Host Guest

Solaris (SunOS 5.5) 84% 15%

Linux (2.0.32) 91% 8%

OSF1 99% 0%

AIX (4.2) 60% 40%

� Why ?

– Anti-Starvation Policy

127

CPU Bounding: Starvation Level Priority

� Original Linux CPU Scheduler

– One Level : process priority

– Run-time Scheduling Priority

• nice value & remaining time quanta

• Ti = 20 - nice_level + 1/2 * Ti-1

– Low priority process can preempt high priority process

� Extended Linux CPU Scheduler

– Two Level : 1) process class, 2) priority

– If runnable host processes exist

• Schedule a host process as in unmodified scheduler

– Only when no host process is runnable

• Schedule a guest process

128

Memory Bounding: Page Limits

� Extended page replacement algorithm

� Adaptive Page-Out Speed

– When a host job steals a guest’s page, page-out multiple pages

• faster than default

High Limit

Low Limit

Priority to Host Job

Priority to Guest Job

Based only on LRU

Mai

n M

emory

Pag

es

– No limit on taking free pages

– High Limit :

• Maximum pages guest can hold

– Low Limit :

• Minimum pages guaranteed to guest

129

Experiment: Memory Bounding� Prioritized Memory Page Replacement

– Total Available Memory : 180MB

– Guest Memory Thresholds: High Limit (70MB), Low Limit (50MB)

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

0 20 40 60 80 100 120 140 160 180 200

time (sec)

mem

ory(

MB

)

host job

memory

guest job

memory

High Limit

Low Limit

130

Experiment: Nice vs. CPU & Memory Bounding

� Large Memory Footprint Job

– Each job takes 82 sec to run in isolation

Policy and Setup Host time (secs)

Guest time (secs)

Host Delay

Host starts then guest,

Guest niced 19 89 176 8.0%

Linger priority 83 165 0.8% Guest starts then host

Guest niced 19 > 5 hours > 5 hours > 2,000%

Linger priority 99 255 8.1%

– Host-then-guest:

• Reduce host job delay 8% to 0.8%

– Guest-then-host:

• Nice causes memory thrashing

• CPU & memory bounding serializes the execution

131

I/O and Network Throttling

Problem 1: Guest I/O & comm. can slow down local jobs

Problem 2: Migration/checkpoint bothers local users

� Policy: Limit guest I/O and comm. bandwidth

– Only when host I/O or communication is active� Mechanism : Rate Windows

– Keep track of I/O rate by host and guest

– throttle guest I/O rate when host I/O is active� Implementation: a loadable kernel module

– Highly portable and deployable

– Light-weight : I/O call intercept

132

I/O Throttling Mechanism : Rate Windows� Regulate I/O Rate

RegulatedProcess ?

ExceedsTarget Rate ?

Compute delay, split

& sleep

Avg.Rate

TargetRate

Yes Yes

No No

Application

Library

Kernel

Application

Library

Kernel

Rate Windows

Request split

I/O or

Comm.

Request

delay < dmin :

ignore

delay > dmax:

split req:sleep dmax

otherwise:

sleep delayM seconds

Avg. Rate

4kB 60kB 12kB 16kB

100msec

500msec

75msec

80msec

N items

133

Experiment: I/O Throttling

� tar programs as host and guest jobs– Guest I/O Limit : 500 kB/sec (~10%)

– Throttling Threshold : Lo: 500 kB/sec Hi: 1000 kB/s

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80 90

time (sec)

I0 r

ate

(kB

/s)

host

guest

0

1000

2000

3000

4000

5000

6000

7000

8000

0 10 20 30 40 50 60 70 80 90 100

time (sec)

IO r

ate

(kB

/s) host

guest

� Without I/O Throttling

� Host tar takes 72 seconds

� With I/O Throttling

� Host tar takes 42 seconds

134

Dilation Factor in I/O Throttling� File I/O rate ≠ disk I/O rate

– Buffer Cache, Prefetching

� Control disk I/O by throttling file I/O

– Adjust delay using

• dilation factor = avg. disk I/O rate / avg. file I/O rate

– Compile test (I/O Limit: 500kB/s)

0

500

1000

1500

2000

0 10 20 30 40 50

Time (sec)

Rate

(K

B/s

)

File I/ODisk I/O

0

500

1000

1500

2000

0 10 20 30 40 50 60 70 80

Time (sec)

Ra

te (

KB

/s)

File I/O

Disk I/O

0

500

1000

1500

2000

0 10 20 30 40 50 60

Time (sec)R

ate

(K

B/s

)

File I/O

Disk I/O

(a) No limit (b) File I/O limit (c) Disk I/O limit

135

Experiment: Network Throttling

� Guest job migration vs. httpd as a host job

– Guest job migration disrupts host job communication

– Throttling migration when host job comm. is active

– Guest job comm. Limit: 500 kB/s

� Without Comm. Throttling

� Lose b/w to migration

� With Comm. Throttling

� Take full b/w immediately

0

2000

4000

6000

8000

10000

0 6 12 18 24 30 36 42 48

Time (sec)

Co

mm

. B

an

dw

idth

(k

B/s

)

guest migr

w eb server

0

2000

4000

6000

8000

10000

0 6 12 18 24 30 36 42 48

Time (sec)C

om

m.

Ba

nd

wid

th (

kB

/s)

guest migr

w eb server

136

Guest Job Performance

– Overall, LL improves 50%~70% over IE

– Less improvement for larger jobs (lu.B)

• Only 36% improvement for lu.B.30m

• Less memory is available while non-idle

– LF is slightly better than LL

– Less Variation for LL

• lu.B.30m: 23.6% for LL, 47.5% for LF

0.0

1.0

2.0

3.0

4.0

5.0

6.0

mg.W.1m mg.W.30m sp.A.10m lu.B.1.5m lu.B.30m

Th

rou

gh

pu

t (b

as

e=

8) LL LF PM IE

137

Host Job Slowdown

– LL/LF delays less for small and medium size jobs

• 0.8%~1.1% for LL/LF, 1.4%~2.3% for PM/IE

• Non-prioritized migration operations of PM/IE

– More delay for large jobs

• Memory contention

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

mg.W.1m mg.W.30m sp.A.10m lu.B.1.5m lu.B.30m

ho

st

de

lay (

%)

LL LF PM IE

138

Conclusions

� Identified opportunities for fine-grain idle resources

� Linger-Longer can exploit up to 60% more idle time

– Fine-Grain Cycle Stealing

– Adaptive Migration

� Linger-Longer can improve parallel applications in NOW

� A suite of mechanisms insulates local job’s performance

– CPU scheduling: starvation-level priority

– Memory Priority: lower and upper limits

– I/O and Network Bandwidth Throttling: Rate Windows

� Linger-Longer really improves

– Guest job throughput by 50% to 70%

– With a 3% host job slowdown

139

Related Work

� Idle Cycle Stealing Systems– Condor [Litzkow88]

– NOW project [Anderson95]

– Butler [Dannenberg85], LSF [Green93], DQS [Zhou93]

� Process Migration in OS– Sprite [Douglis 91], Mosix [Barak 95]

� Idle Memory Stealing Systems– Dodo [Acharya 99], GMS [Freely 95]

– Cooperative Caching [Dahlin 94][Sarkar 96]

� Parallel Programs on Non-dedicated Workstations– Reconfiguration [Acharya 97]

– MIST/MPVM [Clark 95], Silk-NOW [Brumofe 97]

– CARMI [Pruyne 95] (Master-worker model)

� Performance Isolation– Eclipse [Bruno 98]

– Resource container [Banga 99]

top related