tm parallel concepts an introduction. tm the goal of parallelization reduction of elapsed time of a...

TM

Parallel ConceptsParallel Concepts

An Introduction

TM

The Goal of ParallelizationThe Goal of Parallelization• Reduction of elapsed time of a program• Reduction in turnaround time of jobs

• Overhead:– total increase in cpu time– communication– synchronization– additional work in algorithm– non-parallel part of the program

• (one processor works, others spin idle)

• Overhead vs Elapsed time is better expressed as Speedup and Efficiency

Elapsed time

1 processor

4 processors

start finish

Elapsed time

cputime

communicationoverhead

1 processor

2 procs

4 procs

8 procs

Reduction in elapsed time

TM

Speedup and Efficiency Speedup and Efficiency Both measure the parallelization properties of a program• Let T(p) be the elapsed time on p processors• The Speedup S(p) and the Efficiency E(p) are defined as:

• For ideal parallel speedup we get:

• Scalable programs remain efficient for large number of processors

S(p) = T(1)/T(p)E(p) = S(p)/p T(p) = T(1)/p

S(p) = T(1)/T(p) = pE(p) = S(p)/p = 1 or 100%

Speedup

Number of processors

ideal

Super-linear

Saturation

Disaster

Efficiency

Number of processors

1

TM

Amdahl’s LawAmdahl’s Law This rule states the following for parallel programs:

• The non-parallel (serial) fraction s of the program includes the communication and synchronization overhead

• Thus, the maximum parallel Speedup S(p) for a program that has parallel fraction f:

The non-parallel fraction of the code (I.e. overhead)imposes the upper limit on the scalability of the code

(1) 1 = s + f ! program has serial and parallel fractions(2) T(1) = T(parallel) + T(serial)

= T(1) *(f + s) = T(1) *(f + (1-f))

(3) T(p) = T(1) *(f/p + (1-f))(4) S(p) = T(1)/T(p)

= 1/(f/p + 1-f)< 1/(1-f) ! for p-> inf.

(5) S(p) < 1/(1-f)

TM

The The PracticalityPracticality of Parallel Computing of Parallel Computing

• In practice, making programs parallel is not as difficult as it may seem from Amdahl’s law

• It is clear that a program has to spend significant portion (most) of run time in the parallel region

Speedup

Percentage parallel code0% 20% 40% 100%60% 80%

8.0

7.0

6.0

5.0

4.0

3.0

2.0

1.0

1970s1980s

1990s

BestHand-tuned codes

~99% range

P=2

P=4

P=8

David J. Kuck,Hugh Performance Computing,Oxford Univ.. Press 1996

TM

Fine-Grained Vs Coarse-GrainedFine-Grained Vs Coarse-Grained• Fine-grain parallelism (typically loop level)Fine-grain parallelism (typically loop level)

– can be done incrementally, one loop at a time– does not require deep knowledge of the code– a lot of loops have to be parallel for decent speedup– potentially many synchronization points

(at the end of each parallel loop)

• Coarse-grain parallelismCoarse-grain parallelism– make larger loops parallel at higher call-tree level

potentially in-closing many small loops– more code is parallel at once– fewer synchronization points, reducing overhead– requires deeper knowledge of the code

MAIN

A B C D

F E G H I J

K L M N O

p q

r s

t

Coarse-grained

Fine-grained

TM

Other Impediments to ScalabilityOther Impediments to Scalability

Load imbalance:Load imbalance:

• the time to complete a parallel execution of a code segment is determined by the longest running thread

• unequal work load distribution leads to some processors being idle, while others work too much

with coarse grain parallelization, more opportunities for load imbalance exist

Too many synchronization pointsToo many synchronization points• compiler will put synchronization points at the start and exit of

each parallel region• if too many small loops have been made parallel, synchronization

overhead will compromise scalability.

Elapsed time

p0p1p2p3

start finish

TM

Parallel Programming ModelsParallel Programming Models Classification of Programming models:

• Control flow - number of explicit threads of execution• Address space - access to global data from multiple threads• Communication - data transfer part of language or library• Synchronization - mechanism to regulate access to data• Data allocation - control of the data distribution to execution threads

Implicit ExplicitFeature

Originallysequential

Data parallel Messagepassing

Sharedvariable

Examples: Auto-paralelisingcompilers

F90, HPF MPI, PVMOpenMPpthreads

Control flow Single Single Multiple Multiple

Address space Single Single Multiple Single

Communication Implicit Implicit Explicit Implicit

Synchronisation Implicit Implicit Implicit/explicit Explicit

Data allocation Implicit Implicit/semi-explicit

Explicit Implicit/semi-explicit

TM

Computing Computing with with DPLDPL

Notes:– essentially sequential form– automatic detection of parallelism– automatic work sharing– all variables shared by default– number of processors specified outside of the code

compile with: f90 -apo -O3 -mips4 -mplist

– the mplist switch will show the intermediate representation

= = 0

1

4(1+x2)

dx0<i<N

4N(1+((i+0.5)/N)2)

PROGRAM PIPROGINTEGER, PARAMETER:: N = 1000000REAL (KIND=8):: LS,PI, W = 1.0/N

PI = SUM( (/ (4.0*W/(1.0+((I+0.5)*W)**2),I=1,N) /) )PRINT *, PIEND

TM

ComputingComputing with with Shared MemoryShared Memory

Notes:– essentially sequential form– automatic work sharing– all variables shared by default– directives to request parallel work distribution– number of processors specified outside of the code

= = 0

1

4(1+x2)

dx0<i<N

4N(1+((i+0.5)/N)2)

#define n 1000000main() {

double pi, l, ls = 0.0, w = 1.0/n;int i;

#pragma omp parallel private(i,l) reduction(+:ls) {#pragma omp for

for(i=0; i<n; i++) {l = (i+0.5)*w;ls += 4.0/(1.0+l*l);

}#pragma omp master

printf(“pi is %f\n”,ls*w);#pragma omp end master }}

TM

ComputingComputing with with Message PassingMessage Passing

Notes:– thread identification first– explicit work sharing– all variables are private– explicit data exchange (reduce)– all code is parallel– number of processors is specified outside of code

= = 0

1

4(1+x2)

dx0<i<N

4N(1+((i+0.5)/N)2)

#include <mpi.h>#define N 1000000main(){

double pi, l, ls = 0.0, w = 1.0/N; int i, mid, nth;

MPI_init(&argc, &argv);MPI_comm_rank(MPI_COMM_WORLD,&mid);MPI_comm_size(MPI_COMM_WORLD,&nth);

for(i=mid; i<N; i += nth) {l = (i+0.5)*w;ls += 4.0/(1.0+l*l);

}MPI_reduce(&ls,&pi,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);if(mid == 0) printf(“pi is %f\n”,pi*w);MPI_finalize();

}

TM

ComputingComputing with with POSIX ThreadsPOSIX Threads

TM

Comparing Parallel ParadigmsComparing Parallel Paradigms• Automatic parallelization combined with explicit Shared Memory Automatic parallelization combined with explicit Shared Memory

programming (compiler directives) used on machines with global programming (compiler directives) used on machines with global memorymemory– Symmetric Multi-Processors, CC-NUMA, PVP– These methods collectively known as Shared Memory Programming (SMP)– SMP programming model works at loop level, and coarse level parallelism:

• the coarse level parallelism has to be specified explicitly• loop level parallelism can be found by the compiler (implicitly)

– Explicit Message Passing Methods are necessary with machines that have no global memory addressability:

• clusters of all sort, NOW & COW

– Message Passing Methods require coarse level parallelism to be scalable•Choosing programming model is largely a matter of the application, personal preference and the target machine.•it has nothing to do with scalability. Scalability limitations:

– communication overhead– process synchronization

•scalability is mainly a function of the hardware and (your) implementation of the parallelism

TM

SummarySummary• The serial part or the communication overhead of the code limits the

scalability of the code (Amdahl Law)

• Programs have to be >99% parallel to use large (>30 proc) machines

• Several Programming Models are in use today:

– Shared Memory programming (SMP) (with Automatic Compiler parallelization, Data-Parallel and explicit Shared Memory models)

– Message Passing model

• Choosing a Programming Model is largely a matter of the application, personal choice and target machine. It has nothing to do with scalability.– Don’t confuse Algorithm and implementation

• Machines with a global address space can run applications based on both, SMP and Message Passing programming models

tm parallel concepts an introduction. tm the goal of parallelization reduction of elapsed time of a...

Documents