programming for performance cs433 spring 2001 laxmikant kale
TRANSCRIPT
Programming for Performance CS433
Spring 2001
Laxmikant Kale
2
Causes of performance loss• If each processor is rated at k MFLOPS, and there are p
processors, why don’t we see k.p MFLOPS performance?– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways• Solution to one problem may create another
• One problem may mask another, which manifests itself under other conditions (e.g. increased p).
3
Causes• Sequential: cache performance
• Communication overhead
• Algorithmic overhead (“extra work”)
• Speculative work
• Load imbalance
• (Long) Critical paths
• Bottlenecks
4
Algorithmic overhead• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
5
Parallel Prefix: continued• How to this operation in parallel?
– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm: – Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements
– N + log(P) +N: doubling of op. Count
6
Bottleneck• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
7
Communication overhead• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:– How much computation per message
– Computation-to-communication ratio
8
Communication overhead examples• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
9
Communication delay: time interval between sending on one processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
Communication overhead
10
Grainsize control• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:– Computation to communication ratio
– computation time / (a + bN) for one message
11
Example: matrix multiplication• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
12
A simple algorithm:• Distribute A by rows, B by columns
– So,any processor can request a row of A and get it (in two messages). Same for a col of B,
– Distribute the work of computing each element of C using some load balancing scheme
• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)
– What is the computation-to-communication ratio?• For each object: 2.N ops, 2 messages with N bytes
• 2N / (2 a + 2N b) = 2N * 0.01 / (2*10 + 2*0.002N)
13
A better algorithm:• Store A as a collection row-bunches
– each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: 2g
14
Alpha vs beta• The per message cost is significantly larger than per byte cost
– factor of several thousands
– So, several optimizations are possible that trade off : • get larger beta cost in return for smaller alpha
– I.e. send fewer messages
– Applications of this idea:• Examined in the last two lectures
15
Programming for performance:steps• Select/design Parallel algorithm
• Decide on Decomposition
• Select Load balancing strategy
• Plan Communication structure
• Examine synchronization needs– global synchronizations, critical paths
16
Design Philosophy:
• Parallel Algorithm design:– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
17
Design principles: contd.• Load balancing
– Select static, dynamic, or quasi-dynamic strategy• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one • (think about this)
• Reduce communication overhead– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
18
Design principles: Synchronization• Eliminate unnecessary global synchronization
– If T(i,j) is the time during i’th phase on j’th PE• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:– Look for long chains of dependences
• Draw timeline pictures with dependences
19
Diagnosing performance problems• Tools:
– Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
20
Diagnostic technniques• Tell-tale signs:
– max load >> average, and # PEs > average is >>1
– max load >> average, and # PEs > average is ~ 1
– Profile shows increase in total time in routine f with increase in PEs:
– Communication overhead: obvious
Load imbalance
Possible bottleneck (if there is dependence)
Algorithmic overhead
21
Communication Optimization• Example problem from earlier lecture: Molecular Dynamics
– Each Processor, assumed to house just one cell, needs to send 26 short messages to “neighboring” processors
– Assume Send/Receive each: alpha = 10 us, beta: 2ns
– Time spent (notice: 26 sends and 26 receives):• 26*2(10 ) = 520 us
– If there are more than one cells on each PE, multiply this number!
– Can this be improved? How?
22
Message combining• If there are multiple cells per processor:
– Neighbors of a cell may be on the same neighboring processor.
– Neighbors of two different cells on the same processor
– Combine messages going to the same processor
23
Communication Optimization I • Take advantage of the structure of communication, and do
communication in stages:– If my coordinates are: (x,y,z):
• Send to (x+1, y,z), anything that goes to (x+1, *, *)
• Send to (x-1, y,z), anything that goes to (x-1, *, *)
• Wait for messages from x neighbors, then
• Send to y neighbors a combined message
– A total of 6 messages instead of 26
– Apparently longer critical path
24
Communication Optimization II• Send all migrating atoms to processor 0
– Let processor 0 sort them out and send 1 message to each processor
– Works ok if the number of processors is small• Otherwise, bottleneck at 0
25
Communication Optimization 3• Generalized problem:
– Each to all, individualized messages
– Apply all previously learned techniques
26
Intro to Load Balancing• Example: 500 processors, 50000 units of work
• What should the objective of load balancing be?
27
Causes of performance loss
• If each processor is rated at k MFLOPS, and there are p processors, why don’t we see k.p MFLOPS performance?– Several causes,
– Each must be understood separately
– but they interact with each other in complex ways• Solution to one problem may create another
• One problem may mask another, which manifests itself under other conditions (e.g. increased p).
28
Causes• Sequential: cache performance
• Communication overhead
• Algorithmic overhead (“extra work”)
• Speculative work
• Load imbalance
• (Long) Critical paths
• Bottlenecks
29
Algorithmic overhead• Parallel algorithms may have a higher operation count
• Example: parallel prefix (also called “scan”)– How to parallelize this?
B[0] = A[0];
for (I=1; I<N; I++)
B[I] = B[I-1]+A[I];
30
Parallel Prefix: continued
• How to this operation in parallel?– Seems inherently sequential
– Recursive doubling algorithm
– Operation count: log(P) . N
• A better algorithm: – Take blocking of data into account
– Each processor calculate its sum, then participates in a prallel algorithm to get sum to its left, and then adds to all its elements
– N + log(P) +N: doubling of op. Count
31
Bottleneck• Consider the “primes” program (or the “pi”)
– What happens when we run it on 1000 pes?
• How to eliminate bottlenecks:– Two structures are useful in most such cases:
• Spanning trees: organize processors in a tree
• Hypercube-based dimensional exchange
32
Communication overhead• Components:
– per message and per byte
– sending, receiving and network
– capacity constraints
• Grainsize analysis:– How much computation per message
– Computation-to-communication ratio
33
Communication overhead examples• Usually, must reorganize data or work to reduce communication
• Combining communication also helps
• Examples:
34
Communication overhead
Communication delay: time interval between sending on one processor to receipt on another:
time = a + b. N
Communication overhead: the time a processor is held up (both sender and receiver are held up): again of the form a+ bN
Typical values: a = 10 - 100 microseconds, b: 2-10 ns
35
Grainsize control• A Simple definition of grainsize:
– Amount of computation per message
– Problem: short message/ long message
• More realistic:– Computation to communication ratio
36
Example: matrix multiplication• How to parallelize this?
For (I=0; I<N; I++)
For (J=0; j<N; J++) // c[I][j] ==0
For(k=0; k<N; k++)
C[I][J] += A[I][K] * B[K][J];
37
A simple algorithm:
• Distribute A by rows, B by columns– So,any processor can request a row of A and get it (in two
messages). Same for a col of B,
– Distribute the work of computing each element of C using some load balancing scheme
• So it works even on machines with varying processor capabilities (e.g. timeshared clusters)
– What is the computation-toc-mmunication ratio?• For each object: 2.N ops, 2 messages with N bytes
38
A better algorithm:
• Store A as a collection row-bunches – each bunch stores g rows
– Same of B’s columns
• Each object now computes a gxg section of C
• Comp to commn ratio:– 2*g*g*N ops
– 2 messages, gN bytes each
– alpha ratio: 2g*g*N/2, beta ratio: g
39
Alpha vs beta
• The per message cost is significantly larger than per byte cost – factor of several thousands
– So, several optimizations are possible that trade off : get larger beta cost for smaller alpha
– I.e. send fewer messages
– Applications of this idea:• Message combining
• Complex communication patterns: each-to-all, ..
40
Example:• Each to all communication:
– each processor wants to send N bytes, distinct message to each other processor
– Simple implementation: alpha*P + N * beta *P• typical values?
41
Programming for performance:steps
• Select/design Parallel algorithm
• Decide on Decomposition
• Select Load balancing strategy
• Plan Communication structure
• Examine synchronization needs– global synchronizations, critical paths
42
Design Philosophy:
• Parallel Algorithm design:– Ensure good performance (total op count)
– Generate sufficient parallelism
– Avoid/minimize “extra work”
• Decomposition:– Break into many small pieces:
• Smallest grain that sufficiently amortizes overhead
43
Design principles: contd.
• Load balancing– Select static, dynamic, or quasi-dynamic strategy
• Measurement based vs prediction based load estimation
– Principle: let a processor idle but avoid overloading one (think about this)
• Reduce communication overhead– Algorithmic reorganization (change mapping)
– Message combining
– Use efficient communication libraries
44
Design principles: Synchronization
• Eliminate unnecessary global synchronization– If T(i,j) is the time during i’th phase on j’th PE
• With synch: sum ( max {T(i,j)})
• Without: max { sum(T (i,j) }
• Critical Paths:– Look for long chains of dependences
• Draw timeline pictures with dependences
45
Diagnosing performance problems
• Tools: – Back of the envelope (I.e. simple) analysis
– Post-mortem analysis, with performance logs• Visualization of performance data
• Automatic analysis
• Phase-by-phase analysis (prog. may have many phases)
– What to measure• load distribution, (commun.) overhead, idle time
• Their averages, max/min, and variances
• Profiling: time spent in individual modules/subroutines
46
Diagnostic technniques
• Tell-tale signs:– max load >> average, and # Pes > average is >>1
• Load imbalance
– max load >> average, and # Pes > average is ~ 1• Possible bottleneck (if there is dependence)
– profile shows increase in total time in routine f with increase in Pes: algorithmic overhead
– Communication overhead: obvious