chapter 4 performance. times user cpu time – time that the cpu is executing the program system cpu...
TRANSCRIPT
Chapter 4
Performance
Times
• User CPU time – Time that the CPU is executing the program
• System CPU time – time the CPU is executing OS routines for the program
• Waiting time – waiting for the completion of I/O or waiting for CPU due to time sharing
• Wall clock time – sum of all the above including time waiting to start execution
MIPS and MFLOPS
• MIPS – Millions of instructions per second• MFLOPS – Millions of Floating-point
Operations Per Second.– Floating-point operations generally take longer
than integer instructions– Most scientific codes are dominated by floating-
point operations
CPI and memory Heirarch
• These topics are discussed in 311• Usually important for single processor
performance evaluation, not for parallel performance evaluation
Benchmark Programs
• Linpack benchmark– http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html– A standard program to solve a system of equations– Has grown into a well used standard for benchmarking systems
• Real applications– Spec benchmarks – uses real programs like compression and compiling– May or may not be applicable to parallel
Parallel Performance
• Parallel Runtime– Time from the start of the program until the last
processor has finished.– Usually considered a function of n and p.– Comprised of times from• Local computations• Exchange of data• Synchronization• Wait time – unequal distribution of work, mutual exclusion
Parallel Program Cost
• The “cost” of a parallel program with input size n and p processors is– C(n,p)=p * T(n,p)– Where T(n,p) is the time it takes to run the
program with input size n on p processors• Cost-Optimal– A parallel implementation is cost optimal if• C(n,p)= T(n) ----> the best sequential time
Speedup
• The speedup is the ratio of best sequential time divided by parallel time– S(n,p)=T(n)/T(n,p)
• Theoretically S(n,p)<=p– However, could contradict because of more cache– Could have found a better algorithm for sequential
implementation (contradicts using the “best” sequential algorithm requirement).
Best Sequential Algorithm
• May not be known• Different data distributions may have different
best sequential algorithm• Implementation of the best is too much• So, many people use a sequential version of
the parallel algorithm as the base for speedup rather than the best sequential algorithm
Efficiency
• Captures the utilization of the processors• E(n,p)=S(n,p)/p• Ideal is 1, typically less than 1
Amdahl’s Law• Using f as the fraction of time that must be done
sequentially.• 1-f is the fraction of time that can be parallelize• If the time for sequential execution is 1 time unit,
then the time for the parallel version is f+(1-f)/p• Speedup is 1/[f+(1-f)/p]• Even with infinite processors, the best speedup is 1/f– For example, if 5% of the time must be sequential, the
best speedup would be 20, even with 1000 processors.
Scalability
• How performance changes with increased number of processors.– As increase processors, may increase
communication time.• Realize a limitation of Amdahl’s law is that as n
increases, f probably decreases• Consider as the number of processors
increases, n is also increased
Asymptotic Times for Communications
• Uses the big O notation from 271 (and big Θ, and big Ω).
• Need to take the connection topology into account.
• Assume– Each link is bidirectional– Every link on a processor can be active at the
same time– Time for message transmission is ts+m*tb
Different Topologies
• Complete graph– All communications are O(1).
• Linear Array– Single broadcast O(p) (actually average p/2)– Multi-broadcast O(p) (best p/2, worst p-1)– Scatter O(p)– Total Exchange O(p2) (look at using p scatters)
• Ring – like a linear array with ends connected, so same results
Mesh
• Looking at a d-dimensional mesh (i.e., 2*d links per node)
• Each dimension is size • For analysis, we look at d as a constant since that is
determined by the hardware and we cannot change it.• Single broadcast – worst d* -> O()• Scatter – sending p-1 messages out 2*d links, so time is
(p-1)/(2d) -> O(p)• Multi-Broadcast – same as scatter• Total exchange -
Hypercube
• D – dimensional hypercube has d – links per node. Has 2d nodes. Use a d-bit number to indicate node.
• Send a message – O(d)=O(log p)• Multi-Broadcast – each node receives p-1
messages over d links. Time (p-1)/d -> O(p/log p)• Scatter – same as multi-broadcast• Total Exchange – O(p)
Parallel Scalar Product
• Looking at vectors a and b and computing aTb– Vectors are size n. Sum of pairwise products
• Assume r=n/p• Each processor computes its part of the sum (r
values summed). Take 2r floating point operations.
• Then parallel system does an accumulate.
Linear Array
• Best processor for root is the middle processor• Takes p/2 communications and float adds.• Total time if it takes time α for floating point
operation and time β for communication is– 2rα+(p/2)(α+β)=2(n/p)α+(p/2)(α+β)
• Easy to see that as increase p, the first term goes down, but the second term goes up.
• What is best p? Find minimum• T’(p)=-2(n/p2)α+(α+β)/2 =0 -> (α+β) p2=4nα ->p2= 4nα/ (α+β)
Hypercube
• Same computation time 2rα• Accumulate time is (log p)• T(p)= 2rα+(log p)*(α+β) =2nα/p+ (log p)*(α+β) • T’(p)= -2(n/p2)α + (α+β) (1/(p*ln 2))– Min at = -> p=