chapter 4 performance. times user cpu time – time that the cpu is executing the program system cpu...

Chapter 4

Performance

Times

• User CPU time – Time that the CPU is executing the program

• System CPU time – time the CPU is executing OS routines for the program

• Waiting time – waiting for the completion of I/O or waiting for CPU due to time sharing

• Wall clock time – sum of all the above including time waiting to start execution

MIPS and MFLOPS

• MIPS – Millions of instructions per second• MFLOPS – Millions of Floating-point

Operations Per Second.– Floating-point operations generally take longer

than integer instructions– Most scientific codes are dominated by floating-

point operations

CPI and memory Heirarch

• These topics are discussed in 311• Usually important for single processor

performance evaluation, not for parallel performance evaluation

Benchmark Programs

• Linpack benchmark– http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html– A standard program to solve a system of equations– Has grown into a well used standard for benchmarking systems

• Real applications– Spec benchmarks – uses real programs like compression and compiling– May or may not be applicable to parallel

http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html

Parallel Performance

• Parallel Runtime– Time from the start of the program until the last

processor has finished.– Usually considered a function of n and p.– Comprised of times from• Local computations• Exchange of data• Synchronization• Wait time – unequal distribution of work, mutual exclusion

Parallel Program Cost

• The “cost” of a parallel program with input size n and p processors is– C(n,p)=p * T(n,p)– Where T(n,p) is the time it takes to run the

program with input size n on p processors• Cost-Optimal– A parallel implementation is cost optimal if• C(n,p)= T(n) ----> the best sequential time

Speedup

• The speedup is the ratio of best sequential time divided by parallel time– S(n,p)=T(n)/T(n,p)

• Theoretically S(n,p)<=p– However, could contradict because of more cache– Could have found a better algorithm for sequential

implementation (contradicts using the “best” sequential algorithm requirement).

Best Sequential Algorithm

• May not be known• Different data distributions may have different

best sequential algorithm• Implementation of the best is too much• So, many people use a sequential version of

the parallel algorithm as the base for speedup rather than the best sequential algorithm

Efficiency

• Captures the utilization of the processors• E(n,p)=S(n,p)/p• Ideal is 1, typically less than 1

Amdahl’s Law• Using f as the fraction of time that must be done

sequentially.• 1-f is the fraction of time that can be parallelize• If the time for sequential execution is 1 time unit,

then the time for the parallel version is f+(1-f)/p• Speedup is 1/[f+(1-f)/p]• Even with infinite processors, the best speedup is 1/f– For example, if 5% of the time must be sequential, the

best speedup would be 20, even with 1000 processors.

Scalability

• How performance changes with increased number of processors.– As increase processors, may increase

communication time.• Realize a limitation of Amdahl’s law is that as n

increases, f probably decreases• Consider as the number of processors

increases, n is also increased

Asymptotic Times for Communications

• Uses the big O notation from 271 (and big Θ, and big Ω).

• Need to take the connection topology into account.

• Assume– Each link is bidirectional– Every link on a processor can be active at the

same time– Time for message transmission is ts+m*tb

Different Topologies

• Complete graph– All communications are O(1).

• Linear Array– Single broadcast O(p) (actually average p/2)– Multi-broadcast O(p) (best p/2, worst p-1)– Scatter O(p)– Total Exchange O(p2) (look at using p scatters)

• Ring – like a linear array with ends connected, so same results

Mesh

• Looking at a d-dimensional mesh (i.e., 2*d links per node)

• Each dimension is size • For analysis, we look at d as a constant since that is

determined by the hardware and we cannot change it.• Single broadcast – worst d* -> O()• Scatter – sending p-1 messages out 2*d links, so time is

(p-1)/(2d) -> O(p)• Multi-Broadcast – same as scatter• Total exchange -

Hypercube

• D – dimensional hypercube has d – links per node. Has 2d nodes. Use a d-bit number to indicate node.

• Send a message – O(d)=O(log p)• Multi-Broadcast – each node receives p-1

messages over d links. Time (p-1)/d -> O(p/log p)• Scatter – same as multi-broadcast• Total Exchange – O(p)

Parallel Scalar Product

• Looking at vectors a and b and computing aTb– Vectors are size n. Sum of pairwise products

• Assume r=n/p• Each processor computes its part of the sum (r

values summed). Take 2r floating point operations.

• Then parallel system does an accumulate.

Linear Array

• Best processor for root is the middle processor• Takes p/2 communications and float adds.• Total time if it takes time α for floating point

operation and time β for communication is– 2rα+(p/2)(α+β)=2(n/p)α+(p/2)(α+β)

• Easy to see that as increase p, the first term goes down, but the second term goes up.

• What is best p? Find minimum• T’(p)=-2(n/p2)α+(α+β)/2 =0 -> (α+β) p2=4nα ->p2= 4nα/ (α+β)

Hypercube

• Same computation time 2rα• Accumulate time is (log p)• T(p)= 2rα+(log p)*(α+β) =2nα/p+ (log p)*(α+β) • T’(p)= -2(n/p2)α + (α+β) (1/(p*ln 2))– Min at = -> p=

chapter 4 performance. times user cpu time – time that the cpu is executing the program system cpu...

Documents