1 lecture 5: part 1 performance laws: speedup and scalability
TRANSCRIPT
![Page 1: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/1.jpg)
1
Lecture 5: Part 1
Performance Laws:
Speedup and Scalability
![Page 2: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/2.jpg)
2
Sequential Execution Time
Execution time = Seconds = Instructions x Cycles x Seconds
(Te) Program Program Instruction Cycle
Execution time = Seconds = Instructions x Cycles x Seconds
(Te) Program Program Instruction Cycle
500 MHz CPU: 500 x 106 clock cycles/secMy program consists of operations:
addition: 500 x 106 (1 cycle per inst)mul: 180 x 106 (3 cycles/inst)div: 120 x 106 (2 cycles/inst)data move: 300 x 106 (1 cycle/inst)
Expected execution time: 2 ns x (500x1+180x3+120x2+300x1)x 106
![Page 3: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/3.jpg)
3
Basic Performance Metrics
MIPS = (instructions/second)x 10-6
MFLOPS = (floating point ops/second)x 10-6
CPI = Average cycles per instruction Throughput: number of results per second Workload: W, number of Ops. required to
complete the program Speed: W/TE
Speedup (S)= Te / Timprove
Efficiency (using P processors) = Speedup / P
![Page 4: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/4.jpg)
4
Part I: Speedup
![Page 5: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/5.jpg)
5
Speedup
General concepts of Speedup in parallel computing: How much faster an application runs on parallel
computer ? What benefits derive from the use of parallelism?
General agreement : speedup = serial time/parallel time
![Page 6: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/6.jpg)
6
Speedup
Fixed-Workload Speedup (Amdahl’s Law) Fixed-Time Speedup (Gustafson’s Law) Fixed-Memory Speedup (Sun and Ni)
![Page 7: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/7.jpg)
7
Amdahl's Law
Speedup due to enhancement E:
ExTime w/o E Speedup(E) = ------------- ExTime w/ E
Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected
ExTime: execution time
1/S
F F/S
![Page 8: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/8.jpg)
8
Amdahl’s Law
ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced
Speedupoverall =ExTimeold
ExTimenew
Speedupenhanced
=
1
(1 - Fractionenhanced) + Fractionenhanced
Speedupenhanced
![Page 9: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/9.jpg)
9
Amdahl’s Law
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall =
ExTimenew =
![Page 10: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/10.jpg)
10
Amdahl’s Law
Floating point instructions improved to run 2X; but only 10% of actual instructions are FP
Speedupoverall = 1
0.95= 1.053
ExTimenew = ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold
![Page 11: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/11.jpg)
11
Amdahl’s Law
Some applications need real-time response Amdahl’s law shows the upper bound of the
achievable speedup for a given problem size. Limit on the achievable speedup:
W: total workload
of W must be executed sequentially
Sp = W / ( W+(1- )(W/P))
= P / (1+(P-1) ) 1 / as P
![Page 12: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/12.jpg)
12
Fixed-Time Speedup
Gustafson’s Law (1988) Scaling the problem size along with the increase of
machine size within the same execution time Scaling for higher accuracy Parameters:
W: workload done by a single node W’ = W+(1- )WP = work done by P nodes
Sp = ( W+(1- )WP) / W
= +(1-)P Speedup is linear function of P
![Page 13: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/13.jpg)
13
Fixed-Memory Speedup
Sun and Ni’s Law: Memory bounding (1993) To solve the largest possible problem, limited
only by the available memory space. As P increases, use up all the increased memory
by scaling the problem size also See Kai’s book (Section 3.6.3)
![Page 14: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/14.jpg)
14
More on Speedup
Can you measure the sequential time? What if the memory is too small ?
Is the sequential machine using the same processor as the one in parallel machine ?
Can the sequential time be shorter than the parallel time ?
Is speedup really important ? Or only the execution time is important ?
![Page 15: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/15.jpg)
15
More on Speedup Speedup larger than “P” (Superlinear) ?
What if the disk swapping effects happen in sequential execution?
More caches/memory used in the parallel machines (PxC, PxM)
Both sequential and parallel machines use the same OS/compiler ?
Is the data distribution/collection time included in the parallel time? Some parallel machines provide parallel I/O !!
![Page 16: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/16.jpg)
16
More on Speedup
What if we use different algorithms for sequential and parallel solutions Take a look at the parallel sorting assignment.
What is the maximum speedup for parallel sorting?
![Page 17: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/17.jpg)
17
Definitions of Speedup
Diverse definitions of serial and parallel execution times [Sahni:1996] Relative Real Absolute Asymptotic Asymptotic relative
![Page 18: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/18.jpg)
18
Parameters Used in Speedup Definitions I = problem instance P = number of processors Q = parallel program n = size of I
![Page 19: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/19.jpg)
19
Relative Speedup
serial time : execution time of the parallel program on a single node of the parallel computer. (mpirun -np 1)
Relative speedup (I,P) = time to solve I using program Q and 1 processor ÷ time to solve I using Q program and P processors
![Page 20: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/20.jpg)
20
Relative Speedup
Depends on the characteristics of the instance I being solved as well as the size P of the parallel computer
Same OS, same node architecture, using same compiler
Extra overheads in serial time (distribution/collection).
![Page 21: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/21.jpg)
21
Real Speedup
Parallel time vs. the fastest serial algorithm or program running on a single node of the parallel computer
Real speedup (I,P) = time to solve I using best serial program and 1 processor ÷ time to solve I using Q program and P processors
![Page 22: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/22.jpg)
22
Problems on Real Speedup The fastest algorithm might not be
known. No single algorithm might be fastest in
all instances for some applications In practice, we use the runtime of the
most frequently used sequential algorithm.
![Page 23: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/23.jpg)
23
Absolute Speedup
Parallel time vs. the fastest sequential algorithm run on the fastest serial computer
Absolute speedup (I,P) = time to solve I using best serial program and 1 fastest processor ÷ time to solve I using Q program and P processors
![Page 24: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/24.jpg)
24
Absolute Speedup
Can also use the sequential algorithm most often used in practice.
Time-variant: researchers keep designing new algorithms
Speedup could be less than 1
![Page 25: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/25.jpg)
25
Asymptotic Real Speedup
Compares the execution time of the best serial algorithm for a particular problem with the asymptotic complexity of the parallel algorithm
Assuming the the parallel computer has all the processors it can use
![Page 26: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/26.jpg)
26
Asymptotic Real Speedup
Asymptotic real speedup (n) = asymptotic complexity of best serial algorithm ÷ asymptotic complexity of Q using as many processors as needed
![Page 27: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/27.jpg)
27
Asymptotic Real Speedup
For problem such as sorting where the asymptotic complexity is not uniquely characterized by the instance size n, the worst-case complexity is used -- O(n2), O(n log n)
Unbounded number of processors.
![Page 28: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/28.jpg)
28
Asymptotic Relative Speedup
Uses the asymptotic time complexity of the parallel algorithm when run on a single processor.
Asymptotic relative speedup = asymptotic complexity of Q using 1 processor ÷ asymptotic complexity of Q using as many processors as needed
![Page 29: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/29.jpg)
29
Matrix Multiplication (n X n) Serial time: O(n3) Parallel time (on n3/log n processors
hypercube): O(log n) time Asymptotic Relative Speedup: O(n3/log n)
Others are measured Speedup Relative Real Absolute
Asymptotic Relative Speedup
![Page 30: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/30.jpg)
30
Part II: Scalability
![Page 31: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/31.jpg)
31
Scalability
Algorithmic Scalability: the available parallelism increases at
least linearly with problem size. Architectural (Size) Scalability:
the architecture continues to yield the same performance per processor, as the number of processors is increased and as the problem size is increased.
![Page 32: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/32.jpg)
32
Architectural Scalability
Architecturalscalability
Interconnection Network: latency/bandwidth
I/O performance
OS support(lock, processessynchronizationoverheads)
Processor SpeedNo. of issues, depthof pipelining
Memory Subsystem:cache/memory speed
Performance grows in all aspects
![Page 33: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/33.jpg)
33
Discussion of Architectural Scalability: bus-based SMPs
Bus (single set of wires) : fixed bandwidth shared by all processors (P) Bandwidth accessible by each processor decreases as P
increases Physical constraints: fixed number of slots Electrical constraints:
bus loading, wire length determine frequency power
Cost scaling: X (bus is the core) OS scaling : scheduling, locking, I/O
![Page 34: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/34.jpg)
34
Scalable Interconnection Network
Bandwidth (P)? Latency (P)? Cost (P)? More bandwidth,
smaller L greater cost SGI XL: 1.5 M HK$
(1.2 GB/s) ALR SMP: 0.1 M HK$
(533 MB/s)
Network (bus ? multistage,mesh, torus)
PE PE PE PE
![Page 35: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/35.jpg)
35
Bottom Line
There is NO “truly scalable” machine Goal is to design for a given “range of scale”
one (instruction level) two to tens (SMP):
Top-end: Sun Enterprise 1000 : 64 processors, COMPACQ 64 Alpha processors
tens to a few thousands (Distributed-Memory Multicomputers: T3E, SP2, Paragon, ...)
tens of thousands (ASCI machines: cluster of SMPs) Techniques at one scale may not be cost effect at another
![Page 36: 1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability](https://reader035.vdocuments.us/reader035/viewer/2022062714/56649d1e5503460f949f2326/html5/thumbnails/36.jpg)
36
Bottom Line
Even though we can meet the engineering requirements to physically scale over the range of interest, we must ensure that the communication and synchronization operations required to support target programming models also scale and are cost effective.