Download - Performance and Scalability Class Ppt
![Page 1: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/1.jpg)
Principles of Scalable Performance
![Page 2: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/2.jpg)
EENG-630 Chapter 3 2
Principles of Scalable Performance
Performance measuresSpeedup lawsScalability principlesScaling up vs. scaling down
![Page 3: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/3.jpg)
3
Performance metrics and measures
Parallelism profilesAsymptotic speedup factorSystem efficiency, utilization and qualityStandard performance measures
![Page 4: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/4.jpg)
4
Degree of parallelism
Reflects the matching of software and hardware parallelismDiscrete time function – measure, for each time period, the # of processors usedParallelism profile is a plot of the DOP as a function of timeIdeally have unlimited resources
![Page 5: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/5.jpg)
Degree of Parallelism
The number of processors used at any instant to execute a program is called the degree of parallelism (DOP); this can vary over time.DOP assumes an infinite number of processors are available; this is not achievable in real machines, so some parallel program segments must be executed sequentially as smaller parallel segments. Other resources may impose limiting conditions.A plot of DOP vs. time is called a parallelism profile.
![Page 6: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/6.jpg)
6
Factors affecting parallelism profiles
Algorithm structureProgram optimizationResource utilizationRun-time conditionsRealistically limited by # of available processors, memory, and other nonprocessor resources
![Page 7: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/7.jpg)
Average Parallelism - 1Assume the following:
n homogeneous processorsmaximum parallelism in a profile is mIdeally, n >> m, the computing capacity of a processor, is something like MIPS or Mflops w/o regard for memory latency, etc.i is the number of processors busy in an observation period (e.g. DOP = i )W is the total work (instructions or computations) performed by a programA is the average parallelism in the program
![Page 8: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/8.jpg)
Average Parallelism - 2
2
1
)(t
t
dttDOPW
m
iitiW
1
where ti = total time that DOP = i, and
m
ii ttt
112
Total amount of work performed is proportional to the area under the profile curve
![Page 9: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/9.jpg)
Average Parallelism - 3
2
1
)(1
12
t
t
dttDOPtt
A
m
ii
m
ii ttiA
11
/
![Page 10: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/10.jpg)
10
Example: parallelism profile and average parallelism
![Page 11: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/11.jpg)
Available Parallelism
Various studies have shown that the potential parallelism in scientific and engineering calculations can be very high (e.g. hundreds or thousands of instructions per clock cycle).But in real machines, the actual parallelism is much smaller (e.g. 10 or 20).
![Page 12: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/12.jpg)
Basic Blocks
A basic block is a sequence or block of instructions with one entry and one exit.Basic blocks are frequently used as the focus of optimizers in compilers (since its easier to manage the use of registers utilized in the block).Limiting optimization to basic blocks limits the instruction level parallelism that can be obtained (to about 2 to 5 in typical code).
![Page 13: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/13.jpg)
Asymptotic Speedup - 1
m
iiWW
1
ii tiW (work done when DOP = i)
(relates sum of Wi terms to W)
kWkt ii /)( (execution time with k processors)
iWt ii /)( (for 1 i m)
![Page 14: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/14.jpg)
Asymptotic Speedup - 2
m
i
m
i
ii
WtT
1 1
)1()1( (resp. time w/ 1 proc.)
m
i
m
i
ii i
WtT
1 1
)()( (resp. time w/ proc.)
AiW
W
T
TS m
i i
m
i i
1
1
/)(
)1((in the ideal case)
![Page 15: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/15.jpg)
15
Performance measures
Consider n processors executing m programs in various modesWant to define the mean performance of these multimode computers:
Arithmetic mean performanceGeometric mean performanceHarmonic mean performance
![Page 16: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/16.jpg)
Mean Performance Calculation
We seek to obtain a measure that characterizes the mean, or average, performance of a set of benchmark programs with potentially many different execution modes (e.g. scalar, vector, sequential, parallel).We may also wish to associate weights with these programs to emphasize these different modes and yield a more meaningful performance measure.
![Page 17: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/17.jpg)
17
Arithmetic mean performance
m
i
ia mRR1
/
m
iiia RfR
1
* )(
Arithmetic mean execution rate(assumes equal weighting)
Weighted arithmetic mean execution rate
-proportional to the sum of the inverses of execution times
![Page 18: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/18.jpg)
18
Geometric mean performance
m
i
mig RR
1
/1
m
i
fig
iRR1
*
Geometric mean execution rate
Weighted geometric mean execution rate
-does not summarize the real performance since it does not have the inverse relation with the total time
![Page 19: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/19.jpg)
19
Harmonic mean performance
ii RT /1
m
i i
m
iia Rm
Tm
T11
111
Mean execution time per instruction For program i
Arithmetic mean execution timeper instruction
![Page 20: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/20.jpg)
20
Harmonic mean performance
m
ii
ah
R
mTR
1
)/1(/1
m
iii
h
RfR
1
*
)/(
1
Harmonic mean execution rate
Weighted harmonic mean execution rate
-corresponds to total # of operations divided by the total time (closest to the real performance)
![Page 21: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/21.jpg)
Geometric Mean
A geometric mean of n terms is the nth root of the product of the n terms.Like the arithmetic mean, the geometric mean of a set of execution rates does not have an inverse relationship with the total execution time of the programs.(Geometric mean has been advocated for use with normalized performance numbers for comparison with a reference machine.)
![Page 22: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/22.jpg)
Harmonic Mean
Instead of using arithmetic or geometric mean, we use the harmonic mean execution rate, which is just the inverse of the arithmetic mean of the execution time (thus guaranteeing the inverse relation not exhibited by the other means).
m
i i
hR
mR
1/1
![Page 23: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/23.jpg)
Weighted Harmonic Mean
If we associate weights fi with the benchmarks, then we can compute the weighted harmonic mean:
m
i ii
hRf
mR
1/
![Page 24: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/24.jpg)
Weighted Harmonic Mean SpeedupT1 = 1/R1 = 1 is the sequential execution time on a single processor with rate R1 = 1.
Ti = 1/Ri = 1/i = is the execution time using i processors with a combined execution rate of Ri = i.Now suppose a program has n execution modes with associated weights f1 … fn. The weighted harmonic mean speedup is defined as:
*
1
1
1/
/n
i ii
S T Tf R
* *1/ hT R(weighted arithmetic mean execution time)
![Page 25: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/25.jpg)
25
Harmonic Mean Speedup Performance
![Page 26: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/26.jpg)
Amdahl’s LawAssume Ri = i, and w (the weights) are (, 0, …, 0, 1-).
Basically this means the system is used sequentially (with probability ) or all n processors are used (with probability 1- ).This yields the speedup equation known as Amdahl’s law:
1 1n
nS
n
The implication is that the best speedup possible is 1/ , regardless of n, the number of processors.
![Page 27: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/27.jpg)
27
Illustration of Amdahl Effect
n = 100
n = 1,000
n = 10,000Speedup
Processors
![Page 28: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/28.jpg)
28
Example 1
95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs?
9.58/)05.01(05.0
1
![Page 29: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/29.jpg)
System Efficiency – 1
Assume the following definitions:O (n) = total number of “unit operations” performed by an n-processor system in completing a program P.T (n) = execution time required to execute the program P on an n-processor system.
O (n) can be considered similar to the total number of instructions executed by the n processors, perhaps scaled by a constant factor.If we define O (1) = T (1), then it is logical to expect that T (n) < O (n) when n > 1 if the program P is able to make any use at all of the extra processor(s).
![Page 30: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/30.jpg)
30
Example 2
5% of a parallel program’s execution time is spent within inherently sequential code.The maximum speedup achievable by this program, regardless of how many PEs are used, is
2005.0
1
/)05.01(05.0
1lim
pp
![Page 31: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/31.jpg)
31
Pop QuizAn oceanographer gives you a serial program and asks you how much faster it might run on 8 processors. You can only find one function amenable to a parallel solution. Benchmarking on a single processor reveals 80% of the execution time is spent inside this function. What is the best speedup a parallel version is likely to achieve on 8 processors?
Answer: 1/(0.2 + (1 - 0.2)/8) 3.3
![Page 32: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/32.jpg)
System Efficiency – 2Clearly, the speedup factor (how much faster the program runs with n processors) can now be expressed as
S (n) = T (1) / T (n)
Recall that we expect T (n) < T (1), so S (n) 1.System efficiency is defined as
E (n) = S (n) / n = T (1) / ( n T (n) )It indicates the actual degree of speedup achieved in a system as compared with the maximum possible speedup. Thus 1 / n E (n) 1. The value is 1/n when only one processor is used (regardless of n), and the value is 1 when all processors are fully utilized.
![Page 33: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/33.jpg)
RedundancyThe redundancy in a parallel computation is defined as
R (n) = O (n) / O (1)What values can R (n) obtain?
R (n) = 1 when O (n) = O (1), or when the number of operations performed is independent of the number of processors, n. This is the ideal case.R (n) = n when all processors performs the same number of operations as when only a single processor is used; this implies that n completely redundant computations are performed!
The R (n) figure indicates to what extent the software parallelism is carried over to the hardware implementation without having extra operations performed.
![Page 34: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/34.jpg)
System Utilization
System utilization is defined asU (n) = R (n) E (n) = O (n) / ( n T
(n) )It indicates the degree to which the system resources were kept busy during execution of the program. Since 1 R (n) n, and 1 / n E (n) 1, the best possible value for U (n) is 1, and the worst is 1 / n.1 / n E (n) U (n) 11 R (n) 1 / E (n) n
![Page 35: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/35.jpg)
Quality of Parallelism
The quality of a parallel computation is defined as
Q (n) = S (n) E (n) / R (n) = T 3 (1) / ( n T 2 (n) O (n) )
This measure is directly related to speedup (S) and efficiency (E), and inversely related to redundancy (R).The quality measure is bounded by the speedup (that is, Q (n) S (n) ).
![Page 36: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/36.jpg)
Standard Industry Performance Measures
MIPS and Mflops, while easily understood, are poor measures of system performance, since their interpretation depends on machine clock cycles and instruction sets. For example, which of these machines is faster?
a 10 MIPS CISC computera 20 MIPS RISC computer
It is impossible to tell without knowing more details about the instruction sets on the machines. Even the question, “which machine is faster,” is suspect, since we really need to say “faster at doing what?”
![Page 37: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/37.jpg)
Doing What?To answer the “doing what?” question, several standard programs are frequently used.
The Dhrystone benchmark uses no floating point instructions, system calls, or library functions. It uses exclusively integer data items. Each execution of the entire set of high-level language statements is a Dhrystone, and a machine is rated as having a performance of some number of Dhrystones per second (sometimes reported as KDhrystones/sec).The Whestone benchmark uses a more complex program involving floating point and integer data, arrays, subroutines with parameters, conditional branching, and library functions. It does not, however, contain any obviously vectorizable code.
The performance of a machine on these benchmarks depends in large measure on the compiler used to generate the machine language. [Some companies have, in the past, actually “tweaked” their compilers to specifically deal with the benchmark programs!]
![Page 38: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/38.jpg)
What’s VAX Got To Do With It?The Digital Equipment VAX-11/780 computer for many years has been commonly agreed to be a 1-MIPS machine (whatever that means).Since the VAX-11/780 also has a rating of about 1.7 KDhrystrones, this gives a method whereby a relative MIPS rating for any other machine can be derived: just run the Dhrystone benchmark on the other machine, divide by 1.7K, and you then obtain the relative MIPS rating for that machine (sometimes also called VUPs, or VAX units of performance).
![Page 39: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/39.jpg)
Other MeasuresTransactions per second (TPS) is a measure that is appropriate for online systems like those used to support ATMs, reservation systems, and point of sale terminals. The measure may include communication overhead, database search and update, and logging operations. The benchmark is also useful for rating relational database performance.KLIPS is the measure of the number of logical inferences per second that can be performed by a system, presumably to relate how well that system will perform at certain AI applications. Since one inference requires about 100 instructions (in the benchmark), a rating of 400 KLIPS is roughly equivalent to 40 MIPS.
![Page 40: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/40.jpg)
40
Parallel Processing Applications
Drug designHigh-speed civil transportOcean modelingOzone depletion researchAir pollutionDigital anatomy
![Page 41: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/41.jpg)
41
Application Models for Parallel Computers
Fixed-load modelConstant workload
Fixed-time modelDemands constant program execution time
Fixed-memory modelLimited by the memory bound
![Page 42: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/42.jpg)
42
Algorithm Characteristics
Deterministic vs. nondeterministicComputational granularityParallelism profileCommunication patterns and synchronization requirementsUniformity of operationsMemory requirement and data structures
![Page 43: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/43.jpg)
43
Isoefficiency Concept
Relates workload to machine size n needed to maintain a fixed efficiency
The smaller the power of n, the more scalable the system
),()(
)(
nshsw
swE
workload
overhead
![Page 44: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/44.jpg)
44
Isoefficiency Function
To maintain a constant E, w(s) should grow in proportion to h(s,n)
C = E/(1-E) is constant for fixed E
),(1
)( nshE
Esw
),()( nshCnfE
![Page 45: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/45.jpg)
45
The Isoefficiency Metric(Terminology)
Parallel system – a parallel program executing on a parallel computerScalability of a parallel system - a measure of its ability to increase performance as number of processors increasesA scalable system maintains efficiency as processors are addedIsoefficiency - a way to measure scalability
![Page 46: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/46.jpg)
46
Notation Needed for the Isoefficiency Relation
n data sizep number of processorsT(n,p)Execution time, using p processors(n,p) speedup(n) Inherently sequential computations(n) Potentially parallel computations(n,p)Communication operations (n,p) Efficiency
Note: At least in some printings, there appears to be a misprint on page 170 in Quinn’s textbook, with (n) being sometimes replaced with (n). To correct, simply replace each with .
![Page 47: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/47.jpg)
47
Isoefficiency Concepts
T0(n,p) is the total time spent by processes doing work not done by sequential algorithm.T0(n,p) = (p-1)(n) + p(n,p)
We want the algorithm to maintain a constant level of efficiency as the data size n increases. Hence, (n,p) is required to be a constant.Recall that T(n,1) represents the sequential execution time.
![Page 48: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/48.jpg)
48
The Isoefficiency RelationSuppose a parallel system exhibits efficiency (n,p).
Define
In order to maintain the same level of efficiency as the number of processors increases, n must be increased so that the following inequality is satisfied.
),()()1(),(T
),(1
),(
0 pnpnppn
pn
pnC
),()1,( 0 pnCTnT
![Page 49: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/49.jpg)
49
Speedup Performance Laws
Amdahl’s lawfor fixed workload or fixed problem size
Gustafson’s lawfor scaled problems (problem size increases with increased machine size)
Speedup modelfor scaled problems bounded by memory capacity
![Page 50: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/50.jpg)
50
Amdahl’s Law
As # of processors increase, the fixed load is distributed to more processorsMinimal turnaround time is primary goalSpeedup factor is upper-bounded by a sequential bottleneckTwo cases: DOP < n DOP n
![Page 51: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/51.jpg)
51
Fixed Load Speedup Factor
Case 1: DOP > n Case 2: DOP < n
n
i
i
Wit i
i )(
m
i
i
n
i
i
WnT
1
)(
i
Wtnt i
ii )()(
m
i
i
m
iii
n
ni
i
W
W
nT
TS
1
)(
)1(
![Page 52: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/52.jpg)
52
Gustafson’s Law
With Amdahl’s Law, the workload cannot scale to match the available computing power as n increasesGustafson’s Law fixes the time, allowing the problem size to increase with higher nNot saving time, but increasing accuracy
![Page 53: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/53.jpg)
53
Fixed-time Speedup
As the machine size increases, have increased workload and new profileIn general, Wi’ > Wi for 2 i m’ and W1’ = W1
Assume T(1) = T’(n)
![Page 54: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/54.jpg)
54
Gustafson’s Scaled Speedup
)(1
'
1
nQn
i
i
WW
m
i
im
ii
n
nm
ii
m
ii
n WW
nWW
W
WS
1
1
1
1
'
'
'
![Page 55: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/55.jpg)
55
Memory Bounded Speedup Model
Idea is to solve largest problem, limited by memory spaceResults in a scaled workload and higher accuracyEach node can handle only a small subproblem for distributed memoryUsing a large # of nodes collectively increases the memory capacity proportionally
![Page 56: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/56.jpg)
56
Fixed-Memory Speedup
Let M be the memory requirement and W the computational workload: W = g(M)g*(nM)=G(n)g(M)=G(n)Wn
nWnGW
WnGW
nQni
iW
WS
n
n
m
i
i
m
ii
n /)(
)(
)( 1
1
1
*1
*
**
*
![Page 57: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/57.jpg)
57
Relating Speedup Models
G(n) reflects the increase in workload as memory increases n timesG(n) = 1 : Fixed problem size (Amdahl)G(n) = n : Workload increases n times when memory increased n times (Gustafson)G(n) > n : workload increases faster than memory than the memory requirement
![Page 58: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/58.jpg)
58
Scalability Metrics
Machine size (n) : # of processorsClock rate (f) : determines basic m/c cycleProblem size (s) : amount of computational workload. Directly proportional to T(s,1).CPU time (T(s,n)) : actual CPU time for executionI/O demand (d) : demand in moving the program, data, and results for a given run
![Page 59: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/59.jpg)
59
Interpreting Scalability Function
Number of processors
Mem
ory
need
ed p
er p
roce
ssor
Cplogp
Cp
Clogp
C
Memory Size
Can maintainefficiency
Cannot maintainefficiency
![Page 60: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/60.jpg)
60
Scalability Metrics
Memory capacity (m) : max # of memory words demandedCommunication overhead (h(s,n)) : amount of time for interprocessor communication, synchronization, etc. Computer cost (c) : total cost of h/w and s/w resources requiredProgramming overhead (p) : development overhead associated with an application program
![Page 61: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/61.jpg)
61
Speedup and Efficiency
The problem size is the independent parameter
n
nsSnsE
),(),(
),(),(
)1,(),(
nshnsT
sTnsS
![Page 62: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/62.jpg)
62
Scalable Systems
Ideally, if E(s,n)=1 for all algorithms and any s and n, system is scalablePractically, consider the scalability of a m/c
),(
),(
),(
),(),(
nsT
nsT
nsS
nsSns I
I
![Page 63: Performance and Scalability Class Ppt](https://reader036.vdocuments.us/reader036/viewer/2022081412/545d7c96b0af9fb32c8b51f0/html5/thumbnails/63.jpg)
63
Summary (2)
Some factors preventing linear speedup?Serial operationsCommunication operationsProcess start-upImbalanced workloadsArchitectural limitations