understanding application scaling nas parallel benchmarks 2.2 on now and sgi origin 2000 frederick...
Post on 19-Dec-2015
220 views
TRANSCRIPT
Understanding Application Scaling
NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000
Frederick Wong, Rich Martin,Remzi Arpaci-Dusseau, David Wu,and David Culler
{fredwong, rmartin, remzi, davidwu, culler}@CS.Berkeley.EDU
Department of Electrical Engineering and Computer Science
Computer Science Division
University of California, Berkeley
June 15th, 1998
Introduction
NAS Parallel Benchmarks suite 2.2 (NPB) has been used widely to evaluate modern parallel systems
7 scientific benchmarks that represents the most common computation kernels
NPB is written on top of Message Passing Interface (MPI) for portability
NPB is a Constant Problem Size (CPS) scaling benchmark suite
This study focuses on understanding NPB scaling on both NOW and SGI Origin 2000
Speedup on NOW
0
5
10
15
20
25
30
35
40
1 6 11 16 21 26 31Nodes
Spee
dup
lu
mg
sp
Ideal
Motivation Early study on NPB shows ideal speedup on NOW!
Scaling as good as T3D and better than SP-2 Per node performance better than T3D, close to SP-2
Speedup on SGI Origin 2000
0
5
10
15
20
25
30
35
40
1 6 11 16 21 26 31
Nodes
Sp
eed
up
lu
mg
sp
Ideal
Submitted results for Origin 2000 show a spread
Presentation Outline
Hardware Configuration Time Breakdown of the Applications Communication Performance Computation Performance Conclusion
Hardware Configuration
SGI Origin 2000 (64 nodes) MIPS R10000 processor, 195 MHz, 32KB/32KB L1 4MB external L2 cache per processor 16GB memory total MPI performance: 13 sec one-way latency, 150 MB
peak, half-power at 8KB message size
Network Of Workstations (NOW) UltraSPARC I processor, 167MHz, 16KB/16KB L1 512KB external L2 cache per processor 128 MB memory per processor MPI performance: 22 sec one-way latency, 27 MB
peak, half-power at 4KB message size
Time Breakdown -- LU
Time Breakdown of LU on NOW
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32Processors
Tim
e (se
cond
s) Cummulative
Computation
CommunicationIdeal
Black line -- total running time a single-man - 10
secs job ideally, requires 5
secs for 2 men total amount of work
-- 10 secs More work, need
communication
Time Breakdown of LU on Origin 2000
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32Processors
Time (
seco
nds)
CummulativeComputationCommunicationIdeal
Time Breakdown -- LU
Time Breakdown of LU on NOW
0
500
1000
1500
2000
2500
3000
1 2 4 8 16 32Processors
Time
(sec
onds
) Cummulative
Computation
CommunicationIdeal
Time Breakdown -- SP
Time Breakdown on NOW
0
500
1000
1500
2000
2500
3000
3500
1 4 9 16 25Processors
Time (
seco
nds) Cummulative
ComputationCommunicationIdeal
Time Breakdown on SGI
0
500
1000
1500
2000
2500
3000
1 4 9 16 25Processors
Time (
seco
nds)
CummulativeComputationCommunicationIdeal
Communication Performance
Micro-benchmarks show that SGI O2000 has better pt2pt comm. performance when compare to NOW
MPI Pp2pt Latency (One-way)
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
1E+06Message Size
Late
ncy
(use
c)
Origin 2000
NOW
MPI Pt2pt Bandwidth (One-way)
0
2040
60
80
100120
140
160
1 100 10,000 1,000,000Message Size
MB/
sec
SGI
NOW
SGI 1/2 Power
NOW 1/2 Power
Communication Efficiency
Communication Efficiency
0%
10%
20%
30%
40%
50%60%
70%
80%
90%
100%
0 10 20 30 40
Processors
Effic
iency
(%)
NOW-LUSGI-LUNOW-SPSGI-SP
absolute bandwidth delivered are close SP/32 on NOW -- 215s SP/32 on SGI -- 289s
comm. efficiency on SGI only achieved 30% of potential bandwidth
protocols tradeoff are pronounce hand-shake vs. bulk-
send in pt2pt collective ops
Computation Performance Relative performance of the benchmarks on single node
roughly close to the processor performance difference
LU SPSGI 1373 1652NOW 2469 2807
Both computational CPI and L2 misses change significantly on both platforms when scaled
LU SPCPI decrease 94% 93%L2 misses decrease 25% 27%
LU Working Set
4-processor Knee starts at 256KB
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
LU Working Set
4-processor Knee starts at 256KB
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
8-Node 8-processor
Knee starts at 128KB
LU Working Set
4-processor Knee starts at 256KB
8-processor Knee starts at 128KB
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
8-Node
16-Node
16-processor Knee starts at 64KB
LU Working Set
4-processor Knee starts at 256KB
8-processor Knee starts at 128KB
16-processor Knee starts at 64KB
0
2
4
6
8
10
12
14
1 10 100 1000 10000Cache Size (KB)
Mis
s R
ate
(%)
4-Node
8-Node
16-Node
32-Node
32-processor Knee starts at 32KB
miss rate drops from 2MB to 4 MB global cache
0
5
10
15
20
25
1 10 100 1000 10000
Cache Size (KB)
Mis
s R
ate
(%)
4-Node
8-Node
16-Node
32-Node
Cost under scaling extra work worsen
memory system’s performance
SP Working Set
total memory references on SGI 4-processor has 64.38
billion memory reference
25-processor has 72.35 billion memory reference
12.38% increase
CostBenefit
Conclusion NPB
-benchmarks hard to predict comm performance global cache increases effectively reduce comp. time sequential node arch. is a dominant factor in NPB perf.
NOW an inexpensive way to go parallel absolute performance is excellent MPI on NOW has good scalability and performance NOW vs. proprietary system -- detail instrumentation ability
speedup cannot tell the whole story, scalability involves: the interplay of program and machine scaling delivered comm. performance, not -benchmarks complicated memory system performance