ifs benchmark with federation switch john hague, ibm
TRANSCRIPT
IFS Benchmark with Federation Switch
John Hague, IBM
Introduction
• Federation has dramatically improved pwr4 p690 communication, so
– Measure Federation performance with Small Pages and Large Pages using simulation program
– Compare Federation and pre-Federation (Colony) performance of IFS
– Compare Federation performance of IFS with and without Large Pages and Memory Affinity
– Examine IFS communication using mpi profiling
Colony v Federation
• Colony (hpca)
– 1.3GHz 32-processor p690s– Four 8-processor Affinity LPARs per p690
• Needed to get communication performance
– Two 180MB/s adapters per LPAR
• Federation (hpcu)
– 1.7GHz p690s– One 32-processor LPAR per p690– Memory and MPI MCM Affinity
• MPI Task and Memory from same MCM• Slightly better than binding task to specific processor
– Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node
IFS Communication:transpositions
1. MPI Alltoall in all rows simultaneously• Mostly shared memory
2. MPI Alltoall in all columns simultaneously
1
0 MPI task
Node
1
4
21 2 3 4 5
8 9
31308
3
0
Simulation of transpositions
• All transpositions in “row” use shared memory• All transpositions in “column” use switch• Number of MPI tasks per node varied
– But all processors used by using OpenMP threads
• Bandwidth measured for MPI Sendrecv calls– Buffers allocated and filled by threads between each call
• Large Pages give best switch performance– With current switch software
“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link)
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000 100000 1000000 10000000
Bytes
MB
/sec
LP: EAGER_LIMIT=64K
LP: MIN_BULK=50K
LP: BASE
SP
SP = Small PagesLP = Large Pages
“Transposition” Bandwidth per link(8 nodes, 4 links/node)
0
200
400
600
800
1000
1200
1400
1600
100 1000 10000 100000 1000000 10000000
Bytes
MB
/sec
32 tasks
16 tasks
8 tasks
4 tasks
Multiple threads ensure all processors are used
hpcu v hpca with IFS
• Benchmark jobs (provided 3 years ago)
– Same executable used on hpcu and hpca– 256 processors used– All jobs run with mpi_profiling (and barriers before data
exchange)
Procs Grid Points hpca hpcu Speedup
T399 10x1_4 213988 5828 3810 1.52
T799 16x8_2 843532 9907 5527 1.79
4D-Var
T511/T255
16x8_2 4869 2737 1.78
IFS Speedups: hpcu v hpca
11.11.21.31.41.51.61.71.81.9
2
SP no MA SP w MA LP no MA LP w MA
Total
spee
dup
v hp
ca
799
399
4D-Var
1
2
3
4
5
SP no MA SP w MA LP no MA LP w MA
Communication
sp
ee
du
p v
hp
ca
799
399
4D_Var
11.11.21.31.41.51.61.71.81.9
2
SP no MA SP w MA LP no MA LP w MA
CPU
spee
du
p v
hp
ca
799
399
4D-Var
LP = Large Pages; SP = Small PagesMA = Memory Affinity
LP/SP & MA/noMA CPU comparison
-5
0
5
10
15
LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP
Pe
rce
nta
ge
799
399
4D-Var
LP/SP & MA/noMA Comms comparison
-20
0
20
40
60
80
100
120
LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP
Pe
rce
nta
ge
799
399
4D-Var
Percentage Communication
05
101520253035404550
SP no MA SP no MA SP w MA LP no MA LP w MA
Pe
rce
nta
ge
799
399
4D-Var
hpca ------------------- hpcu --------------------------
Extra Memory needed by Large Pages
Large Pages are allocated in Real Memory in segments of 256 MB
• MPI_INIT– 80MB which may not be used – MP_BUFFER_MEM (default 64MB) can be reduced– MPI_BUFFER_ALLOCATE needs memory which may not be used
• OpenMP threads:– Stack allocated with XLSMPOPTS=“stack=…” may not be used
• Fragmentation – Memory is "wasted"
• Last 256 MB segment– Only a small part of it may be used
mpi_profile
• Examine IFS communication using mpi profiling
– Use libmpiprof.a
– Calls and MB/s rate for each type of call• Overall• For each higher level subroutine
– Histogram of blocksize for each type of call
mpi_profile for T799
128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818 -------TOTAL 866.574 ----------------------------------------------------------------
Barrier indcates load imbalance
mpi_profile for 4D_Var min0
128 MPI tasks, 2 threadsWALL time = 1218 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec)--------------------------------------------------------------MPI_Send 43995 7222.9 317.8 1.033MPI_Bsend 38473 13898.4 534.7 0.843MPI_Isend 326703 168598.3 55081.6 6.368MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166MPI_Bcast 288 374491.7 107.9 0.490MPI_Barrier 27062 0.0 0.0 94.168MPI_Allgatherv 466 285958.8 133.3 26.250MPI_Allreduce 1325 73.2 0.1 1.027 -------TOTAL 374.223-----------------------------------------------------------------
Barrier indicates load imbalance
MPI Profiles for send/recv
0
5000
10000
15000
20000
25000
30000
KBytes
Cal
ls
799
0
1000
2000
3000
4000
5000
6000
7000
8000
KBytes
Cal
ls
399 hpca
0
20000
40000
60000
80000
100000
120000
KBytes
Ca
lls
4d_var min0
mpi_profiles for recv/send
Avg
MB
MB/s per task
hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
Conclusions
• Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85
• Best Environment Variables
– MPI.network=ccc0 (instead of cccs)– MEMORY_AFFINITY=yes– MP_AFFINITY=MCM ! With new pvmd– MP_BULK_MIN_MSG_SIZE=50000– LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow – MP_EAGER_LIMIT=64K
hpca v hpcu
------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
hpca v hpcu
------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3
mpi_profiles for recv/send
Avg
MB
MB/s per task
hpca hpcu
T799 (4 tasks per link)
trltom (inter node) 1.84 35 224
trltog (shrd memory) 4.00 116 890
slcomm2 (halo) 0.66 65 363
4D-Var min0 (4 tasks per link)
trltom (inter node) 0.167 160
trltog (shrd memory) 0.373 490
slcomm2 (halo) 0.088 222
Conclusions
• Memory Affinity with binding– Program binds to: MOD(task_id*nthrds+thrd_id,32), or– Use new /usr/lpp/ppe.poe/bin/pmdv4 – How to bind if whole node not used– Try VSRAC code from Montpellier– Bind adapter link to MCM ?
• Large Pages – Advantages
• Need LP for best communication B/W with current software – Disadvantages
• Uses extra memory (4GB more per node in 4D-Var min1)• Load Leveler Scheduling
– Prototype switch software indicates Large Pages not necessary
• Collective Communication– To be investigated
Linux compared to PWR4 for IFS
• Linux (run by Peter Mayes)– Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch– Portland Group compiler:– Compiler flags: -O3 -Mvect=sse– No code optimisation or OpenMP– Linux 1: 1 CPU/node, Myrinet IP– Linux 1A: 1 CPU/node, Myrinet GM– Linux 2: using 2 CPUs/node
• IBM Power4– MPI (intra-node shared memory) and OpenMP– Compiler flags: -O3 –qstrict– hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch– hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch
Linux compared to Pwr4
0
100
200
300
400
500
600
700
800
900
1000
0 10 20 30 40 50 60 70
Processors
Se
c fo
r S
tep
s 1
to 1
1
T511 Linux 1
T511 Linux 1A
T511 hpca
T511 hpcu
T159 Linux 2
T159 Linux 1
T159 hpca
T159 hpcu