ifs benchmark with federation switch john hague, ibm

IFS Benchmark with Federation Switch

John Hague, IBM

Introduction

• Federation has dramatically improved pwr4 p690 communication, so

– Measure Federation performance with Small Pages and Large Pages using simulation program

– Compare Federation and pre-Federation (Colony) performance of IFS

– Compare Federation performance of IFS with and without Large Pages and Memory Affinity

– Examine IFS communication using mpi profiling

Colony v Federation

• Colony (hpca)

– 1.3GHz 32-processor p690s– Four 8-processor Affinity LPARs per p690

• Needed to get communication performance

– Two 180MB/s adapters per LPAR

• Federation (hpcu)

– 1.7GHz p690s– One 32-processor LPAR per p690– Memory and MPI MCM Affinity

• MPI Task and Memory from same MCM• Slightly better than binding task to specific processor

– Two 2-link 1.2GB/s Federation adapters per p690 • Four 1.2GB/s links per node

IFS Communication:transpositions

1. MPI Alltoall in all rows simultaneously• Mostly shared memory

2. MPI Alltoall in all columns simultaneously

1

0 MPI task

Node

1

4

21 2 3 4 5

8 9

31308

3

0

Simulation of transpositions

• All transpositions in “row” use shared memory• All transpositions in “column” use switch• Number of MPI tasks per node varied

– But all processors used by using OpenMP threads

• Bandwidth measured for MPI Sendrecv calls– Buffers allocated and filled by threads between each call

• Large Pages give best switch performance– With current switch software

“Transposition” Bandwidth per link(8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link)

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000 100000 1000000 10000000

Bytes

MB

/sec

LP: EAGER_LIMIT=64K

LP: MIN_BULK=50K

LP: BASE

SP

SP = Small PagesLP = Large Pages

“Transposition” Bandwidth per link(8 nodes, 4 links/node)

0

200

400

600

800

1000

1200

1400

1600

100 1000 10000 100000 1000000 10000000

Bytes

MB

/sec

32 tasks

16 tasks

8 tasks

4 tasks

Multiple threads ensure all processors are used

hpcu v hpca with IFS

• Benchmark jobs (provided 3 years ago)

– Same executable used on hpcu and hpca– 256 processors used– All jobs run with mpi_profiling (and barriers before data

exchange)

Procs Grid Points hpca hpcu Speedup

T399 10x1_4 213988 5828 3810 1.52

T799 16x8_2 843532 9907 5527 1.79

4D-Var

T511/T255

16x8_2 4869 2737 1.78

IFS Speedups: hpcu v hpca

11.11.21.31.41.51.61.71.81.9

2

SP no MA SP w MA LP no MA LP w MA

Total

spee

dup

v hp

ca

799

399

4D-Var

1

2

3

4

5


Communication

sp

ee

du

p v

hp

ca

799

399

4D_Var

11.11.21.31.41.51.61.71.81.9

2


CPU

spee

du

p v

hp

ca

799

399

4D-Var

LP = Large Pages; SP = Small PagesMA = Memory Affinity

LP/SP & MA/noMA CPU comparison

-5

0

5

10

15

LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP

Pe

rce

nta

ge

799

399

4D-Var

LP/SP & MA/noMA Comms comparison

-20

0

20

40

60

80

100

120

LP/SP no MA LP/SP w MA MA/noMA w SP MA/noMA w LP

Pe

rce

nta

ge

799

399

4D-Var

Percentage Communication

05

101520253035404550

SP no MA SP no MA SP w MA LP no MA LP w MA

Pe

rce

nta

ge

799

399

4D-Var

hpca ------------------- hpcu --------------------------

Extra Memory needed by Large Pages

Large Pages are allocated in Real Memory in segments of 256 MB

• MPI_INIT– 80MB which may not be used – MP_BUFFER_MEM (default 64MB) can be reduced– MPI_BUFFER_ALLOCATE needs memory which may not be used

• OpenMP threads:– Stack allocated with XLSMPOPTS=“stack=…” may not be used

• Fragmentation – Memory is "wasted"

• Last 256 MB segment– Only a small part of it may be used

mpi_profile

• Examine IFS communication using mpi profiling

– Use libmpiprof.a

– Calls and MB/s rate for each type of call• Overall• For each higher level subroutine

– Histogram of blocksize for each type of call

mpi_profile for T799

128 MPI tasks, 2 threadsWALL time = 5495 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec) --------------------------------------------------------------MPI_Send 49784 52733.2 2625.3 7.873MPI_Bsend 6171 454107.3 2802.3 1.331MPI_Isend 84524 1469867.4 124239.1 1.202MPI_Recv 91940 1332252.1 122487.3 359.547MPI_Waitall 75884 0.0 0.0 59.772MPI_Bcast 362 26.6 0.0 0.028MPI_Barrier 9451 0.0 0.0 436.818 -------TOTAL 866.574 ----------------------------------------------------------------

Barrier indcates load imbalance

mpi_profile for 4D_Var min0

128 MPI tasks, 2 threadsWALL time = 1218 sec--------------------------------------------------------------MPI Routine #calls avg. bytes Mbytes time(sec)--------------------------------------------------------------MPI_Send 43995 7222.9 317.8 1.033MPI_Bsend 38473 13898.4 534.7 0.843MPI_Isend 326703 168598.3 55081.6 6.368MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166MPI_Bcast 288 374491.7 107.9 0.490MPI_Barrier 27062 0.0 0.0 94.168MPI_Allgatherv 466 285958.8 133.3 26.250MPI_Allreduce 1325 73.2 0.1 1.027 -------TOTAL 374.223-----------------------------------------------------------------

Barrier indicates load imbalance

MPI Profiles for send/recv

0

5000

10000

15000

20000

25000

30000

KBytes

Cal

ls

799

0

1000

2000

3000

4000

5000

6000

7000

8000

KBytes

Cal

ls

399 hpca

0

20000

40000

60000

80000

100000

120000

KBytes

Ca

lls

4d_var min0

mpi_profiles for recv/send

Avg

MB

MB/s per task

hpca hpcu

T799 (4 tasks per link)

trltom (inter node) 1.84 35 224

trltog (shrd memory) 4.00 116 890

slcomm2 (halo) 0.66 65 363

4D-Var min0 (4 tasks per link)

trltom (inter node) 0.167 160

trltog (shrd memory) 0.373 490

slcomm2 (halo) 0.088 222

Conclusions

• Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85

• Best Environment Variables

– MPI.network=ccc0 (instead of cccs)– MEMORY_AFFINITY=yes– MP_AFFINITY=MCM ! With new pvmd– MP_BULK_MIN_MSG_SIZE=50000– LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow – MP_EAGER_LIMIT=64K

hpca v hpcu

------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3

mpi_profiles for recv/send

Avg

MB

MB/s per task

hpca hpcu

T799 (4 tasks per link)

trltom (inter node) 1.84 35 224

trltog (shrd memory) 4.00 116 890

slcomm2 (halo) 0.66 65 363

4D-Var min0 (4 tasks per link)

trltom (inter node) 0.167 160

trltog (shrd memory) 0.373 490

slcomm2 (halo) 0.088 222

Conclusions

• Memory Affinity with binding– Program binds to: MOD(task_id*nthrds+thrd_id,32), or– Use new /usr/lpp/ppe.poe/bin/pmdv4 – How to bind if whole node not used– Try VSRAC code from Montpellier– Bind adapter link to MCM ?

• Large Pages – Advantages

• Need LP for best communication B/W with current software – Disadvantages

• Uses extra memory (4GB more per node in 4D-Var min1)• Load Leveler Scheduling

– Prototype switch software indicates Large Pages not necessary

• Collective Communication– To be investigated

Linux compared to PWR4 for IFS

• Linux (run by Peter Mayes)– Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch– Portland Group compiler:– Compiler flags: -O3 -Mvect=sse– No code optimisation or OpenMP– Linux 1: 1 CPU/node, Myrinet IP– Linux 1A: 1 CPU/node, Myrinet GM– Linux 2: using 2 CPUs/node

• IBM Power4– MPI (intra-node shared memory) and OpenMP– Compiler flags: -O3 –qstrict– hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch– hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch

Linux compared to Pwr4

0

100

200

300

400

500

600

700

800

900

1000

0 10 20 30 40 50 60 70

Processors

Se

c fo

r S

tep

s 1

to 1

1

T511 Linux 1

T511 Linux 1A

T511 hpca

T511 hpcu

T159 Linux 2

T159 Linux 1

T159 hpca

T159 hpcu

ifs benchmark with federation switch john hague, ibm

Documents

mpi profiling slide

large pages slide

mb mpi

memory affinity slide

node slide

mpi routine

large pages large pages

mpi mcm affinity mpi