scaling up user codes on the sp david skinner, nersc division, berkeley lab

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

1

Scaling Up User Codes on the SP

David Skinner, NERSC Division, Berkeley Lab


2

Motivation

• NERSC’s focus is on capability computation– Capability == jobs that use ¼ or more of the machines resources

• Scientists whose work involves large scale computation or HPC should keep ahead of workstation sized problems

• “Big Science” problems are more interesting!


3

Challenges

• CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated.

• Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes– MPI implementation

– Filesystem metadata systems

– Batch queue system

Users need information on how to mitigate the impact of these issues for large concurrency applications.


4

Seaborg.nersc.gov

MP_EUIDEVICE

(switch fabric)

MPI Bandwidth

(MB/sec)

MPI Latency

(usec)

css0 500 / 350 8 / 16

css1

csss 500 / 350

(single task)

8 / 16


5

Switch Adapater Performance

csss

css0


6

Switch considerations

• For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio)

• Use MP_SHAREDMEMORY to minimize switch traffic

• csss is most often the best route to the switch


7

Synchronization

• On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks

• A fully synchronizing MPI call requires everyone’s attention

• By analogy, imagine trying to go to lunch with 1024 people

• Probability that everyone is ready at any given time scales poorly


8

Synchronization (continued)

• MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above

• Use MPI_Broadcast if possible – Not fully synchronizing

• Remove un-needed MPI_Barrier calls

• Use Asynchronous I/O when possible


9

Load Balance

• If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall

• Seek out and eliminate sources of variation• Distribute problem uniformly among nodes/cpus

0 20 40 60 80 100

0

1

2

3 FLOPI/OSYNCFLOPI/OSYNC


10

Alternatives to MPI

• CHARM++ and NAMD – Spatially decomposed molecular dynamics with

periodic load balancing, data decomposition is adaptive

• AMPI http://charm.cs.uiuc.edu/– An automatic approach to load balancing

• BlueGene L type machines with > 10K cpus will need re-examine these issues altogether


11

Improving MPI Scaling on Seaborg


12

The SP switch

• Use MP_SHAREDMEMORY=yes (default)

• Use MP_EUIDEVICE=csss for 32 bit applications

(default)

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general


13

64 bit MPI

• 32 bit MPI has inconvenient memory limits – 256MB per task default and 2GB maximum

– 1.7GB can be used in practice, but depends on MPI usage

– The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s

internal buffers and pipes

• 64 bit MPI removes these barriers– But must run on css0 only, less switch bandwidth

• Seaborg has 16,32, and 64 GB per node available


14

64 bit MPI Howto

At compile time:

* module load mpi64 * compile with the "-q64" option using mpcc_r, mpxlf_r, or mpxlf90_r.

At run time:

* module load mpi64 * use "#@ network.MPI = css0,us,shared" in your job scripts. The multilink adapter "csss" is not currently supported. * run your POE code as you normally would


15

MP_LABELIO, phost

• Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc.

export MP_LABELIO=yes

• Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks– MPI and LAPI versions available

– Hostslists are useful in general


16

Core files

• Core dumps don’t scale (no parallel work)

• MP_COREDIR=/dev/null No corefile I/O• MP_COREFILE_FORMAT=light_core Less I/O• LL script to save just one full fledged core file, throw away

others … if MP_CHILD !=0 export MP_COREDIR=/dev/nullendif…


17

Debugging

• In general debugging 512 and above is error prone and cumbersome.

• Debug at a smaller scale when possible.

• Use shared memory device MPICH on a workstation with lots of memory to simulate 1024 cpus.

• For crashed jobs examine LL logs for memory usage history.


18

Parallel I/O

• Can be a significant source of variation in task completion prior to synchronization

• Limit the number of readers or writers when appropriate. Pay attention to file creation rates.

• Output reduced quantities when possible


19

OpenMP

• Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation,

e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads

• Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages


20

Summary

• Resources are present to face the challenges posed by scaling up MPI applications on seaborg.

• Scientists should expand their problem scopes to tackle increasingly challenging computational problems.

• NERSC consultants can provide help in achieving scaling goals.


21


22


23


24

scaling up user codes on the sp david skinner, nersc division, berkeley lab

Documents