Transcript
Page 1: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the

Cray T3E and IBM SP

Patrick H. Worley Computer Science and Mathematics Division

Oak Ridge National Laboratory

NERSC Users’ Group Meeting Oak Ridge, TNJune 6, 2000

Page 2: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

… random collection of benchmarks, looking at communication, serial, and parallel performance on the

IBM SP and other MPPs at NERSC and ORNL.

Alternative Title

Page 3: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.

These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes

Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the United States Department of Energy under Contract No. DE-AC05-00OR22725.

Acknowledgements

Page 4: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Platforms at NERSC

IBM SP• 2-way Winterhawk I SMP “wide” nodes with 1 GB memory• 200 MHz Power 3 processors with 4 MB L2 cache• 1.6 GB/sec node memory bandwidth (single bus)• Omega multistage interconnect

SGI/Cray Research T3E-900• Single processor nodes with 256 MB memory• 450 MHz Alpha 21164 (EV5) with 96 KB L2 cache• 1.2 GB/sec node memory bandwidth• 3D torus interconnect

Page 5: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Platforms at ORNL

IBM SP• 4-way Winterhawk II SMP “thin” nodes with 2 GB memory• 375 MHz Power 3-II processors with 8 MB L2 cache• 1.6 GB/sec node memory bandwidth (single bus)• Omega multistage interconnect

Compaq AlphaServer SC• 4-way ES40 SMP nodes with 2 GB memory• 667 MHz Alpha 21264a (EV67) processors with 8 MB L2

cache• 5.2 GB/sec node memory bandwidth (dual bus)• Quadrics “fat tree” interconnect

Page 6: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Other Platforms

SGI / Cray Research Origin 2000 at LANL• 128-way SMP node with 32 GB memory• 250 MHz MIPS R10000 processors with 4 MB L2 cache• NUMA memory subsystem

IBM SP• 16-way Nighthawk II SMP node• 375 MHz Power3-II processors with 8 MB L2 cache• switch-based memory subsystem• Results obtained using prerelease hardware and software

Page 7: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Topics

Interprocessor communication performance Serial performance

• PSTSWM spectral dynamics kernel• CRM column physics kernel

Parallel performance• CCM/MP-2D atmospheric global circulation model

Page 8: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Interprocessor communication performance• within an SMP node• between SMP nodes• with and without contention • with and without cache invalidation

for both bidirectional and unidirectional communicationprotocols

Brief description of some results. For more details, see

http://www.epm.ornl.gov/~worley/studies/pt2pt.html

Page 9: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth

between nodes on the IBM SP at NERSC

Page 10: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth

between nodes on the IBM SP at NERSC

Page 11: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth

between nodes on the IBM SP at ORNL

Page 12: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Bidirectional bandwidth comparison across platforms:swap between processors 0-1

Page 13: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Bidirectional bandwidth comparison across platforms:swap between processors 0-4

Page 14: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Bidirectional bandwidth comparison across platforms:simultaneous swap between processors 0-4,1-5,2-6,3-7

Page 15: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Bidirectional bandwidth comparison across platforms:8 processor send/recv ring 0-1-2-3-4-5-6-7-0

Page 16: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Communication Tests

Summary• Decent intranode performance is possible.• Message passing functionality is good.• Switch / NIC performance is limiting in internode

communication.• Contention for Switch/NIC bandwidth in SMP nodes can be

significant.

Page 17: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Serial Performance Issues

• Compiler optimization• Domain decomposition• Memory contention in SMP nodes

Kernel codes• PSTSWM - spectral dynamics• CRM - column physics

Page 18: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWM• solves the nonlinear shallow water equations on a sphere

using the spectral transform method • 99% of floating point operations are fmul, fadd, or fmadd• accessing memory linearly, but not much reuse• (longitude, vertical, latitude) array index ordering

computation independent between horizontal layers (fixed vertical index)

as vertical dimension size increases, demands on memory increase

Page 19: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWM on the IBM SP at NERSC

Horizontal Resolutions

T5: 8x16

T10: 16x32

T21: 32x64

T42: 64x128

T85: 128x256

T170: 256x512

Page 20: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWM on the IBM SP at NERSC

Page 21: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWMPlatform comparisons - 1 processor per SMP node

Horizontal Resolutions

T5: 8x16

T10: 16x32

T21: 32x64

T42: 64x128

T85: 128x256

T170: 256x512

Page 22: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWMPlatform comparisons - all processors active in SMP node

(except Origin-250)

Horizontal Resolutions

T5: 8x16

T10: 16x32

T21: 32x64

T42: 64x128

T85: 128x256

T170: 256x512

Page 23: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWMPlatform comparisons - 1 processor per SMP node

Page 24: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

PSTSWMPlatform comparisons - all processors active in SMP node

(except Origin-250)

Page 25: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Spectral Dynamics

Summary• Math libraries and relaxed mathematical semantics

improve performance significantly on the IBM SP.• Node memory bandwidth is important (for this kernel code),

especially on bus-based SMP nodes.• The IBM SP serial performance is a significant

improvement over the (previous generation) Origin and T3E systems.

Page 26: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Column Physics

CRM• Column Radiation Model extracted from the Community

Climate Model• 6% of floating point operations are sqrt, 3% are fdiv• exp, log, and pow are among top six most frequently called

functions• (longitude, vertical, latitude) array index ordering

computations independent between vertical columns (fixed longitude, latitude)

as longitude dimension size increases, demands on memory increase

Page 27: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Column Physics

CRM on the NERSC SPlongitude-vertical slice, with varying number of longitudes

Page 28: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Column Physics

CRMlongitude-vertical slice, with varying number of longitudes

1 processor per SMP node

Page 29: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Column Physics

Summary• Performance is less sensitive to node memory bandwidth

for this kernel code.• Performance on the IBM SP is very sensitive to compiler

optimization and domain decomposition.

Page 30: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Parallel Performance

Issues• Scalability• Overhead growth and analysis

Codes• CCM/MP-2D

Page 31: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Message-passing parallel implementation of the National Center for Atmospheric Research (NCAR) Community Climate Model

Computational Domains• Physical Domain:

Longitude x Latitude x Vertical levels• Fourier Domain:

Wavenumber x Latitude x Vertical levels• Spectral Domain:

(Wavenumber x Polynomial degree) x Vertical levels.

Page 32: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Problem Sizes

• T42L18128 x 64 x 18 physical domain grid42 x 64 x 18 Fourier domain grid946 x 18 spectral domain grid~59.5 GFlops per simulated day

• T170L18512 x 256 x 18 physical domain grid170 x 256 x 18 Fourier domain grid14706 x 18 spectral domain grid~3231 GFlops per simulated day

Page 33: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Computations• Column Physics

independent between vertical columns• Spectral Dynamics

Fourier transform in longitude direction Legendre transform in latitude direction tendencies for timestepping calculated in spectral

domain, independent between spectral coordinates• Semi-Lagrangian Advection

Use local approximations to interpolate wind fields and particle distributions away from grid points.

Page 34: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Decomposition across latitude parallelizes the Legendre transform:

Use distributed global sum algorithm currently requires north/south halo updates for semi-Lagrangian

advection Decomposition across longitude

parallelizes the Fourier transform:Either use distributed FFT algorithm or transpose

fields and use serial FFT requires east/west halo updates for semi-Lagrangian

advection requires night/day vertical column swaps to load

balance physics

Page 35: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Sensitivity of message volume to domain decomposition

Page 36: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Scalability

CCM/MP-2D T42L18 Benchmark

Page 37: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Scalability

CCM/MP-2D T170L18 Benchmark

Page 38: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Overhead

CCM/MP-2D T42L18 BenchmarkOverhead Time Diagnosis

Page 39: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

Overhead

CCM/MP-2D T170L18 BenchmarkOverhead Time Diagnosis

Page 40: Patrick H. Worley  Computer Science and Mathematics Division Oak Ridge National Laboratory

CCM/MP-2D

Summary• Parallel algorithm optimization is (still) important for achieving

peak performance• Bottlenecks

Message-passing bandwidth and latency SMP node memory bandwidth on the SP


Top Related