patrick h. worley computer science and mathematics division oak ridge national laboratory
DESCRIPTION
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the Cray T3E and IBM SP. Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory. NERSC Users’ Group Meeting Oak Ridge, TN June 6, 2000. Alternative Title. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/1.jpg)
Kernel and Application Code Performance for a Spectral Atmospheric Global Circulation Model on the
Cray T3E and IBM SP
Patrick H. Worley Computer Science and Mathematics Division
Oak Ridge National Laboratory
NERSC Users’ Group Meeting Oak Ridge, TNJune 6, 2000
![Page 2: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/2.jpg)
… random collection of benchmarks, looking at communication, serial, and parallel performance on the
IBM SP and other MPPs at NERSC and ORNL.
Alternative Title
![Page 3: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/3.jpg)
Research sponsored by the Atmospheric and Climate Research Division and the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
These slides have been authored by a contractor of the U.S. Government under contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes
Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the United States Department of Energy under Contract No. DE-AC05-00OR22725.
Acknowledgements
![Page 4: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/4.jpg)
Platforms at NERSC
IBM SP• 2-way Winterhawk I SMP “wide” nodes with 1 GB memory• 200 MHz Power 3 processors with 4 MB L2 cache• 1.6 GB/sec node memory bandwidth (single bus)• Omega multistage interconnect
SGI/Cray Research T3E-900• Single processor nodes with 256 MB memory• 450 MHz Alpha 21164 (EV5) with 96 KB L2 cache• 1.2 GB/sec node memory bandwidth• 3D torus interconnect
![Page 5: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/5.jpg)
Platforms at ORNL
IBM SP• 4-way Winterhawk II SMP “thin” nodes with 2 GB memory• 375 MHz Power 3-II processors with 8 MB L2 cache• 1.6 GB/sec node memory bandwidth (single bus)• Omega multistage interconnect
Compaq AlphaServer SC• 4-way ES40 SMP nodes with 2 GB memory• 667 MHz Alpha 21264a (EV67) processors with 8 MB L2
cache• 5.2 GB/sec node memory bandwidth (dual bus)• Quadrics “fat tree” interconnect
![Page 6: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/6.jpg)
Other Platforms
SGI / Cray Research Origin 2000 at LANL• 128-way SMP node with 32 GB memory• 250 MHz MIPS R10000 processors with 4 MB L2 cache• NUMA memory subsystem
IBM SP• 16-way Nighthawk II SMP node• 375 MHz Power3-II processors with 8 MB L2 cache• switch-based memory subsystem• Results obtained using prerelease hardware and software
![Page 7: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/7.jpg)
Topics
Interprocessor communication performance Serial performance
• PSTSWM spectral dynamics kernel• CRM column physics kernel
Parallel performance• CCM/MP-2D atmospheric global circulation model
![Page 8: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/8.jpg)
Communication Tests
Interprocessor communication performance• within an SMP node• between SMP nodes• with and without contention • with and without cache invalidation
for both bidirectional and unidirectional communicationprotocols
Brief description of some results. For more details, see
http://www.epm.ornl.gov/~worley/studies/pt2pt.html
![Page 9: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/9.jpg)
Communication Tests
MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth
between nodes on the IBM SP at NERSC
![Page 10: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/10.jpg)
Communication Tests
MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth
between nodes on the IBM SP at NERSC
![Page 11: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/11.jpg)
Communication Tests
MPI_SENDRECV bidirectional and MPI_SEND/MPI_RECV unidirectional bandwidth
between nodes on the IBM SP at ORNL
![Page 12: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/12.jpg)
Communication Tests
Bidirectional bandwidth comparison across platforms:swap between processors 0-1
![Page 13: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/13.jpg)
Communication Tests
Bidirectional bandwidth comparison across platforms:swap between processors 0-4
![Page 14: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/14.jpg)
Communication Tests
Bidirectional bandwidth comparison across platforms:simultaneous swap between processors 0-4,1-5,2-6,3-7
![Page 15: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/15.jpg)
Communication Tests
Bidirectional bandwidth comparison across platforms:8 processor send/recv ring 0-1-2-3-4-5-6-7-0
![Page 16: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/16.jpg)
Communication Tests
Summary• Decent intranode performance is possible.• Message passing functionality is good.• Switch / NIC performance is limiting in internode
communication.• Contention for Switch/NIC bandwidth in SMP nodes can be
significant.
![Page 17: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/17.jpg)
Serial Performance Issues
• Compiler optimization• Domain decomposition• Memory contention in SMP nodes
Kernel codes• PSTSWM - spectral dynamics• CRM - column physics
![Page 18: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/18.jpg)
Spectral Dynamics
PSTSWM• solves the nonlinear shallow water equations on a sphere
using the spectral transform method • 99% of floating point operations are fmul, fadd, or fmadd• accessing memory linearly, but not much reuse• (longitude, vertical, latitude) array index ordering
computation independent between horizontal layers (fixed vertical index)
as vertical dimension size increases, demands on memory increase
![Page 19: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/19.jpg)
Spectral Dynamics
PSTSWM on the IBM SP at NERSC
Horizontal Resolutions
T5: 8x16
T10: 16x32
T21: 32x64
T42: 64x128
T85: 128x256
T170: 256x512
![Page 20: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/20.jpg)
Spectral Dynamics
PSTSWM on the IBM SP at NERSC
![Page 21: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/21.jpg)
Spectral Dynamics
PSTSWMPlatform comparisons - 1 processor per SMP node
Horizontal Resolutions
T5: 8x16
T10: 16x32
T21: 32x64
T42: 64x128
T85: 128x256
T170: 256x512
![Page 22: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/22.jpg)
Spectral Dynamics
PSTSWMPlatform comparisons - all processors active in SMP node
(except Origin-250)
Horizontal Resolutions
T5: 8x16
T10: 16x32
T21: 32x64
T42: 64x128
T85: 128x256
T170: 256x512
![Page 23: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/23.jpg)
Spectral Dynamics
PSTSWMPlatform comparisons - 1 processor per SMP node
![Page 24: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/24.jpg)
Spectral Dynamics
PSTSWMPlatform comparisons - all processors active in SMP node
(except Origin-250)
![Page 25: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/25.jpg)
Spectral Dynamics
Summary• Math libraries and relaxed mathematical semantics
improve performance significantly on the IBM SP.• Node memory bandwidth is important (for this kernel code),
especially on bus-based SMP nodes.• The IBM SP serial performance is a significant
improvement over the (previous generation) Origin and T3E systems.
![Page 26: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/26.jpg)
Column Physics
CRM• Column Radiation Model extracted from the Community
Climate Model• 6% of floating point operations are sqrt, 3% are fdiv• exp, log, and pow are among top six most frequently called
functions• (longitude, vertical, latitude) array index ordering
computations independent between vertical columns (fixed longitude, latitude)
as longitude dimension size increases, demands on memory increase
![Page 27: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/27.jpg)
Column Physics
CRM on the NERSC SPlongitude-vertical slice, with varying number of longitudes
![Page 28: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/28.jpg)
Column Physics
CRMlongitude-vertical slice, with varying number of longitudes
1 processor per SMP node
![Page 29: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/29.jpg)
Column Physics
Summary• Performance is less sensitive to node memory bandwidth
for this kernel code.• Performance on the IBM SP is very sensitive to compiler
optimization and domain decomposition.
![Page 30: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/30.jpg)
Parallel Performance
Issues• Scalability• Overhead growth and analysis
Codes• CCM/MP-2D
![Page 31: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/31.jpg)
CCM/MP-2D
Message-passing parallel implementation of the National Center for Atmospheric Research (NCAR) Community Climate Model
Computational Domains• Physical Domain:
Longitude x Latitude x Vertical levels• Fourier Domain:
Wavenumber x Latitude x Vertical levels• Spectral Domain:
(Wavenumber x Polynomial degree) x Vertical levels.
![Page 32: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/32.jpg)
CCM/MP-2D
Problem Sizes
• T42L18128 x 64 x 18 physical domain grid42 x 64 x 18 Fourier domain grid946 x 18 spectral domain grid~59.5 GFlops per simulated day
• T170L18512 x 256 x 18 physical domain grid170 x 256 x 18 Fourier domain grid14706 x 18 spectral domain grid~3231 GFlops per simulated day
![Page 33: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/33.jpg)
CCM/MP-2D
Computations• Column Physics
independent between vertical columns• Spectral Dynamics
Fourier transform in longitude direction Legendre transform in latitude direction tendencies for timestepping calculated in spectral
domain, independent between spectral coordinates• Semi-Lagrangian Advection
Use local approximations to interpolate wind fields and particle distributions away from grid points.
![Page 34: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/34.jpg)
CCM/MP-2D
Decomposition across latitude parallelizes the Legendre transform:
Use distributed global sum algorithm currently requires north/south halo updates for semi-Lagrangian
advection Decomposition across longitude
parallelizes the Fourier transform:Either use distributed FFT algorithm or transpose
fields and use serial FFT requires east/west halo updates for semi-Lagrangian
advection requires night/day vertical column swaps to load
balance physics
![Page 35: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/35.jpg)
CCM/MP-2D
Sensitivity of message volume to domain decomposition
![Page 36: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/36.jpg)
Scalability
CCM/MP-2D T42L18 Benchmark
![Page 37: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/37.jpg)
Scalability
CCM/MP-2D T170L18 Benchmark
![Page 38: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/38.jpg)
Overhead
CCM/MP-2D T42L18 BenchmarkOverhead Time Diagnosis
![Page 39: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/39.jpg)
Overhead
CCM/MP-2D T170L18 BenchmarkOverhead Time Diagnosis
![Page 40: Patrick H. Worley Computer Science and Mathematics Division Oak Ridge National Laboratory](https://reader035.vdocuments.us/reader035/viewer/2022070502/56814d48550346895dba6fd8/html5/thumbnails/40.jpg)
CCM/MP-2D
Summary• Parallel algorithm optimization is (still) important for achieving
peak performance• Bottlenecks
Message-passing bandwidth and latency SMP node memory bandwidth on the SP