benchmark performance on bassi jonathan carter user services group lead jtcarter@lbl

1

Benchmark performance on Bassi

Jonathan CarterUser Services Group Lead

[email protected]

NERSC User Group MeetingJune 12, 2006

2

Architectural Comparison

Node Type Where Network CPU/

NodeClockMHz

PeakGFlop

Stream BW

GB/s/P

Peak byte/flo

p

MPIBW

GB/s/P

MPI Latency

sec

NetworkTopology

Power3 NERSC Colony 16 375 1.5 0.4 0.26 0.13 16.3 Fat-tree

Itanium2 LLNL Quadrics 4 1400 5.6 1.1 0.19 0.25 3.0 Fat-tree

Opteron NERSC InfiniBand 2 2200 4.4 2.3 0.51 0.59 6.0 Fat-tree

Power5 NERSC HPS 8 1900 7.6 6.8 0.85 0.69 4.7 Fat-tree

X1E ORNL Custom 4 1130 18.0 9.7 0.54 2.9 5.0 4D-Hypercube

ES ESC IN 8 1000 8.0 26.3 3.29 1.5 5.6 Crossbar

SX-8 HLRS INX 8 2000 16.0 41.0 2.56 2.0 5.0 Crossbar

3

NERSC 5 Application Benchmarks

• CAM3– Climate model, NCAR

• GAMESS– Computational chemistry, Iowa State, Ames Lab

• GTC– Fusion, PPPL

• MADbench– Astrophysics (CMB analysis), LBL

• Milc– QCD, multi-site collaboration

• Paratec– Materials science,developed LBL and UC Berkeley

• PMEMD– Computational chemistry, University of North Carolina-Chapel Hill

4

Application Summary

Application Science Area

Basic Algorithm

Language Library Use

Comment

CAM3 Climate(BER)

CFD, FFT FORTRAN 90 netCDF IPCC

GAMESS Chemistry(BES)

DFT FORTRAN 90 DDI, BLAS

GTC Fusion(FES)

Particle-in-cell

FORTRAN 90 FFT(opt) ITER emphasis

MADbench Astrophysics(HEP & NP)

Power Spectrum Estimation

C Scalapack 1024 proc. 730 MB per task, 200 GB disk

MILC QCD(NP)

Conjugate gradient

C none 2048 proc. 540 MB per task

PARATEC Materials(BES)

3D FFT FORTRAN 90 Scalapack Nanoscience emphasis

PMEMD Life Science(BER)

Particle Mesh Ewald

FORTRAN 90 none

5

CAM3

• Community Atmospheric Model version 3– Developed at NCAR with substantial DOE input, both scientific and

software.• The atmosphere model for CCSM, the coupled climate system model.

– Also the most timing consuming part of CCSM. – Widely used by both American and foreign scientists for climate

research. • For example, Carbon, bio-geochemistry models are built upon (integrated

with) CAM3.• IPCC predictions use CAM3 (in part)

– About 230,000 lines codes in Fortran 90.• 1D Decomposition, runs up to 128 processors at T85 resolution

(150Km)• 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km)

resolution.

6

CAM3: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk56 0.22 15% 0.35 6% 0.93 12%

240 0.18 13% 0.38 6% 0.83 11%

7

GAMESS

• Computational chemistry application – Variety of electronic structure algorithms available

• About 550,000 lines of Fortran 90• Communication layer makes use of highly

optimized vendor libraries• Many methods available within the code

– Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation

– Many computational chemistry studies rely on these techniques

• Exactly the same as DOD HPCMP TI-06 GAMESS benchmark– Vendors will only have to do the work once

8

GAMESS: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.02 1% 0.07 1% 0.07 2% 0.06 1%384 0.03 2% 0.32 5% 0.31 4%

• Small case: large, messy, low computational-intensity kernels problematic for compilers

• Large case depends on asynchronous messaging

9

GTC

• Gyrokinetic Toroidal Code • Important code for Fusion

SciDAC Project and for the International Fusion collaboration ITER.

• Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC)3D visualization of electrostatic

potential in magnetic fusion device

10

GTC: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.15 10% 0.51 9% 0.64 15% 0.72 9% 1.7 10% 1.9 23% 2.3 14%256 0.13 8% 0.44 7% 0.58 13% 0.68 9% 1.7 10% 1.8 22% 2.3 15%

• SX8 highest raw performance (ever) but lower efficiency than ES• Scalar architectures suffer from low computational intensity, irregular

data access, and register spilling• Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of

X1– Opteron: on-chip memory controller and caching of FP L1 data

• X1 suffers from overhead of scalar code portions

11

MADbench

• Cosmic microwave background radiation analysis tool (MADCAP)– Used large amount of time in FY04 and one of the

highest scaling codes at NERSC• MADBench is a benchmark version of the original

code – Designed to be easily run with synthetic data for

portability. – Used in a recent study in conjunction with Berkeley

Institute for Performance Studies (BIPS).• Written in C making extensive use of ScaLAPACK

libraries• Has extensive I/O requirements

12

MADbench: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.56 37% 2.6 43% 1.7 40% 4.1 54%256 0.50 34% 2.2 36% 1.8 40% 3.2 44%

2048 0.70 47% 1.6 27%

• Dominated by– Blas3– I/O

13

MILC

• Quantum ChromoDynamics application– Widespread community use, large allocation– Easy to build, no dependencies, standards

conforming– Can be setup to run on wide-range of concurrency

• Conjugate gradient algorithm• Physics on a 4D lattice• Local computations are 3x3 complex matrix

multiplies, with sparse (indirect) access pattern

14

MILC: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P%pk GFs/P%pk GFs/P%pk GFs/P%pk64 0.18 12% 0.26 4% 0.60 14% 1.35 18%256 0.14 9% 0.26 4% 0.51 12% 0.86 11%

2048 0.12 8% 0.25 4% 0.47 11%

15

PARATEC

• Parallel Total Energy Code• Plane Wave DFT using custom 3D FFT • 70% of Materials Science Computation at

NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).

16

PARATEC: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.60 40% 1.8 29% 2.3 53% 4.4 58% 3.8 21% 5.1 64% 7.5 49%256 0.41 27% 0.79 13% 1.7 38% 3.3 43% 3.3 18% 5.0 62% 6.8 43%

• All architectures generally perform well due to computational intensity of code (BLAS3, FFT)• SX8 achieves highest per-processor performance• X1/X1E shows lowest % of peak

– Non-vectorizable code much more expensive on X1/X1E (32:1)– Lower bisection bandwidth to computational ratio (4D-hypercube)– X1 Performance is comparable to Itanium2

• Itanium2 outperforms Opteron because– Paratec less sensitive to memory access issues (BLAS3)– Opteron lacks FMA unit– Quadrics shows better scaling of all-to-all at large concurrencies

17

PMEMD

• Particle Mesh Ewald Molecular Dynamics– A F90 code with advanced MPI coding should

test compiler and stress asynchronous point to point messaging.

• PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences

• Test system is a 91K atom blood coagulation protein

18

PMEMD: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.13 9% 0.21 3% 0.46 10% 0.52 7%

256 0.05 3% 0.10 2% 0.19 4% 0.32 4%

19

Summary

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

7000.0

8000.0

9000.0

10000.0

MILC M

MILC L

MILC XL

GTC MGTC L

PARA M

PARA L

GAM MGAM L

MAD MMAD L

MAD XL

PME MPME L

CAM MCAM L

seaborgbassijacquardthunder

20

Summary

seaborg bassi jacquard thunder s/bMILC M 1028.9 138 312 708.0 7.5

MILC L 9562.7 1496 2530 5069.0 6.4

MILC XL 12697.3 1945 3289 6129.0

GTC M 8236.9 1667 1876 2345.0 4.9

GTC L 9572.3 1790 2079 2759.0 5.3

PARA M 3306.4 451.0 861.0 1134.0 7.3

PARA L 6811.0 854.0 1654.0 3534.0 8.0

GAM M 18665.0 5837.0 5404.0 5277.0 3.2

GAM L 42167.0 4683.0 4516.0 9.0

MAD M 8013.9 1094.0 2585.0 1727.0 7.3

MAD L 8421.6 1277.0 2417.0 1942.0 6.6

MAD XL 2943.9 447.0 846.0 1291.0

PME M 2080.0 538 606 1344.0 3.9

PME L 3020.0 475 782 1541.0 6.4

CAM M 7932.8 1886.0 4988 4.2

CAM L 2439.0 527.0 1158.0 4.6

21

Summary

• Average ratio bassi to seaborg is 6.0 for N5 application benchmarks

benchmark performance on bassi jonathan carter user services group lead jtcarter@lbl

Documents

atmosphere model

original code

benchmark performance

low computational intensity

mp2 energy

dft energy

lines of fortran

lines codes