benchmark performance on bassi jonathan carter user services group lead jtcarter@lbl

21
1 Benchmark performance on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006

Upload: pascha

Post on 21-Mar-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Benchmark performance on Bassi Jonathan Carter User Services Group Lead [email protected] NERSC User Group Meeting June 12, 2006. Architectural Comparison. NERSC 5 Application Benchmarks. CAM3 Climate model, NCAR GAMESS Computational chemistry, Iowa State, Ames Lab GTC Fusion, PPPL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

1

Benchmark performance on Bassi

Jonathan CarterUser Services Group Lead

[email protected]

NERSC User Group MeetingJune 12, 2006

Page 2: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

2

Architectural Comparison

Node Type Where Network CPU/

NodeClockMHz

PeakGFlop

Stream BW

GB/s/P

Peak byte/flo

p

MPIBW

GB/s/P

MPI Latency

sec

NetworkTopology

Power3 NERSC Colony 16 375 1.5 0.4 0.26 0.13 16.3 Fat-tree

Itanium2 LLNL Quadrics 4 1400 5.6 1.1 0.19 0.25 3.0 Fat-tree

Opteron NERSC InfiniBand 2 2200 4.4 2.3 0.51 0.59 6.0 Fat-tree

Power5 NERSC HPS 8 1900 7.6 6.8 0.85 0.69 4.7 Fat-tree

X1E ORNL Custom 4 1130 18.0 9.7 0.54 2.9 5.0 4D-Hypercube

ES ESC IN 8 1000 8.0 26.3 3.29 1.5 5.6 Crossbar

SX-8 HLRS INX 8 2000 16.0 41.0 2.56 2.0 5.0 Crossbar

Page 3: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

3

NERSC 5 Application Benchmarks

• CAM3– Climate model, NCAR

• GAMESS– Computational chemistry, Iowa State, Ames Lab

• GTC– Fusion, PPPL

• MADbench– Astrophysics (CMB analysis), LBL

• Milc– QCD, multi-site collaboration

• Paratec– Materials science,developed LBL and UC Berkeley

• PMEMD– Computational chemistry, University of North Carolina-Chapel Hill

Page 4: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

4

Application Summary

Application Science Area

Basic Algorithm

Language Library Use

Comment

CAM3 Climate(BER)

CFD, FFT FORTRAN 90 netCDF IPCC

GAMESS Chemistry(BES)

DFT FORTRAN 90 DDI, BLAS

GTC Fusion(FES)

Particle-in-cell

FORTRAN 90 FFT(opt) ITER emphasis

MADbench Astrophysics(HEP & NP)

Power Spectrum Estimation

C Scalapack 1024 proc. 730 MB per task, 200 GB disk

MILC QCD(NP)

Conjugate gradient

C none 2048 proc. 540 MB per task

PARATEC Materials(BES)

3D FFT FORTRAN 90 Scalapack Nanoscience emphasis

PMEMD Life Science(BER)

Particle Mesh Ewald

FORTRAN 90 none

Page 5: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

5

CAM3

• Community Atmospheric Model version 3– Developed at NCAR with substantial DOE input, both scientific and

software.• The atmosphere model for CCSM, the coupled climate system model.

– Also the most timing consuming part of CCSM. – Widely used by both American and foreign scientists for climate

research. • For example, Carbon, bio-geochemistry models are built upon  (integrated

with) CAM3.• IPCC predictions use CAM3 (in part)

– About 230,000 lines codes in Fortran 90.• 1D Decomposition, runs up to 128 processors at T85 resolution

(150Km)• 2D Decomposition, runs up to 1680 processors at 0.5 deg (60Km)

resolution.

Page 6: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

6

CAM3: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk56 0.22 15% 0.35 6% 0.93 12%

240 0.18 13% 0.38 6% 0.83 11%

Page 7: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

7

GAMESS

• Computational chemistry application – Variety of electronic structure algorithms available

• About 550,000 lines of Fortran 90• Communication layer makes use of highly

optimized vendor libraries• Many methods available within the code

– Benchmarks are DFT energy and gradient calculation, MP2 energy and gradient calculation

– Many computational chemistry studies rely on these techniques

• Exactly the same as DOD HPCMP TI-06 GAMESS benchmark– Vendors will only have to do the work once

Page 8: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

8

GAMESS: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.02 1% 0.07 1% 0.07 2% 0.06 1%384 0.03 2% 0.32 5% 0.31 4%

• Small case: large, messy, low computational-intensity kernels problematic for compilers

• Large case depends on asynchronous messaging

Page 9: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

9

GTC

• Gyrokinetic Toroidal Code • Important code for Fusion

SciDAC Project and for the International Fusion collaboration ITER.

• Transport of thermal energy via plasma microturbulence using particle-in-cell approach (PIC)3D visualization of electrostatic

potential in magnetic fusion device

Page 10: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

10

GTC: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.15 10% 0.51 9% 0.64 15% 0.72 9% 1.7 10% 1.9 23% 2.3 14%256 0.13 8% 0.44 7% 0.58 13% 0.68 9% 1.7 10% 1.8 22% 2.3 15%

• SX8 highest raw performance (ever) but lower efficiency than ES• Scalar architectures suffer from low computational intensity, irregular

data access, and register spilling• Opteron/IB is 50% faster than Itanium2/Quadrics and only 1/2 speed of

X1– Opteron: on-chip memory controller and caching of FP L1 data

• X1 suffers from overhead of scalar code portions

Page 11: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

11

MADbench

• Cosmic microwave background radiation analysis tool (MADCAP)– Used large amount of time in FY04 and one of the

highest scaling codes at NERSC• MADBench is a benchmark version of the original

code – Designed to be easily run with synthetic data for

portability. – Used in a recent study in conjunction with Berkeley

Institute for Performance Studies (BIPS).• Written in C making extensive use of ScaLAPACK

libraries• Has extensive I/O requirements

Page 12: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

12

MADbench: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.56 37% 2.6 43% 1.7 40% 4.1 54%256 0.50 34% 2.2 36% 1.8 40% 3.2 44%

2048 0.70 47% 1.6 27%

• Dominated by– Blas3– I/O

Page 13: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

13

MILC

• Quantum ChromoDynamics application– Widespread community use, large allocation– Easy to build, no dependencies, standards

conforming– Can be setup to run on wide-range of concurrency

• Conjugate gradient algorithm• Physics on a 4D lattice• Local computations are 3x3 complex matrix

multiplies, with sparse (indirect) access pattern

Page 14: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

14

MILC: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P%pk GFs/P%pk GFs/P%pk GFs/P%pk64 0.18 12% 0.26 4% 0.60 14% 1.35 18%256 0.14 9% 0.26 4% 0.51 12% 0.86 11%

2048 0.12 8% 0.25 4% 0.47 11%

Page 15: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

15

PARATEC

• Parallel Total Energy Code• Plane Wave DFT using custom 3D FFT • 70% of Materials Science Computation at

NERSC is done via Plane Wave DFT codes. PARATEC capture the performance of a wide range of codes (VASP, CPMD, PETOT).

Page 16: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

16

PARATEC: Performance

P

Power3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

X1EPhoenix

SX6 ES

SX8HLRS

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk

64 0.60 40% 1.8 29% 2.3 53% 4.4 58% 3.8 21% 5.1 64% 7.5 49%256 0.41 27% 0.79 13% 1.7 38% 3.3 43% 3.3 18% 5.0 62% 6.8 43%

• All architectures generally perform well due to computational intensity of code (BLAS3, FFT)• SX8 achieves highest per-processor performance• X1/X1E shows lowest % of peak

– Non-vectorizable code much more expensive on X1/X1E (32:1)– Lower bisection bandwidth to computational ratio (4D-hypercube)– X1 Performance is comparable to Itanium2

• Itanium2 outperforms Opteron because– Paratec less sensitive to memory access issues (BLAS3)– Opteron lacks FMA unit– Quadrics shows better scaling of all-to-all at large concurrencies

Page 17: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

17

PMEMD

• Particle Mesh Ewald Molecular Dynamics– A F90 code with advanced MPI coding should

test compiler and stress asynchronous point to point messaging.

• PMEMD is very similar to the MD Engine in AMBER 8.0 used in both chemistry and biosciences

• Test system is a 91K atom blood coagulation protein

Page 18: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

18

PMEMD: Performance

PPower3 Seaborg

Itanium2 Thunder

OpteronJacquard

Power5Bassi

GFs/P %pk GFs/P %pk GFs/P %pk GFs/P %pk64 0.13 9% 0.21 3% 0.46 10% 0.52 7%

256 0.05 3% 0.10 2% 0.19 4% 0.32 4%

Page 19: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

19

Summary

0.0

1000.0

2000.0

3000.0

4000.0

5000.0

6000.0

7000.0

8000.0

9000.0

10000.0

MILC M

MILC L

MILC XL

GTC MGTC L

PARA M

PARA L

GAM MGAM L

MAD MMAD L

MAD XL

PME MPME L

CAM MCAM L

seaborgbassijacquardthunder

Page 20: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

20

Summary

seaborg bassi jacquard thunder s/bMILC M 1028.9 138 312 708.0 7.5

MILC L 9562.7 1496 2530 5069.0 6.4

MILC XL 12697.3 1945 3289 6129.0

GTC M 8236.9 1667 1876 2345.0 4.9

GTC L 9572.3 1790 2079 2759.0 5.3

PARA M 3306.4 451.0 861.0 1134.0 7.3

PARA L 6811.0 854.0 1654.0 3534.0 8.0

GAM M 18665.0 5837.0 5404.0 5277.0 3.2

GAM L 42167.0 4683.0 4516.0 9.0

MAD M 8013.9 1094.0 2585.0 1727.0 7.3

MAD L 8421.6 1277.0 2417.0 1942.0 6.6

MAD XL 2943.9 447.0 846.0 1291.0

PME M 2080.0 538 606 1344.0 3.9

PME L 3020.0 475 782 1541.0 6.4

CAM M 7932.8 1886.0 4988 4.2

CAM L 2439.0 527.0 1158.0 4.6

Page 21: Benchmark performance on Bassi Jonathan Carter User Services Group Lead jtcarter@lbl

21

Summary

• Average ratio bassi to seaborg is 6.0 for N5 application benchmarks