performance modeling in germany why care about cluster performance? g. wellein, g. hager, t. zeiser,...

Performance modeling in Germany

Why care about cluster performance?

G. Wellein, G. Hager, T. Zeiser, M. Meier

Regional Computing Center Erlangen (RRZE)Friedrich-Alexander-University Erlangen-Nuremberg

April 16, 2008IDC hpcuserforum

16.04.2008 [email protected] 2IDC - hpcuserforum

BerlinHannover

FZ Jülich

HLRS-Stuttgart

LRZ-München

LRZ Munich: SGI Altix (62 TFlop/s)HLR Stuttgart: 12 TFlop/s NEC SX8

Erlangen/ Nürnberg

Jülich Supercomputing Center

8,9 TFlop/s IBM Power4+ 46 TFlop/s BlueGene/L

228 TFlop/s BlueGene/P

HPC Centers in Germany: A view from Erlangen


Friedrich-Alexander-University Erlangen-Nuremberg(FAU)

2nd largest university in Bavaria

26.500+ students

12.000+ employees

11 faculties

83 institutes

23 hospitals

265 chairs (C4 / W3)

141 fields of study

250 buildings scattered through 4 cities (Erlangen, Nuremberg, Fürth, Bamberg)

WS 06/07

FFAAUU

RRZE provides all “IT” services for the university


Theoretical PhysicsFluid Dynamics

Material Sciences

Life Sciences / Chemistry

Nano-Sciences

Applied Mathematics

Applied Physics

Computational Sciences

Compute Cycles in 2007 (RRZE only)> 8 Mio. CPU-hrs

SScc

iieennccee

&&

RReesseeaa

rrcchh

Introduction: Modeling and Simulation – Interdisciplinary Research Focus of FAU


IntroductionHPC strategy of RRZE

SCIENCE

Problem Solution

Methods&Algorithms

SoftwareEngineering

Analysis

Com

pute

r S

cien

ceM

athe

mat

ics Access

Parallel./Debuging

Optimization

Data handling

High P

erformance C

omputing

Support group (R

RZ

E)


Introduction RRZE: Compute Resources (2003-2007)

216 2-way compute nodes: 86 nodes: Intel Xeon 2.6 GHz; FSB533 64 nodes: Intel Xeon 3.2 GHz; FSB800 66 nodes: Intel Xeon 2.66 GHZ; Dual-Core

25 4-way compute nodes: AMD Opteron270 (2.0 GHz); Dual-Core

GBit Ethernet network Infiniband: 24 nodes (SDR) + 66 nodes (DDR) 5,5+13 TByte Disc Space Installation: 4/ 2003 ; Upgrades: 12 / 2004,

Q4/2005, Q3/2007

Compute cluster

Compute ServersSGI Altix3700

• 32 Itanium2 1.3 GHz

• 128 GByte Memory

• 3 TByte Disc Space

• Inst.: 11 / 2003

SGI Altix330

• 16 Itanium2 1.5 GHz

• 32 GByte Memory

• Inst.: 3 / 2006

4+16 CPUs paid by scientists‘ funding

374 / 532 CPUs paid by scientists‘ money


RRZE “Woody” ClusterHP / Bechtle

876 Intel Xeon5160 processor cores 3.0 GHz -> 12 GFlop/s per core HP DL140 G3 compute node (217 compute + 2 login) nodes

Peak performance: 10512 GFlop/s LINPACK 7315 GFlop/s

Main memory: 8 GByte / compute node

Voltaire „DDRx“ IB-switch: 240 ports

OS: SLES9

Parallel filesystem (SFS): 15 TByte(4 OSS)

NFS filesystem: 15 TByte

Installation: Oct. 2006

Top500 Nov. 2006:

Rank 124 (760 cores)

Top500 Nov. 2007:

Rank 329 (876 cores)

Power consumption > 100 kW


RRZE Our HPC backbone

Dr. Hager

Physics

Dr. Zeiser

CFD

M. Meier

Computer Science

User support; software; Parallelization & Optimization; Evaluation of new hardware; HPC tutorials & lectures

Dr. Wellein

Physics

System administration

Architecture of cluster nodesArchitecture of cluster nodesccNUMA – why care about it?ccNUMA – why care about it?


Dual CPU Intel Xeon node (old) Dual socket Intel “Core” node

Dual socket AMD Opteron node

PC

Chipset(northbridge)

Memory

PC

Chipset

Memory

PC

C

PC

PC

C

PC

PC

C C

MI

Memory

PC

PC

C C

MI

Memory

CPCC

Cluster nodesBasic architecture of compute nodes

Intel platform provides 1 path per socket to memory (still UMA)

HT provides scalable bandwidth for Opteron systems but introduces ccNUMA architecture: Where does my data finally end up?

Intel will move to ccNUMA with QuickPath (CSI) technology


double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)!$OMP PARALLEL DO PRIVATE(Y,X,…) SCHEDULE(RUNTIME)do z=1,zMax do y=1,yMax do x=1,xMax

if( fluidcell(x,y,z) ) then

LOAD f(x,y,z, 0:18,t)

Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1)

… SAVE f(x ,y-1,z-1, 18,t+1)

endif enddo enddoenddo

Cluster nodes: ccNUMA pitfallsSimple Lattice Boltzmann Method (LBM) kernel

Collide

Stream

#load operations: 19*xMax*yMax*zMax + 19*xMax*yMax*zMax

#store operations: 19*xMax*yMax*zMax


Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 2-socket Intel Xeon (UMA)

UMA node

Correct parallel initializationDifferent thread scheduling in initialization and compute step Sequential initialization of data


Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 4-socket DC Opteron (ccNUMA)

ccNUMA node Co

rrec

t p

aral

lel

init

iali

zati

on

Dif

fere

nt

thre

ad s

ched

uli

ng

in

in

itia

liza

tio

n a

nd

co

mp

ute

ste

p

Seq

uen

tial

in

itia

liza

tio

n o

f d

ata


Cluster nodes: ccNUMA pitfallsFilesystem cache: 2 socket server – UMA vs. ccNUMA

for x in `seq 1 41` do dd if=/dev/zero of=/scratch/justatest bs=1M count=${x}00 sync mpirun_rrze –np 4 ./triad.x < input.triads ; done

PC

PC

C C

MI

Memory

PC

PC

C C

MI

Memory

Main memory bandwidth – Main memory bandwidth – Did you ever check the stream number of your Did you ever check the stream number of your compute nodes?compute nodes?


Cluster nodesMain memory bandwidth within a compute node

Theoretical (aggregate) bandwidth of Intel Xeon51xx (“Woodcrest”) – 2 sockets:

21.3 GByte/s ( = 2 * 1333 MHz * 8 Byte)

Intel Conroe / Xeon 30XX – 1 socket:8.5 GByte/s ( = 1 * 1066 MHz * 8 Byte) Intel Kentsfield / QX6850 – 1 socket:10.6 GByte/s ( = 1 * 1333 MHz * 8 Byte)

AMD Opteron/Barcelona (memory controller on-chip) Socket F 10.6 GByte/s per socket (DDR2-667 DIMMs)

Popular kernels to measure real-world bandwidth stream: A=B; A=s*B; A=B+C; A=B+s*C

“Optimized version”: Suppress additional RFO for A (nontemporal stores)

Array size = 20.000.000 & offset=0


Cluster nodes (Dual-Cores)Optimized version of stream running on all cores

sockets/node

COPY[MB/s]

SCALE[MB/s]

ADD

[MB/s]TRIAD[MB/s]

Intel Slides3.0 GHz (WC; GC)

2 8204 8192 7680 7680

RRZE: Intel EA Box2.66 GHz (WC; BF)

2 6195 6198 6220 6250

RRZE:HP DL1403.0 GHz (WC; GC)

2 7521 7519 6145 6149

RRZE: transtec3.0 GHz (WC; GC)

2 8193 8159 6646 6796

RRZE: CUDA Workstation2.33 GHz (CT; GC)

2 8952 8962 7766 7796

There is not a single stream number even though CPU & Chipset & memory DIMM speed are identical!


Cluster nodes (Quad-Cores) Optimized version of stream running on all cores

sockets/node

COPY[MB/s]

SCALE[MB/s]

ADD

[MB/s]TRIAD[MB/s]

RRZE:HP DL1403.0 GHz (WC; GC)

2 7521 7519 6145 6149

RRZE: Intel EAX5482 - FSB16003.2 GHz („Hapertown“)

2 8180 8170 8840 9080

RRZE: AMD OpteronBarcelona 2 GHz/DDR2-667

2 17027 15500 16684 16700

RRZE: Intel EA*X38ML Server BoardQX6850 (3.0 GHz)

1 6587 6566 6969 6962

* FSB1333; use 2 threads only


Cluster nodes (Quad-Cores) “Optimized stream”: The vendors always choose the best measurements..

We do not yet know what happens here but we are working very hard

AMD K10 (“Barcelona”)

Barcelona design: relative memory alignments constraints?


Parallelization by compilerThe first thing they do is reducing performance…

Sequential performance

Lat

tice

Bo

ltzm

ann

so

lve

r

OMP_NUM_THREADS=1

PC

PC

C

PC

PC

C

Chipset

Sequential version4 cores: Speed-up ~ 30%


Intra socket scalabilityHaving a low baseline makes things easier…

single core Intel Q6850

Scalability is important but never forget the baseline

Lat

tice

Bo

ltzm

ann

so

lve

r

PCC

PCC

MI / HT

PCC

PCC

L3 cache

Experiences with cluster performanceExperiences with cluster performance

Tales from the trenches..Tales from the trenches..


Cluster nodes Single socket nodes: Intel S3000PT board

Intel S3000PT board: 1 socket: Intel Xeon30XX series 2 boards/ 1 U FSB1066 – Un-buffered DDR2 1 PCIe-8X 2 SATA ports Intel AMT

http://www.intel.com/design/servers/boards/s3000PT/index.htm

Optimized for

MPI apps with high memory and/or MPI bandwidth requirements!

Not optimized for

Maximum LINPACK/$$


Cluster nodes Single socket nodes: Intel S3000PT board

RRZE S3000PT Cluster (Installation: 09/2007) 66 compute nodes

2,66 GHz Xeon 3070 (Dual-Core) 4 GB memory (DDR2-533)

72-port Flextronics IB DDR-Switch (max. 144 ports) Delivered by transtec

Application performance compared with WOODY Performance measured by parallel RRZE benchmark suite

Strong scaling CoresPerformance

S3000PT/WOODY

AMBER8/pmemd (MD – Chemistry) 32 1,01

IMD (MD – Materials Sciences) 64 1,12

EXX (Quantum Chemistry) 16 1,14

OAK3D (Theoretical Physics) 64 1,29

trats/BEST (LBM solver – CFD) 64 1,37


ClustersNever trust them…. (S3000PT cluster)

STREAM triad performance on arrival

STREAM triad performance after

•Choosing correct BIOS

•Removing bad memory DIMMsDIMMs:

Samsung

Kingston


ClustersNever trust them… DDRx Voltaire IB Switch (WOODY)

Simple ping pong should get

~1500 MB/s (DDR)

~1000 MB/s (SDR)

Several reboots& firmware upgs.

1510 MB/s 950 MB/s

First measurement of BW for each link


ClustersNever trust them….

A “cheap cluster” acquired by a local research group

Putting DDR IB cards into PCIe-4x slots may work but is not a good idea….


ClustersNever trust anyone….

A “cheap cluster” acquired by a local research group “We were told that AMD is the best processor available!”

2-way nodes (AMD Opteron Dual-Core 2.2 GHz) + DDR IB network “Why buy a commercial compiler when a free one is available?”

gfortran Target application: AMBER9/pmemd

Runtime [s]

SUN Studio12OpenMPI

3500

gfortranIntel MPI

3000

Intel64/9.1.Intel MPI

2700

4 MPI processes on one node of woody

Runtime [s]

Opteron cluster (2.2 GHz)

2*2*2 1930

Woody cluster(3.0 GHz)

2*2*2 1430

2 x Intel QX6850 (3.0 GHz)

2*1*4 1440

8 MPI processes (Intel64/9.1. + IntelMPI)


ClustersYet another Cluster OS ????

7 AMD Opteron nodes (Dual Core / Dual Socket)

4 GB per node

Windows 2003 Enterprise + Compute Cluster Pack

Visual Studio 2005,Intel compilers, MKL, ACML

Star-CD

Gbit Ethernet Access via RDP or ssh (sshd from Cygwin)

GUI tool for job control: Cluster Job Manager / CLI: job.cmd script

New users for RRZE: Chair for Statistics and Econometrics


ClustersWindows CCS is really fast (in migrating processes)

0

100

200

300

400

500

600

700

800

N

ML

UP

s/s

placement+pinning placement only no placement

4MB L2 limit NUMA placement: +60%

additional pinning: +30%

Pinning benefit is only due to better NUMA locality!

Per

form

ance

of

2D J

acob

i (he

at c

ondu

ctio

n) s

olve

r

What’s next ???? What’s next ???? SUN Niagara2 / IBM Cell / GPUsSUN Niagara2 / IBM Cell / GPUs

C4C3C2C1

L2$ BankL2$ BankL2$ BankL2$ Bank

Crossbar16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

C8C7C6C5

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

L2$Bank

Memorycontroller

Memorycontroller

Memorycontroller

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

Crossbar

Memorycontroller

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

SSI, JTAGDebug port

Dual-channelFB-DIMM

Dual-channelFB-DIMM

Dual-channelFB-DIMM

Dual-channelFB-DIMM

NIU PCIe

10 Gb Ethernet X8 @ 2.5 GHz2 GB/s each direction

42 GB/s read, 21 GB/s write

4 MB L2$

8 threads per core2 execution pipes1 op/cycle per pipe

x10 writex14 read@ 4.0 GT/s

2–8 DIMMs

Sys I/Fbuffer switch

core

© S

un

•Massive parallelism

•Programming / Optimization models completely new for most of us:Porting of kernels only Amdahl’s law

•Most “accelerators” will stay in niche markets (Remember: Itanium did fail because of complex optimization and missing SW compatibility!)


Summary

Cluster provide tremendous compute capacity at a low price tag but they are far away from being a standard product designed for optimal performance on HPC apps a solution for highly parallel high end apps

(Heterogeneous) Multi-/Many-Core architectures will further improve price/performance ratio but increase programming complexity

Most users of HPC systems will not be able to adequately address the challenges and problems pointed out in this talk!

performance modeling in germany why care about cluster performance? g. wellein, g. hager, t. zeiser,...

Documents

way compute nodes

kwidc hpcuserforum rrze

compute resources

nodes sdr

nodes ddr5

fsb53364 nodes

fsb80066 nodes

gbyte memory