performance modeling in germany why care about cluster performance? g. wellein, g. hager, t. zeiser,...

32
Performance modeling in Germany Why care about cluster performance? G. Wellein , G. Hager, T. Zeiser, M. Meier Regional Computing Center Erlangen (RRZE) Friedrich-Alexander-University Erlangen- Nuremberg April 16, 2008 IDC hpcuserforum

Upload: walter-harper

Post on 31-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Performance modeling in Germany

Why care about cluster performance?

G. Wellein, G. Hager, T. Zeiser, M. Meier

Regional Computing Center Erlangen (RRZE)Friedrich-Alexander-University Erlangen-Nuremberg

April 16, 2008IDC hpcuserforum

16.04.2008 [email protected] 2IDC - hpcuserforum

BerlinHannover

FZ Jülich

HLRS-Stuttgart

LRZ-München

LRZ Munich: SGI Altix (62 TFlop/s)HLR Stuttgart: 12 TFlop/s NEC SX8

Erlangen/ Nürnberg

Jülich Supercomputing Center

8,9 TFlop/s IBM Power4+ 46 TFlop/s BlueGene/L

228 TFlop/s BlueGene/P

HPC Centers in Germany: A view from Erlangen

16.04.2008 [email protected] 3IDC - hpcuserforum

Friedrich-Alexander-University Erlangen-Nuremberg(FAU)

2nd largest university in Bavaria

26.500+ students

12.000+ employees

11 faculties

83 institutes

23 hospitals

265 chairs (C4 / W3)

141 fields of study

250 buildings scattered through 4 cities (Erlangen, Nuremberg, Fürth, Bamberg)

WS 06/07

FFAAUU

RRZE provides all “IT” services for the university

16.04.2008 [email protected] 4IDC - hpcuserforum

Theoretical PhysicsFluid Dynamics

Material Sciences

Life Sciences / Chemistry

Nano-Sciences

Applied Mathematics

Applied Physics

Computational Sciences

Compute Cycles in 2007 (RRZE only)> 8 Mio. CPU-hrs

SScc

iieennccee

&&

RReesseeaa

rrcchh

Introduction: Modeling and Simulation – Interdisciplinary Research Focus of FAU

16.04.2008 [email protected] 5IDC - hpcuserforum

IntroductionHPC strategy of RRZE

SCIENCE

Problem Solution

Methods&Algorithms

SoftwareEngineering

Analysis

Com

pute

r S

cien

ceM

athe

mat

ics Access

Parallel./Debuging

Optimization

Data handling

High P

erformance C

omputing

Support group (R

RZ

E)

16.04.2008 [email protected] 6IDC - hpcuserforum

Introduction RRZE: Compute Resources (2003-2007)

216 2-way compute nodes: 86 nodes: Intel Xeon 2.6 GHz; FSB533 64 nodes: Intel Xeon 3.2 GHz; FSB800 66 nodes: Intel Xeon 2.66 GHZ; Dual-Core

25 4-way compute nodes: AMD Opteron270 (2.0 GHz); Dual-Core

GBit Ethernet network Infiniband: 24 nodes (SDR) + 66 nodes (DDR) 5,5+13 TByte Disc Space Installation: 4/ 2003 ; Upgrades: 12 / 2004,

Q4/2005, Q3/2007

Compute cluster

Compute ServersSGI Altix3700

• 32 Itanium2 1.3 GHz

• 128 GByte Memory

• 3 TByte Disc Space

• Inst.: 11 / 2003

SGI Altix330

• 16 Itanium2 1.5 GHz

• 32 GByte Memory

• Inst.: 3 / 2006

4+16 CPUs paid by scientists‘ funding

374 / 532 CPUs paid by scientists‘ money

16.04.2008 [email protected] 7IDC - hpcuserforum

RRZE “Woody” ClusterHP / Bechtle

876 Intel Xeon5160 processor cores 3.0 GHz -> 12 GFlop/s per core HP DL140 G3 compute node (217 compute + 2 login) nodes

Peak performance: 10512 GFlop/s LINPACK 7315 GFlop/s

Main memory: 8 GByte / compute node

Voltaire „DDRx“ IB-switch: 240 ports

OS: SLES9

Parallel filesystem (SFS): 15 TByte(4 OSS)

NFS filesystem: 15 TByte

Installation: Oct. 2006

Top500 Nov. 2006:

Rank 124 (760 cores)

Top500 Nov. 2007:

Rank 329 (876 cores)

Power consumption > 100 kW

16.04.2008 [email protected] 8IDC - hpcuserforum

RRZE Our HPC backbone

Dr. Hager

Physics

Dr. Zeiser

CFD

M. Meier

Computer Science

User support; software; Parallelization & Optimization; Evaluation of new hardware; HPC tutorials & lectures

Dr. Wellein

Physics

System administration

Architecture of cluster nodesArchitecture of cluster nodesccNUMA – why care about it?ccNUMA – why care about it?

16.04.2008 [email protected] 10IDC - hpcuserforum

Dual CPU Intel Xeon node (old) Dual socket Intel “Core” node

Dual socket AMD Opteron node

PC

Chipset(northbridge)

Memory

PC

Chipset

Memory

PC

C

PC

PC

C

PC

PC

C C

MI

Memory

PC

PC

C C

MI

Memory

CPCC

Cluster nodesBasic architecture of compute nodes

Intel platform provides 1 path per socket to memory (still UMA)

HT provides scalable bandwidth for Opteron systems but introduces ccNUMA architecture: Where does my data finally end up?

Intel will move to ccNUMA with QuickPath (CSI) technology

16.04.2008 [email protected] 11IDC - hpcuserforum

double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)!$OMP PARALLEL DO PRIVATE(Y,X,…) SCHEDULE(RUNTIME)do z=1,zMax do y=1,yMax do x=1,xMax

if( fluidcell(x,y,z) ) then

LOAD f(x,y,z, 0:18,t)

Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1)

… SAVE f(x ,y-1,z-1, 18,t+1)

endif enddo enddoenddo

Cluster nodes: ccNUMA pitfallsSimple Lattice Boltzmann Method (LBM) kernel

Collide

Stream

#load operations: 19*xMax*yMax*zMax + 19*xMax*yMax*zMax

#store operations: 19*xMax*yMax*zMax

16.04.2008 [email protected] 12IDC - hpcuserforum

Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 2-socket Intel Xeon (UMA)

UMA node

Correct parallel initializationDifferent thread scheduling in initialization and compute step Sequential initialization of data

16.04.2008 [email protected] 13IDC - hpcuserforum

Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 4-socket DC Opteron (ccNUMA)

ccNUMA node Co

rrec

t p

aral

lel

init

iali

zati

on

Dif

fere

nt

thre

ad s

ched

uli

ng

in

in

itia

liza

tio

n a

nd

co

mp

ute

ste

p

Seq

uen

tial

in

itia

liza

tio

n o

f d

ata

16.04.2008 [email protected] 14IDC - hpcuserforum

Cluster nodes: ccNUMA pitfallsFilesystem cache: 2 socket server – UMA vs. ccNUMA

for x in `seq 1 41` do dd if=/dev/zero of=/scratch/justatest bs=1M count=${x}00 sync mpirun_rrze –np 4 ./triad.x < input.triads ; done

PC

PC

C C

MI

Memory

PC

PC

C C

MI

Memory

Main memory bandwidth – Main memory bandwidth – Did you ever check the stream number of your Did you ever check the stream number of your compute nodes?compute nodes?

16.04.2008 [email protected] 16IDC - hpcuserforum

Cluster nodesMain memory bandwidth within a compute node

Theoretical (aggregate) bandwidth of Intel Xeon51xx (“Woodcrest”) – 2 sockets:

21.3 GByte/s ( = 2 * 1333 MHz * 8 Byte)

Intel Conroe / Xeon 30XX – 1 socket:8.5 GByte/s ( = 1 * 1066 MHz * 8 Byte) Intel Kentsfield / QX6850 – 1 socket:10.6 GByte/s ( = 1 * 1333 MHz * 8 Byte)

AMD Opteron/Barcelona (memory controller on-chip) Socket F 10.6 GByte/s per socket (DDR2-667 DIMMs)

Popular kernels to measure real-world bandwidth stream: A=B; A=s*B; A=B+C; A=B+s*C

“Optimized version”: Suppress additional RFO for A (nontemporal stores)

Array size = 20.000.000 & offset=0

16.04.2008 [email protected] 17IDC - hpcuserforum

Cluster nodes (Dual-Cores)Optimized version of stream running on all cores

sockets/node

COPY[MB/s]

SCALE[MB/s]

ADD

[MB/s]TRIAD[MB/s]

Intel Slides3.0 GHz (WC; GC)

2 8204 8192 7680 7680

RRZE: Intel EA Box2.66 GHz (WC; BF)

2 6195 6198 6220 6250

RRZE:HP DL1403.0 GHz (WC; GC)

2 7521 7519 6145 6149

RRZE: transtec3.0 GHz (WC; GC)

2 8193 8159 6646 6796

RRZE: CUDA Workstation2.33 GHz (CT; GC)

2 8952 8962 7766 7796

There is not a single stream number even though CPU & Chipset & memory DIMM speed are identical!

16.04.2008 [email protected] 18IDC - hpcuserforum

Cluster nodes (Quad-Cores) Optimized version of stream running on all cores

sockets/node

COPY[MB/s]

SCALE[MB/s]

ADD

[MB/s]TRIAD[MB/s]

RRZE:HP DL1403.0 GHz (WC; GC)

2 7521 7519 6145 6149

RRZE: Intel EAX5482 - FSB16003.2 GHz („Hapertown“)

2 8180 8170 8840 9080

RRZE: AMD OpteronBarcelona 2 GHz/DDR2-667

2 17027 15500 16684 16700

RRZE: Intel EA*X38ML Server BoardQX6850 (3.0 GHz)

1 6587 6566 6969 6962

* FSB1333; use 2 threads only

16.04.2008 [email protected] 19IDC - hpcuserforum

Cluster nodes (Quad-Cores) “Optimized stream”: The vendors always choose the best measurements..

We do not yet know what happens here but we are working very hard

AMD K10 (“Barcelona”)

Barcelona design: relative memory alignments constraints?

16.04.2008 [email protected] 20IDC - hpcuserforum

Parallelization by compilerThe first thing they do is reducing performance…

Sequential performance

Lat

tice

Bo

ltzm

ann

so

lve

r

OMP_NUM_THREADS=1

PC

PC

C

PC

PC

C

Chipset

Sequential version4 cores: Speed-up ~ 30%

16.04.2008 [email protected] 21IDC - hpcuserforum

Intra socket scalabilityHaving a low baseline makes things easier…

single core Intel Q6850

Scalability is important but never forget the baseline

Lat

tice

Bo

ltzm

ann

so

lve

r

PCC

PCC

MI / HT

PCC

PCC

L3 cache

Experiences with cluster performanceExperiences with cluster performance

Tales from the trenches..Tales from the trenches..

16.04.2008 [email protected] 23IDC - hpcuserforum

Cluster nodes Single socket nodes: Intel S3000PT board

Intel S3000PT board: 1 socket: Intel Xeon30XX series 2 boards/ 1 U FSB1066 – Un-buffered DDR2 1 PCIe-8X 2 SATA ports Intel AMT

http://www.intel.com/design/servers/boards/s3000PT/index.htm

Optimized for

MPI apps with high memory and/or MPI bandwidth requirements!

Not optimized for

Maximum LINPACK/$$

16.04.2008 [email protected] 24IDC - hpcuserforum

Cluster nodes Single socket nodes: Intel S3000PT board

RRZE S3000PT Cluster (Installation: 09/2007) 66 compute nodes

2,66 GHz Xeon 3070 (Dual-Core) 4 GB memory (DDR2-533)

72-port Flextronics IB DDR-Switch (max. 144 ports) Delivered by transtec

Application performance compared with WOODY Performance measured by parallel RRZE benchmark suite

Strong scaling CoresPerformance

S3000PT/WOODY

AMBER8/pmemd (MD – Chemistry) 32 1,01

IMD (MD – Materials Sciences) 64 1,12

EXX (Quantum Chemistry) 16 1,14

OAK3D (Theoretical Physics) 64 1,29

trats/BEST (LBM solver – CFD) 64 1,37

16.04.2008 [email protected] 25IDC - hpcuserforum

ClustersNever trust them…. (S3000PT cluster)

STREAM triad performance on arrival

STREAM triad performance after

•Choosing correct BIOS

•Removing bad memory DIMMsDIMMs:

Samsung

Kingston

16.04.2008 [email protected] 26IDC - hpcuserforum

ClustersNever trust them… DDRx Voltaire IB Switch (WOODY)

Simple ping pong should get

~1500 MB/s (DDR)

~1000 MB/s (SDR)

Several reboots& firmware upgs.

1510 MB/s 950 MB/s

First measurement of BW for each link

16.04.2008 [email protected] 27IDC - hpcuserforum

ClustersNever trust them….

A “cheap cluster” acquired by a local research group

Putting DDR IB cards into PCIe-4x slots may work but is not a good idea….

16.04.2008 [email protected] 28IDC - hpcuserforum

ClustersNever trust anyone….

A “cheap cluster” acquired by a local research group “We were told that AMD is the best processor available!”

2-way nodes (AMD Opteron Dual-Core 2.2 GHz) + DDR IB network “Why buy a commercial compiler when a free one is available?”

gfortran Target application: AMBER9/pmemd

Runtime [s]

SUN Studio12OpenMPI

3500

gfortranIntel MPI

3000

Intel64/9.1.Intel MPI

2700

4 MPI processes on one node of woody

Runtime [s]

Opteron cluster (2.2 GHz)

2*2*2 1930

Woody cluster(3.0 GHz)

2*2*2 1430

2 x Intel QX6850 (3.0 GHz)

2*1*4 1440

8 MPI processes (Intel64/9.1. + IntelMPI)

16.04.2008 [email protected] 29IDC - hpcuserforum

ClustersYet another Cluster OS ????

7 AMD Opteron nodes (Dual Core / Dual Socket)

4 GB per node

Windows 2003 Enterprise + Compute Cluster Pack

Visual Studio 2005,Intel compilers, MKL, ACML

Star-CD

Gbit Ethernet Access via RDP or ssh (sshd from Cygwin)

GUI tool for job control: Cluster Job Manager / CLI: job.cmd script

New users for RRZE: Chair for Statistics and Econometrics

16.04.2008 [email protected] 30IDC - hpcuserforum

ClustersWindows CCS is really fast (in migrating processes)

0

100

200

300

400

500

600

700

800

N

ML

UP

s/s

placement+pinning placement only no placement

4MB L2 limit NUMA placement: +60%

additional pinning: +30%

Pinning benefit is only due to better NUMA locality!

Per

form

ance

of

2D J

acob

i (he

at c

ondu

ctio

n) s

olve

r

What’s next ???? What’s next ???? SUN Niagara2 / IBM Cell / GPUsSUN Niagara2 / IBM Cell / GPUs

C4C3C2C1

L2$ BankL2$ BankL2$ BankL2$ Bank

Crossbar16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

C8C7C6C5

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

16 KB I$

8 KB D$

L2$Bank

Memorycontroller

Memorycontroller

Memorycontroller

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

FPU

SPU

Crossbar

Memorycontroller

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

L2$Bank

SSI, JTAGDebug port

Dual-channelFB-DIMM

Dual-channelFB-DIMM

Dual-channelFB-DIMM

Dual-channelFB-DIMM

NIU PCIe

10 Gb Ethernet X8 @ 2.5 GHz2 GB/s each direction

42 GB/s read, 21 GB/s write

4 MB L2$

8 threads per core2 execution pipes1 op/cycle per pipe

x10 writex14 read@ 4.0 GT/s

2–8 DIMMs

Sys I/Fbuffer switch

core

© S

un

•Massive parallelism

•Programming / Optimization models completely new for most of us:Porting of kernels only Amdahl’s law

•Most “accelerators” will stay in niche markets (Remember: Itanium did fail because of complex optimization and missing SW compatibility!)

16.04.2008 [email protected] 32IDC - hpcuserforum

Summary

Cluster provide tremendous compute capacity at a low price tag but they are far away from being a standard product designed for optimal performance on HPC apps a solution for highly parallel high end apps

(Heterogeneous) Multi-/Many-Core architectures will further improve price/performance ratio but increase programming complexity

Most users of HPC systems will not be able to adequately address the challenges and problems pointed out in this talk!