performance modeling in germany why care about cluster performance? g. wellein, g. hager, t. zeiser,...
TRANSCRIPT
Performance modeling in Germany
Why care about cluster performance?
G. Wellein, G. Hager, T. Zeiser, M. Meier
Regional Computing Center Erlangen (RRZE)Friedrich-Alexander-University Erlangen-Nuremberg
April 16, 2008IDC hpcuserforum
16.04.2008 [email protected] 2IDC - hpcuserforum
BerlinHannover
FZ Jülich
HLRS-Stuttgart
LRZ-München
LRZ Munich: SGI Altix (62 TFlop/s)HLR Stuttgart: 12 TFlop/s NEC SX8
Erlangen/ Nürnberg
Jülich Supercomputing Center
8,9 TFlop/s IBM Power4+ 46 TFlop/s BlueGene/L
228 TFlop/s BlueGene/P
HPC Centers in Germany: A view from Erlangen
16.04.2008 [email protected] 3IDC - hpcuserforum
Friedrich-Alexander-University Erlangen-Nuremberg(FAU)
2nd largest university in Bavaria
26.500+ students
12.000+ employees
11 faculties
83 institutes
23 hospitals
265 chairs (C4 / W3)
141 fields of study
250 buildings scattered through 4 cities (Erlangen, Nuremberg, Fürth, Bamberg)
WS 06/07
FFAAUU
RRZE provides all “IT” services for the university
16.04.2008 [email protected] 4IDC - hpcuserforum
Theoretical PhysicsFluid Dynamics
Material Sciences
Life Sciences / Chemistry
Nano-Sciences
Applied Mathematics
Applied Physics
Computational Sciences
Compute Cycles in 2007 (RRZE only)> 8 Mio. CPU-hrs
SScc
iieennccee
&&
RReesseeaa
rrcchh
Introduction: Modeling and Simulation – Interdisciplinary Research Focus of FAU
16.04.2008 [email protected] 5IDC - hpcuserforum
IntroductionHPC strategy of RRZE
SCIENCE
Problem Solution
Methods&Algorithms
SoftwareEngineering
Analysis
Com
pute
r S
cien
ceM
athe
mat
ics Access
Parallel./Debuging
Optimization
Data handling
High P
erformance C
omputing
Support group (R
RZ
E)
16.04.2008 [email protected] 6IDC - hpcuserforum
Introduction RRZE: Compute Resources (2003-2007)
216 2-way compute nodes: 86 nodes: Intel Xeon 2.6 GHz; FSB533 64 nodes: Intel Xeon 3.2 GHz; FSB800 66 nodes: Intel Xeon 2.66 GHZ; Dual-Core
25 4-way compute nodes: AMD Opteron270 (2.0 GHz); Dual-Core
GBit Ethernet network Infiniband: 24 nodes (SDR) + 66 nodes (DDR) 5,5+13 TByte Disc Space Installation: 4/ 2003 ; Upgrades: 12 / 2004,
Q4/2005, Q3/2007
Compute cluster
Compute ServersSGI Altix3700
• 32 Itanium2 1.3 GHz
• 128 GByte Memory
• 3 TByte Disc Space
• Inst.: 11 / 2003
SGI Altix330
• 16 Itanium2 1.5 GHz
• 32 GByte Memory
• Inst.: 3 / 2006
4+16 CPUs paid by scientists‘ funding
374 / 532 CPUs paid by scientists‘ money
16.04.2008 [email protected] 7IDC - hpcuserforum
RRZE “Woody” ClusterHP / Bechtle
876 Intel Xeon5160 processor cores 3.0 GHz -> 12 GFlop/s per core HP DL140 G3 compute node (217 compute + 2 login) nodes
Peak performance: 10512 GFlop/s LINPACK 7315 GFlop/s
Main memory: 8 GByte / compute node
Voltaire „DDRx“ IB-switch: 240 ports
OS: SLES9
Parallel filesystem (SFS): 15 TByte(4 OSS)
NFS filesystem: 15 TByte
Installation: Oct. 2006
Top500 Nov. 2006:
Rank 124 (760 cores)
Top500 Nov. 2007:
Rank 329 (876 cores)
Power consumption > 100 kW
16.04.2008 [email protected] 8IDC - hpcuserforum
RRZE Our HPC backbone
Dr. Hager
Physics
Dr. Zeiser
CFD
M. Meier
Computer Science
User support; software; Parallelization & Optimization; Evaluation of new hardware; HPC tutorials & lectures
Dr. Wellein
Physics
System administration
Architecture of cluster nodesArchitecture of cluster nodesccNUMA – why care about it?ccNUMA – why care about it?
16.04.2008 [email protected] 10IDC - hpcuserforum
Dual CPU Intel Xeon node (old) Dual socket Intel “Core” node
Dual socket AMD Opteron node
PC
Chipset(northbridge)
Memory
PC
Chipset
Memory
PC
C
PC
PC
C
PC
PC
C C
MI
Memory
PC
PC
C C
MI
Memory
CPCC
Cluster nodesBasic architecture of compute nodes
Intel platform provides 1 path per socket to memory (still UMA)
HT provides scalable bandwidth for Opteron systems but introduces ccNUMA architecture: Where does my data finally end up?
Intel will move to ccNUMA with QuickPath (CSI) technology
16.04.2008 [email protected] 11IDC - hpcuserforum
double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1)!$OMP PARALLEL DO PRIVATE(Y,X,…) SCHEDULE(RUNTIME)do z=1,zMax do y=1,yMax do x=1,xMax
if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 0:18,t)
Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1)
… SAVE f(x ,y-1,z-1, 18,t+1)
endif enddo enddoenddo
Cluster nodes: ccNUMA pitfallsSimple Lattice Boltzmann Method (LBM) kernel
Collide
Stream
#load operations: 19*xMax*yMax*zMax + 19*xMax*yMax*zMax
#store operations: 19*xMax*yMax*zMax
16.04.2008 [email protected] 12IDC - hpcuserforum
Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 2-socket Intel Xeon (UMA)
UMA node
Correct parallel initializationDifferent thread scheduling in initialization and compute step Sequential initialization of data
16.04.2008 [email protected] 13IDC - hpcuserforum
Cluster nodes: ccNUMA pitfallsSimple LBM kernel: 4-socket DC Opteron (ccNUMA)
ccNUMA node Co
rrec
t p
aral
lel
init
iali
zati
on
Dif
fere
nt
thre
ad s
ched
uli
ng
in
in
itia
liza
tio
n a
nd
co
mp
ute
ste
p
Seq
uen
tial
in
itia
liza
tio
n o
f d
ata
16.04.2008 [email protected] 14IDC - hpcuserforum
Cluster nodes: ccNUMA pitfallsFilesystem cache: 2 socket server – UMA vs. ccNUMA
for x in `seq 1 41` do dd if=/dev/zero of=/scratch/justatest bs=1M count=${x}00 sync mpirun_rrze –np 4 ./triad.x < input.triads ; done
PC
PC
C C
MI
Memory
PC
PC
C C
MI
Memory
Main memory bandwidth – Main memory bandwidth – Did you ever check the stream number of your Did you ever check the stream number of your compute nodes?compute nodes?
16.04.2008 [email protected] 16IDC - hpcuserforum
Cluster nodesMain memory bandwidth within a compute node
Theoretical (aggregate) bandwidth of Intel Xeon51xx (“Woodcrest”) – 2 sockets:
21.3 GByte/s ( = 2 * 1333 MHz * 8 Byte)
Intel Conroe / Xeon 30XX – 1 socket:8.5 GByte/s ( = 1 * 1066 MHz * 8 Byte) Intel Kentsfield / QX6850 – 1 socket:10.6 GByte/s ( = 1 * 1333 MHz * 8 Byte)
AMD Opteron/Barcelona (memory controller on-chip) Socket F 10.6 GByte/s per socket (DDR2-667 DIMMs)
Popular kernels to measure real-world bandwidth stream: A=B; A=s*B; A=B+C; A=B+s*C
“Optimized version”: Suppress additional RFO for A (nontemporal stores)
Array size = 20.000.000 & offset=0
16.04.2008 [email protected] 17IDC - hpcuserforum
Cluster nodes (Dual-Cores)Optimized version of stream running on all cores
sockets/node
COPY[MB/s]
SCALE[MB/s]
ADD
[MB/s]TRIAD[MB/s]
Intel Slides3.0 GHz (WC; GC)
2 8204 8192 7680 7680
RRZE: Intel EA Box2.66 GHz (WC; BF)
2 6195 6198 6220 6250
RRZE:HP DL1403.0 GHz (WC; GC)
2 7521 7519 6145 6149
RRZE: transtec3.0 GHz (WC; GC)
2 8193 8159 6646 6796
RRZE: CUDA Workstation2.33 GHz (CT; GC)
2 8952 8962 7766 7796
There is not a single stream number even though CPU & Chipset & memory DIMM speed are identical!
16.04.2008 [email protected] 18IDC - hpcuserforum
Cluster nodes (Quad-Cores) Optimized version of stream running on all cores
sockets/node
COPY[MB/s]
SCALE[MB/s]
ADD
[MB/s]TRIAD[MB/s]
RRZE:HP DL1403.0 GHz (WC; GC)
2 7521 7519 6145 6149
RRZE: Intel EAX5482 - FSB16003.2 GHz („Hapertown“)
2 8180 8170 8840 9080
RRZE: AMD OpteronBarcelona 2 GHz/DDR2-667
2 17027 15500 16684 16700
RRZE: Intel EA*X38ML Server BoardQX6850 (3.0 GHz)
1 6587 6566 6969 6962
* FSB1333; use 2 threads only
16.04.2008 [email protected] 19IDC - hpcuserforum
Cluster nodes (Quad-Cores) “Optimized stream”: The vendors always choose the best measurements..
We do not yet know what happens here but we are working very hard
AMD K10 (“Barcelona”)
Barcelona design: relative memory alignments constraints?
16.04.2008 [email protected] 20IDC - hpcuserforum
Parallelization by compilerThe first thing they do is reducing performance…
Sequential performance
Lat
tice
Bo
ltzm
ann
so
lve
r
OMP_NUM_THREADS=1
PC
PC
C
PC
PC
C
Chipset
Sequential version4 cores: Speed-up ~ 30%
16.04.2008 [email protected] 21IDC - hpcuserforum
Intra socket scalabilityHaving a low baseline makes things easier…
single core Intel Q6850
Scalability is important but never forget the baseline
Lat
tice
Bo
ltzm
ann
so
lve
r
PCC
PCC
MI / HT
PCC
PCC
L3 cache
Experiences with cluster performanceExperiences with cluster performance
Tales from the trenches..Tales from the trenches..
16.04.2008 [email protected] 23IDC - hpcuserforum
Cluster nodes Single socket nodes: Intel S3000PT board
Intel S3000PT board: 1 socket: Intel Xeon30XX series 2 boards/ 1 U FSB1066 – Un-buffered DDR2 1 PCIe-8X 2 SATA ports Intel AMT
http://www.intel.com/design/servers/boards/s3000PT/index.htm
Optimized for
MPI apps with high memory and/or MPI bandwidth requirements!
Not optimized for
Maximum LINPACK/$$
16.04.2008 [email protected] 24IDC - hpcuserforum
Cluster nodes Single socket nodes: Intel S3000PT board
RRZE S3000PT Cluster (Installation: 09/2007) 66 compute nodes
2,66 GHz Xeon 3070 (Dual-Core) 4 GB memory (DDR2-533)
72-port Flextronics IB DDR-Switch (max. 144 ports) Delivered by transtec
Application performance compared with WOODY Performance measured by parallel RRZE benchmark suite
Strong scaling CoresPerformance
S3000PT/WOODY
AMBER8/pmemd (MD – Chemistry) 32 1,01
IMD (MD – Materials Sciences) 64 1,12
EXX (Quantum Chemistry) 16 1,14
OAK3D (Theoretical Physics) 64 1,29
trats/BEST (LBM solver – CFD) 64 1,37
16.04.2008 [email protected] 25IDC - hpcuserforum
ClustersNever trust them…. (S3000PT cluster)
STREAM triad performance on arrival
STREAM triad performance after
•Choosing correct BIOS
•Removing bad memory DIMMsDIMMs:
Samsung
Kingston
16.04.2008 [email protected] 26IDC - hpcuserforum
ClustersNever trust them… DDRx Voltaire IB Switch (WOODY)
Simple ping pong should get
~1500 MB/s (DDR)
~1000 MB/s (SDR)
Several reboots& firmware upgs.
1510 MB/s 950 MB/s
First measurement of BW for each link
16.04.2008 [email protected] 27IDC - hpcuserforum
ClustersNever trust them….
A “cheap cluster” acquired by a local research group
Putting DDR IB cards into PCIe-4x slots may work but is not a good idea….
16.04.2008 [email protected] 28IDC - hpcuserforum
ClustersNever trust anyone….
A “cheap cluster” acquired by a local research group “We were told that AMD is the best processor available!”
2-way nodes (AMD Opteron Dual-Core 2.2 GHz) + DDR IB network “Why buy a commercial compiler when a free one is available?”
gfortran Target application: AMBER9/pmemd
Runtime [s]
SUN Studio12OpenMPI
3500
gfortranIntel MPI
3000
Intel64/9.1.Intel MPI
2700
4 MPI processes on one node of woody
Runtime [s]
Opteron cluster (2.2 GHz)
2*2*2 1930
Woody cluster(3.0 GHz)
2*2*2 1430
2 x Intel QX6850 (3.0 GHz)
2*1*4 1440
8 MPI processes (Intel64/9.1. + IntelMPI)
16.04.2008 [email protected] 29IDC - hpcuserforum
ClustersYet another Cluster OS ????
7 AMD Opteron nodes (Dual Core / Dual Socket)
4 GB per node
Windows 2003 Enterprise + Compute Cluster Pack
Visual Studio 2005,Intel compilers, MKL, ACML
Star-CD
Gbit Ethernet Access via RDP or ssh (sshd from Cygwin)
GUI tool for job control: Cluster Job Manager / CLI: job.cmd script
New users for RRZE: Chair for Statistics and Econometrics
16.04.2008 [email protected] 30IDC - hpcuserforum
ClustersWindows CCS is really fast (in migrating processes)
0
100
200
300
400
500
600
700
800
N
ML
UP
s/s
placement+pinning placement only no placement
4MB L2 limit NUMA placement: +60%
additional pinning: +30%
Pinning benefit is only due to better NUMA locality!
Per
form
ance
of
2D J
acob
i (he
at c
ondu
ctio
n) s
olve
r
What’s next ???? What’s next ???? SUN Niagara2 / IBM Cell / GPUsSUN Niagara2 / IBM Cell / GPUs
C4C3C2C1
L2$ BankL2$ BankL2$ BankL2$ Bank
Crossbar16 KB I$
8 KB D$
16 KB I$
8 KB D$
16 KB I$
8 KB D$
16 KB I$
8 KB D$
C8C7C6C5
16 KB I$
8 KB D$
16 KB I$
8 KB D$
16 KB I$
8 KB D$
16 KB I$
8 KB D$
L2$Bank
Memorycontroller
Memorycontroller
Memorycontroller
FPU
SPU
FPU
SPU
FPU
SPU
FPU
SPU
FPU
SPU
FPU
SPU
FPU
SPU
FPU
SPU
Crossbar
Memorycontroller
L2$Bank
L2$Bank
L2$Bank
L2$Bank
L2$Bank
L2$Bank
L2$Bank
SSI, JTAGDebug port
Dual-channelFB-DIMM
Dual-channelFB-DIMM
Dual-channelFB-DIMM
Dual-channelFB-DIMM
NIU PCIe
10 Gb Ethernet X8 @ 2.5 GHz2 GB/s each direction
42 GB/s read, 21 GB/s write
4 MB L2$
8 threads per core2 execution pipes1 op/cycle per pipe
x10 writex14 read@ 4.0 GT/s
2–8 DIMMs
Sys I/Fbuffer switch
core
© S
un
•Massive parallelism
•Programming / Optimization models completely new for most of us:Porting of kernels only Amdahl’s law
•Most “accelerators” will stay in niche markets (Remember: Itanium did fail because of complex optimization and missing SW compatibility!)
16.04.2008 [email protected] 32IDC - hpcuserforum
Summary
Cluster provide tremendous compute capacity at a low price tag but they are far away from being a standard product designed for optimal performance on HPC apps a solution for highly parallel high end apps
(Heterogeneous) Multi-/Many-Core architectures will further improve price/performance ratio but increase programming complexity
Most users of HPC systems will not be able to adequately address the challenges and problems pointed out in this talk!