towards computing with ibm blue gene · jülich supercomputing centre (jsc) german research school...
TRANSCRIPT
Mitg
lied
der H
elm
holtz
-Gem
eins
chaf
t
Towards PetaFlopsComputing with IBM Blue Gene
N. Attig, F. Hoßfeld
26.02.2008 PASA Workshop 2
Outline
Part 1: Integration of the Jülich Supercomputing Centre (JSC)into local, regional, national and European structures
FZJ, JSC, NICIAS, JPSS, JARAGCS, PRACE
Part 2: Supercomputing at JSC with IBM Blue GeneDevelopment of Blue Gene systems at JSC Architectural Highlights
Part 3: Applications on Blue Gene
Summary
26.02.2008 PASA Workshop 3
Forschungszentrum Jülich (FZJ)
26.02.2008 PASA Workshop 4
Supercomputing at FZJ (I)
is being supported by theJülich Supercomputing Centre (JSC, former ZAM) andthe virtual institute John von Neumann Institute for Computing
JSC is responsible for the operation of the supercomputers, for user support, for R&D work in the field of computer and computational science, for education and training
NIC is responsible for the peer-reviewed provision of computer time to national and European computational science projects
26.02.2008 PASA Workshop 5
Organisation Structure
IAS Institute for Advanced Simulation
Institute forComputational Nanoscience
Institute forComputational
Biology
Institute for Systems Biology
John von Neumann Institute for Computing
Jülich Supercomputing Centre (JSC)
Ger
man
Res
earc
h Sc
hool
(GR
S)Jü
lich-
Aac
hen
Res
earc
h A
llian
ce
JAR
A-S
IM
Jülich Platform for Sim
ulation Sciences (JPSS)
Partnership for Advanced Computing in Europe (PRACE)
26.02.2008 PASA Workshop 6
JARA-SIM Structure
26.02.2008 PASA Workshop 7
Supercomputingin Germany
JSCJülich
HLRSStuttgart
LRZGarching
HLRNBerlin
HLRNHannover
RZGGarching
DWDOffenbach
DKRZHamburg
Wuppertal
Aachen Dresden
National Centres
State Centres
Topical Centres
Universities
26.02.2008 PASA Workshop 8
National HPC Pyramid
3
EuropeanHPC Centres
Topical HPC Centres, Centres with regionaltasks
HPC Server
Aachen, Berlin,DKRZ, Dresden,DWD, Karlsruhe,Hannover, MPG/RZG, Paderborn
University/Institute
~ 10
~ 100
NationalHPC Centres Garching, Jülich, Stuttgart
26.02.2008 PASA Workshop 9
Gauss Centre for Supercomputing
Alliance of the three German SC centres HLRS, JSC, LRZLargest supercomputer complex in EuropeCreating one joint scientific governanceGerman representative in PRACEApplicant for a European supercomputer centre in FP7More information: http://www.gauss-centre.eu
26.02.2008 PASA Workshop 10
PRACE: Towards European HPC
PRACE: Partnership for Advanced Computing in Europe
Objective: Create and implement a persistent and sustainable pan-European High Performance Computing (HPC) service
Consortium of Austria, Finland, France, Germany, Greece, Ireland, Italy, theNetherlands, Norway, Poland, Portugal, Spain, Switzerland, Turkey and United Kingdom. German Representative: GCS
Memorandum of Understanding signed April 17, 2007 Submission of a joint proposal (also named PRACE) to the
EU for the realization of a PRACE Preparatory Phase (May 2)Launch of PRACE project: January 1, 2008;
Kickoff: Jülich, January 29-30, 2008
26.02.2008 PASA Workshop 11
PRACE: Towards European HPC (2)
The Governance of an Open Structure
• “Principal partner” is a representative of a European country that expressed its interest in hosting (and funding) one of the main Tier0 HPC centers of the target Tier 0 HPC infrastructure.
• “General partner” is a representative of a European state that expressed its interest to collaborate in many aspects related with definition, operation and scientific management.
• “Associate partner” is a concept that will permit the gradual involvement of scientific communities and industrial users (e.g.climate (ENES), Fusion (EFDA) or Bioinformatics (EBI), …)
26.02.2008 PASA Workshop 12
PRACE: Towards European HPC (3)
Other Stakeholders
• “The users” are academic and industrial groups and organizations, which require capability computing for performing their scientific tasks or the competitiveness of their products and services.
• “European Commission” is involved as a facilitator and catalyzer by sponsoring ESFRI and eIRG and implementing the Capacities FP7 programme, by funding several user communities or projects of academic or industrial relevance and by providing key infrastructures (GEANT, DEISA).
• “National funding agencies” will permit part of the funding of the European HPC infrastructure.
26.02.2008 PASA Workshop 13
HPC Infrastructure Embedment of JSC
PRACEPRACEABCDABCDuniversityuniversitycooperationcooperation
GCSGCS
HGFHGF
NICNIC
JARAJARA
26.02.2008 PASA Workshop 14
Part 2
Supercomputing at JSC with IBM Blue Gene
26.02.2008 PASA Workshop 15
FZJ Dual Supercomputer Complex
IBM p690 e-serverJUMP2004
IBM Blue Gene/LJUBL2005/6
IBM Blue Gene/PJUGENE
JUMP successor> 200 TFlop/s2007/8
File Server
Petaflop/s SystemFile Server
JUMP successor> 200 TFlop/s2009/10
26.02.2008 PASA Workshop 16
Development of Blue Gene systems at JSC (I)Summer 2005
Installation of a 1-rack Blue Gene/L
Summer/Autumn 2005Porting applications to BG/L:
Lattice Quantum Chromodynamics (LQCD)no surprise: BG/L is “spin-off” of QCDOC
Theoretical Chemistry: CPMD, VASPCFD: Blood Flow in a Ventricular Assist DeviceMaterials Science, Crack PropagationLaser Plasma InteractionBiophysics: Simulating Protein FoldingQuantum ComputingParallel Performance Analysis
26.02.2008 PASA Workshop 17
Development of Blue Gene systems at JSC (II)January 2006
System upgrade to 8-rack BG/L (16,384 procs), alias JUBL
March 2006JUBL TutorialInauguration (06/03/2006)
May 2006Blue Gene Week: Optimising existing BG/L codes
December 2006Blue Gene/L Scaling Workshop
Improving the scaling behaviour of selected applications:7 teams, different research areas
26.02.2008 PASA Workshop 18
Development of Blue Gene systems at JSC (III)Spring 2007
Decision to upgrade BG/L to a BG/P system
October 2007Installation of a BG/P system, alias JUGENE
October / November 2007Linpack benchmark → No. 2 in TOP500Stabilizing hardware
December 2007 / January 2008Stabilizing hardwareTesting of system software, first users (IBM/JSC)
February 22, 2008JUGENE Tutorial, Official inauguration of JUGENE
26.02.2008 PASA Workshop 19
Blue Gene/P design
Chip4 processors
13.6 GF/s Compute Card1 chip, 13.6 GF/s
2.0 GB DDR2(4.0GB optional)
Node Card(32 chips 4x4x2)
32 compute, 0-2 IO cards435 GF/s, 64 GB Rack
32 Node CardsCabled 8x8x1613.9 TF/s, 2 TB
System72 Racks, 72x32x32
1 PF/s, 144 TB
Source: IBM
26.02.2008 PASA Workshop 20
Blue Gene/P networks
3 Dimensional TorusInterconnects all compute nodesVirtual cut-through hardware routing425 MB/s on all 12 node links (5.1 GB/s per node)Communications backbone for computations188 TB/s total bandwidth
Collective NetworkInterconnects all compute and I/O nodesOne-to-all broadcast functionalityReduction operations functionality850 MB/s bandwidth per link
26.02.2008 PASA Workshop 21
Blue Gene/P networks
Low Latency Global Barrier and InterruptLatency of one way to reach all nodes 0.65 µs, MPI 1.6 µs
External I/O Network10 GBit EthernetActive in the I/O nodesAll external communication(file I/O, control, user interaction, etc.)
Control Network1 GBit EthernetBoot, monitoring and diagnostics
26.02.2008 PASA Workshop 22
Special Features
• Double Hummer FPU (BG/L and BG/P)Parallel floating point operations on pairs of doublesLoads and stores both in single and double precisionCross operations that allow to compute a complex multiplication in two instructions.
• DMA (Direct memory access; BG/P only)DMA engine interfaces with torus networkDMA has separate access to L3 cacheDMA can send messages to other nodes or to itself,capable of direct puts and getsMPI_ISEND and MPI_IRECV implicitly use DMA
26.02.2008 PASA Workshop 23
Why Blue Gene?Advantages
Low power consumptionSmall foot printScalable architectureTransparent high-speed reliable networkBalanced system: processors, memory, networkReasonable price-performance ratioSystem for capability computing
DisadvantagesNo OpenMP on BG/L, 4-way SMP on BG/PSmall memory 0.5 GB per node (2 proc.) on BG/L
2.0 GB per node (4 proc.) on BG/P
26.02.2008 PASA Workshop 24
JUBL: Jülich Blue Gene/L
16,384 processorsPowerPC 440, 700 MHz2 proc. per node
45.8 Tflop/s peak36.5 Tflop/s Linpack
4 TByte memory0.5 GByte per node
Main memory bandwidth per node: 5.6 GByte/sTorus network, bandwidth: 2.1 GByte/s, latency: 6.4 µs
26.02.2008 PASA Workshop 25
JUGENE: Jülich Blue Gene/P
65,536 processorsPowerPC 450, 850 MHz4 proc. per node
222.8 Tflop/s peak167.3 Tflop/s Linpack
32 TByte memory2 GByte per node
560 kW power consumptionMain memory bandwidth per node: 13.6 GByte/sTorus network, bandwidth: 5.1 GByte/s, latency: 3.2 µsHighly scalable leadership-class system, No 2 worldwide 11/07
26.02.2008 PASA Workshop 26
Blue Gene/L vs Blue Gene/P
Property Blue Gene/L Blue Gene/P
Node Properties
Node ProcessorsProcessor FrequencyCoherencyL3 Cache size (shared)Main StoreMain Store Bandwidth (1:2 pclk)Peak Performance
2 * PowerPC® 4400.7GHzSoftware managed4 MB512 MB / 1GB5.6 GB/s 5.6 GF/node
4 * PowerPC® 4500.85GHzSMP8 MB2 GB / 4 GB13.6 GB/s 13.9 GF/node
Torus Network
BandwidthHardware Latency
(Nearest Neighbour)Hardware Latency (Worst Case)
6*2*175 MB/s=2.1 GB/s200 ns (32 B packet)1.6 µs (256 B packet)6.4 µs (64 hops)
6*2*425 MB/s=5.1 GB/s100 ns (32 B packet)800 ns (256 B packet)3.2 µs (64 hops)
Tree Network
BandwidthHardware Latency (worst case)
2*350 MB/s=700 MB/s5.0 µs
2*0.85 GB/s=1.7 GB/s3.5 µs
26.02.2008 PASA Workshop 27
JUBL Usage
26.02.2008 PASA Workshop 28
User Research Fields
JUMP~ 150 Projects
JUBL25 Projects
26.02.2008 PASA Workshop 29
NIC Users and Access
Zeuthen
Wuppertal
Würzburg
UlmStuttgart
Siegen
Potsdam
Osnabrück
Münster
München
Marburg
Mainz
Leipzig
Konstanz
Köln
Kaiserslautern
Karlsruhe
Jülich Jena
Ilmenau
Heidelberg
Hannover
Hamburg
HalleGöttingen
Gießen
Garching
Freiburg
Frankfurt/Oder
Erlangen
Duisburg
DüsseldorfDresden
Dortmund
Darmstadt
Chemnitz
Bremen
Bonn
Bochum
Bielefeld
Berlin
Bayreuth
Aachen
Eligibility– Proposals accepted from Germany
and Europe– From academia, research institutions
and industryProcedure
– Peer review by NIC Scientific Council– International referees– Scientific quality counts– One year grants
Access via Grid ChemistryMany Particle PhysicsElementary Particle Physics
Life + EnvironmentMaterial ScienceSoft MatterOther
26.02.2008 PASA Workshop 30
European Users
DEISA: Distributed Super-computing Infrastructure
I3HP: Hadron Physics
NIC Initiative towards new Member States
Other CollaborationsZagreb
Warsaw
Vienna
SARA
RZG
Roskilde
Rome
PragueLRZ
IDRIS
Nicosia
Cracow
HLRS
Glasgow GdanskEdinburgh
ECMWF
CSC
Coimbra
CINECA
BSC
Budapest
BrnoBratislava
Athens
26.02.2008 PASA Workshop 31
Part 3
Applications on Blue Gene
26.02.2008 PASA Workshop 32
Application Highlight
Spin forces on surfaces know about left and right
St. Blügel et al., Nature 447, 441–446 (10 May 2007)
Upper configuration can
exist
Mirrored configuration is
unstable
26.02.2008 PASA Workshop 33
Support: Scaling Workshops (05/06, 12/06)
Collaboration of users and scaling experts from Argonne National Laboratory, IBM and JSCJSC donated in total about 4 million BG/L CPU hoursHands-on trainingExtremely efficient scalability achieved for QCD, materials science and CFD codesMulti-rack applications become attractive in regular production mode
26.02.2008 PASA Workshop 34
Lattice Quantum Chromo Dynamics I
Research Area: Elementary Particle Physics on the Lattice
Code: Hybrid Monte Carlo with Wilson gauge action andimproved Wilson fermionsFortran90/MPI, compute kernel coded in assembler
Special Features:Kernel: Conjugate Gradient solver (80% of CPU time)with even/odd preconditioningmatrix × vector; sparse complex matrixOverlap of computation and communicationUsage of the (double hummer) 128-bit-dual FPULattice has to fit the torus network
26.02.2008 PASA Workshop 35
Lattice Quantum Chromo Dynamics I (2)
Fortran version Assembler version
26.02.2008 PASA Workshop 36
Lattice Quantum Chromo Dynamics II
Research Area: Elementary Particle Physics on the Lattice
Code: Hybrid Monte Carlo with Symanzik improved gauge action and dynamical UV filtered Clover fermionsCode written mainly in C, Communication implemented in ASM or SPI, Clover sparse matrix multiplication in ASM
Special Features:Kernel: Clover sparse matrix multiplication (80% of CPU time)
Compute intensive parts use low level compiler macrosOverlap of computation and communication; usage of the (double hummer) 128-bit-dual FPU; lattice has to fit the torus network
26.02.2008 PASA Workshop 37
Lattice Quantum Chromo Dynamics II (2)
Blue Gene/LLattice size: 483 × 96
Blue Gene/PLattice size: 644
26.02.2008 PASA Workshop 38
LQCD II (3)
Blue Gene/LLattice size: 483 × 96
Blue Gene/PLattice size: 644
26.02.2008 PASA Workshop 39
CFD I: Simulation of Blood Flow in aVentricular Assist Device (VAD)
Research Area: CFD
Code: Finite Element techniques, Distributed Memory Code (distribution of subdomains), Fortran90 and C / MPI
Special Features:Simulation of unsteady fluid flowsInvestigation of design modifications which may improvethe VADs biocompatibility Major problem in parallel code: Communication bottlenecks; analyzed with SCALASCA performance toolkit
26.02.2008 PASA Workshop 40
CFD I (2)pr
oces
sors
overall time steps per hour
26.02.2008 PASA Workshop 41
MD Studies with DL_POLY3
Research Area: Materials Science
Code: Fully Distributed Memory Code, Link-Cell algorithmFortran90/MPI
Special Features:Combination of short-range (van der Waals) and long-range forces (Coulomb Smooth ParticleMesh Ewald FFT communication)Scaling of FFT is limited e.g. by communicationMajor problem: I/O4 min for 500 time steps, 10 min to dump the coord.Serial I/O processing has to be parallelized
26.02.2008 PASA Workshop 42
DL_POLY3 (2)
26.02.2008 PASA Workshop 43
Car Parrinello Molecular Dynamics (CPMD)
Research Area: Materials SciencePlane wave/pseudopotential implementation of Density Functional Theory, especially for ab-initio MD
Code: Well parallelized code, many platforms, Fortran77Plane waves distribution and 3D-FFT parallelizationMixed MPI/OpenMP parallelization,Hierarchical taskgroup parallelization for BG/L,Parallel Linear Algebra and Parallel Initialization for BG
Special Features:Tuning for IBM systems done by A. Curioni, IBM, member of the CPMD development team
26.02.2008 PASA Workshop 44
CPMD (2)
CPMD is used in production on JUBL and JUGENETypical partition is between 1024 and 4096 processors;interesting investigations are constrained to theseprocessor numbersCode is extremely well adopted to the BG architectureFirst BG/P experience:A system of ~500 atoms performs on 2048 processors twice as fast as on BG/L, much more than could be expected by the increase of the clock rate
26.02.2008 PASA Workshop 45
CPMD on BG/P vs BG/L
Test case: methanol liquid/vapour interface ~ 700 atoms – 70 Ry Cutoff
1
10
100
128 256 512 1024 2048
Number of nodes
Tim
e (S
)
BG/P (SMP)BG/P(VNM)BG/L(CO)
BG/P vs BG/L (VNM) : 2.4
BG/P(SMP) 1 threads: 1.0BG/P(SMP) 2 threads: 1.9BG/P(SMP) 4 threads: 3.6
26.02.2008 PASA Workshop 46
Application Highlight
Activated amino acids form peptides outside cells
D. Marx et al., J. Am. Chem. Soc. (“Three-Page Communication” to the Editor), ASAP Article 10.1021/ja7108085 S0002-7863(71)00808-4 (7 February 2008)
26.02.2008 PASA Workshop 47
Summary
• JSC is fully embedded in local, regional, national and European HPC structures
• Blue Gene is a promising and challenging unique supercomputer platform which makes petaflopscomputing possible
• Blue Gene applications profit from the enormous scalability of the system, the high-speed network and the balanced system architecture