blue gene/l experiences at ncar · 2005. 6. 16. · –the 32 bg/l i/o nodes configured into four...
TRANSCRIPT
![Page 1: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/1.jpg)
Blue Gene/L Experiencesat NCAR
Dr. Richard D. LoftDeputy Director of R&D
Scientific Computing DivisionNational Center for Atmospheric Research
![Page 2: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/2.jpg)
Outline
• Why is Blue Gene/L interesting?
• Why is it a bit frightening?
• How did we get one at NCAR?
• BG/L Case Studies:– Tuning I/O subsystem performance.
– Getting application scalability.
![Page 3: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/3.jpg)
Why Blue Gene/L is Interesting
•Features•Massive parallelism - fastest in world. (137 Tflops)•Achieves high packaging density. (2048 pes/rack)•Lower power per processor. (25 KW/rack)•Dedicated reduction network. (solver scalability)•Puts network interfaces on chip. (embedded tech.)•Conventional programming model:
•xlf90, xlcc compiler•MPI
![Page 4: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/4.jpg)
BG/L Unknowns
•Questions•High reliability? (1/N effect)•Applications for 100k processors? (Amdahl’s Law)•System robustness: I/O, scheduling flexibility.
•Limitations•Node Memory Limitation (512 MB/node)•Partitioning is quantized (in powers of two)•Simple node kernel - (no: forks-> threads -> OMP)•No support for multiple executables (no: CCSM).
![Page 5: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/5.jpg)
BlueGene/L Collaboration
NCAR
CU Denver
CU Boulder
Blue Gene/L
![Page 6: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/6.jpg)
Project Objectives
• Proving out a new architecture…– Investigation of performance & scalability of
key applications
– System software stress testing• Parallel file systems (Lustre, GPFS)
• Schedulers (SLURM, COBALT, LSF)
• Education– Classroom access to massive parallelism
![Page 7: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/7.jpg)
Details of the NCAR/CU“Frost” Blue Gene/L
• 2048 processors, 5.73 Tflops peak
• 32 I/O nodes
• 6 Tbytes of high performance disk
• Delivered to NCAR: March 15, 2005
• Acceptance tests began March 23, 2005
• Acceptance completed March 28, 2005
• First PI meeting March 30, 2005
![Page 8: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/8.jpg)
Blue Gene/L @ NCAR
“Frost”
![Page 9: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/9.jpg)
Bring-up of “Frost” BG/L System
• Criteria for user access…– System Partitions (e.g. 256, 128, 64, 32)
– Functional I/O subsystem
– Scheduler
– Connection to NCAR mass storage system
![Page 10: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/10.jpg)
Current “Frost” BG/L Status
• Partition definitions (512,256,128, 4x32) in place.• I/O system performance and correctness issues appear
to be behind us. (more on this later)• Schedulers (COBALT, SLURM) coming along.• MSS connections in place.• Codes ported: POP, WRF, HOMME, BGC5• Establishing relationships with other centers
– BG/L Consortium membership– Other BG/L sites: SDSC, Argonne, LLNL, Edinburgh
![Page 11: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/11.jpg)
BG/L: Getting I/O Subsystem Performance
![Page 12: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/12.jpg)
“Frost” I/O Architecture
• One Blue Gene/L rack with one I/O node per 32 computenodes (pset size of 32)
• Two DS4500 (FAStT900) fibre channel disk subsystemservers with dual-controllers and ~3TB of RAID5 disk oneach for a total of ~6TB shared
• Four p720 servers each with– 4 CPUs– 8 GB memory– 2 fibre channel adapters– 8 dedicated GigE interfaces for I/O
• One 96-port Cisco 4006 switch– does not support jumbo frames
![Page 13: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/13.jpg)
Frost I/O Problem Resolution
• Initial I/O Configuration– Each of the eight GigE ports on each of the four
p720s dedicated to one of the 32 BG/L I/O nodes• Static routes and NFS-mount options were used to
accomplish the one-to-one mapping.• All GigE ports configured on the same network subnet (the
BG/L functional network)
– I/O performance numbers were, at best roughly half ofwhat was expected - ~300MB/sec instead of~600MB/sec - often far less.
– I/O tests failed sporadically and inconsistently
![Page 14: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/14.jpg)
Frost I/O Problem Resolution
Front-End1FE4 (Front I/O) / FE4 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
CISCO
6509
(Loaner)
Front-End2
Front-End3
Front-End4
FAStT1
FAStT2
Blue Gene/L (pset=32)
Initial I/O Configuration
Same Subnet
1x1 - BG/L I/O
![Page 15: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/15.jpg)
Frost I/O Problem Resolution
• Root Cause Analysis: two main problems…– Low-level network communication (ARP) was getting
confused due to multiple GigE interfaces on thesame host on the same network subnet.
– Bandwidth on the Cisco 6509 switch was splitting by afactor of eight due to backplane limitations andsuboptimal port assignment within the switch.
![Page 16: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/16.jpg)
Frost I/O Problem Resolution
• Current I/O Configuration– Two GigE ports on each of the four p720s are now configured to look
like a single interface via channel-bonding.– GigE Ports judiciously assigned to the new Cisco 4006 switch so as to
prevent the bandwidth splitting problem.– The 32 BG/L I/O nodes configured into four groups of eight with each
group assigned to one of the p720s.• Same linking mechanism used (i.e., static routes & NFS)
– NFS tuning parameters (rsize,wsize=32768) enabled.– Max I/O performance numbers are in the ~674MB/sec range.– I/O tests ran to successful completion every time.
• Direct Cabling– Elimination of the Cisco switch by cabling the BG/L I/O nodes directly
to the p720 GigE ports is currently not possible due to limitations in theBG/L core software and database design.
![Page 17: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/17.jpg)
Frost I/O Problem Resolution
Front-End1FE4 (Front I/O) / FE4 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE4 (Front I/O) / FE4 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
FE1 (Front I/O) / FE1 (Back I/O)
FE2 (Front I/O) / FE2 (Back I/O)
FE3 (Front I/O) / FE3 (Back I/O)
CISCO
4006
Front-End2
Front-End3
Front-End4
FAStT1
FAStT2
Blue Gene/L (pset=32)
Current I/O Configuration
“Single”Interface On
Subnet
“1”x8 - BG/L I/O
2xChannel-Bond
2xChannel-Bond
2xChannel-Bond
2xChannel-Bond
![Page 18: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/18.jpg)
“Frost” BG/L I/O performance:1 MB files
mean aggregate I/O rates on compute nodes
0
100
200
300
400
500
600
700
800
0 200 400 600 800 1000 1200
number of concurrent processes
thro
ug
hp
ut
(MB
/se
c)
write rate
read rate
-each process wrote or read 1 GB of data
-I/O request size was 1 MB
![Page 19: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/19.jpg)
Then we uncovered a readcorrectness issue…
• Data validation tests run at the completion of the I/O performancetest cycle uncovered an intermittent byte reordering problem on thereread cycle….– IBM engaged and worked the problem as a Severity 1 issue– Frequency of the problem was data dependent– Root cause found to be a Linux kernel, cache-coherency problem in
the interaction with the low level network card driver on the BlueGene/L I/O nodes
– Emergency e-fix recently installed has eliminated the problem.
![Page 20: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/20.jpg)
BG/L: Getting Applications to Scale
![Page 21: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/21.jpg)
The Spectral Element ComputationalMesh: the “Cube-Sphere”
• Spectral Elements:– A quadrilateral “patch” of
gridpoints N x N– Gauss-Lobbattto Grid– N=8 is optimal (Taylor)
• Cube– Ne = Elements on an edge– 6*Ne*Ne elements total
• Cube Partitioning– Metis– Space filling curve partitioning
algorithm
• Ne=8 shown ~180 km
Ne=8
NxN element
![Page 22: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/22.jpg)
6th Order Spectral Elements on the Ne=4Cube Sphere
![Page 23: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/23.jpg)
Partitioning a cube-sphere on8 processors
![Page 24: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/24.jpg)
Partitioning a cubed-sphere on8 processors
![Page 25: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/25.jpg)
Default “Lexical” Mapping(2-D Example)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Coprocessor Mode
![Page 26: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/26.jpg)
Default “Lexical” Mapping(2-D Example)
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Virtual Node Mode
16 17 18 19
20 21 22 23
24 25 26 27
28 29 30 31
![Page 27: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/27.jpg)
Desirable “Grouped” Mapping(2-D Example)
0 2 4 6
8 10 12 14
16 18 20 22
24 26 28 30
Virtual Node Mode
1 3 5 7
9 11 13 15
17 19 21 23
25 27 29 31
![Page 28: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/28.jpg)
2x2 “Snaked” Mapping(2-D Example)
0 6 8 14
2 4 10 12
16 22 24 30
18 20 26 28
Virtual Node Mode
1 7 9 15
3 5 11 13
17 23 25 31
19 21 27 29
![Page 29: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/29.jpg)
2x2 “Snaked” Mapping(2-D Example)
0 6 8 14
2 4 10 12
16 22 24 30
18 20 26 28
Virtual Node Mode
1 7 9 15
3 5 11 13
17 23 25 31
19 21 27 29
![Page 30: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/30.jpg)
BG/L HOMME - Moist Dynamics
Sustained MFLOP per second per processor for moist Held-Suarez.Explicit integration Δt = 4 seconds.6 X 128 X 128 elements, 96 vertical levels.
8 TFlops
![Page 31: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/31.jpg)
BG/L HOMME -Moist Dynamics with Physics
Sustained MFLOP per second per processor for Aquaplanet with Emanuel physics.Explicit integration Δt = 4 seconds.6 X 128 X 128 elements, 40 vertical levels.
11.3 TFlops
![Page 32: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism](https://reader035.vdocuments.us/reader035/viewer/2022071517/613b5005f8f21c0c8268edde/html5/thumbnails/32.jpg)
Questions?