blue gene/l experiences at ncar · 2005. 6. 16. · –the 32 bg/l i/o nodes configured into four...

32
Blue Gene/L Experiences at NCAR Dr. Richard D. Loft Deputy Director of R&D Scientific Computing Division National Center for Atmospheric Research

Upload: others

Post on 08-Aug-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Blue Gene/L Experiencesat NCAR

Dr. Richard D. LoftDeputy Director of R&D

Scientific Computing DivisionNational Center for Atmospheric Research

Page 2: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Outline

• Why is Blue Gene/L interesting?

• Why is it a bit frightening?

• How did we get one at NCAR?

• BG/L Case Studies:– Tuning I/O subsystem performance.

– Getting application scalability.

Page 3: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Why Blue Gene/L is Interesting

•Features•Massive parallelism - fastest in world. (137 Tflops)•Achieves high packaging density. (2048 pes/rack)•Lower power per processor. (25 KW/rack)•Dedicated reduction network. (solver scalability)•Puts network interfaces on chip. (embedded tech.)•Conventional programming model:

•xlf90, xlcc compiler•MPI

Page 4: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BG/L Unknowns

•Questions•High reliability? (1/N effect)•Applications for 100k processors? (Amdahl’s Law)•System robustness: I/O, scheduling flexibility.

•Limitations•Node Memory Limitation (512 MB/node)•Partitioning is quantized (in powers of two)•Simple node kernel - (no: forks-> threads -> OMP)•No support for multiple executables (no: CCSM).

Page 5: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BlueGene/L Collaboration

NCAR

CU Denver

CU Boulder

Blue Gene/L

Page 6: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Project Objectives

• Proving out a new architecture…– Investigation of performance & scalability of

key applications

– System software stress testing• Parallel file systems (Lustre, GPFS)

• Schedulers (SLURM, COBALT, LSF)

• Education– Classroom access to massive parallelism

Page 7: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Details of the NCAR/CU“Frost” Blue Gene/L

• 2048 processors, 5.73 Tflops peak

• 32 I/O nodes

• 6 Tbytes of high performance disk

• Delivered to NCAR: March 15, 2005

• Acceptance tests began March 23, 2005

• Acceptance completed March 28, 2005

• First PI meeting March 30, 2005

Page 8: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Blue Gene/L @ NCAR

“Frost”

Page 9: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Bring-up of “Frost” BG/L System

• Criteria for user access…– System Partitions (e.g. 256, 128, 64, 32)

– Functional I/O subsystem

– Scheduler

– Connection to NCAR mass storage system

Page 10: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Current “Frost” BG/L Status

• Partition definitions (512,256,128, 4x32) in place.• I/O system performance and correctness issues appear

to be behind us. (more on this later)• Schedulers (COBALT, SLURM) coming along.• MSS connections in place.• Codes ported: POP, WRF, HOMME, BGC5• Establishing relationships with other centers

– BG/L Consortium membership– Other BG/L sites: SDSC, Argonne, LLNL, Edinburgh

Page 11: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BG/L: Getting I/O Subsystem Performance

Page 12: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

“Frost” I/O Architecture

• One Blue Gene/L rack with one I/O node per 32 computenodes (pset size of 32)

• Two DS4500 (FAStT900) fibre channel disk subsystemservers with dual-controllers and ~3TB of RAID5 disk oneach for a total of ~6TB shared

• Four p720 servers each with– 4 CPUs– 8 GB memory– 2 fibre channel adapters– 8 dedicated GigE interfaces for I/O

• One 96-port Cisco 4006 switch– does not support jumbo frames

Page 13: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Frost I/O Problem Resolution

• Initial I/O Configuration– Each of the eight GigE ports on each of the four

p720s dedicated to one of the 32 BG/L I/O nodes• Static routes and NFS-mount options were used to

accomplish the one-to-one mapping.• All GigE ports configured on the same network subnet (the

BG/L functional network)

– I/O performance numbers were, at best roughly half ofwhat was expected - ~300MB/sec instead of~600MB/sec - often far less.

– I/O tests failed sporadically and inconsistently

Page 14: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Frost I/O Problem Resolution

Front-End1FE4 (Front I/O) / FE4 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

CISCO

6509

(Loaner)

Front-End2

Front-End3

Front-End4

FAStT1

FAStT2

Blue Gene/L (pset=32)

Initial I/O Configuration

Same Subnet

1x1 - BG/L I/O

Page 15: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Frost I/O Problem Resolution

• Root Cause Analysis: two main problems…– Low-level network communication (ARP) was getting

confused due to multiple GigE interfaces on thesame host on the same network subnet.

– Bandwidth on the Cisco 6509 switch was splitting by afactor of eight due to backplane limitations andsuboptimal port assignment within the switch.

Page 16: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Frost I/O Problem Resolution

• Current I/O Configuration– Two GigE ports on each of the four p720s are now configured to look

like a single interface via channel-bonding.– GigE Ports judiciously assigned to the new Cisco 4006 switch so as to

prevent the bandwidth splitting problem.– The 32 BG/L I/O nodes configured into four groups of eight with each

group assigned to one of the p720s.• Same linking mechanism used (i.e., static routes & NFS)

– NFS tuning parameters (rsize,wsize=32768) enabled.– Max I/O performance numbers are in the ~674MB/sec range.– I/O tests ran to successful completion every time.

• Direct Cabling– Elimination of the Cisco switch by cabling the BG/L I/O nodes directly

to the p720 GigE ports is currently not possible due to limitations in theBG/L core software and database design.

Page 17: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Frost I/O Problem Resolution

Front-End1FE4 (Front I/O) / FE4 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE4 (Front I/O) / FE4 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

FE1 (Front I/O) / FE1 (Back I/O)

FE2 (Front I/O) / FE2 (Back I/O)

FE3 (Front I/O) / FE3 (Back I/O)

CISCO

4006

Front-End2

Front-End3

Front-End4

FAStT1

FAStT2

Blue Gene/L (pset=32)

Current I/O Configuration

“Single”Interface On

Subnet

“1”x8 - BG/L I/O

2xChannel-Bond

2xChannel-Bond

2xChannel-Bond

2xChannel-Bond

Page 18: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

“Frost” BG/L I/O performance:1 MB files

mean aggregate I/O rates on compute nodes

0

100

200

300

400

500

600

700

800

0 200 400 600 800 1000 1200

number of concurrent processes

thro

ug

hp

ut

(MB

/se

c)

write rate

read rate

-each process wrote or read 1 GB of data

-I/O request size was 1 MB

Page 19: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Then we uncovered a readcorrectness issue…

• Data validation tests run at the completion of the I/O performancetest cycle uncovered an intermittent byte reordering problem on thereread cycle….– IBM engaged and worked the problem as a Severity 1 issue– Frequency of the problem was data dependent– Root cause found to be a Linux kernel, cache-coherency problem in

the interaction with the low level network card driver on the BlueGene/L I/O nodes

– Emergency e-fix recently installed has eliminated the problem.

Page 20: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BG/L: Getting Applications to Scale

Page 21: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

The Spectral Element ComputationalMesh: the “Cube-Sphere”

• Spectral Elements:– A quadrilateral “patch” of

gridpoints N x N– Gauss-Lobbattto Grid– N=8 is optimal (Taylor)

• Cube– Ne = Elements on an edge– 6*Ne*Ne elements total

• Cube Partitioning– Metis– Space filling curve partitioning

algorithm

• Ne=8 shown ~180 km

Ne=8

NxN element

Page 22: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

6th Order Spectral Elements on the Ne=4Cube Sphere

Page 23: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Partitioning a cube-sphere on8 processors

Page 24: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Partitioning a cubed-sphere on8 processors

Page 25: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Default “Lexical” Mapping(2-D Example)

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Coprocessor Mode

Page 26: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Default “Lexical” Mapping(2-D Example)

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Virtual Node Mode

16 17 18 19

20 21 22 23

24 25 26 27

28 29 30 31

Page 27: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Desirable “Grouped” Mapping(2-D Example)

0 2 4 6

8 10 12 14

16 18 20 22

24 26 28 30

Virtual Node Mode

1 3 5 7

9 11 13 15

17 19 21 23

25 27 29 31

Page 28: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

2x2 “Snaked” Mapping(2-D Example)

0 6 8 14

2 4 10 12

16 22 24 30

18 20 26 28

Virtual Node Mode

1 7 9 15

3 5 11 13

17 23 25 31

19 21 27 29

Page 29: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

2x2 “Snaked” Mapping(2-D Example)

0 6 8 14

2 4 10 12

16 22 24 30

18 20 26 28

Virtual Node Mode

1 7 9 15

3 5 11 13

17 23 25 31

19 21 27 29

Page 30: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BG/L HOMME - Moist Dynamics

Sustained MFLOP per second per processor for moist Held-Suarez.Explicit integration Δt = 4 seconds.6 X 128 X 128 elements, 96 vertical levels.

8 TFlops

Page 31: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

BG/L HOMME -Moist Dynamics with Physics

Sustained MFLOP per second per processor for Aquaplanet with Emanuel physics.Explicit integration Δt = 4 seconds.6 X 128 X 128 elements, 40 vertical levels.

11.3 TFlops

Page 32: Blue Gene/L Experiences at NCAR · 2005. 6. 16. · –The 32 BG/L I/O nodes configured into four groups of eight with each group assigned to one of the p720s. •Same linking mechanism

Questions?