towards petascale computingscv.bu.edu/documentation/presentations/bgl/bgl-bu.pdf · smps are...

60
IBM Research April 2003 | BlueGene/L © 2003 IBM Corporation Towards Petascale Computing David Klepacki [email protected] IBM T. J. Watson Research Center

Upload: others

Post on 01-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

IBM Research

April 2003 | BlueGene/L © 2003 IBM Corporation

Towards Petascale Computing

David [email protected] T. J. Watson Research Center

Page 2: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

2

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

The Blue Gene Project

In December 1999, IBM Research announced a five-year, $100 M US, effort to build a petaflops scale supercomputer to attack problems such as protein folding.The Blue Gene project has two primary goals:

Advance the state of the art in biomolecular simulations.Advance the state of the art in computer design and software forextremely large scale systems.

In November 2001, a research partnership with Lawrence LivermoreNational Laboratory was announced.In November 2002, acquisition of a BG/L machine by LLNL from IBM as part of the ASCI Purple contract was announced.

Page 3: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

3

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

The Problem: A Hardware Perspective

The current approach to large systems is to build clusters of large SMPs (NEC Earth Simulator, ASCI machines, Linux clusters)

SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”Very expensive switches for high performanceVery high electrical power consumption: low computing power densitySignificant amount of resources (particularly in memory hierarchy) devoted to improving single-thread performance

Would like a more modular/cellular approach, with a simple building block (or cell) that can be replicated ad infinitum as necessary –aggregate performance is importantCell should have low power/high density characteristics, and should also be cheap to build and easy to interconnect

Page 4: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

4

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

The Problem: A Software Perspective

Limited success with systems of ~1,000 nodesSystem management scalability issuesSystem reliabilityApplication performance scalability

Application scalability depends heavily on specific application, but there are known applications that can use > 10,000 nodes

High-bandwidth, low-latency interconnect necessaryMinimum operating system involvement

Most system management tools (e.g., job scheduling systems, debuggers, parallel shells) are unlikely to work well with 10,000 –100,000 nodes

It is unlikely that everything will be up at the same timeTime-outs, stampedes, other problems

Page 5: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

5

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Approach: Cellular Systems Architecture

A homogeneous collection of simple independent processing units called cells, each with its own operating system imageAll cells have the same computational and communications capabilities (interchangeable from application or OS view)Integrated connection hardware provides a straightforward path to scalable systems with thousands/millions of cellsChallenges:

Programmability (particularly for performance)System managementFault tolerance and high availabilityMapping of computations to cells

Page 6: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

6

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Supercomputing Roadmap

1995 2000 2005 2010 2015Year

1

10

100

1000

10000

100000Te

raFl

ops

84% CGR

IBM Blue Gene

Human brain ops

Mouse brain ops

Riken MDM*Riken MDM*

IBM Deep Blue*

ASCILizard

SourceASCI Roadmap www.llnl.gov/asci, IBMBrain ops/sec: Kurzweil 1999, The Age of Spiritual MachinesMoravec 1998, www.transhumanist.com/volume1/moravec.htm

Page 7: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

7

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Compute Density

2 Pentium 4/1U42 1U/rack84 processors/rack

2 Pentium 4/blade (7U)14 blade/chassis6 chassis/frame168 processors/rack

2 dual CPU/compute card16 compute cards/node card16 node cards/midplane2 midplanes/rack2048 processors/rack

Page 8: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

8

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

System Power Comparison

BG/L

20.1 kW

450 Thinkpads

(LS Mok,4/2002)

20.3 kW

Page 9: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

9

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Supercomputer Peak Speed

1940 1950 1960 1970 1980 1990 2000 2010Year Introduced

1E+2

1E+4

1E+6

1E+8

1E+10

1E+12

1E+14

1E+16

Peak

Spe

ed (f

lops

)

Doubling time = 1.5 yr.

ENIAC (vacuum tubes)UNIVAC

IBM 701 IBM 704IBM 7090 (transistors)

IBM Stretch

CDC 6600 (ICs)CDC 7600

CDC STAR-100 (vectors)CRAY-1

Cyber 205 X-MP2 (parallel vectors)

CRAY-2X-MP4

Y-MP8i860 (MPPs)

ASCI White

Blue Gene / L

Blue Pacific

DeltaCM-5 Paragon

NWTASCI Red

ASCI Red

CP-PACS

NEC Earth Simulator

Page 10: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

10

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Supercomputer Power Efficiencies

1997 1999 2001 2003 2005Year

0.001

0.01

0.1

1

GFL

OPS

/Wat

t

QCDSPColumbia

QCDOCColumbia/IBM Blue Gene/L

ASCI White

Earth Simulator

Page 11: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

11

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Supercomputer Cost/Performance

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005Year

100

1000

10000

100000

1000000

Dol

lars

/GFL

OP

C/PASCICray

BeowulfsCOTSJPL

QCDSPColumbia

QCDOCColumbia/IBM

Blue Gene/L

ASCI Blue

ASCI WhiteASCI Q

Earth Simulator

T3E

Red Storm

Page 12: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

12

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Clock Frequencies

1985 1990 1995 2000 2005YEAR

1

10

100

1,000

10,000MHz

Intel 32 DRAM-RAS

Page 13: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

13

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Blue Gene Hardware Architecture

Page 14: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

14

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Fundamentals

A large number of nodes (65,536)Low-power nodes for densityHigh floating-point performanceSystem-on-a-chip technology

Nodes interconnected as 64x32x32 three-dimensional torus

Easy to build large systems, as each node connects only to six nearest neighbors – full routing in hardwareCross-section bandwidth per node is proportional to n2/n3

Auxiliary networks for I/O and global operations

Applications consist of multiple processes with message passing

Strictly one process/node

Page 15: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

15

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Fundamentals (continued)

Machine should be dedicated to execution of applications, not system management

Avoid asynchronous events (e.g., daemons, interrupts)Avoid complex operations on compute nodes

The “I/O node” – an offload engineSystem management functions are performed in a (N+1)th nodeI/O (and other complex operations) are shipped from compute node to I/O node for executionNumber of I/O nodes adjustable to needs (N=64 for BG/L)

This separation between application and system functions allows compute nodes to focus on application executionCommunication to the I/O nodes must be through a separate tree interconnection network to avoid polluting the torus

Page 16: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

16

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Fundamentals (continued)

Machine monitoring and control should also be offloadedMachine boot, health monitoringPerformance monitoring

Concept of a “service node”Separate nodes dedicated to machine control and monitoringNon-architected from a user perspective

Control networkA separate network that does not interfere with application or I/O trafficSecure network, since operations are critical

By offloading services to other nodes, we let the compute nodes focus solely on application execution without interferenceBG/L is the extreme opposite of time sharing: everything is space shared

Page 17: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

17

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Board(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR

BlueGene/L

Page 18: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

18

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Compute System-on-a-Chip ASIC

PLB (4:1)

“Double FPU”

Ethernet Gbit

JTAGAccess

144 bit wide DDR256MB

JTAG

Gbit Ethernet

440 CPU

440 CPUI/O proc

L2

L2

MultiportedSharedSRAM Buffer

Torus

DDR Control with ECC

SharedL3 directoryfor EDRAM

Includes ECC

4MB EDRAM

L3 CacheorMemory

l

6 out and6 in, each at 1.4 Gbit/s link

256

256

1024+144 ECC256

128

128

32k/32k L1

32k/32k L1

2.7GB/s

22GB/s

11GB/s

“Double FPU”

5.5GB/s

5.5 GB/s

256

snoop

Tree

3 out and3 in, each at 2.8 Gbit/s link

GlobalInterrupt

4 global barriers orinterrupts

128

5.6GFpeaknode

Page 19: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

19

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Interconnection Networks

3 Dimensional TorusInterconnects all compute nodes (65,536)Virtual cut-through hardware routing1.4Gb/s on all 12 node links (2.1 GB/s per node)Communications backbone for computations350/700 GB/s bisection bandwidth

Global TreeOne-to-all broadcast functionalityReduction operations functionality2.8 Gb/s of bandwidth per linkLatency of tree traversal in the order of 2 µsInterconnects all compute and I/O nodes (1024)

EthernetIncorporated into every node ASICActive in the I/O nodes (1:64)All external comm. (file I/O, control, user interaction, etc.)

Page 20: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

20

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Control System

Control and monitoring are performed by a set of processes executing on the service nodesNon-architected network (Ethernet + JTAG) supports non-intrusive monitoring and controlCore Monitoring and Control System

Controls machine initialization and configurationPerforms machine monitoringTwo-way communication with node operating systemIntegration with database for system repository

Page 21: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

21

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Compute Card• 2 nodes per compute card• Each BLC node in 25mm x 32mm CBGA• FRU

BL ASIC2 cores

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

One compute node

Page 22: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

22

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

I/O Card

One I/O node

BL ASIC2 cores DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

DRAM

• 2 nodes per I/O card• FRU• Memory doubled by stacking DRAMs

Page 23: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

23

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Node Card

Power Supplies

Gb Ethernet cables from IO cards

IO Card

Compute card(16 total)

Page 24: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

24

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

512-way Midplane, Side ViewLink Cards without cables

32-way Node Cards2-way compute card

Page 25: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

25

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Rack, without Cables

Fan Rail

Fan-Card

Midplane Connector

19.55”New Card-Cage Design With Integrated Fans

Node Cards

Link Card

Service Card

• 1024 compute processors• 1024 communication processors

• 256 GB DRAM• 2.8 TF peak

• 16 I/O nodes• 8 GB DRAM• 16 Gb Ethernet channels

• ~20 kW, air cooled• redundant power• redundant fans

• ~36”W x ~36”D x ~80” H (40-42U)

Page 26: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

26

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L System/HostRaid DiskServersLinux

Archive (64)WAN (46)Visualization(128)

Switch HostFEN: AIX or LinuxSN

230 26

512

1024

GPFS + NFS

Chip(2 processors)

Compute Card(2 chips, 2x1x1)

Node Board(32 chips, 4x4x2)

16 Compute Cards

System(64 cabinets, 64x32x32)

Cabinet(32 Node boards, 8x8x16)

2.8/5.6 GF/s4 MB

5.6/11.2 GF/s0.5 GB DDR

90/180 GF/s8 GB DDR

2.9/5.7 TF/s256 GB DDR

180/360 TF/s16 TB DDR

Page 27: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

27

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Complete BlueGene/L System at LLNL

BGL Compute

Node (CN)

65,536

WAN (0ther)

CWFS

Fede

rate

d G

igab

it Et

hern

et S

witc

h 2,

048

Ports

Fede

rate

d G

igab

it Et

hern

et S

witc

h 2,

048

Ports

48

64

128

512

8

1024

Front End Nodes (FEN) 8 ports

Service Node (SN) 8 Gigabit Ethernet Ports

BGL IO

Node (ION) 1,024

Control Management Network (CMN)

Archive

VIS

8

Page 28: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

28

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L System

Livermore (POR)

Livermore System:Plan-of-Record Design

51’

24’

A = 1225 sq. ft.

( or 57’ Including 3’ Aisles at each end)

( or A = 1368 sq. ft. Including 3’ Aisles at each end)

RackHot PlenumCold Plenum

Page 29: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

29

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Blue Gene Software Architecture

Page 30: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

30

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Software Architecture

User applications execute exclusively on the compute nodes and see only the application volume as exposed by the user-level APIsThe outside world interacts only with the I/O nodes and processing sets (I/O node + compute nodes) they represent, through the operational surface –functionally, the machine behaves as a cluster of I/O nodesInternally, the machine is controlled through the service nodes in the control surface – goal is to hide this surface as much as possible

Page 31: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

31

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Software Architecture Rationale

We view the system as a cluster of 1,024 I/O nodes, thereby reducing a 65,536-node machine to a manageable sizeFrom a system perspective, user processes “live” in the I/O node:

Process management (start, monitor, kill)Process debuggingAuthentication, authorizationSystem management daemons (LoadLeveler, xCAT, MPI)Traditional LINUX operating system

User processes actually execute on compute nodesSimple single-process (two threads) kernel on compute nodeOne processor/thread provides fast, predictable executionUser process extends into I/O node for complex operations

Application model is a collection of private-memory processes communicating through messages

Page 32: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

32

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Programming Models for Compute Nodes

Heater mode: CPU 0 does all the work for computation and communication, while CPU 1 spins idle (roles reversible)

Intended primarily for debugging mode of systemSome applications may actually benefit!Reduces power consumption

Communication coprocessor mode: CPU 0 executes user application while CPU 1 handles communications (roles reversible)

Preferred mode of operation for communication-intensive codesRequires coordination between CPUs, which is handled in libraries

Symmetric mode: Both CPUs do computation and communication within the same process

Page 33: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

33

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Software Stack in BlueGene/L Compute Node

HPK controls all access to hardware, and enables bypass for application useUser-space libraries and applications can directly access torus and tree through bypassAs a policy, user-space code should not directly touch hardware, but there is no enforcement of that policyApplication code can use both processors in a compute node

BG/L ASIC

HPK Bypass

Application code

User-space libraries

Page 34: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

34

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Role of I/O Nodes in System

BG/ L ASIC

HPK

Application

BG/ L ASIC

HPK

Application

BG/ L ASIC

HPK

Application

BG/ L ASIC

HPK

Application

BG/ L ASIC

NFS

IP Ethernet

compute nodes

I/ O node

Linux

Applicationproxies

I/ Ocontrol

Page 35: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

35

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

File I/O in BlueGene/Lfile serverorblock storage

User application

glibc

Compute node

Service process

Linux glibc

I/ O node kernel

file systemI/O node

Tree

Ethernet

ciolib ciolib

Page 36: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

36

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Sockets I/O in BlueGene/L

remote process

Tree SPI

User application

glibc

Compute node

Service process

Linux glibc

I/ O node kernel

IP stackI/O node

Tree

Ethernet

ciolib ciolib

Page 37: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

37

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

I/O Architecture: GPFS on a Cluster

C

C

C

C I

I/ O-nodeand

file servernetwork

Server

C

C

C

C I Server

Server

... ...Host

RAID

RAID

RAID

RAID

Page 38: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

38

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Communication Challenges

Need to provide low-latency, high-bandwidth communication libraries to user application

Programming model is set of processes communicating through messages

Software layers tend to add significant latency to system – matching sends and receives a major culpritEfficient implementation of MPI is a major goal

Page 39: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

39

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Communication Infrastructure for BlueGene/L

Three layers of communication librariesSPI active packets layer: directly maps to hardwareSPI active messages layer: abstraction layerMPI: for end-user application

Active packets and active messages are self-describing entities that cause actions to be executed on the receiving node

Packets have a maximum length determined by architecture (256 B)Messages can be much longer and are broken down into packetsCan be used by applications, but are intended more for library developersOne-sided communication, which can be acted upon as soon as it is received by the target node

MPI will be based on active messages layerFirst step is a simple port of MPICH based on active messagesWe will then enhance MPI with knowledge of machine topologyWe will use the BG/L tree for global reductions and broadcasts

Page 40: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

40

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

MPI on BlueGene/L (1)

MPI is based on active messages layer, which is used as transportAfter a send/receive match, the sender can put the data directly in the destination space using an active message – good approach for large messages, as it avoids buffering and copiesBuffers are still necessary for control and short messages

size = # of tasks in programrank = my rank in set of tasksif (rank == 0) {for (int i = 1; i < size; i++)receive from task i into buffer[i]

} else {send data to task 0

}

size = # of tasks in programrank = my rank in set of tasksif (rank == 0) {for (int i = 1; i < size; i++)receive from task i into buffer[i]

} else {send data to task 0

}

Page 41: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

41

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

MPI on BlueGene/L (2)

Typical solution requires dedicated buffers to each possible sender (all tasks) in every task and flow control65,536 buffers of 1KB each would consume 25% of our memory, and not yield good performanceA better solution is dynamic allocation of buffer space – only active senders get buffer space in receiverMonitor communication patterns and adapt

Reserve buffers based on recent history and inform sender about being prepared – avoid handshake even if “irecv” is not posted earlyGive up reservation when communication patterns change

Flow control is still needed, and packets can still be dropped!

Page 42: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

42

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Alignment

•When alignments differ, one extra memory copy is needed•When alignments differ, one extra memory copy is needed

•Problem: Torus packet layer only handles aligned data;

•Message layer provides a data packetizer/unpacketizer

•Problem: Torus packet layer only handles aligned data;

•Message layer provides a data packetizer/unpacketizer

SENDER

RECEIVER

•When sender/receiver alignments are same, packet transmission is easy

•head and tail transmitted in a single “unaligned” packet

•aligned packets go directly to/from torus FIFOs

•When sender/receiver alignments are same, packet transmission is easy

•head and tail transmitted in a single “unaligned” packet

•aligned packets go directly to/from torus FIFOs

Page 43: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

43

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

CompilersSupport Fortran95, C99, C++

GNU C, Fortran, C++ compilers can be used with BG/L, but they do not generate Hummer2 codeIBM xlf/xlc compilers have been ported to BG/L, with code generation and optimization features for Hummer2 (licensing issues being resolved)

Backend enhanced to support PPC440 and to target SIMD FPU on nodes

Finds parallel operations that match SIMD instructionsRegister allocator enhanced to handle register pairsInstruction scheduling tuned to unit latencies

Initial design of (2-way) SIMD FPU architecture was driven by key workload kernels such as matrix-matrix product and FFT

Identified several mux combinations for SIMD operations not usually seen on other SIMD ISA extensions (Intel SSE, PPC AltiVec), e.g.,dP = aP + bP * cP || dS = aS - bS * cSdP = aP + bP * cP || dS = aS + bP * cS

Page 44: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

44

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Math Library

Subset of ESSL (equivalent to LAPACK) being developedCollaboration with Math dept and IRLMainly dense matrix kernels – DGEMM, DGEMV, DDOT, Choleskyand LU factorizationEarly experience

Exposed problems with compiler – many known problems fixedConcern about memory subsystem performance

StatusDGEMM (matrix multiplication) getting excellent performance, estimated at ~ 2.4 GFLOPS on a 700 MHz node (86% of peak; VHDL simulations with hot L1 cache show: 93% of peak on both CPUs; With cold L1 cache: 76% of peak)

Page 45: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

45

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Event Analysis and Failure Prediction

Failures in large clustersMost of the failures are addressed through manual intervention as they are noticedMost of the actions taken are based on individual experiences and decisions

Goals for BG/L: Use event prediction techniques to predict and prevent critical failures, either automatically or with minimal human interventionBasic approach

Monitor event (failures) logs from the machineDevelop models from observation (control theory)Predict future events (failures)Use failure prediction to schedule maintenance, guide process migration

Page 46: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

46

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Fault Recovery Based on Checkpointing

Different levels of transparency to user applicationUser controlled: Application does self checkpointingSemi-automatic: Application indicates program points for checkpointing, checkpoint performed by system softwareAutomatic: Completely transparent checkpointing by system software

Application launched on an error-free partition after loading data from previous checkpointApproach insufficient for dealing with undetected soft errors – may occur once a month for 65,536-node machine

Consistency checks within applicationError checking by system software – in general, need redundancy

Page 47: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

47

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Job Scheduling in BlueGene/L

Job scheduling strategies can significantly impact the utilization of large computer systemsMany large systems see utilization in the 50-70% rangeMachines with toroidal topology (as opposed to all-to-all switch) are particularly sensitive to job scheduling – this was demonstrated at LLNL with gang scheduling on Cray T3DFor BlueGene/L, we have investigated the impact of two scheduling techniques

Task migrationBackfilling

Page 48: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

48

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BGLsim: System-Level Simulator

Complete system-level simulator for BlueGene/LBased on the Mambo simulation infrastructure from Austin ResearchComplements our VHDL simulators (which are much slower)

Architecturally accurate simulatorExecutes the full BG/L instruction set, including Hummer2

Models all of the devices, including Ethernet, torus, tree, lock box, etc.Supports development of system software and applicationsCurrently not performance aware: counts only instructions, not cyclesWe have simulated systems with up to 64 nodes (not a limitation)

Efficient simulation that supports code development1,000,000 – 2,000,000 BG/L instructions per second on 1GHz Pentium IIICan be extended to 5-10% performance accurate without much drop

We have used it to develop/test HPK, Linux, device drivers, networking, MPI, compilers, FPU benchmarks

Page 49: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

49

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

128x128x128 FFT Scaling

0.001

0.01

0.1

1

1 10 100 1000 10000 100000

3D-F

FT ti

me

(sec

onds

)

Node Count

BG/L

Page 50: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

50

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

1024x1024x1024 FFT Scaling

Page 51: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

51

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Molecular Dynamics Scaling, 30,000 Atoms,P3ME(FFT)

Page 52: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

52

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

ASCI CodesAlgorithm Scientific

ImportanceMapping to

BG/L Architecture

Ab Initio Molecular Dynamics in biology and materials science (JEEP) A A

Three dimensional dislocation dynamics (MICRO3D and PARANOID) A A

Molecular Dynamics(MDCASK code) A A

Kinetic Monte Carlo (BIGMAC) A C

Computational Gene Discovery (MPGSS) A B

Turbulence: Rayleigh-Taylor instability (MIRANDA) A A

Shock Turbulence (AMRH) A A

Turbulence and Instability Modeling(sPPM benchmark code) B A

Hydrodynamic Instability (ALE hydro) A A

Page 53: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

53

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Conclusions

Embedded technology promises to be an efficient path toward building massively parallel computers optimized at the system level

Low power is critical to achieving a dense, simple, inexpensive systemWe are developing a BG/L system software stack with Linux-like personality for user applications

Custom solution (HPK) on compute nodes for highest performanceLinux solution on I/O nodes for flexibility and functionalityMPI is the default programming model, but others are being investigated

BG/L is testing software approaches to management/operation of very large scale machines

Hierarchical organization for management“Flat” organization for programmingMixed conventional/special-purpose operating systems

Many challenges ahead, particularly in performance and reliability

Page 54: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

54

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L Team

NR Adiga, G Almasi, GS Almasi, Y Aridor, M. Bae, D Beece, R Bellofatto, G Bhanot, R Bickford, M Blumrich, AA Bright, J Brunheroto, C Cascaval, J Castaños, W Chan, L Ceze, P Coteus, S Chatterjee, D Chen, G Chiu, TM Cipolla, P Crumley, KM Desai, A Deutsch, T Domany, MB Dombrowa, W Donath, M Eleftheriou, C Erway, J Esch, B Fitch, J Gagliano, A Gara, R Garg, R Germain, ME Giampapa, B Gopalsamy, J Gunnels, M Gupta, F Gustavson, S Hall, RA Haring, D Heidel, P Heidelberger, LM Herger, D Hoenicke, RD Jackson, T Jamal-Eddine, GV Kopcsay, E Krevat, MP Kurhekar, AP Lanzetta, D Lieber, LK Liu, M Lu, M Mendell, A Misra, Y Moatti, L Mok, JE Moreira, BJ Nathanson, M Newton, M Ohmacht, A Oliner, V Pandit, RB Pudota, William Pulleyblank, R Rand, R Regan, B Rubin, A Ruehli, S Rus, RK Sahoo, A Sanomiya, E Schenfeld, M Sharma, E Shmueli, S Singh, P Song, V Srinivasan, BD Steinmacher-Burow, K Strauss, C Surovic, R Swetz, T Takken, RB Tremaine, M Tsao, AR Umamaheshwaran, P Verma, P Vranas, TJC Ward, M Wazlowski (IBM)

B de Supinski, L Kissel, M Seager, J Vetter, RK Yates (LLNL)

Page 55: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

55

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BlueGene/L External Collaborators

Argonne National LabBarcelonaBoston UniversityCaltechColumbia UniversityNational Center for Atmospheric ResearchOak Ridge National LabSan Diego Supercomputing CenterStanfordTechnical University of ViennaTrinity College DublinUniversidad Politecnica de ValenciaUniversity of New MexicoUniversity of EdinburghUniversity of Maryland

Page 56: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

56

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Page 57: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

57

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

BACKUP SLIDES

Page 58: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

58

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Task Migration and Backfilling

timespac

e

timespac

e

6

39

5

4

backfillingqueue = {6,3,9,5,4}, 10 nodes

migration

Page 59: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

59

IBM Research

Blue Gene | Towards Petascale Computing © 2003 IBM Corporation

Results for Job Scheduling on BlueGene/L

Page 60: Towards Petascale Computingscv.bu.edu/documentation/presentations/BGL/BGL-BU.pdf · SMPs are typically designed for something else (sweet spot of market) and then combined “unnaturally”

IBM Research

April 2003 | BlueGene/L © 2003 IBM Corporation

Towards Petascale Computing