kei davis and fabrizio petrini {kei,fabrizio}@lanl.gov europar 2004, pisa italy 1 ccs-3 p al state...
TRANSCRIPT
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy1
CCS-3
PAL
STATE OF THE ART
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy2
CCS-3
PALSection 2
Overview We are going to briefly describe some state-of-the-
art supercomputers The goal is to evaluate the degree of integration of
the three main components, processing nodes, interconnection network and system software
Analysis limited to 6 supercomputers (ASCI Q and Thunder, System X, BlueGene/L, Cray XD1 and ASCI Red Storm), due to space and time limitations
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy3
CCS-3
PALASCI Q: Los Alamos National Laboratory
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy4
CCS-3
PALASCI Q
Total — 20.48 TF/s, #3 in the top 500
Systems — 2048 AlphaServer ES45s
8,192 EV-68 1.25-GHz CPUs with 16-MB cache
Memory — 22 Terabytes
System Interconnect
Dual Rail Quadrics Interconnect
4096 QSW PCI adapters
Four 1024-way QSW federated switches
Operational in 2002
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy5
CCS-3
PAL
MemoryUp to 32 GB
MMB 2
MMB 1
MMB 0
Serial, Parallelkeyboard/mousefloppy
Cache 16 MB per CPU
256b 125 MHz(4.0 GB/s)
256b 125 MHz(4.0 GB/s)
EV68 1.25 GHz
PC
I5
PC
I4
PC
I0
PC
I2
PC
I1
PC
I6
PC
I7
PCI-USB PCI-junk IO
PC
I3
PC
I8
PC
I 9
64b 33MHz (266MB/S)64b 33MHz (266MB/S)64b 66MHz (528 MB/S)P
CI5
PC
I4
PC
I0
PC
I2
PC
I1
64b 66MHz (528 MB/S)
PC
I6
PC
I7
PCI-USB PCI-junk IO
PC
I3
PC
I8
PC
I 9
64b 33 MHz (266 MB/S)
64b 66 MHz (528 MB/S)
QuadC-Chip Controller
PCI ChipBus 0
PCI ChipBus 1
DDDDD
DDDD DDDD DD
QuadC-Chip Controller
PCI ChipBus 0,1
PCI ChipBus 2,3
DDDDD
DDDD DDDD DD
MMB 3
PC
I7 HS
PC
I5
PC
I4
PC
I3 HS
PC
I2 HS
PC
I1 HS
PC
I0
Each @ 64b 500 MHz (4.0 GB/s)
PC
I9 HS
PC
I8 HS
PC
I6 HS
3.3V I/O 5.0V I/O
Node: HP (Compaq) AlphaServer ES45 21264 System Architecture
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy6
CCS-3
PALQsNET: Quaternary Fat Tree
• Hardware support for Collective Communication
• MPI Latency 4s, Bandwidth 300 MB/s
• Barrier latency less than 10s
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy7
CCS-3
PALInterconnection Network
1st 64U64DNodes 0-63
16th 64U64DNodes 960-1023
48 63 1023
1
2
3...
SwitchLevel
4
5
960
6
Mid Level
Super Top Level
1024 nodes(2x = 2048 nodes)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy8
CCS-3
PALSystem Software
Operating System is Tru64 Nodes organized in Clusters of 32 for resource
allocation and administration purposes (TruCluster) Resource management executed through Ethernet
(RMS)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy9
CCS-3
PALASCI Q: Overview
Node Integration Low (multiple boards per node, network interface on
I/O bus) Network Integration
High (HW support for atomic collective primitives) System Software Integration
Medium/Low (TruCluster)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy10
CCS-3
PALASCI Thunder, 1,024 Nodes, 23 TF/s peak
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy11
CCS-3
PALASCI Thunder, Lawrence Livermore National Laboratory
• 1,024 Nodes, 4096 Processors, 23 TF/s,
•#2 in the top 500
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy12
CCS-3
PALASCI Thunder: Configuration
1,024 Nodes, Quad 1.4 Ghz Itanium2, 8GB DDR266 SDRAM (8 Terabytes total)
2.5 s, 912 MB/s MPI latency and bandwidth over Quadrics Elan4
Barrier synchronization 6 s, allreduce 15 s 75 TB in local disk in 73GB/node UltraSCSI320 Lustre file system with 6.4 GB/s delivered parallell
I/O performance Linux RH 3.0, SLURM, Chaos
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy13
CCS-3
PAL
CHAOS: Clustered High Availability Operating System Derived from Red Hat, but differs in the following
areas Modified kernel (Lustre and hw specific) New packages for cluster monitoring, system
installation, power/console management SLURM, an open-source resource manager
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy14
CCS-3
PALASCI Thunder: Overview
Node Integration Medium/Low (network interface on I/O bus)
Network Integration Very High (HW support for atomic collective
primitives) System Software Integration
Medium (Chaos)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy15
CCS-3
PALSystem X: Virginia Tech
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy16
CCS-3
PALSystem X, 10.28 TF/s 1100 dual Apple G5 2GHz CPU based nodes.
8 billion operations/second/processor (8 GFlops) peak double precision floating performance.
Each node has 4GB of main memory and 160 GB of Serial ATA storage. 176TB total secondary storage.
Infiniband, 8s and 870 MB/s, latency and bandwidth, partial support for collective communication
System-level Fault-tolerance (Déjà vu)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy17
CCS-3
PALSystem X: Overview
Node Integration Medium/Low (network interface on I/O bus)
Network Integration Medium (limited support for atomic collective
primitives) System Software Integration
Medium (system-level fault-tolerance)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy18
CCS-3
PAL
Chip(2 processors)
Compute Card(2 chips, 2x1x1)
Node Card(32 chips, 4x4x2)
16 Compute Cards
System(64 cabinets, 64x32x32)
Cabinet(32 Node boards, 8x8x16)
2.8/5.6 GF/s4 MB
5.6/11.2 GF/s0.5 GB DDR
90/180 GF/s8 GB DDR
2.9/5.7 TF/s256 GB DDR
180/360 TF/s16 TB DDR
BlueGene/L System
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy19
CCS-3
PALBlueGene/L Compute ASIC
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256/512MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
“Double FPU”
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
• IBM CU-11, 0.13 µm• 11 x 11 mm die size• 25 x 32 mm CBGA• 474 pins, 328 signal• 1.5/2.5 Volt
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy20
CCS-3
PAL
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy21
CCS-3
PAL
DC-DC Converters:40V 1.5, 2.5V
2 I/O cards
16compute
cards
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy22
CCS-3
PAL
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy23
CCS-3
PALBlueGene/L Interconnection Networks
3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GBytes/s per node) 350/700 GBytes/s bisection bandwidth Communications backbone for computations
Global Tree One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of tree traversal in the order of 5 µs Interconnects all compute and I/O nodes (1024)
Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.)
Low Latency Global Barrier 8 single wires crossing whole system, touching all nodes
Control Network (JTAG) For booting, checkpointing, error logging
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy24
CCS-3
PALBlueGene/L System Software Organization
Compute nodes dedicated to running user application, and almost nothing else - simple compute node kernel (CNK)
I/O nodes run Linux and provide O/S services
file accessprocess launch/terminationdebugging
Service nodes perform system management services (e.g., system boot, heart beat, error monitoring) - largely transparent to application/system software
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy25
CCS-3
PALOperating Systems
Compute nodes: CNKSpecialized simple O/S
5000 lines of code, 40KBytes in core
No thread support, no virtual memoryProtection
Protect kernel from applicationSome net devices in userspace
File I/O offloaded (“function shipped”) to IO nodes
Through kernel system calls“Boot, start app and then stay out of the way”
I/O nodes: Linux2.4.19 kernel (2.6 underway) w/ ramdiskNFS/GPFS clientCIO daemon to
Start/stop jobsExecute file I/O
Global O/S (CMCS, service node) Invisible to user programs Global and collective decisions Interfaces with external policy
modules (e.g., job scheduler) Commercial database technology
(DB2) stores static and dynamic state
Partition selection Partition boot Running of jobs System error logs Checkpoint/restart
mechanism Scalability, robustness, security
Execution mechanisms in the core Policy decisions in the service node
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy26
CCS-3
PALBlueGeneL: Overview
Node Integration High (processing node integrates processors and
network interfaces, network interfaces directly connected to the processors)
Network Integration High (separate tree network)
System Software Integration Medium/High (Compute kernels are not globally
coordinated) #2 and #4 in the top500
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy27
CCS-3
PALCray XD1
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy28
CCS-3
PALCray XD1 System Architecture
Compute 12 AMD Opteron 32/64
bit, x86 processors High Performance
LinuxRapidArray Interconnect 12 communications
processors 1 Tb/s switch fabricActive Management Dedicated processorApplication Acceleration 6 co-processors Processors directly
connected to the interconnect
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy29
CCS-3
PALCray XD1 Processing Node
Six SATA Hard Drives
Four independent PCI-X Slots
500 Gb/s crossbar switch
12-port Inter-chassis
connector
Connector to 2nd 500 Gb/s crossbar switch and 12-port
inter-chassis connector
4 Fans
Chassis Rear
Chassis Front
Six 2-way SMP Blades
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy30
CCS-3
PAL Cray XD1 Compute Blade
4 DIMM Sockets for DDR 400
Registered ECCMemory
4 DIMM Sockets for DDR 400
Registered ECCMemory
RapidArrayCommunications
ProcessorAMD Opteron
2XX Processor
Connector to Main Board
AMD Opteron 2XX
Processor
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy31
CCS-3
PALFast Access to the Interconnect
Processor I/O Interconnect
GigaBytes GFLOPS GigaBytes per Second
CrayXD1
Memory
Xeon Server
6.4GB/sDDR 400
8 GB/s
5.3 GB/sDDR 333
0.25 GB/sGigE
1 GB/sPCI-X
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy32
CCS-3
PAL Communications Optimizations
RapidArray Communications Processor HT/RA tunnelling with bonding Routing with route redundancy Reliable transport Short message latency optimization DMA operations System-wide clock synchronization
RapidArray Communications
Processor
2 GB/s
3.2 GB/s
2 GB/s
AMDOpteron 2XX
Processor
RA
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy33
CCS-3
PAL
Usability
Single System Command and Control
Resiliency
Dedicated management processors, real-time OS and communications fabric.
Proactive background diagnostics with self-healing.
Synchronized Linux kernels
Active Manager System
Active ManagementSoftware
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy34
CCS-3
PALCray XD1: Overview
Node Integration High (direct access from HyperTransport to
RapidArray) Network Integration
Medium/High (HW support for collective communication)
System Software Integration High (Compute kernels are globally coordinated)
Early stage
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy35
CCS-3
PALASCI Red STORM
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy36
CCS-3
PALRed Storm Architecture
Distributed memory MIMD parallel supercomputer Fully connected 3D mesh interconnect. Each
compute node processor has a bi-directional connection to the primary communication network
108 compute node cabinets and 10,368 compute node processors (AMD Sledgehammer @ 2.0 GHz)
~10 TB of DDR memory @ 333MHz Red/Black switching: ~1/4, ~1/2, ~1/4 8 Service and I/O cabinets on each end (256
processors for each color240 TB of disk storage (120 TB per color)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy37
CCS-3
PALRed Storm Architecture
Functional hardware partitioning: service and I/O nodes, compute nodes, and RAS nodes
Partitioned Operating System (OS): LINUX on service and I/O nodes, LWK (Catamount) on compute nodes, stripped down LINUX on RAS nodes
Separate RAS and system management network (Ethernet)
Router table-based routing in the interconnect
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy38
CCS-3
PAL
Net I/O
Service
Users
File I/OCompute
/home
Red Storm architecture
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy39
CCS-3
PALSystem Layout(27 x 16 x 24 mesh)
NormallyUnclassified
NormallyClassified
SwitchableNodes
Disconnect Cabinets
{ {
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy40
CCS-3
PAL
Run-Time System
Logarithmic loader
Fast, efficient Node allocator
Batch system – PBS
Libraries – MPI, I/O, Math
File Systems being considered include
PVFS – interim file system
Lustre – Pathforward support,
Panassas…
Operating Systems
LINUX on service and I/O nodes
Sandia’s LWK (Catamount) on compute nodes
LINUX on RAS nodes
Red Storm System Software
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy41
CCS-3
PALASCI Red Storm: Overview
Node Integration High (direct access from HyperTransport to network
through custom network interface chip) Network Integration
Medium (No support for collective communication) System Software Integration
Medium/High (scalable resource manager, no global coordination between nodes)
Expected to become the most powerful machine in the world (competition permitting)
Kei Davis and Fabrizio Petrini{kei,fabrizio}@lanl.gov Europar 2004, Pisa
Italy42
CCS-3
PALOverview
Node Integration
Network Integration
Software Integration
ASCI Q
ASCI Thunder
System X
BlueGene/L
Cray XD1
Red Storm