August 22, 2000
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
1
High performance cluster technology: the HPVM experience
Mario LauriaDept of Computer and Information Science
The Ohio State University
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
2August 22, 2000
Thank You!
• My thanks to the organizers of SAIC 2000 for the invitation
• It is an honor and privilege to be here today
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
3August 22, 2000
Acknowledgements
• HPVM is a project of the Concurrent Systems Architecture Group - CSAG (formerly UIUC Dept. of Computer Science, now UCSD Dept. of Computer Sci. & Eng.)» Andrew Chien (Faculty)
» Phil Papadopuolos (Research faculty)
» Greg Bruno, Mason Katz, Caroline Papadopoulos (Research Staff)
» Scott Pakin, Louis Giannini, Kay Connelly, Matt Buchanan, Sudha Krishnamurthy, Geetanjali Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu, Ju Wang (Graduate Students)
• NT Supercluster: collaboration with NCSA Leading Edge Site» Robert Pennington (Technical Program Manager)
» Mike Showerman, Qian Liu (Systems Programmers)
» Qian Liu*, Avneesh Pant (Systems Engineers)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
4August 22, 2000
Outline
• The software/hardware interface (FM 1.1)• The layer-to-layer interface (MPI-FM and FM 2.0)• A production-grade cluster (NT Supercluster)• Current status and projects (Storage Server)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
5August 22, 2000
Motivation for cluster technology
• Killer micros: Low cost Gigaflop processors here for a few kilo$$’s /processor• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at
100’s $$/ connection• Leverage HW, commodity SW (Windows NT), build key technologies
» high performance computing in a RICH and ESTABLISHED software environment
Gigabit/sec Networks- Myrinet, SCI, FC-AL, Giganet,GigabitEthernet, ATM
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
6August 22, 2000
Ideal Model: HPVM’s
• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts and
encapsulates underlying resource complexity• Simplifies use of complex resources
“Virtual Machine Interface”
Actual system configuration
Application Program
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
7August 22, 2000
HPVM = Cluster Supercomputers
• High Performance Cluster Machine (HPVM)» Standard APIs hiding network topology, non-standard communication sw
• Turnkey Supercomputing Clusters» high performance communication, convenient use, coordinated resource management
• Windows NT and Linux, provides front-end Queueing & Mgmt (LSF integrated)
FastMessages
MPI Put/GetGlobalArrays
Myrinet and Sockets
HPVM 1.0Released Aug 19, 1997
PGI HPF
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
8August 22, 2000
Motivation for a new communication software
• “Killer networks” have arrived ...» Gigabit links, moderate cost (dropping fast), low latency routers
• … however network software only delivers network performance for large messages.
1Gbit network (Ethernet, Myrinet)
125s ovhdN1/2=15KB
0
20
40
60
80
100
120
Msg Size (Bytes)
Ban
dw
idth
(M
B/s
)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
9August 22, 2000
Motivation (cont.)
• Problem: Most messages are small
Message Size Studies
< 576 bytes [Gusella90]
86-99% <200B [Kay&Pasquale]
300-400B avg size [U Buffalo monitors]
• => Most messages/applications see little performance improvement. Overhead is the key (LogP, Culler, et.al. studies)
• Communication is an enabling technology; how to fulfill its promise?
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
10August 22, 2000
Fast Messages Project Goals
• Explore network architecture issues to enable delivery of underlying hardware performance (bandwidth, latency)
• Delivering performance means:» considering realistic packet size distributions
» measuring performance at the application level
• Approach:» minimize communication overhead
» Hardware/software, multilayer integrated approach
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
11August 22, 2000
Getting performance is hard!
01020304050607080
16 32 64 128 256 512
Msg Size
Ba
nd
wid
th (
MB
/s)
TheoreticalPeak
Link Mgmt
• Slow Myrinet NIC processor (~5 MIPS)• Early I/O bus (Sun’s Sbus) not optimized for small transfers
» 24 MB/s bandwidth with PIO, 45 MB/s with DMA
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
12August 22, 2000
Simple Buffering and Flow Control
• Dramatically simplified buffering scheme, still performance critical• Basic buffering + flow control can be implemented at acceptable cost.• Integration between NIC and host critical to provide services efficiently
» critical issues: division of labor, bus management, NIC-host interaction
0
5
10
15
20
25
16 32 64 128
256
512
Msg Size
Ba
nd
wid
th (
MB
/s)
PIO
Buffer Mgmt
Flow Control
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
13August 22, 2000
FM 1.x Performance (6/95)
• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]
• Hardware limits PIO performance, but N1/2 = 54 bytes
• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)
0
2
4
6
8
10
12
14
16
18
20
16 32 64 128 256 512 1024 2048
Msg Size (Bytes)
Ban
dw
idth
(MB
/s)
FM
1Gb Ethernet
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
14August 22, 2000
Illinois Fast Messages 1.x
• API: Berkeley Active Messages» Key distinctions: guarantees(reliable, in-order, flow control), network-processor
decoupling (dma region)
• Focus on short-packet performance» Programmed IO (PIO) instead of DMA» Simple buffering and flow control» user space communication
Sender:FM_send(NodeID,Handler,Buffer,size);
// handlers are remote proceduresReceiver:
FM_extract()
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
15August 22, 2000
The FM layering efficiency issue
• How good is the FM 1.1 API?• Test: build a user-level library on top of it and
measure the available performance» MPI chosen as representative user-level library» porting of MPICH (ANL/MSU) to FM
• Purpose: to study what services are important in layering communication libraries» integration issues: what kind of inefficiencies arise at the
interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
16August 22, 2000
MPI on FM 1.x
• First implementation of MPI on FM was ready in Fall 1995• Disappointing performance, only fraction of FM bandwidth available to
MPI applications
0
5
10
15
20
16 32 64 128
256
512
1024
2048
Msg Size
Ban
dw
idth
(M
B/s
)
FM
MPI-FM
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
17August 22, 2000
MPI-FM Efficiency
• Result: FM fast, but its interface not efficient
0102030405060708090
100
16 32 64 128 256 512 1024 2048
Msg Size
% E
ffic
ien
cy
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
18August 22, 2000
MPI-FM layering inefficiencies
Header Source buffer Header Destination buffer
MPI
FM
• Too many copies due to header attachment/removal, lack of coordination between transport and application layers
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
19August 22, 2000
The new FM 2.x API
• Sending» FM_begin_message(NodeID,Handler,size),
FM_end_message()» FM_send_piece(stream,buffer,size) // gather
• Receiving» FM_receive(buffer,size) // scatter» FM_extract(total_bytes) // rcvr flow
control
• Implementation based on use of a lightweight thread for each message received
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
20August 22, 2000
MPI-FM 2.x improved layering
• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies
Header Source buffer Header Destination buffer
MPI
FM
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
21August 22, 2000
MPI on FM 2.x
• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead» Short messages much better than IBM SP2, PCI limited
» Latency ~ SGI O2K
Msg Size
0
10
2030
40
5060
708090
100
4 8
16 32 64
128
256
512
102
4
204
8
419
6
819
2
163
84
327
68
655
36
Ban
dw
idth
(M
B/s
) FM
MPI-FM
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
22August 22, 2000
MPI-FM 2.x Efficiency
• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)
0102030405060708090
100
4 8 16 32 64 128
256
512
1024
2048
4196
8192
1638
4
3276
8
6553
6
Msg Size
% E
ffic
ien
cy
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
23August 22, 2000
MPI-FM at work: the NCSA NT Supercluster
• 192 Pentium II, April 1998, 77Gflops» 3-level fat tree (large switches), scalable bandwidth, modular
extensibility
• 256 Pentium II and III, June 1999, 110 Gflops (UIUC), w/ NCSA• 512xMerced, early 2001, Teraflop Performance (@ NCSA)
77 GF, April 1998110 GF, June 99
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
24August 22, 2000
192 Hewlett Packard, 300 MHz
64 Compaq, 333 MHz
• Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA• Myrinet Network, HPVM, Fast Messages• Microsoft NT OS, MPI API, etc.
The NT Supercluster at NCSA
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
25August 22, 2000
HPVM III
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
26August 22, 2000
MPI applications on the NT Supercluster
• Zeus-MP (192P, Mike Norman)• ISIS++ (192P, Robert Clay)• ASPCG (128P, Danesh Tafti)• Cactus (128P, Paul Walker/John Shalf/Ed Seidel)• QMC (128P, Lubos Mitas)• Boeing CFD Test Codes (128P, David Levine)• Others (no graphs):
» SPRNG (Ashok Srinivasan), Gamess, MOPAC (John McKelvey), freeHEP (Doug Toussaint), AIPS++ (Dick Crutcher), Amber (Balaji Veeraraghavan), Delphi/Delco Codes, Parallel Sorting
=> No code retuning required (generally) after recompiling with MPI-FM
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
27August 22, 2000
0
1
2
3
4
5
6
70
10
20
30
40
50
60
Processors
Gig
afl
op
s
Origin-DSM
Origin-MPI
NT-MPI
SP2-MPI
T3E-MPI
SPP2000-DSM
Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems
Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)
Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
28August 22, 2000
NCSA NT Supercluster Solving Navier-Stokes Kernel
Danesh Tafti, Rob Pennington, Andrew Chien NCSA
0
10
20
30
40
50
60
0
10
20
30
40
50
60
Processors
Sp
ee
du
p
NT MPI
Origin MPI
Origin SM
Perfect
0
1
2
3
4
5
6
7
0
10
20
30
40
50
60
70
Processors
Gig
afl
op
s
NT MPI
Origin MPI
Origin SM
Single Processor Performance:MIPS R10k 117 MFLOPSIntel Pentium II 80 MFLOPS
Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner
(2D 1024x1024)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
29August 22, 2000
0
2
4
6
8
10
12
14
16
#Procs
Gig
afl
op
s
SGI O2K
x86 NT
Solving 2D Navier-Stokes Kernel (cont.)
Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 4094x4094)
Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)
• Excellent Scaling to 128P, Single Precision ~25% faster
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
30August 22, 2000
Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR
Equations
0
20
40
60
80
100
120
0
20
40
60
80
10
0
12
0Processors
Sc
alin
g
Origin
NT SC
Ratio of GFLOPsOrigin = 2.5x NT SC
Paul Walker, John Shalf, Rob Pennington, Andrew Chien NCSA
Cactus was Developed by Paul Walker, MPI-PotsdamUIUC, NCSA
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
31August 22, 2000
Quantum Monte Carlo Origin and HPVM Cluster
0
2
4
6
8
10
12
14
0 20 40 60 80 100 120
Processors
GF
LO
PS
T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance Nanomaterials Team)
Origin is about 1.7x Faster than NT SC
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
32August 22, 2000
Supercomputer Performance Characteristics
• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system
cost
Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500
SGI Origin2000 500 ~0.5 ~1,000
HPVM NT Supercluster 300 ~3.2 ~6,000
Berkeley NOW II 100 ~3.2 ~2,000
IBM SP2 550 ~3.7 ~38,000
Beowulf(100Mbit) 300 ~25 ~500,000
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
33August 22, 2000
HPVM today: HPVM 1.9
FastMessages
MPI
Myrinet or VIA
BSP
Shared Memory (SMP)
SHMEM
GlobalArrays
• Added support for:» Shared memory» VIA interconnect
• New API: » BSP
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
34August 22, 2000
Show me the numbers!
• Basics» Myrinet
– FM: 100+MB/sec, 8.6 µsec latency– MPI: 91MB/sec @ 64K, 9.6 µsec latency
• Approximately 10% overhead
» Giganet– FM: 81MB/sec, 14.7 µsec latency– MPI: 77MB/sec, 18.6 µsec latency
• 5% BW overhead, 26% latency!
» Shared Memory Transport– FM: 195MB/sec, 3.13 µsec latency– MPI: 85MB/sec, 5.75 µsec latency
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
35August 22, 2000
Bandwidth Graphs
0
20
40
60
80
100
120
0 2048 4096 6144 8192 10240 12288 14336 16384
message size (bytes)
MB
/s
MPI on VIA FM on Myrinet MPI on Myrinet FM on VIA
• N1/2 ~ 512 Bytes
• FM bandwidth usually a good indicator of deliverable bandwidth
• High BW attained for small messages
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
36August 22, 2000
Other HPVM related projects
• Approx. three hundreds groups have downloaded HPVM 1.2 at the last count
• Some interesting research projects:» Low-level support for collective communication, OSU » FM with multicast (FM-MC), Vrije Universiteit, Amsterdam» Video server on demand, Univ. of Naples» Together with AM, U-Net and VMMC, FM has been the
inspiration for the VIA industrial standard by Intel, Compaq, IBM
• Latest release of HPVM is available from http://www-csag.ucsd.edu
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
37August 22, 2000
Current project: a HPVM-based Terabyte Storage Server
• High performance parallel architectures increasingly associated with data-intensive applications:» NPACI large dataset applications requiring 100’s of GB:
– Digital Sky Survey, Brain waves Analysis
» digital data repositories, web indexing, multimedia servers:– Microsoft TerraServer, Altavista, RealPlayer/Windows Media
servers (Audionet, CNN), streamed audio/video
» genomic and proteomic research– large centralized data banks (GenBank, SwissProt, PDB, …)
• Commercial terabyte systems (Storagetek, EMC) have price tags in the M$ range
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
38August 22, 2000
The HPVM approach to a Terabyte Storage Server
• Exploit commodity PC technologies to build a large (2 TB) and smart (50 Gflops) storage server» benefits: inexpensive PC disks, modern I/O bus
• The cluster advantage:» 10 us communication latency vs 10 ms disk access latency
provides opportunity for data declustering, redistribution, aggregation of I/O bandwidth
» distributed buffering, data processing capability » scalable architecture
• Integration issues:» efficient data declustering, I/O bus bandwidth allocation,
remote/local programming interface, external connectivity
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
39August 22, 2000
Global Picture
Myrinet
HPVM Cluster
San Diego Supercomputing Center
Dept. of CSE, UCSD
• 1GB/s link between the two sites» 8 parallel Gigabit Ethernet connections
» Ethernet cards installed in some of the nodes on each machine
1 GB/s link
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
40August 22, 2000
The Hardware Highlights
• Main features:» 1.6 TB = 64 * 25GB disks = $30K (UltraATA disks)» 1 GB/s of aggregate I/O bw (= 64 disks * 15 MB/s)» 45 GB RAM, 48 Gflop/s» 2.4 Gb/s Myrinet network
• Challenges:» make available aggregate I/O bandwidth to applications» balance I/O load across nodes/disks» transport of TB of data in and out of the cluster
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
41August 22, 2000
The Software Components
FastMessages
MPI Put/GetGlobalArrays
Myrinet
Panda
SRB
Storage Resource Broker (SRB) used for interoperability with existingNPACI applications at SDSC
Parallel I/O library (e.g. Panda, MPI-IO)to provide high performance I/O to coderunning on the cluster
The HPVM suite provides supportfor fast communication, standardAPIs on NT cluster
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
42August 22, 2000
Related Work
• User-level Fast Networking:» VIA list: AM (Fast Socket) [Culler92, Rodrigues97], U-Net (Unet/MM)
[Eicken95, Welsh97], VMMC-2 [Li97]
» RWCP PM [Tezuka96], BIP [Prylli97]
• High-perfomance Cluster-based Storage:» UC Berkeley Tertiary Disks (Talagala98)
» CMU Network-attached Devices [Gibson97], UCSB Active Disks (Acharya98)
» UCLA Randomized I/O (RIO) server (Fabbrocino98)
» UC Berkeley River system (Arpaci-Dusseau, unpub.)
» ANL ROMIO and RIO projects (Foster, Gropp)
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
43August 22, 2000
Conclusions
• HPVM provides all the necessary tools to transform a PC cluster into a production supercomputer
• Projects like HPVM demonstrate:» level of maturity achieved so far by cluster technology with respect
to conventional HPC utilization
» springboard for further research on new uses of the technology
• Efficient component integration at several levels key to performance:» tight coupling of the host and NIC crucial to minimize
communication overhead
» software layering on top of FM has exposed the need for a client-conscious design at the interface between layers
Summer Institute on Advanced Computation
Wright State University - August 20-23, 2000
44August 22, 2000
Future Work
• Moving toward a more dynamic model of computation:» dynamic process creation, interaction between computations» communication group management» long term targets are dynamic communication, support for
adaptive applications
• Wide-area computing:» integration within computational grid infrastructure» LAN/WAN bridges, remote cluster connectivity
• Cluster applications:» enhanced-functionality storage, scalable multimedia servers
• Semi-regular network topologies