Download - High performance cluster technology: the HPVM experience

August 22, 2000

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

1

High performance cluster technology: the HPVM experience

Mario LauriaDept of Computer and Information Science

The Ohio State University



2August 22, 2000

Thank You!

• My thanks to the organizers of SAIC 2000 for the invitation

• It is an honor and privilege to be here today



3August 22, 2000

Acknowledgements

• HPVM is a project of the Concurrent Systems Architecture Group - CSAG (formerly UIUC Dept. of Computer Science, now UCSD Dept. of Computer Sci. & Eng.)» Andrew Chien (Faculty)

» Phil Papadopuolos (Research faculty)

» Greg Bruno, Mason Katz, Caroline Papadopoulos (Research Staff)

» Scott Pakin, Louis Giannini, Kay Connelly, Matt Buchanan, Sudha Krishnamurthy, Geetanjali Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu, Ju Wang (Graduate Students)

• NT Supercluster: collaboration with NCSA Leading Edge Site» Robert Pennington (Technical Program Manager)

» Mike Showerman, Qian Liu (Systems Programmers)

» Qian Liu*, Avneesh Pant (Systems Engineers)



4August 22, 2000

Outline

• The software/hardware interface (FM 1.1)• The layer-to-layer interface (MPI-FM and FM 2.0)• A production-grade cluster (NT Supercluster)• Current status and projects (Storage Server)



5August 22, 2000

Motivation for cluster technology

• Killer micros: Low cost Gigaflop processors here for a few kilo$$’s /processor• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at

100’s $$/ connection• Leverage HW, commodity SW (Windows NT), build key technologies

» high performance computing in a RICH and ESTABLISHED software environment

Gigabit/sec Networks- Myrinet, SCI, FC-AL, Giganet,GigabitEthernet, ATM



6August 22, 2000

Ideal Model: HPVM’s

• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts and

encapsulates underlying resource complexity• Simplifies use of complex resources

“Virtual Machine Interface”

Actual system configuration

Application Program



7August 22, 2000

HPVM = Cluster Supercomputers

• High Performance Cluster Machine (HPVM)» Standard APIs hiding network topology, non-standard communication sw

• Turnkey Supercomputing Clusters» high performance communication, convenient use, coordinated resource management

• Windows NT and Linux, provides front-end Queueing & Mgmt (LSF integrated)

FastMessages

MPI Put/GetGlobalArrays

Myrinet and Sockets

HPVM 1.0Released Aug 19, 1997

PGI HPF



8August 22, 2000

Motivation for a new communication software

• “Killer networks” have arrived ...» Gigabit links, moderate cost (dropping fast), low latency routers

• … however network software only delivers network performance for large messages.

1Gbit network (Ethernet, Myrinet)

125s ovhdN1/2=15KB

0

20

40

60

80

100

120

Msg Size (Bytes)

Ban

dw

idth

(M

B/s

)



9August 22, 2000

Motivation (cont.)

• Problem: Most messages are small

Message Size Studies

< 576 bytes [Gusella90]

86-99% <200B [Kay&Pasquale]

300-400B avg size [U Buffalo monitors]

• => Most messages/applications see little performance improvement. Overhead is the key (LogP, Culler, et.al. studies)

• Communication is an enabling technology; how to fulfill its promise?



10August 22, 2000

Fast Messages Project Goals

• Explore network architecture issues to enable delivery of underlying hardware performance (bandwidth, latency)

• Delivering performance means:» considering realistic packet size distributions

» measuring performance at the application level

• Approach:» minimize communication overhead

» Hardware/software, multilayer integrated approach



11August 22, 2000

Getting performance is hard!

01020304050607080

16 32 64 128 256 512

Msg Size

Ba

nd

wid

th (

MB

/s)

TheoreticalPeak

Link Mgmt

• Slow Myrinet NIC processor (~5 MIPS)• Early I/O bus (Sun’s Sbus) not optimized for small transfers

» 24 MB/s bandwidth with PIO, 45 MB/s with DMA



12August 22, 2000

Simple Buffering and Flow Control

• Dramatically simplified buffering scheme, still performance critical• Basic buffering + flow control can be implemented at acceptable cost.• Integration between NIC and host critical to provide services efficiently

» critical issues: division of labor, bus management, NIC-host interaction

0

5

10

15

20

25

16 32 64 128

256

512

Msg Size

Ba

nd

wid

th (

MB

/s)

PIO

Buffer Mgmt

Flow Control



13August 22, 2000

FM 1.x Performance (6/95)

• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

• Hardware limits PIO performance, but N1/2 = 54 bytes

• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)

0

2

4

6

8

10

12

14

16

18

20

16 32 64 128 256 512 1024 2048

Msg Size (Bytes)

Ban

dw

idth

(MB

/s)

FM

1Gb Ethernet



14August 22, 2000

Illinois Fast Messages 1.x

• API: Berkeley Active Messages» Key distinctions: guarantees(reliable, in-order, flow control), network-processor

decoupling (dma region)

• Focus on short-packet performance» Programmed IO (PIO) instead of DMA» Simple buffering and flow control» user space communication

Sender:FM_send(NodeID,Handler,Buffer,size);

// handlers are remote proceduresReceiver:

FM_extract()



15August 22, 2000

The FM layering efficiency issue

• How good is the FM 1.1 API?• Test: build a user-level library on top of it and

measure the available performance» MPI chosen as representative user-level library» porting of MPICH (ANL/MSU) to FM

• Purpose: to study what services are important in layering communication libraries» integration issues: what kind of inefficiencies arise at the

interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]



16August 22, 2000

MPI on FM 1.x

• First implementation of MPI on FM was ready in Fall 1995• Disappointing performance, only fraction of FM bandwidth available to

MPI applications

0

5

10

15

20

16 32 64 128

256

512

1024

2048

Msg Size

Ban

dw

idth

(M

B/s

)

FM

MPI-FM



17August 22, 2000

MPI-FM Efficiency

• Result: FM fast, but its interface not efficient

0102030405060708090

100

16 32 64 128 256 512 1024 2048

Msg Size

% E

ffic

ien

cy



18August 22, 2000

MPI-FM layering inefficiencies

Header Source buffer Header Destination buffer

MPI

FM

• Too many copies due to header attachment/removal, lack of coordination between transport and application layers



19August 22, 2000

The new FM 2.x API

• Sending» FM_begin_message(NodeID,Handler,size),

FM_end_message()» FM_send_piece(stream,buffer,size) // gather

• Receiving» FM_receive(buffer,size) // scatter» FM_extract(total_bytes) // rcvr flow

control

• Implementation based on use of a lightweight thread for each message received



20August 22, 2000

MPI-FM 2.x improved layering

• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies

Header Source buffer Header Destination buffer

MPI

FM



21August 22, 2000

MPI on FM 2.x

• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead» Short messages much better than IBM SP2, PCI limited

» Latency ~ SGI O2K

Msg Size

0

10

2030

40

5060

708090

100

4 8

16 32 64

128

256

512

102

4

204

8

419

6

819

2

163

84

327

68

655

36

Ban

dw

idth

(M

B/s

) FM

MPI-FM



22August 22, 2000

MPI-FM 2.x Efficiency

• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

0102030405060708090

100

4 8 16 32 64 128

256

512

1024

2048

4196

8192

1638

4

3276

8

6553

6

Msg Size

% E

ffic

ien

cy



23August 22, 2000

MPI-FM at work: the NCSA NT Supercluster

• 192 Pentium II, April 1998, 77Gflops» 3-level fat tree (large switches), scalable bandwidth, modular

extensibility

• 256 Pentium II and III, June 1999, 110 Gflops (UIUC), w/ NCSA• 512xMerced, early 2001, Teraflop Performance (@ NCSA)

77 GF, April 1998110 GF, June 99



24August 22, 2000

192 Hewlett Packard, 300 MHz

64 Compaq, 333 MHz

• Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA• Myrinet Network, HPVM, Fast Messages• Microsoft NT OS, MPI API, etc.

The NT Supercluster at NCSA



25August 22, 2000

HPVM III



26August 22, 2000

MPI applications on the NT Supercluster

• Zeus-MP (192P, Mike Norman)• ISIS++ (192P, Robert Clay)• ASPCG (128P, Danesh Tafti)• Cactus (128P, Paul Walker/John Shalf/Ed Seidel)• QMC (128P, Lubos Mitas)• Boeing CFD Test Codes (128P, David Levine)• Others (no graphs):

» SPRNG (Ashok Srinivasan), Gamess, MOPAC (John McKelvey), freeHEP (Doug Toussaint), AIPS++ (Dick Crutcher), Amber (Balaji Veeraraghavan), Delphi/Delco Codes, Parallel Sorting

=> No code retuning required (generally) after recompiling with MPI-FM



27August 22, 2000

0

1

2

3

4

5

6

70

10

20

30

40

50

60

Processors

Gig

afl

op

s

Origin-DSM

Origin-MPI

NT-MPI

SP2-MPI

T3E-MPI

SPP2000-DSM

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)



28August 22, 2000

NCSA NT Supercluster Solving Navier-Stokes Kernel

Danesh Tafti, Rob Pennington, Andrew Chien NCSA

0

10

20

30

40

50

60

0

10

20

30

40

50

60

Processors

Sp

ee

du

p

NT MPI

Origin MPI

Origin SM

Perfect

0

1

2

3

4

5

6

7

0

10

20

30

40

50

60

70

Processors

Gig

afl

op

s

NT MPI

Origin MPI

Origin SM

Single Processor Performance:MIPS R10k 117 MFLOPSIntel Pentium II 80 MFLOPS

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner

(2D 1024x1024)



29August 22, 2000

0

2

4

6

8

10

12

14

16

#Procs

Gig

afl

op

s

SGI O2K

x86 NT

Solving 2D Navier-Stokes Kernel (cont.)

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 4094x4094)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

• Excellent Scaling to 128P, Single Precision ~25% faster



30August 22, 2000

Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR

Equations

0

20

40

60

80

100

120

0

20

40

60

80

10

0

12

0Processors

Sc

alin

g

Origin

NT SC

Ratio of GFLOPsOrigin = 2.5x NT SC

Paul Walker, John Shalf, Rob Pennington, Andrew Chien NCSA

Cactus was Developed by Paul Walker, MPI-PotsdamUIUC, NCSA



31August 22, 2000

Quantum Monte Carlo Origin and HPVM Cluster

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120

Processors

GF

LO

PS

T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance Nanomaterials Team)

Origin is about 1.7x Faster than NT SC



32August 22, 2000

Supercomputer Performance Characteristics

• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system

cost

Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500

SGI Origin2000 500 ~0.5 ~1,000

HPVM NT Supercluster 300 ~3.2 ~6,000

Berkeley NOW II 100 ~3.2 ~2,000

IBM SP2 550 ~3.7 ~38,000

Beowulf(100Mbit) 300 ~25 ~500,000



33August 22, 2000

HPVM today: HPVM 1.9

FastMessages

MPI

Myrinet or VIA

BSP

Shared Memory (SMP)

SHMEM

GlobalArrays

• Added support for:» Shared memory» VIA interconnect

• New API: » BSP



34August 22, 2000

Show me the numbers!

• Basics» Myrinet

– FM: 100+MB/sec, 8.6 µsec latency– MPI: 91MB/sec @ 64K, 9.6 µsec latency

• Approximately 10% overhead

» Giganet– FM: 81MB/sec, 14.7 µsec latency– MPI: 77MB/sec, 18.6 µsec latency

• 5% BW overhead, 26% latency!

» Shared Memory Transport– FM: 195MB/sec, 3.13 µsec latency– MPI: 85MB/sec, 5.75 µsec latency



35August 22, 2000

Bandwidth Graphs

0

20

40

60

80

100

120

0 2048 4096 6144 8192 10240 12288 14336 16384

message size (bytes)

MB

/s

MPI on VIA FM on Myrinet MPI on Myrinet FM on VIA

• N1/2 ~ 512 Bytes

• FM bandwidth usually a good indicator of deliverable bandwidth

• High BW attained for small messages



36August 22, 2000

Other HPVM related projects

• Approx. three hundreds groups have downloaded HPVM 1.2 at the last count

• Some interesting research projects:» Low-level support for collective communication, OSU » FM with multicast (FM-MC), Vrije Universiteit, Amsterdam» Video server on demand, Univ. of Naples» Together with AM, U-Net and VMMC, FM has been the

inspiration for the VIA industrial standard by Intel, Compaq, IBM

• Latest release of HPVM is available from http://www-csag.ucsd.edu



37August 22, 2000

Current project: a HPVM-based Terabyte Storage Server

• High performance parallel architectures increasingly associated with data-intensive applications:» NPACI large dataset applications requiring 100’s of GB:

– Digital Sky Survey, Brain waves Analysis

» digital data repositories, web indexing, multimedia servers:– Microsoft TerraServer, Altavista, RealPlayer/Windows Media

servers (Audionet, CNN), streamed audio/video

» genomic and proteomic research– large centralized data banks (GenBank, SwissProt, PDB, …)

• Commercial terabyte systems (Storagetek, EMC) have price tags in the M$ range



38August 22, 2000

The HPVM approach to a Terabyte Storage Server

• Exploit commodity PC technologies to build a large (2 TB) and smart (50 Gflops) storage server» benefits: inexpensive PC disks, modern I/O bus

• The cluster advantage:» 10 us communication latency vs 10 ms disk access latency

provides opportunity for data declustering, redistribution, aggregation of I/O bandwidth

» distributed buffering, data processing capability » scalable architecture

• Integration issues:» efficient data declustering, I/O bus bandwidth allocation,

remote/local programming interface, external connectivity



39August 22, 2000

Global Picture

Myrinet

HPVM Cluster

San Diego Supercomputing Center

Dept. of CSE, UCSD

• 1GB/s link between the two sites» 8 parallel Gigabit Ethernet connections

» Ethernet cards installed in some of the nodes on each machine

1 GB/s link



40August 22, 2000

The Hardware Highlights

• Main features:» 1.6 TB = 64 * 25GB disks = $30K (UltraATA disks)» 1 GB/s of aggregate I/O bw (= 64 disks * 15 MB/s)» 45 GB RAM, 48 Gflop/s» 2.4 Gb/s Myrinet network

• Challenges:» make available aggregate I/O bandwidth to applications» balance I/O load across nodes/disks» transport of TB of data in and out of the cluster



41August 22, 2000

The Software Components

FastMessages

MPI Put/GetGlobalArrays

Myrinet

Panda

SRB

Storage Resource Broker (SRB) used for interoperability with existingNPACI applications at SDSC

Parallel I/O library (e.g. Panda, MPI-IO)to provide high performance I/O to coderunning on the cluster

The HPVM suite provides supportfor fast communication, standardAPIs on NT cluster



42August 22, 2000

Related Work

• User-level Fast Networking:» VIA list: AM (Fast Socket) [Culler92, Rodrigues97], U-Net (Unet/MM)

[Eicken95, Welsh97], VMMC-2 [Li97]

» RWCP PM [Tezuka96], BIP [Prylli97]

• High-perfomance Cluster-based Storage:» UC Berkeley Tertiary Disks (Talagala98)

» CMU Network-attached Devices [Gibson97], UCSB Active Disks (Acharya98)

» UCLA Randomized I/O (RIO) server (Fabbrocino98)

» UC Berkeley River system (Arpaci-Dusseau, unpub.)

» ANL ROMIO and RIO projects (Foster, Gropp)



43August 22, 2000

Conclusions

• HPVM provides all the necessary tools to transform a PC cluster into a production supercomputer

• Projects like HPVM demonstrate:» level of maturity achieved so far by cluster technology with respect

to conventional HPC utilization

» springboard for further research on new uses of the technology

• Efficient component integration at several levels key to performance:» tight coupling of the host and NIC crucial to minimize

communication overhead

» software layering on top of FM has exposed the need for a client-conscious design at the interface between layers



44August 22, 2000

Future Work

• Moving toward a more dynamic model of computation:» dynamic process creation, interaction between computations» communication group management» long term targets are dynamic communication, support for

adaptive applications

• Wide-area computing:» integration within computational grid infrastructure» LAN/WAN bridges, remote cluster connectivity

• Cluster applications:» enhanced-functionality storage, scalable multimedia servers

• Semi-regular network topologies

Download - High performance cluster technology: the HPVM experience

Top Related