Transcript
Page 1: High performance cluster technology: the HPVM experience

August 22, 2000

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

1

High performance cluster technology: the HPVM experience

Mario LauriaDept of Computer and Information Science

The Ohio State University

Page 2: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

2August 22, 2000

Thank You!

• My thanks to the organizers of SAIC 2000 for the invitation

• It is an honor and privilege to be here today

Page 3: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

3August 22, 2000

Acknowledgements

• HPVM is a project of the Concurrent Systems Architecture Group - CSAG (formerly UIUC Dept. of Computer Science, now UCSD Dept. of Computer Sci. & Eng.)» Andrew Chien (Faculty)

» Phil Papadopuolos (Research faculty)

» Greg Bruno, Mason Katz, Caroline Papadopoulos (Research Staff)

» Scott Pakin, Louis Giannini, Kay Connelly, Matt Buchanan, Sudha Krishnamurthy, Geetanjali Sampemane, Luis Rivera, Oolan Zimmer, Xin Liu, Ju Wang (Graduate Students)

• NT Supercluster: collaboration with NCSA Leading Edge Site» Robert Pennington (Technical Program Manager)

» Mike Showerman, Qian Liu (Systems Programmers)

» Qian Liu*, Avneesh Pant (Systems Engineers)

Page 4: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

4August 22, 2000

Outline

• The software/hardware interface (FM 1.1)• The layer-to-layer interface (MPI-FM and FM 2.0)• A production-grade cluster (NT Supercluster)• Current status and projects (Storage Server)

Page 5: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

5August 22, 2000

Motivation for cluster technology

• Killer micros: Low cost Gigaflop processors here for a few kilo$$’s /processor• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at

100’s $$/ connection• Leverage HW, commodity SW (Windows NT), build key technologies

» high performance computing in a RICH and ESTABLISHED software environment

Gigabit/sec Networks- Myrinet, SCI, FC-AL, Giganet,GigabitEthernet, ATM

Page 6: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

6August 22, 2000

Ideal Model: HPVM’s

• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts and

encapsulates underlying resource complexity• Simplifies use of complex resources

“Virtual Machine Interface”

Actual system configuration

Application Program

Page 7: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

7August 22, 2000

HPVM = Cluster Supercomputers

• High Performance Cluster Machine (HPVM)» Standard APIs hiding network topology, non-standard communication sw

• Turnkey Supercomputing Clusters» high performance communication, convenient use, coordinated resource management

• Windows NT and Linux, provides front-end Queueing & Mgmt (LSF integrated)

FastMessages

MPI Put/GetGlobalArrays

Myrinet and Sockets

HPVM 1.0Released Aug 19, 1997

PGI HPF

Page 8: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

8August 22, 2000

Motivation for a new communication software

• “Killer networks” have arrived ...» Gigabit links, moderate cost (dropping fast), low latency routers

• … however network software only delivers network performance for large messages.

1Gbit network (Ethernet, Myrinet)

125s ovhdN1/2=15KB

0

20

40

60

80

100

120

Msg Size (Bytes)

Ban

dw

idth

(M

B/s

)

Page 9: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

9August 22, 2000

Motivation (cont.)

• Problem: Most messages are small

Message Size Studies

< 576 bytes [Gusella90]

86-99% <200B [Kay&Pasquale]

300-400B avg size [U Buffalo monitors]

• => Most messages/applications see little performance improvement. Overhead is the key (LogP, Culler, et.al. studies)

• Communication is an enabling technology; how to fulfill its promise?

Page 10: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

10August 22, 2000

Fast Messages Project Goals

• Explore network architecture issues to enable delivery of underlying hardware performance (bandwidth, latency)

• Delivering performance means:» considering realistic packet size distributions

» measuring performance at the application level

• Approach:» minimize communication overhead

» Hardware/software, multilayer integrated approach

Page 11: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

11August 22, 2000

Getting performance is hard!

01020304050607080

16 32 64 128 256 512

Msg Size

Ba

nd

wid

th (

MB

/s)

TheoreticalPeak

Link Mgmt

• Slow Myrinet NIC processor (~5 MIPS)• Early I/O bus (Sun’s Sbus) not optimized for small transfers

» 24 MB/s bandwidth with PIO, 45 MB/s with DMA

Page 12: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

12August 22, 2000

Simple Buffering and Flow Control

• Dramatically simplified buffering scheme, still performance critical• Basic buffering + flow control can be implemented at acceptable cost.• Integration between NIC and host critical to provide services efficiently

» critical issues: division of labor, bus management, NIC-host interaction

0

5

10

15

20

25

16 32 64 128

256

512

Msg Size

Ba

nd

wid

th (

MB

/s)

PIO

Buffer Mgmt

Flow Control

Page 13: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

13August 22, 2000

FM 1.x Performance (6/95)

• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

• Hardware limits PIO performance, but N1/2 = 54 bytes

• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM deliverable)

0

2

4

6

8

10

12

14

16

18

20

16 32 64 128 256 512 1024 2048

Msg Size (Bytes)

Ban

dw

idth

(MB

/s)

FM

1Gb Ethernet

Page 14: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

14August 22, 2000

Illinois Fast Messages 1.x

• API: Berkeley Active Messages» Key distinctions: guarantees(reliable, in-order, flow control), network-processor

decoupling (dma region)

• Focus on short-packet performance» Programmed IO (PIO) instead of DMA» Simple buffering and flow control» user space communication

Sender:FM_send(NodeID,Handler,Buffer,size);

// handlers are remote proceduresReceiver:

FM_extract()

Page 15: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

15August 22, 2000

The FM layering efficiency issue

• How good is the FM 1.1 API?• Test: build a user-level library on top of it and

measure the available performance» MPI chosen as representative user-level library» porting of MPICH (ANL/MSU) to FM

• Purpose: to study what services are important in layering communication libraries» integration issues: what kind of inefficiencies arise at the

interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]

Page 16: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

16August 22, 2000

MPI on FM 1.x

• First implementation of MPI on FM was ready in Fall 1995• Disappointing performance, only fraction of FM bandwidth available to

MPI applications

0

5

10

15

20

16 32 64 128

256

512

1024

2048

Msg Size

Ban

dw

idth

(M

B/s

)

FM

MPI-FM

Page 17: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

17August 22, 2000

MPI-FM Efficiency

• Result: FM fast, but its interface not efficient

0102030405060708090

100

16 32 64 128 256 512 1024 2048

Msg Size

% E

ffic

ien

cy

Page 18: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

18August 22, 2000

MPI-FM layering inefficiencies

Header Source buffer Header Destination buffer

MPI

FM

• Too many copies due to header attachment/removal, lack of coordination between transport and application layers

Page 19: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

19August 22, 2000

The new FM 2.x API

• Sending» FM_begin_message(NodeID,Handler,size),

FM_end_message()» FM_send_piece(stream,buffer,size) // gather

• Receiving» FM_receive(buffer,size) // scatter» FM_extract(total_bytes) // rcvr flow

control

• Implementation based on use of a lightweight thread for each message received

Page 20: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

20August 22, 2000

MPI-FM 2.x improved layering

• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies

Header Source buffer Header Destination buffer

MPI

FM

Page 21: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

21August 22, 2000

MPI on FM 2.x

• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead» Short messages much better than IBM SP2, PCI limited

» Latency ~ SGI O2K

Msg Size

0

10

2030

40

5060

708090

100

4 8

16 32 64

128

256

512

102

4

204

8

419

6

819

2

163

84

327

68

655

36

Ban

dw

idth

(M

B/s

) FM

MPI-FM

Page 22: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

22August 22, 2000

MPI-FM 2.x Efficiency

• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

0102030405060708090

100

4 8 16 32 64 128

256

512

1024

2048

4196

8192

1638

4

3276

8

6553

6

Msg Size

% E

ffic

ien

cy

Page 23: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

23August 22, 2000

MPI-FM at work: the NCSA NT Supercluster

• 192 Pentium II, April 1998, 77Gflops» 3-level fat tree (large switches), scalable bandwidth, modular

extensibility

• 256 Pentium II and III, June 1999, 110 Gflops (UIUC), w/ NCSA• 512xMerced, early 2001, Teraflop Performance (@ NCSA)

77 GF, April 1998110 GF, June 99

Page 24: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

24August 22, 2000

192 Hewlett Packard, 300 MHz

64 Compaq, 333 MHz

• Andrew Chien, CS UIUC-->UCSD • Rob Pennington, NCSA• Myrinet Network, HPVM, Fast Messages• Microsoft NT OS, MPI API, etc.

The NT Supercluster at NCSA

Page 25: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

25August 22, 2000

HPVM III

Page 26: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

26August 22, 2000

MPI applications on the NT Supercluster

• Zeus-MP (192P, Mike Norman)• ISIS++ (192P, Robert Clay)• ASPCG (128P, Danesh Tafti)• Cactus (128P, Paul Walker/John Shalf/Ed Seidel)• QMC (128P, Lubos Mitas)• Boeing CFD Test Codes (128P, David Levine)• Others (no graphs):

» SPRNG (Ashok Srinivasan), Gamess, MOPAC (John McKelvey), freeHEP (Doug Toussaint), AIPS++ (Dick Crutcher), Amber (Balaji Veeraraghavan), Delphi/Delco Codes, Parallel Sorting

=> No code retuning required (generally) after recompiling with MPI-FM

Page 27: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

27August 22, 2000

0

1

2

3

4

5

6

70

10

20

30

40

50

60

Processors

Gig

afl

op

s

Origin-DSM

Origin-MPI

NT-MPI

SP2-MPI

T3E-MPI

SPP2000-DSM

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Page 28: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

28August 22, 2000

NCSA NT Supercluster Solving Navier-Stokes Kernel

Danesh Tafti, Rob Pennington, Andrew Chien NCSA

0

10

20

30

40

50

60

0

10

20

30

40

50

60

Processors

Sp

ee

du

p

NT MPI

Origin MPI

Origin SM

Perfect

0

1

2

3

4

5

6

7

0

10

20

30

40

50

60

70

Processors

Gig

afl

op

s

NT MPI

Origin MPI

Origin SM

Single Processor Performance:MIPS R10k 117 MFLOPSIntel Pentium II 80 MFLOPS

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner

(2D 1024x1024)

Page 29: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

29August 22, 2000

0

2

4

6

8

10

12

14

16

#Procs

Gig

afl

op

s

SGI O2K

x86 NT

Solving 2D Navier-Stokes Kernel (cont.)

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 4094x4094)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

• Excellent Scaling to 128P, Single Precision ~25% faster

Page 30: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

30August 22, 2000

Near Perfect Scaling of Cactus - 3D Dynamic Solver for the Einstein GR

Equations

0

20

40

60

80

100

120

0

20

40

60

80

10

0

12

0Processors

Sc

alin

g

Origin

NT SC

Ratio of GFLOPsOrigin = 2.5x NT SC

Paul Walker, John Shalf, Rob Pennington, Andrew Chien NCSA

Cactus was Developed by Paul Walker, MPI-PotsdamUIUC, NCSA

Page 31: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

31August 22, 2000

Quantum Monte Carlo Origin and HPVM Cluster

0

2

4

6

8

10

12

14

0 20 40 60 80 100 120

Processors

GF

LO

PS

T. Torelli (UIUC CS), L. Mitas (NCSA, Alliance Nanomaterials Team)

Origin is about 1.7x Faster than NT SC

Page 32: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

32August 22, 2000

Supercomputer Performance Characteristics

• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system

cost

Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500

SGI Origin2000 500 ~0.5 ~1,000

HPVM NT Supercluster 300 ~3.2 ~6,000

Berkeley NOW II 100 ~3.2 ~2,000

IBM SP2 550 ~3.7 ~38,000

Beowulf(100Mbit) 300 ~25 ~500,000

Page 33: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

33August 22, 2000

HPVM today: HPVM 1.9

FastMessages

MPI

Myrinet or VIA

BSP

Shared Memory (SMP)

SHMEM

GlobalArrays

• Added support for:» Shared memory» VIA interconnect

• New API: » BSP

Page 34: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

34August 22, 2000

Show me the numbers!

• Basics» Myrinet

– FM: 100+MB/sec, 8.6 µsec latency– MPI: 91MB/sec @ 64K, 9.6 µsec latency

• Approximately 10% overhead

» Giganet– FM: 81MB/sec, 14.7 µsec latency– MPI: 77MB/sec, 18.6 µsec latency

• 5% BW overhead, 26% latency!

» Shared Memory Transport– FM: 195MB/sec, 3.13 µsec latency– MPI: 85MB/sec, 5.75 µsec latency

Page 35: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

35August 22, 2000

Bandwidth Graphs

0

20

40

60

80

100

120

0 2048 4096 6144 8192 10240 12288 14336 16384

message size (bytes)

MB

/s

MPI on VIA FM on Myrinet MPI on Myrinet FM on VIA

• N1/2 ~ 512 Bytes

• FM bandwidth usually a good indicator of deliverable bandwidth

• High BW attained for small messages

Page 36: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

36August 22, 2000

Other HPVM related projects

• Approx. three hundreds groups have downloaded HPVM 1.2 at the last count

• Some interesting research projects:» Low-level support for collective communication, OSU » FM with multicast (FM-MC), Vrije Universiteit, Amsterdam» Video server on demand, Univ. of Naples» Together with AM, U-Net and VMMC, FM has been the

inspiration for the VIA industrial standard by Intel, Compaq, IBM

• Latest release of HPVM is available from http://www-csag.ucsd.edu

Page 37: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

37August 22, 2000

Current project: a HPVM-based Terabyte Storage Server

• High performance parallel architectures increasingly associated with data-intensive applications:» NPACI large dataset applications requiring 100’s of GB:

– Digital Sky Survey, Brain waves Analysis

» digital data repositories, web indexing, multimedia servers:– Microsoft TerraServer, Altavista, RealPlayer/Windows Media

servers (Audionet, CNN), streamed audio/video

» genomic and proteomic research– large centralized data banks (GenBank, SwissProt, PDB, …)

• Commercial terabyte systems (Storagetek, EMC) have price tags in the M$ range

Page 38: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

38August 22, 2000

The HPVM approach to a Terabyte Storage Server

• Exploit commodity PC technologies to build a large (2 TB) and smart (50 Gflops) storage server» benefits: inexpensive PC disks, modern I/O bus

• The cluster advantage:» 10 us communication latency vs 10 ms disk access latency

provides opportunity for data declustering, redistribution, aggregation of I/O bandwidth

» distributed buffering, data processing capability » scalable architecture

• Integration issues:» efficient data declustering, I/O bus bandwidth allocation,

remote/local programming interface, external connectivity

Page 39: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

39August 22, 2000

Global Picture

Myrinet

HPVM Cluster

San Diego Supercomputing Center

Dept. of CSE, UCSD

• 1GB/s link between the two sites» 8 parallel Gigabit Ethernet connections

» Ethernet cards installed in some of the nodes on each machine

1 GB/s link

Page 40: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

40August 22, 2000

The Hardware Highlights

• Main features:» 1.6 TB = 64 * 25GB disks = $30K (UltraATA disks)» 1 GB/s of aggregate I/O bw (= 64 disks * 15 MB/s)» 45 GB RAM, 48 Gflop/s» 2.4 Gb/s Myrinet network

• Challenges:» make available aggregate I/O bandwidth to applications» balance I/O load across nodes/disks» transport of TB of data in and out of the cluster

Page 41: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

41August 22, 2000

The Software Components

FastMessages

MPI Put/GetGlobalArrays

Myrinet

Panda

SRB

Storage Resource Broker (SRB) used for interoperability with existingNPACI applications at SDSC

Parallel I/O library (e.g. Panda, MPI-IO)to provide high performance I/O to coderunning on the cluster

The HPVM suite provides supportfor fast communication, standardAPIs on NT cluster

Page 42: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

42August 22, 2000

Related Work

• User-level Fast Networking:» VIA list: AM (Fast Socket) [Culler92, Rodrigues97], U-Net (Unet/MM)

[Eicken95, Welsh97], VMMC-2 [Li97]

» RWCP PM [Tezuka96], BIP [Prylli97]

• High-perfomance Cluster-based Storage:» UC Berkeley Tertiary Disks (Talagala98)

» CMU Network-attached Devices [Gibson97], UCSB Active Disks (Acharya98)

» UCLA Randomized I/O (RIO) server (Fabbrocino98)

» UC Berkeley River system (Arpaci-Dusseau, unpub.)

» ANL ROMIO and RIO projects (Foster, Gropp)

Page 43: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

43August 22, 2000

Conclusions

• HPVM provides all the necessary tools to transform a PC cluster into a production supercomputer

• Projects like HPVM demonstrate:» level of maturity achieved so far by cluster technology with respect

to conventional HPC utilization

» springboard for further research on new uses of the technology

• Efficient component integration at several levels key to performance:» tight coupling of the host and NIC crucial to minimize

communication overhead

» software layering on top of FM has exposed the need for a client-conscious design at the interface between layers

Page 44: High performance cluster technology: the HPVM experience

Summer Institute on Advanced Computation

Wright State University - August 20-23, 2000

44August 22, 2000

Future Work

• Moving toward a more dynamic model of computation:» dynamic process creation, interaction between computations» communication group management» long term targets are dynamic communication, support for

adaptive applications

• Wide-area computing:» integration within computational grid infrastructure» LAN/WAN bridges, remote cluster connectivity

• Cluster applications:» enhanced-functionality storage, scalable multimedia servers

• Semi-regular network topologies


Top Related