cs160 – lecture 3

CS160 – Lecture 3

Clusters. Introduction to PVM and MPI

Introduction to PC Clusters

• What are PC Clusters?

• How are they put together ?

• Examining the lowest level messaging pipeline

• Relative application performance

• Starting with PVM and MPI

Clusters, Beowulfs, and more

• How do you put a “Pile-of-PCs” into a room and make them do real work?– Interconnection technologies– Programming them– Monitoring– Starting and running applications– Running at Scale

Beowulf Cluster

• Current working definition: a collection of commodity PCs running an open-source operating system with a commodity interconnection network– Dual Intel PIIIs with fast ethernet, Linux

• Program with PVM, MPI, …

– Single Alpha PCs running Linux

Beowulf Clusters cont’d

• Interconnection network is usually fast ethernet running TCP/IP– (Relatively) slow network

– Programming model is message passing

• Most people now associate the name “Beowulf” with any cluster of PCs– Beowulf’s are differentiated from high-performance

clusters by the network

• www.beowulf.org has lots of information

High-Performance Clusters

• Killer micros: Low-cost Gigaflop processors here for a few kilo$$’s /processor

• Killer networks: Gigabit network hardware, high performance software (e.g. Fast Messages), soon at 100’s-$$$/ connection

• Leverage HW, commodity SW (*nix/Windows NT), build key technologies=> high performance computing in a RICH software environment

Gigabit Networks- Myrinet, SCI, FC-AL, Giganet,GigE,ATM

Cluster Research Groups

• Many other cluster groups that have had impact– Active Messages/Network of workstations (NOW)

UCB

– Basic Interface for Parallelism (BIP) Univ. of Lyon

– Fast Messages(FM)/High Performance Virtual Machines(HPVM) (UIUC/UCSD)

– Real World Computing Partnership (Japan)

– (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton)

Clusters are Different

• A pile of PC’s is not a large-scale SMP server.– Why? Performance and programming model

• A cluster’s closest cousin is an MPP– What’s the major difference? Clusters run N copies of the OS, MPPs usually run one.

Ideal Model: HPVM’s

• HPVM = High Performance Virtual Machine• Provides a simple uniform programming model, abstracts

and encapsulates underlying resource complexity• Simplifies use of complex resources

“Virtual Machine Interface”

Actual system configuration

Application Program

Virtualization of Machines

• Want the illusion that a collection of machines is a single machines– Start, stop, monitor distributed programs– Programming and debugging should work seemlessly– PVM (Parallel Virtual Machine) was the first, widely-

adopted virtualization for parallel computing

• This illusion is only partially complete in any software system. Some issues– Node heterogeneity.– Real network topology can lead to contention

• Unrelated – What is a Java Virtual Machine?

High-Performance Communication

• Level of network interface support + NIC/network router latency– Overhead and latency of communication deliverable bandwidth

• High-performance communication Programmability!– Low-latency, low-overhead, high-bandwidth cluster communication– … much more is needed …

• Usability issues, I/O, Reliability, Availability• Remote process debugging/monitoring

Switched 100 MbitOS mediated access

Switched Multigigabit,User-level accessNetworks

Putting a cluster together

• (16, 32, 64, … X) Individual Node– Eg. Dual Processor Pentium III/733, 1 GB mem, ethernet

• Scalable High-speed network– Myrinet, Giganet, Servernet, Gigabit Ethernet

• Message-passing libraries– TCP, MPI, PVM, VIA

• Multiprocessor job launch– Portable batch System– Load Sharing Facility – PVM spawn, mpirun, rsh

• Techniques for system management– VA Linux Cluster Manager (VACM)– High Performance Technologies Inc (HPTI)

Communication style is message Passing

4 3 2 1Packetized message

1 2

• How do we efficiently get a message from Machine A to Machine B?

• How do we efficiently break a large message into packets and reassemble at receiver?

• How does receiver differentiate among message fragments (packets) from different senders?

A B

Will use the details of FM to illustrate some communication

engineering

FM on Commodity PC’s

• Host Library: API presentation, flow control, segmentation/reassembly, multithreading

• Device driver: protection, memory mapping, scheduling monitors• NIC Firmware: link management, incoming buffer management,

routing, multiplexing/demultiplexing

NIC1280Mbps

P6 busPCI

PentiumII/III

~33 MIPS~450 MIPS

FM HostLibrary

FM NICFirmware

FM DeviceDriver

Fast Messages 2.x Performance

• Latency 8.8s, Bandwidth 100+MB/s, N1/2 ~250 bytes• Fast in absolute terms (compares to MPP’s, internal memory BW)• Delivers a large fraction of hardware performance for short messages • Technology transferred in emerging cluster standards Intel/Compaq/Microsoft’s

Virtual Interface Architecture.

0

10

20

30

40

50

60

70

80

4 16 64 256 1,024 4,096 16,384 65,536Msg size (bytes)

Ban

dw

idth

(M

B/s

)

n1/2n1/2

100+ MB/s100+ MB/s

Comments about Performance

• Latency and Bandwidth are the most basic measurements message passing machines– Will discuss in detail performance models

because• Latency and bandwidth do not tell the entire story

• High-performance clusters exhibit– 10X is deliverable bandwidth over ethernet– 20X – 30X improvement in latency

How does FM really get Speed?• Protected user-level access to network (OS-bypass)• Efficient credit-based flow control

– assumes reliable hardware network [only OK for System Area Networks]

– No buffer overruns ( stalls sender if no receive space)

• Early de-multiplexing of incoming packets– multithreading, use of NT user-schedulable threads

• Careful implementation with many tuning cycles– Overlapping DMAs (Recv), Programmed I/O send– No interrupts! Polling only.

OS-Bypass Background

• Suppose you want to perform a sendto on a standard IP socket?– Operating System mediates access to the network device

• Must trap into the kernel to insure authorization on each and every message (Very time consuming)

• Message is copied from user program to kernel packet buffers• Protocol information about each packet is generated by the OS

and attached to a packet buffer• Message is finally sent out onto the physical device (ethernet)

• Receiving does the inverse with a recvfrom– Packet to kernel buffer, OS strip of header, reassembly of data, OS

mediation for authorization, copy into user program

OS-Bypass

• A user program is given a protected slice of the network interface– Authorization is done once (not per message)

• Outgoing packets get directly copied or DMAed to network interface– Protocol headers added by user-level library

• Incoming packets get routed by network interface card (NIC) into user-defined receive buffers– NIC must know how to differentiate incoming packets. This is

called early demultiplexing.

• Outgoing and incoming message copies are eliminated.• Traps to OS kernel are eliminated

Packet Pathway

User M

essage Buffer

NIC

Programmed I/O

NICPkt Pkt

DMA to/from Network

Pkt

Pkt

Pkt

DMA

User level Handler 2

User level Handler 1U

ser Message B

uffer Pinned DMA receiveregion

• Concurrency of I/O busses

• Sender specifies receiver handler ID

• Flow control keeps DMA region from being overflowed

User B

uffer

Fast Messages 1.x – An example message passing API and library

• API: Berkeley Active Messages– Key distinctions: guarantees(reliable, in-order, flow control), network-processor

decoupling (DMA region)

• Focus on short-packet performance:– Programmed IO (PIO) instead of DMA– Simple buffering and flow control– Map I/O device to user space (OS bypass)

Sender:FM_send(NodeID,Handler,Buffer,size);

// handlers are remote proceduresReceiver:

FM_extract()

What is an active message?

• Usually, message passing has a send with a corresponding explicit receive at the destination.

• Active messages specify a function to invoke (activate) when message arrives– Function is usually called a message handler

The handler gets called when the message arrives, not by the destination doing an explicit receive.

FM 1.x Performance (6/95)

• Latency 14 s, Peak BW 21.4MB/s [Pakin, Lauria et al., Supercomputing95]

• Hardware limits PIO performance, but N1/2 = 54 bytes• Delivers 17.5MB/s @ 128 byte messages (140mbps, greater than OC-3 ATM

deliverable)

0

2

4

6

8

10

12

14

16

18

20

16 32 64 128 256 512 1024 2048

Msg Size (Bytes)

Ban

dw

idth

(MB

/s)

FM

1Gb Ethernet

The FM Layering Efficiency Issue

• How good is the FM 1.1 API?• Test: build a user-level library on top of it and

measure the available performance– MPI chosen as representative user-level library– porting of MPICH 1.0 (ANL/MSU) to FM

• Purpose: to study what services are important in layering communication libraries– integration issues: what kind of inefficiencies arise at

the interface, and what is needed to reduce them [Lauria & Chien, JPDC 1997]

MPI on FM 1.x - Inefficient Layering of Protocols

• First implementation of MPI on FM was ready in Fall 1995

• Disappointing performance, only fraction of FM bandwidth available to MPI applications

0

5

10

15

20

16 32 64 128

256

512

1024

2048

Msg Size

Ban

dw

idth

(M

B/s

)

FM

MPI-FM

MPI-FM Efficiency

• Result: FM fast, but its interface not efficient

0102030405060708090

100

16 32 64 128 256 512 1024 2048

Msg Size

% E

ffic

ien

cy

MPI-FM Layering InefficienciesHeader Source buffer Header Destination buffer

MPI

FM

• Too many copies due to header attachment/removal, lack of coordination between transport and application layers

Redesign API - FM 2.x

• Sending– FM_begin_message(NodeID, Handler, size)

– FM_send_piece(stream,buffer,size) // gather

– FM_end_message()

• Receiving– FM_receive(buffer,size) // scatter

– FM_extract(total_bytes) // rcvr flow control

MPI-FM 2.x Improved Layering

• Gather-scatter interface + handler multithreading enables efficient layering, data manipulation without copies

Header Source buffer Header Destination buffer

MPI

FM

MPI on FM 2.x

• MPI-FM: 91 MB/s, 13s latency, ~4 s overhead– Short messages much better than IBM SP2, PCI limited– Latency ~ SGI O2K

Msg Size

0

10

2030

40

5060

708090

100

4 8

16 32 64

128

256

512

102

4

204

8

419

6

819

2

163

84

327

68

655

36

Ban

dw

idth

(M

B/s

) FM

MPI-FM

MPI-FM 2.x Efficiency

• High Transfer Efficiency, approaches 100% [Lauria, Pakin et al. HPDC7 ‘98]• Other systems much lower even at 1KB (100Mbit: 40%, 1Gbit: 5%)

0102030405060708090

100

4 8 16 32 64 128

256

512

1024

2048

4196

8192

1638

4

3276

8

6553

6

Msg Size

% E

ffic

ien

cy

HPVM III (“NT Supercluster”)

• 256xPentium II, April 1998, 77Gflops– 3-level fat tree (large switches), scalable bandwidth, modular

extensibility

• => 512xPentium III (550 MHz) Early 2000, 280 GFlops– Both with National Center for Supercomputing Applications

77 GF, April 1998280 GF, Early 2000

Supercomputer Performance Characteristics

• Compute/communicate and compute/latency ratios• Clusters can provide programmable characteristics at a dramatically lower system cost

Mflops/Proc Flops/Byte Flops/NetworkRTCray T3E 1200 ~2 ~2,500

SGI Origin2000 500 ~0.5 ~1,000

HPVM NT Supercluster 300 ~3.2 ~6,000

Berkeley NOW II 100 ~3.2 ~2,000

IBM SP2 550 ~3.7 ~38,000

Beowulf(100Mbit) 300 ~25 ~200,000

0

1

2

3

4

5

6

70

10

20

30

40

50

60

Processors

Gig

afl

op

s

Origin-DSM

Origin-MPI

NT-MPI

SP2-MPI

T3E-MPI

SPP2000-DSM

Solving 2D Navier-Stokes Kernel - Performance of Scalable Systems

Preconditioned Conjugate Gradient Method With Multi-level Additive Schwarz Richardson Pre-conditioner (2D 1024x1024)

Danesh Tafti, Rob Pennington, NCSA; Andrew Chien (UIUC, UCSD)

Is the detail important? Is there something easier?

• Detail of a particular high-performance interface illustrates some of the complexity for these systems– Performance and scaling are very important.

Sometimes the underlying structure needs to be understood to reason about applications.

• Class will focus on distributed computing algorithms and interfaces at a higher level (message passing)

How do we program/run such machines?

• PVM (Parallel Virtual Machine) provides– Simple message passing API– Construction of virtual machine with a software console– Ability to spawn (start), kill (stop), monitor jobs

• XPVM is a graphical console, performance monitor

• MPI (Message Passing Interface)– Complex and complete message passing API– Defacto, community-defined standard– No defined method for job management

• Mpirun provided as a tool for the MPICH distribution

– Commericial and non-commercial tools for monitoring debugging• Jumpshot, VaMPIr, …

Next Time …

• Parallel Programming ParadigmsShared Memory

Message passing

cs160 – lecture 3

Documents

high performance software

highperformance clusters

high performance computing

collection of commodity

fast ethernet

cluster together16

cluster groups

cluster of pcsbeowulfs