sctp-based middleware for mpi · 2011-11-22 · sctp-based middleware for mpi by humaira kamal...

SCTP-Based Middleware for MPI

by

Humaira Kamal

B.S.E.E. University of Engineering and Technology, Lahore, Pakistan, 1994

M.Sc. Computer Science, Lahore Univ. of Management Sciences, Pakistan, 2001

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in

THE FACULTY OF GRADUATE STUDIES

(Computer Science)

The University of British Columbia

August 2005

c© Humaira Kamal, 2005

Abstract

SCTP (Stream Control Transmission Protocol) is a recently standardized transport

level protocol with several features that better support the communication require-

ments of parallel applications; these features are not present in traditional TCP

(Transmission Control Protocol). These features make SCTP a good candidate as

a transport level protocol for MPI (Message Passing Interface). MPI is a message

passing middleware that is widely used to parallelize scientific and compute inten-

sive applications. TCP is often used as the transport protocol for MPI in both local

area and wide-area networks. Prior to this work, SCTP has not been used for MPI.

In this thesis, we compared and evaluated the benefits of using SCTP instead

of TCP as the underlying transport protocol for MPI. We redesigned LAM-MPI,

a public domain version of MPI, to use SCTP. We describe the advantages and

disadvantages of using SCTP, the necessary modifications to the MPI middleware to

use SCTP, and the performance of SCTP as compared to the stock implementation

that uses TCP.

ii

Contents

Abstract ii

Contents iii

List of Tables vi

List of Figures vii

Acknowledgements ix

Dedication x

1 Introduction 1

2 Overview 7

2.1 SCTP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Overview of MPI Middleware . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Message Progression Layer . . . . . . . . . . . . . . . . . . . 14

2.2.2 Message Delivery Protocol . . . . . . . . . . . . . . . . . . . . 15

2.2.3 Overview of NAS Parallel Benchmarks . . . . . . . . . . . . . 17

3 Evaluating TCP for MPI in WANs 19

3.1 TCP Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Investigating the Use of TCP for MPI in WANs . . . . . . . . . . . . 22

iii

3.3 MPI in an Emulated WAN Environment . . . . . . . . . . . . . . . . 23

3.3.1 Communication Characteristics of the NAS Benchmarks . . . 24

3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4.1 NAS Benchmarks, Loss . . . . . . . . . . . . . . . . . . . . . 26

3.4.2 NAS Benchmarks, Latency . . . . . . . . . . . . . . . . . . . 27

3.4.3 NAS Benchmarks, Nagle’s Algorithm and Delayed Acknowl-

edgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 SCTP-based Middleware: Design and Implementation 32

4.1 Overview of using SCTP for MPI . . . . . . . . . . . . . . . . . . . . 35

4.2 Message Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Concurrency and SCTP streams . . . . . . . . . . . . . . . . . . . . 37

4.3.1 Assigning Messages to Streams . . . . . . . . . . . . . . . . . 38

4.3.2 Head-of-Line Blocking . . . . . . . . . . . . . . . . . . . . . . 38

4.3.3 Example of Head-of-Line Blocking . . . . . . . . . . . . . . . 39

4.3.4 Maintaining State . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.5.1 Race Condition in Long Message Protocol . . . . . . . . . . . 43

4.5.2 Race Condition in MPI Init . . . . . . . . . . . . . . . . . . . 46

4.6 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.6.1 Multihoming Feature . . . . . . . . . . . . . . . . . . . . . . . 47

4.6.2 Added Protection . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.6.3 Use of a Single Underlying Protocol . . . . . . . . . . . . . . 51

4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Experimental Evaluation 53

5.1 Evaluation Using Benchmark Programs . . . . . . . . . . . . . . . . 55

iv

5.1.1 MPBench Ping-Pong Test . . . . . . . . . . . . . . . . . . . . 55

5.1.2 Using NAS Benchmarks . . . . . . . . . . . . . . . . . . . . . 58

5.2 Evaluation Using a Latency Tolerant Program . . . . . . . . . . . . . 59

5.2.1 Comparison Using a Real-World Program . . . . . . . . . . . 59

5.2.2 Investigating Head-of-line Blocking . . . . . . . . . . . . . . . 63

6 Related Work 66

7 Conclusions and Future Work 72

Bibliography 75

v

List of Tables

3.1 Breakdown of message types for LU classes S, W, A and B using the

default LAM settings for TCP . . . . . . . . . . . . . . . . . . . . . 24

3.2 Breakdown of message types for SP classes S, W, A and B using the

default LAM settings for TCP . . . . . . . . . . . . . . . . . . . . . 25

5.1 The performance of SCTP and TCP using a standard ping-pong test

under loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

vi

List of Figures

2.1 Single association with multiple streams . . . . . . . . . . . . . . . . 10

2.2 MPI and LAM envelope format . . . . . . . . . . . . . . . . . . . . . 13

2.3 Examples of valid and invalid message orderings in MPI . . . . . . . 14

2.4 Message progression in MPI middleware . . . . . . . . . . . . . . . . 16

3.1 NAS benchmarks performance for dataset B under different loss rates

using TCP SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 NAS benchmarks performance for dataset B with different latencies

using TCP SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Performance of LU using different combinations of Nagle’s algorithm

and delayed acknowledgments . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Comparison of LU and SP benchmarks for different datasets . . . . . 29

4.1 Similarities between the message protocol of MPI and SCTP . . . . 35

4.2 Example of head-of-line blocking in LAM TCP module . . . . . . . . 39

4.3 Example of multiple streams between two processes using non-blocking

communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Example of multiple streams between two processes using blocking

communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Example of long message race condition . . . . . . . . . . . . . . . . 44

4.6 SCTP’s four-way handshake . . . . . . . . . . . . . . . . . . . . . . . 49

vii

5.1 The performance of SCTP using a standard ping-pong test normal-

ized to TCP under no loss . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2 TCP versus SCTP for the NAS benchmarks . . . . . . . . . . . . . . 58

5.3 TCP versus SCTP for (a) short and (b) long messages for the Bulk

Processor Farm application . . . . . . . . . . . . . . . . . . . . . . . 61

5.4 TCP versus SCTP for (a) short and (b) long messages for the Bulk

Processor Farm application using Fanout of 10 . . . . . . . . . . . . 63

5.5 Effect of head-of-line Blocking in the Bulk Processor Farm program

for (a) short and (b) long messages . . . . . . . . . . . . . . . . . . . 64

viii

Acknowledgements

I would like to acknowledge the invaluable contribution of my supervisor Dr. Alan

Wagner who has been a constant source of inspiration for me. He has always been

generous with his time and discussions with him have added a great deal to my

knowledge. One of the best things about working under his supervision is that he

has always encouraged dialogue and provided constructive feedback on our work.

I also want to thank Dr. Norm Hutchinson for reading my thesis and pro-

viding useful feedback.

I would also like to acknowledge the many useful discussions I have had with

my group partner, Brad Penoff, over the course of our project that contributed to

our mutual learning.

Many thanks to Randall Stewart, one of the primary designers of SCTP, for

answering our questions and providing timely fixes to the protocol stack that made

it possible to provide the performance evaluation. Thanks to Jeff Squyres, one of

the developers of LAM-MPI, for readily answering our many questions about the

LAM-MPI implementation.

Humaira Kamal

The University of British Columbia

August 2005

ix

To my parents, sisters and brother, without whose encouragement and support

this would not have been possible.

x

Chapter 1

Introduction

Increases in network bandwidth and connectivity have spurred interest in distributed

systems that can utilize the network and computational resources available in wide-

area networks (WANs). The challenge is how to harness the power of the hundreds

and thousands of processors that may be available across the network to solve prob-

lems faster by executing them in parallel. Introduced about a decade ago, the

Message Passing Interface (MPI) [46, 24] has become the de facto standard for con-

structing parallel applications in distributed environments. The MPI forum defines

a library of functions that can be used to create parallel applications which commu-

nicate by message passing. The primary goals of MPI were to provide portability

of source code and allow efficient implementations across a range of architectures.

MPI facilitates creation of efficient scientific and compute-intensive parallel pro-

grams and implementations are available for a wide variety of platforms. Since its

inception, MPI has been very popular and virtually every vendor of parallel systems

has implemented their own variant [49]. There are also open source, public domain

implementations available for execution of MPI programs in local area networks.

Notable among them are Local Area Multicomputer (LAM) [22] and MPICH [62].

Initial implementations of MPI were targeted towards tightly coupled cluster-

like machines, where reliability and security were not an issue. Implementations of

1

MPI for local area networks widely use TCP as the underlying transport protocol,

to provide reliable in-order delivery of messages. TCP was available in the first

public domain versions of MPI such as LAM and MPICH, and more recently the

use of MPI with TCP has been extended to computing grids [37], wide-area net-

works, the Internet and meta-computing environments that link together diverse,

geographically distributed, computing resources. The main advantage to using an

IP-based protocol (i.e., TCP/UDP) for MPI is portability and ease with which it

can be used to execute MPI programs in diverse network environments. One well-

known problem with using TCP in wide-area environments is the presence of large

latencies and the difficulty in utilizing all of the available bandwidth. MPI processes,

in an application, tend to be loosely synchronized, even when they are not directly

communicating with each other. If delays occur in one process, they can potentially

have a ripple effect on other processes, which can get delayed as well. As a result,

performance can be poor especially in the presence of network congestion and the

resulting increased latency which can cause processes to stall waiting for the delivery

or receipt of a message. Moreover, MPI processes typically establish a large number

of TCP connections for communicating with the other processes in the application.

This imposes a requirement on the operating system to maintain a large number of

socket descriptors.

Although applications sensitive to latency suffer when run over TCP or UDP,

there are latency tolerant programs such as those that are embarrassingly parallel,

or almost so, that can use an IP-based transport protocol to execute in environments

like the Internet. In addition, the dynamics of TCP is an active area of research

and there is interest in better models [7] and tools for instrumenting and tuning

TCP connections [43]. As well, TCP itself continues to evolve, especially for high

performance links, with research into new variants like TCP Vegas [21, 45]. Finally,

latency hiding techniques and exploiting trade-offs between bandwidth and latency

can further expand the range of MPI applications that may be suitable to execute

2

over IP in both local and wide-area networks. In the end, the ability for MPI

programs to execute unchanged in almost any environment is a strong motivation

for continued research in IP-based transport protocol support for MPI.

Currently, TCP is widely used as an IP-based transport protocol for MPI.

The TCP protocol, however, is not well matched for MPI applications, especially in

wide-area networks, for a number of reasons, which are discussed below:

1. TCP is a stream-based protocol, whereas MPI is message-oriented. This im-

poses a requirement on the implementation of the MPI library to carry out

message framing.

2. The TCP protocol provides strict in-order delivery of all user data transmit-

ted on a connection. MPI applications need reliable message transfer, but not

a globally strict sequenced delivery. In Chapter 2 we discuss the details of

MPI message matching and how messages with the same tag, rank and con-

text should be delivered in order, while those with different values for these

parameters are allowed to overtake each other. The strict in-order delivery of

messages by TCP can result in unnecessary head-of-line blocking in case of

network loss or delays.

3. The ability to recover quickly from loss in a network is especially desirable

in wide-area environments. TCP SACK has been shown to be more robust

under loss than other variants of TCP, like TCP Reno, however, SACK is not

an integral part of the TCP protocol and is provided as an option in some

implementations [19]. The user, thus, has to make sure that the proper TCP

SACK implementation is available on all the nodes, in order to take advantage

of better performance under loss.

4. TCP is known to be vulnerable to certain denial-of-service attacks like SYN

flooding. In open environments, like wide-area networks or the Internet, re-

silience against such attacks is a concern.

3

5. TCP has no built-in support for multihomed hosts and in case of network

failures, there is no provision for it to failover to an alternate path. Mul-

tihomed hosts are increasingly becoming common and the ability of being

able to transparently switchover to another path can be especially useful in a

wide-area network.

Recently, a new transport protocol called SCTP (Stream Control Transmis-

sion Protocol) [59] has been standardized by the IETF. SCTP alleviates many of

the problems with TCP. SCTP is message oriented like UDP but has TCP-like con-

nection management, congestion and flow control mechanisms. In SCTP, there is

an ability to define streams that allow multiple independent and reliable message

subflows inside a single association. This eliminates the head-of-line blocking that

can occur in TCP-based middleware for MPI. In addition, SCTP associations and

streams closely match the message-ordering semantics of MPI when messages with

the same tag, source rank and context are used to define a stream (tag) within an

association (source rank).

SCTP includes several other mechanisms that make it an attractive target for

MPI in open network environments where secure connection management and con-

gestion control are important. It makes it possible to offload some MPI middleware

functionality onto a standardized protocol that will hopefully become universally

available. Although new, SCTP is currently available for all major operating sys-

tems and is part of the standard Linux kernel distribution.

In this thesis we propose the use of SCTP as a robust transport protocol

for MPI in wide-area networks. We first investigate the problems that arise from

the use of TCP for MPI in an emulated wide-area network. The objective is to

gain a better understanding of the effect TCP has on MPI performance and to learn

about what kind of applications are suitable for such environments. We use LAM as

the MPI middleware for our experiments and also extensively instrument the LAM

TCP module to gain an insight into its working. Our second step is to redesign the

4

LAM-MPI middleware to take advantage of the features SCTP provides. During

our design, we investigate the shortcomings of TCP, discussed above, and introduce

features using SCTP that alleviate those shortcomings. One novel advantage to us-

ing SCTP that we investigate is the elimination of the head-of-line blocking present

in TCP-based MPI middleware. Next, we carry out an extensive evaluation of our

implementation and compare its performance with the LAM TCP module. We use

standard benchmarks as well as real-world programs to evaluate the performance

of SCTP. Our experiments show that SCTP outperforms TCP under loss, when

latency tolerant applications are used. The benefits of using SCTP were substantial

even when a single stream was used between two endpoints. Use of SCTP’s mul-

tistreaming feature adds to the overall performance and we evaluate the effects of

head-of-line blocking in our applications. A second indirect contribution of our work

is with respect to SCTP. Our MPI middleware makes very aggressive use of SCTP.

In using the NAS parallel benchmarks [9] along with programs of our own design,

we were able to uncover problems in the FreeBSD implementation of the protocol

that led to improvements in the stack by the SCTP developers.

Overall, the use of SCTP makes for a more resilient implementation of MPI

that avoids many of the problems present in using TCP, especially in an open wide-

area network environment, such as the Internet. Of course, it doesn’t eliminate the

performance issues of operating in that environment, nor does it eliminate the need

for tuning connections. Many of the same issues remain and as mentioned, this is

an active area of research in the networking community where numerous variants of

TCP have been proposed. Because of the similarity between the two, SCTP will be

able to take advantage of improvements to TCP. Although the advantages of SCTP

have been investigated in other contexts, this is the first use of it for MPI. SCTP is

an attractive replacement for TCP in a wide-area network environment, and adds

to the set of IP transport protocols that can be used for MPI. The release of Open

MPI [23], and its ability to mix and match transport mechanisms, is an ideal target

5

of our work where an SCTP module can further extend MPI across networks that

require a robust TCP-like connection.

The remaining thesis is organized as follows. In Chapter 2, we give a back-

ground overview of SCTP, MPI middleware and NAS benchmarks used in our ex-

periments. Chapter 3 investigates the use of TCP in wide-area environments where

bandwidth and latency can vary and discusses the challenges of using TCP for MPI.

In Chapter 4, we discuss our motivation for using SCTP for MPI and discuss the

design and implementation details. In Chapter 5, we report the results of various

experiments carried out to evaluate our SCTP module and provide comparisons with

TCP. Research and related work on wide-area computing is discussed in Chapter 6.

Chapter 7 presents our conclusions and directions for future work. This thesis is

part of work done jointly with Brad Penoff, a fellow graduate student at UBC. This

thesis focuses on the higher level design and implementation issues of our SCTP

module. There were several other implementation issues, which may likely be de-

tailed in Brad Penoff’s thesis. Details about design and evaluation of the SCTP

module in Chapters 4 and 5 are part of our paper due to appear in Supercomputing

conference in November 2005 [35].

6

Chapter 2

Overview

2.1 SCTP Overview

SCTP is a general purpose unicast transport protocol for IP network data commu-

nications, which has been recently standardized by the IETF [59, 66, 56]. It was

initially introduced by a working group, SIGTRAN, of the IETF as a mechanism

for transportation of call control signaling over the Internet. Telephony signaling

has rigid timing and reliability requirements that must be met by a phone service

provider. Investigations by SIGTRAN led them to conclude that TCP is inadequate

for transport of telephony signaling due to the following reasons. Independent mes-

sages over a TCP connection can encounter head-of-line blocking due to TCP’s

order-preserving property, which can cause critical call control timers to expire,

leading to setup failures. Also, there is a lack of path-level redundancy support in

TCP, which is a limitation compared to the signaling protocol SS7 that is designed

with full link-level redundancy. As well, lack of control over the TCP retransmission

timers is a shortcoming for applications with rigid timing requirements.

Development of SCTP was directly motivated by the need to find a bet-

ter transport mechanism for telephony signaling [59]. SCTP has since evolved for

more general use to satisfy the needs of applications that require a message-oriented

protocol with all the necessary TCP-like mechanisms and additional features not

7

present in either TCP or UDP.

SCTP has been referred to, by its designers, as “Super TCP” and the rea-

son is that SCTP provides sequencing, flow control, reliability and full-duplex data

transfer like TCP, however, it provides an enhanced set of capabilities not in TCP

that make applications less susceptible to loss. Like UDP, the SCTP data transport

service is message oriented and supports the framing of application data. But like

TCP, SCTP is session-oriented and communicates by establishing an association

between two endpoints. In SCTP, the word association is used instead of a con-

nection between two endpoints because an association has a broader scope than a

TCP connection. In an SCTP association, an endpoint can be represented by mul-

tiple IP addresses and a port, whereas in TCP, each endpoint is bound to exactly

one IP address. SCTP supports two types of sockets: one-to-one style and one-to-

many style sockets. One of the design objectives of SCTP was to allow existing

TCP applications to be migrated to SCTP with little effort. The one-to-one SCTP

style socket provides this function, and the few changes required to achieve this

are as follows. The sockets used for communication among processes are created

using IPPROTO SCTP rather than IPPROTO TCP. Socket options like turning on/off

Nagle’s algorithm are similar to TCP except that socket option SCTP NODELAY is

used instead of TCP NODELAY. In the one-to-many style a single socket can commu-

nicate with multiple SCTP associations similar to a UDP socket that can receive

datagrams from different UDP endpoints. An association identifier is used to dis-

tinguish an association on the one-to-many socket. This style eliminates the need

for maintaining a large number of socket descriptors.

Additionally, SCTP supports multiple logical streams within an association,

where each stream has the property of independent, sequenced message delivery.

Loss in any of the streams affects only that stream and not others. In an SCTP

association, a maximum of 65,536 unidirectional streams can be defined by either

end for simultaneous delivery of data.

8

Multistreaming in SCTP is achieved through an independence between de-

livery of data and transmission of data. An SCTP packet has one common header

followed by one or more “chunks”. There are two types of chunks; control and data

and each chunk is fully self-descriptive. Control chunks carry information that is

used for maintenance of an SCTP association, while user messages are carried in

data chunks. Each data chunk is assigned two sets of sequence numbers; a transport

sequence number TSN that regulates the transmission of messages and the detection

of message loss, and a stream identifier SNo and stream sequence number SSN pair,

which is used to determine the sequence of delivery of received data. The receiver

can thus use this mechanism to determine when a gap occurs in the transmission

sequence, for example, due to loss, and which stream it affects.

Depending on the size of a user message and the path maximum transmission

unit (PMTU), SCTP may fragment the message into one or more data chunks,

however, a single data chunk cannot carry more than one user message. In order

to handle the case in which the user message is fragmented, the SSN remains the

same on all the data chunks containing the fragments, while consecutive TSNs are

assigned to the individual fragments in the order of the fragmentation. An SCTP

sender uses a round robin scheme to transmit user messages with different stream

identifiers to the receiver, so that none of the streams are starved [57].

Figure 2.1 shows an association between two endpoints with three streams

identified as 0, 1 and 2. As Figure 2.1 shows, SCTP sequences data chunks and

not bytes as in TCP. Together SNo, SSN, and TSN are used to assemble messages

and guarantee the ordered delivery of messages within a stream but not between

streams. For example, receiver Y in Figure 2.1 can deliver Msg2 before Msg1 should

it happen to arrive at Y before Msg1.

In contrast, when a TCP source sends independent messages to the same

receiver at the same time, it has to open multiple independent TCP connections. It

is possible to have multiple streams by having parallel TCP connections and parallel

9

Endpoint X Endpoint Y

Stream 0

Stream 1

Stream 2

Stream 0

Stream 1

Stream 2

SEND

SEND

RECEIVE

RECEIVE

User messages0212

Message Stream Number (SNo)

Fragmentation

0

TSN=1

SSN=0

SNo=0

2

TSN=2

SSN=0

SNo=2

1a1b2a2b2c

TSN=3

SSN=0

SNo=1

TSN=4

SSN=0

SNo=1

TSN=6

SSN=2

SNo=2

TSN=7

SSN=2

SNo=2

TSN=8

SSN=2

SNo=2

Data chunk queue

2

Msg1Msg2Msg3Msg4Msg5

2

TSN=5

SSN=1

SNo=2

TSN = Transmission Sequence NumberSSN = Stream Sequence Number

Msg1Msg2Msg3Msg4Msg5

Bundling

Control chunk queue

SCTP Layer

IP LayerSCTP Packets

Logical View of Multiple Streams in an Association

Figure 2.1: Single association with multiple streams

connections can also improve throughput in congested and un-congested links. There

are, however, fairness issues as, in a congested network, parallel connections claim

more than their fair share of the bandwidth thereby affecting the cross-traffic [30].

Also, the sender and receiver have to fragment and multiplex data over each of

the parallel TCP connections. One approach to making parallel connections TCP-

friendly is to couple them all to a single connection [25], SCTP does precisely this by

ensuring that all the streams within a single SCTP association share a common set of

congestion control parameters. It obtains the benefits of parallel TCP connections

10

while keeping the protocol TCP-friendly [53]. SCTP’s property of being TCP-

friendly promotes its possible wide deployment in the future.

Both the SCTP and TCP protocols use a similar congestion control mech-

anism. An SCTP sender maintains a congestion window cwnd variable for each

destination address, however, there is only one receive window rwnd variable for

the entire association. In today’s networks congestion control is critical to any

transport protocol and Atiquzzaman et al [53] have shown that SCTP’s congestion

control mechanism has improvements over TCP, that allow it to achieve larger av-

erage congestion window size and faster recovery of segment loss, which results in

higher throughput.

Another difference between SCTP and TCP is that endpoints in SCTP are

multihomed and can be bound to multiple IP addresses (i.e., interfaces). If a peer

is multihomed, then an SCTP endpoint will select one of the peer’s destination ad-

dresses as a primary address and all the other addresses of the peer become alternate

addresses. During normal operation, all data is sent to the primary address, with

the exception of retransmissions, for which one of the active alternate addresses

is selected. Recent work on the retransmission policies of protocols that support

multihoming, like SCTP, proposes a policy that sends fast retransmissions to the

primary address, and timeout retransmissions to an alternate peer IP address [31].

The authors show that sending all retransmissions to an alternate address is use-

ful if the primary address has become unreachable, and can result in performance

degradation otherwise.

When a destination address is determined to be down, SCTP’s multihom-

ing feature can provide fast path failover for an association. SCTP currently does

not support the simultaneous transfer of data across interfaces, but this will likely

change in future. Currently, researchers at the University of Delaware [29, 28] are

investigating the use of SCTP’s multihoming feature to provide simultaneous trans-

fer of data between two endpoints through two or more end-to-end paths, and this

11

functionality may become part of the SCTP protocol.

Another factor that may contribute to rapid and widespread deployment of

SCTP is the availability of utilities like withsctp [42] and an SCTP shim translation

layer [3] that can transparently translate calls to the TCP API into equivalent SCTP

calls. The withsctp utility intercepts calls to the TCP socket at the application level

and converts them to SCTP calls. The SCTP shim translation layer works at the

kernel level and allows a hybrid approach, in which it first tries to establish an SCTP

association, and failing which sets up a TCP connection.

2.2 Overview of MPI Middleware

MPI has a rich variety of message passing routines. These include MPI Send and

MPI Recv along with various combinations such as blocking, non-blocking, syn-

chronous, asynchronous, buffered and unbuffered versions of these calls. Processes

exchange messages by using these routines and the middleware is responsible for

progressing all the send and receive requests from their initiation to completion. As

an example, the format of the basic blocking and non-blocking send and receive calls

is as follows:

Blocking MPI Send and MPI Recv calls:

MPI Send (msg, count, type, dest rank, tag, context)

MPI Recv (msg, count, type, source rank, tag, context).

Non-blocking MPI Isend and MPI Irecv calls:

MPI Isend(msg, count, type, dest rank, tag, context, request handle)

MPI Irecv(msg, count, type, source rank, tag, context, request handle)

In non-blocking calls, the request handle is returned, which can be used to

check, later in the program execution, if the send/receive operation has completed

or not. An MPI message consists of a message envelope and payload as shown in

12

Figure 2.2. The matching of a send request to its corresponding receive request, for

example, an MPI Send to its MPI Recv, is based on three values inside the message

envelope: (i) context, (ii) source/destination rank and (iii) tag. Context identifies

Structure of Envelope in LAM

Message Length

Tag

Flags

Rank

Context

Sequence Number

Payload

Format of MPI Message

Context Rank Tag

Envelope

Figure 2.2: MPI and LAM envelope format

a set of processes that can communicate with each other. Within a context, each

process is uniquely identified by its rank, an integer from zero to one less than the

number of processes in the context. A user can specify the type of information being

carried by a message by assigning a tag to it. MPI also allows the source and/or tag

of a message to be a wildcard in a receive request. For example if MPI ANY SOURCE

is used then that specifies that the process is ready to receive messages from any

sending process. Similarly if MPI ANY TAG is specified, a message arriving with any

tag would be received.

In addition to their use in message matching, the tag, rank and context

(TRC) define the order of delivery of MPI messages. MPI semantics require that

any messages sent with the same message TRC value should be delivered in order,

i.e., they may not overtake each other, whereas MPI messages with different TRCs

are not required to be ordered. As seen in Figure 2.3, if messages between a pair

of processes are delivered strictly in the order in which they are sent, then this is

perfectly permissible according to the MPI semantics, however, this ordering is only

a subset of the legitimate message orderings allowed by MPI. Out of order messages

13

Process X Process Y

Msg_1

MPI_Send(Msg_1, Tag_A )

MPI_Send(Msg_2, Tag_B )

MPI_Send(Msg_3, Tag_A ) Msg_3

Msg_2

MPI_Irecv(..ANY_TAG..)

Process X Process Y





Process X Process Y





Out of order messages with same tags violate MPI semantics

Msg_1

Msg_2

Msg_3

Msg_1

Msg_2

Msg_3

Figure 2.3: Examples of valid and invalid message orderings in MPI

with the same tags, however, violate MPI semantics.

2.2.1 Message Progression Layer

Implementations of the MPI standard, typically, provide a request progression mod-

ule that is responsible for progressing requests from initialization to completion. One

of the main functions of this module is to support asynchronous and non-blocking

message progression in MPI. The request progression module must maintain state

information for all requests, including how far the request has progressed and what

14

type of data/event is required for it to reach completion. There are two approaches to

implementing the MPI library; single-threaded and multithreaded implementations.

Multithreaded implementations [6, 14, 50] use a separate communication thread

for making progress asynchronously and provide increased overlap of communica-

tion and computation. Multithreaded implementations, however, have additional

overheads resulting from synchronization and cache and context switching, which

single-threaded implementations, like LAM and MPICH, avoid. Typically, single-

threaded implementations require special design considerations since requests are

only progressed by the MPI middleware when the user makes any calls to it.

LAM’s request progression mechanism is representative of the way requests

are handled in other MPI middleware implementations and, due to its modular

design, provided a convenient platform for us to work in. LAM provides a TCP

Request Progression Interface (RPI) module that is used to perform point-to-point

message passing using TCP as its underlying transport protocol. Requests in LAM

go through four states: init, start, active and done, and the RPI module is responsi-

ble for progressing all requests through their life-cycle. We redesigned LAM’s request

progression interface layer module for TCP to use SCTP. Although LAM’s message

passing engine is single-threaded, our design is not limited to single-threaded imple-

mentations only. Of course, in a corresponding multithreaded implementation some

details will need to change to handle issues related to synchronization and resource

contention. In the next section we describe the different types of messages and the

message delivery protocols as implemented in LAM.

2.2.2 Message Delivery Protocol

As shown in Figure 2.4, when an incoming message is received at the transport layer

it is first checked against a list of posted receive requests. If a match is found then

the message is dispatched to the correct receive request buffer. If a matching receive

was not issued at the time of message arrival, the message is placed in an unexpected

15

Application Layer Receive Request is Issued

MPI Middleware

Transport Layer

Incoming Message is Received

Unexpected Message Queue

Receive Request Queue

Runtime

Socket

Figure 2.4: Message progression in MPI middleware

message queue (implemented in LAM as an internal hash table). Similarly, when

a new receive request is posted, it is first checked against the unexpected message

queue, if a match is found then the request is advanced, otherwise, it is placed in

the receive request queue.

In LAM, each message body is preceded by an envelope that is used to match

a message. We can broadly classify the messages in three categories with respect to

the way LAM treats them internally:

1. Short messages, which are, by default, of size 64 Kbytes or less.

2. Long messages which are of size greater than 64 Kbytes.

3. Synchronous short messages, which are also of size 64 Kbytes or less, but

require a different message delivery protocol than short messages, as discussed

below.

Figure 2.2 shows the format of an envelope in LAM. The flag field in the

envelope indicates what type of message body follows it.

Short messages are passed using eager-send and the message body immedi-

ately follows the envelope. If the receiving process has posted/issued a matching

receive buffer, the message is received and copied into the buffer and the request

16

is marked as done. If no matching receive has been posted, then this is treated

as an unexpected message and buffered. When a matching receive is later posted,

the message is copied in the request receive buffer and the corresponding buffered

message is discarded.

Long messages are handled differently than short messages and are not sent

eagerly, but instead sent using the following rendezvous scheme: Initially only the

envelope of the long message is sent to the receiver, if the receiver has posted a

matching receive request then it sends back an acknowledgment (ACK) to the sender

to indicate that it is ready to receive the message body. The sender sends back an

envelope followed by the long message body in response to the ACK received. If

no matching receive was posted at the time the initial long envelope was received,

it is treated as an unexpected message. Later when a matching receive request is

posted, it sends back an ACK and the rendezvous proceeds as above. Use of eager

send for long messages is a topic that has been under research and people have

found that an eager protocol for long messages outperforms a rendezvous protocol

only if a significant number of receives have been pre-posted [4]. In [65] the authors

report that protocols like eager send can lead to resource exhaustion in large cluster

environments.

Synchronous short messages are also communicated using eager-send, how-

ever, the send is not complete until the sender receives an ACK from the receiver.

The discussion above about the handling of unexpected messages also applies here.

In addition to point-to-point communication functions, MPI has a number of rou-

tines for collective operations. In LAM, these routines are implemented on top of

point-to-point communication.

2.2.3 Overview of NAS Parallel Benchmarks

In Chapters 3 and 5, we describe several experiments conducted using the Numerical

Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [9]. These benchmarks

17

approximate the performance that can be expected from a portable parallel appli-

cation. A brief overview of these benchmarks is included in this section.

The NAS parallel benchmarks were developed by the NASA Ames Research

Centre and is a set of eight programs that are derived from computational fluid dy-

namics code. NPB was used in our experiments because it approximates real-world

parallel scientific applications better than most other available benchmarks. The

eight programs are: LU (Lower-upper symmetric Gauss-Seidel), IS (Integer Sort),

MG (Multi-grid method for Poisson equation), EP (Embarrassingly Parallel), CG

(Conjugate Gradient), SP (Scalar Pentadiagonal Systems), BT (Block Tridiagonal

Systems) and FT (Fast Fourier Transform for Laplace equation). Each program has

several problem sizes (S, W, and A, B, C and D) ranging from small to very large.

Communication characteristics of these benchmarks are described in [20],

which shows that all of the benchmarks use collective communication. With the ex-

ception of EP benchmark, all the benchmarks use point-to-point communication as

well. EP uses only MPI Allreduce and MPI Barrier and has very few communica-

tion calls. The number of communication calls in the remaining benchmarks range

from 1000 to 50,000 for the dataset B. The communication volume ranges from less

than 12 Kbytes for EP to several gigabytes of data for the other benchmarks.

18

Chapter 3

Evaluating TCP for MPI in

WANs

In this chapter we evaluate the use of TCP for MPI in wide-area networks and

Internet-like environments where bandwidth and latency can vary. Our objective is

to gain a better understanding of the impact of IP protocols on MPI performance

that will improve our ability to take advantage of wide-area networks for MPI and

may suggest techniques to make MPI programs less sensitive to latency, so that they

can execute in Internet-like environments. Not all programs will be suitable for these

environments; however, for those programs that are, it is important to understand

the limitations and to continue to take advantage of IP transport protocols.

We performed our experiments in an emulated wide-area network using the

NAS parallel benchmarks to evaluate TCP in this environment. Our experiments

concentrated on latency and the effect of packet delay and loss on performance,

and we measured their effect using two widely available variants of TCP, namely

New-Reno and SACK. We also evaluated these two versions with respect to the

TCP NODELAY option, which disables Nagle’s algorithm, and the effect of delayed

acknowledgments. In this chapter we discuss a number of challenges for using TCP

and MPI in a wide-area network. Our focus is on performance issues rather than

19

issues pertaining to management and access.

3.1 TCP Performance Issues

For improving TCP’s performance in high bandwidth delay environments, tech-

niques like tuning of TCP buffers and using parallel streams are well known in the

networking community. Work done in [15] implements a TCP tuning daemon that

uses network metrics to transparently tune TCP parameters for individual flows

over designated paths. The daemon decides on appropriate buffer sizes to use on

a per flow basis and in case of packet loss attempts to speed-up recovery by mod-

ifying TCP congestion control parameters, disabling delayed acknowledgments or

using parallel streams. Their results show that their tuning techniques achieve im-

provement for bulk transfers over standard TCP. Other work on TCP shows that

the common practice of setting the socket buffer size based on the definition of

bandwidth-delay product (BDP) as a function of transmission capacity and round-

trip time can lead to suboptimal throughput [30]. They show that the socket buffer

size should be set to the BDP only in paths that do not carry cross traffic and do

not introduce queuing delays. They also present an application level mechanism

that automatically adjusts the socket buffer size to near its optimal value.

Work in [11, 21, 64] focuses on TCP’s role in high performance computing

and identifies both implementation-related and protocol-related problems in TCP.

Some of the issues that these papers look at are:

1. Operating system bottlenecks due to increasing differences between networking

speeds and processor and bus speeds. It suggests use of OS-bypass or interrupt

coalescing to help TCP deal with OS’s interrupt processing overhead, and use

of a larger effective size of MTU e.g., through use of jumbograms.

2. Comparison of rate-based versions like TCP Vegas with TCP Reno on high

performance links. Their results show that, with proper selection of Vegas

20

parameters, TCP Vegas performs better than TCP Reno for real traffic dis-

tributions.

There are also approaches that use other transport protocols like UDP and

are aggressive about acquiring network capacity. However, these techniques raise

concerns about fairness when competing with other traffic in the network. In [13]

a new communication level protocol is proposed that uses a single UDP stream

for high-performance data transfers. Their protocol implements a communication

window that spans the entire data buffer at the user level and a selective acknowl-

edgment window that spans the entire data transfer. In addition to this, they use a

user-defined acknowledgment frequency. Their protocol achieves increased through-

put over both short and long haul high-performance networks, however, the paper

does not discuss any fairness issues and the effect of their protocol on cross traffic.

Our work does not investigate buffer tuning, because our focus is on the

effect of packet delay and loss on performance. Buffer tuning is certainly important

for large messages and it raises interesting questions for LAM, which sets up a fully

connected environment. During MPI INIT connections are set up between each pair

of nodes and the socket buffer size is set to the same value for all connections. As

shown with MPICH-G2 [37], it is possible to tune TCP parameters for individual

connections and this is one area for future exploration.

Given our focus on latency, we investigated the effect of Nagle’s algorithm

and delayed acknowledgments on the benchmarks. In TCP, the Nagle’s algorithm

holds back data in the TCP buffer until a sufficiently large segment can be sent. De-

layed acknowledgments do not ACK each segment received but wait for (a) data on

the reverse stream, (b) two segments, or (c) a 100 to 200 millisecond timer to go off,

before an ACK is sent. Both Nagle’s algorithm and delayed acknowledgments opti-

mize bandwidth utilization and normally both are active. In LAM/MPI, the Nagle’s

algorithm is disabled (i.e., TCP NODELAY is set) to reduce message latency. Ideally,

to achieve minimum latency, the Nagle’s algorithm and delayed acknowledgments

21

should both be disabled. We investigated using different combinations of these two

parameters. These parameters are not independent, as reported in [47], and this

work proposes a modification to the Nagle’s algorithm that can improve latency and

also protect against transmission of an excessive number of short packets. Another

factor that can impact performance is the difference between various TCP stacks. It

has been shown that Apache webservers, which have TCP NODELAY option set on all

its sockets, perform worse on Linux than on FreeBSD. FreeBSD derived stacks have

byte-oriented ACKs, whereas Linux uses a packet-oriented ACK. Linux asks for an

acknowledgment after the first packet, while in FreeBSD the total size and not the

packets matter, and it asks for an acknowledgment after several packets [60]. This

enhancement is also part of SCTP’s congestion control mechanism. In SCTP, the

congestion window is increased based on the number of bytes acknowledged, not the

number of acknowledgments received [1].

3.2 Investigating the Use of TCP for MPI in WANs

For evaluating the use of TCP for MPI, we used the NAS benchmarks NPB-MPI

version 3.2 and LAM version 7.0.6 as the MPI middleware.

As a proof of concept and the first step in understanding the problems of us-

ing TCP for MPI in environments with high latency and loss, we explored different

environments for our experiments. Our first attempt was to use PlanetLab [51] which

provides a world-wide shared distributed compute environment. We attempted to

run MPI programs on a set of machines in North America and Asia. Although

they did run, as expected we found a huge variation in times due to network and

processing delays. Moreover, the timing information from PlanetLab was not reli-

able and the results obtained were not consistent. We also investigated the use of

EmuLab [16] as an experimental testbed for this work. However, the time required

to run the NAS benchmarks was often several days, therefore, we decided in favor of

using a dedicated collection of machines and created an emulated wide-area network

22

environment that uses Dummynet to vary packet loss and packet latency.

3.3 MPI in an Emulated WAN Environment

Our experimental setup consisted of four FreeBSD-5.3 nodes connected together via

a layer-two switch using 100Mbit/s Ethernet connections. Two of the nodes used

were Pentium-4 2.8GHz, one Pentium III 558MHz and one Pentium II 400MHz.

These nodes represented a mix of fast and slow machines that introduced some

measure of heterogeneity in our network, which is something that would be expected

while executing MPI programs over WANs.

We experimented with six of the eight NAS benchmarks; we did not use

the FT benchmark because it required Fortran 90 support and we did not use BT

because a single run took several days to complete on our setup. We investigated

both small (S, W) and large problem sizes (B).

Dummynet was configured on each of the nodes to allow us to vary the

amount of loss and latency on the links between the nodes. TCP is used as the

underlying transport mechanism in LAM and our goal was to determine how it

would perform under varying loss rates and latencies similar to those encountered

in WANs. We experimented with varying the latencies between 0 and 50 millisec-

onds and packet loss rates between 0% to 10%. The two different variants we

investigated were New-Reno and SACK. As reported by Floyd, New-Reno is widely

deployed and is used in 77% of the web servers that could be classified [45]. Since

we were investigating loss, we also chose TCP SACK as well. It is well-known that

TCP SACK performs better than Reno under loss conditions where TCP SACK’s

selective acknowledgment can avoid some retransmissions under packet loss. Finally,

we experimented with the socket settings for TCP NODELAY and delayed acknowledg-

ments.

23

3.3.1 Communication Characteristics of the NAS Benchmarks

The communication characteristics of the NAS parallel benchmarks were discussed

in Section 2.2.3 and it was noted that with the exception of the EP benchmark,

the remaining benchmarks use both collective and point-to-point communication.

EP uses very few communication calls, however, in the remaining benchmarks, the

communication calls range from 1000 to 50,000 for dataset B. Tables in [20] show

that the communication volume ranges from less than 12 Kbytes for EP to several

gigabytes of data for the other benchmarks.

As discussed in Section 2.2.2, MPI middleware keeps track of two types of

messages: expected messages and unexpected messages. There is slightly more

overhead for processing unexpected messages, however, a large number of expected

messages indicates that the application may be stalled waiting for a message to

arrive. We instrumented the LAM code to determine the number of expected and

unexpected short and long messages generated during the execution of the bench-

marks. A breakdown of different message types for classes S, W, A and B is shown

in Table 3.1 for the LU benchmark and in Table 3.2 for the SP benchmark.

Message Type

Expected UnexpectedDataset Long Short Long Short

S 0 1,097 0 0

W 0 18,978 0 0

A 300 31,145 208 0

B 254 50,221 251 0

Table 3.1: Breakdown of message types for LU classes S, W, A and B using thedefault LAM settings for TCP

LU and SP benchmarks are two of the most communication intensive bench-

marks in the set and they were chosen because of the different characteristics of their

communication calls and the difference in the type of messages exchanged during

their execution. LU uses only blocking send calls (MPI Send) apart from collec-

24

Message Type

Expected UnexpectedDataset Long Short Long Short

S 0 1,216 0 0

W 0 4,422 0 0

A 2,807 9 2,011 0

B 1,314 9 3,504 0

Table 3.2: Breakdown of message types for SP classes S, W, A and B using thedefault LAM settings for TCP

tive communications, whereas, SP uses non-blocking MPI send calls (MPI Isend).

For larger datasets like A and B, LU mostly communicates using short messages,

whereas the number of short messages in SP are almost non-existent compared to

long messages.

As mentioned, we used a heterogeneous environment with two faster ma-

chines and two slower machines. The numbers in these tables are for the two

slower machines. For the faster machines, a greater proportion of long messages

were expected messages. These numbers demonstrate some of the problems in un-

derstanding the performance of these benchmarks, especially in a heterogeneous

environment. The fact that the vast majority of messages are expected indicates

that the application is very likely spending time waiting for messages to arrive. This

may indicate the benchmark is sensitive to message latency.

3.4 Experiments

In the following sections we describe the results of our experiments with varying loss

and delay for six of the eight NAS benchmarks for different problem sizes.

25

3.4.1 NAS Benchmarks, Loss

Experiments were performed with the NAS benchmarks where Dummynet was used

to introduce random packet loss of 1%, 2% and 10% between each of the four nodes.

For the baseline case, the experiments were also run when there was no loss on

the links. The benchmarks report performance by three measurements: the total

amount of time taken, Mops/s total value and Mops/s/process. The value Mops/s

total value is linearly related to total time.

Figure 3.1 shows the performance of TCP SACK for dataset B under different

loss rates. Performance of TCP New-Reno showed a similar trend, therefore, we did

Variation over Different Loss Rates for TCP SACK (Nagle's Algorithm Disabled)

0

50

100

150

200

250

300

LU CG EP IS MG SP

Mop

s/s

Loss=0%

Loss=1%

Loss=2%

Loss=10%

Figure 3.1: NAS benchmarks performance for dataset B under different loss ratesusing TCP SACK

not include its graph. As expected the performance falls with increasing loss. The

degradation in performance is more marked for the LU benchmark than others.

The LU benchmark contains a large number of neighbor to neighbor MPI SEND calls

which are for the most part short and expected messages. As a result, LU is sensitive

to the large delays that occur when packets are lost and need to be retransmitted.

The only benchmark not affected by loss is EP, which has only a few collective

communication calls. The reduction in bandwidth due to TCP’s congestion control

26

mechanism does not seem to be affecting performance.

Our results for 0%, 1% and 2% on all benchmarks showed that TCP SACK

always performed better or equal to TCP New-Reno for all six NAS measurements.

Our results are consistent with the results in the literature that show that TCP

SACK under loss performs better than TCP New-Reno. As discussed in Fall and

Floyd’s paper [19], New-Reno does not perform as well as TCP SACK when multiple

packets are dropped from a window of data. During Fast Recovery, New-Reno is

limited to retransmitting at most one dropped packet per round trip time, which

affects its performance.

3.4.2 NAS Benchmarks, Latency

In order to study network latencies and its effect on MPI applications we ran the NAS

benchmarks with 10ms and 50ms latencies. The increased latencies were introduced

equally on all network links.

As shown in Figure 3.2, the performance is similar to that obtained in Fig-

ure 3.1 for loss. This is further evidence that the major effect of loss on performance

Variation over Different Latencies for TCP SACK (Nagle's Algorithm Disabled)

0

50

100

150

200

250

300

LU CG EP IS MG SP

Mop

s/s

Delay=0ms

Delay=10ms

Delay=50ms

Figure 3.2: NAS benchmarks performance for dataset B with different latenciesusing TCP SACK

is increased latency. It is surprising that there is no significant difference in behavior

27

between increasing the latency for all messages versus loss, which increases delay for

the small number of lost messages. This effect may be due to the loosely synchronous

nature of the computation where delaying one message delays them all.

3.4.3 NAS Benchmarks, Nagle’s Algorithm and Delayed Acknowl-

edgments

The last characteristic of TCP that we investigated was the enabling of Nagle’s

algorithm and delayed acknowledgments. Typically, Nagle’s algorithm and delayed

acknowledgments are both enabled on a connection. Commonly, implementations

of MPI, including LAM, disable Nagle’s algorithm by default, and retain the system

default of delayed acknowledgments. The results for different settings of these two

parameters for the LU benchmark are shown in Figure 3.3.

LU: Combinations of Nagle and Delayed ACKs

0

50

100

150

200

250

300

350

S W B

Dataset Size

Mop

s/s

Nagle, Delayed ACK

Nagle, No Delayed ACK No Nagle, Delayed ACK

No Nagle, No Delayed ACK

Figure 3.3: Performance of LU using different combinations of Nagle’s algorithmand delayed acknowledgments

For LU, Figure 3.3 shows the significant improvement obtained by disabling

Nagle’s algorithm for small problem sizes. This advantage is less evident for larger

problems sizes. Similar observations were made for the SP benchmark. Although

the difference is not large, we consistently found that LAM’s default setting of dis-

abling Nagle’s algorithm and leaving delayed acknowledgments enabled gave the

28

best performance. In particular, it outperformed the case where both are disabled,

which should have achieved the smallest latency. Surprisingly, even though the

performance of the other combinations is close, the tcpdump trace files show very

different traffic patterns. The traces for Nagle disabled, and no delayed acknowl-

edgments had a lot of variation in the RTT (round trip time) and faster recovery

during congestion avoidance, since acknowledgments return more quickly. We ex-

pected the setting with Nagle disabled and no delayed acknowledgments to give the

least amount of delay since segments and ACKs are sent out immediately. However,

as Figure 3.3 shows, this does not result in improved performance. Once again the

synchronous nature of the computation may imply that it is only the maximum

delays that matter.

Figure 3.4 shows the performance of LU and SP for problem sizes W, S and

Comparison of LU and SP for Different Datasets (Loss=0%, SACK ON, Nagle OFF, Delayed ACKs ON)

0

50

100

150

200

250

300

350

SP LU

Mop

s/s

S

W

B

Figure 3.4: Comparison of LU and SP benchmarks for different datasets

B using the best set of parameters (SACK, Nagle disabled, and delayed acknowledg-

ments turned-on). As Figure 3.4 shows there is a significant difference between the

behavior of LU and SP across the problem set sizes. SP shows a significant drop in

performance for dataset B compared to LU. We investigated this further to find that

the major difference between S, W and B was the number of long messages used in

these benchmarks. For problem size B, the number of long messages ranged from

29

almost none in LU to almost all of the approximately 5000 messages exchanged in

SP. This shows the dramatic impact that the MPI middleware and the use of mes-

sage rendezvous for long messages can have on performance. SP uses non-blocking

MPI ISEND but is not able to begin transferring data until the rendezvous point is

reached.

3.5 Summary of Results

Understanding the dynamics of TCP is difficult because TCP contains a range of

flow control and congestion avoidance mechanisms that result in varied performance

in different environments. There are also a number of different TCP variants. This

chapter highlights the performance problems with using TCP for MPI in WANs,

especially taking into account the TCP variants and settings typically used for

MPI in local and wide-area networks. We explored the performance of TCP in

an emulated WAN environment to better quantify the differences. TCP SACK

consistently outperformed TCP New-Reno and this points to the importance of

better congestion control for such environments. For NAS benchmarks, as expected,

there was a considerable drop in performance as the packet loss increased from 0%

to 10%. In the NAS benchmarks, for the latencies investigated, the majority of the

messages were expected and thus potentially very sensitive to variation in latencies.

We found that, for several of the benchmark programs, performance is determined

by the maximum RTT across all of the connections. Even a modest loss in one

link creates spikes in the RTT that in the end slows down all the processes. The

sensitivity of an application to latency results in the overall performance to be as

fast as the slowest process.

In terms of Nagle’s algorithm, the default configuration in LAM-MPI achieved

the best results. There was not much of a difference between the best results and

those obtained for other combinations except for TCP’s default setting: Nagle en-

abled with delayed acknowledgments. We found that the rendezvous mechanism

30

used by LAM can have an impact on performance much more than adjusting the

Nagle and delayed acknowledgments settings.

Our experiments with TCP emphasize the importance of latency tolerant

MPI programs for wide-area networks and the ability to recover quickly from loss.

In Chapter 4, we discuss SCTP-based middleware for MPI and discuss many of the

features of SCTP, like multistreaming, multihoming, better congestion control and

built-in SACK mechanism that make it more resilient and better suited to wide-

area networks. Due to SCTP’s similarity with TCP, it will also be able to take

advantage of research on tuning of individual connections and other performance

enhancements.

31

Chapter 4

SCTP-based Middleware:

Design and Implementation

In this chapter we discuss the role that SCTP can play as the underlying trans-

port layer for MPI. The advantage of using MPI in distributed environments is that

its libraries exist for a wide variety of machines and environments ranging from

dedicated high performance architectures, local area networks, compute grids, to

execution across wide-area networks and the Internet. As a standard it has been

successful in creating a body of code that is portable to any of these environments.

However, the strict definition of portability is insufficient since scalability and per-

formance must also be considered. It is challenging to obtain good performance in

environments such as wide-area networks where bandwidth and latency vary. Our

experiments in Chapter 3 with TCP underscore the importance of latency tolerant

MPI programs for the wide-area environments. Programs that overlap communica-

tion with computation can hide much of the effects of latency that is inevitable in

such environments.

There are several reasons why SCTP may be more suitable than TCP for

MPI, in general, and also can prove to be more robust in wide-area environments.

1. As we discuss in detail in this chapter, SCTP has a lot of similarities in common

32

with MPI, which allows us to offload a lot of functionality of MPI onto the

transport layer.

2. SCTP’s multistreaming ability allows us to provide additional concurrency

and avoid head-of-line blocking that is possible in TCP.

3. Congestion control and SACK in SCTP is more extensive than in TCP and

allows more efficient communication under loss conditions.

4. SCTP’s support for multihoming makes it more fault tolerant - an important

property in wide-area networks. It can take advantage of multiple alternate

paths between a pair of nodes and can recover from loss segments faster than

TCP.

In this thesis we investigate all of the above advantages of SCTP, except

multihoming. We first discuss the details of redesigning the LAM TCP RPI module

to use SCTP as its underlying transport protocol. We implemented three versions

of an SCTP-based RPI module to investigate the opportunities offered by SCTP in

terms of the variety of new features that it offers and explore ways in which MPI can

benefit from a new transport protocol. The design and implementation of our SCTP

module was a three-phased process where we iteratively added functionality to each

successive module. Each module was refined as we learned from our experiences

until we achieved the final version, which assimilates the benefits of the previous

modules.

One of the first challenges we faced in the redesigning the TCP RPI mod-

ule was the lack of documentation about the module and its interaction with other

parts of LAM. The TCP RPI module maintains state information for all requests

such as how far the request has progressed and what type of data/event is re-

quired for it to reach completion. It has a complex state machine for supporting

asynchronous and non-blocking message progression in MPI. The only way to learn

about how it worked was by thorough code examination. In the process, we also

33

created a detailed technical document explaining the module’s working that other

people can benefit from [33]. This document is publicly available and is linked-off

the LAM/MPI official website. We also extensively instrumented LAM code to

trace request progression and other key events using lam debug statements. The

lam debug statements allowed us to tune the level of verbosity in the output and

were used to create our own custom diagnostic system. During our initial learning

process and also during implementation and testing we made extensive use of the

diagnostic output to trace problems. Tracing problems in the execution of parallel

programs was quite challenging at times, especially when the output was not always

reproducible and when slight changes in timing of events produced different results.

In many instances, painstaking examination of the diagnostic traces had to be done

to identify the source of errors.

Another challenge was the use of a new transport protocol that is still in its

early stages of optimization and tuning. We found that performance of SCTP was

highly dependant on the implementation used, with the Linux lksctp implementation

performing much poorer than the FreeBSD KAME implementation. Moreover, while

using different features of the SCTP API, there were several coding nuances that we

had to discover for ourselves as not much documentation was available. During our

testing, the SCTP module also provided a testing ground for the FreeBSD KAME

SCTP implementation where we discovered several problems with the protocol that

led to improvements in the stack. We collaborated closely with Randall Stewart,

one of the primary designers of SCTP, in the identification of those problems.

In the subsequent sections we describe the design and implementation details

of the final version of the SCTP module, since it incorporates the features of previous

versions, and also compare it to the LAM-TCP design. Our iterative design process

is discussed in more detail in [34]. We first discuss why SCTP is a good match for

MPI as its underlying transport protocol. In Sections 4.2 and 4.3 we discuss how

multiple streams are used in our module and the enhanced concurrency obtained as

34

a result. In Section 4.4 we discuss resource management issues in the middleware

and provide a comparison with LAM-TCP. In Section 4.5 we describe the race

conditions in our module and the solution adopted to fix them. In Section 4.6 we

discuss enhanced reliability and fault tolerance in our module due to the features

provided by SCTP. In the last section we discuss some limitations of using SCTP.

4.1 Overview of using SCTP for MPI

SCTP promises to be particularly well-suited for MPI due to its message-oriented

nature and provision of multiple streams in an association.

As shown in Figure 4.1, there are some striking similarities between SCTP

MPI SCTP

Context

Rank / Source

Message

Tags

One-to-Many

Socket

Association

Streams

Figure 4.1: Similarities between the message protocol of MPI and SCTP

and MPI. Contexts in an MPI program identify a set of processes that communicate

with each other, and this grouping of processes can be represented as a one-to-many

socket in SCTP that establishes associations with that set of processes. SCTP can

map each association to the unique rank of a process within a context and thus use

an association number to determine the source of a message arriving on its socket.

Each association can have multiple streams which are independently ordered and

this property directly corresponds with message delivery order semantics in MPI.

In MPI, messages sent with different tag, rank and context (TRC) to the same

receiver are allowed to overtake each other. This permits direct mapping of streams

35

to message tags.

This similarity between SCTP and MPI is workable at a conceptual level,

however, there are some implementation issues, to be discussed, that make the

socket-context mapping less practical. Therefore, we preserved the mappings from

associations to ranks and from streams to message tags but not the one from socket

to context in our implementation. Context creation within an MPI program can be

a dynamic operation and if we map sockets to contexts, this would require creating

a dynamic number of sockets during the program execution. Not only would this

add complexity and additional bookkeeping to the middleware, there can be an

overhead to managing a large number of sockets. Creation of a lot of sockets in

the middleware counteracts the benefits we can get from using a single one-to-many

SCTP socket that can send/receive messages from multiple associations. Due to

these reasons, instead of mapping sockets to contexts, the context and tag pair was

used to map messages to streams.

In discussions with Randall Stewart we discovered another way of dealing

with contexts and that is to map them to PID (Payload Identifier) present in the

common SCTP header of each packet. The PID is a 32-bit field that can be used at

the application level to label the contents of an SCTP packet and is ignored by the

SCTP layer. Using the PID field gives us the flexibility of dynamic context creation

in our application without the need for maintaining a corresponding number of

sockets. Also, the PID mapping can be easily incorporated in our module, with

minor modifications.

4.2 Message Demultiplexing

As discussed in Section 2.1, SCTP provides a one-to-many UDP-like style of com-

munication that allows a single socket descriptor to receive messages from multiple

associations, eliminating the need for maintaining a large number of socket descrip-

tors. In the case of LAM’s TCP module there is a one-to-one mapping between

36

processes and socket descriptors, where every process creates individual sockets for

each of the other processes in the LAM environment.

In SCTP’s one-to-many style, the mapping between processes and sockets

no longer exists. With one-to-many sockets there is no way of knowing when a

particular association is ready to be read from or written to. At any time, a process

can receive data sent on any of the associations through its sole socket descriptor.

SCTP’s one-to-many socket API does not permit reception of messages from a par-

ticular association, therefore, messages are received by the application in the order

they arrive and only afterwards is the receive information examined to determine

the association it arrived on. It is likely that the ability to peek at the association

information before receiving data may become part of later versions of the SCTP

API [57].

In our SCTP module, each message goes through two levels of demultiplexing;

first on the association number the message arrived on and secondly on the stream

number within that association. These two parameters allow us to invoke the correct

state function for that request, which directs the incoming message to the proper

request receive buffer. If no matching receive was posted, it is treated like an

unexpected message and buffered.

4.3 Concurrency and SCTP streams

MPI send/receive calls without wildcards define an ordered stream of messages be-

tween two processes. It is also possible, through the use of wildcards or appropriate

tags, to relax the delivery order of messages. Relaxing the ordering of messages can

be used to make the program more message driven and independent from network

delay and loss. However, when MPI is implemented on top of TCP, the stream-

oriented semantics of TCP with one connection per process precludes having un-

ordered message streams from a single process. This restriction is removed in our

implementation by mapping tags onto SCTP streams, which allows different tags

37

from the same source to be independently delivered.

4.3.1 Assigning Messages to Streams

As discussed in the Chapter 2, message matching in MPI is based on the tag, rank

and context of the message. In SCTP, the number of streams is a static parameter

(short integer value) that is set when an association is initially established. For each

association, we use a fixed sized pool of stream numbers, ten by default, that is

used for sending and receiving messages between the endpoints of that association.

Messages with different TRC are mapped to different stream numbers within an

association to permit independent delivery. Of course, since the number of streams

is fixed, the degree of concurrency achieved depends on the number of streams.

4.3.2 Head-of-Line Blocking

Head-of-line blocking can occur in the LAM TCP module when messages have

the same rank and context but different tags. As shown in Figure 4.2, we use

non-blocking communication between two processes that communicate by send-

ing/receiving several messages with different tags. The communication/computation

overlap depends on the amount of data that is sent without blocking for it to be

received, and continuing with useful computation. As Figure 4.2 shows, loss of a

single TCP segment can block delivery of all subsequent data in a TCP connection

until the lost segment is delivered. Our assignment of TRC to streams in the SCTP

module alleviates this problem by allowing the unordered delivery of these messages.

It is often the case that MPI applications are loosely synchronized and as

a result alternate between computation and bursts of communication. We believe

this characteristic of MPI programs makes them more likely to be affected by head-

of-line blocking in high loss scenarios, which underpins our premise that SCTP is

better suited as a transport mechanism for MPI than TCP on WANs.

38

Process X Process Y

Tag_A MPI_Irecv

MPI_Irecv

Tag_B

Msg_A Msg_B

Delivered

Process X Process Y

Tag_A MPI_Isend MPI_Irecv

MPI_Irecv

Tag_B

Msg_A Msg_B

Blocked

LAM TCP Module

SCTP Module

MPI_Isend

MPI_Isend

MPI_Isend

Figure 4.2: Example of head-of-line blocking in LAM TCP module

4.3.3 Example of Head-of-Line Blocking

As an illustration of the benefits of our SCTP module, consider the two MPI pro-

cesses P0 and P1 shown in Figure 4.3. Process P1 sends two messages Msg-A and

Msg-B to P0 in order, using different tags. P0 does not care what order it receives

the messages and, therefore, posts two non-blocking receives. P0 waits for any of the

receive requests to complete and then carries out some computation. Assume that

part of Msg-A is lost during transmission and has to be retransmitted while Msg-B

arrives safely at the receiver. In the case of TCP, its connection semantics require

that Msg-B stay in TCP’s receive buffer until the two sides recover from the loss and

the lost parts are retransmitted. Even though the programmer has specified that the

messages can be received in any order, P0 is forced to receive Msg-A first and incur

39

P0

MPI_Irecv(P1, tag-A)

MPI_Irecv(P1, tag-B)

MPI_Waitany()

Compute()

MPI_Waitall()

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

TCP SCTP

Exe

cution tim

e o

n P

0

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

MPI_Irecv

MPI_Irecv

MPI_Waitany

Compute

MPI_Waitall

Msg-B arrives

Msg-A arrives

Figure 4.3: Example of multiple streams between two processes using non-blockingcommunication

increased latency as a result of TCP’s semantics. In the case of the SCTP module,

since different message tags are mapped to different streams and each stream can

deliver messages independently, Msg-B can be delivered to P0 and the process can

continue executing until Msg-A is required. SCTP matches the MPI semantics and

makes it possible to take advantage of the concurrency that was specified in the

program. The TCP module offers concurrency at process level, while our SCTP

module adds to this with enhanced concurrency at the TRC level.

Even when blocking communication takes place between a pair of processes

and there is loss, there are still advantages to using SCTP. Consider Figure 4.4, where

blocking sends and receives are being used with two different tags. In SCTP, if Msg-A

is delayed or lost, and Msg-B arrives at the receiver, it is treated as an unexpected

message and buffered. SCTP, thus, allows the receive buffer to be emptied and

40

P0

MPI_Recv(P1, tag-A)

MPI_Recv(P1, tag-B)

- - -

MPI_Send(Msg-A, P0, tag-A)

MPI_Send(Msg-B, P0, tag-B)

- - -

P1

Figure 4.4: Example of multiple streams between two processes using blocking com-munication

hence does not slow down the sender due to the flow control mechanism. In TCP,

however, Msg-B will occupy the socket receive buffer until Msg-A arrives.

4.3.4 Maintaining State

The LAM TCP module maintains state for each process (mapped to a unique socket)

that it reads from or writes to. Since TCP delivers bytes in strict sequential order

and the TCP module transmits/ receives one message body at a time per process,

there is no need to maintain state information for any other messages that may arrive

from the same process since they cannot overtake the message body currently being

read. In our SCTP module, this assumption no longer holds true since subflows

on different stream numbers are only partially ordered with respect to the entire

association. Therefore, we need to maintain state for each stream number that a

message can arrive on from a particular process. We only need to maintain a finite

amount of state information per association since we limit the possible number of

streams to our pool size of ten streams. For each stream number, the state holds

information about how much of the message has been read, what stage of progression

the request is in and what needs to be done next. At the time when an attempt

is made to read an incoming message, we cannot tell in advance from where the

message will arrive and how large it will be, therefore, we specify a length equal

to the socket receive buffer. SCTP’s receive function sctp recvmsg does take care

of message framing by returning the next message and not the number of bytes

specified by the size field in the function call. This frees us from having to look

through the receive buffer to locate the message boundaries.

41

4.4 Resource Management

LAM’s TCP RPI module uses TCP as its communication mechanism and employs

a socket based interface in a fully connected environment. It uses a one-to-one

connection-oriented scheme and maintains N sockets, one for each of the N processes

in its environment. The select system call is used to poll these sockets to determine

their readiness for reading any incoming messages or writing outgoing messages.

Polling of these sockets is necessary because operating systems like Linux and UNIX

do not support asynchronous communication primitives for sockets. It has been

shown that the time taken for the select system call has a cost associated with it

that grows linearly with the number of sockets [44]. Socket design was originally

based on the assumption that a process would initiate a small number of connections,

and it is known that performance is affected if a large number of sockets are used.

Implications of this are significant in large scaled commodity clusters with thousands

of nodes that are becoming more frequently used. Use of collective communications

in MPI applications also strengthens the argument, since nearly all connections

become active at once when a collective communication starts, and handling a large

number of active connections may result in performance degradation.

SCTP’s one-to-many communication style eliminates the need for maintain-

ing a large number of socket descriptors. In our implementation each process creates

a single one-to-many SCTP socket for communicating with the other processes in

the environment. An association is established with each of the other processes us-

ing that socket and since each association has a unique identification, it maps to a

unique process in the environment. We do not use the select system call to detect

events on the socket, instead an attempt is made to retrieve messages at the socket

using sctp recvmsg, as long as there are any pending receive requests. In a simi-

lar way, if there are any pending send requests, they are written out to the socket

using sctp sendmsg. If the socket returns EAGAIN, signaling that it is not ready to

perform the current read or write operation, the system attempts to advance other

42

outstanding requests.

The use of one-to-many sockets in SCTP results in a more scalable and

portable implementation since it does not impose a strong requirement on the system

to manage a large number of socket descriptors and the resources for the associated

buffers. As discussed in [5], managing these resources can present problems. We

investigated a process’s limits for the number of socket descriptors it can manage

and the number of associations it can handle on a single one-to-many socket. In our

setup, a process reached its maximum socket descriptors’ limit much earlier than

the limit on the maximum number of associations it can handle. The number of file

descriptors in a system usually has a user limit of 1024 and needs root privileges

to be changed. Moreover, clusters can potentially execute parallel jobs by several

users, and need to have the capability of providing a number of sockets to each user.

4.5 Race Conditions

In this section we describe the race conditions in our module and the solutions

adopted to fix them.

4.5.1 Race Condition in Long Message Protocol

In the SCTP module it was necessary to change the long message protocol. This

occurred because even though SCTP is message-based the size of message that can

be sent in a single call to sctp sendmsg function is limited by the send buffer size.

As a result large messages had to broken up into multiple smaller messages of size

less than that of send buffer. These messages then had to be reassembled at the

receiving side.

All pieces of the large message are sent out on the same stream number to en-

sure in-ordered delivery. As a refinement to this scheme, we considered interleaving

messages sent on different streams with portions of a message larger than the send

buffer size, at the time it is passed to the SCTP transport layer. Since reassembly

43

of the large message is being done at the RPI level, and not at the SCTP level, this

may result in reduced latency for shorter messages on other streams especially when

processes use non-blocking communication.

While testing our SCTP module we encountered race conditions that oc-

curred due to the rendezvous mechanism of the long message protocol. Consider

the case when two processes P0 and P1 are exchanging long messages using the

same stream numbers. Both of them would send rendezvous initiating envelopes for

their long messages to the other and wait for an acknowledgment before starting

the transmission of the long message body.

As shown in Figure 4.5, after P1 receives the ACK for its long message Msg1,

P0 P1

Envelope - Msg1

Partial body - Msg1

ACK - Msg1

Envelope - Msg2

ACK - Msg2 (incorrect)

Remaining body - Msg1

Figure 4.5: Example of long message race condition

it begins transmitting it. It is possible that P1 is successful in writing only part

of Msg1 to the socket, and it then saves its current state which would include the

number of bytes sent to P0. Process P1 would now continue and try to advance

its other active requests and would return to writing the rest of Msg1 later. During

that time P1 sends out an ACK for long message Msg2 to P0 using the same stream

number that Msg1 was being sent on. From the perspective of process P0, it was

in the middle of receiving Msg1 and the next message it receives is from the same

44

process and on the same stream number. Since P0 decides what function to call,

after reading a message from its socket, based on the association and stream number

it arrived on, it calls the function that was directing the bytes received to the buffer

designated for Msg1. As a result, P0 erroneously reads the ACK for Msg2 as part of

body of Msg1.

Option A

In order to fix the above race condition we considered several options, and the trade-

offs associated with them. The first option was to stay in a loop while sending Msg1

until all of it has been written out to the socket. One advantage of this is that the

rendezvous mechanism introduces latency overhead and once the transmission of

long message body starts, we do not want to delay it any longer. The disadvantage

is that it greatly reduces the amount of concurrency that would otherwise have been

possible, since we do not receive from or send to other streams or associations. Also,

if the receiver of the long message had issued a non-blocking receive call, then it

might continue to do other computations and, as a result, sending the data in a

tight loop may simply cause the sender to stall waiting for the receiver to remove

the data from the connection. Another disadvantage that may arise is when the

receiver is slow and we would be overloading it by sending a lot of data in a loop

instead of advancing other requests on other streams or associations.

Option B

The second option was to disallow P1 from writing a different message, if a previous

message writing operation was still in progress, on the same stream number to the

same process. The advantage of this was simplicity in design, but we reduce the

amount of overlap that was possible in the above case if we allowed both processes

to exchange long messages at the same time. However, it is still possible to send

and receive messages from other streams or associations. This option is the one we

45

implemented in the SCTP module.

Option C

A third option was to treat acknowledgments used within LAM to synchronize

message exchange, such as those used in the long message rendezvous mechanism,

as control messages. These control messages would be treated differently from other

messages containing actual data. Whenever a control message carrying an ACK

arrives, the middleware would know it is not part of any unfinished message, e.g.,

the long message in our example, and would invoke the correct state function to

handle it. This option introduces more complexity and bookkeeping to the code,

but may turn out to be one that offers the most concurrency.

4.5.2 Race Condition in MPI Init

There was one other race condition that occurred in implementing the SCTP mod-

ule. Because the one-to-many SCTP style does not require any of the customary

accept and connect function calls before receiving messages, we had to be careful

about MPI Init to ensure that each process establishes associations with all the

other processes before exchanging messages. In order to ensure that all associations

are established before messages are exchanged, we implemented a barrier at the end

of our association setup stage and before the message reception/ transmission stage.

This is especially important in an heterogeneous environment with varying network

and machine speeds. It is easier with TCP because the MPI Init connection setup

procedure automatically takes care of this issue by its use of connect and accept

function calls.

4.6 Reliability

In general, MPI programs are not fault tolerant and communication failure typically

causes the programs to fail. There are, however, projects like FT-MPI [18], which

46

introduce different modes of failure, that can be controlled by an application using

a modified MPI API. This work is discussed in Chapter 6. One source of failure is

packet loss in networks, which can result in added latency and reduced bandwidth

that severely impact the overall performance of programs. There are several mech-

anisms in SCTP that help reduce the number of failures and improve the overall

reliability of executing an MPI program in a WAN environment.

4.6.1 Multihoming Feature

MPI applications have a strong requirement for the ability to rapidly switch to an

alternate path without excessive delays in the event of failure. MPI processes in a

given application’s environment tend to be loosely synchronized even though they

may not all be communicating directly with each other. Delays occurring in one

process due to failure of the network can have a domino effect on other processes and

they can potentially be delayed as well. We saw some evidence of this phenomenon in

our experiments in Chapter 3. It would be highly advantageous for MPI processes if

the underlying transport protocol supports fast path failover in the event of network

failure.

SCTP’s multihoming feature provides an automatic failover mechanism where

a communication failure between two endpoints on one path will cause switchover

to an alternate path with little interruption to the data transfer service. Of course,

this is useful only when there are independent paths, but having multiple interfaces

on different networks is not uncommon and SCTP makes it possible to exploit the

possibility when it exists. TCP has no built-in support for multihomed IP hosts and

cannot take advantage of path-level redundancy. It could be accomplished by using

multiple sockets and managing them in the middleware but this introduces added

complexity to the middleware.

SCTP uses a heartbeat control chunk to periodically probe the reachability

of the idle destination addresses of an association and to update the path round

47

trip times of the addresses. The heartbeat period is a configurable option, modi-

fiable through a socket option, with a default value usually set to 30 seconds [59].

The heartbeat mechanism detects failure by keeping count of consecutive missed re-

sponses to the data and/or heartbeat chunks to that address. If this count exceeds

a configurable threshold spp pathmaxrxt then that destination address is marked

as unreachable. By default spp pathmaxrxt is set to a value of 5. The value of this

parameter represents a trade-off between speed of detection and reliability of detec-

tion. If the value is set too low, an overloaded host may be declared unreachable.

In addition to this, there are a range of user-adjustable controls that can be used

to adjust maximum and minimum retransmission timeouts (RTO) and hence tune

the amount of time it takes to determine network failures [55]. There are trade-offs

associated with selecting the value of these control parameters. Setting max RTO

too low, for example, can result in unnecessary retransmissions in the event of delay

in the network. Ideally, these control parameters need to be tuned to suit the nature

and requirements of the application and the network to ensure fast failover, but to

avoid unnecessary switchovers due to network delay. If we contrast this with the

TCP protocol, there is a lack of control over TCP retransmission timers and there

is no built-in mechanism for a failover.

As a side note on multihoming, an SCTP association spans both IPv4 and

IPv6 addresses and can aid in progression of the Internet from IPv4 to IPv6 [58].

4.6.2 Added Protection

SCTP has several built-in features that provide an added measure of protection

against flooding and masquerade attacks. They are discussed below.

As SCTP is connection oriented, it exchanges setup messages at the initiation

of the communication, for which it uses a robust four-way handshake as shown in

Figure 4.6. The receiver of an INIT message does not reserve any resources until

the sender proves that its IP address is the one claimed in the association setup

48

P0 P1

INIT

INIT-ACK

COOKIE-ECHO

COOKIE-ACK

Figure 4.6: SCTP’s four-way handshake

message. The handshake uses a signed state cookie to prevent use of IP spoofing for

SYN flooding denial-of-service attack. The cookie is authenticated by making sure

that it has a valid signature and then a check is made to verify that the cookie is

not stale. This check guards against replay attacks [59]. Some implementations of

TCP also use a cookie mechanism, but those typically do not use signed cookies

and also the cookie mechanism is supplied as an option and it is not an integral

part of the TCP protocol as is the case with SCTP. In TCP, a user has to validate

that both sides of the connection have the required implementation. SCTP’s four-

way handshake may be seen to be an overhead, however, if a one-to-many style

socket is used, then user data can be piggy-backed on the third and fourth leg of

the handshake.

Every SCTP packet has a common header that contains, among other things,

a 32-bit verification tag. This verification tag protects against two things: (1) It

prevents an SCTP packet from a previous inactive association from being mistaken

as a part of a current association between the same endpoints. (2) It protects

against blind injection of data in an active association [59]. A study [63] 1 done on

1Network transport protocols rarely hit the news headlines. The work by Paul Watsonwas widely reported in the media. The Vancouver Sun featured a front page article titled“The man who saved the Internet”on April 22, 2004. This problem, however, had been

49

TCP Reset attacks has shown that TCP is vulnerable to a denial-of-service attack

in which the attacker tries to prematurely terminate an active TCP connection.

The study has found that this type of attack can utilize the TCP window size

to reduce the number of sequence numbers that must be guessed for a spoofed

packet to be considered valid and accepted by the receiver. This kind of attack is

especially effective on long standing TCP flows such as BGP peering connections,

which if successfully attacked, would disrupt routing in the Internet. SCTP is not

susceptible to the reset attacks due to its verification tag [57].

SCTP is able to use large socket buffer sizes by default because it is not

subject to denial-of-service attacks that TCP becomes susceptible to with large

buffers [57].

SCTP also provides an autoclose option that protects against accidental

denial-of-service attacks where a process opens an association but does not send

any data. Using autoclose, we can specify the maximum number of seconds that

an association remains idle, i.e., no user traffic in either direction. After this time

the association is automatically closed [55]. There is another difference between

SCTP’s and TCP’s close mechanisms; TCP allows an “half-closed” state, which is

not allowed in SCTP. In the half-closed state, one side closes a connection and can-

not send data, but may continue to receive data until the peer closes its connection.

SCTP avoids the possibility of a peer failing to close its connection by disallowing

a half-closed state.

In this section we have highlighted the additional security features that are

present in SCTP. These features are especially important in open network environ-

ments, such as the Internet, where security is an issue and therefore the use of SCTP

adds to the reliability of the environment for MPI.

identified earlier by the network community [57].

50

4.6.3 Use of a Single Underlying Protocol

One issue not directly related to SCTP was the mixed use of UDP and TCP in

LAM. LAM uses user level daemons that by default employ UDP as their transport

protocol. These daemons serve several purposes:

1. Daemons act as external agents that have previously fulfilled the authentica-

tion requirements for connecting to other nodes and can enable fast execution

of the MPI programs.

2. They enable external monitoring of running jobs and carrying out cleanup

when a user aborts an MPI process.

3. LAM also provides remote I/O via the LAM daemon processes.

We modified the transport protocol used by daemons in LAM to SCTP so

that we have a single underlying protocol while executing MPI programs instead

of a combination of protocols that do not offer the set of features we want to take

advantage of, such as, the capability to multihome by all the components of the

LAM environment.

4.7 Limitations

As discussed in Section 4.5, SCTP has a limit on the size of message it can send

in a single call to sctp sendmsg. This limits us from taking full advantage of the

message-framing property of SCTP since our long message protocol divides a long

message into smaller messages and then carries out message framing at the middle-

ware level.

Both TCP and SCTP use a flow control mechanism where a receiver adver-

tises its available receive buffer space to the sender. Normally when a message is

received it is kept in the kernel’s protocol stack buffer until a process issues a read

51

system call. When large sized messages are exchanged in MPI applications and mes-

sages are not read out quickly, the receiver advertises less buffer space to the sender,

which slows down the sender due to flow control. An event driven mechanism that

gets invoked as soon as a message arrives is currently not supported in our module.

Some factors that can affect the performance of SCTP are as follows:

• SCTP uses a comprehensive 32-bit CRC32c checksum which is expensive in

terms of CPU time, while TCP typically offloads checksum calculations to

the network interface card (NIC). However, CRC32c provides much stronger

data verification at the cost of additional CPU time. It has been proven to be

mathematically stronger than the 16-bit checksums used by TCP or UDP [58].

It is likely that hardware support for CRC32c may become available in a few

years.

• TCP can always pack a full MTU, but SCTP is limited by the fact that

it bundles different messages together, which may not always fit to pack a

full MTU. However, in our experiments this was not observed to be a factor

impacting the performance.

• TCP performance has been fine-tuned over the past decades, however, opti-

mization of the SCTP stack is still in its early stages and will improve over

time.

In this chapter we discussed the design and implementation issues of our

SCTP module in detail. We have focused on the need for latency tolerant MPI

programs for wide-area environments and described SCTP’s many features, which

make it more resilient and suitable for such environments. In Chapter 5 we carry

out a thorough experimental evaluation of the SCTP module, and compare its per-

formance to the LAM-TCP module.

52

Chapter 5

Experimental Evaluation

In this chapter we describe the experiments carried out to compare the performance

of our SCTP module with the LAM-TCP module for different MPI applications.

Our experimental setup consists of a dedicated cluster of eight identical Pentium-4

3.2GHz, FreeBSD-5.3 nodes connected via a layer-two switch using 1Gbit/s Ethernet

connections. Kernels on all nodes are augmented with the kame.net SCTP stack [36].

The experiments were performed in a controlled environment and Dummynet was

configured on each of the nodes to allow us to vary the amount of loss on the

links between the nodes. We compare the performance under loss rates of 0%, 1%

and 2% between each of the eight nodes. The cluster used for our experiments in

this chapter is larger and faster than the one used for evaluating TCP for MPI,

described in Chapter 3. The new cluster had become available to us at the time

we were ready to test our SCTP module, and provided the opportunity to carry

out more extensive experiments with better machines and larger number of nodes.

There were NAS benchmarks, like BT, which could not be run on the older cluster,

as they were taking an inordinate amount of time to execute. Since the new cluster

reduced the overall time required for each experiment run, we were able to move

quickly. All our experiments were executed several times to verify the consistency

of our results.

53

In Section 5.1, we evaluate the performance of the SCTP module using two

standard benchmark programs. In Section 5.2, we look at the effect of using a latency

tolerant program that overlaps communication with computation. We compare the

performance using a real-world program that uses a master-slave communication

pattern commonly found in MPI programs. We also discuss the effects of head-of-

line blocking. In order to make the comparison between SCTP and TCP as fair

as possible, the following settings were used in all the experiments discussed in

subsequent sections:

1. By default, SCTP uses much larger SO SNDBUF/SO RCVBUF buffer sizes than

TCP. In order to prevent any possible effects on performance due to this

difference, the send and receive buffers were set to a value of 220 Kbytes in

both the TCP and SCTP modules.

2. Nagle’s algorithm is disabled by default in LAM-TCP and this setting was

used in the SCTP module as well.

3. An SCTP receiver uses Selective Acknowledgment SACK to report any missing

data to the sender, therefore, SACK option for TCP was enabled on all the

nodes used in the experiment.

4. In our experimental setup, the ability to multihome between any two endpoints

was available since each node is equipped with three gigabit interface cards

and three independent paths between any two nodes were available. In our

experiments, however, this multihoming feature was not used so as to keep the

network settings as close as possible to that used by the LAM-TCP module.

5. TCP is able to offload checksum calculations on to the NICs on our nodes and

thus has zero CPU cost associated with its calculation. SCTP has an expensive

CRC32c checksum which can prove to be a considerable overhead in terms of

CPU cost. We modified the kernel to turn off the CRC32c checksum in SCTP

so that this factor does not affect the performance results.

54

5.1 Evaluation Using Benchmark Programs

We evaluated our implementation using two benchmark programs; MPBench ping-

pong test and NAS parallel benchmarks. The NAS benchmarks approximate the

performance of real applications. These benchmarks provided a point of reference

for our measurements and the results are discussed below.

5.1.1 MPBench Ping-Pong Test

We first report the output obtained by running the MPBench [48] ping-pong test

with no message loss. This is a standard benchmark program that reports the

throughput obtained when two processes repeatedly exchange messages of a speci-

fied size. All messages are assigned the same tag. Figure 5.1 shows the throughput

MPBench Ping Pong Test

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1 32768 65535 98302 131069

Message Size (bytes)

Thr

ough

put N

orm

aliz

ed to

LA

M_T

CP

va

lues

LAM_SCTP

LAM_TCP

Figure 5.1: The performance of SCTP using a standard ping-pong test normalizedto TCP under no loss

obtained for different message sizes in the ping-pong test under no loss. The through-

put values for the LAM-SCTP module are normalized with respect to LAM-TCP.

The results show that SCTP is more efficient for larger message sizes, however,

55

TCP does better for small message sizes. The crossover point is approximately at a

message size of 22 Kbytes, where SCTP throughput equals that of TCP. The SCTP

stack is fairly new compared to TCP and these results may change as the SCTP

stack is further optimized.

We also used the ping-pong test to compare SCTP to TCP under 1% and 2%

loss rates. Since the LAM middleware treats short and long messages differently,

we experimented with loss for both types of messages. In the experiments short

messages were 30 Kbytes and long messages were 300 Kbytes. For both short and

long messages under 1% and 2% loss, SCTP performed better than TCP and the

results are summarized in Table 5.1.

Throughput(bytes/second)Loss:1% Loss:2%

MPI Message Size SCTP TCP SCTP TCP(bytes)

30K 54,779 1,924 44,614 1,030

300K 5,870 1,818 2,825 885

Table 5.1: The performance of SCTP and TCP using a standard ping-pong testunder loss

SCTP shows that it is more resilient under loss and can perform rapid re-

covery of lost segments. Work done in [1] and [53] compares the congestion control

mechanisms of SCTP and TCP and shows that SCTP has a better congestion con-

trol mechanism that allows it to achieve higher throughput in error prone networks.

Some features of SCTP’s congestion control mechanism are as follows:

• Use of SACK is an integral part of the SCTP protocol, whereas, for TCP it is

an option available in some implementations. In those implementations SACK

information is carried in IP options and is, therefore, limited to reporting at

most four TCP segments. In SCTP, the number of gap ACK blocks allowed

is much larger as it is dictated by the PMTU [59].

56

• Increase in the congestion window in SCTP is based on the number of bytes

acknowledged and not on the number of acknowledgments received [1]. This

allows SCTP to recover faster after fast retransmit. Also, SCTP initiates slow

start when the congestion window is equal to the slow start threshold. This

helps in achieving a faster increase in the congestion window.

• The congestion window variable cwnd can achieve full utilization because when

a sender has 1 byte of space in the cwnd and space available in receive window,

it can send a full PMTU of data.

• The FreeBSD KAME SCTP stack also includes a variant called New-Reno

SCTP that is more robust to multiple packet losses in a single window [32].

• When the receiver is multihomed, an SCTP sender maintains a separate con-

gestion window for each transport address of the receiver because the conges-

tion status of the network paths may differ from each other. SCTP’s retrans-

mission policy helps in increasing the throughput, since retransmissions are

sent on one of the active alternate transport addresses of the receiver. The

effect of multihoming was not a factor in our tests.

In order to determine the performance overhead of SCTP’s CRC32c check-

sum, we re-ran the ping-pong test for short and long message sizes under no loss

with CRC32c checksum turned-on. The throughput, on average, degraded by 12%

compared to the case when the checksum was turned-off. In wide-area networks,

the effect of CRC32c checksums on overall performance is likely to be small since

communication time is dominated by network latency. The effect of the CRC32c

checksum was more noticeable in the ping-pong tests because the experiments were

performed on a low latency local area network. In general, although the time for

calculating the checksum does add a few microseconds to message latency, it is likely

to be small in comparison to network latency.

57

5.1.2 Using NAS Benchmarks

In our second experiment, we used the NAS parallel benchmarks (NPB version 3.2)

to test the performance of MPI programs with SCTP and TCP as the underlying

transport protocol. The characteristics of the NAS benchmarks have already been

discussed in Chapter 3. We tested the following seven of those benchmarks: LU, IS,

MG, EP, CG, BT and SP. Dataset sizes S, W, A and B were used in the experiments

with the number of processes equal to eight. The Dataset sizes increase in order S,

W, A, and B, with S being the smallest. We have done an analysis of the type of

messages being exchanged in these benchmarks and we have found that in datasets

‘S’ and ‘W’, short messages (as defined in LAM middleware to be messages smaller

than or equal to 64 Kbytes) are predominantly being sent/received. In datasets ‘A’

and ‘B’ we see a greater number of long messages (i.e., messages larger than 64

Kbytes) being exchanged.

NAS Parallel MPI Benchmarks

0

500

1000

1500

2000

2500

3000

3500

4000

4500

LU SP EP CG BT MG IS

Mop

/s

LAM_SCTP

LAM_TCP

Figure 5.2: TCP versus SCTP for the NAS benchmarks

Figure 5.2 shows the results for dataset size ‘B’ under no loss. The results for

the other datasets are not shown in the figure, but as expected from the ping-pong

test results, TCP does better for the shorter datasets. These benchmarks use single

tags for communication between any pair of nodes and the benefits of using multiple

58

streams in the SCTP module are not being utilized. The SCTP module, in this case,

reduces to a single stream per association and as the results show, the performance

on average is comparable to TCP. TCP shows an advantage for the MG and BT

benchmarks and we believe the reason for this is that these benchmarks use greater

proportion of short messages in dataset ‘B’ than the other benchmarks.

5.2 Evaluation Using a Latency Tolerant Program

In this section we compare the performance of the SCTP module with LAM-TCP

using a latency tolerant program. As our discussion in the previous chapters showed,

in order to take advantage of highly available, shared environments that have large

delays and loss, it is important to develop programs that are latency tolerant and

capable of overlapping computation with communication to a high degree. Moreover,

SCTP’s use of multiple streams can provide more concurrency. Here, we investigate

the performance of such programs with respect to TCP and also discuss the effect

of head-of-line blocking in these programs.

5.2.1 Comparison Using a Real-World Program

In this section we investigate the performance of a realistic program that makes use

of multiple tags. Our objectives were two-fold: first, we wanted to evaluate a real-

world parallel application that is able to overlap communication with computation,

and secondly, to examine the effect of introducing multiple tags which, in case of

the SCTP module, can map to different streams. We describe the experiments

performed using an application which we call the Bulk Processor Farm program.

The communication characteristics of this program is typical of real-world master-

slave programs.

The Bulk Processor Farm is a request driven program with one master and

several slaves. The slaves ask the master for tasks to do, and the master is responsible

for creating tasks, distributing them to slaves and then collecting the results from

59

all the slaves. The master services all task requests from slaves in the order of their

arrival. Each task is assigned a different tag, which represents the type of that task.

There is a maximum number of different tags (MaxWorkTags) that can be distributed

at any time. The slaves can make multiple requests at the same time and when they

finish doing some task they send a request for a new one, so at any time, each of the

slaves have a fixed number of outstanding job requests. This number was chosen to

be ten in our experiments. The slaves use non-blocking MPI calls to issue send and

receive requests and all messages received by the slaves are expected messages. The

slaves use MPI ANY TAG to show that they are willing to perform a task of any type.

The master has a total number of tasks (NumTasks) that need to be performed

before the program can terminate. In our experiments we set that number to 10,000

tasks. We experimented with tasks of two different sizes; short tasks equal to 30

Kbytes and long tasks equal to 300 Kbytes. The size of the task represents the

message sizes that are exchanged between the master and the slave.

In case of the SCTP module, the tags will be mapped to streams and in case

of congestion if a message sent to a slave is lost, then it would still be possible for

messages on the other streams to be delivered and the slave program can continue

working without blocking on the lost message. In LAM-TCP, however, since all

messages between the master and a slave have to be delivered in order, there would

be less overlap of communication with computation. The experiments were run at

loss rates of 0%, 1% and 2%. The farm program was run six times for each of the

different combinations of loss rates and message sizes and the average value of total

run-times are reported. The average and the median values of the multiple runs

were very close to each other. We also calculated the standard deviation of the

average value, and found the variation across different runs to be very small.

Figure 5.3 shows the comparison between LAM-SCTP and LAM-TCP for

short and long message sizes under different loss rates. As seen in Figure 5.3, SCTP

outperforms TCP under loss. For short messages, LAM-TCP run-times are 10 to 11

60

LAM_SCTP versus LAM_TCP for Farm Program Message Size: Short

6.8 7.7 11.2 5.9

79.9

131.5

0

20

40

60

80

100

120

140

0% 1% 2% Loss Rate

Tot

al R

un T

ime

(sec

onds

)

LAM_SCTP

LAM_TCP

LAM_SCTP versus LAM_TCP for Farm Program Message Size: Long

83

804

1595

114

2080

4311

0 500

1000

1500

2000 2500 3000

3500

4000 4500

5000

0% 1% 2% Loss Rate

Tot

al R

un T

ime

(sec

onds

)

LAM_SCTP

LAM_TCP

(a)

(b)

Figure 5.3: TCP versus SCTP for (a) short and (b) long messages for the BulkProcessor Farm application

times higher than that of LAM-SCTP at loss rates of 1 to 2%. For long messages,

LAM-TCP is slower than LAM-SCTP by 2.58 times at 1% and 2.7 times at 2%

loss. Although SCTP performs substantially better than TCP for both short and

long messages, the fact that the difference is more pronounced for short messages

is very positive. MPI implementations, typically, try to optimize short messages for

latency and long messages for bandwidth [4]. In our latency tolerant application,

the slaves, at any time, have a number of pre-posted outstanding receive requests.

Since short messages are sent eagerly by the master, they are copied to the correct

61

receive buffer as soon as they are received. Since messages are being transmitted

on different streams in LAM-SCTP, there is less variation in the amount of time

the slave might have to wait for a job to arrive, especially under loss conditions,

and results in more overlap of communication with computation. The rendezvous

mechanism in the long messages introduces synchrony to the transmission of the

messages, and the payload can only be sent after the receiver has acknowledged its

readiness to accept it. This reduces the amount of overlap possible, compared to

the case when messages are sent eagerly. The long messages are also more prone to

be affected by loss by virtue of their size. Long messages would typically be used for

bulk transfers, and the cost of the rendezvous is amortized over the time required

to transfer the data.

We also introduced a tunable parameter to the program described above,

which we call Fanout. Fanout represents the number of tasks a master will send to

a slave in response to a single task request from that slave. In Figure 5.3 the Fanout

is 1.

We experimented with a Fanout value of 10 as shown in Figure 5.4. We

anticipated that the settings with Fanout=10 would create more possibilities for

head-of-line blocking in the LAM-TCP case, since in this case ten tasks are sent

to a slave in response to one task request, and are more prone to be affected by

loss. Figure 5.4 shows that with larger Fanout, the average run-time for TCP

increases substantially for long messages, while there are only slight changes for

short messages. One possibility for this behavior is TCP’s flow-control mechanism.

We are sending ten long tasks in response to a job request, and in case of loss, TCP

blocks delivery of all subsequent messages until the lost segment is recovered. If it

does not empty the receive buffer quickly enough, it can cause the sender to slow

down. Moreover, as discussed in Section 5.1.1, SCTP’s better congestion control also

aids in faster recovery of lost segments, and this is also reflected in all our results.

Also, we expect the performance difference between TCP and SCTP to widen when

62

LAM_SCTP versus LAM_TCP for Farm Program Message Size: Short, Fanout: 10

8.7 11.7 16.0 6.2

88.1

154.7

0

20

40

60

80

100

120

140

160

180

0% 1% 2%

Loss Rate

Tot

al R

un T

ime

(sec

onds

)

LAM_SCTP

LAM_TCP

LAM_SCTP versus LAM_TCP for Farm Program Message Size: Long, Fanout: 10

79

786

1585

129

3103

6414

0

1000

2000

3000

4000

5000

6000

7000

0% 1% 2%

Loss Rate

Tot

al R

un T

ime

(sec

onds

)

LAM_SCTP

LAM_TCP

(a)

(b)

Figure 5.4: TCP versus SCTP for (a) short and (b) long messages for the BulkProcessor Farm application using Fanout of 10

multihoming is present and retransmissions are sent on an alternate path.

5.2.2 Investigating Head-of-line Blocking

In the experiments presented so far, we have shown that SCTP has superior perfor-

mance than TCP under loss, and there are two main factors affecting the results:

improvements in SCTP’s congestion control mechanism, and the ability to use mul-

tiple streams to reduce head-of-line blocking. In this section we examine the per-

formance improvement obtained as a result of using multiple streams in our SCTP

63

module. In order to isolate the effects of head-of-line blocking in the farm program,

we created another version of the SCTP module, one that uses only a single stream

to send and/or receive messages irrespective of the message tag, rank and context.

All other things were identical to our multiple-stream SCTP module.

The farm program was run at different loss rates for both short and long mes-

sage sizes. The results are shown in Figure 5.5. Since multihoming was not present

(a)

(b)

LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream for Farm Program. Message Size: Short, Fanout: 10

8.7

11.7

16.0

9.3 11.0

21.6

0

5

10

15

20

25

0% 1% 2% Loss Rate

Tot

al R

un T

ime

(sec

onds

)

10 Streams

1 Stream

LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream for Farm Program. Message Size: Long, Fanout: 10

79

786

1585

79

1000

1942

0

500

1000

1500

2000

2500

0% 1% 2% Loss Rate

Tot

al R

un T

ime

(sec

onds

)

10 Streams

1 Stream

Figure 5.5: Effect of head-of-line Blocking in the Bulk Processor Farm program for(a) short and (b) long messages

in the experiment, the results obtained show the effect of head-of-line blocking and

the advantage due to the use multiple tags/streams. The reduction in average run-

64

times when using multiple streams compared to a single stream is about 25% under

loss for long messages. In the case of short messages, the benefit of using multiple

streams becomes apparent at 2% loss where using a single stream shows increase

in run-time of about 35%. This shows that head-of-line blocking has a substantial

effect in a latency tolerant program like the Bulk Processor Farm. The performance

difference between the two cases, can increase in a network where it takes a long

time to recover from packet loss [57]. There is a possibility for improvement in these

results even further as the FreeBSD KAME SCTP stack is optimized. Currently,

the receive path for an SCTP receiver needs to be completely re-written for optimal

performance [57]. As mentioned in Section 2.1, an SCTP sender uses a round-robin

scheme to send messages on different streams, and according to the KAME SCTP

developers, the receive function, similarly, needs to be carefully thought out and a

proper scheme devised for when multiple messages, on different streams, are ready

to be received.

65

Chapter 6

Related Work

The effectiveness of SCTP has been explored for several protocols in high latency,

high-loss environments. Researchers have investigated the use of SCTP in FTP [41],

HTTP [52], and also over wireless [10] and satellite networks [1]. Using SCTP for

MPI has not been investigated.

There are a wide variety of projects that use TCP in an MPI environment.

MPICH-G2 [37] is a multi-protocol implementation of MPI for the Globus [27] en-

vironment that was primarily designed to link together clusters over wide-area net-

works. Globus provides a basic infrastructure for Grid computing and addresses

many of issues with respect to the allocation, access and coordination of resources

in a Grid environment. The communication middleware in MPICH-G2 is able to

switch between the use of vendor-specific MPI libraries and TCP. In this regard, it

is well suited for environments where there is a connection between clusters or other

Grid elements. MPICH-G2 maintains a notion of locality and uses this to optimize

communication between local nodes in a cluster or local network and nodes which

are further away that use TCP. The developers report that MPICH-G2 optimizes

TCP but there is no description of these optimizations and how they might work in

general when all the nodes are non-local. Another related work is PACX-MPI [38],

which is a grid-enabled MPI implementation. PACX-MPI enables execution of MPI

66

applications over a cluster of high-performance computers like Massively Parallel

Processors (MPPs), connected through high-speed networks or the Internet. Com-

munication within MPPs is done with vendor MPI and between meta-computers

is done through TCP connections. LAM/MPI, as well, provides limited support

for Globus grid environment. LAM can be booted across a Globus grid, but only

restricted types of execution are possible [40, 54].

Another project on wide-area communication for grid computing is NetIbis

[12] that focuses on connectivity, performance and security problems of TCP in

wide-area networks. NetIbis enhances performance by use of data compression over

parallel TCP connections. Moreover, it uses TCP splicing for connection establish-

ment through firewalls and can use encryption as well. MagPIe [39] is a library

of collective communication optimized for wide-area networks that is built on top

of MPICH. MagPIe constructs communication graphs that are wide-area optimal,

taking the hierarchical structure of network topology into account. Their results

show that their algorithms outperform MPICH in real wide-area environments.

Researchers have also focused on providing system fault tolerance in large

scale distributed computing environments. HARNESS FT-MPI [18] is one such

project that handles fault tolerance at the MPI communicator level. It aims at

providing application programmers with different methods of dealing with failures

in MPI rather than having to rely only on check-pointing and restart. In FT-MPI,

the semantics and modes of failure can be controlled by the application using a

modified MPI API. Our work provides the opportunity to add SCTP for transport

in a Grid-environment where we can take advantage of the improved performance

in the case of loss.

Several projects have used UDP rather than TCP. As mentioned, UDP is

message-based and one can avoid all the “heavy-weight” mechanisms present in TCP

to obtain better performance. However, when one adds reliability on top of UDP

the advantages begin to diminish. For example, LA-MPI [2], a high-performance,

67

reliable MPI library uses UDP and supports sender-side retransmission of messages

based on acknowledgment timeout. Their retransmission approach has many simi-

larities to TCP/IP but they bypass the complexities of TCP’s flow and congestion

control mechanisms. They justify their use of a simple retransmission technique by

stating that their scheme is targeted at low-error rate cluster environments and is

thus appropriate. LA-MPI also supports fault tolerance through message striping

across multiple network interfaces and can achieve greater bandwidth through si-

multaneous transfer of data. Automatic network failover in case of network failures

is currently under development in their implementation. LA-MPI reports perfor-

mance of their implementation over UDP/IP to be similar to TCP/IP performance

of other MPI implementations over Ethernet.

WAMP [61] is an example of UDP for MPI over wide-area networks. This

work presents a user-level communication protocol where reliability was built on top

of UDP. Interestingly, WAMP only wins over TCP in heavily congested networks

where TCP’s congestion avoidance mechanisms limit bandwidth. One of the reasons

for WAMP’s improvement over TCP is the fact that their protocol does not attempt

to modify its behavior based on network congestion and maintains a constant flow

regardless of the network conditions. Their protocol, in fact, captures bandwidth

at the expense of other flows in the network. This technique may not be appropri-

ate for Internet-like environments where fairness and network congestion must be

considered. Limitation on achievable bandwidth due to TCP’s congestion control is

a problem but hopefully research in this area will lead to better solutions for TCP

that can also be incorporated into SCTP.

Another possible way to use streams is to send eager short messages on

one stream and long messages on a second one. LA-MPI followed this approach

with their UDP/IP implementation of MPI [2]. This potentially allows for the

fast delivery of short messages and a more optimized delivery for long messages

that require more bandwidth. It also may reduce head-of-line blocking for the

68

case of short messages waiting for long messages to complete. However, one has

to maintain MPI message ordering semantics and thus, as is the case in LA-MPI,

sequence numbers were introduced to ensure that the messages are received strictly

in the order they were posted by the sender. Thus, in the end, there is no added

opportunity for concurrent progression of messages as there is in the case of our

implementation.

There have been a number of MPI implementations that have explored im-

proving communication with emphasis on some particular aspect of the overall per-

formance. Work discussed in [44] integrates MPI’s receive queue management into

the TCP/IP protocol handler. The handler is interrupt driven and copies received

data from the TCP receive buffer to the user space. The focus of this work is on

avoiding polling of sockets and reducing system call overhead in large scale clusters.

In general, these implementations require special drivers or operating systems

and will not have a great degree of support, which limits their use. Although the

same can be said about SCTP, the fact that SCTP has been standardized and

implementations have begun to emerge, are all good indications of more wide-spread

support.

Recently Open MPI has been announced which is a new public domain ver-

sion of MPI-2 that builds on the experience gained from the design and implemen-

tation of LAM/MPI, LA-MPI and FT-MPI [17] and PACX-MPI. Open MPI takes

a component-based approach to allow the flexibility to mix and match components

for collectives and for transport and link management. It is designed to be scalable

and fault-tolerant and provides a platform for third-party research and enables ad-

dition of new components. We hope that our work can be incorporated into Open

MPI as a transport module. TEG [65] is a fault-tolerant point-point communica-

tion module in Open MPI that supports multihoming and the ability to stripe data

across interfaces. It provides concurrent support for multiple network types, such

as Myrinet, InfiniBand, GigE etc., and can carry out message fragmentation and

69

delivery utilizing different multiple network interfaces (NICs). The ability to sched-

ule data over different interfaces has been proposed for SCTP and may provide an

alternative way to provide the functionality of TEG in environments like the Inter-

net. A group at the University of Delaware is researching Concurrent Multipath

Transfer (CMT) [29, 28], which uses SCTP’s multihoming feature to provide simul-

taneous transfer of data between two endpoints via two or more end-to-end paths.

The objective of using CMT between multihomed hosts is to increase an application

throughput. CMT is at the transport layer and is thus more efficient, compared

to multipath transfer at application level, since it has access to finer details about

the end-to-end paths. CMT is currently being integrated into the FreeBSD KAME

SCTP stack and will be available as a sysctl option by end of year 2005.

Another recent project, still under development, is Open Run-Time Envi-

ronment (OpenRTE) [8], for supporting distributed high-performance computing in

heterogeneous environments. It is a spin-off from the Open MPI project and has

drawn from existing approaches to large scale distributed computing like LA-MPI,

HARNESS FT-MPI and the Globus project. It focuses on providing:

1. Ease-of-use, with focus on transparency and the ability to execute an applica-

tion on a variety of computing resources.

2. Resilience, with user-definable/selectable error management strategies.

3. Scalability and extensibility, with the ability to support addition of new fea-

tures.

It is based on the modular component architecture developed for the Open

MPI project and the behavior of any of the OpenRTE subsystems can be changed

by defining a new component and selecting it for use. Their design allows users to

customize the run-time behavior for their applications. A beta version of OpenRTE

is currently being evaluated by the developers. We believe this project and the

Open MPI project can benefit from using SCTP in their transport module. SCTP

70

is particularly valuable for applications that require monitoring and path/session

failure detection, since its heartbeat mechanism is designed for providing active

monitoring of associations. One potential module in Open MPI may be a wide-area

network message progression module.

71

Chapter 7

Conclusions and Future Work

In this thesis we discussed the design and evaluation of using SCTP for MPI. SCTP

is better suited as a transport layer for MPI because of its several distinct features

not present in TCP. We have designed the SCTP module to address the head-of-

line blocking problem present in LAM-TCP middleware. We have shown that SCTP

matches MPI semantics more closely than TCP and we have taken advantage of the

multistreaming feature of SCTP to provide a direct mapping from streams to MPI

message tags. This has resulted in increased concurrency at the TRC level in the

SCTP module compared to concurrency at process level in LAM-TCP.

Our SCTP module’s state machine uses one-to-many style sockets and avoids

the use of expensive select system calls, which leads to increased scalability in

large scale clusters. In addition, SCTP’s multihoming feature makes our module

fault tolerant and resilient to network path failures.

We have evaluated our module and reported the results of several experi-

ments using standard benchmark programs as well as a real-world application and

compared the performance with LAM-TCP. Simple ping-pong tests, under no loss,

have shown that LAM-TCP outperforms the SCTP module for small message sizes

but SCTP does better for large messages. Additionally, SCTP’s performance is

comparable to TCP’s for standard benchmarks such as the NAS benchmarks when

72

larger dataset sizes are used. The strengths of SCTP over TCP become apparent

under loss conditions as seen in the results for ping-pong tests and our latency tol-

erant Bulk Processor Farm program. When different tags are used in a program,

the advantages due to multistreaming in our module can lead to further benefits in

performance.

SCTP is a robust transport level protocol that extends the range of MPI

programs that can effectively execute in a wide-area network. There are many

advantages of using MPI in open environments like the Internet, the first being

portability and the ability to execute a large body of MPI applications unchanged

in diverse environments. Secondly, not everybody has access to dedicated high per-

formance clusters and this provides the opportunity to take advantage of computing

resources in other environments as well as to be able to access specialized resources

not locally available. There are, of course, many performance issues of operating in

an open environment and the need for dynamically optimizing connections remains

an active area of research in the networking community. Several variants of TCP

have been proposed and we believe that SCTP, due to its similarity with TCP, will

also be able to take advantage of improvements in that area.

Our work has the potential to spawn a new body of research where other

features of SCTP like multihoming and load balancing can be investigated to better

support parallel applications using MPI. The current ongoing integration of CMT

in the SCTP stack poses interesting questions about potential benefits for MPI

applications. Work has also been done on prioritization of streams in SCTP [26],

and it would be interesting to investigate the benefits to MPI, if a selection of

different latencies are available to messages by assigning them to different priority

streams.

In general, the performance requirements of different MPI programs will

vary. There will be those programs that can only achieve satisfactory performance

on dedicated machines, with low-latency and high-bandwidth links. On the other

73

hand, there will be those latency tolerant programs that will be able to run just as

well in highly available, shared environments that have large delays and loss. Our

contribution is to the latter type of programs, for extending their performance in

open environments such as the Internet. Ideally, we want to increase the portability

of MPI and in doing so, encourage programmers to develop programs more towards

this end of the spectrum.

74

Bibliography

[1] Rumana Alamgir, Mohammed Atiquzzaman, and William Ivancic. Effect of

congestion control on the performance of TCP and SCTP over satellite net-

works. In NASA Earth Science Technology Conference, Pasadena, CA, June

2002.

[2] Rob T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean

Risinger, Mark A. Taylor, Timothy S. Woodall, and Mitchel W. Sukalski. Ar-

chitecture of LA-MPI, a network-fault-tolerant MPI. In 18th International

Parallel and Distributed Processing Symposium (IPDPS’04), Sante Fe, New

Mexico, April 2004.

[3] Ryan W. Bickhart. Transparent SCTP shim, personal communication. Univer-

sity of Delaware, http://www.cis.udel.edu/ bickhart/research/shim.html.

[4] Ron Brightwell and Keith Underwood. Evaluation of an eager protocol opti-

mization for MPI. In PVM/MPI, pages 327–334, 2003.

[5] Greg Burns and Raja Daoud. Robust message delivery with guaranteed re-

sources. In Proceedings of Message Passing Interface Developer’s and User’s

Conference (MPIDC), May 1995.

[6] Sadik G. Caglar, Gregory D. Benson, Qing Huang, and Cho-Wai Chu. A

multi-threaded implementation of MPI for Linux clusters. In 15th IASTED

International Conference on Parallel and Distributed Computing and Systems,

Los Angeles, November 2003.

[7] Neal Cardwell, Stefan Savage, and Thomas Anderson. Modeling TCP latency.

In INFOCOM, pages 1742–1751, 2000.

[8] R. H. Castain, T. S. Woodall, D. J. Daniel, J. M. Squyres, B. Barrett, and G. E.

Fagg. The open run-time environment (OpenRTE): A transport multi-cluster

environment for high-performance computing, September 2005. To appear in

75

EURO PVM MPI 2005 12th European Parallel Virtual Machine and Message

Passing Interface Conference.

[9] NASA Ames Research Center. Numerical aerodynamic simulation (NAS) paral-

lel benchmark (NPB) benchmarks. http://www.nas.nasa.gov/Software/NPB/.

[10] Yoonsuk Choi, Kyungshik Lim, Hyun-Kook Kahng, and I. Chong. An experi-

mental performance evaluation of the stream control transmission protocol for

transaction processing in wireless networks. In ICOIN, pages 595–603, 2003.

[11] Wu chun Feng. The future of high-performance networking, 2001. Workshop

on New Visions for Large-Scale Networks: Research and Applications.

[12] A. Denis, O. Aumage, R. Hofman, K. Verstoep, T. Kielmann, and H.E.Bal.

Wide-area communication for grids: an integrated solution to connectivity,

performance and security problems. In High performance Distributed Comput-

ing, 2004. Proceedings. 13th IEEE International Symposium on, pages 97–106,

June 2004.

[13] P. Dickens, W. Gropp, and P. Woodward. High performance wide area data

transfers over high performance networks. In International Workshop on Per-

formance Modeling, Evaluation, and Optimization of Parallel and Distributed

Systems, 2002.

[14] Rossen Dimitrov and Anthony Skjellum. Software architecture and performance

comparison of MPI/Pro and MPICH. In Lecture Notes in Computer Science,

Volume 2659, Pages 307 - 315, January 2003.

[15] Tom Dunigan, Matt Mathis, and Brian Tierney. A TCP tuning daemon. In

Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Super-

computing, pages 1–16, Los Alamitos, CA, USA, 2002. IEEE Computer Society

Press.

[16] EmuLab. Network Emulation Testbed. http://www.emulab.net/.

[17] Graham E. Fagg and Graham E. Dongarra. Building and using a fault-tolerant

MPI implementation. International Journal of High Performance Computing

Applications, 18(3):353–361, 2004.

[18] Graham E. Fagg and Jack J. Dongarra. HARNESS fault tolerant MPI design,

usage and performance issues. Future Gener. Comput. Syst., 18(8):1127–1142,

2002.

76

[19] Kevin Fall and Sally Floyd. Simulation-based comparisons of Tahoe, Reno and

SACK TCP. SIGCOMM Comput. Commun. Rev., 26(3):5–21, 1996.

[20] Ahmad Faraj and Xin Yuan. Communication characteristics in the NAS parallel

benchmarks. In PDCS, pages 724–729, 2002.

[21] W. Feng and P. Tinnakornsrisuphap. The failure of TCP in high-performance

computational grids. In Supercomputing ’00: Proceedings of the 2000

ACM/IEEE conference on Supercomputing, page 37, Washington, DC, USA,

2000. IEEE Computer Society.

[22] G. Burns, R. Daoud and J. Vaigl. LAM: An Open Cluster Environment for

MPI. In Supercomputing Symposium ’94, Toronto, Canada, June 1994.

[23] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Don-

garra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett,

Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham,

and Timothy S. Woodall. Open MPI: Goals, concept and design of a next gen-

eration MPI implementation. In Proceedings, 11th European PVM/MPI Users’

Group Meeting, Budapest, Hungary, September 2004.

[24] Al Geist, William Gropp, Steve Huss-Lederman, Andrew Lumsdaine, Ewing L.

Lusk, William Saphir, Tony Skjellum, and Marc Snir. MPI-2: Extending the

message-passing interface. In Euro-Par, Vol. I, pages 128–135, 1996.

[25] Thomas J. Hacker, Brian D. Noble, and Brian D. Athey. Improving throughput

and maintaining fairness using parallel TCP. In IEEE INFOCOM, 2004.

[26] Gerard J. Heinz and Paul D. Amer. Priorities in SCTP multistreaming. In SCI

’04, Orlando, FL, July 2004.

[27] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit.

Intl Journal of Supercomputer Applications, 11(2):115 – 128, 1997.

[28] Janardhan R. Iyengar, Paul D. Amer, and Randall Stewart. Retransmission

policies for concurrent multipath transfer using SCTP multihoming. In ICON

2004, Singapore, November 2004.

[29] Janardhan R. Iyengar, Keyur C. Shah, Paul D. Amer, and Randall Stewart.

Concurrent multipath transfer using SCTP multihoming. In SPECTS 2004,

San Jose, July 2004.

77

[30] M. Jain, R. Prasad, and C. Dovrolis. The TCP bandwidth-delay product revis-

ited: Network buffering, cross traffic, and socket buffer auto-sizing. Technical

Report GIT-CERCS-03-02, Georgia Tech, February 2003.

[31] Armando L. Caro Jr., Paul D. Amer, and Randall R. Stewart. Retransmission

policies for multihomed transport protocols. Technical Report TR2005-15, CIS

Dept, University of Delaware, March 2005.

[32] Armando L.C̃aro Jr., Keyur Shah, Janardhan R. Iyengar, Paul D. Amer, and

Randall R. Stewart. SCTP and TCP variants: Congestion control under mul-

tiple losses. Technical Report TR2003-04, CIS Dept, U of Delaware, February

2003.

[33] Humaira Kamal. Description of LAM TCP RPI module. Technical re-

port, University of British Columbia, Computer Science Department, Available

at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/LAM TCP RPI MODULE.pdf,

2004.

[34] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP-based middleware for

MPI in wide-area networks. In 3rd Annual Conf. on Communication Networks

and Services Research (CNSR2005), pages 157–162, Halifax, May 2005. IEEE

Computer Society.

[35] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP versus TCP for MPI.

In Supercomputing ’05: Proceedings of 2005 ACM/IEEE conference on Super-

computing (to appear), Seattle, WA, November 2005.

[36] KAME. SCTP stack implementation for FreeBSD. http://www.kame.net.

[37] Nicholas T. Karonis, Brian R. Toonen, and Ian T. Foster. MPICH-G2:

A grid-enabled implementation of the message passing interface. CoRR,

cs.DC/0206040, 2002.

[38] Rainer Keller, Edgar Gabriel, Bettina Krammer, Matthias S. Mueller, and

Michael M. Resch. Towards efficient execution of MPI applications on the grid:

Porting and optimization issues. Journal of Grid Computing, 1:133–149, June

2003.

[39] Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Plaat, and Raoul

A. F. Bhoedjang. MagPIe: MPI’s collective communication operations for

clustered wide area systems. SIGPLAN Not., 34(8):131–140, 1999.

78

[40] The LAM/MPI Team Open Systems Lab. LAM/MPI user’s guide, September

2004. Pervasive Technology Labs, Indiana University.

[41] Sourabh Ladha and Paul Amer. Improving multiple file transfers using SCTP

multistreaming. In Proceedings IPCCC, April 2004.

[42] Linux Kernel Stream Control Transmission Protocol (lksctp) project. withsctp

lksctp-tools utility. lksctp-tools package, http://lksctp.sourceforge.net/.

[43] Matt Mathis, John Heffner, and Raghu Reddy. Web100: Extended TCP instru-

mentation for research, education and diagnosis. SIGCOMM Comput. Com-

mun. Rev., 33(3):69–79, 2003.

[44] M. Matsuda, T. Kudoh, H. Tazuka, and Y. Ishikawa. The design and imple-

mentation of an asynchronous communication mechanism for the MPI commu-

nication model. In IEEE Intl. Conf. on Cluster Computing, pages 13–22, Dana

Point, Ca., Sept 2004.

[45] Alberto Medina, Mark Allman, and Sally Floyd. Measuring the evolution of

transport protocols in the Internet, April 2005. To appear in ACM CCR.

[46] Message Passing Interface Forum. MPI: A Message Passing Interface. In Super-

computing ’93: Proceedings of 1993 ACM/IEEE conference on Supercomputing,

pages 878–883. IEEE Computer Society Press, 1993.

[47] G. Minshall, Y. Saito, J. Mogul, and B. Verghese. Application performance pit-

falls and TCP’s Nagle algorithm. In Workshop on Internet Server Performance,

Atlanta, GA, USA, May 1999.

[48] Philip J. Mucci. MPBench: Benchmark for MPI functions. Available at:

http://icl.cs.utk.edu/projects/llcbench/mpbench.html.

[49] Peter S. Pacheco. Parallel programming with MPI. Morgan Kaufmann Pub-

lishers Inc., San Francisco, CA, USA, 1996.

[50] Scott Pakin, Mario Lauria, and Andrew Chien. High performance messaging

on workstations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing

’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, San

Diego, CA, December 1995.

[51] PlanetLab. An open platform for developing, deploying, and accessing

planetary-scale services. http://www.planet-lab.org/.

79

[52] R. Rajamani, S. Kumar, and N. Gupta. SCTP versus TCP: Comparing the

performance of transport protocols for web traffic. Technical report, University

of Wisconsin-Madison, May 2002.

[53] M. Atiquzzaman S. Fu and W. Ivancic. SCTP over satellite networks. In

IEEE Computer Communications Workshop (CCW 2003), pages 112–116,

Dana Point, Ca., October 2003.

[54] Jeffrey M. Squyres, Brian Barrett, and Andrew Lumsdaine. Boot systems ser-

vices interface (SSI) modules for LAM/MPI. Technical Report TR576, Indiana

University, Computer Science Department, 2003.

[55] W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. UNIX Network

Programming, Vol. 1, Third Edition. Pearson Education, 2003.

[56] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor,

M. Kalla I. Rytina, L. Zhang, and V. Paxson. The stream control transmission

protocol (SCTP). Available from http://www.ietf.org/rfc/rfc2960.txt, October

2000.

[57] Randall Stewart. Primary designer of SCTP and developer of FreeBSD KAME

SCTP stack, private communication, 2005.

[58] Randall Stewart and Paul D. Amer. Why is SCTP needed given TCP and UDP

are widely available? Internet Society, ISOC Member Briefing #17. Available

from http://www.isoc.org/briefings/017/briefing17.pdf, June 2004.

[59] Randall R. Stewart and Qiaobing Xie. Stream control transmission protocol

(SCTP): a reference guide. Addison-Wesley Longman Publishing Co., Inc.,

2002.

[60] Alexander Tormasov and Alexey Kuznetsov. TCP/IP options for high-

performance data transmission, 2002. http://builder.com.com/5100-6372 14-

1050878.html.

[61] Rajkumar Vinkat, Philip M. Dickens, and William Gropp. Efficient communi-

cation across the Internet in wide-area MPI. In Conference on Parallel and Dis-

tributed Programming Techniques and Applications, Las Vegas, Nevada, USA,

2001.

[62] W. Gropp, E. Lusk, N. Doss and A. Skjellum. High-performance, portable

implementation of the MPI message passing interface standard. Parallel Com-

puting, 22(6):789–828, September 1996.

80

[63] Paul A. Watson. Slipping in the window: TCP reset attacks, October 2003.

CanSecWest Security Conference.

[64] Eric Weigle and Wu chun Feng. A case for TCP Vegas in high-performance

computational grids. In HPDC ’01: Proceedings of the 10th IEEE International

Symposium on High Performance Distributed Computing (HPDC-10’01), page

158, Washington, DC, USA, 2001. IEEE Computer Society.

[65] T.S. Woodall, R.L. Graham, R.H. Castain, D.J. Daniel, M.W. Sukalski, G.E.

Fagg, E. Gabriel, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sa-

hay, P. Kambadur, B. Barrett, and A. Lumsdaine. TEG: A high-performance,

scalable, multi-network point-to-point communications methodology. In Pro-

ceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary,

September 2004.

[66] J. Yoakum and L. Ong. An introduction to the stream control transmission pro-

tocol (SCTP). Available from http://www.ietf.org/rfc/rfc3286.txt, May 2002.

81

sctp-based middleware for mpi · 2011-11-22 · sctp-based middleware for mpi by humaira kamal...

Documents