sctp-based middleware for mpi · 2011-11-22 · sctp-based middleware for mpi by humaira kamal...
TRANSCRIPT
SCTP-Based Middleware for MPI
by
Humaira Kamal
B.S.E.E. University of Engineering and Technology, Lahore, Pakistan, 1994
M.Sc. Computer Science, Lahore Univ. of Management Sciences, Pakistan, 2001
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in
THE FACULTY OF GRADUATE STUDIES
(Computer Science)
The University of British Columbia
August 2005
c© Humaira Kamal, 2005
Abstract
SCTP (Stream Control Transmission Protocol) is a recently standardized transport
level protocol with several features that better support the communication require-
ments of parallel applications; these features are not present in traditional TCP
(Transmission Control Protocol). These features make SCTP a good candidate as
a transport level protocol for MPI (Message Passing Interface). MPI is a message
passing middleware that is widely used to parallelize scientific and compute inten-
sive applications. TCP is often used as the transport protocol for MPI in both local
area and wide-area networks. Prior to this work, SCTP has not been used for MPI.
In this thesis, we compared and evaluated the benefits of using SCTP instead
of TCP as the underlying transport protocol for MPI. We redesigned LAM-MPI,
a public domain version of MPI, to use SCTP. We describe the advantages and
disadvantages of using SCTP, the necessary modifications to the MPI middleware to
use SCTP, and the performance of SCTP as compared to the stock implementation
that uses TCP.
ii
Contents
Abstract ii
Contents iii
List of Tables vi
List of Figures vii
Acknowledgements ix
Dedication x
1 Introduction 1
2 Overview 7
2.1 SCTP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Overview of MPI Middleware . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Message Progression Layer . . . . . . . . . . . . . . . . . . . 14
2.2.2 Message Delivery Protocol . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Overview of NAS Parallel Benchmarks . . . . . . . . . . . . . 17
3 Evaluating TCP for MPI in WANs 19
3.1 TCP Performance Issues . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Investigating the Use of TCP for MPI in WANs . . . . . . . . . . . . 22
iii
3.3 MPI in an Emulated WAN Environment . . . . . . . . . . . . . . . . 23
3.3.1 Communication Characteristics of the NAS Benchmarks . . . 24
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4.1 NAS Benchmarks, Loss . . . . . . . . . . . . . . . . . . . . . 26
3.4.2 NAS Benchmarks, Latency . . . . . . . . . . . . . . . . . . . 27
3.4.3 NAS Benchmarks, Nagle’s Algorithm and Delayed Acknowl-
edgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 SCTP-based Middleware: Design and Implementation 32
4.1 Overview of using SCTP for MPI . . . . . . . . . . . . . . . . . . . . 35
4.2 Message Demultiplexing . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Concurrency and SCTP streams . . . . . . . . . . . . . . . . . . . . 37
4.3.1 Assigning Messages to Streams . . . . . . . . . . . . . . . . . 38
4.3.2 Head-of-Line Blocking . . . . . . . . . . . . . . . . . . . . . . 38
4.3.3 Example of Head-of-Line Blocking . . . . . . . . . . . . . . . 39
4.3.4 Maintaining State . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Resource Management . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Race Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5.1 Race Condition in Long Message Protocol . . . . . . . . . . . 43
4.5.2 Race Condition in MPI Init . . . . . . . . . . . . . . . . . . . 46
4.6 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.1 Multihoming Feature . . . . . . . . . . . . . . . . . . . . . . . 47
4.6.2 Added Protection . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.6.3 Use of a Single Underlying Protocol . . . . . . . . . . . . . . 51
4.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Experimental Evaluation 53
5.1 Evaluation Using Benchmark Programs . . . . . . . . . . . . . . . . 55
iv
5.1.1 MPBench Ping-Pong Test . . . . . . . . . . . . . . . . . . . . 55
5.1.2 Using NAS Benchmarks . . . . . . . . . . . . . . . . . . . . . 58
5.2 Evaluation Using a Latency Tolerant Program . . . . . . . . . . . . . 59
5.2.1 Comparison Using a Real-World Program . . . . . . . . . . . 59
5.2.2 Investigating Head-of-line Blocking . . . . . . . . . . . . . . . 63
6 Related Work 66
7 Conclusions and Future Work 72
Bibliography 75
v
List of Tables
3.1 Breakdown of message types for LU classes S, W, A and B using the
default LAM settings for TCP . . . . . . . . . . . . . . . . . . . . . 24
3.2 Breakdown of message types for SP classes S, W, A and B using the
default LAM settings for TCP . . . . . . . . . . . . . . . . . . . . . 25
5.1 The performance of SCTP and TCP using a standard ping-pong test
under loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
vi
List of Figures
2.1 Single association with multiple streams . . . . . . . . . . . . . . . . 10
2.2 MPI and LAM envelope format . . . . . . . . . . . . . . . . . . . . . 13
2.3 Examples of valid and invalid message orderings in MPI . . . . . . . 14
2.4 Message progression in MPI middleware . . . . . . . . . . . . . . . . 16
3.1 NAS benchmarks performance for dataset B under different loss rates
using TCP SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 NAS benchmarks performance for dataset B with different latencies
using TCP SACK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Performance of LU using different combinations of Nagle’s algorithm
and delayed acknowledgments . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Comparison of LU and SP benchmarks for different datasets . . . . . 29
4.1 Similarities between the message protocol of MPI and SCTP . . . . 35
4.2 Example of head-of-line blocking in LAM TCP module . . . . . . . . 39
4.3 Example of multiple streams between two processes using non-blocking
communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Example of multiple streams between two processes using blocking
communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Example of long message race condition . . . . . . . . . . . . . . . . 44
4.6 SCTP’s four-way handshake . . . . . . . . . . . . . . . . . . . . . . . 49
vii
5.1 The performance of SCTP using a standard ping-pong test normal-
ized to TCP under no loss . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 TCP versus SCTP for the NAS benchmarks . . . . . . . . . . . . . . 58
5.3 TCP versus SCTP for (a) short and (b) long messages for the Bulk
Processor Farm application . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 TCP versus SCTP for (a) short and (b) long messages for the Bulk
Processor Farm application using Fanout of 10 . . . . . . . . . . . . 63
5.5 Effect of head-of-line Blocking in the Bulk Processor Farm program
for (a) short and (b) long messages . . . . . . . . . . . . . . . . . . . 64
viii
Acknowledgements
I would like to acknowledge the invaluable contribution of my supervisor Dr. Alan
Wagner who has been a constant source of inspiration for me. He has always been
generous with his time and discussions with him have added a great deal to my
knowledge. One of the best things about working under his supervision is that he
has always encouraged dialogue and provided constructive feedback on our work.
I also want to thank Dr. Norm Hutchinson for reading my thesis and pro-
viding useful feedback.
I would also like to acknowledge the many useful discussions I have had with
my group partner, Brad Penoff, over the course of our project that contributed to
our mutual learning.
Many thanks to Randall Stewart, one of the primary designers of SCTP, for
answering our questions and providing timely fixes to the protocol stack that made
it possible to provide the performance evaluation. Thanks to Jeff Squyres, one of
the developers of LAM-MPI, for readily answering our many questions about the
LAM-MPI implementation.
Humaira Kamal
The University of British Columbia
August 2005
ix
To my parents, sisters and brother, without whose encouragement and support
this would not have been possible.
x
Chapter 1
Introduction
Increases in network bandwidth and connectivity have spurred interest in distributed
systems that can utilize the network and computational resources available in wide-
area networks (WANs). The challenge is how to harness the power of the hundreds
and thousands of processors that may be available across the network to solve prob-
lems faster by executing them in parallel. Introduced about a decade ago, the
Message Passing Interface (MPI) [46, 24] has become the de facto standard for con-
structing parallel applications in distributed environments. The MPI forum defines
a library of functions that can be used to create parallel applications which commu-
nicate by message passing. The primary goals of MPI were to provide portability
of source code and allow efficient implementations across a range of architectures.
MPI facilitates creation of efficient scientific and compute-intensive parallel pro-
grams and implementations are available for a wide variety of platforms. Since its
inception, MPI has been very popular and virtually every vendor of parallel systems
has implemented their own variant [49]. There are also open source, public domain
implementations available for execution of MPI programs in local area networks.
Notable among them are Local Area Multicomputer (LAM) [22] and MPICH [62].
Initial implementations of MPI were targeted towards tightly coupled cluster-
like machines, where reliability and security were not an issue. Implementations of
1
MPI for local area networks widely use TCP as the underlying transport protocol,
to provide reliable in-order delivery of messages. TCP was available in the first
public domain versions of MPI such as LAM and MPICH, and more recently the
use of MPI with TCP has been extended to computing grids [37], wide-area net-
works, the Internet and meta-computing environments that link together diverse,
geographically distributed, computing resources. The main advantage to using an
IP-based protocol (i.e., TCP/UDP) for MPI is portability and ease with which it
can be used to execute MPI programs in diverse network environments. One well-
known problem with using TCP in wide-area environments is the presence of large
latencies and the difficulty in utilizing all of the available bandwidth. MPI processes,
in an application, tend to be loosely synchronized, even when they are not directly
communicating with each other. If delays occur in one process, they can potentially
have a ripple effect on other processes, which can get delayed as well. As a result,
performance can be poor especially in the presence of network congestion and the
resulting increased latency which can cause processes to stall waiting for the delivery
or receipt of a message. Moreover, MPI processes typically establish a large number
of TCP connections for communicating with the other processes in the application.
This imposes a requirement on the operating system to maintain a large number of
socket descriptors.
Although applications sensitive to latency suffer when run over TCP or UDP,
there are latency tolerant programs such as those that are embarrassingly parallel,
or almost so, that can use an IP-based transport protocol to execute in environments
like the Internet. In addition, the dynamics of TCP is an active area of research
and there is interest in better models [7] and tools for instrumenting and tuning
TCP connections [43]. As well, TCP itself continues to evolve, especially for high
performance links, with research into new variants like TCP Vegas [21, 45]. Finally,
latency hiding techniques and exploiting trade-offs between bandwidth and latency
can further expand the range of MPI applications that may be suitable to execute
2
over IP in both local and wide-area networks. In the end, the ability for MPI
programs to execute unchanged in almost any environment is a strong motivation
for continued research in IP-based transport protocol support for MPI.
Currently, TCP is widely used as an IP-based transport protocol for MPI.
The TCP protocol, however, is not well matched for MPI applications, especially in
wide-area networks, for a number of reasons, which are discussed below:
1. TCP is a stream-based protocol, whereas MPI is message-oriented. This im-
poses a requirement on the implementation of the MPI library to carry out
message framing.
2. The TCP protocol provides strict in-order delivery of all user data transmit-
ted on a connection. MPI applications need reliable message transfer, but not
a globally strict sequenced delivery. In Chapter 2 we discuss the details of
MPI message matching and how messages with the same tag, rank and con-
text should be delivered in order, while those with different values for these
parameters are allowed to overtake each other. The strict in-order delivery of
messages by TCP can result in unnecessary head-of-line blocking in case of
network loss or delays.
3. The ability to recover quickly from loss in a network is especially desirable
in wide-area environments. TCP SACK has been shown to be more robust
under loss than other variants of TCP, like TCP Reno, however, SACK is not
an integral part of the TCP protocol and is provided as an option in some
implementations [19]. The user, thus, has to make sure that the proper TCP
SACK implementation is available on all the nodes, in order to take advantage
of better performance under loss.
4. TCP is known to be vulnerable to certain denial-of-service attacks like SYN
flooding. In open environments, like wide-area networks or the Internet, re-
silience against such attacks is a concern.
3
5. TCP has no built-in support for multihomed hosts and in case of network
failures, there is no provision for it to failover to an alternate path. Mul-
tihomed hosts are increasingly becoming common and the ability of being
able to transparently switchover to another path can be especially useful in a
wide-area network.
Recently, a new transport protocol called SCTP (Stream Control Transmis-
sion Protocol) [59] has been standardized by the IETF. SCTP alleviates many of
the problems with TCP. SCTP is message oriented like UDP but has TCP-like con-
nection management, congestion and flow control mechanisms. In SCTP, there is
an ability to define streams that allow multiple independent and reliable message
subflows inside a single association. This eliminates the head-of-line blocking that
can occur in TCP-based middleware for MPI. In addition, SCTP associations and
streams closely match the message-ordering semantics of MPI when messages with
the same tag, source rank and context are used to define a stream (tag) within an
association (source rank).
SCTP includes several other mechanisms that make it an attractive target for
MPI in open network environments where secure connection management and con-
gestion control are important. It makes it possible to offload some MPI middleware
functionality onto a standardized protocol that will hopefully become universally
available. Although new, SCTP is currently available for all major operating sys-
tems and is part of the standard Linux kernel distribution.
In this thesis we propose the use of SCTP as a robust transport protocol
for MPI in wide-area networks. We first investigate the problems that arise from
the use of TCP for MPI in an emulated wide-area network. The objective is to
gain a better understanding of the effect TCP has on MPI performance and to learn
about what kind of applications are suitable for such environments. We use LAM as
the MPI middleware for our experiments and also extensively instrument the LAM
TCP module to gain an insight into its working. Our second step is to redesign the
4
LAM-MPI middleware to take advantage of the features SCTP provides. During
our design, we investigate the shortcomings of TCP, discussed above, and introduce
features using SCTP that alleviate those shortcomings. One novel advantage to us-
ing SCTP that we investigate is the elimination of the head-of-line blocking present
in TCP-based MPI middleware. Next, we carry out an extensive evaluation of our
implementation and compare its performance with the LAM TCP module. We use
standard benchmarks as well as real-world programs to evaluate the performance
of SCTP. Our experiments show that SCTP outperforms TCP under loss, when
latency tolerant applications are used. The benefits of using SCTP were substantial
even when a single stream was used between two endpoints. Use of SCTP’s mul-
tistreaming feature adds to the overall performance and we evaluate the effects of
head-of-line blocking in our applications. A second indirect contribution of our work
is with respect to SCTP. Our MPI middleware makes very aggressive use of SCTP.
In using the NAS parallel benchmarks [9] along with programs of our own design,
we were able to uncover problems in the FreeBSD implementation of the protocol
that led to improvements in the stack by the SCTP developers.
Overall, the use of SCTP makes for a more resilient implementation of MPI
that avoids many of the problems present in using TCP, especially in an open wide-
area network environment, such as the Internet. Of course, it doesn’t eliminate the
performance issues of operating in that environment, nor does it eliminate the need
for tuning connections. Many of the same issues remain and as mentioned, this is
an active area of research in the networking community where numerous variants of
TCP have been proposed. Because of the similarity between the two, SCTP will be
able to take advantage of improvements to TCP. Although the advantages of SCTP
have been investigated in other contexts, this is the first use of it for MPI. SCTP is
an attractive replacement for TCP in a wide-area network environment, and adds
to the set of IP transport protocols that can be used for MPI. The release of Open
MPI [23], and its ability to mix and match transport mechanisms, is an ideal target
5
of our work where an SCTP module can further extend MPI across networks that
require a robust TCP-like connection.
The remaining thesis is organized as follows. In Chapter 2, we give a back-
ground overview of SCTP, MPI middleware and NAS benchmarks used in our ex-
periments. Chapter 3 investigates the use of TCP in wide-area environments where
bandwidth and latency can vary and discusses the challenges of using TCP for MPI.
In Chapter 4, we discuss our motivation for using SCTP for MPI and discuss the
design and implementation details. In Chapter 5, we report the results of various
experiments carried out to evaluate our SCTP module and provide comparisons with
TCP. Research and related work on wide-area computing is discussed in Chapter 6.
Chapter 7 presents our conclusions and directions for future work. This thesis is
part of work done jointly with Brad Penoff, a fellow graduate student at UBC. This
thesis focuses on the higher level design and implementation issues of our SCTP
module. There were several other implementation issues, which may likely be de-
tailed in Brad Penoff’s thesis. Details about design and evaluation of the SCTP
module in Chapters 4 and 5 are part of our paper due to appear in Supercomputing
conference in November 2005 [35].
6
Chapter 2
Overview
2.1 SCTP Overview
SCTP is a general purpose unicast transport protocol for IP network data commu-
nications, which has been recently standardized by the IETF [59, 66, 56]. It was
initially introduced by a working group, SIGTRAN, of the IETF as a mechanism
for transportation of call control signaling over the Internet. Telephony signaling
has rigid timing and reliability requirements that must be met by a phone service
provider. Investigations by SIGTRAN led them to conclude that TCP is inadequate
for transport of telephony signaling due to the following reasons. Independent mes-
sages over a TCP connection can encounter head-of-line blocking due to TCP’s
order-preserving property, which can cause critical call control timers to expire,
leading to setup failures. Also, there is a lack of path-level redundancy support in
TCP, which is a limitation compared to the signaling protocol SS7 that is designed
with full link-level redundancy. As well, lack of control over the TCP retransmission
timers is a shortcoming for applications with rigid timing requirements.
Development of SCTP was directly motivated by the need to find a bet-
ter transport mechanism for telephony signaling [59]. SCTP has since evolved for
more general use to satisfy the needs of applications that require a message-oriented
protocol with all the necessary TCP-like mechanisms and additional features not
7
present in either TCP or UDP.
SCTP has been referred to, by its designers, as “Super TCP” and the rea-
son is that SCTP provides sequencing, flow control, reliability and full-duplex data
transfer like TCP, however, it provides an enhanced set of capabilities not in TCP
that make applications less susceptible to loss. Like UDP, the SCTP data transport
service is message oriented and supports the framing of application data. But like
TCP, SCTP is session-oriented and communicates by establishing an association
between two endpoints. In SCTP, the word association is used instead of a con-
nection between two endpoints because an association has a broader scope than a
TCP connection. In an SCTP association, an endpoint can be represented by mul-
tiple IP addresses and a port, whereas in TCP, each endpoint is bound to exactly
one IP address. SCTP supports two types of sockets: one-to-one style and one-to-
many style sockets. One of the design objectives of SCTP was to allow existing
TCP applications to be migrated to SCTP with little effort. The one-to-one SCTP
style socket provides this function, and the few changes required to achieve this
are as follows. The sockets used for communication among processes are created
using IPPROTO SCTP rather than IPPROTO TCP. Socket options like turning on/off
Nagle’s algorithm are similar to TCP except that socket option SCTP NODELAY is
used instead of TCP NODELAY. In the one-to-many style a single socket can commu-
nicate with multiple SCTP associations similar to a UDP socket that can receive
datagrams from different UDP endpoints. An association identifier is used to dis-
tinguish an association on the one-to-many socket. This style eliminates the need
for maintaining a large number of socket descriptors.
Additionally, SCTP supports multiple logical streams within an association,
where each stream has the property of independent, sequenced message delivery.
Loss in any of the streams affects only that stream and not others. In an SCTP
association, a maximum of 65,536 unidirectional streams can be defined by either
end for simultaneous delivery of data.
8
Multistreaming in SCTP is achieved through an independence between de-
livery of data and transmission of data. An SCTP packet has one common header
followed by one or more “chunks”. There are two types of chunks; control and data
and each chunk is fully self-descriptive. Control chunks carry information that is
used for maintenance of an SCTP association, while user messages are carried in
data chunks. Each data chunk is assigned two sets of sequence numbers; a transport
sequence number TSN that regulates the transmission of messages and the detection
of message loss, and a stream identifier SNo and stream sequence number SSN pair,
which is used to determine the sequence of delivery of received data. The receiver
can thus use this mechanism to determine when a gap occurs in the transmission
sequence, for example, due to loss, and which stream it affects.
Depending on the size of a user message and the path maximum transmission
unit (PMTU), SCTP may fragment the message into one or more data chunks,
however, a single data chunk cannot carry more than one user message. In order
to handle the case in which the user message is fragmented, the SSN remains the
same on all the data chunks containing the fragments, while consecutive TSNs are
assigned to the individual fragments in the order of the fragmentation. An SCTP
sender uses a round robin scheme to transmit user messages with different stream
identifiers to the receiver, so that none of the streams are starved [57].
Figure 2.1 shows an association between two endpoints with three streams
identified as 0, 1 and 2. As Figure 2.1 shows, SCTP sequences data chunks and
not bytes as in TCP. Together SNo, SSN, and TSN are used to assemble messages
and guarantee the ordered delivery of messages within a stream but not between
streams. For example, receiver Y in Figure 2.1 can deliver Msg2 before Msg1 should
it happen to arrive at Y before Msg1.
In contrast, when a TCP source sends independent messages to the same
receiver at the same time, it has to open multiple independent TCP connections. It
is possible to have multiple streams by having parallel TCP connections and parallel
9
Endpoint X Endpoint Y
Stream 0
Stream 1
Stream 2
Stream 0
Stream 1
Stream 2
SEND
SEND
RECEIVE
RECEIVE
User messages0212
Message Stream Number (SNo)
Fragmentation
0
TSN=1
SSN=0
SNo=0
2
TSN=2
SSN=0
SNo=2
1a1b2a2b2c
TSN=3
SSN=0
SNo=1
TSN=4
SSN=0
SNo=1
TSN=6
SSN=2
SNo=2
TSN=7
SSN=2
SNo=2
TSN=8
SSN=2
SNo=2
Data chunk queue
2
Msg1Msg2Msg3Msg4Msg5
2
TSN=5
SSN=1
SNo=2
TSN = Transmission Sequence NumberSSN = Stream Sequence Number
Msg1Msg2Msg3Msg4Msg5
Bundling
Control chunk queue
SCTP Layer
IP LayerSCTP Packets
Logical View of Multiple Streams in an Association
Figure 2.1: Single association with multiple streams
connections can also improve throughput in congested and un-congested links. There
are, however, fairness issues as, in a congested network, parallel connections claim
more than their fair share of the bandwidth thereby affecting the cross-traffic [30].
Also, the sender and receiver have to fragment and multiplex data over each of
the parallel TCP connections. One approach to making parallel connections TCP-
friendly is to couple them all to a single connection [25], SCTP does precisely this by
ensuring that all the streams within a single SCTP association share a common set of
congestion control parameters. It obtains the benefits of parallel TCP connections
10
while keeping the protocol TCP-friendly [53]. SCTP’s property of being TCP-
friendly promotes its possible wide deployment in the future.
Both the SCTP and TCP protocols use a similar congestion control mech-
anism. An SCTP sender maintains a congestion window cwnd variable for each
destination address, however, there is only one receive window rwnd variable for
the entire association. In today’s networks congestion control is critical to any
transport protocol and Atiquzzaman et al [53] have shown that SCTP’s congestion
control mechanism has improvements over TCP, that allow it to achieve larger av-
erage congestion window size and faster recovery of segment loss, which results in
higher throughput.
Another difference between SCTP and TCP is that endpoints in SCTP are
multihomed and can be bound to multiple IP addresses (i.e., interfaces). If a peer
is multihomed, then an SCTP endpoint will select one of the peer’s destination ad-
dresses as a primary address and all the other addresses of the peer become alternate
addresses. During normal operation, all data is sent to the primary address, with
the exception of retransmissions, for which one of the active alternate addresses
is selected. Recent work on the retransmission policies of protocols that support
multihoming, like SCTP, proposes a policy that sends fast retransmissions to the
primary address, and timeout retransmissions to an alternate peer IP address [31].
The authors show that sending all retransmissions to an alternate address is use-
ful if the primary address has become unreachable, and can result in performance
degradation otherwise.
When a destination address is determined to be down, SCTP’s multihom-
ing feature can provide fast path failover for an association. SCTP currently does
not support the simultaneous transfer of data across interfaces, but this will likely
change in future. Currently, researchers at the University of Delaware [29, 28] are
investigating the use of SCTP’s multihoming feature to provide simultaneous trans-
fer of data between two endpoints through two or more end-to-end paths, and this
11
functionality may become part of the SCTP protocol.
Another factor that may contribute to rapid and widespread deployment of
SCTP is the availability of utilities like withsctp [42] and an SCTP shim translation
layer [3] that can transparently translate calls to the TCP API into equivalent SCTP
calls. The withsctp utility intercepts calls to the TCP socket at the application level
and converts them to SCTP calls. The SCTP shim translation layer works at the
kernel level and allows a hybrid approach, in which it first tries to establish an SCTP
association, and failing which sets up a TCP connection.
2.2 Overview of MPI Middleware
MPI has a rich variety of message passing routines. These include MPI Send and
MPI Recv along with various combinations such as blocking, non-blocking, syn-
chronous, asynchronous, buffered and unbuffered versions of these calls. Processes
exchange messages by using these routines and the middleware is responsible for
progressing all the send and receive requests from their initiation to completion. As
an example, the format of the basic blocking and non-blocking send and receive calls
is as follows:
Blocking MPI Send and MPI Recv calls:
MPI Send (msg, count, type, dest rank, tag, context)
MPI Recv (msg, count, type, source rank, tag, context).
Non-blocking MPI Isend and MPI Irecv calls:
MPI Isend(msg, count, type, dest rank, tag, context, request handle)
MPI Irecv(msg, count, type, source rank, tag, context, request handle)
In non-blocking calls, the request handle is returned, which can be used to
check, later in the program execution, if the send/receive operation has completed
or not. An MPI message consists of a message envelope and payload as shown in
12
Figure 2.2. The matching of a send request to its corresponding receive request, for
example, an MPI Send to its MPI Recv, is based on three values inside the message
envelope: (i) context, (ii) source/destination rank and (iii) tag. Context identifies
Structure of Envelope in LAM
Message Length
Tag
Flags
Rank
Context
Sequence Number
Payload
Format of MPI Message
Context Rank Tag
Envelope
Figure 2.2: MPI and LAM envelope format
a set of processes that can communicate with each other. Within a context, each
process is uniquely identified by its rank, an integer from zero to one less than the
number of processes in the context. A user can specify the type of information being
carried by a message by assigning a tag to it. MPI also allows the source and/or tag
of a message to be a wildcard in a receive request. For example if MPI ANY SOURCE
is used then that specifies that the process is ready to receive messages from any
sending process. Similarly if MPI ANY TAG is specified, a message arriving with any
tag would be received.
In addition to their use in message matching, the tag, rank and context
(TRC) define the order of delivery of MPI messages. MPI semantics require that
any messages sent with the same message TRC value should be delivered in order,
i.e., they may not overtake each other, whereas MPI messages with different TRCs
are not required to be ordered. As seen in Figure 2.3, if messages between a pair
of processes are delivered strictly in the order in which they are sent, then this is
perfectly permissible according to the MPI semantics, however, this ordering is only
a subset of the legitimate message orderings allowed by MPI. Out of order messages
13
Process X Process Y
Msg_1
MPI_Send(Msg_1, Tag_A )
MPI_Send(Msg_2, Tag_B )
MPI_Send(Msg_3, Tag_A ) Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
Process X Process Y
MPI_Send(Msg_1, Tag_A )
MPI_Send(Msg_2, Tag_B )
MPI_Send(Msg_3, Tag_A )
MPI_Irecv(..ANY_TAG..)
Process X Process Y
MPI_Send(Msg_1, Tag_A )
MPI_Irecv(..ANY_TAG..)
MPI_Send(Msg_2, Tag_B )
MPI_Send(Msg_3, Tag_A )
Out of order messages with same tags violate MPI semantics
Msg_1
Msg_2
Msg_3
Msg_1
Msg_2
Msg_3
Figure 2.3: Examples of valid and invalid message orderings in MPI
with the same tags, however, violate MPI semantics.
2.2.1 Message Progression Layer
Implementations of the MPI standard, typically, provide a request progression mod-
ule that is responsible for progressing requests from initialization to completion. One
of the main functions of this module is to support asynchronous and non-blocking
message progression in MPI. The request progression module must maintain state
information for all requests, including how far the request has progressed and what
14
type of data/event is required for it to reach completion. There are two approaches to
implementing the MPI library; single-threaded and multithreaded implementations.
Multithreaded implementations [6, 14, 50] use a separate communication thread
for making progress asynchronously and provide increased overlap of communica-
tion and computation. Multithreaded implementations, however, have additional
overheads resulting from synchronization and cache and context switching, which
single-threaded implementations, like LAM and MPICH, avoid. Typically, single-
threaded implementations require special design considerations since requests are
only progressed by the MPI middleware when the user makes any calls to it.
LAM’s request progression mechanism is representative of the way requests
are handled in other MPI middleware implementations and, due to its modular
design, provided a convenient platform for us to work in. LAM provides a TCP
Request Progression Interface (RPI) module that is used to perform point-to-point
message passing using TCP as its underlying transport protocol. Requests in LAM
go through four states: init, start, active and done, and the RPI module is responsi-
ble for progressing all requests through their life-cycle. We redesigned LAM’s request
progression interface layer module for TCP to use SCTP. Although LAM’s message
passing engine is single-threaded, our design is not limited to single-threaded imple-
mentations only. Of course, in a corresponding multithreaded implementation some
details will need to change to handle issues related to synchronization and resource
contention. In the next section we describe the different types of messages and the
message delivery protocols as implemented in LAM.
2.2.2 Message Delivery Protocol
As shown in Figure 2.4, when an incoming message is received at the transport layer
it is first checked against a list of posted receive requests. If a match is found then
the message is dispatched to the correct receive request buffer. If a matching receive
was not issued at the time of message arrival, the message is placed in an unexpected
15
Application Layer Receive Request is Issued
MPI Middleware
Transport Layer
Incoming Message is Received
Unexpected Message Queue
Receive Request Queue
Runtime
Socket
Figure 2.4: Message progression in MPI middleware
message queue (implemented in LAM as an internal hash table). Similarly, when
a new receive request is posted, it is first checked against the unexpected message
queue, if a match is found then the request is advanced, otherwise, it is placed in
the receive request queue.
In LAM, each message body is preceded by an envelope that is used to match
a message. We can broadly classify the messages in three categories with respect to
the way LAM treats them internally:
1. Short messages, which are, by default, of size 64 Kbytes or less.
2. Long messages which are of size greater than 64 Kbytes.
3. Synchronous short messages, which are also of size 64 Kbytes or less, but
require a different message delivery protocol than short messages, as discussed
below.
Figure 2.2 shows the format of an envelope in LAM. The flag field in the
envelope indicates what type of message body follows it.
Short messages are passed using eager-send and the message body immedi-
ately follows the envelope. If the receiving process has posted/issued a matching
receive buffer, the message is received and copied into the buffer and the request
16
is marked as done. If no matching receive has been posted, then this is treated
as an unexpected message and buffered. When a matching receive is later posted,
the message is copied in the request receive buffer and the corresponding buffered
message is discarded.
Long messages are handled differently than short messages and are not sent
eagerly, but instead sent using the following rendezvous scheme: Initially only the
envelope of the long message is sent to the receiver, if the receiver has posted a
matching receive request then it sends back an acknowledgment (ACK) to the sender
to indicate that it is ready to receive the message body. The sender sends back an
envelope followed by the long message body in response to the ACK received. If
no matching receive was posted at the time the initial long envelope was received,
it is treated as an unexpected message. Later when a matching receive request is
posted, it sends back an ACK and the rendezvous proceeds as above. Use of eager
send for long messages is a topic that has been under research and people have
found that an eager protocol for long messages outperforms a rendezvous protocol
only if a significant number of receives have been pre-posted [4]. In [65] the authors
report that protocols like eager send can lead to resource exhaustion in large cluster
environments.
Synchronous short messages are also communicated using eager-send, how-
ever, the send is not complete until the sender receives an ACK from the receiver.
The discussion above about the handling of unexpected messages also applies here.
In addition to point-to-point communication functions, MPI has a number of rou-
tines for collective operations. In LAM, these routines are implemented on top of
point-to-point communication.
2.2.3 Overview of NAS Parallel Benchmarks
In Chapters 3 and 5, we describe several experiments conducted using the Numerical
Aerodynamic Simulation (NAS) Parallel Benchmarks (NPB) [9]. These benchmarks
17
approximate the performance that can be expected from a portable parallel appli-
cation. A brief overview of these benchmarks is included in this section.
The NAS parallel benchmarks were developed by the NASA Ames Research
Centre and is a set of eight programs that are derived from computational fluid dy-
namics code. NPB was used in our experiments because it approximates real-world
parallel scientific applications better than most other available benchmarks. The
eight programs are: LU (Lower-upper symmetric Gauss-Seidel), IS (Integer Sort),
MG (Multi-grid method for Poisson equation), EP (Embarrassingly Parallel), CG
(Conjugate Gradient), SP (Scalar Pentadiagonal Systems), BT (Block Tridiagonal
Systems) and FT (Fast Fourier Transform for Laplace equation). Each program has
several problem sizes (S, W, and A, B, C and D) ranging from small to very large.
Communication characteristics of these benchmarks are described in [20],
which shows that all of the benchmarks use collective communication. With the ex-
ception of EP benchmark, all the benchmarks use point-to-point communication as
well. EP uses only MPI Allreduce and MPI Barrier and has very few communica-
tion calls. The number of communication calls in the remaining benchmarks range
from 1000 to 50,000 for the dataset B. The communication volume ranges from less
than 12 Kbytes for EP to several gigabytes of data for the other benchmarks.
18
Chapter 3
Evaluating TCP for MPI in
WANs
In this chapter we evaluate the use of TCP for MPI in wide-area networks and
Internet-like environments where bandwidth and latency can vary. Our objective is
to gain a better understanding of the impact of IP protocols on MPI performance
that will improve our ability to take advantage of wide-area networks for MPI and
may suggest techniques to make MPI programs less sensitive to latency, so that they
can execute in Internet-like environments. Not all programs will be suitable for these
environments; however, for those programs that are, it is important to understand
the limitations and to continue to take advantage of IP transport protocols.
We performed our experiments in an emulated wide-area network using the
NAS parallel benchmarks to evaluate TCP in this environment. Our experiments
concentrated on latency and the effect of packet delay and loss on performance,
and we measured their effect using two widely available variants of TCP, namely
New-Reno and SACK. We also evaluated these two versions with respect to the
TCP NODELAY option, which disables Nagle’s algorithm, and the effect of delayed
acknowledgments. In this chapter we discuss a number of challenges for using TCP
and MPI in a wide-area network. Our focus is on performance issues rather than
19
issues pertaining to management and access.
3.1 TCP Performance Issues
For improving TCP’s performance in high bandwidth delay environments, tech-
niques like tuning of TCP buffers and using parallel streams are well known in the
networking community. Work done in [15] implements a TCP tuning daemon that
uses network metrics to transparently tune TCP parameters for individual flows
over designated paths. The daemon decides on appropriate buffer sizes to use on
a per flow basis and in case of packet loss attempts to speed-up recovery by mod-
ifying TCP congestion control parameters, disabling delayed acknowledgments or
using parallel streams. Their results show that their tuning techniques achieve im-
provement for bulk transfers over standard TCP. Other work on TCP shows that
the common practice of setting the socket buffer size based on the definition of
bandwidth-delay product (BDP) as a function of transmission capacity and round-
trip time can lead to suboptimal throughput [30]. They show that the socket buffer
size should be set to the BDP only in paths that do not carry cross traffic and do
not introduce queuing delays. They also present an application level mechanism
that automatically adjusts the socket buffer size to near its optimal value.
Work in [11, 21, 64] focuses on TCP’s role in high performance computing
and identifies both implementation-related and protocol-related problems in TCP.
Some of the issues that these papers look at are:
1. Operating system bottlenecks due to increasing differences between networking
speeds and processor and bus speeds. It suggests use of OS-bypass or interrupt
coalescing to help TCP deal with OS’s interrupt processing overhead, and use
of a larger effective size of MTU e.g., through use of jumbograms.
2. Comparison of rate-based versions like TCP Vegas with TCP Reno on high
performance links. Their results show that, with proper selection of Vegas
20
parameters, TCP Vegas performs better than TCP Reno for real traffic dis-
tributions.
There are also approaches that use other transport protocols like UDP and
are aggressive about acquiring network capacity. However, these techniques raise
concerns about fairness when competing with other traffic in the network. In [13]
a new communication level protocol is proposed that uses a single UDP stream
for high-performance data transfers. Their protocol implements a communication
window that spans the entire data buffer at the user level and a selective acknowl-
edgment window that spans the entire data transfer. In addition to this, they use a
user-defined acknowledgment frequency. Their protocol achieves increased through-
put over both short and long haul high-performance networks, however, the paper
does not discuss any fairness issues and the effect of their protocol on cross traffic.
Our work does not investigate buffer tuning, because our focus is on the
effect of packet delay and loss on performance. Buffer tuning is certainly important
for large messages and it raises interesting questions for LAM, which sets up a fully
connected environment. During MPI INIT connections are set up between each pair
of nodes and the socket buffer size is set to the same value for all connections. As
shown with MPICH-G2 [37], it is possible to tune TCP parameters for individual
connections and this is one area for future exploration.
Given our focus on latency, we investigated the effect of Nagle’s algorithm
and delayed acknowledgments on the benchmarks. In TCP, the Nagle’s algorithm
holds back data in the TCP buffer until a sufficiently large segment can be sent. De-
layed acknowledgments do not ACK each segment received but wait for (a) data on
the reverse stream, (b) two segments, or (c) a 100 to 200 millisecond timer to go off,
before an ACK is sent. Both Nagle’s algorithm and delayed acknowledgments opti-
mize bandwidth utilization and normally both are active. In LAM/MPI, the Nagle’s
algorithm is disabled (i.e., TCP NODELAY is set) to reduce message latency. Ideally,
to achieve minimum latency, the Nagle’s algorithm and delayed acknowledgments
21
should both be disabled. We investigated using different combinations of these two
parameters. These parameters are not independent, as reported in [47], and this
work proposes a modification to the Nagle’s algorithm that can improve latency and
also protect against transmission of an excessive number of short packets. Another
factor that can impact performance is the difference between various TCP stacks. It
has been shown that Apache webservers, which have TCP NODELAY option set on all
its sockets, perform worse on Linux than on FreeBSD. FreeBSD derived stacks have
byte-oriented ACKs, whereas Linux uses a packet-oriented ACK. Linux asks for an
acknowledgment after the first packet, while in FreeBSD the total size and not the
packets matter, and it asks for an acknowledgment after several packets [60]. This
enhancement is also part of SCTP’s congestion control mechanism. In SCTP, the
congestion window is increased based on the number of bytes acknowledged, not the
number of acknowledgments received [1].
3.2 Investigating the Use of TCP for MPI in WANs
For evaluating the use of TCP for MPI, we used the NAS benchmarks NPB-MPI
version 3.2 and LAM version 7.0.6 as the MPI middleware.
As a proof of concept and the first step in understanding the problems of us-
ing TCP for MPI in environments with high latency and loss, we explored different
environments for our experiments. Our first attempt was to use PlanetLab [51] which
provides a world-wide shared distributed compute environment. We attempted to
run MPI programs on a set of machines in North America and Asia. Although
they did run, as expected we found a huge variation in times due to network and
processing delays. Moreover, the timing information from PlanetLab was not reli-
able and the results obtained were not consistent. We also investigated the use of
EmuLab [16] as an experimental testbed for this work. However, the time required
to run the NAS benchmarks was often several days, therefore, we decided in favor of
using a dedicated collection of machines and created an emulated wide-area network
22
environment that uses Dummynet to vary packet loss and packet latency.
3.3 MPI in an Emulated WAN Environment
Our experimental setup consisted of four FreeBSD-5.3 nodes connected together via
a layer-two switch using 100Mbit/s Ethernet connections. Two of the nodes used
were Pentium-4 2.8GHz, one Pentium III 558MHz and one Pentium II 400MHz.
These nodes represented a mix of fast and slow machines that introduced some
measure of heterogeneity in our network, which is something that would be expected
while executing MPI programs over WANs.
We experimented with six of the eight NAS benchmarks; we did not use
the FT benchmark because it required Fortran 90 support and we did not use BT
because a single run took several days to complete on our setup. We investigated
both small (S, W) and large problem sizes (B).
Dummynet was configured on each of the nodes to allow us to vary the
amount of loss and latency on the links between the nodes. TCP is used as the
underlying transport mechanism in LAM and our goal was to determine how it
would perform under varying loss rates and latencies similar to those encountered
in WANs. We experimented with varying the latencies between 0 and 50 millisec-
onds and packet loss rates between 0% to 10%. The two different variants we
investigated were New-Reno and SACK. As reported by Floyd, New-Reno is widely
deployed and is used in 77% of the web servers that could be classified [45]. Since
we were investigating loss, we also chose TCP SACK as well. It is well-known that
TCP SACK performs better than Reno under loss conditions where TCP SACK’s
selective acknowledgment can avoid some retransmissions under packet loss. Finally,
we experimented with the socket settings for TCP NODELAY and delayed acknowledg-
ments.
23
3.3.1 Communication Characteristics of the NAS Benchmarks
The communication characteristics of the NAS parallel benchmarks were discussed
in Section 2.2.3 and it was noted that with the exception of the EP benchmark,
the remaining benchmarks use both collective and point-to-point communication.
EP uses very few communication calls, however, in the remaining benchmarks, the
communication calls range from 1000 to 50,000 for dataset B. Tables in [20] show
that the communication volume ranges from less than 12 Kbytes for EP to several
gigabytes of data for the other benchmarks.
As discussed in Section 2.2.2, MPI middleware keeps track of two types of
messages: expected messages and unexpected messages. There is slightly more
overhead for processing unexpected messages, however, a large number of expected
messages indicates that the application may be stalled waiting for a message to
arrive. We instrumented the LAM code to determine the number of expected and
unexpected short and long messages generated during the execution of the bench-
marks. A breakdown of different message types for classes S, W, A and B is shown
in Table 3.1 for the LU benchmark and in Table 3.2 for the SP benchmark.
Message Type
Expected UnexpectedDataset Long Short Long Short
S 0 1,097 0 0
W 0 18,978 0 0
A 300 31,145 208 0
B 254 50,221 251 0
Table 3.1: Breakdown of message types for LU classes S, W, A and B using thedefault LAM settings for TCP
LU and SP benchmarks are two of the most communication intensive bench-
marks in the set and they were chosen because of the different characteristics of their
communication calls and the difference in the type of messages exchanged during
their execution. LU uses only blocking send calls (MPI Send) apart from collec-
24
Message Type
Expected UnexpectedDataset Long Short Long Short
S 0 1,216 0 0
W 0 4,422 0 0
A 2,807 9 2,011 0
B 1,314 9 3,504 0
Table 3.2: Breakdown of message types for SP classes S, W, A and B using thedefault LAM settings for TCP
tive communications, whereas, SP uses non-blocking MPI send calls (MPI Isend).
For larger datasets like A and B, LU mostly communicates using short messages,
whereas the number of short messages in SP are almost non-existent compared to
long messages.
As mentioned, we used a heterogeneous environment with two faster ma-
chines and two slower machines. The numbers in these tables are for the two
slower machines. For the faster machines, a greater proportion of long messages
were expected messages. These numbers demonstrate some of the problems in un-
derstanding the performance of these benchmarks, especially in a heterogeneous
environment. The fact that the vast majority of messages are expected indicates
that the application is very likely spending time waiting for messages to arrive. This
may indicate the benchmark is sensitive to message latency.
3.4 Experiments
In the following sections we describe the results of our experiments with varying loss
and delay for six of the eight NAS benchmarks for different problem sizes.
25
3.4.1 NAS Benchmarks, Loss
Experiments were performed with the NAS benchmarks where Dummynet was used
to introduce random packet loss of 1%, 2% and 10% between each of the four nodes.
For the baseline case, the experiments were also run when there was no loss on
the links. The benchmarks report performance by three measurements: the total
amount of time taken, Mops/s total value and Mops/s/process. The value Mops/s
total value is linearly related to total time.
Figure 3.1 shows the performance of TCP SACK for dataset B under different
loss rates. Performance of TCP New-Reno showed a similar trend, therefore, we did
Variation over Different Loss Rates for TCP SACK (Nagle's Algorithm Disabled)
0
50
100
150
200
250
300
LU CG EP IS MG SP
Mop
s/s
Loss=0%
Loss=1%
Loss=2%
Loss=10%
Figure 3.1: NAS benchmarks performance for dataset B under different loss ratesusing TCP SACK
not include its graph. As expected the performance falls with increasing loss. The
degradation in performance is more marked for the LU benchmark than others.
The LU benchmark contains a large number of neighbor to neighbor MPI SEND calls
which are for the most part short and expected messages. As a result, LU is sensitive
to the large delays that occur when packets are lost and need to be retransmitted.
The only benchmark not affected by loss is EP, which has only a few collective
communication calls. The reduction in bandwidth due to TCP’s congestion control
26
mechanism does not seem to be affecting performance.
Our results for 0%, 1% and 2% on all benchmarks showed that TCP SACK
always performed better or equal to TCP New-Reno for all six NAS measurements.
Our results are consistent with the results in the literature that show that TCP
SACK under loss performs better than TCP New-Reno. As discussed in Fall and
Floyd’s paper [19], New-Reno does not perform as well as TCP SACK when multiple
packets are dropped from a window of data. During Fast Recovery, New-Reno is
limited to retransmitting at most one dropped packet per round trip time, which
affects its performance.
3.4.2 NAS Benchmarks, Latency
In order to study network latencies and its effect on MPI applications we ran the NAS
benchmarks with 10ms and 50ms latencies. The increased latencies were introduced
equally on all network links.
As shown in Figure 3.2, the performance is similar to that obtained in Fig-
ure 3.1 for loss. This is further evidence that the major effect of loss on performance
Variation over Different Latencies for TCP SACK (Nagle's Algorithm Disabled)
0
50
100
150
200
250
300
LU CG EP IS MG SP
Mop
s/s
Delay=0ms
Delay=10ms
Delay=50ms
Figure 3.2: NAS benchmarks performance for dataset B with different latenciesusing TCP SACK
is increased latency. It is surprising that there is no significant difference in behavior
27
between increasing the latency for all messages versus loss, which increases delay for
the small number of lost messages. This effect may be due to the loosely synchronous
nature of the computation where delaying one message delays them all.
3.4.3 NAS Benchmarks, Nagle’s Algorithm and Delayed Acknowl-
edgments
The last characteristic of TCP that we investigated was the enabling of Nagle’s
algorithm and delayed acknowledgments. Typically, Nagle’s algorithm and delayed
acknowledgments are both enabled on a connection. Commonly, implementations
of MPI, including LAM, disable Nagle’s algorithm by default, and retain the system
default of delayed acknowledgments. The results for different settings of these two
parameters for the LU benchmark are shown in Figure 3.3.
LU: Combinations of Nagle and Delayed ACKs
0
50
100
150
200
250
300
350
S W B
Dataset Size
Mop
s/s
Nagle, Delayed ACK
Nagle, No Delayed ACK No Nagle, Delayed ACK
No Nagle, No Delayed ACK
Figure 3.3: Performance of LU using different combinations of Nagle’s algorithmand delayed acknowledgments
For LU, Figure 3.3 shows the significant improvement obtained by disabling
Nagle’s algorithm for small problem sizes. This advantage is less evident for larger
problems sizes. Similar observations were made for the SP benchmark. Although
the difference is not large, we consistently found that LAM’s default setting of dis-
abling Nagle’s algorithm and leaving delayed acknowledgments enabled gave the
28
best performance. In particular, it outperformed the case where both are disabled,
which should have achieved the smallest latency. Surprisingly, even though the
performance of the other combinations is close, the tcpdump trace files show very
different traffic patterns. The traces for Nagle disabled, and no delayed acknowl-
edgments had a lot of variation in the RTT (round trip time) and faster recovery
during congestion avoidance, since acknowledgments return more quickly. We ex-
pected the setting with Nagle disabled and no delayed acknowledgments to give the
least amount of delay since segments and ACKs are sent out immediately. However,
as Figure 3.3 shows, this does not result in improved performance. Once again the
synchronous nature of the computation may imply that it is only the maximum
delays that matter.
Figure 3.4 shows the performance of LU and SP for problem sizes W, S and
Comparison of LU and SP for Different Datasets (Loss=0%, SACK ON, Nagle OFF, Delayed ACKs ON)
0
50
100
150
200
250
300
350
SP LU
Mop
s/s
S
W
B
Figure 3.4: Comparison of LU and SP benchmarks for different datasets
B using the best set of parameters (SACK, Nagle disabled, and delayed acknowledg-
ments turned-on). As Figure 3.4 shows there is a significant difference between the
behavior of LU and SP across the problem set sizes. SP shows a significant drop in
performance for dataset B compared to LU. We investigated this further to find that
the major difference between S, W and B was the number of long messages used in
these benchmarks. For problem size B, the number of long messages ranged from
29
almost none in LU to almost all of the approximately 5000 messages exchanged in
SP. This shows the dramatic impact that the MPI middleware and the use of mes-
sage rendezvous for long messages can have on performance. SP uses non-blocking
MPI ISEND but is not able to begin transferring data until the rendezvous point is
reached.
3.5 Summary of Results
Understanding the dynamics of TCP is difficult because TCP contains a range of
flow control and congestion avoidance mechanisms that result in varied performance
in different environments. There are also a number of different TCP variants. This
chapter highlights the performance problems with using TCP for MPI in WANs,
especially taking into account the TCP variants and settings typically used for
MPI in local and wide-area networks. We explored the performance of TCP in
an emulated WAN environment to better quantify the differences. TCP SACK
consistently outperformed TCP New-Reno and this points to the importance of
better congestion control for such environments. For NAS benchmarks, as expected,
there was a considerable drop in performance as the packet loss increased from 0%
to 10%. In the NAS benchmarks, for the latencies investigated, the majority of the
messages were expected and thus potentially very sensitive to variation in latencies.
We found that, for several of the benchmark programs, performance is determined
by the maximum RTT across all of the connections. Even a modest loss in one
link creates spikes in the RTT that in the end slows down all the processes. The
sensitivity of an application to latency results in the overall performance to be as
fast as the slowest process.
In terms of Nagle’s algorithm, the default configuration in LAM-MPI achieved
the best results. There was not much of a difference between the best results and
those obtained for other combinations except for TCP’s default setting: Nagle en-
abled with delayed acknowledgments. We found that the rendezvous mechanism
30
used by LAM can have an impact on performance much more than adjusting the
Nagle and delayed acknowledgments settings.
Our experiments with TCP emphasize the importance of latency tolerant
MPI programs for wide-area networks and the ability to recover quickly from loss.
In Chapter 4, we discuss SCTP-based middleware for MPI and discuss many of the
features of SCTP, like multistreaming, multihoming, better congestion control and
built-in SACK mechanism that make it more resilient and better suited to wide-
area networks. Due to SCTP’s similarity with TCP, it will also be able to take
advantage of research on tuning of individual connections and other performance
enhancements.
31
Chapter 4
SCTP-based Middleware:
Design and Implementation
In this chapter we discuss the role that SCTP can play as the underlying trans-
port layer for MPI. The advantage of using MPI in distributed environments is that
its libraries exist for a wide variety of machines and environments ranging from
dedicated high performance architectures, local area networks, compute grids, to
execution across wide-area networks and the Internet. As a standard it has been
successful in creating a body of code that is portable to any of these environments.
However, the strict definition of portability is insufficient since scalability and per-
formance must also be considered. It is challenging to obtain good performance in
environments such as wide-area networks where bandwidth and latency vary. Our
experiments in Chapter 3 with TCP underscore the importance of latency tolerant
MPI programs for the wide-area environments. Programs that overlap communica-
tion with computation can hide much of the effects of latency that is inevitable in
such environments.
There are several reasons why SCTP may be more suitable than TCP for
MPI, in general, and also can prove to be more robust in wide-area environments.
1. As we discuss in detail in this chapter, SCTP has a lot of similarities in common
32
with MPI, which allows us to offload a lot of functionality of MPI onto the
transport layer.
2. SCTP’s multistreaming ability allows us to provide additional concurrency
and avoid head-of-line blocking that is possible in TCP.
3. Congestion control and SACK in SCTP is more extensive than in TCP and
allows more efficient communication under loss conditions.
4. SCTP’s support for multihoming makes it more fault tolerant - an important
property in wide-area networks. It can take advantage of multiple alternate
paths between a pair of nodes and can recover from loss segments faster than
TCP.
In this thesis we investigate all of the above advantages of SCTP, except
multihoming. We first discuss the details of redesigning the LAM TCP RPI module
to use SCTP as its underlying transport protocol. We implemented three versions
of an SCTP-based RPI module to investigate the opportunities offered by SCTP in
terms of the variety of new features that it offers and explore ways in which MPI can
benefit from a new transport protocol. The design and implementation of our SCTP
module was a three-phased process where we iteratively added functionality to each
successive module. Each module was refined as we learned from our experiences
until we achieved the final version, which assimilates the benefits of the previous
modules.
One of the first challenges we faced in the redesigning the TCP RPI mod-
ule was the lack of documentation about the module and its interaction with other
parts of LAM. The TCP RPI module maintains state information for all requests
such as how far the request has progressed and what type of data/event is re-
quired for it to reach completion. It has a complex state machine for supporting
asynchronous and non-blocking message progression in MPI. The only way to learn
about how it worked was by thorough code examination. In the process, we also
33
created a detailed technical document explaining the module’s working that other
people can benefit from [33]. This document is publicly available and is linked-off
the LAM/MPI official website. We also extensively instrumented LAM code to
trace request progression and other key events using lam debug statements. The
lam debug statements allowed us to tune the level of verbosity in the output and
were used to create our own custom diagnostic system. During our initial learning
process and also during implementation and testing we made extensive use of the
diagnostic output to trace problems. Tracing problems in the execution of parallel
programs was quite challenging at times, especially when the output was not always
reproducible and when slight changes in timing of events produced different results.
In many instances, painstaking examination of the diagnostic traces had to be done
to identify the source of errors.
Another challenge was the use of a new transport protocol that is still in its
early stages of optimization and tuning. We found that performance of SCTP was
highly dependant on the implementation used, with the Linux lksctp implementation
performing much poorer than the FreeBSD KAME implementation. Moreover, while
using different features of the SCTP API, there were several coding nuances that we
had to discover for ourselves as not much documentation was available. During our
testing, the SCTP module also provided a testing ground for the FreeBSD KAME
SCTP implementation where we discovered several problems with the protocol that
led to improvements in the stack. We collaborated closely with Randall Stewart,
one of the primary designers of SCTP, in the identification of those problems.
In the subsequent sections we describe the design and implementation details
of the final version of the SCTP module, since it incorporates the features of previous
versions, and also compare it to the LAM-TCP design. Our iterative design process
is discussed in more detail in [34]. We first discuss why SCTP is a good match for
MPI as its underlying transport protocol. In Sections 4.2 and 4.3 we discuss how
multiple streams are used in our module and the enhanced concurrency obtained as
34
a result. In Section 4.4 we discuss resource management issues in the middleware
and provide a comparison with LAM-TCP. In Section 4.5 we describe the race
conditions in our module and the solution adopted to fix them. In Section 4.6 we
discuss enhanced reliability and fault tolerance in our module due to the features
provided by SCTP. In the last section we discuss some limitations of using SCTP.
4.1 Overview of using SCTP for MPI
SCTP promises to be particularly well-suited for MPI due to its message-oriented
nature and provision of multiple streams in an association.
As shown in Figure 4.1, there are some striking similarities between SCTP
MPI SCTP
Context
Rank / Source
Message
Tags
One-to-Many
Socket
Association
Streams
Figure 4.1: Similarities between the message protocol of MPI and SCTP
and MPI. Contexts in an MPI program identify a set of processes that communicate
with each other, and this grouping of processes can be represented as a one-to-many
socket in SCTP that establishes associations with that set of processes. SCTP can
map each association to the unique rank of a process within a context and thus use
an association number to determine the source of a message arriving on its socket.
Each association can have multiple streams which are independently ordered and
this property directly corresponds with message delivery order semantics in MPI.
In MPI, messages sent with different tag, rank and context (TRC) to the same
receiver are allowed to overtake each other. This permits direct mapping of streams
35
to message tags.
This similarity between SCTP and MPI is workable at a conceptual level,
however, there are some implementation issues, to be discussed, that make the
socket-context mapping less practical. Therefore, we preserved the mappings from
associations to ranks and from streams to message tags but not the one from socket
to context in our implementation. Context creation within an MPI program can be
a dynamic operation and if we map sockets to contexts, this would require creating
a dynamic number of sockets during the program execution. Not only would this
add complexity and additional bookkeeping to the middleware, there can be an
overhead to managing a large number of sockets. Creation of a lot of sockets in
the middleware counteracts the benefits we can get from using a single one-to-many
SCTP socket that can send/receive messages from multiple associations. Due to
these reasons, instead of mapping sockets to contexts, the context and tag pair was
used to map messages to streams.
In discussions with Randall Stewart we discovered another way of dealing
with contexts and that is to map them to PID (Payload Identifier) present in the
common SCTP header of each packet. The PID is a 32-bit field that can be used at
the application level to label the contents of an SCTP packet and is ignored by the
SCTP layer. Using the PID field gives us the flexibility of dynamic context creation
in our application without the need for maintaining a corresponding number of
sockets. Also, the PID mapping can be easily incorporated in our module, with
minor modifications.
4.2 Message Demultiplexing
As discussed in Section 2.1, SCTP provides a one-to-many UDP-like style of com-
munication that allows a single socket descriptor to receive messages from multiple
associations, eliminating the need for maintaining a large number of socket descrip-
tors. In the case of LAM’s TCP module there is a one-to-one mapping between
36
processes and socket descriptors, where every process creates individual sockets for
each of the other processes in the LAM environment.
In SCTP’s one-to-many style, the mapping between processes and sockets
no longer exists. With one-to-many sockets there is no way of knowing when a
particular association is ready to be read from or written to. At any time, a process
can receive data sent on any of the associations through its sole socket descriptor.
SCTP’s one-to-many socket API does not permit reception of messages from a par-
ticular association, therefore, messages are received by the application in the order
they arrive and only afterwards is the receive information examined to determine
the association it arrived on. It is likely that the ability to peek at the association
information before receiving data may become part of later versions of the SCTP
API [57].
In our SCTP module, each message goes through two levels of demultiplexing;
first on the association number the message arrived on and secondly on the stream
number within that association. These two parameters allow us to invoke the correct
state function for that request, which directs the incoming message to the proper
request receive buffer. If no matching receive was posted, it is treated like an
unexpected message and buffered.
4.3 Concurrency and SCTP streams
MPI send/receive calls without wildcards define an ordered stream of messages be-
tween two processes. It is also possible, through the use of wildcards or appropriate
tags, to relax the delivery order of messages. Relaxing the ordering of messages can
be used to make the program more message driven and independent from network
delay and loss. However, when MPI is implemented on top of TCP, the stream-
oriented semantics of TCP with one connection per process precludes having un-
ordered message streams from a single process. This restriction is removed in our
implementation by mapping tags onto SCTP streams, which allows different tags
37
from the same source to be independently delivered.
4.3.1 Assigning Messages to Streams
As discussed in the Chapter 2, message matching in MPI is based on the tag, rank
and context of the message. In SCTP, the number of streams is a static parameter
(short integer value) that is set when an association is initially established. For each
association, we use a fixed sized pool of stream numbers, ten by default, that is
used for sending and receiving messages between the endpoints of that association.
Messages with different TRC are mapped to different stream numbers within an
association to permit independent delivery. Of course, since the number of streams
is fixed, the degree of concurrency achieved depends on the number of streams.
4.3.2 Head-of-Line Blocking
Head-of-line blocking can occur in the LAM TCP module when messages have
the same rank and context but different tags. As shown in Figure 4.2, we use
non-blocking communication between two processes that communicate by send-
ing/receiving several messages with different tags. The communication/computation
overlap depends on the amount of data that is sent without blocking for it to be
received, and continuing with useful computation. As Figure 4.2 shows, loss of a
single TCP segment can block delivery of all subsequent data in a TCP connection
until the lost segment is delivered. Our assignment of TRC to streams in the SCTP
module alleviates this problem by allowing the unordered delivery of these messages.
It is often the case that MPI applications are loosely synchronized and as
a result alternate between computation and bursts of communication. We believe
this characteristic of MPI programs makes them more likely to be affected by head-
of-line blocking in high loss scenarios, which underpins our premise that SCTP is
better suited as a transport mechanism for MPI than TCP on WANs.
38
Process X Process Y
Tag_A MPI_Irecv
MPI_Irecv
Tag_B
Msg_A Msg_B
Delivered
Process X Process Y
Tag_A MPI_Isend MPI_Irecv
MPI_Irecv
Tag_B
Msg_A Msg_B
Blocked
LAM TCP Module
SCTP Module
MPI_Isend
MPI_Isend
MPI_Isend
Figure 4.2: Example of head-of-line blocking in LAM TCP module
4.3.3 Example of Head-of-Line Blocking
As an illustration of the benefits of our SCTP module, consider the two MPI pro-
cesses P0 and P1 shown in Figure 4.3. Process P1 sends two messages Msg-A and
Msg-B to P0 in order, using different tags. P0 does not care what order it receives
the messages and, therefore, posts two non-blocking receives. P0 waits for any of the
receive requests to complete and then carries out some computation. Assume that
part of Msg-A is lost during transmission and has to be retransmitted while Msg-B
arrives safely at the receiver. In the case of TCP, its connection semantics require
that Msg-B stay in TCP’s receive buffer until the two sides recover from the loss and
the lost parts are retransmitted. Even though the programmer has specified that the
messages can be received in any order, P0 is forced to receive Msg-A first and incur
39
P0
MPI_Irecv(P1, tag-A)
MPI_Irecv(P1, tag-B)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP SCTP
Exe
cution tim
e o
n P
0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Figure 4.3: Example of multiple streams between two processes using non-blockingcommunication
increased latency as a result of TCP’s semantics. In the case of the SCTP module,
since different message tags are mapped to different streams and each stream can
deliver messages independently, Msg-B can be delivered to P0 and the process can
continue executing until Msg-A is required. SCTP matches the MPI semantics and
makes it possible to take advantage of the concurrency that was specified in the
program. The TCP module offers concurrency at process level, while our SCTP
module adds to this with enhanced concurrency at the TRC level.
Even when blocking communication takes place between a pair of processes
and there is loss, there are still advantages to using SCTP. Consider Figure 4.4, where
blocking sends and receives are being used with two different tags. In SCTP, if Msg-A
is delayed or lost, and Msg-B arrives at the receiver, it is treated as an unexpected
message and buffered. SCTP, thus, allows the receive buffer to be emptied and
40
P0
MPI_Recv(P1, tag-A)
MPI_Recv(P1, tag-B)
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
Figure 4.4: Example of multiple streams between two processes using blocking com-munication
hence does not slow down the sender due to the flow control mechanism. In TCP,
however, Msg-B will occupy the socket receive buffer until Msg-A arrives.
4.3.4 Maintaining State
The LAM TCP module maintains state for each process (mapped to a unique socket)
that it reads from or writes to. Since TCP delivers bytes in strict sequential order
and the TCP module transmits/ receives one message body at a time per process,
there is no need to maintain state information for any other messages that may arrive
from the same process since they cannot overtake the message body currently being
read. In our SCTP module, this assumption no longer holds true since subflows
on different stream numbers are only partially ordered with respect to the entire
association. Therefore, we need to maintain state for each stream number that a
message can arrive on from a particular process. We only need to maintain a finite
amount of state information per association since we limit the possible number of
streams to our pool size of ten streams. For each stream number, the state holds
information about how much of the message has been read, what stage of progression
the request is in and what needs to be done next. At the time when an attempt
is made to read an incoming message, we cannot tell in advance from where the
message will arrive and how large it will be, therefore, we specify a length equal
to the socket receive buffer. SCTP’s receive function sctp recvmsg does take care
of message framing by returning the next message and not the number of bytes
specified by the size field in the function call. This frees us from having to look
through the receive buffer to locate the message boundaries.
41
4.4 Resource Management
LAM’s TCP RPI module uses TCP as its communication mechanism and employs
a socket based interface in a fully connected environment. It uses a one-to-one
connection-oriented scheme and maintains N sockets, one for each of the N processes
in its environment. The select system call is used to poll these sockets to determine
their readiness for reading any incoming messages or writing outgoing messages.
Polling of these sockets is necessary because operating systems like Linux and UNIX
do not support asynchronous communication primitives for sockets. It has been
shown that the time taken for the select system call has a cost associated with it
that grows linearly with the number of sockets [44]. Socket design was originally
based on the assumption that a process would initiate a small number of connections,
and it is known that performance is affected if a large number of sockets are used.
Implications of this are significant in large scaled commodity clusters with thousands
of nodes that are becoming more frequently used. Use of collective communications
in MPI applications also strengthens the argument, since nearly all connections
become active at once when a collective communication starts, and handling a large
number of active connections may result in performance degradation.
SCTP’s one-to-many communication style eliminates the need for maintain-
ing a large number of socket descriptors. In our implementation each process creates
a single one-to-many SCTP socket for communicating with the other processes in
the environment. An association is established with each of the other processes us-
ing that socket and since each association has a unique identification, it maps to a
unique process in the environment. We do not use the select system call to detect
events on the socket, instead an attempt is made to retrieve messages at the socket
using sctp recvmsg, as long as there are any pending receive requests. In a simi-
lar way, if there are any pending send requests, they are written out to the socket
using sctp sendmsg. If the socket returns EAGAIN, signaling that it is not ready to
perform the current read or write operation, the system attempts to advance other
42
outstanding requests.
The use of one-to-many sockets in SCTP results in a more scalable and
portable implementation since it does not impose a strong requirement on the system
to manage a large number of socket descriptors and the resources for the associated
buffers. As discussed in [5], managing these resources can present problems. We
investigated a process’s limits for the number of socket descriptors it can manage
and the number of associations it can handle on a single one-to-many socket. In our
setup, a process reached its maximum socket descriptors’ limit much earlier than
the limit on the maximum number of associations it can handle. The number of file
descriptors in a system usually has a user limit of 1024 and needs root privileges
to be changed. Moreover, clusters can potentially execute parallel jobs by several
users, and need to have the capability of providing a number of sockets to each user.
4.5 Race Conditions
In this section we describe the race conditions in our module and the solutions
adopted to fix them.
4.5.1 Race Condition in Long Message Protocol
In the SCTP module it was necessary to change the long message protocol. This
occurred because even though SCTP is message-based the size of message that can
be sent in a single call to sctp sendmsg function is limited by the send buffer size.
As a result large messages had to broken up into multiple smaller messages of size
less than that of send buffer. These messages then had to be reassembled at the
receiving side.
All pieces of the large message are sent out on the same stream number to en-
sure in-ordered delivery. As a refinement to this scheme, we considered interleaving
messages sent on different streams with portions of a message larger than the send
buffer size, at the time it is passed to the SCTP transport layer. Since reassembly
43
of the large message is being done at the RPI level, and not at the SCTP level, this
may result in reduced latency for shorter messages on other streams especially when
processes use non-blocking communication.
While testing our SCTP module we encountered race conditions that oc-
curred due to the rendezvous mechanism of the long message protocol. Consider
the case when two processes P0 and P1 are exchanging long messages using the
same stream numbers. Both of them would send rendezvous initiating envelopes for
their long messages to the other and wait for an acknowledgment before starting
the transmission of the long message body.
As shown in Figure 4.5, after P1 receives the ACK for its long message Msg1,
P0 P1
Envelope - Msg1
Partial body - Msg1
ACK - Msg1
Envelope - Msg2
ACK - Msg2 (incorrect)
Remaining body - Msg1
Figure 4.5: Example of long message race condition
it begins transmitting it. It is possible that P1 is successful in writing only part
of Msg1 to the socket, and it then saves its current state which would include the
number of bytes sent to P0. Process P1 would now continue and try to advance
its other active requests and would return to writing the rest of Msg1 later. During
that time P1 sends out an ACK for long message Msg2 to P0 using the same stream
number that Msg1 was being sent on. From the perspective of process P0, it was
in the middle of receiving Msg1 and the next message it receives is from the same
44
process and on the same stream number. Since P0 decides what function to call,
after reading a message from its socket, based on the association and stream number
it arrived on, it calls the function that was directing the bytes received to the buffer
designated for Msg1. As a result, P0 erroneously reads the ACK for Msg2 as part of
body of Msg1.
Option A
In order to fix the above race condition we considered several options, and the trade-
offs associated with them. The first option was to stay in a loop while sending Msg1
until all of it has been written out to the socket. One advantage of this is that the
rendezvous mechanism introduces latency overhead and once the transmission of
long message body starts, we do not want to delay it any longer. The disadvantage
is that it greatly reduces the amount of concurrency that would otherwise have been
possible, since we do not receive from or send to other streams or associations. Also,
if the receiver of the long message had issued a non-blocking receive call, then it
might continue to do other computations and, as a result, sending the data in a
tight loop may simply cause the sender to stall waiting for the receiver to remove
the data from the connection. Another disadvantage that may arise is when the
receiver is slow and we would be overloading it by sending a lot of data in a loop
instead of advancing other requests on other streams or associations.
Option B
The second option was to disallow P1 from writing a different message, if a previous
message writing operation was still in progress, on the same stream number to the
same process. The advantage of this was simplicity in design, but we reduce the
amount of overlap that was possible in the above case if we allowed both processes
to exchange long messages at the same time. However, it is still possible to send
and receive messages from other streams or associations. This option is the one we
45
implemented in the SCTP module.
Option C
A third option was to treat acknowledgments used within LAM to synchronize
message exchange, such as those used in the long message rendezvous mechanism,
as control messages. These control messages would be treated differently from other
messages containing actual data. Whenever a control message carrying an ACK
arrives, the middleware would know it is not part of any unfinished message, e.g.,
the long message in our example, and would invoke the correct state function to
handle it. This option introduces more complexity and bookkeeping to the code,
but may turn out to be one that offers the most concurrency.
4.5.2 Race Condition in MPI Init
There was one other race condition that occurred in implementing the SCTP mod-
ule. Because the one-to-many SCTP style does not require any of the customary
accept and connect function calls before receiving messages, we had to be careful
about MPI Init to ensure that each process establishes associations with all the
other processes before exchanging messages. In order to ensure that all associations
are established before messages are exchanged, we implemented a barrier at the end
of our association setup stage and before the message reception/ transmission stage.
This is especially important in an heterogeneous environment with varying network
and machine speeds. It is easier with TCP because the MPI Init connection setup
procedure automatically takes care of this issue by its use of connect and accept
function calls.
4.6 Reliability
In general, MPI programs are not fault tolerant and communication failure typically
causes the programs to fail. There are, however, projects like FT-MPI [18], which
46
introduce different modes of failure, that can be controlled by an application using
a modified MPI API. This work is discussed in Chapter 6. One source of failure is
packet loss in networks, which can result in added latency and reduced bandwidth
that severely impact the overall performance of programs. There are several mech-
anisms in SCTP that help reduce the number of failures and improve the overall
reliability of executing an MPI program in a WAN environment.
4.6.1 Multihoming Feature
MPI applications have a strong requirement for the ability to rapidly switch to an
alternate path without excessive delays in the event of failure. MPI processes in a
given application’s environment tend to be loosely synchronized even though they
may not all be communicating directly with each other. Delays occurring in one
process due to failure of the network can have a domino effect on other processes and
they can potentially be delayed as well. We saw some evidence of this phenomenon in
our experiments in Chapter 3. It would be highly advantageous for MPI processes if
the underlying transport protocol supports fast path failover in the event of network
failure.
SCTP’s multihoming feature provides an automatic failover mechanism where
a communication failure between two endpoints on one path will cause switchover
to an alternate path with little interruption to the data transfer service. Of course,
this is useful only when there are independent paths, but having multiple interfaces
on different networks is not uncommon and SCTP makes it possible to exploit the
possibility when it exists. TCP has no built-in support for multihomed IP hosts and
cannot take advantage of path-level redundancy. It could be accomplished by using
multiple sockets and managing them in the middleware but this introduces added
complexity to the middleware.
SCTP uses a heartbeat control chunk to periodically probe the reachability
of the idle destination addresses of an association and to update the path round
47
trip times of the addresses. The heartbeat period is a configurable option, modi-
fiable through a socket option, with a default value usually set to 30 seconds [59].
The heartbeat mechanism detects failure by keeping count of consecutive missed re-
sponses to the data and/or heartbeat chunks to that address. If this count exceeds
a configurable threshold spp pathmaxrxt then that destination address is marked
as unreachable. By default spp pathmaxrxt is set to a value of 5. The value of this
parameter represents a trade-off between speed of detection and reliability of detec-
tion. If the value is set too low, an overloaded host may be declared unreachable.
In addition to this, there are a range of user-adjustable controls that can be used
to adjust maximum and minimum retransmission timeouts (RTO) and hence tune
the amount of time it takes to determine network failures [55]. There are trade-offs
associated with selecting the value of these control parameters. Setting max RTO
too low, for example, can result in unnecessary retransmissions in the event of delay
in the network. Ideally, these control parameters need to be tuned to suit the nature
and requirements of the application and the network to ensure fast failover, but to
avoid unnecessary switchovers due to network delay. If we contrast this with the
TCP protocol, there is a lack of control over TCP retransmission timers and there
is no built-in mechanism for a failover.
As a side note on multihoming, an SCTP association spans both IPv4 and
IPv6 addresses and can aid in progression of the Internet from IPv4 to IPv6 [58].
4.6.2 Added Protection
SCTP has several built-in features that provide an added measure of protection
against flooding and masquerade attacks. They are discussed below.
As SCTP is connection oriented, it exchanges setup messages at the initiation
of the communication, for which it uses a robust four-way handshake as shown in
Figure 4.6. The receiver of an INIT message does not reserve any resources until
the sender proves that its IP address is the one claimed in the association setup
48
P0 P1
INIT
INIT-ACK
COOKIE-ECHO
COOKIE-ACK
Figure 4.6: SCTP’s four-way handshake
message. The handshake uses a signed state cookie to prevent use of IP spoofing for
SYN flooding denial-of-service attack. The cookie is authenticated by making sure
that it has a valid signature and then a check is made to verify that the cookie is
not stale. This check guards against replay attacks [59]. Some implementations of
TCP also use a cookie mechanism, but those typically do not use signed cookies
and also the cookie mechanism is supplied as an option and it is not an integral
part of the TCP protocol as is the case with SCTP. In TCP, a user has to validate
that both sides of the connection have the required implementation. SCTP’s four-
way handshake may be seen to be an overhead, however, if a one-to-many style
socket is used, then user data can be piggy-backed on the third and fourth leg of
the handshake.
Every SCTP packet has a common header that contains, among other things,
a 32-bit verification tag. This verification tag protects against two things: (1) It
prevents an SCTP packet from a previous inactive association from being mistaken
as a part of a current association between the same endpoints. (2) It protects
against blind injection of data in an active association [59]. A study [63] 1 done on
1Network transport protocols rarely hit the news headlines. The work by Paul Watsonwas widely reported in the media. The Vancouver Sun featured a front page article titled“The man who saved the Internet”on April 22, 2004. This problem, however, had been
49
TCP Reset attacks has shown that TCP is vulnerable to a denial-of-service attack
in which the attacker tries to prematurely terminate an active TCP connection.
The study has found that this type of attack can utilize the TCP window size
to reduce the number of sequence numbers that must be guessed for a spoofed
packet to be considered valid and accepted by the receiver. This kind of attack is
especially effective on long standing TCP flows such as BGP peering connections,
which if successfully attacked, would disrupt routing in the Internet. SCTP is not
susceptible to the reset attacks due to its verification tag [57].
SCTP is able to use large socket buffer sizes by default because it is not
subject to denial-of-service attacks that TCP becomes susceptible to with large
buffers [57].
SCTP also provides an autoclose option that protects against accidental
denial-of-service attacks where a process opens an association but does not send
any data. Using autoclose, we can specify the maximum number of seconds that
an association remains idle, i.e., no user traffic in either direction. After this time
the association is automatically closed [55]. There is another difference between
SCTP’s and TCP’s close mechanisms; TCP allows an “half-closed” state, which is
not allowed in SCTP. In the half-closed state, one side closes a connection and can-
not send data, but may continue to receive data until the peer closes its connection.
SCTP avoids the possibility of a peer failing to close its connection by disallowing
a half-closed state.
In this section we have highlighted the additional security features that are
present in SCTP. These features are especially important in open network environ-
ments, such as the Internet, where security is an issue and therefore the use of SCTP
adds to the reliability of the environment for MPI.
identified earlier by the network community [57].
50
4.6.3 Use of a Single Underlying Protocol
One issue not directly related to SCTP was the mixed use of UDP and TCP in
LAM. LAM uses user level daemons that by default employ UDP as their transport
protocol. These daemons serve several purposes:
1. Daemons act as external agents that have previously fulfilled the authentica-
tion requirements for connecting to other nodes and can enable fast execution
of the MPI programs.
2. They enable external monitoring of running jobs and carrying out cleanup
when a user aborts an MPI process.
3. LAM also provides remote I/O via the LAM daemon processes.
We modified the transport protocol used by daemons in LAM to SCTP so
that we have a single underlying protocol while executing MPI programs instead
of a combination of protocols that do not offer the set of features we want to take
advantage of, such as, the capability to multihome by all the components of the
LAM environment.
4.7 Limitations
As discussed in Section 4.5, SCTP has a limit on the size of message it can send
in a single call to sctp sendmsg. This limits us from taking full advantage of the
message-framing property of SCTP since our long message protocol divides a long
message into smaller messages and then carries out message framing at the middle-
ware level.
Both TCP and SCTP use a flow control mechanism where a receiver adver-
tises its available receive buffer space to the sender. Normally when a message is
received it is kept in the kernel’s protocol stack buffer until a process issues a read
51
system call. When large sized messages are exchanged in MPI applications and mes-
sages are not read out quickly, the receiver advertises less buffer space to the sender,
which slows down the sender due to flow control. An event driven mechanism that
gets invoked as soon as a message arrives is currently not supported in our module.
Some factors that can affect the performance of SCTP are as follows:
• SCTP uses a comprehensive 32-bit CRC32c checksum which is expensive in
terms of CPU time, while TCP typically offloads checksum calculations to
the network interface card (NIC). However, CRC32c provides much stronger
data verification at the cost of additional CPU time. It has been proven to be
mathematically stronger than the 16-bit checksums used by TCP or UDP [58].
It is likely that hardware support for CRC32c may become available in a few
years.
• TCP can always pack a full MTU, but SCTP is limited by the fact that
it bundles different messages together, which may not always fit to pack a
full MTU. However, in our experiments this was not observed to be a factor
impacting the performance.
• TCP performance has been fine-tuned over the past decades, however, opti-
mization of the SCTP stack is still in its early stages and will improve over
time.
In this chapter we discussed the design and implementation issues of our
SCTP module in detail. We have focused on the need for latency tolerant MPI
programs for wide-area environments and described SCTP’s many features, which
make it more resilient and suitable for such environments. In Chapter 5 we carry
out a thorough experimental evaluation of the SCTP module, and compare its per-
formance to the LAM-TCP module.
52
Chapter 5
Experimental Evaluation
In this chapter we describe the experiments carried out to compare the performance
of our SCTP module with the LAM-TCP module for different MPI applications.
Our experimental setup consists of a dedicated cluster of eight identical Pentium-4
3.2GHz, FreeBSD-5.3 nodes connected via a layer-two switch using 1Gbit/s Ethernet
connections. Kernels on all nodes are augmented with the kame.net SCTP stack [36].
The experiments were performed in a controlled environment and Dummynet was
configured on each of the nodes to allow us to vary the amount of loss on the
links between the nodes. We compare the performance under loss rates of 0%, 1%
and 2% between each of the eight nodes. The cluster used for our experiments in
this chapter is larger and faster than the one used for evaluating TCP for MPI,
described in Chapter 3. The new cluster had become available to us at the time
we were ready to test our SCTP module, and provided the opportunity to carry
out more extensive experiments with better machines and larger number of nodes.
There were NAS benchmarks, like BT, which could not be run on the older cluster,
as they were taking an inordinate amount of time to execute. Since the new cluster
reduced the overall time required for each experiment run, we were able to move
quickly. All our experiments were executed several times to verify the consistency
of our results.
53
In Section 5.1, we evaluate the performance of the SCTP module using two
standard benchmark programs. In Section 5.2, we look at the effect of using a latency
tolerant program that overlaps communication with computation. We compare the
performance using a real-world program that uses a master-slave communication
pattern commonly found in MPI programs. We also discuss the effects of head-of-
line blocking. In order to make the comparison between SCTP and TCP as fair
as possible, the following settings were used in all the experiments discussed in
subsequent sections:
1. By default, SCTP uses much larger SO SNDBUF/SO RCVBUF buffer sizes than
TCP. In order to prevent any possible effects on performance due to this
difference, the send and receive buffers were set to a value of 220 Kbytes in
both the TCP and SCTP modules.
2. Nagle’s algorithm is disabled by default in LAM-TCP and this setting was
used in the SCTP module as well.
3. An SCTP receiver uses Selective Acknowledgment SACK to report any missing
data to the sender, therefore, SACK option for TCP was enabled on all the
nodes used in the experiment.
4. In our experimental setup, the ability to multihome between any two endpoints
was available since each node is equipped with three gigabit interface cards
and three independent paths between any two nodes were available. In our
experiments, however, this multihoming feature was not used so as to keep the
network settings as close as possible to that used by the LAM-TCP module.
5. TCP is able to offload checksum calculations on to the NICs on our nodes and
thus has zero CPU cost associated with its calculation. SCTP has an expensive
CRC32c checksum which can prove to be a considerable overhead in terms of
CPU cost. We modified the kernel to turn off the CRC32c checksum in SCTP
so that this factor does not affect the performance results.
54
5.1 Evaluation Using Benchmark Programs
We evaluated our implementation using two benchmark programs; MPBench ping-
pong test and NAS parallel benchmarks. The NAS benchmarks approximate the
performance of real applications. These benchmarks provided a point of reference
for our measurements and the results are discussed below.
5.1.1 MPBench Ping-Pong Test
We first report the output obtained by running the MPBench [48] ping-pong test
with no message loss. This is a standard benchmark program that reports the
throughput obtained when two processes repeatedly exchange messages of a speci-
fied size. All messages are assigned the same tag. Figure 5.1 shows the throughput
MPBench Ping Pong Test
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 32768 65535 98302 131069
Message Size (bytes)
Thr
ough
put N
orm
aliz
ed to
LA
M_T
CP
va
lues
LAM_SCTP
LAM_TCP
Figure 5.1: The performance of SCTP using a standard ping-pong test normalizedto TCP under no loss
obtained for different message sizes in the ping-pong test under no loss. The through-
put values for the LAM-SCTP module are normalized with respect to LAM-TCP.
The results show that SCTP is more efficient for larger message sizes, however,
55
TCP does better for small message sizes. The crossover point is approximately at a
message size of 22 Kbytes, where SCTP throughput equals that of TCP. The SCTP
stack is fairly new compared to TCP and these results may change as the SCTP
stack is further optimized.
We also used the ping-pong test to compare SCTP to TCP under 1% and 2%
loss rates. Since the LAM middleware treats short and long messages differently,
we experimented with loss for both types of messages. In the experiments short
messages were 30 Kbytes and long messages were 300 Kbytes. For both short and
long messages under 1% and 2% loss, SCTP performed better than TCP and the
results are summarized in Table 5.1.
Throughput(bytes/second)Loss:1% Loss:2%
MPI Message Size SCTP TCP SCTP TCP(bytes)
30K 54,779 1,924 44,614 1,030
300K 5,870 1,818 2,825 885
Table 5.1: The performance of SCTP and TCP using a standard ping-pong testunder loss
SCTP shows that it is more resilient under loss and can perform rapid re-
covery of lost segments. Work done in [1] and [53] compares the congestion control
mechanisms of SCTP and TCP and shows that SCTP has a better congestion con-
trol mechanism that allows it to achieve higher throughput in error prone networks.
Some features of SCTP’s congestion control mechanism are as follows:
• Use of SACK is an integral part of the SCTP protocol, whereas, for TCP it is
an option available in some implementations. In those implementations SACK
information is carried in IP options and is, therefore, limited to reporting at
most four TCP segments. In SCTP, the number of gap ACK blocks allowed
is much larger as it is dictated by the PMTU [59].
56
• Increase in the congestion window in SCTP is based on the number of bytes
acknowledged and not on the number of acknowledgments received [1]. This
allows SCTP to recover faster after fast retransmit. Also, SCTP initiates slow
start when the congestion window is equal to the slow start threshold. This
helps in achieving a faster increase in the congestion window.
• The congestion window variable cwnd can achieve full utilization because when
a sender has 1 byte of space in the cwnd and space available in receive window,
it can send a full PMTU of data.
• The FreeBSD KAME SCTP stack also includes a variant called New-Reno
SCTP that is more robust to multiple packet losses in a single window [32].
• When the receiver is multihomed, an SCTP sender maintains a separate con-
gestion window for each transport address of the receiver because the conges-
tion status of the network paths may differ from each other. SCTP’s retrans-
mission policy helps in increasing the throughput, since retransmissions are
sent on one of the active alternate transport addresses of the receiver. The
effect of multihoming was not a factor in our tests.
In order to determine the performance overhead of SCTP’s CRC32c check-
sum, we re-ran the ping-pong test for short and long message sizes under no loss
with CRC32c checksum turned-on. The throughput, on average, degraded by 12%
compared to the case when the checksum was turned-off. In wide-area networks,
the effect of CRC32c checksums on overall performance is likely to be small since
communication time is dominated by network latency. The effect of the CRC32c
checksum was more noticeable in the ping-pong tests because the experiments were
performed on a low latency local area network. In general, although the time for
calculating the checksum does add a few microseconds to message latency, it is likely
to be small in comparison to network latency.
57
5.1.2 Using NAS Benchmarks
In our second experiment, we used the NAS parallel benchmarks (NPB version 3.2)
to test the performance of MPI programs with SCTP and TCP as the underlying
transport protocol. The characteristics of the NAS benchmarks have already been
discussed in Chapter 3. We tested the following seven of those benchmarks: LU, IS,
MG, EP, CG, BT and SP. Dataset sizes S, W, A and B were used in the experiments
with the number of processes equal to eight. The Dataset sizes increase in order S,
W, A, and B, with S being the smallest. We have done an analysis of the type of
messages being exchanged in these benchmarks and we have found that in datasets
‘S’ and ‘W’, short messages (as defined in LAM middleware to be messages smaller
than or equal to 64 Kbytes) are predominantly being sent/received. In datasets ‘A’
and ‘B’ we see a greater number of long messages (i.e., messages larger than 64
Kbytes) being exchanged.
NAS Parallel MPI Benchmarks
0
500
1000
1500
2000
2500
3000
3500
4000
4500
LU SP EP CG BT MG IS
Mop
/s
LAM_SCTP
LAM_TCP
Figure 5.2: TCP versus SCTP for the NAS benchmarks
Figure 5.2 shows the results for dataset size ‘B’ under no loss. The results for
the other datasets are not shown in the figure, but as expected from the ping-pong
test results, TCP does better for the shorter datasets. These benchmarks use single
tags for communication between any pair of nodes and the benefits of using multiple
58
streams in the SCTP module are not being utilized. The SCTP module, in this case,
reduces to a single stream per association and as the results show, the performance
on average is comparable to TCP. TCP shows an advantage for the MG and BT
benchmarks and we believe the reason for this is that these benchmarks use greater
proportion of short messages in dataset ‘B’ than the other benchmarks.
5.2 Evaluation Using a Latency Tolerant Program
In this section we compare the performance of the SCTP module with LAM-TCP
using a latency tolerant program. As our discussion in the previous chapters showed,
in order to take advantage of highly available, shared environments that have large
delays and loss, it is important to develop programs that are latency tolerant and
capable of overlapping computation with communication to a high degree. Moreover,
SCTP’s use of multiple streams can provide more concurrency. Here, we investigate
the performance of such programs with respect to TCP and also discuss the effect
of head-of-line blocking in these programs.
5.2.1 Comparison Using a Real-World Program
In this section we investigate the performance of a realistic program that makes use
of multiple tags. Our objectives were two-fold: first, we wanted to evaluate a real-
world parallel application that is able to overlap communication with computation,
and secondly, to examine the effect of introducing multiple tags which, in case of
the SCTP module, can map to different streams. We describe the experiments
performed using an application which we call the Bulk Processor Farm program.
The communication characteristics of this program is typical of real-world master-
slave programs.
The Bulk Processor Farm is a request driven program with one master and
several slaves. The slaves ask the master for tasks to do, and the master is responsible
for creating tasks, distributing them to slaves and then collecting the results from
59
all the slaves. The master services all task requests from slaves in the order of their
arrival. Each task is assigned a different tag, which represents the type of that task.
There is a maximum number of different tags (MaxWorkTags) that can be distributed
at any time. The slaves can make multiple requests at the same time and when they
finish doing some task they send a request for a new one, so at any time, each of the
slaves have a fixed number of outstanding job requests. This number was chosen to
be ten in our experiments. The slaves use non-blocking MPI calls to issue send and
receive requests and all messages received by the slaves are expected messages. The
slaves use MPI ANY TAG to show that they are willing to perform a task of any type.
The master has a total number of tasks (NumTasks) that need to be performed
before the program can terminate. In our experiments we set that number to 10,000
tasks. We experimented with tasks of two different sizes; short tasks equal to 30
Kbytes and long tasks equal to 300 Kbytes. The size of the task represents the
message sizes that are exchanged between the master and the slave.
In case of the SCTP module, the tags will be mapped to streams and in case
of congestion if a message sent to a slave is lost, then it would still be possible for
messages on the other streams to be delivered and the slave program can continue
working without blocking on the lost message. In LAM-TCP, however, since all
messages between the master and a slave have to be delivered in order, there would
be less overlap of communication with computation. The experiments were run at
loss rates of 0%, 1% and 2%. The farm program was run six times for each of the
different combinations of loss rates and message sizes and the average value of total
run-times are reported. The average and the median values of the multiple runs
were very close to each other. We also calculated the standard deviation of the
average value, and found the variation across different runs to be very small.
Figure 5.3 shows the comparison between LAM-SCTP and LAM-TCP for
short and long message sizes under different loss rates. As seen in Figure 5.3, SCTP
outperforms TCP under loss. For short messages, LAM-TCP run-times are 10 to 11
60
LAM_SCTP versus LAM_TCP for Farm Program Message Size: Short
6.8 7.7 11.2 5.9
79.9
131.5
0
20
40
60
80
100
120
140
0% 1% 2% Loss Rate
Tot
al R
un T
ime
(sec
onds
)
LAM_SCTP
LAM_TCP
LAM_SCTP versus LAM_TCP for Farm Program Message Size: Long
83
804
1595
114
2080
4311
0 500
1000
1500
2000 2500 3000
3500
4000 4500
5000
0% 1% 2% Loss Rate
Tot
al R
un T
ime
(sec
onds
)
LAM_SCTP
LAM_TCP
(a)
(b)
Figure 5.3: TCP versus SCTP for (a) short and (b) long messages for the BulkProcessor Farm application
times higher than that of LAM-SCTP at loss rates of 1 to 2%. For long messages,
LAM-TCP is slower than LAM-SCTP by 2.58 times at 1% and 2.7 times at 2%
loss. Although SCTP performs substantially better than TCP for both short and
long messages, the fact that the difference is more pronounced for short messages
is very positive. MPI implementations, typically, try to optimize short messages for
latency and long messages for bandwidth [4]. In our latency tolerant application,
the slaves, at any time, have a number of pre-posted outstanding receive requests.
Since short messages are sent eagerly by the master, they are copied to the correct
61
receive buffer as soon as they are received. Since messages are being transmitted
on different streams in LAM-SCTP, there is less variation in the amount of time
the slave might have to wait for a job to arrive, especially under loss conditions,
and results in more overlap of communication with computation. The rendezvous
mechanism in the long messages introduces synchrony to the transmission of the
messages, and the payload can only be sent after the receiver has acknowledged its
readiness to accept it. This reduces the amount of overlap possible, compared to
the case when messages are sent eagerly. The long messages are also more prone to
be affected by loss by virtue of their size. Long messages would typically be used for
bulk transfers, and the cost of the rendezvous is amortized over the time required
to transfer the data.
We also introduced a tunable parameter to the program described above,
which we call Fanout. Fanout represents the number of tasks a master will send to
a slave in response to a single task request from that slave. In Figure 5.3 the Fanout
is 1.
We experimented with a Fanout value of 10 as shown in Figure 5.4. We
anticipated that the settings with Fanout=10 would create more possibilities for
head-of-line blocking in the LAM-TCP case, since in this case ten tasks are sent
to a slave in response to one task request, and are more prone to be affected by
loss. Figure 5.4 shows that with larger Fanout, the average run-time for TCP
increases substantially for long messages, while there are only slight changes for
short messages. One possibility for this behavior is TCP’s flow-control mechanism.
We are sending ten long tasks in response to a job request, and in case of loss, TCP
blocks delivery of all subsequent messages until the lost segment is recovered. If it
does not empty the receive buffer quickly enough, it can cause the sender to slow
down. Moreover, as discussed in Section 5.1.1, SCTP’s better congestion control also
aids in faster recovery of lost segments, and this is also reflected in all our results.
Also, we expect the performance difference between TCP and SCTP to widen when
62
LAM_SCTP versus LAM_TCP for Farm Program Message Size: Short, Fanout: 10
8.7 11.7 16.0 6.2
88.1
154.7
0
20
40
60
80
100
120
140
160
180
0% 1% 2%
Loss Rate
Tot
al R
un T
ime
(sec
onds
)
LAM_SCTP
LAM_TCP
LAM_SCTP versus LAM_TCP for Farm Program Message Size: Long, Fanout: 10
79
786
1585
129
3103
6414
0
1000
2000
3000
4000
5000
6000
7000
0% 1% 2%
Loss Rate
Tot
al R
un T
ime
(sec
onds
)
LAM_SCTP
LAM_TCP
(a)
(b)
Figure 5.4: TCP versus SCTP for (a) short and (b) long messages for the BulkProcessor Farm application using Fanout of 10
multihoming is present and retransmissions are sent on an alternate path.
5.2.2 Investigating Head-of-line Blocking
In the experiments presented so far, we have shown that SCTP has superior perfor-
mance than TCP under loss, and there are two main factors affecting the results:
improvements in SCTP’s congestion control mechanism, and the ability to use mul-
tiple streams to reduce head-of-line blocking. In this section we examine the per-
formance improvement obtained as a result of using multiple streams in our SCTP
63
module. In order to isolate the effects of head-of-line blocking in the farm program,
we created another version of the SCTP module, one that uses only a single stream
to send and/or receive messages irrespective of the message tag, rank and context.
All other things were identical to our multiple-stream SCTP module.
The farm program was run at different loss rates for both short and long mes-
sage sizes. The results are shown in Figure 5.5. Since multihoming was not present
(a)
(b)
LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream for Farm Program. Message Size: Short, Fanout: 10
8.7
11.7
16.0
9.3 11.0
21.6
0
5
10
15
20
25
0% 1% 2% Loss Rate
Tot
al R
un T
ime
(sec
onds
)
10 Streams
1 Stream
LAM_SCTP 10-Streams versus LAM_SCTP 1-Stream for Farm Program. Message Size: Long, Fanout: 10
79
786
1585
79
1000
1942
0
500
1000
1500
2000
2500
0% 1% 2% Loss Rate
Tot
al R
un T
ime
(sec
onds
)
10 Streams
1 Stream
Figure 5.5: Effect of head-of-line Blocking in the Bulk Processor Farm program for(a) short and (b) long messages
in the experiment, the results obtained show the effect of head-of-line blocking and
the advantage due to the use multiple tags/streams. The reduction in average run-
64
times when using multiple streams compared to a single stream is about 25% under
loss for long messages. In the case of short messages, the benefit of using multiple
streams becomes apparent at 2% loss where using a single stream shows increase
in run-time of about 35%. This shows that head-of-line blocking has a substantial
effect in a latency tolerant program like the Bulk Processor Farm. The performance
difference between the two cases, can increase in a network where it takes a long
time to recover from packet loss [57]. There is a possibility for improvement in these
results even further as the FreeBSD KAME SCTP stack is optimized. Currently,
the receive path for an SCTP receiver needs to be completely re-written for optimal
performance [57]. As mentioned in Section 2.1, an SCTP sender uses a round-robin
scheme to send messages on different streams, and according to the KAME SCTP
developers, the receive function, similarly, needs to be carefully thought out and a
proper scheme devised for when multiple messages, on different streams, are ready
to be received.
65
Chapter 6
Related Work
The effectiveness of SCTP has been explored for several protocols in high latency,
high-loss environments. Researchers have investigated the use of SCTP in FTP [41],
HTTP [52], and also over wireless [10] and satellite networks [1]. Using SCTP for
MPI has not been investigated.
There are a wide variety of projects that use TCP in an MPI environment.
MPICH-G2 [37] is a multi-protocol implementation of MPI for the Globus [27] en-
vironment that was primarily designed to link together clusters over wide-area net-
works. Globus provides a basic infrastructure for Grid computing and addresses
many of issues with respect to the allocation, access and coordination of resources
in a Grid environment. The communication middleware in MPICH-G2 is able to
switch between the use of vendor-specific MPI libraries and TCP. In this regard, it
is well suited for environments where there is a connection between clusters or other
Grid elements. MPICH-G2 maintains a notion of locality and uses this to optimize
communication between local nodes in a cluster or local network and nodes which
are further away that use TCP. The developers report that MPICH-G2 optimizes
TCP but there is no description of these optimizations and how they might work in
general when all the nodes are non-local. Another related work is PACX-MPI [38],
which is a grid-enabled MPI implementation. PACX-MPI enables execution of MPI
66
applications over a cluster of high-performance computers like Massively Parallel
Processors (MPPs), connected through high-speed networks or the Internet. Com-
munication within MPPs is done with vendor MPI and between meta-computers
is done through TCP connections. LAM/MPI, as well, provides limited support
for Globus grid environment. LAM can be booted across a Globus grid, but only
restricted types of execution are possible [40, 54].
Another project on wide-area communication for grid computing is NetIbis
[12] that focuses on connectivity, performance and security problems of TCP in
wide-area networks. NetIbis enhances performance by use of data compression over
parallel TCP connections. Moreover, it uses TCP splicing for connection establish-
ment through firewalls and can use encryption as well. MagPIe [39] is a library
of collective communication optimized for wide-area networks that is built on top
of MPICH. MagPIe constructs communication graphs that are wide-area optimal,
taking the hierarchical structure of network topology into account. Their results
show that their algorithms outperform MPICH in real wide-area environments.
Researchers have also focused on providing system fault tolerance in large
scale distributed computing environments. HARNESS FT-MPI [18] is one such
project that handles fault tolerance at the MPI communicator level. It aims at
providing application programmers with different methods of dealing with failures
in MPI rather than having to rely only on check-pointing and restart. In FT-MPI,
the semantics and modes of failure can be controlled by the application using a
modified MPI API. Our work provides the opportunity to add SCTP for transport
in a Grid-environment where we can take advantage of the improved performance
in the case of loss.
Several projects have used UDP rather than TCP. As mentioned, UDP is
message-based and one can avoid all the “heavy-weight” mechanisms present in TCP
to obtain better performance. However, when one adds reliability on top of UDP
the advantages begin to diminish. For example, LA-MPI [2], a high-performance,
67
reliable MPI library uses UDP and supports sender-side retransmission of messages
based on acknowledgment timeout. Their retransmission approach has many simi-
larities to TCP/IP but they bypass the complexities of TCP’s flow and congestion
control mechanisms. They justify their use of a simple retransmission technique by
stating that their scheme is targeted at low-error rate cluster environments and is
thus appropriate. LA-MPI also supports fault tolerance through message striping
across multiple network interfaces and can achieve greater bandwidth through si-
multaneous transfer of data. Automatic network failover in case of network failures
is currently under development in their implementation. LA-MPI reports perfor-
mance of their implementation over UDP/IP to be similar to TCP/IP performance
of other MPI implementations over Ethernet.
WAMP [61] is an example of UDP for MPI over wide-area networks. This
work presents a user-level communication protocol where reliability was built on top
of UDP. Interestingly, WAMP only wins over TCP in heavily congested networks
where TCP’s congestion avoidance mechanisms limit bandwidth. One of the reasons
for WAMP’s improvement over TCP is the fact that their protocol does not attempt
to modify its behavior based on network congestion and maintains a constant flow
regardless of the network conditions. Their protocol, in fact, captures bandwidth
at the expense of other flows in the network. This technique may not be appropri-
ate for Internet-like environments where fairness and network congestion must be
considered. Limitation on achievable bandwidth due to TCP’s congestion control is
a problem but hopefully research in this area will lead to better solutions for TCP
that can also be incorporated into SCTP.
Another possible way to use streams is to send eager short messages on
one stream and long messages on a second one. LA-MPI followed this approach
with their UDP/IP implementation of MPI [2]. This potentially allows for the
fast delivery of short messages and a more optimized delivery for long messages
that require more bandwidth. It also may reduce head-of-line blocking for the
68
case of short messages waiting for long messages to complete. However, one has
to maintain MPI message ordering semantics and thus, as is the case in LA-MPI,
sequence numbers were introduced to ensure that the messages are received strictly
in the order they were posted by the sender. Thus, in the end, there is no added
opportunity for concurrent progression of messages as there is in the case of our
implementation.
There have been a number of MPI implementations that have explored im-
proving communication with emphasis on some particular aspect of the overall per-
formance. Work discussed in [44] integrates MPI’s receive queue management into
the TCP/IP protocol handler. The handler is interrupt driven and copies received
data from the TCP receive buffer to the user space. The focus of this work is on
avoiding polling of sockets and reducing system call overhead in large scale clusters.
In general, these implementations require special drivers or operating systems
and will not have a great degree of support, which limits their use. Although the
same can be said about SCTP, the fact that SCTP has been standardized and
implementations have begun to emerge, are all good indications of more wide-spread
support.
Recently Open MPI has been announced which is a new public domain ver-
sion of MPI-2 that builds on the experience gained from the design and implemen-
tation of LAM/MPI, LA-MPI and FT-MPI [17] and PACX-MPI. Open MPI takes
a component-based approach to allow the flexibility to mix and match components
for collectives and for transport and link management. It is designed to be scalable
and fault-tolerant and provides a platform for third-party research and enables ad-
dition of new components. We hope that our work can be incorporated into Open
MPI as a transport module. TEG [65] is a fault-tolerant point-point communica-
tion module in Open MPI that supports multihoming and the ability to stripe data
across interfaces. It provides concurrent support for multiple network types, such
as Myrinet, InfiniBand, GigE etc., and can carry out message fragmentation and
69
delivery utilizing different multiple network interfaces (NICs). The ability to sched-
ule data over different interfaces has been proposed for SCTP and may provide an
alternative way to provide the functionality of TEG in environments like the Inter-
net. A group at the University of Delaware is researching Concurrent Multipath
Transfer (CMT) [29, 28], which uses SCTP’s multihoming feature to provide simul-
taneous transfer of data between two endpoints via two or more end-to-end paths.
The objective of using CMT between multihomed hosts is to increase an application
throughput. CMT is at the transport layer and is thus more efficient, compared
to multipath transfer at application level, since it has access to finer details about
the end-to-end paths. CMT is currently being integrated into the FreeBSD KAME
SCTP stack and will be available as a sysctl option by end of year 2005.
Another recent project, still under development, is Open Run-Time Envi-
ronment (OpenRTE) [8], for supporting distributed high-performance computing in
heterogeneous environments. It is a spin-off from the Open MPI project and has
drawn from existing approaches to large scale distributed computing like LA-MPI,
HARNESS FT-MPI and the Globus project. It focuses on providing:
1. Ease-of-use, with focus on transparency and the ability to execute an applica-
tion on a variety of computing resources.
2. Resilience, with user-definable/selectable error management strategies.
3. Scalability and extensibility, with the ability to support addition of new fea-
tures.
It is based on the modular component architecture developed for the Open
MPI project and the behavior of any of the OpenRTE subsystems can be changed
by defining a new component and selecting it for use. Their design allows users to
customize the run-time behavior for their applications. A beta version of OpenRTE
is currently being evaluated by the developers. We believe this project and the
Open MPI project can benefit from using SCTP in their transport module. SCTP
70
is particularly valuable for applications that require monitoring and path/session
failure detection, since its heartbeat mechanism is designed for providing active
monitoring of associations. One potential module in Open MPI may be a wide-area
network message progression module.
71
Chapter 7
Conclusions and Future Work
In this thesis we discussed the design and evaluation of using SCTP for MPI. SCTP
is better suited as a transport layer for MPI because of its several distinct features
not present in TCP. We have designed the SCTP module to address the head-of-
line blocking problem present in LAM-TCP middleware. We have shown that SCTP
matches MPI semantics more closely than TCP and we have taken advantage of the
multistreaming feature of SCTP to provide a direct mapping from streams to MPI
message tags. This has resulted in increased concurrency at the TRC level in the
SCTP module compared to concurrency at process level in LAM-TCP.
Our SCTP module’s state machine uses one-to-many style sockets and avoids
the use of expensive select system calls, which leads to increased scalability in
large scale clusters. In addition, SCTP’s multihoming feature makes our module
fault tolerant and resilient to network path failures.
We have evaluated our module and reported the results of several experi-
ments using standard benchmark programs as well as a real-world application and
compared the performance with LAM-TCP. Simple ping-pong tests, under no loss,
have shown that LAM-TCP outperforms the SCTP module for small message sizes
but SCTP does better for large messages. Additionally, SCTP’s performance is
comparable to TCP’s for standard benchmarks such as the NAS benchmarks when
72
larger dataset sizes are used. The strengths of SCTP over TCP become apparent
under loss conditions as seen in the results for ping-pong tests and our latency tol-
erant Bulk Processor Farm program. When different tags are used in a program,
the advantages due to multistreaming in our module can lead to further benefits in
performance.
SCTP is a robust transport level protocol that extends the range of MPI
programs that can effectively execute in a wide-area network. There are many
advantages of using MPI in open environments like the Internet, the first being
portability and the ability to execute a large body of MPI applications unchanged
in diverse environments. Secondly, not everybody has access to dedicated high per-
formance clusters and this provides the opportunity to take advantage of computing
resources in other environments as well as to be able to access specialized resources
not locally available. There are, of course, many performance issues of operating in
an open environment and the need for dynamically optimizing connections remains
an active area of research in the networking community. Several variants of TCP
have been proposed and we believe that SCTP, due to its similarity with TCP, will
also be able to take advantage of improvements in that area.
Our work has the potential to spawn a new body of research where other
features of SCTP like multihoming and load balancing can be investigated to better
support parallel applications using MPI. The current ongoing integration of CMT
in the SCTP stack poses interesting questions about potential benefits for MPI
applications. Work has also been done on prioritization of streams in SCTP [26],
and it would be interesting to investigate the benefits to MPI, if a selection of
different latencies are available to messages by assigning them to different priority
streams.
In general, the performance requirements of different MPI programs will
vary. There will be those programs that can only achieve satisfactory performance
on dedicated machines, with low-latency and high-bandwidth links. On the other
73
hand, there will be those latency tolerant programs that will be able to run just as
well in highly available, shared environments that have large delays and loss. Our
contribution is to the latter type of programs, for extending their performance in
open environments such as the Internet. Ideally, we want to increase the portability
of MPI and in doing so, encourage programmers to develop programs more towards
this end of the spectrum.
74
Bibliography
[1] Rumana Alamgir, Mohammed Atiquzzaman, and William Ivancic. Effect of
congestion control on the performance of TCP and SCTP over satellite net-
works. In NASA Earth Science Technology Conference, Pasadena, CA, June
2002.
[2] Rob T. Aulwes, David J. Daniel, Nehal N. Desai, Richard L. Graham, L. Dean
Risinger, Mark A. Taylor, Timothy S. Woodall, and Mitchel W. Sukalski. Ar-
chitecture of LA-MPI, a network-fault-tolerant MPI. In 18th International
Parallel and Distributed Processing Symposium (IPDPS’04), Sante Fe, New
Mexico, April 2004.
[3] Ryan W. Bickhart. Transparent SCTP shim, personal communication. Univer-
sity of Delaware, http://www.cis.udel.edu/ bickhart/research/shim.html.
[4] Ron Brightwell and Keith Underwood. Evaluation of an eager protocol opti-
mization for MPI. In PVM/MPI, pages 327–334, 2003.
[5] Greg Burns and Raja Daoud. Robust message delivery with guaranteed re-
sources. In Proceedings of Message Passing Interface Developer’s and User’s
Conference (MPIDC), May 1995.
[6] Sadik G. Caglar, Gregory D. Benson, Qing Huang, and Cho-Wai Chu. A
multi-threaded implementation of MPI for Linux clusters. In 15th IASTED
International Conference on Parallel and Distributed Computing and Systems,
Los Angeles, November 2003.
[7] Neal Cardwell, Stefan Savage, and Thomas Anderson. Modeling TCP latency.
In INFOCOM, pages 1742–1751, 2000.
[8] R. H. Castain, T. S. Woodall, D. J. Daniel, J. M. Squyres, B. Barrett, and G. E.
Fagg. The open run-time environment (OpenRTE): A transport multi-cluster
environment for high-performance computing, September 2005. To appear in
75
EURO PVM MPI 2005 12th European Parallel Virtual Machine and Message
Passing Interface Conference.
[9] NASA Ames Research Center. Numerical aerodynamic simulation (NAS) paral-
lel benchmark (NPB) benchmarks. http://www.nas.nasa.gov/Software/NPB/.
[10] Yoonsuk Choi, Kyungshik Lim, Hyun-Kook Kahng, and I. Chong. An experi-
mental performance evaluation of the stream control transmission protocol for
transaction processing in wireless networks. In ICOIN, pages 595–603, 2003.
[11] Wu chun Feng. The future of high-performance networking, 2001. Workshop
on New Visions for Large-Scale Networks: Research and Applications.
[12] A. Denis, O. Aumage, R. Hofman, K. Verstoep, T. Kielmann, and H.E.Bal.
Wide-area communication for grids: an integrated solution to connectivity,
performance and security problems. In High performance Distributed Comput-
ing, 2004. Proceedings. 13th IEEE International Symposium on, pages 97–106,
June 2004.
[13] P. Dickens, W. Gropp, and P. Woodward. High performance wide area data
transfers over high performance networks. In International Workshop on Per-
formance Modeling, Evaluation, and Optimization of Parallel and Distributed
Systems, 2002.
[14] Rossen Dimitrov and Anthony Skjellum. Software architecture and performance
comparison of MPI/Pro and MPICH. In Lecture Notes in Computer Science,
Volume 2659, Pages 307 - 315, January 2003.
[15] Tom Dunigan, Matt Mathis, and Brian Tierney. A TCP tuning daemon. In
Supercomputing ’02: Proceedings of the 2002 ACM/IEEE conference on Super-
computing, pages 1–16, Los Alamitos, CA, USA, 2002. IEEE Computer Society
Press.
[16] EmuLab. Network Emulation Testbed. http://www.emulab.net/.
[17] Graham E. Fagg and Graham E. Dongarra. Building and using a fault-tolerant
MPI implementation. International Journal of High Performance Computing
Applications, 18(3):353–361, 2004.
[18] Graham E. Fagg and Jack J. Dongarra. HARNESS fault tolerant MPI design,
usage and performance issues. Future Gener. Comput. Syst., 18(8):1127–1142,
2002.
76
[19] Kevin Fall and Sally Floyd. Simulation-based comparisons of Tahoe, Reno and
SACK TCP. SIGCOMM Comput. Commun. Rev., 26(3):5–21, 1996.
[20] Ahmad Faraj and Xin Yuan. Communication characteristics in the NAS parallel
benchmarks. In PDCS, pages 724–729, 2002.
[21] W. Feng and P. Tinnakornsrisuphap. The failure of TCP in high-performance
computational grids. In Supercomputing ’00: Proceedings of the 2000
ACM/IEEE conference on Supercomputing, page 37, Washington, DC, USA,
2000. IEEE Computer Society.
[22] G. Burns, R. Daoud and J. Vaigl. LAM: An Open Cluster Environment for
MPI. In Supercomputing Symposium ’94, Toronto, Canada, June 1994.
[23] Edgar Gabriel, Graham E. Fagg, George Bosilca, Thara Angskun, Jack J. Don-
garra, Jeffrey M. Squyres, Vishal Sahay, Prabhanjan Kambadur, Brian Barrett,
Andrew Lumsdaine, Ralph H. Castain, David J. Daniel, Richard L. Graham,
and Timothy S. Woodall. Open MPI: Goals, concept and design of a next gen-
eration MPI implementation. In Proceedings, 11th European PVM/MPI Users’
Group Meeting, Budapest, Hungary, September 2004.
[24] Al Geist, William Gropp, Steve Huss-Lederman, Andrew Lumsdaine, Ewing L.
Lusk, William Saphir, Tony Skjellum, and Marc Snir. MPI-2: Extending the
message-passing interface. In Euro-Par, Vol. I, pages 128–135, 1996.
[25] Thomas J. Hacker, Brian D. Noble, and Brian D. Athey. Improving throughput
and maintaining fairness using parallel TCP. In IEEE INFOCOM, 2004.
[26] Gerard J. Heinz and Paul D. Amer. Priorities in SCTP multistreaming. In SCI
’04, Orlando, FL, July 2004.
[27] I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit.
Intl Journal of Supercomputer Applications, 11(2):115 – 128, 1997.
[28] Janardhan R. Iyengar, Paul D. Amer, and Randall Stewart. Retransmission
policies for concurrent multipath transfer using SCTP multihoming. In ICON
2004, Singapore, November 2004.
[29] Janardhan R. Iyengar, Keyur C. Shah, Paul D. Amer, and Randall Stewart.
Concurrent multipath transfer using SCTP multihoming. In SPECTS 2004,
San Jose, July 2004.
77
[30] M. Jain, R. Prasad, and C. Dovrolis. The TCP bandwidth-delay product revis-
ited: Network buffering, cross traffic, and socket buffer auto-sizing. Technical
Report GIT-CERCS-03-02, Georgia Tech, February 2003.
[31] Armando L. Caro Jr., Paul D. Amer, and Randall R. Stewart. Retransmission
policies for multihomed transport protocols. Technical Report TR2005-15, CIS
Dept, University of Delaware, March 2005.
[32] Armando L.C̃aro Jr., Keyur Shah, Janardhan R. Iyengar, Paul D. Amer, and
Randall R. Stewart. SCTP and TCP variants: Congestion control under mul-
tiple losses. Technical Report TR2003-04, CIS Dept, U of Delaware, February
2003.
[33] Humaira Kamal. Description of LAM TCP RPI module. Technical re-
port, University of British Columbia, Computer Science Department, Available
at: http://www.cs.ubc.ca/labs/dsg/mpi-sctp/LAM TCP RPI MODULE.pdf,
2004.
[34] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP-based middleware for
MPI in wide-area networks. In 3rd Annual Conf. on Communication Networks
and Services Research (CNSR2005), pages 157–162, Halifax, May 2005. IEEE
Computer Society.
[35] Humaira Kamal, Brad Penoff, and Alan Wagner. SCTP versus TCP for MPI.
In Supercomputing ’05: Proceedings of 2005 ACM/IEEE conference on Super-
computing (to appear), Seattle, WA, November 2005.
[36] KAME. SCTP stack implementation for FreeBSD. http://www.kame.net.
[37] Nicholas T. Karonis, Brian R. Toonen, and Ian T. Foster. MPICH-G2:
A grid-enabled implementation of the message passing interface. CoRR,
cs.DC/0206040, 2002.
[38] Rainer Keller, Edgar Gabriel, Bettina Krammer, Matthias S. Mueller, and
Michael M. Resch. Towards efficient execution of MPI applications on the grid:
Porting and optimization issues. Journal of Grid Computing, 1:133–149, June
2003.
[39] Thilo Kielmann, Rutger F. H. Hofman, Henri E. Bal, Aske Plaat, and Raoul
A. F. Bhoedjang. MagPIe: MPI’s collective communication operations for
clustered wide area systems. SIGPLAN Not., 34(8):131–140, 1999.
78
[40] The LAM/MPI Team Open Systems Lab. LAM/MPI user’s guide, September
2004. Pervasive Technology Labs, Indiana University.
[41] Sourabh Ladha and Paul Amer. Improving multiple file transfers using SCTP
multistreaming. In Proceedings IPCCC, April 2004.
[42] Linux Kernel Stream Control Transmission Protocol (lksctp) project. withsctp
lksctp-tools utility. lksctp-tools package, http://lksctp.sourceforge.net/.
[43] Matt Mathis, John Heffner, and Raghu Reddy. Web100: Extended TCP instru-
mentation for research, education and diagnosis. SIGCOMM Comput. Com-
mun. Rev., 33(3):69–79, 2003.
[44] M. Matsuda, T. Kudoh, H. Tazuka, and Y. Ishikawa. The design and imple-
mentation of an asynchronous communication mechanism for the MPI commu-
nication model. In IEEE Intl. Conf. on Cluster Computing, pages 13–22, Dana
Point, Ca., Sept 2004.
[45] Alberto Medina, Mark Allman, and Sally Floyd. Measuring the evolution of
transport protocols in the Internet, April 2005. To appear in ACM CCR.
[46] Message Passing Interface Forum. MPI: A Message Passing Interface. In Super-
computing ’93: Proceedings of 1993 ACM/IEEE conference on Supercomputing,
pages 878–883. IEEE Computer Society Press, 1993.
[47] G. Minshall, Y. Saito, J. Mogul, and B. Verghese. Application performance pit-
falls and TCP’s Nagle algorithm. In Workshop on Internet Server Performance,
Atlanta, GA, USA, May 1999.
[48] Philip J. Mucci. MPBench: Benchmark for MPI functions. Available at:
http://icl.cs.utk.edu/projects/llcbench/mpbench.html.
[49] Peter S. Pacheco. Parallel programming with MPI. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA, 1996.
[50] Scott Pakin, Mario Lauria, and Andrew Chien. High performance messaging
on workstations: Illinois Fast Messages (FM) for Myrinet. In Supercomputing
’95: Proceedings of the 1995 ACM/IEEE conference on Supercomputing, San
Diego, CA, December 1995.
[51] PlanetLab. An open platform for developing, deploying, and accessing
planetary-scale services. http://www.planet-lab.org/.
79
[52] R. Rajamani, S. Kumar, and N. Gupta. SCTP versus TCP: Comparing the
performance of transport protocols for web traffic. Technical report, University
of Wisconsin-Madison, May 2002.
[53] M. Atiquzzaman S. Fu and W. Ivancic. SCTP over satellite networks. In
IEEE Computer Communications Workshop (CCW 2003), pages 112–116,
Dana Point, Ca., October 2003.
[54] Jeffrey M. Squyres, Brian Barrett, and Andrew Lumsdaine. Boot systems ser-
vices interface (SSI) modules for LAM/MPI. Technical Report TR576, Indiana
University, Computer Science Department, 2003.
[55] W. Richard Stevens, Bill Fenner, and Andrew M. Rudoff. UNIX Network
Programming, Vol. 1, Third Edition. Pearson Education, 2003.
[56] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor,
M. Kalla I. Rytina, L. Zhang, and V. Paxson. The stream control transmission
protocol (SCTP). Available from http://www.ietf.org/rfc/rfc2960.txt, October
2000.
[57] Randall Stewart. Primary designer of SCTP and developer of FreeBSD KAME
SCTP stack, private communication, 2005.
[58] Randall Stewart and Paul D. Amer. Why is SCTP needed given TCP and UDP
are widely available? Internet Society, ISOC Member Briefing #17. Available
from http://www.isoc.org/briefings/017/briefing17.pdf, June 2004.
[59] Randall R. Stewart and Qiaobing Xie. Stream control transmission protocol
(SCTP): a reference guide. Addison-Wesley Longman Publishing Co., Inc.,
2002.
[60] Alexander Tormasov and Alexey Kuznetsov. TCP/IP options for high-
performance data transmission, 2002. http://builder.com.com/5100-6372 14-
1050878.html.
[61] Rajkumar Vinkat, Philip M. Dickens, and William Gropp. Efficient communi-
cation across the Internet in wide-area MPI. In Conference on Parallel and Dis-
tributed Programming Techniques and Applications, Las Vegas, Nevada, USA,
2001.
[62] W. Gropp, E. Lusk, N. Doss and A. Skjellum. High-performance, portable
implementation of the MPI message passing interface standard. Parallel Com-
puting, 22(6):789–828, September 1996.
80
[63] Paul A. Watson. Slipping in the window: TCP reset attacks, October 2003.
CanSecWest Security Conference.
[64] Eric Weigle and Wu chun Feng. A case for TCP Vegas in high-performance
computational grids. In HPDC ’01: Proceedings of the 10th IEEE International
Symposium on High Performance Distributed Computing (HPDC-10’01), page
158, Washington, DC, USA, 2001. IEEE Computer Society.
[65] T.S. Woodall, R.L. Graham, R.H. Castain, D.J. Daniel, M.W. Sukalski, G.E.
Fagg, E. Gabriel, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sa-
hay, P. Kambadur, B. Barrett, and A. Lumsdaine. TEG: A high-performance,
scalable, multi-network point-to-point communications methodology. In Pro-
ceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary,
September 2004.
[66] J. Yoakum and L. Ong. An introduction to the stream control transmission pro-
tocol (SCTP). Available from http://www.ietf.org/rfc/rfc3286.txt, May 2002.
81