a hybrid mpi design using sctp and iwarp distributed systems group mike tsai, brad penoff, and alan...
TRANSCRIPT
A Hybrid MPI Design using SCTP and iWARP
Distributed Systems
Group
Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science
University of British Columbia
Vancouver, Canada
April 14, 2008
A Hybrid Message Passing Interface Design using the
Stream Control Transmission Protocol and the Internet Wide Area
Remote Direct Memory Access Protocol
Distributed Systems
Group
Mike Tsai, Brad Penoff, and Alan WagnerDepartment of Computer Science
University of British Columbia
Vancouver, Canada
April 14, 2008
Research Background
• SCTP – Stream Control Transmission Protocol
– IETF standardized transport protocol for IP– Can be used anywhere TCP or UDP are used– Additional features
• SCTP and MPI middleware– LAM (unreleased)– MPICH2 (1.0.5 and on) ch3:sctp– Open MPI SCTP BTL (in v1.3 trunk)
• Hardware acceleration techniques for IP– Protocol offload– OS bypass– Zero copy– RDMA– 10 GigE
How would these look for SCTP?
Are there benefits here for using SCTP?
State-of-the-Art Networking
• iWARP - Internet Wide Area RDMA protocol– IETF standard for RDMA over IP
• Use RDMA, point-to-point, or a mix?
• “Why Compromise?” (G. Shainer @ HPCWire.com)
– Depending on the application, use whichever is best.• For MPI middleware, who decides what’s best?
Story/motivation
The programmer!
Contribution
• Hybrid MPI with functional decomposition lets the programmer decide:– Let RMA use RDMA– Let other communications use point-to-point
• Explore SCTP’s use within iWARP– Extended OSC userspace software iWARP,
making many internal OSC changes
iWARP : DDP & LLP
RDMAP
IP
DDP
Verbs or API
Lower Layer Protocol (LLP)
Direct Data Placement
• Fragments messages• Reassembles segments• Segments self-contained
• Data delivery and placement separation
• Out-of-order delivery
Requires LLP to:• Keep segment boundaries• Be reliable• Take a strong checksum
iWARP : LLP = MPA over TCP
RDMAP
IP
TCP
MPA
DDP
Verbs or API
Message PDU Aligned
• Message framing• DDP segment vs. TCP stream
• Markers for out-of-order• For middlebox fragmentation
• Stronger checksum
… is a complex layer (majority of OSC code)!
… can lead to non-compliant TCP stacks.
LLP
SCTP is a better LLP
LLP’s needs built-in to SCTP:• Reliable, message-based• CRC32c checksum• Out-of-order support:
• MSG_UNORDERED• Multistreaming• Multihoming
Unmodified stack supports:• Path failover• Multirail data striping
RDMAP
IP
TCP
MPA
DDP
SCTP
Verbs or API
LLP
In the beginning, there was ch3:sctp
MPI-1 APIMPI-2 one-sided
RMA API
SCTP
CH3:SCTP
Socket
MPICH2
OSC iWARP was modified and incorporated in as a thread….
MPI-1 APIMPI-2 one-sided
RMA API
SCTP
iWARP
CH3:SCTP
Socket
MPICH2
RMA done by modified OSC iWARP
MPI-1 APIMPI-2 one-sided
RMA API
SCTP
iWARP
CH3:HYBRID
CH3:SCTP
Socket
MPICH2
Shared Data Structure
OSC iWARP changes to support MPI
• Running in a thread
• Use SCTP
• Making all OSC ops non-blocking
• Locks around shared data
MPI-1 APIMPI-2 one-sided
RMA API
SCTP
iWARP
CH3:HYBRID
CH3:SCTP
Socket
MPICH2
Shared Data Structure
Connection Management Design
Connection establishment:
• Separate one-to-many socket for new QPs– SCTP “peeloff” feature
• New QP sends request from one-to-many socket• Request/ACK received, then QP socket peeled-off• For conflicts, MPI rank resolves who sends ACK
Progress Engine
Loop or Break out
early
Dequeue Event
Handle Event
Read Logic
Write Logic
No Event
Valid event
W WRWR
Event Queue
Enqueue
Read Event
Enqueue Write Event
head
Start
Dequeue head Event
End
iWARP poll
Enqueue iWARP E
vent
Application Level Events
Performance What we tested…
– Compared our new ch3:hybrid to the original ch3:sctp
– Two 3.2 GHz Intel boxes (GigE + switch)• OSU latency tests (MPI_Put & MPI_Get)• Homemade synthetic benchmark
– Combination of RMA and MPI-1 calls
OSU One-sided Latency Tests• ch3:hybrid adds 2-8% overhead
Synthetic Application
• ch3:hybrid was faster than ch3:sctp – 3.8 seconds vs. 4.5 seconds
• Extra thread helps in some cases
Conclusions
• RDMA versus point-to-point for MPI– Why choose?
• Functional decomposition lets programmer decide
• SCTP is a good match for iWARP– Implementation of iWARP using SCTP shown.– SCTP has its place in the state-of-the-art.– It’d be more exciting to have SCTP-based
devices…
Google “sctp mpi” for more information about our work
Thank you!
Rank 0 Rank 1
Connect (send connection packet)
Connect (send connection packet)
Connect Request Discarded
(Target rank > 0)
Connect Request Accepted (Target rank > local rank)
Peeloff, register with iWARP
Connection ACK
Peeloff, register with iWARP
App. Level Connection formed
Time t
Connection Management Design