end-to-end fault tolerance using transport layer multihoming
DESCRIPTION
End-to-End Fault Tolerance Using Transport Layer Multihoming. Armando L. Caro, Jr. Dissertation Proposal April 8, 2003. A 1. B 1. ISP. ISP. Internet. A 2. B 2. ISP. ISP. Host A. Host B. Propose to investigate transport layer multihoming for - PowerPoint PPT PresentationTRANSCRIPT
Protocol · Engineering · LaboratoryUniversity of Delaware
End-to-End Fault ToleranceUsing Transport Layer Multihoming
Armando L. Caro, Jr.
Dissertation ProposalApril 8, 2003
2
Propose to investigate transport layer multihoming for • end-to-end fault tolerance (primary goal)
• improved application performance (secondary goal)
Host A
A1
A2
Host B
B1
B2
InternetISP
ISP
ISP
ISP
3
Why Investigate Transport Layer Multihoming?• Many applications (e.g., mission-critical) require uninterrupted
service
• Internet path outages are common• link failures
• overloaded links
• Multiple network interfaces provide network layer redundancy• interfaces today are relatively cheap
Need transport layer support to increase connection resilience during path outages
4
Can’t Routing Handle Path Outages?• Routing does not recover fast enough from link failures
• [Labovitz 00] measure failure detection and recoveryminimum: 3 minutes
often: 10’s of minutes
40% required >30 minutes
• [Chandra 01] (using probes)5% required 2.75 – 27.75 hours!
• [Paxson 97] (using probes)1.5 – 3.3% of routes had “serious pathologies”
• [Labovitz 98] (examining routing table logs)10% of routes available < 95% of time
65% of routes available < 99.99% of time
• Routing does not recover at all from overloaded links• Flash crowds
• DoS attacks - statistics in [Moore 01]
5
SCTP Multihoming
• hosts choose 1 of 4 possible TCP connections:• (A1,B1) or (A1,B2) or (A2,B1) or (A2,B2)
• 1 SCTP association• ({A1,A2}, {B1,B2})• concept of “primary” destination
• Host A → B1
• Host B → A1
• network state (RTT, cwnd, ssthresh, …) maintained per destination
Host A
A1
A2
Host B
B1
B2
InternetISP
ISP
ISP
ISP
6
A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. Using SCTP Multihoming for Fault Tolerance & Load Balancing. SIGCOMM 2002 Poster, August 2002.
i = 1j = 2
Di primaryDi activenew =>Dirtx => Dj
Phase IDi primaryDi errorsnew =>Dirtx => Dj
Phase IIDi primaryDi failednew => Djrtx => Dj
Phase III
Diresponds
Path.Max.Retransexceeded
Di responds
Dtimes out
i
Reachability probes • Explicitly with heartbeats
• Implicitly with data
6
1
2
1
2
InternetISP
ISP
ISP
ISP
Sender: Host APrimary: B1
Alternate: B2
A B
Current SCTP Failover Mechanism
7
SCTP Failover:
Issue 1 - Failover is “temporary”Issue 2 - Retransmission PolicyIssue 3 - Failure Detection TimeIssue 4 - No Source Interface Selection
8
SCTP Failover: Issue 1- Failover is “temporary”
9
SCTP Failover: Issue 1- Failover is “temporary”• Current failover policy
• Traffic is redirected back to primary when primary responds to a single heartbeat
• i.e., primary destination is never changed
• Why keep the primary destination?
• Assumes application has a preferred destination
*A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002.
• We found returning to the primary may be inefficient*
• at time of return
• primary’s cwnd = 1MTU & ssthresh = 2MTU
• alternate’s cwnd > 1MTU & ssthresh > 2MTU
• We propose to investigate “permanent” failover when no destination is preferred
• One successful heartbeat may not accurately indicate recovered path outages• overloaded links may need more probing
• We propose to investigate other probing techniques
10
6
1
2
1
2
InternetISP
ISP
ISP
ISP
A B
Two-level Threshold Failoverα = temporary failover
β = auto change primary
A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002.
A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. Using SCTP Multihoming for Fault Tolerance & Load Balancing. SIGCOMM 2002 Poster, August 2002.
Di respondsi j
i = 1j = 2
Di primaryDi activenew =>Dirtx => Di
Phase IDi primaryDi errorsnew =>Dirtx => Di
Phase IIDi primaryDi failednew => Djrtx => Dj
Phase IIIDj primaryDi failednew => Djrtx => Dj
Phase IV
Diresponds
α
Di responds
βDitimes out
11
SCTP Failover: Issue 2 – Retransmission Policy
12
SCTP Failover: Issue 2 – Retransmission Policy
• Current retransmission policy• If peer is multihomed, retransmit to an alternate destination
• Why the alternate destination? • Attempts to improve chances of success
• No prior research to demonstrate benefits
* A. Caro, P. Amer, R. Stewart. Transport Layer Multihoming for Fault Tolerance in FCS Networks. CTA 2003, April 2003. (Submitted to MILCOM 2003)
• We found that this policy degrades performance in many circumstances*
• Not enough traffic on the alternate path to accurately measure RTT …so timeouts are LONG! *
• We propose to investigate alternative policies
13
Potential Solutions• Solution 1: Retransmissions to Same Destination
• Pro: uses destination with accurate RTT; cwnd benefits for primary• Con: fewer successful transmits if primary failed
• Solutions 2: Heartbeat After RTO (Randall Stewart’s idea)• Pro: immediate opportunity to measure RTT after RTO backoff• Con: still few samples to estimate alternate RTT
• Solution 3: Timestamps• Pro: Karn’s Algorithm not needed; more RTT samples on alternate• Con: 12-byte overhead in each packet
• Solution 4: Our Multiple Fast Retransmit Algorithm• Pro: minimizes number of timeouts• Con: no extra RTT samples on alternate
• Solution 5: Rtx to Same Destination & Multiple Fast Rtx
A. Caro, P. Amer, J. Iyengar, R. Stewart. Retransmission Policies with Transport Layer Multihoming. UD CIS TR2003-05, March 2003. (submitted to ICON 2003)
14
Simulation Topology
R
R
R
R
10Mbps 25ms
10Mbps 25ms
100Mbps 5-20ms
100Mbps 5-20ms
100Mbps 5-20ms
100Mbps 5-20ms
P1 P2 P8
P1 P2 P8
SCTPSender
P1 P2 P8
P2 P8
SCTPReceiver
A
1
2
3
4
1
2
3
4
B
1
2
3
4
1
2
3
4
P1
Primary
Alternate
100Mbps 10ms
100M
bps
10ms 100M
bps 10ms
100Mbps
10ms
15
Methodology• A→B traffic
• 4MB file transfer
• Packet sizes: 100% @ 1500B
• Cross-traffic• Self-similar (aggregation of Pareto sources)
• Packet sizes: 50% @ 40B, 25% @ 576B, 25% @ 1500B
• Load: 5Mbps – 11Mbps (producing varying loss rates)
• Simulation parameters (60 runs per combo)• Cross-traffic on primary destination path
• Cross-traffic on alternate destination path
• Retransmission policy (current policy, or 1 of 5 solutions)
16
18
SCTP Failover: Issue 3 - Failure Detection Time
19
SCTP Failover: Issue 3 - Failure Detection Time• Current SCTP recommends static parameter settings:
• RTO (min, max): (1, 60) seconds
• Path.Max.Retrans: 5 attempts per destination
• Heartbeat Interval: 30 seconds
• [Jungmaier 02] improves performance by lowering parameter settings, but• their experimental network had
• fixed delays (ie, no delay spikes)
• no cross-traffic (ie, no congestion)
• RTO.Min < 1 second against recommendation in [Allman 99]
• We propose to• further investigate static parameter settings in a more realistic environment
• investigate dynamically changing parameters based on
• path metrics (RTT, loss rate)
• application requirements (high throughput, low delay, low loss)
*A. Caro, J. Iyengar, P. Amer, G. Heinz, R. Stewart. A Two-level Threshold Recovery Mechanism for SCTP. SCI 2002, July 2002.
Best case failure detection is 1+2+4+8+16+32 = 63 seconds! *
20
• Introduce Fast Recovery mechanism• Avoids multiple cwnd reductions in a single RTT
• Similar to New-Reno TCP’s Fast Recovery
• Introduce new policy which restricts cwnd increasing during Fast Recovery• Maintains conservative behavior
• Modify SCTP’s Fast Retransmit• Avoids unnecessary delays of retransmissions
Congestion Control Improvement
A. Caro, K. Shah, J. Iyengar, P. Amer, R. Stewart. SCTP and TCP Variants: Congestion Control Under Multiple Losses. UD CIS TR2003-04, February 2003. (submitted to ACM CCR)
R. Stewart, L. Ong, I. Arias-Rodriguez, K. Poon, P. Conrad, A. Caro, M. Tuexen. SCTP Implementer’s Guide. draft-ietf-tsvwg-sctpimpguide-08.txt, March 2003.
22
Drop Scenarios
Scenarios from:
Kevin Fall and Sally Floyd. Simulation-based Comparisons of Taho, Reno, and SACK TCP. In ACM Computer Communications Review, 26(3):5-21, July 1996.
One drop Two drops Three drops Four drops
23
24
SCTP Failover: Issue 4 - No Source Interface Selection
25
6
1
2
1
2
InternetISP
ISP
ISP
ISP
Sender: Host APrimary: B1
Alternate: B2
A B
SCTP Failover: Issue 4 - No Source Interface Selection
• Current SCTP • transport sender only specifies destination IP address
• but network layer determines outgoing source IP address/interface
• Why is this a problem?• Suppose A’s network layer routes packets to B1 & B2 via A1
26
• For full multihoming flexibility• endpoint’s IP stack should support multiple default routes
• SCTP should specify the source-destination pair for sending traffic
• Stewart and Lei’s KAME implementation• supports experimental options for source interface selection
• maintains network state per destination
• varies source address to same destination until destination failure detected
• [Kubo 03] propose a failover scheme that• maintains network state per source-destination pair
• detects failures per source-destination pair
• We propose to further investigate source interface selection
SCTP Failover: Issue 4 (cont’d)
27
Plan of Study
28
Plan of Study (in progress)
• Retransmission policies with multihoming (issue 2)• other file sizes
• other cross traffic types (Exponential aggregate, etc)
• SCTP vs TCP variants under multiple losses (issue 3)• more extensive loss scenarios
• Analytic SCTP model (issues 1 & 3)• build on TCP models in [Padhye 98] and [Cardwell 00]
• use to investigate static failover parameter settings
29
Plan of Study (future)• Adaptive failover algorithm (issue 3)
• dynamically adjust thresholds based on• path metrics (RTT, loss rate)• app requirements (high throughput, low delay, low loss)
• Probing mechanism (issue 1)• investigate use of packet pairs or small packet trains
• Source interface selection (issue 4)• evaluate proposed solutions by [Stewart KAME] and [Kubo 03]• investigate other possible solutions
• Final failover mechanism evaluation• simulation• empirical study
31
Any Questions?