towards an internet that “never fails” hari balakrishnan mit joint work with nick feamster,...
TRANSCRIPT
Towards an Internet that “Never Fails”
Hari BalakrishnanMIT
Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru
What We Should Aim Toward
• Carrier airlines (2002 FAA Fact Book) 41 accidents, 6.7 million flights (five “nines” availability)
• 911 phone service (1993 NRIC report) 29 minutes downtime per year per line (four “nines”
availability)
• Standard phone service (various sources) 53 minutes downtime per year per line (four “nines”
availability)
• The Internet? One to two “nines”
Example Catastrophic Failures
“…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.”
-- news.com, April 25, 1997“Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001“WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue."
-- cnn.com, October 3, 2002"A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).”
-- dslreports.com, February 23, 2004
NANOG List Failure “Analysis”
0102030405060708090
Filtering RouteLeaks
RouteHijacks
RouteInstability
RoutingLoops
Blackholes
# Threads over Stated Period
1994-1997 1998-2001 2001-2004
Note: Only includes problems openly discussed on this list.
More than 70% of threads discussing failures relatedto router configuration or route announcement problems
Faults and Failures
• Fault = Underlying defect in a component that causes it to violate a specification Latent or Active (i.e., cause errors)
• Unmasked faults (errors) cause failures Failure of subsystem (spec violation) causes fault in
system
• Internet faults occur for complex reasons Hardware, software, protocol, design, implementation,
operational faults: could be triggered by malice
• Internet failure: A cannot communicate with B
Three Directions
• Configuration as programming Defines BGP behavior Tools to cope with routing complexity
• Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful
• End-to-end routing Exposing multiple paths to end systems (and
stubs)
Today: Reactive Operation
• Problems cause downtime• Problems often not immediately apparent
What happens if I tweak this policy…?
Coping with Complexity• View configuration as (distributed) programming
Large-scale: over 1M lines of code in some networks
• Programming tools to reduce fault frequency Static analysis can detect many faults [rcc] Sandboxing to overcome current “stimulus-response”
reasoning [FR03]
• Centralize configuration platform More “intentional” config specs Push configs to routers Push routes to routers [RCP:F+04] Use static analysis and sandboxing tools
Proactive Operation with rcchttp://nms.csail.mit.edu/rcc
Faults
• Represent complex, distributed configuration• Define a correctness specification• Map specification to constraints
ConfigureDetectFaults
Deploy
rcc
rccNormalized
Representation
CorrectnessSpecification
ConstraintsDistributed router
configurations (Single AS)
Correctness Specification
Path Visibility Every destination with a usable path has a route advertisement
Route Validity Every route advertisement corresponds to a usable path
Example violation: Signaling partition
Example violation: Routing loop
If there exists a path, then there exists a route
If there exists a route, then there exists a path
Results: Faults across 17 ASes
0
2
4
6
8
10
iBGP
SignalingPartitionDuplicateLoopbackIncomplete
iBGP
Session
Inconsistent
Export
Inconsistent
ImportTransitBetween
Peers
Undefined
Filter
Incomplete
Filter
Number of ASes
Route Validity Path Visibility
Every AS had faults, regardless of network sizeMost faults can be attributed to distributed configuration
Three Directions
• Configuration as programming Tools to cope with routing complexity
• Coping with protocol faults: failure-atomic interdomain routing Prefix-based routing considered harmful
• End-to-end routing Exposing multiple paths to end systems
Prefixes are too coarse-grained
Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn
70% of intra-AS failuresnot visible in BGP [FABK03]
…but they are also too fine-grained!
• ~70% of discontiguous prefix pairs from the same AS are announced from the same location
• Allocation explains about 60% of these cases: Registries often allocate discontiguous address
blocks to a single AS on the same day
• Routes for these prefixes will “flap” together. 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent)
Route objects should correspond to an “atom” of hosts that share fate
Proposal: Atomic Interdomain Protocol (AIP)
• Exterminate prefixes
• Name “atomic domains” (AD) directly Addressing, forwarding and routing on ADs Like current AS numbers, but finer-grained Example: MIT, Microsoft Redmond, one PoP of a
large ISP, …
• Flat AD IDs can carry cryptographic meaning Self-certifying (hash of public key)
• End-system addresses have the form [AD : LocalID]
Summary
It’s worth shooting for a two or three order-of-magnitude improvement in Internet availability
It’s possible to get four or five nines of Internet availability, if we: Develop tools to cope with configuration
complexity Develop a failure-atomic routing system Expose multiple IP-layer paths to higher
layers