self-organized fault-tolerant routing in p2p overlays wojciech galuba, karl aberer epfl, switzerland...

Self-organized fault-tolerant routing in P2P overlays

Wojciech Galuba, Karl AbererEPFL, Switzerland

Zoran Despotovic, Wolfgang Kellerer Docomo Euro-Labs, Munich, Germany

2© 2009 EPFL, Docomo Euro-Labs

What are the P2P overlays?

Underlying blue network (e.g. TCP/IP) Red peers come and go Peers form an overlay network (red links)

3

Routing in P2P overlays

Overlays (usually) have their own address space Goal: provide point-to-point connectivity

or rather point-to-service connectivity...

© 2009 EPFL, Docomo Euro-Labs

source

destination


What is the problem?

Failures in large-scale systems are the norm, not the exception

Permanent failures well understoodOverlay maintenance algorithms

Intermittent failuresTransient network connectivity problemsPeer overload, resource exhaustionCannot be addressed in the same way as

permanent failures

5

Existing solutions - multipath

Multiple paths Goal: at least one path reaches destination


source

destination

- lossy peer

6

Existing solutions – iterative routing

Source controls the routing process Successively ask nodes for their neighbors High redundancy if one node fails, use others


source

destination

- lossy peer

j

7

Exisisting solutions - problems

Heavily rely on message redundancyHigh bandwidth cost

Do not learn from failuresLikely to repeat the same routing mistakes



Forward feedback protocol (FFP)

Requestor Provider

Request

Feedback

1

2

3

Service

Requestor determines the quality of the provided service decision binary: good or bad

Feedback follows the same path as the request Feedback is obligatory, no feedback = bad feedback

9

A peer on the path

Knows only its overlay neighbors Based on feedback, learns which neighbors are

reliable Associates a success estimator with each (j, dz) pair:

j – neighbor address dz – destination zone

A success estimator is an exponentially averaged success rate, [0..1] Initially 0.5 Increased on positive feedback Decreased on negative feedback or feedback timeout


ph peernh

10

Next hop selection

Based on the state of the success estimators

Pick a neighbor j for which the current value of a success estimator is the highest i.e. maximize the probability of success based

on performance history


11

The FFP protocol in action

ph peer

nh2

nh1

nh1 has history of success but starts failing peer switches to nh2

- -+ -

+

+


12

Cumulative effect

The root cause of the failure receives the most negative feedback

The links to the faulty peer are avoided by its neighbors

- lossy peer


13

Scalability through dest zoning

O(log N) zones and O(log N) neighbors Total state at each node: O(log2N)


Increasing overlay distance to destination

Increasing destination zone number

0123

Exponentially decreasing zone size

14

Evaluation

PlanetLab – a planetary-scale testbed 350 peers Conditions:

Median system load: 5.3 Unpredictable delays and loss „natural” on PlanetLab

Challege: introduce loss and delays in a Chord-like DHT place a tight 3s timeout on service requests see if protocols can route around faulty peers

Workload: multi-source, multi-destination


15

The line-up

BASE – baseline, no fault-tolerance mechanisms

MULTI4 – 4-way multipath routing ITER4 – Kademlia-based iterative routing,

4 parallel RPCs FFP



Every 5 mins: a new 10% of peers become delayers

Delayers delay all messages by 100-2000ms

19

25% of droppers arrive at 300s Convergence time depends on the traffic pattern


20

Topology-oblivious routing

Starts with all success estimators = 0.5Empty routing tables

Learn by trial and errorWhich neighbors are good forwarders for

which destinations Routing tables are entirely emergent Initially random walks

converge to reliable routes



Warmup: initially use the original Chord routing tables After some time switch to FFP routing tables

22

Summary

FFP uses 2-5 times less bandwidth than MULTI and ITER

Same or higher fault-tolerance More suitable for workloads:

that are high-rate with fewer src-dest pairs


23

Benefits of the self-org approach

Decentralized scalability Topology-oblivious

Applicable to many networks Agnostic to the causes of failures

Robust to many failure scenarios Even those it was not designed for


24

FFP used for secure routing in MANETs

Additional crypto to prevent feedback forgery

No PKI ! Tech report:

http://tinyurl.com/ffp-manet


self-organized fault-tolerant routing in p2p overlays wojciech galuba, karl aberer epfl, switzerland...

Documents

bad feedback slide

negative feedback

zone size slide

feedback timeout

request feedback

positive feedback

neighbors lossy peer

karl aberer epfl