partition-tolerant distributed publish/subscribe systems
DESCRIPTION
Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011. Partition-Tolerant Distributed Publish/Subscribe Systems. Content-Based Publish/Subscribe. NY. London. P. P. Publish. P. Toronto. Pub/Sub. S. S. S. S. S. P. S. sub = [STOCK=IBM]. - PowerPoint PPT PresentationTRANSCRIPT
Partition-Tolerant Distributed Publish/Subscribe Systems
Reza Sherafat Kazemzadeh *Hans-Arno JacobsenUniversity of Toronto
IEEE SRDSOctober 6, 2011
msrg.orgSRDS 2011 2
London
Toronto
Trader 1
Content-Based Publish/Subscribe
Pub/Sub
S
SS SS
S
S
PPublish PP
P
sub = [STOCK=IBM]sub= [CHANGE>-8%]
NY
Trader 2Stock quote dissemination
msrg.orgSRDS 2011 3
Fault-tolerance (against concurrent failures):
Broker crashes Link failures Recoveries
Reliability: Publications match subscriptions Per-source in-order delivery After some point in time Exactly-once
delivery(no loss, no duplicates)
Assumptions: Clients are light-weight (broker
network is responsible for reliability) A time t after which the system
provides guaranteed delivery
P/S
GoalsPub
Sub Sub
Reliabledelivery
msrg.orgSRDS 2011 4
System Architecture
Tree dissemination networks: One path from source to destination Pros:
Simple, loop-free Preserves publication order
(difficult for non-tree content-based P/S) Cons:
Trees are highly susceptible to failures
Primary tree: Initial spanning tree that is formedas brokers join the system
Maintain neighborhood knowledge Allows brokers to reconfigure overlay
after failures on the fly
∆-Neighborhood knowledge: ∆ is configuration parameterensures handling ∆-1 concurrent failures (worst case)
Knowledge of other brokers within distance ∆ Join algorithm
Knowledge of routing paths within neighborhood Subscription propagation algorithm
3-neighborhood
2-neighborhood
1-neighborhood
msrg.orgSRDS 2011 5
Subscription PropagationPublication ForwardingBroker/Link RecoveryOverlay Management
Single chain
Overview of the Approach
SRDS 2011 6
Overlay Management Alg.Maintains end-to-end connectivity despite failures in the overlay.
msrg.orgSRDS 2011 7
When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links.
Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers
Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. Only active connections are used by brokers
Overlay Partitions
ABCDEF SP D
pid1=<C, {D}>
Partition detectorBrokers on the partitionBrokers beyond
the partitionBrokers onthe partition
Active connection to E
?
x
msrg.orgSRDS 2011 8
What if there are more failures, particularly adjacent failures?
If ∆ is large enough the same process can be used for larger partitions.
Overlay Partitions – 2 Adjacent Failures
ABCDEF SP D
pid1=<C, {D}>Brokers beyondthe partition
Brokers onthe partition
E
+ pid2=<C, {D, E}>
Active connection to F
msrg.orgSRDS 2011 9
Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay.
Brokers “on” and “beyond” the partition are unreachable.
No new active connection
Overlay Partitions - ∆ Adjacent Failures
ABCDEF SP D
pid1=<C, {D}>Brokers beyondthe partition
Brokers onthe partition
E
pid2=<C, {D, E}>
F
+ pid3=<C, {D, E, F}>
SRDS 2011 11
Subscription Propagation Alg.How correct routing tables are maintained despite overlay partitions?
msrg.orgSRDS 2011 12
Subscription Propagation Algorithm
Establishes end-to-end routing state among brokers while taking into account overlay partitions.
Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. Primary tree is the “basis” of constructing end-to-end forwarding paths.
Each subscription contains:
SUB = <Id, Predicates, Anchor>
Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”] Anchor is a reference to brokers along the propagation path of the
subscription
msrg.orgSRDS 2011 13
Subscription Propagation in Absence of Overlay Partitions Subscription anchor field is updated to a broker point up to ∆
hops closer to subscriber
Accepting a subscription is to add it into routing tables Only after confirmations are received, a subscription is accepted
(i.e., will be used for matching) Observation: Matching publications are delivered to a subscriber
once its local broker accepts subscription
ABCDEP S
Subscriptions
Confirmations
ssssss
☑conf
s.anchor
☑conf ☑conf ☑conf ☑conf☑conf
∆ hops ∆ hops
☑
msrg.orgSRDS 2011 14
Subscription Propagation in Presence of overlay Partitions
Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation
Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed.
Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table.
ABCDEP S
Confirmations
Subscriptions
CD Bs
☑conf
s
s
s
☑
conf
☑conf* ☑conf* ☑* pid tag is alsostored alongwith s* Tag conf with pid
☑
SRDS 2011 15
Publication Forwarding Alg.How accepted subscriptions and their partition tags are used to achieve reliable delivery?
msrg.orgSRDS 2011 16
Publication Forwarding in Absence of Overlay Partitions Forwarding only uses subscriptions accepted brokers.
Steps in forwarding of publication p: Identify anchor of accepted subscriptions that match p Determine active connections towards matching subscriptions’
anchors Send p on those active connections and wait for confirmations If there are local matching subscribers, deliver to them If no downstream matching subscriber exists, issue confirmation
towards P Once confirmations arrive, discard p and send a conf towards P
PublicationsABCDEP S
Subscriptions
☑
p
☑ ☑ ☑ ☑ ☑ ☑CE
p p p p pDeliver to local
subscribersconfconfconfconfconfconf
p
msrg.orgSRDS 2011 17
Publication Forwarding in Presence of Overlay Partitions Key forwarding invariant to ensure reliability: we ensure
that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription.
Case1: Sub s has been accepted with no pid. It is safe to bypass intermediate brokers
conf
conf
conf
Publications
ABCDEP S
Subscriptionsp
C BD☑ ☑ ☑ ☑ ☑ ☑ ☑
p pDeliver to local
subscribersconf
p
msrg.orgSRDS 2011 19
Case2: Sub s has been accepted with some pid.
Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:
It is safe to deliver publications from sources beyond the partition.
Publication Forwarding (cont’d)
conf
conf
conf
Publications
ABCDEP S
Subscriptionsp
C BD☑ ☑ ☑ ☑ ☑*
p p
Depending on when this link has been establishedeither recovery or subscription propagation ensure
C accepts s prior to receiving p
conf
p
msrg.orgSRDS 2011 20
Publication Forwarding (cont’d) Case2: Subscription s is accepted with some pid tags.
Case 2b: Publisher’s broker has not accepted s:
It is unsafe to deliver publications from this publisher (invariant).
Publications
ABCDEP S
Subscriptionsp
☑ ☑*
p p*
s was acceptedat S with the same pid tag
p p
p
Tag with pid
SRDS 2011 21
EvaluationUsing a mix of simulation and experimental deployments on large-scale testbed.
msrg.orgSRDS 2011 22
Simulation Results
SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆
∆=4∆=3
∆=1∆=2
Size of ∆-neighborhoods
Network size of 1000 Broker fanout of 3
∆=1∆=2∆=3∆=4
msrg.orgSRDS 2011 23
Impact of Failures on End-to-End Broker Reachability
Using a graph simulation tool.
Overlay setup:▪ Network size 1000 Brokers with
fanout=3 Failure injection:▪ Failures: up to 100 brokers▪ We randomly marked a given
number of nodes as failed Measurements:▪ We counted the number of end-
to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain.
∆=3∆=4
∆=2∆=1
∆=1
∆=4
msrg.orgSRDS 2011 24
Experimental Deployments:Impact of Failures on Pub Delivery
500 brokers deployed on 8-core machines in a cluster: Network setup: Overlay fanout=3. We measured aggregate pub. delivery count in an interval of 120s Expected bar is number of publications that must be delivered
despite failures (this excludes traffic to/from failed brokers).
∆=1
∆=3∆=2
∆=4
Expected
∆=4∆=3
∆=1
∆=1
msrg.orgSRDS 2011 25
Conclusions We developed a reliable P/S system that tolerate
concurrent broker and link failures:
Configuration parameter ∆ determines level of resiliency against failures (in the worst case).
Dissemination trees augmented with neighborhood knowledge.
Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures.
We studied the performance of the system when numberof failures far exceeds ∆:
A small value for ∆ ensures good connectivity.
SRDS 2011 26
Questions…Thanks for your attention!
msrg.orgSRDS 2011 27
Challenges Why “end-to-end” principle does not work?
Publishers and subscribers are decoupled andunaware of each other.
Routing paths are established by dynamicallyinserted subscriptions Subscription propagation is also subject to
broker/link failure.
Selective delivery makes in-order deliveryover redundant path difficult Subscribers are only interested in a subset of
what is published.
Responsibility on P/S
messaging system
Subscription propagation algorithm
We use a special form
of tree dissemination
msrg.orgSRDS'09 28
A copy is first preserved on disk
Intermediate hops send an ACK to previous hop after preserving
ACKed copies can be dismissed from disk
Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery This ensures reliable delivery but may cause delays while the machine is
down
Store-and-ForwardP P PPFrom
hereTohere
ackackack
msrg.orgSRDS'09 29
Use a mesh network to concurrently forward msgs on disjoint paths
Upon failures, the msg is delivered using alternative routes
Pros: Minimal impact on delivery delay
Cons: Imposes additional traffic & possibility of duplicate delivery
Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]
PPPP
Fromhere
Tohere
msrg.orgSRDS'09 30
Replicas are grouped into virtual nodes Replicas have identical routing information
Replica-based Approach [Bhola , et al., DSN 2002]
PhysicalMachines
Virtual node
msrg.orgSRDS'09 31
Replicas are grouped into virtual nodes Replicas have identical routing information
We compare against this approach
PP
PP
P
P
Virtual node
Replica-based Approach[Bhola , et al., DSN 2002]
msrg.orgSRDS 2011 32
Publication Forwarding (cont’d) Case2: Sub s has been accepted with
some pid.
Case 2b (Partition barrier): Publisher’s broker has also not accepted sPublications
ABCDEP S
Subscriptions
p1
☑ ☑*
p1 p1*
s was acceptedat S with the same pid tag
p1 p1
p1
p1*
Tag with pid
p2 & p1
matches r & smatches s
☑r ☑r ☑r ☑r ☑r ☑r ☑ ☑rR
msrg.orgSRDS 2011 33
Subscription Propagation with Partitions Partition islands:
Simply confirm (and accept) subscriptions over available If partition brokers are reachable from
the other side of the partition
Intuition: Publications from P may
only be lost if they arrive at B But this will not happen since
there is no link towards B from F Correctness proof argues on the precedence of acceptance and
creation of links
ABCDEP SP BC
Leadbroker
A
Confirmations
Subscriptions
☑ ☑ ☑ ☑
Will acceptduring recovery
☑
msrg.orgSRDS 2011 34
Subscription Propagation with Partition Barriers If a portion of the network that includes publishers is
on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures
Lead broker “partially confirms” the subscription and tags the confirmation with the partition information Accepting brokers store the partition information along
with the subscription This ensures liveness
ABCEFG AP SCD
Leadbroker
B
Forward
Partialconf
☑* ☑*Δ hops
msrg.orgSRDS 2011 35
Publication Forwarding Only accepted subscriptions are stored in
SRT and used for matching
At each point in time, a broker has a number of connections to its nearest reachable neighbors This set of active connections may change
over time
Publication forwarding steps:1. Store publication in a FIFO internal
message queue2. Match and compute set of {from} for
subscriptions that match3. For each partially confirmed
subscription, tag the publication with the partition information
4. Send the publication to the closest reachable neighbors towards {from}
5. Once all confirmations arrive, discard publication and issue confirmation towards publisher
queuePPPP
SS
S
P
(δ+1)-neighborhood
A
msrg.orgSRDS 2011 36
Network size of 1000 Broker fanout of 3
Network size of 1000 Broker fanout of 7
Evaluations
SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆
∆=4∆=3
∆=1∆=2
Size of ∆-neighborhoods
∆=4∆=3∆=2∆=1
Size of ∆-neighborhoods
msrg.orgSRDS 2011 37
Overlay Links Management Sessions: FIFO
communication links between brokers.
Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B.
∆ = 2
Primary tree
msrg.orgSRDS 2011 38
Agenda Challenges of reliability and fault-tolerance in P/S
Our approach Topology neighborhood knowledge Subscription propagation Publication forwarding Recovery procedure
Evaluation results
Conclusions