failover and takeover contingency mechanisms for network partition and node failure
DESCRIPTION
Proper definition of suitable mechanisms to cope with network partition and to recover from node failure are among the most common problems when designing and implementing a fault-tolerant distributed system. The concern is even more serious when the different scenarios could not be predicted beforehand and are detected once the system is at deployment stage. There are a number of decisions that can be made when choosing the right contingency mechanisms to deal with these distribution-bounded problems. The factors that must be taken into account include not only the technology in use, the node layout, the message protocol and the properties of the messages to be exchanged, certain desired/demanded features such as latency, bandwidth,... but also the communications network reliability, and even the hardware where the system is running on. In this paper we present ADVERTISE, a distributed system for advertisement transmission to on-customer-home set-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator. We use this system as a case study to explain how we addressed the aforementioned problems, and present a set of good practices that can be extrapolated to comparable systems.TRANSCRIPT
![Page 1: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/1.jpg)
Failover and Takeover Contingency Mechanismsfor Network Partition and Node Failure
Macías López, Laura M. Castro, David Cabrero
MADS Research Group – Universidade da Coruña (Spain)
Erlang WorkshopCopenhaguen, 14th September 2012
Erlang Workshop (2012) Fail/Takeover Mechanisms 1 / 25
![Page 2: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/2.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 2 / 25
![Page 3: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/3.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
![Page 4: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/4.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
![Page 5: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/5.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 3 / 25
![Page 6: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/6.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
![Page 7: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/7.jpg)
Why are we (all) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 4 / 25
![Page 8: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/8.jpg)
Why are we (presenting this work) here?
concurrency!
high-availability!
distribution!
Erlang Workshop (2012) Fail/Takeover Mechanisms 5 / 25
![Page 9: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/9.jpg)
Why are we (presenting this work) here?
Unexpected problemsafter deployment!
node failures!system failure!
Erlang Workshop (2012) Fail/Takeover Mechanisms 6 / 25
![Page 10: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/10.jpg)
Why are we (presenting this work) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
![Page 11: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/11.jpg)
Why are we (presenting this work) here?
Erlang Workshop (2012) Fail/Takeover Mechanisms 7 / 25
![Page 12: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/12.jpg)
Outline
1 The system
2 The problems at deployment
3 The solution
4 Final remarks
Erlang Workshop (2012) Fail/Takeover Mechanisms 8 / 25
![Page 13: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/13.jpg)
The systemADVERTISE
Distributed system for advertisement transmission to on-customer-homeset-top boxes (STBs) over a Digital TV network (iDTV) of a cable operator
Erlang Workshop (2012) Fail/Takeover Mechanisms 9 / 25
![Page 14: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/14.jpg)
The systemADVERTISE’s requirements
ensure the appropriate coordination of advertising mechanisms:
I compilation of events
I emission of advertising signals to STBs during a period of time
I recording hits (displays) of a specific piece of advertisement
Major challengeManagement of the size of the communications network:
growing number of operator’s customers (∼ 100.000)
Erlang Workshop (2012) Fail/Takeover Mechanisms 10 / 25
![Page 15: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/15.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 16: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/16.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 17: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/17.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 18: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/18.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 19: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/19.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 20: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/20.jpg)
The systemADVERTISE’s architecture
Erlang Workshop (2012) Fail/Takeover Mechanisms 11 / 25
![Page 21: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/21.jpg)
The systemADVERTISE’s structure
Erlang Workshop (2012) Fail/Takeover Mechanisms 12 / 25
![Page 22: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/22.jpg)
The systemADVERTISE as Erlang Distributed Application
To meet its requirements, ADVERTISE was designed
as a distributed application over several nodes
Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
![Page 23: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/23.jpg)
The systemADVERTISE as Erlang Distributed Application
To meet its requirements, ADVERTISE was designed
as a distributed application over several nodes
Erlang Workshop (2012) Fail/Takeover Mechanisms 13 / 25
![Page 24: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/24.jpg)
The problems at deploymentThe symptoms
ADVERTISE deployment environment
presented some particularities that had not been foreseen:
some nodes showed a tendency to fail more often than others
network partition was common during some time periods (noon,
night)
In this situation. . .Fault tolerance requirements were not met!
Erlang Workshop (2012) Fail/Takeover Mechanisms 14 / 25
![Page 25: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/25.jpg)
The problems at deploymentThe diagnosis
ADVERTISE was developed and tested over several physical machines
running on a single physical machine
using a shared hard disk
sharing the network link
sharing with other apps/VMs
Frequent saturation of shared resources was perceived by ADVERTISEnodes as short network partitions.
Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
![Page 26: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/26.jpg)
The problems at deploymentThe diagnosis
ADVERTISE was deployed over several virtual machines
running on a single physical machine
using a shared hard disk
sharing the network link
sharing with other apps/VMs
Frequent saturation of shared resources was perceived by ADVERTISEnodes as short network partitions.
Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
![Page 27: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/27.jpg)
The problems at deploymentThe diagnosis
ADVERTISE was deployed over several virtual machines
running on a single physical machine
using a shared hard disk
sharing the network link
sharing with other apps/VMs
Frequent saturation of shared resources was perceived by ADVERTISEnodes as short network partitions.
Erlang Workshop (2012) Fail/Takeover Mechanisms 15 / 25
![Page 28: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/28.jpg)
The problems at deploymentThe consequences
If nodes lose connectivity, believe that all the others are down and assume
system functions, there are likely to be inconsistencies when connectivity
is restored (duplicated responsibilities, data inconsistencies).
Perceived network partitions led to cascade failoversDuplicated registration of global names, random killing of conflicting
processes, overflow and eventual stop of the supervision mechanisms.
Erlang Workshop (2012) Fail/Takeover Mechanisms 16 / 25
![Page 29: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/29.jpg)
The solution
For ADVERTISE, data consistency was more important
than availability:
system could not afford that advertising campaigns, rules,
or media were lost or became inconsistent
instead, it was acceptable that no ads were sent to STBs
(or that they were delayed)
The solutionWe re-designed ADVERTISE to be deployed over a minimum of 3 nodes,
and never on an isolated node
Erlang Workshop (2012) Fail/Takeover Mechanisms 17 / 25
![Page 30: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/30.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 31: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/31.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 32: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/32.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 33: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/33.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 34: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/34.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 35: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/35.jpg)
The solutionADVERTISE initialisation
Erlang Workshop (2012) Fail/Takeover Mechanisms 18 / 25
![Page 36: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/36.jpg)
The solutionADVERTISE boot
Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
![Page 37: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/37.jpg)
The solutionADVERTISE boot
Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
![Page 38: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/38.jpg)
The solutionADVERTISE boot
Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
![Page 39: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/39.jpg)
The solutionADVERTISE boot
Erlang Workshop (2012) Fail/Takeover Mechanisms 19 / 25
![Page 40: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/40.jpg)
The solutionNode integrity check
1 Retrieve the last known population of active nodes Listactives
2 Retrieve the list of all ADVERTISE nodes from the configuration Listall
3 Filter Listall removing ping-unreachable nodes
4 If
(filtered(Listall) 6= Listactives) ∧ (|Listactives| = 1)
ADVERTISE is suspended immediately,
and node is rebooted once connectivity is restored
Erlang Workshop (2012) Fail/Takeover Mechanisms 20 / 25
![Page 41: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/41.jpg)
The solutionDistributed AC check
1 DAC is queried on all nodes, to get PID of ADVERTISE local sup
2 If ∃n ∈ Listall for which ADVERTISE local sup PID could not be
retrieved, node failure is assumed
1 If n ∈ Listactives it means it replies to ping from the global supervisor but
cannot reach others; after a timeout
1 If n /∈ Listactives node failure is confirmed
2 If n ∈ Listactives node is up and we reboot it
Erlang Workshop (2012) Fail/Takeover Mechanisms 21 / 25
![Page 42: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/42.jpg)
The solutionCurrent ADVERTISE deployment
Cluster of 3 virtual nodes, handles an average of 18K STBs per node
with peaks of 23K STBs during prime time
Our tests reached a maximum of 45K STBs per node
System running with no incidents reported in the last 4 months
Most intensive advertising campaign was a 2-month promotion:
displayed over 66 million times, with a peak of 140K times in 1 hour
Average campaign can be displayed a total of 500K, with peaks of up
to 30K in 1 hour during prime time Saturday night
Erlang Workshop (2012) Fail/Takeover Mechanisms 22 / 25
![Page 43: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/43.jpg)
Final remarksLessons learned
When designing a distributed Erlang app, one must take into account:
Network reliability
Latency of requests
Bandwidth
Network security
Network topology
Heterogeneity of components
Scalability
Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
![Page 44: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/44.jpg)
Final remarksLessons learned
When designing a distributed Erlang app, one must take into account:
Network reliability
Latency of requests
Bandwidth
Network security
Network topology
Heterogeneity of components
Scalability
Erlang Workshop (2012) Fail/Takeover Mechanisms 23 / 25
![Page 45: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/45.jpg)
Final remarksYour mileage may vary!
Had ADVERTISE requirements been substantially different
we would probably have favoured
availability over consistency, for instance
And that would be a different story. . .
Erlang Workshop (2012) Fail/Takeover Mechanisms 24 / 25
![Page 46: Failover and takeover contingency mechanisms for network partition and node failure](https://reader038.vdocuments.us/reader038/viewer/2022103014/54b436394a79594f728b459d/html5/thumbnails/46.jpg)
Questions?
Audience ! thanks
Some images and icons were downloaded from: openclipart.org
Erlang Workshop (2012) Fail/Takeover Mechanisms 25 / 25