1 software fault tolerance (swft) fault-tolerance in mobile networks dependable embedded systems...
TRANSCRIPT
1
Software Fault Tolerance (SWFT)Fault-Tolerance in Mobile Networks
Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de
Prof. Neeraj Suri
Brahim Ayari
Dept. of Computer ScienceTU Darmstadt, Germany
2
Mobile Wireless Networks
Can be classified in two major categories: cellular networks (also known as infrastructured networks) and ad-hoc networks
Cellular networks have fixed and wired gateways called base stations or mobile support stations responsible for routing messages
Ad-hoc networks have no fixed infrastructure and all nodes are capable of movement, which determines the network connectivity
Ad-hoc nodes communicate directly only with the nodes that are immediately within their transmission range
To communicate with other nodes, intermediate nodes forward the packets from the source towards the destination
3
Wireless Communication
Much more difficult to achieve than wired communication Surrounding environment interacts with the signal, Blocking signal paths and Introducing noise and echoes
Wireless connections are of lower quality than wired connections Lower bandwidths: 9-14 Kbps for cellular telephony, 1-54
Mbps for radio communication (WLAN), also the bandwidth availability per user is dependent on the number of users communicating in that area
Higher error rates More frequent spurious disconnections
4
Mobile Device Limitations
The implications of portability for mobile devices are small size and weight, and dependence on battery power
Small size and weight implies Restricted memory size Small storage capacity Limited user interface (both data entry and data display) Limited processing capabilities
5
Cellular Networks
Heterogeneity Wireless networks Mobile nodes
Perturbations Disconnections Node/Communication
failures
Application-dependentsolutions
Wired NetworkWired Network
WLANUMTS
GPRS
6
Cellular Networks
Most of proposed solutions rely on base stations (BS) to add fault-tolerance to cellular networks
Mobile devices are exposed to physical damage and theft. They can be also lost at any time
Most of fault-tolerance techniques rely on a stable storage
A stable storage can be seen as an ideal storage medium that, given a set of failure assumptions, protects user data from corruption or loss
A stable storage should guarantee the atomicity of the write operation
7
Cellular Networks: Recovery Strategies
A recovery strategy essentially has two components: a state saving and a handoff strategy
State saving techniques are based on traditional checkpointing and message logging strategies
The host saves periodically its state at a stable storage und upon a failure, execution can be started from the last-saved checkpoint
Handoff is used to achieve continuous service while a mobile device moves from one cell to another
Handoff process is initiated either by crossing a cell boundary or by a deterioration in quality of the wireless link in the current cell
8
State Saving
The state of the process can get altered, either upon receipt of a message, or upon a user input. Messages or inputs which modify the state are called write events
No Logging Approach (N) The state is saved at the base station upon every write event Upon a failure, the last-saved state in the base station is
loaded Need for frequent transmission of state in the wireless link
Logging Approach (L) The state is checkpointed periodically at the base station The write events between two checkpoints are logged After a failure, the mobile device loads the last checkpoint and
the log of write events
9
Handoff Pessimistic Strategy (P)
The checkpoint (and logs in case of Logging) is transferred to the new cell BS during handoff
The new BS sends an ack to the old BS to be able to purge its copy of the checkpoint
Disadvantage is large volume of data transfer during each handoff
Lazy Strategy (L) No transfer of checkpoints and logs during handoff A linked list of the BSs of the cells visited is created In case of No Logging, the checkpoint is saved at the
current cell’s BS after each write event In case of Logging, a log of last write events is maintained,
in addition to the last checkpoint If a new checkpoint is taken at one BS, the old checkpoint
and logs are deleted from the old BSs along the linked list Saves network overhead but recovery is more complicated
10
Handoff (2)
Trickle Strategy (T) In Lazy Strategy, the scattering of logs in different BSs
increases as mobility increases, making recovery time-consuming
A failure at one BS containing the log renders the entire state information useless
Checkpoint and logs are kept in the preceding BS of the current one
During handoff a control message is sent to the preceding BS to transfer any Checkpoint and logs to the current one and the ID of the preceding BS is stored in the current one
11
Optimal Recovery Scheme
Mobility Wireless Bandwidth
Failure Rate
Optimal Scheme
HighLow
Low LL
High NT
High All LT
Low All All LL
12
Mobile Ad-Hoc Systems
13
Mobile Ad-Hoc Systems
Main characteristics of ad-hoc systems are Self-organizing Fully decentralized Highly dynamic
Applications Conferences, meetings Wireless communications
between vehicles in road traffic
Disaster relief Rescue missions Battlefield operations
14
Routing Protocols
Due to limited transmission range of wireless networks interfaces, multiple network hops are needed for one node to exchange data with other nodes across the network
Routing protocols constitute the basic primitives on which most of the higher-level protocols are build
Routing protocols can be divided into Unicasting Unreliable Broadcasting and Multicasting Geocasting
15
Unicasting Protocols
They can be generally categorized as
1. Topology-based routing protocolsThese protocols use the information about the
links in the network to perform packet forwarding
2. Position-based routing protocolsThese protocols aim to surpass some of the
limitations of topology-based protocols by using additional information, i.e., the physical location of
nodes
16
Topology-based Routing Protocols
Proactive Protocols, in which nodes periodically refresh the routing information so that every node always has consistent, up-to-date routing information from each node to every other node in the network
Reactive protocols, where the routing information is propagated to a node only when it is necessary, i.e., when the node requests it
Hybrid protocols, which make use of both reactive and proactive approaches so as to incorporate the merits of both of them
17
Position-based Routing Protocols
Require that information about the physical position of the ad-hoc node is available
Each node may determine its position using Global Position System (GPS) or some other type of positioning service
A location service is used by the sender of a packet to determine the position of the destination (to include it in the packet)
Position-based routing does not necessarily require the establishment or maintenance of routes
Position-based routing supports the delivery of packets to all nodes in a given geographical area
18
Some Optimizations
Power-aware routing protocols
Disconnected ad-hoc routing
Agent-based ad-hoc routing
19
Unreliable Broadcasting and Multicasting
Unreliable because no guarantees on message delivery is provided for partitionable networks
Four principal families are distinguished1. Simple Flooding, where a source node broadcasts a
packet to its neighbors, each of which broadcasts in turn the packet to its neighbors if this was not already done
2. Probability-Based Methods, which are similar to flooding except that nodes only forward with a probability determined by their perception of the network topology
3. Area-Based Methods, where a node refrains from forwarding a packet received from another node if the additional area that would be so covered is too low
4. Neighbor Knowledge Methods, where each node maintains state on its neighbors so to avoid unnecessary forwarding
20
Geocasting
Geocasting is a variant of the conventional multicasting problem
Messages are delivered to all hosts within a given geographical region
In traditional multicasting, a host becomes a member of the multicast by explicitly joining the multicast group (usually a named entity)
A host automatically is a member of a geocast group if its location belongs to the region specified for the geocast, this region is referred to as geocast region. The set of nodes in the geocast region is said to form the geocast group
21
Fault-Tolerance in Ad-hoc Networks
In distributed computing, several problems have been isolated, such as distributed mutual exclusion, consensus, leader election, distributed commit and group communication
All of these represent primitives to support fault-tolerance of distributed applications
In mobile computing, substantial real applications are still scarce, the formal study of generic problems is quite recent
There are problems specific to the characteristics of the new domain (mobile computing) like location-dependent problems such as geocasting and location based group membership service
22
Transactional Applications
Because of mobility, transactional applications in the ad-hoc context must cope with possibility that even normal system operation may lead to violations of the database correctness
Research has focused on redefining the notion of correctness so as to adapt to the new constraints of ad-hoc networks
A number of alternative definitions of ACID properties have been identified that weaken one or more of the properties
The general trend is to allow a certain degree of autonomy in transaction processing during disconnections
23
Transactional Applications (2) For example, in disconnected operation, a database
client maintaining a local copy of the most recently used data could continue executing even while being disconnected from the server
User transactions can be decomposed into a number of weak and strict sub-transactions according to the degree of consistency needed by the application
Strict transactions maintain the traditional notion of transaction (if committed then always globally). As result they can e only committed while being connected with the server
Weak transactions are committed locally. Upon connection with the server global commit is performed, some of them can be aborted during the global commit
24
Group communication
A group membership protocol manages the formation and maintenance of a set of processes called a group
For example, a group may be A set of processes that are cooperating toward a common
task (e.g., the primary and backup servers of a database), A set of processes that share a common interest (e.g.,
clients that subscribe to a particular newsgroup), or The set of all processes in the system that are currently
deemed to be operational
In general, a process may leave a group because It failed, It voluntarily requested to leave, or It is forcibly expelled by other members of the group
25
Group communication (2)
A process may also join a group (e.g., it may have been selected to act as a replicate for the other processes in the group)
A group membership protocol must manage such dynamic changes in a coherent way: Each process has a local view of the current membership
of the group, and Processes in the group need to agree on these local views
despite failures
26
The Problem of Partitioning
By their nature, network applications for mobile computing involve cooperation among multiple sites
For these applications characterized by reliability and reconfigurability requirements, possible partitioning of the communication network is an extremely important aspect of the environment
In addition to accidental partitioning caused by failures and node movement, mobile computing systems typically support disconnected operation (additional cause of partitioning)
27
The Problem of Partitioning (2)
Two processes may appear to belong to two different partitions with respect to “ping” messages
But the same two processes may appear in the same partition when communicating through email
This is because the two communication services considered have significantly different message buffering, timeout and retransmission properties
Partitioning may result in service reduction or service degradation but need not necessarily render application services completely unavailable
28
The Problem of Partitioning (3)
Partition-aware applications are those that are able to make progress in multiple concurrent partitions without blocking. Service reduction and degradation depend heavily on the application semantics
For certain application classes with strong consistency requirements, it may be the case that all services have to be suspended completely in all but one partition
For applications with less stringent consistency requirements, partitionable group membership services can provide a useful framework to leverage from
29
Leader Election
Leader election algorithms for mobile ad hoc networks are classified in
1. Non-Compulsory protocols, which do not affect the motion of the nodes and try to take advantage of the mobile hosts natural movement by exchanging information whenever mobile hosts meet incidentally
2. Compulsory protocols, which determine the motion of some or all the nodes according to a specific scheme in order to meet the protocol demands (i.e., meet more often, spread in geographical area, etc.)
In both protocol classes, it is assumed that the mobile node moves in a bounded three-dimensional space
30
Literature
Pradhan D.K., Krishna P. and Vaidya N.H., Recoverable mobile environments: Design and tradeoff analysis. FTCS-26 , (June 1996)
Claudio Basile, Marc-Oliver Killijian, and David Powell. A survey of dependability issues in mobile wireless networks. Technical report, Laboratory for Analysis and Aarchitecture of Systems, National Center for Scientific Research, Toulouse, France, Feb 2003