1 software fault tolerance (swft) fault-tolerance in mobile networks dependable embedded systems...

1

Software Fault Tolerance (SWFT)Fault-Tolerance in Mobile Networks

Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de

Prof. Neeraj Suri

Brahim Ayari

Dept. of Computer ScienceTU Darmstadt, Germany

2

Mobile Wireless Networks

Can be classified in two major categories: cellular networks (also known as infrastructured networks) and ad-hoc networks

Cellular networks have fixed and wired gateways called base stations or mobile support stations responsible for routing messages

Ad-hoc networks have no fixed infrastructure and all nodes are capable of movement, which determines the network connectivity

Ad-hoc nodes communicate directly only with the nodes that are immediately within their transmission range

To communicate with other nodes, intermediate nodes forward the packets from the source towards the destination

3

Wireless Communication

Much more difficult to achieve than wired communication Surrounding environment interacts with the signal, Blocking signal paths and Introducing noise and echoes

Wireless connections are of lower quality than wired connections Lower bandwidths: 9-14 Kbps for cellular telephony, 1-54

Mbps for radio communication (WLAN), also the bandwidth availability per user is dependent on the number of users communicating in that area

Higher error rates More frequent spurious disconnections

4

Mobile Device Limitations

The implications of portability for mobile devices are small size and weight, and dependence on battery power

Small size and weight implies Restricted memory size Small storage capacity Limited user interface (both data entry and data display) Limited processing capabilities

5

Cellular Networks

Heterogeneity Wireless networks Mobile nodes

Perturbations Disconnections Node/Communication

failures

Application-dependentsolutions

Wired NetworkWired Network

WLANUMTS

GPRS

6

Cellular Networks

Most of proposed solutions rely on base stations (BS) to add fault-tolerance to cellular networks

Mobile devices are exposed to physical damage and theft. They can be also lost at any time

Most of fault-tolerance techniques rely on a stable storage

A stable storage can be seen as an ideal storage medium that, given a set of failure assumptions, protects user data from corruption or loss

A stable storage should guarantee the atomicity of the write operation

7

Cellular Networks: Recovery Strategies

A recovery strategy essentially has two components: a state saving and a handoff strategy

State saving techniques are based on traditional checkpointing and message logging strategies

The host saves periodically its state at a stable storage und upon a failure, execution can be started from the last-saved checkpoint

Handoff is used to achieve continuous service while a mobile device moves from one cell to another

Handoff process is initiated either by crossing a cell boundary or by a deterioration in quality of the wireless link in the current cell

8

State Saving

The state of the process can get altered, either upon receipt of a message, or upon a user input. Messages or inputs which modify the state are called write events

No Logging Approach (N) The state is saved at the base station upon every write event Upon a failure, the last-saved state in the base station is

loaded Need for frequent transmission of state in the wireless link

Logging Approach (L) The state is checkpointed periodically at the base station The write events between two checkpoints are logged After a failure, the mobile device loads the last checkpoint and

the log of write events

9

Handoff Pessimistic Strategy (P)

The checkpoint (and logs in case of Logging) is transferred to the new cell BS during handoff

The new BS sends an ack to the old BS to be able to purge its copy of the checkpoint

Disadvantage is large volume of data transfer during each handoff

Lazy Strategy (L) No transfer of checkpoints and logs during handoff A linked list of the BSs of the cells visited is created In case of No Logging, the checkpoint is saved at the

current cell’s BS after each write event In case of Logging, a log of last write events is maintained,

in addition to the last checkpoint If a new checkpoint is taken at one BS, the old checkpoint

and logs are deleted from the old BSs along the linked list Saves network overhead but recovery is more complicated

10

Handoff (2)

Trickle Strategy (T) In Lazy Strategy, the scattering of logs in different BSs

increases as mobility increases, making recovery time-consuming

A failure at one BS containing the log renders the entire state information useless

Checkpoint and logs are kept in the preceding BS of the current one

During handoff a control message is sent to the preceding BS to transfer any Checkpoint and logs to the current one and the ID of the preceding BS is stored in the current one

11

Optimal Recovery Scheme

Mobility Wireless Bandwidth

Failure Rate

Optimal Scheme

HighLow

Low LL

High NT

High All LT

Low All All LL

12

Mobile Ad-Hoc Systems

13

Mobile Ad-Hoc Systems

Main characteristics of ad-hoc systems are Self-organizing Fully decentralized Highly dynamic

Applications Conferences, meetings Wireless communications

between vehicles in road traffic

Disaster relief Rescue missions Battlefield operations

14

Routing Protocols

Due to limited transmission range of wireless networks interfaces, multiple network hops are needed for one node to exchange data with other nodes across the network

Routing protocols constitute the basic primitives on which most of the higher-level protocols are build

Routing protocols can be divided into Unicasting Unreliable Broadcasting and Multicasting Geocasting

15

Unicasting Protocols

They can be generally categorized as

1. Topology-based routing protocolsThese protocols use the information about the

links in the network to perform packet forwarding

2. Position-based routing protocolsThese protocols aim to surpass some of the

limitations of topology-based protocols by using additional information, i.e., the physical location of

nodes

16

Topology-based Routing Protocols

Proactive Protocols, in which nodes periodically refresh the routing information so that every node always has consistent, up-to-date routing information from each node to every other node in the network

Reactive protocols, where the routing information is propagated to a node only when it is necessary, i.e., when the node requests it

Hybrid protocols, which make use of both reactive and proactive approaches so as to incorporate the merits of both of them

17

Position-based Routing Protocols

Require that information about the physical position of the ad-hoc node is available

Each node may determine its position using Global Position System (GPS) or some other type of positioning service

A location service is used by the sender of a packet to determine the position of the destination (to include it in the packet)

Position-based routing does not necessarily require the establishment or maintenance of routes

Position-based routing supports the delivery of packets to all nodes in a given geographical area

18

Some Optimizations

Power-aware routing protocols

Disconnected ad-hoc routing

Agent-based ad-hoc routing

19

Unreliable Broadcasting and Multicasting

Unreliable because no guarantees on message delivery is provided for partitionable networks

Four principal families are distinguished1. Simple Flooding, where a source node broadcasts a

packet to its neighbors, each of which broadcasts in turn the packet to its neighbors if this was not already done

2. Probability-Based Methods, which are similar to flooding except that nodes only forward with a probability determined by their perception of the network topology

3. Area-Based Methods, where a node refrains from forwarding a packet received from another node if the additional area that would be so covered is too low

4. Neighbor Knowledge Methods, where each node maintains state on its neighbors so to avoid unnecessary forwarding

20

Geocasting

Geocasting is a variant of the conventional multicasting problem

Messages are delivered to all hosts within a given geographical region

In traditional multicasting, a host becomes a member of the multicast by explicitly joining the multicast group (usually a named entity)

A host automatically is a member of a geocast group if its location belongs to the region specified for the geocast, this region is referred to as geocast region. The set of nodes in the geocast region is said to form the geocast group

21

Fault-Tolerance in Ad-hoc Networks

In distributed computing, several problems have been isolated, such as distributed mutual exclusion, consensus, leader election, distributed commit and group communication

All of these represent primitives to support fault-tolerance of distributed applications

In mobile computing, substantial real applications are still scarce, the formal study of generic problems is quite recent

There are problems specific to the characteristics of the new domain (mobile computing) like location-dependent problems such as geocasting and location based group membership service

22

Transactional Applications

Because of mobility, transactional applications in the ad-hoc context must cope with possibility that even normal system operation may lead to violations of the database correctness

Research has focused on redefining the notion of correctness so as to adapt to the new constraints of ad-hoc networks

A number of alternative definitions of ACID properties have been identified that weaken one or more of the properties

The general trend is to allow a certain degree of autonomy in transaction processing during disconnections

23

Transactional Applications (2) For example, in disconnected operation, a database

client maintaining a local copy of the most recently used data could continue executing even while being disconnected from the server

User transactions can be decomposed into a number of weak and strict sub-transactions according to the degree of consistency needed by the application

Strict transactions maintain the traditional notion of transaction (if committed then always globally). As result they can e only committed while being connected with the server

Weak transactions are committed locally. Upon connection with the server global commit is performed, some of them can be aborted during the global commit

24

Group communication

A group membership protocol manages the formation and maintenance of a set of processes called a group

For example, a group may be A set of processes that are cooperating toward a common

task (e.g., the primary and backup servers of a database), A set of processes that share a common interest (e.g.,

clients that subscribe to a particular newsgroup), or The set of all processes in the system that are currently

deemed to be operational

In general, a process may leave a group because It failed, It voluntarily requested to leave, or It is forcibly expelled by other members of the group

25

Group communication (2)

A process may also join a group (e.g., it may have been selected to act as a replicate for the other processes in the group)

A group membership protocol must manage such dynamic changes in a coherent way: Each process has a local view of the current membership

of the group, and Processes in the group need to agree on these local views

despite failures

26

The Problem of Partitioning

By their nature, network applications for mobile computing involve cooperation among multiple sites

For these applications characterized by reliability and reconfigurability requirements, possible partitioning of the communication network is an extremely important aspect of the environment

In addition to accidental partitioning caused by failures and node movement, mobile computing systems typically support disconnected operation (additional cause of partitioning)

27

The Problem of Partitioning (2)

Two processes may appear to belong to two different partitions with respect to “ping” messages

But the same two processes may appear in the same partition when communicating through email

This is because the two communication services considered have significantly different message buffering, timeout and retransmission properties

Partitioning may result in service reduction or service degradation but need not necessarily render application services completely unavailable

28

The Problem of Partitioning (3)

Partition-aware applications are those that are able to make progress in multiple concurrent partitions without blocking. Service reduction and degradation depend heavily on the application semantics

For certain application classes with strong consistency requirements, it may be the case that all services have to be suspended completely in all but one partition

For applications with less stringent consistency requirements, partitionable group membership services can provide a useful framework to leverage from

29

Leader Election

Leader election algorithms for mobile ad hoc networks are classified in

1. Non-Compulsory protocols, which do not affect the motion of the nodes and try to take advantage of the mobile hosts natural movement by exchanging information whenever mobile hosts meet incidentally

2. Compulsory protocols, which determine the motion of some or all the nodes according to a specific scheme in order to meet the protocol demands (i.e., meet more often, spread in geographical area, etc.)

In both protocol classes, it is assumed that the mobile node moves in a bounded three-dimensional space

30

Literature

Pradhan D.K., Krishna P. and Vaidya N.H., Recoverable mobile environments: Design and tradeoff analysis. FTCS-26 , (June 1996)

Claudio Basile, Marc-Oliver Killijian, and David Powell. A survey of dependability issues in mobile wireless networks. Technical report, Laboratory for Analysis and Aarchitecture of Systems, National Center for Scientific Research, Toulouse, France, Feb 2003

1 software fault tolerance (swft) fault-tolerance in mobile networks dependable embedded systems...

Documents

networks cellular networks

mobile wireless networks

infrastructured networks

operation slide

cellular telephony

wireless communication

handoff strategy state

current cell slide