[ieee 2012 ieee asia-pacific conference on applied electromagnetics (apace) - melaka, malaysia...

Adaptive Modular Recovery Block

Ayesha Ubaid

Department of Comp Software Engg

Military College of Signals

National University of Sciences and

Technology

Islamabad, Pakistan

[email protected]

Ubaid-UR-Rehman

Department of Electrical Engineering

Military College of Signal,

National University of Science and

Technology

Islamabad, Pakistan

[email protected]

Muhammad Ali Abidi

Engineering Wing

Military College of Signals,

National University of Sciences and

Technology

Islamabad, Pakistan

[email protected]

Abstract—achieving ideal fault tolerance capabilities in distributed systems is an increasing stipulate. The paper provides reader with general understanding of fault tolerance and enlightens various software fault tolerance models for design faults. Component based recovery block (CBRB) divides systems into modules and check for their reliability separately but lacks to explain integration issues that can cause instability. The proposed model addresses this unattended issue of CBRB to enhance the reliability of software.

Keywords-N-version programming (NVP), recovery blocks (RB), component based recovery blocks (CBRB).

I. INTRODUCTION

Software fault tolerance is an important step and

necessary component in order to construct the reliable

computing systems ranging from embedded systems to data

warehouse systems, and is the area of research since

1960s[1]. Since software are engineered and not

manufactured, the more you go into details the more

research areas are highlighted.

Software fault tolerance is the ability of software to

detect and recover from a fault, which happens at runtime.

Software faults pose paramount hazardous circumstances

especially in safety critical systems. Hence it is important to

uncover ways to lessen these faults [1].Software fault

tolerance can further be classified mainly into design

diverse techniques and data diverse techniques. Former

deals with introduction of design variant components which

help eliminate such faults. Later techniques introduce data

of different types to check the data related faults. Since most

of failures are caused by design faults, this paper focuses on

models which deal with design faults. Component based

development is the increasing trend in the modern software

industry. This trend continues with the different sort of

testing and quality assurance techniques. The proposed

model not only offers design diverse fault tolerance

techniques but facilitates the developer with integration and

black box testing which is inherent to the model.

This paper includes five sections, section II explain the

traditional techniques used for design fault masking, section

III explains proposed model named ‘adaptive component

based recovery block’. Section IV explains effectiveness of

proposed model. This paper concludes in section V.

II. EXISTING TECHNIQUES FOR DESIGN DIVERSE FAULTS

Design errors are very difficult to catch. To do so

research has been conducted but still there is an extreme lack

of tools. Some useful models for design fault tolerance

include N-version programming (NVP), Self-checking

software, Recovery blocks, Distributed recovery blocks and

Component based recovery blocks.

A. N-version Model In an N-version software system, we have modules,

which are made up to N different implementations. Each

variant accomplishes the same task, but in a different way.

Also we have voter, each version then submits its answer to

voter which determines the correct answer. In case of NVP,

removing independent faults will increase reliability more

“than for RB since the versions are run in parallel” [8].The

flexibility in the specification to encourage design diversity,

but yet to maintain the compatibility between versions is a

difficult task [2]. General Pseudo code is:

run Version 1, Version 2, ..., Version n

if

Decision Mechanism (Result1, Result2,...,

Result n))

Figure 1: N-version programming structure

Version 2

NVP

entry

Distribute inputs

Version 1 Version 3

Gather

l

DM Exception raised

Failure exception

2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia

9978-1-4673-3115-9/12/$31.00 ©2012 IEEE 40

Return Result

else failure exception

B. Self Checking Software “Self-checking software is basically the extra checks

including some amount of check pointing and rollback

recovery methods that are added into fault-tolerant or safety

critical systems”[3]. They are implemented in some

extremely reliable systems. The problem with self-checking

software is that it is not rigorous. For a specific fault, it is

very difficult to identify all parts of code involved in

causing that fault. Furthermore, it is unpredictable to deduce

that how reliable a system has been made with self-checking

software? Without the proper rigor, self-checking software

cannot effectively be done. The general syntax (for n = 4):

run Variants 1 and 2 on Hardware Pair 1,

Variants 3 and 4 on Hardware Pair 2

compare Results 1 and 2 compare Results 3 and 4

if not (match)

set NoMatch1 set NoMatch2

else

set Result Pair 1

else set Result Pair 2

if

NoMatch1 and not NoMatch2,Result =Result Pair2

else if

NoMatch2 and not NoMatch1,Result =Result Pair1

else if

NoMatch1 and NoMatch2, raise exception

else if

not NoMatch1 and not NoMatch2

then

compare Result Pair 1 and 2

if

not (match),raise exception

if (match)

Result = Result Pair 1 or 2

return Result

C. Recovery Blocks The recovery blocks method is developed by Rendell

[4].The primary block is executed at first. If primary block

fails the acceptance test, it executes the first alternate. The

primary module in practice is usually more complex and is

the fastest. First alternate may be slower and comparatively

simple. If the first alternate passes its acceptance test, it is

considered as reliable and the process stops. But if it fails,

the process continues in same above mentioned manner

until the stable system is identified. If none of the alternate

is correct, an error message (exception) is reported which

then indicates the fact that the software could not perform

the requested operation. General syntax is given by [5]:

Ensure Acceptance Test

by Primary Alternate

else by

Alternate 2

else by

Alternate 3

...

else by

Alternate n

else

failure exception [5]

D. Distributed Recovery Blocks The distributed recovery block (DRB) was formulated

by K.H. Kim and H. O. Welch to incorporate hardware and

software fault tolerance in a single real time application.

The approach is a combination of recovery block and

distributed processing. In this technique, the primary and

alternate modules are replicated on two computing nodes.

“Which are interconnected via networks” [6].Both the

computing environments receive the input at the same time

and compute the modules in parallel. There are two nodes

that are connected via network. One is the primary node and

other is the secondary node. Each of them has primary and

alternate versions of the program along with the acceptance

tests. Acceptance test has two parts, the timer and the

logical acceptance test. Timer calculates the execution time

while the later determines if the output is acceptable or not.

The time is turned on as soon as the input is fed into the

system. In Kim's terminology, the primary module is

referred to as the “primary try block” [7] the alternate

module is referred to as the alternate try block. “In fault free circumstances, the backup node executes the primary alternative and secondary try block executes the alternative module if both will pass the acceptance test and update their local database” [7]. After the success, result is forwarded to

the backup node for the notification purpose, Then this

primary node send its result to the concurrent computing

station. In case of failure, the primary node notifies the

backup node. Since it was computing concurrently, it will

immediately provide its output and alternate its role with

that of primary try block.

If the primary block stop it execution entirely because of

the O.S crash or hang in the application, it detects the crash

by the expiration of a local timer.

E. Component Based Recovery Blocks (CRB) Being motivated by the advantages of modular structure,

O.A Abulnaja proposed a variant of recovery block in

2005.He named this variant as “Component Based Recovery

Block”. The basic procedure is same as that of traditional

recovery block. It does not test the whole system at one go.


41

The system is divided into modules and their variants are

made. The primary module is tested first (it may be the

fastest but the most complex). If it fails, the first alternate is

run. In case it fails, the other alternate is tested. If the

module passes the acceptance test, it is considered reliable.

The model increases the system reliability by offering

the modularity and discourages the monolithic system

approach. It also increases the portability and maintenance

capabilities of the system. This further enhances the

portability and maintenance capabilities of the system.

The series of studies undertaken by O.A Abulnaja for

recovery block have shown that its effectiveness depends

upon the module diversity and low probability of failing the

acceptance tests [1]. For successful implementation of the

technique following are the requirements.

1. Freedom of specification writing.

2. Independence in designing.

3. Low connected faults.

4. Hardware faults protector.

5. Modularity in software design.

6. High implementation cost. [1].

Figure 2: Breaking monolithic application into components based

application [1]

The CBRB technique deals with the modular reliability

of the system but does not address the integration issues

involved. To deal with these issues a new model “Adaptive

Component Based Recovery Block” is proposed which is an

advance version of component based recovery block.

III. ADAPTIVE COMPONENT BASED RECOVERY BLOCK

Since modular development is preferred over

monolithic system, modern systems are composed of

component based modular structures. Thus the proposed

model is designed for modular systems and does not offer

fault tolerance for monolithic system. The model takes a

single module as an input not total system. The proposed

design not only requires modular approach for offering its

services but also requires the versioning of those modules.

At the entry of the RCB the primary version of system

has to be checked, for this purpose the primary module is

fed as an input and is tested for reliability. Those who pass

the unit test are forwarded to the integration chamber where

the modules are tested for integration. Those who pass the

integration test are forwarded to the absolute chamber.

Integration tests are very important to determine the

compatibility amongst the modules. The modules after

passing the integration test are shifted to ‘absolute chamber’

where the system checking is performed i.e. black box

testing.

The versions whose modules pass these three test criteria

are considered reliable. In case a module failing unit test or

integration test, existence of new alternate is checked. If the

alternate exists, the whole process is repeated until a reliable

module is found. If the alternate does not exist, the module

id and related checkpoints are forwarded to the main version

and failure exception is raised resulting in the rejection of

the whole module.

The version passing these tests will be considered

reliable. Failing any of these tests will also cause the failure

of the version.

The working and technical effectiveness of this model

is same as that of traditional recovery blocks and

components based recovery block. Rather provides an added

advantage to the system in form of module diversity and

integration testing. This would lead to the safety of time and

resources during deployment.

Fig 3: Adaptive Component Based Recovery Block Model

Monolithic Application Component Application

C-2 C-1

C-3 C-4

M1

Execute

RCB ENTRY

Version 1

M2 M3 Mn

Integration chamber

(Integration testing)

RCB EXIT Restore

Absolute chamber

(Black Box testing)

Failure Exception

New Alternate

Fail

Fail

Yes

Pass

M1

Execute

RCB ENTRY

Version 1

M2 M3 Mn

Integration chamber

(Integration testing)

RCB EXIT Restore

Absolute chamber

(Black Box testing)

Failure Exception

New Alternate

Pass


42

General Syntax,

Run Version 1

Run Module1, Module2, Module3… Module n

Ensure Unit test

by Primary module1

else by

Secondary Module1

else by

Tertiary Module1

…………….. n Module n

AND

Ensure Integration Test by

Primary module n & module n-1

else exception failure

compare results

If Result (Acceptance Test) =Result (Integration Test)

Ensure Black Box Testing

By integrated modules

Else exception failure

The concept of versioning and modularization are

very common. Talking about the integration and absolute

chamber, these may not be the typical software based

system. Rather both the chambers refer to the environment

where human can interact with the system in favorable

environment and perform these two tests (integration and

black box respectively).

The unit test (UT) is specific to the program under test

and all the acceptance criteria may be programmed as

separate software whose output will depict the functional

output of the particular module.

IV. ADAPTIVE COMPONENT BASED RECOVERY

BLOCK EFFECTIVNESS

Series of experiments have been conducted to analyze

the effectiveness of the proposed scheme. Different

developers write specifications and construct components

for every version. Thus, the freedom of specification writing

is achievable.

Since each component for every program or program

module are constructed by different developers using

different algorithms, programming languages, design tools,

compilers, implementation techniques, test approaches etc.

Hence design independence is met.

Furthermore, in this technique different developers

construct components for every module of a version with

variety in experience, training and are may be widely

geographically spread, using different programming

languages. The risk of connected fault is much lowered. As

the components are joined together to make a specific

version and the diversity of algorithms and tools will help in

achieving this goal.

Different components could be situated on different

machines. Thus decreasing hardware related faults by

deploying each module on different hardware.

Due to that in the proposed scheme software consists of

components, and a version is not a monolithic entity rather

we can say that the model offers ‘variant of the variants’.

The modular or component based structure increases

maintainability and promoting changes, replacing the

existing components with new ones will speed up the

process of modification. In other words, the system

availability will be improved. Finally, in this technique,

instead of developing N+1 different modules of each

program or program module, we can use (buy or rent)

already exist components (modules). This will decrease the

implementation cost. Thus, the sixth requirement for the RB

scheme successful implementation will be met [8].

From the above we can see that the Adaptive component

based recovery block technique meets the above principle

requirements for RB scheme successful implementation

mentioned in[1].Improving system’s reliability. But

software systems are not only the modular bodies these

modules have to be integrated to make the bigger systems.

The issues related to integration are dealt by the

introduction of the ‘integration chamber’ that independently

checks for the integration of reliable modules.

Also the overall system needs to be checked for the

reliability after all the modules have been integrated. For

this purpose black box testing is also introduced at the last

step.

The two tests will assure that although the

modules/components were reliable independently but also

the version composed of these modules is also reliable and

no integration issues will go to affect the system

independently.

V. DISCUSSIONS & CONCLUSIONS

Since none of the model is perfect it is important to

analyze these models, their advantages and shortcomings.

In N-Version Programming model, the specifications

must be very specific, although it is practically impossible

to have ideal specifications. This lack of ideal specification

makes this fault tolerance model less effectual. According to

Daniels [9], NVP as well as recovery blocks “are based on

software component redundancy and on assumption of rare

correlated coincidental failure of components” [9] and

become much costly for a low-cost application system [10].

The Self Checking algorithm does not provide

exhaustive testing of the code. Code coverage is not

transparent, letting programmer in chaos that what portion

of the code has been checked and what has been left.

When it comes to recovery block, it is design

diverse like other software models but it is complicated, as it


43

requires the ability to roll back the state of the system from

trying an alternate. It may have up to N-alternatives. This try

and rollback ability has the effect of making the software to

appear extremely transactional, until after a transaction is

accepted is it committed to the system.

The distributed recovery block provides the

combined services of recovery block and parallel

computing. Again some shortcomings exit. If one of two

nodes passes the algorithm is considered correct. But still

the state of checking is un-known; there may be chances

that the any of the nodes has suffered a malfunctioning and

providing wrong results.

Along with the potential advantages of the

traditional recovery block, the component-based recovery

block some more advanced advantage. Running different

modules over different hardware while testing would help in

eliminating hardware faults. Software modification is

favored by offering modularity. Design diversity and low

correlated costs are favored as well. But the implementation

cost will rise.

Motivated by the effectiveness of Recovery Block (RB)

techniques and modularity in software systems, a more

reliable scheme called Component-Based Recovery Block

(CBRB) scheme is studied in detail.

The proposed technique strongly discourages the

monolithic system approach and promotes modularization

of the systems. It adds two testing techniques as well i.e.

integration testing and black box testing, adding more

reliability to the system. The scheme not only promotes

fault tolerance but also offers inherent testing that will add

reliability count of the system under consideration.

This research work will serve as the foundation for future

research work including

1. The implantation of automated absolute chamber

and integration chamber.

2. The unit test algorithm and standards modification.

REFERENCES

[1]O.A. Abulnaja: “Component Based Recovery Block”: AIML Journal,

Volume (5), Issue (2), June 2005.

[2] Bharathi. V : “N-Version programming method of Software Fault

Tolerance: A Critical Review” : National Conference on Nonlinear

Systems Dynamics NCNSD 2003.

[3] Inacio. C : ”Software Fault Tolerance - Dependable Embedded

Systems” :Carnegie Mellon University : 1998

[4] B.Randell : “System structure for Software Fault Tolerance,” :IEEE-

Software Eng.,vol. SE-1,pp.220-232, June 1975

[5] Randell. B and Xu. J : “The Evolution of the Recovery Block Concept”,

in: Software Fault Tolerance, M.R. Lyu (Ed.), Wiley,1995.

[6] Kikuchi. T and Kobayashi. C: “Communications networks and virtual

economic integration: The case of three countries”: International Advances

in Economic Research :Vol 9, Number 1.

[7] K. H. Kim, "The Distributed Recovery Block Scheme," in Software

Fault Tolerannce, M. R. Lyu, Ed. Chichester: John Wiley & Sons, pp. 189-

210, 1995.

[8] H. Hecht. Fault tolerant software IEEE Transactions on Reliability,

Volume (R-28): 227-232, 1979.

[9] Daniels. F: “The Reliable Hybrid Pattern-A Generalized Software Fault

Tolerant Design Pattern”: PLoP ’97 Conference.

[10] Saha. G. K :”A Single-Version Algorithmic Approach to Fault

Tolerant Computing Using Static Redundancy” : Clei Electronic Journal,

Vol 9, Number 2, Paper 9, Dec 2006.


44

[ieee 2012 ieee asia-pacific conference on applied electromagnetics (apace) - melaka, malaysia...

Documents