[ieee 2012 ieee asia-pacific conference on applied electromagnetics (apace) - melaka, malaysia...
TRANSCRIPT
Adaptive Modular Recovery Block
Ayesha Ubaid
Department of Comp Software Engg
Military College of Signals
National University of Sciences and
Technology
Islamabad, Pakistan
Ubaid-UR-Rehman
Department of Electrical Engineering
Military College of Signal,
National University of Science and
Technology
Islamabad, Pakistan
Muhammad Ali Abidi
Engineering Wing
Military College of Signals,
National University of Sciences and
Technology
Islamabad, Pakistan
Abstract—achieving ideal fault tolerance capabilities in distributed systems is an increasing stipulate. The paper provides reader with general understanding of fault tolerance and enlightens various software fault tolerance models for design faults. Component based recovery block (CBRB) divides systems into modules and check for their reliability separately but lacks to explain integration issues that can cause instability. The proposed model addresses this unattended issue of CBRB to enhance the reliability of software.
Keywords-N-version programming (NVP), recovery blocks (RB), component based recovery blocks (CBRB).
I. INTRODUCTION
Software fault tolerance is an important step and
necessary component in order to construct the reliable
computing systems ranging from embedded systems to data
warehouse systems, and is the area of research since
1960s[1]. Since software are engineered and not
manufactured, the more you go into details the more
research areas are highlighted.
Software fault tolerance is the ability of software to
detect and recover from a fault, which happens at runtime.
Software faults pose paramount hazardous circumstances
especially in safety critical systems. Hence it is important to
uncover ways to lessen these faults [1].Software fault
tolerance can further be classified mainly into design
diverse techniques and data diverse techniques. Former
deals with introduction of design variant components which
help eliminate such faults. Later techniques introduce data
of different types to check the data related faults. Since most
of failures are caused by design faults, this paper focuses on
models which deal with design faults. Component based
development is the increasing trend in the modern software
industry. This trend continues with the different sort of
testing and quality assurance techniques. The proposed
model not only offers design diverse fault tolerance
techniques but facilitates the developer with integration and
black box testing which is inherent to the model.
This paper includes five sections, section II explain the
traditional techniques used for design fault masking, section
III explains proposed model named ‘adaptive component
based recovery block’. Section IV explains effectiveness of
proposed model. This paper concludes in section V.
II. EXISTING TECHNIQUES FOR DESIGN DIVERSE FAULTS
Design errors are very difficult to catch. To do so
research has been conducted but still there is an extreme lack
of tools. Some useful models for design fault tolerance
include N-version programming (NVP), Self-checking
software, Recovery blocks, Distributed recovery blocks and
Component based recovery blocks.
A. N-version Model In an N-version software system, we have modules,
which are made up to N different implementations. Each
variant accomplishes the same task, but in a different way.
Also we have voter, each version then submits its answer to
voter which determines the correct answer. In case of NVP,
removing independent faults will increase reliability more
“than for RB since the versions are run in parallel” [8].The
flexibility in the specification to encourage design diversity,
but yet to maintain the compatibility between versions is a
difficult task [2]. General Pseudo code is:
run Version 1, Version 2, ..., Version n
if
Decision Mechanism (Result1, Result2,...,
Result n))
Figure 1: N-version programming structure
Version 2
NVP
entry
Distribute inputs
Version 1 Version 3
Gather
l
DM Exception raised
Failure exception
2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia
9978-1-4673-3115-9/12/$31.00 ©2012 IEEE 40
Return Result
else failure exception
B. Self Checking Software “Self-checking software is basically the extra checks
including some amount of check pointing and rollback
recovery methods that are added into fault-tolerant or safety
critical systems”[3]. They are implemented in some
extremely reliable systems. The problem with self-checking
software is that it is not rigorous. For a specific fault, it is
very difficult to identify all parts of code involved in
causing that fault. Furthermore, it is unpredictable to deduce
that how reliable a system has been made with self-checking
software? Without the proper rigor, self-checking software
cannot effectively be done. The general syntax (for n = 4):
run Variants 1 and 2 on Hardware Pair 1,
Variants 3 and 4 on Hardware Pair 2
compare Results 1 and 2 compare Results 3 and 4
if not (match)
set NoMatch1 set NoMatch2
else
set Result Pair 1
else set Result Pair 2
if
NoMatch1 and not NoMatch2,Result =Result Pair2
else if
NoMatch2 and not NoMatch1,Result =Result Pair1
else if
NoMatch1 and NoMatch2, raise exception
else if
not NoMatch1 and not NoMatch2
then
compare Result Pair 1 and 2
if
not (match),raise exception
if (match)
Result = Result Pair 1 or 2
return Result
C. Recovery Blocks The recovery blocks method is developed by Rendell
[4].The primary block is executed at first. If primary block
fails the acceptance test, it executes the first alternate. The
primary module in practice is usually more complex and is
the fastest. First alternate may be slower and comparatively
simple. If the first alternate passes its acceptance test, it is
considered as reliable and the process stops. But if it fails,
the process continues in same above mentioned manner
until the stable system is identified. If none of the alternate
is correct, an error message (exception) is reported which
then indicates the fact that the software could not perform
the requested operation. General syntax is given by [5]:
Ensure Acceptance Test
by Primary Alternate
else by
Alternate 2
else by
Alternate 3
...
else by
Alternate n
else
failure exception [5]
D. Distributed Recovery Blocks The distributed recovery block (DRB) was formulated
by K.H. Kim and H. O. Welch to incorporate hardware and
software fault tolerance in a single real time application.
The approach is a combination of recovery block and
distributed processing. In this technique, the primary and
alternate modules are replicated on two computing nodes.
“Which are interconnected via networks” [6].Both the
computing environments receive the input at the same time
and compute the modules in parallel. There are two nodes
that are connected via network. One is the primary node and
other is the secondary node. Each of them has primary and
alternate versions of the program along with the acceptance
tests. Acceptance test has two parts, the timer and the
logical acceptance test. Timer calculates the execution time
while the later determines if the output is acceptable or not.
The time is turned on as soon as the input is fed into the
system. In Kim's terminology, the primary module is
referred to as the “primary try block” [7] the alternate
module is referred to as the alternate try block. “In fault free circumstances, the backup node executes the primary alternative and secondary try block executes the alternative module if both will pass the acceptance test and update their local database” [7]. After the success, result is forwarded to
the backup node for the notification purpose, Then this
primary node send its result to the concurrent computing
station. In case of failure, the primary node notifies the
backup node. Since it was computing concurrently, it will
immediately provide its output and alternate its role with
that of primary try block.
If the primary block stop it execution entirely because of
the O.S crash or hang in the application, it detects the crash
by the expiration of a local timer.
E. Component Based Recovery Blocks (CRB) Being motivated by the advantages of modular structure,
O.A Abulnaja proposed a variant of recovery block in
2005.He named this variant as “Component Based Recovery
Block”. The basic procedure is same as that of traditional
recovery block. It does not test the whole system at one go.
2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia
41
The system is divided into modules and their variants are
made. The primary module is tested first (it may be the
fastest but the most complex). If it fails, the first alternate is
run. In case it fails, the other alternate is tested. If the
module passes the acceptance test, it is considered reliable.
The model increases the system reliability by offering
the modularity and discourages the monolithic system
approach. It also increases the portability and maintenance
capabilities of the system. This further enhances the
portability and maintenance capabilities of the system.
The series of studies undertaken by O.A Abulnaja for
recovery block have shown that its effectiveness depends
upon the module diversity and low probability of failing the
acceptance tests [1]. For successful implementation of the
technique following are the requirements.
1. Freedom of specification writing.
2. Independence in designing.
3. Low connected faults.
4. Hardware faults protector.
5. Modularity in software design.
6. High implementation cost. [1].
Figure 2: Breaking monolithic application into components based
application [1]
The CBRB technique deals with the modular reliability
of the system but does not address the integration issues
involved. To deal with these issues a new model “Adaptive
Component Based Recovery Block” is proposed which is an
advance version of component based recovery block.
III. ADAPTIVE COMPONENT BASED RECOVERY BLOCK
Since modular development is preferred over
monolithic system, modern systems are composed of
component based modular structures. Thus the proposed
model is designed for modular systems and does not offer
fault tolerance for monolithic system. The model takes a
single module as an input not total system. The proposed
design not only requires modular approach for offering its
services but also requires the versioning of those modules.
At the entry of the RCB the primary version of system
has to be checked, for this purpose the primary module is
fed as an input and is tested for reliability. Those who pass
the unit test are forwarded to the integration chamber where
the modules are tested for integration. Those who pass the
integration test are forwarded to the absolute chamber.
Integration tests are very important to determine the
compatibility amongst the modules. The modules after
passing the integration test are shifted to ‘absolute chamber’
where the system checking is performed i.e. black box
testing.
The versions whose modules pass these three test criteria
are considered reliable. In case a module failing unit test or
integration test, existence of new alternate is checked. If the
alternate exists, the whole process is repeated until a reliable
module is found. If the alternate does not exist, the module
id and related checkpoints are forwarded to the main version
and failure exception is raised resulting in the rejection of
the whole module.
The version passing these tests will be considered
reliable. Failing any of these tests will also cause the failure
of the version.
The working and technical effectiveness of this model
is same as that of traditional recovery blocks and
components based recovery block. Rather provides an added
advantage to the system in form of module diversity and
integration testing. This would lead to the safety of time and
resources during deployment.
Fig 3: Adaptive Component Based Recovery Block Model
Monolithic Application Component Application
C-2 C-1
C-3 C-4
M1
Execute
RCB ENTRY
Version 1
M2 M3 Mn
Integration chamber
(Integration testing)
RCB EXIT Restore
Absolute chamber
(Black Box testing)
Failure Exception
New Alternate
Fail
Fail
Yes
Pass
M1
Execute
RCB ENTRY
Version 1
M2 M3 Mn
Integration chamber
(Integration testing)
RCB EXIT Restore
Absolute chamber
(Black Box testing)
Failure Exception
New Alternate
Pass
2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia
42
General Syntax,
Run Version 1
Run Module1, Module2, Module3… Module n
Ensure Unit test
by Primary module1
else by
Secondary Module1
else by
Tertiary Module1
…………….. n Module n
AND
Ensure Integration Test by
Primary module n & module n-1
else exception failure
compare results
If Result (Acceptance Test) =Result (Integration Test)
Ensure Black Box Testing
By integrated modules
Else exception failure
The concept of versioning and modularization are
very common. Talking about the integration and absolute
chamber, these may not be the typical software based
system. Rather both the chambers refer to the environment
where human can interact with the system in favorable
environment and perform these two tests (integration and
black box respectively).
The unit test (UT) is specific to the program under test
and all the acceptance criteria may be programmed as
separate software whose output will depict the functional
output of the particular module.
IV. ADAPTIVE COMPONENT BASED RECOVERY
BLOCK EFFECTIVNESS
Series of experiments have been conducted to analyze
the effectiveness of the proposed scheme. Different
developers write specifications and construct components
for every version. Thus, the freedom of specification writing
is achievable.
Since each component for every program or program
module are constructed by different developers using
different algorithms, programming languages, design tools,
compilers, implementation techniques, test approaches etc.
Hence design independence is met.
Furthermore, in this technique different developers
construct components for every module of a version with
variety in experience, training and are may be widely
geographically spread, using different programming
languages. The risk of connected fault is much lowered. As
the components are joined together to make a specific
version and the diversity of algorithms and tools will help in
achieving this goal.
Different components could be situated on different
machines. Thus decreasing hardware related faults by
deploying each module on different hardware.
Due to that in the proposed scheme software consists of
components, and a version is not a monolithic entity rather
we can say that the model offers ‘variant of the variants’.
The modular or component based structure increases
maintainability and promoting changes, replacing the
existing components with new ones will speed up the
process of modification. In other words, the system
availability will be improved. Finally, in this technique,
instead of developing N+1 different modules of each
program or program module, we can use (buy or rent)
already exist components (modules). This will decrease the
implementation cost. Thus, the sixth requirement for the RB
scheme successful implementation will be met [8].
From the above we can see that the Adaptive component
based recovery block technique meets the above principle
requirements for RB scheme successful implementation
mentioned in[1].Improving system’s reliability. But
software systems are not only the modular bodies these
modules have to be integrated to make the bigger systems.
The issues related to integration are dealt by the
introduction of the ‘integration chamber’ that independently
checks for the integration of reliable modules.
Also the overall system needs to be checked for the
reliability after all the modules have been integrated. For
this purpose black box testing is also introduced at the last
step.
The two tests will assure that although the
modules/components were reliable independently but also
the version composed of these modules is also reliable and
no integration issues will go to affect the system
independently.
V. DISCUSSIONS & CONCLUSIONS
Since none of the model is perfect it is important to
analyze these models, their advantages and shortcomings.
In N-Version Programming model, the specifications
must be very specific, although it is practically impossible
to have ideal specifications. This lack of ideal specification
makes this fault tolerance model less effectual. According to
Daniels [9], NVP as well as recovery blocks “are based on
software component redundancy and on assumption of rare
correlated coincidental failure of components” [9] and
become much costly for a low-cost application system [10].
The Self Checking algorithm does not provide
exhaustive testing of the code. Code coverage is not
transparent, letting programmer in chaos that what portion
of the code has been checked and what has been left.
When it comes to recovery block, it is design
diverse like other software models but it is complicated, as it
2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia
43
requires the ability to roll back the state of the system from
trying an alternate. It may have up to N-alternatives. This try
and rollback ability has the effect of making the software to
appear extremely transactional, until after a transaction is
accepted is it committed to the system.
The distributed recovery block provides the
combined services of recovery block and parallel
computing. Again some shortcomings exit. If one of two
nodes passes the algorithm is considered correct. But still
the state of checking is un-known; there may be chances
that the any of the nodes has suffered a malfunctioning and
providing wrong results.
Along with the potential advantages of the
traditional recovery block, the component-based recovery
block some more advanced advantage. Running different
modules over different hardware while testing would help in
eliminating hardware faults. Software modification is
favored by offering modularity. Design diversity and low
correlated costs are favored as well. But the implementation
cost will rise.
Motivated by the effectiveness of Recovery Block (RB)
techniques and modularity in software systems, a more
reliable scheme called Component-Based Recovery Block
(CBRB) scheme is studied in detail.
The proposed technique strongly discourages the
monolithic system approach and promotes modularization
of the systems. It adds two testing techniques as well i.e.
integration testing and black box testing, adding more
reliability to the system. The scheme not only promotes
fault tolerance but also offers inherent testing that will add
reliability count of the system under consideration.
This research work will serve as the foundation for future
research work including
1. The implantation of automated absolute chamber
and integration chamber.
2. The unit test algorithm and standards modification.
REFERENCES
[1]O.A. Abulnaja: “Component Based Recovery Block”: AIML Journal,
Volume (5), Issue (2), June 2005.
[2] Bharathi. V : “N-Version programming method of Software Fault
Tolerance: A Critical Review” : National Conference on Nonlinear
Systems Dynamics NCNSD 2003.
[3] Inacio. C : ”Software Fault Tolerance - Dependable Embedded
Systems” :Carnegie Mellon University : 1998
[4] B.Randell : “System structure for Software Fault Tolerance,” :IEEE-
Software Eng.,vol. SE-1,pp.220-232, June 1975
[5] Randell. B and Xu. J : “The Evolution of the Recovery Block Concept”,
in: Software Fault Tolerance, M.R. Lyu (Ed.), Wiley,1995.
[6] Kikuchi. T and Kobayashi. C: “Communications networks and virtual
economic integration: The case of three countries”: International Advances
in Economic Research :Vol 9, Number 1.
[7] K. H. Kim, "The Distributed Recovery Block Scheme," in Software
Fault Tolerannce, M. R. Lyu, Ed. Chichester: John Wiley & Sons, pp. 189-
210, 1995.
[8] H. Hecht. Fault tolerant software IEEE Transactions on Reliability,
Volume (R-28): 227-232, 1979.
[9] Daniels. F: “The Reliable Hybrid Pattern-A Generalized Software Fault
Tolerant Design Pattern”: PLoP ’97 Conference.
[10] Saha. G. K :”A Single-Version Algorithmic Approach to Fault
Tolerant Computing Using Static Redundancy” : Clei Electronic Journal,
Vol 9, Number 2, Paper 9, Dec 2006.
2012 IEEE Asia-Pacific Conference on Applied Electromagnetics (APACE 2012), December 11 - 13, 2012, Melaka, Malaysia
44