fault tolerance by design diversity: concepts and experiments

Tolerance of design faults in software and in hardwareis the challenge of the eighties. How can we meet the challenge?

Design diversity offers a solution.

Fault Tolerance

by Design Diversity:

Concepts and Experiments

Algirdas Avizienis and John P. J. Kelly

UCLA Computer Science Department

Fault tolerance is the survival attribute of computer ar-chitectures; when a system is able to recover automaticallyfrom fault-caused errors, and to eliminate faults withoutsuffering an externally perceivable failure, the system issaid to be fault tolerant. i Fault tolerance is supported byboth hardware and software elements of the system, in-cluding redundant resources and detailed implementationsof fault detection and recovery algorithms. Originally,fault-tolerant architectures were developed to toleratephkvsical faults that occur because of random failurephenomena in the hardware of a system. More recently,the tolerance of design faults, especially in software, hasbeen added to the objectives of fault tolerance.The use of redundant copies of hardware, data, and

programs has proven to be quite effective in the detectionof physical faults and in subsequent system recovery.However, design faults-which are introduced by humanmistakes or defective design tools-are reproduced whenredundant copies are made; such replication of faultyhardware or software elements fails to enhance the faulttolerance of the system with respect to design faults.

Design diversitv is the approach in which the hardwareand software elements that are to be used for multiplecomputations are not copies, but are independentlydesigned to meet a system's requirements. Differentdesigners and design tools are employed in each effort,and commonalities are systematically avoided. The ob-vious advantage of design diversity is that reliable com-

puting does not require the complete absence of designfaults, but only that those faults not produce similar errors

in a majority of the designs. This and other advantages, as

well as some limitations of design diversity, are discussedin this article.

Implementation and use of design diversity

The multiple-computation approach to fault tolerance.A very effective approach to the implementation of faulttolerance has been the execution of multiple (N-fold, withN-2) computations with the same objective: a set ofresults, derived from a given set of initial conditions andinputs. Two fundamental requirements apply to suchmultiple computations:

(1) the consistency of initial conditions and inputs forall N computations must be assured; and

(2) a reliable decision algorithm that determines a singledecision result from the multiple results must beprovided.

The decision algorithm may utilize only a subset of all Nresults for a decision; for example, the first result thatpasses an acceptance test may be chosen. It is also possiblethat the decision result cannot be determined, in whichcase a higher level recovery procedure may be invoked.The decision algorithm, itself, is often implemented Ntimes once for each computation in which the decisionresult is used in which case only one computation is af-fected by the failure of any particular decision element(such as the majority voter).

Multiple computations have been implemented byN-fold (N -2) replications in three domains: time (repeti-tion), space (hardware), and program (software). The ad-ditional attribute of diversity will be addressed later. First,we must define and discuss a shorthand notation todescribe the various possible approaches to multiple com-

putation.

0018-9162 84 0800-0067S01.00 1984 IEEF 67August 1984

Authorized licensed use limited to: University of Tartu Estonia. Downloaded on April 26,2021 at 11:25:19 UTC from IEEE Xplore. Restrictions apply.

The reference, or simplex, case is that of one execution(simplex time IT) of one program (simplex software IS)on one hardware channel (simplex hardware IH), de-scribed by the notation: IT/IH/IS. For simplex time(I T), multiple computation cases consist of concurrent ex-ecution ofN copies of a program on Nhardware channels(I T/NH/NS). Examples of such systems are NASA'sSpace Shuttle (N= 4), SRI's SIFT System (N2 3), DraperLaboratory's Fault-Tolerant Multiprocessor (N=3), andAT&T Bell Laboratories No. I ESS (N= 2) computers. I

In the N-fold time cases (NT), sequential executions of acomputation are performed. Some transient faults aredetectable by the repeated (N= 2) execution of the samecopy of a program on the same machine (2T/IlH/IS). Acommon practice in coping with system crashes has beenthe loading and running of a new copy of the program onthe same hardware (2T/IH/2S). A more sophisticatedfault-tolerance approach that is used in Tandem com-

puters, for example, is the execution of a copy of the pro-gram on a standby duplicate channel (2T/2H/2S). Therepetition of the same copy of a program on a duplicatechannel (2T/2H/IS) is another possible variation.

The types of faults in multiple computations. The usualpartitioning of faults into "single-fault" and "multiple-fault" classes needs to be refined when we consider multi-ple computations. Faults that affect only one in a set ofNcomputations are called simplex faults. A simplex faultdoes not affect other computations, although it may beeither a single or a multiple fault within one channel.Simplex faults are very effectively tolerated by themultiple-computation approach as long as input con-sistency and an adequate decision algorithm are provided.

Faults that effectMout of theNseparate computationsare called M-plex faults (2 sM<N); they affect computa-tions that form a set which is used to arrive at a decisionresult. Typically, multiple occurrences of faults are di-

COMPUTER68


vided into two classes: independent and related. We saythat related M-plex faults are those for which a commoncause, which affects M computations, exists or ishypothesized. Examples of common physical causes in-clude interchannel shorts or "sneak paths," power-supplyinterruptions, and bursts of radiation. Their effects on theindividual computations are not necessarily alike. Anotherclass of related M-plex faults is quite different: it arisesfrom human mistakes made during the design or thesubsequent design modification of a hardware or softwareelement. All N computations that use copies of that hard-ware or software element are ultimately affected in thesame way by the design fault when a certain state of theelement is reached.

Because of their differing consequences, related M-plexfaults are likewise divided into two classes: design faultsand operationalfaults. Design faults are generic to an en-tire set or family of hardware/software copies, since they

result from a deviation of the design from its specification.All other related M-plex faults are operational. Their rela-tionship is not traceable to a design mistake, but rather tothe physical, logical, or temporal proximity of the affectedcomputations.The consequences of faults are errors that can be

observed at points at which either single-computationfault-detection algorithms or N-computation decisionalgorithms are applied to the results. The Msimplex errorsthat are caused by related M-plex faults may be either iden-tical or distinct at the points of observation. OperationalM-plex faults may produce errors of either type, althoughthey are much more likely to cause distinct errors. Designfaults will produce identical errors in all computations ifidentical copies of hardware and software elements areemployed to execute multiple computations that have thesame inputs and initial values. Such total susceptibility todesign faults is the most serious limitation of the multiple-computation approach.

August 1984 69


I)esign diversity in multiple computations. In N-foldimplemiienltations, an effective method of avoiding iden-tical er0-o1-s caused bs design faults is to use independentlyclesigned softsare anid/or hardware elemenits instead ofidelltical copies. This appioaclh applies diirectly to theparallel, sinmplex time ssstem 1TINHINS, which can beconsverted to (1) 1 T/NH/NDS (where lN\DS stands forN id dielerse soijtare, which is used in N-ersion pro-craminii-lg, 3 described below); (2) 1 TINDHINS (whereNDH stanids for \-fold diverse hardware); and (3)I T/.N'DH NDS.

The sequential N T s5stems hase been implemented asrecoserN block swith N sequentially applicable alternateproerams (NTI HINDS) that rise the same hardssare. Anacceptanice test is perfoi-nmed foi tault detection, and theclecis.ion algorithmn selects the first sct of iresults that passtthC test.

>N ILllmber of % ariations anI combinations of ap-proachcs cani bc readily identif'ied, in addition to therecovev-block andcl N- ersion programmina methods.First, the acceptanice tests on indisidual versions can bcuLsed to support the dceci sion algorithms in NVP. A conlsis-teint imlethod for softwlarae exccption handlingl has beenLonl.structed. 4 Secondcl, tauhlt-detection algorithm.s in in-dis idual hardwarc chaninels are applicable in order todistinguish phy sical faUlts ftomii design failt.

Third. the reco cr -block approach can concurrentlyue tss o (orlmoIe) lhaidsaca e chanlinels, cithel copied, oidiserse: (N/2) T 211 \DS or (N/2) T/2 DHIADS. A deci-sioni alcorithm applieci here cani support the acceptancetests, sincc tss o sets of results arc made as ailable inpaiallel, anid real-timc conistrainits are met as long as onlechanniel remnains acceptable.The mlost genercal dis el-se sVstem is, hoswev er, XTIDEDH/ZDS sith acceptaInce tests anid detection of physical

faults in each channiiel SUppOTitine the decision algorithm.OCther sr stems result wshen somile features are deleted ordis ersits is reduced.

L ooking back- at cai-lice- machinies, se mas consider thatmarginal testin"'' as used in first-ceneration hardssarc

was a primitise methodl of imposinig sequcntial hardsaredisersits on a single clhaninel (2T12DH1IS), and that repe-tition of a comiiputation ssith a diflerent clock iate ma> betliouchlit of1 as a forma of 'time diserisits'': 2DTI1HI1S.The rise of desihn dis eirsity introdluces "similarit'" con-

sideiationis about Iesults and erirois. First, the decisionialgoriithim ma has e to dete-limine the dJecisioni result froma set ot similar, but not idenitical, results. Similalr resultsali-e def ined as two oi miiore iresults (good or erronieous) thatarce sithin the range of variation allosed b> the decisionalLeorithtm; the> are theli-fore usecd to determFlinIe the deci-SionI result. Similar ICesults oflten diftfer sithin a certainlanae sslhen differcnt algorithmiis are used in disersedcesiens. Consequently, similar, but not identical errors.Ltn cause the decisioni algorithm to arrise at an erroneous

decision esuLlt. The errors aire nos classifiedl as either(disninct or sinilar (identicaUl errors torm a subset of similarertots). Finall7 swe note that a nes tspe of M-plex designftaults-in(dependent 4-nplex desien' faults can cause.imlar criiors and lead to anl crronncou decision Iresult.

Eliminating related design faults. To ieduce thelikelihood of related design faults in a set of computations,it is necessary to employ (1) independent design and imil-plementation techniques such as diserse algorithms, pro-gramming languages, translators, design automationtools, and machine languaces; and (2) independent(noninteracting) programmers oi designers (preferablysith a sariet> of trainina and experience).The third and probably the most critical condition is the

existence of a complete and accurate initialformulation ofthe requirements that the diverse designs are to meet. Thisis the "hard core" of this fault-tolerance approach. Latentdefects such as inconisistencies and ambiguities in, oromissions from, the initial formulation are likely to biasotherwise entirely independent piocramming- or lopic-design efforts towasad iclated design faults.The most promising approach to the production of the

initial formulation is the use of formal, very high levelspecification languages. Such specifications are executableand can be automatically tested for latent defects. Here,perfection is required only at the highest level of specifica-tion; the rest of the design and implementation pro-cess as well as its tools are required to be only as fault-free as existing resource constraints and time limits permit.The independent generation and subsequent comparisonof two specifications, using two formal languages, is thenext step toward increasing the dependability of current

specifications.

Potential advantages of design diversity. The most im-mediate and direct opportunity to apply designi diVe sit> is inmultichannel s>stems, such as SIFT, that hase serv conm-plete toleranice of phvsical faults. The hardsware resourcesand arclitectural featuIles that can sLipport software disersi-tv are already present, anid their implenenitationi of disersitVis an extension of existing ph>sical fault-tolerance mecha-nisms. Hardware disersits can be introduced by choosing,ftor each chaninel, funIctionally compatible hardwarebuilding blocks from differenit suppliers.A more speculatise and also more ceneral applica-

tion of designi disersity is its use as a partial substitute forcurrent software serification and salidation procedures.Instead of extensise V&V of a single programii, tswo in-dependent versions can be executed in an operational en-Vironment, completing V&V concurrently with productis eoperation. The doubled cost of pioducing the software iscompensated for by al ieduction of the V&V time. In somesituations, there mas also be a decrease in the amounit

and therefore the cost in manl hours and softssaretools needed for the very thorough V&V effort.

Design faults in the V&V softsvaie are also less critical.The user can take the less efficienlt ("back-up") versionoff line sshen adequate reliability of operation is reached,and then bring it back on line for special operating condi-tions that requiie creater reliabilits assurance, especiallsafter modifications or after maintenaince. A potentialsystem-lifetime cost reduction exists because a duplex,diverse system can support continued operation after la-tent design faults are uncovered, providing nearly 100 per-cent asailabilits. The cost of fault anal>sis and elimination

should also be reduecld because repaiis are less urgent.

COMPUTER70


The possible use of a "mail order" approach to the pro-duction of two or more versions of software modules sug-gests a very intriguing long-range benefit of the design-diversity approach in software. Programmers, working attheir own preferred times and locations and using theirown personal computing equipment, can write the ver-sions of software if they are given a formal specificationthat includes a set of acceptance tests. Two potential ad-vantages of this approach are

* The overhead cost that accrues in highly controlledprofessional programming environments would bedrastically reduced, since this approach allows theprogrammers free play to work on their own initiativeand utilize low-cost home facilities.

* The potential of the rapidly growing number of com-puter hobbyists to serve as productive programmerswould be tapped. For various reasons, many in-dividuals with programming talents cannot fill theprofessional programmer's role as defined by today'srigorous approaches to quality control and use ofcentralized sites during the programming process.

Finally, an important reliability and availability advan-tage to be gained through design diversity may be expectedfor systems that use VLSI circuits. The growingcomplexity of VLSI circuits-with 500,000 gates/chipnow available and one million gates/chip predicted for thenear future-raises the probability of design faults, since acomplete verification of the design is very difficult to at-tain. Furthermore, the design automation and verificationtools are themselves subject to undiscovered design faults.Even with multichannel fault-tolerant system designs, asingle design fault may require the replacement of all chipsof that type, since on-chip modifications are impractical.Such a replacement is a costly and time-consuming pro-cess. However, use of design diversity of VLSI circuitsdoes allow the continued use of chips with design faults, solong as their errors are not similar at the circuit boundarieswhere the decision algorithm is applied. Reliable operationthroughout the lifetime of a system can be obtained even ifnone of the chips has a perfect design and none of the basicstructure of the VLSI circuits has been modified.

Development of multiversion software: anexperimental approach

Because of the potential advantages of design diversityand design fault tolerance, further studies have been con-ducted to determine their usefulness as alternatives toV&V, testing, and proof methodology, which aim atdelivering fault-free software products and hardware cir-cuits. A research effort at UCLA (started in 1976) wasfounded on 14 years of continuous investigation intotolerance of physical faults; the goal was to determine thefeasibility of adapting N-fold modular redundancy (withmajority voting) to provide fault tolerance of softwaredesign faults. The first experimental study of this ap-proach, called "N-Version Programming," was com-

pleted in 1978. However, suggestions that this approachmight be a viable method of software fault tolerance hadbeen published previously.6 In fact, Dionysius Lardnerfirst made the suggestion in his article "Babbage'sCalculating Engine," published in the Edinburgh Reviewin July 1834:

The most certain and effectual check upoIn errors whicharise in the process of computation, is to cause the samecomputations to be made br separate and 'independentcomputers; and this check is rendered still more decisix e ifthey make their computations by different methods.7

A second approach, already under investigation in1975, is the recovery block technique, in which alternatesoftware versions are organized in a manner similar to thedynamic redundancy (standby sparing) technique used inhardware. The objective of the RB technique is to performsoftware design fault detection during runtime by an ac-ceptance test performed on the results of one version. Ifthe test fails, an alternate path of execution is taken to im-plement recovery. This technique is currently being in-vestigated at several locations, and several importantresearch activities related to NVP and RB have beenreported recently. 3,4,8- U

Initial studies of N-version programming. N-versionprogramming is defined as the independent generation ofN -2 software modules, called "versions," from thesame initial specification. 2 "Independent generation"means that programming efforts are carried out by in-dividuals or groups that do not interact with respect to theprogramming process. Wherever possible, differentalgorithms, programming languages, and translators areused in each effort.The goal of the initial specification is to state the func-

tional requirements completely and unambiguously, whileleaving the widest possible choice of implementations tothe N programming efforts. It also states all the specialfeatures that are needed to execute the set ofN versions ina fault-tolerant manner. 6 An initial specification defines(1) the function to be implemented by an N-version soft-ware unit; (2) data formats for the special mechanisms:decision vectors ("d-vectors"), decision status indicators("ds-indicators"), and synchronization mechanisms; (3)the cross-check points ("cc-points") for d-vector genera-tion; (4) the decision algorithm; and (5) the responses tothe possible outcomes of decisions. We note that "deci-sion" is used as a general term to mean the deternminationof a decision result from multiple versions. The term"comparison" usually refers to the N= 2 case, and"voting" to a majority decision with N>2. The decisionalgorithm explicitly states the range of variation, if such arange exists, that is allowed in determining whether ver-sion results are similar.

It is the fundamental assumption of the N-version ap-proach that the independence of programming efforts willgreatly reduce the probability that software design faultswill cause similar errors to occur in two or more versions.Together with a reasonable choice of d-vectors and cc-

points, the independence of design faults is expected to

make N-version programming an effective method ofachieving design fault tolerance. The effectiveness of the

71August 1984


entire approach depends oIn the validity of the assumption,and anI experinmental investigationi, deemed essential, wasunldertak-en at UCLA.The initial resecai-ch effort at UCI A addressed tso ques-

tions: (1) What is equired in termiis of foirmal specifica-tioins, choices of problemns, and the natture of algorithms,timing conlstraints, anid decision algorithrmis to makeN-ersion progriammine feasible, regardless of the cost?(2) Wlhat methods should be used to compare the cost andthe effectiseness of the N-sersion programming approachto the too alternaties: single-ver sion programming andtilhi recosci block--iapproach?The scaicity ot pi-esvOLIS ieCsuLlts anid an absencec of for-

mal theories onl lN-ersion progriarmin- led to the choiceot anl cxpet imellntal approach: to choose some convenient-Is accessiblc proeramnming problcmis, to asscss the ap-Plicabilits of \Nv ecsion pirorammi-ni, and then to-cneratc ai set of prooranis. Oncee generated, the programssere executed in a SillLulated mlItiple-hardsvarC systCM,and the restulting obsersations sete applied to refine thenmethodologv and to loi iuLlate theories ot \Ns-ersion pro--ramrni'ng. A inot e cletailed resess of the UCI A researchlapproaclt and a discussion1 of tso sets of experimentalIcnSLIltS, otte LuSinc 27 Laitcl the otli usine, 16 intdependentlysst ittet pi oi amS, has c been publishled pre IoLlsI.2'6

The specification-oriented multiversion software ex-periment. Ear lys ehsar clemottst rated the practicalitv ofexperimentaettl ins estigation atld contfirmlled the Iteed forhigh-qualits softss Ore specifications. The prirteipal aim ofthe subsequeltt re eartcI wsas the inte.stigation of softwarespeift cation tecilttiques. Othel- aintssscrc insestigating thetspes andci Can Cs of softssarte dcesil taults, pi-oposilg im-prosnicittets both in softsare specification techniques, andin the use of those teelitique , aItd pi oposinIg ulture Cx-petrinitectts in the intestipation ot tault-toleratti design insoftsate aiId in hairdssa. IITo cXamiiiC the eflfct of Specificationt techniques on

mUtisciC.sion1 softsaTr, aLni cxpcrimFct1t V as designed itshich thi -ce difftc-eitc specificationss\cre usecd.* The first\sas writteni in thte formtal speciticationi languace OBJ.The secoirel specification lantguace utsed ssas the nonforntall'DI, shich sas characteristie of cut rent indcistry prac-tice. Tiec LEnlish artVeacess Llsed as the third, aitd-eontrol," specificationi laItnnage, sirce English had beenisecd it thep1rC O-ioLts stUdeiS\ speci catic'ti iiif al" if it iss' itten it a lailguage

sitsh explicitls antl ptceiscvs cletfined ssn tax andemIllanltics.)- Itimal specif ications tas e sonte s ers ads an-taec)us pi opet-tics: tltes cair be StUdiecd iathltnaticalls;thes cart be mccliaticdcl anid tcstced to gathctr emipiricalesvicletnce ot tiltert cotrcctntess; andl thes can be comiiputer-piocessscl to) r cios e anibiUgitiS, tO elititinate inconsisteni-Mii,and to be made contplcte eoCII0 Lh fc) em-ipirical

testiilrg'. IITe Ptecisiolt of t orntal laIltUlcaes fItakes sritirtcgot ouS spec'ificationts casicr ssith them than ssvith notfor-

I )iSl Ii 11- l in a lortc1iti"'ii ti ntii aht bChair pron ramii-ricik: itiar (lo di lo)Wd atio L ( 5 %(1 o nicin l'oiii 1976 ti 1979. It atalso

in, ppni ciii ii Iroraccclurlpoi L-amillimi la.trid inc. T'D) iria0 ioittornWall1mLll'aL7tl'ol.t5 N 111p nrogr lll attltli l4 ,.ldde"wmsI1. It%N.1sdcl%clopedao( MIii d10ati orll i,t1I. 1 nl1 na% '111d t'L ltloll" a,' 'I priodu liliJ ()Iu,ll 'llt,lli0ll to0li

mal languages. OBJ was chosen for formal specificationbecause the mechanisnt necessary to construct and testspecifications with it was available at UCLA as were per-sons experienced in OBJ use. This personal experienceproved to be important, since, as for all other formalspecification languages examined, the existing documenta-tion was quite inadequate. OBJ did, however, promotemodularity and explicit handling of error conditions.The nonformal specification language, PD[, lacks the

power and sophisticatioin of OBJ, but it does have ade-quate documentatiort, is reasonably well known and hasbeen in use in industry for seseral years. Writing specifica-tions in PDL is straight forsward, while understanding themdepends largely on the care taken by the writer. PDL pro-sides extensise cross referencing and indexing featuresthat wvould be verv useftil in OBJ. Specifications wvritten in

PDL, however, do tend to be rather long.The problem chosent for the experiment was an "airport

scheduler" exercise. This database problem contcerns theoperatiort of an airport, itt which flights are scheduled todepart fior other airpor-ts and seats are reserved on thosetlights. The problertt was discussed originally by Ehrig,Kreow ski, ard \'eber 14 and later used to illustrate OBJ bsGoguen aird Tardo. 1 3ecarise the problem is transaction-oriented, the natural chloice of N-version cross-checkpoints was at the end ot each transaction. With the OBJspecification as a basis, one specification was sritten inPDL antd aniother in English. The OBJ specification was13 pages long, the English specification took 10 pages, andthe PDL required 74 pages.

Lxecution of the experiment. Programmers with rea-sottable proficiency in PL/1 were recruited among thecomputer science studerits at UCLA. Each was assigned towork with orte of the three specifications; no specificationswas tacklecd by a group whose oserall range of abilities wasnot represeittative of the total range. The programmerswere aivert a realistic cleadline and a monetary incentive toproduce programs of at least minimal quality by thedeadline. TIhe experinternt proceeded in several steps:

(1) recruitin7g;(2) teaching OBJ artd PDL;(3) examliing antd ranking;(4) assignintg the problem; and(5) e altiating programns.

A seminar wvas held at LCLA to aninounce our need forparticipants, and wse recruited 30 progammer-s whoseabilities rantged fron good to excellent, who were senior orgraduate students, and whose professional experienceranged from itorte at all to several years. The next stagewas the pr-esentation of a one-dav course on OBJ andPDL, which was necessarv because the programmerslacked familiarity with OBJ and had very little familiaritywith PDL. Study material was distributed and an ex-amination was held two days later, at which the 30 par-ticipants ssere ranked as good, aserage, or poor. Theimembers of each group were then assigned (in roughlyequal numbers) to rise the OBJ, PDL, and Englishspecifications, respectisely, in suclh a way as to avoidloading anN of the specifications with either predominant-is good or predominartls bad programmers.

COMPUTER72


At a subsequent meeting, each programmer was given apacket containing the specification; a notebook to recordprogrammer effort, bugs encountered, and other prob-lems; and a questionnaire on the specification and its use.It was also made clear that the programmers would not bepaid for their work unless their programs passed a straight-forward viability test. While an example of a typicalviability test was given, the actual test to be used was notrevealed. The programmers were requested not to discussthe project with other participants, and the goal of the ex-periment was once again carefully explained to supportthis request. Programmers were expected to analyze theirspecification and were allowed to ask questions individual-ly. To preserve independence, however, group discussionswere not held.

At the end of a four-week period, 18 of the 30 program-mers returned working program versions of the airportscheduler written in PL/1. Of the 18 program xersions,seven were written from the OBJ specification, five fromthe PDL specification, and six from English. All 18 pro-grams were run xwith the standard viability test. Afterminor modifications were made to two programs by theoriginal programmers, all 18 were judged satisfactory andwere prepared for more detailed testing.To conduct the more extensive testing, a very demand-

ing set of 100 transactions was developed in an attempt toexercise as many features of the programs as possible. Theimmediate consequence of running the individual pro-grams with these transactions was the discovery that 11 ofthe 18 programs aborted on invalid input. This had beennoted as a very dangerous situation to encounter inN-version programming. 6 When executed in sets of three,one aborting version usually causes operating system in-tervention in all versions, effectively allowing the bad ver-sion to outweigh two otherwise healthy xersions. To fixthis situation, we provided all programs with an accep-tance test that detected and attempted recox ery from theseotherwise catastrophic faults.

Results of the experiment

Program sizes and time requirements varied con-

siderably (see Table 1). To conduct testing of individualversions, all 18 program versions were first executed withan 18-version decision algorithm in order to completelydefine the expected result for each of the 100 transactions.It would have been preferable to execute the OBJspecification to determine the expected result. Unfor-tunately, OBJ was not powerful enough to allow completespecification of all implementation constraints. Thislimitation is common to many formal specificationlanguagies.

Next, each version was executed alone, and the result ofeach case was compared to the expected result andclassified as follows:

* "OK point," if the result was the expected result;* "cosmetic error," if the result contained correct data

but had, for example, misspelled text or containedbad formating;

* "good point," (code G) if the result w as an OK pointor a cosmetic error;

* "detected error" (code D), if the program version

failed the acceptance test that had been added to de-tect and attempt recovery from abort conditions;

* "undetected error" (code U or U*), if the versionpassed the acceptance test but was in error; in otherwords, errors were detectable only by comparisonwith the expected result. U is a distinct error that dif-fers from the results of the other two versions, whileU* is an error that is similar for two or all three of theversions.

The individual test results are summarized in Table 2.All 816 possible triplets of the 18 xersions (see Table 3)

were then executed with a three-version decision algo-rithm, using 100 transactions for each triplet. There were

now three version results to consider for each transaction,

Table 1. Characteristics of the 18 versions of programs written in OBJ, PDL, and English by UCLA study participants.For each program version, the number of PL/1 statements used in the program (PL/1 STMTS), the number of pro-cedures used (NO. PROCS), the compile time (COMPILE TIME), the execution time for the 100 transaction test case(EXEC. TIME), and the program size in bytes (SIZE BYTES) are shown. The time unit is a machine-unit second, which isa measure of time, and other resources, such as I/O needs.

NAME OFVERSION

OBJ01OBJ2OBJ3OBJ4OBJ5OBJ6OBJ7PDL1PDL2PDL3PDL4PDL5ENG1ENG2ENG3ENG4EN G5ENG6

PL/ 1STMTS423400398328455243336455501242437217260372385689481387

NOPROCS

222817141416232733193911211930251512

COM PILE_ TIME

15.1411.357.428.62

14.794.718.30

16.9619.584.31

16.314.264.75

12.418.12

28.238.7619.23

EXEC.TIME3.893.964.334.773.102.704.923.16

19.684.092.844.303.333.892.412.942.423.99

SIZEBYTES3760028048309042992032304209603480824928296562736030896

_26440275523779220648268642405624656

August 198473


Table 2. Test results for individual versions of the 18 programs written in OBJ, PDL, and English.

VERSIONOBJ1OBJ2OBJ3OBJ4OBJ5OBJ6OBJ7PDL1PDL2PDL3PDL4PDL5

ENG1ENG2ENG3ENG4ENG5ENG6

OKPOINTS

737167696746525954954594746 797305553

COSMETICERRORS

0

18113

120

17

2

28

1227

1563

GOODOK OR COS.

7389787279466961569573

_ 94869498356156

DETECTEDERRORS

2848007

32405

0.2509

ith the results coded as show n abov e. Results of the types

G, U, and U* all passcd the acceptance test and can onlybe classified here because the expected result has beendctermined in advance.Now let us considei- the decision algorithm that was used

to produce one decision result from the three separate ver-

sion results. Ot the 64 possible triple combinatioins of theversion result codes (G,D, U, U*), 27 contain a single U*code. Since a single U* error is nmeaningless, these com-

binatioins are discarded to leave 37 remaining combina-tions. These tall into 14 separate categories, as illustratedin Table 4. \We review these 14 categories with respect to

thc numbei- of versionls used by the decision algorithm.The detected errors D erc excluded by the decisionaleorithm. Table 5 suImTmarizes the decision results of all81,600 triple computatioins.

r-i'pley- (Iecisions. The two cases in which all threee-stilts aie sinilalr, V V(C,(,C) and V14 =

(U*,U*U*) are inidistiiguishlable to the decision algo-itthm and are both given the highest confidence rating (see1Table 5). Of course, VI produced the cxpected result and1 41 cidl not. Also, ve expect three similar errors ( V14) tObe rare. Actuallyv, `1 occurred in 44.9 percent ot the deci-ions and 1 in only 0. 1 peircent.

Table 3. A list of the three-version triplets where 0

designates an OBJ version, P designates a PDL version,and E designates an English version.

TRIPLETCOMPOSITICK

000PPPEEEOPEOOPOOEOPPOEEPPEPEEALL

NUMBER CFT1RIEPLETS

351020

10512670

1056075

81 6

In three cases, two of the three results were similar andtherefore outweighed the third dissenting version. Thecase V3= V(G,G,U) produced the expected result andappeared in 27.1 percent of the decision results;V? = V( G, U*, U*) allowed two similar errors to outweigha good result and occurred 2.4 percent of the time;V13 = V(U,U*,U*) occurred 0.4 percent of the time.

Again these three cases were indistinguishable to the deci-sion algorithm, and all were given a confidence level ot"two."

In two cases, V6 = V(G, U, U) and V12 = V( U, U, U),the decision algorithm identified three distinct version

results and so declared that a decision result could not beobtained. V6 occurred in 8.4 percent of the decisions and

V12 in 1.7 percent. The confidencc lcvcl was declared to be"zero. "

Duplex decisions. When the acceptance test of a pro-

gram version detected an internal error, it signaled thedecision algorithm and its result was not used in the deci-sion. The remaining two version results were used in a

duplex decision.In two cases, V2= V(G,G,D) and VI1 = V(U*,U*,D),

the results were similar and gave either the expected result(V2) or an error ( VI I ). V2 occurred in 6.5 percent of the de-cisions and VI in 0.1 percent. Both cases were given a con-

fidence level of "two."In K5 = V(G,D, U) and VI( = V(D, U, U), two distinct

results remained after error detection and the decisionalgorithm declared that there was n1o dccision result. Thesecases were said to have a confidence level of "zero" andoccurred in 4.9 percent and 1. 1 perccnt of the decisions,respectively.

Simplex decisions. When two versions failed their ac-

ceptance tests, the decision algorithm made the simplexdecision of accepting the remaining result as valid with a

confidence level of "one." Case VA = V(G,D,D) gave theexpected result and occurred in 1.6 percenit of the decisionlresults, while V= V( U,D,D) gave an undetected error

and occurred in 0.6 pciccnt of the decisions. The con-

COMPUTER

UN DETERRORS

253

18202154

_ 24 _3812

127

11 462

403935

74


fidence level of "one," which they were given, may be in-sufficient in some applications; the result in such instanceswould be declared not sufficiently dependable.

Null decisions. When all three versions failed accep-tance tests, there was no decision result; consequently,the confidence level was declared "zero." Case V8=V(D,D,D), the null decision case, occurred in 0.2 percentof the decisions.

Types and causes of errors. Several program versionsproduced many cosmetic errors, such as a result misalign-ment or a format misspelling, and a preprocessor to thedecision algorithm was implemented to attempt to fixthem. The cosmetic errors were caused by disregard of thespecification and by personalization of the result. Per-sonalization seems to occur in response to the program-mer's need to leave a mark of individuality. Explainingthat program results were to be computer processed and

Table 4. The three-version decision algorithm, where G = Good Point, D = Detected Error, and U or U* = UndetectedError (either distinct or similar, respectively). The confidence level describes how many version results the decisionalgorithm used in obtaining the decision result.

CASEViV2V3V4V5V6V7V8V9Vi0ViiV12V13V14

VERSIONRESULTSG,G,G

G,G,UG,D,DG,D.UG,U,U

D.D,D

D,DUUD,U. U*U.U.U

U,U*.U*U*tU* .U*

NO. OFERRORS

01122223333333

DECISIONALGORITHMTRIPLEXDUPLEXTRIPLEXSIMPLEXDUPLEXTRIPLEXTRIPLEXNULL

SIMPLEXDUPLEXDUPLEXTRIPLEXTRIPLEXTRIPLEX

DECISIONRESULT

GGGGDDU *

DUDU*DU *

U *

CONF.LEVEL

3221

0020102023

Table 5. The results of decision algorithms.

THREE-VERSION TRIPLET COMPOSITIONPPP EEE448 820

44.8% 41.0%128 166

12.8% 8.3%274 554

27.4% 27.7%8 32

0.8%88

8.8%18

1.8%12

1.2%0

0.0%7

0.7%6

0.6%0

0.0%1

00.0%

101 .0%1 000

100.0%

1 .6%80

4.0%206

10.3%82

4.1%0

0.0%04

0.2%22

1.1%0

0.0%20

1 .0%8

0.4%6

0.3%2000

100.0%

CASEVi

V2

V3

V4

V5

V6

V7

V8

V9

Vi0

V11

V12

V13

V14

Total

ALL3666544.9%52926.5%2210527.1%

1 2831 .6%39864.9%68388.4%19442.4%176

0.2%477

0.6%867

1 .1%87

0.1%1415

1 .7%353

0.4%112

0. 1%

81 600100 0%

0001703

48.7%129

3.7%842

24.1%48

1 .4%141

4.0%264

7.5%86

2.5%4

0.1%20

0.6%11

0.3%6

0.2%173

4.9%48

1.4%25

0.7%

3500100.0%

OPE9354

44.5%13416.4%5939

28.3%347

1 .7%10465.0%1 7478.3%386

1.8%65

0.3%123

0.6%274

1.3%28

0.1%275

1.3%62

0.3%13

0.1%21000

100.0%

OTH ER2434045.0%35286.5%1449626.8%

8481.6%26314.9%46038.5%13782.5%107

0.2%323

0.6%554

530.1%946

1.7%235

0.4%58

0.1%54100

100.0%

August 1984 75


thtLs had to be geelnratcd exactly as specified might havehelped to rcduice this ptoblem.

AII pi -ogiam xrsions contained acceptanice tests thatperiformcd limitcd self-chleckingQ and, on error detection,ceut an ei-ot message to the decision algorithm. These(ltccd cl 10-1-o-s x ere not considered by the decisionaloLiithmn. They xC-ere caused mainly by inadequate inputhecliks ancd othler interpretatioin t'aults.The li7cl[ectxed error-s xxere of greatest interest. These

cit-ots passed a vexsion's acceptance test and reached thedecisioni aleoirithm. U ndetccted similar errotrs are par-ticularlx danLtrous in a multiversion soft are systemhecause thex clai outxcigh a gooci version. Every unde-tected ssimilla- enr-or in the experinmenit x as traced back to itscaLuse, xhich xas tound to be eithei (1) a specificationt'atilt, (2) ani interpretationi fault, or (3) an implementationt'auilt.

Specif/icatiotfclults. A fault that occui-s in the specifica-tioin phase of' softxate dexelopmcnt is potenitially vexicostlx, sinece it ot'teniri-emaills undetected until after com-pletioin ot' the softxx are. Fix e specit'ication t'aults x eref'ounled andcl ate shox n in Table 6. One fault, S1, xx as tracedto thec 08.1 specification andi xxas actually more of a

laIUiagc inadequacy than a ti-nc specificationi defect. Thisf'aLLlt xaS t'OUlCi in foti- ot' the sexen OBJ programs. TheOtIl1l- foul t'fLults eCF dLiue to ambiguitics in the Englishspcif ication, atIId thic-c xx tc o aults in the PDLspecif cai.

l1erprewItiottlJaliots. An interpretationi fault is causedbx a mlisilicli-pl-ctationi Ol- misunderstandinu of the specifi-catioin, iathct- than by a mistake in the specificatioil. WeCitillUtciSh bctxeeu intel-pl-etation faults catused by inade-Cluate undetrstanding ot' the specit'ication aind implemen-tation t'atults caused bx faultx inmplementation of' the

'peificatioi. FVailiuc to check input x alues, f'oi example, isan intelrpiretation tfault, xhile beiine unable to retrieveecords I'ioimi the database is an implementation fault.Tlici-c xxeie niile mcitel-pi-etation faults (see Table 7), theilloSt sci-ioLns ot' hich xas the t'ailure of all fixe PDL x er-sions and one OBJ x ersion to check input parameters ade-quatelx Manx ot' the intelpi-etationi t'aults Could hax e beenprcxentecd by gixing moire examplcs in the specifications.ndccd, somle programmires xere uniablc to understand thepecifications ancd relied, instead, on the examiiples giiven

alld onl thcir ilnt 1ltuillo.

ImpJIcic/lelatiiol fullrst. An impleniciitationi fault iscaLuseCC by a deficient implemeintationi of the specification.[hbetc xxresciccx implenielltation f'aults (see Table 8), ofxhiCh fournioncerined the inabilitx to coricctlv deletei-ecoi ds in the database. Most of the implementatioin faultsxould hax e been cletected by the pi-oci-ammers if more ex-tensix c testinu hatd bcen pet-fornmed. This xxas not Cen-couLtaedt, hloxxCVCi , since one of oUtI primaly aimils xas toStudx thce cirCease in reliabilitv to be cained xith imperfectpr oct-asm is cl in \vxersion sxVstCIems.

'I'he catises of similar errors. In manv cases, similar er-rot xxet-e caut'sedl bx related faults. Xs an example, con

dicic'- tb i tra lact io l:

SCHEDUI E 3, JFK, 1:00, 2, 727 (tranisactioni 8)This completely valid transaction was rejected bv ENG4and ENG6 because the departure time had niot beenl inputas "01:00" (fault S2).A quite surprising result xas that independenit faults

often produced similar errors. For example, consider tran-sactions 3, 26, 36, and 69:

LIST (trainsactioni 3)After processing this, OBJ6 clid no fuither piocessinve ottransactions (fault 18).SCHEDUI E 10, SAC, 23:00, 10, DC10(transaction 26)

Because the destination code was not one of four samplecodes, programs OBJI, OBJ4, OBJ5, and OB.17 wouldlnot accept his transactioni (fault SI).CANCEL 999 (tiainsactioin 36)

Because of a faultV canicel alg-orithm, this transactioni xasnot performed correetlI by OBJ3 (fault L2). Next, con-sider the following tiansactionSCHEDULE 6, LAX, 20:30, 10, 727 (transaction 69)

This transaction shoutld have been rejected, because it usedan illegal duplicate plane number, but all the OBJ versionslisted aboxe (OBJ 1, OBJ3, OBJ4, OBJ5, OBJ6, 0B.17) ac-cepted this transaction because of earlier errors made intransactions 3, 26, or 36. Here, three indepeindenit desienfaults are illustrated: the arbitrarv decisioni of one pro-grammer to shut dowxn after a cei-taiin sequteice of transac-tions had beeni processed (fault I8); a pooi Cancelalgoiithm (fault L2); and the acceptance of only ftourdestinations (fault S1). These all led to similar ei-rors, inv hich an illegal transaction was accepted and simplyacknowledged. This particular result, simple acknowl-ed(gment, contained Ino data and xas the default testilt ofsexeral differenlt transactionis. A partial solution to thisproblem is nexer to allox a simple acknoxledgment, sinicethe acknowledgment may cause a similar error. Additionalsupporting data such as icdentification of partial programstate should always be incluclded and should lead to fexetsimilar errors.

Need for improvement in specificationtechniques

Despite real strengths, formal sof tx are specificationitechniques often haxe not paid off ftot- conventionaldesignst' and produce specificationis of limiited value.Some areas of concerni andl sug-ested improxements atepresenited here from outr experienices xith OBJ.

Lack of experience and documentation. Forrnal specifi-catioIn languages are difficult to use, hard to xN-rite, anidoften sensitixe to ty,pographical errots. 12 Programmnei-sare usually ullfamiliar xith the applicatixe style of mostformal specifications; documentationi is inadequate; andtraining material does not exist. These problems are to beexpected during the early stages of dexelopment ot anvnex methodology and are probablx transitorx.

Automated tools. CLurrent specificationi techniiques,xhile capable of beinegcomputer processed, piovide littlesupport and fexx tools. Aci-oss-referenciiuc;facilitx ispar-

COMPUTER76


ticularly needed to keep track of function definitions andsubsequent invocations. Databases containing predefinedcommon objects and lower level specifications would behelpful, as would the capability of viewing the specifica-tion at different levels of abstraction.

Applicability to the problem domain. In order to pro-duce a straightforward specification, the language must beappropriate for the particular problem domain. Existinespecification languages are limited and cumbersome, part-ly because of the disparity between them and the problemand implementation domains. It is very difficult, for ex-ample, to specify implementation constraints and inter-face specifications. Since a specification should describewhat is to be done, it is important to address known im-plementation constraints as early as possible. The newLarch family of specification languages 16 is a step in theright direction.

Limited power. The lack of built-in functions puts alarge burden on the writer and, by cluttering the specifica-tion with lox-lexel detail, makes it more difficult to com-prehend. Support for the specification of real arithmetichas been missing from formal specification languages,making numerical problems xery difficult to define.

Work in progress and goals for long-rangeresearch

One major goal of the first-generation experiments just

described Xwas to apply our accumulated experience to thenext generation of experiments. It has become exident that

the general UCLA campus computing facility is a nonsup-

portixe environment for multiversion softxware experi-ments. To establish a long-term research facility for fur-ther experimental investigations in the design and exalua-tion of N-fold-implemented fault-tolerant systems, ve are

designing a multichannel system testbed at the UCLACenter for Experimental Computer Science. The testbedwill be a part of the center's advanced local network,which utilizes the Locus distributed operating system 18 to

operate a set of 20 Vax /750 computers.

A distributed svstem testbed for multiversion softvN are.The testbed is intended to provide two distinct researchand utilization opportunities. First, it will serxe as an ex-

perimental Xehicle for future fault-tolerant design in-

vestigations; and second, it Xxill be a xery high-reliability

Table 6. Specification faults causing similar errors.

Specifving temporal characteristics. Specificationlanguages have not allowed us to define notions of timeand concurrencv. Specifying the nondeterminacy for asyn-chronous sv stems remains a difficult researchproblem. 13.17 It is particularly important to be able tospecify timing in a multiversion softxvare enxironment;timing specifications are needed to address performancerequirements for individual versions, as Nvell as for thespecification of the decision algorithm and its associatedwatchdog timers that prevent a failed version from hang-ing up the system.

Language syntax. OBJ programmers Xwere unanimousin their dislike of the language's syntax, even though thespecification that they receixed wxas a substantial improve-ment over a "rawx" OBJ specification. The OBJ inter-preter treats a specification as a continuous, Lisp-likecharacter string, makingz parsing very difficult. Thespecification used had been post-processed by a text editorto make it a little more intelligible.

Abstraction versus procedural specifications. More ex-

perience is needed in wxriting and using true abstractspecifications as xvell as "rapid prototypes." What are theconsequences of introducing bias in a "rapid prototype,"and are the more difficult abstract specifications wxorth theextra effort? This experiment showed that, at the current

lexel of dexelopment, a familiar procedural specification,PDL, was easier to use and understand than the more dif-ficult, unfamiliar, abstract specification of OBJ.Our studv of both formal and informal specification

languag-es has showxn that none of the languages availableis sufficiently automated. Formal specification languaaes,while showing promise for the future, thus far have beenXerv difficult to use and understand and are sex erelylimited in poxxer.

SPEC.FAU LTS1S2S3S4S5

OBJ1.4.5.7

APPEARS INPDL

DESCRIPTIONENG

Only defines four destinations4.5.6 Time shown as 09.45 in example4.5.6 Error message order ambiguity2.4 Duplicate error message ambiguity6 Parameter checking ambiguity

Table 7. Interpretation faults causing similar errors.

INT.FAULTI1

1213

APPEARS INOBJ PDL

1.2.3.4.5

1A41.2.5 4

14 4 4

15

161718

19

2.4

3 11

6

2

DESCRIPTIONENG

Did not check inputparameters firstExpects time as 09:45

1 Wrong error message onillegal input

1 Create does not work thesecond time

5 Error message output onlegal input

6 Allows null parameters4 Allows invalid parameters

No output after first listproducedCannot handle invalid in-put

Table 8. Implementation faults causing similar errors.

IMPL.FAULTLiL2L3L4L5L6L7

APPEARS INOBJ PDL ENG

37

DESCRI PTION

Cancel does not work on last entry5 Cancel works only on partial database

Cancel unknown cancels last entryCannot retrieve recordAllows duplicate record

4 Cancel. reserve-seat ignored4 Bad input leads to chaos

77August 1984

l


and continuous availability node for other users of thelocal network who have a need for exceptionally reliablecomputing.

Certain requirements for designing a distributed systempresented themselves to us. It is necessary to provide in-dependent simultaneous execution and to compare in-termediate results when the programming versions havereached identical cross-check points. A synchronizationmechanism is needed that makes sure that the results com-pared are from identical cc-points. Because of reliabilityrequirements, the synchronization must operate withoutany centralized components. To obtain such a mechanismin a distributed system is difficult. The Locus distributedoperating system communicates by passing messages withinherently varying communication delays that are signifi-cant when compared with the elapsed time between eventsin a single process. As a consequence of thesecharacteristics, a consistent view of the global state doesnot exist at all times.For these reasons, the testbed was designed as a set of

hierarchical layers, as shown in Figure 1. A layered ap-proach has several advantages. It reduces complexity,shields higher layers from details of how the services of-fered by lower layers are actually implemented, andfacilitates verification of the system. Each layer uses theservices provided by the lower layers and adds new ser-vices, which are used in turn by the higher layers. Thetestbed is composed of the following four layers:

(1) The transport layer, which moves messages fromone site to another or within one site. Its goal is to makesure that no message is lost, duplicated, damaged, ormisaddressed. It is designed so that messages arrive inthe order that they are sent. The transport layer canalso set up connections between authorized entities,such as the transport layers, at the sites.(2) The synchronization layer, which collects messages(that hold version results) from identical cc-points,

Figure 1. An architectural model of the distributedsystem tested.

delivers them to the transport layer, and synchronizesthe restart of the versions. This layer can establish com-munication connections between versions.(3) The decision and local-executive layer. This layerprocesses version results from identical cc-points byusing the decision algorithm and determines whether aversion is faulty or not. The decision result and the ver-sion fault status are forwarded to the synchronizationlayer to be used in further computations.(4) The user and global-executive layer. This layer con-sists of two parts: the multiple versions of user pro-grams and a replicated global executive program thathas the same functions as the global executive found inSIFT.5 It performs the fault-tolerance functions thatare not performed synchronously at each cc-point.These functions include system fault diagnosis, recon-figuration, and fault reporting for maintenance pur-poses.A functional diagram of the current system is given in

Figure 2. Sync represents the synchronization module; SV,the supervisory module that performs local executive func-tions; DEC, the decision algorithm; and Vn represents ver-sion n of the user program. The global executive is notshown. The pipe and process systems of Locus are used toconnect sites and to provide reliable communication. Theusual operating system functions (file system, processmanagement, and conventional interprocess communica-tion) of Locus are available to the user software. The test-bed will therefore support the operation of realisticallylarge and complex programs. An initial implementation ofthe synchronization, supervisory, and decision algorithmmodules has been written in the C programming language.

Enhanced faul-tolerance mechanisms. As the testbedevolves, we plan to investigate various fault-tolerancemechanisms. The present implementation of the decisionalgorithm will be replaced by a more general decisionalgorithm. Two candidate algorithms are skew and adap-tive voting. 6 Skew voting handles numerical results thatare unequal because of the differences in precision allowedaround some unknown reference value. Adaptive votingweights version results; the weights can then be dynamical-ly calculated to favor similar results. The decisionalgorithm will also have some ability to recognize cosmeticerrors and deal with diverse data representations.

The user interface is another important and difficultproblem that will be addressed. Functions that must besupplied include distribution and execution of programversions, log recording for each trial, statistics generation,and interactive version and system manipulation. By pro-viding and evaluating these, we expect to obtain quan-titative experimental decision results about the effec-tiveness of the fault-tolerance mechanisms. We also planto evaluate the possible loss of performance due to theoperation of the fault-tolerance mechanisms in theabsence of faults, as well as the cost of error recovery.A problem area that will be thoroughly examined is the

recovery of failed versions through rollback, rollforward,and reinitialization. Since we assume that all versions arelikely to contain design faults, it is critical to be able torecover these versions as they fail, rather than merelydegrade to N-l versions, then N-2 versions, and so on.

78 COMPUTER


An important and interesting application area thatoften requires very high reliability and availability is real-time execution of time-critical applications. However, thecurrent Locus implementation of the testbed is likely to betoo slow for this purpose. Despite this limitation of Locus,its functional architecture can be used with fastertransport service and faster scheduling policies in a real-time system. In addition, Locus can be used to simulatereal-time execution.

Use of advanced specification languages. Significantprogress has occurred in the development of formalspecification languages since our previous experiments.The promising languages Larch, 16 Clear, 19 and Prologare being evaluated for application in future experiments.We have been impressed by the ease with which Prologspecifications can be written even without such features asabstract data types and implicit-control structure.Once the testbed is operational, we will have an initial

statement of the attributes required for the components ofthe working fault-tolerance mechanisms implemented inC. The next step will be a formal specification of the pro-tocol layers, the decision algorithm, and the local andglobal executives. We plan to investigate the use of Con-current Prolog as the first specification language. Thespecification will provide an executable rapid prototype ofthe underlying supervisory operating system and the ap-plication versions. It will allow not only the migration toreal-time systems, but also the use of multiversion soft-ware techniques for the fault-tolerance mechanisms them-selves. The goal is an operating system that supportsdesign diversity in application programs and which is itselfdiverse in design.

Independent specifications of some operating systemmodules in Larch and Clear will serve to compare theirmerits with those of Prolog. Further research is planned inthe application of dual-diverse formal specifications toeliminate errors traceable to specification faults and in-crease reliability.

Software cost studies. It has been estimated that 70 per-cent of the overall cost of software is spent on softwaremaintenance, which includes (1) correction of bugs and (2)enhancement due to changing requirements. We will care-fully monitor the cost of producing program versions dur-ing all phases of development.

The cost ofremoving bugs. By combining software ver-sions that have not been subjected to V&V testing to pro-duce highly reliable multiversion software, we may be ableto decrease cost while increasing reliability. Most errors inthe software versions will be detected by the decisionalgorithm during on-line-production use of the system.The software faults then can be fixed without affectingsystem availability.

The cost ofenhancement. Enhancing multiple softwareversions is more difficult. Specifications should be suffi-ciently modular and structured so that enhancement willgenerally affect few modules. The extent to which eachmodule is affected can then be used to determine whether(1) existing versions should be modified to reflect the

enhancement, (2) existing versions should be discardedand new versions produced, or (3) new versions should beproduced to implement the enhancements and old versionskept to implement the original requirements. Experimentswill be conducted to gain insight into the criteria to be usedfor a choice.

Mail-order software experiments. To test the "mail-order" concept, members of fault-tolerance researchgroups at several universities will write software versionsfor use in a large experiment. We expect that software ver-sions produced at geographically separate locations, bypeople with different experience who use different pro-gramming languages, will contain substantial design diver-sity. It may be possible to utilize the rapidly growingpopulation of computer hobbyists on a contractual basisby having them provide individual module versions at theirown locations. This would not require a large concentra-tion of skilled people and would allow for the loss of in-dividual programmers. A model experiment has beenspecified, which includes 15 general guidelines and 10specific tasks. 20The first-generation experiments described here have

provided a proof-of-concept and have shown the feasibili-ty of design diversity as a complement to fault avoidance.Whether it is a practical and cost-effective general solutionto the design fault problem remains to be seen. However,tolerance of design faults in software and in hardware isthe challenge of the eighties, and the results presented hereencourage us to propose further experimentation. To thisend, second-generation experimentation is now underwayat four universities* to measure the efficacy of designdiversity and to demonstrate conclusive reliability in-creases under large-scale, controlled experimental condi-tions. *

Acknowledgments

This research has been supported by NSF grantsMCS-78-18918 and MCS 81-21696, and by a research

*Experimentation is now underway at UCLA, the University of Virginia,the University of Illinois, and North Carolina State University.

Figure 2. A functional block diagram of the present system in whichSYNC = Synchronization module, SV = Supervisory module, Vn = Ver-sion n of user's application program, VOT = Voting module, and rightand left arrows indicate communication lines (provided by the "pipe"system call).

79August 1984


grant from the Battelle Memorial Institute. Current sup-port is provided by the Advanced Computer Science pro-gram of the Federal Aviation Administration. Valuablecontributions to the research have been made by LimingChen, Baron Grey, Per Gunningberg, Srinivas Makam,Tetsuichiro Sasada, Lorenzo Strigini, and PascalTraverse. This article has benefited from perceptive com-ments by Flaviu Cristian and Jack Goldberg. *

References

1. A. Avizienis, ed., "Special Issue on Fault-Tolerant DigitalSystems," Proc. IEEE, Vol. 66, No. 10, Oct. 1978.

2. A. Avifienis and L. Chen, "On the Implementation ofN-Version Programming for Software Fault-ToleranceDuring Execution," Proc. Compsac 77, Nov. 1977, pp.149-155.

3. T. Anderson and P.A. Lee, Fault Tolerance, Principles, andPractice, Prentice-Hill International, London, England1981, pp. 249-291.

4. F. Cristian, "Exception Handling and Software FaultTolerance," IEEE Trans. Computers, Vol. C-31, No. 6,June 1982, pp. 531-539.

5. J. Goldberg, "SIFT: A Provable Fault-Tolerant Computerfor Aircraft Flight Control," Proc. IFIP Congress 80, pp.151-156.

6. L. Chen and A Avizienis, "N-Version Programming: AFault-Tolerance Approach to Reliability of SoftwareOperation," Digest of Eighth Annual Int'l Conf. Fault-Tolerant Computing, June 1978, pp. 3-9

7. P. Morrison and E. Morrison, eds., Charles Babbage andHis Calculating Engines, Dover Publications, Inc., NewYork, 1961, p. 177.

8. L. Gmeiner and U. Voges, "Software Diversity in ReactorProtection Systems: An Experiment," Proc. IFACWorkshop Safecomp 79.

9. W. C. Carter, "Architectural Considerations for DetectingRun Time Errors in Programs," Digest of 13th Annual Int'lSymp. Fault-Tolerant Computing, June 1983, pp. 249-256.

10. T. Anderson and J. C. Knight, "A Framework for SoftwareFault Tolerance in Real-Time Systems," IEEE Trans. Soft-ware Eng., Vol. SE-9, No. 3, May 1983, pp. 355-364.

11. J. P. J. Kelly and A. Aviiienis, "A Specification-OrientedMultiversion Software Experiment" Digest of 13th AnnualInt'l Symp. Fault-Tolerant Computing, June 1983, pp.120-126.

12. J. A. Goguen and J. J. Tardo, " An Introduction to OBJ: ALanguage for Writing and Testing Formal Algebraic Pro-gram Specifications," Proc. Specifications Reliable Soft-ware, 79CH1401-9C, Apr. 1979, pp. 170-189.

13. B. H. Liskov and V. Berzins, "An Appraisal of ProgramSpecifications," Research Directions in SoftwareTechnology, P. Wegner, ed., MIT Press, Cambridge,Mass., 1979, pp. 170-189.

14. H. Ehrig, H. Kreowski, and H. Weber, "AlgebraicSpecification Schemes for Database Systems," Proc.Fourth Int'l. Conf. Very Large Databases, Sept. 1978, pp.427-440.

15. D. L. Parnas, "The Role of Program Specification,"Research Directions in Software Technology, P. Wegner,ed., MIT Press, Cambridge, Mass., 1979, pp. 364-370.

16. J. V. Guttag and J. J. Horning, "An Introduction to theLarch Shared Language," Proc. IFIP Congress 83, pp.809-814.

17. P. M. Melliar-Smith, "System Specifications," ComputingSystems Reliability, T. Anderson and B. Randell, eds.,Cambridge University Press, Cambridge, England, 1979.

18. G. Popek et al., "LOCUS-A Network Transparent, HighReliability Distributed System," UCLA Computer ScienceDepartment Quarterly, Vol. 9, No. 1, Winter 1981, pp.75-88.

19. R. M. Burstall and J. A. Goguen, "An Informal Introduc-tion to Specifications Using CLEAR," The CorrectnessProblem in Computer Science, R. Boyer and H. Moore,eds., Academic Press, New York, 1981, pp. 185-213.

20. J. P. J. Kelly, "Specification of Fault-Tolerant Multiver-sion Software: Experimental Studies of a Design DiversityApproach," UCLA Computer Science Department, tech.report CSD-820927, Sept. 1982.

Algirdas Aviiienis is currently the chairmanof the Computer Science department of

i UCLA, where he has served on the facultysince 1962. He also directs the UCLAReliable Computing and Fault-Toleranceresearch group, which he established atUCLA in 1972. Current projects of thegroup include design diversity for the

- tolerance of design faults, fault tolerance indistributed systems, and fault-tolerant

supercomputer architectures. Avizienis also directed research onfault-tolerant spacecraft computers at Caltech's Jet PropulsionLaboratory in Pasadena, California (1960 to 1973). This effortresulted in the experimental JPL Self Testing And Repairing, orSTAR, computer (IEEE Trans. on Computers, Vol. C-20, No.11, 1971). For his pioneering work in fault-tolerant computingAvizienis was elected Fellow of IEEE, received the NA$A ApolloAchievement Award, the Honor Roll award of the IEEE Com-puter Society, the AIAA Information Systems Award, and theNASA Exceptional Service Medal.

Avizienis received his BS (1954), MS (1955) and PhD (1960) inelectrical engineering from the University of Illinois. He served asfounding chairman of the IEEE-CS Technical Committee onFault-Tolerant Computing (1969-1973), and of IFIP WorkingGroup 10.4, "Reliable Computing and Fault Tolerance"(1980-1981).

John P. J. Kelly is a visiting assistant pro-fessor of the Computer Science departmentof UCLA and a member of the UCLA Reli-able Computing and Fault-Toleranceresearch group. His primary research in-terests include the use of design diversity toprovide fault tolerance in software, fault-tolerance in distributed systems, and soft-ware engineering. Kelly is also a computerscientist at Ordain, Inc. where he is devel-

oping a reliable, distributed database machine.Since 1972, Kelly has been a consulting computer scientist in the

areas of fault tolerance and distributed systems to variousorganizations both in the USA and abroad, including the NASA-Langley Research Center, the Jet Propulsion Laboratory, Tran-saction Technology Incorporated, and Telos Computing Cor-poration.

Kelly received his BS (1975) and MS (1977) degrees inmathematics from the University of Cambridge, England and hisPhD (1982) in computer science from UCLA.

Questions about this article can be directed to either author, 3731Boelter Hall, University of California, Los Angeles, CA 90024.

COMPUTER80


fault tolerance by design diversity: concepts and experiments

Documents