coding approaches to fault tolerance in combinational and ...978-1-4615-0853-3/1.pdf · coding...

18
Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems

Upload: hanguyet

Post on 29-Aug-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems

Page 2: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE

Page 3: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

CODING APPAOACHES TO FAULT TOLERANCE IN COMBINATIONAL AND DYNAMIC SYSTEMS

CHRISTOFOROS N. HADJICOSTIS Coordinated Science Laboratory and Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

~.

" SPRINGER SCIENCE+BUSINESS MEDIA, LLC

Page 4: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

ISBN 978-1-4613-5271-6 ISBN 978-1-4615-0853-3 (eBook) DOI 10.1007/978-1-4615-0853-3

Library of Congress Cataloging-in-Publication Data

A c.I.P. Catalogue record for this book is available from the Library of Congress.

Copyright © 2002 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 2002 Softcover reprint of the hardcover 1 st edition 2002

AII rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photo­copying, recording, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC.

Printed an acid-free paper.

Page 5: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

To Pani

Page 6: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Contents

List of Figures List of Tables Foreword Preface Acknowledgments

1. INTRODUCTION 1 Definitions, Motivation and Background 2 Fault-Tolerant Combinational Systems

2.1 Reliable Combinational Systems 2.2 Minimizing Redundant Hardware

3 Fault-Tolerant Dynamic Systems 3.1 Redundant Implementations 3.2 Faults in the Error-Correcting Mechanism

4 Coding Techniques for Fault Diagnosis

Part I Fault-Tolerant Combinational Systems

2. RELIABLE COMBINATIONAL SYSTEMS OUT OF UNRELIABLE COMPONENTS

1 Introduction 2 Computational Models for Combinational Systems 3 Von Neumann's Approach to Fault Tolerance 4 Extensions of Von Neumann's Approach

4.1 Maximum Tolerable Noise for 3-lnput Gates 4.2 Maximum Tolerable Noise for u-Input Gates

5 Related Work and Further Reading

3. ABFT FOR COMBINATIONAL SYSTEMS 1 Introduction 2 Arithmetic Codes 3 Algorithm-Based Fault Tolerance

XI

xiii xv

XVII

XXI

1

1 4 6 6 7

lO 12 13

21 21 22 23 27 27 29 31

33 33 35 37

Page 7: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

viii CODING APPROACHES TO FAULT TOLERANCE

4 Generalizations of Arithmetic Coding to Operations with Algebraic Structure 41 4.1 Fault Tolerance for Abelian Group Operations 41

4.1.1 Use of Group Homomorphisms 44 4.1.2 Error Detection and Correction 45 4.1.3 Separate Group Codes 47

4.2 Fault Tolerance for Semigroup Operations 49 4.2.1 Use of Semigroup Homomorphisms 50 4.2.2 Error Detection and Correction 51 4.2.3 Separate Semigroup Codes 52

4.3 Extensions 56

Part II Fault-Tolerant Dynamic Systems

4. REDUNDANT IMPLEMENTATIONS OF ALGEBRAIC MACHINES 61

1 Introduction 61 2 Algebraic Machines: Definitions and Decompositions 61 3 Redundant Implementations of Group Machines 64

3.1 Separate Monitors for Group Machines 66 3.2 Non-Separate Redundant Implementations for Group

Machines 69 4 Redundant Implementations of Semigroup Machines 73

4.1 Separate Monitors for Reset-Identity Machines 74 4.2 Non-Separate Redundant Implementations for Reset-

Identity Machines 75 5 Summary 76

5. REDUNDANT IMPLEMENTATIONS OF DISCRETE-TIME LTI DYNAMIC SYSTEMS 79

1 Introduction 79 2 3 4 5 6

Discrete-Time LTI Dynamic Systems Characterization of Redundant Implementations Hardware Implementation and Fault Model Examples of Fault-Tolerant Systems Summary

6. REDUNDANT IMPLEMENTATIONS OF LINEAR FINITE-STATE MACHINES

1 Introduction 2 Linear Finite-State Machines 3 Characterization of Redundant Implementations 4 Examples of Fault-Tolerant Systems 5 Hardware Minimization in Redundant LFSM Implementations 6 Summary

79 80 83 86 96

99 99 99

102 104 108 112

7. UNRELIABLE ERROR CORRECTION IN DYNAMIC SYSTEMS 115

Page 8: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

COlltents IX

1 Introduction 115 2 Fault Model for Dynamic Systems 117 3 Reliable Dynamic Systems using Distributed Voting Schemes 118 4 Reliable Linear Finite-State Machines 123

4.1 Low-Density Parity Check Codes and Stable Memories 123 4.2 Reliable Linear Finite-State Machines using Constant

Redundancy 127 5 Other Issues 132

8. CODING APPROACHES FOR FAULT DETECTION AND IDENTIFICATION IN DISCRETE EVENT SYSTEMS 143 1 Introduction 143 2 Petri Net Models of Discrete Event Systems 145 3 Fault Models for Petri Nets 148 4 Separate Monitoring Schemes 151

4.1 Separate Redundant Petri Net Implementations 151 4.2 Fault Detection and Identification 154

5 Non-Separate Monitoring Schemes 160 5.1 Non-Separate Redundant Petri Net Implementations 160 5.2 Fault Detection and Identification 166

6 Applications in Control 170 6.1 Monitoring Active Transitions 170 6.2 Detecting Illegal Transitions 171

7 Summary 174

9. CONCLUDING REMARKS 179 1 Summuy 179 2 Future Research Directions 181

10. ABOUT THE AUTHOR 185

11. INDEX 187

Page 9: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

List of Figures

1.1 Triple modular redundancy. 5 1.2 Fault-tolerant combinational system. 6 1.3 Triple modular redundancy with correcting feedback. 9 1.4 Fault-tolerant dynamic system. 10 2.1 Error correction using a "restoring organ." 23 2.2 Plots of functions f(q) and g(q) for two different values

ofp. 24 2.3 Two successive restoring iterations in von Neumann's

construction for fault tolerance. 25 3.1 Arithmetic coding scheme for protecting binary operations. 34 3.2 aN arithmetic coding scheme for protecting integer addition. 37 3.3 ABFT scheme for protecting matrix multiplication. 39 3.4 Fault-tolerant computation of a group operation. 43 3.5 Fault tolerance using an abelian group homomorphism. 44 3.6 Coset-based error detection and correction. 46 3.7 Separate arithmetic coding scheme for protecting inte-

ger addition. 48 3.8 Separate coding scheme for protecting a group operation. 48 3.9 Partitioning of semi group (N, x ) into congruence classes. 54 4.1 Series-parallel decomposition of a group machine. 62 4.2 Redundant implementation of a group machine. 65 4.3 Separate redundant implementation of a group machine. 66 4.4 Relationship between a separate monitor and a decom-

posed group machine. 69 5.1 Delay-adder-gain implementation and the correspond-

ing signal flow graph for an LTI dynamic system. 84 5.2 State evolution equation and hardware implementation

of the digital filter in Example 5.2. 89 5.3 Redundant implementation based on a checksum condition. 89

Page 10: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

xii CODING APPROACHES TO FAULT TOLERANCE

5.4 Second redundant implementation based on a checksum condition. 91

6.1 Hardware implementation of the linear feedback shift register in Example 6.1. 100

6.2 Different implementations of a convolutional encoder. 107 7.1 Reliable state evolution subject to faults in the error corrector. 119 7.2 Modular redundancy with distributed voting scheme. 120 7.3 Hardware implementation of Gallager's modified itera-

tive decoding scheme for LDPC codes. 125 7.4 Replacing k LFSM's with n redundant LFSM's. 128 7.A.1 Encoded implementation of k LFSM's using n redun-

dant LFSM's. 134 8.1 Petri net with three places and three transitions. 145 8.2 Cat-and-mouse maze. 147 8.3 Petri net model of a distributed processing system. 150 8.4 Concurrent monitoring scheme using a separate Petri

net implementation. 151 8.5 Example of a separate redundant Petri net implemen-

tation that identifies single transition faults in the Petri net of Figure 8.1. 156

8.6 Example of a separate redundant Petri net implementa-tion that identifies single place faults in the Petri net of Figure 8.1. 157

8.7 Example of a separate redundant Petri net implementa-tion that identifies single transition or single place faults in the Petri net of Figure 8.1. 159

8.8 Concurrent monitoring scheme using a non-separate Petri net implementation. 161

8.9 Example of a non-separate redundant Petri net imple-mentation that identifies single transition faults in the Petri net of Figure 8.1. 167

8.10 Example of a non-separate redundant Petri net imple-mentation that identifies single place faults in the Petri net of Figure 8.1. 169

8.11 Example of a separate redundant Petri net implementa-tion that enhances control in the Petri net of Figure 8.3. 171

Page 11: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

List of Tables

2.1 5.1

Input-output table for the 3-input XNAND gate. Syndrome-based error detection and identification in Example 5.1.

27

87

Page 12: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Foreword

Fault tolerance requires redundancy, but redundancy comes at a price. At one extreme of redundancy, fault tolerance may involve running several complete and independent replicas of the desired process; discrepancies then indicate faults, and the majority result is taken as correct. More modest levels of redun­dancy - for instance, adding parity check bits to the operands of a computation - can still be very effective, but need to be more carefully designed, so as to ensure that the redundancy conforms appropriately to the particular character­istics of the computation or process involved. The latter challenge is the focus of this book, which has grown out of the author's graduate theses at MIT.

The original stimulus for the approach taken here comes from the work of Beckmann and Musicus, developed in Beckmann's 1992 doctoral thesis, also at MIT. That work focused on computations having group structure. The essential idea was to map the group in which the computation occurred to a larger group via a homomorphism, thereby preserving the structure of the computation while introducing the necessary redundancy. Hadjicostis has significantly expanded the setting to processes occurring in more general algebraic and dynamic sys­tems.

For combinational (i.e., memoryless) systems, this book shows how to rec­ognize and exploit system structure in a way that leads to resource-efficient arithmetic coding and ''ABFT'' (algorithm-based fault-tolerant) schemes, and characterizes separate (parity-type) codes. These results are then extended to dynamic systems, providing a unified system theoretic framework that makes connections with traditional error correcting methodologies for communica­tion systems, allows coding techniques to be studied in conjunction with the dynamics of the process that is being protected, and enables the development of fault-tolerance techniques that can account for faults in the error correc­tor itself. Numerous examples throughout the book illustrate how the frame­work and methodology translate to particular situations of interest, providing a parametrization of the range of possibilities for redundant implementation, and

Page 13: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

xvi CODING APPROACHES TO FAULT TOLERANCE

allowing one to examine features of and trade-offs among different possibilities and realizations.

The book responds to the growing need to handle faults in complex digital chips and complex networked systems, and to consider the effects of faults at the design stage rather than afterwards. I believe that the approach taken by the author points the way to addressing such needs in a systematic and fruitful fashion. The material here should be of interest to both researchers and practitioners in the area of fault tolerance.

George Verghese Massachusetts Institute of Technology

Page 14: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Preface

As the complexity of systems and networks grows, the likelihood of faults in certain components or communication links increases significantly and the con­sequences become highly unpredictable and severe. Even within a single digital device, the reduction of voltages and capacitances, the shrinking of transistor sizes and the sheer number of gates involved has led to a significant increase in the frequency of so-called "soft-errors," and has prompted leading semicon­ductor manufacturers to admit that they may be facing difficult challenges in the future. The occurrence of faults becomes a major concern when the systems involved are life-critical (such as military, transportation or medical systems), or operate in remote or inaccessible environments (where repair may be difficult or even impossible).

A fault-tolerant system is able to tolerate internal faults and preserve desirable overall behavior and output. A necessary condition for a system to be fault­tolerant is that it exhibit redundancy, which enables it to distinguish between correct and incorrect results or between valid and invalid states. Redundancy is expensive and counter-intuitive to the traditional notion of system design; thus, the success of a fault-tolerance design relies on making efficient use of hardware by adding redundancy in those parts of the system that are more liable to faults than others. Traditionally, the design of fault-tolerant systems has considered two quite distinct fault models: one model constructs reliable systems out of unreliable components (all of which may suffer faults with a certain probability) whereas the other model focuses on detecting and correcting a fixed number of faults (aiming at minimizing the required hardware). This book addresses both of these fault models and describes coding approaches that can be used to exploit the algorithmic/evolutionary structure in a particular combinational or dynamic system in order to avoid excessive use of redundancy. The book has grown out of thesis work at the Massachusetts Institute of Technology and research at the University of Illinois at Urbana-Champaign.

Page 15: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

xviii CODING APPROACHES TO FAULT TOLERANCE

Chapters 2 and 3 describe coding approaches for designing fault-tolerant combinational systems, i.e., systems with no internal memory that perform a static function evaluation on their inputs. Chapter 2 reviews von Neumann's work on "Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components," which is one of the first systematic approaches to fault tolerance. Subsequent related results on combinational circuits that are constructed as interconnections of unreliable ("noisy") gates are also discussed. In these approaches, a combinational system is built out of components (e.g., gates) that suffer transient faults with constant probability; the goal is to assem­ble these unreliable components in a way that introduces "structured" redun­dancy and ensures that, with high probability, the overall functionality is the correct one.

Chapter 3 describes a distinctly different approach to fault tolerance which aims at protecting a given combinational system against a pre-specified num­ber of component faults. Such designs become more dominant once system components are fairly reliable; they generally aim at using a minimal amount of structured redundancy to achieve detection and correction of a pre-specified number offaults. As explained in Chapter 3, coding techniques are particularly successful for arithmetic and linear operations; extensions of these techniques to operations with group or semi group structure are also discussed.

The remainder of the book focuses on fault tolerance in dynamic systems, such as finite-state controllers or computer simulations, whose internal state influences their future behavior. Modular redundancy (system replication) and other traditional techniques for fault tolerance are expensive, and rely heavily - particularly in the case of dynamic systems operating over extended time horizons - on the assumption that the error-correcting mechanism does not fail. The book describes a systematic methodology for adding structured re­dundancy to a dynamic system, exposing a wide range of possibilities between no redundancy and full replication. These possibilities can be parameterized in various settings, including algebraic machines (Chapter 4) and linear dynamic systems (Chapters 5 and 6). By adopting specific fault models and, in some cases, by making explicit connections with hardware implementations, the ex­position in these chapters describes resource-efficient designs for redundant dynamic systems. Optimization criteria for choosing among different redun­dant implementations are not explicitly addressed; several examples, however, illustrate how such criteria can be posed and investigated.

Chapter 7 relaxes the traditional assumption that the error-correcting mecha­nism does not fail. The basic idea is to use a distributed error-correcting mech­anism so that the effects of faults are dispersed within the redundant system in a non-devastating fashion. As discussed in Chapter 7, one can employ these techniques to obtain a variant of modular redundancy that uses unreliable sys­tem replicas and unreliable voters to construct redundant dynamic systems that

Page 16: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Preface xix

evolve in time with a low probability offailure. By combining these techniques with low-complexity error-correcting coding, one can efficiently protect iden­tical unreliable linear finite-state machines that operate in parallel on distinct input sequences. The approach requires only a constant amount of redundant hardware per machine to achieve a probability of failure that remains below any pre-specified bound over any given finite time interval.

Chapter 8 applies coding techniques in other contexts. In particular, it presents a methodology for diagnosing faults in discrete event systems that are described by Petri net models. The method is based on embedding the given Petri net model in a larger Petri net that retains the functionality and properties of the given one, while introducing redundancy in a way that facilitates error detection and identification.

Chapter 9 concludes with a look into emerging research directions in the areas of fault tolerance, reliable system design and fault diagnosis. Unlike traditional methodologies, which add error detecting and correcting capabili­ties on top of existing, non-redundant systems, the methodology developed in this book simultaneously considers the design for fault tolerance together with the implementation of a given system. This comprehensive approach to fault tolerance allows the study of a larger class of redundant implementations and can be used to better understand fundamental limitations in terms of system-, coding- and information-theoretic constraints. Future work should also focus on the implications of redundancy on the speed and power efficiency of digital systems, and also on the development of systematic ways to trade-off vari­ous system parameters of interest, such as redundant hardware, fault coverage, detection/correction complexity and delay.

Christoforos N. Hadjicostis Urbana, Illinois

Page 17: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

Acknowledgments

This book has grown out of research work at the Massachusetts Institute of Technology and the University of Illinois at Urbana-Champaign. There are many colleagues and friends that have been extremely generous with their help and advice during these years, and to whom I am indebted.

I am very thankful to many members of the faculty at MIT for their involve­ment and contribution to my graduate research. In particular, I would like to express my most sincere thanks to George Verghese for his inspiring guidance, and to Alan Oppenheim and Greg Wornell for their support during my tenure at the Digital Signal Processing Group. Also, the discussions that I had with Sanjoy Mitter, Alex Megretski, Bob Gallager, David Forney and Srinivas De­vadas were thought-provoking and helpful in defining my research direction; I am very thankful to all of them.

I am also grateful to many members of the faculty at UIVC for their warm support during these first few years. In particular, I would like to thank Steve Kang and Dick Blahut, who served as heads of the Department of Electrical and Computer Engineering, Ravi Iyer, the director of the Coordinated Science Lab­oratory, and Tamer Ba§ar, the director of the Decision and Control Laboratory, whose advice and direction have been a tremendous motivation for writing this book.

I would also like to thank my many friends and colleagues who made aca­demic life at MIT and at UIVC both enjoyable and productive. Special thanks go to Carl Livadas, Babis Papadopoulos and John Apostolopoulos, who were a great source of advice during my graduate studies. At VIUC, Andy Singer, Francesco Bullo and Petros Voulgaris were encouraging and always willing to help in any way they could. Becky Lonberger, Francie Bridges, Darla Chupp, Vivian Mizuno, Maggie Beucler, Janice Zaganjori and Sally Bemus made life a lot simpler by meticulously taking care of administrative matters. I would also like to thank Eleftheria Athanasopoulou, Boon Pang Lim and Yingquan Wu for proof-reading portions of this book.

Page 18: Coding Approaches to Fault Tolerance in Combinational and ...978-1-4615-0853-3/1.pdf · Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems . ... FAULT TOLERANCE

xxii CODING APPROACHES TO FAULT TOLERANCE

I am very grateful to many research agencies and companies that have sup­ported my work as a graduate student and as a research professor. These in­clude the Defense Advanced Research Projects Agency for support under the Rapid Proto typing of Application Specific Signal Processors project, the Elec­tric Power Research Institute and the Department of Defense for support un­der the Complex Interactive Networks/Systems Initiative, the National Science Foundation for support under the Information Technology Research and Career programs, the Air Force Office for Scientific Research for support under their University Research Initiative, the DIUC Campus Research Board, the National Semiconductor Corporation, the Grass Instrument Company and Motorola.

Finally, I am extremely thankful to Jennifer Evans and the Kluwer Aca­demic Publishers for encouraging me to make these ideas more widely available through the publication of this book.