reliability engineering

Alessandro Birolini

Reliability Engineering

Alessandro Birolini

ReliabilityEngineeringTheory and Practice

Fifth edition

With 140 Figures, 60 Tables,120 Examples, and 50 Problems

123

Prof. Dr. Alessandro Birolini∗

Ponte Vecchio – Torre degli AmideiI-50122 FirenzeTuscany, Italy

email: [email protected]

∗Ingenieur et penseur, Ph.D., Professor Emeritus of Reliability Engineeringat the Swiss Federal Institute of Technology (ETH), Zürichbiography on: www.ethz.ch/people/whoiswho

Library of Congress Control Number: 2007921004

First and second edition printed under the title “Quality and Reliability ofTechnical Systems”

ISBN 978-3-540-49388-4 5th ed. Springer Berlin Heidelberg New YorkISBN-10 3-540-40287-X 4th ed. Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the materialis concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broad-casting, reproduction on microfilm or in any other way, and storage in data banks. Duplication ofthis publication or parts thereof is permitted only under the provisions of the German Copyright Lawof September 9, 1965, in its current version, and permission for use must always be obtained fromSpringer. Violations are liable for prosecution under the German Copyright Law.

Springer is a part of Springer Science+Business Media

springer.com

© Springer-Verlag Berlin Heidelberg 1994, 1997, 1999, 2004, and 2007

The use of general descriptive names, registered names, trademarks, etc. in this publication does notimply, even in the absence of a specific statement, that such names are exempt from the relevantprotective laws and regulations and therefore free for general use.

Typesetting: Camera ready by authorProduction: LE-TEX Jelonek, Schmidt & Vöckler GbR, LeipzigCover-Design: medio Technologies AG

Printed on acid-free paper 62/3100/YL - 5 4 3 2 1 0

"La chance vient d l'esprit qui estpr2t d la recevoir. " I )

Louis Pasteur

"Quand on apercoit combien la somme de nos

ignorantes dipasse celle de nos connaissances,

on se serztpeu porti d conclure trop vite. " ZJ

Louis De Broglie

"One has to learn to consider causes rather than

symptoms of undesirable events and avoid hypo-

critical attitudes. " A. B.

'1 "Opportunity Comes to the intellect which is ready to receive it."

2, "When one recognizes how much the sum of our ignorante exceeds that of our knowledge, one is less ready to draw rapid conclusions."

Preface to the 5th Edition

This 5th edition differs from the 4th one for some refinements and extensions mainly on investigation and test of complex repairable systems. For phased-mission systems a new approach is given for both reliability and availability (Section 6.8.6.2). Effects of common cause failures (CCF) are carefully investigated for a 1-out-of-2 redundancy (6.8.7). Petri nets and dynamic FTA are introduced as alternative investigation methods for repairable systems (6.9). Approximate expressions are further developed. An unified approach for availability estimation und demonstration is given for exponentially and Erlangian distributed failure-free and repair times (7.2.2, A8.2.2.4, A8.3.1.4). Con$dence limits at system level are given for the case of constant failure rates (7.2.3.1). Investigation of nonhomogeneous Poisson processes is refined and more general point processes (superimposed, cumulative) are discussed (A7.8), with application to data analysis (7.6.2) & cost optimization (4.7). Trend tests to detect early failures or wearozdi are introduced (7.6.3). A simple demonstration for mean & variance in a cumulative process is given (A7.8.4). Expansion of a redundancy 2-out-of-3 to a redundancy 1-out-of-3 is discussed (2.2.6.5). Some present production-related reliability problems in VLSI ICs are shown (3.3.4). Maintenance strategies are reviewed (4.6).

As in the previous editions of this book, reliability figures at system level have indices SI (e.g. M W i ) , where S stands for system and i is the state entered at t=O (Table 6.2). Furthermore, considering that for a repairable system, operating times between system failures can be neither identically distributed nor independent, failure rate is confined to nonrepairable systems or to repairable systems which are as-good-as-new after repair. Failure intensio is used for general repairable systems. For the cases in which renewal is assumed to occur, the variable X starting by X = 0 at euch renewal is used instead of t, as for interarrival times. Also because of the estimate M ~ B F = T l k , often used in practical applications, MTBF is confined to repairable systems whose failure occurrence can be described by a homogeneous Poisson processes, for which (and only for which) interarrival times are independent exponentially distributed random variables with the same Parameter hs and mean MTBF, = 11 hs (p. 358). For Markov and semi- Markov models, MUTs is used (pp. 265,477). Repair is used as a synonym for restoration, with the assumption that repaired elements in a system are as-good-as-new after repair (the system is as-good-as-new, with respect to the state considered, only if all nonrepaired elements have constant failure rate). Reliability growth has been transferred in Chapter 7 and Table 3.2 on electronic components has been put in the new Appendix A.lO. A set of problems for home- work assignment has been added in the new Appendix A. 11.

This edition extends and replaces the previous editions. The comments of many friends and the agreeable cooperation with Springer-Verlag are gratefully acknowledged.

Zurich and Florence, September 13,2006 Alessandro Birolini

Preface to the 4th Edition The large interest granted to this book made a 4th edition necessary. The structure of the book is unchanged, with its main part in Chapters 1 - 8 and self contained appendices A l -A5 on management aspects and A6 - A8 on basic probability theory, stochastic processes & statistics.

Such a structure allows rapid access to practical results and a comprehensive introduction to the mathematical foundation of reliability theory. The content has been extended and reviewed. New models & considerations have been added to Appendix A7 for stochastic processes (NHPP), Chapter 4 for spare parts provisioning, Chapter 6 for complex repairable systems (imperfect switching, incomplete coverage, items with more than two states, phased-mission systems, fault tolerant reconfigurable systems wi th reward and frequency I duration aspects, Monte Carlo simulation), and Chapters 7 & 8 for reliabilig data analysis. Some results come from a stay in 2001 as Visiting Fellow at the Institute ofAdvanced Study of the University of Bologna.

Performance, dependability, cost, and time to market are key factors for today's products and services. However, failure of complex systems can have major safety consequences. Also here, one has to learn to consider causes rather than syrnptoms o f undesirable events und avoid hypocritical attitudes. Reliability engineering can help. Its purpose is to develop methods und tools to evaluate und demonstrate reliability, maintainability, availability, and safety of components, equipment & systems, and to support development and production engineers in building in these characteristics. To build in reliability, maintainability, and safety into complex systems, failure rate and failure mode analyses must be performed early in the development phase and be supported (as far as possible) by failure mechanism analysis, design guidelines, and design reviews. Before production, qual$cation tests are necessary to venfy that targets have been achieved. In the production phase, processes have to be qualified and monitored to assure the required quality level. For many systems,availability requirements have to be met and stochastic processes are used to investigate and optimize reliability and availability, including logistic support as well. Software often plays a dominant role, requiring specific quality assurance activities. Finally, to be cost and time effective, reliability engineering has to be coordinated with quality management (TQM) efforts, including value engineering and concurrent engineering, as appropriate.

This book presents the state-of-the-art of reliability engineering in theory and practice. It is a textbook based on the author's experience of 30 years in this field, half in industry and as founder of the Swiss Test Lab. for VLSI ICs in Neuchatel, and half as Professor (full since 1992) of Reliability Engineering at the Swiss Federal Institute of Technology (ETH), Zurich. It also reflects the experience gained in an effective cooperation between University and industry over 10 years with more than 30 medium and large industries [1.2 (1996)]+). Following Chapter 1, the book is structured in three p a r k

1. Chapters 2 - 8 deal with reliability, maintainability, and availability analysis und test, with emphasis on practical aspects in Chapters 3, 5, and 8. This part answers the question of how to build in, evaluate, und dernonstrate reliability, maintainability, und availability.

2. Appendices A l - A5 deal with definitions, standards, and program plans for quality and reliability assurancel management of complex systems. This minor part of the book has been added to comment on definitions and standards, and to support managers in answering the question of how to specify und achieve high reliability targets for complex Systems, when tailon'ng is not rnandatory.

3. Appendices A6 - A8 give a comprehensive introduction to probability theory, stochastic processes, and statistics, as needed in Chapters 2, 6, and 7, respectively. Markov, semi- Markov, and semi-regenerative processes are introduced with a view developed by the author in [A7.2 (1975 & 1985)l. Tkispart is addressed to systern oriented engineers.

Methods and tools are presented in a way that they can be tailored to Cover different levels of reliability requirements (the reader has to select this level). Investigation of repairable systems is performed systematically for many of the structures occurring in practical applications,

starting with constant failure and repair rates and generalizing step by step up to the case in which the process involved is regenerative with a minimum number o f regeneration states. Considering for each element M7TR (mean time to repair) << M V F (mean time to failure), it is shown that the shape of the repair time distribution has a small influence on the results at system level and, for constant failure rate, the reliability function at the system level can often be approximated by an exponential function. For large series - parallel systems, approximate expressions for reliability and availability are developed in depth, in particular using macro structures as introduced by the author in [6.5 (1991)l. Procedures to investigate repairable Systems with complex structure (for which a reliability block diagram often does not exist) are given as further application of the tools introduced in Appendix A7, in particular for imperfect switching, incomplete fault coverage, elements with more than two states, phased-mission systems, and fault tolerant reconfigurable systems with reward & frequency I duration aspects. New design d e s have been added for imperfect switching and incomplete coverage. A Monte Carlo approach useful for rare events is given. Spare parts provisioning is discussed for decentralized and centralized logistic support. Estimation and demonstration of a constant failure rate and statistical evaluation of general reliability data are considered in depth. Qualification tests and screening for components and assemblies are discnssed in detail. Methods for causes-to-effects analysis, design guidelines for reliability, maintainability & software quality, and checklists for design reviews are considered carefully. Cost optimization is investigated for some practical applications. Standards and trends in quality management are discussed. A large number of tables, figures, and examples support practical aspects

It is emphasized that care is necessary in the statistical analysis of reliability data (in particular for accelerated tests and reliability growth), causes-to-effects analysis should be performed systematically at least wkere redundancy appears (also to support remote maintenance), and further efforts should be done for developing approximate expressions for complex repairable systems as well as models for fault tolerant systems with hardware und software.

Most of the methods & tools given in this book can be used to investigatelimprove safety as well, which no longer has to be considered separately from reliability (although modeling human aspects can lead to some difficulties). The Same is forprocess and sewices reliability.

The book has been used for many years (Ist German Ed. 1985, Springer) as a textbook for three Semesters beginning graduate students at the ETH Zurich and for Courses aimed at engineers in industry. The basic Course (Chapters 1, 2, 5 & 7, with introduction to Chapters 3,4, 6 & 8) should belong to the curriculum of most engineering degrees.

This edition extends and reviews the 3rd Edition (1999). It aims further to establish a link between theory und practice, to be a contribution to a continuous learning program und a sustainable development, und to support creativity (stimulated by an internal confidence and a deep observation of nature, but restrained by excessive bureaucracy or depersonalization). The comments of many friends and the agreeable cooperation with Springer-Verlag are gratefully acknowledged.

Zurich and Florence, March 2003 Alessandro Birolini

+I For L...], see References at the end of the book.

Contents

1 Basic Concepts. Quality and Reliability Assurance of Complex Equipment & Systems . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Basic Concepts 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Reliability 2

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Failure 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Failure Rate 4

1.2.4 Maintenance. Maintainability . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Logistic Support 8

. . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Availability 9 1.2.7 Safety, Risk. and Risk Acceptance . . . . . . . . . . . . . . . . . . 9

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.8 Quality 11 1.2.9 Cost and System Effectiveness . . . . . . . . . . . . . . . . . . . . 11

. . . . . . . . . . . . . . . . . . . . . . . . . 1.2.10 Product Liability 15 . . . . . . . . . . . . . . . . . . . . . . 1.2.1 1 Histoncal Development 16

1.3 Basic Tasks & Rules for Quality & Reliability Assurance of Complex Equip . & Systems . 17 1.3.1 Quality and Reliability Assurance Tasks . . . . . . . . . . . . . . . . 17 1.3.2 Basic Quality and Reliability Assurance Rules . . . . . . . . . . . . . . 19 1.3.3 Elements of a Quality Assurance System . . . . . . . . . . . . . . . . . . 21

. . . . . . . . . . . . . . . . . . . . . . 1.3.4 Motivation and Training 24

2 Reliability Analysis During the Design Phase (Nonrepairable Items up to System Failure) . . . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction 25

. . . . . . . 2.2 Predicted Reliability of Equipment and Systems with Simple Structure 28 . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Required Function 28

. . . . . . . . . . . . . . . . . . . . . 2.2.2 Reliability Block Diagram 28 . . . . . . . . . 2.2.3 Operating Conditions at Component Level. Stress Factors 33

. . . . . . . . . . . . . . . . 2.2.4 Failure Rate of Electronic Components 35 . . . . . . . . . . . . . . . . . . . 2.2.5 Reliability of One-Item Structure 39

. . . . . . . . . . . . . . . . 2.2.6 Reliability of Senes-Parallel Structures 41 . . . . . . . . . . . . . . . . . 2.2.6.1 Systems without Redundancy 41

. . . . . . . . . . . . . . . . . . . 2.2.6.2 Concept of Redundancy 42 2.2.6.3 Parallel Models . . . . . . . . . . . . . . . . . . . . . . 43

. . . . . . . . . . . . . . . . . . 2.2.6.4 Series - Parallel Structures 45 . . . . . . . . . . . . . . . . . . . . 2.2.6.5 Majonty Redundancy 47

. . . . . . . . . . . . . . . . . . . . . . . . 2.2.7 Part Count Method 51 . . . . . . . . . . . . . . . . 2.3 Reliability of Systems with Complex Stmcture 52

. . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Key Item Method 52 2.3.1.1 Bndge Structure . . . . . . . . . . . . . . . . . . . . . 53 2.3.1.2 Re1 . Block Diagram in which Elements Appear More than Once . . . 54

. . . . . . . . . . . . . . . . . . . . . . 2.3.2 Successful Path Method 55 . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 State Space Method 56

. . . . . . . . . . . . . . . . . . . . . 2.3.4 Boolean Function Method 57 . . . . . . . 2.3.5 Parallel Models with Constant Failure Rates and Load Sharing 61

XI1 Contents

2.3.6 Elements with more than one Failure Mechanism or one Failure Mode . . . . 64 2.3.7 Basic Considerations on Fault Tolerant Structures . . . . . . . . . . . . 66

2.4 Reliability Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.5 Mechanical Reliability, Drift Failures . . . . . . . . . . . . . . . . . . . . 67

. . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Failure Mode Analysis 72 2.7 Reliability Aspects in Design Reviews . . . . . . . . . . . . . . . . . . . . 77

3 Qualification Tests for Components and Assemblies . . . . . . . . . . . . . . . . 81 3.1 Basic Selection Cntena for Electronic Components . . . . . . . . . . . . . . . 81

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Environment 82 3.1.2 Performance Parameters . . . . . . . . . . . . . . . . . . . . . . 84

. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Technology 84 . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Manufacturing Quality 86

3.1.5 Long-Term Behavior of Performance Parameters . . . . . . . . . . . . 86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.6 Reliability 86

3.2 Qualification Tests for Complex Electronic Components . . . . . . . . . . . . 87 3.2.1 Electrical Test of Complex ICs . . . . . . . . . . . . . . . . . . . . 88 3.2.2 Characterization of Complex ICs . . . . . . . . . . . . . . . . . . . 90 3.2.3 Environmental and Special Tests of Complex ICs . . . . . . . . . . . . 92

. . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Reliability Tests 101 3.3 Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components . 101

. . . . . . . . . . . . . . . 3.3.1 Failure Modes of Elecironic Components 101 . . . . . . . . . . . . . 3.3.2 Failure Mechanisms of Electronic Components 102

. . . . . . . . . . . . . . . 3.3.3 Failure Analysis of Electronic Components 102 3.3.4 Examples of VLSI Production-RelatedReliability Problems . . . . . . . . 106

. . . . . . . . . . . . . . . . . 3.4 Qualification Tests for Electronic Assemblies 107

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Maintainability Analysis 112 . . . . . . . . . . . . . . . . . . . . . . . 4.1 Maintenance. Maintainability 112

. . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Maintenance Concept 115 . . . . . . . . . . . . . . . . . . . 4.2.1 Fault Recognition and Isolation 116

. . . . . . . . . . . . . . . . . . 4.2.2 Equipment and System Partiiiouing 118 . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 User Documentation 118

. . . . . . . . . . . . 4.2.4 Training of Operating and Maintenance Personnel 119 . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 User Logistic Support 119

. . . . . . . . . . . . . . . . . . 4.3 Maintainability Aspects in Design Reviews 121 . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Predicted Maintainability 121

. . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Calculation of M7TRs 121 . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Calculütion of M7TPMS 125

. . . . . . . . . . . . . . . . . . 4.5 Basic Models for Spare Parts Provisioning 125 . . . . . . . . . 4.5.1 Centralized Logistic Support, Nonrepairable Spare Parts 125

. . . . . . . . 4.5.2 Decentralized Logistic Support, No~epairable Spare Parts 129 . . . . . . . . . . . . . . . . . . . . . . . 4.5.3 Repairable Spare Parts 130

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Repair strategies 134 . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Cost Considerations 136

. . . . . . . 5 Design Guidelines for Reliability. Maintainability. and Software Quality 139 . . . . . . . . . . . . . . . . . . . . . . 5.1 Design Guidelines for Reliability 139

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Derating 139

Contents XI11

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Cooling 140 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Moisture 142

5.1.4 Electromagnetic Compatibility. ESD Protection . . . . . . . . . . . . . 143 5.1.5 Components and Assemblies . . . . . . . . . . . . . . . . . . . . . 145

5.1.5.1 Component Selection . . . . . . . . . . . . . . . . . . . . 145 5.1.5.2 Component Use . . . . . . . . . . . . . . . . . . . . . . 145 5.1.5.3 PCB and Assembly Design . . . . . . . . . . . . . . . . . . 146 5.1.5.4 PCB and Assembly Manufacturing . . . . . . . . . . . . . . . 147

. . . . . . . . . . . . . . . . . . 5.1.5.5 Storage and Transportation 148 5.1.6 Particular Guidelines for IC Design and Manufacturing . . . . . . . . . . 148

5.2 Design Guidelines for Maintainability . . . . . . . . . . . . . . . . . . . . 149 . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 General Guidelines 149

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Testability 149 5.2.3 Accessibility, Exchangeability . . . . . . . . . . . . . . . . . . . . 151

. . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Operation, Adjustment 152 . . . . . . . . . . . . . . . . . . . 5.3 Design Guidelines for Software Quality 152

5.3.1 Guidelines for Software Defect Prevention . . . . . . . . . . . . . . . 155 . . . . . . . . . . . . . . . . . . . . . 5.3.2 Configuration Management 158

5.3.3 Guidelines for Software Testing . . . . . . . . . . . . . . . . . . . 158 . . . . . . . . . . . . . . . . . . . 5.3.4 Software Quality Growth Models 159

. . . . . . . . . . . . . . . . 6 Reliability and Availability of Repairable Systems 162 . . . . . . . . . . . . . . . . . . . . 6.1 Introduction and General Assumptions 162

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Oue-Item Structure 168 . . . . . . . . . . . . . . . . . 6.2.1 One-Item Structure New at Time t = 0 169

. . . . . . . . . . . . . . . . . . . . . 6.2.1.1 Reliability Function 169

. . . . . . . . . . . . . . . . . . . . . 6.2.1.2 Point Availability 170 . . . . . . . . . . . . . . . . . . . . 6.2.1.3 Average Availability 171

. . . . . . . . . . . . . . . . . . . . . 6.2.1.4 Interval Reliability 172 . . . . . . . . . . . . . . . . . 6.2.1.5 Special Kinds of Availability 173

6.2.2 One-Item Strncture New at Time t = 0 and with Constant Failnre Rate h . . . 176 . . . . . 6.2.3 One-Item Strncture with Arbitrary Initial Conditions at Time t = 0 176

. . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Asymptotic Behavior 178 . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Steady-State Behavior 180 . . . . . . . . . . . . . . . . . . . . . . . 6.3 Systems without Redundancy 182

. . . . . . . . . . 6.3.1 Senes Structure with Constant Failure and Repair Rates 182 . . . . . . 6.3.2 Series Structure with Constant Failure and Arbitrary Repair Rates 185

. . . . . . . . . . 6.3.3 Series Structure with Arbitrary Failure and Repair Rates 186 . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 1-out-of-2 Redundancy 189

. . . . . . . 6.4.1 1-out-of-2 Redundancy with Constant Failure and Repair Rates 189 . . . 6.4.2 1-out-of-2 Redundancy with Constant Failure and Arbitrary Repair Rates 197

6.4.3 1-out-of-2 Red . with Const . Failure Rate in Res . State and Arbitr . Repair Rates . 200 . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 k-out-of-n Redundancy 206

6.5.1 k-out-of-n Warm Redundancy with Constant Failure and Repair Rates . . . . 207 6.5.2 k-out-of-n Active Redundancy with Const . Failure and Arbitrary Repair Rates . 210

. . . . . . . . . . . . . . . . . . . . . 6.6 Simple Senes - Parallel Stnictures 213 . . . . . . . . . 6.7 Approximate Expressions for Large Series- Parallel Structures 219

. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Introduction 219 . . . . . . . . . . . . . . . . . . 6.7.2 Application to a Practical Example 223

X N Contents

6.8 Systems with Complex Structure . . . . . . . . . . . . . . . . . . . . . . 231 . . . . . . . . . . . . . . . . . . . . . . . 6.8.1 General Considerations 231

. . . . . . . . . . . . . . . . . . . . . . . 6.8.2 Preventive Maintenance 233 6.8.3 Imperfect Switching . . . . . . . . . . . . . . . . . . . . . . . . 236 6.8.4 Incomplete Coverage . . . . . . . . . . . . . . . . . . . . . . . . 241 6.8.5 Elements with more than two States or one Failure Mode . . . . . . . . . 246 6.8.6 Fault Tolerant Reconfigurable Systems . . . . . . . . . . . . . . . 248

6.8.6.1 Ideal Case . . . . . . . . . . . . . . . . . . . . . . . 248 6.8.6.2 Time Censored Reconfiguration (Phased-Mission Systems) . . . . . 248 6.8.6.3 Failure Censored Reconfiguration . . . . . . . . . . . . . . 255 6.8.6.4 With Reward and Frequency / Duration Aspects . . . . . . . . . . 259

6.8.7 Systems with Common Cause Failures . . . . . . . . . . . . . . . . . 260 6.8.8 General Procedure for Modeling Complex Systems . . . . . . . . . . . 264

6.9 Alternative Investigation Methods . . . . . . . . . . . . . . . . . . . . . 267 6.9.1 Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 6.9.2 Dynamic Fault Trees . . . . . . . . . . . . . . . . . . . . . . . 270 6.9.3 Computer-Aided Reliability and Availability Computation . . . . . . . . 272

6.9.3.1 Numerical Solution of Equations for Reliability and Availability . . . 272 6.9.3.2 Monte Carlo Simulations . . . . . . . . . . . . . . . . . . 273

7 Statistical Quality Control and Reliability Tests . . . . . . . . . . . . . . . . . 277 7.1 Statistical Quality Control . . . . . . . . . . . . . . . . . . . . . . . . 277

7.1.1 Estimation of a Defective Probability p . . . . . . . . . . . . . . . . 278 7.1.2 Simple Two-sided Sampling Plans for Demonstration of a Def . Probability p . . 280

7.1.2.1 Simple Two-sided Sampling Plans . . . . . . . . . . . . . . . 281 7.1.2.2 Sequential Tests . . . . . . . . . . . . . . . . . . . . . . 283

7.1.3 One-sided Sampling Plans for the Demonstration of a Def . Probability p . . . 284 7.2 Statistical Reliability Tests . . . . . . . . . . . . . . . . . . . . . . . . . 287

7.2.1 Reliability & Availability Estimation & Demon . for the case of a given Mission . 287 7.2.2 Availability Estimation &Demonstration for Continuous Operation (steady-state) . 289

7.2.2.1 Availability Estimation . . . . . . . . . . . . . . . . . . . 289 7.2.2.2 Availability Demonstration . . . . . . . . . . . . . . . . . . 291 7.2.2.3 Further Availability Evaluation Methods for Continnous Operation . . 292

7.2.3 Estimation and Demonstration of a Constant Failure Rate h (or of MTBF= 1 Ih) . 294 7.2.3.1 Estimation of a Constant Failure Rate h . . . . . . . . . . . . 296 7.2.3.2 Simple Two-sided Test for the Demonstration of h . . . . . . . . 298 7.2.3.3 Simple One-sided Test for the Demonstration of h . . . . . . . . 302

7.3 Statistical Maintainability Tests . . . . . . . . . . . . . . . . . . . . . . . 303 7.3.1 Estimation of an M?TR . . . . . . . . . . . . . . . . . . . . . . . 303 7.3.2 Demonstration of an M7TR . . . . . . . . . . . . . . . . . . . . . 305

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Accelerated Testing 307 . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Goodness-of-fit Tests 312

. . . . . . . . . . . . . . . . . . . . . 7.5.1 Kolmogorov-Srnirnov Test 312 7.5.2 Chi-square Test . . . . . . . . . . . . . . . . . . . . . . . . . . 316

7.6 Statistical Analysis of General Reliability Data . . . . . . . . . . . . . . . . . 319 . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 General considerations 319

7.6.2 Tests for Nonhomogeneous Poisson Processes . . . . . . . . . . . . . . 321 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Trend Tests 323

7.6.3.1 Tests of a HPP versus a NHPP with increasing intensity . . . . . . 323 . . . . . . 7.6.3.2 Tests of a HPP versus a NHPP with decreasing intensity 326

Contents xv

7.6.3.3 Heuristic Tests to distinguish between HPP and Gen . Monotonic Trend . 327 7.7 Reliability Growth . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

8 Quality & Reliability Assurance During the Production Phase (Basic Considerations) . . 335 8.1 Basic Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 8.2 Testing and Screening of Electronic Components . . . . . . . . . . . . . . . 336

8.2.1 Testing of Electronic Components . . . . . . . . . . . . . . . . . . 336 8.2.2 Screening of Electronic Components . . . . . . . . . . . . . . . . . 337

8.3 Testing and Screening of Electronic Assemblies . . . . . . . . . . . . . . . . 340 8.4 Test and Screening Strategies, Economic Aspects . . . . . . . . . . . . . . . 342

8.4.1 Basic Considerations . . . . . . . . . . . . . . . . . . . . . . . . 342 8.4.2 Quality Cost Optimization at Incoming Inspection Level . . . . . . . . . . 345 8.4.3 Procedure to handle first deliveries . . . . . . . . . . . . . . . . . . 350

Annexes A l Terms and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . 351

A2 Quality and Reliability Standards . . . . . . . . . . . . . . . . . . . . . . . 365 A2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 A2.2 Requirements in the Industrial Field . . . . . . . . . . . . . . . . . . . . 366 A2.3 Requirements in the Aerospace, Defense. and Nuclear Fields . . . . . . . . . . 368

A3 Definition and Realization of Quality and Reliability Requirements . . . . . . . . 369 A3.1 Definition of Quality and Reliability Reqnirements . . . . . . . . . . . . . . 369 A3.2 Realization of Quality and Reliability Requirements for Complex Equip . & Systems . 371 A3.3 Elements of a Quality and Reliability Assurance Program . . . . . . . . . . . 376

A3.3.1 Project Organization. Planning. and Scheduling . . . . . . . . . . . 376 A3.3.2 Quality and Reliability Requirements . . . . . . . . . . . . . . . . 377 A3.3.3 Reliability and Safety Analysis . . . . . . . . . . . . . . . . . . 377 A3.3.4 Selection and Qualification of Components. Materials & Manuf . Processes . 378 A3.3.5 Configuraiion Management . . . . . . . . . . . . . . . . . . . 378 A3.3.6 Quality Tests . . . . . . . . . . . . . . . . . . . . . . . . . 380 A3.3.7 Quality Data Reporting System . . . . . . . . . . . . . . . . . . 380

. . . . . . . . . . . . . . . . . . . . . . . . A4 Checklists for Design Reviews 383 . . . . . . . . . . . . . . . . . . . . . . . . . A4.1 System Design Review 383

. . . . . . . . . . . . . . . . . . . . . . . A4.2 Preliminary Design Reviews 384 A4.3 Critical Design Review (System Level) . . . . . . . . . . . . . . . . . . 386

A5 Requirements for Quality Data Reporting Systems . . . . . . . . . . . . . . . . 388

A6 Ba& Probability Theory . . . . . . . . . . . . . . . . . A6.1 Field of Events . . . . . . . . . . . . . . . . . . . A6.2 Concept of Probability . . . . . . . . . . . . . . . . A6.3 Conditional Probability. Independence . . . . . . . . . . A6.4 Fundamental Rules of Probability Theory . . . . . . . . .

A6.4.1 Addition Theorem for Mutually Exclusive Events . . A6.4.2 Multiplication Theorem for Two Independent Events A6.4.3 Multiplication Theorem for Arbitrary Events . . . .

XVI Contents

A6.4.4 Addition Theorem for Arbitrary Events . . . . . . . . . . . . . . . 399 A6.4.5 Theorem of Total Probability . . . . . . . . . . . . . . . . . . . 400

A6.5 Random Variables, Distribution Functions . . . . . . . . . . . . . . . . . 401 A6.6 Numerical Parameters of Random Variables . . . . . . . . . . . . . . . . 406

A6.6.1 Expected Value (Mean) . . . . . . . . . . . . . . . . . . . . . 406 . . . . . . . . . . . . . . . . . . . . . . . . . . . A6.6.2 Variance 410

. . . . . . . . . . . . . . . . . . A6.6.3 Modal Value, Quantile. Median 412 A6.7 Multidimensional Random Variables, Conditional Distributions . . . . . . . . . 412 A6.8 Numencal Parameters of Random Vectors . . . . . . . . . . . . . . . . . 414

A6.8.1 Covariance Matrix. Correlation Coefficient . . . . . . . . . . . . . 415 A6.8.2 Further Properties of Expected Value and Variance . . . . . . . . . . 416

A6.9 Distribution of the Sum of Indep . Positive Random Variables and of Zmin. Zmax . 416 A6.10 Distribution Functions used in Reliability Analysis . . . . . . . . . . . . . 419

A6.10.1 Exponential Distribution . . . . . . . . . . . . . . . . . . . 419 . . . . . . . . . . . . . . . . . . . . A6.10.2 Weibull Distribution 420

A6.10.3 Gamma Distribution, Erlangian Distribution. and X2 -Distribution . . 422 . . . . . . . . . . . . . . . . . . . . A6.10.4 Normal Distribution 424

. . . . . . . . . . . . . . . . . . . A6.10.5 Lognormal Distribution 425 . . . . . . . . . . . . . . . . . . . . A6.10.6 Uniform Distribution 427 . . . . . . . . . . . . . . . . . . . . A6.10.7 Binomial Distribution 427 . . . . . . . . . . . . . . . . . . . . A6.10.8 Poisson Distribution 429

. . . . . . . . . . . . . . . . . . . A6.10.9 Geometrie Distribution 431 . . . . . . . . . . . . . . . . . A6.10.10 Hypergeometric Distribution 432

. . . . . . . . . . . . . . . . . . . . . . . . . . . A6.11 Limit Theorems 432 . . . . . . . . . . . . . . . . . . . A6.11.1 Law of Large Numbers 433 . . . . . . . . . . . . . . . . . . . A6.11.2 Central Limit Theorem 434

. . . . . . . . . . . . . . . . . . . . . . A7 Basic Stochastic-Processes Theory 438 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A7.1 Introduction 438

. . . . . . . . . . . . . . . . . . . . . . . . . . . A7.2 Renewal Processes 441 . . . . . . . . . . . . . . . . A7.2.1 Renewal Function. Renewal Density 443

. . . . . . . . . . . . . . . . . . . . . . . A7.2.2 Recurrence Times 446 . . . . . . . . . . . . . . . . . . . . . . A7.2.3 Asymptotic Behavior 447

. . . . . . . . . . . . . . . . . . . A7.2.4 Stationary Renewal Processes 449 . . . . . . . . . . . . . . . . . A7.2.5 Homogeneous Poisson Processes 450

. . . . . . . . . . . . . . . . . . . . . . A7.3 Alternating Renewal Processes 452 . . . . . . . . . . . . . . . . . . . . . . . . . A7.4 Regenerative Processes 456

. . . . . . . . . . . . . . . . A7.5 Markov Processes with Finitely Many States 458 . . . . . . . . . . . . . . A7.5.1 Markov Chains with Finitely Many States 458

. . . . . . . . . . . . A7.5.2 Markov Processes with Finitely Many States 460 A7.5.3 State Probabilities and Stay (Sojourn) Times in a Given Class of States . . 469

. . . . . . . . . . . . . A7.5.3.1 Method of Differential Equations 469 . . . . . . . . . . . . . . . A7.5.3.2 Method of Integral Equations 473

. . . . . . . . . A7.5.3.3 Stationary State and Asymptotic Behavior 474 . . . . . . . . . . . . . A7.5.4 Frequency / Duration and Reward Aspects 476

. . . . . . . . . . . . . . . . . . A7.5.4.1 Frequency / Duration 476 . . . . . . . . . . . . . . . . . . . . . . . . A7.5.4.2 Reward 478

. . . . . . . . . . . . . . . . . . . . . A7.5.5 Birth and Death Process 479 . . . . . . . . . . . . . . A7.6 Semi-Markov Processes with Finitely Many States 483

. . . . . . . . . . . . . . . . . . . . . . . A7.7 Semi-regenerative Processes 488 . . . . . . . . . . . . . . . . . . . A7.8 Nonregenerative Stochastic Processes 492

Contents XVII

. . . . . . . . . . . . . . . . . . . . . A7.8.1 General Considerations 492 A7.8.2 Nonhomogeneous Poisson Processes (NHPP) . . . . . . . . . . . . 493

. . . . . . . . . . . . . . . . . A7.8.3 Supenmposed Renewal Processes 497 A7.8.4 Cumulative Processes . . . . . . . . . . . . . . . . . . . . . . 498 A7.8.5 General Point Processes . . . . . . . . . . . . . . . . . . . . . 500

A8 Basic Mathematical Statistics . . . . . . . . . . . . . . . . . . . . . . . . 503 . . . . . . . . . . . . . . . . . . . . . . . . . . A8.1 Empirical Methods 503

A8.1.1 Empirical Distribution Function . . . . . . . . . . . . . . . . . . 504 . . . . . . . . . . . . . . . . . A8.1.2 Empirical Moments and Quantiles 506

A8.1.3 Further Applications of the Einpincal Distribution Function . . . . . . . 507 . . . . . . . . . . . . . . . . . . . . . . . . . . A8.2 Parameter Estimation 511

. . . . . . . . . . . . . . . . . . . . . . . . A8.2.1 Point Estimation 511 . . . . . . . . . . . . . . . . . . . . . . . A8.2.2 Intemal Estimation 516

A8.2.2.1 Estimation of an Unknown Probability p . . . . . . . . . . 516 A8.2.2.2 Estimation of the Param . h for an Exp . Distribution, Fixed T . . 520 A8.2.2.3 Estimation of the Param . h for an Exp . Distribution, Fixed n . . 521 A8.2.2.4 Availability Estimation (Erlangian Failure-Free & Repair Times) 523

. . . . . . . . . . . . . . . . . . . . . . A8.3 Testing Statistical Hypotheses 525 . . . . . . . . . . . . . . . . . A8.3.1 Testing an Unknown Probability p 526

A8.3.1.1 Simple Two-sided Sampling Plan . . . . . . . . . . . . . 527 . . . . . . . . . . . . . . . . . . . . A8.3.1.2 Sequential Test 528

A8.3.1.3 Simple One-sided Sampling Plan . . . . . . . . . . . . . 529 A8.3.1.4 Availability Demonstration (Erlangian Failure-Free & Rep . Times)531

A8.3.2 Goodness-of-fitTestsforCompletely Specified Fo(t) . . . . . . . . . 533 A8.3.3 Goodness-of-fit Tests for Fo(t) with Unknown Parameters . . . . . . . 536

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . A9 Tables and Charts 539 . . . . . . . . . . . . . . . . . . . . . . A9.1 Standard Normal Distribution 539

. . . . . . . . . . . . . . . . . A9.2 x2-~istribution (Chi-Square Distribution) 540 . . . . . . . . . . . . . . . . . . . . A9.3 t-Distribution (Student distribution) 541 . . . . . . . . . . . . . . . . . . . . A9.4 F Distribution (Fisher distribution) 542

. . . . . . . . . . . . . . . . . . A9.5 Table for the Kolmogorov-Smirnov Test 543 . . . . . . . . . . . . . . . . . . . . . . . . . . . A9.6 GammaFunction 544 . . . . . . . . . . . . . . . . . . . . . . . . . . . A9.7 Laplace Transform 545

. . . . . . . . . . . . . . . A9.8 Probability Charts (Probability Plot Papers) 547 A9.8.1 Lognormal Probability Chart . . . . . . . . . . . . . . . . . . . 547

. . . . . . . . . . . . . . . . . . . . A9.8.2 Weibull Probability Chart 548

. . . . . . . . . . . . . . . . . . . . A9.8.3 Normal Probability Chart 549

. . . . . . . . . . . . . . . . . . Al0 Basic Technological Component's Properties 550

. . . . . . . . . . . . . . . . . . . . . . . . . . A l l Problems for Home-Work 554

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acronyms 560

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References 561

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index 581

1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems

The purpose of reliability erzgineerirzg is to develop methods and tools to evaluate und demonstrate reliability, maintainability, availability, and safety of components, equipment, and systems, as well as to support development and production engineers in building in these characteristics. In order to be cost and time effective, reliability engineering must be integrated in project activities, and support quality assurance and concurrent engineering efforts. This chapter introduces basic concepts, shows their relationships, and discusses the tasks necessary to assure quality and reliability of complex equipment and systems with high quality und reliability requirements. A comprehensive list of definitions is given in Appendix Al . Standards for quality assurance (management) systems are discussed in Appendix A2. Refinements of management aspects are given in Appendices A3 - A5 for the cases in which tailoring is not mandatory.

1.1 Introduction

Until the nineteen-sixties, quality targets were deemed to have been reached when the item considered was found to be free of defects or systematic failures at the time it left the manufacturer. The growing complexity of equipment and systems, as well as the rapidly increasing cost incurred by loss of operation as a consequence of failures, have brought to the forefront the aspects of reliability, maintainability, availability, and safety. The expectation today is that complex equipment and systems are not only free from defects und systematic failures at time t = O (when they are put into operation), but also perform the required function failure free for a stated time interval and have a fail-safe behavior in the case of critical or catastrophic failures. However, the question of whether a given item will operate without failures during a stated period of time cannot be simply answered by yes or no, on the basis of a compliance test. Experience shows that only aprobability for this occurrence can be given. This probability is a measure of the item's

2 1 Basic Concepts, Quality and Reliability Assurance of Complex Equipment and Systems

reliability and can be interpreted as follows:

If n statistically identical items are put into operation at time t = 0 to pegorm a given mission und 7 I n of them accomplish it successfully, then the ratio 7 / n is a random variable which converges for increasing n to the true value of the reliability (Appendix A6.11).

Performance Parameters as well as reliability, maintainability, availability, and safety have to be built in during design & development and retained during production and operation of an item. After the introduction of some important concepts in Section 1.2, Section 1.3 gives basic tasks and rules for quality and reliability assurance of complex equipment und Systems with high quality und reliability requirements (see Appendix A l for a comprehensive list of definitions and Appendices A2 - A5 for a refinement of management aspects).

1.2 Basic Concepts

This section introduces important concepts used in reliability engineering and shows their relationships (see Appendix A l for a more complete list).

1.2.1 Reliability

Reliability is a characteristic of an item, expressed by the probability that the item will perform its required function under given conditions for a stated time interval. It is generally designated by R. From a qualitative point of view, reliability can be defined as the ability of the item to remain functional. Quantitatively, reliability specifies the probabili~ that no operational interruptions will occur during a stated time interval. This does not mean that redundant parts may not fail, such parts can fail and be repaired (without operational interruption at item (system) level). The concept of reliability thus applies to nonrepairable as well as to repairable items (Chapters 2 and 6, respectively). To make sense, a numerical Statement of reliability (e.g., R = 0.9) must be accompanied by the definition of the required function, the operating conditions, and the mission duration. In general, it is also important to know whether or not the item can be considered new when the mission Starts.

An item is a functional or structural unit of arbitrary complexity (e.g. component, assembly, equipment, subsystem, system) that can be considered as an entity for investigations. It may consist of hardware, software, or both and may also include human resources. Often, ideal human aspects and logistic Support are assumed, even if (for simplicity) the term System is used instead of technical system.

1.2 Basic Concepts 3

The required function specifies the item's task. For example, for given inputs, the item outputs have to be constrained within specified tolerance bands (performance Parameters should still be given with tolerances and not merely as fixed values). The definition of the required function is the starting point for any reliabili9 analysis, as it defines failures.

Operating conditions have an important influence upon reliability, and must therefore be specified with care. Experience shows e.g., that the failure rate of semiconductor devices will double for operating temperature increase of 10 - 20°C .

The required function andl or operating conditions can be time dependent. In these cases, a mission profile has to be defined and all reliability figures will be related to it. A representative mission profile and the corresponding reliability targets should be given in the item's specification

Often the mission duration is considered as a Parameter t , the reliabilityfunction is then defined by R ( t ) . R ( t ) is the probability that no failure at item level will occur in the interval (0, t ] . The item's condition at t = 0 (new or not) influences final results. To consider this, reliability figures at system level will have indices Si

(e.g. Rs , ( t ) ) , where S stands for system and 1 is the state entered at t = 0 (Table 6.2). A distinction between predicted and estimated or assessed reliability is

important. The first one is calculated on the basis of the item's reliability structure and the failure rate of its components (Sections 2.2 & 2.3), the second is obtained from a statistical evaluation of reliability tests (Section 7.2) or from field data by known environmental and operating conditions.

The concept of reliability can be extended to processes and services as well, although human aspects can lead to modeling difficulties (see e.g. Section 1.2.7).

1.2.2 Failure

A failure occurs when the item stops performing its required function. As simple as this definition is, it can become difficult to apply it to complex items. The failure- free time (hereafter used as a synonym for failure-free operating time) is generally a random variable. It is often reasonably long, but it can be very short, for instance because of a failure caused by a transient event at turn-on. A general assumption in investigating failure-free times is that at t = 0 the item is free of defects and systematic failures. Besides their frequency, failures should be classified (as far as possible) according to the mode, cause, effect, and mechanism:

1. Mode: The mode of a failure is the Symptom (local effect) by which a failure is observed; e.g., Opens, shorts, or drift for electronic components (Table 3.4); brittle rupture, creep, cracking, seizure, fatigue for mechanical components.

2. Cause: The cause of a failure can be intrinsic, due to weaknesses in the item andlor wearout, or extrinsic, due to errors, misuse or mishandling during the design, production, or use. Extrinsic causes often lead to systematic failures,


which are deterministic and should be considered like defects (dynamic defects in software quality). Defects are present at t = 0, even if often they can not be discovered at t = 0. Failures appear always in time, even if the time to failure is short as it can be with systematic or early failures.

3. Effect: The effect (consequence) of a failure can be different if considered on the item itself or at higher level. A usual classification is: non relevant, partial, complete, and critical failure. Since a failure can also cause further failures, distinction between primaiy and secondaiy failure is important.

4. Mechanism: Failure mechanism is the physical, chemical, or other process resulting in a failure (see Table 3.5 for some examples).

Failures can also be classified as sudden and gradual. In this case, sudden and complete failures are termed cataleptic failures, gradual and partial failures are termed degradation failures. As failure is not the only cause for an item being down, the general term used to define the down state of an item (not caused by a preventive maintenance, other planned actions, or lack of external resources) is fault. Fault is thus a state of an item and can be due to a defect or a failure.

1.2.3 Failure Rate

The failure rate plays an important role in reliability analysis. This Section introduces it heuristically, see Appendix A6.5 for an analytical derivation.

Let us assume that n statistically identical and independent items are put into operation at time t = 0, under the same conditions, and at the time t a subset V ( t ) of these items have not yet failed. Y ( t ) is a right continuous decreasing step function (Fig. 1.1). t l , ..., t„ measured from t = 0, are the observed failure-free times (times to failure) of the n items considered. They are independent realizations of a random variable T (hereafter identified as a failure-free time) and must not be confused with arbitrary points on the time axis ( tl, t ; ,...). The quantity

is the empirical mean (empirical expected value) of T. Empirical quantities are statistical estimates, marked with " in this book. For n+ W, E[TI converges to the true value E [T] = MTTF (given by Eq. (1 3 ) ) of the mean failure-free time T

(Eq. (A6. l47), see also Appendix A8.1.2). The function

is the empirical reliability function. As shown in Appendix A8.1.1, k ( t ) converges to the reliability function R ( t ) for n-t W.

For an arbitrary time interval ( t , t + 6t1, the empirical failure rate is defined as

1.2 Basic Concepts

Figure 1.1 Number T(t) of (nonrepairable) items still operating at time t

i ( t ) 6 t is the ratio of the items failed in the interval ( t , t +6t ] to the number of items still operating (or surviving) at time t. Applying Eq. (1.2) to Eq. (1.3) yields

For n+ CQ & St-10, and assuming R( t ) derivable, h(t) converges to the failure rate

-d R(t) l d t h(t> =

R( t )

Considering R(0) = 1 (at t = 0 all items are new) it follows that

The failure rate h ( t ) given by Eqs. (1.3)- (1.5) applies in particular to nonrepairable items (Figs. 1.1 & 1.2). However, considering Eq. (A6.25) it can also be used for repairable items which are as-good-as-new after repair (renewal), taking instead of t the variable X starting by X = 0 ut euch renewal (as for interarrival times). If a repairable system cannot be restored to be as-good-as-new after repair (with respect to the state considered), i.e if at least one element with time dependent failure rate has not been renewed at every repair, failure intensity z ( t ) has to be used (see pp. 355, 356,358 for cornments). The use of hazard rate for A( t ) should also be avoided.


In many practical applications, N t ) = h can be assumed. Eq. (1.6) then yields

for h(t) = h .

The failure-free time T > 0 is exponentially distributed (F( t ) = Pr{% I t } = 1 - e-L ).

For this case, and only in this case, the failure rate h can be estimated by B = k I T , where T is a given (fixed) cumulative operating time and k the total number of failures during T (Eqs. (7.28) and (A8.46)).

The mean (expected value) of the failure-free time T> 0 is given by (Eq. (A6.38))

where MTTF stands for mean time to failure. For k ( t ) = h it follows that E[T] = 11h. Constant (time independent) failure rate h is often assumed for repairable

items too, considered as-good-as-new after repair (renewal). For this case, and only in this case, successive failure-free times are independent random variables, exponentially distributed with the same Parameter A, and have mean

MTBF = l / h , for h(x)= h , (1.9)

where MTBF stands for mean operating time between failures. Also because of the statistical estimate M ~ B F = T l k (Section 7.2.3.1), often used in practical applications, MTBF should be confined to the case of repairable items with constant failure rate (p. 358). For Markov and semi-Markov models, MUTs is used (Eqs. (6.287) or (A7.142)).

The failure rate of a large population of statistically identical und independent items exhibits often a typical bathtub curve (Fig. 1.2) with the following 3 phases:

1. Early failures: A ( t ) decreases (in general) rapidly with time; failures in this phase are attributable to randomly distributed weaknesses in materials, components, or production processes.

2. Failures with constant (or nearly so) failure rate: h(t) is approximately constant; failures in this period are Poisson distributed and often cataleptic.

3. Wearout failures: h(t) increases with time; failures in this period a r e attributable to aging, wearout, fatigue, etc. (e.g. corrosion, electrornigration).

Early failures are not deterministic and appear in general randomly distributed in time and over the items. During the early failure period, h(t) must not necessarily decrease as in Fig. 1.2, in some cases it can oscillate. To eliminate early failures, bum-in or environmental stress screening is used (Chapter 8). Early failures must be distinguished from systematic failures, which are deterministic and caused by errors or mistakes, and whose elimination requires a change in design, production process, operational procedure, documentation or other. The length of the early failure period varies greatly in practice. However, in most applications it will be shorter than a few thousand hours. The presence of a period with constant (or nearly so)

1.2 Basic Concepts

Figure 1.2 Typical shape for the failure rate of a Zarge population of statistically identical und independent (nonrepairable) items (dashed is a possible shift for a higher Stress, e.g. ambient temperature)

failure rate h ( t ) = h is realistic for many equipment & Systems, and useful for caIcuIations. The memoryless property, which characterizes this period, Ieads to a homogeneous Poisson process for the flow of failures (Appendix A7.2.5) and to a Markov process for the time behavior of a repairable item if also constant repair rates can be assumed (Chapter 6). An increasing failure rate after a given operating time (> 10 years for many electronic equipment) is typical for most items and appears because of degradation phenomena due to wearout.

A possible explanation for the shape of h ( t ) given in Fig. 1.2 is that the population of n statistically identical and independent items contains n pf weak elements and n(1- p f ) good ones. The distribution of the failure-free time can then be expressed by a weighted sum of the form F(t) = pf Fl( t ) + ( 1 - p f ) F z ( t ) . For calculation or simulation purposes, Fl( t ) could be a gamma distribution with ß < 1 and Fz(t ) a shifted Weibull distribution with ß> 1 (Eqs. (A6.34), (A6.96), (A6.97)).

The failure rate strongly depends upon the item's operating conditions. For semiconductor devices, experience shows for example that the value of ?L doubles for an operating temperature increase of 10 to 20°C and becomes more than an order of magnitude higher if the device is exposed to elevated mechanical Stresses (Table 2.3). Typicalfigures for ?L are 10-10 to 10-7 h-1 for electronic components.

The concept of failure rate also applies to humans and a shape similar to that depicted in Fig. 1.2 can be obtained from a mortality table.

As stated with Eqs. (1.3) -(1.5), the failure rate h ( t ) is a conditional density and must not be confused with the failure intensity z ( t ) (Eq. (A7.228)) or the intensity h ( t ) of a renewal process (Eq. (A7.18)) or m ( t ) of a Poisson process (Eq. (A7.193)). z (t) , h ( t ) , and m ( t ) are unconditional densities and differ basically from h ( t ) . This distinction is important also for the case of a homogeneous Poisson process, for which z ( t ) = h( t ) = m(t) = h holds for the intensity and h(x ) = h holds for the interarrival times ( X starting by 0 at each interarrival time, See also p. 356). To reduce ambiguities, force of mortality has been suggested for h(t ) in [6.3, A7.301.

8 1 Basic Concepts, Quality and Reliability Assurance of Complex Equiprnent and Systems

1.2.4 Maintenance, Maintainability

Maintenance defines the set of activities performed on an item to retain it in or to restore it to a specified state. Maintenance is thus subdivided into preventive maintenance, carried out at predetermined intervals to reduce wearout failures, and corrective maintenance, carried out after failure recognition and intended to put the item into a state in which it can again perform the required function. Aim of a preventive maintenance is also to detect and repair hidden failures, i.e. failures in redundant elements not identified at their occurrence. Corrective maintenance is also known as repair, and can include any or all of the following steps: recognition, isolation (localization & diagnosis), elimination (disassembly, replace, reassembly), checkout. Repair is used hereafter as a synonym for restoration. To simplify calculations, it is generally assumed that the element in the reliability block diagram for which a maintenance action has been performed is as-good-as-new after maintenance. This assumption is valid for the whole equipment or system in the case of constant failure rate for all elements which have not been repaired or replaced.

Maintainability is a characteristic of an item, expressed by the probability that a preventive maintenance or a repair of the item will be performed within a stated time intental for given procedures und resources (skill level of personnel, spare Parts, test facilities, etc.). From a qualitative point of view, maintainability can be defined as the ability of an item to be retained in or restored to a specified state. The expected value (mean) of the repair time is denoted by MTTR (mean time to repair), that of a preventive maintenance by MTTPM. Often used for unscheduled removals is also MTBUR. Maintainability has to be built into complex equipment or Systems during design und development by realizing a maintenance concept. Due to the increasing maintenance cost, maintainability aspects have grown in importance. However, maintainability achieved in the field largely depends on the resources available for maintenance (human and material), as well as on the correct installation of the equipment or system, i.e. on the logistic support and accessibility.

1.2.5 Logistic Support

Logistic support designates all activities undertaken to provide effective and economical use of an item during its operating phase. To be effective, logistic support should be integrated into the rnaintenance concept of the item under consideration and include after-sales service.

An emerging aspect related to maintenance and logistic support is that of obsolescence managernent, i.e. how to assure functionality over a long operating period, e.g. 20 years, when technology is rapidly evolving and components need for maintenance are no longer manufactured. Care has to be given here to design aspects, to assure interchangeability during the equipment's useful life without important redesign. Standardization in this direction is in Progress [1.9].


1.2.6 Availability

Availability is a broad term, expressing the ratio of delivered to expected service. It is often designated by A and used for the stationary & steady-state value of the point and average availability (PA = AA). Point availability (PA(t)) is a characteristic of an item expressed by the probability that the item will perform its required function under given conditions at a stated instant of time t. From a qualitative point of view, point availability can be defined as the ability of the item to perjorm its required function under given conditions at a stated instant of time (dependability).

Availability evaluations are often difficult, as logistic support and human factors should be considered in addition to reliability and maintainability. Ideal human and logistic support conditions are thus often assumed, yielding to the intrinsic (inherent) availability. Hereafter, availability is used as a synonym for intrinsic availability. Further assumptions for calculations are continuous operation and complete renewal for the repaired element in the reliability block diagram (assumed as-good-as-new after repair). For a given item, the point availability PA(t) rapidly converges to a stationary & steady-state value, given by (Eq. (6.48))

PA is also the stationary & steady-state value of the average availability (AA) giving the expected value (mean) of the percentage of the time during which the item performs its required function. PAs and AAS is used for considerations at system level. Other availability measures can be defined, e.g. mission availability, work-mission availability, overall availability (Sections 6.2.1.5, 6.8.2). Application specific figures are also known, see e.g. [6.11]. In contrast to reliability analyses for which no failure at item (system) level is allowed (only redundant parts can fail and be repaired on line), availability analyses allow failures at item (system) level.

1.2.7 Safety, Risk, and Risk Acceptance

Safety is the ability of the item not to cause injury to persons, nor significant material damage or other unacceptable consequences during its use. Safety evaluation must consider the following two aspects: Safety when the item functions and is operated correctly and safety when the item or a part of it has failed. The first aspect deals with accident prevention, for which a large number of national and international regulations exist. The second aspect is that of technical safety which is investigated using the same tools as for reliability. However, a distinction between technical safety and reliability is necessary. While safety assurance examines measures which allow an item to be brought into a safe state in the case of failure (fail-safe behavior), reliability assurance deals more generally with measures for minimizing the total number of failures. Moreover, for technical safety the effects of external

influences like human errors, catastrophes, sabotage, etc. are of great importance and must be considered carefully. The safety level of an item influences the number of product liability claims. However, increasing in safety can reduce reliability.

Closely related to the concept of (technical) safety are those of risk, risk management, and risk acceptance, including risk analysis and risk assessment [1.21, 1.261. Risk problems are generally interdisciplinary and have to be solved in close cooperation between engineers und sociologists to find common solutions to controversial questions. An appropriate weighting between probability of occurrence and effect (consequence) of a given accident is important. The multiplicative rule is one among different possibilities. Also it is necessary to consider the different causes (machine, machine & human, human) and effects (location, time, involved people, effect duration) of an accident. Statistical tools can Support risk assessment. However, although the behavior of a homogenous human population is often known, experience shows that the reaction of a single Person can become unpredictable. Similar difficulties also arise in the evaluation of rare events in complex systems. Considerations on risk and risk acceptance should take into account that the probability p, for a given accident which can be caused by one of n statistically identical and independent items, each of them with occurrence probability p, is for np small nearly equal to np as per

Equation (1.11) follows from the binomial distribution and the Poisson approximation (Eqs. (A6.120) & (A6.129)). It also applies with n p = Atot T to the case in which one assumes that the accident occurs randomly in the interval (0, T], caused by one of n independent items (systems) with failure rates Al , ..., L,,, where ?L„, = ?Ll + ... + An . This is because the sum of n independent Poisson processes is again a Poisson process (Eq. (7.27)) and the probability ?L„ ~ e - ~ ~ ~ ~ ~ for one failure in the interval (0, T] is nearly equal to Atot T . Thus, for n p << 1 or Atot T << 1 it holds that

Also by assuming a reduction of the individual occurrence probability p (or failure rate Li), one recognizes that in the future it will be necessary either to accept greater risks p, or to keep the spread of high-risk technologies under tighter control. Similar considerations could also be made for the problem of environmenfal Stresses caused by mankind. Aspects of ecologically acceptable production, use, disposal, and recycling or reuse of products will become subject for international regulations, in the general context of sustainable development.

In the context of a product development, risks related to feasibility and time to market within the given cost constraints must be considered during all development phases (feasibility checks in Fig. 1.6 and Tables A3.3 & 5.3).


Mandatory for risk rnanagement are psychological aspects related to risk awareness and safety communication. As long as a danger for risk is not perceived, people often do not react. Knowing that a safety behavior presupposes a risk awareness, communication is an important tool to avoid that a risk related to the system considered will be underestimated, See e.g. [1.26].

1.2.8 Quality

Quality is understood as the degree to which a set of inherent characteristics fulfiIls requirements. This definition, given now also in the ISO 9000: 2000 IA1.61, follows closely the traditional definition of quality, expressed by fitness for use, and applies to products and services as well.

1.2.9 Cost and System Effectiveness

All previously introduced concepts are interrelated. Their relationship is best shown through the concept of cost effectiveness, as given in Fig. 1.3. Cost effectiveness is a measure of the ability of the item to meet a service demand of stated quantitative characteristics, with the best possible usefulness to life-cycle cost ratio. It is often referred also to as system effectiveness. Figure 1.3 deals essentially with technical and cost aspects. Some management aspects are considered in Appendices A2 - A 5. From Fig. 1.3, one recognizes the central role of quality assurance, bringing together all assurance activities (Section 1.3.3), and of dependability (collective term for availability performance and its influencing factors).

As shown in Fig. 1.3, lije-cycle cost (LCC) is the sum of the cost for acquisition, operation, maintenance, and disposal of an item. For complex systems, higher reliability in general leads to a higher acquisition cost and lower operating cost, so that the optimum of life-cycle cost seldom lies at extremely low or high reliability figures. For such a System, per year operating and maintenance cost often lie between 3 and 6% of acquisition cost, and experience shows that up to 80% of the life-cycle cost is frequently generated by decisions early in the design phase. In the future, life-cycle cost will take more into account current and deferred damage to the environment caused by production, use, and disposal of an item. Life-cycle cost optimization is project specific, in general, and falls within the framework of cost effectiveness or systems engineering. It can be positively influenced by concurrent engineering [1.13, 1.15, 1.221. Figure 1.4 shows as an example the influence of the attainment level of quality and reliability targets on the sum of cost for quality assurance and for the assurance of reliability, maintainability, and logistic support for two complex systems [2.3 (1986)l. To introduce this model, let us first consider Example 1.1.


Example 1.1 An assembly contains n independent components each with a defective probability p . Let ck be the cost to replace k defective components. Determine (i) the expected value (mean) C(i) of the total replacement cost (no defective components are allowed in the assembly) and (ii) the mean of the total cost (test and replacement) C(ii) if the components are submitted to an incoming inspection which reduces defective percentage from p to po (test cost ct per component).

Solutiori

(i) The solution makes use of the binomial distribution (Appendix A6.10.7) and question (i) is also solved in Example A6.18. The probability of having exactly k defective components in a lot of size n is given by (Eq. (A6.120))

The mean C(i) of the total cost (deferred cost) caused by the defective components follows then from

(ii) To the cost caused by the defective components, calculated from Eq. (1.14) with p, instead of p, one must add the incoming inspection cost n cl

The difference between C(i) and C(ii) gives the gain (or loss) obtained by introducing the incoming inspection, allowing thus a cost optimization (see also Section 8.4 for a deeper discussion).

With similar considerations to those in Example 1.1 one obtains for the expected value (mean) of the total repair cost C„ during the cumulative operating time T of an item with failure rate h and cost C, per repair

1 C„=hTc„=-

MTBF ccm.

In Eq. (1.16), the term h T gives the mean value of the number of failures during T (Eq. (A7.42)), and MTBF is used as MTBF = 1 / h.

From the above considerations, the following equation expressing the mean C of the sum of the cost for quality assuranceand for the assurance of reliability, maintainability, and logistic support of a system can be obtained

Thereby, q denotes quality, r reliability, crn corrective maintenance, prn preventive maintenance, 1 logistic support, off down time, and d defects.

1.2 Basic Concepts

I Cost Effectiveness (System Effectiveness) I

Capability C11 Ooerational Availability ' 1 . (Dependabili

i Cost Effectiveness Assurance (System Effectiveness Assurance) 1

Safety D

. Design, develop- . Configuration ment, evaiuation management . Production Quality testing Cost analyses (incl. reliability, (Life-cycle costs, maintainability. VE, VA) and safety tests)

Quality control during production (hardware) Quality data reporting System . Software quality

Reliability . Maintainability targets targets . Required . Maintenance function concept . Environm. cond. . Design . Parts & materials guidelines Design . Paititioning guidelines in LRUs . Derating . Operating Screening control . Redundancy . Diagnosis . FMEA, FTA, etc. . Maintainability Rel. block diagr. analysis Rel. prediction . Design reviews Design reviews

. Safety targets . Maintenance . Design concept guidelines . Customer . Safety analysis documentation ( F M E M C A , ' Spare Parts FTA, etc.) provisioning

. ~~~i~~ reviews TOO~S and test equipment for maintenance After d e s service

Figure 1.3 Cost Effectiveness (System Effectiveness) for complex equipment & Systems with high quality und reliability requirements (see Appendices A l - A5 for definitions and management aspects; dependability can be used instead of operational availability, for a qualitative meaning)

MTBFs and OAs are the system mean operating time between failures (assumed here = 1 / hs) and the system steady-state overall availability (Eq. (6.196) with Tpm

instead of T„). T is the total system operating time (useful life) and nd is the number of hidden defects discovered (and eliminated) in the field. Cq , C r , Ccm , C„, and Cl are the cost for quality assurance and for the assurance of reliability, repairability, serviceability, and logistic support, respectively. C„, c o f ~ , and cd are the cost per repair, per down time hour, and per hidden defect, respectively (preventive maintenance cost are scheduled cost, considered here as a part of Cpm).

The first five terms in Eq. (1.17) represent a part of the acquisition cost, the last three terms are deferred cost occurring during field operation. A model for investigating the cost C according to Eq. (1.17) was developed in [2.3 (1986)l by assuming C q , C,., Ccm, Cpm, C I , MTBFS, OAS, T, cm, C,#, and cd as Parameters and investigating the variation of the total cost expressed by Eq. (1.17) as a function of the level of attainment of the specified targets, i.e. by introducing the variables gq = QA IQA, , gr = MTBFs I MTBFs,, g„ = MTIRSg~MTTRs, gpm = MTTPMS, / MTTPMS, and gl = MLDsg I MLDs, where the subscript g denotes the specified target for the corresponding quantity. A power relationship

was assumed between the actual cost C i , the cost Ci, to reach the specified target (goal) of the considered quantity, and the level of attainment of the specified target ( 0 i ml < 1 and all other mi > I). The following relationship between the number of hidden defects discovered in the field and the ratio C9 / C , was also included in the model

The final equation for the cost C as function of the variables g9, gr , g, , g„ , and gl follows then as (using Eq. (6.196) for OAs)

The relative cost C / C , given in Fig. 1.4 is obtained by dividing C by the value C , form Eq. (1.20) with all gi = 1. Extensive analyses with different values for mi, C i , MTBFs, OAs, T, C,, cs8, and cd have shown that the value C / Cg is only moderately sensitive to the parameters mi.

1.2 Basic Concepts

Rel. cost C/Cg

Figure 1.4 Sum of the relative cost C / Cg for quality assurance and for the assurance of reliability, maintainability, and logistic support of two complex Systems with different mission profiles, as a function of the level of attainment of the specified quality and reliability targets gq and g„ respectively (the specified targets are dashed, results based on Eq. (1.20))

1.2.10 Product Liability

Product liability is the onus on a manufacturer (producer) or others to compensate for losses related to injury to persons, material damage, or other unacceptable consequences caused by a product (item). The manufacturer has to speczfy a safe operational mode for the product (user documentation). In legal documents related to product liability, the term product often indicates hardware only and the term defective product is in general used instead of defective or failed product. Responsible in a product liability claim are all those people involved in the design, production, sale, and maintenance of the product (item), inclusive suppliers. Basically, strict liability is applied (the manufacturer has to demonstrate that the product was free from defects). This holds in the USA and increasingly in Europe [1.8]. However, in Europe the causality between damage and defect has still to be demonstrated by the wer.

The rapid increase of product liability claims (alone in the USA, 50,000 in 1970 and over one million in 1990) cannot be ignored by manufacturers. Although such a situation has probably been influenced by the peculiarity of US legal procedures, configuration management and safety analysis (in particular causes-to-effects analyses) as well as considerations on risk management should be performed to increase safety and avoid product liability claims (see Sections 1.2.7 & 2.6, and Appendix A.3.3).


1.2.11 Historical Development

Methods and procedures of quality assurance and reliability engineering have been developed extensively over the last 50 years. For indicative purpose, Table 1.1 summarizes the major steps of this development and Fig. 1.5 shows the approximate distribution of the relative effort between quality assurance and reliability engineering during the same period of time. Because of the rapid progress of microelectronics, considerations on redundancy, fault-tolerante, test strategy, and sojhvare quality have increased in importance. A skillful, allegorical presentation of the story of reliability (as an Odyssey) is given in [1.25].

Table 1.1 Histoncal development of quality assurance (management) and reliability engineering

~efore 1940 Quality attributes and characteristics are defined. In-process and final tests are

L940 - 50

1950 - 60

1960 - 70

1970 - 80

1980 - 90

after 1990

carried out, usually in a department within the production area. The concept of quality of manufacture is introduced.

Defects and failures are systematically collected and analyzed. Corrective actions are carried out. Statistical quality control is developed. It is recognized that quality must be built into an item. The concept quality of design becomes important.

Quality assurance is recognized as a means for developing and manufacturing an item with a specified quality level. Preventive measures (actions) are added to tests and corrective actions. It is recognized that correct short-term functioning does not also signify reliability. Design reviews and systematic analysis of failures (failure data and failure mechanisms), performed often in the research & development area, lead to important reliability improvements.

Difficulties with respect to reproducibility and change control, as well as interfacing problems during the integration phase, require a refinement of the concept of configuration management. Reliability engineering is recognized as a means of developing and manufacturing an item with specified reliability. Reliability estimation methods und demonstration tests are developed. It is recognized that reliability cannot easily be demonstrated by an acceptance test. Instead of a reliability figure (h orMTBF=lIh), the contractual requirement is for a reliability assurance program. Maintainability, availability, and logistic Support become important.

Due to the increasing complexity and cost for maintenance of equipment and systems, the aspects of man-machine interface and life-cycle cost become important. Terms like product assurance, cost effectiveness and systems engineering are introduced. Product liability becomes important. Quality and reliability assurance activities are made project specific and carried out in close cooperation with all engineers involved in a project. Customers require demonstration of reliability and maintainability during the warranty penod. The aspect of testability gains in significance. Test und screening strategies are developed to reduce testing cost and warranty services. Because of the rapid progress in microelectronics, greater possibilities are available for redundant and fault tolerant structures. The concept of sofhvare quality is introduced. The necessity to further shorten the development time leads to the concept of concurrent engineering. Total Quality Management (TQM) appears as a refinement to the concept of quality assurance as used at the end of the seventies.

1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems

Relative effort [%]

4 100 -

75

' '. 50 -

. \ *' .--

Quality assurance X --- \.\

25 I I - - -

System engineenng (part)

Fault causes I modes I effects I mechanisms analysis Reliabilily analysis

Software quality

Configuration management

Qualjty testing, Quality control, Quality data reporting system

0 I FYear

Figure 1.5 Approximate distribution of the relative effort between quality assurance and reliability engineering for complex equipment und systems

1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems

This section deals with some important considerations on the organization of quality and reliability assurance in the case of complex equipment nnd systems with high quality und reliability requirements. This minor Part of the book aims to support managers in answering the question of how to specify und realize high reliability targets for complex equipment und systems when tailoring is not mandatory. Refinements are in Appendices A l - A5, with considerations on quality management and total quality management (TQM) as well. As a general rule, quality assurance and reliability engineering must avoid bureaucracy, be integrated in project activities, and support quality management and concurrent engineering efforts, as per TQM.

1.3.1 Quality and Reliability Assurance Tasks

Experience shows that the development and production of complex equipment and systems with high reliability, maintainability, availability, a n d l o r safety targets requires specific activities during all life-cycle phases of the item considered. For complex equipment and systems, Fig. 1.6 shows the life-cyclephases and Table 1.2 gives main tasks for quality and reliability assurance. Depicted in Table 1.2 is also the period of time over which the tasks have to be performed. Within a project, the tasks of Table 1.2 must be refined in a project-specific quality and reliability assurance program (Appendix A3).


Table 1.2 Main tasks for quality and reliability assurance of complex equipment und systems with high quality und reliabilily r-equirements (the bar height is a measure of the relative effort)

Main tasks for quality and reliability assurance of complex equipment und systems, conforrning to TQM (see Table A3.2 for more details and for task assignment)

1. Customer and market requirements

2. Preliminary analyses

3. Quality and reliability aspects in specs, quotations, contracts, etc. -

4. Quality and reliability assurance program

5. Reliability and maintaimbility analyses

6. Safety and human factor analyses

7. Selection and qualification of components and materials

8. Supplier selection and qualification

9. Project-dependent procedures and work instructions

10. Configuration management

11. Prototype qualification tests

12. Quality control during production

13. In-process tests

14. Final and acceptance tests

15. Quality data reporting system

16. Logistic support

17. Coordination and monitoring

18. Quality costs

9. Concepts, methods, and general procedures (quality and reliability)

50. Motivation and training I

1.3 Basic Tasks & Rules for Quality and Reliability Assurance of Complex Systems 19

Conception, Definition, Production Design, Development, Evaluation (Manufactunng) I I

Pilot Series Installation, FUU deve~opment, 1 production production 1 Ocration 1

I t Idea, market . System requirements specifications Evaluation of Feasibility chcck delivered Interface definition equipment . Proposal for the and Systems design phase . Proposal for preliminary study

t t t t Feasibility check Feasibility check Series item ,? Revised system Production . cuStomer - specifications documentation documentation $ Qualified and Qualified produc-. Logistic 2 released tion processes concept - prototypes . Qualified and . Spare g 2 . Technical . - released first provisioning documentation senes item Proposal for pilot Proposal for senes production production

Figure 1.6 Basic life-cycle phases of complex equipment und Systems (the output of a given phase is the input of the next phase; see Tab. 5.3 for software)

1.3.2 Basic Quality and Reliability Assurance Rules

Performance, dependability, cost, and time to market are key factors for today's products and services. Taking care of the considerations in Section 1.3.1, the basic rules for a quality and reliability assurance optimized by considering cost and time schedule aspects (conforming to TQM) can be summarized as follows:

1. Quality and reliability targets should be just as high as necessary to satisfy real customer needs

4 Apply the rule "as-good-as-necessary ':

2. Activities for quality & reliability assurance should be performed continuously throughout all project phases, from definition to operating phase (Table 1.2)

4 Do not change the project manager before ending the pilot production.

3.Activities must be performed in close cooperation between all engineers involved in the project (Table A3.2)

+ Use TQM und concurrent engineering approaches.

4. Quality and reliability assurance activities should be monitored by a central quality & reliability assurance department (Q & RA), which cooperates actively in all project phases (Fig. 1.7 and Table A3.2)

+ Establish an efficient und independent quality & reliability assurance department (Q & RA) active in the projects.


Figure 1.7 shows a basic organization which could embody the above rules and satisfy requirements of quality management standards (Appendix A2). As shown in Table A3.2, the assignment of quality and reliability assurance tasks should be such, that every engineer in a project bears his/her own responsibilities (as per TQM) .

A design engineer should for instance be responsible for all aspects of hislher own product (e.g. an assembly) including reliability, maintainability & safety, and the production department should be able to manufacture and test such an item within its own competence. The quality & reliability assurancedepartment (Q & RA in Fig. 1.7) can be for instance responsible for

setting targets for reliability and quality levels,

coordination of the activities belonging to quality and reliability assurance,

preparation of guidelines and working documents (quality and reliability aspects),

qualification, testing and screening of components and material (quality and reliability aspects),

release of manufacturing processes (quality and reliability aspects),

development and operation of the quality data reporting system,

solution of quality and reliability problems at the equipment and system level,

acceptance testing.

This central quality and reliability department should not be too small (credibility) nor too large (sluggishness).

Figure 1.7 Basic organizational structure for quality & reliability assurance in a company producing complex equipment und systems with high quality (Q), reliability (R), und / or sufety requirements (connecting lines indicate close cooperation; A denotes assurance, I inspection)


1.3.3 Elements of a Quality Assurance System

As stated in Sections 1.3.1, many of the tasks associated with quality assurance (here in the sense of quality management as per T Q M ) are interdisciplinary in nature. In order to have a minimum impact on cost and time schedules, their solution requires the concurrent efforts of all engineers involved in a project. To improve coordination, it can be useful to group the quality assurance activities into the following basic areas (Fig. 1.3):

1. Configuration Management: Procedure used to specify, describe, audit & release the configuration of an item, as well as to control it during modifications or changes. Configuration management is an important tool for quality assurance. It can be subdivided into configuration identification, auditing (design reviews), control, and accounting (Appendix A3.3.5).

2. Quality Tests: Tests to verify whether the item conforms to specified requirements. Quality tests include incoming inspections, as well as qualification tests, production tests, and acceptance tests. They also Cover reliability, maintainability, safety, and software aspects. To be cost effective, quality tests must be coordinated and integrated into a test strategy.

3. Quality Control During Production: Control (monitoring) of the production processes and procedures to reach a stated quality of manufacturing.

4. Qualip Data Reporting System (QDS, FRACAS): A system to collect, analyze, and correct all defects and failures (faults) occurring during the production and test of an item, as well as to evaluate and feedback the corresponding quality and reliability data. Such a system is generally Computer assisted. Analysis of failures and defects must be traced to the cause, to avoid repetition of the same problem.

5. Sofmare quality: Special procedures and tools to specify, develop, and test software (Section 5.3).

Configuration management spans from the definition up to the operating phase (Appendices A3 & A4). Quality tests encompasses technical and statistical aspects (Chapters 3, 7, and 8). The concept of a quality data reporting system is depicted in Fig. 1.8 (see Appendix A5 for basic requirements). Table 1.3 shows an example of data reporting sheets for PCBs evaluation.

The quality and reliability assurance system must be described in an appropriate quality handbook supported by the company management. A possible content of such a handbook for a company producing complex equipment und systems with high quality 61 reliability requirements can be: General, Project Organization,

Quality Assurance (Management), Quality & Reliability Assurance Program, Reliability Engineering, Maintainability Engineering, Safety Engineering, Software Quality Assurance.


Table 1.3 Example of information status for PCBs (populated printed circuit board's) from a quality data reporting system

a) Defects and failures at PCB level

Period: . . . .

b) Defects and failures at component level

Penod: . . . . PCB:. . . . No. of PCBs: . .

Compo- Manufac- No' Of compOnents Nuinber of No. of faults per place of occurrence

faults % incoming in-process final test warranty type application period

C) Cause analysis for defects and failures due to components

Period:

d) Correlation between components and PCBs

Penod: . . . .

Compo- nent

Failure ratc ( 1 0 - ~ h-I1 Measures

observed PCB term term

Cause Percent deiective (%)

systematic

observed predicled inherent fahre

not identified


1.3.4 Motivation and Training

Cost effective quality and reliability assurance (management) can be achieved if every engineer involved in a project is made responsible for his/ her assigned activities (e.g. as per Table A3.2). Figure 1.9 shows a comprehensive, practice oriented, motivution und training program in a company producing complex equipment und Systems with high quality & reliability requirements.

Basic training

Title: Quality Management and Reliability Engineering

Aim: Introduction to tasks, methods, and organization of the company's quality and reliability assurance

Participants: Top and middle management, project managers, selected engineers

Duration: 4 h (seminar with discussion) Documentation: ca. 30 pp.

Advanced training 1 Advanced training I Ti tle: Methods of Reliability

Engineering Aim: Leaming the methods used in

reliabiliy assurance Participants: Project managers, engineers

from marketing & production, selected engineers from development

Duration: 8 h (seminar with discussion) Document.: ca. 40 pp.

* Examples: Statistical Quality Control, Test and Screening Strategies, Software Quality, Testability, Reliability and Availability of Repairable Systems, Fault- Tolerant Systems with Hardware and Software, Mechanical Reliability, Failure Mechanisms and Failure Analysis, etc.

I

Title: Reliability Engineering Aim: Leaming the techniques used

in reliabilty engineering (applications oriented and company specific)

Participants: Design engineers, Q&R specialists, selected engineers from marketing and production

Duration: 24 h (course with exercices) Document.: ca. I50 pp.

0 Special training I Title: Special ~ o ~ i c s ' Aim: Learning special tools and

techniques Participants: Q&R specialists, selected

engineers from development and production

Duration: 4 to 16 h per topic Document.: ca. 20 pp. per topic

Figure 1.9 Example for a practical oriented training and motivation program in a company producing complex equipment and Systems with high quality ( Q ) C? reliability (R) requirements

2 Reliability Analysis During the Design Phase (Nonrepairable Items up to System Failure)

Reliability analysis during the design and development of complex equipment and systems is important to detect and eliminate reliability weaknesses as early as possible and to perform comparative studies. Such an investigation includes failure rate and failure mode analysis, verification of the adherence to design guidelines, and cooperation in design reviews. This chapter presents methods and tools for failure rate and failure mode analysis of complex equipment and systems considered as nonrepairable (up to system failure). After a short introduction (Section 2.1), Section 2.2 deals with series -parallel structures. Complex structures, elements with more than one failure mode, and parallel models with load sharing are investigated in Section 2.3. Reliability allocation is introduced in Section 2.4. Stress / strength and drift analysis is discussed in Section 2.5. Section 2.6 deals with failure mode and causes-to-effects analyses. Section 2.7 gives a checklist for reliability aspects in design reviews. Maintainability is considered in Chapter 4 and repairable systems are investigated in Chapter 6 (including complex systems for which a reliability block diagram does not exist, imperfect switching, incomplete coverage, common cause failures, reconfigurable systems, as well as an introduction to Petri nets, dynamic FT, and computer-aided analysis). Design guidelines are in Chapter 5, qualification tests in Chapter 3, and reliability tests in Chapters 7 & 8. Theoretical foundations for this chapter are in Appendix A6.

2.1 Introduction

An important Part of the reliability analysis during the design and development of complex equipment and systems deals with failure rate and failure mode investigation as well as with the verification of the adherence to appropriate design guidelines for reliability. Failure mode and causes-to-effects analysis is considered in Section 2.6, design guidelines are given in Chapter 5. Sections 2.2- 2.5 are devoted to failure rate analysis. Investigating the failure rate of a complex

26 2 Reliability Analysis During the Design Phase

equipment or system leads to the calculation of the predicted reliability, i.e. that reliability which can be calculated from the structure of the item and the reliability of its elements. Such a prediction is necessary for an early detection of reliability weaknesses, for comparative studies, for availability investigation taking care of maintainability and logistic support, and for the definition of quantitative reliability targets for subcontractors. However, because of different kind of uncertainties, the predicted reliability can often be only given with ,a limited accuracy. To these uncertainties belong

simplifications in the mathematical modeling (independent elements, complete and sudden failures, no flaws during design and manufacturing, no damages),

insufficient consideration of faults caused by internal or external interference (switching, transients, EMC, etc.),

inaccuracies in the data used for the calculation of the component failure rates.

On the other hand, the true reliability of an item can only be determined by reliability tests, performed often at the prototype's qualification tests, i.e. late in the design and development phase. Practical applications also shown that with an experienced reliability engineer, the predicted failure rate at equipment or system level often agree reasonably well (within a factor of 2) with field data. Moreover, relative values obtained by comparative studies generally have a much greater accuracy than absolute values. All these reasons support the efforts for a reliability prediction during the design of equipment and systems with specified reliability targets.

Besides theoretical considerations, discussed in the following sections, practical aspects have to be considered when designing reliable equipment or systems, for instance with respect to operating conditions and to the mutual influence between elements (input/output, load sharing, effects of failures, transients, etc.). Concrete possibilities for reliability improvement are

reduction of thermal, electrical and mechanical Stresses,

correct interfacing of components and materials,

simplification of design and construction,

use of qualitatively better components and matenals,

protection against ESD and EMC,

screening of critical components and assemblies,

use of redundancy,

in that order. Design guidelines (Chapter 5) and design reviews (Tables A3.3, 2.8, 4.3, and 5.5, Appendix A4) are mandatory to support such improvements. This chapter deals with nonrepairable (up to system failure) equipment and systems. Maintainability is discussed in Chapter 4. Reliability and availability of repairable equipment and systems is considered carefully in Chapter 6.

2.1 Introduction

Required function

Set up the reliability block diagram (RBD), by performing a FMEA where redundancy appears Determine the component Stresses Compute the failure rate hi of each component Compute R(t) at the assembly level Check the fulfillment of reliability design rules Perform a preliminary design review

Eliminate reliability weaknesses component/matenal selection derating screening redundancy

P-

yes

I 1 Go to the next assembly or to the next integration level

Figure 2.1 Reliability analysis procedure at assembly level

Taking account of the above considerations, Fig. 2.1 shows the reliability analysis procedure for an assembly. The procedure of Fig. 2.1 is based on the part stress method given in Section 2.2.4 (see Section 2.2.7 for the part count method). Also included are a failure mode analysis (FMEAIFMECA to check the validity of the assumed failure modes) and a verification of the adherence to design guidelines for reliability in a preliminary design review (Section 5.1, Appendix A3.3.5). Verification of the assumed failure mode is mandatoiy where redundancy appears, in particular because of the series element in the reliability block diagram (see for instance Examples 2.1-2.3 and, for a comparative investigation, Figs. 2.8 - 2.9, and 6.17 - 6.18). To simplify the notation in the following sections, reliability will be used forpredicted reliability and system instead of technical system, i.e. for a system with ideal human factors and logistic Support.


2.2 Predicted Reliability of Equipment and Systems with Simple Structure

Simple structures are those for which a reliability block diagram exists and can be reduced to a series/parallel fomz with independent elements. For such an item, the predicted reliabilio is calculated according to the following procedure, See Fig. 2.1:

1. Definition of the required function and of its associated mission profile.

2. Derivation of the corresponding reliability block diagram (RBD).

3. Determination of the operating conditions for each element of the RBD.

4. Determination of the failure rate for each element of the RBD.

5. Calculation of the reliability for each element of the RBD.

6. Calculation of the item (system) reliability function Rs ( t ).

7. Elimination of reliability weaknesses and return to step 1 or 2, as necessary.

This section discusses at some length steps 1 to 6, see Example 2.6 for the application to a simple situation. For the investigation of equipment or systems for which a reliability block diagram does not exist, one refers to Section 6.8.

2.2.1 Required Function

The required function specifies the item's task. Its definition is the starting point for any analysis, as it defines failures. For practical purposes, Parameters should be defined with tolerances and not merely as fixed values.

In addition to the required function, environmental conditions at system level must also be defined. Among these, ambient temperature (e.g. +40°C), Storage temperature (e.g. -20 to +60°C), humidity (e.g. 40 to 60%), dust, corrosive atmosphere, vibration (e.g. 0Sg„ at 2 to 60Hz), shock, noise (e.g. 40 to 70dB), and power supply voltage variations (e.g. +.20%). From these global environmental conditions, the constructive characteristics of the system, and the internal loads, operating conditions (actual stresses) for each element of the system can be determined.

Required function and environmental conditions are often time dependent, leading to a mission profile (operational profile for software). A representative mission profile and the corresponding reliability targets should be defined in the system specifications (initially as a rough description and then refined step by step), See the remark on p. 38 and Section 6.8.6.2 for phased-mission systems.

2.2.2 Reliability Block Diagram

The reliability block diagram (RBD) is an event diagram. It answers the following question: Which elements of the item under consideration are necessavy for the

2.2 Predicted Reliability of Equipment and Systems with Simple Structure

Equipment I I I

I Component 1

Figure 2.2 Procedure for setting up the reliability block diagram (RBD) of a system with four levels

filfillment of the required function und which can fail without affecting it? Setting up a RBD involves, at first, partitioning the item into elements with clearly defined tasks. The elements which are necessary for the required function are connected in series, while elements which can fail with no effect on the required function (redundancy) are connected in parallel. Obviously, the ordering of the series elements in the reliability block diagram can be arbitrary. Elements which are not relevant for (or used in) the required function under consideration are removed (put into a reference list), after having verified (FMEA) that their failure does not affect elements involved in the required function. These considerations make it clear that for a given system, each requiredfinction has its own reliability block diagram.

In setting up the reliability block diagram, care must be taken regarding the fact that only two states (good or failed) and one failure mode (e.g., Opens or shorts) can be considered for each element. Particular attention must also be paid to the correct identification of the parts which appear in series with a redundancy (see e.g. Section 6.8). For large equipment or systems the reliability block diagram is derived top down as indicated in Fig. 2.2 (for 4 levels as an example). At each level, the corresponding required function is derived from that at the next higher level.

The technique of setting up reliability block diagrams is shown in the Examples 2.1 to 2.3 (see also Examples 2.6, 2.13, 2.14). One recognizes that a reliability block diagram basically differs from afinctional block diagram. Examples 2.2,2.3, 2.14 also show that one or more elements can appear more than once in a reliability


block diagram, while the corresponding element is physically present only once in the item considered. To point out the strong dependence created by this fact, it is mandatory to use a box form other than a Square for these elements (in Example 2.2, if E2 fails the required function is fulfilled only if EI , E3 and E5 work). To avoid ambiguities, each physically different element of the item should bear its own number. The typical structures of reliability block diagrams are summarized in Table 2.1 (see Section 6.8 for situations in which a reliability block diagram does not exist).

Example 2.1 Set up the reliability block diagrams for the following circuits:

E,

(i) Res. voltage divider (ii) Electronic switch (iu) Simplified radio receiver

Solution Cases (i) and (iii) exhibit rio redundaiicy, i.e. for the required function (tacitly assumed here) all elements must work. In case (ii), hdnsistors TRI and TR2 are redundant if their failure mode is a short between emitter and collector (the failure mode for resistors is generally an open). From these considerations, the reliability block diagrams follows as

47

(i) Resistive voltage divider (ii) Electronic switch

(3) Simplified radio receiver

Example 2.2 An item is used for two different missions with the corresponding reliability block diagrams given in the figures below. Give the reliability block diagram for the case in which both functions are simultaneously required in a common mission.

Mission 1 Mission 2

Solution The simultaneous fulfillment of both required functions leads to the series connection of both reliability block diagrams. Simplification is possible for element E1 but not for element E2. A deeper discussion on phased-

--@@Fr mission reliability analysis is in Section 6.8.6.2. Mission 1 and 2

2.2 Predicted Reliability of Equipment and Systems with Simple Structure 3 1

Table 2.1 Basic reliability block diagrams and associated reliabiiity functions (nonrepairable up to system failure, new at t = O (R, (0) = I) , independent elements (except E2 in 9), active redundancy; 7-9 are complex structures and cannot be reduced to a series-parallel structure with indep. elements)

Reliability Block Diagram

4 k-out-of-n

Reliabiiity Function (Rs=Rso ( t ) ; R, =Ri( t ) , R i ( 0 ) = l )

Remarks

One- item structure,

h ( t ) = h ~ ~ ( t ) = e -A i t

Senes structure,

h, ( t ) = h , ( t )+ ...+ h,(t)

1 - out - of - 2 - redundancy,

k - out - of - n redundancy

for k = 1 =3 Rs= 1 - ( 1 - R)"

Serieslparallel structure

Majority redundancy, gen- erd case (n + 1) -out - of -

( 2 n + l ) , n = 1,2, ...

Bridge structure

(bi-directional on E*)

Bndge structure

(unidirectional on E 5 )

The element E2 appears twice in the reliability block diagram (not in the hardware)

2 Reliability Analysis During the Design Phase

Example 2.3 Set up the reliability block diagram for the electronic circuit shown on the right. The required function asks for operation of P2 (main assembly) and of P1 or Pr (control cards).

Solution

This example is not as trivial as Examples 2.1 and 2.2. A good way to derive the reliability block diagram is to consider the mission " 4 or 4 must work" and "P2 must work" separately, and then to put both missions together as in Example 2.2 (see also Example 2.14).

Also given in Table 2.1 are the associated reliability functions for the case of nonrepairable elements (up to system failure) with active redundancy and independent elements except case 9 (Sections 2.2.6, 2.3.1-2.3.4); see Section 2.3.5 for load sharing, Section 2.5 for mechanical systems, and Chapter 6 for repairable systems.

Table 2.2 Most important Parameters influencing the failure rate of electronic components

Digital and linear ICs

Hybrid circuits

Bipolar transistors

Diodes

Thyristors D

Optoelectronic components D

Resistors D

Capdcitors D

Coils, transformers D

Relays, switches D

Connectors D

D denotes dominant, X denotes important

2.2 Predicted Reliability of Equipment and Systems with Simple Structures

id capability

Figure 2.3 Load capability and typical derating curve (dashed) for a bipolar Si-transistor as function of the ambient temperature BA ( P = dissipated power, PN = rated power)

2.2.3 Operating Conditions at Component Level, Stress Factors

The operating conditions of each element in the reliability block diagram influence the item's reliability and have to be considered. These operating conditions are function of the environmental conditions (Section 3.1.1) and internal loads, in operating and dormant state. Table 2.2 gives an overview of the rnost important Parameters influencing electronic component failure rates.

A basic assumption is that components are in no way over stressed. In this context it is important to consider that the load capability of many electronic components decreases with increasing ambient temperature. This in particular for power, but also for voltage and current. As an Example, Fig. 2.3 shows the variation of the power capability as function of the ambient temperature OA for a bipolar Si transistor (with constant thermal resistance RJA). The continuous line represents the load capability. To the right of the break point the junction temperature is nearly equal to 175°C (max. specified operating temperature). The dashed line gives a typical derating curve for such a device. Derating is the designed (intentional) non utilization of the full load capability of a component with the purpose to reduce its failure rate. The stress factor (stress ratio, stress) S is defined as

applied load S =

rated load at 40°C (2.1)

To give a touch, Figs. 2.4 - 2.6 show the influence of the temperature (ambient B A , case Oc or junction OJ) and of the stress factor S on the failure rate of some electronic components (from IEC 61709 [2.23]). Experience shows that for a good design and BA 5 40°C one should have O . l < S < 0.6 for power, voltage, and current, S 2 0.8 for fan-out, and S 5 0.7 for Uin of lin. ICs (Table 5.1). S < 0.1 should also be avoided.


- - - - - Paper, metallired paper 8 plastic' Ceramic Aluminum, non-solid eleclrolyte'

P a p e r , metaliized paper 8 plastic n - - - - Ceramic

Aluminum. non-solid eiecfrolyie 4 z:lTantal, solid eledralfie

Figure 2.4 Factor n~ as function of the case temperature OC for capacitors and resistors, and factor nu as function of the voltage stress for capacitors (examples from IEC 61 709 [2.23])

ICS, Trans~~tors. Reference and Microwavediodes EPROM OTPROM EEPROM EAROM Diodes andlor Power Devices

100

- - - - ICs, Transistors. Refetence- and Microwavediodes

n T . . . . . . . . . . .. . . EPROM, OTPROM, EEPROM, EAROM Diodes andior Power Devices

400

300

200

100

0

0 40 80 120 160

R P CMOS (UmI= 15 V) - - - - BipoiaiAlialog iCr . . . .. . . . .. . , . , Transistors

Figure 2.5 Factor nT as function of the junction temperature B J (left, half log for semiconductors and right, linear for semiconductors, resistors, and coils) and factor n" as function of the power supply voltage for semiconductors (examples from IEC 61709 [2.23])


Figure 2.6 Factor TCT as function of the junction temperature B J and factors TCU and n1 as function of voltage and current stress for optoeleclronic devices (examples from ZEC 61709 [2.23])

2.2.4 Failure Rate of Electronic Components

The failure rate h ( t ) of an item is the probability (referred to 6 t ) of a failure in the interval ( t , t + 6t] given that the item was new at t = 0 und did not fail in the interval (0, t ] , see Eqs. (1.3), (1.5), (A6.25). For a large population of statistically identical und independent items, h ( t ) exhibits often three successive phases: One of early failures, one with constant (or nearly so) failure rate and one involving failures due to wearout. Early failures should be eliminated through screening (Chapter 8). Wearout failures can be expected for some electronic components (electrolytic capacitors, power and optoelectronic devices, ULSI-ICs) as well as for mechanical and electromechanical components. They must be considered on a case- by-case basis in setting up a preventive maintenance strategy (Sections 4.6,6.8.2).

To simplify calculations, reliability prediction is often performed by assuming a constant (time independent) failure rate during the useful life

This approximation greatly simplify calculations, since a constant failure rate leads to a flow of failures described by a homogeneous Poisson process, i.e. to a process with memoryless propers (Eqs. (A6.30), (A6.87), Appendix A7.2.5). The failure rate of components can be assessed experimentally by accelerated reliability tests or from field data (if operating conditions are sufficiently well known) with appropriate statistical data analysis (Sections 7.2, 7.4 - 7.6). For established electronicl electro-mechanical components, models and figures are often given in failure rate handbooks [2.21-2.291. Among these, FIDES Guide 2004 [2.21], HDBK217Plus (2006) [2.22], IEC 61709 (1996) [2.23], IEC TR 62380 (2004) [2.24], IRPH 2003 [2.25], RDF-96 [2.28], Telcordia SR-332 (2001) [2.29]. IEC 61709 gives

36 2 Reliability Analysis Dunng the Design Phase

Table 2.3 Indicative figures for environmental conditions and corresponding factors nE

GF (-40 to+45OC) 2 - 200 Hz 1 (Ground fixed] 1 1 ;: GM (-40 to+45"C) 2- 500 HZ

(Ground mobile)

Ns (-40 to+45"C) 2 - 200 Hz (Nav. sheltered) 2 gn

NU (-40to+7O0C) 2 - 200 Hz (Nav. unsheltered) 5 g,

C=capacitors, DS=discrete semicond., R=resistors, RH=rel. humidity, h=high, m=medium, l=low, g , = 10m/sz (GB is Ground stationary weatherprotected in [2.25,2.29] and is taken as reference value in [2.22,2.23,2.24])

laws of dependency of the failure rate on different Stresses (temperature, voltage, etc.) and must be supported by a Set of reference failure rates hr f (e.g. for a standard industrial environment, i.e. 40°C ambient temperature O A , GB as per Table 2.3, and steady-state conditions in the field). IRPH 2003 is based on IEC 61709 and gives reference failure rates. Effects of thermal cycling, dormant state, and ESD are considered in IEC TR 62380 and HDBK-217Plus. Refined models are in FZDES Guide 2004. HDBK-217PEus is the next generation of the PRISM software tool and aims to replace MIL-HDBK-217. An international agreement on failure rate models for reliability predictions at equipment und System level in practical applications should be found to simplify comparative investigations ([1.2 (1996)l and remark on p. 38).

Failure rates are taken from one of the above handbooks or from one's own field experience for the calculation of the predicted reliability. Models in these handbooks have often a simple structure, of the form

often further simplified to

by taking nE = nQ = 1 because of the assumed standard (industrial) environment (GB in Table 2.3) and standard quality level. Indicative figures for the factors nE and n~ are in Tables 2.3 and 2.4.

h lies between 10 - '~h - ' for passive components and 1 0 - ~ h- ' for VLSI ICs (Table A1O. 1, Example 2.4). The unit I O - ~ h- ' is designated by FIT (failure in time).


Table 2.4 Reference values for the qudity factors nQ

Qualification

1 Reiiiforced 1 CECC* I no special / Monolithic ICs

Hybrid ICs

Discrete Semiconductors

Resistors

Capacitors 0.1 2.0

*~eference value in [2.22-2.251 and class I1 in [2.29] (coi~esponds to MIL-HDBK-217 F classes B,JANTX,M)

In general, ho and h r f increase exponentially with temperature (see Figs. 2.4 - 2.6 for some examples). The influence of the stress factor is illustrated by factors

and nI. For the factor n T as a function of the junction temperature 8 j , an Arrhenius Model is often used. In the case of only one dominant failure mechanism, Eq. (7.56) gives the ratio of the n T factors at two temperatures T2 and Tl

where A is the acceleration factor, k the Boltzmann's constant (8.6.10-~ eV/ K), T the junction temperature (in Kelvin degrees), and Eu the activation energy in eV. As in Figs. 2.4 - 2.6, experience shows that a global value for E, often lie between 0.3eV and 0.6eV for Si devices. The design guideline BJ 1 100°C, if possible OJ 1 80°C, given in Section 5.1 for semiconductor devices is based on this consideration. Models in IEC 61709 assumes for n T two dominant failure mechanisms with activation energies E,, and E„ (about 0.3eV for Eal arid 0.6eV for Ea2 ). The corresponding equation for n T takes in this case the form

where 0 5 A 5 1 is a constant, z = (1/T,f-1/T2)lk, and zrf =(1/TZf - l I T , ) l k

with TEf = 3 13 K (40°C). For components of good cornmercial quality, and using nE = n Q = 1, failure rate

calculations lead to figures which for practical applications in standard industrial environments (BA = 40°C, GB) often agree reasonably well withfield data (up to a factor of 2). This holds at the equipment und System level, although deviations can occur at component level, depending on the failure rate catalog used (Example 2.4).

3 8 2 Reliability Analysis During the Design Phase

Discussion over comparison with obsolete data should be dropped and it would seem to be opportune to unify models und data, taking from each model the "good part" and putting them together for new better models (strategy applicable to other situations as well). Models for prediction in practical applications should remain reasonably simple, laws for dominant failure mechanisms should be given in international standards, and the list of reference failure rates hTef should be yearly updated. Models based on failure mechanisms have to be used as a basis for simplified models. The assumption of h < loT9h-' should be confined to cornponents with a stable production process and a reserve to technological limits.

Calculation of the failure rate at equipment and system level often requires considerations on the mission profile. If the mission can be partitioned in time spans with almost homogeneous Stresses, switching effects are negligible, and the failure rate is time independent (between successive state changes of the system), the contribution of each time span can be added linearly, as often assumed for du9 cycles. With these assumptions, investigation of phased-mission systems (systems whose elements are used at different rates) becomes possible (Section 6.8.6.2).

Estimation and demonstration of component's and system's failure rates is considered in Sections 7.2.3.1 and 7.2.3.2- 7.2.3.3, respectively.

Example 2.4 For indicative purpose, the following table gives failure rates calculated according to some different data bases 12.29, 2.25, 2.241 for continuous operation in non interface application; 8,=40°C, BJ=5S0C, S = 0.5, CB, and X g = l as for CECC certified, and class I1 Telcordia; P1 is used for plastic package; in 10-~ h-' (NT), quantified at 1.10-~ h-' .

Telcordia IRPH IEC ** 2001 2003 zo 'mf*

DRAM, CMOS, 1 M, P1

SRAM, CMOS, 1 M, P1

EPROM CMOS, 1 M, P1

i 6 ~ i t p ~ ( i 0 ~ TRI, CMOS, PI

Gate array, CMOS, 30,000 gates , 40 Pins, PI

Lin, Bip, 70 Tr, P1

GP diode, Si, 100 mA, lin, P1

Bip. transistor, 300 mW, switching, P1

JFET, 300 mW, switching, P1

Ceramic capacitor, 100 nF, 12S°C, class 1

Foil capacitor, 1 pF

Ta solid capacitor, herrn., 100 ,uF, 0.3Q / V

MF resistor, 114 W, 100 kQ

Cermet pot, 50 k 8 , < 10 annual shaft rot.

*Assurned value for computations as per IEC 61709 [2.23], €IA= 40°C; * *~roduciion year 2001 for ICs

2.2 Predicted Reliability of Equipment and Systems with Simple Structures 39

2.2.5 Reliability of One-Item Structure

A one-item nonrepairable structure is characterized by the distribution function F(t) = Pr{z I t ] of its failure-free time T , assumed > 0 with F(0) =0 , and hereafter used as a synonym for failure-free operating time. The reliability function R ( t ) , i.e. the probability of no failure in the interval (0, t ] , follows as (Eq. (A6.24))

The expected value (mean) of the failure-free time T , designated as MTTF (mean time to failure), can be calculated from Eq. (A6.38)

M77F = E[T] = j ~ ( t ) d t . (2.8) 0

Should the one-item structure exhibit a useful life limited to T„ Eq. (2.8) yields

MTTF, = J" ~ ( t ) d t , R ( ~ ) = o for t > T~ 0

In the following, T, = - will be assumed (except in Example 6.21). Equation (2.8) is an important relationship. It is valid not only for a one-item

structure (often considered as an indivisible entity), but it also holds for a one-item structure of arbitrary complexity. Rs ( t ) & M7TFs is used to emphasize this

Thereby, S stands for system and i for the state entered at t =O (Table 6.2). i = 0 holds for system new at t = 0, i.e. for Rso(0) = 1 (this notation is used in the following sections, in particular in Chapter 6 dealing with repairable systems).

Back to the one-item structure, considered in this section as an indivisible entity, and assuming R ( t ) derivable, the failure rate h ( t ) of a nonrepairable one-item structure new at t = 0 is given by (Eq. (A6.25))

1 d R ( t ) l d t A ( t ) = lim -Pr{ t<z<t+6t I z > t } = -

s t ~ o 6t N t )

Considering R(0) = 1 , it follows that

from which, for h ( t ) = ?L,

The mean time to failure in this case is equal to 1 I h . In practical applications

is often used, where MTBF stands for mean operating time between failures, expressing thus a figure applicable to repairable one-item structures. To avoid misuses, and also because of the often used estimate MTBF = T l k , MTBF should be confined to repairable items with constant (time independent) failure rate (constant failure rates for all elements in the case of a system, See remark on p. 358).

As shown by Eq. (2.11), the reliability function of a nonrepairable one-item structure new at t = 0 is completely defined by its failure rate h ( t ) . In the case of electronic components, h ( t ) = h can often be assumed. The failure-free time z then

ht exhibits an exponential distribution ( F ( t ) = Pr{z I b } = 1 - e- ) For a time dependent failure rate, the distribution function of the failure-free time can often be approximated by the weighted sum of a Gamma distribution (Eq. (A6.97)) with ß < 1 and a shifted Weibull distribution (Eq. (A6.96)) with ß > 1 (Eq. (A6.34)).

Equations (2.7) - (2.12) implies that the nonrepairable one-item structure is new at time t = 0. Also of interest in some applications is the probability of failure-free operation during an interval (0, t ] under the condition that the item has already operated without failure for xo time units before t = 0. This quantity is a conditional probability , designated by R( t , xo ) and given by (Eq. (A6.27))

For h ( x ) = h , Eq. (2.14) reduces to Eq. (2.12). This memoryless property occurs only with constant (time independent) failure rate. Its use greatly simplify calculations in the next sections, in particular in Chapter 6 dealing with repairable Systems.

Equations (2.8) and (2.9) can also be used for repairable i tems. In fact, assuming that at failure the item is replaced by a statistically equivalent one, a new, independent, failure-free time z with the same distribution function as the former one is started after repair (replacement), yielding the Same expected value. However, for these cases the variable x starting by x = 0 after each repair has to be used instead of t (as for interarrival times). With this, M m i (Eq. 2.9)) can be used for the mean time to failure of a given system, independently of whether it is repairable or not. The only assumption is that the system is as-good-as-new after repair, with respect to the state i considered (Table 6.2). At system level, this occurs only if all nonrepaired (renewed) elements in the system have constant failure rates. If the failure rate of one nonrenewed element is not constant, difficulties can arise. This, even if the assumption of an as-bad-as-old situation (pp. 405 & 497), applies.

In some applications, it can appear that elements of a population of similar items exhibits different failure rate. Considering as an example the case of components

2.2 Predicted Reliability of Equipment and Systems with Simple Stmctures 4 1

delivered from two manufacturer with proportion p & (1 - p) and failure rates hl & h2, the reliability function of an arbitrarily selected component is (Eq. (A6.34))

According to Eq. (2.10), it follows for the failure rate that

From Eq. (2.15) one recognizes that the failure rate of mixture distributions is time dependent and decrease monotonically from the average of the failure rates at t = 0 to the minimum of the failure rates as t -+ W.

2.2.6 Reliability of Series - Parallel Structures

For nonrepairable items (up to item failure), reliability calculation at equipment and system level can often be performed using the basic models given in Table 2.1. The one-item stmcture has been introduced in Section 2.2.5. Series, parallel, and series - parallel structures are considered in this Section. The last three models of Table 2.1 are investigated in Section 2.3. To unify the notation, system will be used for item and it is assumed that at t = 0 the system in new (i.e. Rs O ( t ) is given).

2.2.6.1 Systems without Redundancy

From a reliability point of view, a system has no redundancy (senes system) if all elements must work in order to fulfill the required function. The reliability block diagram consists in this case of the series connection of all elements (El to E,) of the system). For calculation purposes it is often assumed that each element operates and fails independently from every other element (independent elements as defined on p. 52). For series Systems, this assumption must not (in general) be verified, because the first failure is a system failure for reliability purposes. Let ei be the event

{element Ei works without failure in the interval(0, t] ).

The probability of this event is the reliability function Ri ( t ) of the element Ei, i.e.

The system does not fail in the interval (0, t] if and only if all elements, E I , ..., E, do not fail in that interval, thus


Here and in the following, S stands for system and 0 specifies that the system is new at t = 0 . Due to the assumed independence among the elements EI, . .. , E, and thus among the events el , ... , e„ it follows (Eq. (A6.9)) that for the reliability jhnction Rs ,$t)

The failure rate of the system can be calculated from Eq. (2.10)

Equation (2.18) leads to the following important conclusion:

The failure rate of a series system (system without redundancy), that consists of independent elements (p. 52), is equal to the sum of the failure rates of its elements.

The system's mean time to failure follows from Eq. (2.9). The special case in which all elements have a constant failure rate h i ( t ) = h, leads to

2.2.6.2 Concept of Redundancy

High reliability, availability, and / or safety at equipment or system level can often only be reached with the help of redundancy. Redundancy is the existence of more than one means (in an item) for performing the required function. Redundancy does not just imply a duplication of hardware, since it can be implemented at the software level or as a time redundancy. However, to avoid common mode and single-point failures, redundant elements should be realized (designed and manufactured) independently from each other. Irrespective of the failure mode (e.g. shorts or opens), redundancy still appears in parallel on the reliability block diagram, not necessarily in the hardware (see e.g. Example 2.6). In setting up the reliability block diagram, particular attention must be paid to the series element to a redundancy. A FMEA is generally mandatory for such a decision. Should the redundant elements fulfill only a part of the required function a pseudo redundancy exist. From the operating point of view, one distinguishes between active, warm, and standby redundancy:


1. Active Redundancy (parallel, hot): Redundant elements are subjected from the beginning to the same load as operating elements; load sharing is possible, but is not considered in the case of independent elements (see Section 2.2.6.3).

2. Warm Redundancy (lightly loaded): Redundant elements are subjected to a lower load until one of the operating elements fails; load sharing is present; however, the failure rate in the reserve state is lower than in the operating state.

3. Standby Redundancy (cold, unloaded): Redundant elements are subjected to no load until one of the operating elements fails; no load sharing is possible, and the failure rate in the reserve state is assumed to be Zero ( h = 0).

Important redundant structures with independent elements in active redundancy are considered in Sections 2.2.6.3 to 2.3.4. Warm and standby redundancies are investigated in Section 2.3.5 and Chapter 6 (repair rate p = 0).

2.2.6.3 Parallel Models

A parallel model consists of n (often statistically identical) elements in active redundancy, of which k are necessary to perform the required function and the remaining n - k are in reserve. Such a structure is designated as a k-out-of-n redundancy (also known as k-out-of-n: G). Investigations assume in general no load sharing, i.e. independent elements (see Section 2.3.5 & 6.5 for load sharing).

Let us consider at first the case of an active 1 -out-of-2 redundancy as given in Table 2.1 (3rd row). The required function is fulfilled if at least one of the elements El or E2 works without failure in the interval (0, t ] . With the Same notation as for Eq. (2.16) it follows that

Assuming that elements El and E2 are independent (p. 52), Eq. (2.20) yields for the reliability function Rso( t ) (Eqs. (A6.13), (A6.8), (2.16)),

The mean time to failure M T P s can be calculated from Eq. (2.9). The special case of two identical elements with constant failure rate ( R l ( t ) = R2( t ) = e-L ) leads to

Equation (2.22) shows that in the presence of redundancy, the failure rate h s ( t ) at system level is a function of time, even if the element's failure rates are time independent. However, the stochastic behavior of the system is still described by a Markov process (see e.g. Section 2.3.5). This time dependence becomes negligible in the case of repairable systems with constant failure and repair rates (Eq. (6.93)).

Generalization to an active k-out-of-n redundancy with identical and independent eleinents (R,(t) = ... = R,(t) = R(t)) follows from the binonlial distribution (Eq. (A6.120)) by setting p = R(t)

RsO(t) can be interpreted as the probability of observing at least k successes in n Bernoulli trials with p = R(t). The mean time to failure MTTFSo can be calculated from Eq. (2.9). For k = 1 and R(t) = e-ht it follows that

1 1 1 ~ ~ ~ ( t ) = 1 -(I - e-")' and M T T F s o = - ( I + - + ...+-). (2.24)

h 2 n

The improvement in M q o shown by Eq. (2.24) becomes much greater when repair without intenuption of operation at system level is allowed, factor ~ / 2 h instead of 3 12 for an active l-out-of-2 redundancy, where = 1 I MTTR is the constant repair rate (Tables 6.6 and 6.8). However, as shown in Fig. 2.7, the increase of the reliability function Rso(t) caused by redundancy is very important for short missions ( t << 1 l I ) , even in the nonrepairable case. Other comparisons between series - parallel structures are given in Figs. 2.8 and 2.9 (Figs. 6.17 and 6.18 for the repairable case).

In addition to the k-out-of-n redundancy described by Eq. (2.23), attention has been paid in the literature to cases in which the fulfillment of the required function asks that not more than n-k consec~rtive elements fail. Such a structure, known as consecutive k-out-of-n system, is theoretically rnore reliable than the corresponding

Figure 2.7 Reliability function for the one-item structure (as reference) and for some active redundaiicies (nonrepairable np to system failure, constant failure rntes, identical and independent elements, no load shaing; see Section 2.3.5 for load sharing)


k-out-of-n redundancy. For a k-out-of-n consecutive system with n identical and independent elements in active redundancy (each with reliability R) it holds that [2.40]

Rso = Pr{no block with more than n - k consecutive failed elements}

with g(n,i)=(:) for i I n - k , g(a ,a)=O for a 2 n - k + l and g(a,b)=g(a-1,b) +g ( a-2 ,b-1)+ ... + g ( a - n + k- i ,b-n+ k) otherwise. n = 5 and k = 3 yields

Rs = R ~ + ~ R ~ ( ~ - R ) + ~ o R ~ ( ~ - R ) ~

from Eq. (2.23) and

from Eq. (2.25), with Rs =RsO ( t ) , R=R(t), R(0) = 1. Examples for consecutive k-out-of-n systems are conveying systems and relay stations. However, for these kinds of application it is important to verify that all elements are independent (with respect to external influences, load sharing, etc.).

2.2.6.4 Series - Parallel Structures

Series - parallel structures can be investigated through successive use of the results for series and parallel models. This holds in particular for nonrepairable systems with active redundancy and independent elements (p. 52). To demonstrate the procedure, let us consider the 5th row in Table 2.1:

1st step: The series elements El - E3 are replaced by E s , E4 & Es by E9, and E6 & E7 by Elo, yielding

2nd step: The 1-out-of-2 redundancy E8 and E9 is replaced by Eil, giving * with R„(t) = R,(t) + R,(t) - R,(t) Rg(t)

3rd step: From steps 1 and 2, the reliability function of the system follows as (with Rs=R„(t), Ri=Ri(t), Ri(0)=l, i =1, ..., 7 )

The mean time to failure can be calculated from Eq. (2.9). Should all elements have a constant failure rate (Al to h7), then

and

Under the assumptions of active redundancy, nonrepairable (up to system failure), independent elements (p. 52), and constant failure rates, the reliability function Rso( t ) of a system with series-parallel structure is given by a sum of exponential

functions. The mean time to failure MTTFSO follows then directly from the exponent terms of Rso( t ) , see Eq. (2.27) for an example.

The use of redundancy implies the introduction of a series eleinent in the reliability block diagram which takes into account the parts which are common to the redundant elements, creates the redundancy (Example 2.5), or assumes a control andlor switching function. For a design engineer it is important to evaluate the influence of the series element in a redundant structure. Figures 2.8 and 2.9 allow such an evaluation to be made for the case in which constant failure rates, independent elements, and active redundancy can be assuined. In Fig. 2.8, a one- item structure (element El with failure rate h,) is compared with a 1-out-of-2 redundancy with a series element (element E2 with failure rate h2). In Fig. 2.9, the 1-out-of-2 redundancy with a series element E2 is compared with the structure which would be obtained if a 1-out-of-2 redundancy for element E2 with a series element E3 would become necessary. Obviously h3<h2<hl (the limiting cases h, =h2 for Fig. 2.8 and hl = h 2 = h 3 for Fig. 2.9 have an indicative purpose only). The three cases are labeled a), b), and C). The upper part of Figs. 2.8 and 2.9 depict the reliability functions and the lower part the ratios MTTFsob I M w O a and M W O c 1 MTTFSOb, respectively. The comparison between case a) of Fig. 2.8 and case C) of Fig. 2.9, given as MTTFSO, I MTTQO, on Fig. 2.8, shows a much lower dependency on h2 1 Al. From Figs. 2.8 and 2.9 following design guideline can be formulated:

The failure rate h2 of the series element in a nonrepairable (up to system failure) 1-out-of-2 active redundancy should not be larger than 10% of the failure rate of the redundant elements Al; tize 10% rule applies also for the case of h3 in Fig. 2.9, i.e.

The investigation of the structures given in Figs. 2.8 and 2.9 for the repairable case (with = 1IMTTR as constant repair rate) leads in Section 6.6 to more

severe conditions (3L2 C: 0.013L1 in general, and h2 < 0.002 Al for p/ Al > 500), See Figs. 6.17 and 6.18.

2.2.6.5 Majority Redundancy

Majority redundancy is a special case of a k-out-of-n redundancy, frequently used in, but not limited to, redundant digital circuits. 2 n + 1 outputs are fed to a voter whose output represents the majority of its 2 n + 1 input signals. The investigation is based on the previously described procedure for series - parallel structures, See for example the case of n = 1 (active redundancy 2-out-of-3 in series with the voter E,)

given in the 6th line of Table 2.1. The majority redundancy realizes in a simple way a fault-tolerant structure without the need for control or switching elements. The required function is performed with no operational interruption up to the time point of the second failure, since the first failure is automatically masked by the majority redundancy. In digital circuits, the voter for a majority redundancy with n = 1 consists of three two-input NAND and one three-input NAND gate, for a bit by bit solution. An alarm circuit is also simple to realize, and can be implemented with three two-input EXOR and one three-input OR gates (Example 2.5). A similar stmcture as for the alarm circuit can be used to realize a second alarm circuit giving a pulse at the second failure, thus expanding the 2-out-of-3 active redundancy to a 1-out-of-3 active redundancy (Problem 2.7 in Appendix A l l ) . A majority redundancy can also be realized with software (N-version programming).

Example 2.5 Realize a majority redundancy for n = 1 inclusive voter and alarm signal at the first failure of a redundant element (bit hy bit solution with "1" for operating and "0" for failure).

Solution Using the Same notation as for Eq. (2.16), the 2-out-of-3 active redundancy can be implemented by (el n e2) U

(el n ej ) U (e2 n eg ). With this, the functional block diagram per bit of the In voter for a majority redundancy with n = 1 is obtained as the realization of the logic equation related to the above expression. The alarm circuit giving a logic 1 at the occurrence of the first failure is also easy to implement. Also it is possible to realize a second alarm i__i-

circuit to detect the second failure Voter Alarm

(Problem 2.7 in Appendix Al 1).

2 Reliability Analysis Dunng the Design Phase

Figure 2.8 Comparison between the one-item structure and a 1-out-of-2 active redundancy with senes element (nonrepairable up to system failure, independent elements, constant failure rates hl & h z , h1 remains the same in both structures;equations according to Table 2.1; given on the nght-

hand side is M- 1 M- O, with M n F s from Fig. 2.9; see Fig. 6.17 for the repairable case)


-hlt-e-2hlt) M7TFnC= 4 / (L, + h, +L,) C) F- W iI.soc(t) = (2. - 2 / ( A l + 2h2 + X , )

EI. E~ - (2e-h2t- e-2h2')e-h3', -2 / (2h, + h , + h , ) 1-out-ol-2 1-out-of-2

active (E,. = E,) active (E2 = E2) + 1 / (2h1 + 2h, + L,)

Figure 2.9 Comparison between basic series -parallel structures (nonrepairable up to system failure, active redundancy, independent elements, constant failure rates h, to h3, hl and h2 remain the same in both structures; equations according to Table 2.1; see Fig. 6.18 for the repairable case)


Example 2.6

Compute the predicted reliability for the following circuit, for which the required function asks that the LED must light when the control voltage ul is high. The environmental conditions correspond to GB in Table 2.3, with ambient temperature €IA = 50°C inside the equipment and 30°C at the location of the LED; quality factor ZQ = 1 as per Table 2.4.

U, : 0 . 1 V a n d 4 V

V„ : 5 V

LED : 1 V at 20 mA, I„, = 100 mA

Re : 150 Q, 112 W, MF

m, TRI : Si, 0.3 W, 30 V, ß > 100, plastic

R„ : 10 kQ, 1R W, ME

Solution

The solution is based on the procedure given in Fig 2.1

I. The required function can be fulfilled since the transistor works as an electronic switch with IC = 20mA and IB = 0.33mA in the on state (satnrated) and the off state is assured by ul =O.lV.

2. Since all elements are involved in the required function, the reliability block diagram consists of the series connection of the five items EI to E5, where E5 represents the printed circuit with soldering joints.

3. The stress factor of each element can be easily determined from the circuit and the given rated values. A stress factor 0.1 is assumed for all elements when the transistor is off. When the transistor is On, the stress factor is 0.2 for the diode and about 0.1 for all other elements. The ambient temperature is 30°C for the LED and 50°C for the remaining elements.

4. The f a h r e rates of the individual elements is determined (approximately) with data from Section 2.2.4 (Example 2.4, Figs. 2.4 -2.6, Tables 2.3 and 2.4 with ZE = %Q = 1). Thus,

LED :h,=1.3.10-~h- ' Transistor : h4 = 3 .10-~ hhl Resistor : h2 = hj = 0.3 .10-~ hd,

when the transistor is on. For the printed circuit board and soldering joints, h5 = 2.10-' h-1 is assumed. The above values for h remain practically unchanged when the transistor is off due to the low stress factors (the stress factor in the off state was Set at 0.1).

5. Based on the results of Step 4, the reliability function of each element can be determined as ~ ~ ( t ) = e-'i

6. The reliability function RS (t ) for the whole circuit can now be calculated. Equation (2.19)

2.2 Predicted Reliability of Equipment and Systems with Simple Shuctures 51

yields Rs (t ) = ea'9'10-9' . For 10 years of continuous operation, for exarnple, the predicted reliability of the circuit is > 0.999.

7 . Supplernentary result: To discuss this example further, let us assume that the failure rate of the transistor is too high (e.g. for safety reasons) and that no transistor of better quality can be obtained. Redundancy should be implemented for this element. Assuming as failure modes short between emitter and collector for transistors and Open for resistors, the resulting circuit and the corresponding reliability block diagram are

R ~ i B C E T R ~

+Jvcc EI to E5 as in point 2 t E~ fi R~~ P R~~ , E~ fi T R ~ P T R ~

i

Due to the very small stress factor, calculation of the individual element failure rates yields the same values as without redundancy. Thus, for the reliability function of the circuit one obtains (assuming independent elements)

from which it follows that

Circuit reliability is then practically no longer influenced by the transistor. This agrees with the discussion made with Fig. 2.7 for h t << 1 . If the failure mode of the transistors were an Open between collector and emitter, both elements E4 and E7 would appear in series in the reliability block diagram; redundancy would be a disadvantage in this case. The intention to put RB, and RB2 in parallel (redundancy) or to use just one basis resistor is wrong, the functionality of the circuit would be compromised because of the saturation voltage of TR2.

2.2.7 Part Count Method

In an early development phase, for logistic purposes, or in some particular applications, a rough estirnate of the predicted reliability can be required. For such an analysis, it is generally assumed that the system under consideration is without redundancy (series structure as in Section 2.2.6.1) and the calculation of the failure rate at component level is made either using field data or by considering technology, environmental, and quality factors only. This procedure is known as part count rnethod and differs basically from the part stress rnethod introduced in Section 2.2.4. Advantage of a part count prediction is the great simplicity, but its usefulness is often limited to specific applications.


2.3 Reliability of Systems with Complex Structure

Complex structures arise in many applications, e.g. in power, telecommunications, defense, and aerospace systems. In the context of this book, a structure is complex when the reliability block diagram either cannot be reduced to a series-parallel structure with independent elements or does not exist. For instance, a reliability block diagram does not exist if more than two states (goodlfailed) or one failure mode (e.g. short or open) must be considered for an element. Moreover, the reduction of a reliability block diagram to a series -parallel structure with independent elements is in general not possible with distributed structures or when elements appear in the diagram more than once (cases 7, 8 , 9 in Table 2.1). The term independent elements refers to independence up to the system failure, in particular without load sharing between redundant elements (load sharing is considered in Section 2.3.5 and Chapter 6). For comparative investigations in Chapter 6, the term totally independent elements will be used to indicate for repairable systems, independence with respect to operation und repair (each element in the reliability block diagram operates and fails independently from every other element and has its own repair crew).

Analysis of complex structures can become difficult and time-consuming. However, methods are well developed, should the reliability block diagram exist and the system satisfy the following requirements:

1. Only active (parallel) redundancy is considered.

2. Elements can appear more than once in the reliability block diagram, but different elements are independent (totally independent for Eq. (2.48)).

3. On 1 off operations are either 100% reliable, or their effect has been considered in the reliability block diagram according to the above restrictions.

Under these assumptions, analysis can be performed using Boolean models. How- ever, for practical applications, simple heuristically oriented methods apply well. Heuristic methods are given in Sections 2.3.1-2.3.3, Boolean models in Section 2.3.4.

Section 2.3.5 deals then with warm redundancy, allowing for load sharing. Section 2.3.6 considers elements with two failure modes. Stress I strength analysis are discussed in Section 2.5. Further aspects, as well as situations in which the reliability block diagram does not exist, are considered in Section 6.8 8 (see also Section 6.9 for an introduction to Petri nets, dynarnic FT, and computer-aided analysis).

As in the previous sections, reliability figures have the indices SO, where S stands for System and 0 specifies System new at t = 0.

2.3.1 Key Item Method

The ke.y item method is based on the theorem of total probability (Eq. (A6.17)). The event {the item operates failure free in the interval(0, t ] ), or in a short form


{system up in (0 , t ] },

can be split into the following two complementary events

{Element Ei up in (0, t ] n system up in(0, t l } and

{Element Ei fails in (0, t ] n system up in(0, t ] ]

From this it follows that, for the reliability function Rso(t) ,

Rso(t) = Ri( t ) Pr(system up in (O,t] I Ei up in (0, t ] }

+ ( 1 - Ri( t ) ) Pr(system up in (O,t] I Ei failed in (0 , t ] } . (2.29)

Where Ri ( t ) = Pr{Ei up in (0, t ] ] as in Eq. (2.16). The element Ei must be chosen in such a way that a series -parallel structure is obtained for the reliability block diagrams conditioned by the events (Ei up in (0 , t ] ) and (Ei failed in (0, t ] } . Successive application of Eq. (2.29) is also possible (Examples 2.9 and 2.14). Sections 2.3.1.1 and 2.3.1.2 present two typical situations.

2.3.1.1 Bridge Structure

The reliability block diagram of a bridge structure with a bi-directional connection is shown in Fig. 2.10 (case 7 in Table 2.1). Element E5 can work with respect to the required function in both directions, from EI via Es to E4 and from E2 via E5 to E3. It is therefore in a key position (key element). This property is used to calculate the reliability function by means of Eq. (2.29) with Ei = Es. For the conditional probabilities in Eq. (2.29), the corresponding reliability block diagrams are

E5 did not fail in (0, tl E5 failed in (0, t ]

From Eq. (2.29), it follows that (with Rs = R,,, ( t ) , Ri = F$ (t), Ri (0) = 1 , i = 1 ,..., 5 )

Rs=R5(Rl+R2-RiR2)(R3+R4-R3R4)+(1-R5)(R1R3+R2R4-R1R2R3R4). (2.30)

Figure 2.10 Reliability block diagram of a bndge circuit with a bi-directional connection on E5


The same considerations apply to the bridge structure with a directed connection (case 8 in Table 2.1). Here, E i must be different from E5. Choosing E i = E 4 yields

The Same result is obtained by choosing e.g. Ei = E l

Example 2.7 shows a further application of the key item method.

Example 2.7 Give the reliability of the item according to case a) below. How mucli would the reliability be improved if the structure were be modified according to case b)? (Assumptions: nonrepairable up to system failure, active redundancy, independent elements, REl (t ) = REl (t ) = RE1 (t ) =

R, (t ) and RE2 (t ) = RE2, (t ) = R2 (t )).

Case a) Case b) Solution

Element Er is in a key position in case a). Thus, similarly to Eq. (2.30), one obtains R , = R , ( 2 R 2 - ~ 2 2 ) + ( 1 - ~ 1 ) ( 2 ~ 1 ~ 2 - ~ ~ ~ ~ ) with Ra=Roa(t), Ri=Ri(t), Ri (0)=l , i=1 ,2 . Case b) represents a series connection of a 1-out-of-3 redundancy with a 1-out-of-2 redundancy. From Sections 2.2.6.3 and 2.2.6.4 it follows that Rb = R1 R1 (3 - 3 R1 + ~ ~ ' ) ( 2 - R2 ), with Rb=ROb(t) , Ri=Ri(t) , Ri(0)=l, i=1 ,2 . Fromthis,

2 R b - R a = 2 R l R2(1-R2)(1-Rl) . (2.32)

The difference Rb - R, reaches as maximum the value 2 / 27 for R1 = 1 13 and R2 = 1 / 2, i.e. Rb= 57 / 108 and Ra= 49 / 108 ( Rb- Ra= 0 for R1 = 0, R, = 1 , R2 = 0, R2 = 1 ); the advantage

of case b) is small, as far as reliability is concemed.

2.3.1.2 Reliability Block Diagram in Which at Least One Element Appears More than Once

In practice, situations often occur in which an element appears more than once in the reliability block diagram, although, physically, there is only one such element in the system considered. These situations can be investigated with the key item method introduced in Section 2.3.1.1, see Examples 2.8,2.9, and 2.14.

2.3 Reliability of Systems with Complex Stmcture

Example 2.8

Give the reliability for the equipment introduced in Example 2.2.

Solution In the reliability block diagram of Example 2.2, element E2 is in a key position. Similarly to Eq. (2.30) it follows that

Rs = R2 R, (R4 + R5 -R4 R5)+(1-R2)R1 R3 R5. (2.33)

with Rs = R s o ( t ) and Ri=Ri(t) , Ri(0)=l, i =1 , ..., 5

Example 2.9

Give the reliability for the redundant circuit of Example 2.3.

Solution

In tlie reliability block diagram of Example 2.3, U1 and U2 are in a key position. Using the method introduced in Section 2.3.1 successively on U1 and U2, i.e. on E5 and G , yields.

With R I = R 2 = R j = R 4 = R D , R 5 = R 6 = R U , R 7 = R 8 = R I , Rg=RI,itfollowsthat

Rs = Ru R„[Ru(2RD R,-R; R ? ) ( ~ R , - ~ : ) + 2 ( 1 - R , ) R ~ R,], (2.34)

with Rs =Rso( t ) , Ru=Ru(t) , R D = R D ( t ) , R,=R,(t), RII=R,(t), Ri(0)=l ( i = l , ..., 9 ) .

2.3.2 Successful Path Method

In this and in the next section, two general (closely related) methods are introduced. For simplicity, considerations will be based on the reliability block diagram given in Fig. 2.1 1. As in Section 2.2.6.1, ei stands for the event

{element Ei u p in the interval(0, t ] },

hence Pr{ei} = Ri ( t ) , as in Eq. (2.16), and Pr{Zi} = 1 - R i ( t ) . The successful path

method is based on the following concept:

The system fulfills i ts required function if there i s at least one path between

the input u n d the output upon which all e lements per form their required

function.

Paths must lead from left to right and may not contain any loops. Only the given direction is possible along a directed connection. The following successful paths exist in the reliability block diagram of Fig. 2.11

2 Reliability Analysis Dunng the Design Phase

Figure 2.11 Reiiability block diagram of a complex structure (elements E3 and E4 appear each twice in the RBD, the directed connection has reliability 1)

Consequently it follows that

Using the addition theorem of probability theory (Eq. (A6.15)), Eq. (2.35) leads to

with Rs = R S O ( t ) and Ri=Ri(t), Ri(0)=i, i = i , ..., 5.

2.3.3 State Space Method

This method is based on the following concept:

Every element Ei is assigned an indicator c i ( t ) with the following property: c i ( t ) = 1 as long as Ei does not fail, und Ci(t) = 0 if Ei has failed. The vector with components Ci(t) determines the system state at time t. Since each element in the interval (0, t] functions or fails independently of the others, 2 n states are possible for an item with n elernents. After listing the 2 n possible states at time t, all those states are determined in which the system performs the required function. The probability that the system is in one of these states is the reliability function Rso( t ) of the system considered.

The Z n possible conditions at time t for the reliability block diagram of Fig. 2.1 1 are


A "1" in this table means that the element or item considered has not failed in (0, t ] (see footnote on p. 58 for fault tree analysis). For Fig. 2.1 1, the event

{system up in the interval(0, t ] ]

is equivalent to the event

After appropriate simplification, this reduces to

from which

Evaluation of Eq. (2.37) leads to Eq. (2.36). In contrast to the successful path method, all events in the state space method (columns in the state space table and terms in Eq. (2.37)) are mutually exclusive.

2.3.4 Boolean Function Method

The Booleanfunction method generalizes and formalizes the methods introduced in Sections 2.3.2 & 2.3.3. For this analysis, besides the 3 assumptions given on p. 52, it is assumed that the item (system) considered is coherent, that is (basically)

1. The state of the system depends on the states of all of its elements; in particular, the system is up for all elements up and down for all elements down.

2. If the system is down, no additional failure of any element can bring it in an up state (monotony).

In the case of repairable systems, the second property must be extended to: If the system is up, it remains up if any element is repaired. Almost all systems in

5 8 2 Reliability Analysis During the Design Phase

practical applications are coherent. In the following, u p is used for system in operating stute and down for systern in a failed stute (in repair if repairable).

The states of a coherent system can be described by a system function (structure function). A system function Q, is a Boolean function')

1 for item up

for item down

of the indicators C i = Ci(t), defined in Section 2.3.3 (Ci = 1 if element Ei is up and C i = 0 if element Ei is down), for which the following applies (coherent systern):

1. Q, depends on all the variables Ci ( i = 1, ..., n) .

2. $ is non decreasing in all variables, @ = 0 for all C i = 0 and 4 = 1 for all C i = 1.

Since the indicators C i and the system function Q, take on only the values 0 and 1,

applies for the reliability function of element Ei, and

applies for the reliability function of the system (calculation of E [@] is in general easier as calculation of Pr{$ = 1)).

The Boolean function method thus transfers the problem of calculating Rso( t ) to that of the deterinination of the system function @ (C1, ..., in). Two methods with a great intuitive appeal are available for this purpose:

1. Minimal Path Set approach: A set 1;: of elements is a minimal path set if the system is up when C j = 1 for all Ej E and Ck= 0 for all Ek E q, but this does not apply for any subset of 1;: (for the bridge in Fig. 2.10, {1,3}, {2,4},

{1,5,4}, and {2,5,3) are the minimal path sets). The elements Ei within 1: form a series nlodel with system function

If for a given system there are r minimal path sets, these form an active I-out-of-r redundancy; i.e.,

+) In fault tree analysis (FTA), "0" for operation (up) and "1" for failure (down) is often used [A2.6 (IEC 61025)l.

2.3 Reliability of Systems wjth Complex Structure 59

2. Minimal Cut Set approach: A set Ci is a minimal cut set if the system is down when c j = 0 for all E j E Ci and ck = 1 for all Ek E Ci, but this does not apply for any subset of Ci (for the bridge in Fig. 2.10, {1,2}, {3,4), {1,5,4}, and {2,5,3} are the minimal cut sets). The elements E j within Ci form aparallel model (active redundancy with k = 1) with system function

If for a given system there are m minimal cut sets, these form a series model; i.e.,

A series model with elements E l , ..., E , has one path set and n cut sets, a parallel model (1-out-of-n) has one cut set and n path sets. Algorithms for finding all minimal path sets and all minimal cut sets are known, see e.g. [2.33 (1975)l. From Eqs. (2.40), (2.42), and (2.44) it holds that

with R s o ( 0 ) = 1 . For practical applications, the following bounds for the reliability function R s o ( t ) can be used [2.36]

If the minimal path sets are mutually exclusive, the right-hand inequality of Eq. (2.46) becomes an equality, similar is for the minimal cut sets (left-hand inequality).

The paths given with Eq. (2.35) are the minimal path sets for the reliability block diagram of Fig. 2.11. Using Eq. (2.42) this lead to the system function

$(51~ . . .~5 ,>=1-<1-51 <354)(1-51 5355)(1-51 c455 ) (1 -525355 ) (1 -<25455 ) arid then to Eq. (2.36). Investigation of ihe block diagram of Fig. 2.1 1 by the method of minimal cut sets is more laborious. Obviously, minimal path sets and minimal cut sets deliver the same system function $ ( C 1 , ..., C,) , with different effort depending on the structure of the reliability block diagram considered (stmctures with many series elements can be treated easily with minimal path sets, see Example 2.10).

Example 2.10 Give the system function according to the minimal path set and the minimal cut Set approach for the following reliability block diagram, and calculate the reliability function assuming independent elements.

Solution For the above reliability block diagram, there exist 2 minimal path sets PI, P2 and 4 minimal cut sets Cl , . .., C4, as given below.

The system function follows then from Eq. (2.42) for the minimal path sets

or from Eq. (2.44) for the minimal cut sets (in both cases by considering Ci Ci 5 I .=C. J L 5.)

Assuming independence for the (different) elements, it follows for the reliability function (forbothcasesandwith RS = R S O ( t ) & Ri=Ri(t), Ri(0)=l, i=1, ..., 5)

Rs=RlR2R5+R2R3R4R5-R lR2R3R4R5 .

Supplementary results: Calculation with the key item method leads directly to Rs = R2(R l+R3R4-R1R3R4)R5+(1 -R2) .0 = R2(R1+R3R4-RlR3R4)R5.

For items (systems) with independent , nonrepairable e lements (up to system failure), the reliability function Rso( t ) = E[@(<„ ..., L,)] can generally be obtained directly from the system function by considering in Eqs. (2.42) and (2.44) the i d e m p o t e n c y property (Ci Ci = Ci) and substituting Ri ( t ) for Ci (Eq. (A6.69)). A further possibility is to use the d i s junc t ive n o r m a l f o r m

$D(<l, . . ., in) or its equivalent l inea r form @L((l, . . ., 5,) of the system function @(Cl, ..., Ln), yielding, for c o h e r e n t systems with i n d e p e n d e n t elements (and Ri=Ri( t ) , Ri(0)=l , i =1, ..., n)

2.3 Reliability of Systems with Complex Structure 6 1

For coherent repairable Systems with totally independent elenients (every element works and is repaired independently from every other element and disposes of its own repair crew), Eq. (2.40) or Eq. (2.47) can be used to calculate the point availability PAso( t ) , yielding for the case of Eq. (2.47)

with PAi =PAio( t ) for the general case (Eq. (6.17)) or PA = M?T< I (Mn"? + MVR,)

for steady-state or t+ - (Eq. (6.48)). However, in practical applications, a repair Crew for each element in the reliability block diagram of a system is rarely available. Nevertheless, Eq. (2.48) can be used as an approximation (upper bound) for PAso(t) . For repairable elements, Ci(t) given by Eq. (2.38) is defined as Ci(t) = 1 for element Ei operating (up) and Ci(t) = 0 for Ei in repair (down), yielding I5 [c i ( t ) ] = PAio(t) . In practical applications, computation is often easily performed for the unavailability I-PAso(t).

2.3.5 Parallel Models with Constant Failure Rates and Load Sharing

In the redundancy structures investigated in the previous sections, all elements were operating under the same conditions. For this type of redundancy, called active (parallel) redundancy, the assumed statistical independence of the elements implies in particular that there is no load sharing. This assumption does not arise in many practical applications, for example, at component level or in the presence of power elements. The investigation of the reliability function in the case of load sharing or of other kinds of dependency involves the use of stochastic processes. The situation is simple if one can assume that the failure rate of each element changes onZy when a failure occurs. In this case, the general model for a k- out-of-n redundancy is a death process as given in Fig. 2.12 (birth and death process as in Fig. 6.13 for the repairable case with constant failure & repair rates). Zo, ..., Zn-k+l are the states of the process. In state Zi, i elements are down. At state Zn-k+l the system is down.

Figure 2.12 Diagram OE the transition probabilities in ( 2 , t + 6 t ] for a k-out-of-n redundancy (nonrepairable, constant failure rates during the sojoum time in each srnte (not nrcessarily at a state change, e.g. because of load sharing), t arbitraiy, 6 t + 0, Markov process)

Assurning

h = failure rate of an element in the operating state (2.49) and

A r = failure rate of an element in the resewe state (Ar I h) , (2.50)

the model of Fig. 2.12 considers in particular the following cases:

1. Active redundancy without load sharing (independent elements)

vi = ( n - i ) h , i = 0, ..., n - k ,

h is the same for all states.

2. Active redundancy with load sharing (?L = ?L(.i))

v i = ( n - i ) h ( i ) , i = o ,..., n - k , (2.52)

h(i) increases at each state change.

3. Warm (lightly loaded) redundancy ( ?L, < h )

vi = k h + ( n - k - i ) h r , i = O ,..., n - k , (2.53)

h and h, are the Same for all states.

4. Standby (cold) redundancy (?L, - 0 )

~ i = k h , i = O ,..., n - k ,

h is the Same for all states.

For a standby redundancy, it is assumed that the failure rate in the reserve state is = 0 (the reserve elements are switched on when needed). Warm redundancy is somewhere between active and standby ( 0 < ?Lr < ?L). It should be noted that the k-out-of-n active, warm, or standby redundancy is only the simplest representatives of the general concept of redundancy. Series -parallel structures, voting techniques, bridges, and more complex structures are frequently used (see Sections 2.2.6, 2.3.1 - 2.3.4, and 6.6 - 6.8 with repair rate p = 0, for some examples). Furthermore, redundancy can also appear in other forms, e.g. at software level, and the benefit of redundancy can be limited by the involved failure modes as well as by control und switching elements (see Section 6.8 for some examples). For the analysis of the model shown in Fig. 2.12, let

Pi(t) = Pr{ the process is in state Zi at time t } (2.55)

be the state probabilities ( i = 0, . . ., n - k + 1). Pi(t) is obtained by considering the process at two adjacent time points t and t + 6 t and by making use of the memoryless property resulting from the constant failure rate assumed between consecutive state changes (Appendix A7.5). The function Pi(t) thus satisfies the following differente equation

2.3 Reliability of Systems with Complex Structure 63

Pi(t + 6t) = Pi(t)(l - v i 6t) t Pi-l(t) vi-1 6 t , i = l , ..., n - k . (2.56)

For 6 t + 0, there follows then a system of differential equations describing the death process

Assuming the initial conditions Pi(0) = 1 and Pj (0) = 0 for j + i at t = 0, the solution (generally obtained using the Laplace transform) leads to Pi(t), i = 0, . . ., n - k + 1. Knowing Pi(t), one can evaluate the reliabili@function Rs(t)

and the mearz time to failure from Eq. (2.9). Assuming for instance Po(0) = 1 as initial condition, one obtains for the Laplace transform of Rso(t),

m

Km(s) = J ~ ~ ~ ( t ) eTSt d t , 0

the expression

The mean time to failure follows then from

M m O = RSO(0)

and leads to n-k

MTTFso = - . (2.62) i= 0 Vi

Thereby, S stands for system and 0 specify the initial condition Po(0) = 1 (Table 6.2). For a k-out-of-n standby redundancy (Eq. (2.54)), it follows that

and n - k + l

MTTFSo = (2.64) kh

Equation (2.63) shows the relation existing between the Poisson distribution and the occurrence of exponentially distributed events.


For the case of a k-out-of-n active redundancy without load sharing, it follows from Eqs. (2.62) and (2.51) that

see also Table 6.8 with p= 0, and A, =h. Some examples for RSO( t ) with different values for n and k are given in Fig. 2.7.

2.3.6 Elements with more than one Failure Mechanism or one Failure Mode

In the previous sections, it was assumed that each element exhibits only one dominant failure mechanism, causing one dominant failure mode; for example intermetallic compound causing a short or corrosion causing an Open for integrated circuits. However, in practical applications, components can have some failure mechanisms and fail in different manner (See e.g. Table 3.4). A simple way to consider more than one failure mechanism is to assume that each failure mechanism is independent of each other and causes a failure at item level. In this case, a series model can be used by assigning a failure rate to each failure mechanism, and Eq. 2.18 or Eq. 7.57 delivers the total failure rate of the item considered. More sophisticated models are possible. A mixture of failure rates and 1 or mechanisms has been discussed in Section 2.2.5 (Eq. (2.15)). This section will consider as an example the case of a diode exhibiting two failure modes. Let

R(t) = Pr{no failure in (0, t ] I diode new at t =0}

R ( t ) = 1 - R ( t ) = Prtfailure in (0, t ] I diode new at t =0) - RU(t ) = Pr{open in (0, t ] I diode new at t =0} - R K ( t ) = Pr{short in (0, t] I diode new at t =O).

Obviously (Example 2.1 1)

The series connection of two diodes exhibits a circuit failure if either one Open or two shorts occur. From this,

- - -2 - - - Rs = 1 - ( 1 - ~ , ) ~ + RK = 2 ~ , -R: + R,?, * (2.67) -

with Rs = R s o ( t ) , = R K ( t ) , Eu = & ( t ) .

2.3 Reliability of Systems with Complex Structure 65

Example 2.11 In an accelerated test of 1000 diodes, 100 failures occur, of which 30 are opens and 70 shorts. Give an estimate for E , zu, and E,.

Solution The maximum likeiihood estinlate of an unknnwn probability p is, according to Eq. (A8.29),

p = k l ~ z . Hence, E = o . I , and RK=0 .07 .

Similarly, for two diodes in parallel (Example 2.12),

To be sitnultaneously protected against at least one failure of arbitrary mode (short or open), a quad redundancy is necessary. Depending upon whether opens or shorts are more frequent, a quad redundancy with or without a bridge connection is used. For both these cases it follows that

and

Equations (2.67) to (2.70) can be obtained using the state space method introduced in Section 2.3.3, however with three states for every element (good, Open (U), and short (K) leading to a state space with 3n elements, See Example 2.12).

Example 2.12 Using the state space method, give the reliability of two parallel connected diodes, assuming that opens and shorts are possible.

Solution Considering the three possible states (good (I), Open (U), and short (K)), the state space for two parallel connected diodes is

4 1 1 1 U U U K K K 4 I U K I U K 1 U K S 1 1 0 1 0 0 0 0 0

&* Rom the above table, it follows that

D2

& = P r ( S = O ) = 2 ~ & + ~ ~ + 2 ~ ~ ~ + E ~ - - - -2 - - -2 - -2 -2 =2(1 -RU- RK)RK +Ru +2Ru RK + RK = 2 R K -RK f Ru.

The linear superposition of the two failure modes, appearing in the final result for E,, do not apply necessarily to arbitraq structures.


2.3.7 Basic Considerations on Fault Tolerant Structures

In applications with high reliability, availability or safety requirements, equipment and systems must be designed to be fault tolerant. This means that without external help (autonomously) the item considered should be able to recognize a fault (failure or defect) and quickly reconfigure itself in such a way as to remain safe and possibly continue to operate with minimal performance loss Cfail-sale, graceful degradation).

Methods to investigate fault tolerant items have been introduced in Sections 2.2.6.2 through 2.3.6, in particular Sections 2.2.6.5 (majority redundancy) and 2.3.6 (quad redundancy). The latter is one of the few structures which can Support at least one failure of any mode, the price paid is four devices instead of one. Other possibilities are known to implement fault tolerance at component level, e.g. [2.39].

Repairable fault tolerant sy stems are considered carefully in Chapter 6, in particular in Section 6.8 for non ideal reconfiguration (incomplete coverage, imperfect switching), phased-mission systems, common cause failures, and reward & frequency/duration aspects. It is shown, that the stochastic processes introduced in Appendix A7 can be used to investigate reliability and availability of fault tolerant systems for cases in which a reliability block diagram does not exist as well.

To avoid common cause or single-point failures, redundant elements should be designed and produced independently from each other, in critical cases with different technology, tools, and personnel. Investigation of all possible failure (fault) modes during the design of fault tolerant equipment or systems is mandatory. This is generally done using failure modes und effects analysis (FMEAIFMECA), fault tree analysis (FTA), causes-to-effects diagrams or similar tools (Section 2.6), supported by appropriate investigation models (see e.g. Examples 6.14 and 6.16). Failure mode analysis is essential where redundancy appears, among other to identify the parts which are in series to the ideal redundancy (in the reliability block diagram), to discover interactions between elements of the given item, and to find appropriate measures to avoid failure propagation (secondary failures).

Protection against seconda~~failures can be realized, at component level, with decoupling elements such as diodes, resistors, capacitors (diodes EI - E4 in Example 2.3). Other possibilities are the introduction of standby elements which are activated at failure of active elements, the use of basically different technologies for redundant elements, etc. Quite generally, all parts which are essential for basic functions (e.g. interfaces and monitoring circuits) have to be designed with care. Adherence to appropriate design guidelines is important (Chapter 5). Recognition and localization of hidden failures as well as avoidance of false alarms (caused e.g. by synchronization problems) is mandatory. These and similar considerations applies in particular for equipment and systems with high reliability andl or safety requirements, as used e.g. in aerospace, automotive, and nuclear applications.

Many of the above aspects also apply to defects, both in hardware and software (see Section 5.3.1 for software defects).

2.4 Reliability Allocation 67

2.4 Reliability Allocation

With complex equipment and systems, it is important to allocate reliability goals at subsystem and assembly levels early in the design phase. Such an allocation motivates the design engineer to consider reliability aspects at all system levels.

Allocation is simple if the item (system) has no redundancy and its components have constant failure rates. The system's failure rate h, is then constant and equal to the sum of the failure rates of its elements (Eq. (2.19)). In such a case, the allocation of h, can be done as follows:

1. Break down the system into elements El, ..., E,.

2. Define a complexity factor ki for each element ( 0 I ki I 1, kl + ... + k, = 1).

3. Determine the duty cycle d for each element ( d = operating time of element Ei I operating time of the system).

4. Allocate the system's failure rate h, among elements El, ..., E, according to

?Li =?Ls ki 1 di. (2.71)

Should all elements have the same complexity ( k1 = . . . = k, = 11 n) and the same duty cycle ( d , = ... = d , = l ) , then

?Li =?LS I n . (2.72)

In addition to the above, cost, technology risks, and failure effects should also be considered. A case-by-case optimization is often possible.

Should the individual element failure rates not be constant andl or the system contain redundancy, allocation of reliability goals is more difficult. The results of Sections 2.2 and 2.3 can be used. If repairable series -parallel structures appear, one can often assume that the failure rate at equipment or system level is fixed by the series elements (Section 6.6), for which Eqs. (2.71) and (2.72) can be used.

2.5 Mechanical Reliability, Drift Failures

As long as the reliability is considered to be the probability R for a mission success (without relation to the distribution of the failure-free time), the reliability analysis procedure for mechanical equipment or systems is similar to that used for electronic equipment or systems and is based on the following steps:

1. Definition of the system and of its associated mission profile.

2. Derivation of the corresponding reliability block diagram.

3. Determination of the reliability for each element of the reliability block diagram.


4. Calculation of the system reliability Rso . 5. Elimination of reliability weaknesses and return to step 1 or 2, as necessary.

Such a procedure is currently used in practical applications and is illustrated by Examples 2.13 and 2.14.

Example 2.13 The fastening of two mechanical parts should be easy and reliable. It is done by means of two flanges which are pressed together with 4 clamps E1 to E4 placed 90' to each other. Expenence has shown that the fastening holds when at least 2 opposing clamps work. Set up the reliability block diagram for this fixation and compute its reliability (each clamp is news at t = 0 and has reliability R, = R, = R3 = R4 = R).

Solution Since at least two opposing clamps (E l and E3 or E2 and E4) have to function without failure, the reliability block diagram is obtained as the series connection of El and E3 in parallel with the series connection of E2 and E4, See graph on the right. Under the assumption that clamp is independent from every other one, the item reliability follows from Rs = 2 R' - R ~ .

Supplernentary result: If two arbitrary clamps were sufficient for the required function, a 2-out- of-4 active redundancy would apply yielding (Tab. 2.1) Rs ,, = 6 R' - 8 R~ + 3 R ~ .

Example 2.14 To separate a satellite's protective shielding, a special electrical-pyrotechnic system described in the block diagram on the right is used. An electrical signal Comes through the cables E1 and E2 (redundancy) to the electncal-pyrotechnic converter E3 which lights the fuses. These carry the pyrotechnic signal to explosive charges for E3 guillotining bolts EI2 and E13 of the tensioning belt. The charges can be ignited from two sides, although one ignition will suffice (redundancy). For fulfillment of the required function, both bolts must be exploded simultaneously. Give the reliability of this separation system as a function of the reliability R1, . . . , R13 of its elements (news at t = 0).

Solution The reliability block diagram is easily obtained by considering first the ignition of bolts E12 & E13 separately and then connecting these two parts of the reliability block diagram in series.

2.5 Mechanical Reliability, Drift Failures 69

Elements E4, E5, ElO, and Ell each appear twice in the reliability block diagram. Repeated application of the key item method (successively on E5, E1l, E4, and ElO, see Section 2.3.1 and Example 2.9), by assuming that the elements EI, . . . , Elj are independent, leads to

Rso=Rg R ~ z R ~ ~ ( R ~ + R ~ - R ~ R ~ ) { R ~ ( R ~ ~ [ R ~ I R ~ ~ ( R ~ + R ~ - R ~ R8)(R7+R9-R7 4 ) + ~ ~ - R ~ ~ > R ~ R ~ I + ( ~ - R ~ ) R ~ R ~ I + ( ~ - R ~ ~ ) R ~ R ~ R ~ R ~ ~ ) + ( ~ - R ~ ) R ~ ~ R ~ R ~ ~ }

More complicated is the situation when the reliability function R( t ) is required. For electronic components it is possible to operate with the failure rate, since models and data are often available. This is generally not the case for mechanical parts, although failure rate models for some parts and units (bearings, springs, couplings, valves, etc.) have been developed [2.26]. If no information about failure rates is available, a general approach based on the stress-strength method, often supported by finite element analysis, can be used. Let c L ( t ) be the stress (load) and c s ( t ) the strength, a failure occurs at the time t for which I EL(t) I > I c s ( t ) I holds for the first time. Often, CL(t) and c s ( t ) can be considered as deterministic values and the ratio k s ( t ) 1 e L ( t ) is the safety factor. In many practical applications, c L ( t ) and c s ( t ) are random variables, often stochastic processes. A practical oriented procedure for the reliability analysis of mechanical Systems in these cases is:

1. Definition of the system and of its associated rnission profile.

2. Formulation of failure hypotheses (buckling, bending, etc.) and validation of them using a FMEAIFMECA (Section 2.6); failure hypotheses are often correlated, this dependence must be identified and considered.

3. Evaluation of the stresses applied with respect to the critical failure hypotheses.

4,Evaluation of the strength limits by considering also dynamic stresses, notches, surface condition, etc.

5 . Calculation of the system reliability (Eqs. (2.74) - (2.80)).

6. Elimination of reliability weaknesses and return to step 1 or 2, as necessary.

Reliability calculation often leads to one of the following situations:

1. One failure hypothesis, stress and strength are > 0: The reliability function is given by

Rso(t) = PrIes(x)> gL(x), 0 < X 5 t ) , R~~ (0)' 1. (2.74)

2. More than one (n >1) failure hypothesis that can be correlated, stresses and strength are > 0: The reliabilityfunction is given by


Equation (2.75) can take a complicated form, according to the degree of dependence encountered.

The situation is easier when stress and strength can be assumed to be independent and positive random variables. In this case, Pr{cs > k L 1 5, = X ] =

Pr{5, > X} = 1 - F, (X) and the theorem of total probability leads to

Examples 2.15 and 2.16 illustrate the use of Eq. (2.76).

Example 2.15 Let the stress C L of a mechanical joint be normally distributed with mean m~ = 100N/mm2 and standard deviation o~ = 4 0 ~ 1 m m ~ . The strength C S is also normally distnbuted with mean

2 ms = 150N/mm2 and standard deviation os = 10NImm . Compute the reliability of the joint.

Solution Since C L and G s are normally distributed, their difference is also normally distributed

A.6.16). Their mean and standard deviation are ms-mL =50Nlmm2 and = 41N/mm2, respectively. The reliability of the joint is then given by (Table A9.1)

Example 2.16 Let the stren th C s of a rod be normally distributed with mean m -450N/mm2 - F -1 4 -- 0.01 t N 1 mm h and standard deviation os = 25N / mm2 + 0.001 t N / mm h I . The stress

2 C L is constant and equal 350 Nlmm . Calculate the reliability of the rod at t = 0 and t = 104 h .

Solution At t = 0, ms = 450N/mm2 and os = 25 N/mm2. Thn,

After 10,000 operating hours, ms = 350N/mm2 and oS = 3 5 ~ / m m ~ . The reliability is then

2.5 Mechanical Reliability, Drift Failures 7 1

Equation (2.76) holds for a one-item structure. For a series model, i.e. in particular for the series connection of two independent elements one obtains:

1. Same stsess k L

2. Independent stresses kL1 and SL2

For a parallel model, i.e. in particular for the parallel connection of two non repairable independent elements it follows that:

1. Same stress k L

2. Independent stresses CL, and cL2

As with Eqs. (2.78) and (2.80), the results of Table 2.1 can be applied in the case of independent stresses and elements. However, this ideal situation is seldom true for mechanical systems, for which Eqs. (2.77) and (2.79) are often more realistic. Moreover, the uiicertainty about the exact form of the distributions for stress and strength far from the mean value, severely reduce the accuracy of the results obtained from the above equations in practical applications. For mechanical items, tests are thus often the only way to evaluate their reliability. Investigations into new methods are in Progress, paying particular attention to the dependence between stresses and to a realistic truncation of the stress and strength densities (Eq. (A6.33)). Other approaches are possible for mechanical systems, see e.g. [2.61-2.751.

For electronic items, Eqs. (2.76) and (2.77) - (2.80) can often be used to investigate drifi failures. Quite generally, all considerations of Section 2.5 could be applied to electronic items. However, the method based on the failure rate, introduced in Section 2.2, is easier to be used and works reasonably well in many practical applications dealing with electronic and electromechanical equipment and systems.


2.6 Failure Mode Analysis

Failure rate analysis (Sections 2.1-2.5) basically do not account for the mode and effect (consequence) of a failure. To understand the mechanism of System failures und in order to identify potential weaknesses of a fail-safe concept it is necessary tu pegorm a failure mode analysis, at least where redundancy appears und for critical parts of the item considered. Such an analysis is termed FMEA (Failure Modes and Effects Analysis) or alternatively FMECA (Failure Modes, Effects, and Criticality Analysis) if also the failure severity is of interest. If failures and defects have to be considered, Fault is used instead of Failure. A FMEAFMECA consists of the systematic analysis of failure (fault) modes, their causes, effects, and criticality C2.81 - 2.84, 2.87 - 2.93, 2.95, 2.971, including common-mode & common-cause failures as well. All possible failure (fault) modes (for the item considered), their causes and consequences are systematically investigated, in one run or in several steps (design FMEAFMECA, process FMEAiFMECA). For critical cases, the possibilities to avoid the failure (fault) or to minimize its consequence are analyzed and corresponding corrective (or preventive) actions are initiated. The criticality describes the severity of the consequence of the failure (fault) and is designated by categories or levels which are function of the risk for damage or loss of performance. Considerations on failure modes for electronic components are in Tables 3.4 & A1O.l and Section 3.3.

The FMEAiFMECA is performed bottom-up by the designer in cooperation with the reliability engineer. The procedure is well established in international standards [2.89]. It is easy to understand but can become time-consuming for complex equipment and Systems. For this reason it is recommended to concentrate efforts to critical parts of the item considered, in particular where redundancy appears. Table 2.5 shows the procedure for a FMEAIFMECA (conforming to ZEC 60812 [2.89]). Basic are steps 3 to 8. Table 2.6 gives an example of a detailed FMECA for the electronic switch given in Example 2.6, Point 7. Each row of Tab. 2.5 is a column in Tab. 2.6. Other worksheet forms for FMEAFMECA are possible [2.82 - 2.84, 2.91, 2.951. The FMEAiFMECA is mandatory for items with fail-safe behavior and for all parts of an item in which redundancy appears (to venfy the effectiveness of the redundancy when failure occurs and to define the element in series on the reliability block diagram), as well as for failures which can cause a safety problem (liability claim). A FMEMFMECA is also useful to Support maintainability analyses.

For a visualization of the item's criticality, the FMECA is often completed by a criticality grid (criticality matrix), see e.g. [2.89]. In such a matrix, each failure mode give an entry (dot or other) with criticality category as ordinate and corresponding probability (frequency) of occurrence as abscissa (Fig. 2.13). Generally accepted classifications are minor (I), major (11), critical (111), and catastrophic (IV) for the criticality level and very low, low, medium and high for the probability of occurrence. In a criticality grid, the further an entry is far from the origin, the greater is the necessity for a corrective fpreventive action.


Table 2.5 Basic procedure for performing a FMECA (according to IEC 60812 [2.89])*)

1. Sequential numbering of the step.

2. Designation of the element or part under consideration, short description of its functioii, and reference to the reliability block diagram, part list, etc. (3 steps in IEC 60812)

3. Assumption of a possible fault**) mode (all possible fault modes have to be considered).

4. Identification of possible causes for the fault mode assumed in step 3 (a cause for a fault can also be a fiaw in the design phase, production phase, transportation, installation or use).

5. Description of the symptoms which will charactenze the fault mode assumed in step 3 and of its local effect (outputlinput relationships, possibilities for secondaiy failures or faults, etc.).

6. Identification of the consequences of the fault mode assumed in step 3 on the next higher integration levels (up to the System level) and on tlie rnission to be perfomed.

7. Identification of fault detection provisions and of corrective actions which can mitigate the severity of the fault mode assumed in step 3, reduce the probability of occurrence, or initiate an alternate operational mode which allows continued operation when the fault occurs.

8. Identification of possibilities to avoid the fault mode assumed in step 3.

9. Evaluation of the severity of the fault mode assumed in step 3 (FMECA only); e.g. I for minor, I1 for major, I11 for critical, IV for catastrophic (or alternatively, 1 for failure to complete a task, 2 for large economic loss, 3 for large material damage, 4 for loss of human life).

10. Estimation of the probability of occurrence (or failure rate) of the fault mode assumed in step 3 (FMECA only), with consideration of the cause of fault identified in step 4).

1 I. Formulation of pertinent remarks which complete the information in the previous columns and also of recommendations for corrective actions, which will reduce the consequences of the fault mode assumed in step 3.

*) FMEA by omitting steps 9 & 10; steps are columns in Tab. 2.6; **)fault includes failure & defect

The procedure for the FMEAIFMECA has been developed for hardware, but can also be used for software as well [2.87, 2.88, 5.64, 5.681. For mechanical items, the FMEAI FMECA is an essential tool in reliability analysis (Section 2.5).

Very low Low Medium High

Probability of failure / fault

Figure 2.13 Example of criticality grid for a FMECA (according to IEC 60812 [2.89])

FA

ILU

RE

(FA

UL

T) M

OD

ES

AN

D E

FF

EC

TS

AN

AL

YSI

S I F

AIL

UR

E (F

AU

LT

) MO

DE

S, E

FF

EC

TS,

AN

D C

RIT

ICA

LIT

Y A

NA

LY

SIS

:qui

pmen

t: c

ontr

ol c

abin

et X

YZ

sm:

LED

dis

pluy

cir

cuit

FM

EA

IFM

EC

A

Mis

sion

/ re

quir

ed fu

ncti

on:

faul

t si

gnal

ing

Stat

e:

oper

atin

n ph

ase

Pag

e:

1&2

ired

by:

J.

ntd

ham

mer

D

ate:

rw

. Sep

t. 13

, 200

0

(2)

131

(41

(5)

(61

(7)

(8)

(9)

(10)

(11)

Pos

sibi

lrtz

es to

E

lem

ent,

Ass

umed

F

unct

ion,

fa

ult

Sym

ptom

s,

Effe

ct o

n F

ault

dete

ctio

n Se

- P

roba

bilit

y of

R

emar

ks u

nd

thef

ault

P

ositi

on

mod

e ca

uses

lo

cal e

ffect

s m

issi

on

poss

ibili

ties

mod

e in

(3

) ve

rit~

occ

urre

nce

sugg

estio

ns

R1.

NPN

i t

rans

isto

r I pl

astic

- ac

kage

74

) L

ED

ligh

ts d

imly

; di

sapp

ears

Ln

B

C

Isad 1 by b

ridg

ing

CE

; no

con

se-

1 Pania

l fa

ilure

uB

C=O

, 'RC"

1

h fo

r BA=

50°

C

and

Gg

it is

pos

sibl

e to

no

tify

the

failu

re

of T

RI

(Lev

el

dete

ctor

)

sequ

ence

to o

ther

ele

men

ts

h fo

r OA=

30°

C

and

GB

be

car

eful

whe

n fo

rmin

g th

e le

ad

Obs

ewe

the

max

so

lder

ing

time;

di

stan

ce b

etw

eer

pack

age

and

boar

d >

2 m

m

pay

atte

ntio

n to

th

e cl

eani

ng

med

ium

he

rmet

. pac

kage


Tabk 2.6 (cont.)


Figure 2.14 Example of Fault Tree (FT) for the electronic switch given in Example 2.6, Point 7, p.51 (0 = Open, S = short, Ext. are possible extenial causes, such as power out, manufacturing error, etc.); as in use for FTA, "0" holds for operating and "1" for failure (Section 6.9.2)

A further possibility to investigate failure-causes-to-effects relationships is the Fault Tree Analysis @TA) [2.6, 2.85, 2.86, 2.95, 2.961. The FTA is a top-down procedure in which the undesired event, for example a critical failure at system level, is represented by AND, OR, and NOT combinations of causes at lower levels. It is a current rule in FTA [A2.6 (IEC 61025)l to use "0" for operating and " 1" for failure (the top event "1" being in general a failure). An example of Fault Tree (FT) for the electronic switch of Example 2.6 (Point 7) is shown in Fig. 2.14. In a fault tree, a cut set is a set of basic events whose occurrence causes the top event to occur. If the top event is system failure, minimal cut sets defined by Eq. (2.43) can be identified. Algorithms have been developed to obtain from a fault tree the minimal cut sets (and minimal path sets) belonging to the system considered, see e.g. [2.35, 2.961. From a complete and correct fault tree it is thus possible to calculate the reliability function and the point availability of the corresponding system in the case of parallel (active) redundancy and totally independent elements (p. 52). TO consider some dependencies, dynamic gutes have been introduced (Section 6.9.2).

Compared to FMEAIFMECA, FTA can take external influences (human andor environmental) better into account, and handle situations where more rhan one primary fault (multiple faults) has to occur in order to cause the undesired event at system level. However, it does not necessarily go through all possible fault modes. Combination of FMEAIFMECA with FTA leads to causes-to-effects chart, showing logical relationship between causes and their single or multiple consequences.

Further methods which can Support causes-to-effects analyses are sneak analysis (circuit, path, timing), worst case analysis, drift analysis, stress-strength analysis, Ishikawa (fishbone) diagrams, Kepner-Tregoe method, Pareto diagrams, and Shewhart cycles (Plan-Analyze-Check-Do), see e.g. [1.19, 1.22, 2.131. Table 2.7 gives a comparison of the most important tools used for causes-to-effects analyses. Figure 2.15 shows the basic structure of an Ishikawa (fishbone) diagram. The Ishikawa diagram is a graphical visualization of the relationships between causes und effect, grouping the causes into machine, material, method, and human (man), into failure mechanisms, or into a combination of all them, as appropriate.


Machine \ Mut; Major causes

A i n o r causes

W

b Effect

Method

Figure 2.15 Typical structure of a cause and effect (Ishikawa or fishbone) diagram (causes can often be grouped into Machine, Material, Method, and Human (Man), into failure mechanisms, or into a combination of all them, as appropriate)

Performing a FMEAIFMECA, FTA, or any other similar investigation presupposes a detailed technical knowledge and thorough understanding of the item and the technology considered. This is necessary to identify all relevant potential flaws (during design, development, manufacture, operation), their causes, and the more appropriate corrective or preventive actions.

2.7 Reliability Aspects in Design Reviews

Design reviews are important to point out, discuss, and eliminate design weaknesses. Their objective is also to decide about continuation or stopping of the project on the basis of objective considerations (feasibility checks in Fig. 1.6 and in Tables A3.3 and 5.3). The most important design reviews are described in Table A3.3 for hardware an in Table 5.5 for software. To be effective, design reviews must be supported by project specific checklists. Table 2.8 gives an example of catalog of questions which can be used to generate project specific checklists for reliability aspects in design reviews (see Table 4.3 for maintainability and Appendix A4 for other aspects). As shown in Table 2.8, checking the reliability aspects during a design review is more than just verifying the value of the predicted reliability or the source used for failure rate calculation. The purpose of a design review is in particular to discuss the selection and use of components and materials, the adherence to given design guidelines, the presence of potential reliability weaknesses, and the results of analysis and tests. Table 2.8 and Table 2.9 can be used to Support this aim.

Scanner


Table 2.7 Important tools for causes-to-effects-analysis

1 Tool 1 Descnption I Application / Effort

FMEAFMECA (Fault Mode Effects Analysis I Fault Mode, Effects and Criticality halysis)*)

event (e.g. a specific catastrophic failure) is the result of AND & OR

FrA (Fault Tree Analysis)

I combinations of elementary events

Systematic bottom-up investigation of the effects (consequences) at system (item) level of the fault modes of all parts of the system considered, as well as of manufactunng flaws and (as far as possible) of user's errors I mistakes*)

Quasi-systematic top-down investigation of the effects (consequences) of faults (failures and defects) as well as of extemal influences on the reliability and lor safety of the system (item) considered; the top

Development phase (design FMEARMECA) and production phase (process FMENFMECA); mandatory for all interfaces, in particular where redundancy appears and for sufety relevant parts

Similar to FMEAEMECA; however, combination of more than one fault (or elementary event) can be better considered as by a FMEAEMECA; also is the influence of external ei>ents (natural catastrophe, sabotage etc.) easier to be considered

Stmctured problem detection, Kepner- analysis, and solution by complex Tregoe situations; the main steps of the Method method deal with a careful problem

analysis, decision making, and

Very largi if pei-formed for a' elements (0.1 MM for a PCB

Large to very large if many tc events are considerei

Ishikawa Diagram (Fishbone Diagram)

Generally applicable, in particular by complex situations and in interdisciplinary work-groups

Graphical representation of the causes-to-effects relationships; the causes are often grouped in four classes: machine, material, method 1 process, and human dependent

Pareto Diagram

ion making in selecting the causes of a fault and thus in defining the appropnate corrective action (Pareto rule: 80% of the problems are generüted by 20% of

1 the possible causes)

Graphical presentation of the frequency (histogram) and (cumulative) distribution of the problem causes, grouped in application specific classes

Largely dependeni on the specific situation

Small

Ideal for tearn-work discussions, in particular for the investigation of design, development, or production weaknesses

Small to large

* Faults include failures and dcfccts, allowing errors as possible causes as well: MM stays for man month

Correlation Diagram

Graphical representation of (two) quantities with possible functional (deterministic or stochastic) relation on an appropnate dy-Cartesian coordinate system

Assessment of a relationship between two quantities Small

2.7 Reliability Aspects in Design Reviews 79

Table 2.8 Example of a catalog of questions for the preparation of project speczfic checklists for the evaluation of reliability aspects in preliminary design reviews (Appendices A3 and A4) of complex equipment and Systems with high reliability requirements

1. 1s it a new development, redesign, or change Imodification?

2. 1s there test or field data available from similar items? What were the problems?

3. Has a list of preferred components been prepared and consequently used?

4. 1s the selectioIiqualification of nonstandard components and material specified? How?

5. Have the interactions among elements been minimized? Can interface problems be expected?

6. Have all the specification requirements of the item been fulfilled? Can individual requirements be reduced?

7. Has the mission profile been defined? How has it been considered in the analysis?

8. Has a reliability block diagram been prepared?

9. Have the environmental conditions for the item been clearly defined? How are the operating conditions for each element?

10. Have derating rules been appropnately applied?

11. Has the junction temperature of all semiconductor devices been kept lower than 10O0C?

12. Have drift, worst-case, and sneak path analyses been performed? What are the results?

13. Has the influence of on-off switching and of extemal interference (EMC) been considered?

14. 1s it necessary to improve the reliability by introducing redundancy?

15. Has a F'hfEAlFMECA been performed, at least for the parts where redundancy appears? How? Are single-point failures present? Can nothing be done against them? Are there safety problems? Can liability problems be expected?

16. Does the predicted reiiability of each element correspond to its allocated value? With which n-factors it has been calculated?

17. Has the predicted reliability of the whole item been calculated? Does this value correspond to the target given in the item's specifications?

18. Are there elements with a limited useful life?

19. Are there components which require screening? Assemblies which require environmental stress screening (ESS)?

20. Can design or construction be further simplified?

21. 1s failure detection, localization, and removal easy? Are hidden failures possible?

22. Have reliability tests been planned? What does this test program include?

23. Have the aspects of manufacturability, testability, and reproducibility been considered?

24. Have the supply problems (second source, long-term deliveries, obsolescence) been solved?


Table 2.9 Example of form sheets for detecting and investigating potential reliability weaknesses at assembly and equipment level

a) Assembly design

Com- Failuri )onent Param-

eters design, develop.,

guidelines

b) Assembly manufacturing

C) Prototype qualification tests

Transportation and storage

d) Equipment or system level

Corrective actions

Fault (defect, failure) analysis

Corrective actions


El. tests Item

Reliability tests

Operation (field data)

Screen- ing

Solder- ing

Layout

Environmental tests Item

Transportation and storage

Clean- ing

Electrical tests

Corrective actions


Screening (ESS)

Assembling Test

3 Qualification Tests for Components and Assemblies

Components, materials, and assemblies have a great impact on the quality and reliability of the equipment and systems in which they are used. Their selection und qualification has to be considered with care by new technologies or important redesigns, on a case-by-case basis. Besides cost and availability on the market, important selection criteria are intended application, technology, quality, long-term behavior of relevant parameters, and reliability. A qualification test includes characterization at different Stresses (for instance electrical and thermal for electronic components), environmental tests, reliability tests, and failure analysis. After some considerations about selection criteria for electronic components (Section 3. I), this chapter deals with qualification tests for complex integrated circuits (Section 3.2) and electronic assemblies (Section 3.4), and discusses basic aspects of failure modes, mechanisms, and analysis of electronic components (Section 3.3). Procedures given in this chapter can be extended to nonelectronic components and materials as well. Reliability related basic technological properties of electronic components are summarized in Appendix A10. Statistical tests are in Chapter 7, test and screening strategies in Chapter 8, design guidelines in Chapter 5.

3.1 Basic Selection Criteria for Electronic Components

As given in Section 2.2 (Eq. (2.18)), the failure rate of equipment and systems without redundancy is the sum of the failure rates of their elements. Thus, for large equipment or systems without redundancy, high reliability can only be achieved by selecting components and materials with sufficiently low failure rates. Useful information for such a selection are:

1. Intended application, in particular required function, environmental conditions, as well as reliability and safety targets.

2. Specific properties of the component or material considered (technological lirnits, useful life, long term behavior of relevant Parameters, etc.).


Possibility for accelerated tests. Results of qualification tests on sirnilar components or materials.

Influence of screening, experience from field operation.

Influence of derating.

Potential design problems (sensitivity of performance Parameters, interface problems, EMC, etc.). Lirnitations due to standardization or logistic aspects. Potential production problems (assembling, testing, handling, storing, etc.).

Purchasing considerations (cost, delivery time, second sources, Song-term availability, quality level).

As many of the above requirements are conflicting, component selection often results in a compromise. The following is a brief discussion of the most important aspects in selecting electronic components.

3.1.1 Environment

Environmental conditions have a major impact on the functionality and reliability of electronic components, equipment, and systems. They are defined in international standards [3.8]. Such standards specify stress limits and test conditions, among others for

heat (steady-state, rate of temperature change), cold, humidity, precipitation (rain, Snow, hail), radiation (solar, heat, ionizing), salt, sand, dust, noise, vibration (sinusoidal, random), shock, fall, acceleration.

Several combinations of stresses have also been defined, for instance

temperature and humidity, temperature and vibration, humidity and vibration.

Not all stress combinations are relevant and by combining stresses, or in defining sequences of stresses, care must be taken to avoid the activation of failure mechanisms which would not appear in thefield.

Environmental conditions at equipment and System level are given by the application. They can range from severe, as in aerospace and defense fields (with extreme low and high ambient temperatures, 100% relative humidity, rapid thermal changes, vibration, shock, and high electromagnetic interference), to favorable, as in Computer rooms (with forced cooling at constant temperature and no mechanical stress). International standards can be used to fix representative environmental conditions for many applications, e.g. ZEC 60721 [3.8]. Table 3.1 gives examples for environmental test conditions for electronicl electromechanical equipment and systems. The stress conditions given in Table 3.1 have indicative purpose and have to be refined according to the specific application, to be cost and time effective.

3.1 Selection Criteria for Electronic Components 83

Table 3.1 Examples for environmental test conditions for electronic I electromechanical equipment and Systems (according to IEC6!l%8 [3.8])

Environmental condition

Dry heat r Damp heat (cycles)

temperature

Vibrations (random)

Vibrations (sinusoidal)

Mechanical shock (impact)

Free fall

L

Stress profile: Procedure

48 or 72 h at 55,70 or 85°C:

El. test, warm up (2OC/ min), hold (80% of test time), power-on (20% of test time), el. test, cool down ( l0C/ min), el. test between 2 and 16 h

2,6, 12 or 24 X 24 h cycles 25 + 55°C with rel. humidity over 90% at 55OC and 95% at 25°C:

EI. test, warm up ( 3 h), hold (9 h), cool down (3h), hold (9h ) , at the end dry with air and el. test between 6 and 16 h

48 or 72 h at -25,4O or -55'C:

El. test, cool down ( 2°C / min), hold (80% test time), power-on (20% test time), el. test, warm up ( 1°Clmin), el. test between 6 and 16 h

30 min random acceleration with rectangular spectrum 20 to 2000 Hz and an acceleration spectral density of 0.03,0.1, or 0.3$1 HZ :

EI. test, stress, visual inspection, el. test

30min at 2g, (0.15mm), 5g, (0.35mm), or log, ( 0.75 mm) at the resonant freq. and the same test duration for swept freq. (3 axes):

El. test, resonance determination, stress at the resonant frequencies, Stresses at swept freq. (10 to 500 Hz), visual inspection, el. test

1000,2000 or 4000 impacts (half sine curve 30 or 50 g, peak value and 6 ms duration in the main loading diection or distributed in the various impact directions:

EI. test, stress (1 to 3 impactsls), inspection (shock absorber), visual inspection, el. test

26 free falls from 50 or lOOcm drop height distributed over all surfaces, Corners and edges, with or without transport packaging:

EI. test, fall onto a 5 cm thick wooden block (fir) on a lOcm thick concrete base, visual insp., el. test

Induced failures

'hysical: Oxidation, structural hanges, softening, drying out, ,iscosity reduction, expansion

Aectrical: Drift parameters, noise, nsulating resistance, opens, shorts

'hysical: Corrosion, electrolysis, bsorption, diffusion

Slectrical: Dnft parameters, nsulating resistance, leakage urrents, shorts

'hysical: Ice formation, structural hanges, hardening, brittleness, ncrease in viscosity, contraction

?lectrical: Drift parameters, opens

'hysical: Structural changes, racture of fixings and housings, oosening of connections, fatigue

Bectrical: Opens, shorts, contact ~roblems, noise

'hysical: Structural changes, racture of fixings and housings, oosening of connections, fatigue

<lectrical: Opens, shorts, contact ~oblems, noise

g = 10m/s2; el. = electncal


At componenf level, to the stresses caused by the equipment or system environmental conditions add those stresses produced by the component itself, due to its internal electrical or mechanical load. The sum of these stresses gives the operating conditions, necessary to determine the stress at component level and the corresponding failure rate. For instance, the ambient temperature inside an electronic assembly can be just some few "C higher than the temperature of the cooling medium, if forced cooling is used, but can become more than 30°C higher than the ambient temperature if cooling is poor.

3.1.2 Performance Parameters

The required performance pararneters at component level are defined by the intended application. Once these requirements are established, the necessary derating is determined taking into account the quantitative relationship between failure rate and stress factors (Sections 2.2.3, 2.2.4, 5.1.1). It must be noted that the use of better components does not necessarily imply better performance andl or reliability. For instance, a faster IC farnily can cause EMC problems, besides higher power consumption and chip temperature. In critical cases, component selection should not be based only on short data sheet information. Knowledge of Parameter sensitivity can be mandatory for the application considered.

3.1.3 Technology

Technology is rapidly evolving for many electronic components, see Fig. 3.1 and Table A1O.l for some basic information. As each technology has its advantages and weaknesses with respect to performance parameters and 1 or reliability, it is necessary to have a set of rules which can help to select a technology. Such rules (design guidelines in Section 5.1) are evolving and have to be periodically refined.

Of particular importance for integrated circuits (ICs) is the selection of the packaging form and type.

For the packaging form, distinction is made between inserted and surface- mounted devices. Znserted devices offer the advantage of easy handling during the manufacture of PCBs and also of lower sensitivity to manufacturing defects or deviations. However, the number of pins is limited. Sugace mount devices (SMD) allow a large number of pins (more than 196 for PQFP and BGA), are cost and space saving, and have better electrical performance because of the shortened and symmetrical bond wires. However, compared to inserted devices, they have greater junction to ambient thermal resistance, are more stressed during soldering, and solder joints have a much lower mechanical strength (Section 3.4). Difficulties

3.1 Selection Critena for Electronic Components

Approximate sales volume [%]

Figure 3.1 Basic IC technology evolution

can be expected with pitch lower than 0.3 mm, in particular if thermal and / or mechanical Stresses occur in field (Sections 3.4 and 8.3).

Packaging types are subdivided into hermetic (ceramic, cerdip, meta1 can) and nonhermetic (plastic) packages. Hermetic packages should be preferred in applications with high humidity or in corrosive ambiance, in any case if moisture condensation occurs on the package surface. Compared to plastic packages they offer lower thermal resistance between chip and case (Table 5.2), but are more expensive and sensitive to damage (microcracks) caused by inappropriate handling (mechanical shocks during testing or PCB production). Plastic packages are inexpensive, less sensitive to thermal or mechanical damage, but are permeable to moisture (other problems related to epoxy, such as ionic contamination and low glass-transition temperature, have been solved). However, better epoxy quality as well as new passivation (glassivation) based on silicon nitride leads to a much better protection against corrosion than formerly (Section 3.2.3, point 8).

If the results of qualification tests are good, the use of ZCs in plastic packages can be allowed if one of the following conditions is satisfied:

1. Continuous operation, relative humidity < 70%, noncorrosive or marginally corrosive environment, junction temperature 5 100 "C, and equipment useful life less than 10 years.

2. Intermittent operation, relative humidity < 60%, noncorrosive environment, no moisture condensation on the package, junction temperature < 100 "C , and equipment useful life less than 10 years.

For ICs with silicon nitride passivation (glassivation), the conditions stated in Point 1 above should also apply for the case of interrnittent operation.

86 3 Qualification Tests for Components and Assemblies

3.1.4 Manufacturing Quality

The quality of manufacture has a great influence on electronic component reliability. However, information about global defective probabilities (fraction of defective items) or agreed AQL values (even Zero defects) are often not suflcient to monitor the rel iabi l i~ level (AQL is nothing more than an agreed upper limit of the defective probability, generally at a producer risk a = 10%, see Section 7.1.3). Information about changes in the defective probability and the results of the corresponding fault analysis are important. For this, a direct feedback to the component manufacturer is generally more useful than an agreement on an AQL value.

3.1.5 Long-Term Behavior of Performance Parameters

The long-term stability of performance parameters is an important selection criterion for electronic components, allowing differentiation between good and poor manufacturers (Fig. 3.2). Verification of this behavior is generally undertaken with accelerated reliability tests (trends are often enough for many practical applications).

3.1.6 Reliability

The reliability of an electronic component can often be specified by its failure rate h. Failure rate figures obtained from field data are valid if intrinsic failures can be separated from extrinsic ones and reliable data / information are available. Those figures given by component manufacturers are useful if calculated with appropriate values for the (global) activation energy (for instance, 0.4 to 0.6eV for ICs) and confidence level (> 60% two sided or z 80% one sided, see Section 7.1.1). Moreaver, besides the numerical value of A, the influence of the stress factor (derating) S is important as a selection criteria (Eq. (2.1), Table 5.1).

Performance parameter [%]

fair unstable

100 %

\- bad

Figure 3.2 Long-term behavior of performance Parameters

3.2 Qualification Tests for Complex Electronic Components


The purpose of a qualification test is to verify the suitability of a given item (material, component, assembly, equipment, system) for a stated application. Qualification tests are often a Part of a release procedure. For instance, Prototype release for a manufacturer and release for acceptance in a preferred list (qualified part list) for a user. Such a test is generally necessary for new technologies or after important redesigns or production processes changes. Additionally, periodic requalification of critical parameters is often necessary to monitor quality and reliability.

Electronic component qualification tests cover characterization, environmental und special tests, as well as reliability tests. They must be supported by intensive failure (fault) analysis to investigate relevant failure mechanisms (and fault causes). For a user, such a qualification test must consider:

1. Range of validity, narrow enough to be representative, but sufficiently large to cover company's needs and to repay test cost.

2. Characterization, to investigate the electrical performance parameters. 3. Environmental and special tests, to check technology limits.

4. Reliability tests, to gain information on the failure rate.

5. Failure analysis, to detect failure causes and investigate failure mechanisms.

6. Supply conditions, to define cost, delivery schedules, second sources, etc.

7. Final report and feedback to Sie manufacturer.

The extent of the above steps depends on the importance of the component being considered, the effect (consequence) of its failure in an equipment or system, and the experience previously gained with similar components and with the same manufacturer. National and international activities are moving toward agreements which should make a qualification test by the User unnecessary for many components [3.8, 3.181. Procedures for environmental tests are often defined in standards [3.8, 3.111.

A comprehensive qualification test procedure for ICs in plastic packages is given in Fig. 3.3. One recognizes the major steps (characterization, environmental and special tests, reliability tests, and failure analysis) of the above list. Environmental tests cover the thermal, climatic, and mechanical Stresses expected in the application under consideration. The number of devices required for the reliability tests should be determined in order to expect 3 to 6 failures during burn-in. The procedure of Fig. 3.3 has been applied extensively (with device-specific aspects like data retention and programming cycles for nonvolatile memories, or modifications because of ceramic packages) to 12 memories each with 2 to 4 manufacturers for comparative investigations [3.2 (1993), 3.6, 3.161. The cost for a qualification test based on Fig. 3.3 for 2 manufacturers (comparative studies) can exceed US$50,000.

8 8 3 Qualification Tests for Components and Assemblies

3.2.1 Electrical Test of Complex ICs

Electrical test of VLSI ICs is performed according to the following three steps:

1. Continuity test.

2. Test of DC parameters.

3. Functional and dynamic test (AC).

The continuitv test checks whether every pin is connected to the chip. It consists in forcing a prescribed current (100pA) into one pin after another (with all other pins grounded) and measuring the resulting voltage. For inputs with protection diodes and for normal outputs this voltage should lie between - 0.1 and - 1.5 V.

Verification of DC parameters is simple. It is performed according to the manufacturer's specifications without restrictions (disregarding very low input currents). For this purpose a precision measurement unit ( P M U ) is used to force a current and measure a voltage ( VOH, VOL, etc.) or to force a voltage and measure a current ( I l H , ZIL, etc.). Before each step, the IC inputs and outputs are brought to the logical state necessary for the measurement.

The functional test is performed together with the verification of the dynarnic parameters, as shown in Figure 3.4. The generator in Fig. 3.4 delivers one row after another of the truth table which has to be verified, with a frequency fo. For a 40- pin IC, these are 40-bit words. Of these binary words, called test vectors, the inputs are applied to the device under test @UT) and the expected outputs to a logical comparator. The actual outputs from the DUT and the expected outputs are compared at a time point selected with high accuracy by a strobe. Modern VLSI automatic test equipment (ATE) for digital ICs have test frequencies fo > 600MHz and an overall precision better than 200ps (resolution < 30ps). In a VLSI ATE not only the strobe but other pulses can be varied over a wide range. The dynamic parameters can be verified in this way. However, the direct measurement of a time delay or of a rise time is in general time-consuming. The main problem with a functional test is that it is not possible to verify all the states and state sequences of a VLSI IC. To see this, consider for instance that for an n X 1 cell memory there are 2" states and n ! possible address sequences, the corresponding truth table would contain 2". n ! rows, giving more than 10loOfor n = 64. The procedure used in

Expected output 1 I

Test vector Result

I I strobe, delayed by the specified propagation time

Figure 3.4 Principle of functional and AC testing for LSI and VLSI ICs


Reliability Tests (154 ICs)

Characterization (20 ICs)

Environmental and Special Tests (66 ICs)

(36 ICs) 1 (30 ICs)

DC characterization (histograms at -55, 0, (reference ICs)

(18 ICs) High temperature storagc (168 h at 150°C)*, electr. test at 0, 16,24 and 168 h at 70°C

I

Screening (e.g. MIL-STD 883 class B without internal visual inspection)

(2 ICs) Passivation AC characterization

(histograms and shmoo-plots at -55,0, 25,70 and 125°C)

(2000 X 45/+15o0C)*, electx test at 0, 1000, 2000 cycles at 70°C

150 1Cs 2000 h hum-in at 125'C, electr. test at 0, 16,64, 250,1000 and 2000 h at 70°C, failure analysis at 16, 64, 250, 1000 and

(ESD) at 500,1000, 2000 V, until VEsD and at V„, -250 V (HBM), el. test before and after strcs!

8 5 T , 85% RH, Vcc = 7 V), electr. test at 0,500,1000,2000 h at 70°C, failure analysis at 500,1000,2000 h (recovery 1-2 h, electr. test within 8 h after recovery)

investigations Latch-up (for CMOS) Hot camers Dielectnc breakdown . Electromigration . Soft errors

120°C. 85% RH, Vcc = 5.5 V), electr. test at 0, 96, 192,408 h** at 70°C, failure analysis at 192, 408 h** (recovery 1-2 h, electr. test within 8 h

I - - - - - - - - - - 4 Failure Analysis L-,

t Final Report

Figure 3.3 Example for a comprehensive qualification test procedure for complex ICs in plastic (PI) packages (industrial application, normal environmental conditions, 3 to 6 expected failures during the reliability test ( Ah = 2 .10-~ h-I in this example), RH = relative humidity) * 150°C by Epoxy resin, 175°C by Silicon resin; ** 1000 h by Si3N4 passivation

90 3 Qualification Tests for Cornponents and Assernblies

practical applications takes into account one or more of the following

partitioning the device into modules and testing each of them separately, finding out regularities in the truth table or given by technological properties,

limiting the test to the part of the truth table which is important for the application under consideration.

The above limitations rises the question of test coverage, i.e. the percentage of faults which are detected by the test. A precise answer to this question can only be given in some particular cases, because information about the faults which actually appear in a given IC is often lacking. Fault models, such as stuck-at-zero, stuck-at-one, or bridging are useful for PCB's testing, but generally of limited utility for a test engineer at the component level.

For packaged VLSI ICs, the electrical test should be performed at 70°C or at the highest specified operating temperature.

3.2.2 Characterization of Complex ICs

Characterization is a parametric, experimental analysis of the electrical properties of a given IC. Its purpose is to investigate the influence of different operating conditions such as supply voltage, temperature, frequency, and logic levels on the IC's behavior and to deliver a cost-effective test program for incoming inspection. For this reason a characterization is performed at 3 to 5 different temperatures and with a large number of different Patterns.

Checkerboard

Surround I I 1

Butterfly Galloping one

1 b O 1

Figure 3.5 Exarnple of test Patterns for rnernories (see Table 3.2 for Pattern sensitivity)

March Diagonal

1 1 1 1

1

0 0 0 0

0 0 0 0

3.2 Qualification Tests for Complex Electronic Components 91

Table 3.2 Kindness of various test patterns for detecting faults in SRAMs, and approximate test times for a 100 ns 128 K X 8 SRAM (tests on a Sentry S50, scrambling table with IDS5000 EBT)

March / eood / ooor I ooor I - 1 5n

Testpattern

Checkerboard

Approx. test time[s] bit addr. I word addr 1 Functional

Diagonal

Surround

Referring to the functional and AC measurements, Figure 3.5 shows some basic patterns for memories. These patterns are generally performed twice, direct and inverse. For the patterns of Fig. 3.5, Table 3.2 gives a qualitative indication of the corresponding pattern sensitivity for static random access memories (SRAMs), and the approximate test time for a 128 K X 8 SRAM. Quantitative evaluation of pattern sensitiviQ or of test coverage is seldom possible; in general, because of the limited validity of fault models available (Sections 4.2.1 and 5.2.2). As shown in Table 3.2, test time strongly depends on the pattern selected. As test times greater than 10s per pattem are long also in the context of a characterization (the Same Pattern will be repeated several thousands times, see e.g. Fig. 3.6), development of efficient test patterns is mandatory [3.2 (1989), 3.6, 3.16, 3.191. For such investigations, knowledge of the relationship between address and physical location (scrambling) of the corresponding cell on the chip is important. If design information is not available, an electrun beam tester (EBT) can be used to establish the scrambling table.

An important evaluation tool during a characterization of complex ICs is the shmoo plot. A shmoo plot is the representation in an dy-diagram of the operating region of an IC as a function of two parameters. As an example, Fig. 3.6 gives the shmoo plots for t A versus vcc of a 128Kx8 SRAM for two patterns and two ambient temperatures [3.6]. For Fig. 3.6, test pattem has been pei-formed about 4000 times (2 X 29 X 61), each with a different combination of vcc and t A . If no fault is detected, an X or a is plotted (defective cells are generally retested once, to confirm the fault). As shown in Fig. 3.6, a small (probably capacitive) coupling between nearby cells exists for this device, as a butterfly pattern is more sensitive than the diagonal pattem to this kind of fault. Statistical evaluation of shmoo plots is often done with composite shmoo-plots in which each record is labeled in 10% steps.

Dyn. parameters D,H,S,O

fair

Butterfly

Galloping one

Number of C*

poor

good

good

A, RA C"" test steps

- - 4 n

A=addressing, C=cap. coupling, D=decoder, H=stuckat Oor at 1, 0 = open, S =short, RA = read amplifier recovery time, * pattern dependent, ** pattern and level dependent

good

good

fair

good

good

good

poor

fair

good

good

poor

fair

fair

good

10n

26n - 16&

8n3I2 + 2 n

4n2 + 6 n

1

27

0.13

0.34

8 . 1 0 ~

4 . 1 0 ~

38

7 . 1 0 ~


Table 3.3 DC parameters for a 40 pin CMOS ASIC specially developed for high noise immunity and with Schmitt-trigger inputs (20 ICs)

From the above considerations one recognizes that in general only a small part of the possible states and state sequences can be tested. The definition of appropriate test Patterns must thus pay attention to the specific device, its technology and regularities in the truth table, as well as to information about its application and experience with similar devices 13.2 (1989), 3.61. A close cooperation between test engineer and User, and also if possible with the device designer and manufacturer, can help to reduce the amount of testing.

As stated in Section 3.2.1, measurement of DC parameters presents no difficulties. As an example, Table 3.3 gives some results for an application specific CMOS-IC (ASIC) specially developed for high noise immunity.

max min

(V) mean

max

3.2.3 Environmental and Special Tests of Complex ICs

The aim of environmental und special tests is to submit a given IC to Stresses which can be more severe than those encountered in field operation, in order to investigate technological limits and failure mechanisms. Such tests are often destructive. A failure analysis after each stress is important to evaluate failure mechanisms and to detect degradation (Section 3.3). Kind and extent of environmental and special tests depend on the intended application ( G F for Fig. 3.3) and specific characteristics of the component considered. The following is a description of the environmental and special tests given in Fig. 3.3 (considerations on production related potential reliability problems are in Sections 3.3 & 3.4, see also Figs. 3.7, 3.9, 3.10):

0.52

2.65

2.76

2.85

0.44

3.19

3.33

3.44

0.44

3.89

3.97

4.09

0.60

2.70

2.75

2.85

0.52

3.19

3.32

3.44

0.48

3.79

3.93

4.04

Vcc 4

Vcc 4

Figure 3.6 Shmoo plots of a lOOns 128K X 8 SRAM for test Patterns a) Diagonal and b) Butterfly at two ambient temperatures 0°C ( 0 ) and 70°C (X)

1. Znternal Visual Znspection: Two ICs are inspected and then kept as a reference for comparative investigation (check for damage after stresses). Before opening (using wet chemical or plasma etching), the ICs are X-rayed to locate the chip and to detect irregularities (package, bonding, die attach, etc.) or impurities. After opening, inspection is made with optical microscopes (conventional X i, 000 andlor stereo X 100). Improper placement of bonds, excessive height and looping of the bonding wires, contarnination, etching, or metallization defects can be seen. Many of these deficiencies often have-only a marginal effect on reliability. Figure 3.7a shows a limiting case (mask misalignment). Figure 3.7b shows voids in the metallization of a 1M DRAM.

2. Passivation Test: Passivation (glassivation) is the protective coating, usually silicon dioxide (PSG) and /or silicon nitride, placed on the entire (die) surface. For ICs in plastic packages it should ideally be free from cracks and pinholes. To check this, the chip is immersed for about 5 min in a 50°C warm mixture of nitric and phosphoric acid and then inspected with an optical microscope (e.g. as in MIL-STD-883 method 2021 [3.11]). Cracks occur in a silicon dioxide passivation if the content of phosphorus is < 2%. However, more than 4% phosphorus activates the formation of phosphoric acid. As solution, silicon nitride passivation (often together with silicon dioxide in separate layers) has been introduced. Such a passivation shows much more resistance to the penetration of moisture (see humidity tests in Point 8 below) and of ionic contamination.

3.Solderability: Solderability of tinned pins should no longer constitute a problem today, except after a very long storage time in a non-protected ambient or after a long burn-in or high-temperature storage. However, problems can arise with gold or silver plated pins, see Section 5.1.5.4. The solderability test is performed according to established standards (e.g. IEC 60068-2 or MZL-STD-883 [3.8, 3.111) after conditioning, generally using the solder bath or the meniscograph method.

4.Electrostatic Discharge (ESD): Electrostatic discharges during handling, assernbling, and testing of electronic components and populated printed circuit boards (PCBs) can destroy or damage sensitive components, particularly semiconductor devices. All I Cs families and many discrete electronic components are sensitive to ESD. Integrated circuits have in general protection circuits, passive and more recently active (better protection by a factor 2 2). To determine ESD immunity, i.e. the voltage value at which damage occurs, different pulse shapes (models) and procedures to perform the test have been proposed. For semiconductor devices, the human body model (HBM) and the charged device model (CDM) are the most widely used. The CDM seems to apply better than the HBM in reproducing some of the damage observed in field applications (see Section 5.1.4 for further details). Based on the experiences gained in qualifying 12 memory types according to Fig. 3.3 [3.2 (1993), 3.61, the following procedure can be suggested for the HBM:

1. 9 ICs divided into 3 equal groups are tested at 500, 1000, and 2000V, respectively. Taking note of the results obtained during these preliminary tests, 3 new ICs are stressed with steps of 250V up to the voltage at which damage occurs (VESD). 3 further ICs are then tested at VEsD -2SOV to confirm that no damage occurs.

2. The test consists of 3 positive and 3 negative pulses applied to each pin within 30 s . Pulses are generated by discharging a lOOpF capacitor through a 1SkQ resistor placed in series to the capacitor (HBM), wiring inductance < 10pH. Pulses are between pin and ground, unused pins Open.

3. Before and after each test, leakage currents (when possible with the limits +lpA for Open and f200nA for short) and electrical characteristics are measured (electrical test as after any other environmental test).

Experience shows that an electrostatic discharge often occurs between 1000 and 4000V. The model Parameters of lOOpF and 1.5 k!2 for the HBM are average values measured with humans (80 to 500 pF , 50 to 5000 Q, 2 kV on synthetic floor and 0.8kV on a antistatic floor with a relative humidity of about 50%). A new model for latent damages caused by ESD has been developed in [3.61 (1995)l. Protection against ESD is discussed in Sections 5.1.4 and 5.1.5.4, see also Section 3.3.4.

l * I - - 2

- 4 '


a) Alignment error at a contact window (SEM, X 10,000)

d) Silver dendrites near an Au bond ball (SEM, x800)

b) Opens in the metallization of a 1 M DRAM bit line, due to particles present during the photolithographic process (SEM, X 2,500)

e) Electromigration in a 16K Schottky TTL PROM after 7 years field operation (SEM, x500)

C) Cross section through two trench-capacitor cells of a 4 M DRAM (SEM, X 5,000)

Figure 3.7 Failure analyses on ICs (Rel. Labi

f) Bond wire damage (delamination) in a plastic-packaged device after 500 X -50 1 +150°C thermal cycles (SEM, x50O)

oratory at the ETH Zurich); see also Figs. 3.9 & 3.10

Scanner

5 . Technological Characterization: Technological investigations are performed to check technological and process parameters with respect to adequacy and maturity. The extent of these investigations can range from a simple check(Fig. 3 . 7 ~ ) to a comprehensive analysis, because of detected weaknesses. Refinement of techniques and evaluation methods for technological characterization is in Progress, see e.g. [3.31 - 3.65, 3.71 - 3.891. The following is a simplified, short description of some important technological characterization methods:

Latch-up is a condition in which an IC latches into a nonoperative state drawing an excessive current (often a short between power supply and ground), and can only be returned to an operating condition through removal and reapplication of the power supply. It is typical for CMOS structures, but can also occur in other technologies where a PNPN structure appears. Latch-up is primarily induced by voltage overstresses (on signals or power supply lines) or by radiation. Modern devices often have a relatively high latch-up immunity (up to 200 rnA injection current). A verification of latch-up sensitivity can become necessary for some special devices (ASICs for instance). Latch-up tests stimulate voltage overstresses on signal and power supply lines as well as power-onlpower-off sequences.

Hot Carriers arise in micron and submicron MOSFETs as a consequence of high electricfields (104 to 105 Vlcm) in transistor channels. Carriers may gain sufficient kinetic energy (some eV, compared to 0.02 eV in thermal equilibrium) to surmount the potential barrier at the oxide interface. The injection of carriers into the gate oxide is generally followed by electron- hole pairs creation and causes an increasing degradation of the transistor parameters, in particular an increase with time of the threshold voltage VTH which can be measured in NMOS transistors. Effects on VLSI and ULSI-ICs are an increase of switching times (access times in RAMs for instance), possible data retention problems (soft writing in EPROMs) and in general an increase of noise. Degradation through hot carriers is accelerated with increasing drain voltage and lowering temperature (negative activation energy of about - 0.03 eV). The test is generally performed under dynamic conditions, at high power supply voltages (7 to 9V) and at low temperatures (-70 to - 20 "C ).

Time-Dependent Dielectric Breakdown (TDDB) occurs in very thin gate oxide layers (< 20nm) as a consequence of extremely high electric jields (10 7- 10~VIcm). The mechanism is described by the therrnochemical (E) model up to about 10 7 ~ / c r n and by the carrier injection (1IE) model up to about 2.10 7 ~ / c m . An approach to unify both models has been proposed in [3.48 (1999)l. As soon as the critical threshold is reached, breakdown takes place, often suddenly. The effects of gate oxide breakdowns are increased


leakage currents or shorts between gate and substrate. The development in time of this failure mechanism depends on process parameters and oxide defects. Particularly sensitive are memories > 4 M . An Arrhenius model can be used for the temperature. Time-dependent dielectric breakdown tests are generally performed on special test structures (often capacitors).

Electromigration is the migration of metal atoms, and also of Si at the Al / S i

interface, as a result of very high current densities, see Fig. 3.7e for an example of a 16K TTL PROM after 7 years of field operation. Earlier lirnited to ECL, electromigration also occurs today with other technologies (because of scaling). The median t-jO of the failure-free time as a function of the current density and temperature can be obtained from the empirical model given by Black [3.46], t50 = B j -neEa'kT , where E, = 0.55 eV for pure Al (0.75 eV for Al-Cu alloy), n = 2 , and B is a process-dependent constant. Electromigration tests are generally performed at wafer level on test structures. Measures to avoid electromigration are optimization of grain structure (bamboo structures), use of Al-Si-Cu alloys for the metallization and of compressive passivation, as well as introduction of multilayer metallizations.

Soft errors can be caused by the process or chip design as well as by process deviations. Key parameters are M O S F E T threshold voltages, oxide thickness, doping concentrations, and line resistance. If for instance the post-implant of a silicon layer has been improperly designed, its conductivity rnight become too low. In this case, the word lines of a DRAM could suffer from signal reductions and at the end of the word line soft errors could be observed on some cells. As a further example, if logical circuits with different signal levels are unshielded and arranged close to the border of a cell array, stray coupling may destroy the information of cells located close to the circuit (chip design problem). Finally, process deviations can cause soft errors. For instance, signal levels can be degraded when metal lines are locally reduced to less than half of their width by the influence of dirt particles. The characterization of soft errors is difficult in general. At the chip level, an electron beam tester allows the measurement of signals within the chip circuitry. At the wafer level, single test structures located in the space between the chips (kerf) can be used to measure and characterize important parameters independently of the chip circuitry. These structures can usually be contacted by needles, so that a well equipped bench setup with high-resolution I-V and C-V measurement instrumentation would be a suitable characterization tool.

Data Retention and Program/ Erase Cycles are important for nonvolatile memories (EPROM, EEPROM, FLASH). A test for data retention generally consists of Storage (bake) at high temperature (2000 h at 125°C for plastic


packages and 500 h at 250°C for ceramic packages) with an electrical test at 70°C at 0 , 250, 500, 1000, and 2000 h (often using a checkerboard Pattern with measurement of ~ A A and of the margin voltage). Experimental investigation of EPROM data retention at temperatures higher than 250°C shown a deviation from the charge loss predicted by the thermionic model r3.6, 3.361. Typical values for program/ erase cycles during a qualification test are 100 for EPROMs and 10,000 for EEPROMs and Flash memories.

6 . High-Temperature Storage: The purpose of high-temperature storage is the stabilization of the thermodynarnic equilibrium, and consequently of the IC's electrical Parameters. Failure mechanisms related to surface problems (contamination, oxidation, contacts, charge induced failures) are activated. To perform the test, the ICs are placed on a metal tray (pins on the tray to avoid thermal voltage stresses) in an oven at 150°C for 200 h . Should solderability be a problem, a protective atmosphere (N2) can be used. Experience shows that for a mature technology (design and production processes), high temperature storage produces only a very few failures (see also Section 8.2.2).

7 . Thermal Cycles: The purpose of thermal cycles is to test the IC's ability to Support rapid temperature changes. This activates failure mechanisms related to mechanical stresses caused by mismatch in the expansion coefficients of the materials used, as well as wearout because of fatigue, See Fig. 3.7f for an example. Thermal cycles are generally performed from air to air in a two- chamber oven (transfer from one chamber to the other with a lift). To perform the test, the ICs are placed on a metal tray (pin on the tray to avoid thermal voltage stresses) and subjected to 2,000 thermal cycles from -65°C ( 4 - 1 0 ) to +150°C (+15,-0), transfer time 5 Imin, time to reach the specified temperature 5 15min, dwell time at the temperature extremes r l0rnin. Should solderability be a problem, a protective atmosphere (N2) can be used. Experience shows that for a mature technology (design and production processes), failures should not appear before some thousand thermal cycles (lower figures for power devices).

8.Humidity or Damp Heut Test, 85/85 and pressure cooker: The aim of humidity tests is to investigate the influence of moisture on the chip surface, in particular corrosion. The following two procedures are often used:

(i) Atmospheric pressure, 85 f 2°C and 85 I 5% rel. humidity ( W 8 5 Test) for 168 to 5,000 h .

(ii) Pressurized steam, 110 f 2°C or 120 I 2OC or 130 I 2°C and 85 f 5% rel. humidity @ressure-cooker test or highly accelerated stress test (HAST)) for 24 to 408 h (1,000 h for silicon nitride passivation).

In both cases, a voltage bias is applied during exposure in such a way that power consumption is as low as possible, while the voltage is kept as high as possible (reverse bias with adjacent metallization lines altematively polarized


high and low, e.g. l h o n / 3h off intermittently if power consumption is greater than 0.OiW). For a detailed procedure one may refer to IEC 60749 [3.8]. In the procedure of Fig. 3.3, both 85/85 and HAST tests are performed in order to correlate results and establish (empirically) a conversion factor. Of great importance for applications is the relation between the failure rates at elevated temperature and humidity (e.g. 85/85 or 120185) and at field operating conditions (e.g. 40160). A large number of models have been proposed in the literature to empirically fit the acceleration factor A associated with the 85/85 test

mean time to failure at lower stress (01 / RHl) A =

mean time to failure at 85/85 (B2 / RH2)

The most important of these models are

In Eqs. (3.2) to (3.6), E, is the activation energy, k the Boltzmann constant (8.6.10-~ eV/ K), 8 the temperature in "C, T the absolute temperature (K), RH the relative humidity, and Cl to C 4 are constants. Equations (3.2) to (3.6) are based on the Eyring model (Eq. (7.59)), the influence of the temperature and the humidity is multiplicative in Eqs. (3.2) to (3.5). Eq. (3.2) has the same structure as in the case of electromigration (Eq. (7.60)). In all models, the technological Parameters (type, thickness, and quality of the passivation, kind of epoxy, type of metallization, etc.) appear indirectly in the activation energy E, or in the constants C , to C 4 . Relationships for HAST are more empirical. From the above considerations, 85/85 and HAST tests can be used as accelerated tests to assess the effect of damp heat combined with bias on ICs by accepting a numerical uncertainty in calculating the acceleration factor. As a global value for the acceleration factor referred to operating field conditions of 40°C and 60% RH, one can assume for PSG a value between 100 and 150 for the 85/85 test and between 1,000 and 1,500 for the 120185 test. To assure 10 years field operation at 40°C and 60% RH, PSG-ICs should thus pass without


evident corrosion damage about i,OOO h at 85/85 or 100 h at 120185. Practical results show that silicon-nitride glassivation offers a much greater resistance to moisture than PSG by a factor up to 10 [3.6].

Also related to the effects of humidity is meta1 migration in the presence of reactive chemicals and voltage bias, leading to the formation of conductive paths (dendrites) between electrodes, see an example in Fig. 3.7d. A further problem related to plastic packaged ZCs is that of bonding a gold wire to an aluminum contact surface. Because of the different interdiffusion constants of gold and aluminum, an inhomogeneous intermetallic layer (Kirkendall voids) appears at high temperature and / or in presence of contaminants, considerably reducing the electrical and mechanical properties of the bond. Voids grow into the gold surface like a plague, from which the name purple plague derives. Purple plague was an important reliability problem in the sixties. It propagates exponentially at temperatures greater than about 180°C. Although almost generally solved (bond temperature, Al-alloy, metallization thickness, wire diameter, etc.), verification after high temperature Storage and thermal cycles is a part of a qualification test, especially for ASICs and devices in small-scale production.

Table 3.4 Indicative values for failure modes of electronic components (%) I I

1 Ta (solid) 1 80 15 1 5 1 - I

Resistors, vaiable (Cermet)

Capacitors foil

ceramic

Coils

Relays

Quartz cryslals

15

70

* . input and output half each; short to VCC or to GND half each; + no output; iinproper output; O fail to off; #localized wearout; fail to trip I spurious trip = 312

1 20 20

80

10

70

-

8 0

5

20 -

5

-

2 0

5

8 0 t


3.2.4 Reliability Tests

The aim of a reliability test for electronic components is to obtain information about Sie

failure rate,

long-term behavior of critical Parameters, effectiveness of screening to be performed at the incorning inspection.

The test consists in general of a dynamic burn-in with electrical measurements and failure analysis at appropriate time points (Fig. 3.3), also including some components which have not failed (to check for degradation). The number (n ) of devices under test can be estimated from the predicted failure rate h (Section 2.2.4) and the acceleration factor A (Eq. (7.56)) in order to expect 3 to 6 failures (k) during bum-in ( n = kl(hAt)). Half of the devices can be submitted to a screening (Section 8.2.2) in order to better isolate early failures (Fig. 3.3). Statistical data analyses are given in Section 7.2 and Appendix A8.

3.3 Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components

3.3.1 Failure Modes of Electronic Components

A failure mode is the symptom (local effect) through which a failure is observed. Typical failure modes are Opens, shorts, drift, or functional faults for electronic components, and brittle rupture, creep, or cracking for mechanical components. Average values for the relative frequency of failure modes in electronic components are given in Table 3.4. The values given in Table 3.4 have indicative purpose and have to be supplemented by application specific results , as far as necessary.

The different failure modes of hardware, often influenced also by the specific application, cause difficulties in investigating the effect (consequence) of failure, and thus in the concrete implementation of redundancy (series if short, parallel if open). For critical situations it can become necessary to use quad redundancy (Section 2.3.6). Quad redundancy is the simplest fault tolerant structure which can accept at least one failure (short or open) of any one of the 4 elements involved in the redundancy.


3.3.2 Failure Mechanisms of Electronic Components

A failure mechanism is the physical, chemical, or other process which results in failure. A large number of failure mechanisms have been investigated in the literature, e.g. [3.31 - 3.651 & [3.71 - 3.891. For some of them, appropriate physical explanations have been found. For others, the models are empirical and often of limited validity. Evaluation of models for failure mechanisms should be developed in two steps: (i) verify the physical v a l i d i ~ of the model and (ii) give its analytical formulation with the appropriate set of Parameters to fit tke model to the data. In any case, experimental verification of the model should be performed with at least a second, independent experiment. The limits of tke model should be clearly indicated. The two most important models used to describe failure mechanisms, the Arrhenius model and the Eyring model are introduced in Section 7.4 with accelerated tests (Eqs. (7.56) - (7.60)). Models to describe the influence of temperature and humidity in damp heat tests have been given with Eqs. (3.2) - (3.6). A new model for latent damages caused by ESD is given in r3.61 (1995)l. Table 3.5 summarizes some important failure mechanisms for ICs, specifying influencing factors and the approximate distribution of the failure mechanisms for plastic- packaged ICs in industrial applications ( G F in Table 2.3). The percentage of misuse and mishandling failures can vary over a large range (20 to 80%) depending on the design engineer using the device, the equipment manufacturer and the end User. For ULSI-ICs one can expect that the percentage of failure mechanisms related to oxide breakdoi.vlz and hot carriers will grow in the future.

3.3.3 Failure Analysis of Electronic Components

The aim of a failure analysis is to investigate the failure mechanisms and find out the possible failure causes. A procedure for failure analysis of ICs (from a user's point of view) is shown in Fig. 3.8. It is based on the following steps and can be terminated as soon as the necessary information has been obtained:

1. Failure detection und description: A careful description of the failure, as observed in situ, and of the surrounding circumstances (operating conditions at the failure occurrence time) is important. Also necessary are information on the IC itself (type, manufacturer, manufacturing data, etc.), on the electrical circuit in which it was used, on the operating time, and if possible on the tests to which the IC was submitted previous to the final use (evaluation of possible damage, e.g. ESD). In a few cases the failure analysis procedure can be terminated, if evident rnishandling or misuse failure can be confirmed.

2. Nonclestructive analysis: The nondestructive analysis begins with an external visual inspection (mechanical damage, cracks, corrosion, burns, overheating, etc.), followed by an X-ray inspection (evident internal fault or damage) and a careful electrical test (Section 3.2.1). For ICs in hermetic packages, it can also

Failu

re m

echa

nism

1

Shor

t des

crip

tion

I C

ause

s A

ccel

erat

ion

fact

ors

. L

u

wir

e (A

u) a

nd m

etal

lizat

ion

(~

f)

ca

usin

g a

britt

le r

egio

n A

u an

d A

l, b

ondi

ng te

mpe

ratu

re,

(voi

ds in

Au

due

to d

iffu

sion

) w

hich

can

pro

voke

bon

d lif

ting

cont

amin

atio

n, t

oo th

ick

met

alliz

atio

n

Bon

ding

B Fa

tigue

Pum

le d

am

e

Mec

hani

cal f

atig

ue o

f bo

ndin

g w

ires

or

bond

ing

pads

bec

ause

D

iffe

rent

exp

ansi

on c

oeff

icie

nts

of th

e of

the

rmom

echa

nica

l st

ress

(al

so b

ecau

se o

f vi

brat

ions

at t

he

mat

eria

ls i

n co

ntac

t (f

or h

erm

etic

ally

re

sona

nce

freq

uenc

y fo

r her

met

ical

ly s

eale

d de

vice

s)

seal

ed d

evic

es a

lso

wir

e re

sona

nce)

Form

atio

n of

an

inte

rmet

allic

he

r at

the

inte

rfac

e be

twee

n D

iffe

rent

inte

rdif

fusi

on c

onst

ants

of

Met

alliz

atio

n . Co

rros

ion

Surf

ace

Cha

rge

spre

adin

g (l

eaka

ge c

urre

nts,

in

vers

ion)

Met

a1 m

igra

tion

Cha

rge

spre

ad la

tera

lly f

rom

the

met

alliz

atio

n or

alo

ng th

e C

onta

min

atio

n w

ith N

aC,

KC

, etc

., is

olat

ion

inte

rfac

e, re

sulti

ng i

n an

inve

rsio

n la

yer

outs

ide

the

too

thin

oxi

de la

yer

(MO

S),

pack

age

activ

e re

gion

whi

ch c

an p

rovi

de f

or in

stan

ce a

con

duct

ion

mat

eria

l pa

th b

etw

een

two

diff

usio

n re

gion

s

Ele

ctro

mig

ratio

n

Ele

ctro

chem

icai

or g

alva

nic

reac

tion

in th

e pr

esen

ce o

f hu

mid

ity a

nd io

nic

cont

amin

atio

n (P

, Na,

Cl e

tc.)

, cri

tical

for

PS

G (

SiO

*) p

assi

vatio

n w

ith >

4% P

(<

2% P

giv

es c

rack

s)

Mig

ratio

n of

met

al a

tom

s in

the

pres

ence

of

reac

tive

chem

ical

s, w

ater

, and

bia

s, l

eadi

ng t

o co

nduc

tive

path

s (d

endr

ites)

bet

wee

n el

ectr

odes

Mig

ratio

n of

met

al a

tom

s (a

lso

of S

i at c

onta

cts)

in th

e di

rect

ion

of t

he e

lect

ron

flow

, cre

atin

g vo

ids

or O

pens

in th

e si

ruct

ure

Hum

idity

, vol

ta e

, con

tam

inat

ion

(Na+

, C

l-,

K

), c

rack

s or

pin

hole

s in

the

pass

ivat

ion

Hui

nidi

ty, v

olta

ge, m

igra

ting

met

als

(Au,

Ag,

Pd,

Cu,

Pb,

Sn)

, con

tam

inan

t (e

ncap

sula

nt)

tem

pera

ture

gra

dien

t, an

omal

ies

in

the

met

alliz

atio

n

Oxi

de

. Time-

depe

nden

t di

elec

. br

eakd

own

(TD

DB

) Io

n m

igra

tion

(par

asiti

c tr

ansi

stor

s, i

nver

sion

)

Mis

use

/ Mis

hand

ling

I Elec

tric

al (

ES

D I

EO

S),

ther

mal

, mec

b., o

r cl

imat

ic o

vers

tres

s / A

pplic

atio

n, d

esig

n, h

andl

ing,

tes

t

Bre

akdo

wn

of th

in o

xide

laye

rs o

ccur

ring

sud

denl

y w

hen

Hig

h vo

ltage

s, th

in o

xide

s, o

xide

su

ffic

ient

cba

rge

has

been

inje

cted

to tn

gger

a ru

naw

ay p

roc.

de

fect

s C

arri

er in

ject

ion

in th

e ga

te o

xide

bec

ause

of

E a

nd B

J;

Con

tarn

inat

ion

witb

alk

alin

e io

ns,

crea

tion

of c

harg

es in

the

Si0

2/S

i-in

terf

ace

pinh

oles

, oxi

de o

r di

ffus

ion

defe

cts

Oth

ers

Inte

rmet

allic

com

poun

d ho

t cam

ers

a-pa

rtic

les

Lat

ch-u

p, e

tc.

The

rmal

cyc

les

with

Ag

> 15

0°C

(vi

brat

ions

at

res.

freq

. for

hen

neti

c de

v.)

Form

atio

n of

int

erm

et. l

ayer

bet

wee

n m

etal

. (A

l)&

subs

tr. S

i)

Mas

k de

fect

s, o

verh

eatin

g, p

ure

Al

Inje

ctio

n of

ele

ctro

ns b

ecau

se o

f hi

gh E

D

imen

sion

s, d

iffu

sion

pro

file

s, E

Gen

erat

ion

of e

lect

ron-

hole

pai

rs b

y a-

part

icle

s (D

RA

Ms)

Pa

ckag

e m

ater

ial,

exte

mal

radi

atio

n A

ctiv

atio

n of

PN

PN p

aths

PN

PN p

aths

Tem

pera

ture

> 1

80°C

(E

a =

0.7

- 1.

1 eV

)

E, O

J (E

,=

0.5

- 1.

2 eV

, up

to

2eV

for

line

ar I

Cs)

RH

, E,

8 j

(Ea

=0

.5 -

0.7

eV)

leV

fo;

larg

e-gr

ain

Al)

E, O

, (E

,=0.

2 -

0.4e

V f

or

oxid

e w

ith d

efec

ts,

0.5

- 0.

6eV

for

intr

insi

c ox

ide:

Ext

erna

l rad

iatio

n V

olta

ge o

vers

tres

s

L E =

ele

ctric

fiel

d, R

H =

rela

tive

hum

idity

, j =

curr

ent d

ensi

ty,

OJ=

junc

tion

tem

pera

ture

, pa

ssiv

atio

n (=

gla

ssiv

atio

n),

% =

indi

cativ

e di

strib

utio

n in

per

cent


be necessary to perform a seal test and if possible a dew-point test. The result of the nondestructive analysis is a careful description of the external failure mode and a first information about possible failure causes and mechanisms. For evident failure causes, the failure analysis can be terminated.

3. Semidestructive analysis: The semidestructive analysis begins by opening the package, mechanically for hermetic packages and with wet chemical (or plasma etching) for plastic ICs. A careful internal visual check is then performed with optical microscopes, conventional 1000 X or stereo 100 X.

This evaluation includes Opens, shorts, state of the passivation / glassivation, bonding, damage due to ESD, corrosion, cracks in the metallization, electromigration, particles, etc. If the IC is still operating (at least partially), other procedures can be used to localize more accurately the fault on the die. Among these are the electron beam tester (or other voltage contrast techniques), liquid crystals (LC), infrared thermography (IRT), emission microscopy (EMMI), or one of the methods to detect irregular recombination Centers, like electron beam induced current (EBIC) or optical beam induced current (OBIC). For further investigations it is then necessary to use a scanning electron microscope (SEM). The result of the semidestructive analysis is a careful description of the internal jailure mode and an improved information about possible failure causes and failure mechanisms. In the case of evident failure causes, the failure analysis procedure can be terminated.

4. Destructive Analysis: A destructive analysis is perfomed if the previous investigations yield unsatisfactory results and there is a realistic chance of success through further analyses. After removal of the passivation and other layers (as necessary) an inspection is carried out with a scanning electron microscope supported by a material investigation (e.g. EDX spectrometry). Analyses are then continued using methods of microanalysis (electron microprobe, ion probe, diffraction, etc.) and performing microsections. The destructive analysis is the last possibility to recognize the original failure cause and the failure mechunisms involved. However, it cannot guarantee success, even with skilled personnel and suitable analysis equipment.

5. Failure mechanism analysis: This step implies a correct interpretation of the results from steps 1 through 4. Additional investigations have to be made in some cases, but questions related to failure mechanisms can still remain Open. In general, feedback to the manufacturer at this stage is mandatory.

6. Final report: All relevant results of the steps 1 to 5 above and the agreed corrective actions must be included in a (short and clear) final report.

7. Corrective actions: Depending on the identified failure causes, appropriate corrective actions should be started. These have to be discussed with the IC manufacturer as well as with the equipment designer, manufacturer, or User depending on the failure causes which have been identified.

3.3 Failure Modes, Failure Mechanisms, and Failure Analysis of Electronic Components

1. Failure delection and descnption component identification reasodmotivation for thc analysis operating conditions at failure environmental conditions at failure I

2. Nondestructive analysis . extemal visual inspection X-ray microscope examination ultrasonic microscope analysis electrical test high-temperature Storage seal test, (possibly also a dew point test) some other special tests, as necessary

3. Semidestructive analysis ' package opening ' optical microscope inspection

failure (fault) localization on the chip (liquid crystals, microthermography, electron beam tester, emission microscope, OBIC, EBIC, etc.) preliminaiy analysis with the scanning electron microscope (SEM)

2a. Failure cause follows from the 1 analysis of the external causes

4. Destructive analysis material analysis at the surface (EDX) glassivatiodpassivation removal material analysis (EDX) metallization removal SEM examination analysis in greater depth (possibly with microsections, FIB, TEM, SEM, elc.)

3a. Failure cause follows from the

I

5. Failure mechanism analysis

6. Failure analysis report 1 7. Corrective actions (with manufaclurer)

Figure 3.8 Basic procedure for failure analysis of electronic components (ICs as an example)

The failure analysis procedure described in Section 3.3.3 for ICs can be applied to other electronic or mechanical components and extended to Cover populated printed circuit boards (PCBs) as well as subassemblies or assemblies.


3.3.4 Examples of VLSI Production-Related Reliability Problems

Production-related potential reliability problems, i.e. flaws or damages which can lead to failures, can occur for VLSI devices at packaging or soldering level (Fig. 3.10), as well as on silicon dies. Those on dies are often more difficult to identify. Following examples show three cases for production-related potential reliability problems on silicon dies, in grown difficulty with respect to their identification [3.49] (see also Fig. 3.7 for further exarnples).

Fig. 3.9a shows a contact step coverage flaw. The contact to a diffusion in bulk silicon is made by the first metal layer, which usually is protected by a barrier against Al penetration into bulk-silicon. However, the first metal layer often must adapt itself to some topography. Design rules make Sure that the contact is flat enough. However, if the contact slopes are too steep (e.g. etching process problem) the step coverage may be reduced. In this case, electric contact is often still given, but melting or electromigration rnay start, leading to a failure. OBIRCH (optical beam induced resistivity change) can help to detect such weak contacts.

Fig. 3.9b shows a wafer processing flaw. Semiconductor devices include at least one poly-Si layer, which usually performs MOS-transistor gates. It is isolated versus bulk silicon by a thin (some nm) gate-oxide, or by a more thick field oxide in active regions. The isolation against further poly-Si layers is given by a self-grown re-oxidation of the poly-Si surface and (in part) by doped silicate-glass (PSG, BPSG). In the structuration process of poly-Si (usually photolithography and plasma etching), an improper etching process may result in poly-Si residues or particles, which during subsequent re-oxidation form an irregular and thin oxide around themselves. A short at t = 0 will be avoided; however, a latent short path is created and a small voltage peak may be enough to breakdown the oxide causing a leakage path.

Figs. 3 . 9 ~ and 3.9d show a ESD damage giving failures at t = O or latent failures, formerly considered as mechanical surface damage. Silicon dies are often delivered as wafers to customers which perform subsequent pre-assembly processes (wafer dicing, back grinding, and pick & place). These operations can include great risks for electrostatic discharge from robotics equipment to the device via device passivation (e.g. when the picker setup of the pneumatic handler moves rapidly on a Teflon bearing). The term ESDFOS (electrostatic discharge from outside-to-surface) has been introduced to describe this failure cause. Like a lightning-strike, the electrostatic spark Comes onto the passivation, cracks it, melts the alurninum of the top metal and cracks the interlevel dielectric (ILD), where the metal underneath locally melts and penetrates into the crack. Depending from the degree of Al penetration, the damage causes a failure at t = 0 or a latent failure. While great care has been taken to Person and workplaces ESD-protection, less attention has often been paid to robotics tools. Extended ESD concept and periodic audits with survey and location of air ionizer fans, grounding concepts, materials, etc. is an effective method against this damage.

3.4 Qualification Tests for Electronic Assemblies

a) A steep dope topography causing a bad C) Latent ESDFOS damage, see also Fig. 3.9d contact coverage with Al ( X 5000) ( X 5000)

b) Slightly oxidized poly residue (small d) Short of two top meta1 layer as consequence white line) buried between a poly-Si- of an ESDFOS damage ( X 5000) gate and a neighbored contact (X 5000)

Figure 3.9 Examples of production-related (hidden) potential reliability problems in Si-dies [3.49] (see also Figs. 3.7 & 3.10)

3.4 Qualification Tests for Electronic Assemblies

As outlined in Section 3.2 for components, the purpose of a qualijkation test is to verify the suitability of a given item (electronic assemblies in this section) for a stated application. Such a qualification involves per3cormance, environmental, and reliability tests, and has to be supported by a careful failures (faults) analysis. To be efficient, it should be performed on Prototypes which are representative for the production line in order to check not only the design but also the production process. Results of qualification tests are an important input to the critical design review (Table A3.3). This section deals with qualification tests of electronic assemblies, in particular of populated printed circuit boards (PCBs).

The aim of the performance test is similar to that of the characterization discussed in Section 3.2.2 for complex ICs. It is an experimental analysis of the electrical properties of the given assembly, with the purpose of investigating the influence of the most relevant electrical Parameters on the behavior of the assembly at different ambient temperatures and power supply conditions (see Section 8.3 for considerations on electrical tests of PCBs).

Environmental tests have the purpose of submitting the given assembly to Stresses which can be more severe than those encountered in the field, in order to investigate technological limits and failure mechanisms (see Section 3.2.3 for complex ICs). The following procedure, based on the experience with a large number of equipment [3.76], can be recommended for assemblies of standard technology used in high reliability (or safety) applications (total 1 10 assemblies):

1. Electrical behavior at extreme temperatures with functional monitoring, 100 h at -40°C , 0°C , and +80°C (2 assemblies).

2. 4,000 thermal cycles (-40 / +120°C with functional monitoring, < 5°C / min or 2 20°C / min within the components according to the field application, 1 lOmin dwell time at -40°C and t 5min at 120°C after the thermal equilibrium has been reached within ti 5°C ( 1 3 assemblies, metallographic analysis after 2,000 and 4,000 cycles).

3. Random vibrations at low temperature, lh with 2 - 6g„ , 20 - 500Hz at -20 "C (2 assemblies).

4. EMC and ESD tests (2 assemblies).

5. Humidity tests, 240h 85/85 test (1 assembly).

Experience shows [3.76] that electronic equipment often behaves well even under extreme environmental conditions (operation at +I20 "C and -60 "C , thermal cycles -40 / +120°C with up to 60°C / min within the components, humidity test 85/85, cycles of 4h 95/95 followed by 4h at -20°C, random vibrations 20- 500Hz at 4g„ and -20°C, ESDIEMC with pulses up to 15kV). However, problems related to crack propagation in solder joints appear, and metallographic investigations on more than 1,000 microsections [3.76] confirm that cracks in solder joints are often initiated by production flaws, see Fig. 3.10d to 3.10f for some examples.

Many of the production flaws with inserted components can be avoided and would cause (if existent) only minor reliability problems. For instance, voids can be eliminated by a better plating of the through-holes (reduced surface roughness of the walls and optimization of the plating parameters). Since even voids up to 50% of the solder volume do not severely reduce the reliability of solder joints for inserted components, it is preferable to avoid rework. Poor wetting of the leads or the excessive formation of brittle intermetallic layers are major potential reliability problems for solder joints. This last kind of defects must be avoided through a better production process.

3.4 Qualification Tests for Electronic Assemblies 109

More critical are surface mount devices (SMD), for which a detectable crack propagation in solder joints often begins after some few thousand thermal cycles. Extensive investigations [3.79, 3.80, 3.891 show that crack propagation is almost independent of pitch, at least down to a pitch of 0.3mm. Experimental results indicate an increase in the reliability of solder joints of IC's with shrinking pitches, due to the increasing flexibility of the leads. A new model to describe the viscoplastic behavior of SMT solder joints has been developed in [3.89]. This model outlines the strong impact of deformation energy on damage evolution and points out, on the basis of experimental observations, that cracks begin in a locally restricted recrystallized area within the joint and propagate in a stripe along the main Stress. The faster the deformation rate (the higher the thermal gradient) and the lower the temperature, the faster damage accumulates in the solder joint. Basically, two different deformation mechanisms are present, grain boundary sliding at rather low thermal gradient and dislocation climbing at higher thermal gradient. Hence attention must be paid in defining environmental and reliability tests or screening procedures for assemblies in SMT (Section 8.3). In such a test, or screening, it is mandatory to activate only the failure mechanism which would also be activated in the field. Because of the elastic behavior of the components and PCB, the dwell time during thermal cycles also plays an important role. The dwell time must be long enough to allow relaxation of the Stresses and depends on temperature, temperature swing, and materials stiffness. As for the thermal gradient, it is difficult to give general rules.

Reliability tests at the assembly and higher integration level have as a primary purpose the detection of all systematic failures (Section 7.7) and an estimation of the failure rate (Section 7.2.3). Precise information on the failure rate shape is seldom possible from qualification tests, because of cost and time limits. If reliability tests are necessary, the following procedure can be used (total 2 8 assemblies):

1.4,000 h dynamic burn-in at 80°C ambient temperature ( 2 2 assemblies, functional monitoring, intermediate electrical tests at 24, 96, 240, 1,000, and 4,000 h).

2.5,000 thermal cycles -20 / +lOO°C with < 5 'C / min for applications with slow heat up and 2 20°C / min for rapid heat up, dwell time 2 10 min at -20°C and 2 5 min at 100°C after the thermal equilibrium has been reached within rt 5°C (2 3 assemblies, metallographic analysis after 1,000, 2,000, and 5,000 cycles; crack propagation can be estimated using a Coffin- Manson relationship of the form N = A E" with E = (ag - ac )1A0 / d [3.89, 3.791, the Parameter A has to be determined with tests at different temperature swings).

3. 5,000 thermal cycles 0 / +80°C, with temperature gradient as in point 2 above, combined with random vibrations lg„, 20- 500Hz (2 3 assemblies, metallographic analysis after 1,000,2,000, and 5,000 cycles).


a) Void caused by an s-shaped pin gassing out in the area A ( X 20)

d) A row of voids along the pin of an SOP package ( X 30)

b) Flaw caused by the insertion of the insulation of a resistor network ( X 20)

e) Soldering defect in a surface mounted resistor, area A ( X 30 )

C) Defect in the copper plating of a hole in a multilayer printed board ( ~ 5 0 )

f) Detail A of Fig. 3.9e ( X 500)

Figure 3.10 Examples of production flaws a) - C) inserted devices, d) - f) SMD (Rel. I

responsible for the initiation of cracks in solder joints ,aboratory at the ETH Zurich); see also Figs. 3.7 & 3.9

Scanner

3.4 Qualification Tests for Electronic Assemblies 111

Thermal cycles with random vibrations highly activate failure mechanisms at the assembly level, i.e. crack propagation in solder joints. If such a Stress occurs in the field, insertion technology should be preferred to SMT. Figure 3.1 1 shows a comparative investigation of crack propagation [3.79 (1993)l.

& QW, SZpins, pitch 0.6Smm. tin pldted

QFP, 52 pinr, pitch 0.65mm. unplated

-e SOP, 28 pins, pitch 1.27mrn, iin plalrd

4- SOP. 28 pms, pmh 1.27mm. unplned

+ Ceramic capacitor, tid pplted

& Cerarnic capacitor, unplated

& MELF resistor, tin plated

& MELF resistor, unplvted

No. of thermal cycles

Figure 3.11 Crack propagation in different SMD solder joints as a function of the number of thermal cycles ( 6 1 11 = crack length in % of the solder joint length, mean over 20 values, thermal cycles -20/+10O0C with 60°C/min inside the solder joint; Reliability Laboratory at the ETH Zurich)

4 Maintainability Analysis

At equipment and system level, maintainability has a great influence on reliability and availability. This holds in particular if redundancy has been implemented and redundant parts have to be repaired on line, i.e. without interruption of operation at system level. Maintainability is thus an important Parameter in the optimization of availability and life-cycle cost. Achieving high maintainability in complex equipment and systems requires appropriate activities which must be started early in the design & development phase and be coordinated by a maintenance concept. To this concept belong failure recognition and isolation (built-in tests), partitioning of the equipment or system into (as far as possible) independent line replaceable units, and logistic Support. A maintenance concept has to be tailored to the equipment or system considered. After some basic concepts (Section 4.1), Section 4.2 deals with a maintenance concept for complex equipment und systems. Section 4.3 considers maintainability aspects in design reviews and Section 4.4 presents methods and tools for maintainability prediction. Spare parts provisioning is investigated in Section 4.5, repair strategies in Section 4.6, and cost optimization in Section 4.7. Design guidelines for maintainability are given in Section 5.2. The influence of preventive maintenance, imperfect switching, and incomplete coverage on system's reliability & availability is investigated in Section 6.8. For simplicity, repair is used as a synonym for restoration.

4.1 Maintenance, Maintainability

Maintenance defines all those activities performed on an item to retain it in or to restore it to a specified state. Maintenance includes thus preventive maintenance, carried out at predetermined intervals, according to prescribed procedures to reduce the probability of failures or the degradation of the functionality of an item, and corrective maintenance, initiated after fault recognition and intended to bring the item into a state in which it can again perform the required function (Fig. 4.1). The aim of preventive maintenance must also be to detect and repair hidden failures, i.e. failures in redundant elements. Corrective maintenance is also known as repair (restoration) and can include any or all of the following steps:

4.1 Maintenance, Maintainability

I MAINTENANCE

PREVENTIVE MAINTENXNCE ' COllRECTlVE MAINTENANCE (retainmeiit of thc iiem fiinctionüliiy) (reestablishinent o f ~ h e item functionülity)

I I I I

Test of all relevant functions, also to Failure recognition recognize hidden failures Failure localization & diagnosis (isolation) Activities to compensate for dnft and Failure correction (removal) to reduce wearout failures Function checkout Overhaul to increase useful life

Figure 4.1 Maintenance tasks (failure can be replaced by fault, including defects and failures)

recognition, localization & diagnosis (isolation), correction (disassemble, remove, replace, reassemble, adjust), and function checkout. For simplicity, repair is hereafter used as a synonym for restoration. The time elapsed from the recognition of a failure until the start-up after failure correction, including all logistic delays (waiting for spare parts or tools) is the repair time (restoration time). Often, ideal logistic support, with no logistic delay, is assumed for calculations.

Maintainability is a characteristic of an item, expressed by the probability that preventive maintenance (serviceability) or repair (repairability) of the item will be performed within a stated time interval by given procedures und resources (number and skill level of the personnel, spare parts, test facilities, logistic support). If z' and T" are the (random) times required to carry out a repair or for a preventive maintenance, then

Repairability = F'r(2' 2 X ) and Serviceability = P ~ { T " < X}. (4.1)

Considering T' and 7" as interarrival times, the variable X is used instead of t in Eq. (4.1). For a rough characterization, the expected values (means) of z' and z"

E [T'] = M7TR = mean time to repair (restoration)

E [T "1 = MTTPM = mean time to preventive maintenance

are often used. Assuming X as a Parameter, Eq. (4.1) gives the distribution functions of $ and z", respectively. These distribution functions characterize the repairability and the sewiceability of the item considered. Experience shows that T' and T" often exhibit a lognormal distribution (Eq. (A6.110)). The typical shape of the corresponding density is shown in Fig. 4.2. A characteristic of the lognormal density is the sudden increase after a period of time in which its value is practically Zero, and the relatively fast decrease after reaching the maximum (modal value X,).


Figure 4.2 Density of the lognormal distribution function for h = 0.6 h-l and o = 0.3 (dashed is the approximation given by a shifted exponential distribution with Same mean)

This shape can be accepted, taking into consideration the main terms of a repair time (Fig. 4.1). However, calculations using a lognormal distribution can become time-consuming. In practical applications it is therefore useful to distinguish between one of the following two situations:

1. Investigation of maintenance times, often under assumption of ideal logistic Support: In this case, the actual distribution function must be considered, see Sections 7.3 and 7.5 for some examples with a lognormal distribution.

2. Investigation of the reliability und availability of repairable Systems: The exact shape of the repair time distribution has in general less influence on the reliability and availability values at system level, as long as the M T T R is unchanged and MTTR« MTTF holds (Examples 6.7, 6.8, 6.9); in this case, the actual repair time distribution function can often be approximated by an exponential function with same mean.

A further possibility to Point 2 above, is to use e.g. a shifted exponential distribution function (Examples 6.8 and 6.9). Figure 4.2 shows (dashed) an example with

The Parameter y' of the exponential d.f. follows from the equality of the mean values

4.1 Maintenance, Maintainability 115

For the numerical example given in Fig. 4.2 (3\. = 0.6h", cs = 0.3; M7TR = 1.75h, Var = 0.29h2) one obtains W = 0.99h and p1=1.32 h-'. A shift which considers equal mean and variance leads to = 1.2 h & P'= 1.9 h-'. For a deeper investigation, one can refer to Examples 6.7 - 6.9. In some cases, an Erlang distribution (Eq. (A6.102)) with ß 2 3 can be assumed for repair times, yielding simple results.

As in the case of the failure rate h(x), for a statistical evaluation of repair times (T') it would be preferable to omit data attributable to systematic failures. For the remaining data, a repair rate p(x) can be obtained from the distribution function G(x) = Pr{z ' 5 X), with density g(x) = d G(x) l dx, as per Eq. (A6.25)

1 g(x ) p(x) = lim -Pr(x < z ' I x + 6 x ~ T ' > x ] =-P,

S X ~ O 6~ 1 - G(x) (4.3)

(considering that T' starts anew at each repair (restoration), X is used instead oft). In evaluating the maintainability achieved in the field, the influence of the

logistic support must be considered. MTTR requirements are discussed in Appendix A3.1. MTTR estimation and demonstration is considered in Section 7.3.

4.2 Maintenance Concept

Like for reliability, maintainability must be built into equipment and systems during the design and development phase. This in particular because maintainability cannot be easily predicted and a maintainability improvement often requires important changes in layout or construction of the item (system) considered. For these reasons, attaining a prescribed maintainability in complex equipment and systems generally requires the planning and realization of a maintenance concept. Such a concept deals with the following aspects:

1. Fault recognition and isolation, including checkout after repair (isolation can be subdivided in localization and diagnosis, and fault is used to consider failures and defects).

2. Partitioning of the equipment or system into independent line replaceable units (LRUs), i.e. spare parts at equipment or system level (line repairable, last repairable, or last replaceable is often used for line replaceable).

3. Preparation of the User documentation (operating & maintenance manuals).

4. Training of operating and maintenance personnel. 5. Logistic support for the user, including after-sales service.

This section introduces the above points for the case of complex equipment und Systems with high rnaintainability requirements.

116 4 Maintainability Analysis

4.2.1 Fault Recognition and Isolation

For complex equipment and Systems, recognition of partial failures or of hidden failures (failure of redundant elements) can be difficult. For this reason, a status test, initiated by operating personnel, or an operation monitoring, running autonomously, must often be implemented. Properties, advantages, and disadvantages of both methods are summarized in Table 4.1. The choice between a status test or a (more complete) operating monitoring must consider cost, reliability, availability, and safety requirements at system level.

The goal of fault isolation (localization and diagnosis) is to isolate faults (failures and defects) down to the line replaceable units (LRUs), i.e. to the part which is considered as a spare part at the equipment or system level. LRUs are generally assemblies, e.g. populated printed circuit board, or units which for repair purposes are considered as an entity and replaced on a plug-out/plug-in basis to reduce repair times. Repair of LRUs is generally performed by specialized personnel and repaired LRUs are stored for reuse. Fault isolation should be performed using built in test (BIT) facilities, if necessary supported by built-in test equipment (BITE).

Use of external special tools should be avoided, however check lists and portable test equipment can be useful to limit the amount of built-in facilities.

Fault recognition and fault isolation are closely related and should be considered together using common hardware andlor software. A high degree of automation

Table 4.1 Autornatic and semiautomatic fault recognition

Rough (quick test)

Status Test Operation I Complete (functional test) Monitoring

Testing of all important functions, if necessary with help of external test equipment Initiated by the operating personnel, then runs automatically

Lower cost Allows fast checking of the functional conditions

Limited fault isolation (localization and diagnosis) capability

functional conditions of the item considered Allows fault isolation down to LRU level

Periodic testing of all important functions Initiated by the operating personnel, then runs automatically or semi-autom. (possibly without extemal stimulation or test equipment)

Gives a clear status of the

Relatively expensive Runs generally off-line (i.e. not in background)

Monitoring of all important functions and automatic display of complete and partial faults Performed with built-in means (BIT/BITE)

Runs automatically on-line, i.e. in background

Expensive

LRU = line replaceable unit; BIT = built-in test; BITE = built-in test equipment

4.2 Maintenance Concept 117

should be striven for, and test results should be automatically recorded. A one-to- one conespondence between test messages and content of the user documentation (operating and maintenance manuals) must be assured.

Built-in tests (BIT) should be able to identify hidden faults, i.e. faults (defects or failures) of redundant elements and, as far as possible, also of sojiware defects. This ability is generally characterized by the following testability parameters:

degree of fault recognition (coverage, e.g. 99% of all relevant failures), degree of fault isolation (e.g. down to LRUs), correctness of the fault isolation (e.g. 95%), test duration (e.g. 1s).

The first two parameters can be expressed by a probability. Distinction between failures and defects is important. As a measure of the correctness of the fault isolation capability, one can use the ratio between the nuinber of correctly isolated faults and the number of isolation tests performed. This figure, similar to that of test coverage, must often remain at an empirical level, because of the lack of exact information about the defects and failures really present or assumed in the item considered. For the test duration, it is generally sufficient to work with mean values. Failure (fault) mode analysis methods (FMEAIFMECA, FTA, cause-to-effect charts, etc.) are useful to check the effectiveness of built-in facilities (Section 2.6).

Built-in test facilities, in particular built-in test equipment (BITE), must be defined taking into consideration not only of pricelperformance aspects but also of their impact on the reliability and availability of the equipment or system in which they are used. Standard BITE can often be integrated into the equipment or system considered. However, project specific BITE is generally more efficient than standard solutions. For such a selection, the following aspects are important:

1. S impl ic i~: Test sequences, procedures, and documentation should be as easy as possible.

2. Standardization: The greatest possible standardization should be striven for, in hardware and software.

3. Reliability: Built-in facilities should have a failure rate of at least one order of magnitude lower than that of the equipment or system in which they are used; their failure should not influence the item's operation (FMEAIFMECA).

4. Maintenance: The maintenance of BITBITE must be simple and should not interfere with that of the equipment or system; the User should be connected to thefield data change service of the manufacturer.

For some applications, it is important that fault isolation (or at least part of the diagnosis) can be remotely controlled. Such a requirement can often be satisfied, if stated early in the design phase. Remote diagnosis must be investigated on a case-by-case basis, using results from a careful failure modes and effects analysis (FMEA, FTA).


A further step on above considerations leads to maintenance concepts which allow automatic or semiautomatic reconfiguration of an item after failure.

Design guidelines for maintainability are given in Section 5.2. Effects of imperfect switching and incomplete coverage are investigated in Section 6.8.

4.2.2 Equipment and System Partitioning

The consequent partitioning of complex equipment and systems into (as far as possible) independent line replaceable units (LRUs) is important for good maintainability. Partitioning must be performed early in the design phase, because of its impact on layout and construction of the equipment or system considered. LRUs should constitute functional units and have clearly defined interfaces with other LRUs. Ideally LRUs should allow a modular construction of the equipment or system, i.e. constitute autonomous units which can be tested each one independently from every other (for hardware as well as for software).

Related to the above aspects are those of accessibility, adjustment, and exchangeability. Accessibility should be easy for LRUs with limited useful lije, high failure rate, or wearout. The use of digital techniques largely reduces the need for adjustment (alignment). As a general rule, hardware adjustment in the field should be avoided. Exchangeability can be a problem for equipment and systems with long useful lije. Spare parts provisioning and aspects of obsolescence can in such cases become mandatory (Section 4.5).

4.2.3 User Documentation

User (or product) documentation for complex equipment and systems can include all of the following Manuals or Handbooks

General Description

Operating Manual

Preventive Maintenance (Service) Manual

Corrective Maintenance (Repair) Manual Illustrated Spare Parts Catalog

Logistic Support.

It is important for the content of the User documentation to be consistent with the hardware and software status of the item considered. Emphasis must be placed on a clear and concise presentation, with block diagrams, flow charts, and check lists. The language should be easily understandable to non-specialized personnel. Procedures should be self sufficient and contain checkpoints to prevent the skipping of important steps.

4.2 Maintenance Concept 119

4.2.4 Training of Operating and Maintenance Personnel

Suitably equipped, well trained, and motivated maintenance personnel are an important prerequisite to achieve short maintenance times and to avoid human errors. Training must be comprehensive enough to cover present needs. However, for a complex system it should be periodically updated to cover technological changes introduced in the system and to further motivate the operating and maintenance personnel.

4.2.5 User Logistic Support

For complex equipment and Systems, customers (users) generally expect from the manufacturer a logistic support during the useful life of the item under consideration. This can range from support on an on-call basis up to a maintenance contract with manufacturer's personnel located at the user site. One important point in such a logistic support is the definition of responsibilities. For this reason, maintenance is often subdivided into different levels (four for military applications (Table 4.2) and three for industry, in general). The first level concerns simple maintenance work such as the Status test, fault recognition and fault isolation down to the subsystem level. This task is generally performed by operating personnel. At the second level, fault isolation is refined, the defective LRU is replaced by a good one, and the functional test is performed. For this task first line maintenance personnel is often required. At the third level, faulty LRUs are repaired by maintenance personnel and stored for reuse. The fourth level generally relates to

Table 4.2 Maintenance levels in the defense area

Tasks

Simple maintenance work Status test Fault recognition Fault isolation down to subsystem level

Preventive maintenance Fault isolation down to LRU level First line repair (LRU replacement) Functional test

Difficult maintenance Repair of LRUs

. Reconditioning werk Important changes or modifications

U

2 3 O E al m 2.5 4 .a LI

4 E 2

3 99 8 .Y C.-

2 m E X

LRU = line replaceable unit (spare pari at system level)

Carried out by

Operating personnel

First line maintenance personnel

Maintenance personnel

from arsenal or industry

logistic level

Level 2

Level 3

Level 4

Location

Field

Cover

Depot

Arsen' or Industry


overhaul or revision (essentially for large mechanical parts subjected to wear, erosion, scoring, etc.) and is often performed at the manufacturer's site by specialized personnel.

For large mechanical systems, maintenance can account for over 30% of the operating cost. A careful optimization of these cost may be necessary in many cases. The part contributed by preventive maintenance is more or less deterministic. For the corrective maintenance, cost equations weighted by probabilities of occurrence can be established from considerations similar as those given in Sections 1.2.9 and 8.4, see also Sections 4.5,4.6, and 4.7.

Table 4.3 Example of catalog of questions for the preparation of project specific checklists for the evaluation of maintainability aspects in preliminary design reviews (Appendices A3 and A4) of

I. Has the equipment or system been conceived with modularity in mind? Are the modules functionally independent and separately testable?

2. Has a concept for fault recognition and isolation been planned and realized? 1s fault detection automatic? Which kind of faults are recognized? How does fault isolation work? 1s isolation down to line replaceable (repairable) units (LRUs) possible? How large are the values for fault recognition and fault isolation (coverage)? 1s remote diagnostic possible?

3. C m redundant elements be repaired on-line?

4. Are enough test points provided? Do they have pull-uplpull-down resistors?

5. Have hardware adjustments (or alignments) been reduced to a n~iriirnurn? Are the adjustable elements clearly marked and easily accessible? 1s the adjustment uncritical?

6. Has the amount of external test equipment been kept to a minimum?

7. Has the standardization of components, materials, and maintenance tools been considered?

8. Are line replaceable units (LRUs) identical with spare parts? Can they be easily tested? 1s a spare parts provisioning concept available?

9. Are all elements with lirnited useful life clearly marked and easily accessible?

0. Are access flaps (and doors) easy to Open (without special tools) and self-latching? Have plug-in unit guide rails self-blocking devices? Can a standardized extender for PCBs be used?

1. Have indirect connectors been used? 1s the plugging-out/plugging-in of PCBs (LRUs) easy? Are power supplies and ground distributed across different contacts?

2. Have wires and cables been conveniently placed? Also with regard to maintenance?

3. Are sensitive elements sufficiently protected against mishandling during maintenance?

4. Can preventive maintenance be performed on-line? Does preventive maintenance also allow the detection of hidden failures?

5. Can the item (the system) be considered as-good-as-new after a maintenance action?

6. Have man-machine aspects been sufficiently considered?

7. Have all safety aspects also for operating and maintenance personnel been considered? Also in the case of failure (F'MEAIFMECA, FTA, etc.)?

complex equipment and systems with high maintainability requirements

4.3 Maintainability Aspects in Design Reviews

4.3 Maintainability Aspects in Design Reviews

Design reviews are important to point out, discuss, and eliminate design weaknesses. Their objective is also to decide about continuation ur stopping of the project on the basis of objective considerations, feasibility checks in Tables A3.3 & 5.3 and Fig. 1.6. The most important design reviews (PDR & CDR) are described in Table A3.3. To be effective, design reviews must be supported by project specific checklists. Table 4.3 gives an example of catalog of questions which can be used to generate project specific checklists for maintainability aspects in design reviews (see Table 2.8 for reliability and Appendix A4 for other aspects).

4.4 Predicted Maintainability

Knowing the reliability structure of a system and the reliability and maintainability of its elements, it is possible to calculate the maintainability of the system considered as a one-item structure (e.g. calculating the reliability function and the point availability at system level and extracting g(t) as the density of the repair time at the system level using Eqs. (6.14) and (6.18)). However, such a calculation soon becomes laborious for arbitrary Systems (Chapter 6). For many practical applications it is often sufficient to know the mean time to repair at the system level Mi7Rs (expected value of the repair (renewal) time at system level) as a function of the system reliability structure, and of the mean time to failure MlTq and mean time to repair MTZRi of its elements. Such a calculation is discussed in Section 4.4.1. Section 4.4.2 deals then with the calculation of the mean time to preventive maintenance at system level MTTPMs. The method used in Sections 4.4.1 and 4.4.2 is easy to understand and delivers mathematically exact results for M n R s and MTLPMS. Use of statistical methods to estimate or demonstrate a maintainability or a M i T R are discussed in Sections 7.2.1, 7.3, 7.5, and 7.6.

4.4.1 Calculation of MTTRs

Let us first consider a system without redundancy, with elements E I , ..., E, in series as given in Fig. 6.4. M7TF; and Mi7Ri are the mean time tu failure and the mean time to repair of element Ei, respectively ( i = 1, ... , n ) . Assume now that each

element works for the Same cumulative operating time T (the system is disconnected during repair or repair times are neglected because of M U R i << M T F ) and let T be arbitrarily large. In this case, the expected value (mean) of the number of failures of element Ei during Tis given by (Eq. (A7.27))

The mean of the total repair time necessary to restore the T l MTTe failures follows then from

For the whole system, there will be in mean

failures and a mean total repair time of

From Eqs. (4.4) and (4.5) it follows then for the mean time to repair (restoration) at the system level MTTRs the final value

Equation (4.6) gives the mathematically exact value for the mean system repair time M U R S under the assumption that at system down (during a repair) no further failures can occur and that switching is ideal (no influence on the reliability). From Eq. (4.6) one can easily verify that

MTTRS = M U R , when MTTRl = .. . = MT& = MTTR,

and

1 " MTTRs = - MTTRi , when MTTFi = . . . = MTTF, .

n . 1 = 1

4.4 Predicted Maintainability

Example 4.1 Give the mean time to repair at system level M7TR.y for the following system.

How large is the mean of the total system down time during the interval (0, t ] for t + .;.. ?

Solution From Eq. (4.6) it follows that

2 h 2.5h l h 0.5h -+- +-+- 500h 400h 250h lOOh 0.01925

MlTR, = -- V = 1.04h.

1 1 1 -+- + P + P 0.0185h-' 500h 400h 250h lOOh

The mean down time at the system level is also 1.04 h, then for a system without redundancy it holds that down time = repair time. The mean operating time at the system level in the interval (0, t ] can be obtained from the expression for the average availability AAS (Eqs. (6.23), (6.24), (6.48), and (6.49))

lim E[total operating time in (0, t I ] = t . AAS = t . MVFS / (MVFS + MmRs ). t-f -

From this, the mean of the total system down time dusing (0, t ] for t -+ follows then from

limE[total system down time in (0, t I] = t - t . AAS = t MVR, / (MTTFs + MVR, ). t-f

Numerical computation then leads to

If every element exhibits a constant failure rate Ai, then M m = 1 / Ai and

-MTTR, , with hs = hi .

Chi i=l

Equations (4.6) and (4.7) can also be used for Systems with redundancy. However, in this case, a distinction at system level between repair time and down time is necessary. If the system contains only active redundancy, the mean time to repair at the system level M U R s is given by Eq. (4.6) or (4.7) by sumrning over all elements of the system, as if they were in series (a similar consideration holds for


spare parts provisioning). By assuming that failures of redundant elements are repaired without interruption of operation at the system level, Eq. (4.6) or (4.7) can be used to obtain an approximate value of the mean down time at the system level, by summing only over all elements without redundancy (series elements), see Example 4.2.

Example 4.2

How does the MTTRS of the system in Example 4.1 change, if an active redundancy is introduced to the element with M7TF = 100 h ?

Under the assumption that the redundancy is repaired without interruption of operation at the system level, is there a difference between the mean time to repair and the mean down time at the system level?

MITF = 100 h MITR = 0.5 h

MTTF = 500 h M n F = 400 h MITR = 2 h MTTR = 2.5 h

MTTF = 100 h MTTR = 0.5 h

Solution

Because of the assumed active redundancy, the operating elements and the reserve elements show the same mean number of failures. The mean system repair time follows then from Eq. (4.6) by summing over all system elements, yielding

-

2 h 2.5h l h 0.5h 0.5h + + -

500h 400 h 250h lOOh lOOh 0.02425 MTTR, = - - 0.85 h

1 1 1 1 -+- I 0.0285h-I +-+-+- 500h 400h 250h lOOh lOOh

d

However, the system down time differs now from the system repair time. Assuming for the redundancy an availability equal to oue (for constant failure rate h = 11 MTTF, constant repair rate y = 11 MTTR, and one repair Crew, Table 6.6 (p. 195) gives for the 1-out-af-2 active redundancy PA = AA = y (2h +P) / (2h (h + y) + y2 ) yielding AA = 0.99995 for this example), the system down time is defined by the elements in series on the reliability block diagram (see Point 9 in Section 6.8.8 (Eq. (6.291)) for precise considerations), thus

2h 2.5h l h + + - 500h 400h 250h 0.01425

mean down time at system level = - - 1.68h. 1 1 0.0085h-' + + -

500h 400h 250h

Similarly to Example 4.1, the mean of the system down time during the interval (0, t ] follows then from

MTTR, lirnE[total down time in (0, t ] ] = t (1 - AAS ) = t - = t . 1.68h. 0.0085h-' = 0.014t. f+ m MTTF,

4.4 Predicted Maintainability 125

4.4.2 Calculation of MTTPMs

Based on the results of Section 4.4.1, the calculation of the mean time to preventive maintenance at system level M m P M S can be performed for the following two cases:

1. Preventive maintenance is carried out at once for the entire system, one element after the other. If the system consists of elements E I , ..., E , (arbitrarily grouped on the reliability block diagram) and the mean time to preventive maintenance of element Ei is MTi'PMi, then

2. Every element Ei of the system is serviced for preventive maintenance independently of all other elements and has a mean time to preventive maintenance MTTPMi. In this case, Eq. (4.6) can be used with MTBPMi instead of MTTF;. and MTTPMi instead of M7TRi, where MTBPMi is the mean time between preventive maintenance for the element Ei.

Case 2 has a practical significance when preventive maintenance can be performed without interruption of the operation at the system level.

4.5 Basic Models for Spare Parts Provisioning

Spare parts provisioning is important for Systems with long useful life or when short repair times andlor independence from the manufacturer is required (spare part is used here e.g. for line replaceable unit (LRU)). Basically, a distinction is made between centralized and decentralized logistic support. Also it is important to take into account whether spare parts are repairable or not. This section presents the basic models for the provision of nonrepairable and of repairable spare parts. For nonrepairable spare parts, the cases of centralized and decentralized logistic support are considered in order to quantify the advantage of a centralized logistic support with respect to a decentralized one. More general maintenance strategies are discussed in Section 4.6, cost specific aspects in Section 4.7.

4.5.1 Centralized Logistic Support, Nonrepairable Spare Parts

In centralized logistic support, spare parts are stocked at one place. The basic problem can be formulated as follows:

At time t = 0, the first Part is put into operation, it fails at time t = z1 and is replaced (in a negligible time) by a second part which fails at time t = z1 + z 2 and so forth; asked is the number n of parts which must be stocked in order that the requirement for parts during the cumulative operating time T is met with a given (fixed) probability y .

To answer this question, the smallest integer n must be found for which

holds. In general, zl , . . . , T , are assumed to be independent positive random variables with the same distribution function F(x) , density f ( x ) , and finite mean E[zi] =

E[z] = MTTF & Var[zi] = Var[c]. If the number of parts is calculated from

the requirement can only be covered (for T large) with a probability of 0.5. Thus, more than T1 M7TF parts are necessary to meet the requirement with y > 0.5.

According to Eq. (A7.12), the probability as per Eq. (4.9) can be expressed by the ( n - 1)th convolution of the distribution function F( t ) with itself, i.e.

T

with FI(T) = F(T) and F,(T) = JF,-](T - ~ ) f ( ~ ) d x , n > 1 . (4.11) 0

Of the distribution functions F(x) used in reliability theory, a closed, simple form for the function F,(x) exists only for the exponential, gamma, and normal distribution functions, yielding a Poisson, gamma, and normal distribution, respectively. In particular, the exponential distribution F(x) = 1 -e-L X leads to (Eq. (A7.39))

The important case of the Weibull distribution F ( x ) = I - e d h X)' must be solved numerically. Figure 4.3 shows the results with Y and ß as Parameters [4.2 (1974)].

For large values of n , an approximate solution for a wide class of distribution functions F(x) can be obtained using the central limit theorem. From Eq. (A6.148) if follows that (for Var[z] < W )

and thus, using X d n ~ a r [ z ] + nE[z] = T ,

4.5 Basic Models for Spare Parts Provisioning

n 2 l e - Y ' 2dyZY lim p r { E z i > T ) =-

n-fW i=l 6 T-n„„

Setting (T - nE[z]) 14- = - d it follows that

.Jvar[z I with K = .

EP 1

1

MTTF

T .- MTTF

Figure 4.3 Number of parts (n) which are necessary to cover a total cumulative Operating time T with a probability 2 y, i.e. smallest n for which Pr{T1 + ... + Z~ > T) 2 y holds, with Pr{T I x )= 1 - e-" and MTTF = T(1 + 1 / P ) 1 h (dashed are the results given by the central limit theorem as per Eq. (4.15), ß = 1 yields the exponential distribution function)

1.0 1.4 1.8 2.2 2.6 3.0

Figure 4.4 Coefficient of variation for the Weibull distribution for 1 I ß I 3

From Eqs. (4.13) and (4.14) one recognizes that d is the y quantile of the standard normal distribution ( 1 - @(-d ) = @(d ) = y ), yielding (Table A9.1)

Equation (4.15) gives for y <. 0.95 a good approximation of the number of parts n down to low values of n (see e.g. Fig. 4.3). K ==I / E[z I is the coeflcient of variation ( K = 1 for the exponential distribution and K = l / U1 + 2 / P ) / (U1 + 1 /ß))2 - 1

for the Weibull distribution (Fig. 4.4)). For the case of a Weibull distribution with ß t 1, approximate values for n

obtained using the central limit theorem (Eq. (4.15)) are shown dashed in Fig. 4.3. For ß =1, deviation from the exact value is < 1.3 for y 10.95 and n2 5; this deviation drops off rapidly for increasing values of ß (F, (x) already approaches a normal distribution for small n). From Eq. (4.14) one recognizes that for y = 0.5, T-nE[z]= 0 andthus,fornlarge, n = T / E [ z ] (Eq. (4.10)).

Let us now consider the case in which the same part occurs k times in the system. For F(x) = 1- e - h x , Eqs. (4.12) -(4.15) hold with

instead of h. This is because the sum of independent Poisson processes is a Poisson process (Eq. (7.27)) and k parts must be operating for the required function (see also Point 2 on p. 131). The same holds if 1 Systems use the same part, one or more per system with total k parts of the same type, and Storage is centralized (Example 4.3).

Considering that k parts are available at t = 0 (operating at t = 0), it is reasonable to define as number of spare parts nSp the quantity

where n is the number of parts obtained from Eqs. (4.12) - (4.16), see Examples 4.3 and 4.4 for two practical applications.

4.5 Basic Models for Spare Parts Provisioning 129

Example 4.3 A part with a constant failure rate h = 1 0 - ~ h-I is used three times in a system (k= 3). Give the number of spare parts n which must be stored to Cover a cumulative operating time "P. T = 10,000h with a probability y 2 0.90.

Solution Considering k hT= 30, the exact solution is given by the smallest integer nSp = n - 3 for which

i=o l !

holds (Eq. (4.12)). From Table A9.2 it follows, for q = 1 - 0.9 = 0.1 and t,,q = 2.30 = 60, the value V = 75.2 (lin. interpolation); thus, V = 76 and (Appendix A9.2 & Eq. (4.12)) n =V / 2 = 38 (the same result is obtained with Fig. 7.3 for m=30 with Eq. (4.15) for U = 1 and d = 1.28, yielding n = 38 ( considering that 3 parts are operating at t = 0 , it follows that (Eq. (4.17)) nsp = 38 - 3 = 35.

4.5.2 Decentralized Logistic Support, Nonrepairable Spare Parts

For Users who have the same system located at different places, spare parts are often stored decentralized, i.e. separately at each location (decentralized means that spare parts cannot be transferred from one location to another location). If there are 1 systems, each with a given part, and the storage of spare parts is decentralized at each system (or location), a first approach could be to store with each system the same number of spare parts obtained using Eqs. (4.9) and (4.17). In this case, the totaI number of parts would be n . I ( ( n - k ) . I spare parts). This number n of Parts, which would be sufficient to meet, with a probability > y (often >> y ) the needs of the I systems with a centralized storage (Example 4.4), would now in general be too small to meet all the individual needs at each location. In fact, assurning that failures at each location are independent, and that with n parts the probability of meeting the needs at any location individually is y , then the probability of meeting the need at all locations is y l . Thus, to meet the need at the 1 locations with a probability y

parts are required, where nl is computed for each location individually with

e.g. using Eq. (4.15) with dl instead of d ( @ (d)= y , @(dl) = z) . To make a comparison between a centralized and a decentralized logistic support, let us assume that the Part considered appears k times in each of the 1 locations, has constant failure rate ?L, and k AT>> d :/2 > d 2/2 holds. In this case, Eqs. (4.15) & (4.16) lead to

n = k h ~ + d m ~ , k hT »d2 / 2, k = 1,2, ..., probability y . (4.20)


For centralized logistic support, Eq. (4.20) yields

n c e n = 1 k ~ ~ + d . , / l k h T , l k h ~ > > d ~ / 2 , k , l = 1 , 2 ,..., probabilityy. (4.21)

For decentralized logistical support, Eq. (4.20) yields

ndec = Z ( ~ ? L T + dl@), khT >> d; 12, k,L = 1,2, ..., probability y , (4.22)

where dl is obtained as for d in Eq. (4.15) with y, = 'fi instead of y (for example, d = 1.64 for y = 0.95 and dl = 2.57 for 1 = 10 i.e. for y, =0.9949, see Table A9.1). From the above considerations it follows that for k AT >> d12/ 2 > d '1 2

with @(d) = y & @(d !) = 5 (see Example 4.4). Setting AT = T l M m , Eq. (4.23) can be used for arbitrary distribution of the failure-free time of the spare parts.

Example 4.4

Let h = 1 0 - ~ h-I be the constant failure rate of a part in a given system. The wer has 6 locations (L = 6) and would like to achieve a cumulative operating time T = 50,000 h at each location with a probability y 2 0.95. How many spare parts could be saved if the User would store all spare parts at the same location (centralized logistic support)?

Solution FromFig. 4.3 (T IMTTF=5 , y = G = 0 . 9 9 ) , Fig 7.3 ( m = 5 , y=0.99, c = n l - I ) , or from a x 2 - ~ a b l e ( tVsq = 10, q=1-0.99= 0.01, V = 2n,) each User would need n, = 12 parts (nl = 14 using Eq. (4.15) with d=d l =2.33 and hT = 5); thus n&= 6.12= 72 parts and (Eq. (4.17)) n.Tpdec = 6.11 = 66 spare parts. Combining the storage (L = 6), it follows from Fig. 7.3 (m=30, y=0.95, c=n„ , -1 )o r from Table A9.2 ( t v , q = 6 0 , q=0.05, v=2n„,) that n, = 40 (acen = 41 using Eq. (4.15) with d = 1.64 and hT = 30); thus, nSPcen = 40 - 6 = 34. A centralized storage would save 66 - 34 (or 72 - 40) = 32 spare parts (Eq. (4.23) gives 1.57 instead of 1.8 (left) and 1.67 instead of 1.94 (right), because k hT= 5 is not >> dlz 12 = 2.71).

Supplementary result: Provisioning independently for each location with y= 0.95 yields n, = 10 (Fig. 4.3) and thus n = 6.10 = 60.

4.5.3 Repairable Spare Parts

In Sections 4.5.1 and 4.5.2 it was assumed that the spare parts (LRUs) were nonrepairable, i.e. that a new spare part was necessary at each failure. In many cases, spare parts can be repaired and then stored for reuse. Calculation of the number of spare parts which should be stored can be performed in a way similar to the investigation of a k-out-of-n standby redundancy, where k is the number of parts used in the system (as in Eq. (4.17)) and n is the smallest integer to be determined

4.5 Basic Models for Spare Parts Provisioning 13 1

such that the requirement is met with a given (fixed) probability y. Following two cases have to be considered:

1. y is the probability that a request for a spare part at a time point t can be met without time delay; in this case, y can be considered as the point availability PAs (in steady-state to simplify investigations) and n is the smallest integer such that PAs 2 y for a given (fixed) y .

2. y is the probability that any request for a spare part during the time intewal ( O J ] will be met without time delay; in this case, y can be considered as the reliabili~function Rso( t ) and n is the smallest integer such that RsO(t) 2 y for given (fixed) y and t.

If the spare parts have a constant failure rate h = 1 / MTTF and a constant repair rate p = 1 I MTTR, birth-und-death processes can be used (Section A7.5.5). To simplify investigations, it is assumed that only one spare part at a time can be repaired (only 1 repair Crew is available) and no further failures are considered when a request for a spare part cannot be met (corresponds to the assumption nofurther failure at System down (Fig. 6.13).

For Case 1 above, Eq. (6.138) with h, = 0 and Eq. (6.140) yield

n-k

PA^ = 2 pj = I - P , - ~ + ~ y j=O

with

Sought is the smallest integer n which satisfies Eq. (4.24) for given (fixed) y, k, ?L, and y. Often n = k + 1 (one spare part) or n = k + 2 (two spare parts) will be suficient. In these cases, results of Table 6.8 yield

nJp = n - k = 1 spare part, 1 repair crew, Case I ,

nSp = n - k = 2 spare parts, 1 repair crew, Case 1 .

If PAS2 is still < y, more than 2 spare parts are necessary. A good approximation for the number nsp of spare parts can be obtained using the smallest integer nsp= n- k satisfying (Table 6.8)

Using results of Appendix A7.5.5 (Eq.(A7.157)) and considering kh << p, it can be shown that the approximations given by Eqs. (4.26)- (4.28) hold also if the assumption "no further failures are considered when a request for a spare part cannot be rnet", is not made. The case in which nSp+ 1 repair crews are available (instead of 1 repair crew) is considered by Eq. (4.32) for comparative investigations.

For Case 2 above, the reliability function can be approximated by an exponential function (Eq. (6.93)), yielding (Eqs. (6.144) & (6.145) with vi = kk)

RsolW = e - t ( k ~ ) ~ / ~ nSp = n - k = 1 spare part, 1 repair crew, Case 2 , (4.29)

Rso2(t) = e - t (kh13 lP2 , nsp = n - k = 2 spare parts, 1 repair crew, Case 2 . (4.30)

If Rso2(t), with t as mission time, is still < y, more than 2 spare parts are necessary. A good approximation for the number n S p of spare parts can be obtained using the smallest integer nSp= n - k satisfying (Table 6.8)

Rso,„(t) 5= 62- t"kh~p)"sp+l> Y , nsp =n- k spare parts, 1 repair crew, Case 2. (4.3 1)

For Eqs. (4.29) to (4.31) it holds necessarily that no further failures are considered when a request for a spare Part cannot be met (system down states are made absorbing for reliability calculations). The case in which n S p repair crews are available is considered by Eq. (4.33) for comparative investigations.

Example 4.5

A system contains k = 100 identical parts (LRUs) with a constant failure rate h = 1 0 - ~ h-I and which can be repaired with a constant repair rate p = 10-I hK1. (i) Give the number of spare parts which mnst be stored in order to meet without any time delay and with a probability y t 0.99 a request for a spare part at a time point t (consider the steady-state only, one repair crew, and no further failure when a request for a spare Part cannot be met). ( i i ) If one spare Part is stored ( n = k + I), how large is the probability that any request for a spare part dunng the time interval (0, 104 h] will be met without any time delay?

Solution

(i) Taking n = k + 1 = 101, Eq. (4.26) yields

Thus only one spare part 1) must be stored.

(ii) For n = k+l , Eq. (4.29) yields Rsol ( t ) = e -0.00001 t and thus RsO, ( l ~ ~ h ) = e - ~ ' ' = 0.91.

Supplementary result: To reach Rs , (lo4) r 0.99 one needs n , *=2 spare parts (Rs 02 (104) = 0.999).

4.5 Basic Models for Spare Parts Provisioning 133

Assuming for cornparative investigations that each of the n„= 11- k spare parts can be repaired independently from each other (n„+ 1 repair crew, no further failures when a request for a spare part cannot be met), results of Section A7.5.5, with vi = k h , i = 0, ... , r z - k , and Bi=ip, i = 1 , ... , n-k + 1 , yield (see also Eq. (6.149))

PAs = l - ( k h l p ) n ~ ~ + l / ( n , p + l ) ! , nSp = n - k spare parts, nSp+ 1 repair crews, Case 1, (4.32)

n , ~

and, with v i as before and Bi = i p , i = 1, ..., n - k ,

- t p ( k h / p ) n ' p " 1 b s p ) ! nSp = n - k spare parts, Rsonsp( t E e n „ , repair crews, Case 2. (4.33)

Using results of Appendix A7.5.5 (Eq.(A7.157)) and considering kh << y, it can be shown that the approximation given by Eq. (4.32) holds also if the assumption "no further failures are considered when a request for a spare part cannot be rnet", is not made. For Eq. (4.33) it holds necessarily that no further failures are considered when a request for a spare Part cannot be met (system down states are made absorbing for reliability calculations).

Generalization of the repair rate leads to semi-regenerative processes with n-k + 1 regeneration and n-k not regeneration states (Section 6.5.2, Appendix A7.7, Sections 6.4.2). For instance, assuming for the repair time a density g(t), a mean MTTR, and a variance Var [T'], Eq. (6.109) with kA instead of h and 3L ,= 0 (see remark on p. 490) and Eq. (6.113) for g(h) , lead to

n„ = n - k = I spare part, 1 repair crew, Case 1.

Similarly, Eq. (6.107) with k h instead of h and h, = 0 and Eq. (6.114) leads to

n,yp = n - k = 1 spare part, 1 repair crew, Case 2.

The last approximation in Eq. (4.34) assumes for the coefficient of variation K that

which holds for most of the distribution functions used for repair times (Fig. 4.4). Assuming MTTR=l/y, i.e. the same mean time to repair disregarding the distribution of the repair time, the last approximations in Eqs. (4.34) and (4.35)

yield the Same result as given by Eqs. (4.26) and (4.29). This shows the small inpuence of the repair time distribution on results a t system level. The last approximation in Eq. (4.35) is obtained by assuming k h M T ; ~ R « I , i.e. using g ( k h ) = 1 - k h MTTR (Eq. (6.1 14). For the approximation in Eq. (4.34) it was necessary to use g ( k h ) = 1 - k AMTTR + (kh12 (MTTR' +VX[Z' 1) 12 (Eq. (6.113)).

Taking Rs (t ) = e -"Mn's in Eqs. (4.31), (4.33) & (4.35), and P A s as in Eqs. (4.28), (4.32) & (4.34), PAs can be expressed as (Eq. (A7.189))

with MTTRs = 1 Ip, M V R s = 11 ( n - k + 1)p & M% = MTTR , respectiveiy. The results of Sections 4.5.1 to 4.5.3, in particular those of Section 4.5.2 on

decentralized logistic support, can be extended to Cover the more general case of Systems with dzfferent spare parts.

4.6 Repair Strategies

Repair (restoration) strategies can be very different according to the objective to be reached (choice between block and age replacement, minimization of the number of spare parts or of the down time at system level, maximization of the availability by given cost andlor logistic support constraints, etc.). In addition to the considerations of Section 4.5 on spare parts provisioning, this section deals with some basic repair strategies from a system performance point of view. Specific cost aspects are considered in Section 4.7.

In order to avoid wearout failures, replacement can be performed at a given (fixed) operating time 0 or at failure if the operating time is smaller than 0 (age replacement). Assuming that after replacement the system is as-good-as-new, each replacement is a renewal point for the underlying process. Fig 4.5a gives a possible time schedule for this case. If F ( x ) is the distribution function of the involved failure-free time T, results of Appendix A7.2 for renewal processes and of Section 4.5 for spare parts provisioning can be used, taking for the failure-free time T the truncated distribution function F O ( x )

Fe(, = for Oax<0

for ~ 2 0 ,

instead of F ( X ) . In particular, for F ( X ) = Pr {T 2 X} = 1 - e-hx it holds that

4.6 Repair Strategies

b) ' renewal point

Figure 4.5 Possible time schedules for a repairable system with preventive maintenance: a) After 0 operating hours or at failure; b) Only at fixed times 0 ,28 , ... ( X start by X = 0 at each renewal point)

and

2 2 2 1 Var [T] = a = E[T 1 - E [ T I = - ( I - e-2he) - - e-" (4.40) h2 h

(Eqs. (A6.38), (A6.41), (A6.45)). For the number of replacements v ( t ) in the interval (O,t] it follows in particular that (Eq. (A7.34))

with MTTF and a from Eqs. (4.39) and (4.40) for the case of constant failure rate. A further possibility is to perform a replacement only at times 8,2 8, ... , taking in

charge that if there is a failure between k 8 and (k + 1)8 the system is down from the failure occurrence up to the time (k + 1) 0 . Figure 4.5b shows a possible time schedule. If V (n8) is the number of failures in the time interval (0, ne ] , the probability to have V (n8 ) = k is given by the binomial distribution (Eq. (A6.120)) with p = F ( 8 ) )

Mean and variance of the number of failures in (0 , n8] is then given by (Eqs. (A6.122) and (A6.123))

E [ v ( n 8 ) ] = n F ( B ) and Var[v(nB)]=nF(0)(1-F(8)). (4.43)

If the age replacement is too expensive, a further strategy is to assume that at times 8,28, ... the system is inspected, but a replacement at the time ( k + 1)8 is performed only if a failure is occurred between k 0 and (k + 1) 8. If the failure-free time z has distribution function F ( X ) , the replacement time -crepl has distribution

This case has been investigated in [6.16] with cost considerations. If

cl = inspection cost ( q > 0 ) cl + c2 = cost for inspection and replacement (c2 > 0 ) c3 = cost for unit of time (h) in which the system is down waiting for repl. (c3>0),

the total cost C for unit time is for t + W given by

where MTTF = E [T]. For 0 + W, E [ T ~ , ~ ~ ] -+ W and C + c3 ; thus, inspection are useful for C < c3. For given F (X) it is possible to find a 0 which minirnizes C.

For Che mission availability and work-mission availability, as defined by Eqs. (6.28) and (6.31), it can be asked in some applications that the number of repairs (replacements) be limited to N (e.g. because just N- 1 spare parts are available). In this case, the summation in Eqs. (6.29) and (6.32) goes up to n = N . If k elements E I , ..., Ek with constant failure rates hl,...,hk and constant repair rates ,ul,...,pk are in series, a good approximation for the work-mission availability with limited repairs is obtained by multiplying the probability for total system down time < x fo r unlimited repairs (Eq. (7.22) with h = hs and p = I . L ~ from Table 6.10 (2nd row)) with the k probabilities that Ni-1 spare parts will be sufficient for element Ei [6.10] (similar as for Eq. (4.19)).

A strategy can also be based on the repair time T ' itself. Assuming for example that if the repair is not finished at time A the failed element is replaced (at time A) by a new equivalent one in a negligible time, the distribution function G(x) of the repair (restoration) times T ' is truncated at A (Eq. (4.38) with A instead of 8). For the case of constant repair rate p, the Laplace transform of G(x) to be used in reliability or availability computations is given by (Appendix A9.7)

However, a truncated distribution function will break the memoryless property and must thus be considered like a general distribution function, yielding to semi- regenerative processes (Appendix A7.7 and Sections 6.4.2 and 6.5.2).

4.7 Cost Considerations

Cost considerations are important in practical applications and apply in particular to spare parts provisioning (Section 4.5) and maintenance strategies (Section 4.6). In the following two basic models based on homogeneous Poisson processes (HPP) with fixed and random cost are discussed.

4.7 Cost Considerations 137

As a first example consider the case in which a constant cost co is related to each repair (renewal) of a given item. Assuming that repair duration is negligible and times between successive failures are independent and exponentially distributed with parameter ?L, the failure flow is a homogeneous Poisson process and the probability for n failures during the operating time t ( v ( t ) = n) is given by (Eq. (A7.41))

Eq. (4.47) is also the probability that the cumulated repair cost over t is C = n co. Mean and variance of C are (Eqs. (A6.40) and (A6.46) with Eq. (A7.42))

E [ C ] = C ~ ? L ~ and ~ a r [ ~ l = c i ? ~ t . (4.48)

For large At, C is approximately normal distributed with mean and variance as per Eq. (4.48).

If repair cost is a random variable T; 0 distributed according to F(x)=Pr{Si< X }

( i = i,2, ...), v ( t ) the Count function giving the number of failures in the operating time interval (0, t ] and 5 , the sum of 5 over (0, t ] , it holds that (Eq. (A7.218))

5 , is distributed as the (cumulative) repair time for failures occurred in a total operating time t of a repairable item, and is given by the work-mission availabili~ (Eq. (6.32) with T. = t ) . Assuming that the failures flow is a homogeneous Poisson process (HPP) with parameter ?L and all 5 are independent from V ( t ) and have the same exponential distribution with parameter p, Eq. (6.32) with constant failure and repair rates A(x) =?L and p(x) = y and T. = t yields (Eq. (A7.219))

Mean and variance of 5 , follow as (Eq. (A7.220), see also Eqs. (4.50), (A6.38), (A6.45), (A6.41))

Furthermore, for t-+w the distribution of 5 , approach a normal distribution with mean and variance as per Eq. (4.51). Moments of 5 , can also be obtained for arbitrary F ( ~ ) = P r { ~ ~ 5 x ) , w i t h F ( O ) = 0 (ExampleA7.14)


Of interest in some practical applications can also be the distribution of the time TC at which the cumulative cost 5, crosses a give (fixed) barrier C. For the case given by Eq. (4.50), i.e. in particular for S i > 0 , the events

( ' ~ , > t } and { k t 5 C ) (4.53)

are equivalent. Form Eq. (4.50) it follows then (Eq. (A7.223))

More general cost optimization strategies are often necessary in practical applications. For example, spare parts provisioning has to be considered as a parameter in the optimization between performance, reliability, availability, logistic Support and cost, taking care of obsolescence aspects as well. In some cases, one parameter is given (e.g. cost) and the best logistic structure is sought to maximize system availability or system performance. Basic considerations, as discussed above and in Sections 1.2.9, 8.4, A6.10.7, A7.5.3.3, applies. However, even assuming constant failure and repair rates, numerical solutions can become necessary, see [4.24] for an example.

5 Design Guidelines for Reliability, Maintainability, and Software Quality

Reliability, maintainability, and software quality have to be built into complex equipment und System during the design and development phase. This has to be supported by analytical investigations (Chapters 2, 4, and 6) as well as by design guidelines. Adherence to such guidelines limits the influence of those aspects which can invalidate the models assumed for analytical investigations, and contributes greatly to build in reliability, maintainability, and software quality. This chapter gives a comprehensive list of design guidelines for reliability, maintainability, and software quality of complex equipment and systems, harmonized with industry's needs [1.2 (1996)l.

5.1 Design Guidelines for Reliability

Reliability analysis in the design and development phase (Chapter 2) gives an estimate of an item's true reliability, based on some assumptions regarding data used, interface problems, dependence between components, compatibility between materials, environmental influences, transients, EMC, ESD, etc., as well as on the quality of manufacture and the user's skill level. To consider exhaustively all these aspects is difficult. The following design guidelines can be used to alleviate intrinsic weaknesses and improve the inherent reliability of complex equipment and systems.

5.1.1 Derating

Thermal and electrical Stresses greatly influence the failure rate of electronic components. Derating is mandatory to improve the inherent reliability of equipment and systems. Table 5.1 gives recommended stress factors S (Eq. (2.1)) to be used

140 5 Design Guidelines for Reliability, Maintainability, and Software Quality

Table 5.1 Recommended derating values for electronic components at ambient temperature 20°C 1 O A 1 40°C

* breakdown voltage; ** isolation voltage (0.7 for Ui,); +sink current; ++low values for inductive loads; X O J < 100°C

for industrial applications (40°C ambient temperature O A , GB as per Table 2.3). For BA > 40°C, a further reduction of S is necessary, in general, linearly up to the limit temperature, as shown in Fig. 2.3. Too low values of S ( S < 0.1) can also cause problems. S = 0.1 can be used in many cases to calculate the failure rate in a standby or dormant state. As rule of thumb, S <= 0.5 is a good choice for reliability.

5.1.2 Cooling

As a general rule, the junction temperature eJ of semiconductor devices should be kept as near as possible to the ambient temperature of the equipment or System

5.1 Design Guidelines for Reliability 141

in which they are used. For a good design, BJ 5 100°C is recommended. In a steady-state situation, i.e. with constant power dissipation P, the following relationships

can be established and used to define the thermal resistance

RJA for junction - ambient RJc for junction - case

Rcs for case - surface RSA for surface - ambient,

where,su$ace is used for heat sink.

Example 5.1

Determine the thermal resistance RSA of a heat sink by assuming P = 400 mW, BJ = 70°C, arid RJC + RCS = 35'CIW.

Solution From Eq. (5.2) it follows that

O J - B A R~~ = - - RJC -RCs and thus RSA =-- 300C 35OCIW = 40°CIW 0.4 W

For many practical applications, thermal resistance can be assumed to be independent of the temperature. However, R j c generally depends on the package used (lead frame, packaging form and type), Rcs varies with the kind and thickness of thermal compound between the device package and the heat sink (or device support), and RsA is a function of the heat-sink dimensions and form as well as of the type of cooling used (free convection, forced air, liquid-cooled plate, etc.). Typical thernial resistance values RJc and Ra for free convection in ambient air without heat sinks are given in Table 5.2. The values of Table 5.2 are iizdicative and have to be replaced with specific values for exact calculations.

Cooling problems should not only be considered locally at the component level, but be integrated into a thermal design concept (thermal management). In defining the layout of an assembly, care must be taken in placing high power dissipation parts away from temperature sensitive components like wet Al capacitors and optoelectronic devices (the usefill life is reduced by a factor of 2 for a 10 - 20°C increase of the ambient temperature). In placing the assemblies in a rack, the cooling flow should be directed from the parts with low toward those with high power dissipation.

Table 5.2 Typical thermal resistance values for semiconductor component packages

( Package form / Package type 1 RJC ["CIW] ( RJA ["cIw]** (

SOL, SOM, SOP IPlastic @MT) 1 20 - 60* 1 70- 240* 1

DIL

DIL

PGA

JC = junction to case; JA = junction to ambient; *lower values for 2 64 pins; ** free convection at 0.15 d s (factor 1.5 - 2 lower for forced cooling at 4 mls)

Plastic

CeramiclCerdip

Ceramic

PLCC

QFP

T 0

5.1.3 Moisture

For electronic components in non hermetic packages, moisture can cause drift and activate various failure mechanisms such as corrosion and electrolysis (see Section 3.2.3, Point 8 for considerations on ICs). Critical in these cases is not the water itself, but the impurities and gases dissolved in it. If high relative humidity can occur, care must be taken to avoid the formation of galvanic couples as well as condensation or ice formation on the component packages or on conductive Parts.

As stated in Section 3.1.3, the use of ICs in plastic packages can be allowed if one of the following conditions is satisfied:

10 - 40*

7 - 20'

6 - 10*

Plastic

Plastic

Plastic

1. Continuous operation, relative humidity < 70%, noncorrosive or marginally corrosive environment, junction temperature <lOO°C, and equipment useful life less than 10 years.

2. Interrnittent operation, relative humidity < 60%, noncorrosive environment, no moisture condensation on the package, junction temperature 4100°C, and equipment useful life less than 10 years.

30 - IOO*

30 - IOO*

20 - 40*

For ICs with silicon nitride passivation, intermittent operation holds also for Point 1. Drying materials should be avoided, in particular if chlorine compounds are

present. Conformal coating on the basis of acrylic, polyurethane, epoxy, silicone or fluorocarbon resin 25 - 125pm thick, filling with gel, or encapsulation in epoxy or similar resins are currently used (attention must be given to thermomechanical Stresses during hardening). The use of hermetic enclosures for assemblies or equipment should be avoided if condensation cannot be excluded. Indicators for the effects of moisture are an increase of leakage currents or a decrease of insulation

10 - 20*

15 - 25*

2 - 20

resistance.

30 - 70*

30- 80*

60 - 300


5.1.4 Electromagnetic Compatibility, ESD Protection

Electromagnetic compatibility (EMC) is the ability of an item to function properly in its intended electromagnetic environment without introducing unacceptable electromagnetic noise (disturbances) into that environment. EMC has thus two aspects, susceptibility and emission. Agreed susceptibility and ernission levels are given in international standards (IEC 61000 13.81). Electrostatic discharge (ESD) protection is a Part of an electromagnetic immunity concept, mandatory for semiconductor devices (Section 3.2.3). Causes for EMC problems in electronic equipment and systems are in particular

switching and transient phenomena, electrostatic discharges, stationary electromagnetic fields.

Coupling can be

conductive (galvanic),

through common impedance, by radiated electromagnetic fields.

In the context of ESD or EMC, disturbances often appears as electrical pulses with rise times in the range 0.1 to 10kV / ns, peak values of 0.1 to 10 kV, and energies of 0.1 to 103rn~ (high values for equipment). EMC aspects, in particular ESD protection, have to be considered early in the design and development of equipment and systems. The following design guidelines can help to avoid problems:

For high speed logic circuits ( f > 20MHz) use a whole plane (layer of a multilayer), or at least a tight grid for ground and power supply, to minimize inductance and to ensure a distributed decoupling capacitance (4 layers as signall Vcc / ground / signal or better 6 layers as shield / signal / Vcc / ground / signal / shield are recommended). For low frequency digital circuits, analog circuits, and power circuits use a single-point ground concept, and wire all different grounds separately to a common groundpoint at system level (across antiparallel suppressor diodes). Use low inductance decoupling capacitors (generally lOnF ceramic capacitors, placed where spikes may occur, i.e. at every IC for fast logic and bus drivers, every 4 ICs for HCMOS) and a 1pF metallized paper (or a 1OpF electrolytic) capacitor per board; in the case of a highly pulsed load, locate the voltage regulator on the same board as the logic circuits. Avoid logic which is faster than necessary and ICs with widely different rise times; adhere to required rise times and use Schmitt-trigger inputs if necessary.


Pay attention to dynamic Stresses (particularly of breakdown voltages on semiconductor devices) as well as of switching phenomena on inductors or capacitors; implement noise reduction measures near the noise source (preferably with Zener diodes or suppressor diodes). Match signal lines whose length is greater than V . t „ also when using differential transmission (often possible with a series resistor at the source or a parallel resistor at the sink, V =signal propagation speed = ~ 1 6 ) ; for HCMOS also use a 1 to 2 kQ pull-up resistor and a pull-down resistor equal to the line impedance Zo, in series with a capacitor of about 200pF per meter of line. Capture induced noise at the beginning and at the end of long signal lines using parallel suppressors (suppressor diodes), series protectors (ferrite beads) or series/parallel networks (RC), in that order, taking into account the required rise and fall times. Use twisted pairs for signal and return lines (one twist per centimeter); ground the return line at one end and the shield at both ends for magnetic shielding (at more points to shield against electricfields); provide a closed (360") contact with the shield for the ground line; clock leads should have adjacent ground returns; for clock signals leaving a board consider the use of fiber optics, coax, trileads, or twisted pairs in that order. Avoid apertures in shielded enclosures (many small holes disturb less than a single aperture having the Same area); use magnetic material to shield against low-frequency magnetic fields and materials with good surface conductiviq against electric fields, plane waves, and high frequency magnetic fields (above IOMHz, absorption loss predominates and shield thickness is determined more for its mechanical rather than for its electrical characteristics); filter or trap all cables entering or leaving a shielded enclosure (filters and cable shields should make very low inductance contacts to the enclosure); RF parts of analog or mixed signal equipment should be appropriately shielded (air core inductors have greater emission but less reception capability than magnetic core inductors); all signal lines entering or leaving a circuit should be investigated for common-mode emission; minimize common-mode currents. Implement ESD current-Jow paths with multipoint grounds at least for plug- in populated printed circuit boards (PCBs), e.g. with guard rings, E S D networks, or suppressor diodes, making sure in particular that all signal lines entering or leaving a PCB are sufficiently ESD protected (360" contact with the shield if shielded cables are used, latched and strobed inputs, etc.); ground to chassis ground all exposed metal, if necessary use secondary shields between sensitive parts and chassis; design keyboards and other operating parts to be immune to ESD.

5.1 Design Guidelines for Reliability

5.1.5 Components and Assemblies

5.1.5.1 Component Selection

1. Pay attention to all specification limits given by the manufacturer and to company-specific rules, in particular to dynamic Parameters and breakdown limits.

2. Limit the number of entries in the list of preferredparts (QPL) and whenever possible ensure a second source procurement; if obsolescence problems are possible (very long warranty or operation time), observe this aspect in the QLP andlor in the design Ilayout of the equipment or System considered.

3. Use non-qualzfied parts and components only after checking the technology and reliability risks involved (the learning phase at the manufacturer's plant can take more than 6 months); in the case of critical applications, intensify the feedback to the manufacturer and plan an appropriate incoming inspection with screening.

5.1.5.2 Component Use

Tie unused logic inputs to the power supply or ground, usually through pull- up lpull-down resistors (100kQ for CMOS), also to improve testability; pull-up 1 pull-down resistors are also recommended for inputs driven by three-state outputs; unused outputs remain basically Open. Protect all CMOS terminals from or to a connector with a 100kQ pull-up I pull-down resistor and a 1 to 10kQ series resistor (latch-up) for an input, or an appropriate series resistor for an output (add diodes if Vin and Vmt cannot be limited between - 0.3 V and VDD + 0.3 V); observe power-up and power-down sequences, make sure that the ground and power supply are applied before and disconnected after the signals. Analyze the thermal stress (internal operating temperature) of each part and component carefully, placing dissipating devices away from temperature- sensitive ones, and adequately cooling components with high power dissipation (failure rates double generally for a temperature increase of 10 - 20°C ); for semiconductor devices, design for a junction temperature BJ I 100°C (if possible keep BJ I 80°C).

Pay attention to transients, especially in connection with breakdown voltages of transistors ( VBEo I 5 V; Stress factor S < 0.5 for VCE, VGS , and Vm). Derate power devices more than signal devices (stress factor S < 0.4 if more than 105 power cycles occur during the useful life). Avoid special diodes (tunnel, step-recovery, pin, varactor, which are 2 to 20 times less reliable than normal Si diodes); Zener diodes are about one half as reliable as Si switching diodes, their stress factor should be > 0.1.

7. Allow a +30% drift of the coupling factor for opdocoupler during operation; regard optocouplers and LEDs as having a limited useful life (generally > 106 h for OJ < 40°C and < 105 h for OJ > 80°C), design for OJ 70°C (if possible keep OJ .: 40°C); pay attention to optocoupler voltage (S a 0.3).

8. Observe operating temperature, voltage stress (DC and AC), and technological suitability of capacitors for a given application: Foil capacitors have a reduced impulse handling capability; wet Al capacitors have a limited useful life (which halves for every 10°C increase in temperature), a large series inductance, and a moderately high series resistance; for solid Ta capacitors the AC impedance of the circuit as viewed from the capacitor terminals should not be too small (the failure rate is an order of magnitude higher with 0.1Q / V than with 2Q / V, although new types are less sensitive); use a 10 - lOOnF ceramic capacitor parallel to each electrolytic capacitor; avoid electrolytic capacitors < 1pF ,

9. Cover EPROM windows with metallized foils, also when stored. 10. Avoid the use of variable resistors in final designs (50 to 100 times less

reliable than fixed resistors); for power resistors, check the internal operating temperature as well as the voltage stress.

5.1.5.3 PCB and Assembly Design

Design all power supplies to handle permanent short circuits and monitor for underlover voltage (protection diode across the voltage regulator to avoid V„, > V„ at power shutdown); use a 10 to lOOnF decoupling ceramic capacitor parallel to each electrolyte capacitor. Clearly define, and implement, inte&ces between different logic families. Establish timing diagrams using worst-case conditions, also taking the effects of glitches into consideration. Pay attention to inductive und capacitive coupling in parallel signal leads ( 0.5 - 1pH / m , 50 - iOOpF / m); place signal leads near to ground returns and away from power supply leads, in particular for clocks; for high-speed circuits, investigate the requirement for wave matching (parallel resistor at the sink, series resistor at the source); introduce guard rings or ground tracks to limit coupling effects. Place all input/output drivers close together, near the connectors, but away from clock circuitry and power supply lines (inputs latched and strobed). Protect PCBs against damage through insertion or removal under power (use appropriate connectors). For PCBs employing s u f i c e mount technology (SMT) , make Sure that the component spacing is not smaller than 0.5mm and that the lead width and spacing are not smaller than 0.25 mm ; test pads and solder-stop pads should be provided; for large leadless ceramic ICs, use an appropriate lead frame

(problems in SMT arise with soldering, heat removal, mismatch of expansion coefficients, pitch dimensions, pin alignment, cleaning, and contamination); pitch < 0.3 mm can give production problems.

8. Observe the power-up and power-down sequences, especially in the case of different power supplies (no signals applied to unpowered devices).

9. Make sure that the rnechanicalfixing of power devices is appropriate, in particular of those with high power dissipation; avoid having current carrying contacts under thermomechanical Stress.

10. The testability of PCBs and assemblies should be considered early in the design of the layout (number and dimension of test points, pull-up / pull- down resistors, activation/deactivation of three-state outputs, see also Section 5.2); manually extend the capability of CAD tools if necessary.

5.1.5.4 PCB and Assembly Manufacturing

1. Keep conductive the workplaces for assembling, soldering, and testing, in particular ground tools and personnel with lMQ resistors; avoid touching the active parts of components during assembling; use soldering irons with transformers and grounded tips.

2. When using automatic placing machines for inserted devices, verify that only the parts of pins free from insulation goes into the soldering holes (resistor networks, capacitors, relays) and that iC pins are not bent into the soldering holes (hindering degassing); for surface mount devices (SMD), make sure that the correct quantity of solder material is deposited, and that the stand-off height between the component body and the printed circuit surface is not less than 0.25 mm (pitch < 0.3 mm can give production problems, see also Section 3.3.4 for possible placing related ESD damages).

3. Control the soldering temperature profile; for wave soldenng choose the best compromise between soldering time and soldering temperature (about 3s at 245°C) as well as an appropriate preheating (about 60s to reach 100°C); check the solder bath periodically and make Sure that there is sufficient distance between the solder joints and the package for temperature sensitive devices; for surface mount technology (SMT) give preference to IR reflow soldering and provide good solder-stop pads (vapor-phase can be preferred for substrates with meta1 core or PCBs with high component density); avoid having inserted and surface mounted devices (SMD) on the same (two-sided) PCB (thermal shock on the SMD with consequent crack formation and possibly ingress of flux to the active Part of the component, in particular for ceramic capacitors greater than 100 nF and large plastic ICs).

4. Avoid soldering gold-platedpins; if not possible, tin-plate the pins in order to reduce Au concentration to < 4% in the solder joint (intermetallic layers) and < 0.5% in the solder bath(contamination), 0.2 ym<Au thickness < 0.5 ym .

5. Avoid having more than one heating process that reaches the soldering temperature, and hence any kind of rework; for temperature sensitive devices, consider the possibility of adequate protection during soldering (support, cooling ring, etc.).

6. For high reliability applications, wash PCBs and assemblies after soldering (deionized water (< 5yS/ Cm), in any case with halogen-free liquids); periodically check the washing liquid for contamination; use ultrasonic cleaning only when resonance problems in components are excluded.

7. Avoid any kind of electrical overstress when testing components, PCBs or assemblies; avoid removal and insertion under power.

5.1.5.5 Storage and Transportation

Keep the storage temperature between 10 and 30°C and the relative humidity between 40 and 60%; avoid dust, corrosive atmospheres, and mechanical Stresses (particularly for electromechanical components); use hermetically sealed containers for high-humidity environments only. Limit the storage time by implementingfirst-in /first-out rules (storage time should be no longer than two years, just-in-time shipping is often only possible for a stable production line). Ensure antistatic storage und transportation of all E S D sensitive electronic components, in particular semiconductor devices (use metallized, unplasticized bags, avoid PVC for bags). Transport PCBs and assemblies in antistatic containers and with all connectors shorted.

5.1.6 Particular Guidelines for IC Design and Manufacturing

1. Reduce latch-up sensitivity by increasing critical distances, changing local doping, or introducing vertical thick-oxide isolation.

2. Avoid significant voltage drops along resistive leads (polysilicon) by increasing line conductivity andlor dimensions or by using multilayer metallizations.

3. Give sufficient size to the contact windows and avoid large contact depth and thus sharp edges (slopes); ensure material compatibility, in particular with respect to metallization layers.

4. Take into account chemical compatibility between materials and tools used in sequential processes; limit the use of planarization processes to uncritical metallization line distances; employ preferably stable processes (low-risk processes) which allow a reasonable Parameter deviation; control carefully the wafer raw malerial (CZ/FZ material, crystal orientation, O2 conc., etc.).

5.2 Design Guidelines for Maintainability


Maintainability, even more than reliability, must be built into complex equipment and systems. This has generally to be performed project specific with a maintenance concept. However, a certain number of design guidelines for maintainability apply quite generally. These will be discussed in this section for the case of complex electronic equipment and systems with high maintainability requirements.

5.2.1 General Guidelines

Plan and implement a concept for automatic fault recognition and automatic or semiautomatic fault isolation (localization and diagnosis) down to the line replaceable unit (LRU) level, including hidden failures and software defects, as far as possible. Partition the equipment or system into line replaceable units (LRUs) and apply techniques of modular construction, starting from the functional structure; make modules functionally independent and electrically as well as mechanically separable; develop easily replaceable LRUs which can be tested with commonly available test equipment. Aim for the greatest possible standardization of parts, tools, and testing equipment; keep the need for external testing facilities to a minimum. Conceive operation and maintenance procedures to be as simple as possible, also considering personnel safety, describe them in appropriate manuals. Consider environmental conditions (thermal, climatic, mechanical) in field operation as well as during transportation and Storage.

5.2.2 Testability

Testability includes the degrees of failure recognition and isolation, the correctness of test results, and test duration. High testability can generally be achieved by improving obsewability (the possibility to check internal signals at the outputs) and controllability (the possibility to modify internal signals from the inputs). Of the following design guidelines, the first five are more for assemblies, and the last five are more for ICs (ASICs in particular).

1. Avoid asynchronous logic (asynchronous signals should be latched and strobed at the inputs).

2. Simplify logical expressions as far as possible. 3. Improve testability of connection paths and simple circuitry using ICs with

boundary-scan (IEEE STD 1149 [4.10]).


4. Separate analog und digital circuit paths, as well as circuitry with different supply voltages; make power supplies mechanically separable.

5. Make feedback paths separable

Logic

~~y

point W: 1 Control signal

6. Realize modules as self-contained as possible, with small sequential depth, electrically separable and individually testable,

I

Control signal 1 Control signal2

with MUXs

with gates

7. Allow for external initialization of sequential logic

Flip-Flop Ext. clock , , v\c,_~ Test point

point Clock

8. Develop and introduce built-in self-test @IST); introduce test modi also for the detection of hidden failures.

9. Provide enough test points (at a minimum on functional-unit inputs and outputs as well as on bus lines) and Support them with pull-up 1 pull-down resistors, provide access for a probe, taking into account the capacitive load (resistive in the case of DC measurements).

10. Make use of a scan path to reduce test time; the basic idea of a scan path is shown on the right-hand side of Fig. 5.1, the test procedure with a scan path is as follows ( n = 3 in Fig. 5.1): 1. Activate the MUX control signal (connect Zto B). 2. Scan-in with n clock pulses an appropriate n-bit test Pattern, this Pattern


Without scan path With scan patli

I 4 Combinationai lagic I+ Combinational logic

Figure 5.1 Basic structure of a synchronous sequential circuit, without a scan path on the left-hand side and with a scan path on the right-hand side

appears in parallel at the FF outputs and can be read serially with n - 1 additional clock pulses (repeat this step to completely test MUXs & FFs).

3. Scan-in with n clock pulses a first test pattern for the combinatorial logic (feedback part) and apply an appropriate pattern also to the input (both Patterns are applied to the combinatorial circuit and generate corresponding results which appear at the output - Y and at the inputs A of the MUXs).

4. Verify the results at the output Y .

5 . Deactivate the MUX control Signal (connect Z to A). 6. Give one clock pulse (feedback results from the combinatorial circuit

appear parallel at the FF outputs). 7. Activate the MUX control signal (connect Z to B). 8. Scan-out with n - 1 clock pulses and verify the results, at the Same time a

second test pattern for the combinatorial circuit can be scanned-in. 9. Repeat steps 3 - 8 up to a satisfactory test of the combinatorial part of the

circuit (see e.g. [4.12,4.13,4.23] for test algorithms specially developed for combinatorial circuits).

5.2.3 Accessibility, Exchangeability

1. Provide self-latching access flaps of sufficient size; avoid the need for special tools (one-way screws, Allen screws, etc.); use clamp fastening.


2. Plan accessibiliiy by considering the frequency of maintenance tasks. 3. Use preferably indirect plug connectors; distribute power supply and ground

over several contacts (20% of the contacts should be used for power supply and ground); plan to have reserve contacts; avoid any external mechanical stress on connectors, define (if possible) only one kind of extender for PCBs and plan its use.

4. Provide for speedy replaceability by means of plug-outlplug-in techniques. 5. Prevent faulty installation or connection (of PCBs for instance) through

mechanical keying.

5.2.4 Operation, Adjustment

1. Use high standardization in selecting operational tools and make any labeling simple and clear.

2. Consider human aspects in the layout of operating consoles and in defining operating and maintenance procedures.

3. Order all steps of a procedure in a logical sequence and document these steps by a visual feedback.

4. Describe system Status, detected fault, or action to be accomplished concisely infull text.

5.Avoid any form of hardware adjustment (or alignment) in the field; if unavoidable, describe the procedure carefully.

5.3 Design Guidelines for Software Quality

Software plays an increasingly important role in equipment and systems, both in terms of technical relevance and of development cost (often higher than 70% even for small systems). Unlike hardware, software does not go through a production phase. Also, software cannot break or wear out. However, it can fail to satisfy its required function because of defects which manifest themselves while the system is operating (dynamic defects). A fault in the software is thus caused by a defect, even if it appears randomly distributed in time, and software problems are basically quality problems which have to be solved with quality assurance tools (defect prevention, configuration management, testing, and quality data reporting systems).

For equipment and systems exhibiting high reliability or safety requirements, software should be conceived and developed to be defect tolerant, i.e. to be able

5.3 Design Guidelines for Software Quality 153

to continue operation despite the presence of software defects. For this purpose, redundancy considerations are necessary, in time domain (protocol with retransmission, cyclic redundancy check, assertions, exception handling, etc.), space domain (error correcting codes, parallel processes, etc.), or as a combination of both. Moreover, if the interaction between hardware and software in the realization of the required function at the system level is large (embedded software), redundancy considerations should also be extended to Cover hardware defects and failures, i.e. to make the system fault tolerant (Sections 2.3.7 and 6.8.3 - 6.8.7). In this context, effort should be devoted to the investigation of causes-to-effects aspects (criticality) of hardware and software faults from a system level point of view, including hardware, software, human factors, and logistic Support as well.

This section introduces basic concepts and tools for software quality assurance, with particular emphasis on design guidelines and preventive actions. Because of their utility in debugging complex software packages, models for software quality growth are also discussed (Section 5.3.4). Greater details can be found in [5.31-5.701 and [A2.8], in particular [5.44,5.50,5.60,5.67, A2.8 (730 for SQ Assurance Plans)].

A first difference between hardware and software appears in the life-cycle phases (Table 5.3). In contrast to Fig. 1.6, the production phase does not appear in the software life-cycle phases, since software can be copied without errors. A partition of the software life-cycle into clearly defined phases, each of them closed with an extensive design review, is mandatory for software quality assurance. A second basic distinction between hardware and software is given by the quality attributes or characteristics (Table 5.4). The definitions of Table 5.4 extend those given in Appendix A l and take care of established standards [A2.8, 5.501. Not all quality attributes of Table 5.4 can befiljilled at the same time. In general, apriority list must be established and consequently followed by all engineers involved in a project. A further difficulty is the quantitative evaluation (assessment) of software quality attributes, i.e. the definition of software quality metrics. An attempt to aggregate (as user) some of the attributes in Table 5.4 is given in [5.45].

From the above considerations, software quality can be defined as the degree to which a software package possesses a stated combination of quality attributes (characteristics). If supported by an appropriate set of software quality metrics, this allows an objective assessment of the quality level achieved. Since only a Iimited number of quality attributes can be reasonably well satisfied by a specific software package, the main purpose of software quality assurance is to maximize the common part of the quality attributes needed, specified, und realized. To reach this target, specific activities have to be performed during all software life-cycle phases. Many of these activities can be derived from hardware quality assurance tasks, in particular regarding preventive actions (defect prevention), configuration management, testing, and corrective actions. However, auditing software quality assurance activities in a project should be more intensive and with a shorter feedback than for hardware (Fig. 5.2, Tab. 5.5).


Table 5.3 Software life-cycle phases (see Fig. 1.6 for hardware life-cycle phases)

Phase

lefinition

lesign, 'oding, resting

ntegration, falidation, nstallation

Iperation, Maintenancc

Objective / Tasks Input

Problem definition Feasibility check

Investigation of alternative soiutions Interface definitions Feasibility check

Setup of detailed specifications Software design Coding Test of each module Verification of compliance with module specifications (design reviews) Data acquisition Feasibility check

Integration and validation of the software Venfication of compliance with system specifications (design reviews) Setup of the definitive documentation

Use/application of the software Maintenance (correctiv and perfective)

Problem definition Constraints on Computer size, programming languages, 110, etc.

System specifications Proposal for the definition phase

Revised system specifications Interface specifications Proposal for the design, coding, and testing phase

Completed and tested software modules Tested V 0 facilities Proposal for the integration, validation, and installation phase

Completed and tested software Complete and definitive documentation

Output

System specifications for functional (what) and performance (how) aspects Proposal for the definitioi phase

Revised system specifications Interface specifications Updated estimation of cost and schedule Feedback from Users Proposal for the design, coding, and testing phase

Definitive flowcharts, data flow diagrams, and data analysis diagrams Test procedures Completed and tested software modules Tested U0 facilities Proposal for the integration, validation, and installation phase Software documentation

Completed and tested software Complete and definitive documentation

Conceming the design and development of complex equipment and Systems, the traditional Separation between hardware and software should be overcome, taking from euch side the "good part" of methods und tools and putting them together for new better methods and tools (strategy applicable to other situations as well).


Table 5.4 Important software quality attnbutes and characteristics

Attribute /Definition

Consistency Degree of uniformity, standardization, and freedom from contradiction within the documentation or parts of a software package

Compatibility

Completeness

Defect Freedom Degree to which a software package can execute its required function without [Reliability) causing system failures

Degree to which two or more software modules or packages can perform their required functions while sharing the same hardware or software environment

Degree to which a software module or package possesses the functions necessary and sufficient to satisfy user needs

Efficiency Degree to which a software module or package performs its required function with minimum consumption of resources (hardware andl or software)

Defect Tolerance rRobustness)

Documentation

Flexibility Degree to which a software module or package can be modified for use in applications or environments other than those for which it was designed

Degree to which a software module or package can function correctly in the presence of invalid inputs or highly stressed environmental conditions

Totality of documents necessary to descnbe, design, test, install, and maintain a software package

Portability Degree to which a software package can be transferred from one hardware or software environment to another

Integrity

Maintainability

Reusability / Degree to which a software module can be used in another program

Degree to which a software package prevents unauthonzed access to or modification of Computer programs or data

Degree to which a software module or package can be easily modified to correct faults, improve the performance, or other attributes

Simplicity Degree to which a software module or package has been conceived and implemented in a straightforward and easily understandable way

1 Software module is used here also for sojiware element

restability

Usability

5.3.1 Guidelines for Software Defect Prevention

Degree to which a software module or package facilitates the establishment of test critena and the performance of tests to determine whether those critena have been met

Degree to which a User can l e r n to operate, prepare inputs for, and interpret outputs of a software package

Defects can be introduced in different ways and at different points along the life cycle phases of software. The following are some causes for defects:

1. During the concept and definition phase

misunderstandings in the problem definition,

5 Design Guidelines for Reliability, Maintainability, and Software Quality - - - - - . , . ,

Software System Specification

I I I \ \ - - - .- - \ I ,

Basic Software Structure

I I

and Installation \

\ I I

I I

, , .. i Modules

Integration

\

I I I

i /

I

Modules Validation

Modules (Software Element) Design, Coding, and Testing

Figure 5.2 Procedure for software development (top-down design and bottom-up integration with vertical and horizontal control loops)

constraints on CPU performance, memory size, computing time, 110 facilities or others, inaccurate interface specifications, too little attention to User needs and/or skills.

2. During the design, coding, and testing phase

inaccuracies in detailed specifications, misinterpretation of detailed specifications, inconsistencies in procedures or algorithms, timing problems, data conversion errors, complex software structuring or large dependence between software modules.

3. During the integration, validation, and installation phase

too large interaction between sofONare modules, errors during software corrections or modifications, unclear or incomplete documentation, changes in the hardware or software environment, exceeding important resources (dynamic memory, disk, etc.).

5.3 Design Guidelines for Software Quality 157

Defects are thus generally caused by human errors (software developer or user). Their detection and removal become more expensive as the software life cycle Progresses (often by a factor of 10 between each of the four main phases of Table 5.3, as in Fig. 8.2 for hardware). Considering that many defects can remain undiscovered for a long time after the software installation (since detected only by particular combinations of data and system states), the necessity for defect prevention through an appropriate software quality assurance becomes mandatory. Following design guidelines can be useful:

1. Fix written procedures/rules and follow them during software development, such rules specify quality attributes with project specific priority and corresponding quality assurance procedures.

2. Formulate detailed specifications und inte$aces as carefully as possible, such specifications Iinterfaces should exist before coding begins.

3. Give priority to object oriented programming. 4. Use well-behaved high-level programming languages, assembler only when

a problem cannot be solved in other way; use established Computer Aided Software Engineering (CASE) for prograrn development and testing.

5. Partition software into independent software modules (modules should be individually testable, developed top-down, and integrated bottom-up).

6. Take into account all constraints given by I/O facilities. 7. Develop software able to protect itself and its data; plan for automatic

testing and validation of data. 8. Consider aspects of testing / testability as early as possible in the develop-

ment phase; increase testability through the use of definition languages (Vienna, RTRL, PSL, IORL).

9. Improve understandability and readability of software by introducing appropriate comments.

10. Document software carefully and carry out sufficient configuration management, in particular with respect to design reviews (Table 5.5).

Software for on-line Systems (product and embedded software) should further be conceived to be as far as possible tolerant on hardware failures and to allow a System recoplfiguration, particularly in the context of a fail-safe concept (hardware and software involved in fail-safe procedures should be periodically checked during the operation phase). For this purpose, redundancy considerations are necessary, in the time domain (protocol with retransmission, cyclic redundancy check, assertions, exception handling, etc.), in the space domain (error correcting codes, parallel processes, etc.), or a combination of both. Moreover, if the interaction between hardware and software in the realization of the required function at the system level is large (embedded software), redundancy considerations should be extended to cover hardware defects and failures, i.e. to make the system fault tolerant (Sections 2.3.7 and 6.8.6). In this context, effort should be devoted to the


investigation of causes-to-effects aspects (criticality) of hardware and software faults from a system level point of view, including hardware, software, human factors, and logistic support as well (Section 2.6).

5.3.2 Configuration Management

Configuration management is an important quality assurance tool during the design and development of complex equipment and systems, both for hardware und software. Applicable methods and procedures are outlined in Section 1.3.3 and discussed in Appendices A3 and A4 for hardware. Some of these methods have been introduced in software standards [A2.8]. Of particular importance for software are design reviews, as given in Table 5.5 (see also Table A3.3 for hardware aspects) and configuration control, i.e. management of changes and modifications.

5.3.3 Guidelines for Software Testing

Planning for software testing is generally a difficult task, as even small programs can have an extremely large number of states which makes a complete test impossible. A test strategy is then necessary. The problem is also known for hardware, for which special design guidelines to increase testability have been developed (Section 5.2). The most important rule, which applies to both hardware and software, is the partitioning of the item (hardware or software) into independent modules which can be individually tested and integrated bottom-up to constitute the system. Many rules can be project specific. The following design guidelines can be useful in establishing a test strategy for software used in complex equipment and systems:

1. Plan software tests early in the design and coding phases, and integrate them step by step into a test strategy.

2. Use appropnate tools (debugger, coverage-analyzer, test generators, etc.). 3. Perform tests first at the module level, exercising all instructions, branches

and logic paths. 4. Integrate and test successively the modules bottom-up to the system level. 5. Test carefully all suspected paths (with potential defects) and software parts

whose incorrect running could cause major system failures. 6. Account for all defects which have been discovered with indication of running

time, software & hardware environments at the occurrence time (state, parameter set, hardware facilities, etc.), changes introduced, and debugging effort.

7. Test the complete software in itsfinal hardware and software environment.

Testing is the only practical possibility to find (and elirninate) defects. It includes


Table 5.5 Software design reviews (IEEE Std 1028-1988 [A2.8])

Management Review

I Software

E .4 +

3

Inspection

Technical Review

Walk- through

Objective

Provide recomrnendations for the following activities Progress, based on an evaluation of product development status changing project direction or identifying the need for alternate planning adequate allocation of resources through global control of the project

Evaluate a specific software element and provide management with evidence that

the software element conforms to its specifications the design (or maintenance) of the software element is being done according to plans, Standards, and guidelines applicable for the project changes to the software element are properly implemented and affect only those system areas identified by change specifications

Detect and identify software element defects, in particular verify that every software element satisfies its specifications venfy that every software element conforms to applicable Standards identify deviations from standards and specifications evaluate software engineenng data (e.g. defect and effort data)

Find defects, omissions, and contradictions in the software elements and consider alternative implementations (long associated with code examination, this process is also applicable to other aspects, e.g. architectural design, detailed design, test plans Iprocedures, and change control procedures)

software element is used here also for software module; see also Tab. A3.3 for gutem oriented design reviews

debug tests (generally performed early in the design phase using breakpoints, desk checking, dumps, inspections, reversible executions, single-step operation, or traces) and run tests. Although costly (often up to 50% of the software development cost), tests cannot guarantee freedom from defects. A balanced distribution of the efforts between preventive actions (defect prevention) and testing must thus be found for each project.

5.3.4 Software Quality Growth Models

Since the beginning of the seventies, a large number of models have been proposed to describe the occurrence of software defects during operation of complex equipment and Systems. Such an occurrence can generate a failure at system level and appears often randomly distributed in time. For this reason, modeling has been done in a similar way as for hardware failures, i.e. by introducing the concept of software failure rate. Such an approach may be valid to investigate software quality growth during software validation and installation, as for the reliability growth models developed in the sixties for hardware (Section 7.7).


However, from the considerations of the preceding sections, the main target should be the development of software free from defects and thus to focus the effort on defect prevention rather than on defect modeling. However, because of their use in investigating software qualio growth, this section introduces briefly basic models known for software defect modeling.

Between consecutive occurrence points of a software defect, the 'Ifnilmre rate" is a function of the number of defects present in the software. This model leads to a death process and is known as Jelinski-Moranda model. If at t = 0 the software contains n defects, the probability P i ( t ) = Pr(i defects have been removed up to the time t I n defects were present at t = 0) can be calculated recursively from (see Fig. A7.9 with vo = nh, vi = ( n - i ) h and Bi = 0 )

t Po(t)=e-i"t, P i ( t ) = j ( n - i + l ) h e - ( n - i ) k e - l ( t - x ) & , i = l , ..., n , (5.3)

0

or directly as

Figure 5.3 shows P o ( t ) to P 3 ( t ) for n = 10. This model can be easily extended to Cover the case in which the Parameter ?L also depends on the number of defects still present in the software. Between consecutive occurrence points of a software defect, the 'Ifailure rate" is a function of the number of defects still present in the software and of the time elapsed since the last occurrence point of a defect. This model generalizes Model 1 above and can be investigated using semi-Markov processes (Appendix A7.6).

0 l l n h 2 1 n h 3 1 n h

Figure 5.3 Pi (t ) = Pr{i defects have been removed up to the time t I n defects were present at t = O} for i = 0 - 3 and n = 10 (the time interval between consecutive occurrence points of a defect is exponentially distnbuted with Parameter Li = ( n - i) X)


Figure 5.4 Simplified modeling for the time behavior of a system whose failure is caused by a hardware failure (Zi -i z;') or by the occurrence of a software defect ( Z i -t Z i)

3. The jlow of occurrence of software defects constitutes a nonhomogeneous Poisson process (Appendix A7.8.2). This model has been extensively investigated in the literature, together with reliability growth models for hardware, with different assumptions on the form of the process intensity (Section 7.7).

4. The jlow of occurrence of software defects constitutes an arbitrary point process. This model is very general but difficult to investigate.

All the above models have a theoretical foundation. However, in practical applications they often suffer from the lack of information (for instance about the number of defects actually present in the software) and data. Also they do not take care of the criticalis, (effect at system level) of the defects still present in the software under consideration (several minor faults are in general less critical than just one major fault). The use of nonhomogeneous Poisson processes is discussed in Section 7.7, see e.g. also [6.3, A7.301 for some critical comments. Oversimplified models should also be avoided [5.69].

For systems with hardware and software, one can often assume that defects in the software will be detected and eliminated one after the other. Only hardware failures should then remain. Figure 5.4 shows a possibility to take this into account [6.9]. However, interdependence between hardware and software can be greater as assumed in Fig. 5.4. Also is the number (n) of defects in the software at the time t = 0 unknown and by eliminating a software defect new defects can be introduced. Modeling software defects as well as systems with hardware and software is still evolving.

6 Reliability and Availability of Repairable Systems

Reliability and availability analysis of repairable Systems is generally performed using stochastic processes, including Markov, semi-Markov, and semi-regenerative processes. The mathematical foundation of these processes is in Appendix A7. Equations used to investigate Markov and serni-Markov models are surnrnarized in Table 6.2. This chapter investigates systematically most of the reliability models encountered in practical applications. Reliability figures at system level have indices S i (e.g. MTTF',), where S stands for system and i is the state entered at t = 0 (Table 6.2). After Section 6. 1 (introduction, assumptions, conclusions), Section 6.2 investigates the one-item structure under general conditions. Sections 6.3 - 6.6 deal extensively with series, parallel, and series-parallel structures. To unify models and simplify calculations, it is assumed that the system has only one repair Crew and no further failures occur at system down. Starting from constant failure and repair rates between successive states (Markov processes), generalization is performed step by step (beginning with the repair rates) up to the case in which the process involved is regenerative with a minimum number of regeneration states. Approxi- mute expressions for large series -parallel structures are investigated in Section 6.7. Sections 6.8 considers systems with complex structure for which a reliability block diagram often does not exist. On the basis of practical examples, preventive maintenance, imperfect switching, incomplete coverage, elements with more than two states, phased-mission systems, common cause failures, and general reconfigurable fault tolerant systems with reward & frequencylduration aspects are investigated. A general procedure for complex structures is given in Section 6.8.8. Sections 6.9 introduces alternative investigation methods (Petri nets, dynarnic FTA, computer- aided analysis), and gives a Monte Carlo approach useful for rare events. Asymptotic & steady-state is used as a synonym for stationary (p. 476). Results are summarized in tables. Selected examples illustrate the practical aspects.

6.1 Introduction, General Assumptions, Conclusions

Investigation of the time behavior of repairable Systems spans a very large class of stochastic processes, from simple Poisson process through Markov and semi- Markov processes up to sophisticated regenerative processes with only one or just a few regeneration states. Nonregenerative processes are rarely considered because

6.1 Introduction and General Assumptions 163

of mathematical difficulties. Important for the choice of the class of processes to be used are the distribution functions for the failure-free and repair times involved. If failure and repair rates of all elements in the system are constant during the stay time in every states (not necessarily at a state change, e.g. because of load sharing), the process involved is a (time-homogeneous) Markov process with finitely many states, for which the stay time in each state is exponentially distributed. The same holds if Erlang distributions occurs (supplementary states, See e.g. Section 6.3.3). The possibility to transform a given stochastic process into a Markov process by introducing supplementary variables is not considered here. Generalization of the distribution functions for repair times leads to semi-regenerative processes, i.e. to processes with an embedded semi-Markov process. This holds in particular if the system has only one repair crew, since each termination of a repair is a renewal point (because of the constant failure rates). Arbitrary distributions of repair and failure-free times lead in general to nonregenerative stochastic processes.

Table 6.1 shows the processes used in reliability investigations of repairable systems, with their possibilities and limits. Appendix A7 introduces these processes with particular emphasis on reliability applications. All equations necessary for the reliability and availability calculation of systems described by time-homogeneous Markov processes and semi-Markov processes are summarized in Table 6.2.

Besides the assumption about the involved distribution functions for failure-free and repair times, reliability and availability calculation is largely influenced by the

Table 6.1 Stochastic processes used in reliability and availability analysis of repairable systems

Stochastic process I Can be used in modeling

Alternating renewal One-item repairable (renewable ) structure vrocess with arbitrary failure and repair rates

Renewal process Spare parts provisioning in the case of arbitrary failure rates and negligible replacement or repair time (Poisson process for const. h)

Markov process (MP) (finite state space, time-homogeneous)

Systems of arbitrary structure whose elements have constant failure and repair rates (Ai, y i ) during the stay time (sojourn time) in everj state (not necessarily at a state change, e.g. because of load sharing)

Semi-Markov process (SMP)

Nonregenerative Systems of arbitrary structure whose elements process have arbitrary failure and repair rates

Some systems whose elements have constant or Erlangian failure rates (Erlang distributed failure-free times) and arbitrary repair rates

Semi-regenerative process (proc. with only few regen. states)

Background / Difficulty

Systems with only one repair crew, arbitrary structure, and whose elements have constant failure rates and arbitrary repair rates

Renewal theory Medium

Renewal Medium theory

Differential equations

or Integral equations

Integral equations

Integral equations

Low

Medium

High

164 6 Reliability and Availability of Repairable Systems

maintenance strategy, logistic support, type of redundancy, and dependence between elements. Existente of a reliability block diagram is assumed in Sections 6.2 - 6.7, not necessarily in Sections 6.8 and 6.9. Results are expressed as functions of time by solving appropriate systems of differential (or integral) equations, or given by the mean time to failure or the steady-state point availability at system level (MTTFsi or PAs) by solving appropriate systems of algebraic equations. If the system has no redundancy, the reliability function is the same as in the nonrepairable case. In the presence of redundancy, it is generally assumed that redundant elements will be repaired without operational interruption at system level. Reliabili~ investigations thus aim to find the occurrence of the first system down, whereas the point availability is the probability to find the system in an up state at a time t , independently of whether down states at system level have occurred before t.

In order to unify models and simplify calculations, the following assumptions are made for the analyses in Sections 6.2 - 6.6 (partly also in Sections 6.7 - 6.9).

1 . Continuous operation: Each element of the system is in operating or reserve state, when not under repair or waiting for repair. (6.1)

2.No further failures at system down: At system down the system is repaired (restored) according to a given maintenance strategy to an up state at system level from which operation is continued, failures during a repair at system down are not considered. (6.2)

3. Only one repair Crew: At system level only one repair Crew is available, repair is performed according to a stated strategy, e.g. first-inlfirst-out. (6.3)

4. Redundancy: Redundant elements are repaired without interruption of Oper- ation at system level; failure of redundant parts is immediately recognized. (6.4)

5. States: Each element in the reliability block diagram has only two states (good or failed); after repair (restoration) it is as-good-as-new. (6.5)

6. Independence: Failure-free and repair times of each element are stochastically independent, > 0, and continuous random variables with finite mean (MTTF; MTTR) and variance (failure-free time is used as a synonym for

failure-jree operating time and repair as a synonym for restoration). (6.6) 7. Support: Preventive maintenance is neglected; fault coverage, switching,

and logistic support are ideal (repair time = restoration time = down time). (6.7)

The above assumptions holds for Sections 6.2 - 6.6, and apply in many practical situations. However, assumption (6.5) must be critically verified, in particular for the aspect as-good-as-new, when the repaired element does not consist of just one Part which has been replaced by a new one, but contains parts which have not been replaced during the repair. This assumption is valid if the nonreplaced parts have constant (time independent) failure rates, and applies in this case to considerations at system level. At system level, reliability figures have indices Si (e.g. MTTFsi) where S stands for system and i is the state entered at t = 0 (Table 6.2). Assuming irreducible processes, asymptotic & steady-state is used as a synonym for stationary.

6.1 Introduction and General Assumptions 165

Section 6.2 considers the one-item repairable structure under general assumptions, allowing a careful investigation of the asymptotic und stationary behavior. For the basic reliability structures encountered in practical applications (series, parallel, and series-parallel), investigations in Sections 6.3 - 6.6 begin by assuming constant failure und repair rates for every element in the reliability block diagram. Distributions of the repair times, and as far as possible of the failure-free times, are then generalized step by step up to the case in which the process involved remains regenerative with a minimum number of regeneration states. This, also to show capability & limits of the models involved. For large series-parallel structures, approximate expressions are developed in deep in Section 6.7. Procedures for investigating repairable systems with complex structure (for which a reliability block diagram often does not exist) are given in Section 6.8 on the basis of practical examples, including imperfect switching, incomplete coverage, more than two states, phased-mission systems, common cause failures, and fault tolerant reconfigurable systems with reward & frequencylduration aspects. It is shown that the tools developed in Appendix A7 (summarized in Tab. 6.2) can be used to solve many of the problems occurring in practical applications, on a case-by-case basis working with the diagram of transition rates or a time schedule. Alternative investigation methods (Petri nets, dynamic FTA), as well as computer-aided analysis is discussed in Section 6.9 and a Monte Carlo approach useful for rare events is given.

From the results of Sections 6.2 - 6.9, the following conclusions can be drawn:

1. As long as for each element in the reliability block diagram the condition MTTR« MTTF holds, the shape of the distribution function of the repair time has small influence on the mean time to failure and on the steady-state availability at system level (see for instance Examples 6.7,6.8, 6.9).

2. As a consequence of Point 1, it is preferable to start investigations by assuming Markov models (constant failure and repair rates for all elements, Table 6.2); in a second step, more appropriate distribution functions can be considered.

3. The assumption (6.2) of no further failure at system down has no influence on the reliability function; it allows a reduction of the state space and simplifies calculation of the availability and interval reliability (yielding good approximate values for the cases in which this assumption does not apply).

4. Already for moderately large systems, the use of Markov models can become time-consuming (up to e . n ! states for a reliability block diagram with n elements); approximate expressions are important, and the method based on macro-structures (Table 6.10) adheres well to many practical applications.

5. For large systems or complex structures, following possibilities are available: work directly with the diagram of transition rates (Section 6.8), calculation of the mean time to failure and of the steady-state availability at system level only (Table 6.2, Eqs. (A7.126), (A7.173), (A7. l U ) , (A7.175)), use of approximate expressions (Section 6.7), use of alternative methods or Monte-Carlo sirnulation (Section 6.9).


Table 6.2 Relationships for the reliability, point availability, and interval reliability of Systems described by time-homogeneous Markov processes & semi-Markov processes (Appendix A7.5 - A7.6)

6.1 Introduction and General Assumptions

Table 6.2 (cont.) ,


6.2 One-Item Structure

A one-item structure is a unit of arbitrary complexity, generally considered as an entity for investigations. Its reliability block diagram is a single element (Fig. 6.1). Considering that in practical applications the one-item structure can have the complexity of a system, and also to use the Same notation as in the following sections of this chapter, reliability figures are given with the indices S or SO (e.g. PAs, Rso(t) , MTTFso), where S stands for system and 0 specifying item new at t = 0.

Under the assumptions (6.1) to (6.3) and (6.5) to (6.7), the repairable one-item structure is completely characterized by the distribution function of the failure-free times 'cO, 'cl , . . .

F A ( ~ ) = P r { ~ O S ~ ) and F(x)=Pr{zi 2x1,

with densities

d FA (X) fA (X) = --- dF(x) and f(x) = - , dx dx

the distribution function of the repair times 28, T ; , . . .

GA(x) = Pr{'cb 5 X ] and G ( x ) = Pr{.c; 2 X},

with densities

and the probability p that the one-item structure is up at t = 0

or

1 - p = Pr{down (i.e. under repair) at t = O},

respectively (z i & T ; are interarrival times, and X is used instead of t). The time behavior of the one-item structure can be investigated in this case with help of the alternating renewal process introduced in Appendix A7.3.

Figure 6.1 Reliability block diagram for a one-item structure

6.2 One-Item Structure 169

Figure 6.2 Possible time behavior of a repairable one-item structure new at t = 0 (repair times greatiy exaggerated; aiternating renewal process with renewal points 0, Sduul. Sduu2 ,... for a transition from down state to up state given that the item is up at t = 0, marked by 0 )

Section 6.2.1 considers the one-item structure new at t = 0, i.e. the case p = 1 and FA(x) = F(x), with arbitrary F(x) and G(x). Generalization of the initial conditions at t = 0 (Sections 6.2.3) allows in Sections 6.2.4 and 6.2.5 a depth investigation of the asymptotic and steady-state behavior.

6.2.1 One-Item Structure New at Time t = 0

Figure 6.2 shows the time behavior of a one-item structure new at t = 0. T ~ , .t2,. . . are the failure-free times. They are statistically independent and distributed according to F(x) as per Eq. (6.8). Similarly, T;, T;, . . . are the repair times, distributed according to G(x) as per Eq. (6.10). Considering assumption (6.5), the time points 0, Sduul, . . . are renewal points and constitute an ordinary renewal process embedded in the original alternating renewal process; investigations of this Section are based on this property (Sduu means a transition from down (repair) to up (operating) starting up at t = 0).

6.2.1.1 Reliability Function

The reliabili~function Rso(t) gives the probability that the item operates failure free in (0, t ] given item new at t = 0

Rso(t) = Pr{up in (O,t] I new at t = 0). (6.13)

Considering Eqs. (2.7) and (6.8) it holds that

Rso(t) = Pr{zl > t } = 1 - F(t).

The mean time to failure given item new at t = 0 follows from Eq. (A6.38)

00

= I ~ ~ ~ ( f ) dt , (6.15) 0

with the upper limit of the integral being TL should the useful life of the item be


limited to TL (in this case, Rso(t) jumps to 0 at t =TL). In the following, TL = - will be assumed.

6.2.1.2 Point Availability

The point availability PAso(t) gives the probability of finding the item operating at time t given item new at t = 0

PAsO(t) = Pr{up at t I new at t = O}. (6.16)

For PAso(t) it holds that

A(t) is often used instead of PAso(t). Equation (6.17) is derived in Appendix A7.3 (Eq. (A7.56)) using the theorem of total probability. 1 - F(t) is the probability of no failure in (0, t], hd„(x)dx gives the probability that any one of the renewal points Sduul, Sduu2, . . . lies in (X, X + dx] , and 1 - F(t - X) is the probability that no further failure occurs in (X, t]. Using Laplace transform (Appendix A9.7) and considering Eq. (A7.50) with FA(x) = F(x), Eq. (6.17) yields

?(s) and g ( s ) are the Laplace transforms of the failure-free time and repair time densities, respectively (given by Eqs. (6.9) and (6.11)).

Example 6.1

a) Give the Laplace transform of the point availability P A S O ( t ) for the case of a constant failure rate h (h(x) = h).

b) Give the Laplace transform and the corresponding time function of the point availability for the case of constant failure und repair rates h and p ( h ( x ) = h and y ( x ) = P ) .

Solution

a) With F(x) = I - e-AX or f(n) = ~ e - ' ~ , Eq. (6.18) yields

Supplernentary results:

(s + a)' &)=a(a x)'-'e -CLZ/ r(ß) (Eq. (A6.98)) yields PA^^ (s) =

(s+ h)(s+a) ' - h a b


b) With f(x) = ~ e - ' ~ and g(x) = pe-'".X, Eq. (6.18) yields

and thus (Table A9.7b)

PAs ( t ) converges rapidly, exponentially with a time constant

l / ( h + p ) = 1/11 = MTTR,

to the asymptotic value p/(h -t- p) = 1 - h lp , see Section 6.2.4 for an extensive discussion.

PAso(t) can also be obtained using renewal process arguments (Appendices A7.2, A7.3, A7.6). After the first repair the item is as-good-as-new. Sduul is a renewal point and from this time point the process restarts anew as at t = 0. Therefore

Considering that the event

I u p a t t J

occurs with exactly one of the following two mutually exclusive events

( no failure in (0, t ] }

it follows that

where f ( x ) * g(x) is the density of the sum q + T; (see Fig 6.2 and Eq. (A6.75)). The Laplace transfonn of PAso(t) as per Eq. (6.22) is that given by Eq. (6.18).

6.2.1.3 Average Availability

The average availability AAso(t) is defined as the expected proportion of time in which the item is operating in (0, t ] given item new at t = 0

1 AAso(t) = - Ertotal up time in (0, t ] 1 new at t =0].

t (6.23)


Considering PAsO(x) from Eq. (6.17), it holds that

Eq. (6.24) has a great intuitive appeal. It can be proved by considering that the time behavior of a repairable item can be described by a binary random function C(t) taking values 1 for up and 0 for down. From this, E [ ( ( t ) ] = l .PAso( t ) +

t O.(1-PASO(t)) = PASO(t) and, taking care of J ( ( X ) & =total up time i n ( 0 t ] , it

0 follows that

6.2.1.4 Interval Reliability

The interval reliability IRsO(t, t + 0 ) gives the probability that the item operates failure free during an interval [ t , t + 01 given item new at t = 0

IRso( t , t+O)=Pr{upin[ t , t+O] I newat t = O } . (6.25)

The same method used to obtain Eq. (6.17) leads to t

1 ~ „ ( t , t +B) = l - ~ ( t + 0 ) + ~ h ~ ( x ) ( l - ~ ( t + 0 - ~ ) ) d r . (6.26) 0

Example 6.2

Give the interval reliability IRso (t . t + 0) for the case of a constant failure rate h (h(x) = h).

Solution With F(x) = 1 - eVhx it follows that

t t -h(t+9)

hduu(x)e -1 (t + 8-2) IRsO( t , t+0 )=e dx = [e-)"+J hd„(x)e -h (t - X ) dx]

0 0 Comparison with Eq. (6.17) for F(x) = I - ehx yields

It must be pointed out that the product rule in Eq. 6.27, expressing Pr{up in [t , t + 01 I new at t = 0 } = Pr{up at t I new at t = 0 ) . Pr{no failure in ( t , t + 01 1 up at t } , is valid only because of the constant failure rate (memoryless property, Eq. (2.14)); in the general case, the second term is Pr{no failure in ( t , t + 01 I (up at t n new at t = O)}, which differs from Pr{ no failure in ( t, t + 01 I up at t ] .

6.2.1.5 Special Kinds of Availability

In addition to the point and average availability (Sections 6.2.1.2 and 6.2.1.3), there are several other kinds of availability useful for practical applications [6.5 (1973)l:

1 Mission Availability: The mission availability MAso(T„ t,) gives the probability that in a mission of total operating time (total up time) T, each failure can be repaired within a time span t„ given item new at t = 0

MASO(To, t,) = Pr{each individual failure occuring in a rnission with

total operating time T, can be repaired in a time < t , I new at t =O}. (6.28)

Mission availability is important in applications where interruptions of length I t , can be accepted. Its computation considers all cases with n = 0,1, ... failures, taking care that at the end of the mission the item is operating (to reach the given (fixed) operating time T,).+) Thus, for given T, > 0 and t„

holds. F,(T,) - F„l(To) is the probability for n failures during the total operating time T, (Eq. (A7.14)); (G(to))n is the probability that all n

repair times will be shorter than t,. For constant failure rate h it holds that n -hT F,(To)-F„I(To)=(hTo) e I n ! and thus

2. Work-Mission Availability: The work-mission availability WMAso (T„ X) gives the probability that the sum of the repair times for all failures occurring in a mission of total operating time (total up time) T, is <X, given item new at t = 0

WMASO(To,x) =Pr{sum of therepair times for allfailures occurring in a rnission of total operating time T, is I x I new at t =0). (6.31)

Similarly as for Eq. (6.29) it follows that for given (fixed) T,> 0 and x> 0 *'

where G,(x) is the distribution function of the sum of n repair times with distribution G(x) (Eq. (A7.13)). As for the mission availability, the item is up at the end of the mission (to reach the given (fixed) operating time T,). For constant failure and repair rates (L, P), Eq. (6.32) yields (see also Eq. (A7.219))

*) An unlimited number n of repair is assumed here, See e.g. Section 4.6 (p. 136) for n limited.

W M A ~ ~ ( T „ X) = 1 - e - @ T ~ + p ) T. >O given, x >O,

n=l k=O WMAs (To, 0) = e - ~ T o .

Defining DT as total down time and UT= t - DT as total up time in (O,t], one can recognize that for given fixed t, WMA„(t -X, X) = Pr {DT in (O,t] 2 X] holds for an item described by Fig. 6.2 ( t > 0, O< X I t). However, the item can now be up or down at t, and the situation differs from that defined by Eq. (6.31). The function WMA„ ( t - X, X) has been investigated in [A7.29 (1957)l. In particular, a closed analytical expression for WMA„ ( t -X, X) is given for constant failure and repair rates ( h , P), and it is shown that the distribution of DT converges for t + - to a normal distribution with mean t h l ( h + p ) = t h l p andvariance t 2 h p / ( h + p ) 3 = t 2 h / p 2 . Itcan benoted, that for the interpretation given by Eq. (6.31), mean and variance of the total repair time are given by T. h I F and T. 2 h /P*, respectively (Eq. (A7.220)).

3. Joint Availability: The joint availability JAso(t, t + 8) gives the probability of finding the item operating at the time points t and t+ 8, given item new at t = 0 (t and t + 0 are given fixed time points, see e.g. [6.14 (1999), 6.281 for stochastic demand)

J A s o ( t , t + O ) = P r { ( u p a t t n u p a t t + O ) 1 newat t=O}. (6.34)

For the case of a constant failure rate h(x) = h, the multiplication rule of Eq. (6.27) yields

For arbitrary failure rate, one has to consider that {up at t n up at t+ 8 I new at t = O] occurs with one of the following 2 mutually exclusive events (AppendixA7.3)

{up in[ t , t+O] I newat t =0}

or

{ (up at t n next failure occurs before t + 8 n up at t + 8) 1 new at t = 0) .

The probability for the first event is the interval reliability IRsO(t,t + 8) given by Eq. (6.26). For the second event, it is necessary to consider the distribution function of the forward recurrence time in the up state ~ ~ ~ ( t ) . As shown in Fig. 6.3, ~ ~ ~ ( t ) can only be defined if the item is up at time t, hence

Pr(zRu(t)>x I newatt =O}=Pr{upin( t , t+x] I(upat t n n e w a t t=O)]

and thus, as in Example A7.2 and considering Eqs. (6.16) and (6.25),


Pr{up in [t,t+ X ] I new at t = 0) IRSO(t,t + X ) Pr{TRu(t) > X I new at t= 0} = - - Pr{up at t I new at t = 0) PAso( t )

=1-F (X). 'RU

(6.36)

For constant f a h r e rate h ( x ) = h one has 1 -F, Ru ( X ) = e - h , as per Eq. (6.27). Considering Eq. (6.36) it follows that

8

~ ~ ~ ~ ( t . t + 0 ) = ~ ~ ~ ~ ( t , t + 0 ) + P A ~ ~ ( ~ ) J " f T ( x ) P A ~ ~ ( B - x ) ~ ~ ~ Ru

D

where PAsl ( t ) = Pr(up at t I a repair begins at t = 0) is given by

t

PAs i ( t ) = h„(x)(l- ~ ( t - X ) ) & , (6.38) 0

with hdud(t)=g(t)+g(t)* f ( t ) * g(t)+g(t) * f ( t ) * g ( t f ( t ) g ( t ) + . (Eq. (A7.50)). JAsO( t , t + 8 ) can also be obtained in a sirnilar way to PAso ( t ) in Eq. (6.17), by considering the alternating renewal process starting up at the time t with zRu( t ) distributed according to F, ( X ) as per Eq. (6.36). This leads to

Ru

with hLu(x ) = f ' (X) *g(x)+fS ( X ) * g(x) * f ( x ) * g(x)+ ..., see Eq. (A7.50), and =RU Ru

f ' (X) = P A S O ( t ) f T R ( x ) = PASO( t )dF ,R (x ) l dx=- dIRso(t,t + x ) l a x , s e e =RU

Eqs. (6.36) and (6.37). Similarly as for 'cR,(t), the distribution function for

the fonvard recurrence time in the down state z R d ( t ) is given by (Fig. 6.3)

with h„(t) = f ( t ) + f ( t ) * g ( t ) * f ( t ) + ... (Eq. (A7.50)). For constant failure rate h ( x ) = h , Eq. (6.37) or (6.39) leads to Eq. (6.35), by considering Eq.(6.19).

Figure 6.3 Forward recurrence times ~ ~ , ( t ) and zRd ( t ) in an alternating renewal process


6.2.2 One-Item Structure New at Time t = 0 and with Constant Failure Rate h

In many practical applications, a constant failure rate h can be assumed. In this case, the expressions of Section 6.2.1 can be simplified making use of the memoryless property given by the constant failure rate. Table 6.3 summarizes the results for the cases of constant failure rate (L) and constant or arbitrary repair rate (P or Mx) = g(x) / ( I - G(x))) .

6.2.3 One-Item Structure with Arbitrary Initial Conditions at Time t = 0

Generalization of the initial conditions at time t = 0 , i.e. the introduction of p, FA(x) and GA(x)as defined by Eqs. (6.12), (6.8), and (6.10), leads to a time behavior of the one-item repairable structure described by Fig. A7.3 and to the following results:

1. Reliability function RS ( t )

R s ( t ) =Pr{upin(O, t] I upat t = 0 } = I - F A ( t ) . (6.41)

Equation (6.41) follows from Pr{ up in [ O , t ] } = Pr{ up at t = 0 n Pr{ up in (O,t] } =Pr{upat t= O}.Pr{upin(O,tI I u p a t t = ~ } = p . ( i - ~ ~ ( t ) = p . ~ S ( t ) .

2. Point availability PAS( t )

with h„(t) = f,(t) * g(t ) + f A ( t ) * g(t ) * f ( t ) * g(t ) + ... and h d u d ( t ) = g A ( t ) s g A ( t ) * f ( t ) * g(t ) + gA( t ) * f ( t ) :r g(t ) * f ( t ) * g( t ) + ... .

3. Average availability AAS( t )

1 1 t

A A s ( t ) = - E [total up time in ( 0 , t ] ] = - j P A s ( x ) d x . t t 0


4. Intewal reliability IRS( t , t + 8)

5 . Joint availability JAs ( t , t + 0)

J A s ( t , t + 8 ) = P r { u p a t t n u p a t t + 8 )

with IRs(t, t + 8) from Eq. (6.44) and PAsl( t ) from Eq. (6.38).

rauic; ".J I\r;auira iui U I c p u u a v i c UIIG-IMIII ~ L I U L L U I C IKW ä~ L = V auu WIUI cunszurzzJazlure rare h

Repair rate

arbitrary constant (P)*)

1. Reliability func- +L- tion Rso (t )

2. Point I e-X' + I

Remarks, Assumptions

3. Average availability

AA„ (t

Rso(O=Pr(upin (0, t ] ( newatt = 01

PA„ (t) = Pr(up at t I newatt = 0). hduu =

f *g+f *g*f*g f ...

f

t 0

AAS , (t)= E [total up time in (0, t l ( n e w a t t = O ) / t

~ ( 1 e-fi+")r 1 - L + + t ( h + P)*

IRs , (t , t + 8) = Pr{up in [t, t f 01 I newatt = 0)

JA,,@, t+ 0)=Pr(upattn upat t+ €I I newatt= 01, PAs ,,(X) as in point 2

MA, (T,, tf ) = Pr{each

failure in amission with total operating time T, can be repaired in a time 2% 1 newatt = 0 )

h = failurerate; P r ( ~ ~ , , ( t ) C X] = 1 - e-X' (Fig. 6.3); up means in the operating state; *) Markov process


6. Forward recurrence time ( ~ ~ ~ ( t ) and ~ R d ( t ) as in Fig. 6.3)

P r { ~ ~ ~ ( t ) l X } = 1 - IRS( t , t + X ) l P A S ( t ) , (6.46)

with IRs(t, t + X ) according to Eq. (6.44) and PAs(t ) from Eq. (6.42), and

Pr{down in [t, t + X ] } Pr{ZRd(t) 5 X } = 1 -

1 - PAs(t) where

f

with hudu(t ) = f A ( t ) + f A ( t ) * g( t ) * f ( t ) + f A ( t ) * g( t ) * f ( t ) * g(t ) * f ( t ) + ... and hUdd(t) = g ~ ( t ) * f ( t ) + gA(t) * f ( t ) * g ( t ) * f ( t ) + ...

Expressions for mission availability and work-mission availability are generally only used with items new at time t = 0 (see [6.5 (1973)l for a generalization.

6.2.4 Asymptotic Behavior

As t + expressions for the point availability, average availability, interval reliability, joint availability, and distribution function of the forward recurrence time (Eqs. (6.42)-(6.47)) converge to quantities which are independent of t and initial conditions at t = 0. Using the key renewal theorem (Eq. (A7.29)) it follows that

MTTF lim PAs(t) = PAs = t+- MTTF + MTTR

MTTF lim AAS ( t ) = AAS = MTTF + MTTR = PA+ (6.49) t+-

lim IRs(t, t + 0) = IRs(0) = t+-

J ( l - ~ ( y ) ) d y , MTTF + MTTR

8

MTTF lim JAs(t, t + 0 ) = JAs(0) =

MTTF + MTTR PA„,(0)>

t - f -

iim P r { ~ ~ ~ ( t ) 5 X] = - (6.52) t - f - MTTF 0

lim Pr{ZRd(t) 5 X } = P (6.53)

t+- MTTR

where MTTF = E[zi], MTTR = E[%;], i = 1,2, ..., and PAo,(@ is the point availability according to Eq. (6.42) with p = 1 and F A ( t ) from Eq. (6.57) or Eq. (6.52). In practical applications, PA and AA (or PAS and AAS for system oriented values) are often referred as ava i l ab i l i t y and denoted by A. The use of PAs =AAS =

(MTBF - MTTR) / MTBF is to avoid, because it implies MTBF = MTTF + MTTR.

Example 6.3

Show that for a repairable one-item structure in continuous operation, the limit

MTTF limPAS (t) = PAs =

t-f W M l T F + M T I R

is valid for any distribution function F(x) of the failure-free time and G(x) of the repair time, if MTTF < 00, M T I R < W, and the densities f(x) and g(x) go to 0 as X t W.

Solution

Using the renewal densig theorem Eq. (A7.31) it follows that

1 lim hd„ (t ) = lim hdud (t ) = t-f 00 t-f - M T I F + MTTR

Furthermore, applying the key renewal theorem Eq.(A. 7.29)to PAs (t ) &ven by Eq. (6.42) yields C a Ce

(1 - F(x))dx 1 (1 - F(n)W

l i rnPA,( t )=p( l - l+ O )+ ( l -P )O t-f M T I F + M T I R M l T F + MTTR

- MTTF MTTF MTTF

P M T T F + MTTR -

MTTF + MTTR M T I F + MTTR

The limit MTTF 1 ( M T I F + MTTR) can also be obtained from the final value theorem of the Laplace transform (Table A9.7), considering for s + 0

and g ( s ) = l - s M T T R + O ( S ) = = - S MTTR.

with ~ ( s ) as per Eq. (A7.89). When considering g(h) for availability calculations, the approximation given by Eq. (6.54) often leads to PAs = 1, already by simple redundancy structures. In these cases, Eq. (6.113) has tobe used.

In the case of constant failure & repair rates h (x ) = h and p(x) = p, Eq. (6.42) yields

Thus, for this important case, the convergence of PAs( t ) toward PAs = / ( h + p )

is exponential with a time constant 1 1 ( h + p ) < 1 / p = MTTR. In particular, for

p = 1, i.e. for PAs(0) = 1 and PA,(t) PAso(t) , it follows that

Generalizing the distribution function G(x) of the repair time and 1 or F(x) of the failure-free time, PAso(t) oscillates damped (as in general for the renewal density h(t) given by Eq. (A7.18)). However, for constant failure rate A and providing LMVR sufficiently small and some rather weak conditions on the density g(x), lower and upper bounds for PAso(t) can be found [6.25]

1 PA„(t) 2

A M m - (A+lIMTTR) t , > e 1 + AMTTR - 1+AM7TR

and

1 PAso(t) 5

-(h+llMI1X)t, t 2 0 , + Cu e 1 + h MTTR 1 + )L MTTR

4 = 1 holds for many practical applications ( h M7TR << 0.1). Sufficient conditions for C, =1 are given in [6.25]. However, conditions on C , are less important as on CL, since PAso(t) 5 1 is always true. The case of a gamma distribution with density g(x) = a ß xP-I e -9 r(ß), mean ß 1 a, and shape Parameter ß 2 3, leads for instance to I P A ~ ~ ( ~ ) - P A , 1 1 ~ M T T R ~ - ~ ~ ~ ~ ~ ~ atleastfor t 2 3MTTR= 3ßIa.

6.2.5 Steady-State Behavior

For

the alternating renewal process describing the time behavior of a one-item repairable structure is stationary (in steady-state), see Appendix A7.3. With p, FA(t), and GA(t) as per Eq. (6.57), the expressions for the point availability (6.42), average availability (6.43), interval reliability (6.44), joint availability (6.45), and the distribution functions of the forward recurrence time (6.46) and (6.47) take the values given by Eqs. (6.48) - (6.53) for all t e 0, see Example 6.4 for the point availability PAs. This relationship between asymptotic & steady-state (stationary) behavior is important in practical applications because it allows the following interpretation (see also the remark on p. 450):

A one-item repairable structure is in a steady-state (stationary behavior) if it began operating at the time t = - und will be considered only for t 2 0, the time t = 0 being an arbitrary time point.


Table 6.4 Results for a repairable one-item in asymptotic & steady-state (stationaty) behavior

1 Failure and repair rates Remarks, assumptions I ~onstant*) / Arbitrary

3. Distribution of i0 - G ( . ) ) &

1. Pr{upatt=O] (U)

MTTF

MTTF + M W R

5. Point availability

( PAS ) MTTF + MTTR

4. Renewal densities

hdu(t) arid hud(t)

-

6. Average availability

( AAS M l T F + MTTR

I

MTTF + MTTR

7. Interval reliability MTiTF + M U R m

(IR, (0)) (1 - F(x))dx

0

8. Joint availability MWF . PA^ (e)

(JAS (8)) MTTF + M W R

MTTF = E[zi], i > 1

MTTR = E[zj] , i > 1

1 20= ~ [ t o h i up time in (0, t]],

1 - e-"

1 - -

P (-!- J A S ( 0 ) = P r ( u p a t t n u p a t t + 8 ) , h f p h + P P A ~ ~ ~ ( O ) = P A ~ ( O ) asper

e-(a +We Eq. (6.42) with p = 1 and FA (t) t-

(L+ P ) as in point 2

FA (X) is also the distribution

function of (1) as in Fig. 6.3

(FA(x) PrtTRu (t) C XI)

GA (X) is also the distribution function of (t) as in Fig. 6.3

(GA(x) = Pr{TRd(t) C X])

h, =fahre , repair rate; up=operating state; h ud(t)=failure frequency, hdu(t)=repair freq.; *) Markov proc.

For constant failure rate h and repair rate p, the convergence of PAso(t) to PAs is exponential with time constant = 1 / p = MTTR as per Eqs. (6.55). Extrapolating the results of Section 6.2.4, one can assume that for practical applications, the function PAso(t) is captured at least for some t > to > 0 in the band I PAso(t) - PAs I =

~ M T R e-t'MUR when generalizing the distribution function of repair times. Thus,

for practical purposes one can assume that after a time t = 10 MTTR, the point availability PAsO(t) has reached its steady-state (stationary) value PAs = AAS.

Important results for the steady-state behavior of a repairable one-item structure are given in Table 6.4.

Example 6.4

Show that for a repairable one-item structure in steady-state, i.e. with p, FA ( X ) , and GA ( X ) as per Eq. (6.57), the point availability is PAs (t ) = PAs = MTTF I (MTTF + MTTR) for all t > 0 .

Solution Applying the Laplace transform to Eq. (6.42) and using Eqs. (A7.50) and (6.57) yields

MTTR sMTTR 1 - ?(s) + - <

MTTF + MUR 1 - lf(s)g(s) s

and finally

from which

MTTF 1 (s) = . - ,

MTTF + MTTR s

and thus PAs (t ) = PAS for all t > 0.

6.3 Systems without Redundancy

The reliability block diagram of a system without redundancy consists of the series connection of all its elements EI to E„ see Fig. 6.4. Each element Ei in Fig. 6.4 is characterized by the distribution functions Fi(x) for the failure-free time and Gi(x) for the repair time.

6.3.1 Series Structure with Constant Failure and Repair Rates for Each Element

In this section, constant failure and repair rates are assumed, i.e.

-h ix F i ( x ) = l - e , X 2 0,

and

G i ( x ) = 1 - e-pi X, X > 0,

6.3 Systems without Redundancy

+J++... +J- E,

Figure 6.4 Reliability block diagram for a system without redundancy (senes structure)

holds for i = 1, ... , n. Because of Eqs. (6.58) and (6.59), the stochastic behavior of the system is described by a time-homogeneous Markov process. Let Zo be the system up state and Zi the state in which element Ei is down. Taking assumption (6.2) into account, i.e. neglecting further failures during a repair at system level (in short: no further failures at system down), the corresponding diagram of transition probabilities in ( t , t + 6t] is given in Fig. 6.5. Equations of Table 6.2 can be used to obtain the expressions for the reliability function, point availability and interval reliability. With U = {z,, ), Ü = {z, , ... , Z n ) and the transition rates according to Fig. 6.5, the rel iabi l i~ function (see Table 6.2 for notation) follows from

Figure 6.5 Diagram of the transition probabilities in ( t , t+ 6 t ] for a repairable senes structure (constant failure and repair rates hi and pi, only one repair Crew, ideal failure recognition &

switch, no further failures at system down, arbitrary t, 6 t 0, Markov process)


and thus, for the mean time to failure,

The point availability is given by

with P„(t) from (Table 6.2)

n t

P,(,) = evhs ' + X J& e-'s X ~ i o ( t -X) di- i=l 0

t

P i O ( t ) = ~ p i e - ~ x ~ O O ( t - x ) d x , i = i , . , n. (6.63) 0

The solution Eq. (6.63) leads to the following Laplace transform (Table A9.7) for PAso(t)

From Eq. (6.64) there follows the asymptotic & steady-state value of the point and average availability

Because of the constant failure rate of all elements, the intewal reliability can be directly obtained from Eq. (6.27) by

with the asymptotic & steady-state value

where n

6.3 Systems without Redundancy 185

6.3.2 Series Structure with Constant Failure Rate and Arbitrary Repair Rate for Each Element

Generalization of the repair time distribution functions G i ( x ) , with densities gi(x) and Gi(0) = 0 , leads to a semi-Markov process with state space Zo, ..., Z„ as in Fig. 6.5 (this because of Assumption (6.2) of no further failures at system down). The reliability function and the mean time to failure are still given by Eqs. (6.60) and (6.61). For the point availability let us first calculate the semi-Markov transition probabilities Qij ( X ) using Table 6.2

The system of integral Equations for the transition probabilities (conditional state probabilities) Pij ( t ) follows then from Table 6.2

For the Laplace transform of the point availability PAso(t) = Poo(t) one obtains finally from Eq. (6.69)

from which, the asymptotic & steady-statevalue of the point and average availability

with lim (1 - g(s)) = s MTTR, as per Eq. (6.54), and s-to

m

MTiR, = J ( l - ~ ~ ( t ) ) d t . (6.72) 0

The intewal reliability can be calculated either from Eq. (6.66) with PAs,(t) from Eq. (6.70) or from Eq. (6.67) with PAs from Eq. (6.71).


Example 6.5

A system consists of elements EI to E4 which are necessary for the fulfillment of the required function (series structure). Let the failure rates h l = 10-~h- l , h2 = O . ~ . l ~ - ~ h - l , hg = 10-~h-I , h 4 = 2.10-~h- l be constant and assume that the repair time of all elements is lognormally distributed with Parameters h = 0.5 h-I and 0 = 0.6. The system has only one repair Crew and no further failure can occur at system down (failures during repair are neglected). Give the reliability function for a mission of duration t = 168 h, the mean time to failure, the asymptotic & stationary values of the point and average availability, and he asymptotic &

stationary values of the interval reliability for 0 = 12 h .

Solution 4 -1 The system failure rate is hs = h, + h2 + h3 + h, = 36.10 h according to Eq. (6.60).

The reliability function follows as RSO(t) = e-0.0036t, from which RS0(168h) = 0.55. The mean time to failure is M T o = 11 hs = 27811. The mean time to repair is obtained from Table A6.2 as E[;] = ( e ~ ~ / ~ ) / h = MTTR = 2.4 h . For the asymptotic & steady-state values of the point and average availability as well as for the interval reliability for 0 = 12 h it follows from Eqs. (6.71) and (6.67) that PAS = AAS = 1/(1+36. 104 .2.4) = 0.991 and IRS(12) = 0.991. e-0.0036'12 = 0.95.

6.3.3 Series Structure with Arbitrary Failure and Repair Rates for Each Element

Generalization of repair and failure-free time distribution functions leads to a nonregenerative stochastic process. This model can be investigated using supplementary variables, or by approximating the distribution functions of the failure-free time in such a way that the involved stochastic process can be reduced to a regenerative process. Using for the approximation an Erlang distribution function leads to a semi-Markov process. As an example, let us consider the case of a two-element series structure ( E l , E2) and assume that the repair times are arbitrary, with densities gl (x) and g z ( x ) , and the failure-free times have densities

and

Equation (6.73) is the density of the sum of two exponentially distributed random time intervals with density hl e-'1'. Under these assumptions, the two-element series structure corresponds to a 1-out-of-2 standby redundancy with constant failure rate h l , in series with an element with constant failure rate A2. Figure 6.6 gives the equivalent reliability block diagram and the corresponding state transition diagram. This diagram only visualizes the possible transitions and can not be

6.3 Systems without Redundancy 187

considered as a diagram of the transition probabilities in ( t , t + 6t3. Zo is the system up state, Z1$ and Z2' are supplementary states necessary for calculation only. For the semi-Markov transition probabilities Qij (X) one obtains (see Table 6.2)

From Eq. (6.75) it follows that (Table 6.2 and Eq. (6.54))

L

I-out-of-2 standby (E, . = E, )

Figure 6.6 Equivalent reliability block diagram and state transition diagram for a two series element system ( E1 and E2) with arbitrarily distnbuted repair times, constant failure rate for E2, and Erlangian ( n = 2) distributed failure-free time for EI, only one repair Crew, ideal failure recognition & switch, no further failures at system down, (5-state semi-Markov process)


(6.79)

(6.80)

The interval reliability IRso( t , t + 0) can be obtained from

IRso(t, t + 0) = Poo ( t ) Rm(0) + Por ( t ) RSr (0)

with Rsi~(0) =edhl +h2)e, because of the constant failure rates hl and h2. Important results for repairable series structures are summarized in Table 6.5.

Table 6.5 Results for a repairable system without redundancy (elements EI, ..., E, in series), one repair Crew, ideal failure recognition & switch, no further failures at system down

Quantity

1. Reliability function (Rso( t ) )

2. Mean time to system failure ( MTTFso)

3. System failure rate up to system failure

( L, (t ))

4. Asymptotic & steady-state value of the point availability & average availabilit) ( PAs = AAS )

. Asymptotic & steady- state value of the interval reliability (IRs(B):

Expression Remarks, assumptions

independent elements (independent :lements at least up to system failure)

~ ~ ( t ) = e-'~ 5 ~ ~ ~ ( t ) = e - ' ~

md MTTF„ = 1 I hs with hs = h , + ... + h ,

independent elements (independent dements at least up to system failure)

At system down, no further failures :an occur:

a) Constant failure rate h ; and constant repair rate li for each element ( i = 1, . . . , n)

b) Constant failure rate hi and arbitrary repair rate pi ( t ) with MUR, = mean time to repair for each element ( i = 1, . . . , n)

C) Zelement senes structure with failure rates h21 t l (1 + Al t ) for E , and h2 for E2

Each element has constant failure ratc Li, hs = h , + ... + h ,

*) Supplementary results: If n repair Crews were available, PAs = ni (1 / (1 +Ai /P;)) = 1 - Zi hi / P i

6.4 1-out-of-2 Redundancy


The I-out-of-2 redundancy, also known as I-out-of-2: G, is the simplest redundant structure arising in practical applications. It consists of two elements El and E2, one of which is in the operating state and the other in reserve. When a failure occurs, one element is repaired while the other continues the operation. The system is down when an element fails while the other one is being repaired. Assuming ideal switching and failure recognition, the reliability block diagram is a parallel connection of elements El and E2, see Fig. 6.7.

Investigations are based on the assumptions (6.1) to (6.7). This implies in particular, that the repair of a redundant element begins immediately on failure occurrence and is performed without interruption of operation at system level. The distribution functions of the repair times, and of the failure-free times are generalized step by step, beginning with the exponential distribution (memoryless), up to the case in which the process involved has only one regeneration state (Section 6.4.3). Influence of switching and incomplete coverage is considered in Sections 6.8.3 and 6.8.4, preventive maintenance in Sections 6.8.2, common cause failures in Section 6.8.7.

6.4.1 1-out-of-2 Redundancy with Constant Failure and Repair Rates for Each Element

Because of the constant failure and repair rates, the time behavior of the 1-out-of-2 redundancy can be described by a time-homogeneous Markovprocess. The number of states is 3 if elements El and E2 are identical (Fig. 6.8) and 5 if they are different (Fig. 6.9), the corresponding diagrams of transition probabilities in ( t , t + 6t] are also given in Fig. A7.4.

Let us consider first the case of identical elements EI and E2 (see Example 6.6 for different elements) and assume as distribution function of the failure-free time

in the operating state and

Figure 6.7 1-out-of-2 redundancy reliability block diagram (ideal failure recognition and switch)

in the reserve state. This includes active (parallel) redundancy for A r = k , warm redundancy for h, < h, and standby redundancy for ?L, = 0. Repair time is assumed to be distributed (independently of Ar) according to

For the investigation of more general situations (arbitrary load sharing, more than one repair Crew, or other cases in which failure andlor repair rates change at a state transition) one can use the birth und death process introduced in Appendix A7.5.5. For all these cases, investigations are generally performed using the method of differential equations (Table 6.2 and Appendix A7.5.3.1). Figure 6.8 gives the diagram of transition probabilities in ( t , t + 6t] for the point availability (Fig. 6.8a) and the reliability function (Fig. 6.8b), respectively.

Considering the system behavior at times t and t + 6 t , the following dzfference equations can be established for the state probabilities Po(t), Pl( t ) , and P2(t) according to Fig. 6.8a, where Pi(t) = Pr{process inZi at t ] , i = 0, 1, 2.

For 6t 4 0 , it follows that

The system of differential equations (6.85) can also be obtained directly from Table 6.2 and Fig. 6.8a Its solution leads to the state probabilities Pi(t), i = O,1, 2.

Assuming as initial conditions at t = 0 , Po(0) = 1 and P1(0) = P2(0) = 0 , the above state probabilities are identical to the transition probabilities Poi(t), i = 0, I , 2, i.e. Poo(t) = Po(t) , POl(t) = P l ( t ) , and PO2(t) = P2(t). The point availability PA„(t) is then given by (see Table 6.2 for notation)

PAsl( t ) or PAs2(t ) could have been determined for suitable initial conditions. From Eq. (6.86) it follows for the Laplace transform of PASO(t) that

and thus for t + .o

6.4 I-out-of-2 Redundancy 191

Figure 6.8 Diagram of the transition probabilities in ( t , t+ 6 t] for a repairable 1-out-of-2 warm redundancy (Fig. 6.7, two identical elements, constant failure (h, L,) and repair (P) rates, one repair Crew, arbitrary t, 6 t L 0, Markov proc.): a) For the point availability; b) For the reliability function

If PAso(t) = PAs for all t 2 0 , then PAs is also the value of the point and average availability in the steady-state. Because of Po + P, + P2 = PAs + P2 = 1 it follows that P2 = 1 - PAS = k ( h + h,) / ( ( L + h,)(h + y 2 ) , with 4 = lim Pi(t), i = 0,1, 2.

Further irnportant results for a 1-out-of-2 redundancy are in ~e&i"ons 6.8.3 (imperfect switching), 6.8.4 (incomplete coverage), and 6.8.7 (common cause failures).

To calculate the reliability function (by the method of differential equations) it is necessary to consider that the 1-out-of-2 redundancy will operate failure free in (0, t ] only if in this time interval the down state at system level (state Z2 in Fig. 6.8) will not be visited . To recognize if the state Z2 has been entered before t it is sufficient to make Z2 absorbing (Fig. 6.8b). In this case, if Z2 is entered the process remains there indefinitely, thus the probability of being in Z2 at t is the probability of having entered Z2 before the time t , i.e. the unreliability 1 - R s ( t ) . To avoid ambiguities, the state probabilities in Fig. 6.8b will be marked by an apostrophe (prime). The procedure is similar to that for Eq. (6.85) and leads to the following system of differential equations

and to the corresponding state probabilities ~ h ( t ) , ~ ; ( t ) , and ~ i ( t ) . With the initial conditions at t = 0 , P&O) = 1 and P;(o) = P ~ ( o ) = 0 , the state probabilities ~ & t ) , ~ ; ( t ) , and Pi(t) are identical to the transition probabilities PO0(t) = ~ ; ( t ) , ~ & ( t ) = ~ i ( t ) , and p,&.(t) = ~ i ( t ) . The reliability function is then given by (see Table 6.2 for notation)

Rso(t) = 6 0 ( t ) + P& ( t ) . (6.90)

With the initial condition P1(0) = 1, R s l ( t ) would have been obtained. Eq. (6.90) yields the following Laplace transform for Rso( t )

from which the mean time to failure (M7TFSo = Rso(0), Eq. (2.61)) follows as

Important for practical applications is the situation for h,hr<< p. To investigate this case let us consider an active redundancy ( h , = h ) . From Eq. (6.91) it follows that

with

- ( 3 h + y ) f d ( 3 h + p ) 2 -82 r1,2 =

2

and thus (Table A9.7b)

RSO(t) = ( r 2 e r i t - r 1 e r z t ) / ( r 2 - r l ) .

For h << p, it follows that rl = 0 and r2 = - p, yielding

Using z / i -E=1-~ /2 for 2rl=-(3h+p)(1--\j1-8h2/(3h+p)2) leads to r,=-2h21(3h+p). RSO(t) can thus be approximated by a decreasing exponential function with time constant M n 4 , - ( 3 h + p) 12 h2. + ) This important result shows that:

For h << p, a repairable I-out-of-2 active redundancy with constant failure rate ?L und constant repair rate p behaves approximately like a one-item structure with constant failure rate hs = 2 h2 / ( 3 h + p); an equivalent repair rate ps for the one-item structure can be obtained by comparing the equations for the steady-state point availability und leads to pS = hS / ( I - PAS) - p, with PAS from Eq. (6.88) (see also Table 6.10).

Extension of the above result to warm redundancy ( A r < h ) leads to

As in all these considerations, Ar = h yields active and L, = 0 standby redundancy. Because of the memoryless properiy of the time-homogeneous Markov process,

the intewal reliabili~ follows directly from the transition probabilities Pij ( t ) and the reliability functions Rsi( t ) , see Table 6.2. Assuming Po(0) = 1 yields

'2 f - 2 ~ ~ t / ( 3 ~ + ~ ) +) Moreexactly: Rs,(t)=erl'/(i - r , / r 2 ) - e /(r21rl-i)=e (i+2h2/ :) - e-pt2h2/ :.

The Laplace transform of fRs,(t, t + 8) is then given by

which leads to the following asymptotic & steady-state value (Table 6.2)

To compare the effectiveness of calculation methods, let us now express the reliability function, point availability, and interval reliability using the method of integral equations (Appendix A7.5.3.2). The Q, (X) are given according to Eq. (A7.102) and Fig. 6.8a by

Q21(x) = P ~ { Z ~ ~ <X} = 1-e-PX.

From Table 6.2 it follows then that

t

Rm(t) = e-(h+xr)r + [(A + h,)e-(h+hi)* ~ ~ ~ ( t -X)& 0

for Sie reliabili~functions Rso(t) and Rsl ( t ) , as well as

and

for the transition probabilities. The solution of Eqs. (6.96) and (6.97) yields Eqs. (6.87), (6.91), and (6.94). Equations (6.96) and (6.97) show how the use of integral equations leads to a quicker solution than differential equations for arbitrary initial conditions at t = 0.

Table 6.6 summarizes the main results of Section 6.4.1. It gives approximate expressions valid for h << p and distinguishes between the cases of active redundancy (h, = X), warm redundancy ( h, < h), and standby redundancy (h, = 0). From Table 6.6, the improvement in M%) through repair, without interruption of operation at system level, is given as lower and upper bounds by

active standby

Investigation of the unavailability in steady-state 1 - PAS leads to

active standby

1 - PAs = 1 - AAS - C1 MTBF p MTBF

The above results can easily be extended to Cover situations in whick failure or repair rates are modified at state changes (e.g. because of load sharing, differences within the element, repair priority, etc.). These cases, simply modify the transition rates on the diagram of transition probabilities in ( t , t + 6 t ] , See for example Figs. 2.12 and A7.4- A7.6.

Example 6.6

Give the mean time to failure MITFSO and the asymptotic & steady-state value of the point availability PAS for a 1-out-of-2 active redundancy with two different elernents EI and E2, constant failure rates hl, h 2 , and constant repair rates p1, p2 (one repair crew).

6.4 1-out-of-2 Redundancy 195

Table 6.6 Reliability function RSO(t), mean time to failure M q O , steady-state availability PAs = AAS, and interval reliability IRS (0) for a repairable 1-out-of-2 redundancy with identical ele-

ements (Fig.6.7, constant failure h, h, & repair rates p (h, Ar« p), one repair Crew, Markov proc.)

* new at t = 0 , ** asymptotic & steady-stak value (for practical applications with A l y > 0.01, convergence of PAs ( t ) to PAs is good after t = 10 1 F , see also p. 181)

Supplementary results: See Table 6.9 for the case with two repair Crews; assuming in Fig. 6.8a Z2 + Z, with pg instead of 2, + 21 with yields PAs =AAS = 1 - 2h2 I lpg (active red.)

Solution Figure 6.9 gives the reliability block diagram and the diagram of transition probabilities in (t, t + 6t]. and PAS can be calculated from appropriate Systems of algebraic equations. According to Table 6.2 and considering Fig. 6.9 it follows for the mean time to failure that

and in particular for hl << pl and h2 << p2,

M W o = k j * 2 ,

Al h 2 ( ~ 1 + ~2

As for Eq. (6.93), the reliabilityfinction can be expressed by

For the asymptotic & steady-state value of the point availability and average availability


PAs = Po + P, + P2 holds with Po, 4, and P2 as solution of (Table 6.2)

( 4 + h2) Po = P, P, + P2 P2

(L2 PI)^ = Po + P2 P4

(L, + P2) P2 = L2 PO + P, q P 1 4 = L 2 4

(1.2P4 =h1P2.

One (arbitrarily chosen) of the five equations must be dropped and replaced by Po + 4 + P2 + P3 + P4 = 1. The solution yields Po through P4, from which

(6.101) Equation (6.101) can also be written in the form

With h l = h2 = h and p1 = (1.2 = y , Eqs. (6.98) and (6.101) become Eqs. (6.92) and (6.88), respectively (with h, = L).

1 -out-of-2 active

1-(hl+p2)St ""W Figure 6.9 Reliability block diagram and diagram of transition probabilities in ( t , t + 6 t] for a repairable 1-out-of-2 active redundancy with different elements (ideal failure recognition and switch, const. failure and repair rates h l , h 2 , p1, and (1.2, one repair Crew, arbitr. t , F t 1 0 , Markov process)


6.4.2 1-out-of-2 Redundancy with Constant Failure Rate and Arbitrary Repair Rate

Consider now a 1-out-of-2 warm redundancy with 2 identical elements El and E2, failure-free times distributed according to Eqs. (6.81) and (6.82), and repair time with mean MTTR, distributed according to an arbitrary distribution function G(x) with G(0) = 0 and density g(x). The time behavior of this system can be described by a process with states Zo, Zl , and Z2. Because of the arbitrary repair rate, only states ZO and Zl are regeneration states. These states constitute a semi-Markov process embedded in the original semi-regenerative process (Fig. A.7.11). The semi-Markov transition probabilities Qij ( X ) are given by Eq. (A7.183). Setting these quantities in the equations of Table 6.2 (SMP), by considering Qo(x) = QO1(x) and Qi ( X ) = Qio(x)+ Q ; ~ ( x ) with Q ; ~ ( X ) as per Eq. (A7.184), it follows for the reliabili~finctions R s i ( t )

and for the transition probabilities Pij ( t ) of the embedded semi-Markov process

The solution of Eq. (6.104) leads in particular to

and (with MTTFso = k„(0), Eq. (2.61))

The Laplace transform of the point availability PAso( t ) = Poo(t) + Pol(t) follows as a solution of Eq. (6.105)

( S + h ) ( l - g ( s ) ) + h r ( l - g(s + I ) ) + h + s g(s + h ) PA &) = (6.108) ( S + h ) [ ( s + h + h r ) ( l - g(s) ) + s g(s + X ) ]

and leads to the asymptotic & steady-state value of the point availability PAs and average availability AAS (considering PAs = ;,"O s PA s o ( s ) and lim (1 - g( s ) ) =

s - t o s . MTTR +o(s) as per Eq. (6.54))

PAs = AA - h + g(h) )

- h ( k + h , ) MVR + k g ( h ) '

where m

MTIR = I (1 - G(n)) dx 0

and g(h) is the Laplace transform of the density g ( t ) for s = ? L , See Eq. (6.88) for g ( t ) =,ue-pt, i.e. g(h) = pl(h + P), and Examples 6.7 &6.8 for the approximation of g(h ) . Calculation of the interval reliability is difficult because state Z1 is regener-ative only at its occurrence point (Fig. A7.11). However, for h MlTR << 1 . g(h)-+l and the asymptotic value of the state probability for Z1 ( 4 =HmPol( t ) ) becomes very small with respect to the state probability for Zo (Po= hil Poo(t)). For the asymptotic & steady-state value of the intewal reliability it holds then that

In many practical applications, it holds that ?L MTTR < 0.01. In such cases, Eq. (6.11 1) can be further simplified to

Example 6.7 Let the density g(x) of the repair time T' of a system with constant failure rate h > 0 be continuous and assume furthermore that hE[z ] = h MTTR << 1 and h d q ] << 1. Investigate the quantity g(h) for h -+ 0 .

Solution For h 4 0, h MTTR << 1, and h << 1, the three first terms of the series expansion of

-At e lead to

From this, follows the approximate expression

In many practical applications,

g(h)- 1 - h MTIR

is a sufficiently good approximation, however not in calculating steady-state availability (Eq. (6.114) would give for Eq. (6.109) PAS = 1, thus Eq. (6.1 13) has to be used).

P Supplernentary results: Assuming g(x) = it follows g(h) = - = 1 - h / p = 1 - hM7TR. L + CL

Example 6.8

In a 1-out-of-2 wann redundancy with identical elements E1 and E2 let the failure rates h in the operating state and h, in the reserve state be constant. For the repair time let us assume that it is distributed according to G(x) = 1 - e-P'(X-W) for x 2 W and G(t) = 0 for x < W, with MTTR - 1 / y > W . Assuming h iy << 1, investigate the influence of iy on the mean time to

failure MTTRs and on the asymptotic & steady-state value of the point availability PAS.

Solution

With

and considenng that

m a

1 1 MTTR = l t g ( t ) d t = (ty'e-P'('-")dt = iy + - z - ,

U' U . . 0 W

i.e. p l = y I ( l - p ~ ) and thus g(h) = p(l -LW) / (h +y( l - )LW)), Eq. (6.107) (left-hand equality) and Eq. (6.109) lead to the approximate expressions

and

On the other hand, W = 0 leads to 1 - g(h) = h / (h + y) and thus (Eqs. (6.92) and (6.88))

Assuming C( » h, h,yields (considenng 0 5 hiy h /P)

M = % ~ , ~ > ~ J?%, g>o - l + h w L = l . h + h =1-h iy and (6.115) M7TF,o, I,f=o PAS, ,=o P

Equation (6.1 15) allows the conclusion to be made that:

For h M7TR << 1 , the shape of the distributionfunction of the repair time has (as long as MTTR is unchanged) a small influence on results at system level, in particular on the mean time to failure MTiTsso und on the asymptotic & steady-state value of the point availability PAs of a 1 -out-of-2 redundancy.

Example 6.9 shows a numerical comparison. This important result can be extended to complex structures.

Exarnple 6.9 A 1-out-of-2 parallel redundancy with identical elements EI and E2 has failure rate h = 10-2 h-' and lognormally distributed repair times with mean MTTR = 2.4h and variance 0.6h2 (Eqs. (A6.112), (A6.113) with h = 0.438 h-', o = 0.315). Compute the mean time to failure MZ7FS ,, and the asymptotic & steady-state point and average availability PAs with approximate expressions: (i) g(h) from Eq. (6.114); (ii) g(h) from Eq. (6.1 13); (iii) g(t ) = p'e-"('-'), t r y , w=1,3h, l I p ' = l . l h , l / p = 2 . 4 b (Eq.(4.2)); (iv) g(t)=pe-"and 11p=2 .4h .

Solution

(i) With g(h) = 0.976 it follows (Eq. (6.107)) that M7TFso = 2183h and (Eq. (6.109)) PAS = 1. (ii) With g(h) = 0.9763 it follows (Eq. (6.107)) that MZ7Fs = 221 1 h and (Eq. (6.109)) PAs - 0.9994. (iii) Example 6.8 yields MTTFso, ~ = 1 , 3 h = 2206h and PAS, v=i,3 h = 0.9995. (iv) From Eqs. (6.92) and (6.88) it follows that M T F s = 2233 h and PAs = 0.9989.

Supplernentary results: Numerical computation with the lognormal distribution (MTTR = 2.4h, Var [T ' ] = 0.6h2 ) yields MTTFso= 2186h and PAs - 0.9995. For a failure rate h= 10-~h-: results were: 209'333h, 1; 209'61 lh, 0.999997; 209'563h, 0.999995; 209'833, 0.999989; 209'513h, 0.999994.

6.4.3 1-out-of-2 Redundancy with Constant Failure Rate only in the Reserve State and Arbitrary Repair Rates

Generalization of the repair and failure rates of a 1-out-of-2 redundancy leads to a nonregenerative stochastic process. However, in many practical applications it can be assumed that the failure rate in the resewe state is constant. If this holds, and the 1-out-of-2 redundancy has only one repair Crew, then the process involved is regenerative with exactly one regeneration state [6.5 (1975)l.

To See this, consider a 1-out-of-2 warm redundancy, satisfying assumptions (6.1) -(6.7), with failure-free times distributed according to F(x) in the operating state and V(x) = 1 - e - ' r x in the reserve state, and repair times distributed according to G(x) for repair of failures in the operating state and W(X) for repair of failures in the reserve state (F(0) =V(O) =G(O) = W(0) = 0, densities f(x), g(x), w(x)). Figure 6.10a shows a time schedule of such a system and Fig. 6.10b gives the state transition diagram (to visualize possible state transitions) of the involved stochastic process.


- operating reserve

renewal point

Figure 6.10 Repairable 1-out-of-2 warm redundancy with constant failure rate h, in the reserve state, arbitrary failure rate in the operating state, arbitrary repair rates, one repair Crew, ideal failure recognition and switch; a ) Possible time schedule (repair times greatly exaggerated); b) state transition diagram to visualize possible state transitions (only Z1 is a regeneration state)

States Zo, Zl , and Z2 are up states. State Z1 is the only regeneration state present here (Fig. 6.10a). The occurrence of Z1 brings the process to a situation of total independence from the previous time development. It is therefore sufficient to investigate the time behavior from t = 0 up to the first regeneration point and between two consecutive regenerationpoints (Appendix A7.7).

Let us consider first the case in which the regeneration state Z1 is entered at t = 0 (SRPO) and let SRPl be the first renewal point after t = 0. The reliability

jünction Rsl(t) is given by (see Table 6.2 for definitions)

t

R„(t) = l - F ( t ) + l u l ( x ) ~ s l ( t -X)&, 0

with 1 - F(t) = Pr(fai1ure -free operating time of the element operating

at t = 0 is > t 1 Z1 entered at t = 0)

and t

~ u , ( x ) ~ ~ ( t - x ) & = Pr{(SRpl 5 t n upin(SRpl, t l ) I Zi enteredat t = 0) 0

C) d)

Figure 6.11 Possible time schedules at t = 0 for the 1-out-of-2 redundancy according to Fig. 6.10

Thefirst renewal point SRPl occurs at the time X (i.e. within the interval (X, x+dx]) only if at this time the operating element fails und the resewe element is ready to enter the operating state. The quantity ul(x), defined as

1 ul(x) = lim - Pr{(x < SRPl I X + 6x 1 Zl entered at t = 0 )

S X ~ O 6x

can be obtained as (Fig. 6.11a)

The point availabil i~ is given by

with

1 - F(t) = Pr{failure - free operating time of the element operating

a t t = O i s > t I zlenteredatt=O),

t

S u , ( x - ) ~ ~ ~ ~ ( t - x ) d * = Pr{(SRpl I t n upat t ) ) Zl enteredatt = 01, 0

and

J u 2 ( x ) P A S l ( t - X ) & = Pr((SRPl 5 t n systemfailedin (0, SRPI] 0

n up at t ) I Z1 entered at t = 0 ) .

The quantity u2(x ) , defined as

1 u 2 ( x ) = lim - Pr{(x < SRPl I X + 6x n system failed in (0, X ] )

6x40 6x I Z1 entered at t = 01,

can be obtained as (Fig. 6.1 1b)

X

u 2 ( x ) = g ( x ) ~ ( x ) + Jhl<dd(y) w ( x - y ) ( ~ ( x ) - ~ ( y ) ) d y (6.121) 0

with

One can recognizes that u 1 ( x ) + u 2 ( x ) is the density of the random variable giving successive occurrence times of state Z1, i.e. interarrival times separating the renewal points 0, SRP1, SRP2, . . . of the embedded renewal process.

Consider now the case in which at t = 0 the state Zo is entered. The reliability finction R s o ( t ) is given by

t

~ „ ( t ) = l - ~ ( t ) + l u ) ( x ) ~ ~ ~ ( t - x ) & (6.123) 0

with (Fig. 6 . 1 1 ~ )

1 u 3 ( x ) = lim -Prix< SRPl I x + 6 x

8x40 6x I ZO entered at t = 0 ) = f ( x ) P A o ( x ) , (6.124)

where

PAo(x) = Pr{reserve element up at time X I ZO entered at t = 0)

The point availability PAso( t ) is given by

with (Fig. 6.1 1d)

1 u 4 ( x ) = lim - Pr{(x < SRPl i X + Sx n System failed in (0, X ] )

sx io

X

] Zo entered at t = 0) = f hLd,(y) w(x - y)(P(x) - F(y )) dy (6.128) 0

and

hudu(~) = V(Y)+ ~ ( Y ) * w ( Y ) * ~ ( Y ) + ~ ( Y ) * w ( Y ) * ~ ( Y ) * w ( y ) * v ( y ) + ... . (6.129)

Equations (6.116), (6.120), (6.123), (6.127) can be solved using Laplace transforms. However, analytical difficulties can arise when calculating Laplace transforms for F(x), G(x ) , W(x) , u l ( x ) , u2 (x ) , u 3(x ) , and u 4 ( x ) as well as at the inversion of the final equations. Easier is the calculation of the mean time to failure MT;TFSO and of the asymptotic & steady-state values of the point and average availability PAS = A A S , for which the following expressions can be found using Laplace transforms, see Eqs. (6.123), (6.116) for MITFso and Eqs. (6.120), (6.127) for PAs

and

with

Eq. (6.130) considers Eqs. (2.59), (2.61)), 6.132). Eq. (6.131) considers that PAs

exists, given by PAs = AA - l im s PA^^ ( s ) , and that u l ( x ) +u2(x ) is the density of s - s-0 a random variable with finite mean (p. 203), and thus ( u 3 ( x ) + u4(x ) ) dx = 1 .

Jom


The model investigated in this section has as special cases that of Section 6.4.2, with F(x) = 1 - e T h x and W(x) = G(x) , as well as the 1-out-of-2 standby redundancy with identical elements and arbitrarily distributed failure-free and repair times, see Exarnple 6.10.

Important results for a 1-out-of-2 redundancy with arbitrary repair rates, and failure rates as general as possible within a regenerative process, are given in Table 6.7.

Example 6.10

Using the results of Section 6.4.3, give the expressions for the reliability function RSO(t) and the point availability PASO(t) for a 1-out-of-2 standby redundancy with 2 identical elements, failure-free time distnbuted according to F(x), with density f(x), and repair time distributed according to G(x)with density g(x).

Solution

For a standby redundancy, ul (X) = f(x) G(x), u2(x) = g(x) F(x) , u3(x) = f(x), and u4 (X) - 0 (Eqs. (6.117), (6.121), (6.124), and (6.128)). From this, the expressions for RSO(t), RSl(t), PASO (t)) . and PASl ( t ) can be given. The Laplace transforms of RSO (t) and PASO (t) are

with

The mean time to failure MT-0 from Eq. (6.133) (or Eq. (6.130) with u3(x) = f(x),

ul(x) = f(x) G(x), and by = F(m) = 1 and Eq. (6.132))

MTTF MTTFso = M n F +

For the asymptotic & steady-state value of the point and average availability PAS = AAS, Eq. (6.134) (or Eq. (6.131) with ul(x) = f(x)G(x) and u2 (X) = g(x)F(x))) yields

MTTF PA, = AAS = _


Table 6.7 Mean time to failure MZTFs o, steady-state point &average availability PAs =AAS, and interval reliability IRS@) for a repairable I-out-of-2 redundancy with two identical elements (Fig.6.10, one repair crew, arbitrary repair rates, failure rates as general as possible within a regenerative proc.)

Standby ( 1 0; Active ( h , = h )

Distribution of the failure-free L times

Distribution of the repair times

Mean of the failure-free times

MTTF = Ce

1 MTTF or -

L,

MTTR = m

f (1 - G(x))d.x 0

Mean of the repair times

M l T R or

MTrR, MTrR MTTR

MTTF + m

MTF J u3 (X)&

D

MTrF +

MTTF

W

1 - j f ( x ) ~ ( x ) d a

0

Mean time to failure ( M m s o )

Point & average availability (PAs = AAS)*

MTTF MTIF

nterval reliability IR,x(W)*

U,@), uZ(x), u3(x) as per Eqs. (6.1 17), (6.121), and (6.124); OS = operating state, RS = reserve state

* asymptotic & steady-state value

6.5 k-out-of-n Redundancy

A k-out-of-n redundancy, also known as k-out-ofn: G, consists of n often identical elements, of which k are necessary for the required function and n - k are in reserve state (or repair). Assuming ideal failure recognition and switching, the reliability block diagram is as given in Fig. 6.12. Investigations in this Section assume


U

k-out-of-n

Figure 6.12 k-out-of-n redundancy reliability block diagram (ideal failure recogntion & switch)

identical elements EI, ... , E„ only one repair Crew, and nofurther failures at system down (failures during a repair at system level are neglected, as per assumption (6.2)). Section 6.5.1 considers the case of warm redundancy with constant failure rate h in the operation state and h,c h in the reserve state as well as constant repair rate p. This case includes active redundancy (h, = h) and standby redundancy (L, 2 0). An extension to Cover other situations in which the failure rate is modified at state changes (e.g. for load sharing) is possible using the equations for the birth und death process developed in Appendix A7.5.5 (see also Section 2.3.5). Section 6.5.2 investigates a k-out-of-n active redundancy with constant failure rate and arbitrary repair rate. The influence of series elements (including switching elements) is considered in Sections 6.6 - 6.7. Imperfect switching, incomplete coverage, and common cause failures are investigated in Section 6.8.

6.5.1 k-out-of-n Warm Redundancy with Identical Elements and Constant Failure and Repair Rates

Assuming constant failure and repair rates, the time behavior of the k-out-of-n redundancy with identical elements can be investigated using a birth und death process (Appendix A7.5.5). Figure 6.13 gives the corresponding diagram of transition probabilities in (t, t + 6t]. From Fig. 6.13 and Table 6.2, the following system of differential equations can be established for the state probabilities Pj ( t ) =

&{in state Z . at t}of a k-out-of-n warm redundancy with one repair Crew and no J

further failures at system down (constant failure rates h & h, and repair rate p)

Figure 6.13 Diagram of transition probabilities in ( t , t + 6t ] for a repairable k-out-of-n warm redundancy (n identical elements, const. failurekrepair rates, no further failures at system down, one repair crew, ideal failure recognition & switch, arbitrary t, 6 t J O , birth and death proc., Zo -Zn.+ up states)

with v j = k h + ( n - k - J]?+, j = 0, ..., n - k .

For the investigation of more general situations (arbitrary load sharing, more than one repair crew, or other cases in which failure andlor repair rates change at a state transition) one can use the birth und death process introduced in Appendix A7.5.5. The solution of the system (6.137) with the initial conditions at t = 0 , Pi(0) = 1 and Pj ( 0 ) = 0 for j + i, yields the point availability (see Table 6.2 for definitions)

n-k

PAsi(t> = C p i j ( t ) ,

with P, ( t ) = Pj ( t ) from Eq. (6.137) with Pi(()) = 1. In many practical applications, only the asymptotic & steady-stateyalue of the point availability PAs is required. This can be obtained by setting Pj ( t ) = 0 and P j ( t ) = Pj ( j = 0, ... , n - k + 1) in Eq. (6.137). The solution is (Appendix A7.5.5)

n-k j V PAS = E Pj = 1 - Pn-k+l, with P' = ---, ~i =

n-k+l , 7c0=1. (6.140)

j=o C ni pi

PAs is also the asymptotic & steady-state value of the average availabilityAAs. As shown in Example A 7.11 (Eq. (A7.157)), for 2 v j < y it holds that

From this, the following bounds for PAs can be used in many practical applications (assuming 2vj < y, j = 0, . . ., n-k) to obtain an approxinzate expression for PAs


The reliabilityfunction follows from Table 6.2 and Fig. 6.13

with vi as in Eq. (6.138). Similar results hold for the mean time to failure

The solution of Eqs. (6.142) and (6.143), shows that Rsi(t) and MTTFsi depend on n - k only. This leads for n - k = 1 to

andfor n - k = 2 to

This property holds for the point availability PAs as well, see Table 6.8 for results. Because of the constant failure rate, the intewal reliability follows directly from

n-k

IR,(t,t + 0 ) = C Pij(t)R,(e), i = 0, ..., n - k (6.146) j = O

with P,(t) as in Eq. (6.139) and Rsi(8) from Eq. (6.142) with t = O . The asymptotic & steady-state value is then given by

n-k

with Pj from Eq. (6.140). Table 6.8 surnrnarizes the main results for a k-out-of-n warm redundancy with constant failure and repair rates.

Assuming, only for comparative investigations with the results of Table 6.8 and Section 6.7, n repair crews (one for each element), following approximate expressions can be found for active redundancy (totally independent elements) l6.27, 6.431

1 MTTFso = - (p 1 h)n-k,

k L(:) n repair Crews, active red., h /p << 1 (6.148)

and for standby redundancy [6.42]

( n - k ) !pn-k MTTFSo =

( k ~ ) ~ - ~ + l n repair Crews, standby red., h lp << 1 (6.149)

( k h 1 p)n-k+l = I -

1 PAs = 1 -

( n - k + l ) ! ( n - k + 1) p MTTFSO

According to Eq. (A7.189), PAs in Eqs. (6.148) and (6.149) can be expressed as

MTTRS 1 PAS = I - - with M T = and M T 3 = M%.

M V F s (n-k+l)p

6.5.2 k-out-of-n Active Redundancy with Identical Elements, Constant Failure Rate, and Arbitrary Repair Rate

Generalization of the repair rate (by conserving constant failure rates ( L , h,) , only one repair Crew, and no further failure at system down), leads to stochastic processes with n - k + 1 regeneration und n - k not regeneration states ({Zo ,Zl} und {G} inFig.A7.11 for n - k =1, {Zo,Z1,Z2'} und {Z23Z3) inFig. A7.12for n - k = 2 ) . As an example let us consider a 2-out-of-3 active redundancy with 3 identical elements, failure rate ?L and repair time distributed according to G ( x ) with G(0) = 0 and density g(x). Because of the assumption of nofurther failure at system down, results of Section 6.4.2 for the 1-out-of-2 warm redundancy can be used for n- k = l by setting k h instead of h (see Tab. 6.8 as well as Eqs. (A7.183) & (A7.184) for n- k=l and Eqs. (A7.185) & (A7.186) for n - k = 2 ) . For the 2-out-of-3 active redundancy one has to Set 2 h instead of h and h instead of h , in Eqs. (6.107) & (6.109) to obtain Eqs. (6.152) & (6.155). However, in order to show the utility of considering time schedules, an alternative derivation is given below.

6.7 Approximate Expressions for Large Senes-Parallel Structures 21 1

Table 6.8 Mean time to failure MTTFSO, steady-state point & average availability PAS = AAS, and interval reliability IRS(@) for a repairable k-out-of-n warn redundancy with n identicd elements (one repair crew, constant failure and repair rates h, h r , y (Ar < h in reserve state, hr =O for standby), no further failures at System down, ideal failure recognition & switch, Markov process)

gen.

case

L n = 2 1 k = l

n = 3 k = 2

gen. case

N

L n = 3

C k = l

-

n = 5 k = 3

n - k arbitrary

Mean time to failure ( MTTFsO)

Interval Asymptotic & steady-state point and average reliability

availability ( PAS = AAS) ( IRS (B))*

V O V ~ P + V O P ~ + p 3 =1-- v v v =RSO(0)

v o v l v 2 + v o v i y + v o y 2 + p 3 p3

v i = k h + ( n - k - i ) h „ i=O, ..., n - k ; h,h,=failurerates (h,=h+active r e d . ~ V o . . . V , - k = h n - k + ' n ! l

(k-I)!, hr=O+standby redundancy jVo . . .V , -k=@h)n-k+l ) ; p=repairrate ( y = l l ~ m ? ~ be-

cause of only one repair crew); Rs ,(8) from Eq. (6.142); * See [6.5 (1985)l for exact solutions

Using Fig. 6.1421, the following integral equation can be established for the reliability function Rso( t ) , see Table 6.2 for definitions,


The Laplace transform of RSo(t ) follows as

and the mean time to failure as

5 - 3g(2h) M77FS0 =

6 h (1 - g(2 h))

For the point availability, Fig. 6.14b yields

0 t t

p ~ „ ( t ) = e - ~ ' ~ ( l - ~ ( t ) ) + ~ ~ ( X ) ~ - ~ ~ P A ~ ~ ( ~ - X ) & +Ig(r)(l-e-2hx)~~SI(t-x)& 0 0 (6.153)

from which,

Asymptotic & steady-state value of the point and average availability follows from

by considering Sz (1 - g(s)) =s. M7TR + o ( s ) as per Eq. (6.54), see Eq. (6.113) for the approximation of g (2h) . For the asymptotic & steady-state value of the intewal reliability, Eq. (6.112) can be used in most applications. Generalization of failure and repair rates leads to nonregenerative stochastic processes.

, A renewal points 0

a) Calculation of RSO ( t ) b) Calculation of PASO(t)

Figure 6.14 Possible time schedule for a repairable 2-out-of-3 active redundancy (const. failure rate, arbitrary repair rate, one repair Crew, no further failures at System down, repair times exaggerated)

6.6 Simple SeriesParallel Stmctures

6.6 Simple Series - Parallel Structures

A series - parallel structure is an arbitrary combination of series and parallel models, see Table 2.1 for some exarnples. Such a structure is generally investigated on a case-by-case basis using the methods of Sections 6.3 - 6.5. If the time behavior can be described by a Markov or semi-Markov process, Table 6.2 can be used to establish equations for the reliability function, point availability, and interval reliability.

As a first example, let us consider a repairable 1-out-of-2 active redundancy with elements EI = E2 = E in series with a switching element E,. The failure rates h and h, as well as the repair rates p and pv are constant. The system has only one repair Crew, repair priority on E , (a repair on EI or E2 is stopped as soon as a failure of E, occurs, see Example 6.11 for the case of no priority), and nofurther failures at system down (failures during a repair at system level are neglected). Figure 6.15 gives the reliability block diagram and the diagram of transition probabilities in ( t , t + Ft]. The reliability function can be calculated using Table 6.2, or directly by considering that for a series structure the reliability at system level is still the product of the reliability of the elements

Because of the term e-hvt, the Laplace transform of Rso(t ) follows directly from the Laplace transform of the reliability function for the 1-out-of-2 parallel redundancy RsOi-out-of-2, by replacing s with s + h, (Table A9.7)

The mean time fo failuve MTTFso follows from = ~ ~ ~ ( 0 )

The last Part of Eq. (6.158) clearly shows the effect of the series element E,. The asymptotic & steady-state value of the point and average availability PAs = AAS is obtained as solution of following system of algebraic equations, see Fig. 6.15 and Table 6.2,


I

1-out-of-2 active repair pnonty on E,,

(E = E = E ) 1 2

Figure 6.15 Reliability block diagram and diagram of transition probabilities in ( t , r+Ft ] for a repairable 1-out-of-2 active redundancy with a switch element (two identical elements, constant failure and repair rates (h, hv, p, P,), one repair Crew, repair priority on E,, no further failures at system down, ideal failure recognition, arbitrary t, F td- 0, Markov process, ZO & Z2 are up states)

For the solution of the system given by Eq. (6.159), one (arbitrarily chosen) equation must be dropped and replaced by PO + 4 + P2 + E j + 4'1. The solution yields Po through P4, from which

As for the mean time to failure (Eq. (6.158)), the last Part of Eq. (6.160) shows the influence of the series element E,. For the asymptotic & steady-state value of the intewal reliability one obtains (Table 6.2)

Example 6.11 Give the reliability function and the asymptotic & steady-state value of the point and average availability for a 1-out-of-2 active redundancy in series with a switching element, as in Fig. 6.15, but without repairpriority on the switching element.

6.6 Simple SeriesRarallel Structures

2-out-of-3 active (E = E = E = E )

1 2 3

Figure 6.16 Reliability block diagram and state transition diagram for a 2-out-of-3 majority redundancy (constant failure rates hfor E and h, for E,,, repair time distributed according to G(x)

with density g(x), one repair Crew, no repair priority, no further failures at system down; ZO, Z1, and Z4 constitute an embedded semi-Markov process, ZO and Z1 are up states)

Solution

The diagram of transition probabilities in (a, t +6t] of Fig. 6.15 can be used by changing the transition from state Z3 to state Z2 to one from Z3 to Z1 and pv in p . The reliability function is still given by Eq. (6.156), then states Z1, Z3, and Z4 are absorbing states for reliability calculations. For the asymptotic & steady-state value of the point and average availability PAS = AAS, Eq. (6.159) is modified to

and the solution leads to

1 h 2h(h+hv) lp2 h PAs = AAS = 1 - 2 - C l - " . (6.162)

h,, 2 h ( h + h , , ) ~ ~ ~ V 1 + ( 2 h + h y ) l p PV

As a second exampie let us consider a 2-out-of-3 majoriq redundancy (2-out-of- 3 active redundancy in series with a voter E,) with arbitrary repair rate Assumptions (6.1) - (6.7) also hold here, in particular Assumption (6.2), i.e. n o further failures at system down. The system has constant failure rates, I. for the three redundant elements and A, for the series element E,,, and repair time distributed according to G(x) with G(0) = 0 and density g(x). Figure 6.16 shows the corresponding reliability diagram and the state transition diagram. ZO and Z1 are up states. ZO, Z1 and Z4 are regeneration states and constitute a semi-Markov process embedded in the original process. This property will be used for the


investigations. From Fig. 6.16 and Table 6.2 there follows for the semi-Markov transition probabilities QO1(x), QIO(x), Q04(x)> Q40(x), Q121(~), arid Q134(~) the expressions

Q121(~) is used to calculate the point availability. It accounts for the process returning from state Z2 to state Z1 and that Z2 is not a regeneration state (probability for the transition Z1 + Z2 + Z1, see also Fig. A7.1 la), similarly for Q134(~). Qi2(x) and Qi3(x) as given in Fig 6.16 are not serni-Markov transition probabilities (Z2 and Z3 are not regeneration states). However,

yields an equivalent Q ~ ( X ) = Q ~ ~ ( . X ) +Q;~(x)+ Qi3(x) useful for the calculation of the reliability function. Considering that Zo and Z1 are up states and at the same time regeneration states, as well as the above expressions, the following system of integral equations can be established for the reliability functions Rso( t ) and RSl(t)

6.6 Simple SeriesParallel Structures 217

The solution of Eq. (6.164) yields

and

R s o ( s ) and MT„ could have been obtained as for Eq. (6.157) by setting s = s + h, inEq (6.151).

For the point availability, calculation of the transition probabilities Pij ( t ) with Table 6.2 and Eq. (6.163) leads to

and

From Eqs. (6.167) and (6.168) it follows the point availability P A s o ( t ) = Poo(t)+Pol(t) and from this (using Laplace transform) the asymptotic & steady-state value

(6.169) with MTTR as per Eq. (6.1 10).

For the asymptotic & steady-state value of the interval reliability, the following approximate expression can be used for practical applications (Eq. (6.11 1))

In Eq. (6.170) it holds that Po=~mwPoo(t ) , with Poo(t) from Eqs. (6.167). For 3 2 h + h,) = 1, IRs (@ = Rso(B) can be used.

Example 6.12

(i) Give using Eqs. (6.166) and (6.169) the mean time to failure M T q O and the asymptotic &

steady-state point and average availability PAS =AAS for the case of a constant repair rate p. (ii) Compare for the case of coustant repair rate the tnie value of the interval reliability IRs(8) with the approximate expression given by Eq. (6.170).

Solution

(i) With ~ ( x ) = l -e-ILX it follows that g(2h + h,) = p1(2 h + h, + p) and thus from Eq. (6.166)

5 h + h , + p 1 1 M7TFso = - (6.171)

( 3 h + h v ) ( 2 h + h , ) + p h , h , , + 6 h 2 / ( 5 h + h , + ~ ) h ,+6h2 lF

and from Eq. (6.169)

P

(ii) With PO0(t) and POl(t) from Eqs. (6.167) & (6.168) it follows for the asymptotic & steady- state value of the interval reliability (Table 6.2) that

The approximate expression according to Eq. (6.170) yields

i.e. the same value as per Eq. (6.173) for 3h << p and considenng RS1(8) 5 RS0(8).

6.6 Simple SerieslParallel Structures 219

To give a better feeling for the mutual influence of the different Parameters involved, Figs. 6.17 and 6.18 compare the mean time to failure MTTFso and the asymptotic & steady-state unavailability 1 - PAs of some basic series - parallel structures. The equations are taken from Table 6.10 which summarizes the results of Sections 6.2 to 6.6 for constant failure and repair rates (approximate expressions are used to simplify calculation, see Section 6.7.2). Comparison with Figs. 2.8 and 2.9 (nonrepairable case) confirms that the most important gain is obtained by the first step (structure b), and shows that the influence of series elements is much greater in the repairable than in the nonrepairable case. Referring to the structures a), b), and C) of Figs. 6.17 and 6.18 the following design rule can be formulated:

The failure rate of the series element in a repairable 1-out-of-2 active redundancy should not be greater than 1 % (0.2% for p / hl > 500) of the failure rate of the redundant elements, i.e. with respect to Fig. 6.17

h z < O.O1hl in general, and h 2 < 0.002hl for p/hl > 500. (6.174)

6.7 Approximate Expressions for Large Series - Parallel Structures

6.7.1 Introduction

Reliability and availability calculation of large series -parallel structures rapidly becomes time-consuming, even if constant failure rate Ai and repair rate vi is assumed for each element Ei of the reliability block diagram and only mean time to failure MTTFso or steady-state availability PA'; = AAS is required. This is because of the large number of states involved, which for a reliability block diagsam with n elements can reach i + EIE::, n;=kl+; n ! ~ ~ , l l i ! = ean!, considering all possible repair strategies. For instance, the system of Fig. 6.19 with 4 elements would have more than 50 states if the assumption of no further failure at system down (6.2) were dropped. 2n states holds for nonrepairable Systems or for system with totrilly independent elements (Point 2 below). Use of approximate expressions becomes thus important. Besides the assumption of one repair Crew und no furtherfailure at system down (Sections 6.2 - 6.6, partly 6.7 & 6.8), given below as Point 3, further assumptions yielding approximate expressions for system reliability and availability are possible, provided that hi<<pi holds for each element Ei. Here some examples:

1. Totally independent elements: If each element of the reliability block diagram operates independently from every other (active redundancy, independent elements, one repair Crew for each element), series - parallel structures can be reduced to one-item structures, which are themselves successively integrated


Figure 6.17 Comparison between a one-item structure and a 1-out-of-2 active redundancy with a series element (repairable, one repair Crew, repair priority on E 2 , no further failure at system down, constant failure rates hl & h2 and repair rate p, Markov process, hl remains the same in both structures, eqs. according to Table 6.10; given on the right are MZ7'Fs O, 1 M- 0a and (1 - PAS,) I (1 - PAsa) with MZTFSO, and 1 - PASc from Fig. 6.18; See Fig. 2.8 for the nonrepairable case)

6.7 Approximate Expressions for Large Series-Parallel Stmcturres

Figure 6.18 Comparison between basic series - parallel structures (repairable, one repair Crew with repair priority on E3, active redundancy, no further failure at system down, constant failure rates hl to hg and repair rate p, Markov process, hl and h2 remain the same in both structures, equations according to Table 6.10; see Fig. 2.9 for the nonrepairable case)

into further series - parallel structures up to the system level. To each of the one-item structure obtained, the mean time to failure MVFsso and steady-state availability P A s , calculated for the underlying series -parallel structure, are used to calculate an equivalent M Z T R ~ from PA^ = M- I ( M T ~ + MT )

using M- = MTTFso. To simplify calculations, and considering the comments given to Eq. (6.93), constant failure rate hs = i 1 MW,, and constant repair rate ys = 1 I M?TRS are assumed for each of the one-item structures obtained. Table 6.9 summarizes basic series- parallel structures based on totally independent elements (see Section 6.7.2 for an example).

2. Macro-structures. A macro-stmcture is a series, parallel, or simple series - parallel structure which is considered as a one-item structure for calculations at higher levels (integration into further macro structures up to system level) [6.5 (1991)l. It satisfies Assumptions (6.1) - (6.7), in particular one repair Crew for each macro-structure and no further failures during a repair at the macro- structure level. The procedure is similar to that of point 1 above (see also the remark to Eq. (4.37) for the calculation of an equivalent M n i i s ) . Table 6.10 summarizes basic macro-structures useful for practical applications, see Sections 6.2 to 6.6 for results and Section 6.7.2 for an example.

3. One repair Crew und no further failures at system down: Assumptions (6.3) and (6.2), valid for all models investigated in Sections 6.3 - 6.6, applies in many practical applications. No further failures at system down means that failures during a repair at system level are neglected. This assumption has no influence on the reliability function at system level and its influence on the availability is limited (if Ai << yi can be assumed for each element Ei).

4. Cutting states: Removing the states with more than k failures from the diagram of transition probabilities in ( t , t + 6t] , or the state transition diagram, produces in general an important reduction of the state diagram. The choice of k (often k = 2 ) is based on the required precision. An upper bound of the error for the asymptotic & steady-state value of the point and average availability P A s = A A S (based on the mapping of states with k failures at the system level in the state Zk of a birth and death process and using (Eq. (A7.157)) Pk 2 q ,, valid for 2(Al + ... + h , ) < min { p i , ..., P,}) has been given in [2.50 (1992)l.

5. Clustering of states: Grouping of elements in the reliability block diagram or of states in the diagram of transition probabilities in ( t , t + 6t] produces in general an important reduction of the number of states in the state diagram.

Combination of the above methods is possible. In any case, series elements must be grouped before any analysis (see Section 6.3 and the second row of Table 6.10).

Considering that the steady-state probability for states with more than one failure decreases rapidly as the number of failures increases (- A l p for each failure, see e.g. pp. 230 and 258 and the corresponding Figs. 6.20 and 6.34), all methods given

6.7 Approximate Expressions for Large Series-Parallel Structurres

Figure 6.19 Basic reliability block diagrarn for an unintemptible el.power supply (UPS)

above yield good approximate expressions for M7TFso and PAs in practical applications. However, referring to the unavailability 1 - PAs, method 1 above can deliver lower values, for instance a factor 2 with an order of magnitude ( A l for a 1-out-of-2 active redundancy (compare Table 6.9 with Table 6.10). An analytical comparison of the above methods is difficult, in general. Numerical investigations show a close convergence of the results given by the different methods, as illustrated for instance in Section 6.7.2 (p. 230) for a practical example with extremely low values for p l h (down to 20).

6.7.2 Application to a Practical Example

To illustrate how methods 1 to 3 of Section 6.7.1 work, let us consider the system with a reliability block diagram as in Fig. 6.19, and assume system new at t = 0, active redundancy, constant failure rates Al to A3, constant repair rates p, to p3, repair priority E I , E3, E2 [6.5 (1988)l. Except for some series elements (to be considered separately in a final step), the reliability block diagram of Fig. 6.19 describes an unintemptible power supply (UPS) used for instance to buffer electrical power network failures in Computer Systems (El being the power network). + )

Although limited to 4 elements, the stochastic process describing the system of Fig. 6.19 would contain more than 50 states if the assumption of no further failure at system down were dropped. Assuming no further failure at system down, the state space is reduced to 12 states (Fig. 6.20). In the following, the mean time to failure (MTTFso) and the asymptotic & steady-state point and average availability ( PAs = AAS) of the system given by Fig. 6.20 is investigated using method 1 (Table 6.91, method 2 (Table 6.10), and method 3 (Table 6.2) of Section 6.7.1. For a numerical comparison, results are given on p. 230 (also for method 4 and for the exact solution obtained by dropping the assumption of no further failure at system down), showing that all methods used deliver good approximate expressions.

+) A refinement to include the battery discharge has been investigated recently [6.45 (2002)l.


Method 1 of Section 6.7.1 yields, using Table 6.9,

System

From Eqs. (6.175) - (6.177) it follows that

and

Method 2 of Section 6.7.1 yields, using Table 6.10,

System

6.7 Approximate Expressions for Large Series-Parallel Structurres 225

Table 6.9 Basic structures for the investigation of large series-parallel Systems by assuming totally independent elements (each element operates and is repaired independently of every other element, Point 1 p. 221), constant failure & repair rates ( h , P), active redundancy, one repair Crew for each element, ideal failure recognition, Markov process (for rows 1 to 5 See Eqs. (6.48), (2.48) & (6.60), (2.48) & (6.99), (2.48) & (6.171) with h, =0, and (2.48) & (6.148), respectively; h s = 11 MTTFso and P , = 1 1 MZTR, = h s 1 (1 - PA, ) are used to simplify the notation; approximations valid for hi << pi ; PA, = AAS =asymptotic & steady-state point and average availability, often denoted by A)

1 1-out-of-2 (active)

U

2-out-of-3 active ( E = E = E = E )

1 2 3

L P

k-out-of-n active (E1= ... = E n = E )

1 ) L S = h , p S = p , PA - = I - h s / P s

- l + h s l p s

X , PA, - h s * P S = - 1 -PA, 1 - PAs

h h PAS = PA1 ... PA, = I - ( J + .,. +J)

P1 Pn

h Li+ ... +X, is =Al + ... +L, =, ps = -2- =

]-PAS h l l k l + ... + h , l ~ ,

L l h , r MTTF,, = 1 1 1 112

Al L2 (P , + P2)

PA, = 1 - -


From Eqs. (6.180) and (6.181) it follows that

and

Method 3 of Section 6.7.1 yields, using Table 6.2 and Fig. 6.20, the following System of algebraic equations for the mean time to failure (Mi =

where

From Eqs. (6.184) and (6.185) it follows that

6.7 Approximate Expressions for Large Senes-Parallel Structurres 227

Table 6.10 Basic macro-siructures for the investigation of large series-parallel Systems by successive building of macro-structures bottom up to system level (Point 2 p. 222), constant failurekrepair rates (h, P), active redundancy, one repair Crew for each macrostructure, no further failure at system down, ideal failure recognition, Markov proc. (for rows 1-6 see Eqs. (6.48), (6.65) &(6.60), (6.103) & (6.99), (6.160) & (6.158), (6.65), (6.60) &Tab. 6.8, and same as for row 5, respectively; h s = 11 M- and p, = I I M V R , = A, /(I -PAs ) are used to simplify the notation; approximations valid for Ai<< p i )

U

1-out-of-2 (active)

1-out-of-2 active (El = E2 = E) repair pnonty on E

I

2-out-of-3 active (E =E =E =E) 1 2 3 repair pnority on Ev

Lß1 repair priority E,

k-out-of-n active (E,= ... = E = E )

h h, + . . . + A n j ,+,=L-

I - P A , h l l p l + . . . + h , , l p , ,


with

Similarly, for the asyrnptotic & steady-state value of the point and average availability PAS = AAS the following System of algebraic equations, can be obtained using Table 6.2 and Fig. 6.20

with pias in Eq. (6.185). One (arbitrarily chosen) of the Eqs. (6.188) must be dropped and replaced by Po + f j + . . . + q1 = 1. The solution yields Po to ql, from which

with

and

6.7 Approximate Expressions for Large Series-Parallel Structurres

repair priority: El' E3' E2

Figure 6.20 Reliability block diagram and diagram of transition probabilities in (t, t + 6t ] for the system described by Fig. 6.19 (active redundancy, one repair Crew, repair priority in the sequence EI , E3, E2, no further failures at system down, ideal failure recogntion, harbitrary t, 6 t k 0 , Markov process (Pi = xj pPij ))

An analytical comparison of Eqs. (6.186) with Eqs. (6.178) and (6.182) or of Eq. (6.189) with Eqs. (6.179) and (6.183) is difficult. Numerical evaluation yields (L and p in h-', M7TF in h)

A l

1 2

L3

P1

P2

P3 MTTFSo (Eq. (6.178), totally IE)

MITFSO (Eq. (6.182), MS)

MTTFSO (Eq. (6.186), no FF)

(Method 4, Cutting)

M q O (only one repair crew)

1 -PAS (Eq. (6.179), totally IE)

1 - PAS (Eq. (6.183), MS)

I - PAS (Eq. (6.189), no FF)

1 - PAS (Method 4, Cutting)

1 - PAS (only one repair crew)

Also given in the above numerical comparison are the results obtained by method 4 of Section 6.7.1 (for a given precision of 1 0 - ~ on the unavailability 1 - PAs) and by dropping the assumption of no further failures at system down in method 3. These results confirm that for Li <<pi good approximate expressions for practical applications can be obtained from all the methods presented in Section 6.7.1. The influence of A i 1 pi appears clearly when comparing columns 1 with 2 and 3 with 4. The results obtained with method 1 of Section 6.7.1 (Eqs. (6.178) and (6.179)) give higher values for and PAS, because of the assumption that each element has its own repair Crew (totally independent elements). Comparing the results form Eqs. (6.186) and (6.189) with those for the case in which the assumption of no further failures at system down is dropped, shows (for this example) the small influence of this assumption on final results.

For indicative purpose and to Support the validity of approximate expressions, the following are the state probabilities for the numerical example according to the first column above, obtained by solving (Eq. (6.188), i.e. with the assumption of one repair crew und no further failure at system down (Fig. 6.20) :

6.8 Systems with Complex Structure


Structures and models investigated in the previous sections of this chapter were based on the existence of a reliability block diagram and on some simplifying assumptions ((6.1) - (6.7)) ; in particular, elements with only two states (goodlfailed) and ideal fault coverage & switching. This was, so far, good to understand basic investigation methods and tools, See e.g. Figs. 6.9, 6.10 & A7.6, Example 6.11, Section 6.7.2, and Table 6.2. However, in practical applications more complex situations can arise. This section uses tools developed in Appendix A7 (summarized in Table 6.2 for Markov & semi-Markov processes) to investigate complex fault tolerant repairable Systems for cases in which a reliability block diagram does not exist or can not easily be found. Constant failure and, in general, also constant repair rates are assumed. On the basis of practical examples it is shown that working with the diagram of transition probabilities or a time schedule, problems occurring in practical applications can be solved on a case-by-case basis. To improve readability, the diagram of transition probabilities in (t , t+ 6t] will be replaced in this section by the diagram of transition rates, which considers transition rates p, only, by omitting 6 t and 1-pi6 t . Of Course, each new System can provide a starting point for a better model, and a large number of Papers is known on this subject too. After some general considerations (Section 6.8.1), Section 6.8.2 deals with aspects ofpreventive maintenance. Sections 6.8.3 & 6.8.4 consider imperfect switching and incomplete coverage. Elements with more than two states or one failure mode are discussed in Section 6.8.5. Section 6.8.6 investigates fault tolerant reconfigurable systems by considering that reconfiguration can occur because of mission profile (phased-mission systems) or failure. For this last case, reward und frequency 1 duration aspects are involved in the analysis as well. Section 6.8.8 summarizes the procedure for modeling systems with complex structure. Alternative investigation methods (Petri nets, dynamic FTA, Computer- aided analysis) are introduced in Section 6.9 and a Monte Carlo procedure, useful for rare events is given. As a general rule, modeling complex systems is a task which must be solved in close cooperation between project and reliability engineers.

6.8.1 General Considerations

In the context of this book, a structure is complex when the reliability block diagram either does not exist or cannot be reduced to a series-parallel structure with independent elements (p. 52). If the reliability block diagram exists, but not as series- parallel structure, reliability and availability analysis can be performed using one or more of the following assumptions (as in previous sections, failure-free time is used as a synonym for failure-fiee operating time, repair as a synonym for restoration):

1. For each element in the reliability block diagram, failure-free times and repair times are statistically independent.

2. Failure and repair rates of each element are constant (time independent). 3. Each element in the reliability block diagram has a constant failure rate. 4. The flow of failures is a Poisson process (homogeneous or nonhomogeneous). 5. Nofurther failures can occur at system down. 6. Redundant elements are repaired on-line (no interruptions at system level). 7. After each repair, the repaired element is as-good-as-new. 8. After each repair, the entire system is as-good-as-new. 9. Only one repair crew is available, repair is started as soon as the repair Crew

is free (first-infirst-out) or according to a given repairpriority. 10. Totally independent elements, i.e. each element operates and is repaired

independently of every other element (n repair Crews for n elements). 11. Ideal failure recognition (in particular no hidden failures or false alarms). 12. All failure-free times and repair times are > 0, continuous, and have afinite

mean and variance. 13. For each element, the mean time to repair is much lower than the mean time

to failure ( M7TRi << MTTFi). 14. Switches and switching operations are 100% reliable and have no influence

on the reliability of the system. 15. Preventive maintenance is not considered.

A clear definition of the assumptions stated is important tofix the validity of the results obtained. It is often tacitly assumed that each element has only 2 states (goodlfailed), one failure mode (cg. shorts or opens), and a time invariant required function (e.g. continuous operation). Elements with more than two states or one failure mode are discussed in Section 6.8.5 (see also Section 2.3.6 for the nonrepairable case). A time dependent operation andlor required function can be investigated when constant failure rate is assumed (Section 6.8.6.2).

The following is a brief discussion of the above assumptions. With assumptions 1 and 2, the time behavior of the system can be described by a time-homogeneous Markov process with finitely many states. Equations can be established using the diagram of transition probabilities in (t,t+ 6t] and Table 6.2. Difficulties can arise because of the Zarge number of states involved. In such cases, a first possibility is to limit investigation to the calculation of the mean time to failure MnFSo and the asymptotic & steady-state value of the point and average availability PAs = AAS, i.e. to the solution of algebraic equations. A second possibility is to use approximate expressions (Section 6.7) or special software tools (Section 6.9.3). Assumption 4 often applies to Systems with a large number of elements. As shown in Sections 6.3 - 6.6, assumption 5 simplifies calculation of the point availability and interval reliability. 1s has no influence on the reliability function, in particular on MTTFso, and can be used for approximate expressions when assumption 13 applies

6.8 Systems with Complex Structure 233

(see Section 6.7.2 for an example). Assumption 6 must be met during the system design. If not satisfied, improvements given by redundancy are questionable (see Example 6.16 and Fig. 6.26) and at least fault recognition should be required and implemented. Assumptions 7 and 8 are satisfied if either assumption 2 or 3 holds. Assumption 7 is frequently used, its validity must be checked. Assumption 8 is rarely used (only with assumptions 2 or 3). Assumption 9 simplifies calculation and is useful for deriving approximate expressions, especially if assumption 13 holds. Together with assumption 3, the behavior of the system can be described by a semi-regenerative process (process with an embedded semi-Markov process). Assumption 3 alone can assure that the process is regenerative. With assumption 10, point availability can be computed using the reliability equation for the non repairable case (Eqs. (2.47) & (2.48)). This assumption rarely applies in practical applications. However, it allows a simple calculation of an upper bound for the point availability. Assumption 13 is generally met. It leads to approximate expressions, as illustrated in Section 6.7 or by using asymptotic expansions, See e.g. r6.19, A7.261. As shown in Examples 6.7- 6.9, the shape of the distribution function of the repair time has small influence on the results at system level PAS, IRs@)), if assumption 13 holds. Assumptions 14 and 15 simplify investigations. They are valid for all models discussed in Sections 6.2-6.7.

Investigation of large series -parallel structures or of complex structures is in general time-consuming and can become mathematically intractable. As a first step it is useful to operate with Markov models. Refinements can then be considered on a case-by-case basis.

If the reliability block diagram does not exists, stochastic processes and tools introduced in Appendix A7 can be used to investigate reliability and availability of fault tolerant systems, on the basis of the diagram of transition rates or a time schedule, See Sections 6.8.3 - 6.8.7 for some examples. on systems with imperfect switching, incomplete coverage, more than two states or one failure mode, reconfigurable structure, and common cause failures. A general procedure for the investigation of complex fault tolerant systems is given in Section 6.8.8. Alternative investigation methods (Petri nets, dynamic FTA, computer-aided analysis) are introduced in Section 6.9 and a Monte Carlo procedure, useful for rare events is given.

6.8.2 Preventive Maintenance

Preventive maintenance is necessary to avoid wearout failures and to identify and repair latent or hidden failures, i.e. failures of redundant elements which cannot be recognized during normal operation. This section investigates a one-item repairable structure with prevenrive maintenance at T p M , 2TpM, ... . The resuits are basic for the investigation of more complex structures and will be useful in the


RPM (t) = R (t) = e-'

0.2

0 2 T p ~ 4TpM %M

Figure 6.21 Reliability functions for a one-item structure with preventive maintenance (of negligible duration) at times TpM, 2 TpM, . . . for two distribution functions F ( t ) = I - R(t) of the failure-free time (item new at t = 0, TpM, 2TPM, ... ; left increasing and right constant failure rate)

following sections to investigate some aspects of fault tolerant repairable Systems (Section 6.8.6.2). Further models / strategies for preventive maintenance or maintenance optimization are possible (Section 4.6).

The item considered is new at t = 0. Its failure-free time is distributed according to F(x) with density f(x), the repair time has distribution G(x) with density g(x). Preventive maintenance is of negligible time duration (e.g. specialized personnel is available) and restores the item to as-good-as-new. If a preventive maintenance is due at a time in which the item is under repair, one of the following cases will apply:

1. Preventive maintenance will not be performed (as included in the running repair, considering that after each repair the item is as-good-as-new).

2. Preventive maintenance is performed, i.e. a running repair is terminated with the preventive maintenance in a negligible time span (this maintenance strategy is also known as age replacementpolicy, se also Section 4.6).

Both situations can occur in practical applications. In case 2, the times 0, TPM, 2 TpM, . . . are renewal points. Case 2 will be considered in the following.

The reliability function RpM( t ) can be calculated from

with R(x) = 1 - F(x) , where F(x) is the distribution function of the failure-free time of the one-item structure considered. Figure 6.21 shows the shape of R(t) and RpM(t ) for an arbitrary F(x), and for F(x) = 1 - L h x . Because of the rnernoryless property which characterizes the exponential distribution function,

RpM(t) = R(t) = e - l t holds for F(x) = 1 - e-L X , (6.193)

independeiztly of TpM . From Eq. (6.192), the mean time to failure with preventive maintenance MrrFpM follows as

6.8 Systems with Complex Struciure

Figure 6.22 Point availability for a one-item stnicture with repair at every failure and preventive maintenance (of negligible duration) at times TpM, 2 TpM , . . . (item new at t = 0 , TpM, 2 TpM, . . . and after each repair)

For F(x) = l-e-hx, Eq(6.194) yields MUFpM = l / h independently of TPM. Determination of optimal preverzfive maintenance periods must consider Eq. (6.194) as well as cost and logistic Support aspects (for f (0) = 0, M n , -;. 00 for T„ + 0).

Example 6.13 shows a further practical application of preventive maintenance. Calculation of the point availability is easy if preventive maintenance is

performed at TpM , 2 TPM, . . . (case 2 above) and leads to

with PAso(t) from Eq. (6.17). Figure 6.22 shows PAPM(t). Contrary to RpM(t), PApM(t) goes to 1 at 0 (item is new), TpM, 2TpM, ..., i.e. at each renewal point.

If the time duration for the preventive maintenance is not negligible, it is useful to define, in addition to the availability introduced in Section 6.2.1, the overall (or operational) availability OA, defined for t -+ as the ratio of the total up time in (0, t] to the sum of total up and down time in (0, t], i.e. to t . Defining M7TF =

mean time to failure and MDT = mean down time (with MTTR = mean time to repair (restore), MTTPM = mean time to carry out preventive maintenance, MLD = mean logistic delay, and TpM = preventive maintenance period) it follows that (see p. 122)

MTlF - - MTTF OA = (6.196)

MTTF + MDT MTlF + MUR + MLD + MTTPM(MUF / TpM )

For MLD = 0, the overall availability is often called technical availability. Other availability measures are possible, e.g. as in [6.11] for railway applications.

Example 6.13

Assume a nonrepairable (up to system failure) 1-out-of-2 active redundancy with two identical elements with constant failure rate h . Give the mean time to failure MTTFpM by assuming a preventive maintenance with period TpM i < l / h. The preventive maintenance is performed in a negligible time span and restores the 1-out-of-2 active redundancy as-good-as-new.

Solution

For a nonrepairable (np to system failure) I-out-of-2 active redundancy with two identical elements with constant failure rate h , the reliability function is given by Eq. (2.22)

~ ( t ) = 2e-ht -

The mean time to failure with preventive maintenance follows from Eq. (6.194) as

Using e-X= 1 - x + x 2 / 2 it follows that

- 1

MTFPM = ~TPM - TPM (=MTBF.MTBF/TpM for M T B F = l / h ) . (6.197)

h2 T.~M h2 'PM

Without preventive maintenance, Eq. (2.22) yields M T = 3 / 2h . Equation (6.197) clearly shows the gain given by the preventive maintenance.

6.8.3 Imperfect Switching

In practical applications, switching is necessary for powering down failed elements and powering up repaired elements. In some cases it is sufficient to locate the switching element in series with the redundancy on the reliability block diagram, yielding series -parallel structures as investigated in Section 6.6. However, such an approach is often too simple to Cover real situations. This section shows this on the basis of practical examples. Further considerations are given in Section 6.8.4 dealing with incomplete coverage. As afirst example, Fig. 6.23 shows a situation in which measurement points MI and M 2 , switches S 1 and S 2 , as well as a control unit C must be considered. To simplify, let us consider only the reliability function in the nonrepairable case (up to system failure). From a reliability point of view, switch Si, element Ei, and measurement point M i in Fig. 6.23 are in series (i = 1, 2). Let 'rbl and T b 2 be the corresponding failure-free times with distribution function F b ( x ) and density f b ( x ) .

T , is the failure-free time of the control device with distribution function F c ( x ) and density f c ( x ) .

Figure 6.23 Functional block diagram for a 1-out-of-2 redundancy with switches SI and S2, measurement points MI and M2, and control device C

Consider first the case of standby redundancy and assume that at t = 0 element EI is switched On. A system failure in the interval (0, t ] occurs with one of the following mutually exclusive events

{T, > zbl n (zbl + Tb2) I t ) or {T, < ~ b l I t} .

It is implicitly assumed here that a failure of the control device has no influence on the operating element, and does not lead to a commutation to E I . A verification of these conditions by a FMEA (Section 2.6) is necessary. With these assumptions, the reliability function Rso(t) of the system described by Fig. 6.23 is given by (nonrepairable case, system new at t = 0)

Assuming further fb(x) = Ab eLhhx and f,(x) = h, e-'C', Eq. (6.198) yields

h, 2 0 leads to the results of Section 2.3.5 for the 1-out-of-2 standby redundancy. Assuming now an active redundancy (at t = 0, EI is put into operation and E2

into the reserve state), a system failure occurs in the interval (0, t] with one of the following mutually exclusive events

{zbl I t n T, > z~~ n Tb2 I t} or {T, < zbl I t } .

The reliabilityfinction is then given by (nonrepairable case, system new at t = 0)

t t

Rso(t) = 1 -[~b(t)/fb(x)(1- F,(x)) + [fb(-")~,(x) &I . (6.201) 0 0

From Eq.(6.201) and assuming fb(x) = hb e-'bx and f,(x) = h , e-'cx it follows that

and

h , 0 leads to the results of Section 2.2.6.3 for the 1-out-of-2 active redundancy. From Eqs. (6.200) and (6.203) one recognizes that for h, >> hb

M7TFso = 1 /hb? for h , >> a b , (6.204)

for both standby and active redundancy, i.e. to a situation as where no redundancy. As a second example consider a I-out-of-2 warm redundancy with constant

failure rate h, h , and repair rate p. The switching element can fail with constant failure rate h , and failure mode stuck at the state occupied just before failure. At first, let us consider the case in which the failure of the switch can be immediately recognized and repaired with constant repair rate y,. Furthermore, assume only one repair Crew, repair priori9 on the switch, and no further failure at system down. Asked are the mean time to system failure MTTFso for system new (state ZO) at t = 0 and the asymptotic & steady-state (stationary) point and average availability PAS =AAS. The involved process is a time-homogeneous Markov process. Figure 6.24 give the diagrams of transition rates for reliability and availability calculation, respectively (down states hatched). From Fig. 6.24a and Table 6.2 or Eq. (A7.126) it follows that M?TFso is given as solution of the following system (Mi = M7TFsi)

yielding

- P h ( h + L,.+ h,)

The approximation assumes p, =p and h, Ar, h, << p. From this approximate expression it follows that the effect of impefect switching with failure mode stuck at the state occupied just before failure, immediately recognized and repaired, is minor and becomes negligible (yielding results for ideal switch as in Tab. 6.6) for

h, << h + h r , (for p = p , >> h , Ar, X , ) . (6.207)

The case h,=O implies p,=O and must be investigated using the exact expression for M7TFso, yielding M q o = (p+ 2h + L,) lh(h + h,) as per Table 6.6.

Figure 6.24 Diagram of transition rates for a repairable 1-out-of-2 warm redundancy (const. failure & repair rates h , Ar, p) with imperfect switching (failure rate h, , repair rate P,, failure mode stuck at the state occupied), failure of the switch inmediately recognized and repaired with repairpriority, one repair Crew; system down in 2 2 , Z2?, 22"; no further failure at system down; Markov process

From Fig. 6.24b and Table 6.2 or Eq. (A7.127) it follows that PAs = AAS is given as solution of the following system of algebraic equations

One of the Eq. (6.208), arbitrarily chosen, must be replaced by Cq = 1. The asymptotic & steady-state point and average availability follows from

The approximation assumes p, = P and h, hr, h , << p. Equation (6.209) allows the same conclusion as for Eq. (6.206). ?L, = 0 implies y, = 0 and yields results for ideal switch (Table 6.6).

Further models for imperfect switching are conceivable. For instance, by assuming that for the model of Fig. 6.24 failure of the switch (with stuck at the state occupied just before failure and failure rate L,) can only be recognized and repaired at system down together with failed elements (one or both) at a repair rate pg. This situation occurs e.g. in power Systems (refuse to start). Figure 6.25 gives the corresponding diagrams of transition rates for reliability and availability calculation,


a) For reliability b) For availability

pm8= h,; pol=h+hr; plO=p; pn = ~ ~ m ~ = h ; p,=h+h, +L,; po,=h; p l=h+p ; p2=pg

Figure 6.25 Diagram of transition rates for a repairable 1-out-of-2 warm redundancy (const. failure & repair rates h , h „ p ) with imperfect switching (failure rate h,, failure mode stuck at the state occupied), failure of the switch recognized and repaired only at system down (Z2 ) together with failed elements at a global repair rate pg ; no further failure at system down; Markov process

respectively (down state hatched). Results are given in Example 6.14. A further possibility is to assume no connection as failure mode(Fig. 6.31) or a constant probability C that the switch will perform correctly when called to operate (Figs. 6.27,6.28).

Example 6.14 Compute the mean time to system failure M7TFs0 for system new (inZo) at t=O and the steady- state point and average availability PAs=AA, of the 1-out-of-2 warm redundancy as per Fig. 6.25.

Solution From Fig. 6.25a and Table 6.2, MTTFso is given as solution of (with Mi=M7TFsi)

p, M, = 1 + h, Mo- + (h + h,)Ml, PI MI = l+PMo 3 PV MV = 1, (6.210)

yielding

Because of the not recognized failure of the switch, the condition on h, to yield results valid for ideal switching (Table 6.6) is more severe as Eq. (6.207) and is given by (see also Eq. (6.240))

From Fig. 6.251, and Table 6.2 or Eq. (A7.127) it follows that PA, =AAS is given as solution of

pOPO = p 4 + p & , pV PO' = h0PO, P14 =(h+h,)Po. P26 + hPv. (6.213)

One of the Eq. (6.213), arbitrarily chosen, must be replaced by Po + Pu + 4 + P2 = 1. The asymptotic & steady-state point and average availability follows from

Equation (6.214) allows the same conclusion for h, as for Eq. (6.211). If Eq. (6.212) is not satisfied, i.e. for ph ,>>h(h+hr) , Eq. (6.211) yields M T , = 1 lh , + 1 l h (nonrepairable 1- out-of-2 stanbby redundancy with h, and h ) and Eq. (6.214) PAs =AAS - 1 - h l pg (one-item).

6.8 Systems with Complex Structure 24 1

6.8.4 Incomplete Coverage

Incomplete fault (failure) coverage occurs because of lack or failure in the diagnosis. Fault coverage is defined as the proportion of faults of an item that can be recognized under given conditions. A fault coverage greater as 0.9 is often required for complex equipment and Systems. Lacks in the diagnosis lead to hidden ur latent failures, i.e. failures (in the main system) which are not covered by diagnosis and can be recognized only during a repair or a preventive maintenance. Hidden or latent failures can cause serious reduction of the advantage offered by a redundancy (see for instance Eq. (6.223)). Failure modes of a diagnosis have to be investigated on a case-by-case basis. However, from a logical point of view, two basic failure modes are (Fig. 6.29)

false alarm,

no alarm emitted (alarm defection).

Incomplete coverage acts on the switching operation and is often investigated as part of imperfect switching. Following an illustrative example, this section discusses some basic possibilities to consider incomplete coverage.

Consider a 1-out-of-2 active redundancy with two different elements El and E 2 , and assume that failures of EI can be recognized only during the repair of E2 or at a preventive maintenance (hidden failures in E l ) . Elements El and E2 have constant failure rates h1 and h2, the repair time of E2 is distributed according to an arbitrary function G(x) with G(0) = 0 and density g(x), and the repair of El takes a negligible time (see Example 6.16 for constant repair rate).

If no preventive maintenance is performed, Fig. 6.26a shows a possible time schedule of the system (new at t =O), yielding for the reliabili~finction

The Laplace transform of Rso(t) follows as

and the mean time to failure becomes

If preventive maintenance is performed at times TPM, 2 TPM, . .. independently


0 renewal point

a) Without preventive maintenance b) With preventive maintenance (penod TpM)

Figure 6.26 Possible time schedules for reliability calculation of a repairable 1-out-of-2 active redundancy withhiddenfailures in element EI (new at t = 0, repairtimesgreatly exaggerated)

of the state of element E2, and after each preventive maintenance (assumed of negligible duration as in Section 6.8.2) the entire system is as-good-as-new, then the times 0, TPM, 2 TPM, . .. are renewal points for the system. For the reliability function Rso,(t) it follows that (considering Eq. (6.192) and Fig. 6.26b)

with Rso( t ) as per Eq. (6.215). The mean time to failure M I T F ~ ~ , follows as

The time TpM between two consecutive preventive maintenance operations can be optimized considering Eq. (6.217) or Eq. (6.219) as well as cost and logistic aspects ( + - for T„ + 0). Examples 6.15 and 6.16 show practical applications of the above incomplete coverage model.

Example 6.15 Give approximate expressions for the mean time to failure M q O given by Eq. (6.217).

Solution For g(hl) -t 1, it follows from Eq. (6.217) that

A better approximation is obtained by considering g(hl) = 1 - hl MTTR

h, + h2 + h i MTTR MTTFso =

h, h2 (1 + h, M'ITR)

with MTTR as mean time to repair element E2 (Eq. (6.1 10)).

Example 6.16

Investigate RSO(t) per Eq. (6.216) and RsopM (t) per Eq. (6.218) as well as MITFSO per Eq. (6.217) and MTTFsopM per Eq. (6.219) for the case of constant repair rate p (g(x) = pe-llX).

Solution

With g(s +Al) = p / ( s + hl + p) it follows from Eq. (6.216) that

The mean time to failure MT-0 follows from Eq. (6.222)

One recognizes, that h l + h2 << p yields directly to

Equations (6.223) and (6.224) show that the repairable 1-out-of-2 active redundancy with hidden failures in one element behaves like a nonrepairable I-out-of-2 standby redundancy. This result bears out, how important it is in the presence of redundancy to investigate failure recognition and failure modes.

In the case of periodic preventive maintenance (period TpM), Eq. (6.219) yields

The last part of Eq. (6.225) has been obtained using e-" X 1 - h X + ((A X)' / 2. The optimization of the time TpM between two consecutive preventive maintenance operations must consider Eq. (6.225), cost, and logistic aspects. Equation (6.222) follows also from Table 6.2 (Markov process with up states ZO, Z1 & Z2, absorbing state Z3, transition rates pol = h l , p02 = h2 ,

P13 = h2 , ~ 2 0 = p P23 = Al, and Po (0) = 1).

Figure 6.27 State transition diagram for availability calculation of a repairable 1-out-of-2 active redundancy (const. failure & repair rates (h, p), incomplete coverage (switch to reserve element with probability C), one repair Crew; semi-Markov process; Qzl (X)-0 for reliability; see also Fig. 6.28

A further possibility to consider incomplete coverage is to assume that a failure will be recognized (only) with a probability C. This case is similar to that of imperfect switching mentioned at the end of Section 6.8.3 and is known in the literature [6.45 (2001)l. Figure 6.27 gives the state transition diagram of the corresponding semi-Markov process for availability calculation (down state hatched, Qzl(x) = 0 for reliability calculation). The transition from state Z1 occurs instan- taneously to Z1 with probability q , = C or to Z 2 with probability p,t2 = 1 - C.

Assuming constant failure and repair rates, the model of Fig. 6.27 can be investigated using a time-homogeneous Markov process with the diagram of transition rates given in Fig 6.28 (also known in the literature of power Systems as redundancy with no start at call [6.34]). Examples 6.17 and 6.18 investigate the models of Figs. 6.27 & 6.28, showing their equivalence.

Example 6.17 Give the mean time to system failure M T o for system new (in Zo) at t=O and the steady-state point and average availability PA,=AA, of the l-out-of-2 warm redundancy as per Fig. 6.27.

Solution From Fig. 6.27 and Table 6.2 or Eq. (A7.173), M W O is given as solution of (with Mi=MZTFSi )

with Ti=-J (1 - Q (X))&, and Q ~ ( x ) = ~ ~ Q ~ ( x ) (Eqs.(A7.166) and (A7.165)). Considering Fig. 6.27 itfollows that T, = 1 /2h , T,. = 0, Tl = 1 I (h + F), T,= 1 l p, yielding

From Fig. 6.27 and Table 6.2 or Eq. (A7.178), PA, =AAS = Po + 6, + 4 is given as

PAS =AAS = ( ~ 0 % + P ~ ~ T ~ ~ + P~T, ) 1 (p0To + pl<T,, + %T, + %T). (6.228)

Thereby, p j is the state probability of the embedded Markov chain, obtained as solution of

TO=%P/@+P), P I , = P o , P l = P l * c + R , ~ = ~ ~ , ( l - c ) t p ~ h / ( h + p ) (6.229) (Table 6.2 or Eq. (A7.175)), yielding (considering x j p j = l ) = p,, = p / (2(h + 2p) - pc) P,= (h+p) l (2 (h+2p)-pc), p2= (h+p- pc)/ (2(h+ 2p)- pc). From Eq. (6.228) it follows then

PAS =AAS= (p2+2hp) 1 (p2+2hp + 2h(h +P-pc)) = 1- 2h(h+p-CLc)/(p2+2hp). (6.230)


p12= h ; pz, = p (Pzi= 0 for reliability)

P P Po= 2h; Pi=h + P; P2=P (P2 = 0 for reliability)

Rgure 6.28 Diagram of transition rates for availability calculation of a repairable I-out-of-2 active redundancy (const. failure & repair rates (h, P), incomplete coverage (switch to the reserve elemeut with probability C), one repair crew); Markov proc.; p2, - 0 for reliability; see also Fig. 6.27

Example 6.18 Give the mean time to system failure M7TFs0 for system new (in Zo) at t=O and the steady-state point and average availability PAs=AAs of the 1-out-of-2 warm redundancy as per Fig. 6.28.

Solution From Fig. 6.28 & Table 6.2 or Eq. (A7.126), MiTFSO is given as solution of (with Mi = MTTFsi)

From Fig. 6.28 and Table 6.2 or Eq. (A7.127), PA, =AAS = Po + 6 is given as solution of

One of the Eq. (6.233), arbitrarily chosen, must be replaced by Po +q +P2 = 1; the solution 2 yields Po = p l (p2+ 2hp + 2h(h + p - P)) and pl= 2hp I (p2+ 2hp +2h(h + p-pc)), from which

Comparison of Eqs. (6.227) with (6.232) and (6.230) with (6.234) shows the equivalence of the rnodels given by Figs. 6.27 and 6.28 (for constant failure and repair rates). For C= 1, Eqs. (6.232) and (6.234) yield results of Table 6.6 for active redundancy. For C = 0, Eqs. (6.232) and (6.234) yield results for a one-item with failure rate 2 h and repair rate p; most unfavorable case, since at the first failure it is not possible to identify the failed element, yielding to a system down. Using results of Table 6.6 one recognizes that (for the model considered in Figs. 6.27 & 6.28) the effect of incomplete coverage is negligible for

Other possibilities to consider incomplete coverage are conceivable. Assuming for instance that the second element continues to operate after nonrecognition of a failure leads to the model considered in Fig. 6.25 with h,=2h(l- C ) & h+hr=2hc, yielding for C = 0 to a nonrepairable 1-out-of-2 active redundancy. A more elaborated


Pol= Po3= PO4= h ~ ~ ; Po5= ~ D F ; p i z = h c + h N c = h ; P 2 0 = P 3 0 = P 4 0 = P 5 r j = P

p o = ( h c + h N c ) + ( L D F + h D D ) = h + h D ;

P i = h c + h N c = h ; p 2 = p 3 = P,= P,=

C = covered, NC = not covered D F = false alarm, D D = alarrn defection

Figure 6.29 Diagram of transition rates for availability calculation of a one-item structure with incomplete coverage and 2 failures modes for the diagnosis (constant failure and repair rates hc , hNC, hDF , hDD , P); Markov process; p- 0 for reliability calculation

model which considers 2 failure modes for the diagnosis, false alarm ( A m ) and alarm defection ( R D D ) has been proposed in [6.42]. Figure 6.29 shows this model by considering one repair rate ( y) for all failure modes, yielding (Table 6.2)

+''D and pAs = AA - M 7 T F -- &J

S O - h ( h + h D ) - p h + p h D D + h ( h + h D )

h „= h „ = h , = 0 leads to results of Section 6.2.1. A possible diagram of transition rates for a 1-out-of-2 active redundancy with 2 repair Crews on the basis of Fig. 6.29 is Fig. 3 of [6.42]. A further example for a duplex System is Fig. 1 of [1.11].

6.8.5 Elements with more than two States or one Failure Mode

Elements with more than two states (good Ifailed for instance) or one failure mode (e.g. Open or short) often arise in practical applications. Some considerations have been given in Sections 2.3.6 and 6.8.4. This section shows, on the basis of practical examples, that items with more than two states or one failure mode can be investigated operating directly with the diagram of transition rates.

As afirst wample consider an item with the three states good, waiting for repair, repair [6.13]. Figure6.30 shows this model. From Fig. 6.30 &Table 6.2 it holds that

1 1 1 Mi7'Fso = and PAs = AA - = 1 - ~ ( - + ~ ) .

- PP'+UP + P ' ) P P (6.237)

The item in Fig. 6.30 behaves like a one-item structure with failure rate h and repair rate 1/(1/y+1 /P')). More complex structures can also be investigated, See e.g. [6.13].

P P' h = failure rate, ,U = repair rate, P '= failure recognition / isolation rate

Figure 6.30 Diagram of transition rates for availability calculation of a one-item with 3 states good, waiting for repair, repair; constant failure, failure recognition & repair rates (h,p',p); Markov proc.

Figure 6.31 Diagram of transition rates for reliability calculation of a repairable 1-out-of-2 wann redundancy (const. failure and repair rates h , hr,p); switch with failure modes stuck at the state occupied (const. failure & repair rates L,, p,) and no connection (const. failure rate L,), failure of the switch immediately recognized and repaired with repair priority, one repair crew; Markov process

As a second example consider a I-out-of-2 warm redundancy with constant failure rate h , h, and repair rate y. The switching element can fail with constant failure rate ?L, for failure mode stuck at the state occupied just before failure or h, for failure mode no connection. F a h r e of the switch can be immediately recognized and repaired with constant repair rate p, or y O Furthermore, assume only one repair crew, repair priority on the switch, and no further failure at system down (also for the switch, no further failure is possible after a failure with one of the two possible failure modes). Asked is the mean time to system failure M7TFso for system new (in state Zo) at t = 0. The involved process is a time-homogeneous Markov process. Figure 6.3 1 gives the diagrams of transition rates for reliability calculation (extension for availability is as in Fig. 6.24). FromFig. 6.31 and Table 6.2 it follows that M7TFsso is given as solution of the following system of algebraic equations (Mi = M7TFsi)

The system given by Eq. (6.238) is identical to that of Eq. (6.205), the only difference being by po and pl for which h, has been added (compare Fig. 6.24a with Fig. 6.31). M7TFso is thus given by Eq (6.206) with po and pl as per Fig. 6.3 1, yielding for p, = y and h, L,., h, << p the approximate expression

and MTTFso= 1 I h, for h, large. The failure mode no connection (L,) is more disturbing than the failure mode stuck at the state occupied just before failure (L,), see Eq (6.207), and the effect of imperfect switching is negligible (Tab. 6.6) only for

h 0 « h ( h + h r ) l p and h,«h+h„ @=p,»h,hr ,h„ho) . (6.240)

The condition given by Eq. (6.240) is similar to that given by Eq. (6.212). Further models for Systems with more than two states or one failure mode are conceivable.


6.8.6 Fault Tolerant reconfigurable Systems

Fault tolerant structures are able to recognizes and isolate failuresl faults and reconfigure themselves to continue operation with minimum loss of performance and / or safety (graceful degradation). Such a characteristic must be built in during design & development. Typical examples of fault tolerant systems are safety circuits as well as power and telecommunication networks. Following a short discussion on ideal reconfiguration, this section deals with reconfiguration occurring at given fixed times or at failure by considering also non ideal conditions, for instance imperfect switching in Section 6.8.6.3. Investigation is based on tools introduced in Appendix A7 and summarized in Table 6.2. Constant failure and repair rates are assumed, yielding to time-homogeneous Markov processes. Procedures are illustrated on a case-by-case basis using diagrams of transition rates.

6.8.6.1 Ideal case

Each redundant structure belongs to a fault tolerant reconfigurable structure and must be validated for this purpose during design & development, for instance with a FMEA (Section 2.6). For the redundant structures investigated in Sections 2.2, 2.3, 6.4 - 6.7 and Appendix A7, independent elements (p. 52), ideal fault coverage, ideal switching, and no reduction of system performance at failure of a redundant element was assumed. Because of these assumptions, investigations often lead to series - parallel structures (Sections 6.6 and 6.7). Imperfect switching, incomplete coverage, and items with more than two states or one failure mode have been considered in Sections 6.8.3 - 6.8.5. In Sections 6.8.6.2 and 6.8.6.3, time und failure censored reconfiguration is investigated. Section 6.8.6.4 considers reward and frequency/ duration aspects. In addition, Section 6.8.7 deals with common causes failures, Section 6.8.8 gives a general procedure for complex repairable system, and Section 6.9 presents alternative investigation methods for complex systems.

6.8.6.2 Time Censored Reconfiguration (Phased-Mission Systems)

In some practical applications, systems are used for different required functions. If each required function can be considered separately from one another, investigation is performed by considering a reliability block diagram (if it exist) for each required function. Otherwise, if mission phases follow each other, investigation must consider the system reconfiguration at the end of each phase and one call this a phased-mission System. Investigation of phased-mission systems can be more complicated as stated in some literature [2.7,2.17,6.2, 6.7, 6.24, 6.33,6.4l],dealing with binary state assignment (limited to totally independent elements (p. 52)), considering time dependent failure or repair rates (breaking the Markov property),


using semi-Markov processes (of limited validity), or missing Assumption 4 below (important when transferring state probabilities at the end of phase k to initial probabilities for phase k + 1). A lower bound Rsol for the mission reliability Rso is obtained by connecting the reliability block diagrams for each phase in series for the whole mission duration (Example 2.5). An upper bound for Rso is given by the smaller of the reliability for each phase taken separately by assuming that all elements involved are as-good-as-new at begin of the phase considered; thus,

k = I, ... ,n (for n phases).

Examples 6.19 - 6.21 illustrate some general considerations and example 6.22 gives a numerical application of Eq. (6.241). For availability, Eq. (6.246) applies.

The following practice oriented procedure (Point (ii) below) for reliability and availability analysis of repairable phased-mission Systems allows, in particular, consideration of standby redundancy and arbitrary repair strategy.

(i) General assumptions:

1. Failure and repair rates ( A i and pi) of all elements are constant during the sojourn time in any state within each phase, but can change (stepwise) at a state (or phase) change because of change in configuration, component use, Stress, repair strategy or other; for all elements it holds that hi « pi.

2. At the begin of the mission all elements are as-good-as-new. 3. Phase duration Tl, ... ,T, are given (fixed) values, each of them so large that

asymptotic & steady-state values for availability can be assumed for every phase (T1, ... ,T, >> i / p i for all elements, see Section 6.2.5 and Table 6.6).

4. For availability investigation, not used elements in a phase are either as-good- as-new and put in standby (failure rate h = 0) at begin of the phase or repaired (Assumption 3) and then put in standby (repair priority on elements used); for reliability investigation, down states at system level are absorbing states and the above rule holds for elements which have not caused system down.

5. System has only one repair Crew and no further failures can occur at system down; system down is an absorbing state for reliability; for availability, the system is restored to an operating state according to a given repair strategy.

6. Fault coverage, switch, and logistic support are ideal. 7. For each phase, a reliability block diagram exists.

Example 6.19 A one-item is used in a mission with phase I (duration T l , const. failure rate L,), followed by phase 2 (duration Tz, const. failure rate h2) . Compute the reliability function for item new at r = 0.

Solution For the reliability function of the whole mission it holds that ( T , T2 given (fixed))

Rs = Pr {phasel failure free n phase 2 failure free) = Pr {phasel failure free J . Pr {phase 2

failure free I phase 1 failure free) = e-'lT1. e-'zT2 = e-('lT1+ ). (6.242) The product rule in Eq. (6.242) holds only because of constant failure rates (see also Eq. (6.27)).


Figure 6.32 Diagrams of transition rates for a one-item used in a mission with phase 1 (duration T,, const. failure rate L,), followed by phase 2 (duration Tz, const. failure rate Lz); Markov process

Example 6.20 Show that Eq. (6.242) can be obtained using a Markov approach, i.e. working with two separate transition rate diagrams for phase 1 and for phase 2, and setting final state probabilities from phase 1 as initial-state probabilities for phase 2.

Solution Figure 6.32 gives the diagrams of transiti?n rates for phase 1 and 2 (separately). For phase 1, the stak probability Pi, (t ) follows from P;, (t ) = - h1 P;, (t ) (Table 6.2, Eq. (A7.115), yielding P;, (t ) = eLhit, for P;, (0) = 1. Thus,

R s o ( ~ , ) = p ; , o ( ? ) = e and Pig1 (71 ) = I - . Pi,, (t) follows from P ' (t ) + Pisl (t )= 1 or by solving (Fig. 6.32 and Table 6.2) 4 0 P;, ( t )= L, ~ ; , ~ ( t ) with P,,, (0)=0. Similarly, for phase 2 with t starting at t = T,,

Example 6.21 A one-item System with reliability function Rs ( t ) is used for a mission of random duration T, > 0 distributed according to Fw(t)=Pr{zw 5 t] with F,+, (0)=0 and density f,(t). Give the reliability, first for the general case and then by assuming constant failure rate h and

6t exponentially distributed mission duration (fw (t )= Se- ).

Solution As mission duration can take any time between (O,=), reliability takes a constant value given by

C a

Rs = J f , ( t ) ~ ~ ( t ) d t , (6.244) 0

(see also Eq. (2.76)). For fw(t)= and constant failure rate h, Eq. (6.244) yields

Supplementary results: In practical application, mission duration is limited to T, and > 0 is a truncated random variable with Pr {T, = T , ) = 1 - F, ( T , -0 ) ; for this case, Eq. (6.245) becomes

Rs = S I (S + h) +e-(S'h)Tw h / (O +L) .

6.8 Systems with Complex Smicture 25 1

(ii) Procedure for reliability & availability computation of repairablephased-mission systems withfixedphase duration Tl, ... ,T„ satisfying the general assumptions (i):

1. Group series elements used in all phases (power supply, cooling, etc.) in one element and put this as a series element in final results (Table 6.10, 2nd row).

2. Draw the diagram of transition rates for reliability evaluation, separately for each phase (I, ... , I ? ) , beginning by phase 1 with Z1 ,O (1 referring to phase 1 and 0 being the state in which all elements are as-good-as-new); down states at system level are absorbing states; use the same state numbering for the same state appearing in successive phases; however, state Zk , corresponding to a state Z,, in a phase C preceding phase k can also contain as-good- as-new elements appearing in phase k but not in a pervious phase, or standby elements (not used in phase k) with failure rate h = 0 ; for k > 1, state Zk,o contains all as-good-as-new elements used in phase k and (as necessary) elements not used in phase k which are standby with failure rate h = 0 ; verify correctness of all transition rates diagrams (as-good-as-new is Same as operating or ready to operate, because of the assumed constant failure rate).

3. For availability investigation, use results of Table 6.10 (or extend diagrams of transition rates, allowing a return to an operating state after system down according to a given repair strategy) to compute the asymptotic & steady-state availability for euch phase separately (PAk,s = AAk, for phase k), taking care of elements which are not used in the phase considered and can act as standby redundancy ( h = 0 ) for working elements; for the whole mission it holds then

PAs = AAS 1 min (PAk, s = AAk, ,), k = 1, ... , n (for n phases) . (6.246)

4. For reliability investigation, compute the reliability function R l , (Tl ) at the end of phase 1 starting in state Z l , o at t = 0 in the same way as for a one mission system (Table 6.2), as well as states probabilities Pi, (T1) for all up states Z l , ; if Z i , (possibly with further as-good-as-new elements used in phase 2) is an up state in phase 2, Pi, (T1) becomes the probability P;, ( 0 ) to start phase 2 in Z2 , ; if Z1, is a down state in phase 2, Pi, (T1) adds to the initial probability of starting phase 2 in the down state; if Zl, does not appear in phase 2, Pi, (T1 ) adds to the initial probability in state Z2, to give Pi, o ( 0 ) (from rule 2 above and verifying that for each phase the sum of all states probabilities is 1); reliability calculation must take care of elements which are not used in the phase considered and can act as standby redundancy ( h = 0 ) for working elements; continuing in this way, following equation can be found for the mission reliability Rso starting phase 1 in Z l ,

Rso" X P ; , ~ ( T ~ ) , U, = set of up states in phase n. (6.247) z €U,

To simplify the notation used in Exarnple 6.20, the variable x starting by X = 0 at the begin of each phase is used in Rule 4 instead o f t (starting by t = 0 with phase 1).


Figure 6.33 Reliability block diagrams and diagram of transition rates for reliability calculation of a phased-mission system with 3 phases (the diagram of transition rates for phase 2 takes care that one element E2 is put in standby with & 0 as soon as available from phase 1); dashed are indicated to which states the final state probabilities of phase 1 and phase 2 are transferred as initial probabilities for phase 2 and phase 3, resp.; constant failure and repair rates (h,p); one rep. Crew; Markov process

As an example let us consider the phased-mission system with 3 phases of given (fixed) duration Tl, Tz and T3, described by the 3 reliability block diagrams and the corresponding diagrams of transition rates for reliability investigation given in Fig. 6.33. The diagram of transition rates for phase 2 considers that in phase 2 only one element E2 is used and assumes that the second element E2 is put in standby redundancy with failure rate h2 = 0 (either from state Zl ,o or as soon as repaired if from state Z1,2). Dashed is given to which states the final state probabilities at time Tl for phase 1 and Tz (T, + Tz with respect to time t ) for phase 2 are transferred as initial probabilities for the successive phase. Let us first consider the asymptotic & steady-state mission availability PA = AA s. From Tables 6.10 and 6.6, it follows for the 3 phases (taken separately) that

The 2nd equation considers that in phase 2 one of the elements E2 acts as standby redundancy with failure rate h2 = 0, combining thus results from Table 6.6 (1 - ( L 2 lP2)') and Table 6.10 (2nd row). Equation (6.246) yields then


For the mission r e l i a b i l i ~ Rso, starting in state Zi,o (all elements are as-good-as- new) at t = 0, the diagrams of transition rates of Fig. 6.33 yield for phases 1, 2, 3 to following coupled system of differential equations for state probabilities (Table 6.2)

In Eq. (6.250), P;, is used instead of pi: (X). From Eq. (6.247) it follows then

Analytical solution of the system given by Eq. (6.250) is possible, but time consuming. Numerical solution can be quickly obtained (Example 6.22). A lower bound Rsol for the mission reliability Rso is obtained by connecting the reliability block diagrams for each phase in series. For Fig. 6.33, this corresponds (practically) to consider phase 3 for a time Span Tl + T2 + T (in phase 2, for element E2 a second element E2 is available in standby redundancy). A good approximation for Rso, is

Example 6.22 Give the numerical solution of Eqs. (6.250) and (6.25 1 ) for h l = 1 0 - ~ h-' , h2 = 10-~ h-' , h3 = 10-~ h-', =p2 =p3 = 0.5 h-' , = 168 h , Tz = 336 h, and T3 = 672 h.

Solution Numencal solution of the 3 coupled Systems of differential equations given by Eq. (6.250) yields

P;,, (T3) = 0.598655, (T3)= 0.023493, (T3)= 0.002388,

P;,, (T3 ) = 0.000092, P& (T3 )= 0.000094, P;,' (Ti )= 0.375278 (6.252)

(with 6 digits because OE P ; ,~ (?)arid (Ti )). RSO follows then from Eq. (6.251)

Rso = I - P ; , ~ ( ? ) = 0.625. (6.253)

Supplementary results: Computing lower and upper bound for Rso as per Eqs. (6.241) and (6.254), yields for the above numencal example 0.55 5 Rs 10.71.


quickly obtained by computing M7TFso using Table 6.10 and setting this in

R ~ ~ , = e - ( T ~ + T 2 + T 3 ) ' M7TFsO; from this, MIT&, -: 1 /(L, + 2x2, ~ p , + 2h: /P,) and

Eq. (6.241) allows computation of an upper bound for RSO (Example 6.22). If the second element E2 were not available in phase 2 as standby redundancy,

P A 2 , S = A A 2 , ~ = 1 - h 2 / p 2 and, from Eq. (6.249), P A s = A A s = l - h 2 / p 2 , since hl / p l C h2 / p2 can be assumed when considering the reliability block diagram for phase 1. Assuming furthermore that the second element E2 would be repaired before the end of phase 2, if in a failed state at the end of phase 1 ( Z 1 , 2 ) , the diagram of transition rates for phase 2 would be equal to that for phase 1, with

+&, h2+h3 9 P ~ - + c L ~ , arid Z1,o + Z2,0, Zl,l+ Z2,1, Z1,2 Z2,3 with

The corresponding initial probabilities for phase 3 would be

If an element E„, where common to all 3 phases in Fig. 6.33 (i.e. in series with all3 reliability block diagrams), Table 6.10 (2nd row) can be used to find

(considering Eq. (6.249)) and, with Rso from Eq. 6.25 1,

The above procedure can be extended to consider more than one repair Crew at system level or any kind of repair (restore) strategy. Other procedures (models) are conceivable. For instance, for nonrepairable systems (up to system failure) of complex structure and with independent elements (parallel redundancy), it can be useful to number the states using binary considerations.

For randomly distributed phase duration, Eq. (6.246) can be used for availability. Reliability can be obtained by expanding results in Exarnples 6.19 - 6.21.

An alternative approach for phased-mission systems is to assume that at the begin of each mission phase, the system is as-good-as-new with respect to the elements used in the mission phase considered (required elements are repaired in a negligible time at the begin of the rnission phase, if they are in a failed state, and not required elements can be repaired during a phase in which they are not used). This assumption can be reasonable for some repairable systems and highly simplifies investigation. For this case, results developed in Section 6.8.2 for preventive maintenance lead to (for phases 1,2, ...)


for the reliability function, and

for the point availability. Si is the state from which the ith mission phase Starts; 0, T~*, T;, ... are the time points on the time axis at which the mission phase 1,2,3, ... begin (the mission duration of phase i being here T: T:, with T~*= 0 ) .

6.8.6.3 Failure Censored Reconfiguration

In most applications, reconfiguration occurs at the failure of a redundant element. Besides cases with ideal fault coverage, ideal switching, and no system performance reduction at failure (Sections 2.2, 2.3, and 6.4-6.7), more complex structures often arise in practical applications. Such structures must be investigated on a case-by- case basis. A FMEAI FMECA (Section 2.6) is mandatory to validate investigations. Often it is necessary to consider that after a reconfiguration, the system performance is reduced, i.e. reward und frequency/duration aspects have to be involved in the analysis.

A reasonably simple and comprehensive example is a power system substation. Figure 6.34 gives the functional block diagram and the diagram of transition rates for availability calculation (pg = 0 for reliability investigation). ZI2 is the down state. The substation is powered by a reliable network and consists of:

Two branch designated by A l & A2 und capable of performing 100% load, each with HV switch, HV circuit breaker and control elements, transformer, measurement & control elements, and LV switch. Two busbars designated by Cl & C2 und capable of performing 100% load (failure rate basically given by double contingency of faults on control elements). A coupler between the busbars, designated by B und capable of pegorming 100% load (failure modes stuck at the state occupied just before failure(does not open), failure rate ABO, and no connection (does not close), failure rate ABO).

Load is distributed between Cl and C2 at 50% rate each. The diagram of transition rates is based on an extensive FMEAJFMECA [6.20 (2002)l showing in particular the key position of the coupler B in the reconfiguration strategy. Coupler B is normally open. A failure of B is recognized only at a failure of A or C. From state Zo, B can fail only with failure mode no connection, from Z1 or 3 only with failure mode


Figure 6.34 Functional block diagram and diagram of transition rates for availability calculation of a power system substation; active redundancy, constant failure and repair rates LAI, hA2, Aso, AB„ Aci , AC2 , pA ,pC ,pg ; imperfect switching of B with failure modes does not Open ( ABO, from Z1 and Z2 ) or no connection (ABO, from ZO), failure of B recognized only at failure of A or C; one repair crew, pnority on C, no further failure at system down; Markov proc.; pg 0 for reliability

stuck at the state occupied just before failure. Constant failure rates hAl, hA2, ABO, ABO, hcl, hC2 and constant repair rates p A , p C , p g are assumed. pA and pc remain the Same also if a repair of B is necessary, pg is larger than pA and pc. From the down state (ZI2) the system retums to state Zo. Furthermore, only one repair crew, repair priority on C (followed by C + B , A , A+B), and no further failure at system down (50% load is an up state with reduced performance, See Section 6.8.6.4) are assumed. Sought are mean time to system failure M7TFso for system new (in state Zo) at t = 0 and asymptotic & steady-state point and average availability PAs =AAS. The involved process is a time-homogeneous Markov process. If results are required for 100% load, Z6 - Zll are down states (see Section 6.8.6.4 for reward considerations). To simplify investigation, equations use hAl=hA2= hA and AC1=hC2 = A C . TO increase readability, the number of states in Fig. 6.34 has been reduced according to Point 2 on p. 264.

From Fig. 6.34 and Table 6.2 or Eq. (A7.126) it follows that M7TFso is given as solution of the following system of algebraic equations (with Mi = MTTFSi)

6.8 Systems with Complex Stnicture 257

Because of hAl=hA2=h„ h„=h„ = L , and the symmetry in Fig. 6.34 it follows that P2=Pl, P4=P3, P7=P6, P9'P& Pll'P10 arid M2'M19 M4=M3 9 M7'M69 M9=M8t M 1 l = M ~ ~ .

This has been considered in solving the System of algebraic equations (6.261). From Eq. (6.261) it follows that

a l w o - 2 h A ~ A p s ( ~ : h B o ) ( a 1 + h C ~ C ~ 5 ~ 1 0 ~ - 2 u 2 c ~ c ~ 5 ~ s ~ l o - a B u 2 u 5 + '6

'

with (6.262)

MTTFso per Eq. (6.262) can be approximated by

yielding M- - P , / ( 2 ( h A + h , ) ( h A + h , p A / p ~ ) for hBo = hBo = 0 (I-out-of-2 active redundancy with A and C in series, as per Table 6.10,2nd & 3rd row).

From Fig. 6.34 and Table 6.2 or Eq. (A7.127) it follows that the asymptotic & steady-state point and average availability PAs =AAS is given as solution of

One of the Eq. (6.266) must be dropped and replaced by = 1. The solution yields


with

From Eqs. (6.267) - (6.269) it follows that

1 + b3 b, + 2b,(1 +hBo1p3 ) + I P5 + 2 h c ( ~ ~ + I P ~ P ~ +2h.A.A~01 ~ 5 ~ 1 0

6.270) PAs =AAS per Eq. (6.270) can be approximated by

2 yielding PAs =AAS-1- 2((hA+hc ) / P ) for hBO =ABO = O and bA = pc = pg = p (1-out- of-2 active redundancy with A and C in series, as per Table 6.10). Equations (6.265) and (6.271) show the small influence of the coupler B. A numerical evaluation with

-6 -1 hAl =hAS=hA=4.10 h (= 0.035 expected failures per year) h c l = h C 2 =AC = 0.12.10-~ h-I (= 0.001 expected failures per year) hB,=0.08. 1oW6 h-" (= 0.0007 expected failures per year) ABO= 0.6.10-~ h" (= 0.005 expected failures per year) pA =pC =1 /4h, pg =1/12h

yields

M7TFs0=7.361o9h and PAs=AAs=1-1.63~10-9

from Eqs46.262) & (6.270), as well as M7TFso ~ 7 . 3 . 109h and PAs =AAS = I - 0.9.10-~ from Eqs. (6.265) & (6,271), respectively; moreover,

Considering the substation as a macro-structure (first row in Table 6.10), it holds that P A s = A A s = l - h s l p s and ~ , ( t ) = e - ' s ~ , w i t h p , = p ~ and h s = l l M 7 T F S 0 .


6.8.6.4 With Reward and Frequency IDuration Aspects

For some applications, e.g. in power and communication Systems, it is of importance to consider system performance also in the presence of failures. Reward and frequency / duration aspects are of interest to evaluate system pe~ormability. For constant failure and repair rates (Markov processes), asymptotic & steady-state system failure frequency fudS and system mean down time MDTs (mean repair (restoration) duration at system level) are given as (Eqs. (A7.143) & (A7.144))

respectively. Similar results hold for semi-Markov processes. U is the set of states considered as up states for fuds and MDTs calculation, is the complement to the totality of states considered. Pj is the asymptotic & steady-state probability of state Zj and pji the transition rate from Zj to Zi. In Eq. (6.272), all transition rates pji leaving state 3 E U toward Zie Ü are considered (cumulated states). Example 6.23 gives an application to the substation investigated in Fig. 6.34. Considering

fuds = fduS (Eq. (A7.145)), fuds can be replaced by fdus.

Example 6.23 Give the failure frequency fuds and the mean failure duration MDTs in steady-state for the substation of Fig. 6.34 for failures referred to a load loss of 100% and 50%, respectively.

For loss of 50% load, Fig. 6.34 with U={Zo - Z,} and Ü=(Z6 - Z12 J yields

From Eq. (6.273) it follows that

MDT~ loss 100% = 42 f u d ~ Iss 100% '

and

MDTs I„ 50% = 1 - (Po + 2P1 + 2P3 + P51 1 fuds las 50% . The numerical example on p. 258 yields fuds lasioos = 1 3 6 ~ 1 0 - ' ~ h ~ ' (= 10" expected failures per year), fudS lass 50% = 783 l ~ - ~ h - ' (= 7 .10 -~ expected failures per year), MDTs ioo"/o =12h, ard MDTs ~ o s s 5 ~ % = 4 h .

Example 6.24 Give the expected instantaneous reward rate in steady-state for the substation of Fig. 6.34. Solution Considenng Fig. 6.34 and the numerical exarnple on p. 258 it follows that

The reward rate I;: takes care of the performance reduction in the state considered, ( ri = 0 for down states, 0 < q < 1 for partially down states, and q = 1 for up states with 100% performance). From this, the expected instantaneous reward rate in steady-state or for t + W, MIRS, is given as (Eq. (A7.147))

The expected accumulated reward in steady-state (or for t -+ m) follows as MARs(t) = MIRs. t , see Example 6.24 for an application. in Eq. (6.274) is the asymptotic & steady-state probability of state Zi , giving also the expected percentage of time the system stays at the performance level specified by Zi (Eq. (A7.132)).

6.8.7 Systems with Common Cause Failures

In some practical applications it is necessary to consider that common cause failures (CCF) can occur. Common cause failures are multiple failures resulting from a single cause. They must be distinguished from common mode failures (CMF), which are multiple failures showing the Same Symptom. Common cause failures can occur in hardware as also in software and their causes can be quite different. Some possible causes for common cause failures in hardware are:

overload (electrical, thermal, mechanical), technological weakness (material, design, production),

misuse (caused e.g. by operating or maintenance personnel),

external event.

Sirnilar causes can be found for software. In the following, a comprehensive example for investigating effects of common

cause failures is considered. Results (Eqs. (6.276) and (6.280) in particular) show that common cause failure acts as a series elernent in the system's reliability structure, with failure rate ( A c ) equal the occurrence rate of the common cause failure and repair (restoration) rate ( pC ) equal the remove rate of the corresponding failure. Graphs given by Figs. 2.8 & 2.9 for nonrepairable systems and Figs. 6.17 & 6.18 for repairable systems can be used to visualize results and to Support d e s (2.28) and (6.174).

a) C only on working elements, repair for C includes all other failures

C) C on elements in working or repair state, repair for C includes all other failures

b) C only on working elements, repair for C has pnority but does not include other failures

d) C on elements in working or repair state, repair as for case b)

Figure 6.35 Diagram of transition rates for availability calculation of the 1-out-of-2 active redundancy of Fig. 6.36 with common cause failures for 4 different basic possibilities (constant failure and repair rates (h, Ac, h a , p, pc, pa), one repair crew, repair pnonty for failures caused by common cause (C), no further failures at system down (except hCdl , hC45 ); Markov process, Zo ,Z, up states)

Figure 6.35 gives the diagrams of transition rates for the repairable 1-out-of-2 active redundancy of Fig. 6.36 with common cause failures for 4 different basic possibilities (C refers to common cause, repair priority for failures caused by C, one repair crew, no further failures at system down). The 4 possibilities of Fig. 6.35 are resumed in Fig. 6.36 for investigation. From Fig. 6.36 and Table 6.2 or Eq. (A7.126) it follows that MnFso is given as solution of the following system of algebraic equations (all down states are absorbing for reliability investigation)

(2h+hc)MTZ'Qo = 1 + 2 h M 9 2

(?L + y ) M q 2 = l+pMTTFSO. (6.275)

From Eq. (6.275), MTTFso follows as (with A c < I ) ,

Furthermore, from Fig. 6.36 and Table 6.2 or Eq. (A7.127) it follows that the asymptotic & steady-state point and average availability PAs =AAS is given as solution of the following system of algebraic equations

One of the Eq. (6.277) must be dropped and replaced by Po+ ... +P5=1 (the first equation because of the particular cases investigated below). The solution yields

and po=2h+hc, P ~ = P C , P Z = ~ + ~ C ~ I + ~ C Z ~ + P . P3=k32. ~ 4 = h c 4 1 + ~ ~ 4 5 + ~ > P5 =Pc54 (Fig. 6.36, Eq. (A7.103)). Considering h «P, h, «F,, hci «pci it follows that

Equations (6.276) & (6.278) can be used to investigate Fig. 6.35, yielding (Ac< h )

a) C only on working elements, repair for C includes all other failures

C) C on elements in working or repair state, repair for C includes all other failures ( L C 4 , « !J )

b) C only on working elements, repair for C does not include other failures

d) C on elements in working or repair state, repair for C does not include other failures


1-out-of-2 active

(E = E = E ) 1 2

Figure 6.36 Reliability block diagram and diagram of transition rates for availability calculation of a 1-out-of-2 active redundancy with common cause failures, for different possibilities as per Fig. 6.35 (const. failure and repair rates (h, Ac, h p, lc, pci ), one repair crew, repair priority for common cause (C), no further failures at System down (except hC4, , kC45 ); Markov process, Zo ,Z2 up states)

Case b) corresponds to a 1-out-of-2 active redundancy in series with a switch (Eqs. (6.158), (6.160)). Further approximations are possible, e.g. with Es =Fi +P3 +P4+P5.

Equations (6.276) & (6.280) clearly show the effect (consequence) of a common cause failure on a 1-out-of-2 active redundancy:

The common cause failure acts as a series element with failure rate ( A C ) equal the occurrence rate of the common cause failure und repair (restoration) rate (P,-) equal the remove rate of the corresponding failure; results given by Figs. 2.8 & 2.9 for nonrepairable systems und Figs. 6.1 7 & 6.18 for repairable systems (with rules (2.28) und (6.174)) applies.

The above rule holds quite general if the common cause failure acts at the Same time on all redundant elements of a redundant structure. From this,

Good protection against common cause failures can only be given if euch element of a redundant structure is realized with different technology (tools), electrically, mechanically und thermally separated, und not designed by the same designer (basically true also for sofiware).

Concrete protection against comrnon cause failures must be worked out on a case- by-case basis, See Example 2.3 for a simple practical Situation. In verifying such a protection, a FMEAIFMECA (Section 2.6) is mandatory for hardware and software. In some cases, common cause failures can occur with a time delay on elements of a redundant structure (e.g. because of the drop of a cooling ventilator); in this cases,

automatic fault recognition can avoid multiple failures. Some practical considerations on failure rates for common causes failures in electronic equipment are in [A2.6 (61508-6)], giving hc l h = 0.005 as achievable value (rule (6.174)).

6.8.8 General Procedure for Modeling Complex Systems

On the basis of the tools introduced in Appendix A7 and results in Sections 6.8.1 - 6.8.7, following procedure can be given for reliability and availability investigation of complex Systems, both when a reliability block diagram exists or n o t (for series-parallel structures , Section 6.7 applies, in particular Table 6.10).

1. As a first step operate with (time-homogeneous) Markov processes, i.e. assume that failure und repair rates of all elements are constant during the stay time in every state, and can change (stepwise) only at state changes, e.g. because of change in configuration, component use, Stress, repair strategy or other (dropping this assumption leads to non markovian processes, as shown e.g. in Section 6.4.2); in a further step, refinements can be considered on a case-by-case basis using serni-regenerative processes.

2. Group series elements and assign to each macro-structure EI, ..., E, a failure rate hs= hl+. .+h, and repair (restorntion) rate ys =& 1 (Ll /pl+. . .+I, /F,) (Table 6.10): a further redbictioti OJQ drlzgram ~<ftmnsition rates is possible in Sone cases (see pp. 222 & 230 ;t\ well aa Figi. 6.27 & 6.28.6.30. 6.37).

3. Perform a FMEA (Sectim 2.6) to fix ail reievantjailrtre morks and to verify actual system capability for rerognition, diugntlsk, rec~or~figrrrtrtior~, graccfid degradation at failure. arid protection against cornrnon ct~~ase/rno~leJaiEure.~.

4. Draw the diugmrn of tmnsifion rates and verify its coi~ectness (sec Fiy. 6 20 & I-rg. 6.34 Bor two cornlprclierisi\e emxplei); of part~cular irnportance is the ide~ltification of up states which have a clirect transition tu a down state at \ystem Ieve1 (e.g. 5 - Z, in Fig. 6.20), i.e. of critiral opernttig mtes,

5. Identify the transition rates between each state (combination of failure and repair rates), by considering assumed repair (restoration) priorities, retained failure modes, and particularities specific to the system considered (dependence between elements, sequence of failure or failure modes, etc.).

Figure 6.37 Example for a reduction of a diagram of transition rates for M7TFS0 calculation (notethat (hO+h,) l ( l+h,~h,)= (l+h,/h,)l(l/hO+llhl))


6. For reliability calculation, the mean time to system failure MTTFsi for system entering state Zi at t = 0 is obtained by solving (Eq. (A7.126))

Zj€U, j? l i j = O , j t i

Thereby, U is the Set of up states, Ü the Set of down states ( U u Ü = {Zo, ..., Z,}), p, the transition rate from state Zi E U to state Zj E U, and pi the sum of all transition rates leaving state Zi (Table 6.2). The system of algebraic equations (6.281) delivers all M7TFsi for any Z,E U entered t = 0 (for Markov processes the condition " Zi is entered at t = 0" can be replaced by "system in Zi at t = 0"). At system level,

can often be used (in Zo all elements are operating or ready to operate, i.e. as-good-as-new because of the memoryless Markov property).

7. The asymptotic ( t -+ W) & steady-state (stationary) point und average availability PAs =AAS is given as

with 3 as solution of (Eq. (A7.127) m rn m

p j P j = P i p i , with Pj>O, Pj =1, pi= p„ j=O, ..., rn. (6.284) i=O, i t j j = O j=O, j#i

In Eq. (6.284), all transition rates pij leading to state 5 have to be considered. One equation for Pj , arbitrarily chosen, must be dropped and replaced by = i (see Section 6.2.1.5 for further availability figures).

8. Considering t h e constant failure rate for all elements, the asymptotic & steady-state interval reliability follows as (Eq. (6.27))

9. The asymptotic & steady-state system failure frequency fudS and systern mean up time MUTs are given as (Eqs.(A7.141) & (A7.142))

respectively. U is the Set of states considered as up states for fudS und MUTs calculation, 6 the complement to the totality of states considered.

The same is for the system repair (restoration) frequency fduS and the system mean down time MDTs, given as

respectively. MUTs is the mean of the time in which the system is moving in the set of up states 5 E U (e.g. Zo to Z7 in Fig. 6.20) before a transition in the set of down states Z i ~ Ü (e.g. Z8 to Zll in Fig. 6.20) occurs, in steady-state or for t -+ m. MDTs is the mean repair (restoration) duration at system level. fudS is the system failure intensi9 zS( t ) = ZS,

as defined by Eq. (A7.230), in steady-state or for t + W . It is not difficult to recognize that one has fudS = fduS and thus

fuds = fduS = ZS = 1 I (MUTS + MDTS), (6.290)

see example 6.25 for a practical application. Equations (6.287), (6.2.89), (6.290) lead to the following important relation

MDTs = MUTs (1 - P A s ) / P A S . (6.291)

Considering that the asymptotic & steady-state probability Po is much greater than all other 5 , the approximation

can often be used ( for MUTs is not allowed, see example 6.25). Zj EU

10. The asymptotic & steady-state expected instantaneous reward rate MIRs is given by (Eq. (A7.147))

Thereby, ri= 0 for down states, 0< q<1 for partially down states, and ri =1 for up states with 100% performance. The asymptotic & steady-state expected accumulated reward MARS follows as (Eq. (A7.148))

MARS ( t ) = MIRs . t . (6.293)

In some cases it can be useful to operate with a time schedule (e.g. Fig. A7.11). Alternative investigarion methods (Petri nets, dynamic FTA, computer-aided analysis) are introduced in Section 6.9. As in the previous sections, failure-free time is used as a synonym forfailure-free operating time and repair as a synonym for restoration.

6.9 Alternative Investigation Methods

Example 6.25 Investigate MUTS, MDTS, fudS, md fduS for the 1-out-of-2 redundancy of Fig. 6.8a.

Solution The solution of Eq. (6.85) with k i ( t ) = 0 yields (Eq. (6.88))

p o = p 2 / [ ( h + h , ) ( h + p ) + y 2 ] and ~ ~ = ~ ( h + h r ) / [ ( ~ + h r ) ( h + p ) + p 2 ]

From Fig. 6.8a and Eqs. (6.286)-(6.289) it follows that

For this example it holds that MUTS = M q l (with M7TFsI as solution of Eq. (6.89) with Pi (0) = 1 or Eq. (6.281), see also Example A7.9), this because the system enters state Z1 after

each system failure; furthermore, MDTs = 11 p because only one repair Crew is available.


The methods given in sections 6.1 to 6.8 are based on Markov, semi-Markov and semi-regenerative processes, according to the involved distributions for failure-free and repair (restoration) times. They have the advantage of great flexibility (arbitrary redundancy or repair strategy, incomplete coverage or switch, common cause failures) and transparency. Further tools are known to model repairable systems, e.g. based on Petri nets or dynamic fault trees. For very large or complex systems, numerical solution or Monte Carlo simulation can also become necessary. Many of these tools are similar in performance and versatility (Petri nets are equivalent to Markov models), other have limitations (fault tree analyses are basically limited to totally independent elements and Monte Carlo simulations delivers only numerical solutions), so that choice of the tool is often related to the personal experience of the analyst (see e.g. rA2.6 (61165, 60300-3-I), 6.30, 2.48 (2005)l for comparisons). However, modeling large systems still requires a close cooperation between project and reliability engineers. In the following, Sections 6.9.1 and 6.9.2 give a short introduction to Petri nets and dynamic fault trees for reliability investigations. Section 6.9.3 considers some aspects of numerical solutions.

6.9.1 Petri Nets

Petri nets (PN) were introduced 1962 by C. A Petri [6.35, 6.2, 6.61 to investigate in particular synchronization, sequentiality, concurrency, and conflict in parallel working digital systems. Several extensions have been at the origin of a large literature [6.1, 6.6, 6.8,6.30,2.37, 2.48 (1999)l. Important for reliability investigations was the possibility to create algorithmically the diagram of transition rates belonging to a given Petri net. With this, investigation of time behavior on the basis of (time-homogeneous) Markov processes was Open (stochastic Petri nets).


Extension to semi-Markov process is straightforward [6.8], but less useful for reliability investigations (Sections 6.3 & 6.4). This section gives a short introduction of Petri nets from a reliability analysis point of view. A Petri net (PN) is a directed graph involving 3 kind of elements:

Places 4 , ..., P, (drawn as circles): A place q is an input to a transition T j if an arc exist from 4 to T j and is an output of a transition Tk and input to a place 4 if an arc exist from Tk to 4 ; places may contain token (black spots) and a PN with token is a marked PN.

Transitions Tl, ..., T, (drawn as empty rectangles for timed transitions or bars for immediate transitions); a transition canfire, taking one token from each input place and putting one token in each output place.

Directed arcs: An arc connects a place with a transition or vice versa and has an arrowhead to indicate the direction; multiple arcs are possible and indicate that by firing of the involved transition a corresponding number of tokens is taken from the involved input place (for input multiple arc) or put in the involved output place (for output multiple arc); inhibitor arcs with a circle instead of the arrowhead are also possible and indicate that for firing condition no token must be contained in the corresponding place.

Firing rules for a transition are:

1. A transition is enabled (can fire) only if all places with an input arc to the given transition contain at least one token (no token for inhibitor arcs).

2. Only one transition can fire at a given time; the selection occurs according to the embedded Markov chain describing the stochastic behavior of the PN.

3. Firing of a transition can be immediate or occurs after a time interval zu > 0 (timed PN); TU > 0 is in general a random variable (stochastic PN) with distribution function Ei (X) when firing occurs from transition Ti to place Pj

- ) , . . X . (yielding a Markov process for Eu (X) = 1 - e 'J , 1.e. with transition rate Au, or a semi-Markov process for Fu (X) arbitrary, with Fi ( 0 ) = 0).

From rule 3, practically only Markov processes (i.e. constant failure and repair rates) will occur in Petri nets for reliability applications (Section 6.4.2). Two further concepts useful when dealing with Petri nets are those of rnarking and reachability:

A marking M = {mi, ..., m,} gives the number mi of token in the place 4 at a given time point and defines thus the state of the PN; M j is immediately reachable from Mi if M j can be obtained by firing a transition enabled by Mi.

With Mo as marking at time t=O, M I , ..., Mk are all the (different) marking reachable from Mo; they define the PN states and give the reachability tree, from which, the diagram of transition rates of the corresponding Markov model follows. Figure 6.38 gives some examples of reliability structures with corresponding PN.


1-out-of-2 active repair priority on E,

( E , = E2 = E ) a)

h C) d)

Figure 6.38 Top: Reliability block diagrarn (a), diagram of transition rates (C), Petri net (PN) (b), and reachability tree (d) for a repairable 1-out-of-2 warm redundancy (two identical elements, constant failure (L, L,) and repair (P) rates, one repair (restoration) crew, Markov process); Bottom: Reliability block diagram (a), diagram of transition rates (C), Petn net (b), and reachability tree (d) for a repairable 1-out-of-2 active redundancy with two identical elements and switch in senes (constant failure (L, L, ) and repair (P, pv) rates, one repair (restoration) crew, repair prionty on switch, Markov processes)


6.9.2 Dynamic Fault Tree

A fault tree (FT) is a graphical representation of the conditions or other factors causing or contributing to the occurrence of a defined undesirable event, referred as the top event [A2.6 (IEC 61025)l. In its original form, as introduced in Section 2.6 for failure mode analysis, a fault tree contains only static gates (essentially AND & OR), is termed static fault tree, and can handle combinatorial events (qualitatively, similar as for a FMEA (Section 2.6) or quantitatively, as with Boolean functions (Section 2.3.3). However, as the top event is in general a failure at system level, "0" is used for operating and "1" for failure. This is opposite to the notation used in Sections 2.3.3 & 2.3.4 for reliability investigations based on state space or Boolean functions. In fault trees, OR gates represent thus a series structure and AND gates a parallel structure. Figure 6.39 gives two examples of reliability structures with corresponding static fault trees.

Static fault trees can be used to compute reliability and availability for the case of totally independent elements (active redundancy and each element has its own repair crew), see e.g. [A2.6 (IEC 61025)l for a comprehensive description. Reliability computation for the nonrepairable case (up to system failure) using fault tree analysis (FTA) leads for the series structure to (Eq. (2.17))

and for the k-out-of-n active redundancy to (Eq. (2.23))

( & t ) = 1 - Ri(t) = failure probability). For complex structures, computation uses, often the method of minimal cut sets (Section 2.3.4).

However, because of their structure, static fault trees can not handle states or time dependencies (e.g. standby redundancy or repair strategy). For these cases, it is necessary to extend static fault trees, adding so called dynamic gates to obtain dynamic fault trees. The most important dynamic gates are [2.85, 6.36, 6.38, A2.6 (IEC 61025)l:

Priority AND gate (PAND), the output event (failure) occurs only if all input events occur and in sequence from left to rjght.

Sequence enforcing gate (SEQ), the output event occurs only if input events occur in sequence from left to right and there are more than two input events.

Spare gate (SPARE), the output event occurs if the number of spares is less than required.

Further gates (choice gate, redundancy gate, warm spare gate) have been suggested,


U

2-out-of-3 active ( E = E = E = E )

1 2 3

a)

Figure 6.39 a) Reliability block diagram and corresponding static fault tree for a 2-out-of-3 active redundancy with switch element in senes; b) Functional block diagram and corresponding static fault tree for a redundant computer system[6.30]; "0" is used for operating and "1" for failure

e.g. in [6.38]. All above dynamic gates requires a Markov analysis, i.e. states probabilities must be computed by a Markov approach (constant failures & repair rates) and then, results used as occurrence probability for the basic event replacing the corresponding dynamic gate. Use of dynamic gates in dynamic fault tree analysis, with corresponding computer programs, has been carefully investigated, e.g. in [2.85,6.36,6.38].

Fault tree analysis (FTA) is an established methodology for reliability and availability analysis (emerging in the nineteen-sixties with investigations on nuclear power plants). However, the necessity to use Markov approaches to solve dynamic gates can limit its use in practical applications. The limits of FTA are in all methods based on binary considerations (fault trees, reliability block diagrams (RBD), binary decision diagrams (BDD), etc.). However, reliability block diagrams and fault trees are valid Support in generating transition rates diagrams for Markov analysis. So once more, combination of investigation tools is often a good way to solve difficult problems.


6.9.3 Computer-Aided Reliability and Availability Computation

Investigation of large series -parallel structures or of complex systems (for which a reliability block diagram often does not exist) is in general time-consuming and can become mathematically intractable. A large number of Computer programs for numerical solution of reliability and availability equations as well as for Monte Carlo simulation have been developed. Such a numerical computation can be in some cases the only way to get results. Section 6.9.3.1 discusses requirements for a versatile program for the numerical solution of reliability and availability equations. Section 6.9.3.2 gives basic considerations on Monte Carlo simulation and introduces an approach useful for rare events. Although appealing, numerical solutions can deliver only case-by-case solutions and can causes problems (instabilities in the presence of sparse matrices, prohibitive run times for Monte Carlo simulation of rare events or if confidence limits are required). As a general rule, analytical solutions (Sections 6.2 - 6-6, 6.8) or approximate expressions (Section 6.7) should be preferred whenever possible.

6.9.3.1 Numerical Solution of Equations for Reliability and Availability

Analytical solution of algebraic or differential / integral equations for reliability and availability computation of large or complex systems can become time-consuming. Software tools exist to solve this kind of problems. From such a software package one generally expects high completeness, usability, robustness, integrity, and portability (Table 5.4). The following is a comprehensive list of requirements:

General requirements:

1. Support interface with CADICAE and confguration management packages. 2. Provide a large component data bank with the possibility for manufacturer

and company-specific labeling, and Storage of non application-specific data.

3. Support different failure rate models [2.21 - 2.291.

4. Have flexible output (regarding medium, sorting capability, weighting), graphic interface, single & multi-user capability, high usability & integrity.

5. Be portable to different platforms.

Spec$c for nonrepairable (up to System failure) systems:

1. Consider reliability block diagrams (RBD) of arbitrary complexity and with a large number of elements (2 1,000) and levels (2 10); possibility for any element to appear more than once in the RBD; automatic editing of series and parallel models; powerful method to handle complex structures; constant or time dependent failure rate for each element; possibility to handle as element macro-structures or items with more than one failure mode.

6.9 Alternative Investigation Methods 273

2. Easy editing of application-specific data, with User features such as: automatic computation of the ambient temperature at component level with freely selectable temperature difference between elements, freely selectable duty cycle from the system level downwards, global change of environmental and quality factors, manual selection of stress factors for tradeoff studies or risk assessment, manual introduction of field data and of default values for component families or assemblies.

3. Allow reuse of elements with arbitrary complexity in a RBD (libraries).

Specific for repairable Systems:

1. Consider elements with constant failure rate and constant or arbitrary repair rate, i.e. handle Markov, semi-Markov, and (as far as possible) semi- regenerative processes.

2. Have automatic generation of the transition rates pd for Markov model and of the involved semi Markov transition probabilities QV ( X ) for Systems with constant failure rates, one repair Crew, and arbitrary repair rate (starting e.g. from a given set of successful paths); automatic generation and solution of the equations describing the system's behavior.

3. Allow different repair strategies (first-in first-out, onerepair Crew or other). 4. Use sophisticated algorithms for quick inversion of sparse matrices. 5. Consider at least 20,000 states for the exact solution of the asymptotic &

steady-state availability PAs = AAS and mean time to system failure MTT&.

6. Support investigations yielding approximate expressions (macro-structures, totally independent elements, cutting states or other, see Section 6.7.1).

A scientific software package satisfying many of the above requirements has been developed at the Reliability Lab. of the ETH [2.50]. Refinement of the requirements is possible. For basic reliability computation, cornmercial programs are available E2.51-2.601. Specialized programs are e.g. in L2.7, 2.17, 2.59, 2.85, 6.23, 6.24, 6.421; considerations on numerical methods for reliability evaluation are e.g. in [2.56].

6.9.3.2 Monte Carlo Simulations

The Monte Carlo technique is a numerical method based on a probabilistic interpretation of quantities obtained from algorithmically generated random variables. It was introduced 1949 by N. Metropolis and S. Ulman [6.32]. Since this first Paper, a large amount of literature has been published, see e.g. [6.12, 6.31, A7.181. This section deals with some basic considerations on Monte Carlo simulation useful for reliability engineering and gives an approach for the simulation of rare events which avoids the difficulty of time truncation because of amplitude quantization of the digital number used.


For reliability purposes, a Monte Carlo simulation can basically be used to estimate a value (e.g. an unknown probability) or simulate (reproduce) the stochastic process describing the behavior of a complex system. In this sense, a Monte Carlo simulation is useful to achieve results, numerically verify an analytical solution, get an idea of the possible time behavior of a complex system or determine interaction among variables. Two main problems related to Monte Carlo simulation are the generation of uniformly distributed random numbers in the interval (0,l) and the transformation of these numbers in random variables with prescribed distribution function. A congruential relation

s ~ + ~ = ( ~ s ~ + b) mod m , (6.296)

where mod is used for modulo, is frequently used to generate pseudo-random numbers (for simplicity, pseudo will be omitted in the following). Transformation to an arbitrary distribution function F(x) is often performed with help of the inverse function F -'(X), see Example A6.17. The method of the inverse function is simple but not necessarily good enough for critical applications.

A further question arising with Monte Carlo simulation is that of how many repetitions n must be run to have an estimate of the unknown quantity within a given interval I E at a given confidence level y . For the case of an event with probability p and assuming n sufficiently large as well as p or (1- p) not very small, Eq. (A6.152) yields for p known

where t ( l + y )12 is the (1 + y ) / 2 quantile of the standard normal distribution; for instance, t ( l+y )„ = 1.645 for y = 0.9 and 1.96 for y = 0.95 (Appendix A9.1). For p totally unknown, the value p = 0.5 has to be taken. Knowing the number of realizations k in n trials, Eq. (A8.43) can be used to find confidence lirnits for p.

To simulate (reproduce) a time-homogeneous Markov process, following procedure is useful, starting by a transition in state Zi at the arbitrary time t = 0:

1. Select the next state Zj to be visited by generating an event with probability

according to the embedded Markov chain (for uniformly distributed random numbers 5 in (0,l) it holds that ~ r { c 1 X} =X).

2. Find the stay time (sojourn time) in state Zi up to jump to the next state Zj by generating a random variable with distribution function (Example A6.17)

3. Jump to state Zj .

Output pulse train (a pulse at each renewal

point S I , S2, . . .)

Pigure 6.40 Block diagram of the programmable generator for renewal processes

Generator of uniformly distributed

random numbers

Extension to semi-Markov processes is easy [A7.2 (1974 & 1977)l. For semi regenerative processes, states visited during a cycle must be considered (see e.g. Fig. A7.11). The advantage of this procedure is that transition sequence and stay (sojourn) times are generated with only a few random numbers. A disadvantage is that the stay times are truncated because of the amplitude quantization of Fij(x) .

To avoid truncation problems, in particular when dealing with rare events distributed on the time axis, an alternative approach implemented as hardware generator for semi-Markov processes in [A7.2 (1974 & 1977)l can be used. To illustrate the basic idea, Fig. 6.40 shows the structure of the generator for renewal processes. The programmable generator for arbitrary renewal processes is driven by a clock At = Ax and consists of three main elements:

Rcset

Comparator (out ut puke

i f c c h j Si

b

a generator for (pseudo-) random numbers Ci uniformly distributed in (0,l); a comparator, comparing at each clock the actual random number Ei with kk and giving an output pulse, marking a renewalpoint, for Gi< hk; a function generator creating hk and starting with hl at each renewal point.

It can be shown (Lk = wk in [A7.2 (1974 & 1977)l) that for

hk 4

the sequence of output pulses constitutes a realization of an ordinary renewal process with distribution function F(kA.x) for the times between successive renewal points. hk is the failure rate belonging to the arithmetic random variable with distribution function F(kAx) (p. 405, Appendix A7.2). Generated random times are in this case not truncated, since the last Part of F(kAx) can be approximated by a geometric distribution ( L k constant as per Eq. (A6.132)). A software implementation of the approach shown by Fig 6.40 is easy, and hardware limitations disappears.

Function generator

(creating the quantity hk)


The homogeneous Poisson process ( H P P ) , is a particular renewal process (Appendix A7.2.5) and can thus be generated (reproduced) with the generator given by Fig. 6.40 ( L k is constant, and the generated random time interval have a geometric distribution). For a nonhomogeneous Poisson process ( N H P P ) with mean value function M ( t ) = E [ v ( t ) ] , generation can be based on the considerations given at the end of Appendix A7.8.2 (for fixed t = T , generate k according to a Poisson distribution with Parameter M(t) (Eq. (A7.190)) and then k random variables with density m( t ) l M ( T ) ; the ordered values are the k occurrence times of the NHPP on (0, T ) ).

7 Statistical Quality Control & Reliability Tests

Statistical quality control and reliability tests are performed to estimate or demonstrate quality and reliability characteristics on the basis of data collected from sampling tests. Estimation leads to a point or intewal estimate (marked with ^ in this book), demonstration is a test of a given hypothesis on the unknown characteristic. Estimation and demonstration of an unknown probability is investigated in Section 7.1 for the case of a defective probability p and in Section 7.2.1 for some reliability figures. Procedures for availability estimation and demonstration for the case of continuous operation are given in Section 7.2.2. Estimation and demonstration of a constant failure rate h (or MTBF for the case MTBF = 11 X) are discussed in depth in Sections 7.2.3. The case of an MTTR is considered in Section 7.3. Basic models for accelerated tests are discussed in Section 7.4. Goodness-of-fit tests based on graphical and analytical procedures are summarized in Section 7.5. Some considerations on general reliability data analysis, with test on nonhomogeneous Poisson processes and trend tests, are given in Section 7.6. Models for reliability growth are introduced in Section 7.7. To simplify the notation, sample is used for random sample and the indices S, referring to System, is omitted in this chapter (MTBF instead of MTBFso and h or PA instead of hs or PAs). Theoretical foundations for this chapter are in Appendix A8. Selected examples illustrate the practical aspects.

7.1 Statistical Quality Control

One of the main purposes of statistical quality control is to use sampling tests to estimate or demonstrate the defective probability p of a given item, to a required accuracy and often on the basis of tests by attributes (i.e. tests of type goodl bad). However, considering p as an unknown probability, a broader field of applications can be covered by the same methods. Other tasks, such as tests by variables and statistical processes control [7.1-7.51, are not considered hereafter.

In this section, p will be considered as a defective probability (fraction of defective items). It will be assumed that p is the same for each element in the sample considered and that each sample element is statistically independent from each other. These assumptions presuppose that the lot is homogeneous and much larger than the sample. They allow the use of the binomial distribution (Appendix A6.10.7).

278 7 Statistical Quality Control and Reliability Tests

7.1.1 Estimation of a Defective Probabilityp

Let n be the size of a (random) sample from a large homogeneous lot. If k defective items have been observed within the sample of size n, then

is the maximum likelihood point estimate of the defective probability p for an item in the lot under consideration, see Eq. (A8.29). For a given confidence level y = 1-ßl -ß2 (O<ßI 11-ß2 <I), the lower jL and upper 3, limit of the confidence intewal of p can be obtained from

for 0 < k < 12, and from

& = O and p,=l -n& for k=O (y = l - P I ) , (7.3)

or from

PI ="$G and P, = 1 for k = n ( y = l - ß z ) , (7.4)

see Eqs. (A8.37) to (A8.40) and the remarks given there. ß1 is the risk that the true value of p is larger than P, and ß2 the risk that the value of p is smaller than P, . The confidence level is nearly equal to (but not less than) y = 1 - ß1 - ß2. It can be considered as the relative frequency of cases in which the interval [P, , P,] overlaps (covers) the true value of p, in an increasing series of repetitions of the experiment of taking a random sample of size n.

In many practical applications, a graphical determination of 5, and 5, is sufficient. The upper diagram in Fig. 7.1 can be used for ß1 = P 2 = 0.05, the lower diagram for ß1 = ß2 = 0.1 (y = 0.9 and y = 0.8, respectively). The continuous lines in Fig. 7.1 are the envelopes of staircase functions (k, n integer) given by Eq. (7.2). They converge rapidly, for min (np, n(1- p)) 2 5 , to the confidence ellipses (dashed lines in Fig. 7.1). Using the confidence ellipses (Eq. (A8.42)), 6, and P, can be calculated from

k + 0.5 b2 + bdk(1- kln) + b2/4 ijUJ =

n + b2

b is the (1 + y) 12 quantile of the standard normal distribution @ ( t ) , given for sorne typical values of y by (Table A9.1)


Figure 7.1 Confidence limits jl and j, for an unknown probability p (e.g. defective probability) as a function of the observed relative frequency k l n (n = sample size, k = observed events, y = confidence level = 1 - ß1 - ß2, here with ß, = ß2; continuous lines are the exact solution (Eqs. (7.2) - (7.4)), dashed the confidence ellipses (Eqs. (7.3, (A8.42), (A6.149))

Example: n = 25, k = 5 gives P = k l n = 0.2 and for y = 0.9 the confidence interval r0.08, 0.381 ([0.0823, 0.37541 using Eq. (7.2), and [0.1011, 0.35721 using Eq. (7.5))

The confidence limits P, and P, can also be used as one-sided confidence intewals. In this case,

O Pl), with y = 1 - ß 2. (7.6)

Example 7.1

In a sample of size n = 25, exactly k = 5 items were found to be defective. Determine for the underlying defective probability p, (i) the point estimate, (ii) the interval estimate for y = 0.8 (Pi = ß2 = 0.1), (iii) the upper bound of p for a one sided confidence interval with y = 0.9.

Solution (i) Equation (7.1) yields the point estimate P = 5125 = 0.2. (ii) For the interval estimate, the lower Part of Fig. 7.1 leads to the confidence interval L0.10, 0.341, [0.1006, 0.33971 using Eq. (7.2) and [0.1175, 0.31941 using Eq. (7.5). (iii) With y = 0.9 it holds p 5 0.34.

Supplementary result: The upper part of Fig. 7.1, would lead to p I 0.38 with y = 0.95.

The role of k l n and p can be reversed and Eq. (7.5) can be used to calculate the limits kl and k2 of the number of observations k in n independent trials (e.g. the number k of defective items in a sample of size n) for given probability y = 1 - ßl - ß2 (with ßl = ß2) and known values of p and n (Eq. (A8.45))

k2,, = n p f b j m . (7.7)

As in Eq. (7.5), the quantity b in Eq. (7.7) is the (1 + y) 1 2 quantile of the standard normal distribution (e.g. b = 1.64 for y = 0.9, Table A9.1). For a graphical solution, Fig. 7.1 can be used by taking the ordinate p as known, and by reading kl 1 n and k2 I n from the abscissa.

7.1.2 Simple Two-sided Sampling Plans for the Demonstration of a Defective Probability p

In the context of acceptance testing, the demonstration of a defective probability p is often required, instead of its estimation (Section 7.1.1). The main concem of this test is to check a Zero hypothesis Ho: p < po against an alternative hypothesis H1:p> pl on the basis of the following agreement between producer and consumer:

The lot should be accepted with a probability nearly equal to (but not less than) I - a if the true (unknown) defective probability p is lower than po but rejected with a probability nearly equal to (but not less than) I - ß if p is greater than pl ( po , pl > po , und 0 < a < i -ß < 1 are given (fxed) values).

po is the specijied defective probability and pl is the maximum acceptable defective

7.1 Statistical Quality Control 28 1

probability. a is the allowed producer's risk (type I error), i.e. the probability of rejecting a true hypothesis Ho: p < PO. ß is the allowed consumer's risk (type I1 error), i.e. the probability of accepting the hypothesis Ho: p < po when the alternative hypothesis H1:p> pl is true. Verification of the agreement stated above is a problem of statistical hypothesis testing (Appendix A8.3) and can be performed, for instance, with a simple two-sided sampling plan or a sequential test. In both cases, the basic model is the sequence of Bernoulli trials, as introduced in Appendix A6.10.7.

7.1.2.1 Simple Two-sided Sampling Plans

The procedure (test plan) for the simple two-sided sampling plan is as follows (Appendix A8.3.1.1):

1. From po, pl, a, and ß, determine the smallest integers C and n which satisfy

and

2. Take a sample of size n, determine the number k of defective items in the sample, and

re jec t Ho: p < po, if k > c

accept Ho: p < po , if k 5 C.

The graph of Fig. 7.2 visualizes the validity of the above rule (see Appendix A8.3.1.1 for a proof). It satisfies the inequalities (7.8) and (7.9), and is known as operating characteristic cuwe. For each value of p, it gives the probability of having no more than C defective items in a sample of size n. Since the operating characteristic curve as a function of p decreases monotonically, the risk for a false decision decreases for p < po and p > pl, respectively. It can be shown that the quantities C and n po depend only on a, P, and the ratio pl 1 po (discrimination ratio). Table 7.3 (p. 301) gives C and npo for some important values of a, ß and pl I po for the case where the Poisson approximation (Eq. (A9.129)) applies.

Using the operating characteristic, the Average Outgoing Q u a l i ~ (AOQ) can be calculated. AOQ represents the percentage of defective items that reach the customer, assuming that all rejected samples have been 100% inspected, and that the defective items have been replaced by good ones, and is given by

7 Statistical Quality Control and Reliability Tests

A 1.0 -

0.8 - Pr {acceptance / p ] = Pr{no more than c defects in a sample of size n I p ]

0.6 -

0.4 -

0.2 -

I 0 ) P

0.05 0.1

Figure 7.2 Operating characteristic curve as a function of the defective probability p for given (fixed)nandc ( p 0 = 2 % , p l = 4 % , a = ß = 0 . 1 ; n=510 andc=14asperTable7.3)

C n . A O Q = p Pr{accepwnce 1 p} = p ( (1 -

z i=O

The maximum value of AOQ is the Average Outgoing Quality Limit [7.4,7.5]. Obtaining the solution of inequalities (7.8) and (7.9) is time-consuming. For

small values of po & pl (up to a few %), the Poisson approximation (Eq. (A6.129))

can be used. Substituting the approximate value obtained by Eq. (7.12) in Eqs. (7.8) and (7.9) leads to a Poisson distribution with Parameters ml = n p l and in0 = n p 0 , which can be solved using a table of the X 2 distribution (Table A9.2). Alternatively, the curves of Fig. 7.3 provide graphical solutions, sufficiently good for practical applications. Exact solutions are in Table 7.3 (p. 301).

Example 7.2

Determine the sample size n and the number of allowed defective items C to test the null hypothesis Ho : p < po = 1% against the alternative hypothesis H1 : p > pl = 2% with producer and consumer risks a = ß = 0.1 (which means a = ß 6 0.1).

Solution For a = ß = 0.1, Table A9.2 yields V = 30 (value of V for which tv,ql /tV,qZ = 2 with ql 2 1 - a = 0.9 and q 2 2 ß = 0.1) and, with linear interpolation, F(20.4) = 0.095 < ß and F(40.8) =

0.908>1- a (V = 28 falls just short). Thus c=v/2-1= 14 and n= 20.4/(2.0.01)= 1020. The values of c and n according to Table 7.3 would be C = 14 and n = 10.17 / 0.01 = 1017. Using the graph of Fig. 7.3 yields practically the same result: C = 14, mo = 10.2 and ml = 20.4 for a = ß = 0.1. Both the analytical and graphical methods require a solution by successive approximation (choice of C and check of conditions for a and ß by considering the ratio pl /po).


7.1.2.2 Sequential Tests

The procedure for a sequential test is as follows (Appendix A8.3.1.2):

1. In a Cartesian coordinate System draw the acceptance line y l ( n ) = a n - bl and the rejection line y 2 ( n ) = a n + b2, with

2. Select one item after another from the lot, test the item, enter the test result in the diagram drawn in step 1, and stop the test as soon as either the rejection or the acceptance line is crossed.

Figure 7.3 Poisson distribution ( 0 results for Examples 7.2 (C = 14). 7.4 (7), 7.5(0), 7.6(2 &0), 7.9 (6))


Figure A8.8 shows acceptance and rejection lines for po = 1%, pl = 2% and a=ß=0.2. The advantage of the sequential test is that on average it requires a smaller sample size than the corresponding simple two-sided sampling plan (Ex. 7.10 or Fig. 7.8). A disadvantage is that the test duration (sample size) is random.

7.1.3 One-sided Sampling Plans for the Demonstration of a Defective Probability p

The two-sided sampling plans of Section 7.1.2 are fair in the sense that for a = P, both producer and consumer run the same risk of making a false decision. In practical applications however, one-sided sampling plans are often used, i.e. only po and a or pl and ß are specified. In these cases, the operating characteristic curve is not completely defined. For every value of C ( C = 0,1, . . .) a largest n (n = 1,2,. . .) exists which satisfies inequality (7.8) for a given po and a , or a smallest n exists which satisfies inequality (7.9) for a given pl and P. It can be shown that the operating characteristic curves become steeper as the value of C increases (see e.g. Figs. 7.4 or A8.9). Hence, for small values of C , the producer (if po and a are given) or the consumer (if pl and ß are given) can be favored. Figure 7.4 visualizes the reduction of the consumer risk (ß = 0.95 for p= 0.0065) by increasing values of the defective probability p or values of C, see Fig 7.9 for a Counterpart.

When only po and a or p, and ß are given, it is usual to set in these cases

po = AQL and pl = LTPD, (7.13)

respectively, where A QL is the Acceptable Quality Level and LTPD is the Lot Tolerance Percent Defective (Eqs. (A8.77) to (A8.82)).

A large number of one-sided sampling plans for the demonstration of AQL values are given in national and international standards (IEC 60410, ISO 2859, MIL-STD-I05 DIN 40080 [7.3]). Many of these plans have been established empirically. The following remarks can be useful when evaluating such plans:

1. AQL values are given in %.

2. The values for n and C are in general obtained using the Poisson approxirnati6n.

3. Not all values of C are listed, the value of a often decreases with increasing C .

4. Sample size is related to lot size, and this relationship is empirical.

5. A distinction is made between reduced tests (level I), normal tests (level 11) and tightened tests (1evelIII); level 11 is normally used; transition from one level to another is often given empirically (e.g. transition from level 11 to level 111 is necessary if 2 out of 5 successive independent lots have been rejected and a retum to levelI1 follows if 5 successive independent lots are passed).

6. The value of a is not given explicitly (for C = 0, for example, a is approximately 0.05 for level I, 0.1 for level 11, and 0.2 for level 111).


Pr {acceptance p)

0.8 - Code F (n = 20, C = 0) Code J (n = 80, C = 1)

0.6 - Code N (n = 500, C = 7)

0.4 -

0.2 -

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Figure 7.4 Operating characteristic curves for demonstration of an AQL = 0.65% with sample sizes n = 20, 80, and 500 as per Table 7.1 ( a = 0.1 1 for C = 0, = 0.09 for C = 1, = 0.03 for C = 7 )

Table 7.1 presents some test procedures for AQL values from IEC 60410 [7.3] and Fig. 7.4 shows the corresponding operating characteristic curves for AQL = 0.65% and sample size n = 20,80, and 500.

Test procedures for demonstration of LTPD values with given (fixed) customer risk ß are for example in [3.11 (S-19500)l. They are often based on the Poisson approximation (Eq. (A6.129)) and can be easily established using a X2 -table (Appendix A9.2) or Fig. 7.3. For given ß and LTPD, the values of n and C can be obtained taking in Fig. 7.3

and reading m = np = nLTPD for c = 0,1,2, ... (Example: ß = 0.1, LTPD = 2% yields m = 3.9 for C = 1, and from this n = 3.910.02 = 195 ; the procedure is thus: test 195 items and reject LTPD = 2% if more than 1 defect occur.)

In addition to the simple one-sided sampling plans described above, multiple one-sided sampling plans are often used to demonstrate AQL values. In a double one-sided sampling plan, the following procedure is used:

1. Take a first sample of size nl and accept definitely if no more than cl defects occur, but reject definitely if exactly or more than d1 defects have occurred.

2. If after the first sample the number of defects is greater than cl but less than dl, take a second sample of size n2 and accept if there are totally (in the first and second sample) no more than cz defects; elsewhere reject.

The operating characteristic curve or acceptance probability for a double one-sided sampling plan can be calculated as


Multiple one-sided sampling plans are also given in national and international standards, See for exarnple IEC 60410 [7.3] for the following double one-sided sampling plan to demonstrate AQL = 1%

Sample Size nl n2 cl dl c2

281 - 500 32 32 0 2 1

501-1,200 50 50 0 3 3

1,201 - 3,200 80 80 1 4 4

3,201-10,000 125 125 2 5 6

The advantage of multiple one-sided sampling plans is that on average they require smaller sample sizes than would be necessary for simple one-sided sampling plans. A disadvantage is that the test duration is not fixed in advance.

Table 7.1 Test procedures for AQL demonstration (test level 11, from IEC 60410 [7.3])

8 A B C

D E F

G H J

K L M

N P Q

Use the first sampling plan above for i' or below for J, C = number of allowed defects

size

N

9 - 1 5 16-25

26-50 51-90

91-150

151-280 281-500

501-1200

1.2k-3.2k 3.2k-10k 10k-35k

35k-150k 150k- 500k

over500k

Sam-

I> '~ slze

125 2 0 0 . 1 315

5 0 0 1 ' 800

1250

4.0

i'

2

3 5 7

10 14 21

1' 1' 1'

0.04

5 . 1

8 . 1 1 3 1 2 0

3 2 5 0 . 1 8 0 . 1

0

1

6.5

1 2 3

5 7 10

14 21 1' i' i' i'

0.25

2 - 8 2 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 0

J . 1

1 . 1

0 1'

1 2

3 5 7

0.065

L L 4

0 i'

1 2

AQL

0.40

~ C C C C C C C C C C C C

3 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 . 1 0 1 '

0 1'

1 2 3

5 7 10

0.10

J . 1 4 . 1 1 . 1

-1

0 i'

1 2 3

in %

0.65

J . 1 L I 0 0

1' L 1

2 3 5

7 10 14

0.15

L

4 0

1'

1

2 3 5

1.0

L 0

1'

1 2

3 5 7

10 14 21

1.5

L . 1

1'

1 2 3

5 7 10

14 21 1'

2.5

0

1' 1 1 1

2 3 5

7 10 14

21 1' i'

7.2 Statistical Reliability Tests 287

7.2 Statistical Reliability Tests

Reliability tests are useful to evaluate the reliability achieved in a given item. Early initiation of such tests allows quick identification and cost-effective correction of weaknesses not discovered by reliability analyses. This supports a learning process, often related to a reliability growth program (Section 7.7). Since reliability tests are generally time-consuming and expensive, they must be coordinated with other tests. Test conditions should be as close as possible to those experienced in the field. As with quality control, a distinction is made between estimation and demonstration of a specific reliability figure. Section 7.2.1 uses results of Section 7.1 for reliability and availability testing for the case of a given (fxed) mission. In section 7.2.2 an unified method for availability estimation and demonstration for the case of continuous operation is introduced. Section 7.2.3 deals carefully with estimation & demonstration of a constant failure rate h (or of MTBF for the case MTBF= I / h). In addition, maintainability tests are considered in Section 7.3, accelerated tests in Section 7.4, goodness-of-fit tests in Section 7.5, general reliability data analysis and trend tests in Section7.6, reliability growth in Section7.7. To simplify notations, the indices S, referring to System, is omitted (R, PA, MTBF, h for Rso, P&, MTBFso , As).

7.2.1 Reliability & Availability Estimation and Demonstration for the Case of a given fixed Mission

Reliability (R) and availability (asymptotic & steady-state point and average availability PA = AA) are often defined as success probability for a given (fixed) rnission. Their estimation and demonstration can thus be performed as for an unknown probability p (Section 7.1) by setting

For a demonstration, the null hypothesis Ho: p < po is converted to Ho: R > Ro or Ho: AA > M o , which adheres better to the concept of reliability or availability. The same holds for any other reliability figure expressed as an unknownprobability p.

The above considerations hold for a given (fixed) mission, repeated for reliability tests as n Bernoulli trials. However, for the case of continuous operation, estimation and demonstration of an availability can leads to a difficulty in defining the time points tl , t2 ,..., t , at which the n observations according to Eqs. (7.2) - (7.4) or (7.8)- (7.10) have to be performed. The case of continuous operation is considered in Section 7.2.2 for availability and Section 7.2.3 for reliability. Examples 7.3 -7.6 illustrate some cases of reliability tests for given fixed mission.

Exarnple 7.3 In a reliability test 95 of 100 items pass. Give the confidence interval for R at y = 0.9 (ßl = ß2).

Solution With p = 1 - R and R = 0.95 the confidence interval for p follows from Fig. 7.1 as [0.03, 0.101. The confidence interval for R is then [0.9, 0.971. (Eq. (7.5) leads to [0.901, 0.9751 for R.)

Example 7.4 The reliability of a given subassembly was R = 0.9 and should have been improved through constructive measures. In a test of 100 subassemblies, 94 of them pass the test. Check with a type I error a = 20% the hypothesis Ho: R > 0.95.

Solution For po = 1 - RO = 0.05, a = 20%, and n = 100, Eq. 7.8 delivers c = 7 (see also the graphical solution from Fig. 7.3 with m = npo = 5 aud acceptance probability t 1 - a = 0.8, yielding a = 0.15 for m = 5 and c = 7). As just k = 6 subassemblies have failed the test, the hypothesis H o : R > 0.95 can be accepted (must not be rejected) at the level 1 - a = 0.8.

Supplernentary result: Assuming as an alternative hypothesis H1 : R < 0.90, or p > pl = 0.1, the type I1 error ß can be calculated from Eq. (7.9) witb c = 7 & n = 100 or graphically from Fig. 7.3 with m = np l = 10, yielding ß = 0.2.

Example 7.5

Determine the minimum number of tests n that must be repeated to verify the hypothesis Ho: R > R1 = 0.95 with a consumer risk ß = 0.1. What is the allowed number of failures c?

Solution The inequality (7.9) must be fulfilled with pl = 1 - Rl = 0.05 and ß = 0.1, n and C must thus satisfy

The number of tests n is a minimum for c = 0. From 0.95n = 0.1, it follows that n = 45 (calculation with the Poisson approximation (Eq. (7.12)) yields n = 46, graphical solution with Fig. 7.3 leads to m = 2.3 and then n = ml p = 46).

Example 7.6 Continuing with Example 7.5, (i) find n for c = 2 and (ii) how large would the producer risk be for c = 0 and c = 2 if the true reliability were R = 0.97?

Solution (i) From Eq. (7.9),

and thus n = 105 (Fig. 7.3 yields m = 5.3 and n = 106 ; from Table A9.2, V = 6 , t 2 , ~ , 9 = 10.645 and n = 107).

(ii) The producer nsk is

hence, a = 0.75 for c = 0 and n = 45, a = 0.61 for c = 2 and n = 105 (Fig. 7.3, yields a = 0.75 for C = 0 and m = 1.35, a = 0.62 for c = 2 and m = 3.15; from Table A9.2, a = 0.73 for v = 2 and t2,a =2.7, a 1 0 . 6 1 for v = 6 and t( ja =6.3).

7.2.2 Availability Estimation and Demonstration for the Case of Continuous Operation (asymptotic & steady-state)

Availability estimation & demonstration for a repairable item in continuous operation can be based on results given in Section 6.2 for the one-item repairable structure. Point estimate (with corresponding mean and variance) for the availability can be found for arbitrary distributions of failure-free and repair times (Section 7.2.2.3). However, interval estimation and demonstration tests can lead to some difficulties. An unified approach for estimating & demonstrating the asymptotic and steady-state point and average availability PA =AA for the case of exponentially or Erlangian distributed failure-free and repair times is introduced in Appendices A8.2.2.4 & A8.3.1.4 (to simplify the notation, PA =AA is used for PAs=AAs).

Sections 7.2.2.1 and 7.2.2.2 deal with this approach. Only the case of exponentially distributed failure-free and repair times, i.e. constant failure and repair rates ( h ( x ) = h, p(x ) = p ) is considered here, extension to Erlangian distributions is easy. Point and average unavailability converge for this case rapidly ( I - P A s o ( t ) and 1-AAso( t ) inTable6.3) to the asymptotic & steady-state value PA = 1-PA =1- AA = h 1 ( h + p ) = h / F . To simplify considerations, it will be assumed that the observed time interval (O,t] is >> l l p , terrninates by a repair, and exactly k (or n) failure-free times zi and corresponding repair times zi have occurred (see Section 7.2.2.3 for other possibilities). Furthermore, h << y will be assumed for the estimation, i.e.

is considered instead of PA = h l ( h + y ) in Section 7.2.2.1 (relative error of Same magnitude as X). A l p is a probabilistic value of the asymptotic & steady-state unavailability and has his statistical counterpart in DTIUT, where DT and UT are the observed down and up times. The procedure given in Appendices A8.2.2.4 and A8.3.1.4 is based on the fact that the quantity p. DT 1 h. UT is distributed according to a Fisher distribution (F-distribution) with vl = v2 = 2 k degrees of freedom. Section 7.2.2.1 deals with estimation and Section 7.2.2.2 with demonstration of PA.

7.2.2.1 Availability Estimation

Having observed for a repairable item described by Fig. 6.2, with constant failure and repair rates h ( x ) = h & y ( x ) = y >> 1, an operating time UT = tl + .. . + tk and a repair time DT = ti+ ... + tl; , the maximum likelihoodpoint estimate for PA, = h 1 p is

unbiased being ( 1 - 1 I k ) DT 1 U T , k > 1 (Example A8.10). E, = h 1 y is an - - approximation for PA = AA = h I ( h + p ) , sufficiently good for practical applications (relative error of same magnitude as 2). For given ßl, ß2, Y = 1 - ß l - ß2

3.0

2.8

2.6

2.4

2.2

2.0

1.8

1.6

1.4

1.2

1 .o 0.8

0.6

0.4

0.2

0.0

1 2 4 6 810 20 40 6080100 200 4 0 0 6 0 0 * A - - .. * , . , . * A - -

Figure 7.5 Confidence limits PA I / PA, = PAn1 I E, and PA, I PA, =G I E, (Eq. (7.17)) for unknown asymptotic & steady-state unavailability PA =1- PA = 1 - AA (E,= D T / U T = maximum likelihood estimate forh I p ( U T = t , + ... + tk , DT = ti+ ... + t $ ); y = 1 - ßl- ß2 = confidence level (here ß1 = ß2 = (1 - y)/2)); result for Example A8.8

( o < ß, < 1 - ßz < I), lower und upper confidence limits for are (Eq. (A8.65))

where F 2 k , 2 k , l - D 7 & F 2 k , 2 k , l - ß , are the i - ß2 & 1 - ß, quantiies of the Fisher ( F ) djstrib~tion (bppenc#x ~ 9 . 4 , 1 ( ~ 9 . 3 _ A9.6])* Figur- 7.5 gives the confidence limits

- ~ , l ~ , = % a u / K for ß 1 = ß 2 = ( 1 - ~ 1 1 2 , useful

for practical applications (Example A8.8). One-sided confidence intewals are * - - -

O < P A I P A „ withy=l-ß , and P A 1 4 P A < l , withy=l-ß,. (7.18)

Corresponding values for the availability can be obtained using PA = 1 - PA. If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with

ßh F 2 k , 2 k , l - ß 2 and F 2 k , 2 k + ~ - ß , have t0 be replaced by F 2 k ß , . ; k ß h , l - B , and

F 2 k ß h ,2 k p , (for unchanged MTTF & MTTR, see Example A8.11). 6, = D T ~ UT remains valid. Results based only on DT are not free of pararneters (Section7.2.2.3).

7.2 Statistical Reliability Tests 29 1

7.2.2.2 Availability Demonstration

In the context of an acceptance testing, demonstration of the asymptotic & steady- state point and average availability ( P A = AA) is often required. For practical applications it is useful to work with the unavailability PA = 1 -PA. The main concern of this test is to check a Zero hypothesis H o : < PAo against an alternative hypothesis H l : PA > GI on the basis of the following agreement between producer and consumer:

The item should be accepted with a probability nearly equal to (but not less than) 1-a if the true (unknown) unavailability is lower than PAO but rejected with a probability nearly equal to (but not less than) I - 0 if is greater than PAl ( E o , PA1 > so, 0 < a < 1 -ß < 1 are given (fuced) values).

PAo is the specified unavailability and PA1 is the maximum acceptable unavailability. a is the allowedproducer's risk (type I error), i.e. the probability of rejecting - - a true hypothesis Ho: PA < PAo. ß is the allowed consumer's risk (type I1 error), - - i.e. the probability of accepting the hypothesis Ho: PA < PAo when the alternative hypothesis H l : > PAl is true. Verification of the agreement stated above is a problem of statistical hypothesis testing (Appendix A8.3) and different approach are possible. In the following, the method introduced in Appendix A8.3.1.4 is given (comparison with other methods is in Section 7.2.2.3).

Assurning constant failure and repair rates A ( x ) = h and p(x) = p, the procedure is as follows (see also rA8.28, A2.6 (IEC 61070)]):

1. For given (fixed) PAo, q, a , and ß ( O < a < 1 - ß < 1), find the smallest integer n (1,2, ...) which satisfy (Eq. (A8.91))

where F 2n, 2n, -a and F zn, zn, 1 -ß are the 1 - a & 1- ß quantiles of the F- distribution (Appendix A9.4), and compute the limiting value (Eq. (A8.92))

2. Observe n failure-free times tl + ... + t , and the corresponding repair times t ; + ... + t ; , and

t l + ... + t , reject H o : < PAO , if > 6

t l + ... + t ,

t l + ... + t , accept H o : < E. , if 1 6 .

t l + ... + t ,

Table 7.2 gives n and 6 for some values of PAl / PAo used in practical applications. It must be noted that the test duration is notfLxed in advance. However, results for fixed time sample plans are not free of parameters (remark to Eq. (7.22)).

Table 7.2 Number n of failure-free times 'cl,. . . , T, & corresponding repair (restoration) times T ; , . . . ,T „ and limiting value 6 of the observed ratio (ti + ... + tn ) 1 (tl + . .. + t,) to demonstrate - - % < PAo against PA > PA1 for various values of a (producer risk), ß (consumer risk), arid% I E,

n = 29

6 = 1.41 PAo I PA0

(PA, > 0.99)*

n = 1 3

F = 1.39 PAo 1 PA0

(PA, t 0.99)*

- - PA1/ PA,, = 4

n = 8

6 = 1.93 PAo I PA0

(PA, > 0.99)*

n = 4

6 = 1.86 PAo 1 PA0

(PA, > 0.98)*

- - PA1/ PA,, = 6

n = 5

6 = 2.32 / PA0

(PA, t 0.98)*

n = 3

6 = 2.06 PAo I PA0

(PA, > 0.99)*

*a lower n can be given (with corresponding F 2 n , , - a ) for PAo smaller than the limit given

Corresponding values for the availability can be obtained using PA = 1 - E. If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with

ß h &Pp, F 2 n , 2 n , l - a and F 2 n , ~ n , l - ß have to be replaced by F 2nß„2nßX, l -a and F 2nßh .2nßli . I - ß (for unchanged M7TF & M n R , See Example A8.11).

7.2.2.3 Further Availability Evaluation Methods (for Continuous Operation)

The approach introduced in Appendices A8.2.2.4 & A8.3.1.4 and given in Sections 7.2.2.1 & 7.2.2.2 yields to an exact solution based on the Fisher distribution for estimating and demonstrating an availability PA =AA , obtained by investigating DTIUT for exponentially or Erlangian distributed failure-free and repair times. Exponentially distributed failure-free times arise in many practical applications. The distribution of repair (restoration) times can often be approximated by an Erlang distribution (Eq. (A6.102) with ß > 3). Generalization of the distribution of failure-free or repair times can lead to analytical difficulties. In the following some alternative approach for estimating and demonstrating an availability PA =AA are briefly discussed and compared with the approach given in Sections 7.2.2.1 & 7.2.2.2 (item's behavior described by the alternating renewal process of Fig. 6.2).

A first possibility is to consider only the distribution of the down time DT (total repair or restoration time) in a given time interval ( O , t ] . At the given (fixed) time point t, the item can be up or down. Eq. (6.33) with t-X instead of T0 gives the distribution function of DT. Moments of DT have been investigated in [A7.29 (1957)l. Mean and variance of the unavailability E = 1 - PA = E [ D T I t ] can thus be given for arbitrary distributions of failure-free and repair times. In particular, for the case of constant failure and repair rates (h(x) = h, p(x) = p) it holds that

?L 2AP l imE[DTl t ]=-=Alp , and limVar[DTlt]=---=2hltp2. (7.22)

t+ - ' + P t+ - t(h+PY

However, already for the case of constant failure and repair rates, results for interval estimation and demonstration test are not free of parameters (function of y [A8.28] or h [A8.18]). The use of the distribution of DT, or D T l t for fixed r , would bring the advantage of a test duration t fixed in advance, but results are not free of parameters and the method is thus of limited utility.

A second possibility is to assign to the state of the item an indicator <(t)

taking values 1 for item up and 0 for item down (Boolean variable in Section 2.3.4). In this case it holds that PA(t)= Pr{<(t) = 11, and thus E[<(t)] = PA(t) and Var[<(t)l = ~ [ c ( t ) ~ ] - ~ ~ [ < ( t ) ] = PA(t ) ( l -PA( t ) ) . Investigation on PA(t) reduces to that on c( t ) , See e.g. lA7.4 (1962)l. In particular, estimation and demonstration of PA(t) can be based on observations of < ( t ) at time points tl < t 2 < ... . A basic problern here, is the choice of the observation time points (randomly, at constant time intervals A = ti+l- t i , or other). For the case of constant failure and repair rates ( L , p) , Eq. (6.20) yields PA( t ) = PAso( t ) = p / ( h +p) + ( h ~ ( h + ~ ) ) e - ( ~ + ~ ) ' (item new at t =O). Convergence to PA = AA = y / ( h + y ) = 1 - h l y is very fast. Furthermore, because of the constant failure rate, the joint availability is given by JAso(t , t + A ) = PAso ( t ) . PAso( A ) (Eqs. (6.34) & (6.35)). Estimation and demonstration for the case of observations at constant time intervals A can thus be reduced to the case of an unknown probability p = K(A) = 1 - PA(A) = (1 -e-PA) h 1 y = hA for A « l l p or p = =ÄÄ =hl (h+ y ) = h l p for A » 11p (Section 7.1).

A further (empirical) possibility is to estimate (point estimation) and demonstrate h and y separately (Section 7.2.3), and put results in PA = ÄÄ = A l p . For an (empirical) interval estimation, the Chebyshev's inequality (Eq. (A6.49)), expressing (Eq. (7.22)) Pr{ I DT I t - h I p I > E ] I 2 h 1 ( t P 2 c 2 ) , can be used, yielding fromastatisticalpointofview P ~ { I D T / ~ - ~ / ( ~ I ~ E ] = ~ r { ~ F - A l f i ] > E ) 2 P ~ { E - h ~ f i > ~ e ] = ~ r { P A > i / f i + ~ ) = l - ~ r { P A I h / f i + ~ ) 1 2 h / ( t $ ~ ~ ~ ) = ß ~ = l - ~ ,

or Pr{ PA 5 PAll = A /fi + J2i / ( t b2(1-y)) I 2 Y by replacing E from the last equality in the preceding line, and with t as test time [7.14].

The different methods can basically be discussed by comparing Fig. 7.5 with Fig. 7.6 and Table 7.2 with Table 7.3. Results based on the Fisher distribution yield broader confidence intervals and longer demonstration tests (this can be accepted, considering that h und p are unknown and that for high availability values higher

* A

pA, l PA, or PA, I PA, can be agreed), the advantage being exact knowledge of the involved errors ( P 1 , ß 2 ) or risks ( a , ß). However, for some aspects (test duration, possibility to verify maintainability with selected failures) it can become more appropriate to estimate and demonstrate 3L and p separately.


7.2.3 Estimation & Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF = 1 lh)

A constant (time independent) failure rate h(x) = h occurs in many practical applications, for nonrepairables items as well as for repairable items which are assumed as-good-as-new after repair (X being the variable starting by X = 0 at the begin of the failure-free time considered, as for interarrival times). h(x) = h implies that failure-free times are independent and exponentially distributed with the same parameter h (Eq. (A6.81)). In this case, the reliability function is given by R(x) = and for the mean time to failure, MTirF = 1 1 h holds for all failure-free times (Eq. (A6.84)). For the repairable case, MTBF (mean operating time between failures) is often used in practical applications instead of M7TF. However, MTBF = 1 I h holds only for the particular case h(x) = h. To avoid inisuses, MTBF is confined in this book tu the case MTBF = 1 I 1. A reason for the assumption of h(x) = h is that, by neglecting repair times, the flow of failures constitute a homogeneous Poisson process (Appendix A7.2.5). This property characterizes exponentially distributed failure-free times and highly simplifies investigations.

This section deals with the estimation and demonstration of a constant failure rate h or of MTBF for the case MTBF = llh (see Appendix A8 for basic considerations and Sections 7.5 - 7.7 for further results. In particular, the case of a given (fixed) czlmulative operating time T is considered, when repair times are neglected (immediate renewal) and individual failure-free times are assumed to be independent. Due to the relationship between exponentially distributed failure-free times and homogeneous Poisson process (Eq. (A7.39)) as well as the additive property of Poisson processes, the fixed cumulative operating time T can be partitioned in an arbitrary way from failure-free times of statistically identical items (Example 7.7), see note to Table 7.3 for a practical mle. The following are some examples:

1. Operation of a single item that is immediately renewed after each failure (renewal time = 0); here, T = t = calendar time = T„„.

2. Operation of n identical items, each of them being irnmediately renewed after each failure (renewal time = 0); here, T = n t (n = 1, 2, . . .).

As stated above, in the case of a constant failure rate h and immediate renewal, the failure process is a homogeneous Poisson process (HPP) with intensity h (for n = 1) or nh (for n> 1) over the fixed time interval (0, T= nt] . Hence, the probability of k failures occurring within the cumulative operating time Tis (Eq. (A7.41))

( n W k -nhT Pr{k failures within T 1 n h } = - e , n=1,2 ,..., k = 0 , 1 , 2 ,... . k !

Statistical procedures for estimation and demonstration of a failure rate h can thus be based on the evaluation of the parameter ( rn = n AT) of a Poisson distribution.

In addition to the case of a given (fixed) cumulative operating time T and

immediate renewal (discussed above and investigated in Sections 7.2.3.1 - 7.2.3.3), for which the number k of failures in Tis a sufficient statistic and f i = k 1 T is an unbiased estimate for h , further possibilities are known. Assuming n identical items at t = 0 and labeling the individual failure times as t; < t ; < . . . , the following cases are important for practical applications: +)

1. Fixed number k of failures, the test is stopped at the kth failure and failed items are instantaneoz~sly renewed; an unbiased point estimate for h is (k > 1)

2. Fixed number k of failures, the test is stopped at the kth failure and failed iterns are not renewed; an unbiased point estimate of the failure rate h is

3. Fixed test time t, failed items are not renewed; a point estimate of the failure rate h (given k items have failed in (O,t]) is

Example 7.7 An item with constant failure rate h operates first for a fixed time Tl and then for a fixed time Tz. Repair times are neglected. Give the probability that k failures will occur in T = Z j + T2.

Solution The item's behavior within each of the time periods Tl and T2 can be described by a homogeneous Poisson process with intensiv h. From Eq. (A7.39) it follows that

(q)1 Pr{i failures in the time period Tl I h ] =

I !

and, because of the memoryless property of the homogeneous Poisson process

k T ) T , , -i-

Pr{k failures in T = T1 +T2 I h ) = i=o d (k-i)!

The last Part of Eq. (7.26) follows from the binomial expansion of (T1 + T ~ j k . Eq. (7.26) shows that for h constant, the curnulative operating time T can be partitioned in any arbitray way (See the remark to Table 7.3 for a practical rule).

Supplementay result: The same procedure can be used to prove that the sum of two independent homogeneous Poisson processes with intensities h l and h2 is a homog.

+) * is used to distinguish t; , t;, . . . as arbitrary points on the time axis, from t l , t2, . . . as independent observations of a failure-free time T (starting at t = O as in Fig. 1.1).

Poisson process with intensity hl + h2 ; in fact,

Pr (kfa i lures in(0 ,~l I h , ,hz]

This result can be extended to nonhomogeneous Poisson processes.

7.2.3.1 Estimation of a Constant Failure Rate h') (or o f MTBF for the Case MTBF =llh)

Let us consider an item with a constant failure rate ?L. If during the given ( fxed) cumulative operating time T" exactly k failures have occurred, the maximum likelihood point estimate for the unknown Parameter h follows from Eq. (A8.46) as

k L = - , with ~ [ h ] = h and ~ a r [ h ] = hlT. (7.28) T

For a given confidence level y = 1 - ßl - ß2 (with 0 < ß1 < 1 - ß2 < 1 ) and k > 0, the lower hl and upper h , limits of the confidence intewal for the failure rate h can be obtained from (Eiqs. (A8.47) to (A8.51))

or from

using the quantile of the X2-distribution (Table A9.2). For k = 0, Eq. (A8.49) yields

hl = O and - ln(i'ßl) , with y = 1 - ßl. h, = --- (7.30) T

Figure 7.6 gives confidence limits bl / h and h, / h for ß1 = ß2 = (1 - y)/2, useful for practical applications.

For the case MTBF = 1 / h, MTBF = T l k is biased (unbiased is M ~ B F = T l ( k + 1)); M ~ B F ~ = 11 h, and M ~ B F , = 1 / hl can be used in practical applications.

+) The case considered in Sections 7.2.3.1 to 7.2.3.3 corresponds to a sampling plan with n elements with replacement and k failures in the given (fixed) time intewal (0, T l n ] , Type I (time) censoring; the underlying process is a homogeneous Poisson process with constant intensity n h .

7.2 Statistical Reliability Tests

lh , hl l h (h= k l ~ )

4

Figure 7.6 Confidence limits h1 / h, h, I h for an unknown constant failure rate h per Eqs. (7.28) &

(7.29) ( T=given (fixed) cumulative operating time (time censoring); k = number of failures during T; y =1- ß, - ß2 = confidence level (here ßl= ß2 = (1; y)l2)); for tbe case MTBF=l/ X, it holds &BF= 1lh (unbiased for k >> 1) and MTBF~ =11 L,, MSBF, = 1 1 hl ); Examples 7.8,7.13

,. A

Confidence limits Al , hucan also be used to give one-sided confidence intewals:

h l h , , with ß2 = 0 and y = 1 - ß 1

or A & , with ß i = O and y = l - ß 2 ,

i.e. MTBF 2 M ~ B F ~ = 1 / h, or MTBF I MTBF, = 1 / il for the case MTBF = 1 l h .

Example 7.8 In testing a subassembly with constant failure rate h, 4 failures occur during T = 104 cumulative operating hours. Find the confidence interval of h for a confidence level y = 0.8 ( ß1 = ß2 = 0.1).

Solution * * * -

From Fig. 7.6 it follows that for k = 4 and y =0.8, h l 14 = 0.44 and h,lh = 2. *With T = 104 h , k = 4 , and h = 4 . 1 0 - ~ hT1, the confidence lirnits are hl = 1.7.10-~ h-I and h, = 8 . 1 0 - ~ h-l.

Supplementary result: Corresponding one-sided conf. interval is h 5 8 . 1 0 - ~ h-I with y = 0.9.

In the above considerations (Eqs. (7.28) - (7.31)), the cumulative operating time T was given (fixed), independent of the individual failure-free times and the number m of items involved (Type I censoring). The situation is different when the number offailures k is given (fixed), i.e. when the test is stopped at the occurrence of the kth failure (Type I1 censoring). Here, the cumulative operating time is a random variable (term ( k - 1) / h of Eqs. (7.23) and (7.24)). Using the memoryless property of homogeneous Poisson processes, it can be shown that the quantities

rn ( t l - t;-l) for renewal, and ( m - i + 1) (t: - t;-l) for no renewal, (7.32)

with i = I, . . ., k and t; = 0, are independent observations of a random variable distributed according to F(x) = 1 - e-L X . This is necessary and sufficient to prove that the given by Eqs. (7.23) and (7.24) are maxirnum likelihood estimates for h. For confidence intervals, results of Appendix A8.2.2.3 can be used.

In some practical applications, system's failure rate confidence limits as a function of component's failure rate confidence limits is sought. Monte Carlo simulation can help. However, for series Systems, constant failure rates hl,...,h,,Atime censoring, and same observation time T, Eqs. (2.19), (7.28), and (7.27) yield hs = h, + .., + h,. Furthermore, for given fixed T, 2Thi (considered here as a random variable, see Appendix A8.2.2.2) has a x2 distribution with 2 ( k i + 1) degrees of freedom (Eq. (A8.48), Table A9.2); thus, 2 Ths has a x2 distribution with E2 ( k i t I ) degrees of freedom. From a x2 distribution table (Table A9.2) one can recognize that for Pr{2Thi<2Thiu}=Pr{hi<hi ] 2 0 . 8 = y (i=i ,..., n) and v=10,20 ,... (k,= ...= k,=4

as an example) one has Pr {hs 2 hli... t h, ) 2 y (Y = i -e-3'2= 0.777 is given e.g. in [7.18]). Extension to different observation times Ti, series-parallel structures, or Erlangian distributed failure-free times is possible [7.18]. Estimation of h 1 y as approximation for an unavailability h l ( h +P) is given in Section 7.2.2.1.

7.2.3.2 Simple Two-sided Test for the Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF = llh)

In the context of an acceptance test, demonstration of a constant failure rate h (or of MTBF for the case MTBF = 11 h) is often required, not merely its estirnation as in Section 7.2.3.1. The main concern of this test is to check a Zero hypothesis Ho: h < ho against an alternative hypothesis H1: h > Al, on the basis of the following agreement between producer and consumer:

Items should be accepted with a probability nearly equal to (but not less than) 1 -a, if the true (unknown) h is less than ho, but rejected with a probability nearly equal to (but not less than) 1 - P , if h is greater than h1 (Lo, h, > h„ und 0 < a < 1 - ß < 1 are given ( fxed) values).

ho is the specified h and Al is the rnaximum acceptable h ( 1 l m 0 and 1 l m 1 in IEC 60605 17.191 or 1 / e0 and 11 81 in MZL-STD-781 [7.22] for the case MTBF = 11 h).

a is the allowed producer's risk (type I error), i.e. the probability of rejecting a true hypothesis H o : h i ho. ß is the allowed consumer's risk (type I1 error), i.e. the probability of accepting Ho when the alternative hypothesis H 1 : h > hl is true Evalua- tion of the above agreement is a problem of statistical hypothesis testing (Appendix A8.3), and can be performed e. g. with a simple two-sided test or a sequential test.

With the simple two-sided test (also known as the fixed length test), the cumulative operating time T and the number of allowed failure C during T are fixed quantities. The procedure (test plan) is as follows:

1. From ho, A l , a, ß determine the smallest integer C and the value T satisfying

and

2. Perform a test with a total cumulative operating time T, determine the number of failures k during the test, and

re jec tH0: h < h o , if k > c

accept Ho: h < h0, if k 5 C. (7.35)

For the case MTBF =l 1 h, the above procedure can be used to test H , : MTBF> MTB5

against H , : MTBF C MTBq , by replacing ho = 11 MTB5 and Al = 1 I MTBF, .

Example 7.9

Following conditions have been specified for the demonstration (acceptance test) of the constant (time independent) failure rate h of an assembly: ho = 1 / 2000 h (specified h ), hl = 1 / 1000 h (minimum acceptable h) , producer risk cr = 0.2, consumer risk ß = 0.2. Give: (i) the cumulative test time T and the allowed number of failures c during T; (ii) the probability of acceptance if the true failure rate h were 1 / 3000 h.

Solution (i) From Fig. 7.3, c = 6 and rn = 4.6 for Pr{acceptance} = 0.82, c = 6 and m = 9.2 for Pr{acceptance} = 0.19; thus c = 6 and T = 9200 h . These values agree well with those obtained from Table A9.2 (V = 14) and are given in Table 7.3. (ii) For h = 1 / 3000 h , T=9200h, c = 6

Prlacceptance I h = I 13000 h 1 6 i

= Pr{no more than 6 failures in T = 9200 h 1 h = L / 3000 h J = c T ~ ~ ~ ~ = 0.96, i=o I !

see also Fig. 7.3 for rn = 3.07 and c = 6 .

i = O

1.0 -

0.8 -

0.6 - tancel L} =Pr (no more than c failures in T ( L)

0.4 -

0.2 -

0 b ?L [hK1]

0.001 0.002

Figure 7.7 Operating characteristic curve (acceptance probability cume) as a function of h for fixed Tand C (Lo = 1 /2000h, LI = 1 /1000h, cx = ß = 0.2; T = 9200h and C = 6 as per Table 7.3; see also Fig. 7.3) (holds for MTBZf, = 2000h and MTBq = 1000h, for the case MTBF= 11 ?L)

The graph of Fig. 7.7 visualizes the validity of the above agreement between producer and consumer (customer). It satisfies the inequalities (7.33) and (7.34), and is known as the operating characteristic cuwe (acceptance probability curve). For each value of ?L, it gives the probability of having not more than C failures during a cumulative operating time T. Since the operating characteristic curve as a function of h is monotonically decreasing, the risk for a false decision decreases for h < hO and h > Al, respectively. It can be shown that the quantities c and h o ~ depend only on a, ß, and the ratio Al 1 ho (discrimination ratio).

Table 7.3 gives C and h , , ~ for some values of a, ß and hllhO useful for practical applications. For the case MTBF = l l h , Table 7.3 holds for testing Ho: MTBF > MTBFo against H1: MTBF < MTBFI, by replacing ho = 1 I MTBFO and h1 = 1 I MTBF1. Table 7.3 can also be used for the demonstration of an unknown probability p (Eqs. (7.8) and (7.9)) in the case where the Poisson approximation applies. A large number of test plans are in international standards [7.19 (61 124)l.

In addition to the simple two-sided test described above, a sequential test is often used (see Appendix A8.3.1.2 and Section 7.1.2.2 for basic considerations and Fig. 7.8 for an example). In this test, neither the cumulative operating time T, nor the number C of allowed failures during T are specified before the test begins. The number of failures is recorded as a function of the cumulative operating time (normalized to 11 An). As soon as the resulting staircase curve crosses the acceptance line or the rejection line the test is stopped. Sequential tests offer the advantage that on average the test duration is shorter than with simple two-sided tests. Using Eq. (7.12) with p0 = l - e - ' ~ ' ~ , pl = 1-e-h16t, n = TIZit, and 6-+ 0 (continuous in time), the acceptance and rejection lines are obtained as

acceptance line : yl (X) = a X - 4 , (7.36)

Table 7.3 Number of allowed failures C dunng the cumulative operating time T and value of hoT to demonstrate h < ho against h > hl for vanous values of a (producer risk), ß (consumer nsk), and hl / h o (can be used to test MTBF < MTB6 against MTBF > MTB4 for the case MTBF = 1 1 h or, using ho T = npo , to test p < po against p > pl for an unknown probability p)

rejection line : y2(x) = a x + b2, (7.37)

a = ß 6 0 . 1

a e ß Z o . 2

a = ß Z 0 . 3

with X = h o T , and

Sequential tests used in practical applications are given in in ternat ional s t anda rds

[7.12 (61124)l. To limit testing effort, restrictions are often placed on the test duration and the number of allowed failures. Figure 7.8 shows two t r u n c a t e d

sequential test plans for a = ß - 0.2 and Al / hO =1.5 and 2, respectively. The lines defined by Eqs. (7.36)-(7.38) are shown dashed in Fig. 7.8a.

* C = 13 yields ho T = 9.48 and a = ß = 0.1003 ; number of items under test = T h o , as a rule of thumb

Example 7.10 Continuing with Example 7.9, give the expected test duration by assuming that the true h equals ho and a sequential test as per Fig. 7.8 is used.

Solution Frorn Fig. 7.8 with hl I ho = 2 it follows that E [test duration I h = hol = 2.4 I h o = 4800 h .

hl lh , = 3

C = 5 hOT = 3.12

(a - ß = 0.096)

C = 2 hoT = 1.47

( a = ß = 0.184)

C = 1 LOT = 0.92

( a - ß = 0.236)

h, l h , = 1.5

C = 40 hoT=32.98

( a = ß = 0.098)

C = 17 hoT=14.33

( a = ß = 0.197)

c = 6 hoT = 5.41

(a = ß = 0.2997)

h, lh , = 2

C = 14*

ho T = 10.17 ( a = ß = 0.093)

c = 6 hoT = 4.62

( a = ß = 0.185)

C = 2 LOT = 1.85

( a = ß = 0.284)

number of failures

number of failures

0 1 2 3 4 5

a) b) Figure 7.8 a) Sequential test plan to demonstrate h < h0 against h > hl for a =: ß - 0.2 and hl /Lo =1.5 (top), hl /Lo = 2 (down), as per IEC 61124 and MIL-HDBK-781 [7.19, 7.221 (dashed on the left are the lines given by Eqs. (7.36)- (7.38)); b) Expected test duration until acceptance (continuous) and operating characteristic curve (dashed) as a function of h0 / h (can be used to test MTBF < MTBF,, against MTBF > MTBF, , for the case MTBF= 1 / h )

7.2.3.3 Simple One-sided Test for the Demonstration of a Constant Failure Rate h (or of MTBF for the Case MTBF= lh)

Simple two-sided tests (Fig. 7.7) and sequential tests (Fig. 7.8) have the advantage that, for a = ß, producer and consumer run the same risk of making a false decision. However, in practical applications often only h, and a or Al and ß, i.e. simple one- sided tests, are used. The considerations of Section 7.1.3 apply and care should be taken with small values of C, as operating with h, and a (or hl and ß) the producer (or consumer) can be favored. Figure 7.9 shows the operating characteristic curves for various values of C as a function of h for the demonstration of h < 111000h against h > 11 1000h with consumer risk ß = 0.2 for ?L = 11 1000h, and visualizes the reduction of producer's risk (a = 0.8 for h=1/1000h) by decreasing h, or increasing c (counterpart of Fig. 7.4).

7.3 Statistical Maintainability Tests

Figure 7.9 Operating characteristic curves (acceptance probability curves) for h l = 1 /1000h, ß = 0.2, and c = 0 ( T = 1610h), c = 1 ( T = 2995h), c = 2 ( T = 4280h), c = 5 ( T = 7905 h), and c = M ( T = 0 0 ) (holds for MTBZj =1000h, for the case MTBF = 1 / h )

7.3 Statistical Maintainability Tests

Maintainability is generally expressed as a probability. In this case, results of Sections 7.1 and 7.2.1 can be used to estimate or demonstrate maintainability. However, estimation and demonstration of specific Parameters, for instance MTTR (mean time to repair) is important for practical applications. If the underlying random values are exponentially distributed (constant repair rate P), the results of Section 7.2.3 for a constant failure rate ?L can be used. This section deals with the estimation and demonstration of an Mi?iR by assuming that repair time is lognormally distributed (for Erlangian distributed repair times, results of Section 7.2.3 can be used, considering Eqs. (A6.102) & (A6.103)). To simplify the notation, realizations (observations) of a repair time T ' will be denoted by t , , . . ., t , instead of t ; , ..., t; .

7.3.1 Estimation of an MTTR

Let t l , . . ., t , be independent observations (realizations) of the repair time T ' of a given item. From Eqs. (A8.6) and (A8.10), the empirical mean and variance of T' are given by

For these estimates it holds that E[E[T']I = E[z' ] = MTTR , ~ a r [ ~ [ < I] = Var[T' ]In , and ~ [ ~ i r [ z ' ] ] = Var[< ] (Appendix A8.1.2). As stated above, the repair time $ can often be assumed lognormally distributed with distribution function (Eq. (A6.110))

and with mean and variance given by (Eqs. (A6.112) and (A6.113))

Form Eq. (7.41) one recognizes that lnz is normally distributed with mean 1 1 lnh and Variance 02. Using Eqs. (A8.24) and (A8.27), the rnaxirnurn likelihood estimation of h and o2 is obtained from

A point estimate for h and o can also be obtained by the method of quantiles. The idea is to substitute some particular quantiles with the corresponding empirical quantiles to obtain estimates for h or o. For t = 11 h , ln(h t ) = 0 and F ( 1 1 h ) = 0.5, therefore, 1 1 h is the 0.5 quantile (median) t0.5 of the distribution function F(t) given by Eq. (7.41). From the empirical 0.5 quantile t"0.5 = inf(t: 6 J t ) 2 0.5) an estimate for h follows as

Moreover, t = e' 1 h yields ~ ( e ~ I h ) = 0.841 (Table A9.1); thus eo / h = t0,841 is the 0.841 quantile of F ( t ) given by Eq. (7.41). Using h = ll?0.5 and o = h ( h t0,841) = ln(t0.841 1 t0,5), an estimate for o is obtained as

Furthermore, considering F(e-olh) = 1 - 0.841 = 0.159, i.e. to,159 = e-O I h , it follows that e20 = h t O,M1 I h t 0,159 and thus Eq. (7.45) can be replaced by

7.3 Statistical Maintainability Tests 305

The possibility of representing a lognormal distribution function as a straight line, to simplify the interpretation of data, is discussed in Section 7.5.1 (Fig. 7.14, Appendix A9.8.1).

To obtain interval estimates for the parameters h and D, note that the logarithm of a log normally distributed variable is normally distributed with mean In ( I / A) and variance 02. Applying the transformation ti + lnti to the individual observations tl, ..., t , and using the results known for the interval estimation of the parameters of a normal distribution [A6.1, A6.41, the confidence intervals

for 02, and

with

for h can be found with i and 6 as in Eq. (7.43). Xi-1,, and t:-l,, are the y quantiles of the and t-distribution with n - 1 degrees of freedom, respectively (Tables A9.2 and A9.3).

Example 7.11 Let 1.1, 1.3, 1.6, 1.9, 2.0, 2.3, 2.4, 2.7, 3.1, and 4.2h be 10 independent observations (realizations) of a lognormally distributed repair time. Give the maximum likelihood estimate and, for y = 0.9, the confidence interval for the parameters h and <r2, as well as the maximum likelihood estimate for MTTR.

Solution Equation (7.43) yields = 0.476 h-I and o2 = 0.146 as maximum likelihood estimates for h and <r2 . From Eq. (7.42), M ~ T R = e0.073 10.476 h-I = 2.26 h . Using Eqs. (7.47) and (7.48), as well as Tables A9.2 and A9.3, the confidence intervals are [1.46/16.919, 1.4613.3251 =

[0.086, 0.441 for a2 and 10.476 e-0.127.1.833, 0.476 e0.127.1.833] h t i = I0.38, 0.601 hT1 for h, respectively .

7.3.2 Demonstration of an MTTR

The demonstration of an MTTR (in an acceptance test) will be investigated here by assuming that the repair time < is lognormally distributed with known o2 (method 1A of MIL-STD-471 [7.22]). A rule is sought to test the null hypothesis H,,: MTTR = MTTRo against the alternative hypothesis H1: MTTR = M T R , for given type I error a and type I1 error ß (Appendix A8.3). The procedure (test plan) is as follows:

1. From a and ß ( 0 < a < 1 - ß < I), determine the quantiles tp and t l- , of the standard normal distribution (Table A9.1)

From M% and M w , compute the sample size n (next highest integer)

(tl-a MUR,, - tP MTTR,)' 2 n = (e" -1).

(Mi'TR, - M ~ R , , ) ~

2. Perform n independent repairs and record the observed repair times t l , . . ., t , (representative sample of repair times).

3. Compute E[< ] according to Eq. (7.39) and reject H o : M7TR = MZTR, if

E [ z ' ] > c = M T T R ~ ( ~ + ~ , - , (7.51)

otherwise accept Ho.

The proof of the above rule implies a sample size n > 10, so that the quantity E[zj can be assumed to have a normal distribution with mean M U R and variance Var[z'] / n (Eqs. (A6.148), (A8.7), (A8.8)). Considering the type I and type I1 errors

and using Eqs. (A6.105) and (7.49), the relationship

C = M T % + = MVR~ + tp (7.52)

2 can be found, with Var0[zr] = ( e o 2 - i ) ~ n ~ : for t l- , and Varl[z'] = (ea -1) MTIR:

for tp according to Eq. (7.42). The sample size n (Eq. (7.50)) follows then from Eq. (7.52) and the right hand side of Eq. (7.51) is equal to the constant c as per Eq. (7.52).

The operating characteristic cuwe can be calculated from

with

d=- MTTR MTTR

Replacing in d the quantity n i ( a a 2 - 1) from Eq. (7.50) one recognizes that the operating characteristic curve is independent of 0 2 (rounding of n neglected).

7.3 Statistical Maintainability Tests 307

Determine the rejection conditions (Eq. (7.51)) and the related operating characteristic curve for the demonstration of MTTR = MT% = 2 h against M?TR = M77Rl = 2.5 h with a = ß = 0.1. o2 is assumed tobe 0.2.

Solution For a = ß = 0.1, Eq. (7.49) and Table A9.1 yield tl-cc = 1.28 and tp = -1.28. From Eq. (7.50) it follows that n = 30. The rejection condition is then given by

From Eq. (7.53), the operating characteristic curve follows as

d 1 x2/2 0.4

Pr{acceptance I MITR] = - j e dw, G-, 0.2 MTTR [h]

with d = 25.84 h I M7TR - 11.64 (see graph). O 1 2 3

7.4 Accelerated Testing

The failure rate ?L of electronic components lies typically between 10-l0 and 10-7 h-1, and that of assemblies in the range of 10-7 to 10-5 h- l . With such figures, cost and scheduling considerations demand the use of accelerated testing for ?L estimation and demonsiration, in particular if reliablefield data are not available. An accelerated test is a test in which the applied stress is chosen to exceed that encountered in field operation, but still below the technological limits. This in order to shorten the time to failure of the item considered by avoiding an alteration of the involved failure mechanism (genuine acceleration). In accelerated tests, failure mechanisms are assumed to be activated selectively by increased stress. The quantitative relationship between degree of activation and extent of stress, i.e. the acceleration factor A, is determined via specific tests. Generally it is assumed that the stress will not change the type of the failure-free time distribution function of the item under test, but only modify the Parameters. In the following, this hypothesis is assumed to be valid (its verification should precede each statistical evaluation of data issued from accelerated tests).

Many electronic component failure mechanisms are activated through an increase in temperature. Calculating the acceleration factor A, the Arrhenius model can often be applied over a reasonably large temperature range (for instance 0 to


150°C for ICs). The Arrhenius model is based on the Arrhenius rate law [3.44], which states that the rate V of a simple (first-order) chemical reaction depends on temperature T as

5 E, and V , are Parameters, k is the Boltzmann constant ( k = 8.6.10- eV / K), and T the absolute temperature in Kelvin degrees. E, is the activation energy and is expressed in eV. Assuming that the event considered (for example the diffusion between two liquids) occurs when the chemical reaction has reached a given threshold, and the reaction time dependence is given by a function r ( t ) , then the relationship between the times tl and t2 necessary to reach at two temperatures Tl and T2.a given level of the chemical reaction considered can be expressed as

Furthermore, assuming r(t) - t, i.e. a linear time dependence, it follows that

Substituting in Eq. (7.54) and rearranging, yields

By transferring this deterministic model to the mean times to failure MTTFl and MTTF2 or to the constant failure rates h 2 and h, (using MTTF = 1 I ?L) of a given item at temperatures Tl and T2 , it is possible to define an acceleration factor A

M T q A = - , or A = - L2 for constant failure rate ,

M 7 - q Al

expressed by

The right hand sides of Eq. (7.55) applies to the case of a constant (time independent but stress dependent) failure rate h(t) = h . In some cases, A = t0,5, 1 t0,52 is assumed (Appendix A6.6.3). Eq. (7.56) can be reversed to give an estimate E, for the activation energy E, based on h1 and h2 obtained empirically from two life tests at temperatures Tl and T2. To verify the model, at least three tests at Tl, T2, and T3 are necessary. Activation energy is highly dependent upon the particular failure mechanism involved (see e.g. Table 3.6 for some indications for

7.4 Accelerated Testing 309

semiconductor devices). High E, values lead to high acceleration factors, due to the assumed relations vl tl = v2 t 2 and v - 1 / eEalkT For ICs, global values of E, lies between 0.3 and 0.7eV (Table 3.6), value which could basically be obtained empirically from the curves of the failure rate as a function of the junction temperature. It must be noted that the Arrhenius model does not hold for all electronic devices and for any temperature range. Figure 7.10 shows the acceleration factor A from Eq. (7.56) as a function of O2 in "C, for e1 = 35 and 55°C and with E, as parameter (Bi = Ti - 273).

In the case of a constant failure rate ?L, the acceleration factor A = h2 1 hl can be used as a multiplicative factor in the conversion of the cumulative operating time from stress TZ to stress Tl (Example 7.13). In practical applications, the acceleration factor A often lies between 10 and some few hundreds, seldom above 1000 (Examples 7.13 and 7.14).

Figure 7.10 Acceleration factor A according to the Arrhenius model (Eq. (7.56)) as a function of Cl2 for 01 = 35 and 55"C, and with E, in eV as parameter ( B i = ?; - 273)


If the item under consideration exhibits more than one dominant failure mechanism or consists of elements El , ..., E, having different failure mechanisms, the series reliability model (Sections 2.2.6.1 and 2.3.6) can often be used to calculate the compound failure rate 3LS (T2) at temperature (stress) TL by considering the failure rates of the individual elements hi(T1 ) and the corresponding acceleration factors A

Example 7.13 Four failures have occurred during 107 cumulative operating hours of a digital CMOS IC at a chip temperature of 130°C. Assuming O1 = 3 5 T , a constant failure rate h, and an activation energy E, = 0.4eV, give the interval estimation of h for y = 0.8. Solution

For = 35OC, O2 = 130°C, and E, = 0.4eV it follows from Fig. 7.10 or Eq. (7.56) that A = 35. The cumulative operatin time at 35'C is thus T = 0.35 .109 h and the point estimate for h is

-9 -B . h = &I T 11.4 .10 h . With k = 4 and y = 0.8, it follows from Fig. 7.6 that h l /L= 0.43 and h, Ih= 2; the confidence interval of h is therefore [4.9, 22.8].10-~ h-I.

Example 7.14

A PCB contains 10 meta1 film resistors with stress factor S = 0.1 and h(25"C) = 0.2.10-~ h-l, 5 ceramic capacitors (class 1) with S = 0.4 and h(25"C) = 0.8.10-~ h-l, 2 electrolytic capacitors (Al wet) with S = 0.6 and h(25"C) = 6.10-~ h-l, and 4 ceramic-packaged linear ICs with Ae JA = 10°C and h(35"C) = 20.10-~ h-l. Neglecting the contribution of the printed wiring and of the solder joints, give the failure rate of the PCB at a burn-in temperature BA of 80°C on the basis of failure rate relationships as given in Fig. 2.4.

Solution The resistor and capacitor acceleration factors can be obtained from Fig. 2.4 as

resistor: A=2.5/0.7=3.6 ceramic capacitor (class I): A = 4.2 10.5 = 8.4 electrolytic capacitor (Al wet): A = 13,610.35 = 38.9.

Using Eq. (2.4) for the ICs, it follows that h - n,. With 8 J = 35°C and 90°C, the acceleration factor for the linear ICs can then be obtained from Fig. 2.5 as A = 7.510.8 = 9.4. From Eq. (7.57), the failure rate of the PCB is then

7.4 Accelerated Testing 311

A further model for investigating the time scale reduction (time compression) resulting from an increase in temperature has been proposed by H. Eyring [3.44, 7.241. The Eyring model defines the acceleration factor as

where B is not necessarily an activation energy. Eyring also suggests the following model, which considers the influences of temperature T and of a further stress X

Equation (7.59) is known as the generalized Eyring model. In this model, a function of the normalized variable X = X 1 X. can also be used instead of the quantity X itself (for example xn , Ilxn, ln xn, In (1 l X")). B is not necessarily an activation energy, C & D are constants. The generalized Eyring model led to accepted models for electromigration (Black), corrosion (Peck), and voltage stress (Kemeny)

where j = current density, RH = relative humidity, and V = voltage, respectively (see also Eqs. (3.2)-(3.6) and Table 3.6). For failure mechanisms related to mechanical fatigue, Coffin-Manson simplified models [2.61, 2.721 (based on the inverse power law) can often be used, yielding for the number of cycles to failure

where AT refers to thermal cycles and G refers to g „ values in vibration tests (0.5 < PT < 0.8 and 0.7 C ß M < 0.9 often occur in practical applications). For damage accumulation, Miner's hypothesis of independent damage increments [3.56] must be considered with care. Also known for conductive filament formation is the Rudra's model [7.26].

Critical remarks on accelerated tests are e.g. in [7.13, 7.15, 7.211. Refinement of the above models is in Progress, in particular for ULSI ICs with emphasis On:

1. New failure mechanisms in oxide and package, as well as new externally induced failure mechanisms.

2. Identification and analysis of causes for early failures or premature wearout.

3. Development of physical models for failure ~nechanisms and of simplifed models for reliabiliv predictions in practical applications.

Such efforts will give better physical understanding of the component's failure rate.

In addition to the accelerated tests discussed above, a rough estimate of component life time can often be obtained through short-term tests under extreme stresses (HALT, HAST, etc.). Examples are humidity testing of plastic-packaged ICs at high pressure and nearly 100% RH, or tests of ceramic-packaged ICs at temperatures up to 350°C. Experience shows that under high Stress, life time is often lognormally distributed, thus with a strong time dependence of the failure rate (see e.g. Table A6.1 and Appendix A6.10.5). Highly accelerated tests (HAST) can activate failure mechanisms which would not occur during normal operation, so care is necessary in extrapolating results to situations exhibiting lower stresses. Often, the purpose of such tests is to force (not only to activate) failures. They belong thus to the class of semi-destructive or destructive tests, often used at the qualification of Prototype to investigate possible failure modes. The same holds for step-stress accelerated tests (often used as life tests or in screening procedures), for which, ac-cumulation of damage can be more complex as given e.g. in [7.20,7.28]. A case-by-case investigation is mandatory for all this kind of tests.

7.5 Goodness-of-fit Tests

Let tl, ..., t, be n independent observations of a random variable z distributed according to F(t), a rule is asked to test the null hypothesis Ho: F(t) = Fo(t), for a given type I error a (probability of rejecting a true hypothesis Ho), against a general alternative hypothesis H1: F(t) # Fo(t). Goodness-of-fit tests deal with such testing of hypothesis and are often based on the empirical distribution function (EDF), see Appendices A8.3 for an introduction. This section shows the use of Kolmogorov-Smirnov and chi-square tests (see p. 534 for some related tests). Trend tests are discussed in Section 7.6.

7.5.1 Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov test is based on the convergence for n -+ of the empirical distribution function (Eq. (A8. 1))

1 . 0 f ~ r t < t ( ~ )

I F,(t) = for tg) < t < t(i+l) (7.62)

1 for t 2 t(,)

to the true distribution function, and compares the experimentally obtained $,(t) with the given (postulated) Fo(t). Fo(t) is assumed here to be known and continuous, t(,), ..., t(,) are the ordered observations. The procedure is as follows:

Figure 7.11 Largest deviation y1-, between a postulated distnbution function Fo(t) and the corresponding empincal disiribution function 6,(t) at the level 1 - a (Pr{D, 5 yl-a IFo (t) true) = 1 - a )

1. Determine the largest deviation D, between 6,( t ) and FO( t )

D,, = SUP ( F,, (t) - FO (t) ( . (7.63) - - < t < -

2. From the given type I error a and the sample size n, use Table A9.5 or Fig. 7.11 to detennine the critical value y

3. Reject H o : F ( t ) = F o ( t ) if D, > yi-,; otherwise accept H o .

This procedure can be easily combined with a graphical evaluation of data. For this purpose, 6 , ( t ) and the band Fo( t ) f y l - , are drawn using a probabi l i ty c h a r t on which F o ( t ) can be represented by a straight line. If 6,(t) leaves the band Fo( t ) k yl -„ the hypothesis H o : F ( t ) = Fo( t ) is to be rejected (note that the band width is not constant when using a probability chart). Probability charts are discussed in Appendix A.8.1.3, examples are in Appendix A9.8 and Figs. 7.12- 7.14. Example 7.15 (Fig. 7.12) shows a graphical evaluation of data for the case of a Weibull distribution, Example 7.16 (Fig. 7.13) investigates the distribution function of a population with early fa i lures and a constant failure rate using a Weibull p robab i l i t y cha r t , and Example 7.17 (Fig. 7.14) uses the Kolmogorov- Smirnov test to check agreement with a lognormal distribution. If F o ( t ) is not completely known, a modification is necessary (Appendix A8.3.3).

Example 7.15 Accelerated life testing of a wet Al electrolytic capacitor leads following 13 ordered observations of lifetime: 59,71, 153, 235, 347,589, 837, 913, 1185, 1273, 1399, 1713, and 2567h. (i) Draw the empirical distribution function of data on a Weibul! probability chart. (ii) Assuming that the underlying distribution function is W$bull, determine h ?nd ß graphically. (iii) The maximum likelihood estimation of h & ß yields ß = 1.12, calculate h and compare results of (iii) with (ii). Solution (i) Figure 7.12 presents the empirical distribution function 6,(t) on Weibull probabili'y Paper. (ii) The graphical determination of h ar@ ß leads to (straight line (ii)) h = 11 840 h and ß = 1.05. (iii) With ß = t.12, Eq. (A8.31) yields h = 11908h (straight line (iii)) (see also Example A8.11).

3 14 7 Statistical Quality Control and Reliability Tests

Figure 7.12 Empirical distribution function @,(t) and estirnated Weibull distribution functions ((ii) and (5)) as per Example 7.15


n n L a

m m V) GO W V ) ~ 2 - 0 0 0 0 Ö N % s ö z Q d g z d o o o o o o o X

Figure 7.13 Shape of a weighted sum of a Weibull distribution F,(t) and an exponential distribution Fb(t) as per Example 7.16, useful to detect (describe) early failures (similar for wearout failures)

Exarnple 7.16 Investigate the mixed distribution function F(t) = 0.2[1- e-(0.1t)0'5 ] + 0.811 - e-0.0005t] on a Weibull probability chart (describing a possible early failure period).

Solution

The weighted sum of a Weibull distribution (ß = 0.5, h = 0.1 h-l, and MTTF = 20 h ) with an exponential distribution ( h = 0.0005 h-I and M7TF = MTBF = 11 h = 2000 h ) represents the distribution function of a population of items with failure rate h(t)= [O.OI (0.1 t)-0.5e-(0.1')0'5 + o . o o o ~ ~ - ~ . ~ ~ ~ ~ ] I W) W11 [0,2e-(0'1 t)0'5 + 0.8 e-0.0005 t ] , i.e. with early

failures up to about t=200h , see graph (h(t) is practically constant at 0.0005 h-l for t between 300h and 400,00Oh, so that for 0,0010 t > 300 h a constant failure rate can be assumed

function F(t) on a Weibull probability chart, showing the typical s-shape.

for practical purposes). Figure 7.13 gives the 0.0005

O 100 200 300 400

Example 7.17 Use the Kolmogorov-Smimov test to verify with a type I error a = 0.2, whether the repair time defined by the observations t,, . .., t10 of Example 7.1 1 are distnbuted according to a lognormal distribution function with Parameters h = 0.5 h-l and o = 0.4 (hypothesis Ho).

Solution The lognonnal distribution (Eq. (7.41)) with h = 0.5 h-l and o = 0.4 is represented by a straight line on Fig. 7.14 (F0( t ) ) With a = 0 . 2 and n = 10, Table A9.5 or Fig. 7.11 yields yl-a = 0.323 and thus the band Fo(t)I 0.323. Since the empirical distribution function @,(t) does not leave the band Fo(t) I the hypothesis Ho can be accepted.

7.5.2 Chi-square Test

The chi-square test ( x2 test) can be used for cont inuous or noncont inuous distribution functions F o ( t ) of T. Furthermore, Fo( t ) need not to be completely known.

For Fo( t ) completely known, the procedure is as follows:

1. Partition the definition range of the random variable T into k intervals (classes) ( a l , a2], ( a 2 , a3] , ..., (ak , ak+i], the choice of the classes must be made independently of the observations tl, ..., t, (rule: n pi 2 5 , with pi as in point 3).

2. Determine the number of observations ki in each class ( a i , ai+,],

i = I , ..., k (ki=numberof t j with a i < t j r a i + l ) .

3. Assuming the hypothesis H o , compute the expected number of observations for each class ( a i , a i+l ]


Figure 7.14 Kolmogorov-Seov test to check the repair time distribution as per Example 7.17 (the distribution function with h and 6 from Example 7.1 1 is shown dashed for information only)

4. Compute the statistic

5. For a given type I error a, use Table A9.2 or Fig. 7.15 to determine the (1 - a ) quantile of the chi-square distribution with k - 1 degrees of

2 freedom x ~ - ~ , ~ - ~ .

2 2 6. Reject Ho: F( t ) = Fo(t) if X , > x ~ - ~ , ~ - ~ , otherwise accept Ho.

If F o ( t ) is not completely known ( F o ( t ) = F,(t, 01, ..., 0,) , where ..., 0 , are unknown parameters), modify the above procedure after step 2 as follows:

3'. On the basis of the observations k i in each class ( a i , u ~ + ~ ] , i = 1, ..., k determine the maximum likelihood estimates for the parameters 01, . . ., 0 , from the following system of (r) algebraic equations

with p i = Fo(ai„, ,..., 8,)-Fo(ai , 01, ..., e r ) > 0, PO + ... +pk = 1 and kl + . . . + kk = n ; for each class ( U i , a i+ l ] , compute the expected


number of observations, i.e.

4'. Calculate the statistic

5'. For given type i error a, use Table A9.2 or Fig. 7.15 to determine the ( 1-a ) quantile of the X2distribution with k - I - r degrees of freedom.

6'. Reject Ho: F(t ) = Fo(t) if 2; > X~-l-r , l -„ otherwise accept Ho.

Comparing the above two procedures, it can be noted that the number of degrees of freedom has been reduced from k - 1 to k - 1 - r , where r is the number of parameters of Fo(t) which have been estimated from the observations t l , ..., t , using the multinomial distribution (Example A8. 13).

Example 7.18 Let 160, 380, 620, 650, 680, 730, 750, 920, 1000, 1100, 1400, 1450, 1700, 2000, 2200, 2800, 3000, 4600, 4700, and 5000 h be 20 independent observations (realizations) of the failure-free time .L for a given assembly. Using the chi-square test for a = 0.1 and the 4 classes (0, 5001, (500, 10001, (1000, 20001, (2000, -), determine whether or not z is exponentially distributed (hypothesis Ho : F(t ) = 1 - e-", h unknown).

Solution The given classes yield number of observations of kl = 2, k2 = 7 , k3 = 5 , and k4 = 6 . The point estimate of h j s then given by Eq. (7.66) with pi = e-hai - eThai+l, yielding for h the numerical solution h - 0.562.10-~ h-l. Thus, the numbers of expected observations in each of the 4 classes are according to Eq. (7.67) An jl = 4.899, n j2 = 3.699, n j3 = 4.90, and

2 n j 4 = 6.499. From Eq. (7.68) it follows that X„ = 4.70 and from Table A9.2, xi,o.9 = 4.605. - 2 2

The hypothesis HO : F(t) = 1 - e-hf must be rejected since X, > Xk-l-r,

- V 10 20 30 40 50

Figure 7.15 ( 1 - a) quantile (U percentage point)

freedom (X:, -, of the chi-square distribution with V degrees of

7.6 Statistical Analysis of General Reliability Data 319

7.6 Statistical Analysis of General Reliability Data

7.6.1 General considerations

In sections 7.2 - 7.5, data were issued from a sample of a random variable T, i.e. they were n statistically independent realizations (observations) t l , . . ., tn of a random variable z distributed according to F(t) = Pr{z I t } , and belonging to one of the following equivalent situations:

1. Lije tiines t l , ..., t, of n statistically identical and independent items, all starting at t = 0 when plotted on the time axis (e.g. as in Figs. 1.1, 7.12, 7.14).

2. Failure-free times separating successive failure occurrences of a repairable item (system) with negligible repair times and repaired (restored) as a whole to as-good-as-new at each repair; i.e., statistically identical and independent interarrival times with a common distribution function (F(x)), yielding a renewal process.

To this data structure belongs also the case considered in Example 7.19. A basically different situation arises when the observations are arbitrary points

on the time axis, i.e. when considering a general pointprocess. To distinguish this case, the involved random variables are labeled T;, T;, ..., with t;, t;, ... for the corresponding realizations (tl* < t; < ... is assumed). This situation occurs in reliability tests when only the failed element in a system is repaired to as-good-as-new, and there is at least one element in the system which has a time dependent failure rate. Failure-free times (interanival times, by assuming negligible repair times) are in this case neither independent nor equally distributed. Only the case of a series system with constant failure rates for all elements (hl,...,An) leads (if repaired elernents are as-good-as-new) to a homogeneous Poisson process (Appendix A7.2.5), for which interarrival times are statistically independent random variables with the common distribution function F(x) = 1 - e-('l+...+ (Eqs. (2.19), (7.27)). Shortcomings because of neglecting this basic property are known, seee.g. [6.3,7.1 l,A7.30].

Example 7.19 Let F(t) be the distribution function of the failure-free time of a given item. Suppose that at t = 0 an unknown number n of items are put into operation and that at the time to exactly k item are failed (no replacement or repair has been done). Give a point estimate for n.

Solution Setting p = F(t,), the number k of failures in (0, t o 1 is binomially distributed (Eq. (A6.120))

k Pr{k failures in (0, t O ] } = & = (9 p (1 - p)n-k, with p = F(t,) . (7.69) An estimate for n can be obtained using the maximum likelihood method, yielding (Eq. (A8.23)) L = ( k ) pk (1 - p)n-k and finally, with a L / an = 0 for n = jj (n is the unknown parameter),

k = k l p = k l F ( t o ) . (7.70)

For Eq. (7.70), the approximation (g) = (e-kl k!)(nnl (n- k)(n-k) has been used (Stirling fomula). The Poisson approximation & .- e-np(np)klk! (Eq. (A6.129)) yields also 4 = k lp.


Easy to investigate when observing data on the time axis are cases involving nonhomogeneous Poisson processes (Sections 7.6.2, 7.6.3, 7.7, Appendix A7.8.2). For more general situations, difficulties can arise (except for some general results valid for stationary point processes (Appendices A7.8.3 - A7.8.5)), and the following basic rule should apply:

I f neither a Poisson process (homogeneous or nonhomogeneous) nor a renewal process can be assumed for the underlying point process, care is necessary in identifying possible models; in any case, validation of model assumptions (from a physical und statistical point of view) shouldprecede data analysis.

The homogeneous Poisson process ( H P P ) , introduced in Appendix A7.2.5 as particular case of a renewal process, is the simplest point process. It is memoryless and tools for a statistical investigation are known. Nonhomogeneous Poisson processes (NHPPs) are without afereffect (Appendix A7.8.2) and for investigation purposes they can be transformed into a H P P (Eq. (A7.200)). Investigation on renewal processes (Appendix A7.2) can be reduced to that of independent random variables with a common distribution function (cases 1 and 2 above). However, disregarding the last part of the above general rule can lead to mistakes, even in the case of renewal processes or independent realizations of a random variable T . As an example, let us consider an item with two independent failure mechanisms, one appearing with constant failure rate ho =10-~h-' and the second (wearout) with a shifted Weibull distribution F(t)=l- e-(Vc-'+' ))%ith h=10-2h-1, v=104h, and ß =3 ( t > V , F(t)= 0 for t 5 V). As case 2 in Eq. (A6.34), the failure-free time T has the distribution function F ( t ) = 1 - e-hot for 0 5 t 5 W and F(?) = 1 -e -A~t . e-(Vt-~)) ' for t > (failure rate h(t ) = ho for t 5 and h ( t ) = ho+ßhß(t- for t > V , similar to a series model with independent elements (Eq. (2.17)). If the presence of the above two failure mechanisms is ignored and the test is stopped (censored) after 104h, the wrong conclusion can be drawn that the item has a constant failure rate of about I O - ~ h-'.

Investigation of cases involving general point processes is beyond the scope of this book (ouly some general results are given in Appendices A7.8.3 - A7.8.5). A large number of ad hoc procedures are known in the literature, but they often only apply to specific situations and their use needs a careful validation of the assumptions stated with the model.

After some considerations on tests for nonhomogeneous Poisson processes in Section 7.6.2, Sections 7.6.3.1 and 7.6.3.2 deal with trend tests to check the assumption homogeneous Poisson process versus nonhomogeneous Poisson process with increasing or decreasing intensity. A heuristic test to distinguish a homogeneous Poisson process from a general monotonic trend is discussed in Section 7.6.3.3. However, as stated in the above general rule, the validity of a model should be checked also on the basis of physical considerations on the item considered. This in particular for the property without afereffect, characterizing Poisson processes.

7.6.2 Tests for Nonhomogeneous Poisson Processes

A nonhomogeneous Poisson process (NHPP) is a point processes which Count function v(t) has unit jumps, independent increments (in nonoverlapping intervals), and satisfies for any b > a 2 0 (Appendix A7.8.2 )

Pr { k events in (a, b] } = ( M ( ~ ) - M ( ~ ) ) ~ ~ - M W - M ( ~ ) ) , k,O,L,Z ,,.., 0 5 a < b , k !

(7.71)

M(t) is the mean value function of the NHPP, giving the expected number of points (events) in (0, t]

~ ( t ) = ~ [ v ( t ) l , M ( O ) = 0 . (7.72)

Assuming ~ ( t ) derivable,

m(t) = dM(t) / dt 2 0 (7.73)

is the intensity of the NHPP and has for 6 t 4 0 following interpretation (Eq. (A7.89))

Because of independent increments, the number of events (failures) in a time interval (t , t + 01 (Eq. (7.71) with a = t & b = t + 8) and the rest waiting time to the next event

Pr{ZR(t) > X } =Pr{ no event in ( t , t+x]} = e - w ( t + X ) - M ( t ) ) , X > O , (7.75)

are independent of the process development up to time t (Eqs. (A7.193, (A7.196)). Thus, also the mean E [zR (t)] is independent of the process development up to time t, andgiven by (Eq. (A7.197))

Furthermore, if 0 < T;< T;< . . . are the occurrence times (arrival times) of the event considered (e.g. failures of a repairable system), measured from t =0, it holds for m ( t ) > 0 that the quantities

= ~ ( 7 ; ) < W: = ~ ( 7 2 ) < ... (7.76)

are the occurrence times of a homogeneous Poisson processes with intensity one (Eq. (A7.200)). Moreover, for given (fixed) t =T and v(T) = n, the occurrence times 0 < 2; < ... 'C,*< T have the same distribution as if they where the order statistics of n independent identically distributed random variables with density

m(t) 1 MV'), O < t < T, (7.77)

and distribution function M(t) / M(T) on (0, T) (Eq. (A7.205)).

Equation (7.74) gives the unconditional probability for one event in (t,t + 6t]. Thus, m ( t ) refers to the occurrence of any one of the events considered. It corresponds to the renewal density h(t) and the failure intensity z ( t ) , but differs basically from the failure rate h( t ) (see remark to Eq. (A7.24)).

Nonhomogeneous Poisson processes (NHPPs) are introduced in Appendix A7.8.2. Some examples are discussed in Section 7.7 with applications to reliability growth. Assuming that the underlying process is a NHPP, estimation of the model parameters (parameters of m ( t ,0)) can be performed using the maximum likelihood method on the basis of observed data 0 < ty < t; <. . .< t i < T (time censoring). Considering Eqs. (7.71) and (7.74), the likelihood function follows as (Eq. (7.102))

and delivers the maximum likelihood estimate 6 for the parameters 8 of m(t ,8) by solving a L / & = 0 for 0 = 6 , where 9 can be a vector (see e.g. Eq. (7.104) for the parameters a and ß of the NHPP with m(t) = aßt P-'). Using the property stated by Eq. (7.76), statistical tests for exponential distribution or for homogeneous Poisson processes (Appendices A8.2.2.2, A8.3.2, A8.3.3 and Sections 7.2.3, 7.5, 7.6.3.1) can be applied to NHPPs as well. Furthermore, using the property stated by Eq. (7.77), the goodness-of-fit tests introduced in Appendix A8.3.2 (Kolmogorov- Smirnov, Cramkr - von Mises, chi-square) can be used to verify agreement of the observed data t;, ..., t i< T with a postulated Mo(t) (t;, t;, ... are the observed values (realizations) of T;,%;, . . and * is used to explicitly show that t i , t;, ... are points on the time axis and not independent realizations of a random variable T, e.g. as in Figs. 1.1, 7.12, 7.14). For the Kolmogorov-Smirnov test, the procedure given in Section 7.5.1 applies with

where <(t) is the observed number of events in (O,i]. More difficult is the situation when the assumption that the underlying model is

a N H P P must also be verified by a statistical data analysis, for instance with a goodness-of-fit test. The problem in not completely solved. However, the property given by Eqs. (7.76) and (7.77) can be used for goodness-of-fit of the NHPP with incompletely specified (up to the parameters) mean function M o ( t ) . The chi-square test holds with the procedure given in Section7.5.2 and Appendix A8.3.3. For a first evaluation, the Kolmogorov-Smirnov test (and tests based on a quadrate statistic) can be used taking half (randomly selected) of the observations t l , . . . , t; to estimate the parameters and continuing with the whole sample the procedure given in Section 7.5.1 for the goodness-of-fit test [A8.11, A8.311.

7.6.3 Trend Tests

In reliability engineering one is often interested to test if there is a monotonic trend in the times between successive failures (interarrival times) of a repairable system with negligible repair (restoration) times, e.g. in order to detect the end of an early failure period or the begin of a wearout period. Such tests extend the tests for exponentiality or for homogeneous Poisson processes introduced in Sections 7.2.3, 7.5, A8.2.1, A8.2.2, A8.3.2, and A8.3.3. If the underlying point process can be approximated by a renewal process, a graphical procedure can be used in detecting the presence of trends, See e.g. Fig. 7.13 for the case of early failures. In the case of a nonhomogeneous Poisson process (NHPP), a trend is given by an increasing or a decreasing intensity m ( t ) , e.g. ß > 1 or ß < 1 in Eq. (7.99). Trend tests can also be useful in investigating what kind of alternative should be considered when an assumption is to be made about the statistical properties of a given data set. However, trend tests check in general only a postulated hypothesis against a more or less general alternative hypothesis. Care is therefore necessary in drawing conclusions from this kind of statistical tests, and the basic rule given on p. 320 applies. In the following, some trend tests used in reliability data analysis are discussed, among them the Laplace test (see e.g. [A8.1] for greater details).

7.6.3.1 Tests of a HPP versus a NHPP with increasing intensity

The homogeneous Poisson process (HPP) is a point process which count function v( t ) has stationary, independent Poisson distributed increments (Eqs. (A7.41)). Interarrival times in a H P P are independent and distributed according to the same exponential distribution F ( x ) = 1 -e - 'X (occurrence times are independent Gamma distributed). The Parameter h characterizes completely the H P P . h is at the Same time the intensity of the H P P and the failure rate h(x) of all interarrival times, X starting by 0 at each occurrence time of the event considered (e.g. failure of a repairable system with negligible repair (restoration) times). This ntlmerical equality has been the cause for misinterpretations and misuse in practical applications, See e.g. [6.3, 7.11, A7.301. The homogeneous Poisson process has been introduced in Appendix A7.2.5 as particular case of a renewal process. Considering v ( t ) as the count function giving the number of events (failures) in (0, t ] , in Example A7.13 (Eq. (A7.213)) it is shown that :

For given (fixed) T und v(T) = n (time censoring), the normalized arrival times 0 < T; / T < ... < T: / T of a homogeneous Poisson process (HPP) have the same distribution as if they where the order statistics of n independent identically uniformly distributed random variables on (0,l). (7.81)

Similar results hold for a NHPP (Eq. (A7.206)) :

For given ( ' x e d ) T und v(T) = n (time censoring), the normalized arrival times 0 < M ( T ~ * ) 1 M(T)< ... < ~ ( 7 : ) 1 M(T) < 1 of a nonhomogeneous Poisson process (NHPP) with mean value function M ( t ) have the same distribution as ifthey where the order statistics of n independent identically uniformly distributed random variables on (0,l). (7.82)

With the above transfonnations, properties of the uniform distribution can be used to Support statistical tests on homogeneous and nonhomogeneous Poisson processes.

Let w be an unifonnly distributed random variable with density

f w ( x ) = l on (O,l), f, (X) =0 outside (0, I), (7.83)

and distribution function F,(x) = x on (0,l). Mean and variance of w are given by (Eqs. (A6.36) and (A6.44))

E[w]=1/2 and Var[o] =1/12. (7.84)

The sum wl+ ... +wn of n independent random variables w has mean nl2 and variance n112. The distribution function F,,(x) of w1+ ... +wn is defined on (0,n) and can be computed using Eq. (A7.12). F,.(x) has been investigated in [A8.8], yielding to the conclusion that F , .(X) rapidly approach a normal distribution as n increases. For practical applications one can assume that for given (fixed) T and v ( T ) = n> 5, the arrival times 0 i T ; < ... < T : < T of a HPP are distributed according to

X

* 1 P r { [ ( , ~ ~ l i ) - n / 2 ] ~ ~ 2 ~ x } = - ~ e - ~ ~ ~ ~ d ~ , i=1 X (7.85)

,L -, Equation (7.85) can be used to test a HPP ( m ( t ) = h ) versus a NHPP with increasing density m ( t ) = d M ( t ) ld t . Using Eq. (7.85) and considering the observations (realizations) t i< t;< ... < t i < ~ , the procedure is (Example 7.20):

1. Compute the statistics

2. For given type I error a determine the critical value t ,-, (1-a quantile) from a table of the standard normal distribution (Tab. A9.1).

3. Reject the hypothesis Ho: the underlying point process is a HPP, against

H1: the underlying process is a NHPP with increasing density, at 1-a confidence, if ( '$t;/ T- n/2)lJn/12 > tl-,; otherwise accept Ho.

i = 1 (7.87)

A test based on Eqs. (7.86)-(7.87) is called Laplace test and was first introduced by Laplace as a test of randnomness. From Eq. (7.87) one recognizes that X tT I T is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject Ho ( Ho true ] and ( X 4 / T- nI2) l f i tends to assurne large values for Ho false (i.e. for m(t) increasing). For T=ti (failure censoring), Eq. (7.86) holds with n-1 (see e.g. [A8.1]).

A further possibility to test a HPP ( m ( t ) = L) versus a NHPP with increasing density m ( t ) = d M ( t ) l d t is to use the statistics

As shown in Example A7.13 (Eq. (A7.213), See also (7.81)), for given (fixed) T and v ( T ) = n, the normalized arrival times 0 < T; / T < ... < T: I T < 1 of a HPP have the same distribution as if they where the order statistics of n independent identically uniformly distriboted random variables on (0,l). Moreover, Example 7.21 shows for wi=z ;/T that 2 ln (T l ty) has a X2distribution (Eq.(A6.103)) with 2n degrees of freedom (T; haslt% be used instead of the observations t:, see footnote on p. 504)

Example 7.20 In a reliability test, 8 failures have occurred in T=10,000h and t l+ ... + t8 =43, OOOh has been observed. Test with a risk cx = 5% (at 95% confidence), using the rule (7.87), the hypothesis Ho: the underlying point process is a HPP, against H1 : the underlying process is a NHPP with increasing density.

Solution From Table A9.1 t,,„ = 1.64 > (4.3 -4)/O.816 = 0.367 and Ho can not be rejected.

Example 7.21 Let the random variable w be uniformly distributend on (0,l). Sho,w that q =- ln(w) is distributed according to Fq(r)=l-e-' on (O,m), and thus 2 E-ln(oi) = 2 C q i = X;, .

i = l i=l Solution Considering that for 0< < 1, -ln(w) is a decreasing function defined on (0, W), it follows that the events {w < X} and {q =-ln(w)>-ln(x)} are equivalent. From this (see also Eq. (A6.31),

X = Pr{o <X) = Pr{q> -ln(x)} and thus, using -lau = t , one obtains Pr{q>t] =e-' and finally

Fq(t)= Pr{q < t}= 1- e-? (7.90) n

From Eqs. (A6.102)-(A6.104), it follows that 22-ln(wi)= 2 C q i has a x2 distribution with ,=I ,=I

2n degrees of freedom.

Example 7.22 In a reliability test, 8 failures have occurred in T= 10,000 h at 850,1200, 2100,3900,4950,5100 8300,9050h. Test with a risk a= 5% (at 95% confidence), using the rule (7.92), the hypothesis Ho : the underlying point process is a HPP, against the alternative hypothesis H1 : the underlying process is a NHPP with increasing density.

Solution From Table A9.2, =7.96< 2( ln(~l t ; )+ ... +ln(T/r;)) = 17.5 and Ho can not be rejected.

Thus, the statistics given by Eq. (7.88) can be used to test a HPP ( m ( t ) = L) versus a NHPP with increasing density m ( t ) = d M ( t ) l d t . Considering Eqs. (7.89) and (7.90), the test procedure is (Example 7.22):


2. For given type I error a determine the critical value , ( U quantile) from a table of the distribution (Table A9.2).

3. Reject the hypothesis Ho: the underlying point process is a HPP, against H1: the underlying process is a NHPP with increasing density, at 1-a confidence, if 2 l n ( T l t * ) < otherwise accept H o . (7.92)

i = l

From Eq. (7.92) one recognizes that 2 l n ( ~ l t * ? ) is a sufficient statistics (Appen- dix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I Hotrue } and 2 ln (T l t * ) tends to assume small values for Ho false (i.e. for m ( t ) increasing). For T = ti (failure censoring), Eq. (7.91) hold with n - 1 (see e.g. [A8.1]).

7.6.3.2 Tests of a HPP versus a NHPP with decreasing intensity

Tests of a homogeneous Poisson process (HPP) versus a nonhomogeneous Poisson process ( N H P P ) with a decreasing intensity m ( t ) = d M ( t ) l d t can be deduced from those for increasing intensity given in section 7.6.3.1. Equations (7.85) and (7.89) remain true. However, if the intensity is decreasing, most of the failures tend to occur before T l 2 and test procedure for the Laplace test has to be changed in (Example 7.23) :


r c i t ; i r ) - . 121 /&E. i=l

(7.93)

2. For given type I error a determine the critical value t , ( a quantile) from a table of the standard normal distribution (Tab. A9.1).

3. Reject the hypothesis Ho: the underlying point process is a HPP, against H1: the underlying process is a NHPP with decreasing density, at 1-a confidence, if ( t : 1 T - n l 2) l@z < t,; otherwise accebt H o . (7.94)

i=l

From Eq. (7.93) one recognizes that C t: / T is a sufficient statistics (Appendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I H o r n e and ( C t; I T - n l 2 ) 1 @2 tend to assume small values for Ho false (i.e. for m ( t ) decreasing). For T = ti (failure censoring), Eq. (7.93) holds with n - 1 (see e.g. [A8.1]).

For the test according to the statistics (7.88), the test procedure is (Example 7.24):

1. Compute the statistics n

2 l n ( ~ l t r ) . (7.95) i=l

2. For given type I error a determine the critical value X:n,l-a

( 1 -a quantile) from a table of the X 2 distribution (Table A9.2). 3. Reject the hypothesis Ho: the underlying point process is a HPP, against

Hl : the underlying process is a NHPP with decreasing density, at 1-a 2

confidence, if 2 )= ln(T l tr ) > X2n , l - a ; otherwise accept H o . , = L

(7.96)

From Eq. (7.95) one recognizes that 2 x l n ( ~ l t * ) is a sufficient statistics (Ap- pendix A8.2.1). It can be noted that in Point 2 above, a = Pr{ reject H o I H o true }

and 2 z l n ( ~ lt') tend to assume large values for Ho false (i.e. for m(t) decreasing). For T= ti (failure censoring), Eq. (7.95) hold with n-1 (see e.g. [Ag. 11).

7.6.3.3 Heuristic Tests to distinguish between HPP and General Monotonic Trend

In some applications, little information is available about the underlying point process describing failures occurrence of a complex repairable System. As in the previous sections, it will be assumed that repair times are neglected. What is sought is a test to identify a monotonic trend of the failure intensity against a constant failure intensity given by a homogeneous Poisson process (HPP).

Example 7.23 Continuing Example 7.20, test using the rule (7.94) and the data of Exaniple 7.20, with a risk a = 5 % (at 95% confidence), the hypothesis Ho: the underlying point process is a HPP, against the alternative hypothesis H1: the underlying process is a NHPP with decreasing density.

Solution From Table A9.1, tOIoFi = - 1.64 < 0.367 and Ho can not be rejected.

Example 7.24 Continuing Example 7.22, test using the rule (7.96) and the data of Example 7.22, with a risk a =5% (at 95% confidence), the hypothesis Ho : the underlying point process is a HPP, against the alternative hypothesis Hl : the underlying process is a NHPP with decreasing density.

Solution From Table A9.2, X:6, 0,95= 26.3 > 2(ln(Tlt ;)+ ... + ln ( T l t ;))=17.5 and Ho can not be rejected.

Consider first, investigations based on successive interarrival times. Such an investigation should be performed at the beginning of data analysis, also because it can quickly deliver a first information about a possible monotonic trend (e.g. interarrival times become more and more long or short). Moreover, if the underlying point process describing failures occurrence can be approximated by a renewal process (successive interarrival times are independent and identically distributed), procedures of Section 7.5 based on the empirical distribution function (EDF) have a great intuitive appeal and can be useful in testing for monotonic trends as well, see Examples 7.15- 7.17 (Figs. 7.12- 7.14). In particular, the graphical procedures given in Example 7.16 (Fig. 7.13) would allow the detection and quantification of an early failure period. The same would be for a wearout period. Similar considerations hold if the involved point process can be approximated by a nonhomogeneous Poisson process (NHPP), see Sections 7.6.1 -7.6.3 and 7.7.

If a trend in successive interarrival times is recognized, but the underlying point process can not be approximated by a renewal process or a NHPP, a further possibility is to consider the observed failure time points tl*< tS< ... directly. As shown in Appendix A7.8.5, a mean value function Z ( t ) = E [ v ( t ) ] can be associated to each point process, where ~ ( t ) is the Count function giving the number of failures occurred in (O,t] ( Zs ( t ) and vS ( t ) should be used for considerations at system level). From the observed failure time points (observed occurrence times) t l < t ; < ... , the empirical mean value function 17.231)

The mean value function Z ( t ) corresponds to the renewal function H(t) in a renewal process (Eq. (A7.15)); z ( t ) = dZ( t ) l d t is the failure intensity and correspond to the renewal density h ( t ) in a renewal process (Eqs. (A7.18), (A7.24)). For a homogeneous Poisson process, Z ( t ) takes the form (Eq. (A7.42))

Each deviation from a straight line Z ( t ) =at is thus an indication for a possible trend (besides statistical deviations). As shown in Example A7.1 (Fig.A7.2) for a renewal process, early failures or wearout gives a basically different shape of the underlying renewal function; a convex shape for the case of early failures and a concave shape for the case of wearout. This property can be used to recognize the presence of trends in a point process, by considering the shape of the associated empirical mean value function i ( t ) given by Eq. (7.97). Such a procedure can help in detecting possible trends, but remains a rough evaluation (see Fig. A7.2 for the case of a renewal process). Care is thus necessary when extrapolating results, e.g. about the failure rate value after the early failure period or the percentage of early failures.

7.7 Reliability Growth

7.7 Reliability Growth

At the prototype qualification tests, the reliability of complex equipment or Systems can be less than expected. Disregarding any imprecision of data used or model used in calculating the predicted reliability (Chapter 2), such a discrepancy is often the consequence of weaknesses (errors and flaws) during design or manufacturing. For instance, use of components or materials at their technological limits or with internal weaknesses, cooling problems, interface problems, transient phenomena, interference between hardware and software, assembly or soldering problems, damage during handling or testing, etc. Errors and flaws cause defects and systematic failures. Superimposed to these are early failures and failures with constant failure rate (wearout should not be present at this stage). A distinction between deterministic faults (defect and systematic failures) and random faults (early failures and failures with constant failure rate) is only possible with a cause arzalysis. Such an analysis is necessary to identify and elirninate causes of observed faults, i.e. change or redesign for defects and systematic failures, screening for early failures, and repair for failures with constant failure rate. Of Course, defects und systematic failures can also be randomly distributed on the time axis, e.g. caused by a mission dependent time-limited overload, by software defects, or simply because of the system complexity. However, they still differ from failures, as they are basically independent of operating time (disregarding systematic failures which can appear only after a certain operating time, e.g. as for some cooling or software problems).

The aim of a re l iab i l i~ growth program is the cost-effective improvement of an item's reliability through successful correction / elimination of the causes of design or production weaknesses. Early failures should be precipitated with an appropriate screening (environmental stress screening, ESS, See Section 8.2 for electronic components, Section 8.3 for electronic assemblies, and Section 8.4 for cost aspects). Considering that flaws found during reliability growth a re i n general deterministic (defects a n d systematic failures), reliability growth is performed during prototype qualification tests and pilot production, seldom for series-produced items (Fig. 7.16). Stresses during reliability growth are often higher than those expected in the field (as for ESS). Furthermore, the statistical methods used to investigate reliability growth are in general basically dz3erent from those given in Section 7.2 for basic reliability tests (e. g. to estimate or demonstrate a constant failure rate h). This is because during the reliability growth program, design and / or production changes are introduced in the item(s) considered and statistical evaluation is not restarted after a change.


Reliability

First series unit

Prototype

Design Qualification Production Life-Cycle Phases

Figure 7.16 Qualitative visualization of a possible reliabiliv growth

A large number of models have been proposed to describe reliability growth for hardware and software, see e.g. [5.58, 5.60, 7.31-7.47, A2.6 (61014 & 61164)], some of them on the basis of theoretical considerations. A practice oriented model, proposed by J.T. Duane [7.36] and refined as a statistical model by L.H. Crow [7.35 (1975)], known also as the AMSAA model, assumes that theflow of events (system failures) constitutes a nonhomogeneous Poisson process (NHPP) with intensity

and mean valuefinction

M(t) gives the expected number of failures in (0, t]. m(t)6t is the probability for one failure (any one) in ( t , t + 6t] (Eq. (7.74)). It can be shown that for a NHPP, m( t ) is equal to the failure rate h(t) of the first occurrence time (Eq. (A7.209). Comparing Eq. (7.99) with Eq. (A6.91) one recognizes that for the NHPP described by Eq. (7.99), thefirst occurrence time has a Weibull distribution. However, m(t) and A(t) arefundamentally different (see the remark on p. 356) and, also for this reason, all others interarrival times do not follow a Weibull distribution and are neither independent nor identically distributed. Because of the distribution of the first occurrence time, the NHPP process described by Eq. (7.99) is often called Weibull process, causing great confusion. Also used is the term power law process. Nonhomogeneous Poisson processes are investigated in Appendix A7.8.2.

7.7 Reliability Growth 33 1

In the following it will be assumed that the underlying model is a NHPP. Verification of this assumption should also be based on physical considerations on the naturelcauses of the defects and systematic failures involved, not only on statistical aspects. If the underlying process is a NHPP, estimation of the model parameters ( a and ß in the case of Eq. (7.99)) can easily be performed using observed data.

Let us consider first the time censored case (Type I censoring) and assume that up to the given (fixed) time T, exactly n events have occurred at times

* * t; < t i < . . .< t i < T. t l , t2, ... are the realizations (observations) of the arrival times T;, T*„... and * indicates that tl, t;, ... are points on the time axis and not independent realizations of a random variable T with a given (fixed) distribution function (e.g. as in Figs. 1.1. 7.12,7.14). Considering the main property of a NHPP, i. e. that the number of events in nonoverlapping intervals are independent and distributed according to

Pr{k events in (U, b]} = (Wb) - WaNk e - (M@)-M(u)) 9

k!

with M(0) = 0 and 0 $ a < b (Eq. (A7.195)), and the interpretation of the intensity m(t) given by Eq. (7.74) or Eq. (A7.194), the following likelihood function (Eq. (A8.24)) can be found for the Parameter estimation of the intensity m(t)

Equation (7.102) considers no event (k = 0 in Eq. (7.101)) in each of the nonoverlapping intervals (0, tl*), ( t l , t;), ... , ( t i ,T) and applies to an arbitrary NHPP. For the Duane model it follows that

i=l

The maximum likelihood estirnates Ci and ß of the parameters a and ß are then obtained from

yielding

An estimate for the intensity of the underlying nonhomogeneous Poisson process is

With known values for & and ß, Eq. (7.105) can be used to extrapolate the attainable intensity if the reliability growth process were to be continued with the same statistical properties for a further time span A after T, yielding

see Example 7.25 for a practical application. In the case of event censoring, i.e. when the test is stopped at the occurrence of

the nth event (Type I1 censoring), Eq. (7.104) holds with t i instead of T and n - 1 instead of n.

Zntewal estimation for the Parameters a and ß can be found, see e.g. [A8.1]. For goodness-of-fit-tests one can consider the property of nonhomogeneous Poisson processes that, for given (fixed) T and knowing that n events have been observed in ( O,T] , i.e. for given T and V ( T ) = n , the occurrence times 0 < 7; < . . . < T have the same distribution as if they where the order statistics of n independent and identically distributed random variables with density m(t) 1 M ( T ) , 0 < t < T (Eq. (A7.205)). For example, the Kolmogorov-Smirnov test (Section 7.5) can be used with Fn(t) = V( t) / v(T) (Eq. (7.79)) and Fo(t) = M o ( t ) / M o ( T ) ) (Eq. (7.80)), see also Appendices A7.8.2 and A8.3.2. Furthermore it holds that if T:< T;< ... are the occurrence times of a NHPP, then ~ ; = M ( T ; ) < = M ( T ; ) < ... are the occurrence times in a homogeneous Poisson process (HPP) with intensity one (Eq. (A7.200)). Results for independent and identically distributed random variables, for HPP or for

Example 7.25 Dunng the reliability growth program of a complex equipment, the following data was gathered: T = 1200 h , n = 8 and ln(Tlti*)= 20. Assuming that the underlying process can be described hy a Duane model, estimate the intensity at t = 1200h and the value attainable at t + A = 3000 h if the reliability growth would continue with the same statistical properties.

Solution With T = 1200 h , n = 8 and ln(T / tf)= 20, it follows from Eq. (7.104) that ß = 0.4 and 6 = 0.47. From Eq. (7.105), the estimate for the intensity leads to A(1200) = 2.67.10-~ h-' ($I (1200) = 8). The attainable intensity after an extension of the program for reliability growth by 1800h is given by Eq. (7.105) as fi(3000) = 1.54 10-~ h-' .

7.7 Reliability Growth 333

exponential distribution function can thus be used. Important is also that the mean value of the random time ~ ~ ( t ) from an arbitrary (fixed) time point t 2 0 to the next failure is independent of the process development up to the time t and is given by (Eq. (A7.197))

m m

E [ZR(t)] = J Pr{no event in ( t , t t XI}& = J ~ - ( ~ @ + ~ ) - ~ ( ~ ) ) d . x , (7.107) 0 0

yielding, for instance,

E [zR(t)] = l l h , t given (fixed), X > 0, (7.109)

for M(t+x)= M(t) + hx, i.e. m(t +X)= h for t given (fixed) and x > 0, and

E [ZR(t)] = ~ ( l + 110) / aUP, t given (fixed), X>O, (7.108)

for M(t + X ) = ~ ( t ) + a x ~ (Appendix A9.6 or Eq. (A6.92) with h = al'P). The Duane model often applies to electronic, electromechanical, and

mechanical equipment and Systems. It can also be used to describe the occurrence of software defects (dynamic defects). However, other models have been discussed in the literature especially for software (Section 5.3.4). Among these, the logarithmic Poisson model, which assumes a nonhomogeneous Poisson process with intensity

For the logarithmic Poisson model, m(t) is monotonically decreasing with

m(0) < and m(-) = 0. Considering M(0) = 0, it follows that

Models combining in a multiplicative way two possible mean value functions M(t) have been investigated in [7.33] by assuming

M(t)=aln(l+tIb).(l-eTtlb) and ~ ( t ) = a t ~ . [ l - ( l + t / ~ ) e - ~ ' ~ ] , (7.112)

with a , b, a, y > 0 , 0 < ß < I , t r 0. In both cases, the intensity m(t) grows from 0 to a maximum, from which it goes to 0 with a shape similar to that of the models given by Eq. (7.110). The models described by Eqs. (7.100), (7.111), and (7.112) are based on nonhomogeneous Poisson processes, satisfying thus the properties discussed in Appendix A7.8.2 .

Although appealing, nonhomogeneous Poisson processes (NHPP) can not solve all reliability growth modeling problems, basically because of their intrinsic simplicity related to the assumption of independent increments. The consequence


of this assumption, is that the NHPP is a process without aftereffect for which the waiting time to the next event from an arbitrary time point t is independent of the process development up to time t (Eq. (7.107) or Eq. (A7.197)). Furthermore, the first occurrence time 77 characterizes the NHPP (Eq. (A7.201)). In particular, a NHPP can not be used to estimate the number of defects present in a software package, see e.g. [A7.30] for further comments.

In general, it is not possible to fix a priori the model to be used in a given situation. For hardware as well as for software, a physical motivation of the model, based on failure or defect causes /mechanisms, can help in such a choice. Having the "best model", the next step should be to verify that the assumptions made are compatibles with the model and after that to check the compatibility with data. Misuses or rnisinterpretations can occur, often because of dependencies between the involved random variables.

8 Quality and Reliability Assurance During the Production Phase (Basic Considerations)

Reliability assurance has to be continued during the production phase, coordinated with other quality assurance activities. In particular for monitoring and controlling production processes, item configuration, in-process und final tests, screening procedures, and collection, analysis & correction of defects and failures. The last measure yields to a learning process whose purpose is to optimize the quality of manufacture, taking into account cost and time schedule limitations. This chapter introduces some basic aspects of quality and reliability assurance during production, discusses test and screening procedures for electronic components and assemblies, introduces the concept of cost optimization related to a test strategy and develops it for a cost optimized test and screening strategy at the incoming inspection. For greater details on qualification & monitoring of production processes one may refer to [7.1-7.5, 8.1-8.151. Models for reliability growth are discussed in Section 7.7.

8.1 Basic Activities

The quality and reliability level achieved during the design and development phase must be retained during production (pilot and series production). The following basic activities Support this purpose.

1. Management of the item's configuration (review and release of the production documentation, control and accounting of changes and modifications).

2. Selection and qualification ofproduction facilities und processes.

3. Monitoring and control of the production procedures (assembling, testing, transportation, Storage, etc.).

4. Protection against damage during production (electrostatic discharge (ESD), mechanical, thermal or electrical stresses).

5. Systematic collection, analysis, and correction of defects and failures occurring during the item's production or testing (back to the root cause).

6. Quality and reliability assurance during procurement (documentation, incorning inspection, supplier audits).

336 8 Quality and Reliability Assurance During the Production Phase

7. Calibration of measurement and testing equipment.

8. Performance of in-process andfinal tests (functional and environmental).

9. Screening of critical components and assemblies. 10. Realization of a test und screening strategy (optimization of the cost and time

schedule for testing and screening).

Configuration management, monitoring of corrective actions, and some important aspects of statistical quality control and reliability tests have been considered in Section 1.3, Chapter 7, and Appendices A3-A5. The following sections present test and screening procedures for electronic components and assemblies, introduce the concept of test and screening strategy, and develop it for a cost optimized test and screening strategy at the incoming inspection. Although focused on electronic systems, many of the considerations given below applies to mechanical systems as well. For greater details on qualification & monitoring of production processes one may refer to 17.1-7.5, 8.1-8.151 (see also Section 7.7 for reliability growth).

8.2 Testing and Screening of Electronic Cornponents

8.2.1 Testing of Electronic Components

Most electronic components are tested today by the end User only on a sampling basis. To be cost effective, sampling plans should take into consideration the quality assurance effort of the component's manufacturer, in particular the confidence which can be given to the data furnished by him. In critical cases, the sample should be large enough to allow acceptance of more than 2 defective components (Sections 7.1.3, 3.1.4). 100% incoming inspection can be necessary for components used in high reliability and 1 or safety equipment and systems, new components, components with important changes in design or manufacturing, or for some critical components like power semiconductors, mixed-signal ICs, and complex logic ICs used at the lirnits of their dynamic parameters. This, so long as the fraction of defective remains over a certain limit, fixed by technical and cost considerations. Advantages of a 100% incoming inspection of electronic components are:

1. Quick detection of all relevant defects. 2. Reduction of the number of defective populated printed circuit boards (PCBs). 3. Simplification of the tests at PCB level. 4. Replacement of the defective components by the supplier. 5. Protection against quality changesfrom lot to lot, or within the same lot.

8.2 Testing and Screening of Electronic Components 337

Despite such advantages, different kinds of damage (overstress during testing, assembling, soldering) can cause problems at PCB level. Defective probability p (fraction of defective items) lies for today's established components in the range of a few ppm (part per million) for passive components up to thousands of ppm for complex active components. In defining a test strategy, a possible change of p from lot to lot or within the same lot should also be considered. An example of test procedure for electronic components is given in Section 3.2.1 for VLSI ICs. Test strategies with cost consideration are developed in Section 8.4.

8.2.2 Screening of Electronic Components

Electronic components new on the market, produced in small series, subjected to an important redesign, or manufactured with insufficiently stable process parameters can exhibit early failures, i.e. failures during the first operating hours (generally up to some few thousand hours). Because of high replacement cost at equipment level or in the field, components exhibiting early failures should be eliminated before they are mounted on printed circuit boards. Defining a cost-effective screening strategy is difficult for at least following two reasons:

1. It may activate failure mechanisms that would not appear in field operation.

2. It could introduce damage (ESD, transients) which may be the cause of further early failures.

Ideally, screening should be performed by skilled personnel, be focused on the failure mechanisms which have to be activated, and not cause damage or alteration. Experience on a large number of components shows that for established technologies and stable process parameters, thermal cycles for discrete (in particular power) devices and burn-in for ICs are the most effective steps to precipitate early failures. Table 8.1 gives possible screening procedures for electronic components used in high reliability or safety equipment and Systems.

Screening procedures and sequences are in national and international standards [8.27, 8.321. The following is an example of a screening procedure for ICs in hermetic packages for high reliability or safety applications:

1. High-temperature storage: The purpose of high temperature storage is the stabilization of the thermodynamic equilibrium and thus of the IC electrical parameters. Failure mechanisms related to surface problems (contamination, oxidation, contacts) are activated. The ICs are placed on a meta1 tray (pins on the tray to avoid thermal voltage stresses) in an oven at 150°C for 24h. Should solderability be a problem, a protective atmosphere (Nz) can be used.

2. Thermal cycles: The purpose of thermal cycles is to test the ICs ability to endure rapid temperature changes, this activates failure mechanisms related to mechanical stresses caused by mismatch in expansion coeficients of the


Table 8.1 Example of test and screening procedures for electronic components used in high reliability or safety equipment and Systems (apply in part to SMD)

Component I Sequence

Resistors Visual inspection, 20 thermal cycles for resistor networks ( -401+ 125"~)* , 48 h steady-state bum-in at 100°C and 0.6 P ~ * , el. test at 2 5 T *

Capacitors

Film

Ceramic

Tantalum (solid)

Aluminum

Visual inspection, 48 h steady-state burn-in at 0.98„, and UN*, el. test at 25°C (C, tan6, RiS)*, measurement of Risst 70°C *

Visual inspection, 20 thermal cycles ( 8„)*, 48 h steady-state burn-in at U and 0.98„,*, el. test at 25'C (C, tan6, RiS)*, measurement of Ri, at 70°C 6

Visual inspection, 10 thermal cycles (OeXzr)*, 48h steady-state burn-in at U? and 0.98- (low zQ)*, el. test at 25'C (C, tan6, Ir)*, meas. of Ir at 70°C * Visual inspection, forming (as necessary), 48 h steady-state bum-in at UN and 0.98„*, el. test at 25'C (C, tan6, I,)*, measurement of I, at 7 0 T *

Diodes (Si) Visual inspection, 30 thermal cycles ( -40 I+ 125T)*, 48 h reverse bias bum- in at 125°C *, el. test at 25'C ( I,, U„ U, )*, seal test (finelgross leak)*+

Digital ICs

BiCMOS

MOS (VLSI)

CMOS (VLSI)

EPROM, EEPROM ( > W

Transistors (Si)

Optoelectronic

LED, IRED

Optocoupler

Linear ICs

Visual inspection, 20 thermal cycles ( -40/+ 125T)*, 50 power cycles (25 I 125"C, Ca. 1 min on I 2min off) for power elements*, el. test at 25OC (P, I„, U„,,)*, seal test (finelgross leak)*+

Visual inspection, 72 h high temp. storage at 100°C *, 20 thermal cycles (-201+ 80°c)*, el. test at 2 5 T (UF, .VRmin)*, seal test (finelgross leak)*+

Visual inspection, 20 thermal cycles ( -251 10O0C), 72 h reverse bias bum-in (HTRB) at 85'C *, el. test at 25'C (Ic l I„ U„ U„, , U„„, I,), seal test (finelgross leak)*+

Visual inspection, reduced el. test at 2 5 T , 48 h dyn. burn-in at 1 2 5 T *, el. test at 70°C *, seal test (finelgross leak)*+

Visual inspection, reduced el. test at 25OC (rough functional test, IDD), 72 h dyn. burn-in at 125'C *, el. test at 70°C *, seal test (finelgross leak)*+,

Visual inspection, reduced el. test at 25'C (rough functional test, IDD), 48 h dyn. bum-in at 125OC *, el. test at 70°c*, seal test (finelgross leak)*+

Visual inspection, programming (CHB), high temp. storage ( 48 h 1125"C), erase, programming (inv. CHB), high temp. storage (48 h 1 l2s0C), erase, el. test at 70°C, seal test (finelgross leak)*+

Visual inspection, reduced el. test at 25'C (rough functional test, ICC, offsets) 20 thermal cycles ( -40 I+ 125"~)*, 96 h reverse bias burn-in (HTRB) at 125°C with red. el. test at 25°C *, el. test at 70°C *, seal test (finelgross leak)*+

Hybrid ICs Visual inspection, high temp. Storage ( 24 h 1125"C), 20 thermal cycles (-40/+ 125"C), constant acceleration (2,000 to 20,000 g, /60s)*, red. el. test at 25'C, 96h dynamic bum-in at 85 to 125OC, el. test at 25"C, seal test (finelgross leak)*+

8.2 Testing and Screening of Electronic Components 339

material used. Thermal cycles are generally performed air to air in a two- chamber oven (transfer from low to high temperature chamber and vice versa using a lift). The ICs are placed on a meta1 tray (pins on the tray to avoid thermal voltage stresses) and subjected to at least 10 thermal cycles from -65 to +150°C (transfer time I lmin, time to reach the specified temperature 5 15min, dwell time at the temperature extremes 2 lomin). Should solderability be a problem, a protective atmosphere (N2) can be used.

3. Constant acceleration: The purpose of the constant acceleration is to check the mechanical stability of die-attach, bonding, and package. This step is only performed for ICs in hermetic packages, when used in critical applications. The ICs are placed in a centrifuge and subjected to an acceleration of 30,00Og, ( 300,000 m 1 s2) for 60 seconds (generally z-axis only).

4. Burn-in: Burn-in is a relatively expensive, but efficient screening step that provokes for ICs up to 80% of the chip-related and 30% of the package-related early failures. The ICs are placed in an oven at 125°C for 24 to 168h and are operated statically or dynamically at this temperature (cooling under power at the end of burn-in is often required). Ideally, ICs should operate with electrical signals as in the field. The consequence of the high burn-in temperature is a time acceleration factor A often given by the Arrhenius model (Eq. (7.56))

where E, is the activation energy, k the Boltzmann's constant (8.6. 1oV5 eV / K), and hl and h2) are the failure rates at chip temperatures Tl and T2 (in K), respectively, See Fig. 7.10 for a graphical representation. The activation energy E , varies according to the failure mechanisms involved. Global average values for ICs lie between 0.3 and 0.7eV. Using Eq. (7.56), the bum-in duration can be calculated for a given application. For instance, if the period of early failures is 3,000 h , €4 = 55"C, and O2 = 130°C (junction temperature in "C), the effective bum-in duration would be of about 50h for E, = 0.65 eV and 200h for E, = 0.4eV. It is often difficult to decide whether a static or a dynamic burn-in is more effective. Should surface, oxide, and metallization problems be dominant, a static burn-in is better. On the other hand, a dynamic burn-in activates practically all failure mechanisms. It is therefore important to make such a choice on the basis of practical experience.

5 . Seal: A seal test is performed to check the seal integrity of the cavity around the chip in hermetically-packaged ICs. It begins with the fine leak test: ICs are placed in a vacuum ( lh at O.5mmHg) and then stored in a helium atmosphere under pressure (ca. 4h at 5 atm); after a waiting period in Open air (30min), helium leakage is measured with the help of a specially

340 8 Quality and Reliability Assurance Dunng the Production Phase

calibrated mass spectrometer (required sensitivity approx. 1 0 - ~ atrn cm3 / s , depending on the cavity volume). After the fine leak test, ICs are tested for gross leak: ICs are placed in a vacuum ( 1 h at 5 mmHg ) and then stored under pressure (2 h at 5 atm) in fluorocarbon FC-72; after a short waiting period in Open air (2 min), the ICs are immersed in a fluorocarbon indicator bath (FC- 40) at 125°C; a continuous stream of small bubbles or two large bubbles from the sarne place within 30 s indicates a defect.

8.3 Testing and Screening of Electronic Assemblies

Electrical testing of electronic assemblies, for instance populated printed circuit boards (PCBs), can be basically performed in one of the following ways:

1. Functional test within the assembly or unit in which the PCB is used.

2. Functional test with the help of functional test equipment.

3. In-circuit test followed by a functional test with the assembly or unit in which the PCB is used.

The first method is useful for small series production. It assumes that components have been tested (or are of sufficient quality) and that automatic or semi-automatic isolation of defects on the PCB is possible. The second method is suitable for large series production, in particular from the point of view of protection against damage (ESD, backdriving, mechanical stresses), but can be expensive. The third and most commonly used method assumes the availability of an appropriate in-circuit test equipment. With such an equipment, each component is electrically isolated and tested statically or quasi-statically. This can be sufficient for passive components and discrete semiconductors, as well as for SSI and MSI ICs, but it cannot replace an electrical test at the incoming inspection for LSI and VLSI ICs (functional tests on in-circuit test equipment are limited to some few lOOkHz and dynamic tests (Fig. 3.4) are not possible). Thus, even if in-circuit testing is used, incoming inspection of critical components should not be omitted. A further disadvantage of in-circuit testing is that the outputs of an IC can be forced to a LOW or a HIGH state. This stress (backdriving) is generally short (SOns), but may be sufficient to cause damage to the iC in question. In spite of this, and of some other problems (polarity of electrolytic capacitors, paralleled components, tolerance of analog devices), in-circuit testing is today the most effective means to test populated printed circuit boards (PCBs), on account of its good defect isolation capability.

Because of the large number of components and solder joints involved, the defective probability of a PCB can be relatively high in stable production conditions too. Experience shows that for a PCB with about 500 components and 3,000 solder

8.4 Test and Screening Strategies, Economic Aspects 341

joints, the following indicative values can be expected (see e.g. Table 1.3 for a fault report form):

0.5 to 2% defective PCBs (often for 314 assembling and 114 components),

1.5 defects per defective PCB (mean value).

Considering such figures, it is important to remember that defective PCBs are often reworked and that a repair or rework can have a negative influence on the quality and reliability of a PCB.

Screening populated printed circuit boards (PCBs) or assemblies with higher integration level is generally a difficult task, because of the many different technologies involved. Experience on a large number of PCBs [3.76] leads to the following screening procedure which can be recommended for PCBs of standard technology used in high reliability applications (in limited amount also for rnixed technology) :

1. Visual inspection and reduced electrical test.

2. 100 thermal cycles between 0°C and +80°C, with temperature gradient 5 5°C I min (within the components), dwell time 2 lOmin, and power off during cooling (gradient 2 20°C / min only if this also occurs in the field and is compatible with the PCB technology).

3. 15min random vibration at 2 g„ , 20 - 500Hz (to be performed if significant vibrations occur in Sie field).

4. 48 h run-in at ambient temperature, with periodic power onloff switching.

5. Final electrical and functional test.

Careful investigations on SMT assemblies down to pitch 0.3mm [3.79, 3.80, 3.891 have shown that basically two different deformation mechanisms can be present in tin based solder joints (see Section 3.4), grain boundary sliding at rather low temperature (or thermal) gradients and low stiffness of the structure component PCB, and dislocation climbing at higher temperature gradients and high stiffness (e.g. for leadless ceramic components). For this reason, screening of populated PCBs in SMT should be avoided if the temperature gradient occurring in the field is not known. Preventive actions, to build in quality and reliability during manufacturing, have to be preferred here.

The above procedure can be considered as an environrnental stress screening (ESS), often performed on a 100% basis in a series production of PCBs used in high reliability or safety applications to provoke early failures. It can serve as a basis for screening at higher integration levels.

Thermal cycles can be combined with power on / off switching or vibration to increase effectiveness. However, in general a screening strategy for PCBs (or at higher integration level) should be established on a case-by-case basis, and be periodically reconsidered (reduced or even canceled if the percentage of early failures drops below a given value, 1% for instance).

Burn-in at assembly level can be used in the context of a reliability test to validate a predicted assembly's failure rate h s . Assuming that the assembly consists of elements E,, .. ., E, in series, with failure rates h l ( T l ) , ..., &(Tl) at temperature Tl and activation factors A l , ..., An for a stress at temperature T2, the assembly failure rate h S ( T 2 ) at temperature Tz (stress) can be calculated from (Eq. (7.57))

Comparison of the predicted failure rate h s (T,) with real data can be performed by submitting the assembly to a burn-in at temperature T2 and evaluating the experimentally obtained failure rate (Section 7.2.3). However, because of the many different technologies often used in an assembly (e.g. populated PCB), Tz is generally chosen < 100°C.

8.4 Test and Screening Strategies, Economic Aspects

8.4.1 Basic Considerations

In view of the optimization of cost associated with testing and screening during production, each manufacturer of high-performance equipment and Systems is confronted with the following question:

What is the most cost-efSective approach to eliminate all defects, systematic failures, und early failures prior to shipment to the customer ?

The answer to this question depends essentially on the level of quality, reliability, and safety required for the item considered, the consequence of a defect or a failure, the effectiveness of each test or screening step, as well as on the direct and deferred cost involved (warranty cost for instance). A test und screening strategy should thus be tailored to the item considered, in particular to its complexity, technology, and production procedures, but also to the facilities and skill of the manufacturer. In setting up such a strategy, the following aspects must be considered:

1. Cost equations should include deferred cost (for instance, warraniy cost and cost for loss of image).

2. Testing and screening should begin at the lowest level of integration and be selective, i.e. consider the effectiveness of each test or screening step.


3. Qualification tests on Prototypes are important to eliminate defects and systematic failures, they should include performance, environmental & reliability tests.

4. Testing and screening should be carefully planned to allow h i g h interpretability of the results, and be supported by a quality data reporting System (Fig. 1.8).

5. Testing and screening strategy should be discussed early in the design phase, during design reviews.

Figure 8.1 can be used as Start point for the development of a test und screening strategy at the assembly level.

A basic relationship between test strategy und cost is illustrated in the example of Fig. 8.2, in which two different strategies are compared. Both cases in Fig. 8.2 deal with the production of a stated quantity of equipment or Systems for which a total of 100,000 ICs of a given type are necessary. The ICs are delivered with a defective probability p = 0.5%. During production, additional defects occur as a result of incorrect handling, mounting, etc., with probabilities of 0.01% at

I hcoming inspection I

PCB assembling and soldering

In-circuit test

Functional test

Unit assembling and testing J +

Storage, shipping, use

Figure 8.1 Flow chart as a basis for the development of a test and screening strategy for electronic assemblies (e.g. populated pnnted circuit hoards (PCBs))

344 8 Quality and Reliability Assurance Dunng the Production Phase

the incoming inspection, 0.1% at assembly level, and 0.01% at equipment level. The cost of eliminating a defective I C is assumed to be $2 (US$) at the incoming inspection, $20 at assembly level, $200 at equipment level, and $2,000 during warranty. The two test strategies differ in the probability (DPr) of detecting/recognizing and eliminating a defect. This probability is for the four levels 0.1, 0.9, 0.8, 1.0 in the first strategy and 0.95, 0.9, 0.8, 1.0 in the second strategy. It is assumed, in this example, that the additional cost to improve the detection probability at incoming inspection (+ $20,000) are partly compensated by the savings in the test at the assembly level (- $10,000). As Fig. 8.2 shows, total cost of the second test strategy are (for this example) lower ($21,900) than those of the first one.

Number of defects and cost are in all this kind of considerations expected values (means of random variables). The use of arithmetic means in the example of Fig. 8.2, on the basis of 100,000 ICs at the input, is for convenience only.

Strategy a

Discovered defects

t 5 1

t 503

t 53

+ 13

Defective 0.5% 0.01% 0.1% 0.01% probabilities No. of defects

10 100 10

Defects cost (in 1000 US$)

Defects 500, at the input

Strategy b

Defective 0.5% 0.01% 0.01% probabilities No. of defects

Defects at the input

DPr= 0.95 DPr = 0.9 DPr = 0.8

DPr = 1

Discovered defects 485 113 18 4

Defects cost (in 1000 US$) 1 2.3 3.6 8

Deferred cost ) Z = 24,900 US$

(in 1000 US$) ( + W (-10) - -

Incoming inspection DPr=O.l

Figure 8.2 Companson between two possible test strategies (figures for defects and cost have to be considered as expected values): a) Emphasis on assembly test; b) Emphasis on incoming inspection ( DPr = detectionlrecognition probability)

56_ 4 Assembly DPr= 0.9

Equip- ment

DPr = 0.8

13 - Warranty DPr= l


Models like that of Fig. 8.2 can be used to identify weakpoints in the production process (e.g. with respect to the defective probabilities at the different production steps) or to evaluate the effectiveness of additional measures introduced to decrease quality cost.

8.4.2 Quality Cost Optimization at Incoming Inspection Level

In this section, optimization of quality cost in the context of a testing und screening strategy is solved for the case of the choice whether a 100% incoming inspection or an incoming inspection on a sampling basis is more cost effective. Two cases will be distinguished, incoming inspection without screening (test only, illustrated by Fig. 8.3 and Fig. 8.4) and incoming inspection with screening (test and screening, illustrated by Fig. 8.5 and Fig. 8.6). The following notation is used:

At = probability of acceptance at the sarnpling test (i.e. probability of having no more than C defective components in a sample of size n (function of p d , given by Eq. (A6.121) with p = pd and k = C , see also Fig. 7.2 for a graphical solution using the Poisson approximation)

A, = Same as At, but for screening (screening with test) cd = deferred cost per defective component

c f = deferred cost per component with early failure

C , = replacement cost per component at the incoming inspection

C, = testing cost per component (test only)

C, = screening cost per component ( C , includes cost for screening and for test)

C, = expected value (mean) of the total cost (direct and deferred) for incoming inspection without screening (test only) of a lot of N

components

C, = expected value (mean) of the total cost (direct and deferred) for incoming inspection with screening (screening with test) of a lot of N components

n = sample size

N = lot size

pd = defective probability (defects are recognized at the test)

p f = probability for an early failure (early failures are precipitated by the screening)

8 Quality and Reliability Assurance During the Production Phase

I Lot of size N I

Figure 8.3 Model for quality cost optimization (direct and deferred cost) at the incoming inspection without screening of a lot of N compouents (all cost are expected values, see Fig. 8.5 for screening)

+

Consider first the incoming inspection without screening (test only). The corresponding model is shown in Fig. 8.3. From Fig. 8.3, the following cost equation can be established for the expected value (mean) of the total cost C ,

Assembly, test, use

Investigating Eq. (8.1) leads to the following cases:

Deferred cost : C;' = At pd ( N - n ) c d

2. For a 100% incorning inspection, n = N and thus

it follows

8.4 Test and Screening Strategie?., Economic Aspects

Empirical values for Lot of size N Pd > C t > C& Cr i-

Sample of size n

I inspection (test)

Figure 8.4 Practical realization of the procedure described by the model of Fig. 8.3

and thus a sampling test is more cost effective.

4. For

and thus a 100% incoming inspection is more cost effective.

The practical realization of the procedure according to the model of Fig. 8.3 is given in Fig. 8.4. The sample of size n to be tested instead of the 100% incoming inspection if the inequality (8.4) is fulfilled, is used to verify the value of pd, which for the actual lot can differ from the assumed one. A table of AQL-values (Table 7.1) can be used to determine values for n and C of the sampling plan, AQL = pd in uncritical cases and AQL < pd if a reduction for the risk of deferred cost is desired.


Lot of size N k = J d Sample of a size n

(screening with el. test)

Screening S Accept?

Test

El. test (without screening) of the remaining

(N - 4

J the remaining

(N - n ) components

Figure 8.5 Model for quality cost optimization (direct and deferred cost) at the incoming inspection with screening of a lot of N components (all cost are expected values; screening includes test)

As a second case, let us consider the situation of an incoming inspection with screening (Section 8.2). Figure 8.5 gives the corresponding model and leads to the following cost equation

The Same considerations as with Eqs. (8.2) - (8.5) lead to the conclusion that if

holds, then a sampling screening (with test) is more cost effective than a 100% screening. The practical realization of the procedure according to the model of Fig. 8.5 is given in Fig. 8.6. As in Fig. 8.4, the sample of size n to be screened instead of the 100% screening if the inequality (8.7) is fulfilled, is used to verify the values of pf and p d , which for the actual lot can differ from the assumed ones.

8.4 Test and Screening Strategies, Economic Aspects

/ Assembly, test, use I

Sample test of size n

0 0 % i 0 i n g 100% b o m i n g

Figure 8.6 Practical realization of the procedure descnbed by Fig. 8.5 (screening includes test)

inspection without screening

(test only)

The lower Part on the left-hand side of Fig. 8.6 is identical to Fig. 8.4. The first inequality in Fig. 8.6 follows from inequality (8.7) with the assumption

inspection with screening

(screening with test)

The second inequality in Fig. 8.6 refers to the cost for incoming inspection without screening (inequality (8.4)).

8.4.3 Procedure to handle first deliveries

Components, materials, and externally manufactured subassemblies or assemblies should be submitted at the first delivery to an appropriate selection procedure. Part of this procedure can be performed in cooperation with the manufacturer to avoid duplication of efforts. Figure 8.7 gives the basic structure of such a procedure, See Sections 3.2 and 3.4 for some examples of qualification tests for components and assemblies.

Qualification test + -II.<=--':.:?> a propriate?

Reject First lot

100% incoming inspection

experience

components and materials

Figure 8.7 Selection procedure for non qualified components and materials

A l Terms and Definitions

This appendix defines und comments on the terms most commonly used in reliability engineering (Fig. Al.1). Table 5.4 extends this appendix to Software quality (See also [ A I S (610)l. Attention has been paid to the adherence to relevant international standards (ISO, IEC) and recent trends [Al .1 - A. 1.81.

System, Systems Engineering, Concurrent Engineering, Cost Effectiveness, Quality - Capability - Availability, Dependability - Reliability

1 Item Required Function, Mission Profile Reliability Block Diagram, Redundancy

MTTF, MTBF Failure, Failure Rate, Failure Intensity, Derating

FMEA, FMECA, FTA Reliability Growth, Environmental Stress Screening, Bum-in

Maintainability

t Preventive Maintenance, MTTPM, MTBUR

Corrective Maintenance, MTTR

- Logistic Support

- Fault

F Defect, Nonconformity Systematic Failure

Failure - Safety - Quality Management, Total Quality Management (TQM) L Quality Assurance

t Configuration Management, Design Review

Quality Test Quality Control during Production

Quality Data Reporting System

- Life Time, Useful Life - Life-Cycle Cost, Value Engineering, Value Analysis - Product Assurance, Product Liability

Figure A l . l Terms most commonly used in reliability engineering


Availability, Point Availability (A(t), PA(t))

Probability that the item is in a state to perform the required function at a given instant of time.

Instantaneous availability is often used. The use of A(t) shouId be avoided, to elude confusion with other kind of availability (e.g. average availability A A(t ), mission availability M A(TO, to), and work-mission availability WM A(TO, X ) in Section 6.2). A qualitative definition, focused on ability, is also possible. The term item stands for a structnral unit of arbitrary complexity. Computation generally assumes continuous operation (item down only for repair), renewal at failure (good-as-new after repair), and ideal human factors & logistic support. For an item with more than one element, good-as-new after repair refers in this book to the repaired element in the reliability block diagram. This assumption is valid for the whole item (system), only in the case of constant failure rates for all elernents. Assuming renewal for the whole item, the asymptotic &steady-state value of the point availability can be expressed by PA = MTTFI(MTTF+ MTTR). PA is also the asymptotic &

steady-state value of the average availability AA (often given as availability A).

Burn-in (nonrepairable items)

Type of screening test while the item is in operation.

For electronic devices, Stresses during burn-in are often constant higher ambient temperature (e.g. 125°C for ICs) and constant higher supply voltage. Burn-in can be considered as a part of a screening procedure, performed on a 100% basis to provoke early failures and to stabilize the characteristics of the item. Often it can be used as an accelerated reliability test to investigate the item's failure rate.

Burn-in (repairable items)

Process of increasing the reliability of hardware by employing functional operation of every items in a prescribed environment with corrective maintenance during the early failure period.

The term run-in is often used instead of burn-in. The stress conditions have to be chosen as near as possible to those expected infield operation. Flaws detected during burn-in can be detenninistic (defects or systematic failures) during the pilot production (reliability growth), but should be attributable only to early failures (randomly distributed) during the series production.

Capability

Ability to meet a service demand of given quantitative characteristics under given internal conditions.

Performance (technical performance) is often used instead of capability.

Al Terms and Definitions

Concurrent Engineering

Systematic approach to reduce the time to develop, manufacture, and market the item, essentially by integrating production activities into the design & development phase.

Concurrent engineering is achieved through intensive teamwork between all engineers involved in the design, production, and marketing of the item. It has a positive influence on the optimization of life-cycle cost.

Configuration Management

Procedure to specify, describe, audit, and release the configuration of the item, as well as to control it during modifications or changes.

Configuration includes all of the item's functional and physical characteristics as given in the documentation (to specify, build, test, accept, operate, maintain, and logistically support the item) and as present in the hardware andlor software. In practical applications, it is useful to subdivide configuration management into configuration identification, auditing, control (design reviews), and accounting. Configuration management is of particular importance during the design &

development phase.

Corrective Maintenance

Maintenance carried out after failure to restore the required function.

Corrective maintenance is also known as repair and can include any or all of the following steps: recognition, isolation (localization & diagnosis), elimination or removal (disassemble, remove, replace, reassemble), and function checkout. Repair is used in this book as a synonym for restoration. To simplify computation it is generally assumed that the repaired element in the reliability block diagram is as-good-as-new after each repair (also including a possible environmental stress Screening of the spare parts). This assumption applies to the whole item (equipment or system) if all elements of the item (which have not been renewed) have constant failure rates (seefailure rate for further comments).

Cost Effectiveness

Measure of the ability of the item to meet a service demand of stated quantitative characteristics, with the best possible usefulness to life-cycle cost ratio.

System effectiveness is often used instead of cost effectiveness.


Defect

Nonfulfillment of a requirement related to an intended or specified use.

From a technical point of view, a defect is similar to a nonconfonnity, however not necessady from a legal point of view (in relation to product liability, nonconformity should be preferred). Defects do not need to influence the item's functionality. They are caused by flaws (errors, mistakes) dunng design, development, production, or installation. The term defect should be preferred to that of error, which is a cause. Unlike failures, which always appear in time (randomly distributed), defects are present at t = 0 . However, some defects can only be recognized when the item is operating and are referred to as dynamic defects (e.g. in software). Similar to defects, with regard to causes, are systematic failures (e.g. cooling problern); however, they are often not present at t=O.

DependabiIity

Collective term used to describe the availability performance and its influencing factors (reliability, maintainability, and logistic support).

Dependability is used generally in a qualitative sense, often defined as ability to provide the required function when demanded.

Derating

Designed reduction of stress from the rated value to enhance reliability.

The stress factor S expresses the ratio of actual to rated stress under normal operating conditions (generally at 25'C ambient temperature). Designed is used as a synonym for deliberate.

Design Review

Independent examination of the design to identify shortcomings that could affect the fitness for purpose, reliability, maintainability or maintenance support requirements of the item.

Design reviews are an important tool for quality assurance and T Q M during the design and development of hardware and software (Tables A3.3,5.3,5.5,2.8,4.3, Appendix A4). An important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3, Fig. 1.6).

Environmental Stress Screening (ESS)

Test or set of tests intended to remove defective items, or those likely to exhibit early failures.

ESS is a screening procedure often perfonned at assembly (PCB) or equipment level on a 100% basis to find defects and systematic failures during the pilot production (reliability growth), or to provoke early failures in a series production. For electronic items, it consists generally of temperature cycles

A l Terms and Definitions 355

andlor random vibrations. Stresses are in general higher than in field operation, but not so high as to stimulate new failure mechanisms. Experience shows that to be cost effective, ESS has to be tailored to the item and production processes. At component level, the term screening is often used.

Failure

Termination of the ability to perform the required function.

Failures should be considered (classified) with respect to the mode, cause, effect, and mechanism. The cause of a failure can be intrinsic (early failure, failure with constant failure rate, wearont) or extrinsic (systematic failures, i. e. failures resulting from errors or mistakes in design, production, or operation which are deterministic and has to be considered as defects). The effect (consequence) of a failure is often different if considered on the directly affected item or on a higher level. A failure is an event appearing in time (randomly distributed), in contrast to a fault which is a state.

Failure Intensity ( z( t))

Limit, if it exists, of the mean number of failures of a repairable item within time interval ( t, t + St] , to 6 t when 6 t -+ 0.

At System level, zs ( t ) is used. Failure intensity applies for repairable items, in particular when repair times are neglected and failure occurrence is considered on the time axis (arrival times). It has been investigated for Poisson processes (homogeneous (z(t)= h) & nonhomogeneous (z(t)= m(t ))) and renewal processes (z(t)= h(t)) (Appendices A7.2, A7.8). For practical applications it holds that z(t)6t =Pr{v(t +6t)-v(t)=l), V ( t )= number of failures in (O,t] (Eq. (A7.229)). Seealso failure rate.

Failure Modes and Effects Analysis (FMEA)

Qualitative method of analysis that involves the study of possible failure modes and faults in subitems, and their effects on the ability of the item to provide the required function.

See FMECA for comments.

Failure Modes, Effects, and Criticality Analysis (FMECA)

Quantitative or qualitative method of analysis that involves failure modes und effects analysis together with a consideration of the probability of the failure mode occurrence and the severity of the effects.

Goal of a FMEA or FMECA is to identify all potential hazards and to analyze the possibilities of reducing their effect andlor occurrence probability. All possible failure modes and faults with the conesponding causes have to be considered bottom-up from lowest to highest integration level of the item considered. Often one distinguishes between design and production (process) FMEA or FMECA. FMECA can be used for fault modes, effects, and criticality analysis (same for FMEA).

Failure Rate (q t ) )

Limit, if it exists, of the conditional probability that the failure occurs within time interval ( t , t + 6 t ] , to 6 t when 6 t + 0, given that the item was new at t = 0 and did not fail in the interval (0, t ] .

At system level, h s ( t ) is used. The failure rate applies in particular for nonrepairable items. In this case, i f z is the item failure-free time, with distribution function F(t) = Pr(t I t } , with F(O)= 0 and density f(t ), the failure rate h ( t ) follows as (Eq. (A6.25), R(t ) = 1 - F(t ))

1 f(t) dR( t ) ld t h ( t ) = lim - R ( t < t ~ t + 8 t l ~ > t } = - = - - - - .

6t10 6t 1 - F(t) W )

Considering R(0) = I , Eq. ( A l . l ) yields R( t ) = e-1; h (x )m and thus, R( t ) = echt for h( t ) = h . This important result characterizes the memoryless property o f the exponential distribution F(t)=l- e -h f , expressed by Eq. ( A l . l ) for h ( t ) = h . Only for h ( t ) = h one can estimate the failure rate h by h = k l 7: where T is the given (fixed) cumulative operating time and k> 0 the total number o f failures during T (Eq. (7.28)). Figure 1.2 shows a typical shape of h ( t ) . However,considering Eq. (A l . l ) , the failure rate can be defined also for repairable items which are as-good-as-new after repair (restoration), taking instead of t the variable x starting by x = 0 at each repair (as for interarrival times). This is important when investigating repairable Systems (Chapter 6), e.g. with constant failure & repair rates. I f a repairable system cannot be restored to be as-good-as-new after repair (with respect to the state considered), i.e i f at least one element with time dependent failure rate has not been renewed at every repair, failure intensiv z ( t ) has to be used. It is thus important to distinguish between failure rate h ( t ) and failure intensiv z ( t ) or intensity ( h ( t ) or m ( t ) for a renewal or Poisson process). z ( t ) , h ( t ) , m ( t ) are unconditional densities (Eqs. (A7.229), (A7.24), (A7.194)) and differ basically from h( t ) which is a conditional density. This distinction is important also for the case o f a homogeneous Poisson process, for which z(t)=h (t)= m(t)=h holds for the intensity and h ( x ) = h holds for the interarrival times ( X starting by 0 at each interarrival time,Eq. (A7.38)). To reduce ambiguities, force of mortality has been suggested for h ( t ) [6.3, A7.301.

Fault

State characterized by an inability to perform the required function due to an internal reason.

A fault is a state and can be a defect or a failure, having thus as possible cause an error (for defects or systematic failures) or a failure mechanism (for failures).

Fault Tree Analysis @TA)

Analysis utilizing fault trees to determine which faults of subitems, or external events, or combination thereof, may result in item faults.

FTA is a top-down approach, which allows the inclusion o f extemal causes more easily than a FMEAl FMECA. However, it does not necessarily go through all possible fault modes. Combination o f FMEA I FMECA with FTA leads to causes-to-effects chart, showing the logical relationship between identified causes and their single or multiple consequences. A graphical description of cause-to-effect relationships is the cause-to-effect diagram fishbone or Ishikawa diagram).


Item

Part, component, device, functional unit, subsystem or system that can be individually described and considered.

An item is a functional or structural unit, generally considered as an entitj for investigations. It can consist of hardware andlor software and include human resources.

Life Cycle Cost (LCC)

Sum of the cost for acquisition, operation, maintenance, and disposal or recycling of the item.

Life-cycle cost have to consider also the effects to the environment of the production, use, and disposal or recycling of the item considered (sustainable development). Their optimization uses cost effectiveness or Systems engineering tools and can be positively influenced by concurrent engineering.

Lifetime

Time span between initial operation and failure of a nonrepairable item.

Logistic Support

All activities undertaken to provide effective and economical use of the item during its operating phase.

An emerging aspect related to logistic support is that of obsolescence management, i.e. how to assure operation over e.g. 20 years when components need for maintenance are no longer manufactured.

Maintainability

Probability that a given maintenance action, performed under stated conditions and using stated procedures and resources, can be carried out within a stated time interval.

Maintainability is a characteristic of the item and refers to preventive and corrective maintenance. A qualitative definition, focused on ability, is also possible. In specifying or evaluating maintainability, it is important to consider the logistic support available (procedures, personnel, spare Parts, etc.).

Mission Profile

Specific task which must be fulfilled by the item during a stated time under given conditions.

The mission profile defines the required function and the environmental conditions as a function of time. A system with a variable required function is termed a phased-mission system.


MTBF

Mean operating time between failures.

At system level, MTBFs is used. MTBF applies for repairable items. However, for practical applications it is important to recognize that successive operating times between system failures have the Same mean (expected value) only if they are independent and have a common distribution function, i.e. if the system is as-good-as-new after each repair at system level. If only the failed element is restored to as-good-as-new after repair and at least one nonrestored element has a time dependent failure rate, successive operating times between system failures are neither independent nor have a common distribution. Only the case of a series-system with constant failure rates hl,...,h, for all elements EI, ..., E, yields to a homogeneous Poisson process, for which successive interarrival times (operating times between system failures) are inde endent and exponentially

- distributed with common distribution function ~ ( x ) = ~ - e - ~ ( ' ~ ' . ' *R1z ) - 1-e-I 'S and mean MTBF' = 1 Ihs (repaired elements are assumed as-good-as-new, yielding system as-good-as-new because of the constant failure rates hl,...,h,). This result holds approximately also for systems with redundancy (see Eq. (6.93) and comments with M77'F). For all these reasons, and also because of the estimate MTBF= T I L, often used in practical applications, MTBF should be confined to repairable systems with constanr failure rates for all elements. Shortcomings because of neglecting this basic property are known, see e.g. [6.3,7.1 l,A7.30]. As in the previous editions of this book, MTBFs will be reserved for the case

For Markov and semi-Markov models, MUTs is used.

MTTF

Mean time to failure.

At system level, MTTFs is used. MTTF is the mean (expected valueJ of the item failure-free time T. It can be computed from the reliability function R(t) as MTTF = R(t ) d t , with TL as the upper limit of the integral if lhe: life time is limited to TL (R(t)= 0 for t > TL ). MTTF applies for both nonrepairable and repairable items if one assumes that after repair the item is as-good-as-new (p. 40). At system level, this occurs (with respect to the state considered) only if the repaired element is as-good-as-new and all nonrepaired elements have constant failure rates. To inclnde for this case all situations, M7TFsi is used in Chapter 6 (S stands for system and i for the state occupied (entered for a semi-Markov process) at the time at which the repair (restoration) is terminated, see e.g. Table 6.2). When dealing with failure-free and repair times, the variable x starting by X = 0 after each repair (restoration) has to be used instead of t (as for interarri~al~times). See p. 40 for further comments. An unbiased, empirical estimate for MTTF is M77F = (tl + ... + t,)l n , where t l , . . . , tn are observed failure-free times of n statistically identical and independent items.

MTTPM

Mean time to preventive maintenance.

See MTTR for comments.

MTBUR

Mean time between unscheduled removals.

MTTR

Mean time to repair.

At system level, M7TRS is used. Repair is used in this book as a synonym for restoration. MTTR is the mean (expected value) of the item mpair time. It can be computed from the distnbution function G(t) of the repair time as MZTR = I. (1 - G(t ))dt . In spsifying or evaluatiiig M7TR. it is necessary to consider the logistic support available for repair (procedures, personnel, spare Parts, test facilities). Repair time is often lognormally distributed. However, for reliability or availability computation of repairable equipment and Systems, a constant repair rate p (i.e. exponentially distributed repair times with = 1 I MTTR) can be used in general to get valid approximate results, as long as MTTR << M l T F holds for each element in the reliability block diagram (Examples 6.7, 6.8, 6.9). An unbiased, empincal estimate of MTTR is M ~ T R = (tl + . . . + t,) l n, where tl, . . ., t, are observed repair times of n statistically identical and independent items.

Nonconformity

Nonfulfillment of a specified requirement.

From a technical point of view, nonconformity is close to defect, however not necessarily from a legal point of view. In relation to product liability, nonconformity should be preferred.

Preventive Maintenance

Maintenance carried out to reduce the probability of failure or degradation.

The aim of preventive maintenance must also be to detect and remove hidden failures, i.e. non- recognized failures in redundant elements. To simplify computation it is generally assumed that the element in the reliability block diagram for which a preventive maintenance has been performed is as-good-as-new after each preventive maintenance. This assumption applies to the whole item (equipment or system) if all components of the item (which have not been renewed) have constant failure rates. Preventive maintenance is generally performed at scheduled time intervals.

Product Assurance

All planned and systematic activities necessary to reach specified targets for the reliability, maintainability, availability, and safety of the item, as well as to provide adequate confidence that the item will meet all given requirements.

The concept of product assurance is used in particular in aerospace programs. It includes quality assurance as well as reliability, maintainability, availability, safety, and logistic suppori engineering.


Product Liability

Generic term used to describe the onus on a producer or others to make restitution for loss related to personal injury, property damage, or other harm caused by the product.

The manufacturer (producer) has to speczfj> a safe operational mode for the product (item). If strict liability applies, the manufacturer has to demonstrate (at a claim) that the product was free from defects when it left the production plant. This holds in the USA and partially also in Europe [1.8]. However, in Europe the causality between damage and defect has still tobe demonstrated by the User and the limitation period is short (often 3 years after the identification of the damage, defect, and manufacturer, or 10 years after the appearance of the product on the market). One can expect that liability will more than before consider faults (defects & failures) and Cover software as well. Product liability forces producers to place greater emphasis on quality assurance Imanagement.

Quality

Degree to which a Set of inherent characteristics fulfills requirements.

This definition, given also in the ISO 9000:2000 Standard [A1.6, A2.91, follows closely the traditional definition of quality fitness for use) and applies to products and semices as well.

Quality Assurance

All the planned and systematic activities needed to provide adequate confidence that quality requirements will be fulfilled.

Quality assurance is a part of quality management, as per ISO 9000: 2000. It refers to hardware and software as well, and includes configuration management, quality tests, quality control during production, quality data reporting systems, and software quality (Fig. 1.3). For complex equipment and systems, quality assurance activities are coordinated by a quality assurance program (Appendix A3). An important target for quality assurance is to achieve the quality requirements with a minimum of cost and time. Concurrent engineering also strive to short the time to develop and market the product.

Quality Control During Production

Control of the production processes and procedures to reach a stated quality of manufacturing.

Quality Data Reporting System

System to collect, analyze, and correct all defects and failures occurring during production and testing of the item, as well as to evaluate and feedback the corresponding quality and reliability data.

Al Terms and Definitions 361

A quality data reporting system is generally Computer aided. Analysis of defects and failures must be traced to the cause in order to determine the best corrective action necessary to avoid repetition of the same problem. The quality data reporting system should also remain active during the operating phase. A quality data reporting system is important to monitor reliability growth.

Quality Management

Coordinated activities to direct and control an organization with regard to quality.

Organization is defined as group of people und facilities (e.g. a company) with an arrangement of responsibilities, authorities, und relationships [A1.6].

Quality Test

Test to verify whether the item conforms to specified requirements.

Quality tests include incoming inspections, qualification tests, production tests, and acceptance tests. They also Cover reliability, maintainability, and safety aspects. To be cost effective, quality tests must be coordinated and integrated in a test (und screening) strategy. The terms test and inspection are often used for quality test.

Redundancy

Existence of more than one means for performing the required function.

For hardware, distinction is made between active (hot, parallel), warn (lightly loaded), and standby (cold) redundancy. Redundancy does not necessarily imply a duplication of hardware, it can for instance be implemented at the software level or as a time redundancy. To avoid common mode failures, redundant elements should be realized independently from each other. Should the redundant elements fulfill only a part of the required function, a pseudo redundancy is present.

Reliability ( R , R( t ) )

Probability that the required function will be provided under given conditions for a given time interval.

According to the above definition, reliability is a characteristic of the item, generally designated by R for the case of a fixed mission and R(t ) for a mission with t as a Parameter. At system level RSi ( t ) is used, where S stands for system and i for the state entered at t = 0 (Table 6.2). A qualitative definition, focused on abili@, is also possible. Reliability gives the probability that no operational interruption at item (system) level will occur during a stated mission, say of duration T. This does not mean that redundant parts may not fail, such parts can fail and be repaired. Thus, the concept of reliability applies for nonrepairable as well as for repairable items. Should T be considered as a variable t , the reliabilityfunction is given by R(t) . If z is the failure-free time, distributed according to F(t) , with F(0) = 0, then R(t ) = Pr(7 > t ) = 1 - F(t ) . . The concept of reliability can also be used for processes or sewices, although modeling human aspects can lead to some difficulties.


Reliability Block Diagram

Block diagram showing how failures of subitems, represented by the blocks, can result in a failure of the item.

The reliability block diagram (RBD)is an event diagram. It answers the question: Which elements of the item are necessary to fulfill the required function und which ones can fail without affecting it? The elements (blocks in the RBD) which mnst operate are connected in series (the ordering of these elements is not relevant for reliability computation) and the elements which can fail (redundant elements) are connected in parallel. Elements which are not relevant (used) for the required function are removed from the RBD and put into a reference list, after having verified (PMEA) that their failure does not affect elements involved in the required function. In a reliability block diagram, redundant elements still appear in parallel, irrespective of the failure mode. However, only one failure mode (e.g. short, open) and two states (good , failed) can be considered for each element.

Reliability Growth

Progressive improvement of a reliability measure with time.

Flaws (errors, mistakes) detected during a reliability growth program are in general deterministic (defects or systematic failures) and present in every item of a given lot. Reliability growth is thus often performed during the pilot production, seldom for series-produced items. Similarly to environmental stress screening (ESS), Stresses during reliability growth often exceed those expected in field operation, but not so high as to stimulate new failure mechanisms. Models for reliability growth can also often be used to investigate the occurrence of defects in sofnvare. Even if software defects often appear in time (dynamic defects), tbe term sofrware reliability should be avoided (sofnvare quality should be preferred).

Required Function

Function or combination of functions of an item which is considered necessary to provide a given service.

The definition of the required function is the starting point for every reliability analysis, as it defines failures. However, difficulties can appear with complex items (systems). For practical purposes, Parameters should be specified with tolerances.

Safety

Ability of the item to cause neither injury to persons, nor significant material damage or other unacceptable consequences.

Safety expresses freedom from unacceptable risk of harm. In practical applications, it is useful to subdivide safety into accidentprevention (the item is safe working while it is operating correctly) and technical safety (the item has to remain safe even if a failure occurs). Technical safety can be defined

A l Terms and Definitions 363

as the probability that the item will not cause i n j u ~ topersons, signzjicant material damage or other unacceptable consequences above a given (fked) level for a stated time interval, when operating under given conditions. Methods and procedures used to investigate technical safety are similar to those used for reliability analyses, however with emphasis on fault lfailure effects.

System

Set of interrelated items considered as a whole for a defined purpose.

A system generally includes hardware, software, semices, and personnel (for operation and support) to the degree that it can be considered self-sufficient in its intended operational environment. For computations, ideal conditions for human factors and logistic support are often assumed, leading to a technical system (for simplicity, the term system is often used instead of technical system). Elements of a system are e.g. components, assemblies, equipment, and subsystems, for hardware. For maintenance purposes, systems are partitioned into independent line replaceable units (LRUs), i.e. spare parts at equipment or system level. The term item is used for a functional or structural unit of arbitrary complexity that is in general considered as an entity for investigations.

Systematic Failure

Failure related in a deterministic way to a certain cause inherent in the design, manufacturing, operation or maintenance processes.

Systematic failures are also known as dynamic defects, for instance in software quality, and have a deterministic character. However, because of the item complexity they can appear as if they were randomly distributed in time.

Systems Engineering

Application of the mathematical and physical sciences to develop systems that utilize resources economically for the benefit of society.

TQM and concurrent engineering can help to optimize systems engineering.

Total Quality Management (TQM)

Management approach of an organization centered on quality, based on the participation of all its members, and aiming at long-term success through customer satisfaction, and benefits to all members of the organization and to socieiy.

Within TQM, everyone involved in the product (directly during development, production, installation, and semicing, or indirectly with management or staff activity) is jointly responsible for the quality of that product.

A l Tenns and Definitions

Useful Life

Time interval starting when the item is first put into operation and ending when a limiting state is reached.

The limiting state can be an unacceptable failure intensity or other. Typical values for useful life are 3 to 6 years for commercial applications, 5 to 15 years for military installations, and 10 to 30 years for distribution or power Systems (see also Lifetime).

Value Analysis

Optimization of the configuration of the item as well as of the production processes and procedures to provide the required item characteristics at the lowest possible cost without loss of capability, reliability, maintainability, or safety.

Value Engineering

Application of value analysis methods during the design phase to optimize the life-cycle cost of the item.

A2 Quality and Reliability Standards

Besides quantitative reliability requirements, such as MTBF = 1 l ?L, MTTR, and availability, customers often require a quality assurance /management System and for complex items also the realization of a quality und reliability assurance program. Such general requirements are covered by national and international standards, the most important of which are briefly discussed in this appendix. The term management is used explicitly where the organization (company) is involved as a whole, as per ISO 9000: 2000 and TQM. A basic procedure for setting up and realizing quality and reliability requirements for complex equipment and systems, with the corresponding quality und reliability assurance program, is discussed in Appendix A3.

A2.1 Introduction

Customer requirements for quality and reliability can be quantitative or qualitative. As with performance Parameters, quantitative reliability requirements are given in system specifications or contracts. They fix targets for reliability, maintainability, availability, and safety (as necessary) along with associated specifications for required function, operating conditions, logistic support, and criteria for acceptance tests. Qualitative requirements are in national or international standards and generally deal with a quality management system. Depending upon the field of application (aerospace, defense, nuclear, or industrial), these requirements may be more or less stringent. Objectives of such standards are in particular:

1. Harmonization of quality management systems and of terms & definitions.

2. Enhancement of customer satisfaction.

3. Standardization of configuration, operating conditions, logistic support, test procedures, and selectionl qualification criteria for components, materials, and production processes.

Important standards for quality management systems are given in Table A2.1, see [A2.1 - A2.131 for a comprehensive list. Some of the standards in Table A2.1 are briefly discussed in the following sections.

366 A2 Quality and Reliability Standards

A2.2 General Requirements in the Industrial Field

In the industrial field, the ISO 9000: 2000 family of standards [A2.9] supersedes the ISO 9000: 1994 family and Open a new era in quality management requirements. The previous 9001 - 9004 are substituted by 9001: 2000 and 9004: 2000. The ISO 8402, on definition, is substituted by the ISO 9000: 2000. Many definitions have been revised and the structure and content of 9001: 2000 and 9004: 2000 are new, and adhere better to the industrial needs and to the concept depicted in Fig. 1.3. Eight basic quaIity management principles have been identified and considered in the ISO 9000: 2000 family: Customer Focus, Leadership, Involvement of People, Process Approach, System Approach to Management, Continuous Improvement, Factual Approach to Decision Making, and Mutually Beneficial Supplier Relationships.

ISO 9000:2000 describes fundamentals of quality management Systems and specify the terminology involved.

ISO 9001: 2000 specifies requirements for a quality management system that an organization (company) needs to demonstrate its ability to provide products that satisfying customer und applicable regulatory requirements. It focus on four main chapters: Management Responsibility, Resource Management, Product and / or Service Realization, and Measurement. A quality management system must ensure that everyone involved with a product (whether in its development, production, installation, or servicing, as well as in a management or staff function) shares responsibility for the quality of that product, in accordance to TQM. At the same time, the system must be cost effective and contribute to a reduction of the time to market. Thus, bureaucracy must be avoided and such a system must Cover all aspects related to quality, reliability, maintainability, availability, and safety, including management, organization, planning, and engineering activities. Customer expects today that only items with agreed requirements will be delivered.

ISO 9004: 2000 provides guidelines that consider efficiency und effectiveness of the quality management system.

The ISO 9000: 2000 family deals with a broad class of products and services (technical and non-technical), its content is thus lacking in details, compared with application specific standards used e.g. in railway, aerospace, defense , and nuclear industries (Appendix A2.3). It has been accepted as national standards in many countries, and international recognition of certification has been partly achieved.

Dependability aspects, focusing on reliability, maintainability, and logistic support of systems are considered in IEC Standards, in particular IEC 60300 for global requirements and IEC 60605, 60706, 60812, 60863, 61025, 61078, 61124, 61163, 61164, 61165, 61508, und 61709for specific procedures, see [A2.6] for a comprehensive list. IEC 60300 deals with dependability programs (management, task descriptions, application guides). Reliability tests for constant failure rate ?L

(or of MTBF for the case MTBF = 1 l h ) are considered in IEC 61 124. Maintainability aspects are in IEC 60706 and s a f e ~ , aspects in IEC 61508.

A2.2 General Requirements in the Industrial Field 367

Table A2.1 Standards for quality and reliability assurance lmanagement of equipment and systems

'ndustriul

~ 0 0 0 Int. ISO 9000: 2000

ISO 9001 : 2000

ISO 9004: 2000

1986-06 Int. IEC 60605

1994-06 Int. IEC 60706

l006 Int. IEC 61124

1998 Int. IEEE Std 1332

goftware Quality

1987- 98 Int. IEEEIANSI

IEC, ISOAEC

9efense

1963 USA MIL-Q-9858

1980 USA MIL-STD-785

L986 USA MIL-STD-781

1983 USA MIL-STD-470

1984 NATO AQAP-1

Qerospace

1974 USA NHB-5300.4 (NASA)

1996 EuropeECSS (EsAl ECSS-E

ECSS-M ECSS-Q

Quality management systems - Fundamentals and vocabulary

Quality management systems - Requirements

Quality management systems - Guidelines for performance improvement

Dependability management (-1: Program management, -2: Program element tasks, -3: Application guides)

Equipment reliability testing (-2: Test cycles, -3: Test conditions -4: Point and interval estimates, -6: Test for constant failure rate)

Guide on maintainability of equipment (-1: Maint. program, -2: Analysis, -3: Data evaluation, -4: Support planning, -5: Diagnostic, -6: Statistical methods)

Reliability testing - Compliance tests for constant failure rate and constant failure intensity (supersedes IEC 60605-7)

60068,60319,60410,60447,60721,60749,60812,60863,61000 61014,61025,61070,61078,61123,61160,61163,61164,61165 61508,61649,61650,61703,61709,61710,61882,62198

IEEE Standard Reliability Program for the Development and Production of Electronic Systems and Equipment (see also 1413

Railway Applications - RAMS Specification & Demonstration Product Liability

IEEE Software Eng. Standards Vol. 1 - 4, 1999 (in particular 610,730, 1028, 1045, 1062, 1465 (ISOIIEC 12119))

IEC 61713 (2000) and ISOnEC 12119 (1998), 12207 (1995)

Quality Program Requirements (ed. A)

Rel. Program for Systems and Eq. Devel. and Prod. (ed. B)

Rel. Testing for Eng. Devel., Qualif. and Prod. (ed. D)

Maintainability Program for Systems and Equip. (ed. A)

NATO Req. for an Industrial Quality Control System (ed. 3)

Safety, Reliability, Maintainability, and Quality Provisions for the Space Shuttle Program (1D-1)

European Corporation for Space Standardization Engineering (-00, - 10) Project Management (-00, -10, -20, -30, -40, -50, -60,-70) Product Assurance (-00, -20, -30, -40, -60, -70, -80)

2003 Europe pr EN 9 100-2003 Quality Management System

368 A2 Quality and Reliability Standards

For electronic equipment & Systems, IEEE Std 1332-1998 [A2.7] has been issued as a guide to a reliability program for the development and production phases. This document gives in a short form the basic requirements, putting an accent on an active cooperation between supplier (manufacturer) and customer, and focusing three main aspects: Determination of the Customer's Requirements, Determination of a Process that satisfy the Customer's Requirements, and Assurance that the Customer's Requirements are met. Examples of comprehensive requirements for industry application are e.g. in [A2.2, A2.31. Software aspects are considered in IEEE Software Engineering Standards [A2.8]. Requirements for product liability are given in national and international directives, see for instance [1.8].

A2.3 Requirernents in the Aerospace, Railway, Defense, and Nuclear Fields

Requirements in space und railwayfields generally combine the aspects of quality, reliability, maintainability, safety, and software quality in a Product Assurance or RAMS document, well conceived in its structure& content [A2.3 - A2.5, A2.121. In the railway field, EN 50126 [A2.3] requires a RAMS program with particular emphasis on safety aspects. Similar is in the avionics field, where EN 9100-2003 [A2.4] has been issued by reinforcing requirements of ISO 9000 family. It can be expected that space and avionics will unify standards in an Aerospace Series.

MIL-Standards have played an important role in the last 30 years, in particular MIL-Q-9858 and MIL-STD-470, -471, -781, 785 & -882 [A2.10]. MIL-Q-9858 (first Ed. 1959) was the basis for many quality assurance standards. However, as it does not Cover specific aspects of reliability, maintainability, and safety, MIL-STD-785, -470, and -882 were issued. MIL-STD-785 requires the realization of a reliability program; tasks are carefully described and the program has to be tailored to satisfy User needs. MTBF = 11 h acceptance procedures are in MIL-STD-781. MIL-STD-470 requires the realization of a maintainability program, with emphasis on design d e s , design reviews, and FMEAI FMECA. Maintainability demonstration is covered by MIL- STD-471. MZL-STD-882 requires the realization of a safety program, in particular the analysis of all potential hazards. For NATO countries, AQAP Requirements were issued starting 1968. MIL Standards have dropped their importance. However, they can still be useful in developing procedures for industrial applications.

The nuclearfield has its own specific, well established standards with emphasis on safety aspects, design reviews, configuration accounting, qualification of components / materials/production processes, quality control during production, and tests.

A3 Definition and Realization of Quality and Reliability Requirements

In defining quality und reliability requirements, it is important that market needs, life cycle cost aspects, time to market as well as development and production risks (for instance when using new technologies) are consider with care. For complex equipment und Systems with high quality & reliability requirements, the realization of such requirements is best achieved with a quality und reliability assurance program, integrated in the project activities andperformed without bureaucracy. Such a program (plan if time schedule is considered) defines the project specific activities for quality and reliability assurance and assigns responsibilities for their realization in agreement to TQM. This appendix discusses first important aspects in defining quality & reliability requirements and then the content of a quality and reliability assurance program for complex equipment und Systems with high qualiq und reliabiliiy requirements for the case in which tailoring is not mandatory. For less stringent requirements, tailoring is necessary to meet real needs and to be cost and time effective. Software specific quality assurance aspects are considered in Section 5.3. Examples for check lists for design reviews are in Appendix A4, requirements for a quality data reporting system in Appendix A5.

A3.1 Definition of Quality and Reliability Requirements

In defining quantitative, project specific, quality und reliability requirements attention has to be paid to the actual possibility to realize them as well as to demonstrate them at a final or acceptance test. These requirements are derived from customer or market needs, taking care of limitations given by technical, cost, and ecological aspects. This section deals with some important considerations by setting MTBF, M V R , and steady-state availability (PA = AA) requirements. MTBF is used for MTBF = 1 / A, where is the constant (time independent) failure rate of the item considered. Tentative targets for MTBF, MTl'R, PA are set by considering

operational requirements relating to reliability, maintainability, and availability, allowed logistic support,

370 A3 Definition and Realization of Quality and Reliability Requirements

required function and expected environmental conditions, experience with similar equipment or Systems,

possibility for redundancy at higher integration level,

requirements for life-cycle cost, dimensions, weight, power consumption, etc., ecological consequences (sustainability).

Typical Jigures for failure rates h of electronic assemblies are between 100 and l,OO0.10-~ h-l at ambient temperature B A of 40°C and with a duty cycle d of 0.3, See Table A3.1 for some examples. The duty cycle ( 0 < d I 1) gives the mean of the ratio between operational time and calendar time for the item considered. Assuming a constant failure rate A und no reliability degradation caused by power onloff, an equivalent failure rate

can be used for practical purposes. Often it can be useful to operate with the mean expected number of failuresper year and 100 items

rn < 1 is a good target for equipment and can influence acquisition cost. Tentative targets are refined successively by performing rough analysis and

comparative studies (definition of goals down to assembly level can be necessary at this time (Eq. (2.71)). For acceptance testing (demonstration) of an MTBF for the case MTBF = l l h , the following data are important (Sections 7.2.3.2 and 7.2.3.3):

1. MTBFo = specified MTBF andlor MTBFl = minimum acceptable MTBF.

2. Required function (mission profile).

3. Environmental conditions (thermal, mechanical, climatic).

4. Allowed producer's andlor consumer's risks (a andlor P).

Table A3.1 Indicative values of failure rates ?L and mean expected number msl of failures per year and 100 items for a duty cycle d = 30% and d = 100% ( B A = 40°C)

Telephone exchanger

Telephone receiver (multifunction)

Photocopier incl. mechanical parts

2,000

200

30,000

Personal computer 3,000 3

Radar equipment (ground mobile)

Control card for autom. process control

Mainframe computer system

2

0.2

30

300

0.3

-

300,000

300

-

6,000

600

100,000

900,000

900

20,000

6

0.6

100

900

0.9

20

A3.1 Definition of Quality and Reliability Requirements 371

5. Cumulative operating time T and number C of allowed failures during T (acceptance conditions).

6. Number of systems under test ( T / MTBFO as a rule of thumb).

7. Parameters which should be tested and frequency of measurement. 8. Failures which should be ignored for the MTBF acceptance test.

9. Maintenance and screening before the acceptance test.

10. Maintenance procedures during the acceptance test.

11. Form and content of test protocols and reports. 12. Actions in the case of a negative test result.

For acceptance testing (demonstration) of an MTTR, the following data are important (Section 7.3.2):

1. Quantitative requirements (MTTR, variante, quantile).

2. Test conditions (environment, personnel, tools, external Support, spare parts).

3. Number and extent of repairs to be undertaken (simulated/introduced failures).

4. Allocation of the repair time (diagnostic, repair, functional test, logistic time).

5. Acceptance conditions (number of repairs and observed empirical MTTR).

6. Form and content of test protocols and reports. 7. Actions in the case of a negative test result.

Availability usually follows from the relationship PA = MTBFI(MTBF + MTTR). However, specific test procedures for PA = AA are given in Scction 7.2.2).

A3.2 Realization of Quality and Reliability Require- ments for Complex Equipment and Systems

For complex items, in particular at equipment and system level, quality and reliability targets are best achieved with a quality und reliability assurance program, integrated in the project activities and performed without bureaucracy. In such a program, project specific tasks and activities are clearly described and assigned. Table A3.2 can be used as a checklist by defining the content of a quality and reliability assurance program for complex equipment und systems with high quality und reliability requirements, when tailoring is not mandatory (see also [A2.8 (730-2002)] and Section 5.3 for software specific quality assurance aspects). Table A3.2 is a refinement of Table 1.2 and shows a possible task assignment in a company as per Fig. 1.7. Depending on the item technology and complexity, or because of tailoring, Table A3.2 is to be shortened or extended. The given responsibilities for tasks (R, C, I) can be modified to reflect the company's personnel situation. For a comprehensive description of reliability assurance tasks see e.g. [A2.6 (60300), A2.10 (785), A3.11.


Table A3.2 Example of tasks and tasks assignment for quality and reliability assurance of complex equipment und systems with high quality und reliability requirements, when tailoring is not mandatory (see also Section 5.3 for software specific quality assurance aspects)

Example of tasks and tasks assignment for quality und reliability assurance, in agreement to Fig. 1.7 and TQM (checklist for the preparation of a quality and reliability assurance program) R stands for responsibility, C for cooperation (must cooperate), I for information (can cooperate)

Customer und rnarket requirements

1 Evaluation of delivered equipment and systems 2 Detennination of market and customer demands and real

needs 3 Customer Support

! Preliminary analyses

1 Definition of tentative quantitative targets for reliability, maintainability, availability, safety, and quality level

2 Rough analyses and identification of potential problems 3 Comparative investigations

Qualio und reliability aspects in specifications, quotations, contracts, etc.

1 Definition of the required function 2 Determination of extemal environmental stresses 3 Definition of realistic quantitative targets for reliability,

maintainability, availability, safety, and quality level 4 Specification of test and acceptance criteria 5 Identification of the possibility to obtain field data 6 Cost estimate for quality & reliability assurance activities

Quality und reliability assurance program

1 Preparation 2 Realization

- design and evaluation - production

i Reliability und maintainability analyses

1 Specification of the required function for each element 2 Determination of environmental, functional, and time-

dependent stresses (detailed operating conditions) 3 Assessment of derating factors 4 Reliability and maintainability allocation 5 Preparation of reliability block diagrams

- assembly level - system level

6 Identification and analysis of reliability weaknesses (FMEA/FMECA, R A , worst-case, dnft, stress-strength- analy ses) - assembly level - system level

A3.2 Realization of Quality and Reliability Requirements

Table A3.2 (cont.)

7 Carrying out comparative studies - assembly level - system level

8 Reliability improvement through redundancy - assembly level - system level

9 Identification of components with limited lifetime 10 Elaboration of the maintenance concept I1 Elaboration of a test and screening strategy 12 Analysis of maintainability 13 Elaboration of mathematical models 14 Calculation of the predicted reliability and maintainability

- assembly level - system level

15 Reliability and availability calculation at system level

Safety und human factor analyses

1 Analysis of safety (avoidance of liability problems) - accident prevention - technical safetv

identification and analysis of critical failures situations (FMEAJFMECA, FTA, etc.) - assembly level

and of

-

risk

- system level theoretical investigations

2 Analysis of human factors (man-machine interface)

Selection und qualzjication of components und materials

1 Updating of the list of preferred components and materials 2 Selection of non-preferred components and materials 3 Qualification of non-preferred components aud materials

- planuing - realization - analysis of test results

4 Screening of components and materials

Supplier selection and qualification

1 Supplier selection - purchased components and materials - external production

2 Supplier qualification (quality and reliability) - purchased components and materials - extemal production

3 Incoming inspections - planning - realization - analysis of test results - decision on corrective actions

purchased components and materials extemal production

A3 Definition and Realization of Quality and Reliability Requirements 374

Table A3.2 (cont.)

10. Configuration manugement

1 Planning and monitoring 2 Realization

- configuration identification during design during production dunng use (warranty period)

- configuration auditing (design reviews, Tables A3.3,5.3,5.5) - configuration control (evaluation, coordination,

and release or rejection of changes and modifications) dunng design

3. Project-dependent procedures und work instructions

1 Reliability guidelines 2 Maintainability guidelines 3 Safety guidelines 4 Other procedures, rules, and work instructions

for development for production

5 Compliance monitoring

during production dunng use (warranty period)

- configuration accounting

11. Prototype qualification tests

1 Planning 2 Realization 3 Analysis of test results 4 Special tests for reliability, maintainability, and safety

M

12. Quality control during production

1 Selection and qualification of processes and procedures 2 Production planning 3 Monitoring of production processes

R&D

C

!3. Zn-process tests

1 Planning 2 Realization

14. Final und acceptance tests

P

C C I R I C I R

R I C I R C

C C C R

1 Environmental tests andlor screening of series-produced items - planning - realization - analysis of test results

2 Final and acceptance tests - plaming - realization - analvsis of test results

Q&R

R

C C C R C R

C C C R 3 Procurement, maintenance, and calibration of test equipment I C C R

A3.2 Realization of Quality and Reliability Requirements 375

Table A3.2 (cont.)

/ 15. Quality data reporting system

1 Data collection 2 Decision on corrective actions

- during Prototype qualification - during in-process tests - during final and acceptance tests - during use (warranty penod)

3 Realization of corrective actions on hardware or software (repair, rework, waiver, scrap)

4 Implementation of the changes in the documentation (technical, production, customer)

5 Data compression, processing, Storage, and feedback 6 Monitoring of the quality data reporting system

1 16. Logistic suppori

1 Supply of special tools and test equipment for maintenance 2 Preparation of customer documentation 3 Training of operating and maintenance personnel 4 Determination of the required number of spare Parts,

maintenance personnel, etc. 5 After-sales (after market) support

17. Coordination and monitoring

3 Planning and realization of quality audits - project-specific - project-independent

4 Information feedback

18. Quality cost

1 Collection of quality cost 2 Cost analysis and initiation of appropnate actions 3 Preparation of periodic and special reports 4 Evaluation of the efficiency of quality & reliability assurance R

19. Concepts, methods, and general procedures (quality und reliability)

I Development of concepts 2 Investigation of methods 3 Preparation and updating of the quality handbook 4 Development of software packages R 5 Collection, evaluation, and distribution of data,

experience and know-how I I

20. Motivation und training 1 / 1 1 I Planning 2 Preparation of Courses and documentation 3 Realization of the motivation and training program R


A3.3 Elements of a Quality and Reliability Assurance Program

The basic elements of a quality and reliability assurance program, as defined in Appendix A.3.2, can be summarized as follows:

1. Project organization, planning, and scheduling

2. Quality and reliability requirements 3. Reliability and safety analysis

4. Selection and qualification of components, materials, and processes

5. Configuration management

6. Quality tests

7. Quality data reporting system

These elements are discussed in this section for the case of complex equipment and Systems with high quality and reliability requirements, when tailoring is not mandatory. In addition, Appendix A4 gives a catalog of questions to generate checklists for design reviews and Appendix A5 specifies the requirements for a quality data reporting System. For software specific quality assurance aspects one can refer to Section 5.3. As suggested in task 4 of Table A3.2, the realization of a quality and reliability assurance program should be the responsibility of the project manager. It is often useful to start with a quality and reliability program for the development phase, covering items 1 to 5 of the above list, and continue with the production phase for points 5 to 7.

A3.3.1 Project Organization, Planning, and Scheduling

A clearly defined project organization and planning is necessary for the realization of a quality and reliability assurance program. Organization and planning must also satisfy modern needs for cost management and concurrent engineering.

The system specification is the basic document for all considerations at project level. The following is a typical outline for system specifications:

1. State of the art, need for a new product

2. Target to be achieved

3. Cost, time schedule

4. Market potential (turnover, price, competition)

5. Technical performance

6. Environmental conditions

7. Operational capabilities (reliability, maintainability, availability, logistic support) 8. Quality and reliability


9. Special aspects (new technologies, Patents, value engineering, etc.)

10. Appendices

The organization of a project begins with the definition of the main task groups. The following groups are usual for a complex system: Project Management, System Engineering, Life-Cycle Cost, Quality and Reliability Assurance, Assembly Design, Prototype Qualification Tests, Production, Assembly and Final Testing. Project organization, task lists, task assignment, and rnilestones can be derived from the task groups, allowing the quantification of the personnel, material, and financial resources needed for the project. The quality and reliability assurance program must require that the project is clearly and suitably organized and planned.

A3.3.2 Quality and Reliability Requirements

The most important steps in defining quality and reliability targets for complex equipment and Systems have been discussed in Appendix A.3.1.

A3.3.3 Reliability and Safety Analysis

Reliability and safety analyses include failure rate analysis, failure mode analysis (FMEAIFMECA, FTA), sneak circuit analysis (to identify latent paths which can cause unwanted functions or inhibit desired functions, while all components are functioning properly), evaluation of concrete possibilities to improve reliability and safety (derating, screening, redundancy), as well as comparative studies; see Chapters 2 - 6 for methods and tools.

The quality and reliability assurance program must show what is actually being done for the project considered. For instance, it should be able to supply answers to the following questions:

1. Which derating rules are considered?

2. How are the actual component-level operating conditions determined?

3. Which failure rate data are used? Which are the associated factors (TC, & xQ)?

4. Which tool is used for failure mode analysis? To which items does it apply? 5. Which kind of comparative studies will be performed?

6. Which design guidelines for reliability, maintainability, safety, and software quality are used? How will their adherence be verified?

Additionally, interfaces to the selection and qualification of components and materials, design reviews, test and screening strategies, reliability tests, quality data reporting system, and subcontractor activities must be shown. The data used for component failure rate calculation should be critically evaluated (source, present relevance, assumed environmental and quality factors TC, & nQ).


A3.3.4 Selection and Qualification of Components, Materials, and Manufacturing Processes

Components, materials, and production processes have a great impact on product quality and reliability. They must be carefully selected and qualified. Examples for qualification tests on electronic components and assemblies are given in Chapter 3. For production processes one may refer e.g. to [8.1 - 8.151.

The quality and reliability assurance program should give how components, materials, and processes are (or have already previously been) selected and qualified. For instance, the following questions should be answered:

1. Does a list of preferred components und materials exist? Will critical components be available on the market-place at least for the required production and warranty time?

2. How will obsolescence problems be solved?

3. Under what conditions can a designer use nonqualified components Imaterials?

4. How are new components selected? What is the qualification procedure?

5. How have the standard manufacturing processes been qualified?

6. How are special manufacturing processes qualified?

Special manufacturing processes are those which quality can't be tested directly on the product, have high requirements with respect to reproducibility, or can have an important negative effect on the product quality or reliability.

A3.3.5 Configuration Management

Configuration management is an important tool for quality assurance, in particular during design and development. Within a project, it is often subdivided into configuration identification, auditing, control, and accounting.

The identification of an item is recorded in its documentation. A possible documentation outline for complex equipment und Systems is given in Fig. A3.1.

Configuration auditing is done via design reviews (often also termed gute review), the aim of which is to assurel verify that the system will meet all requirements. In a design review, all aspects of design and development (selection and use of components and materials, dimensioning, interfaces, etc.), production (manufacturability, testability, reproducibility), reliability, maintainability, safety, patent regulations, value engineering, and value analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3. For complex Systems a review of the first production unit (FCAJPCA) is often required. A further important objective of design reviews is to decide about continuation or stopping the project considered on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 & Fig. 1.6). A week


1 DOCUMENTATION 1

System specifications - Quotations, requests Interface documentation Planning and control documentation Conceptslstrategies (maintenance, test) Analysis reports Standards, handbooks, general mles

Fig. A3.1 Possible dc ~cumentation outline for complex equipment und Systems

TECHNICAL

before the design review, participants should present project specific checklists, see Appendix A4 and Tables 2.8 & 4.3 for sorne suggestions. Design reviews are chaired by the project manager and should cochaired by the project quality and reliability assurance manager. For complex equiprnent and Systems, the review team may vary according to the following list:

project manager,

Work breakdown

Assembly

Operations plansfrecords Customer system structures Production procedures specifications Drawings Tool documentation Operating and Schematics maintenance manuals Part lists documentation Spare part catalog Wiring plans Test procedures Specifications Test reports Purchasing doc. Documents pertaining to Handling/transpo~tation/ the quality data storagelpackaging doc. reporting system

PRODUCTION DOCUMENTATION

project quality and reliability assurance manager,

CUSTOMER

design engineers,

representatives from production and marketing,

independent design engineer or extemal expert, customer representatives (if appropriate).

Configuration control includes evaluation, coordination, and release or rejection of all proposed changes and modifications. Changes occur as a result of defects or failures, modifications are triggered by a revision of the system specifications.

Configuration accounting ensures that all approved changes and modifications have been irnplemented and recorded. This calls for a defined procedure, as changes Imodifications rnust be realized in hardware, software, and documentation.

A one-to-one correspondence between hardware or software and documentation is irnportant during all life-cycle phases of a product. Complete records over all life-cycle phases become necessary if traceability is explicitly required, as e.g. in the aerospace or nuclear field. Partial traceability can also be required for products which are critical with respect to safety, or because of product l iabili~.

Referring to configuration management, the quality and reliability assurance program should for instance answer the following questions:


1. Which documents will be produced by whom, when, and with what content?

2. Are document contents in accordance with quality and reliability requirements?

3. 1s the release procedure for technical and production documentation compatible with quality requirements?

4. Are the procedures for changes Imodifications clearly defined?

5. How is compatibility (upward and /or downward) assured? 6. How is configuration accounting assured during production?

7. Which items are subject to traceability requirements?

A3.3.6 Quality Tests

Qualio tests are necessary to verify whether an item conforms to specified requirements. Such tests Cover performance, reliability, maintainability, and safety aspects, and include incoming inspections, qualification tests, production tests, and acceptance tests. To optimize cost and time schedule, tests should be integrated in a test (and screening) strategy at system level. Methods for statistical quality control and reliability tests are given in Chapter 7. Qualification tests and screening procedures are discussed in Sections 3.2 - 3.4 and 8.2 - 8.3. Basic considerations for test and screening strategies with cost considerations are in Section 8.4. Some aspects of testing software are discussed in Section 5.3. Reliability growth is investigated in Section 7.7.

The quality and reliability assurance program should for instance answer the following questions:

1. What are the test and screening strategies at system level?

2. How were subcontractors selected, qualified and monitored? 3. What is specified in the procurement documentation?

4. How is the incoming inspection performed?

5. Which components and materials are 100% tested? Which are 100% screened? What are the procedures for screening?

6. How are Prototypes qualified? Who decides on test results?

7. How are production tests performed? Who decides on test results?

8. Which procedures are applied to defective or failed items? 9. What are the instructions for handling, transportation, Storage, and shipping?

A3.3.7 Quality Data Reporting System

Starting at the Prototype qualification tests, all defects and failures should be systematically collected, analyzed and corrected. Analysis should go back to the cause of the fault, in order to find those actions most appropriate for avoiding repetition of

A3.3 Elements of a Quality and Reliability Assurance Program 381

Table A3.3 Design reviews during definition, design, and dev. of complex equipment und Systems

System Design Review W R )

At the end of the definition phase

Critical review of the system specifications on the basis of results from market research, rough analysis, comparative studies, patent situation, etc. Feasibility check

Item list System specifications (draft) Documentation (analyses, reports, etc.) Checklists (one for each participant)*

System specifications - Proposal for the design phase Interface definitions Rough maintenance and logistic support concept Report

Preliminary Design Reviews (PDR)

During the design phase, each time an assembly has been developed

Critical review of all documents belonging to the assembly under consideration (calculations, schematics, parts lists, test specifications, etc.) Comparison of the target achieved with the system specifications requirements Checking interfaces to other assemblies Feasibility check

Item list Documentation (analyses, schematics, drawings, parts lists, test specifications, work breakdown structure, interface specifications, etc.) Reports of relevant earlier design reviews Checklists (one for each participant)*

Reference configuration (baseline) of the assembly considered List of deviations from the system specifications Report

Critical Design Review (CDR)

At the end of prototype qualification tests

Cntical comparison of prototype qualification tesi results with system requirements Formal review of the correspondence between technical documentation and prototype Verification of mannufac- turability, testability, and reproducibility Feasibility check

Item list Technical documentation Testing plan and procedures for prototype qualification tests Results of prototype qualification tests List of deviations from the system requirements Maintenance concept Checklists (one for each participant)"

List of the final deviations from the system specs. Qualified and released Prototypes Frozen technical documentation Revised mainten. concept Production proposal Report

See Appendix A4 for a possible catalog of qucstions to generatc project specific checklists and Tab. 5.5 for software specific aspects; gate review is often uscd instead of design review

the same problem. The concept of a quality data reporting system is illustrated in Fig. 1.8 and applies basically to hardware and software, detailed requirements are given in Appendix A5.


The quality and reliability assurance program should for instance answer the following questions:

1. How is the collection of defect and failure data carried out? At which project phase is started with?

2. How are defects and failures analyzed? 3. Who carries out corrective actions? Who monitors their realization? Who

checks the final configuration? 4. How is evaluation and feedback of quality and reliability data organized?

5. Who is responsible for the quality data reporting system? Does production have their own locally limited version of such a system? How does this Systems interface with the company's quality data reporting system?

Checklists for Design Reviews

In a design review, all aspects of design, development, production, reliability, maintainability, safety, patent regulations, value engineeringlvalue analysis are critically examined with the help of checklists. The most important design reviews are described in Table A3.3 (see Table 5.5 for software specific aspects). A further objective of design reviews is to decide about continuation or stopping the project on the basis of objective considerations and feasibility check (Tables A3.3 and 5.3 &

Fig. 1.6). This appendix gives a catalog of questions which can be used to generate project specific checklists for design reviews for complex equipment und systems with high quality & reliability requirements, when tailoring is not mandatory.

A4.1 System Design Review

1. What experience exists with similar equipment or systems? 2. What are the goals for performance (capability), reliability, maintainability,

availability, and safety? How have they been defined? Which mission profile (required function and environmental conditions) is applicable?

3. Are the requirements realistic? Do they correspond to a market need? 4. What tentative allocation of reliability and maintainability down to assembly 1

unit level was undertaken? 5. What are the critical items? Are potential problems to be expected (new

technologies, interfaces)? 6. Have comparative studies been done? What are the results? 7. Are interference problems (external or internal EMC) to be expected? 8. Are there potential safety Iliability problems? 9.1s there a maintenance concept? Do special ergonomic requirements exist?

10. Are there special software requirements? 1 1. Has the patent situation been verified? Are licenses necessary? 12. Are there estimates of life-cycle cost? Have these been optimized with

respect to reliability and maintainability requirements?

3 84 A4 Checklists for Design Reviews

13. 1s there a feasibility study? Where does the competition stand? Has development risk been assessed?

14. 1s the project time schedule realistic? Can the system be marketed at the right time?

15. Can supply problems be expected during production ramp-up?

A4.2 Preliminary Design Reviews

a) General

1. 1s the assembly / m i t under consideration a new development or only a change/modification? Can existing items (e.g. sub assemblies) be used?

2.1s there experience with similar assembly /mit? What were the problems? 3.1s there redundancy hardware and / or software? 4. Have customer and market demands changed since the beginning of

development? Can individual requirements be reduced? 5. Can the chosen solution be further simplified? 6. Are there patent problems? Do licenses have to be purchased? 7. Have expected cost and deadlines been met? Were value engineering used?

b) Performance Parameters

1. How have been defined the main performance Parameters of the assembly / unit under consideration? How was their fulfillment verified (calculations, simulation, tests)?

2. Have worst case situations been considered in calculations / simulations? 3. Have interference problems (EMC) been solved? 4. Have applicable standards been observed during design and development? 5. Have interface problems with other assemblies Iunits been solved? 6. Have Prototypes been adequately tested in laboratory?

C) Environmental Conditions

1. Have environmental conditions been defined? As a function of time? Were these consequently used to determine component operating conditions?

2. How were EMC interference been determined? Has his influence been taken into account in worst case calculation/ simulation?

A4.2 Preliminary Design Reviews 385

d) Components and Materials

1. Which components and materials do not appear in the preferred lists? For what reasons? How were these components and materials qualified?

2. Are incoming inspections necessary? For which components and materials? How / Who will they be performed?

3. Which components and materials were screened? How / Who will screening be performed?

4. Are suppliers guaranteed for series production? 1s there at least one second source for each component and material? Have requirements for quality, reliability, and safety been met?

5. Are obsolescence problems to be expected? How will they be solved?

e) Reliability

See Table 2.8.

f) Maintainability

See Table 4.3.

g) Safety

1. Have applicable standards concerning accident prevention been observed? 2. Has safety been considered with regard to external causes (natural

catastrophe, sabotage, etc.)? 3. Has a FMEAIFMECA or similar cause-to-effects analysis been performed?

Are there failure modes with critical or even catastrophic consequence? Can these be avoided? Have all single-point failures been identified? Can these be avoided?

4. Has a fail-safe analysis been performed? What were the results? 5. What safety tests are planned? Are they sufficient? 6. Have safety aspects been dealt with adequately in the documentation?

h) Human Factors, Ergonomics

1. Have operating and maintenance sequences been defined with regard to the training level of Operators and maintenance personnel?

2. Have ergonomic factors been taken into account by defining operating sequences?

3. Has the man-machine interface been sufficiently considered?

386 A4 Checklists for Design Reviews

i) Standardization

1. Have standard components and materials been used wherever possible? 2. Has items exchangeability been considered during design and construction?

j) Configuration

1. 1s the technical documentation (schematics, drawings, etc.) complete, error- free, and does it reflect the present state of the project?

2. Have all interface problems between assemblies Iunits been solved? 3. Can the technical documentation be frozen and considered as reference

documentation (baseline)? 4. How is compatibility (upward andlor downward) assured?

k) Production and Testing

1. Which qualification tests are foreseen for prototypes? Have reliability, maintainability, and safety aspects been considered sufficiently in these tests?

2. Have all questions been answered regarding manufacturability, testability, and reproducibility?

3. Are special production processes necessary? Were they qualified? What were the results?

4. Are special transport, packaging, or storage problems to be expected?

A4.3 Critical Design Review (System Level)

a) Technical Aspects

1. Does the documentation allow an exhaustive and correct interpretation of test procedures and results? Has the technical documentation been frozen? Has conformance with present hardware and software been checked?

2. Are test specifications and procedures complete? In particular, are conditions for functional, environmental, reliability, and safety tests clearly defined?

3. Have fault criteria been defined for critical parameters? 1s an indirect measurement planned for those parameters which cannot be measured accurately enough during tests?

4. Has a representative mission profile, with the corresponding required function, been clearly defined for reliability tests?

A4.3 Cntical Design Review (System Level) 387

5. Have test criteria for maintainability been defined? Which failures were simulated / introduced? How have personnel and material conditions been fixed?

6. Have test criteria for safety been defined (accident prevention and technical safety)?

7. Have ergonornic aspects been checked? How? 8. Can packaging, transport and Storage cause problems? 9. Have defects and failures been systematically analyzed (mode, cause, effect)?

Has the usefulness of corrective actions been verified? How? Also with respect to cost?

10. Have all deviations been recorded? Can they be accepted? 11. Does the system still satisfy customer/market needs? 12. Are manufacturability and reproducibility guaranteed within the framework

of a production environment?

b) Formal Aspects

1.1s the technical documentation complete? 2. Has the technical documentation been checked for correctness?

For coherency? 3.1s uniqueness in numbering guaranteed? Even in the case of changes? 4.1s hardware labeling appropriate? Does it satisfy production and maintenance

requirements? 5. Has conformance between Prototype and documentation been checked? 6. 1s the maintenance concept mature? Are spare parts having a different

change Status fully interchangeable? 7. Are production tests sufficient from today's point of view?

A5 Requirements for Quality Data Reporting Systems

A quality data reporting System is a system to collect, analyze, and correct all defects and failures occurring during production and testing of an item, as well as to evaluate and feedback the corresponding quality and reliability data (Fig. 1.8). The system is generally computer-aided. Analysis of failures and defects must go back to the root cause in order to determine the most appropriate action necessary to avoid repetition of the same problern. The quality data reporting system applies basically to hardware and software. It should remain active during the operating phase, at least for the warranty time. This appendix summarizes the requirements for a computer-aided quality data reporting system for complex equipment and systems.

a) General Requirements

1. Up-to-dateness, completeness, and utility of the delivered information must be the primary concern (best compromise).

2. A high level of u s a b i l i ~ (user friendliness) and minimal manual intervention should be a goal.

3. Procedures and responsibilities should be clearly defined (several levels depending upon the consequence of defects or failures).

4. The system should be flexible and easily adaptable to new needs.

b) Requirements Relevant to Data Collection

1. All data concerning defects and failures (relevant to quality, reliability, maintainability, and safety) have to be collected, from the begin of Prototype qualification tests to (at least) the end of the warranty time.

2. Data collection forms should be preferably 8" X 11 " or A4 format be project-independent and easy to fill in

ensure that only the relevant information is entered and answers the questions: what, where, when, why, and how?

A5 Requirements for Quality Data Reporting Systems 389

have a separate field (20-30%) for free-format input for comrnents (requests for analysis, logistic information, etc.), these comments do not need to be processed and should be easily separable from the fixed portion of the form.

3. Description of the Symptom (mode), analysis (cause, effect), and corrective action undertaken should be recorded in clear text and coded at data entry by trained personnel.

4. Data collection can be carried out in different ways at a single reporting location (adequate for simple problems which can be solved directly at the reporting location) from different reporting locations which report the fault (defect or failure), analysis result, and corrective action separately.

Operating reliability, maintainability, or logistic data can also be reported.

5. Data collection forms should be entered into the Computer daily (on line if possible), so that corrective actions can be quickly initiated (for field data, a weekly or monthly entry can be sufficient for many purposes).

C) Requirements for Analysis

1. The cause should be found for each defect or failure at the reporting location, in the case of simple problems by a fault review board, in critical cases.

2. Failures (and defects) should be classified according to

mode - sudden failure (short, Open, fracture, etc.) - gradual failure (drift, wearout, etc.) - intermittent failures, others if needed cause - intrinsic (inherent weaknesses, wearout, or some other intrinsic cause) - extrinsic (systernatic failure, i.e. misuse, mishandling, design, or manuf. failure) - secondary failure effect - irrelevant - partial failure - cornplete failure - critical failure (safety problern).

3. Consequence of the analysis (repair, rework, change, scraping) must be reported.

d) Requirements for Corrective Actions

1. Every record is considered pending until the necessary corrective action has been successfully completed and certified.

2. The quality data reporting system must monitor all corrective actions.

A5 Requirements for Quality Data Reporting Systems 390

3. Procedures and responsibilities pertaining to corrective action have to be defined (simple cases usually solved by the reporting location).

4. The reporting location must be informed about a completed corrective action.

e) Requirements Related to Data Processing, Feedback, and Storage

1. Adequate coding must allow data compression and simplify data processing.

2. Up-to-date information should be available on-line. 3. Problem-dependent and periodic data evaluation must be possible.

4. At the end of a project, relevant information should be stored for comparative investigations.

f) Requirements Related to Compatibility with other Software Packages

1. Compatibility with company's configuration management and data banks should be assured.

2. Data transfer with the following external software packages should be assured important reliability data banks quality data reporting systems of subsidiary companies quality data reporting systems of large contractors.

The effort required for implementing a quality data reporting system as described above can take 5 to 10 man-years for a medium-sized company. Competence for operation and maintenance of the quality data reporting system should be with the company's quality and reliability assurance department. The priority for the realization of corrective actions is project specific and should be fixed by the project manager. Major problems (defects and failures) should be discussed periodically by a fault review board chaired by the company's quality and reliability assurance manager, which should have, in critical cases defined in the company's quality assurance handbook, the competence to take golnogo decisions.

Basic Probability Theory

In many practical situations, experiments have a random outcome, i.e., the results cannot be predicted exactly, although the same experiment is repeated under identical conditions. Examples in reliability engineering are failure-free time of a given System, repair time of equipment, inspection of a given item during production, etc. Experience shows that as the number of repetitions of the same experiment increases, certain regularities appear regarding the occurrence of the event considered. Probability theory is a mathematical discipline which investigates the laws describing such regularities. The assumption of unlimited repeatability of the same experiment is basic to probability theory. This assumption permits the introduction of the concept of probability for an event starting from the properties of the relative frequency of its occurrence in a long series of trials. The axiomatic theory ofproba- bility, introduced 1933 by A.N. Kolmogorov [A6.10], brought probability theory to a mathematical discipline. In reliability analysis, probability theory allows the investigation of the probability that a given item will operate failure-free for a stated period of time under given conditions, i.e. the calculation of the item's reliability on the basis of a mathematical model. The d e s necessary for such calculations are presented in Sections A6.1- A6.4. The following sections are devoted to the concept of random variables, necessary to investigate reliability as a function of time and as a basis for stochastic processes (Appendix A7) and mathematical statistics (Appendix Ag). This appendix is a compendium of probability theory, consistent from a mathematical point of view but still with reliability engineering applications in mind. Selected examples illustrate the practical aspects.

A6.1 Field of Events

As introduced 1933 by A.N. Kolmogorov [A6.10], the mathematical model of an experiment with random outcome is a triplet [Q, F , Pr], also called probability space. & is the sample space, F the event field, and Pr the probability of each element of F . is a Set containing as elements all possible outcomes of the experiment considered. Hence & = {i,2, 3, 4, 5, 6) if the experiment consists of a single throw of a die, and SZ = [O, W) in the case of failure-free times of an item. The

392 A6 Basic Probability Theory

elements of SZ are called elementary events and are represented by W. If the logical Statement "the outcome of the experiment is a subset A of SZ" is identified with the subset A itself, combinations of Statements become equivalent to operations with subsets of SZ. If the sample space SZ is finite or countable, a probability can be assigned to every subset of SZ. In this case, the event field F contains all subsets of SZ and all combinations of them. If L2 is continuous, restrictions are necessary. The eventfield F is thus a system of subsets of SZ to each of which a probability has been assigned according to the situation considered. Such a field is called a o-field ( o-algebra) and has the following properties:

1. SZ is an element of g . 2. If A is an element of F, its complement Ä is also an element of F . 3. If Al , A2, ... are elements of g , the countable union Al U A2 U ... is also an

element of F.

From the first two properties it follows that the empty set 0 belongs to F . From the last two properties and De Morgan's law one recognizes that the countable intersection Al n A2 n . . . also belongs to F. In probability theory, the elements of

are called (random) events. The most important operations on events are the union, the intersection, and the complement:

1. The union of a finite or countable sequence Al , A2, ... of events is an event which occurs if at least one of the events Al , A2, . . . OCCU~S; it will be denoted by Al u A 2 U . . . orby U i A i .

2. The intersection of a finite or countable sequence Al, A2, . . . of events is an event which occurs if each one of the events Al, A2, . . . occurs; it will be denotedby Al n A 2 n... orby f l iA i .

3. The complement of an event A is an event which occurs if and only if A does - - notoccur;itisdenotedby A , A = { w : w @ A } = Q \ A , A U Ä = Q , ~ n Ä = 0 .

Important properties of set operations are:

Commutativelaw : A u B = B U A ; A n B = B n A

Associativelaw : A u ( B u C ) = ( A u B ) u C ; A n ( B n C ) = ( A n B ) n C

Distributivelaw : A u ( B n C ) = ( A u B ) n ( A u C ) ; A n ( B u C ) = ( A n B ) u ( A n C )

Complementlaw : ~ n Ä = 0 ; A U Ä = Q

Idempotentlaw : A u A = A ; A n A = A De Morgan's law : A U B = A n B; A n B = A U B

= - Iden t i t y l aw : A = A ; A u ( A n B ) = A u B .

The sample space B is also called the sure event and 0 is the impossible event. The events Al , A2, . . . are mutually exclusive if Ai n Aj = $3 holds for any i # j . The events A and B are equivalent if either they occur together or neither of them occur, equivalent events have the same probability. In the following, events will be mainly enclosed in braces { } .

A6.2 Concept of Probability

A6.2 Concept of Probability

Let us assume that 10 (random) samples of size n = 100 were taken from a large and homogeneous lot of populated printed circuit boards (PCBs), for incoming inspection. Examination yielded the following results:

Sample number: 1 2 3 4 5 6 7 8 9 1 0

No. of defective PCBs: 6 5 1 3 4 0 3 4 5 7

For 1000 repetitions of the "testing a PCB" experiment, the relative frequency of the occurrence of event {PCB defective) is

It is intuitively appealing to consider 0.038 as the probability of the event {PCB defective}. As shown below, 0.038 is a reasonable estimation of this probability (on the basis of the experimental observations made).

Relative frequencies of the occurrence of events have the property that if n is the number of trial repetitions and n(A) the number of those trial repetitions in which the event A occurred, then

is the relative frequency of the occurrence of A, and the following d e s apply:

1. R1: j n ( A ) 2 0 .

2. R2: &(Q) = 1.

3. R3: if the events Al, . . ., Am are mutually exclusive, then n(Al U ... U Am) = n(Al) + ... + n(Am) and ;,(Al U ... UA,) = f i n ( ~ i ) + ... + &(A&.

Experience shows that for a second group of n trials, the relative frequency F,(A) can be different from that of the first group. j,(A) also depends on the number of trials n. On the other hand, experiments have confirmed that with increasing n, the value & ( A ) converges toward a fixed value p(A) , see Fig. A6.1 for an example. It therefore seems reasonable to designate the limiting value p ( A ) as the probability P ~ { A ) of the event A , with ; J A ) as an estimate of P ~ { A ) . Although intuitive, such a definition of probability would lead to problems in the case of continuous (non-denumerable) sample spaces.

Since Kolmogorov's work [A6.10], the probability P ~ { A ) has been defined as a function on the event field F of subsets of 8. The following axioms hold for this function:

A6 Basic Probability Theory

kln

0.8

Figure A6.1 Example of relative frequency Wn of "heads" when tossing a symmetric coin n times

1. Axiom 1: Foreach A € F i s Pr{A}20.

2. Axiom 2: Pr{Q) = 1.

3. Axiom 3: If events Al, A2, . . . are mutually exclusive, then

Axiom 3 is equivalent to the following Statements taken together:

4. Axiom 3' : For any finite collection of mutually exclusive events, Pr{A1 U ... U An} = Pr{AI) + ... + Pr{A,}.

5. Axiom 3": If events Al, A2, . . . are increasing, i.e. An L An+1, n = 1,2, . . ., m

then lim Pr(An} = pr{UAi1. n-3- i= l

The relationships between Axiom 1 and R 1, and between Axiom 2 and R 2 are obvious. Axiom 3 postulates the total additivity of the set function P ~ { A ) . Axiom 3' corresponds to R3. Axiom 3" implies a continuityproperty of the set function P ~ { A ) which cannot be derived from the properties of $,(A), but which is of great importance in probability theory. It should be noted that the interpretation of the probability of an event as the limit of the relative frequency of occurrence of this event in a long series of trial repetitions, appears as a theorem within the probability theory (law of large numbers, Eqs. (A6.144) and (A6.146)).

From axioms 1 to 3 it follows that:

Pr{@} = 0 ,

Pr{A}<Pr{B} if A c B,

~r {Ä} = 1 - Pr {A} ,

A6.2 Concept of Probability 395

When modeling an experiment with random outcome by means of the probability space [Q, F, Pr], the difficulty is often in the determination of the probabilities P ~ { A ) for every A E g. The structure of the experiment can help here. Beside the statistical probability, defined as the limit for n -+ - of the relative frequency k l n, the following d e s can be used if one assumes that all elementary events o have the same chance of occurrence:

1. Classical probability (discrete uniform distribution): If Q is a finite set and A

a subset of LI, then

number of elements in A Pr{A} =

number of elements in Q

number of favorable outcomes Pr(A} =

number of possible outcomes

2. Geometrie probability (spatial uniform distribution): If Q is a Set in the plane R~ of area Q and A a subset of Q, then

area of A Pr{A} =

area of LI

It should be noted that the geometric probability can also be defined if Q is a part of the Euclidean space having a finite area. Examples A6.1 and A6.2 illustrate the use of Eqs. (A6.2) and (A6.3).

Example A6.1 From a shipment containing 97 good and 3 defective ICs, one IC is randomly selected. What is the probability that it is defective?

Solution From Eq. (A6.2),

3 Pr{IC defective] = -

100

Example A6.2 Maurice and Matthew wish to meet between 8:00 and 9:00 a.m. according to the following d e s : 1) They come independently of each other and each will wait 12 minutes. 2) The time of arrival is equally distributed between 8:00 and 9:00 a.m. What is the probability that they will meet?

Solution Arrival of Matthew

Equation (A6.3) can be applied and leads to, see graph, A

0.8. 0.8 1-2-

PrIMatthew meets Maurice) = = 0.36. 1


Another way to determine probabilities is to calculate them from other probabilities which are known. This involves paying attention to the structure of the experiment and application of the d e s of probability theory (Appendix A6.4). For example, the predicted reliability of a system can be calculated from the reliability of its elements and the system's structure. However, there is often no alternative to determining probabilities as the limits of relative frequencies, with the aid of statistical methods (Appendices A6.1 I and A8).

A6.3 Conditional Probability, Independence

The concept of conditional probability is of great importance in practical applications. It is not difficult to accept that the information "event A has occurred in an experiment" can modify the probabilities of other events. These new probabilities are defined as conditional probabilities and denoted by Pr{B I A } . If for example A B, then Pr{B I A } = 1, which is in general different from the original unconditional probability Pr( B) . The concept of conditional probability Pr{B I A ) of the event B under the condition "event A has occurred", is introduced here using the properties of relative frequency. Let n be the total number of trial repetitions and let n ( A ) , n( B) , and n ( A n B ) be the number of occurrences of A, B and A n B, respectively, with n ( A ) > 0 assumed. When considering only the n ( A ) trials (trials in which A occurs), then B occurs in these n ( A ) trials exactly when it occurred together with A in the original trial series, i.e. n( A n B ) times. The relative frequency of B in the trials with the information "A has occurred" is therefore

Equation (A6.4) leads to the following definition of the conditional probability Pr(B I A } of an event B under the condition A, i.e. assuming that A has occurred,

From Eq. (A6.5) it follows that

Pr{A n B} = Pr{A} Pr{B I A} = Pr{B} Pr{A I B}. (A6.6)

Using Eq. (A6.5), probabilities Pr{B I A } will be defined for all B E F . Pr{B I A } is

A6.3 Conditional Probability, Independence 397

a function of B which satisfies Axioms 1 to 3 of Appendix A6.2, obviously with Pr{A I A ) = 1 . The information "event A has occurred" thus leads to a new probability space [ A , F A , PrA] , where F A consists of events of the form A n B, with B E F and P r A { B } = P r ( B I A ) , seeExampleA6.5.

It is reasonable to define the events A and B as independent if the information "event A has occurred" does not influence theprobability of the occurrence of event B, i.e. if

However, when considering Eq. (A6.6), another definition, with symmetry in A and B is obtained, where P r { A ] > 0 is not required. Two events A and B are independent if and only if

Pr {A n B) = Pr {A} Pr {B}. (A6.8)

The events A l , ..., An are (stochastically) independent if for each k (1 < k 5 n) and any selection of distinct i l , ..., ik E {i, ..., n)

holds.

A6.4 Fundamental Rules of Probability Theory

The probability calculation of event combinations is based on the fundamental d e s of probability theory introduced in this section.

A6.4.1 Addition Theorem for Mutually Exclusive Events

The events A and B are mutually exclusive if the occurrence of one event excludes the occurrence of the other, formally A n B = 0. Considering a component which can fail due to a short or an Open circuit, the events

failure occurs due to a short circuit

and

failure occurs due to an Open circuit

are mutually exclusive. Application of Axiom 3 (Appendix A6.2) leads to


Pr{A U B) = Pr(A} + Pr (B}. (A6.10)

Equation (A6.10) is considered a theorem by tradition only; indeed, it is a particular case of Axiom A3 in Appendix A6.2.

Example A6.3

A shipment of 100 diodes contains 3 diodes with shorts and 2 diodes with Opens. If one diode is randomly selected from the shipment, what is the probability that it is defective?

Solution

From Eqs. (A6.10) and (A6.2),

If the events A l , A2 , . . . are mutually exclusive ( A i n A j = 0 for all i # j , they are also totally exclusive. According to Axiom 3 it follows that

A6.4.2 Multiplication Theorem for Two Independent Events

The events A a n d B a r e independent if the information about occurrence (or nonoccurrence) of one event has no influence on the probability of occurrence of the other event. In this case Eq. (A6.8) applies

Example A6.4 A system consists of two elements E1 and E2 necessary to fulfill the required function. The failure of one element has no influence on the other. R1 = 0.8 is the reliability of E1 and R2 = 0.9 is that of E2 . What is the reliability RS of the system?

Solution

Considering the assumed independence between the elements E1 and E2 and the definition of R1, R2 , and RS as R1 = Pr{EI fulfills the required function] , R2 = Pr [E2 fulfills the required function] , and RS = Pr[El fulfills the required function n E2 fulfills the required function) , one

obtains from Eq. (A6.8)

A6.4 Fundamental Rules of Probability Theory

A6.4.3 Multiplication Theorem for Arbitrary Events

For arbitrary events A and B, with Pr{A} > 0 and Pr{B} > 0, Eq. (A6.6) applies

Pr{A n B} = Pr{A} Pr{B I A} = Pr{B} Pr{A I B}.

Example A6.5 2 ICs are randomly selected from a shipment of 95 good and 5 defective ICs. What is the probability of having (i) no defective ICs, and (ii) exactly one defective IC?

Solution (i) From Eqs. (A6.6) and (A6.2),

95 94 Pr{ first IC good n second IC good) = - .- = 0.902.

100 99

(ii) PrIexactly one defective IC] = Pr{(first IC good n second IC defective) U (first IC defective n second IC good)) ; from Eqs. (A6.6) and (A6.2),

Generalization of Eq. (A6.6) leads to the multiplication theorem

Here, Pr{Al n .. . n } > 0 is assumed. An important special case arises when the events A l , ..., An are (stochastically) independent, in this case Eq. (A6.9) yields

A6.4.4 Addition Theorem for Arbitrary Events

The probability of occurrence of at least one of the (possibly non-exclusive) events A and B is given by

Pr{A U B) = Pr {A} + Pr{B} - Pr{A n B}. (A6.13)

To prove this theorem, consider Axiom 3 (Appendix A6.2) and the partitioning of the events A u B and B into mutually exclusive events ( A U B = A u (Ä n B) and B = ( A n B ) u ( Ä n B)).

Example A6.6 To increase the reliability of a System, 2 machines are used in active (parallel) redundancy. The reliability of each machine is 0.9 and each machine operates and fails independently of the other. What is the system's reliability?

Solution From Eqs. (A6.13) and (A6.8), Pr{the first machine fulfills the required function U the second machine fulfills the required function] = 0.9 + 0.9 - 0.9 .0.9 = 0.99.

The addition theorem can be generalized to n arbitrary events. For n = 3 one obtains

Pr{A U B U C} = Pr{A U ( B U C ) ) = Pr{A} + Pr{B U C} - Pr{A n ( B U C ) ) = Pr{A} + Pr{B} + Pr{C} - Pr{B n C} - Pr{A n B] - Pr{A n C} + Pr{A n B n C ) . (A6.14)

In general, Pr{Al U .. . U An) can be obtained by the so-called inclusion/exclusion method

n Pr{Al U ... U An) = x ( - l ) k + l ~ k

k=l

with

It can be shown that S = Pr{A1 U ... U A n } < S I , S $ . S1 - S 2 , S 5 SI - S 2 + S 3 , etc. Although the upper bounds do not necessarily decrease and the lower bounds do not necessarily increase, a good approximation for S often results from only a few Si. For a further investigation one can use the Frkchet theorem Sk„< Sk ( n - k ) 1 ( k + I), which follows from s ~ + ~ = s ~ ( ~ : ~ ) /(i() =sk(n - k ) / ( k + 1)c sk for A1=A2=.. . =An.

A6.4.5 Theorem of Total Probability

Let A l , A2, ... be mutually exclusive events (Ai n Aj = 0 for all i # j), 52 = Al U A2 U ..., and Pr{Ai) > 0, i = 1, 2, ... . For an arbitrary event B one has B = B n Q = B n ( A l u A 2 U ...) = ( B n A 1 ) u ( B n A 2 ) u ..., where the events B n A l , B n A2, . . . are mutually exclusive. Use of Axiom 3 (Appendix A6.2) and Eq. (A6.6) yields

Equation (A6.17) expresses the theorem (or formula) of total probability.

A6.4 Fundamental Rules of Probability Theory 40 1

Example A6.7 ICs are purchased from 3 suppliers (Al, A2, A3) in quantities of 1000, 600, and 400 pieces, respectively. The probabilities for an IC to be defective are 0.006 for Al, 0.02 for A2, and 0.03 for A3. The ICs are stored in a common container disregarding their source. What is the probability that one IC randomly selected from the stock is defective?

Solution From Eqs. (A6.17) and (A6.2),

1000 600 400 Pr{the selected IC is defective} = - 0.006 + - 0.02 + - 0.03 = 0.015.

2000 2000 2000

Equations (A6.17) and (A6.6) lead to Bayes theorem, which allows calculation of the a posteriori probability Pr{& I B}, k = 1, 2, ... as a function of a priori probabilities Pr{Ai},

Example A6.8

Let the IC as selected in Example A6.7 be defective. What is the probability that it is from supplier Al?

Solution From Eq. (A6.18),

(1000 12000). 0.006 Pr{IC from Al I IC defective) = = 0.2.

0.015

A6.5 Random Variables, Distribution Functions

If the result of an experiment with a random outcome is a (real) number, then the underlying quantity is a (real) random variable. For example, the number appearing when throwing a die is a random variable taking on values in (1, . . ., 6}. Random variables are designated hereafter with Greek letters T, 5, 5, etc. The triplet [Q, F, Pr] introduced in Appendix A6.2 becomes [ K , 23, Pr], where = (-W, -) and B is the smallest event field containing all (semi) intervals ( U , b] with a < b. The probabilities Pr{A] = Pr{z E Al, A E B, define the distribution law of the random variable T. Among the many possibilities to characterize this distribution law, the most frequently used is to define

F(t) = Pr{z 5 t} . (A6.19)

F( t ) is called the distributionfunction of the random variable T+). For each t, F( t ) gives the probability that the random variable will assume a value smaller than or equal to t . Because for s > t one has {T 2 t } 2 {T 2 s ] , F(t) is a nondecreasing function. Moreover, F(--) = 0 and F(..) = 1. If Pr(z = t o } > 0 holds, then F(t) has a jump of height Pr{z = t o ) at t ~ . It follows from the above definition and Axiom 3" (Appendix A6.2) that F(t) is continuous from the right. Due to Axiom 2, F(t) can have at most a countable number of jumps. The probability that the random variable z takes on a value within the interval ( U , b] is given by

The following classes of random variables are of particular importance:

1. Discrete random variables: A random variable z is discrete if it can only assume a finite or countable number of values, i.e. if there is a sequence t l , t z , ... such that

pk = Pr{z = t k } , with zp, = 1. (A6.20) k

A discrete random variable is best described by a table

Values of T t 1 t2

Probabilities p, P2

The distribution function F(t) of a discrete random variable z is a stepfunction

If the sequence t i , t z , . . . is ordered so that tk < tk+l, then

F(t) = C pj , for tk 2 t < tk+i. jSk

If only the value k = 1 occurs in Eqs. (A6.21), z is a constant ( T = tl = C ) . A constant C can thus be regarded as a random variable with distribution function

0 for t < C F(t) =

1 f o r t 2 C .

An important special case of discrete random variables is that of arithmetic random variables. The random variable z is arithmetic if it can take the values ...,- At, 0, At, . . . , with probabilities

+) From a mathematical point of view, the random variable T is defined as a measurable mapping of 52 onto the axis of real numbers = (-W, W), i.e. a mapping such that for each real value X

the Set of o for which {T = ~ ( o ) S X } belongs to !F, the distribution function of T is then obtained by Setting F(t) = Pr{z 5 t } = Pr@ : z(w) 5 t } .

A6.5 Random Variables, Distribution Functions 403

2. Continuous random variables: The random variable z is absolutely continuous if a function F(t) 2 0 exists such that

f ( t ) is called (probability) density of the random variable z and satisfies the condition

The distribution function F(t) and the density f ( t ) are related (almost every- where) by (see Fig. A6.2 for an example)

Mixed distribution functions, exhibiting both jumps and continuous growth, can occur in some applications. These distribution functions can generally be represented by a mixture (weighted sum) of discrete and continuous distribution functions (Eq. (A6.34)).

Figure A6.2 Relationship between the distribution function F(t) and the density f ( t ) for a continuous random variable T > 0

In reliability theory, T > O denotes (in this book) the failure-free time @ilure- free operating time) of an item,distributed according to F(t) = Pr(7 I t} with F(0) = 0. The reliability function (survival function) R(t) gives the probability that the item considered will operate failure-free in (0, t]; thus,

F(t) = Pr{z I t}, R(t) = Pr{z > t} = 1 - F(t) , z > 0, F(O)=O, R(O)= 1. (A6.24)

The failure rate h(t) of an item exhibiting a continuous failure-free time T is defined as

Calculation leads to (Eq. (A6.5) and Fig. A6.3)

1 P r { t < ~ < t + & r i ~ > t ] 1 Pr{t < T 5 t + St} h(t) = lim - . = lim - .

6t.10 6t Pr(z > t} 6t50 6t Pr{z > t}

and thus, assuming F( t ) derivable,

f( t) dR(t) l d t h(t) = - = - -, 1 - F(t) R( t)

It is important to distinguish between failure rate h(t), as conditional density for failure in (t, t + St] given that the item was new at t=O and has not failed in (0, t], and density f(t), as unconditional density for failure in (t , t + St] given only that the item was new a t t=O (assumed with F(0) = 0). The failure rate h ( t ) applies in particular to nonrepairable items. However, considering Eq. (A6.25) it can also be defined for repairable items which are as-good-as-new aper repair (renewal), taking instead of t the variable x starting by x = 0 a t euch renewal (as for interarrival times). If a repairable item cannot be restored to be as-good-as-new after repair, failure intensity z (t) (Eq. (A7.228)) has to be used (see p. 356 for a discussion).

Considering R(0) = 1, Eq. (A6.25) yields

Thus, h ( t ) completely define the reliability function R(t). For practical applications it can be useful to know the probability for failure-free

operation in (0, t] given that the item has already operated a failure-free time xo > O f + X 0

(A6.27) pr{T> t + x 0 I w x o } = ~ ( t , x o ) = ~ ( t + x ~ ) / ~ ( x ~ ) = e x o

Figure A6.3 Visual aid to compute the failure rate h(t) ( h(x) for interarrival times)

A6.5 Random Variables, Distribution Functions 405

From Eq. (A6.27) it follows that 00

- dR(t,xo)ldt = ( t x o ) = ( x 0 ) and E[T-xo / r >xo]=[ R(x)&/ R(xo).

R ( t , x o ) (A6.28)

From the left-hand side of Eq. (A6.28) one recognizes that the conditional failure rate h(t,xo) at time t given that the item has operated failure-free a time xo is the failure rate at time t + xo (=h for h(x)=h). This leads to the concept of bad-as-old used in some considerations on repairable Systems [6.3] (see also p. 497).

Important conclusions as to the aging behavior of an item can be drawn from the shape of its failure rate. For h(t) nondecreasing, it follows for u < s and t > 0 that

For an item with increasing failure rate, inequality (A6.29) shows that the probability of surviving a further period t decreases as a function of the achieved age, i.e. the item ages. The contrary holds for an item with decreasing failure rate.

No aging exists in the case of a constant failure rate, i.e. for R(t) = eLh< yielding (memoryless property of the exponential distribution)

For an arithmetic random variable, the failure rate is defined as

h ( k ) = P r { ~ = k A t 1 ~ > ( k - 1 ) A t } = p k / x P i k = 1, 2, .. i>k

Following concepts are important to reliability theory (see also Eqs. (A6.78) &

(A6.79) for minirnum zmin & maximum T„ of a set of random variables 71, . ..,T,):

1. Function of a random variable: If u(x) is a monotonically increasing function and T a continuous random variable with distribution function F, ( t ) , then

Pr{z S t} = Pr{q = u(z) 2 u(t)],

and the random variable = U(T) has distribution function

Fq(t) = P r ( q = U(T) S t } = P ~ { T 5 u-'(t)}= F r ( r l ( t ) ) , (A6.3 1)

where U-' is the inversefunction of u (Example A6.17). If du(t)ld t exists,

f,(t) = k(uw'(t)) . du-'(t) 1 dt .

(For U(T) monotonically decreasing, I du-'(t) l dt I has to be used forfq(t).)

2. Distribution with random parameter: If the distribution function of T depends on a parameter 6 with density f8(x), then for T it holds that


3. Truncated distribution: In some practical applications it can be assumed that realizations 5 a or > b of a random variable 5 with distribution function F(t) are discarded (e.g. lifetimes 5 0). For a truncated random variable it holds that

4. Mixture of distributions: In many practical applications the situation arises in which two or more failure mechanisms have to be considered for a given item. The following are some examples for the case of two failure mechanisms, (e.g. early failures and wearout, early failures and constant failure rate, etc.) appearing with distribution function Fl( t ) and F2(t) , respectively,

for any given item, only early failures (with probability p) or wearout (with probability 1 - p) can appear,

both failure mechanisms can appear in any item, a percentage p will show both failure mechanisms and 1 - p only one failure mechanism, e.g. wearout governed by F2(t) .

The distribution functions F(t) of the failure-free time is in these cases:

F(t) = pFl(t) +(I - p)Fz(t),

F(t) =l - ( I - F1 (t))(l-F2(t)) =Fl(t)+F2(t) -Fl(t)F2(t),

The first case gives a mixture with weights p and 1 - p (Example 7.16). The second case corresponds to a series model with two independent elements, (Eq.(2.17)). The third case is a combination of both previous cases.

The main properties of the distribution functions frequently used in reliability theory are summxiized in Table A6.1 and discussed in Appendix A6.10.

A6.6 Numerical Parameters of Random Variables

For a rough characterization of a random variable T, some typical values such as the expected value (mean), variance, and median can be used.

A6.6.1 Expected Value (Mean)

For a discrete random variable T taking values t l , t2 , . . . , with probabilities pl, P?, . . . ,

A6.6 Numerical Parameters of Random Variables 407

ihe expected value or mean E[T] is given by

E [ T I = ~ ~ ~ P ~ , (A6.35) k

provided the series converges absolutely. If z only takes the values t l , . . ., tm , Eq. (A6.35) can be heuristically explained as follows. Consider n repetitions of a trial whose outcome is z and assume that kl times the value tl , . . . , km times the value tm has been observed ( n = kl +... + km) , the ariihmetic mean of the observed values is

As n + W, ki l n converges to pi (Eq. (A6.146)), and the arithmetic mean obtained above tends towards the expected value E[z] given by Eq. (A6.35). For this reason, the terms expected value and mean are often used for the same quantity E[z]. From Eq. (A6.35), the mean of a constant Cis the constant itself, i.e. E[C] = C.

The mean of a continuous random variable z with density f ( t ) is given by

provided the integral converges absolutely. For positive continuous random variables, Eq. (A6.36) reduces to

E[z] = J t f ( t ) dt 0

which, for E[d < W , can be expressed (Example A6.9) as

Example A6.9 Prove the equivalence of Eqs. (A6.37) and (A6.38).

Solution e3 m m m t

R ) = 1 F ) = ( X ) yields I ~ ( t ) d t = I ( j f ( x ) d i ) d t t 0 0 t

Changing the order of integration it follows that (see graph)

m 00 X m

j ~ ( t ) d t = I(jdt)f(x)dx =Ixf(x)dx. 0 0 0 0

Table A6.1 Distribution functions used in reliability analysis (with X instead of t for interamval times)

Name Distribution Function F(t) = Pr{z I t ]

Density Parameter Range f(t) = d F(t) l dt

Exponential

Weibull

Gamma

f(0 t > 0, F(O)=O v = l , 2, ...

+ 2 [hl (degrees of freedom:

(x-m )' t

2 a2 dx Normal

Lognormal

Binomial

Poisson

k

P r ( < < k ] =Epi =l-(l-p)k

i = l i-l

Pi = P U - P ) Geometric

Hyper- geometric

k K N - K Pr[< < k l = ( i N n - i )

i=o (n )

A6.6 Numencal Parameters of Random Variables

Table A6.1 (cont)

Mean E h 1

Properties Failure Rate h(t ) = f(t ) 1 (1 - F(t ))

Memoryless:

Pr{z > t +xo 1 z > xo] = P r { ~ > t ] = e - ~

Monotonic failure rate: increasing for ß > 1 (h(0) = 0, L(=) = = decreasing for ß 1 ( h(0) = m , L(-) = 0:

Laplace transf. exists: T ( s ) = hß I (s + h)' ; Monotonic failure rate with h(-) = h ; Exp. for ß = 1, Erlangian for ß = n = 2,3, . . (sum of n exp. distributed random variab.)

Gamma with ß = v 12= 1,2, ... and h = l / 2

lnz has a normal distribution;

F(t) = @(ln(ht)/o)

pi =Pr( i successes in n Bernoulli tnals} (n independent trials with Pr{A} = p ) ;

Random sample with replacement

not relevant

not relevant

Memoryless: ~r { C > i + j 1 > i ) = (1 -

Pi = Pr{first success in a sequence of Bernoulli tnals occurs first at the ith trial]

Random sample without replacement not relevant


For the expected value of the random variable q = U(T)

m

E[ril= C. u(tk ~ , k or E[q] = j u(t) f (t) dt k -m

holds, provided that series and integral converge absolutely. Two particular cases of Eq. (A6.39) are:

1. u(x) = Cx,

2. u(x) = xk, which leads to the k th moment of T,

Further important properties of the mean are given by Eqs. (A6.68) and (A6.69).

A6.6.2 Variance

The variance of a random variable z is a measure of the spread (or dispersion) of the random variable around its mean E[T]. Variance is defined as

Var[z] = E[(z - ~ [ z ] ) ~ ] , (A6.42)

and can be calculated as

for a discrete random variable. and as

01

Var[z] = ( t - E[T])~ f ( t ) dt -01

for a continuous randorn variable. In both cases,

If E[T] or Var[~] is infinite, z is said to have an infinite variance. For arbitrary constants C and A, Eqs. (A6.45) and (A6.40) yield

Var[Cz - A] = C2 Var[z]

and

A6.6 Numerical Parameters of Random Variables

Var[C] = 0.

The quantity

o=.\ivar[zl

is the standard deviation of T and, for t 2 0,

is the coeficient of variation of T. The random variable

(T -E[T]) / o

has mean 0 and variance 1, and is a standardized random variable. A good understanding of the variance as a measure of dispersion is given by the

Chebyshev's inequality, which states (Example A6.10) that for every E > 0

The Chebyshev inequality (known also as Bienaym6-Chebyshev inequality) is more useful in proving convergence than as an approximation. Further important properties of the variance are given by Eqs. (A6.70) and (A6.71).

Example A6.10

Prove the Chebyshev inequality for a continuous random variable (Eq. (A6.49)).

Solution For a continuous random variable T with density f(t), the definition of the variance implies

which proves Eq. (A6.49).

Generalization of the exponent in Eqs. (A6.43) and (A6.44) leads to the kth central moment of T

A6.6.3 Modal Value, Quantile, Median

In addition to the moments discussed in Appendices A6.6.1 and A6.6.2, the modal value, quantile, and median are defined as follows:

1 . For a continuous random variable T , the modal value is the value o f t for which f ( t ) reaches its maximum, the distribution o f z is multimodal i f f ( t ) exhibits more than one maximum.

2.The q quantile is the value tq for which F(t) reaches the value q ,

tq = infit: F(t) 2 q J ; in general, F(t, ) = q for a continuous random variable ( t,, for which 1 - F(tp) = Q( tp ) = P , is termed percentage point).

3. The 0.5 quantile ( t 0 ,5 ) is the median.

A6.7 Multidimensional Random Variables, Conditional Distributions

Multidimensional random variables (random vectors) are often required in reliability and availability investigations o f repairable Systems. For random vectors, the outcome of an experiment is an element o f the n-dimensional space %-,. The probability space [Q, F, Pr] introduced in Appendix A6.1 becomes [R, , Bn, T],where B" is the smallest event field which contains all "intervals" o f the form (al ,bl] . ....( a„ b,] = ( ( t l , .. ., t , ) : t i E (ai, bi], i =l , ... , nJi Random vectors are designated by

Greek letters with an arrow ( T = ( z l , ..., T, ) , E,= (t1, ..., E,,),:tc.). The probabilities Pr{A] = ~ r ( < C A} , A E Bn define the distribution law o f T . The function

F(tl, . . . , t,) = P ~ { T ~ 5 tl , . . . , T , 5 t,} , (A6.51)

where {zl 5 tl, . . ., zn I t,) = { (z l 5 t l ) n . .. n ( T , 5 t,)}

-+ is the distribution function o f the random vector T , known as joint distribution jünction of z l , . . . , T,. F(tl, . . . , t,) is:

monotonically nondecreasing in each variable, Zero (in the limit) i f at least one variable goes to - W ,

one (in the limit) i f all variables go to W, b

continuous from the right in each variable, such that the probabilities Pr{al < zl 4 bl, ..., an < T , I b,}, calculated for arbitrary al, .. ., a„ bl, . . ., b, with ai < bi, are not negative; for example, n= 2 yields Pr{al <T, I bl, a2 <z2 4 b2) = F(a2,bz)-F(al,b2)- F(a2,bl)+ F(al,bl), see graph.

A6.7 Multidimensional Random Variables. Conditional Distributions 413

-+ It can be shown that every component zi of z = (zl , ..., T,) is a random variable with distribution function, marginal distribution function,

Fi(ti) = P ~ { T ~ 5 ti} = F(w, ...,M, ti, 00, . . .,00). (A6.52)

The components 71, . . ., 7, of -2 are (stochastically) independent if and only if, for any n and n-tulpe ( tl, . . . , t,) E R,,

n

F ( ~ ~ , ... J,) = n F ~ ( ~ ~ ) . (A6.53) i=l

It can be shown that Eq. (A6.53) is equivalent to

for every Bi E 2? n. -+

The random vector z = (T„ ..., T,) is absolutely continuous if a function f(xl, . . ., X,) 2 0 exists such that for any n and n-tulpe tl, . . ., t,

-+ f(xl, ..., X,) is the density of T , known also as joint density of zl, ..., T,. and satisfies the condition

For any subset A E B n, it follows that

The density of zi, marginal densiq, can be obtained from f(tl, ..., t,) as

+ The components TI, . . ., 7, of a continuous random vector z are (stochastically)

independent if and only if, for any n and n-tulpe tl, ..., t, E Rn,

-+ For a two dimensional continuous random vector z = (zl , z2), the function


is the conditional density

(A6.58)

of under the condition zl = t l , with f l ( t l ) > 0 . Similarly f l( t l I t 2 ) = f ( t l , t2 ) l f2 ( t2 ) is the conditional density for zl given z2 = t2 , with f2 ( t2 ) > 0. For the marginal density of . L ~ it follows that

Therefore, for any A E B'

and in particular

Equations (A6.58) & (A6.59) lead to the Bayes theorem for continuous random var- t-

iable ( t 1 t ) ( ( t f 1 t )) I f 2 ( t 2 ) f ( t 1 t 2 ) dt2, used in Bayesian statistics. -

A6.8 Numerical Parameters of Random Vectors

4 Let T = ( T ~ , ..., T, ) be a random vector, and U a real-valued function in R,.

-3 The expected value or mean of the random variable U( T ) is

for the discrete case and

for the continuous case, assuming that series and integral converge absolutely. The conditional expected value of T Z given = tl follows, in the continuous

case, from Eqs. (A6.36) and (A6.58) as

A6.8 Numerical Parameters of Random Vectors

Thus the unconditional expected value of can be obtained from

Equation (A6.65) is known as the formula of total expectation and is useful in practical applications.

A6.8.1 Covariance Matrix, Correlation Coefficient

+ Assuming for T = ( T ~ , . . . , T,) that Var [TJ < m , i = 1 , . . . , n, an important rough characterization of a random vector is the covariance matrix I a , 1, where

a = COV[T. 't .] = E[('ti - E[zi])(zj - E['tj])] V 1' J

are given in the continuous case by

The diagonal elements of the covariance matrix are the variances of components zi, i = 1, ..., n. Elements outside the diagonal give a measure of the degree of dependency between components (obviously a , = a ji). For zi independent of T j ,

U . . = a .. = 0 holds. iJ 1 2 + For a two dimensional random vector z = ( T ~ , T ~ ) , the quantity

is the correlation coefflcient of the random variables z1 and TZ, provided

oi = l/var[zi] < 00, i = 1, 2.

The main properties of the correlation coefficient are:

1. I p l I l .

2. if zl and z2 are independent, then p = 0.

3. p = I1 if and only if zl and are linearly dependent.


A6.8.2 Further Properties of Expected Value and Variance

-+ Let TI, . . ., T, be arbitrary random variables (components of a random vector T ) having finite variances and Cl, ..., C, constants. From Eqs. (A6.62) or (A6.63) and (A6.40) it follows that

E[CIzl + ... +C,T,]= C1E[~,]+ ... +C,E[z,]. (A6.68)

If T, and are independent then, fromEq. (A6.63) and Eq. (A6.45),

E[zl Z ~ I = E [ T ~ I E [ T ~ ] and Var[T1 z2] =E[zf] E[T$] - ~ ~ [ z ~ ] ~ ~ [ z ~ ] . (A6.69)

The variance of a sum of independent random variables zl, ..., T, is obtained from Eqs. (A6.62) or (A6.63) and (A6.69) as

For a sum of arbitrary random variables TI , . ..,T„ the variance can be obtained for i, j E (1, ..., n ]

A6.9 Distribution of the Sum of Independent Positive Random Variables and of T„, T„,

Let .zl and be independent non-negative arithmetic random variables with ai = Pr{Tl = i ) , bi = = i ) , i = 0,1, .... Obviously, zl +T2 is also arithmetic, and therefore

The sequence CO, CI, ... is the convolution of the sequences ao, a l , ... and bo, bl, .... Now, let zl and z2 be two independent positive continuous random variables

with distribution functions Fl(t), F2(t) and densities fl(t), f2(t), respectively (F1(0) = F2(0) = 0). Using Eq. (A6.55), it can be shown (Example A6.11 and Fig. A6.4) that for the distribution of = zl + z2

A6.9 Distribution of the Sum of Independent Positive Random Variables and of zmi„ T„, 417

Figure A6.4 Visual aid to compute the distnbution of T = TI + TZ (TI, 22 > 0 )

holds, and

The extension to two independent continuous random variables and z2 defined over (-=, =) leads to

The right-hand side of Eq. (A6.74) represents the convolution of the densities f l ( t )

and f2(t) , and will be denoted by

The Laplace transform (Appendix A9.7) of fq(t) is thus the product of the Laplace transforms of f l ( t ) and f2( t )

Sq (s) = S1 (s) S2 (8).

Example A6.11

F'rove Eq. (A6.74).

Solution Let '1 and 72 be two independent positive and continuous random variables with distribution functions Fl (t), F2(t) and densities fl (t), f2 ( t ) , respectively (F; (0) = F2 (0) = 0). From Eq. (A6.55) with f(x, y) = fl(x)f2(y) it follows that (see also the graph)

F 11 ( t ) = P r { r = q + ~ ~ - < t ) = ~ ~ f i ( x ) f 2 ( y ) d x d y Y

x+y<t t

I 1 - X f

= J ( Jf,(y)dy)f,(x)& = j$(f - ~ ) f , ( x ) & 0 0 0

which proves Eq. (A6.73). Eq. (A6.74) follows with F2 (0) = 0 (Equation (A6.74) follows also from Eq. (A6.65)). X

0 x X+& t

Sums of positive random variables occur in reliability theory when investigating repairable Systems (e.g. Example 6.12). For n r 2 , the density f q ( t ) of q =TI +...+ T,

for independent positive continuous random variables T I , . . . , T, follows as

fr(t)=fi(t)* ... *fn(t). (A6.77)

Example A6.12 Two machines are used to increase the reliability of a system. The first is switched on at time t = 0 , and the second at the time of failure of the first one, standby redundancy. The failure-free times of the machines, denoted by TI and 22 are independent exponentially distributed with Parameter ?L (Eq. A6.81)). What is the reliability function of the system?

Solution From RS(t) = P ~ { T ~ + ' 2 > t] = 1 - Pr{zl + 22 < t} and Eq. (A6.73) it follows that

0 R (t) gives the probability for no failures ( e-ht) or exactly one failure ( ht e-") in (0, t].

Other important distribution functions for reliability analyses are the minimum e„ and the maximum T„, of a finite set of positive, independent random variables TI, . . ., T,; for instance, as failure-free time of a series or a 1-out-of-n parallel system, respectively. If T I , . . ., T , are independent positive random variables with distribution functions Fi(t) = P ~ { T ~ I t ] , i = 1 , ... , n, then

n Pr{zmin > t } = P ~ { T ~ > t n ... n T, > t } = n < 1 - ~ ~ ( t ) ) , (A6.78)

i=l and

i=l It can be noted that the failure rate related to T,~, is given by

where hi(t) is the failure rate related to Fi(t) . The distribution of T „ leads for F I ( t )= ...= Fn(t) and n+.o to the Weibull distribution [A6.8]. For the mixture of distribution functions one refers to the considerations given by Eqs. (A6.34) & (2.15).

Scanner

A6.10 Distribution Functions used in Reliability Analysis


This section introduces the most important distribution functions used in reliability analysis, See Table A6.1 for a Summary. The variable t, used here for convenience, applies in particular to nonrepairable items. For interarrival times (e.g. when considering repairable systems), x has to be used instead oft .

A6.10.1 Exponential Distribution

A continuous positive random variable z has an exponential distribution if

The density is given by

f(t) = he-At, t 20, f(t)= o for t < O; h > 0, (A6.82)

and the failure rate (Eq. (A6.25)) by

h(t) = h . (A6.83)

The mean and the variance can be obtained from Eqs. (A6.38) and (A6.44) as

1 E[T] = -

h and

The Laplace transform of f(t) is, according to Table A9.7,

Example A6.13 The fahre-free time T of an assembly is exponentially distributed with h = 1oM5 h-l. What is the probability of T being (i) over 2,000 h, (ii) over 20,000 h, (iii) over 100,000 h , (iv) between 20,000 h and 100,000 h ?

Solution From Eqs. (A6.81), (A6.24) and (A6.19) one obtains

(i) Pr{z > 2,000h) = eT0'02 = 0.98,

(ii) Pr{z > 20,000h) = e-0'2 = 0.819, (iii) Pr{% > 100,000 h} = Pr(7 > l lh = E[z]} = e-' = 0.368,

(iv) Pr(20,OOOh < T 1 100,000h) = e-0.2 - e-I = 0.451.

For an exponential distribution, the failure rate is constant (time-independent) and equal to h. This important property is a characteristic of the exponential distribution and does not appear with any other continuous distribution. It greatly simplifies calculation because of the following properties:

1. Memoryless property: Assuming that the failure-free time is exponentially distributed and knowing that the item is functioning at the present time, its behavior in the future will not depend on how long it has already been operating. In particular, the probability that it will fail in the next time interval 6t is constant and equal to h & . This is a consequence of Eq. (A6.30)

2. Constant failure rate at system level: If a system without redundancy consists of elements EI, ..., E , and the failure-free times T ~ , ..., T, of these elements are independent and exponentially distributed with Parameters A l , . . ., X , then, according to Eq. (A6.78), the system failure rate is also constant (time- independent) and equal to the sum of the failure rates of its elements

~ ~ ( t ) = = e - h ~ , with hs = AI + . . . +L, . (A6.88)

It should be noted that the expression hs = E h i is a characteristic of the series model with independent elements, and also remains valid for the time-dependent failure rates Ai = hi( t j , see Eqs. (A6.80) and (2.18).

A6.10.2 Weibull Distribution

The Weibull distribution can be considered as a generalization of the exponential distribution. A continuous positive random variable T has a Weibull distribution if


and the failure rate (Eq. (A6.25)) by

h is the scale Parameter ( F @ ) depends on At only) and ß the shape parameter. ß = 1 yields the exponential distribution. For ß > 1, the failure rate h ( t ) increases monotonically, with h(Oj = 0 and ?L(-) = - . For ß < 1, ?L( t j decreases monotonically, with h(0) = and ?L(-) = 0 . The mean and the variance are given

by

A6.10 Distribution Functions used in Reliability Analysis 42 1

h and

where C o

r(s) = Jxz-le-x dx 0

is the complete gamma function (Appendix A9.6). The coefficient of variation K = J ~ I E[T] =a I E[%] is plotted in Fig. 4.5. For a given E[T], the density of the Weibull distribution becomes peaked with increasing P. An analytical expression for the Laplace transform of the Weibull distribution function does not exist.

For a system without redundancy (series model) whose elements have independent failure-free times TI, . . ., T, distributed according to Eq. (A6.89), the reliability function is given by

with?LV= ?L!&. Thus, the failure-free time of the system has a Weibull distribution with Parameters h' and ß .

The Weibull distribution with ß > 1 often occurs in applications as a distribution of the failure-free time of components which are subject to wearout andlor fatigue (lamps, relays, mechanical components, etc.). It was introduced by W. Weibull in 1951, related to investigations on fatigue in metals [A6.20]. B.W. Gnedenko showed that a Weibull distribution occurs as one of the extreme value distributions for the smallest of n ( n + W ) independent random variables with the same distribution function (Weibull-Gnedenko distribution [A6.7, A6.81).

The Weibull distribution is often given with the parameter a = ?Lß instead of ?L or also with three Parameters

Example A6.14 Shows that for a three parameter Weibull distribution, also the time scale Parameter can be determined (graphically) on a Weibull probability chart, e.g. for the empincal evaluation of data.

Solution In the system of coordinates log„(t) and loglo log„(ll(l- F(t))) the two parameter Weibull distribution function (Eq. (A6.89)) appears as a straight line, allowing a graphical determination of hand ß (see Eq.(A8.16) and Fig.Ag.2). The three parameter Weibull distribution (Eq.(A6.96)) leads to a concave curve. In this case, for two arbitrary points tl and t2 > t, it holds for the mean point on the scale loglo 10g,~(l 41- F(t))), defining t„ that loglo(t2 -W) + loglo(tl -Y) = 210glo(t, - Y), see Eq. (A8.16), the identity a + (b - a)/2 = (a + b)/2, and Fig. A8.2. From this, (t2 -y)(tl -W) = ( t , - ~ ) ~ and = (tlt2 -&) l ( t l + t2 - 2 t m ) , as function of tl, t2, tm.

A6.10.3 Gamma Distribution, Erlangian Distribution, and ~2 Distribution

A continuous positive random variable T has a Gamma distribution if

r is the complete Gamma function defined by Eq. (A6.94). y is the incomplete Gamma function (Appendix A9.6). The density of the Gamma distribution is given by

and the failure rate is calculated from h ( t ) = f ( t ) l ( 1 - F(t)) . h ( t ) is constant (time- independent) for ß = 1, rnonotonically decreasing for ß < 1 and monotonically increasing for ß > 1. However, in contrast to the Weibull distribution, h( t ) always converges to ?L for t + W , see Table A6.1 for an example. A Gamma distribution with ß < 1 mixed with a three-parameter Weibull distribution (Eq. (A6.33, case 1)) can be used as an approximation to the distribution function for an item with failure rate as the bathtub cuwe given in Fig. 1.2.

The mean and the variance are given by

and

The Laplace transform (Table A9.7) of the Gamma distribution density is

From Eqs. (A6.101) and (A6.76), it follows that the sum of two independent Gamma-distributed random variables with parameters h, ß1 and h, ß2 has a Gamma distribution with parameters h, ß1 + ß2.

Example A6.15 Let the random variables zl and 22 be independent and distributed according to a Gamma distnbution with the parameters h and ß. Determine the density of the sum q = 'CI + 22.

A6.10 Distribution Functions used in Reliability Analysis 423

Solution According Eq. (A6.98), 71 and 72 have density f(t )= h (h t / r@). The Laplace transform of f(t) is f(s) = ?@ / ( s + h ) ß (Table A9.7). From Eq. (A6.76), the Laplace transform of the density of q = zl + 72 follows as fr(s) = x2ß / (s + ~ ) ~ ß . The random variable q = 71 + 72 thus has a Gamma distribution with parameters h and 2ß (generalization to n> 2 is immediate).

For ß = n = 2,3 , ..., the Gamma distribution given by Eq. (A6.97) leads to an Erlangian distribution with parameters h and n. Taking into account Eq. (A6.77) and comparing the Laplace transform of the exponential distribution h I (s + h ) with that of the Erlangian distribution ( h I ( s + L)),, leads to the following conclusion:

If z is Erlang distributed with parameters ?L und n, then T can be considered as the sum of n independent, exponentially distributed random variables with Parameter h , i.e. T = z l + ... +T, with P r ( z i < t } = 1 - e C a t , i =1, ..., n.

The Erlangian distribution can be obtained by partial integration of the right-hand side of Eq. (A6.97), with ß = n. This leads to (see also Appendices A9.2 & A9.6)

From Example A6.15, if failure-free times are Erlangian distributed with parameters (n, L), the sum of k failure-free times is Erlangian distributed with parameters (kn, h).

For h = 1 1 2 and ß = V 1 2 , V = i, 2, . .., the Gamma distribution given by Eq. (A6.97) is a chi-square distribution (x2 distribution) with V degrees of freedom. The corresponding random variable is denoted X t. The chi-square distribution with V degrees of freedom is thus given by (see also Appendix A9.2)

t Ll -X12 - I x 2 e dx, r>o.p(o)=o;v=l.2 ,.... (A6.103) ~ ( t ) = ~ r { ~ t 5 t ] = ,-

22r(:) 0

From Eqs. (A6.97), (A6.102), and (A6.103) it follows that

has a distribution with V = 2 n degrees of freedom. If Cl, ..., 5, are independent, normally distributed random variables with mean m and variance 02, then

is distributed with n degrees of freedom. The above considerations show the importance of the distribution in mathematical statistics. The distribution is also used to compute the Poisson distribution (Eq.(A6.102) with n = v 1 2 and h = 112 or Eq. (A6.126) with k = v / 2 - 1 and m = t / 2, See also Table A9.2).


A6.10.4 Normal Distribution

A widely used distribution function, in theory and practice, is the normal distribution, or Gaussian distribution. The random variable T has a normal distribution if

The density of the normal distribution is given by

The failure rate is calculated from A ( t ) = f ( t ) l (1 - F(t)). The mean and variance are

E[T] = m and

Var[z] = 0 2 ,

respectively. The density of the normal distribution is symmetric with respect to the line X = m . Its width depends upon the variance. The area under the density curve is equal to (Table A9.1, [A9.1], See also Appendix A9.6 for the Poisson's integral)

0.6827 for the interval rn f o, 0.9999367 for the interval rn + 4 o, 0.95450 for the interval m f 2 o, 0.9999932 for the interval rn + 4.5 o , 0.99730 for the interval rn f 3 G, 0.99999943 for the interval rn f 5 0,. 0.999533 for the interval rn I 3.5 o , 0.9999999980 for the interval rn + 6 o.

A normal distributed random variable takes values in ( -W, +W). However, for m > 3 o it is often possible to consider it as a positive random variable in practical applications. rn + 6 o is frequently used as a sharp limit for controlling the process quality (6-oapproach). Assuming to accept a shift of the mean of 1.5 o in the manufacturing process, the 6-0 approach refers in this case to rn 14.5 o with respect to the basic quantity, yielding 3.4 ppm right and 3.4 ppm left the sharp lirnit.

2 If T has a normal distribution with parameters m and o , ( T - rn) 1 o is normally distributed with parameters 0 and 1, which is the standard normal distribution @ ( t )

If and z2 are (stochastically) independent, normally distributed random variables with parameters rn l , o:, and rn2, G:, q = + 7 2 is normally distributed with parameters rnl + in2, o: + 02 (Example A6.16). This rule can be generalized to the sum of n independent normally distributed random variables, and extended to dependent normally distributed random variables (Example A6.16).


Example A6.16 Let the random variables 71 and 9 be (stochastically) independent and normally distributed with means ml and m2 and variances o; and o;. Give the density of the sum q = . L ~ + T ~ .

Solution According to Eq. (A6.74), the density of q = 71 + T ~ ~ O ~ ~ O W S as

Setting u = n - m l , V = t - ml - m 2 , and considering

the result

is obtained. Thus the sum of independent normally distributed random variables is also normally 2 2 distributed with mean ml + m2 and variance o , + 0 2 . If 21 and 72 are not (stochastically)

independent, the distribution function of zl + 72 is still a normal distribution with m = ml + m2, 2 2 2 but with variance o = o, + o2 + 2p ol 0 2 , where p is the correlation coefficient (Eq. (A6.67)).

The normal distribution often occurs in practical applications, also because the distribution function of the sum of a large number of (stochastically) independent random variables converges under weak conditions to a normal distribution (central lirnit theorem, Eq. (A6.148)).

A6.10.5 Lognormal Distribution

A continuous positive random variable T has a lognormal distribution if its logarithm is normally distributed (Example A6.17). For the lognormal distribution,

(in L t )2 -- 1

f(t) = - 2 0 2 t>O, f ( t )=Ofor t<O; h, 0 > 0 . (A6.111) t 0 6

The failure rate is calculated from A(t) = f ( t ) / ( l - F(t)), see Table A6.1 for an example. The mean and the variance of T are (Problem A6.6 in Appendix A l 1)

e 0 2 / 2

E[T] = - a and

respectively. The density of the lognormal distribution is practically Zero for some t at the origin, increases rapidly to a maximum, and decreases quickly (Fig. 4.2). It applies often as model for repair times (Section 4.1) or for lifetimes in accelerated reliability tests (Section 7.4), and appears when a large number of (stochastically) independent random variables are combined in a multiplicative way. It is also the limit distribution for n+ of X, when x , + ~ = (1 + E, )X,, where E, is a random variable independent of X, [A6.9, 6.191. The notation with rn or a = - h ( h ) is often used. It must also be noted that 0' = Var [lnz] and rn = In (1 / h) = E [ l n ~ ] (Example A6.17).

Example A6.17 Show that the logarithm of a lognomally distributed random variable is normally distributed.

Solution For (ln t+ln h)'

1 fT( t ) = - 2 0 2

and q = lnz, Equation (A6.31) yields (u(t) = lnt and U-'(t) = et)

with m = ln(1 I L). This method can be used for other transformations. for exarnple:

(i) U (t) = et ; U-' (t) = ln(t) : Normal distribution 4 Lognormal distribution,

(ii) u(t) = ln(t) ; U-'(2) =er : Lognormal distribution -t Normal distribution,

(iii) U (t) = t ß; U-' (t) = : Weibull distribution + Exponential distribution,

(iv) U (t) = V ; U-I (t) = t ß : Exponential distribution -t Weibull distribution,

(V) U (t) =F; ' (2); U-I (t) =Fq(t) : Uniform distribution on (0, 1) -+ F? (t),

(vi) U (t )=Fq (t); Ü 1 ( t ) = ~ i l ( t ) : F?(t) -+ Uniform distribution on (0, I),

(vii) q =C.T; z = q l C : F,(t)=F,(t / C ) and $ ( t ) = G(t l C ) / C .

In Monte Carlo simulations, more elaborated algorithms than F i l ( t ) are often used.

A6.10.6 Uniform Distribution

A random variable T is uniformly distributed in the interval ( U , b) if it has the distribution function

The density is then given by

f ( t ) = --L for a < t < b . b - a

The uniform distribution is a particular case of the geometric probability introduced by Eq. (A6.3), for x1 instead of x2. Because of the property mentioned by case (V) of Example A6.17, the uniform distribution in the interval (0,l) plays an important role in simulation problems.

A6.10.7 Binomial Distribution

Consider a trial in which the only outcomes are either a given event A or its complement Ä. These outcomes can be represented by a randorn variable of the form

1 if A occurs 6 = {

0 othenvise.

6 is called a Bernoulli variable. If

P r{6=1}=p and P r { 6 = 0 ) = 1 - p ,

and Var[G] = ~ [ 6 2 ] - ~ 2 [ 6 ] = p - p2 = p ( l - P ) .

An infinite sequence of independent Bernoulli variables

with the same probability Pr{6i = 1) = P , i t 1, is called a Bernoulli model or a sequence of Bernoulli trials. The sequence 61, 62, . . . describes, for example, the model of the repeated sampling of a component from a lot of size N, with K defective components ( p = KIN) such that the component is retumed to the lot after testing (sample with replacement). The random variable


is the number of ones occurring in n Bernoulli trials. The distribution of 6 is given by

Equation (A6.120) is the binomial distribution. is obviously an arithmetic random variable taking on values in ( 9 1 , . . ., n ) with probabilities pk. TO prove Eq. (A6.120), consider that

is the probability of the event A occurring in the first k trials and not occurring in the n - k following trials (SI, . . ., 6, are independent); furthermore in n trials there are

different possibilities of occurrence of k ones and n - k Zeros, the addition theorem (Eq. (A6.11)) then leads to Eq. (A6.120).

Example A6.18

A populated printed circuit board (PCB) contains 30 ICs. These are taken from a shipment in which the probability of each IC being defective is constant and equal to 1%. What are the probabilities that the PCB contains (i) no defective ICs, (ii) exactly one defective IC, and (iii) more than one defective IC?

Solution From Eq. (A6.120) with p = 0.01,

(i) po = 0 . 9 9 ~ ~ = 0.74,

(ii) p1 = 30 .0 .01 .0 .99~~ = 0.224,

(iii) p2 + . . . + p30 = 1 - po - p1 = 0.036

Knowing pi and assuming Ci = cost for i repairs (because of i defective ICs) it is easy to calculate the mean C of the total cost caused by the defective ICs (C= pl Cl + . . . + p30 C30) and thus to develop a test strategy based on cost considerations (Section 8.4).

For the random variable 5 defined by Eq. (A6.119) it follows that

Example A6.19 Give mean and variance of a binornially distributed random variable with Parameters n andp.

Solution Considering the independence of a l , . . . , 6„ the definition of 5 (Eq. (A6.1 lg)), and from Eqs. (A6.117), (A6.118), (A6.68), and (A6.70) it follows that

and

Var[(] = Var[S1] + . . . + Var[S,] = n p (1 - p).

A further demonstration follows, as for Example A6.20, by considering that

For large n , the binomial distribution converges to the normal distribution (Eq. (A6.149)). The convergence is good for min ( n p, n(1- p) ) 2 5 . For small values of p, the Poisson approximation (Eq. (A6.129)) can be used. Calculations of Eq. (A6.120) can be based upon the relationship between the binomial and the beta or the Fisher distribution (Appendix A9.4).

Generalization of Eq. (A6.120) for the case where one of the events A l , . . ., Am can occur with probability p l , ..., p , at every trial, leads to the multinomial distribution

Pr{in n trials Al occurs kl times, . . . , n !

Am occurs km times) = $ ... k, k,! ... k m !

with kl + ... + k m = n and pl + ... + P , = 1.

A6.10.8 Poisson Distribution

The arithmetic random variable 5 has a Poisson distribution if

k m rn p k = P r { < = k ) = - e - , k = 0 , 1 , ..., m>O, k !

and thus

The mean and the varinnce of 5 are

E[<]= m and

Var[<] = m .

The Poisson distribution often occurs in connection with exponentially distributed failure-free times. In fact, Eq. (A6.125) with m = At gives the probability of k failures in the time interval(0, t ] , given h and t (Eq. (A7.41)).

The Poisson distribution is also used as an approximation of the binomial distribution for n -+ W and p -+ 0 such that n p = m < W . To prove this convergence, called the Poisson approximation, Set m = n p , Eq. (A6.120) then yields

k n! rn k in ,,-k n(n -1 ) ... ( n - k + l ) rn rn n-k

(-1 (1--) - Pk=k!(n-k)! n k .-(1--)

k! n

1 k - 1 rn k

rn n-k = l ( 1 - - ) . . . ( 1--).-(1--) ,

n n k ! n

from which (for k < and m = n p < W ) it follows that

k m -m lim pk = -e , rn= n p .

n-S- k !

Using partial integration one can show that

The right-hand side of Eq. (A6.130) is a special case of the chi-square distribution (Eq. (A6.103) with V 1 2 = k + 1 and t = 2m). A table of the chi-square distribution can then be used for ~iumerical evaluation of the Poisson distribution (Table A9.2).

Example A6.20

Give mean and variance of a Poisson-distributed random variable.

Solution

From Eqs. (A6.35) and (A6.125),

Similarly, from Eqs. (Ah.45), (A6.41), (A6.125), and considering k2 = k ( k - 1) + k ,

A6.10 Distribution Functions used in Reliability Analysis 43 1

A6.10.9 Geornetric Distribution

Let 81, ti2, ... be a sequence of independent Bernoulli variables resulting from Bernoulli trials. The arithmetic random variable 5 defining the number of trials to the Jirst occurrence of the event A has a geometric distribution given by

Equation (A6.131) follows from the definition of Bernoulli variables 6i (Eq. (A6.115))

The geometric distribution is the only discrete distribution which exhibits the memoryless property, as does the exponential distribution for the continuous case. In fact, from Pr{< > k ) = Pr{61 = 0 n ... n$ = 01 = (1 - and, for any k and j > 0, it follows that

The failure rate is time independent and given by

For the distribution function of the random variable 5 defined by Eq. (A6.131) one obtains

m m

Mean and variance are then (with E rucn=xl(l-x)2 and n2xn=x(1+n) ~ ( l - n ) ~ , X < 1) n=l n =l

and

If Bernoulli trials are carried out at regular intervals At, then Eq. (A6.133) provides the distribution function of the number of time units At between successive occurrences of the event A under consideration; for example, breakdown of a capacitor, interference pulse in a digital network, etc.

Often the geometric distribution is considered with pk = p(1- p?, k = (),I,..., in this case E[Q = ( 1 - p ) 1 p and Var[l;] = (1 - p ) l p 2 .


A6.10.10 Hypergeometric Distribution

The hypergeometric distribution describes the model of a random sample without replacement. For example, if it is known that there are exactly K defective components in a lot of size N , then the probability of finding k defective components in a random sarnple of size n is given by

pk =Pr{[= k } = (3 (1 kK) 9

k = 0, . . . , min ( K , n). (A6.136)

Equation (A6.136) defines the hypergeometric distribution. Since for fixed n and k ( 0 5 k l n )

lim Pr{[ = k } = ( i ) p k ( l - p ) n - k , K

with p = -, N+m N

the hypergeometric distribution can, for large N, be approximated by the binomial distribution with p = K I N . For the random variable 5 defined by Eq. (A6.136) it holds that

and

A6.11 Limit Theorems

Limit theorems are of great importance in practical applications because they can be used to find approximate expressions with the help of known (tabulated) distributions. Two important cases will be discussed in this section, the law of Zarge numbers and the central limit theorem. The law of large numbers provides

A6.11 Limit Theorems 433

additional justification for the construction of probability theory on the basis of relative frequencies. The central limit theorem shows that the normal distribution can be used as an approximation in many practical situations.

A6.11.1 Law of Large Nurnbers

Two notions used with the law of large numbers are convergence in probability and convergence with probability one. Let Cl, c2, . . ., and 5 be random variables on a probability space [Q, F, Pr]. 5, converge in probability to 5 if for arbitrary E > 0

holds. 5, converge to 5 with probability one if

The convergence with probability one is also called convergence almost sure (a.s.). An equivalent condition for Eq. (A6.141) is

lim Pr{sup -& } = 0, n k>n

for any E > 0. This clarifies the difference between Eq. (A6.140) and the stronger condition given by Eq. (A6.141).

Let us now consider an infinite sequence of Bernoulli trials (Eqs. (A6.115), (A6.119), and (A6.120)), with Parameter p = P r { A ) , and let S n be the number of occurrences of the event A in n trials

The quantiiy S n I n is the relative frequency of the occurrence of A in n independent trials. The weak law of large numbers states that for every E > 0,

Equation (A6.144) is a direct consequence of Chebyshev's inequality (Eq. (A6.49)). Similarly, for a sequence of independent identically distributed random variables

2 z l , ..., T „ with mean E [ z i ] = a and variance Var[ z i ] = o < m (i = 1, ... , n),

According to Eq. (A6.144), the sequence S n l n converges in probability to p = Pr{A]. Moreover, according to the Eq. (A6.145), the arithmetic mean ( t l + ... + t n ) l n of n independent obsewations of the random variable T (with a

finite variance) converges in probability to E [ z ] . Therefore, 6 = Sn 1 n and 6 =(tl + ... + t,)l n are consistent estimatesof p = Pr{A) and a = E [ z ] , respectively (Appendix A8.1 and A8.2). Equation (A6.145) is also a direct consequence of Chebyshev's inequality (Eq. (A6.49).

A firmer statement than the weak law of large numbers is given by the strong law of large numbers,

According to Eq. (A6.146), the relative frequency S n I n converges with probability one (a.s.) to p = Pr{A} . Similarly, for a sequence of independent identically distributed random variables z l , . . ., T „ with mean E[z i ] = a < and variance

2 V a r [ z i ] = o < W ( i=1 ,2 , ...),

The proof of the strong law of large numbers (A6.146) and (A6.147) is more laborious than that of the weak law of large numbers, see e.g. [A6.6 (vol. 11), A6.71.

A6.11.2 Central Limit Theorem

Let T I , T Z , ... be independent, identically distributed random variables with mean 2 E [ z i ] = a < W andvariance V a r [ ~ ~ ] = o < W , i = 1,2 , ... . For every t C W ,

( C ~ ~ ) - n a t i=l 1

lim Pr{ ,- x2/zdX

n+ -= 4 Equation (A6.148) is the central limit theorem. It says that for large values of n, the distribution function of the sum z l + ... + T , can be approximated by the normal distribution with mean E[z l + ... + z n ] = n E [ z i ] = n a and variance Var[zl + ... +T,] =

nVar[z i] = n o 2 . The central limit theorem is of great theoretical and practical importance, in probability theory and mathematical statistics. It includes the integral Laplace theorem (also known as the De Moivre-Laplace theorem) for the case where z i = Zii are Bernoulli variables,

n C 6 , is the random variable 5 in Eq. (A6.120) for the binomial distribution, i.e i = l

A6.11 Limit Theorems 435

it is the total number of occurrences of the event considered in n Bernoulli trials.

From Eq. (A6.149) it follows that for n + n F z si 1 ,lzG - x2/2

81 -+ - I e d'. n+ W,

2j211 -m

or, for each given E > 0, II E

X 8, 2

G= -x2,2

P - I } -+ - J e h, n- t W . (A6.150) li.. ,

Setting the right-hand side of Eq. (A6.150) equal to y allows determination of the number of trials n for given y, p, and E which are necessary to fulfill the inequality 1(4 +...+ 8,) l n - p I I E with a probability y. This result is important for reliability investigations using Monte Carlo simulations, see also Eq. (A6.152).

The central limit theorem can be generalized under weak conditions to the sum of independent random variables with different distribution functions [A6.6 (Vol. 11), A6.71, the meaning of these conditions being that each individual standardized random variable ( z i - E [ z i ] ) /,,/M provides a small contribution to the standardized sum (Lindeberg conditions).

Examples 6.21-6.23 give some applications of the central limit theorem.

Example A6.21

The senes production of a given assembly requires 5,000 ICs of a particular type. 0.5% of these ICs are defective. How many ICs must be bought in order to be able to produce the series with a probability of y = 0.99?

Solution Setting p = Pr{IC defective} = 0.005, the minimum value of n satisfying

i=l i=l

must be found. Rearrangement of Eq. (A6.149) and considering t = t y leads to

where t y denotes the y quantile of the standard normal distribution @(t) given by Eq. (A6.109) or Table A9.1. For y = 0.99 one obtains from Table A9.1 t,, = to,99 = 2.33. With p = 0.005, it follows that

Thus, n = 5,037 ICs must be bought (if only 5,025 = 5,000 + 5,000.0.005 ICs were ordered, then t y = 0 and y = 0.5).


Example A6.22 Electronic components are delivered with a defective probability p = 0.1%. (i) How large is the probability of having exactly 8 defective components in a (homogeneous) lot of size n = 5,000? (ii) In which interval [kl, k2] around the mean value n p = 5 will the number of defective components lie in a lot of size n = 5,000 with a probability Y as near as possible to 0.95 ?

Solution

(i) The use of the Poisson approximation (Eq. (A6.129)) leads to

the exact value (obtained with Eq. (A6.120)) being 0.06527. For comparison, the following are the values of pk obtained with the Poisson approximation (Eq. (A6.129)) in the first row and the exact values from Eq. (A6.120) in the second row

(ii) From the above table one recognizes that the interval [kl, k2] = [I, 91 is centered on the mean value n p=5 and satisfy the condition ''Y as near as possible to 0.95 " ( y = pl + p2 + ... + pg - 0.96). A good approximation for kl and k2can also be obtained using Eq. (A6.151) to determine E = (k2 - k,)l2n by given p, n , and t(l+y)

where t ( l + y ) / 2 is the (1 + y ) / 2 quantile of the standard normal distribution @(t) (Eq. (A6.109)). Equation (A6.151) is a consequence of Eq. (A6.150) by considering that

from which,

n E I - , /np( l - P ) = A = t(,+y),2.

Withy = 0.95, t(l+y)12 = tO,„, = 1.96 (Table A9.1), n = 5,000, and p = 0.001 one obtains n& = 4.38, yielding kl = np - nE = 0.62 (2 0) and kz = np + n & = 9.38 (5 n). The same solution is also given by Eq. (A8.45)

considering b = t(l+y) .

Example A6.23 As an example belonging to both probability theory and statistics, determine the number n of hials necessary to estimate an unknown probability p within a given interval f E at a given probability Y (e.g. for a Monte Carlo simulation).

A6.11 Limit Theorems

Solution From Eq. (A6.150) it follows that for n -t

Therefore, n E nE

1 - Y 1 I e dx = - yields - Y l + y

2 6 e dx=O.5+-=- , 2 2

-m

and thus n e l Jnp(l-p) = t ( „y)12 , from which

where t(l+y)12 is the (1 + y ) l 2 quantile of the Standard normal distribution @(t ) (Eq. (A6.109), Appendix A9.1). The number of trials n depend on the value of p and is a maximum ( nmX ) for p = 0.5. The following table gives n„, for different values of E and y

Equation (A6.152) has been established by assuming that p is known. Thus, E refers to the number of observations in n trials (2En = k2-k, as per Eq. (A8.45) with b = t(l+y)12). However, the meauing of Eq. (A8.45) can be reversed by assuming that the number k of realizations in n trials is known. In this case, for n large and p or (1 - p ) not very small, E

refers to the width of the confidence interval for p (28 = 6, - P l as per Eq. (A8.43) with k(1- k l n) >> b2 I 4 and thus also n >> b2). The two considerations yielding a relation of the form given by Eq. (A6.152) are basically different (probability theory and statistics) and agree only because of n -t W (see also the remarks on pp. 508 and 520). For n, p or (1 - p ) small, the binomial distribution has to be used (Eqs. (A8.37) and (A8.38)).

A7 Basic Stochastic-Processes Theory

Stochastic processes are a powerful tool for investigating reliability and availability of repairable equipment and Systems. A stochastic process can be considered as a family of time-dependent random variables or as a random function in time, and thus has a theoretical foundation based on probability theory (Appendix A6). The use of stochastic processes allows analysis of the influence of the failure-free and repair time distributions of elements, as well as of the system's structure, repair strategy, and logistic support, on the reliability and availability of a given system. Considering applications given in Chapter 6, this appendix mainly deals with regenerative stochastic processes with a finite state space, to which belong renewal processes, Markov processes, semi-Markov processes, and semi-regenerative processes, including reward and frequencylduration aspects. However, because of their importance, some nonregenerative processes (in particular the nonhomogeneous Poisson process) are introduced in Appendix A7.8. This appendix is a compendium of the theory of stochastic processes, consistent from a mathematical point of view but still with reliability engineering applications in mind. Selected examples illustrate the practical aspects.

A7.1 Introduction

Stochastic processes are mathematical models for random phenomena evolving over time, such as the time behavior of a repairable system or the noise voltage of a diode. They are designated hereafter by Greek letters g(t) , C( t ) , q ( t ) , ~ ( t ) etc.

To introduce the concept of stochastic process, consider the time behavior of a system subject to random influences and let T be the time interval of interest, e.g. T = [O, m). The Set of possible states of the system, i.e. the state space, is assumed to be a subset of the set of real numbers. The state of the system at a given time to is thus a random variable & t o ) The random variables Q t ) , t E T , may be arbitrarily coupled together. However, for any n = 1,2, ..., and arbitrary values t l , ..., t , E T , the existence of the n-dimensional distributionfunction (Eq. (A6.51))

A7.1 Introduction 439

is assumed. {(tl), ..., <(tn) are thus the components of a random vector l ( t ) . It can be shown that the family of n-dimensional distribution functions (Eq. (A7.1)) satisfies the consistency condition

F(xl, . . ., xk,m, . . ., W, tl, . . . , tk,tk+l, . . ., t,) = F(xl, . . ., xk, tl, . . ., tk), k < n

Conversely, if a family of distribution functions F ( x l , ..., X„ t l , ..., t, ) satisfying the above consistency and symmetry conditions is given, then according to a theorem of A.N. Kolmogorov [A6.10], a distribution law on a suitable event field of the space $T consisting of all real functions on T exists. This distribution law is the distribution of a random function {(t), t E T, usually referred to as a stochastic process. The time function resulting from a particular experiment is called a sample path or realization of the stochastic process. All sample paths are in $ T , however the set of sample paths for a particular stochastic process can be significantly smaller than $ T , e.g. consisting only of increasing step functions. In the case of discrete time, the notion of a sequence of random variables C„ n G T is generally used. The concept of a stochastic process generalizes the concept of a random variable introduced in Appendix A6.5. If the random variables {(t) are defined as measurable functions <(t) = <(t, W), t E T, on a given probability space [G, F, Pr] then

F(xl, . . ., X„ tl, . .., tn) = Pr{o : E,(tl,o) 5 xl, . . ., 5(t„w) I X,},

and the consistency and symmetry conditions are fulfilled. o represents the random influence. The function { ( t , ~ ) , t E T, is for a given W a realization of the stochastic process.

The Kolmogorov theorem assures the existence of a stochastic process. How- ever, the determination of all n-dimensional distribution functions is practically impossible, in general. Sufficient for many applications are often some specific Parameters of the stochastic process involved, such as state probabilities or stay (sojourn) times. The problem considered, and the model assumed, generally allow deterrnination of

the time domain T (continuous, discrete, finite, infinite) the structure of the state space (continuous, discrete) the dependency structure of the process under consideration (e.g. memoryless) invariance properties with respect to time shifts (time-homogeneous, stationary).

The simplest process in discrete time is a sequence of independent random variables Ci, t2, . . . . Also easy to describe are processes with independent increments, for instance Poisson processes (Appendices A7.2.5 & A7.8.2), for which

440 A7 Basic Stochastic-Processes Theory

II

pr{t(to) 5 X,} n p r { t ( t i ) - t(ti-~) 5 (A7.2) i = l

holds for arbitrary n = 1,2, ..., XI, ..., X , , and to < ... < t , E T . For reliability investigations, processes with continuous time Parameter t 2 0 and discrete state space {Zo, ..., Ztn) are important. Among these, the following processes will be discussed in the following sections (see Table 6.1 for a comparison).

renewal processes

Markov processes semi-Markov processes

semi-regenerative processes (processes with an embedded semi-Markov process) particular nonregenerative processes (nonhomogeneous Poisson processes for instance).

Markov processes represent a straightforward generalization of sequences of independent random variables. They are processes without aftereffect. With this, the evolution of the process after an arbitrary time point t only depends on t und on the state occupied at t, not on the evolution of the process before t. For time- homogeneous Markov processes, the dependence on t also disappears (memoryless propers). Markov processes are very simple regenerative stochastic processes. They are regenerative with respect to each state and, if time-homogeneous, also with respect to any time I. Semi-Markov processes have the Markov property at the time points of any state change; i.e., all states of a Semi-Markov process are regeneration states. In a semi-regenerative process, a subset Zo, . . ., Zk of the states {Zo, . . ., Z, )

are regeneration states and constitute an embedded semi-Markov process. For an arbitrary regenerative stochastic process, there exists a sequence of random points (regeneration points) at which the process forgets its foregoing evolution and (from a probabilistic point of view) restarts anew. Typically, regeneration points occur when the process returns to some particular states (regeneration states). Between regeneration points, the dependency stmcture of the process can be very complicated.

In order to describe the time behavior of Systems which are in statistical equilibrium (steady-state), stationary and time-homogeneous processes are suitable. The process { ( t ) is stationary (strictly stationary) if for arbitrary n = 1,2, ..., t l , ..., t„ and time span a ( t i , ti + a E T , i = i ,..., n)

F(xl, . . . , X,, tl + U, . . . , t, + U ) = F(xl, . . . , X„ tl, . . . , t , ) . (A7.3)

For n = 1, Eq. (A7.3) shows that the distribution function of the random variable c ( t ) is independent of t . Hence, E[ t ( t ) ] , Var[C(t)], and all other moments are independent of time. For n = 2, the distribution function of the two-dimensional random variable ( & t ) , { ( t + U)) is only a function of U. From this it follows that the correlation coeflcient between { ( t ) and t ( t + U) is also only a function of u

A7.1 Introduction

Besides stationarity in the strict sense, stationarity is also defined in the wide sense. The process & t ) is stationary in the wide sense if the mean E[t(t)] the variance Var[<(t)], and the correlation coefficient ptS(t,t + U) are finite and independent oft. Stationarity in the strict sense of a process having a finite variance implies stationarity in the wide sense. The contrary is true only in some particular cases, e.g. for the normal process (process for which all n-dimensional distribution finctions (Eq. (A7.1) are n-dimensional normal distribution functions, See Example A6.16).

A process c(t) is time-homogeneous if it has stationary increments, i.e. if for arbitrary n = L2, ..., values X,, ..., X„ time Span a, and disjoint intervals ( ti, bi) ((ti, ti + a, bi, bi + a E T, i = 1 ,..., n )

If c(t) is stationary, it is also time-homogeneous. The contrary is not true, in general. However, time-homogeneous Markov Processes (for instance) become stationary as t -1 W .

The stochastic processes discussed in this appendix evolve in time, and their state space is a subset ofnatural numbers. Both restrictions can be omitted, without particular difficulties, with a view to a general theory of stochastic processes.

A7.2 Renewal Processes

In reliability theory, renewal processes describe the model of an item in continuous operation which is replaced at each failure, in a negligible amount of time, by a new, statistically identical item. Results for renewal processes are basic and useful in many practical situations.

To define the renewal process, let T ~ , T ~ , ... be (stochastically) independent and non-negative random variables distributed according to

FA(x) = Pr{zO 5 X}, x2.0,

and

i = I , 2, ..., x > O . (A7.7)

The random variables

or equivalently the sequence q , , ~ l , ... constitutes a renewal process. The points S I , S 2 , ... are renewal points (regeneration points). The renewal process is thus a particular point process. The arcs relating the time points 0, SI, S2, . . . on Fig. A7.la help to visualize the underlying point process. A count function

can be associated to a renewal process, giving the number of renewal points in the interval ( 0 , t] (Fig. A7.lb). Renewal processes are ordinary for F A ( 2 ) = F ( x ) , otherwise they are modified (stationary for F A ( x ) as in Eq. (A7.35)). To simplify the analysis, let us assume in the following that

m

M7TFo=E[r,] < W and M n F = E [ r J = / ( I - F ( x ) ) & < -, i n 1. (A7.11) 0

As zo , z l , . .. are interarrival times, the variable x starting by 0 at t = 0 and at each renewal point S I , S2, . .. (arrival times) is used instead o f t (Fig. A7.la).

Figure A7.1 a) Possible time schedule for a renewal process; b) Corresponding count function v(t) (Si, SZ, . . . are renewal (regeneration) points, X start by 0 at t = 0 and at each renewal point)

A7.2 Renewal Processes 443

A7.2.1 RenewaI Function, Renewal Density

Consider first the distribution function of the number of renewal points v(t) in the time interval (0, t] . From Fig. A7.1,

Pr{v(t) I n-1}= Pr{& > t} = 1-Pr(& I t}

= l - P r { ~ ~ + ...+ ~ ~ - ~ S t ] = 1 - F , ( t ) , n = 1 , 2 ,.... (A7.12)

The functions FJt) can be calculated recursively (Eq. A6.73))

From Eq. (A7.12) it follows that

and thus, for the expected value (mean) of ~ ( t ) ,

The function H(t) defined by Eq. (A7.15) is the renewal function. Due to F(0) = 0, one has H(0) = 0. The distribution functions FJt) have densities (Eq. (A6.74))

t

fl(t) = fA(t) and f,(t) = 1 f(r) fn-l(t -X) dr , n = 2, 3, . . . , (A7.16) 0

and are thus the convolutions of f(x) with fn-l(x). Changing the order of summation and integration one obtains from Eq. (A7.15)

The function

&(t> - M

h(t) = - - C fn (t) dt n=l

is the renewal density. h( t ) is the failure intensity z (t) (Eq. (A7.228)) for the case in which failures of a repairable item (system) with negligible repair times can be described by a renewal process (see also Eqs. (A7.24) and (A7.229)).

H(t), as per Eq. (A7.17), satisfy

Equation (A7.19) is the renewal equation. The corresponding for renewal density is

It can be shown that Eq. (A7.20) has exactly one solution whose Laplace transform i(s) exists and is given by (Appendix A9.7)

For an ordinary renewal process (FA(x) = F(x)) it holds that

Thus, an ordinary renewal process is completely characterized by its renewal density h(t) or renewal function H(t). In particular, it can be shown (e.g.[6.4]) that

t

~ a r [ v ( t ) ] = ~ ( t ) + 2 j h ( x ) ~ ( t - X) dx - ( ~ ( t ) ) ~ . (A7.23) 0

It is not difficult to see that H(t) = E[v(t)] and Var[v(t)] ase finite for all t < W .

The renewal density h(t) has the following important meaning:

Due to the assumption FA(0) = F(0) = 0, it follows that

1 lim - Pr{v(t + 8t) - v(t) > 1) = 0 stLo6t

und thus, for 8 t -1 0,

Pr {any one of the renewal points SI or S2 or . . . lies in ( t , t + Ft] } = h(t) F t +o(Ft) . (A7.24)

Equation (A7.24) gives the unconditional probability for one renewal point in ( t , t +&I. h(t) corresponds thus to the failure intensity z (t) (Eq. (A7.228)) and the intensity m(t) of a Poisson process (homogeneous (Eq. (A7.42)) or nonhomogeneous (Eq. (A7.193))), but differs basically from the failure rate h(t) (Eq. (A6.25)) which gives the conditional probability for a failure in (t,t +8t] given that no failure has occurred in (O,t], and can thus be used for only (as a function of t). This distinction is important also for the case of a homogeneous Poisson process (F*(x) = ~ ( x ) = 1 -e-hX, Appendix A7.2.5), for which h(x)=h holds for all

interarrival times (with X starting by 0 at each renewal point) and h ( t ) = h holds for the whole process. Misuses are known, See e.g. [6.3]. Example A7.1 discusses the shape of H(t) for some practical applications.

Example A7.1 Give the renewal function H(t), analytically for

(i) f, (X) = f(x) = h echx (Exponential)

(ii) fA(x) = f(x) = 0.5 h(h x)2e-hx (Erlang with n = 3)

(iii) fA (X) = f(x) = h (h xf l e-hx / r@) (Gamma),

and numerically for h(x) = h for 0 5 X < Y and h(x) = h + ßhß , (x -~ )ß - l for x 2 Y, i.e. for

with h = 4.10-~ h-l, h w = 10-~ h-', ß = 5, W = 2.10' h (wearout), and for

(V) FA ( X ) = F(x) as in case (iv) but with ß = 0.3 and = 0 (early failures).

Give the solution in a graphical form for cases (iv) and (V).

Solution The Laplace transformations of fA (t) and f(t) for the cases (i) to (iii) are (Table A9.7b)

(ii) TA (s) = F(s) = h3 /(S + h13

(iii) FA (s) = f(s) = Aß I(s + h)ß , I

i(s)follows then from Eq. (A7.22) yielding h(t) or directly H(t) = h(x)dx 0

(i) i(s) = h 1 s and H(t) = ht

(ii) h ( s ) = h 3 / ~ ( ~ 2 + 3 h s + 3 h 2 ) = h 3 1 s [ ( s + ~ h ) ~ + $ h ~ ]

and H(t) = :[ht - 1 + 2- e-3)3'2 sin(yl3htt 2 + :)I 113 hß 1 (S + h)'

n m

(iii) h(s) == hnß = E [ h ~ l ( s + h ) ß ] = C --

i - h ß / ( s + h ) ß .=I ,,=I (S + hY"'

t n ß nß-1 h X -h

and H(t )= --- e dx. ,, '("B)

Cases (iv) and (V) can only be solved numerically or by simulation. Figure A7.2 gives the results for these two cases in a graphical form (see Eq. (A7.28) for the asymptotic behavior of H(t), dashed line in Fig. A7.2a). Figure A7.2 shows that the convergence of H(t) to its asymptotic value is reasonably fast. The shape of H(t) allows recognition of the presence of wearout (iv) or early failures (V), but can not deliver precise indications on the failure rate shape (see Section 7.6.3.3 and Problem A7.2 in Appendix Al l ) .


Figure A7.2 a) Renewal function H(t) and b) Failure rate h(x) and density function f(x) for cases (iv) and (V) in Example A7.1 (H(t) was obtained empirically, simulating 1000 failure-free times and plotting H(t) as a continuous curve; 6 =[(a / MZTF)' - I]/ 2 according to Eq. (A7.28))

A7.2.2 Recurrence Times

Consider now the distribution functions of the fonvard recurrence time ~ ~ ( t ) and the backward recurrence time z s ( t ) . As shown in Fig. A7. la, T,( t ) and ~ , ( t ) are the time intervals from an arbitrary time point t forward to the next renewal point and backward to the last renewal point (or to the time origin), respectively. It follows from Fig. A7.la that the event ~ ~ ( t ) > X occurs with one of the following mutually exclusive events

A o = { S 1 > t + x }

An = {(Sn I t ) n (7, > t + X - Sn)} , n = l,2, ... .

Obviously, Pr{Ao} = 1 - FA(t + X). The event An means that exactly n renewal points have occurred before t and the (n+l)th renewal point occurs after t + x. Because of Sn and T, independent, it follows that

Pr{& I Sn = y} = Pr{z, > t + x - y}, n = 1, 2, ...,

and thus, from the theorem of total probability (Eq. (A6.17))

yielding finally to

t

Pr{~,( t )Sx]=F~(t+x)-Jh(y)( l-F(t+x-y))dy. (A7.25) 0

The distribution function of the backward recurrence time zS(t) can be obtained as

Since Pr{SO > t} = 1 - FA(t), the distribution function of zs(t) makes a jump of height 1 - FA(t) at the point x = t .

A7.2.3 Asymptotic Behavior

Asymptotic behavior of a renewal process (generally of a stochastic process) is understood to be the behavior of the process for t + W . The following theorems hold with MTTF as in Eq. (A7.11):

1. Elementary Renewal Theorem [A6.6 (vol. 11), A7.241: If the conditions (A7.9) - (A7.11) are fulfilled, then

H(t) - 2 lim - - where H(t) = E[v(t)] . t+- t MTTF'

3 For Var[v(t)] it holds that /ltVar[v(t)] / t = 02/ MVF , with 02=Var[ri] < W, i 2 1.

(It can also be shown [6.16] that lim (V@)/ t ) = 1 / MTTFholds with probability 1.) t+-

2. Tightened Elementary Renewal Theorem [A7.24, A7.29(1957)]: If the conditions (A7.9) - (7.1 1) are fulfilled, E[z,] = M- < and 0 2 = Var[.ci] < W, i 2 I, then

t Ci M T 1 lim (H(t) - -) = - - - + - . (A7.28) t+- MTTF 2~7777' M?TF 2

3.Key Renewal Theorem [A7.9(vol. 11), A7.241: If the conditions (A7.9) - (A7.11) are fulfilled, U ( z ) 2 0 is bounded, nonincreasing, and Riemann integrable over the interval (0, W), and h ( t ) is a renewal density, then

lim t+=

For any a > 0, the key renewal theorem leads, with

(1 f o r O < z < a = 10 oiherwise,

to Blackwell's Theorem rA7.9 (vol. 11), A7.241

H ( t + a ) - H ( t ) 1 lim -- -

t+- a M n F

Conversely, the key renewal theorem can be obtained from Blackwell's theorem.

4.Renewal Density Theorem [A7.9(1941), A7.241: If the conditions (A7.9)- (A7.11) are fulfilled, f A ( x ) & f ( x ) go to 0 as X -+ Var[.ti] < W, i 2 1, then

1 lim h(t) = -.

M7TF (A7.31)

t+-

5. Recurrence Time Limit Theorems: Assuming U ( z ) = 1 - F(x + z ) in Eq. (A7.29) and considering F A ( - ) = 1 & MTTF per Eq. (A7.1 I), Eq. (A7.25) yields

rn 1 1 X

iimPr{.tR(r)<x)=l--J(1-F(x+z))dz=-/(I-F(y))dyi (A7.32) t+- MTTF MTTF

For t -+ -, the density of the fonvard recurrence time z R ( t ) is thus given by fTR(x)= ( 1 - F(x)) I MiTF. Assuming E [ T ~ ] = M ~ F < W , 02=Var[.ti]< W, i t 1

& E [ T , (t )]< -, it follows that lim (x2(1-F(x)))= 0 . Integration by parts yields X-f -

CG

1 MTTF 0 2

iim E[.tR(t)] = -J ~ ( 1 - F ( x ) ) & = - + - . t+- MTTF 2 2 MTTF

The result of Eq. (A7.33) is important to clarify the waiting time paradox :

(i) ?L%E [.tR(t)] = MTTF 12 holds for oZ=O (is. for 'Ti = MTTF, i tO) , and

-hx (ii) &E [.tR(t)] = E [ z i ] = 1 I h, i 2 0, holds for F,(x) = F(x) = 1 - e . Similar results are for the backward recurrence time z S ( t ) . For a simultaneous observation of z R ( t ) and z s ( t ) , it must be noted that in this cases z R ( t ) und . tS ( t ) belong to the same zi and are independent only for case (ii).

6. Central Limit Theorem for Renewal Processes [A7.24(1954), A7.29(1957) 1: If the conditions (A7.9) and (A7.11) are fulfilled and 02=Var[xi] < -J, i 2 1, then

Equation (A7.34) is a consequence of the central limit theorem for the sum of independent and identically distributed random variables (Eq. (A6.148)).

Equations (A7.27) - (A7.34) show that the renewal process with an arbitrary initial distribution function FA(x) converges to a statistical equilibrium (steady-state) as t -+ W , see Appendix A7.2.4 for a discussion on stationary renewal process.

A7.2.4 Stationary Renewal Processes

The results of Appendix A7.2.3 allow a stationary renewal process to be defined as follows:

A renewal process is stationary (in steady-state) if for all t 0 the distributionfunction of ~ ~ ( t ) in Eq. (A7.25) does not depend on t.

It is intuitively clear that such a situation can only occur if a particular relationship exists between the distribution functions FA(x) and F(x) given by Eqs. (A7.6) and (A7.7). Assuming

it follows that fA(x ) = (1 - F(x)) l m F , i A ( s ) = ( 1 - f ( s ) ) / ( s MTTF) , and thus from Eq. (A7.21)

- 1 h(s) = -

S MTTF

yielding

1 h(t) = - .

MTTF

With FA(") & h(x) from Eqs. (A7.35) &(A7.36), Eq. (A7.25) yields for any ( t 2 0 )


Equation (A7.35) is thus a necessary und sufficient condition for stationarity of the renewal process with Pr{zi 5 X ) = F(x), i 2 1.

It is not difficult to show that the Count process v ( t ) given in Fig. 7.lb, belonging to a stationary renewal process, is a process with stationary increments. For any t, a > 0, and n = 1,2, ... it follows that

with F„l(a) as in Eq. (A7.13) and FA(x) as in Eq. (A7.35). Moreover, for a stationary renewal process, H(t ) = t l MTTF and the mean number of renewals within an arbitrary interval ( t , t + U ] is

Comparing Eq. (A7.32) with Eq. (A7.37) it follows that under weak conditions, as t -+ every renewal process becomes stationary. From this, the following interpretation can be made which is useful for practical applications:

A stationary renewal process can be regarded as a renewal process with arbitrary initial condition FA(x), which has been started at t 4 und will only be considered for t > 0 ( t = 0 being an arbitrary time point).

The most important properties of stationary renewal processes are summarized in Table A7.1. Equation (A7.32) also obviously holds for z R ( t ) and z s ( t ) in the case of a stationary renewal process.

A7.2.5 Homogeneous Poisson Processes

The renewal process, defined by Eq. (A7.8), with

is a homogeneous Poisson process (HPP). FA(x) per Eq.(A7.38) fulfills Eq. (A7.35) and thus, the Poisson process is stationary. From Sections A7.2.1 to A7.2.3 it follows that (see also Example A6.20)

A7.2 Renewal Processes 45 1

As a result of the merno~less property of the exponential distribution, the count process v(t) (as in Fig A7.lb) has independent increments. Quite generally, a point process is a homogeneous Poisson process (HPP), with intensity ?L, if the associated count function v(t) has stationary independent increments and satisfy Eq. (A7.41). Alternatively, a renewal process satisfying Eq. (A7.38) is a HPP.

Substituting for t in Eq. (A7.41) a nondecreasing function M( t ) > 0, a nonhomogeneous Poisson process (NHPP) is obtained. The NHPP is a point process with independent Poisson distributed increments. Because of independent increments, the NHPP is a process without aftereffect (memoryless if HPP) and the sum of Poisson processes is a Poisson process (Eq. (7.27) for HPP). Moreover, the sum of n independent renewal processes with low occurrence converge for n+ to a NHPP, to a HPP in the case of stationary independent renewal processes (Appendix A7.8.3). However, despite its intrinsic simplicity, the NHPP is not a regenerative process, and in statistical data analysis, the property of independent increments is often difficult to be proven. Nonhomogeneous Poisson processes are introduced in Appendix A7.8.2 and used in Sections 7.6 and 7.7 for reliability tests.

Table A7.1 Main properties of a stationary renewal process

1 Expression I Comments, assumptions

1. Distribution function of 70

3. Renewal function

F(0) = 0, fA (x)=dFA ( X ) / du

M7TF=E[Ti], i 2 1

2. Distribution function of zi, i 2 l

t 3 t 2 0 H(t) = E[v( t )] = E[number of

renewal points in (0, t ] ] 1 t = t t = i m Pr,, 01 4. Renewal density , t > O 6t.10

Sz or.. . lies in (t, t + 6t]

W ) F(0) = 0, f (x) = d F(x) 1 du

5. Distribution function & mean of the forward recurrence time

Pr{zR ( t ) 5 X] = FA ( X ) , t 2 0

E [ T ~ ( ~ ) ] = T/2+Var[.ci] / 2 ~

FA ( X ) as in point 1, same for T (t )

T = MTTF= E [ T ~ ] , i 2 i

A7.3 Alternating Renewal Processes

Generalization of the renewal process given in Fig. A7.la by introducing a positive random replacement time, distributed according to G(x), leads to the alternating renewal process. An alternating renewal process is a process with two states, which alternate from one state to the other after a stay (sojourn) time distributed according to F(x) and G(x), respectively. Considering the reliability and availability analysis of a repairable item in Section 6.2 and in order to simplify the notation, these two states will be referred to as the up state and the down state, abbreviated as u and d, respectively.

To define an alternating renewal process, consider two independent renewal processes { T ~ } and {T;), i = 0,1, .... For reliability applications, zi denotes the ith failure-free time and T ; the ith repair (restoration) time . These random variables are distributed according to

FA(x) for 70 and F(x) for z i , i t I , X > 0, (A7.45)

GA(x) for ~ , j and G(x) for T ; , i t 1, X > 0, (A7.46)

with F, (0)=F(O)=GA(O)=G(O)=O, densities fA(x), f(x), gA(x) , g(x), and means (< W)

M l T F = E[z i ] = J (1 - F(x))dx, i i l ,

0

and m

M U R = E[T;] = / ( I - G(x))&, i t l ,

0

where MTTF and MTTR are used for mean time to failure and mean time to repair (restoration). The sequences

form two modified alternating renewal processes, starting at t = 0 with zo and TS, respectively. Figure A7.3 shows a possible time schedule of these two alternating renewal processes (repair times greatly exaggerated). Embedded in every one of these processes are two renewal processes with renewal points Sudui or Suddi marked with A and Sduui or ,Ydudi marked with 0 , where udu denotes a transition from up to down given up at t = 0, i.e.

A7.3 Alternating Renewal Processes

Figure A7.3 Possible time schedule for two alternating renewal processes starting at t = 0 with 70 and 6, respectively (shown are also the 4 embedded renewal processes with renewal points . . A) These four embedded renewal processes are statistically identical up to the time intervals starting at t = 0, i.e. up to

The corresponding densities are

for the time interval starting at t = 0, and

f(x) * g(x)

for all others. The symbol * denotes convolution (Eq. (A6.75)). The results of Section A7.2 can be used to investigate the embedded renewal

processes of Fig. A7.3. Equation (A7.22) yields Laplace transforms of the renewal densities hudu(t), hduu(t), hudd(t), arid hdud(t)

huduW = fA ($1

Lduu(s) = ~ ( s )

1 - f (SI g(s) i - T(s)~(s) '

To describe the alternating renewal process defined above (Fig. A7.3), let us introduce the two-dimensional stochastic process ( <(t), z ~ ~ ( ~ )(t)) where <(t) denotes the state of the process (repairable item in reliability application)

u if the item is up at time t

d if the item is down at time t

~ , ~ ( t ) and ~ ~ ~ ( t ) are thus the fonvard recurrence times in the up and down states, respectively, provided that the item is up or down at the time t, See Fig. 6.3.

To investigate the general case, both alternating renewal processes of Fig. A7.3 must be combined. For this let

p = Pr{item up at t = 0) and 1 - p = Pr{item down at t = 0). (A7.51)

In tems of the process ( <(t) , T R (t ) ( t ) ) ,

Consecutive jumps from up to down form a renewal process with renewal density

Similarly, the renewal density for consecutive jumps from down to up is given by

Using Eqs. (A7.52) and (A7.53), and considering Eq. (A7.25), it follows that

t

= p ( 1 - F A ( t + 8 ) ) + ~ h d u ( ~ ) ( l - ~ ( t - x + 8 ) ) d x (A7.54) 0

and

Setting 8 = 0 in Eq. (A7.54) yields

The probability PA(t) = Pr(((t) = U} is called thepoint availability and IR(t,t + 81 =

Pr{<(t) = u n ~ ~ , ( t ) > 8) the intewal reliability of the given item (Section 6.2). An alternating renewal process, characterized by the Parameters p, FA(x), F(x),

G A ( x ) , and G(x) is stationary if the two-dimensional process ( ( ( t ) , ~ , < ~ ~ , ( t ) ) is stationary. As with the renewal process it can be shown that an alternating renewal process is stationary if and only if

A7.3 Altemating Renewal Processes 455

(A7.57)

with MTTF and MTirR as in Eqs. (A7.47) and (A7.48). In particular, for t 2 0 the following relationships apply for the stationary alternating renewal process (Examples 6.3 and 6.4)

MTTF PA(t) = Pr{item up at t ] = = PA, (A7.58)

MTTF + MTTR

IR(t, t + 8 ) = Pr{item up at t and remains up until t + 81 Co ~ -

- - j ( l - ~ ( y ) ) d y . MTTF + MTTR

8

Condition (A7.57) is equivalent to

Moreover, application of the key renewal theorem (Eq. (A7.29)) to Eqs. (A7.54) - (A7.56) yields (Example 6.4)

lim Pr{c(t) = u n zRu(t) > 81 = t+m

1 - Y (A7.61) MTTF + MTTR e

lim Pr{<(t) = d n zRd( t ) > 8 ) = t+m

j ( l - G ( y ) ) d y , (-47.62) MTTF + MTTR

8

MTTF lim Pr{<(t) = u J = lim PA(t) = PA = (A7.63)

t-+- t+-= MTTF + MTTR

Thus, irrespective of its initial conditions p, FA(x) , and G A ( x ) , an alternating renewal process has for t -+ - an asymptotic behavior which is identical to the stationary state (steady-state). In other words:

A stationary alternating renewal process can be regarded as an alternating renewal process with arbitrary initial conditions p , FA(x), und G A ( x ) , which has been started at t = - W und will only be considered for t 2 0 ( t = 0 being an arbitrary time point).

It should be noted that the results of this section remain valid even if independence between z j and T; within a cycle (e.g. T O + T;, T~ + T;, ...) is dropped; only independence between cycles is necessary. For exponentially distributed z j and 21, i.e. for constant failure rate h and repair rate p in reliability applications, the convergence of PA(t) towards PA stated by Eq. (A7.63) is of the

form PA(t) -PA = (h 1 ( h + p))e-(h+p)t = (h ~ ~ ) e - ~ " See Eq. (6.20) and Section 6.2.4 for further considerations.

A7.4 Regenerative Processes

A regenerative process is characterized by the property that there is a sequence of random points on the time axis, regeneration points, at which the process forgets its foregoing evolution and, from a probabilistic point of view, restarts anew. The times at which a regenerative process restarts occur when the process returns to some states, defined as regeneration states. The sequence of these time points for a specific regeneration state is a renewal process embedded in the original stochastic process. For example, both the states up and down of an alternating renewal process are regeneration states. All states of time-homogeneous Markov processes and of serni-Markov processes, defined by Eqs. (A7.95) and (A7.158), are regenerative. However there are processes in discrete state space with only few (two in Fig. A7.11, one in Fig. 6.10) or even with no regeneration states (see e.g. Appendix A7.8 for some considerations). A regenerative process must have at least one regeneration state.

A regenerative process thus consists of independent cycles which describe the time behavior of the process between two consecutive regeneration points of the same type (same regeneration state). The ith cycle is characterized by a positive random variable zci (duration of cycle i) and a stochastic process t i ( t ) defined for 0 1 t < zCi (content of the cycle). Let t , ( t ) , 0 1: t < zcn, n= 0, I , ... be (stochastically) independent and for 12 2 i identically distributed cycles. For simplicity, let us assume that the time points SI = zcO, S2 = zco + ,... form a renewal process. The random variables zCo and T,, , i > 1, have distribution functions FA(x) for T , ~ and F(x) for T , ~ , densities f A ( x ) and f ( x ) , and finite means TA and T„ respectively. The regenerative process 5 ( t ) is then given by

The regenerative structure is sufficient for the existence of an asymptotic behavior (limiting distribution) for the process as t + (provided that the mean time between regeneration points is finite). This limiting distribution is determined by the behavior of the process between two consecutive regeneration points of the same regeneration state.

A7.4 Regenerative Processes 457

Defining h ( t ) as the renewal density of the renewal process given by S I , S 2 , ... and setting

it follows, similarly to Eq. (A7.25), that

For any given distribution of the cycle c i ( t ) , 0 5 t < zci, i 2 1, with Tc = E [ T , ~ ] < W,

there exists a stationary regenerative process c e ( t ) with regeneration points Sei, i 2 1. The cycles Ce, ( t ) , 0 5 t < have for n 2 1 the Same distribution law as c i ( t ) , 0 5 t < zCi. The distribution law of the starting cycle ce0 ( t ) , 0 5 t < T , ~ , can be calculated from the distribution law of & ( t ) , 0 5 t < T , ~ , See Eq. (A7.57) for alternating renewal processes. In particular,

with T, = E [ T , ~ ] < W , i 2 1. Furthermore, for every non-negative function g(t ) and S I =o ,

Equation (A7.66) is known as the stochastic mean value theorem. Since U(t , B) is nonincreasing and 5 1 - F(t) for all t 2 0, it follows from

Eq. (A7.64) and the key renewal theorem (Eq. (A7.29)) that

Equations (A7.65) and (A7.67) show that under general conditions, as t -+ 00 a regenerative process becomes stationary. As in the case of renewal and alternating renewal processes, the following interpretation is true:

A stationary regenerative process can be considered as a regenerative process with arbitrary distribution of the starting cycle, which has been started at t = - W und will only be consideved for t 2 0 ( t = 0 being an arbitrary time point).

45 8 A7 Basic Stochastic-Processes Theory

A7.5 Markov Processes with Finitely Many States

Markov processes are processes without aftereffect. They are characterized by the property that for any (arbitrarily chosen) time point t their evolution after t depends on t and the state occupied at t , but not on the process evolution up to the time t. In the case of time-homogeneous Markov processes, dependence on t also disappears. In reliability theory, these processes describe the behavior of repairable Systems with constant failure und repair rates for all elements. Constant rates are required during the stay (sojourn) time in any state, not necessarily at state changes (e.g. for load sharing). After an introduction to Markov chains, time-homogeneous Markov processes with finitely many states are considered in depth, as basis for Chapter 6.

A7.5.1 Markov Chains with Finitely Many States

Let E,0, Ci ,. . . be the sequence of consecutively occurring states. A stochastic process in discrete time E,, with state space {Zo,. .., Z, }is a Markov chain if for n = 0, 1,2, .. . and arbitrary i, j, io , . . . , in-i E (0, . . . , m } ,

The quantities pv ( n ) are the (one step) transition probabilities of the Markov chain. Investigation will be limited here to time-homogeneous Markov chains, for which the transition probabilities pv ( n ) are independent of n

For simplicity, Markov chain will be used in the following as equivalent to time- homogeneous Markov chains. The probabilities pv satisfy the relationships

m

ej20 and xqj=l, i, j E {o, ... , m]. (A7.70) j=O

A matrix with elements pq as in Eq. (A7.70) is a stochastic matrix. The k-step transition probabilities are the elements of the kth power of the stochastic matrix with elements p ~ . For example, k = 2 leads to (Example A7.2)

m

Pr{5n+2=Zj I 5,=zi I = I- 0 ) without any relation to the time axis; this is important when considering embedded Markov chains in a stochastic process.

A7.5 Markov Processes with Finitely Many States 459

from which, considering the Markov property (A7.68),

Results for k > 2 follow by induction.

Example A7.2

Assuming Pr{C]>O,provethat Pr((A n B ) I C ] = Pr{B I C}Pr{A I ( B n C)]

Solution For Pr{C] 1 0 it follows that

The distribution law of a Markov chain is completely given by the initial distribution

Ai = P r { t O = Z i } , i = 0, ..., m, (A7.72)

with CAi=l, and the transition probabilities p ~ , since for every and arbitrary io, .... in E (0, . . ., m),

Pr{cO = Zi, n Ci = Zi, n ... n C,, = Z. In ) = A. 10 piOii ... pin-,in

and thus, using the theorem of total probability (A6.17),

A Markov chain with transition probabilities pg is stationary if and only if the state probabilities Pr{E,, = Zj J , j = 0, ... , m, are independent of n, i.e. (Eq. (A7.73) with n=1) if the initial distribution Ai (Eq. (A7.72)) is a solution (p j ) of the system

m m

, with p j 2 0 and z p j = l , j = O ,..., rn. (A7.74) i= 0 j=l

The system given by Eq. (A7.74) must be solved by canceling one (arbitrarily chosen) equation and replacing this by pj = 1. PO, . . . ,prn from Eq. (A7.74) define the stationary distribution of the Markov chain with transition probabilities TU.

A Markov chain with transition probabilities pg is irreducible if every state can be reached from every other state, i.e. if for each (i, j) there is an n = n(i, j) such that

(n ) Tg > O , i j { O m } , n t l . (A7.75)

It can be shown that the system (A7.74) possesses a unique solution with

p j > O and f i + % + . . . + p m = 1 , j = o , ..., m, (A7.7 6)

only if the Markov chain is irreducible, see e.g. [A7.3, A7.13, A7.27, A7.29 (1968)l.

A7.5.2 Markov Processes with Finitely Many States +I

A stochastic process <(t ) in continuous time with state space {Zo, . .., Z,} is a Markov process if for n= 0,1,2, ..., arbitrary time points t +U> t> tn> ... > to2 0 , and arbitrary i , j , iO , ..., in E {0, ..., m } ,

<( t ) ( t 2 0 ) is a jump function, as visualized in Fig. A7.10. The conditional state probabilities in Eq. (A7.77) are the transition probabilities of the Markov process and they will be designated by Pu ( t , t + U )

Equations (A7.77) and (A7.78) give the probability that <(t + U ) will be Zj given that <(t ) was Zi. Between t and t + a the Markov process can visit any other state (this is not the case in Eq. (A7.95), in which Zj is the next state visited after Zi).

The Markov process is time-homogeneous if

In the following only time-homogeneous Markov processes will be considered. For simplicity, Markov process will often be used as equivalent to time-homogeneous Markov process. For arbitrary t > 0 and a > 0 , Pij ( t + a ) satisfy the Chapman- Kolmogorov equations

k=O

which demonstration, for given fixed i and j, is similar to that for in Eq. (A7.71). Furthermore Pu ( U ) satisfy the conditions

and thus form a stochastic matrix. Together with the initial distribution

+) Continuous (parameter) Markov chain is often used in the literature. Using Markovprocess should help to avoid confusion with Markov chains embedded in stochastic processes (footnote onp. 458).


the transition probabilities Pij (U) completely determine the distribution law of the Markov process. In particular, the state probabilities for t > 0

Pj(t) = Pr{E,(t) = Z j}, i = 0, ..., m (A7.83)

can be obtained from

Setting

0 for i + j

and assurning that the transition probabilities Pij ( t ) are continuous at t = 0, it can be shown that Pij(t) are also differentiable at t = 0. The limiting values

P-(6t) 1 - Pii (Gt) l i m L = p„ for i 7~ j, and lim

6t.10 6t =Pi,

sr.10 6t

exist and satisfy

Equation (A7.86) can be written in the form

~ i j (6 t ) = p, 6t + o(6t) and 1 - Pii(&) = pi 6t + o(6t), (A7.8 8)

where o(6t) denotes a quantity having an order higher than that of 6 t , i.e.

Considering for any t 2 0

the following useful interpretation for pij and pi can be obtained for 6 t -1 0 and arbitrary t

pij 6t = Pr{ jump from Zi to Zj in (t , t t 6t] / &t) = Zi }

It is thus reasonable to define pij and pi as transition rates (for a Markov process, pij plays a sirnilar role to that of the transition probability pij for a Markov chain).


Setting a = 6 t in Eq. (A7.80) and considering Eqs. A7.78) and (A7.79) yields

and then, taking into account Eq. (A7.86), it follows that

Equations (A7.91) are the Kolmogorov's fonvard equations. With initial conditions P,(O) = 6, as in Eq. (A7.85), they have a unique solution which satisfies Eq. (A7.81). In other words, the transition rates according to Eq. (A7.86) or Eq. (A7.90) uniquely determine the transition probabilities P, ( t ) . Similarly as for Eq. (A7.91), it can be shown that P, ( t ) also satisfy the Kolmogorov's backward equations

Equations (A7.91) & (A7.92) are also known as Chapman-Kolmogorov equations. They can be written in matrix form 6 = P A & 6 = A P and have the formal solution P ( t ) = e P (0).

The following description of the time-homogeneous Markov process with initial distribution Pi(()) and transition rates p,, i, j E (0, ..., m), provides a better insight into the structure of a Markov process as a pure jump process (Fig. A7.10). It is the basis for investigations of Markov processes by means of integral equations (Section A7.5.3.2), and is the motivation for the introduction of semi-Markov processes (Section A7.6). Let kO, Ci, . . . be a sequence of random variables taking values in {Zo, . .., Z, ) denoting the states successively occupied and qo, qi , . . . a sequence of positive random variables denoting the stay (sojourn) times between two consecutive state transitions. Define

Pij T U = ~ , i + j and pii = 0 , i , j E {O ,..., m ] , (A7.93)

and assume furthermore that

and, for n= O,1,2, ..., arbitrary i , j, io , ..., in-, E (0, .. . , m), and arbitrary xo, . .., x ~ - ~ > 0,

In Eq. (A7.93, as well as in Eq. (A7.158), Z j is the next state visited after Zi (this is not the case in Eq. (A7.77), see also the remark with Eq. (A7.106)). Q, (X) is thus defined only for j z i . Co, E I , . . .is a Markov chain, with an initial distribution

and transition probabilities

P, = Pr{kntl = Z j ( C,, = Zi} , with Pii = 0 ,

embedded in the original process. From Eq. (A7.93, it follows that (Example A7.2)

Qij (X) is a semi-Markov transition probability and will as such be introduced and discussed in Section A7.6. Now, define

So=O, S , = I I ~ + . . . + ~ ~ - ~ , n = 1 , 2 ,... ,

and E,(t) = E,„ for Sn 5 t <

From Eq. (A7.98) and the memoryless property of the exponential distribution (Eq. (A6.87)) it follows that 5(t), t 1 0 is a Markovprocess with initial distribution

Pi(0) = Pr{c(O) = Zi ]

and transition rates

1 p, = lim -Pr{jumpfromZitoZj in ( t , t+6t l I c ( t ) = Z i J , j + i

6rL0 6t and

1 m

pi = iim -Pr{leave Zi in ( t , t + 6t] I E,(t) = Zi} = p, i j tLoSt j,=o . ]?Li

The evolution of a time-homogeneous Markov process with transition rates p, and pi can thus be described in the following way [A7.2 (1974 ETH)]:

I f at t = 0 the process enters the state Zi , i.e. Co = Zi, then the next state to be entered, say Z j ( j # i ) is selected according to the probability pij 2 o (pii = O), and the stay (sojourn) time in Zi is a random variable with distribution function

-Pi". Pr{qo<x l (cO=~in\ l = Z i ) } = l - e ,


as the process enters Z j , the next state to be entered, say Zk ( k + j), will be selected with probability qk 2 0 kpjj = 0) und the stay (sojourn) time q l in Z j will be distributed according to

etc.

The sequence C„ n = o, i , . . . of the states successively occupied by the process is that of the Markov chain embedded in C(t), the so called embedded Markov chain. The random variable q, is the stay (sojourn) time of the process in the state defined by C,,. From the above description it becomes clear that each state Zi , i = 0, ... , rn, is a regenerution state.

In practical applications, the following technique can be used to determine the quantities Qij ( X ) , pij , und Fij ( X ) in Eq. (A7.95) [A7.2 (1985)l:

Ifthe process enters the state Zi at an arbitravy time, say at t = 0, then a set of independent random times zij > 0, j it i, begin (zu is the stay (sojourn) time in Zi with the next jump to Z j ) ; the process will then jump to Z j at the time X i f 'tij = X und > zij for (all) k # j.

In this interpretation, the quantities Qij ( X ) , pij , and Fij (X) are given by

Qij(x) = Pr{zij 9 x n ~ i k > T V , k # j}, with Q,(O)= O , , (A7.99)

= Pr{zik > zij, k f j } , B with pii 3 0, (A7.100)

FU(x) = Pr{zij I x I Tik > ' tu, k # j } , with F,(O) = 0. (A7.101)

Assuming for the time-homogeneous Markov process (memoryless property)

one obtains, as in Eq. (A7.95),

m Pij P.. = - = Q . . (

II W ) for j i , p i = pij, P.. LI = O - > (A7.103) Pi i =O

It should be emphasized that due to the memoryless property of the time-homogeneous Markov process, there is no difference whether the process enters Zi at t = 0 or it is already there. However, this is not true for semi-Markovprocesses (Eq. A7.158).


Quite generally, a repairable system can be described by a time-homogeneous Markov process if and only if all random variables occurring (failure-free times and repair times) are independent und exponentially distributed. If some failure-free times or repair times of elements are Erlang distributed (Appendix A6.10.3), the time evolution of the system can be described by a time-homogeneous Markov process with appropriate state space's extension (Fig. 6.6).

A powerful tool when investigating time-homogeneous Markov processes is the diagram of transition probabilities in (t , t + 6t], where 6 t -+ 0 (6 t:, 0, i.e. 6 t .1 0) and t is an arbitrary time point (e.g. t = 0). This diagram is a directed graph with nodes labeled by states Zi, i = 0, ... , rn, and arcs labeled by transition probabilities P, (6t), where terms of order o(6t) are omitted. It is related to the state transition diagram of the system involved, take care of particular assumptions (such as repair priority, change of failure or repair rates at a state change, etc.), and has in general more than 2" states, if n elements in the reliability block diagram are involved (see for instance Fig. A7.6 and Section 6.7). Taking into account the properties of the random variables T, , introduced with Eq. (A7.99), it follows that for 6 t + 0

Pr{(5(6t) = Z n only one jump in (0 ,6t] ) 1 k(0) = Zi )

and Pr{(E,(6t) = Zj n more than one jurnp in (OJt]) 1 c(0) = Zi) = o(6t). (A7.106)

From this,

P,(&) = p, 6t + o(6t), j + i and Pii (6t) = 1 - pi 6t + o(6t),

as with Eq. (A7.88). Although for 6 t + 0 it holds that P, (6t) = Q, (6t), the meanings of P, (6t) as in Eq. (A7.79) or Eq. (A7.78) and Q, (6t) as in Eq. (A7.95) or Eq. (A7.158) are basically dzjjferent. With Qij(x), Zj is the next state visited after Zi, this is not the case for P, (X).

Examples A7.3 to A7.5 give the diagram of transition probabilities in ( t + 6t] for some typical stmctures for reliability applications. The states in which the system is down are hatched. In state Zo all elements are up (operating or in reserve state). +)

Example A7.3 Figure A7.4 shows several possibilities for a 1-out-of-2 redundancy. The difference with respect to the number of repair crews appears when leaving the states Z2 and Z3. Cases b) and C) are identical when two repair crews are available.

+) The memoryless property, characterizing the (time-homogeneous) Markov processes, is satisfied in all diagrams of Fig. A7.4 and in all similar diagrams given in this book. Assuming e.g. that at a given time t the system of Fig. A7.4b left is in state Z4, development after t is independent of how many times before t the system has oscillate between Z2 and Zo or Z 2 , Zo , Z1 , Z3 .


Disiribution of failure-free times operating state: F(t) = 1 - ePht reserve state: F(t) = 1 - e-'r

Distribution of repair time: G(t) = 1 - e-Pt 1-out-of-2

one repair Crew two repair Crews

Figure A7.4 Diagram of transition probabilities in (t, t + 6 t ] for a repairable 1-out-of-2 redundancy (constant failure rates h, h , and repair rate P): a) Warm redundancy with El = E2 ( h , = h + active redundancy, h, = 0 jstandby redundancy); b) Active redundancy with EI * E2; C ) Active redundancy with EI # E2 and repairprion'ty on E1 ( t arbitrary, 6t .L 0, Markov process)


Example A7.4 Figure A7.5 shows two cases of a k-out-of-n active redundancy with two repair crews. In the first case, the system operates up to the failure of all elements (with reduced performance from state Zn-k+l). In the second case no further failures can occur when the system is down.

Example A7.5 Figure A7.6 shows a series/parallel structure consisting of the series connection (in the reliability sense) of a 1-out-of-2 active redundancy, with elements E2 and E3 and a switching element EI. The system has only one repair Crew. Since one of the redundant elements E2 or E3 can be down without having a system failure, in cases a) and b) the repair of element EI is givenfirst priority. This means that if a failure of E1 occurs during a repair of E2 or E3, the repair is stopped and El will be repaired. In cases C) and d) the repairpriority on E1 has been dropped.

E l = E 2 = ... = E n = E

Distribution of

failure-free operating times: F(t)=l - e -At

repair times: G(t)=l - e-Pr

k-out-of-n (active)

vi = (n-i) h and pi (i+l) = vi for i = 0, 1, ... , n- 1 ; p10 = p ; pi(i-l) = 2p for i = 2, 3, ... , n

vi =(n-i)A and pi(i+l) =vi for i=O,l , ..., n-k; p l O = p ; pi(i-l)=2p for i=2 ,3 , ..., n-k+l

b)

Figure A7.5 Diagram of transition probabilities in (t, t + 6t ] for a repairable k-out-of-n active redundancy with w o repair crews (constant failure rate h and repair rate P): a) The system operates up to the failure of ihe last element; b) No further failures at system down (t arbitrary, St .L 0, Markov process; in a k-out-of-n redundancy the system is up if at least k elements are operating)


E 2 = E 3 = E

Distribution of

failure-free times: F(t)= 1- e-L for E, F(+ 1- for E I

repair times: G(t)= 1- e-P for E , G(t)= 1- e-PIt for E1 U

1-out-of-2 (active)

a) Repair priority on E1 b) As a), but no further failures at syst. down

C) No repair priority (i.e. repair as per first-in first-out) d) As C), but no further failures at syst. down

Figure A7.6 Diagram of transition probabilities in (t, t + St] for a repairable series parallel structure with E2 = E3 = E and one repair crew: a) Repair priority on EI and the system operates up to the failure of the last element; b) Repair priority on EI and at system failure no further failures can occur; C) and d) as a) and b), respectively, but without repair priority on EI (constant failure rates h, hl and repair rates p, pl; t arbitrary; 6 t J 0, Markov process)


A7.5.3 State Probabilities and Stay Times (Sojourn Times) in a Given Class of States

In reliability theory, two important quantities are the state probabilities and the distribution function of the stay (sojourn) times in the set of system up states. The state probabilities allow calculation of the point availability. The reliability function can be obtained from the distribution function of the stay time in the set of system up states. Furthermore, a combination of these quantities allows for time- homogeneous Markov processes a simple calculation of the intewal reliability.

It is useful in such an analysis to subdivide the system state space into two complementary sets U and Ü

U = set of the system up states (up states at system level) - U = set of the system down states (down states at system level). (A7.107)

Partition of the state space in more than two classes is possible, see e.g. [A7.28]. Calculation of state probabilities and stay (sojourn) times can be carried out for

Markov processes using the method of differential equations or of integral equations.

A7.5.3.1 Method of Differential Equations

The method of dzfferential equations is the classical one used in investigating Markov processes. It is based on the diagram of transition probabilities in (t , t + 8t]. Consider a time-homogeneous Markov process c(t) with arbitrary initial distribution Pi(0) = Pr{c(O) = Zi} and transition rates p, and pi. The state probabilities defined by Eq. (A7.83)

Pj(t) = Pr{{(t) = Zj}, j = 0, ..., rn,

satisfy the system of differential equations

The proof of Eq. (A7.108) is sirnilar as for Eq. (A7.91), See also Example A7.6. The point availability PAs(t), for arbitrary initial conditions at t = 0, follows then from

PAS(t) = Pr{{(t) E U} = Pj(t). (A7.109) Z j € U

In reliability analysis, particular initial conditions are often of interest. Assurning

P i ( 0 ) = l and Pj(0)=O f o r j g i , (A7.110)


i.e. that the system is in Zi at t = 0 (usually in state Zo denoting "all elements are up7'), the state probabilities Pj ( t ) are the transitionprobabilities Pv ( t ) defined by Eqs. (A7.78) & (A7.79) and can be obtained as

with Pj ( t ) as the solution of Eq. (A7.108) with initial conditions as in Eq. (A7.110), or of Eq. (A7.92). The point availability, now designated with PAs,(t), is then given by

PAsi(t) is the probability that the system is in one of the up states at t, given it was in Z, at t = 0 . Example A 7.6 illustrate calculation of the point-availability for a 1- out-of-2 active redundancy.

Example A7.6

Assume a I-out-of-2 uctive redunduncy, consisting of 2 identical elements EI = E2 = E with constant failure rate h and repair rate p, and only one repair Crew. Give the state probabilities of the involved Markov process ( EI and E2 are new at t = 0).

Solution

Figure A7.7 shows the diagram of transition probabilities in (t, t + St] for the investigation of the point availability. Because of the memoryless property of the involved Markov Process, Fig A7.7 and Eqs. (A7.83) & (A7.90) lead to (by omitting the terms in o(St), as per Eq. (A7.89))

and then, as St 0 ,

(t) = -(h + p) Pl ( t ) + 2 h Po (t) + p P, (t)

i2( t ) = -pP2(t) + h P, (t).

Equation (A7.113) also follows from Eq. (A7.108) with the p v from Fig. A7.7. The solution of Eq. (A7.113) with given initial conditions at t = 0 , e.g. Po(0) = 1, PI (0) = P2 (0) = 0 , leads to state probabilities Po(t), Pl(t), and P2(t), and then to the point availability according to Eqs. (A7.111) and (A7.112) with i = 0 (see also Example A7.9 and Table 6.2 for the solution).

Figure A7.7 Diagram of the transition probabilities in (t, t + 6 t ] for availability calculation of a 1-out-of-2 active redundancy with E,=E,= E, constant failure rate h and constant repair rate p, one repair Crew (t arbitrary, 6 t .1 0, Markov process with pol = 2h, pI0 =p, pIZ = h, pZ1 = p , po=2h, p l = h + p 7 PZ=P)

A further important quantity for reliability analyses is the reliability function R s ( t ) , i.e. the probability of no system failure in (0, t ] . R s ( t ) can be calculated using the method of differential equations if all states in Ü are declared to be absorbing states. This means that the process will never leave Zk if it jumps into a state Zk E Ü. It is not difficult to See that in this case, the events

{first system failure occurs before t } and

{system is in one of the states Üat t }

are equivalent, so that the sum of the probabilities to be in one of the states in U is the required reliability function, i.e. the probability that up to the time t the process has never left the Set of up states U. To make this analysis rigorous, consider the rnodified Markov process ~ ' ( t ) with transition probabilities P~; ( t ) and transition

- pl. = pij if zi E U, p'.. = o if Z~ E U , p ) = C p i , II 11

(A7.114) j = O j jti

The state probabilities ~ ; . ( t ) of < ( t ) satisfy the following system of differential equations (see Example A7.7 for an application)

m m

$ ( t ) = -p;P;(t)+ z ~ / ( t ) ~ b , p: J = X p:., J Z j = 0, ..., m. (A7.115) i=O i= 0 i+ j i+ j

Assuming as initial conditions P:(o) = 1 and P;.(o) = 0 for j + i (with Zi E U), the solution of Eq. (A7.115) leads to the state probabilities P ; ( t ) and from these to the transition probabilities

P; ( t ) = P; ( t ) . (A7.116)

The reliabilityfinction Rsi( t ) is then given by

Rsi(t) = Pr{<(x) E U for 0 < X < t 1 <(o) = zi) = X ~ : j ( t ) , Z , e U . (A7.117) zj €U

The probabilities marked with ' ( ~ i ( t ) ) are reserved for reliability calculation, when using the method of differential equations. This should avoid confusion with the corresponding quantities for the point availability. Example A7.7 illustrates the calculation of the reliability function for a 1-out-of-2 active redundancy.

Example A7.7 Give the reliability function for the same case as in Example A7.6, i.e. the probability that the system has not left the states ZO and Z1 up to time t.

Solution The diagram of transition probabilities in (t, t +6t] of Fig. A7.7 is modified as in Fig. A7.8 by making the down state Z2 absorbing. For the state probabilities it follows that (see Ex. A7.6)

62 (t) = -h P; (t) . (A7.118)

The so1u;ion of Eq. (A7.118) with the given iritial tonditions ,at t=O ( ~ ' ~ ( 0 ) = 1 , PI (0) = P2(0) = 0 ) leads to the state probabilities P,,(t), P, (t) and P2(t), and then to the transition probabilities and to the reliability function according to Eqs. (A7.116) and (A7.117), respectively (the dashed state probabilities should avoid confusion with the solution given by Eq. (A7.113)).

Equations (A7.112) and (A7.117) can be combined to determine the probability that the process is in an up state (set U) at t and does not leave the set U in the time interval [t , t + B], given { (0 ) = Zi. This quantity is the interval reliability IRsi(t, t + 0). Due to the memoryless property of the involved Markov process,

IRsi(t,t + 0) = Pr(&x) E U for t 5 x < t + B 1 { ( O ) = Zi} = P, ( t ) .Rsj (B) , , zjcu (A7.119)

with i = 0,1, ..., m and Pu ( t ) as given in Eq. (A7.111).

Figure A7.8 Diagram of the transition probabilities in (t, t + 6t] for the reliabilityfunction of a 1-out-of-2 active redundancy with E,= E2=Ei constant failure rate h and constant repair rate y, one repair Crew (t arbitrary, 6t 0, Markov process with pol = 2h, pm = y, p12 = h ; p, = 2h, P1 = h + y , P 2 = 1

A7.5.3.2 Method of Integral Equations

The method of integral equations is based on the representation of the (time-homogeneous) Markov process g(t) as a pure jump process by means of 5, and q , as introduced in Appendix A7.5.2 (Eq. (A7.95), Fig. A7.10). From the mernoryless property it uses only the fact that jump points (in a new state) are regeneration points of g(t).

The transition probabilities Pij ( t ) = Pr{g(t) = Zj 1 &0) = Zi} can be obtained by solving the following system of integral equations

with pi=Xjsjti P, , 6,=O for j+i, &=I. To prove Eq. (A7.120), consider that

Pij(t) = Pr{(k(t) = Z n no jumps in (0, t ] ) 1 4(O) = Zi ) m

+ Pr{(t(t) = Zj n firstjump in (0, t] in Zk) 1 e(0) = Zi] k=O k+i

The first term of Eq. (A7.121) only holds for j = i and gives the probability that the process will not leave the state Zi (e-P" = P ~ { z ~ > t for all j + i ] according to the interpretation given by Eqs. (A7.99) - (A7.104)). The second term holds for any j ;t i , it gives the probability that the process will move first from Zi to Zk and take into account that the occurrence of Zk is a regeneration point. Accord- ing to Eq. (A7.95), Pr { = Zk n qo < X 1 g(0) = Zi} = Qik(x) = pik(l -CPix) and Pr{<(t) = Zj 1 ( C 0 = Zi n qo = X n Ci = Zk)} = Gj ( t - X ) . Equation (A7.120) then follows from the theorem of total probability (Eq. (A6.17)).

In the Same way as for Eq. (A.121), it can be shown that the reliabilityfunction RS i ( t ) , as defined in Eq. (A7.117), satisfies the following system of integral equations

Point availability PASi(t) and IRsi(t, t + 8) are given by Eqs. (A7.112) & (A7.119), with Pij(t) per Eq. (A7.120). The use of integral equations for PASi(t) can lead to mistakes, since RSi(t) and PASi(t) describe two different situations (summing for PASi(t) over all states j E (0, ... , m} leads to PASi( t )=l) .

The Systems of integral equations (A7.120) and (A7.122) can be solved using Laplace transforms. Referring to Appendix A9.7,

and

A direct advantage of the method based on integral equations appears in the calculation of the mean stay (sojourn) time in the up states. Denoting by MTTQ the system mean time to failure, provided the system is in state Zi E U at t = 0, it follows that (Eq. (A6.38), Appendix A9.7)

D0

M T T F ~ ~ = J ~ ~ ~ ( t ) d t = KSi(o). (A7.125) 0

Thus, according to Eq. (A7.124), MTTFSi satisfies the following system of algebraic equations (see Example A7.9 an application)

1 P .. m

M 7 q i = - + ~ m j , Pi=xP„ Z i € U . (A7.126) Pi zjEupi j=o

j t i j #i

A7.5.3.3 Stationary State and Asymptotic Behavior

The determination of time-dependent state probabilities or of the point availability of a system whose elements have constant failure and repair rates is still possible using differential or integral equations. However, it can become time-consuming. The situation is easier where the state probabilities are independent of time, i.e. when the process involved is stationary (the system of differential or integral equations reduces to a system of algebraic equations):

A time-homogeneous Markov process <( t ) with states ZO, . . ., Zm is stationary, ifits state probabilities Pi(t) = Pr{<(t) = Zi } , i = 0, ... , rn do not depend on t.

This can be Seen from the following relationship

Pr{k(tl) = Zi n ... n 5(tn) = Zin } = PrIS(ti) = Zil IPili2 ( t 2 - t l )... Pi,-li, (tn - tn-l

which, according to the Markov property (Eq. (A7.77)) must be valid for arbitrary tl < ... C t , and i, , ... , in E {O, ... , m). For any a > 0 this leads to


From Pi(t + U ) = Pi(t) it follows Pi(t) = Pi(0) = 4, and in particular Pi(t)=O. Conse- quently, the process c ( t ) is stationary (in steady-state) if and only if its initial distribution q=Pi(0)=Pr{~(O)=Zi] , i=O, ... , m , satisfies for t > 0 the system (Eq. (A7.108))

m m m

P J . P . J = E p i p i j , with P j T O , Z p j = l , p . = Z p . . , J 11 j = O ,..., rn. i=O j=O i=O i+ j i * j (A7.127)

The system of Eq. (A7.127) must be solved by replacing one (arbitrarily chosen) equation by x P j = 1. Every solution of Eq.(A7.127) with Pj 2 0 , j =O, . . . , rn, is a stationary initial distribution of the Markov process. Equation (A7.127) expresses that

Pr{ to come out from state Z J = Pr{ to come in state Z } ,

also known as generalized cut sets theorem. A Markov process is irreducible if for every pair i, j E {0, ..., in} there exists

a t such that P, ( t ) > 0 , i.e. if every state can be reached from every other state. It can be shown that if P, ( t o ) > 0 for some to > 0 , then P, ( t ) > 0 for any t > 0. A Markov process is irreducible if and only if its embedded Markov chain is irreducible. For an irreducible Markov process, there exist quantities q >O, j = 0, . . . , rn, with Po + . . . + Pm = 1, such that independently of the initial condition Pi(0) the following holds (Markov theorem, See e.g. [A6.6 (Vol. I)])

lim Pj(t) = P j > 0 , j = o , ..., m. (A7.128) t - fm

For any i = 0, ... , m it follows then that

lim Pij( t ) = P j >O, j = 0, ..., m. t - f W

The set of values Po, ..., Pm from Eq. (A7.128) is the limiting distribution of the Markov process. From Eqs. (A7.74) and (A7.129) it follows that for an irreducible Markov process the limiting distribution is the only stationary distribution, i.e. the only solution of Eq. (A7.127) with q > 0 , j = 0 , ... , m.

Further important results follow from Eqs. (A7.174) - (A7.180). In particular the initial distribution in stationary state (Eq. (A7.18 I)), the frequency of consecutive occurrences of a given state (Eq. (A7.182)), and the relation between stationary values Pj from Eq. (A7.127) and 1;. for the embedded Markov chain (Eq.(A7.74)) givenby


From the results given by Eqs. (A7.127)-(A7.129), the asymptotic & steady-state value of the point availability PAs is given by

If K is a subset of {Zo, ..., Zm), the Markov process is irreducible, and Po, ..., Pm are the limiting probabilities obtained from Eq. (A7.127) then,

total sojourn time in states Z j E Kin (0, t] Pr{ lim = P j ) = 1 (A7.132)

t+m t Z j € K

irrespective of the initial distribution Po(0), ..., Pm(0). From Eq. (A7.132) it follows

total operating time in (0,tl = C = PAS = Pr{ lim

t+m t Z j € U

The average availability of the system can be expressed as (see Eq. (6.24)) t

1 AAs(t) =- E[total operating time in (0, t] 1 c(0) =Zi] = PA%(^) dr . (A7.133)

t t 0

The above considerations lead to (for any Zi E U )

Expressions kPk are useful in practical applications, e.g. for cost optimizations. For reliability applications, irreducible Markov processes can be assumed, for

availability calculations. According to Eqs. (A7.127) and (A7.128),

asymptotic & steady-state is used, for such cases, as a synonym for stationary.

A7.5.4 Frequency / Duration and Reward Aspects

In some applications, it is important to consider the frequency with which failures at system level occur and the mean duration (expected value) of the system down time (or of the system operating time) in the stationary state. Also of interest is the investigation of fault tolerant Systems for which a reconfiguration can take place after a failure, allowing continuation of operation with defined loss of performance (reward). Basic considerations on these aspects are given in this section.

A7.5.4.1 Frequency / Duration

To introduce the concept of frequency /duration let us consider the one-item structure discussed in Appendix A7.3 as application of the alternating renewal


process. As in Appendix A7.3 assume an item (system) which alternates between operating state, with mean time to failure (mean up time) MTTF, and repair state, with complete renewal and mean repair time (mean down time) MTTR. In the stationary state, the frequency at which item failures fud or item repairs (restorations) fdu occurs is given as (Eq. (A7.60))

Furthermore, for the one-item structure, the mean up time MUT is

MTU = MTTF. (A7.136)

Consequently, considering Eq. (A7.58) the basic relation

MTTF PA = = fud + MUT (A7.137)

MTTF + MTTR

can be established, where PA is the point availability (probability to be up) in the stationary state. Similarly, for the mean failure duration MDT one has

MDT = MTTR (A7.138)

and thus MTTR

1-PA= = fdu. MDT. (A7.139) MTTF + MTTR

Constant failure rate ?L = I I MTTF and repair (restoration) rate p = I /MT772 leads to

which expresses the stationary property of time-homogeneous Markov processes, as particular case of Eq. (A7.127) with rn = (0,l).

For Systems of arbitrary complexity with constant failure and repair (restoration) rates, described by time-homogeneous Markov processes (Appendix A7.5.2), it can be shown that the asymptotic & steady-state System fuilure frequency fudS and system mean up time MUTs are given as

respectively. U is the set of states considered as up states for fudS und MUTs calculation, 6 the complement to the totality of states considered. MUTS is the mean of the time in which the system is moving in the set of up states $ E U before a transition in the set of down states Zi EÜ occurs in the stationary case or for t + -. In Eq. (A.7.141),all transition rates pji leaving state 5 E U toward Zi EÜ are


considered (curnulated states). Similar results hold for semi-Markov processes. Equations (A7.141) and (A7.142) have a great intuitive appeal: (i) Because of the memoryless property of the (time-homogeneous) Markov processes, the asymptotic steady-state probability to have a failure in ( t , t + 8t] i ;s%s~i fit and, fuds&t. (ii) Defining UT as the total up time in (0 , t ) and v( t ) ' aS number of failures in (0, t ) , and considering for t + the limits UTI t i, PAS and v ( t ) l t + fudS, it f0110~ U T / v ( t ) + MUTS=PASl fuds f0r t+m.

Same results hold for the system repair (restoration) frequency fduS and System mean down time MDTs (mean repair (restoration) duration at system level), given as

and

MDTs = ( E Pi fdus = (1 - PAS) fdus , (A7.144) zi € 6

respectively. fduS is the system failure intensity z s ( t ) = z s as defined by Eq. (A7.230) in

steady-state or for t + W . Considering that each failure at system level is followed by a repair (restoration) at system level, one has fudS = fduS and thus

Equations (A7.142), (A7. I@), and (A7.145) yield to the following important relation between MDTs and MUTs (see also Eqs. (A7.137) - (A7.140))

Computation of the frequency of failures ( fduS) and mean failure duration (MDTs ) based on fault tree and corresponding minimal cut-sets (Sections 2.3.4, 2.6) is often used in power systems [6.22], where f f , df and Pf appear for fduS, MDTS, and 1 - PAs . The central part of Eq. (A7.145) is known as theorem of cuts.

Although appealing, C 4 MITFSi, with M= from Eq. (A7.126) and 8 from Eq.(A7.127), can not be used to calculate MUTS (Eqs.(A7.126) and (A7.127) describe two different situations, see the remark with Eq. (A7.122)).

A7.5.4.2 Reward

Complex fault tolerant systems have been conceived to be able to reconfigure themselves at the occurrence of a failure and continue operation, if necessary with reduced performance. Such a feature is important for many systems, e.g. production, information, and power systems, which should assure continuation of operation after a system failure. Besides fail-safe aspects, investigation of such systems is

based on the superposition of pe$ormance behavior (often assumed deterministic) and stochastic dependability behavior (including reliability, maintainability, availability, and logistic support). A straightfonvard possibility is to assign to each state Zi of the dependability model a reward rate 5 which take care of the performance reduction in the state considered. From this, the expected (mean) instantaneous reward rate MIRs ( t ) can be calculated in stationary state as

thereby, ri= 0 for down states, 0< ri<l for partially down states, and ri=l for up states with 100% performance. The expected (mean) accumulated reward MARS(t) over the time interval (0, t] follows for the stationary state as

Other metrics, for instance reward impulses at state transition or the expected ratio of busy channels to jobs request, are possible (see e.g. rA7.15, 6.19 (1995), 6.26, 6.341). The reward rate can be applied directly to differential equations. For the purpose of this book, application in Section 6.8.6.4 will be limited to Eq. (A7.147).

< in Eq. (A7.147) is the asymptotic & steady-state probability in state Zi (Eq. (A7.127)), giving also the expected percentage of time the system stays at the performance level specified by Zi (Eq. (A7.132)).

A7.5.5 Birth and Death Process

A birth and death process is a Markov process characterized by the property that transitions from a state Zi can only occur to state Zi+l or ZiFl . In the time-homogeneous case, it is used to investigate k-out-ofn redundancies with identical elements and constant failure und repair rates during the stay (sojourn) time in any given state (not necessarily at state transitions, e.g. because of load sharing). The diagram of transition probabilities in ( t , t +6t] is given in Fig. A7.9. vi and Bi are the transition rates from state S i to Zi+l and Zi to Zi-l, respectively (transitions outside neighboring states can occur in ( t , t +6t] only with probability o(6t)). The system of

Figure A7.9 Diagram of transition probabilities in (t, t + 6t ] for a birth and death process with n + l states (t arbitrary, 6 t L 0, Markov process)

differential equations describing the birth und death process given in Fig. A7.9 is

P j ( t ) = - ( v j + O j ) P j ( t ) + vj-I P jp i ( t ) + Oj+i Pj+,(t)

with Oo = V-i = V n = = 0, j= 0, ..., n . (A7.149)

The conditions v j > 0 ( j = 0, ... , n - 1 ) and Oj

for the existence of the limiting probabilities

lim Pj ( t ) = P' , with Pj > 0 t - fw

> 0 ( j = 1 , ... , n < 00) are sufficient

n

and ~ P J . = 1. (A7.150) j=O

It can be shown (Example A7.8), that the probabilities PJ , j = 0, ... , n are given by

n V. .. . Vi-1 P. = n . p = n j / Z n i , with n i = and no=l . (A7.151)

J J 0 i=O 0, ... 0,

From Eq. (A7.151) one recognizes that

P k ~ k =Pk+l@k+l? (k = 0, ..., n-1).

this holds quite general for time-homogeneous Markov processes (Eq. (A7.127)).

Example A7.8

Assuming Eq. (A7.150) prove Eq. (A7.151).

Solution

Considering Eqs. (A7.149) & (A7.150), P' are the solution of following system of algebraic eqs.

0 = -voPo + OiP1

0 = -OnPn + v ~ - ~ P , - ~ .

From the first equation it follows P, = Gvo 10,. With this 4 , the second equation leads to

Recursively one obtains

Considering Po + ... f P, = 1, Po follows and thenEq. (A7.151).

The values of Pj given by Eq. (A7.15 1) can be used in Eq. (A7.134) to calculate the stationary (asymptotic & steady-state) value of the point availability. The system mean time to failure follows from Eq. (A7.126). Examples A7.9 and A7.10 are applications of the birth and death process.


Example A7.9 For the 1-out-of-2 active redundancy with one repair Crew of Examples A7.6 and A7.7, i.e. for - vo = 2 h , v1 = h , O1 = O2 = p, U = { Z O , Z 1 ) and U = { Z 2 ), give the asymptotic & steady- state value PAS of the point availability and the mean time to failure M7TFS0 and M7TFsl.

Solution The asymptotic & steady-state value of point availability is given by Eqs. (A7.134) and (A7.151)

1 t 2 h / p p2 + 2 h p PA, = Po t P, = -

1 + 2 h l p + 2 h 2 / p 2 2 h ( h + p ) + p 2

The system's mean time to failure follows from Eq. (A7.126) with pol=p0 =2h , p12 = h , Pio=P. ~ l = h + ~ ,

M7TFso = 1 / 2 h + M-

1 P MnFs, = + - MVFs ,

h t p h + p

yielding 3 1 + p 2 h + p

MTTFso = - and MITFsl = - 2 h2 2 a2

Example A7.10

A computer system consists of 3 identical CPUs. Jobs arrive independently and the arrival times form a Poisson process with intensity h . The duration of each individual job is distributed exponentially with parameter p. All jobs have the same memory requirements D. Give for h = 2 p the minimum size n of the memory required in units of D, so that in the stationary case (asymptotic & steady-state) a new job can immediately find Storage space with a probability y of at least 95%. When ovefflow occurs, jobs are queued.

Solution

The problem can be solved using the following birth and death process

1 - h 6 t 1 - ( h + p ) 6 r 1 - ( h + 2 p ) 6 t 1 - ( h + 3 p ) & 1 - (h+3p)Ot

h 6 t h 6 t

... ...

P 6t 2 p 6t 3pS t 3p6t 3 p 6t 3 ~ 6t

In state Zi , exactly i memory units are occupied. n is the smallest integer such that in the steady- state, Po + ... t P,-1 = y 2 0.95 (if the assumption were made that jobs are lost if ovefflow occurs, then the process would stop at state Z n ) . For steady-state, Eq. (A7.127) yields

O=-hPo+pFj O = h P o - ( h + p ) e + 2 p P 2 O = h e - ( k + 2 p ) P 2 +3pP3 O = h P 2 - ( h + 3 p ) q +3pP4

The solution leads to

n Assuming lim C = 1 and considering - < 1 it follows that

n-+- i=o 3

h " 9 h l p . h 3 ( ~ l p ) ~ Po[I+-+ -(-)"=p,[l+-+ I = 1 ,

P j=2 2 3 2 ( 3 - h l p )

from which

The size of the memory n can now be detennined from

2 ( 3 - A l p ) "-1 9 h l p U+-+ C, - ( - ) ] > Y .

6 + 4 h / ~ + ( h / p ) ~ i=2

For h 1 P = 2 and y = 0.95, the smallest n satisfying the above equation is n = 9 ( Po = 11 9,

P, = 2 / 9 , q =2i-113i f o r i t 2 ) .

As shown by Examples A7.9 and A7.10, reliability applications of birth and death processes identify v i as failure rates and Bi as repair rates. In this case,

v j < < Q j + l , j = 0 , ..., n - 1 ,

with v j and O j as in Fig. A7.9. Assuming 0 < r < 1 and thus

the following relationships for the steady-state probability Pj can be obtained (Example A7.11)

P. > . Pi, O < r j . (A7.156) ' - r(1-rn- ') i=j+l

For r 5 11 2 it follows that

n

Equation (A7.157) states that for 2 v j .I Bj the steady-state probability in a state Z j of a birth and death process described by Fig. A7.9 is 2 the sum of the steady- state probabilities in all states following Zj , j = 0 , ..., n - i [2.50 (1992)l. This relationship is useful in developing approximate expressions for system availability.

Example A7.11 Assuming Eq.(A7.155), prove Eqs. (A7.156) and (A7.157).

Solution Using Eq. (A7.150),

Setting S r for O < r < l and i = j , j+1, ..., n-1,itfollowsthat

and thus Eq. (A7.156). Furthermore, for r 5 11 2 it holds that n

I Pj S 1 - (11 2)n-j 5 1, and hence Eq. (A7.157). i=j+l

A7.6 Semi-Markov Processes with Finitely Many States

The description of Markov processes given in Appendix A7.5.2 allows a straightforward generalization to semi-Markov processes. In a semi-Markov process, the sequence of consecutively occurring states forms an embedded (time-homogeneous) Markov chain, just as with Markov processes. The stay (sojourn) time in a given state Zi is a positive random variable zu whose distribution depends on Zi and on the following state Z j , but in contrast to Markov processes it is arbitrarily und not exponentially distributed. Related to semi-Markov processes are Markov renewal processes ( ~ ~ ( t ) = number of transitions in state Zi during (0,tl) [A7.23].

To define semi-Markov processes, let kO, Ci , . . . be the sequence of consecutively occurring states, i.e. a sequence of random variables taking values in {ZO, ..., Z m ] , and qo , q l , ... the stay (sojourn) times between consecutive states, i.e. a sequence of positive random variables. A stochastic process c ( t ) with state space {Zo, . . ., Zm ) is a semi-Markov process if for n= 0,1,2, ..., arbitrary i , j , io , ..., in-, E (0, ... , m), and arbitrary xo, .. ., x,-~ > 0 ,

E,(t)=& for 0 1 t< q o and t ( t ) =4, for q o +... t qn - l I t < qo+ ... + q n for n2 1 ( t 2 0) is a pure jump process, as visualized in Fig. A7.10.


o*" oux oux

Figure A7.10 Possible realization for a semi-Markov process (X starts by 0 at each state change)

The functions Qij ( X ) in Eq. (A7.158), defined only for j + i , are the semi-Markov transition probabilities (see remarks with Eqs. (A7.93) - (A7.101)). Setting

and, for pij ,J 0 ,

leads to

Qjj(x) = ~q j+i, Qij(0)= 0, (A7.161)

with (Example A7.2)

and

F,(x) =Pr{qn S x 1 (5, = Z i n = Z j ) } , j+i, qj(o)=o. (A7.163)

As for a semi-Markov process, pii=O is mandatory, Qii(x) and Fii(x) can be arbitrary. From Eq. (A7.158), the consecutive jump points at which the process enters zi are regeneration points. This holds for any i E (0, ..., m}. Thus,

all states of a semi-Markov process are regeneration states.

The renewal density of the embedded renewal process of consecutive jumps in Zi

(i-renewals) will be denoted as h i ( t ) (Eq. (A7.177)). The interpretation of the quantities Qv ( X ) given by Eqs. (A7.99) - (A7.101) are

useful for practical applications (see for instance Eqs. (A7.183) - (A7.186)).

A7.6 Semi-Markov Processes with Finitely Many States 485

The initial distribution, i.e. the distribution of the vector (50- 5(0), Ci, qo) is given, for the general case, by

A u (X) = Pr{ e(0) = Zi n ei = Z n residual sojourn time (qo) in Zi I X 1

with Pi(0) = Pr{k(0) = Zi}, pij according to Eq. (A7.162), and Fij (X) = Priresidual sojourn time in Zj I x 1 (k(0) = Zi n = Zj ) } . k(0) is used here for clarity instead of kO. The semi-Markov process is memoryless only at the transition points from one state to the other. To have the time t = O as a regeneration point, the initial condition c(0) = Zi, sufficient for time-homogeneous Markov processes, must be reinforced for serni-Markov processes by

Zi is entered at t = 0 .

The sequence kO, Cl, ... forms a Markov chain, embedded in the serni-Markov process, with transition probabilities pij as per Eq. (A7.162) and initial probabilities Pi(0), i = 0, . . . , m. F, (X) is the conditional distribution function of the stay (sojourn) time in Zi with consequent jump in Zj (next state to be visited).

A semi-Markov process is a Markov process if and only if Fij (X) = 1 -e-PiX,

for i , j E {0, ... , m]. An example of a two state semi-Markov process is the altemating renewal proc-

ess given in Appendix A7.3 (?o = up, Z1 = down, pol =p10 = 1, FO1(x) = F("),

Flo(x) = G(x), Fo(x) = FA(^), FI (X) = GA(x), Po(0) = P , P1(0) = 1 - P ) . In many applications, the quantities QV (X), or pV and Fij (X), can be calculated

using Eqs. (A7.99) - (A7.101), as shown in Appendix A7.7 and Sections 6.3 - 6.6. For the unconditional stay (sojoum) time in Zi, the distribution function is

and the mean

In the following it will be assumed that

exists for all i, j E {0, ..., m]. Consider first the case in which the process enters the state Zi at t = 0, i.e. that

P i ( 0 ) = l and F i ( x ) = F u ( x ) .

The transition probabilities


P Q ( ~ ) = Pr{&t) = Z j 1 Zi is entered at t = 0 ) (A7.168)

can be obtained by generalizing Eq. (A7.120),

with 6, and Qi(t) per Eqs. (A7.85) & (A7.165). The stateprobabilities follow as

m

Pj( t ) = Pr{c(t) = Z j } = C. Pr{Zi is entered at t = 0} Pij( t ) , (A7.170)

with Pj ( t ) L 0 and Po(t) + . . . + P,(t) = 1. If the state space is divided into the complementary Sets U for the up states and Ü for the down states, as in Eq. (A7.107), the point availability follows from Eq. (A7.112)

PAsi(t) = Pr{k(t) E U I Zi is entered at t = 0 ) = PQ(t), i = 0, ..., m,

zj= U (A7.171)

with P, ( t ) as in Eq. (A7.169). The probability that thefirst transition from a state in U to a state in Ü occurs after time t , i.e. the reliabilityfunction, can be obtained by generalizing the system of integral equations (A7.122).

with Qi(t ) as in Eq. (A7.165). The mean of the stay (sojourn) time in U, i.e. the system mean time to failure, follows from Eq. (A7.172) as solution of the following system of algebraic equations (with as per Eq. (A7.166))

M V F s i = Ti +X p, M T T Q j , Zi E U , (A7.173) Zj€U j + i

Consider now the case of a stationary semi-Markov process. Under the assumption that the embedded Markov chain is irreducible (each state can be reached from every other state with probability > O), the semi-Markov process is stationary if and only if the initial distribution (Eq. (A7.164)) is given by [A7.22, A7.23, A7.281

In Eq. (A7.174), P, are the transition probabilities (Eq.(A7.162)) and pj the stationary distribution of the embedded Markov chain; pj are the unique solutions of

A7.6 Semi-Markov Processes with Finitely Many States 487

The system given by Eq. (A7.175) must be solved by dropping one (arbitrarily chosen) equation and replacing this by E p j = 1. For the stationary semi-Markov process, the state probabilities are independent of time and given by

with Ti per Eq. (A7.166) and Ti from Eq. (A7.175). Tii is the mean of the time interval between two consecutive occurrences of the state Zi (in steady-state). These time points form a stationary renewal process with renewal density

hi is the frequency of successive occurrences of state Zi. In Eq. (A7.176), I j can be (heuristically) interpreted as 4 = lim„, [ ( t I Tii) Ti] 1 t = Ti I Tii and as ratio of the mean time in which the embedded Markov chain is in state Zi to the mean time in all states I;;: = ~ i T i 1 x p k T k . Similar is for A,(x) in Eqs. (A7.174) & (A7.179). The stationary (asymptotic and steady-state) value of the point availability PAs and average availability AAS follows from Eq. (A7.176)

Under the assumptions made above, i.e. continuous sojourn times with finite means and an irreducible embedded Markov chain, the following applies for i = 0, ... , m regardless of the initial distribution at t = 0

lim Pr { &t) = Zi n next transition in Z j t - i m

Pipi X n residual sojourn time in Z i I X} = k l - F , ( y ) ) d y = A , ( x ) , (A7.179)

c . ~ k Tk 0 k=O

and thus Ti l imPr{{ ( t )=Z i )=Pi=- and l i m P A s ( t ) = P A s = x P i . (A7.180)

t+ - Ti i t-im ziel/

For reliability applications, irreducible semi-Markov processes can be assumed. According to Eqs. (A7.176) and (A7.180),

asymptotic & steady-state is used, for such cases, as a synonym for stationary.


For the alternating renewal process (Appendix A7.3 with Zo = up, Zl =down, T. = MTTF, and Tl = MTTR) it holds that po = pl = 1 / 2 (embedded Markov chain) and T o o = ~ l = T o + ~ . Eq.(A7.178)(or(A7.180))leadsto PAS=Po=ToIT„=ToI(To+T,)

=poTo I (mTO+plTl). This example shows the basic differente between I;. as stationary distribution of the embedded Markov chain and the limiting state probability 8 in state Zi of the original process in continuous time.

For time-homogeneous Markov processes (Appendix A 7 3 , it follows Ti =1/ pi (Eqs. (A7.166), (A7.165), (A7.102)); for this case, Eqs. (A7.174) & (A7.177) yield

and hi(t) = hi= Pi p j =Pi / T i = l /Ti i , i = 0, ..., m , (A7.182)

respectively. Eq. (A7.181) follows also directly from Eq. (A7.164) by considering F; ( X ) = Fg (X) = 1 - e-PiX. Eq. (A7.18 1) expresses the stationary property of time- homogeneous Markov processes (see also Eqs. (A7.15 1) and (A7.127)). Fur- thermore, Eq. (A7.161) holds with pij = pg / pi, Eq. (A7.176) reduces to Eq. (A7.130).

A7.7 Semi-regenerative Processes

As pointed out in Appendix A7.5.2, the time behavior of a repairable system can be described by a time-homogeneous Markov process only if failure-free times and repair times of all elements are exponentially distributed (constant failure and repair rates during the stay (sojourn) time in every state, with possible stepwise change at state transitions, e.g. because of load sharing). Except for the Special case of the Erlang distribution (Section 6.3.3), non exponentially distributed repair and / or failure-free times lead in some few cases to semi-Markov processes and in general to processes with only few regeneration states or to nonregenerative processes. To make sure that the time behavior of a system can be described by a semi-Markov process, there must be no "running" failure-free time or repair time at any state transition (state change) which is not exponentially distributed, otherwise the sojourn time to the next state transition would depend on how long these non- exponentially distributed times have already run. Example A7.12 shows the case of a process with states Zo, Z1, Z2 in which only states Zo and 2, are regeneration states. Zo and Z1 form a semi-Markov process embedded in the original process, on which the investigation can be based. Processes with an embedded semi-Markov process are called semi-regenerative processes. Their investigation can become time-consuming and has to be performed in general on a case-by-case basis, See for instance Example A7.12 (Fig. (A7.1 I)), Fig. A7.12 and Sections 6.4.2,6.4.3,6.5.2.


- operating reserve - - - - - repair

0 A renewal points (for ZO and Z1, respectively)

Figure A7.11 a) Possible time schedule for a 1-out-of-2 warm redundancy with constant failure rates (L, L,), arbitrary repair rate (density g(x)), only one repair Crew (repair times greatly exaggerated); b) State transition diagram for the embedded semi-Markov process with regeneration states ZO and Z1 (Qiz is not a semi-Markov transition probability); during a transition Z1 + % + Z1, the embedded Markov chain (on {Zo , Z, 1) remains in Z1); this model holds for a k-out-of-n warm redundancy with n - k = 1 as well

Example A7.12 Consider a I-out-of-2 warn redundancy as in Fig. A7.4a with constant failure rates h i n operating and h r in reserve state, one repair Crew, arbitrarily distributed repair time with distribution G(x) and density g(x). Give the transition probabilities for the embedded semi-Markovprocess.

Solution As Fig. A7.11a shows, only states ZO and Z1 are regeneration states. % is not a regeneration state because at the transition points into a repair with arbitrary repair rate is running. Thus, the process involved is not a semi-Markov process. However, states ZO and Z1 form an embedded semi-Markov process on which investigations can be based. The transition probabilities of the embedded serni-Markov process are obtained (using Eq. (A7.99) and Fig. A7.11) as

Q121(~)= ~ r t q 2 1 5 X I = k y ) ( l - e-'~)dy. (A7.183) 0

Q121(x) is used to calculate the point availability (Section 6.4.2). It accounts for the process returning from state Z2 to state Z1 (Fig. A7.11a) and that Z2 is a not a regeneration state (transition Z1 + Z2 + Z1; during a transition Z1 -t Z2 + Z1, the embedded Markov chain (on {Zo , Z, I) remains in Z1) Qiz(x) as given in Fig A7.10 is not a semi-Markov transition probability ( Z2 is not a regeneration state). However, Q;,(x) expressed as (see Fig. A7.11a)

X -AY Q;,(x) = j h e - " ( l - ~ ( ~ ) ) d y = I - e-" - j h e G(y)dy, (A7.184)

0 0

yields an equivalent Q ~ ( x ) = Qio(x) + Q;~(x) useful for calculation purposes (see Section 6.4.2).


operating - - - - - repair 0 A 7 renewal points

(for ZO, Z1 and ZT, respectively)

Figure A7.12 a) Possible time schedule for a k-out-of-n warm redundancy with n-k =2, constant failure rates (h & L,), arbitrary repair rate (density g(x)), only one repair crew, and no further failure at system down (repair times greatly exaggerated, operating and reserve elements not separately shown in the operating phases at system level); b) State transition diagram for the embedded semi- Markov process with regeneration states Zo , Zl , and Z2,

Replacing in Eqs. (A7.183) and (A7.184) h with kh leads to a k-out-of-n warm redundancy with n-k=l, constant failure rates (h, h,), arbitrary repair times with density g(x), only one repair crew, and no further failure at system down.

As a second example, Fig. A7.12 gives a possible time schedule for a k-out-of-n warm redundancy with n - k = 2 , constant failure rates ( h, L,), arbitrary repair rate (density g(x)), only one repair crew, and no further failure at system down. Given is also the state transition diagram of the involved semi-regenerative process. States ZO, Z1, and Z2. are regeneration states, Z2 and Z3 are not regeneration states. The corresponding transition probabilities of the embedded Semi-Markov process are

Q121(~), Q1232, ( X ) , and Q2,32, (X) are used to calculate the point availability. They account for the transitions throughout the nonregenerative states Z2 andZ3.

Similarly as for Q;,(.w) in Example A7.12, the quantities

are not semi-Markov transition probabilities, however they are useful for calculation purposes (to simplify, they are not shown in Fig. A7.12b). Results for g(x) = pe-P, i.e. for constant repair rate p, are given in Table 6.8 (n-k=2).

In the following, some general considerations on semi-regenerative processes are given. A pure jump process E,(t), tr 0, with state space Zo, . . ., Z, is semi- regenerative, with regeneration states Zo, ..., Zk, k < rn, if the following holds: Let Co, Cl, . . . be the sequence of successively occurring regeneration states and ( P ~ , 0), then Eq.(A7.158) must be fulfilled for n = 0,i, 2, ..., arbitrary i, j, i,, ..., in-l E {O, ..., k ] , and arbitrary positive values XO, ..., x,,-~ (where 5„ q, have been changed in C,„ V,). In other words, E,(t ) as given by E,(t)= 5, for cpo +. .. + <pn-l

2 t < + .. . + 9, is a semi-Markov process with state space Zo, . . ., Zk and transition probabilities Q, ( X ) , i, j E {O, ..., k 1, embedded in the original process E,(t).

The piece E,( t ), (PO+. .. +<P,-~ 5 t < (PO+. . . + (P,, n t 1 of the original process is a cycle (Appendix A7.4). Its distribution depends on C„ i.e. on the regeneration state involved, and its probabilistic structure can be very complicated. The epochs at which a fixed state Zi, 0 s i s k occurs are regeneration points and constitnte a renewal process (belonging to state Zi) embedded in the original process E,(t).

Often the Set of system up states U is a subset of the regeneration states Zo, . . . , Zk. The procedure used to develop Eqs. (A7.183) - (A7.186) can help to find the transition probabilities involved and from these the reliability function per Eq. (A7.172) and the point availability per Eq. (A7.171), see for instance Example A7.12 and Sections 6.4.2 and 6.4.3. A regenerative process with five states, of which only one is a regeneration state, was necessary to investigate the general 1-out-of-2 warm redundancy [6.5 (1975)l given in Sections 6.4.3 (Fig. 6.10).

If the embedded semi-Markov process has an irreducible embedded Markov chain and a continuous conditional distribution functions Fij ( X ) = Pr(<p, 5 X ( (<n+l = Zj n 5, = Zi) 1, i, j E (0, ..., k 1, then

lim Pr{$(t) =Si}, i=O, ..., k , t+ -

exists and do not depend on the initial distribution at t = 0, see e.g. [A6.6 (Vol. II)]. The proof is based on the key renewal theorem (Eq. (A7.29)). Denoting by Ti the mean sojourn time in the state Zi and by T; the mean of the time interval between two consecutive occurrences of Zi (cycle length), it holds for i = 0, .. . , k that

lim Pr{c(t) = Si J = Pi = Ti / T; t+


For the 1 -out-of-2 warm redundancy of Example A7.12 it holds that PO = P I = 1 1 2

(embeddedMarkovchain), T o = l l ( h + h r ) , T 1 = ( l - & h ) ) l h , T & = l l ( h + h , ) + M ~ T R

+ [ ( I - g ( h ) ) l g ( h ) ] M l ~ R , T ~ = ~ ( ~ ) [ ~ / ( ~ + ~ , ) + M T T R ] + ( ~ - ~ ( ~ ) ) M T T R , P, =To/T&, and P, = Zi I T,;. The final result for PAs = Po +P, is given by Eq. (6.109). For constant repair rate p , g(h) = p l ( h + p ) and T. = 1 l ( h + h , ) , Tl = l / ( h + P), Tio =

( p 2 + ( h + h r ) ( h + p ) ) l p 2 ( h + h r ) , T; = ( p 2 + ( h + h r ) ( h + P ) ) l p ( h + h r ) ( h + P ) , yielding PAs =Po + P, = T. /T& + Tl /T,; according to Eq.(6.88), or Eq.(A7.152) for h„=h; this case can also be investigated considering 3 regeneration states as per Fig. 6.8a.

A7.8 Nonregenerative Stochastic Processes

The assumption of arbitrarily (i.e. not exponentially) distributed failure-free and repair (restoration) times for the elements of a system, already leads to nonregenerative stochastic processes for simple series or parallel structures. After some general considerations, nonregenerative processes used in reliability analysis are introduced.

A7.8.1 General Considerations

Solutions for nonregenerative stochastic processes are often problem-oriented. However, as a possible general method, transformation of the given stochastic process into a Markov or a semi-Markov process by a suitable state space extension can be used in some cases by one of the following ways:

1. Approximation of distribution functions: Approximating the involved distribution functions (for repair andlor failure-free times) by an Erlang distribution (Eq. (A6.102)) allows a transformation of the original process into a time- homogeneous Markov process through introduction of additional states.

2. Introduction of supplementary variables: Introducing for every element of a system as supplementary variables the failure-free time since the last repair and the repair time since the last failure, the original process can be transformed into a Markov process with state space consisting of discrete and continuous Parameters. Investigations usually lead to partial differential equations which have to be solved with the corresponding boundary conditions.

The first method is best used when repair andlor failure rates are monotonically increasing from Zero to a final value, its application is easy to understand (Fig. 6.6). The second method [A7.4 (1955)l is very general, but often time-consuming.

A7.8 Nonregenerative Stochastic Processes 493

A further method is based on the general concept of point process. Considering the sequence of jump times T; and states 5, entered at these points, an equivalent description of the process ((t) is obtained by a marked point process (T:, C,), n=O, 1, ... . Analysis of the system's steady-state behavior follows using Korolyuk's theorem ( Pr{ jump into Zi during ( t, t At] } = ht 6t + o(6t), with =E [Number of jumps in Zi during the unit time interval]), See e.g. [A7.11, A7.121. As an example, consider a repairable coherent system with n totally independent elements (p. 61). Let C 1 ( t ) , . . . , CJt) and <(t) be the binary processes with states 0 (down) & 1 (up) describing ele-

ments and system, respectively. If the steady-state point availability of each element

M T 6 lim PAi(t) = lim Pr{ci(t) = 1) = PAi = i= l , ..., n, (A7.189) t+- t+m M 7 T q + M T R i '

exists, then the steady-state point availability of the system is given by Eq. (2.48) and can be expressed as PAs = MTTFs / (MTTFs + MTTRs), see e.g. [6.4, A7.101.

Investigation of the time behavior of Systems with arbitrary failure andor repair rates can become time-consurning. In these cases, approximate expressions can help to get results (see Section 6.7 for some examples).

A7.8.2 Nonhomogeneous Poisson Processes (NHPP)

A nonhomogeneous Poisson process (NHPP) is a point process with independent Poisson distributed increments, i.e. a sequence of points (events) on the time axis, which Count function V ( t ) has independent increments (in nonoverlapping intervals) and satisfy

V ( t) gives the number of events in (0, t]. In the following, V ( t ) is assumed right continuous with unit jumps. M(t) is the mean of V ( t ) , called mean valuefunction,

M(t) = E [ ~ ( t ) ] , t>O, M(O)=O, (A7.191)

and it holds that (Example A6.20)

Var [ ~ ( t ) ] = E [ ~ ( t ) ] = M(t), t>O, M(O)=O. (A7.192)

M(t) is a nondecreasing, continuous function with M(0) = 0, often assumed increasing, unbounded, and absolutely continuous. If

m(t) = dM(t) / dt 2 0 , t>O, (A7.193)

exists, m(t) is the intensity of the NHPP. Eqs. (A7.193) and (A7.19 1) yield

Pr{v(t+6t) - ~ ( t ) = l } = m(t) 6 t + o(6t), t>O,6t.L0, (A7.194)

and no distinction is made between arrival rate and intensiv. Equation (A7.194)

gives the unconditional probability for one event (e.g. failure) in ( t , t + St] . m(t) corresponds to the renewal den& h ( t ) (Eq. (A7.24)) but dijjfers basically from the failure rate A( t ) , see remark on p. 356. Equation (A7.194) also shows that a NHPP is locally without aftereffect. This holds globally (Eq.(A7.195)) and characterizes the NHPP. However, memoryless (i.e. with independent und stationary increments) is only the homogeneous Poisson process (HPP), for which M ( t ) = t holds.

Nonhomogeneous Poisson processes have been greatly investigated in the literature, see e.g. [6.3, A7.3, A7.12, A7.21, A7.25, A7.30, A8.11. This appendix gives some important results useful for reliability analysis. These results hold for H P P ( M ( t ) = A t ) as well, and most of them are a direct consequence of the independent increments property. In particular, the number of events in a time interval ( U , b]

m w - ~ i , ) i ' ,- (M(S)-M(,,)) Pr{k events in (a, b] I Ha] =Pr{k events in (a, b] ] = k !

k=1,2 ,..., O l a x I H,]=Pr{noevent in( t , t+x] I H , )

=Pr{no event i n ( t , t + x ] ] = e -("(t+x)-M(t)) 0, (A7.196)

are independent of the process development up to time t (history H, or H,). Thus, also the mean E [ ~ ~ ( t ) ] is independent of the history and given by

Let 0 < T{< 22 < . . . be the occurrence times (arrival times) of the event considered (e.g. failures of a repairable system), measured from the origin t = T ; = 0 and

* * taking values O< t: t i < ... +). Furthermore, let q, = T , - T , - ~ be the nth interarrival time ( n 2 1). Considering M(0) = 0 , t 2 0 , T ; = t ; = 0 and assurning M(t) increasing, absolutely continuous, and unbounded (lim M ( t ) = M), the following holds:

t+m

1. The occurrence times (arrival times) T;, 22 ,... have joint density n * * * * * * *

f(t, , t „ ..., t i ) =nm(ty)e-(M(ti)-M(ti-l))= e - M ( t n ) n m ( t i ) , t o =o< t l < ... <t;, i=l i=l (A7.198)

(follows from Eqs. (A7.194) & (A7.195)) and marginal distribution function

+) * is used to explicitly show that 2;,22, ..., or t f , t i , ..., are points on the time axis and not independent observations of a random variable 2, e.g. as in Figs. 1.1,7.12,7.14.

* with density fi(tT) = m( t l ) ~ ( t:) " e-M(ti )/ (i - I ) ! & mean E [T:] = / : x f i ( x ) d r

(events { T ; 5 t,' } and {at least i events have occurred in (0 , t ; ] ) are equivalent).

2. The quantities

V*, = M(T;) < = M(T 2) < ...

are the occurrence times in a HPP with intensity one ( M ( t ) = t ) (follows from V,. ( t ) = v T * (M-' ( t ) ) + E [ v . ( t ) l = E [V=* (M-' ( t ) ) ] =M (M-' ( t ) ) = t , see Eq. (A6.31)).

V

3. The conditional distribution functions of q„l & T:+~ given q l = x l , ...,V„= X, are

(follow from Eq. (A7.195) with k = 0 or from Eq. (A7.196)).

4. For given (fixed) t = T and v ( T ) = n (time censoring), the joint density of the occurrence times 0 < T ; < . . . < T ; < T under the condition v ( T ) = n is given by

(see Example A7.13) and that of 0 < T;< . . . T ; < T und v ( T ) = n is

(follows from Eqs. (A7.203) and (A7.190)). From Eq. (A7.203) one recognizes that for given (fixed) t =T and v ( T ) = n , the occurrence times 0 <T;< . . . < T ; < T have the same distribution as if they where the order statistics of n independent identically distributed random variables with density

& distribution function M ( t ) l M ( T ) on (0, T ) (compare Eqs.(A7.210), (A7.211)).

5. Furthermore, for given (fixed) t =T and V (T) = n , the quantities

have the same distribution as if they where the order statistics of n independent identically distributed random variables with density o n e , i. e. uniformly distributed, on (0 , l ) (follows from Point 2 above (Eq. (A7.200)) and Eq. (A7.213)). For the case in which one takes T= t i (failure censoring), Eqs. (A7.203) - (A7.206) and (A7.210) - (A7.213) hold with n-1 instead of n.

has for t - + m a standard normal distribution (folows basically from Point 2 above and Eqs. (A7.34), (A7. lgl), (A7.192)).

7. The sum of n independent NHPPs with mean value function M,(t) and intensity mi ( t ) is a NHPP with mean value function and intensity

i=l i=l

respectively (follows from the independent increments property of NHPPs and Eq. (A7.190), see Eq. (7.27) for HPPs).

From the above properties, the following conclusions can be drawn: (1) For i = 1,

Eq. (A7.199) yields t

- I m(x)dx ~ r { ~ l l t } = 1 - e-"(t) = 1 -e 0 ; (A7.209)

Example A7.13 Show that for given (fixed) T and V (T) = n, the occurrence times 0 <T:<. . .< T; < T in a non homogeneous Poisson Process with intensity m(t) have the Same joint density as the order statistics of n independent identically distnbuted randomvariables with density m(t)l M ( T ) on (0, T).

Solution For a NHPP with intensity m(t), the occurrence times O<T~< ... < T;< T given T (fixed) and V (T) = n have joint density (Eqs. (A7.194) & (A7.195) and considering 0< t;< ... < t :< T )

f(t;,t;, ..., t l 1 n)=f( t ; , t l ,..., t i , n) / (M(T)~ e-M(71 /n!) = m(t;)e-(M(';)m(t~)e-(M('~)-M(';))...

m(tn „-(M(ti )-M(Z;_, )) e-(M( 'O-Wt: j) / (M(T)" e-"(T',n! ) =n! fi (m(til /M(T)). (A7.210) i=l

Considering that for a Set of n realizations of a given random variable there are n! permutations giving the same order statistics, the joint density of the order statistics of n independent identically distributed random variables with density m(t) 1 M(T) on the interval (0, T) is given by

f ( t ; t c ..., t i 1 n ) = n ! n (m(t i ) lM(T)) , on(0, T), ~ < t ; < ... < t , '<~, (A7.211) i =l

yielding Eq. (A7.203).

Supplementary results for HPPs: For a HPP, Eq. (A7.205) yields

m(t) / M(T) = h / h T = 1 / T and thns f(t,:tG,..,t,'l n)= n!lTn on (0,T). (A7.212)

Furthermore, when considering 'ci /T, Eqs. (A7.205) and (A6.31) yield

T.m(t . T) /M(T)= T. h / h T = l and f(tl:t$ ..., t,*l n ) = n! on (0,s). (A7.213)

Thus, for given (fixed) T and v(T)=n, the arrival times 0<'c;< ... <T: < T of a HPP have the same distnbution as if they where the order statistics of n independent identically uniformly distributed random variables on (0,T) (on (0,s) for O<T ; /T< ... <T; /T< 1).


thus, comparing Eqs. (A7.209) and (A6.26) it follows that the intensit~ of a NHPP is equal to the failure rate of the first occurrence time T: or interarrival time V, = 7;.

(2) Equation (A7.201) shows that the conditional density of the interarrival time v ~ + ~ = T;+~- T; given T; = tn is independent of the process development up to the time t,* and is equal to the conditional failure rate at time t i+ x of the first occurrence time T; given T; > t i (Eq. (A6.28)), for any n? 1; this leads to the concept of bad-as-old used in some considerations on repairable Systems, see e.g. [6.3, A7.301. (3) From Eq. (A7.202), the distribution of the occurrence time zL1 depends only on T;; thus, T;, 22, ... is a Markov sequence. (4) From Eq. (A7.204) one can obtain Eq. (A7.198) by considering Pr{ no event in ( t i , T ] ] = e- (M(T)-M@,?), and vice versa. (5) Equations (A7.198) and (A7.199) show that for a NHPP, occurrence (arrival) times are not independent; the same is for interarrival times, which are neither independent nor identically distributed.

Thus, the NHPP is not a regenerative process. On the other hand, the homogeneous Poisson processes (HPP) is a renewal process, with independent interarrival times distributed according to the same exponential distribution (Eq. (A7.38)) and independent Gamma distributed occurrence times (Eq. (A7.39)). However, because of independent increments, the NHPP is without aftereffect (memoryless if HPP) and the sum of Poisson processes is a Poisson process, both in homogeneous and nonhomogeneous case (Eq. (7.27)). Convergence of a point process to a NHPP or to a HPP is discussed in Appendices A7.8.3 and A7.8.5.

Although appealing, the assumption of independent incrernents, mandatory for Poisson processes (HPP and NHPP), can limit the validity of models uscd in practical applications with arbitrary failure andlor repair rates (see e.g. Sections 7.6 and 7.7). However, the properties in Points 1-6 above (in particular Eqs. (A7.200) & (A7.206)) are useful for statistical tests on NHPPs, as well as for Monte Carlo simulations. In particular, results for exponential distributions or for HPPs can be used and the Kolmogorov-Smirnov test holds with Fo(t) = Mo(t) l Mo(T)) and I?,( t ) = G ( t) 1 ( T )

(Sections 7.6- 7.7). Equation (A7.205) is useful to generate realizations of a NHPP (generate k for given T and M(T) (Eq. (A7.190)), then k random variables with density m(t) / M ( T ) ; the ordered values are the k occurrence times of the NHPP On (0,T)) .

A7.8.3 Superimposed Renewal Processes

Consider a repairable series system with n totally independent elements (p. 52) and assume that repair times are negligible and that after each repair (renewal) the repaired element is as-good-as-new. Let be the mean time to failure of element Ei and M T T 6 that of the system. Theflow of system failures is given by the superposition of n independent renewal processes, each of them related to an element of the system. If vs ( t ) is the Count function at system level giving the number of system failures in (O,t] and v i ( t ) that of element Ei, it holds that

n

VS(~) = E v i ( t ) , t>O, vi(0)=O, i = l , 2 ,..., n . (A7.214) i=l

vi(t) is a random variable, distributed as per Eq. (A7.12). Thus, for the rnean value finction a t System level Zs( t ) it follows that (Eqs. (A6.68) and (A7.15))

yielding for the failure intensiq a t systern level zS(t) (Eq. (A7.18))

In Eqs. (A7.215) and (A7.216), Hi(t) and hi(t) are the renewal function and renewal density of the renewal process related to element Ei. However, the point process yielding vS(t) is not a renewal process. Simple results hold only for homogeneous Poisson processes (HPP), which surn is a HPP (Eq. (7.27)). The Same holds for nonhomogeneous Poisson processes (NHPP), but a NHPP is not a renewal process.

For (stochastically) independent renewal processes, it can be shown that:

1. The surn of n independent stationary renewal processes is a stationary renewal process with renewal density

(follows basically from Eq. (A7.36)).

2. For n - t - , the surn of n independent renewal processes with very low occurrence (one occurrence of any type and 2 2 occurrences of all type are unlikely), and for which ?imm Zipr{vi(t)-vi(a) =I}= M( t )-M(a) holds for any fixed t and a < t , converge to a NHPP with E [ ~ ( t ) ] = M(t) for a l l t > 0 (Grigelionis rA7.141, see also [A7.12, A7.301); furthermore, if all renewal densities hi(t) are bounded (at t = O), the sum converge for n -1- to a HPP [A7.14].

3. For t -1- and n+-, the surn of n independent renewal processes with low occurrence (one occurrence of any type is unlikely) converge to a HPP with renewal density as per Eq. (A7.217) [A7. 171, See also [A7.8, A7.12, A7.301.

A7.8.4 Cumulative Processes

Cumulative processes [A7.24, A7.4 (1962)j, also known as compound processes [A7.3, A7.9 (Vol. 2), A7.211, are obtained when at the occurrence of each event in a point process, a random variable is generated and the stochastic process given by the surn of these random variables is considered. The involved point process is often


limited to a renewal process (including the homogeneous Poisson process (HPP), yielding to a cumulative or compound Poisson process) or to a nonhomogeneous Poisson Process (NHPP). The generated random variable can have arbitrary distribution. Cumulative processes can be used to model some practical situations; for instance, the total maintenance cost for a repairable system over a given period of time or the cumulative damage caused by random shocks on a mechanical structure (assuming linear superposition of damage). If a subsidiary senes of events is generated instead of a random variable and the two types of events are indis- tinguishable, the process is a branching process [A7.3, A7.21, A7.301, discussed e.g. in [6.3, A7.51 as a model to describe failure occurrence when secondary failures are triggered by primary failures.

Let ~ ( t ) be the count function giving the number of events (on the time axis) of the involved point process (Fig. A7.1), C i the generated random variable at the occurrence of the ith event, and 5, the sum of 5 over (0,tl

The stochastic process of value 6 , ( t > 0) is a cumulative process. It is not difficult to recognize that for t i > 0, 5 is distributed as the total repair

time (total down time) for failures occurred in a total operating time (total up time) t of a repairable item, and is thus given by the work-mission availability (Eq. (6.32)).

In the following some important results are given for the case in which the involved point process is a homogeneous Poisson process (HPP) with parameter h and the generated random variables are independent from V (t) and have the same exponential distribution with parameter p. From Eq. (6.33), with To= t , it follows that

(At )n n-l =I-e- (~ t f~)x ) . t > o given. n o . pr{&=o)=e-? (A7.219) n =l

Mean and variance of 5, follow as (Eqs. (A7.219), (A6.38), (A6.45), (A6.41))

Furthermore, for t+m the distribution of 5, approaches a normal distribution with mean and variance as per Eq. (A7.220), see also Eq. (7.22). Moments of 5 , can also be obtained using the moment generating function [A7.3, A7.4 (1962)l or directly by considering Eq. (A7.218), yielding to (Example A7.14)

E [St] = E [v(t)] E[ki] and ~ a r [ ~ ~ ] = E [ v ( t ) l ~ a r [ ~ ~ l + ~ a r [ v ( t ) ] ~ ~ [ ~ ~ l . (A7.221)

Of interest in some practical applications can also be the distribution of the time at which the process 5 , ( t > 0) crosses a give (fixed) barrier C. For the case

given by Eq. (A7.119), i.e. in particular for ki> 0, the events

{ z c > t ] and { t t 5 C } (A7.222)

are equivalent. Form Eq. (A7.219) it follows then

Cumulative processes are regenerative only if the involved point process is regenerative, in particular thus for the HPP investigated above. However, because of possible generalizations (NHPP, arbitrary point processes), they have been considered in this Appendix devoted to nonregenerative stochastic processes.

A7.8.5 General Point Processes

A point process is an ordered sequence of points on the time axis, giving for example the failure occurrence of a repairable System. Poisson and renewal processes are simple examples of point processes. Assuming that simultaneous events can not

Example A7.14 Prove Eq. (A7.221).

Solution Considering 5 >O, continuous with finite mean & variance (i = 1,2, ... ), and independent of V (t), for given V(t)=n Eq. (A7.218) yields for the mean and variance of tt (Appendix A6.8)

E[c t I 6 ( t ) = n l = nElci l and Varlct I B( t )=nl = nVar[cil . (A7.224)

From Eq. (A7.224) it follows then

For Var& 1, it holds that (Eq. (A6.45)) ~ a r [ < ~ ] = ~ [ c : ] - E2 [ C t ] ; from which, considering =(Ci+ ... + cv(t))2 and Eq. (A7.225) for (as well as Eq. (A6.69) for row 2 and

Eq. (A6.45) for row 3 below)


occur (with probability one) and assigning to the point process a count function v ( t ) giving the number of events occurred in (O,t], investigation of point processes can be performed on the basis of the involved count function ~ ( t ) . However, arbitrary point processes can lead to analytical difficulties, and results are known only for particular situations (low occurrence rate, stationary, regular, etc.). In reliability applications, general point processes can appear for example when investigating failure occurrence of repairable Systems by neglecting repair times. In the following only some basic properties of general point processes will be discussed, see e.g. [A7.10, A7.11, A7.12, A7.301 for greater details.

Let ~ ( t ) be a count function giving the number of events occurred in (O,t], assume v(O)= 0 and that simultaneous occurrences are not possible. The underlying point process is stationary if v ( t ) has stationary increments (Eq. (A7.5)) and without aftereffect if v ( t ) has independent increments (Eq. (A7.2)). The sum of independent stationary point processes is a stationary point process. The same holds for processes without aftereffect. However, only the homogeneous Poisson process ( H P P ) is stationary und without aftereffect (memoryless).

For a general point process, a mean value function

giving the mean (expectation) of the number of points (events) in (O,t] can be defined. Z( t ) is a nondecreasing, continuous function with Z(0) = 0, often assumed increasing, unbounded and absolutely continuous. If

exists, z(t) is the intensity of the point process. Equations (A7.228)&(A7.227) yield

and no distinction is made between arrival rate and intensity. Equation (A7.229) gives the unconditional probability for one event (failure) in (t, t +6t]. ~ ( t ) corresponds thus to m( t ) (Eq. (A7.193)) and h(t) (Eq. (A7.24)), but differs basically from the failure rate h(t) (Eq. (A6.25)) which gives the conditional probability of failure in (t , t +6t] given that the item was new at t = 0 and no failure has occurred in (O,t]. This distinction is important also for the case of a homogeneous Poisson process (Appendix A7.2.5), for which h(x)= A holds for all interarrival times (with x starting by 0 at each renewal point) and h(t)=A holds for the whole process. Misuses are known, in particular when dealing with reliability data analysis (see e.g. [6.3, A7.301 and comments on pp. 356 & 358, Appendix A7.8.2, and Sections 1.2.3, 7.6, 7.7). Thus, as a first rule to avoid confusion,

for repairable items, it is mandatory to use for interarrival times the variable x starting by 0 at each failure (event), instead oft.


Some limits theorems on point processes are known, in particular on the convergence to a HPP, See e.g. [A7.10, A7.11, A7.121.

In reliability applications, z(t) is called failure intensi9 [A1.4], ROCOF (rate of occurrence of failures) in [6.3] . z(t) applies in particular to repairable Systems when repair (restoration) times are neglected. In this case, vs ( t ) is the Count function giving the number of system failures occurred in (O,t], with ~ ( 0 ) = 0, and

is the system failure intensity.

Ag Basic Mathematical Statistics

Mathematical statistics deals basically with situations which can be described as follows: Given a population of statistically (stochastically) identical und independent elements with unknown statistical properties, measurements regarding these properties are made on a (random) sample of this population and on the basis of the collected data, conclusions are made for the remaining elements of the population. Examples are the estimation of an unknown probability (e.g. defective probability), the parameter estimation for the distribution function of an item's failure-free time T, or a decision whether the mean of T is greater than a given value. Mathematical statistics thus goes from observations (realizations) of a given (random) event in a series of independent trials to search for a suitableprobabilistic model for the event considered (inductive approach). Methods used are based on probability theory and results obtained can only be formulated in a probabilistic language. Minimization of the risk for a false conclusion is an important objective in mathematical statistics. This Appendix introduces the basic concepts of mathematical statistics necessary for the quality and reliability tests given in Chapter 7. It is a compendium of mathematical statistics, consistent from a mathematical point of view but still with reliability engineering applications in mind. Emphasis is on empirical methods, parameter estimation, and testing of hypotheses. To simplify the notation, the terms random and statistical will be omitted (in general) and the term mean is used as a synonym for expected value. Estimated values are marked with " . Selected examples illustrate practical aspects.

A8.1 Empirical Methods

Empirical methods allow a quick and easy evaluationl estimation of the distribution function and of the mean, variance, and other moments characterizing a random variable. These estimates are based on the empirical distribution function and have a great intuitive appeal. An advantage of the empirical distribution function, when plotted on an appropriate probability charts (probability plot papers), is to give a simple visual rough check as to whether the assumed model seems correct.

504 A8 Basic Mathematical Statistics

A8.1.1 Empirical Distribution Function

A sample of size n of a random variable T with the distribution function F(t) is a -+

random vector T = ( z l , ..., T, ) whose components zi are assumed independent and identically distributed random variables with F(t) = Pr{zi < t } , i = 1 , ..., n.

For instance, T I , ..., T , are the failure-free times (failure-free operating time) of n items randomly selected from a lot of statistically identical items with a distribution function F(t) for the failure-free time T. The obsewed failure-free times, i.e. the

+ realization of the random vector z = ( z l , ..., T,), is a set t l , ..., t , of statistically independent real values (> 0 in the case of failure-free times). Distinction between random variables z l , . . . , T , and their observations t l , . . ., t , is important from a mathematical point of view. +)

When the sample elements (obsewations) are ordered by increasing magnitude, an order sample t ( l ) , ..., t(,) is obtained. In life tests, observations t l , ..., t , constitute often themselves an order sample. An advantage of an order sample of n observations on independent, identically distributed random variables with density f(t) is the simple form of the joint density f(t(l),...,t(,)) = n ! IIi f( t( i)) .

With the purpose of saving test duration and cost, life tests can be terminated (stopped) at the occurrence of the kth ordered observation (kth failure) or at a given (fixed) time T„, . If the test is stopped at the kth failure, a type II censoring occurs (from the left if the time origin of all observations is not known). A type I censoring occurs if the test is stopped at T„, . A third possibility is to stop the test at a given (fixed) number k of observations (failures) or at Te„, whenever the first occurs. The corresponding test plans are termed (n, F,k), (n,F, T„,), and (n, F,(k,T„,)), respectively, where F stands for "without replacement". In many applications, failed items can be replaced (for instance in the case of a repairable item or system), in these cases F is changed with r in the test plans.

For a set of ordered observations t ( l ) , ..., t(,), the right continuous function

for t < t ( l )

for t ( , ) < t < t ( i + i ) (A8.1)

for t 2 t ( , )

is the empirical distributionfunction (EDF) of the random variable T, See Fig. A8.1 for a graphical representation. fi,(t) expresses the relative frequency of the event ( T 5 t } in n independent trial repetitions, and provides a well defined estimate of the

+) The investigation of statistical methods and the discussion of their properties can only be based on the (random) sample T I , ..., T,. However, in applying the methods for a numencal evaluation (statistical decision), the observations tl, ..., tn have to be used. For this reason, the sarne equation (or procedure) can be applied to or ti according to the situation.


L."

Figure A8.1 Example of an empirical distribution function (t,, ..., t, t ( l ), ..., t ( , is assumed here)

distribution function F(t) = Pr{z 2 t}. The symbol - is hereafter used to denote an estimate of an unknown quantity. As stated in the footnote on p. 504, when investigating the properties of the empirical distribution function F,(t) it is necessary in Eq. (A8.1) to replace the observations t(l),..., t(,) by the sample elements T(I ) , . . ., T(,) .

For given F(t) and any fixed value of t, the number of observations I t, i.e. n I?,(t), is binomially-distributed (Eq. (A6.120)) with Parameter p = F(t), mean

E [n fin (t)] = n F(t) , (A8.2)

and variance

~ a r [n fi, (t)] = n F(t) (1 - F(t)). (A8. 3)

Moreover, application of the strong law of large numbers (Eq. (A6.146)) shows that for any given (fixed) value o f t , I?,(t) converges to F(t) with probability one for n -+ m. This convergence is uniform in t and holds for the whole distribution function F(t). Proof of this important result is given in the Glivenko-Cantelli theorem [A8.4, A8.14, A8.161, which states that the largest absolute deviation between I?,(t) and F(t) over all t, i.e.

converges with probability one toward 0

Pr{ lim D, = 0 } = 1. H-


In life tests, observations t l , .. ., t , constitute often themselves an order sample. This is useful for statistical evaluation of data. However, if the test is stopped at the occurrence of the kth failure or at S„, and k or T„„ are small, the homogeneity of the sample can be questionable and the shape of F(t) could change for t > t k or t > T„, (e.g. because of wearout, See the remark on p. 320).

A8.1.2 Empirical Moments and Quantiles

The moments of a random variable T are completely determined by the distribution function F(t) = Pr{z I t ) . The empirical distribution e,(t) introduced in Appendix A8.1.1 can be used to estimate the unknown moments of T .

The values t( l) , ..., t(,) having been fixed, 6 J t ) can be regarded as the distribution function of a discrete random variable with probability pk = 11 n at the points t ( k ) , k = 1, ..., n. Using Eq. (A6.35), the corresponding mean is the empirical mean (empirical expectation) of T and is given by

Taking into account the footnote on p. 504, E[%] is a random variable with mean

and variance

Equation (A8.7) shows that E[T] is an unbiased estimate of E [ z ] , see Eq. (A8.18). Furthermore, from the strong law of large numbers (Eq. (A6.147)) it follows that for n + W , @T] converges with probability one toward E [ z ]

1 Pr{ lim (; C r i ) = E[r] } = I .

n-tm i=l

The exact distribution function of E[z] is known in a closed simple form only for some particular cases (normal, exponential, or Gamma distribution). However, the central limit theorem (Eq. (A6.148)) shows that for large values of n the distribution of Erz] can always be approximated by a normal distribution with mean E[%] and variance Var[z] 1 n .

Based on F,(t), Eqs. (A6.43) and (A8.6) provide an estimate of the vxiiance as


The expectation of this estimate yields Var[c] (n - 1) l n . For this reason, the empirical variance of T is usually defined as

for which it follows that

E[V&[T]] = Var[z]

The higher-order moments (Eqs. (A6.41) and (A6.50)) can be estimated with

The empirical quantile F q is defined as the q quantile (Appendix A6.6.3) of the empirical distribution function ; , ( t )

fq = inf { t : F,(t) 2 q). (A8.13)

A8.1.3 Further Applications of the Empirical Distribution Function

Comparison of the empirical distribution function fi,(t) with a given distribution function F(t) is the basis for several non-parametric statistical methods. These include goodness-of-fit tests, confidence bands for distribution functions, and graphical methods using probability charts (probability plot papers).

A quantity often used in this context is the largest absolute deviation D, between ;,(t) and F(t) , defined by Eq. (A8.4). If the distribution function F( t ) of the random variable z is continuous, then the random variable F(T) is uniformly distributed in (0,l) . It follows that D, has a distribution independent of F(t). A.N. Kolmogorov showed [A8.20] that for F(t) continuous and X > 0,

m

k -2k2x2 iim &{&D, 5 x I F ( t ) ) = 1 + 2 E ( - l ) e n+- k=l

The series converges rapidly, so that for x > 1 / & ,


The distribution function of D,, has been tabulated for small values o A8.261, See Table A9.5 and Table A8.1. From the above it follows that:

For a given continuous distribution function F(t ) , the band F(t) + y l - , overlaps the empirical distribution function 6 J t ) with probability 1 -an where an + a as n + M, with y l - , defined by

Pr{Dn I yl-, I F(t)} = 1 -a (A8.15)

und given in Table A9.5 or Table A8.1.

From Table A8.1 one recognizes that the convergence an -+ a is good (for practical purposes) for n > 50. If F( t ) is not continuous, it can be shown that with yl-, from Eq. (A8.15), the band F(t) + yl - , overlaps 6,(t) with a probability 1 -an, w h e r e a i + a ' s a as n + m .

The role of F ( t ) and 6,(t) can be reversed, yielding: ..

The random band Fn(t)lt y l - , overlaps the true (unknown) distribution function F ( t ) with probability 1 - U„ where an -+ a as n + W.

This last consideration is an aspect of mathematical statistics, while the former one (in relation to Eq. (A8.15)) was a problem of probability theory. One has thus the possibility to estimate an unknown continuous distribution function F ( t ) on the basis of the empirical distribution function 6,(t) , see e.g. Figs. 7.12 and 7.14.

Example A8.1 How large is the confidence band around 6,(t) for n = 30 and for n = 100 if a = 0.2 ?

Solution From Table A8.1, y0,8 = 0.19 for n = 30 and y0.8 = 0.107 for n = 100. This leads to the band F,(t)10.19 for n=30 and Fn(t)f0.107 for n=100.

A8.1 Empirical Methods 509

To simplify investigations, it is often useful to draw 6,(t) on aprobability chart (probability plot paper). The method is as follows:

The empirical distribution function e , ( t ) is drawn in a system of coordinates in which a postulated type of continuous distribution function is represented by a straight liize; if the underlying distribution F ( t ) belongs to this type of distribution function, then for a sufficiently Zarge value of n the points ( t ( i ) , F,(t($) will approximate to a straight line (a systematic deviation from a straight line, particularly in the domain 0.1 < e,(t) < 0.9, leads to rejection of the type of distribution function assumed).

This can also be used as a simple rough visual check as to whether an assumed model ( F ( t ) ) seems correct. In many cases, estimates for unknown parameters of the underlying distribution function F ( t ) can be obtained from the estimated straight line for e,(t) . Probability charts for the Weibull (including exponential), lognormal and normal distribution functions are given in Appendix A9.8, some applications are in Section 7.5. The following is a derivation of the Weibull probability chart. The function

~ ( t ) = 1 - e-@t) P

can be transformed to loglo(l /(I - F(t))) = (At) loglo(e) and finally to

In the system of coordinates loglo(t) and loglo loglo(l l ( 1 - F( t ) ) , the Weibull distribution function given by ~( t )=l -e - (" ) ' appears as a straight line. Fig. A8.2 shows this for ß = 1.5 and h = 1 / 800 h . As illustrated by Fig. A8.2, the parameters ß and 3L can be obtained graphically

ß is the dope of the straight line, it appears on the scale loglo loglo(l / ( 1 - F(t)) if t is changed by one decade (e.g. from 102 to 103 in Fig. A8.2), for loglo loglo(l / ( 1 - F(t)) = loglo loglO(e), i.e. on the dashed line in Fig. A8.2, one has loglo(h t ) = 0 and thus h = 1 / t .

The Weibull probability chart also applies to the exponential distribution (P = 1). For a three parameter Weibull distribution (F( t ) = 1 - e-(h(t- 'v))ß, t V) one

can operate with the time axis t '= t - W , giving a straight line as before, or consider the concave curve obtained when using t (see Fig. A8.2 for an example). Conversely, from a concave curve describing a Weibull distribution (e.g. in the case of an empirical data analysis) it is possible to find W using the relationship

2 W = ( t l t2 - t M ) / ( t l + t2 - 2t, ) existing between two arbitrary points tl , t2 and t , obtained from the mean of F ( t l ) and F ( t 2 ) on the scale loglologlo(l / ( 1 - F(t)) ) , see Example A6.14 for a derivation and Fig. A8.2 for an application with tl=400h and t2 =1000h, yielding t,= 600h and ~ = 2 0 0 h .

A8 Basic Mathematical Statistics

Figure A8.2 Weibull probability chart: The distribution function F(t) = 1 - e-(ht)P appears as a straight line (in the example h = 11 800 h and ß = 1.5); for a three Parameter distribution F@)= 1 - e - ( h ( t - v ) ) P , t 2 W , one can use t t=t-W or operate with a concave curve and determine (as necessary) W , h, and ß graphically (dashed curve for h = 1/800 h , ß = 1.5, and

= 200h as an example)

A8.2 Parameter Estimation


In many applications it can be assumed that the type of distribution function F ( t ) of the underlying random variable 2: is known. This means that F ( t ) = F(t, 01, ..., 0,) is known in its functional form, the real-valued parameters 01, . . ., 9, having to be estimated. The unknown Parameters of F( t ) must be estimated on the basis of the observations t l , ..., t,. A distinction is made betweenpoint and intewal estimation.

A8.2.1 Point Estimation

Consider first the case where the given distribution function F ( t ) only depends on a parameter 9, assumed hereafter as an unknown constant + ) . A point estimate for 9 is a function (statistic)

6 , = ~ ( t ~ , . . . , t,) (A8.17)

of the observations t l , ..., t , of the random variable T (not of the unknown parameter 9 itself). The estimate 6, is

unbiased, if

consistent, if 6, converges to 8 in probability, i.e. if for any E > 0

strongly consistent, if 6, converges to 0 with probability one, i.e.

efficient, if

E[(&, - fN2 I

is minimum over all possible point estimates for 9,

sufficient (sufficient statistic for B), if 6, delivers the complete information about 8 (available in the observations t l , ..., t,), i.e. if the conditional distribution of for given 6, does not depend on 9.

+) Bayesian estimation theory, based on the Bayes theorem (Eqs. (A6.18), (A6.58-A6.59)) and which considers 9 as a random variable and assigns to it an a priori distribution function, will not be considered in this book (as a function of the random sample, 0, is a random variable, while 0 is an unknown constant). However, a Bayesian statistics can be useful if knowledge on the a prion distnbution function is well founded, for these cases one may refer e.g. to [A8.23, A8.241.


For an unbiased estimate, Eq. (A8.21) becomes

An unbiased estimate is thus efficient if ~ar[6,] is minimum over all possible point estimates for 8 and consistent if ~ar[6,] + 0 for n -+ =J. This last Statement is a consequence of Chebyschev's inequality (Eq. (A6.49)). Efficiency can be checked using the Cramkr - Rao inequality and sufficiency using the factorization criterion of the likelihood function, see e.g. [A8.1, A8.231. Other useful properties of estimates are asymptotic unbiasedness and asymptotic efficiency.

Several methods are known for estimating 8. To these belong the methods of moments, quantiles, least Squares, and maximum likelihood. The maximum likelihood method [A8.1, A8.15, A8.231 is commonly used in engineering applications. It provides point estimates which, under relatively general conditions, are consistent, asymptotically unbiased, asymptotically efficient, and asymptotically normal-distributed. It can be shown that if an efficient estimate exists, then the likelihood equation (Eqs. (A8.25) or (A8.26)) has this estimate as a unique solution. Furthermore, an estimate 6, is suficient if and only if the likelihood function (Eqs. (A8.23) or (A8.24)) can be written in two factors, one depending on t l , ..., t , only, the other on 8 and 6, = u(tl ,..., t,), see Examples A8.2 to A8.4.

The maximum likelihood rnethod was developed by R.A. Fisher [A8.15 (1921)l and is based on the following idea:

Maximize, with respect to the unknown Parameter 8 , the probabili~ (Pr) that in a sample of size n, the (statistically independent) values t l , ..., t , will be obsewed (i.e. maximize the probability of observing that record); this by maximizing the likelihoodfunction (L - Pr), defined as

in the discrete case, and as

n

L(tl, .. ., t„8) = n f ( t i , O ) , with f(ti, 0) as density function, (A8.24) i=l

in the continuous case.

Since the logarithmic function is monotonically increasing, the use of ln(L) instead of L leads to the same result. If L(tl , . . ., t„ 8 ) is derivable and the maximum likelihood estimate 6 , exists, then it will satisfy the equation

A8.2 Parameter Estimation 513

The maximum likelihood method can be generalized to the case of a distribution function with a finite number of unknown parameters 81, ..., 0,. Instead of Eq. (A8.26), the following system of r algebraic equations must be solved

The existence and uniqueness of a maximum likelihood estimate is satisfied in most practical applications.

To simplify the notation, in the following the index n will be omitted for the estimated parameters.

Example A8.2

Let t l , . . ., t, be statistically independent observations of an exponentially distnbuted failure-free time T. Give the maximum likelihood estimate for the unknown Parameter h of the exponential distribution.

Solution

With f(t, h ) = h e-" , Eq. (A8.24) yields L(tl, . . . , tn, h) = hne-h(tlt "' + ',), from which

This case corresponds to a sampling plan with n elements without replacement, terminated at the occurrence of the nth failure. h depends only on the sum t l + ... +t„ not on the individual values.of ti; tl + ... + tll is a sufficient statistic and ?L is a sufficient estimate (L = 1 hne-nh'h). However, h = nl(t l + ... + t,) is a biased estimate, unbiased is h= ( n - l)I(tl+ ... +t,),5 as weil as E[TI =(tl+ ...+ t , ) /n given by Eq. (A8.6).

Example A8.3

Assuming that an event A has occurred exactly k times in n Bernoulli trials, give the maximum likelihood estimate for the unknownprobabilityp for event A to occur (binomial distribution).

Solution

Using Eq. (A6.120), the likelihood function (Eq. (A8.23)) becomes

L = pk = (nk) pk (I - p)n-k or I n L = l n ( ~ ) + k l n p + ( n - k ) l n ( ~ - p ) . This leads to

j is unbiased. It depends only on k , i.e. on the number of the event occurrences in n independ-

ent trials; k is a suficient statistic and j is a suflcient estimate ( L=Q . [pe(l-p)(l-e)]n).

5 14 A8 Basic Mathematical Statistics

Example A8.4

Let kl, . . . , kn be independent observations of a random variable 5 distnbuted according to the Poisson distribution defined by Eq. (A6.125). Give the maximum likelihood estimate for the unknown Parameter m of the Poisson distribution.

Solution

The likelihood function becomes k l + ... + k , m

L = -nm or l n L = ( k , + ...+ k,)lnm-mn-In(kl! ... k,!) k*! ... kn!

and thus

f i is unbiased. It depends only on the sum kl + . . . + kn, not on the individual ki ; kl + . . . + k, is a suficient statistic and f i is a suficient estimate ( L = (1 / kl ! . . . kn !) . (mn e-n M)).

Example A8.5

Let tl, . . . , tn be statistically independent observations of a Weibull distributed failure-free time T. Give the max. likelihood estimate for the unknown Parameters h and P.

Solution B

With f(t, L, ß) = ß L ( L t)'-'e - " it follows from Eq. (A8.24) that

yielding

The solution for ß is unique and can be found, using Newton's approximation method (the value obtained from the empincal distribution function can give a good initial value, see Fig. 7.12).

Due to cost and time limitations, the situation often arises in reliability applications in which the items under test are run in parallel and the test is stopped before all items have failed. If there are n items, and at the end of the test k have failed (at the individual failure times (times to failure) tl < t2 < . . . < tk) and n - k are still working, then the operating times Tl , ..., of the items still working at the end of the test should also be accounted for in the evaluation. Considering a Weibull distribution as in Example A8.5, and assuming that the operating times Tl, ..., Tnek have been observed in addition to the failure-free times tl , . . . , tk , then

ß ß k n-k (?JjP L ( I ~ , ..., tk, L, 0) - ( ß a ß f e - k ( t l + .., +ti)n t f - l e- ,


yielding

The calculation method used for Eq. (A8.32) applies for any distribution function, yielding

where i sums over all observed times to failure, j sums over all failure-free times (operating times without failure), and 8 can be a vector. However, following two cases must be distinguished: 1) Tl = ... = = tk, i.e. the test is stopped at the (random) occurrence of the kth failure (Type II censoring), and 2) Tl = . . . = = T„„ is the fixed (given) test duration (Type I censoring). The two situations are basically different and this has to be considered in data analysis, see e.g. the discussion below with Eqs. (A8.34) and (A8.35).

For the exponential distribution (P = I), Eq. (A8.31) reduces to Eq. (A8.28) and Eq. (A8.32) to

If the test is stopped at the occurrence of the k th failure, Tl = ... tk (in general) and the quantity T, = tl + ... + tk + (n - k)t, is the random cumulative operating time over all items during the test. This situation corresponds to a sampling plan with n elements without replacement (renewal), censored at the occurrence of the kth failure (Type II censoring). Because of the memoryless property of the Poisson process (Eqs. (7.26) and (7.27)), T, can be calculated as T, = n tl + (n - l)(t2 - tl ) + ... + (n - k + l)(tk - tk-l ). It can be shown that the estimate = k / T, is biased. An unbiased estimate is given by

If the test is stopped at the fixed time T„„, then T, = tl + ... + tk + (n - k)Ttest. In this case, T„, is given (fixed) but k as well as tl, ..., tk are random. This situation corresponds to a sampling plan with n elements without replacement, censored at a fixed (given) test time T„„ (Type I censoring). Also for this case, k / T, is biased.

Important for practical applications, also because yielding unbiased results, is the case with replacement, see Appendix A8.2.2.2 and Sections 7.2.3.

A8.2.2 Interval Estimation

As shown in Appendix A8.2.1, a point estimation has the advantage of providing an estimate quickly. However, it does not give any indication as to the deviation of the estimate from the true parameter. More information can be obtained from an interval estimation. With an intewal estimation, a (random) interval [ G 1 , G,] is sought such that it overlaps (covers) the true value of the unknown parameter 8 with a given probability y. [ G l , G,] is the confidence intenial, and 6 , are the lower and upper confidence limits, and y is the confidence level. y has the following interpretation:

In an increasing number of independent samples of size n (used to obtain confidence intewals), the relative frequency of the cases in which the confidence intewals [ G 1 , G,] overlap (cover) the unknown parameter 0 converges to the confidence level y = 1 - ßl - ß2 (0 < Pi .: 1 - ß2 < 1) .

ß, and ß, are the error probabilities related to the interval estimation. If y can not be reached exactly, the true overlap probability should be near to, but not less than, y.

The confidence interval can also be one-sided, i.e. (0, G,] or [ G l , W) for 8 r 0. Figure A8.3 shows some examples of confidence intervals.

The concept of confidence intervals was introduced independently by J. Neyman and R. A. Fisher around 1930. In the following, some important cases for quality control and reliability tests are considered.

A8.2.2.1 Estimation of an Unknown Probability p

Consider a sequence of Bernoulli trials (Appendix A6.10.7) where a given event A can occur with constant probability p at each trial. The binomial distribution

, B2 P1 C@ two-sided

0 81 e u

ßi 8 one-sided 6 I 6, 0

e u

, B2 0

~0 one-sided 6 2 el ei

Figure A8.3 Examples of confidence intervals for 0 2 0

gives the probability that the event A will occur exactly k times in n independent trials. From the expression for pk, it follows that

k2

Pr{kl 5 observations of A in n trials < k2 I p ) = (Y) pi( l - i=k,

However, in mathematical statistics, the Parameter p is unknown. A confidence intewal for p is sought, based on the observed number of occurrences of the event A in n Bernoulli trials. A solution to this problem has been presented by Clopper and Pearson [A8.6]. For given y = 1 - ß1 - ß2 ( 0 i ß, < 1 - P, < 1) the following holds:

I f in n Bemoulli trials the event A has occurred k tirnes, there is a probability nearly equal to (but not smaller than) y = 1 - ß1 - ß2 that the confidence interval [ F , , F , ] overlaps the true (unknown) probability p, with P1 & F , given by

c( l )p ; ( l - P,)"-' = ß 2 , for 0 < k < n , i=k

for k = 0 take

Pl=O and $ , = l - 6 , with y = 1 -P„ (A8.39)

und for k = n take

jl = and Pu = 1, with y =1-P,.

Considering that k is a random variable, P1 and P, are random variables. According to the footnote on p. 504, it would be more correct (from a mathematical point of view) to compute from Eqs. (A8.37) and (A8.38) the quantities pkl and pku, and then to set PI = pkl and P, = pku. For simplicity, this has been omitted here. Assuming p as a random variable, ß1 and ß2 would be the probabilities forp to be greater than and smaller than p l , respectively (Fig. A8.3).

The proof of Eqs. (A8.37) is based on the monotonic property of the function

For given (fixed) n, B, (k, p) decreases in p for fixed k and increases in k for fixed p (Fig. A8.4). Thus, for any p > 3, it follows that

B,(k,p)< B,(k,P,)=ß1.

For p > F „ the probability that the (random) number of observations in n trials will

Figure A8.4 Binomial distribution as a function of p for n fixed and two values of k

take one of the values O,1, . . ., k is thus < ßl (for p > p ' in Fig. A8.4, the Statement would also be true for a K > k). This holds in particular for a number of observations equal to k and proves Eq. (A8.38). Proof of Eq. (A8.37) is similar.

To determine pl and j, as in Eqs. (A8.37) and (A8.38), a Table of the Fisher distribution (Appendix A9.4) or of the Beta function can be used. However, for Pi = ß2 = (1 - Y ) / 2 and n sufficiently large, one of the following approximate solutions can be used in practical applications:

1. For large values on n (min(np , n(1- P ) ) > 5 ) , a good estimate for jl and j, can be found ushg the integral Laplace theorem. Rearranging Eq. (A6.149) and considering = k and ( k I n - instead of ( k I n - p ) yields

i=l

The right-hand side of Eq. (A8.41) is equal to the confidence level y , i.e.

Thus, for a given y , the value of b can be obtained from a table of the normal distribution (Table A9.1). b is the 1 - (1 - y ) 12 = ( I + y ) I 2 quantile of the standard normal distribution @ ( t ) , i.e., b = t (l+y),2 giving e.g. b = 1.64 for y = 0.9. On the left-hand side of Eq. (A8.41), the expression

is the equation of the confidence ellipse. For given values of k , n, and b,

confidence lirnits Pl and f i p n be determined as roots of Eq. (A8.42)

see Figs. A8.5 and 7.1 for some Examples.

2. For small values of n, confidence limits can be determined graphically from the envelopes of Eqs. (A8.37) and (A8.38) for ß1 = ß2 = (1 -Y) 1 2 , see Fig. 7.1 for y = 0.8 and y = 0.9. For n > 50, the curves of Fig. 7.1 practically agree with the confidence ellipses given by Eq. (A8.43).

One-sided confidence intewals can also be obtained from the above values for jl and F,. Figure A8.3 shows that

O l p < & , w i t h y = l - ß , and j l l p l l , w i t h y = l - ß 2 . (A8.44)

Example A8.6 Using confidence ellipses, give the confidence interval [jl, ju] for an unknown probability p for the case n = 50, k = 5, and y = 0.9.

Solution Setting n = 50, k = 5 , and b = 1.64 in Eq. (A8.43) yields the confidence interval [0.05, 0.191, see also Fig. 8.5 or Fig. 7.1 for a graphical solution. Corresponding one-sided confidence intervals would be p 1 0.19 or p 2 0.05 with y = 0.95.

Figure A8.5 Confidence limits (ellipses) for an unknown probability p with a confidence level y = 0.9 and for n = 10,25,50, 100 (according to Eq. (A8.43))

The role of kln and p in Eq. (A8.42) can be reversed, and Eq. (A8.42) can be used to solve a problem of probability theory, i.e. to compute for a given probability y , y = 1 - ßl - ß2 with ßl = ß2 , the limits kl and k2 of the number of observations k in n independent trials for given (fixed) values of p and n (e.g. the number k of defective items in a sample of size n)

As in Eq. (A8.43), the quantity b in Eq. (A8.45) is the (1+ y ) / 2 quantile of the normal distribution (e.g. b = 1.64 for y = 0.9 from Table A9.1). For a graphical solution, Fig. A8.5 can be used, taking the ordinatep as known and by reading kl In and k2 In from the abscissa. An exact solution follows from Eq. (A8.36).

A8.2.2.2 Estimation of the Parameter hfor an Exponential Distribution: Fixed Test Duration, with Replacement

Consider an item having a constant failure rate h and assume that at each failure it will be immediately replaced by a new, statistically equivalent item, in a negligible replacement time (Appendix A7.2). Because of the memoryless property (constant failure rate), the number of failures in (0, T1 is Poisson distributed and given by Pr{k failures in ( o , T ] I h ] = ( h ~ f e - ' T 1 k ! (Eq (A7.41)). The maximum likelihood point estimate for h follows from Eq. (A8.30), with n = 1 and m = AT, as

Similarly, estimation of the confidence interval for the failure rate h can be reduced to the estimation of the confidence intewal for the Parameter m = h T of a Poisson distribution. Considering Eqs. (A8.37) and (A8.38) and the similarity between the binomial and the Poisson distribution, the confidence limits and h, can be determined for given ßl, ßL, and y = 1 - ßl - ß2 ( 0 < Pi < i - ß2 < 1) from

and

for k = 0 takes

On the basis of the known relationship to the chi-square ( X 2 ) distribution

(Eqs. (A6.102), (A6.103), Appendix A9.2), the values hl and h, from Eqs. (A8.47) and (A8.48) follow from the quantiles of the chi-square distribution, yielding

0

for k > O , (A8.50)

ßl = ß2 = (1 - y ) / 2 is frequently used in practical applications. Fig. 7.6 gives the results obtained from Eqs. (A8.50) and (A8.51) for ß1 = ß2 = ( 1 - y ) / 2 .

One-sided confidence intewals are given as in the previous section by *

0 I h I h„ with y = 1 - ßl and hl I ?L < W, with y = 1 - ß2. (A8.52)

The situation considered by Eqs. (A8.47) to (A8.51) corresponds also to that of a sampling plan with n elements with replacement, each of them with failure rate h'= h l n , terminated at a fixed test time T„, = T. This situation is statistically different from that presented by Eq. (A8.34) and in Section A8.2.2.3.

A8.2.2.3 Estimation of the Parameter k for an Exponential Distribution: Fixed Number n of Failures, no Replacement

Let zl , ..., T , be independent random variables distributed according to a common distribution function F(t) = P ~ { T ~ 5 t ) = 1-e-", i = 1, ..., n. From Eq. (A7.39),

Setting a = n ( 1 - E ~ ) / h and b = n ( l + q)/ h it follows that

Considering now TI, ..., T, as a random sample of z with t l , ..., t , as observations, Eq. (A8.54) can be used to compute confidence limits il and i, for the parameter h. For given ßl, ß2 , and y = 1 - ßl - ß2 (0 i ß, < 1 - ß, < I), this leads to

h1 = ( I - E * ) ~ and h, = ( l + ~ ~ ) h , (A8.55)

with * n h =

tl + ... +tn

and given by

Using the definition of the chi-square distribution (Appendix A9.2), it follows that )/Zn andthus ) /2n and ~ - E ~ = ( x ~ ~ , ~ ~ 1+ E1 = ( X z n , l -ß i

.. xLn , p, Al = and h,=

2(tl + ... +tn) 2(tl + ... +tn)

E2 = 1 or = 00 lead to one-sided confidence intervals [0, L,] or [ L l , W). Figure A8.6 gives the graphical relationship between n, y , and E for the case = = E.

The case considered by Eqs. (A8.53) to (A8.58) corresponds to the situation described in Example A8.2 (sampling plan with n elements without replacement, terrninated at the nth f ahre ) , and differs statistically from that in Section A8.2.2.2.

case of a A8.7)

Y

1.0 0.8

0.6

0.4

0.2

0.1 0.08

0.06

0.04

0.02

0.01 0.01 0.02 0.05 0.1 0.2 0.5

Figure A8.6 Probability 'y $at the interval (1 f c ) i overlaps the true value of h for the fixednumbernoffailures ( h = n l ( t l + ...+ t n ) , P r { ~ < t ] = l - e - ' ~ , *forExample

/

b E


Example AS.7

For the case considered by Eqs. (A8.53) to (A8.58), give for n = 50 and y = 0.9 the-two-sided confidence interval for the parameter h of an exponential distnbution as a function of h . Solution From Figure A8.6, E = 0.24 yielding the confidence interval [0.76h, 1.24 h] .

A8.2.2.4 Avaiiability Estimation (Erlangian Failure-Free and Repair Times)

Considerations of Section A8.2.2.3 can be extended to estimate the availability of a repairable item (described by the alternating renewal process of Fig. 6.2) for the case of Erlangian distributed failure-free andlor repair times (Appendix A6.10.3), and in particular for the case of constant failure and repair rates (exponentially distributed failure-free and repair times).

Consider a repairable item in continuous operation, new at t = 0 (Fig. 6.2), and assume constant failure and repair rates ( h ( x ) = h , p(x) = p). For this case, point and average unavailability converge rapidly ( 1-PAso(t) & 1-AAso( t ) in Table 6.3) to the asymptotic and steady-state value given by

h l ( h + p ) is a probabilistic value and has his statistical Counterpart in DT/(UT+DT), where DT is the down (repair) time and UT = t -DT the up (operating) time observed in (0 , t ] . To simplify considerations, it will be assumed in the following t >> MTTR = 1 I p (Table 6.3) and that at the time point t a repair is terminated and k failure-free and repair times have occurred ( k=1,2, ... ) . Furthermore, a«p, i.e.

- PA =1-PA = PA, = h / p (A8.59)

will be assumed here, yielding the counterpart DT I UT (relative error of magnitude PA). Considering that at the time point t a repair is terminated, it holds that

where ti & t i are the observed values of failure-free and repair times zi & T;, respectively. According to Eqs. (A6.102) - (A6.104), the quantity 2 h (zl +. . . +zk) has a ~2 distribution with V = 2 k degrees of freedom. The same holds for the repair times 2 y (7; + . . . +&J. From this, it follows (Appendix A9.4) that the quantity

is distributed according to a Fisher distribution (F) with v l = v 2 = 2 k degrees of freedom (E, is an unknown parameter, regarded here as a random variable)

Having observed for a repairable item described by Fig. 6.2 with constant failure rate h(x) = h and repair rate p(x) = p >> h , an operating time UT = tl +. . . + tk and a repair time DT = t;+. . . + ti , the mmimum likelihood estimate for G, = h I p is

A

E , = ( k i P ) = D T I UT=( t i+ ...+ t i ) l ( t l + ...+ tk), (A8.62)

an unbiased point estimate being (1 - 1 I k) DT I U T , k r 1 (Example A8.10). With the same considerations as for Eq. (A8.54), Eq. (A8.61) yields ( k = i, 2, ...)

and thus to the confidence limits E,, = (1 - E 2 ) S a and sau = (1 + E ~ ) = ~ , with

2, as in Eq. (A8.62) and EI, related to the confidence level y = 1 - ß1 - ß2 by

(2k -I)! " xk-l (2k - I)! - j,dx=ß, and dx = P 2 . (A8.64) (k -1)12 (1 + X ) (k

From the definition of the Fisher distribution (Appendix A9.4), it follows that E I = F ~ ~ , ~ ~ , J - P , - 1 and E2= 1 - F 2 k , 2 k , ~ z ; andthus, using F v , , v 2 , ~ 2 = l l F v , , v , , ~ - ~ 2 ,

where F2k,2k,,-ß2 & F2k,2k,l-ß, are the 1 - ß2 & 1 - ßl quantiles of the Fisher ( F )

distribution (Appendix A9.4, [A9.3- A9.61). A graphical visualization of the confidence interval [G , ,] is given in Fig. 7.5. One-sided confidence intervuls are

,. - O < P A < P A „ withy=l-P, and Pal < % < I , withy=l-P,. (A8.66)

Corresponding values for the availability can be obtained using PA = 1 - X. If failure free andlor repair times are Erlangian distributed (Eq. (A6.102)) with

ßh & ßp, F2k,2k,l-ß2 and F 2k,2k,~-ß1 have to be replaced by F 2kßP .zkßh.l-ß2 and F2 k P h , 2 kPp (for unchanged MTTF & MlTR, See Exarnple A8.11). Ga = DT/ UT remains valid. Results based only on the distribution of DT (Eq. 7.22) are not free of parameters (Section 7.2.2.3).

Example A8.S For the estimation of an availability PA, UT= 1750h, DT= 35h and k= 5 failures and repairs have been observed. Give for const. failure & repair rates the 90% lower limit of PA (Fig.7.5, y = 0.8).

Solution From Eqs. (A8.65) & (A8.66) and Table A9.4a follows zu = 2%. 2.32 and thus -PA > 95.3%. Supplementary result: Erlangian distributed repair times with PP = 3 yields E, = 2% .1.82.

A8.3 Testing Statistical Hypotheses


When testing a statistical hypothesis, the objective is to solve the following problem:

From one's own experience, the nature of the problem, or simply as a basic hypothesis, a specific null hypothesis Ho is formulated for the statistical properties of the obsewed random variable; sought is a rule which allows rejection or acceptance of Ho on the basis of the statistically independent obsewations made from a sample of the random variable under consideration .

If R is the unknown reliability of an item,following null hypotheses Ho are possible:

la) Ho: R = 4 Ib) H,: R > & lc) Ho: R < & .

To test whether the failure-free time of an item is distributed according to an exponential distribution Fo(t) = 1 - e - X with unknown h, or Fo(t) = 1 - e - with known ko , the following null hypotheses Ho can be for instance formulated:

2a) Ho : the distribution function is Fo(t)

2b) Ho : the distribution function is different from Fo(t)

2c) Ho : h = hO, provided the distribution is exponential

2d) Ho : h < ho, provided the distribution is exponential 2e) Ho : the distribution function is 1 - e-ht , Parameter h unknown.

It is usual to subdivide hypotheses into parametric (la, lb, lc, 2c, 2d) and non- parametric ones (2a, 2b, and 2e). For each of these types, a distinction is also made between simple hypotheses (la, 2a, 2c) and composite hypotheses ( lb , lc, 2b, 2d, 2e). When testing a hypothesis, two kinds of errors can occur (Table A8.2):

type I error, i.e. the error of rejecting a true hypothesis Ho, the probability of this error is denoted by a type II error, i.e. the error of accepting a false hypothesis Ho, the probability of this error is denoted by ß (to compute ß, an alternative hypothesis H1 is necessary, ß is then the probability of accepting Ho assuming H1 is true).

If the sample space is divided into two complementary sets, A for acceptance and 3 for rejection, the type I and type I1 errors are given by

a = Prisample in 3 I H o true},

ß = Pr{sample in A I Ho false (Hi true)}.

Both kinds of error are possible and cannot be minimized simultaneously. Often a

Table A8.2 Possible errors when testing a hypothesis

I HO is false ( H 1 is true) 1 correct I false + type I1 error (0) I Ho is true

is selected and a test is sought so that, for a given H 1 , ß will be minimized. It can be shown that such a test always exists if H o and H 1 are simple hypotheses [A8.22]. For given alternative hypothesis H1, ß can often be calculated and the quantity 1 - ß = Pr{sarnple in 3 I H1 true] is referred as the power of the test.

The following sections consider some important procedures for quality control and reliability tests, see Chapter 7 for applications. Such procedures are basically obtained by investigating the distribution of a suitable quantity observed in the sample.

A8.3.1 Testing an Unknown Probability p

Ho is rejected

false + type I error ( C X )

Let A be an event which can occur at every independent trial with the constant, unknown probability p . A rule (test plan) is sought which allows testing of the hypothesis

Ho: P < Po 0 1 )P (A8.69) Po

Ho is accepted

correct

against the alternative hypothesis

Hl HI: ~ > m ( ~ 1 2 ~ 0 1 o 1 (A8.70)

The type I error should be nearly equal to (but not greater than) a for p = po. The type II error should be nearly equal to (but not greater than) ß for p = pl. Such a situation often occurs in practical applications, in particular in:

quality control, where p refers to the defective probability or fraction of defective items, reliability tests for a given fixed mission, where it is usual to set p = 1 - R (R =reliability).

In both cases, a is the producer's risk and ß the consumer's risk. The two most frequently used procedures for testing hypotheses defined by (A8.69) and (A8.70), with pi >pO, are the simple two-sided sampling plan and the sequential test (one-sided sampling plans are considered in Appendix A8.3.1.3).

A8.3.1.1 Simple Two-sided Sampling Plan

The rule for the simple two-sided sampling plan (simple two-sided test) is:

1. For given po, pl >po, a, and ß (0 < a < 1 - ß < I), compute the smallest integers C and n which satisfy

and

2. Perform n independent trials (Bernoulli trials), determine the number k in which the event A (component defective for example) has occurred, and

*reject Ho: p c

accept Ho: p < po , if k 5 C.

As in the case of Eqs. (A8.37) and (A8.38), the proof of the above rule is based on the inonotonic property of Bn(c, p ) = k( Y) (I - see also Fig A8.4. For known n, C , and p, B,(c,p) gives the pobab?lity of having up to C defectives in a sample of size n. Thus, assuming H o true, it follows that the probability of rejecting H o (i.e. the probability of having more than C defectives in a sample of size n) is smaller than a

n Pr{rejection of H o / Ho true} = ( . ) p i ( l - "P' I Y < a ,

1 i=c+l

Similarly, if H1 is true ( p > p l ) , it follows that the probability of accepting Ho is smaller than ß

Pr{acceptanceof H o 1 Xi true) = 5 ("pi(i-p)"-i I pl, respectively. Figure A8.7 shows the results for po = 1%, pl = 2%, and a = ß 220%. The curve of Fig. A8.7 is known as the operating characteristic (OC). If po and pl are small (up to a few %) or close to 1, the Poisson approximation (Eq. (A6.129))

is generally used.


Pr {Acceptance 1 p ] = (f )P i ( l - ~ ~ i

4 i=O

Figure AS.7 Operating characteristic (probability of acceptance) as a function of p for fixed n and C

( p o = l % , p1 =2%, u=ß=0.185, n=462 , c = 6 )

A8.3.1.2 Sequential Test

Assume that in a two-sided sampling plan with n = 50 and C = 2 , a 3rd defect, i.e. k = 3, occurs at the 12th trial. Since k > C , the hypothesis H o will be rejected as per procedure (A8.73), independent of how often the event A will occur during the remaining 38 trials. This example brings up the question of whether a plan can be established for testing H o in which no unnecessary trials (the remaining 38 in the above example) have to be performed. To solve this problem, A. Wald proposed the sequential test [A8.32]. For this test, one element after another is taken from the lot and tested. Depending upon the actual frequency of the observed event, the decision is made to either

reject Ho

accept Ho

perform a further trial.

The testing procedure can be described as follows (Fig. A8.8):

Zn a system of Cartesian coordinates, the rzumber n of trials is recorded on the abscissa und the number k of trials in which the event A occurred on the ordinate; the test is stopped with acceptance or rejection as soon as the resulting staircase cuwe k = f(n) crosses the acceptance or rejection line given in the Cartesian coordinates for specified values of po, pi, a, und ß.

The acceptance and rejection lines can be determined from:

Acceptance line : k = an - bl,

Rejection line : k = an + b2,

with

k

Figure A8.8 Sequential test for po = 1%, pl = 2%, and a = ß = 20%

W - P O ) / ~ ~ - P I ) ) in((1 - a ) /ß a =

m ]-PO i= P,

1 - Po ln((l-ß)'a . (A8.76) 1?= I?

In- + ln- In- + In-- ' - Po In-+In-

P, ] - P , Po ] - P , Po 1-f+

Figure A8.8 shows acceptance and rejection lines for po= 1%, pl= 2%, a = ß =20%. Practical remarks related to sequential tests are given in Sections 7.1.2.2 and 7.2.3.2.

AS.3.1.3 Simple One-sided Sampling Plan

In many practical applications only po and a or pl and ß are specified, i.e. one Want to test H o : P< po against H 1 : P> po with type I error a o r H o : p< pl against H 1 : p>pl with type I1 error P. In these cases, only Eq. (A8.71) or Eq. (A8.72) can be used and the test plan is a pair (C, n) for each selected value of C = 0, I,. . . and calculated value of n. Such plans are termed one-sided sampling plans.

Setting pl = po in the relationship (A8.70) or in other words, testing

H o : P C P 0 (A8.77)

against

H11 P > P o

with type I error a, i.e. using one (c,n) pair (for C = 0,1, ...) from Eq. (A8.71) and the test procedure (A8.73), the type I1 error can become very large and reach the value 1 - a for p = po. Depending upon the value selected for C = 0,1,. . . and that calculated for n (the smallest integer n which satisfies Eq. (A8.71)), different plans (pairs of (C, n)) are possible. Each of these plans yields different type I1 errors. Figure A8.9 shows this for some values of C (the type I1 error is the ordinate of the

Figure A8.9 Operating characteristics for po = 1 %, a = 0.1 and C = 0 ( n = 10), C = 1 ( n = 53), c = 2 (n=110), c = 3 ( n = 1 7 4 ) a n d c = w

operating characteristic for p > po). In practical applications, it is common usage to define

where AQL stands for Acceptable Quality Level. The above considerations show that with the choice of only po and a(instead of po, p,, a, and ß) the producer can realize an advantage, particularly if small values of c are used.

On the other hand, setting po = p, in the relationship (A8.69), or testing

Ho: P < P1 (A8.80)

with type I1 error P, i.e. using one (C , n) pair (for C = 0,1, ...) from Eq. (A8.72) and the test procedure (A8.73), the type I error can become very large and reach the value 1 - ß for p = P,. Depending upon the value selected for C = 0,1, . . . and that calculated for n (the largest integer n which satisfies Eq.(A8.72)), different plans (pairs of ( C , n)) are possible. Considerations here are similar to those of the previous case, where only po and a were selected. For small values of C the consumer can realize an advantage. In practical applications, it is common usage to define

p, = LTPD, (A8. 82)

where LTPD stands for Lot Tolerance Percent Defective. Further remarks on one-sided sampling plans are in Section 7.1.3.

A8.3 Testing Statistical Hypotheses 531

AS.3.1.4 Availability Demonstration (Erlangian Faiiure-Free and Repair Times)

Considerations of Section A8.2.2.4 on availability estimation can be extended to demonstrate the availability of a repairable item (described by the alternating renewal process of Fig. 6.2) for the case of Erlangian distributed failure-free and/or repair times (Appendix A6.10.3), and in particular for the case of constant failure and repair rates (exponentially distributed failure-free and repair times).

Consider a repairable item in continuous operation, new at t = 0 (Fig. 6.2), and assume constant failure and repair rates ( h ( x ) = h, y(x) = P). For this case, point and average unavailability converge rapidly ( 1 - PA„( t ) & 1 - AAso( t ) inTable6.3) to the asymptotic & steady-state value given by

h / ( h + y ) is a probabilistic value of the asymptotic & steady-state unavailability and has his statistical Counterpart in DT I (UT+ DT), where DT is the down (repair) time and UT the up (operating) time observed in (O,t] . From Eq. (A8.83) it follows that

As in Appendix A8.2.2.4, it will be assumed that at the time point t a repair is terminated, and exactly n failure free and n repair times have occurred. However, for a demonstration test PA or will be specified (Eqs. (A.8.88)- (A8.89)) and DTl UT observed. Similar as for Eq. (A8.60), the quantity

is distributed according to a Fisher distribution (F-distribution) with v1=v2=2n degrees of freedom (Appendix A9.4). From this (with DT 1 UT as a random variable),

- dy. (A8.85)

PA UT

Setting

6 = x . P A ~ ~ ~ , Eq. (A8.85) yields

Considering DT I UT = (T; + . . . + T;) /(q + . . . + T ~ ) , i.e. the sum of n repair times divided by the sum of the corresponding n failure-free times, a rule for testing

532 A 8 Basic Mathematical Statistics


H1: &Xl ( P A , r P A o ) (A8.89)

can be established (as in Appendix A8.3.1) for given type I error (producer risk) nearly equal to (but not greater than) a for E = E, and type II error (consumer risk) nearly equal to (but not greater than) ß for E = E,

DT DT Pr(->i3 I E = E o } S a and Pr{-58 I PA = P A , } Iß. (A8.90)

UT UT

From Eqs. (A8.87) & (A8.90) it follows that (considering the definition of the Fisher distribution, Appendix A9.4), 6. P A ~ I P A ~ =F 2n,2n,1-a and 6. PA1 I PA1 = F 2n ,2n ,ß .

Eliminating F , using F v „ v „ =1 / F v „ v , , - B , and considering - - the conditions (A8.90), the rule for testing H o : PA = PAo against H , : PA = PA, follows as (see also [A8.28, A2.6 (IEC 61070)l):

1. For given %, q, a, and ß (0 < a < 1 - ß < I), find the smallest integer n (1,2, ...) which satisfy

where F 2,,, 2n, 1 - C L and F zn, zn, 1 - p are the 1 - a & 1 - ß quantiles of the F- distribution (Appendix A9.4, [A9.2-A9.6]), and compute the lirniting value

8 = F 2 n , z n , l - a /PAo = F 2 n , ~ n , l - ~ ( ~ - P ~ I ) I P ~ I . (A8.92)

2. Observe n failure free times tl + ... + t , and the corresponding repair times t ; + ... + t;, and

accept H~ : PA < PAo , Corresponding values for the availability can be obtained using PA = 1 - H.

If failure free and/or repair times are Erlangian distributed (Eq. (A6.102)) with ß h &Pp, F 2n,2n,l-a and F 2 n . z n , i - ß have to be replaced by F 2nß„~nßh , l -a and F2nßh,2nß„l - ß (for unchanged MTTF & MTTR, see Example A8.11). Results based only on the distribution of DT (Eq. 7.22) are not free of Parameters (Section 7.2.2.3).

Exarnple A8.9 For the demonstrationof - an availability PA, customer and producer agree the following Parameters: PAo = 1%, PAl = 6%, a = ß = 10%. Give for the case of constant failure and repair rates ( & ( X ) = h and ~ ( x ) = p >> h ) the number n of failures and repairs that have to be observed and the acceptance limit 6 = ( t i + ... + t ; ) 1 ( t l + ... + t , ) .

Solution Eq. (A8.91) & Table A9.4a yields n= 5 ( ( F i0,i0,0,9)2=2.322< 6 . 9 9 / 1 . 9 4 < 2.59'= ( F 8,8,09)2) . 6 = F „-„ .,,, PAo I PAO = 2.32.1199 =0.0234 follows from Eq. (A8.92), See also Tab. 7.2.

Suppl. result: Erlangian distr. repair times with ßp=3 yields n=3, 6= 0.0288 (2.85.2.13 < 6.32).

Example A8.10 Give an unbiased estimate for PA, = h l p .

Solution From Eq. (A8.61) it follows that

xhlp UT ( 2 k - I ) ! y k -'

Pr(-<X]=--- - DT

d Y ( ( k - I ) ! ? ( 1 + J ' l z k

The density of UTIDT for X 7 observed UTIDT is the maximum likelihood function for the estimation of Alp. From this, Alp = DTI UT (Eq. A8.25). Considenng now Alp as a random variable with distribution function as per Eq. (A8.61)for given UTI DT, it follows that (TableA9.4)

h i p = DT I UT is thus biased, unbiased is (1 - 1 I k ) DTI UT

Example A8.11 Give the degrees of freedom of the F*-distnbution for the case of Erlangian distributed failure-free and repair times with Parameters h , ßh and P*, ßP, respectively (with h f= hßh and pf= pßp because of the unchanged MTTF and MTTR).

Solution Let T I + ... +T, be the exponentially distributed failure-free times with mean MTTF= 1 I h . If the actual failure-free times are Erlangian distributed with parameters h*, ßh and mean MTTF= ßh lh*= 1 I h (Appendix A6.10.3, Table A6.1), the quantity

corresponding to the surn of n Erlangian distnbuted failure-free times, has a distnbution with V =2 nßh degrees of freedom (Eq. (A6.104)). Similar is for the repair times Ti. Thus, the quantity

PA DT 2(Ti1 +T;, +...+ T;~,+.. . +T:, -!-T;, +... + ~ ; ~ , ) / 2 n ß , _ . _ = - . - PA UT L* 2 (T„ + T„ +... + Tlph +... +T,] + T , ~ +... +T,&) 12nßh

obtained from Eq. (A6.84) by considering h =L*/ ßh and p=pf 1 PP, has a F-distribution with vl = 2 n ß , and V, = 2nßh degrees of freedom (AppendixA9.4).

AS.3.2 Goodness-of-fit Tests for Completely Specified F0(t)

Goodness-of-fit tests have the purpose to verify agreement of observed data with a postulated (completely specified or only partially known) model. A typical example is as follows: Given tl, . . ., t , as n (stochastically) independent observations of a random variable T, a rule is sought to test the null hypothesis

Ho : the distribution function of T is Fo(t), (Ag. 94)


Hl : the distribution function of T is not Fo(t). (A8.95)

F,(t) can be completely defined (as in this section) or depend on some unknown parameters which must be estimated from the observed data (as in the next section). In general, less can be said about the risk of accepting a false hypothesis H, (to compute the type 11 error P , a specific alternative hypothesis H, must be assumed). For some distribution functions used in reliability theory, particular procedures have been developed, often with different alternative hypotheses H, and investigation of the corresponding test power, see e.g. [A8.l, A8.9, A8.231. Among the distribution-free procedures, the Kolmogorov-Smirnov, Cramkr -von Mises, and chi-square (X') tests are frequently used in practical applications to solve the goodness-of-fit problem given by Eqs. (A8.94) & (A8.95). These tests are based upon comparison of the empirical distribution function (EDF) G,(t), defined by Eq. (A8.1), with a postulated distribution function Fo(t) .

1. The Kolmogorov-Smirnov test uses the (supremum) statistic

introduced in Appendix A8.1. 1. A. N. Kolmogorov showed [A8.20] that if F,(t) is continuous, the distribution of D, under the hypothesis H, is independent of F,(t). For a given type I error a , the hypothesis H, must be rejected for

D, > Yl-W (A8.97)

where yl-, is defined by

Pr{Dn > yi-, I H, is true} = a . (A8.98)

Values for y ,-, are given in Tables A8.1 and A9.5. Figure A8.10 illustrates the Kolmogorov-Srnirnov test with hypothesis Ho not rejected. Because of its graphical visualization, in particular when probability charts are used (Appendix A8.1.3, Section 7.5), the Kolmogorov-Smirnov test is often used in reliability data analysis.

2. The Cramdr- von Mises test uses the (quadrate) statistic

-W

As in the case of the D, statistic, for Fo(t) continuous the distribution of W; s independent of F,(t) and tabulated (see for instance [A9.5]). The Cramer - von Mises statistic belongs to the so-called quadrate statistics defined by


Figure A8.10 Kolmogorov-Smimov test ( n = 20, a = 20%)

where ~ ( t ) is a suitable weight function. ~ ( t ) = 1 yields the W: statistic and ~ ( t ) = [F,(t) ( 1 - F,(t))] -' yields the Anderson- Darling statistic A:. Using the transformation z ( i ) = F,(t(i,), calculation of W: and in particular of A: becomes easy, see e.g. [A8.10]. This transformation can also be used for the Kolmogorov-Srnirnov test, although here no change occurs in D,.

3.The chi-square ( X 2 ) goodness-of-fit test starts from a selected partition ( a l , a 2 ] , ( a 2 , a g ] , .. ., ( a k , of the set of possible values of T and uses the statistic

is the number of observations (realizations of T) in ( a i , ai+l] and

is the expected number of observations in (a i , ai+l] (obviously kl +... + kk = n and pl +... +pk = 1 ) . Under the hypothesis H o , K. Pearson EA8.271 has shown that the asymptotic distribution of X: for n + is a X 2 distribution with k - 1 degrees of freedom. Thus for given type I error a ,

lim Prix: > x~-l,l-a 1 H o true } = (Y. (A8.104) n-f-

holds, and the hypothesis H o must be rejected if

is the ( 1 - a ) quantile of the X 2 distribution with k - 1 degrees of Xk-1,l-a

freedom (Table A9.3). The classes ( a l , a 2 ] , ( a 2 , a 3 ] , .. ., ( a k , ak+,] are to be chosen b e f o r e the test is performed, in such a way that all pi are approximately equal. Convergence is generally good, even for relatively small values of n (np i 2 5). Thus,

b y selecting the classes ( a l , a 2 ] , ( a 2 , a 3 ] , ..., ( a k , ak+l] one should take

care that all n p i (Eq. (A8.103) are almost equal und 1 5.

Example A8.12 shows an application of the chi-square test. When in a goodness-of-fit test, the deviation between 6, ( t ) and F o ( t ) seems

abnormally small, a verification against superconform (superuniform if the transformation qi) = F o ( t ( i l ) is used) can become necessary. Tabulated values for the lower limit L I - , for D, are e.g. in [A8.1] (for instance, a = 0.1 -+ Z 1 - , = 0.57 I&).

Example A8.12 Accelerated life testing of a wet Al electrolytic capacitor leads to the following 13 ordered observations of lifetime: 59, 71, 153, 235, 347, 589, 837, 913, 1185, 1273, 1399, 1713, and 2567 h. Using the chi-square test and the 4 classes (0, 2001, (200, 6001, (600, 12001, (1200, M), verify at the level a = 0.1 (i.e. with first kind error a = 0.1) whether or not the failure-free time T of the capacitors is distributed according to the Weibull distribution Fo(t)=Pr{z < t ) = l - e - ( 1 0 - ~ t ) " ~ (hypothesis H o : F o ( t ) = l - e - ( l o J z ) ' . ~ ) ,

Solution The given classes yield number of observations of kl = 3, k2 = 3, k , = 3, and k4 = 4. The numbers of expected observations in each classes are, according to Eq. (A8.103), n p -1.754, 1 - np2 =3.684, np3 =3.817, and np4 =3.745. From Eq. (A8.101) it follows that X13 =1.204

2 -3 1.2 and from Table A9.2, X 3, 0.9 = 6.251. Ho : F. (t ) = 1 - e-(1° ') can be accepted since

2 2 X, < x ~ - ~ , -cx (in agreement with Example 7.15).

A8.3.3 Goodness-of-fit Tests for a Distribution F,,(t) with Unknown Parameters

The Kolmogorov-Smirnov test and the tests based on the quadrate statistics can be used with some modification when the underlying distribution function F,( t ) is not completely known (unknown parameters). The distribution of the involved statistic Dn, W;, A: must be calculated (often using Monte Carlo simulation) for each type of distribution and can depend on the true values of the parameters [A8.1]. For instance, in the case of an exponential distribution FO( t ,h ) = 1- e-nt with parameter ?L estimated as per Eq. (A8.28) h = n l ( t l + ... + t,), the values of Y , - ,

for the Kolmogorov-Smirnov test have to be modified from those given in TableA8.1,e.g.formyl-,=1.36/& for a=0 .05 and yl- ,=1.22/& for a = 0 . 1 to [A8.1]

A8.3 Testing Statistical Hypotheses 537

Also a modification of D, in DA= (D,, - 0.2 / n)( l + 0.26 / & + 0.5 / n) is recommended rA8.11. A heuristic procedure is to use half of the sample (randomly selected) to estimate the parameters and continue with the whole sample and the basic procedure given in Appendix A8.3.2 [A8.11 (p. 59), A8.311.

The chi-square ( X * ) test offers a more general approach. Let Fo(t,B1, ..., B,) be the assumed distribution function, known up to the parameters e l , . . ., 8,. If

the unknown parameters 01, . . ., 8, are estimated according to the maximum likelihood method on the basis of the observed frequencies ki using the multinomial distribution (Eq. (A6.124)), i.e. from the following system of r algebraic equations (Example A8.13)

P1 + ... + pk = I ,

and kl + ... + kk = n ,

a ~ i a2pi exist (i = i, ..., k ; j, m = 1. .., r < k - I), - and - ae, ae, aem api

the matrix with elements - is of rank r, ae then the statistic

calculated with Pi = F , ( U ~ + ~ , i l , ..., 8,) - ~ ~ ( a ~ , GI, ..., G , ) , has under H , asymptotically for n -+ a x2 distribution with k - 1 - r degrees of freedom [A8.15 (1924)], see Example 7.18 for a practical application. Thus, for a given type I error a,

holds, and the hypothesis H, must be rejected if


2 2 is the (1 - a ) quantile of the X distribution with k - 1 - r degrees of freedom. Calculation of the Parameters 01, . . ., 8 , directly from the observations t l , ..., tn can lead to wrong decisions.

Example A8.13

Prove Eq. (A8.107).

Solution The observed frequencies kl , ..., kk in the classes (al , a 2 ] , ( aZ , a j ] , . .., ( a k , ak+l] resuit from n trials, where each observation falls into one of the classes (ai , ai+l] with probability pi = F,,(a,+, , 8„ . .., e r ) - F, (a i , 8„ . . ., e r ) , i = 1, . . . , k . The multinomial distribution applies. Taking into account Eq. (A6.124),

n! k , k k Pr{in n trials Al occurs k, times, . . . , Ak occurs kk times) = - m ... pk

k l ! ... k k !

with

kl+ ...+ kk=nand m t . . .+pk= 1,

the likelihood function (Eq. (A8.23)) becomes

with

pi =pi(O1, ..., € I r ) , pl + ... +pk =1, and kl+ ... +kk = n

Equation (A8.107) follows then from

a lnL - = 0 for 8 . = B j and j = 1, ..., r , aej I

which complete the proof. A practical application with r = 1 is given in Example 7.18.

A9 Tables and Charts

A9.1 Standard Normal Distribution

Parameters: E[z] = 0 , Var[z] = 1, Modal value = E[z]

Properties: @ ( 0 ) = 0.5, t ) = 1 - ( t ) , @ ( t , ) = a => tl-, = - t ,

Table A9.1 Standard normal distribution @(t) for t = 0.00 - 2.99

Examples: Pr{r < 2.33) = 0.9901; p r { ~ < -1) = 1 - PI{T < 1 ) = 1 - 0.8413 = 0.1587

540 A9 Tables and Charts

A9.2 x2- Distribution (Chi - Square Distribution) t

Definition : F(t) = PI-{%; I t ) = dx, t > o , F(o)=o, V = 1,2, ... (degrees

of freedom)

Parameters: E[X;I = V, ~ a r [ ~ ; ] = 2 V, Modal value = V - 2 (V > 2)

1 Relationsships: Normal distribution: = - m)', t1 ,. . ., independent

'T i=l

normal distrib. with E [ t i ] = rn and var[ti1 =02

Y-1 ( t / 2 I i - t / 2

Poisson distribution: - e = F ) , = 2 , 4 , ... i=O

i !

Incomplete Gamma function (Appendix A9.6): (f ,:) = F(t) T(;)

Table A9.2 0.05,0.1,0.2,0.4,0.6,0.8,0.9,0.95,0.975 quantilesof the x 2 - distribution for which ) = q ; t V s q = ( x + F ? 1 2 for V > 100) ( tv .4 =xv ,4

L+, i ! 1 -F(26) forv = 18 = 0.1)


A9.3 t - Distribution (Student distribution)

V Parameters: E [ t v ] = 0 , Var [ tv ] = - (V > 2 ) , Modal value = 0 V - 2

Properties: F(0) = 0.5, F(-t) = 1 - F(t)

2 Relationsships: Normal distribution und X - distribution : tv = I&

5 is normal distributed with E[5] = 0 and Var[c] = 1; X; is

distributed with V degrees of freedorn, 5 and independent

Cauchy distribution: F(t) with V = 1

Table A9.3 0.7,0.8,0.9,0.95,0.975,0.99,0.995,0.999 quantiles of the t - distnbution

Examples: F(t16,0,9) = O.9-f t16,0,9=t16,0.9= 1.3368; F(t16,0,1)=0.1-+ 1


A9.4 F - Distribution (Fisher distribution)

V l + V 2 V1 r ( 7 ) - - 2

x(V1-2)/2 Definition: F(t) = Pr{ F, 5 t } =

1' 2 "1 dx , r(?)r(+) ( v l x + ~ ~ ) ( ~ 1 + ~ 2 ) ~ ~

t > 0, F(0) = 0 , vl ,v2 =1, 2, . . . (degrees of freedom)

2 "2 2v2 (V1 +V2 - 2)

Parameters: E[FV1,V2] = ~ 2 - 2 (v2 >2), Var[F, , 1 = 0i2 >4), v 1 ( v 2 - 2 ) 2 ( ~ 2 - 4 )

2 Relationships: X - distribution: Fvl ,v2= 7 9 with X; & X; as in Appendix A9.2

Xvz 1 V2 1 2

= 1-F( p(n - k)

BinomiaI distribution: ( l - p ) ( k + l ) ) ' 1=0

w i t h V l = 2 ( k + l ) a n d v 2 = 2 ( n - k )

Table A9.4a 0.90 quantiles of the F - distribution

(tvl,vz,0.9'Fv,,v,,o.9 forwhich F(tv1,v2, 0.9)=0.9)

Table A9.4b 0.95 quantiles of the F - distribution

Table for the Kolmogorov - Smirnov Test

Dn = SUp I Fn(t) - Fo(t) 1 , F, ( t ) = empirical distribution function (Eq. (A8.l)) - m < t < m F, (t ) = postulated continuous distribution function

Table A9.5 1 - a quantiles of the distrib. funct. of D, ( P ~ { D , < Y,-, I H~ true) = 1 - a )

1.220

J;; 1.630

J;;

544

A9.6 Gamma function

Definition :

Special values:

Factorial:

Relationships:


0 Re(z) > 0 (Euler's integral), solution of T(z + 1) = z T(z) with r(1) = 1

n != 1.2.3 ...: n = r ( n + 1)

= @nn+1/2e-n+B/12n , 0 < 8 < 1 (Stirling's formula)

1 u z ) w

Beta function: B ( z , W ) = Inz- ' (1 - X)"-' dx = --- 0 T(z + W )

Psi function: ~ ( z ) = d (In Uz))

dz

Incomplete Gamma function:

2 X - distribution (F@) as in Appendix A9.2): (f .i) = ~ ( t ) T(;)

Table A9.6 Gamma function for 1.00 5 t 5 1.99 ( t real), for other values use T(z + 1) = z T(z)

t 0 1 2 3 4 5 6 7 8 9

1.00 1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

1.0000 ,9943 ,9888 .9835 ,9784 .9735 ,9687 .9641 .9597 .9554 .9513 .9474 .9436 ,9399 .9364 ,9330 .9298 .9267 ,9237 .9209 .9182 .9156 ,9131 ,9107 .9085 ,9064 ,9044 ,9025 .9007 ,8990 ,8975 ,8960 3946 ,8934 3922 .8911 ,8902 3893 ,8885 3878 ,8873 2868 ,8863 3860 ,8858 ,8857 ,8856 ,8856 ,8857 ,8859 ,8862 ,8866 ,8870 ,8876 .8882 ,8889 ,8896 3905 ,8914 ,8924 ,8935 .8947 ,8959 3972 ,8986 ,9001 ,9017 ,9033 ,9050 .9068 ,9086 ,9106 .9126 ,9147 ,9168 .9191 .9214 ,9238 .9262 .9288 ,9314 ,9341 ,9368 .9397 ,9426 .9456 .9487 .9518 ,9551 .9584 .9618 ,9652 .9688 .9724 ,9761 .9799 .9837 ,9877 .9917 .9958

Examples: r(1.25) = 0.9064; r(0.25) = r(1.25) / 0.25 = 3.6256; r(2.25) = 1.25. r(1.25) = 1.133

A9.7 Laplace Transform

W

Definition: F( s ) = I e - S t F( t ) dt F(t ) defined for t Z 0, piecewise continuos . o I ~ ( t ) l < ~e~~ (0 < A , B < W )

c + i - 1

Inverse irimsf. : F( t ) = - I & s ) es d s exists in the halfplan Re(s) = C > B, i = f i 2n 1

C - i -

Moment gener - Considering f(t) as density of T > 0, it follows (under weak conditions) that ating function: m ( - s f z k k

f(s)= ~ome-s'f(t)d~ = E [ r - " ] = ~ [ x - ] = E I ; thus, k=O k ! k=O k.

exept for the sign, the kth coefficient of the MacLaurin expansion of f (s), or

a itz

for arbitrary T, the characteristic function E [e ] = I e f(x)dx applies -Ce

Table A9.7a Properties of the Laplace Transform

Transform Domain Time Domain

Linearity

Scale Change

Shift

Differentiation

Integration

Convolution (F, * q)

iim s@(s) S+ rn

lim s P(s) s L 0

Initial Val. Theorem

Final Val. ~heorem*

'~xistence of the limit is assumed; **U@) is the unit step function (Table A9.7b)

Table A9.7b Important Laplace Transforms

rransform Domain m

Time Domain

F(t) (understood as u(t) F(t), with u(t) as unit siep)

Impulse 6 (t) (for a > 0 , F(t - a ) => e-sa )

Unit step u( t ) ( ~ ( t ) = 0 for t < 0, ~ ( t ) = 1 fort 2 0) -AI - ( s + A ) a

(e.g. he 11- u( t - U ) ] => ( i -e

a b - a -t + - ( ~ - e - ~ ~ ) b b2

1 -eWpt for 0 2 x < A mncated

for x 2 h , distribution function

A9 Tables and Charts 547

A9.8 Probability Charts

A distribution function appears as a straight line when plotted on a probability chart belonging to its family. The use of probability charts (probability plot papers). simplifies the analysis and interpretation of data, in particular of life times or failure-free times vailure-free operating time). In the following the charts for lognormal, Weibull, and normal distnbutions are given.

A9.8.1 Lognormal Probability Chart

The distribution function (Eq. (A6.110), Table A6.1) In(h t )

( ~ n y+in h12 - 0

1 ~ ( t ) = - 2 ~ 2 dy = - J e - x 2 ' 2 h , t >o. F(o)=o; A, G > O

0 4G -M

appears as a straight line on the chart of Fig. A9.1 (h in h-' ). For F(t) = 0.5, ht0,5 = 1 and thus h = 1 I t0,5; moreover, for F(t) = 0.99, l ~ ~ ( t ~ , ~ ~ / t0,5) l o = 2,33 and thus o = ln(to.99 1 to,s) 12.33 (this can be used for a graphical estimation of ?L and &) .

Pigure A9.1 Lognormal probability chart


A9.8.2 Weibull Probability Chart

The distribution function F( t ) = I - e - @ ' ) ' , t > 0, F(0) = 0, h, ß > 0 (Eq. (A6.89), Table A6.1) appears as a straight line on the chart of Fig. A9.2 (h in h-I), see Appendix A8.1.3. On the dashed line one has h=l / t ; moreover, ß appears on the scale loglo l o g l o ( & - ) ) when t is varied by one decade (Figs. A8.2,7.12,7.13). -

&

m m vI 2 ö 2 2 % 2 ö z g o g g O b b o o ~ o o g 2 Figure A9.2 Weibull probability chart


A9.8.3 Normal Probability Chart

The distribution function (Eq. (A6.105), Table A6.1)

appears as a straight line on the chart of Fig. A9.3. For F(t) = 0.5, t0.5 - m = 0 and thus m = t0,5 ; moreover, for F ( t ) = 0.99, (t0,99 - t0,5) I G = 2.33 and thus o = (t,„ - to,5) 12.33. For a statistical evaluation of data, it is often useful to-estimate m and o as per Eqs. (A8.6), (A8.8), (A6.108) and to operate with @ (e) .

0

Figure A9.3 Normal probability chart (standard normal distribution)

A l 0 Basic Technological Componentis Properties

Table A1O.l gives some basic technological properties of electronic components to Support reliability evaluations.

Table A1O.l Basic technological properties of electronic components

Component

4xed resistors t Carbon film

, Metal film

Wire- wound

Thermistors (PTC, NTC)

Jariable resist.

Cermet Pot., Cermet Trim

Wirewound Pot.

Technology, Characteristics I Sensitive to

A layer of carbon film deposited at high temperature on ceramic rods; +5% usual; medium TC; relatively low dnft (-1 to +4%); failure modes: opens, drift, rarely shorts; elevated noise; 1 G? to 22 MG?; low h (0.2 to 0.4 FIT)

Load, temperature, overvoltage, freq. (> 50MHz), moisture

Evaporated NiCr film deposited on aluminum oxide ceramic; +5% usual; Load, temperature, low TC; low drift (+l%); failure modes: current peaks, drift, opens, rarely shorts; low noise; ESD, moisture 10 Q to 2.4MQ: low h (0.2 FIT)

Usually NiCr wire wound on glass fiber substrate (sometimes ceramic); precision (+0.1%) or power (+5%); low TC; failure modes: opens, rarely shorts between adjacent windings; low noise; O.lG? to 250 k!2 ; medium h (2 to 4 FIT)

Load, temperature, overvoltage, mechanical stress (wire < 25 pm ), moisture

PTC: Ceramic materials ( BaTi03 or SrTiOg with metal salts) sintered at high temperatures, showing strong increase of resistance (103 to 104) within 50°C; medium h (4 to 10 FIT, large values for Current and voltage disk and rod packages) load, moisture NTC: Rods pressed from metal oxides and sintered at hi h temperature, large neg. TC 5 . . (TC - - 1 / T ). failure rate as for PTC

Metallic glazing (often ruthenium oxide) deposited as a thick film on ceramic rods Load, current, and fired at about 800°C; usually 210%; fritting voltage poor linearity (5%); medium TC; failure (< 1.5V), modes: opens, localized wearout, drift; temperature, relatively high noise (increases with age); vibration, 20 C2 - 2 MC2 ; low to medium h (5-20 FIT) CuNi / NiCr wire wound on ceramic rings noise, dust, X cylinders (spindle-operated potentiom.); moisture, normally I 10%; good linearity (1%); frequency (wire) precision or power; low, nonlinear TC; low drift; failure modes: opens, localized wearout, relatively low noise; 10 C2 to 50 kC2: medium to large h (10 to 100 HT)

Application

Low power (51W) moderate temperature (45°C) and frequency (5 50MHz)

Low power ( 5 0.5 W), high accuracy and stability, high freq. ( 5 500 MHz)

High power, high stability, low frequency (5 20 kHz)

PTC: Temperature Sensor, overload protection, etc.

NTC: Compen- sation, control, regulation, stabilization

Should only be 2mployed when there is a need for adjustment during ~peration, fixed resistors have to be preferred for xilibration during testing; load xipability proportional to the part of the resistor med

Al0 Basic Technological Component's Properties

Table A1O.l (cont.)

Component

:apacitors Plastic (KS, KP, KT, KC)

Metallized plastic (MKP, MKT, MKC, MKU)

Metallized Paper (MP, MKV)

Ceramic

Tantalum (dry)

Aluminum (wet)

Technology, Characteristics

Wound capacitors with plastic film (K) of polystyrene (S), polypropylene (P), polyethylene-terephthalate (T) or polycarbonate (C) as dielectric and Al foil; very low loss factor (S, P, C); failure modes: opens, shorts, drift; pF to pR low h (1 to 3FIT)

Wound capacitors with metallized film (MK) of polypropylene (P), polyethylene- terephthalate (T), polycarbonate (C) or cellulose acetate (U); self-healing; low loss factor; failure modes: opens, shorts; nF to p E low h (1 to 2 FIT)

Wound capacitors with metallized paper (MP) and in addition polypropylene film as dielectric (MKV); self-healing; low loss factor; failure modes: shorts, opens, drift; 0.1 pF to mF, low h (1 to 3 FIT)

Often manufactured as multilayer capacitors with metallized ceramic layers by sin- tering at high temperature with controlled firing process (class 1: E, < 200, class 2: E, > 200); very low loss factor (class 1); temperature compensation (class 1); high resonance frequency: failure modes: shorts, drift, opens; pF to pF; low h (0.5 to 2 FIT)

Manufactured from a porous, oxidized cylinder (sintered tantalum powder) as anode, with manganese dioxide as electrolyte and a meta1 case as cathode; polarized; medium frequency-dependent loss factor; failure modes: shorts, opens, drift; 0.1 pF to mF; low to medium h (2 to 5 FIT, 20 to 40 FIT for bead)

Wound capacitors with oxidized Al foil (anode and dielectric) and conducting electrolyte (cathode); also available with two formed foils (nonpolarized); large, frequency and temperature dependent loss factor; failure modes: drift, shorts, opens; pF to 200 mF ; medium to large h (5 to 10 FIT); limited useful life (function of temperature and ripple)

Sensitive to

Voltage stress, pulse stress (T, C), temperature (S, P), moisture* (S, P), cleaning agents (S)

Voltage stress, frequency (T, C, U), temperature (P), moisture* (P> U)

Voltage stress and temperature (MP), moisture

Voltage stress, temperature (even during soldering) moisture*, aging at high temperature (class 2)

Incorrect polarity, voltage stress, AC resistance (ZO) of the el. circuit (new types less sensitive), temperature, frequency (>lkHz). moisture*

Incorrect pola& (if polarized), voltage stress, temperature, cleaning agent (halogen), Storage time, frequency (> l r n z ) , moisture*

Application

Tight capacitance tolerances, high stability (S, P), low loss (S, P), well- defined temperature coefficient

High capacitance values, low loss, relatively low frequencies (< 20kHz for T, U)

Coupling, smoothing, blocking (MP), oscillator circuits, commutation, attenuation (MKV)

Class 1: high stability, low loss, low aging; class 2: coupling, smoothing, buffering, etc.

Relatively high capacitance per unit volume, high requirements with respect to reliability, ZO t lS1N

Very high capacitance per unit volume, uncritical applications with respect to stability, relatively low ambient temperature (0 to 55OC)

552

Table A1O.l (cont.)

Al0 Basic Technological Component's Properties

Component

Diodes (Si) General purpose

Zener

Transistors Bipolar

FET

I Technology, Characteristics I Sensitive to / Application

PN junction produced from high purity Si by diffusion; diode function based on the recombination of minority carriers in the depletion regions; failure modes: shorts, opens; low h (1 to 3FIT, 0J=400C, 10 FIT for rectifiers with 0 T = 100°C)

Forward current, reverse voltage, temperature, transients, moisture*

I

Heavily doped PN junction (charge carrier I generation in strong electric field and rapid increase of the reverse current at low

Load, temperature, reverse voltages); failure modes: shorts,

moisture* opens, dnft; low to medium h (2 to 4 FIT for voltage regulators ( 8 J = 40°C), 20 to 50 FIT for voltage ref. ( 0 7 = 100°C))

Signal diodes (analog, switch), rectifier, fast switching diodes (Schottky,

Level control, voltage reference (allow for +5% drift)

Swjtch, amplifier, power stage (allow for +20% dnft, +500% for ICBO)

PNP or NPN junctions manufactured using planar technology (diffusion or ion implantation); failure modes: shorts, Opens, thermal fatigue for power transistors; low to medium h (2 to 6 FIT for OJ = 40°C, 20 to 60 FIT for power transistors and 8 = 100°C)

Load, temperature, breakdown voltage (VBCEO, VBEBO), moisture*

Controlled rectifiers (Thyristors, triacs, etc.)

Voltage controlled semiconductor resistance, with control via diode (JFET) or isolated layer (MOSFET); transist. function based on majority carrier transport; N or P channel; depletion or enhancement type (MOSFET); failure modes: shorts, opens, M, medium h (3 to 10 FIT for 8 J = 40°C, 30 to 60 FIT for power transistors and 0 T = 100°C)

Load, temperature, breakdown voltage, ESD, radiation, moisture*

NPNP junctions with lightly doped inner zones (P, N), which can be triggered by a control pulse (thyristor), or a special antiparallel circuit consisting of two thynstors with a single finng circuit (triac); failure modes: drift, shorts, opens; large h (20 to 100 FIT for 0 = 100°C)

Switch (MOS) and amplifier (JFET) for high-resistance circuits (allow for 220% dlift)

Temperature, reverse voltage, nse rate of voltage and current, commutation effects, moisture*

Opto- semiconductors (LED, IRED,

Controlled rectifier, overvoltage and overcurrent protection (allow for 220% drift)

Electrical/optical or opticallelectrical converter made with photosensitive semiconductor components; transmitter (LED,

photo-sensitive ievices, opto- rouplers, etc.)

IRED, laser diode etc.), receiver (photo- resistor, photo-transistor, solar cells etc.), opto-coupler, displays; failure modes: opens, dnft, short . medi m o large h (2 to 100 FIT, 20.,/no.ofpirel: for LCD); limited useful life

Temperature, cnrrent, ESD, moisture*, mechanical stress

Displays, Sensors, galvanic separation, noise rejection (allow for 230% drift)

Al0 Basic Technological Component's Properties 553

Table A1O.l (cont.)

Component

Digital ICs Bipolar

MOS

CMOS

BiCMOS

Analog ICs

Operational amplifiers, comparators, voltage regulators, etc.

Hybrid ICs Thick film, thin film

Application Technology, Charactenstics

Monolithic ICs with bipolar transistors (TTL, ECL, L), important AS TTL (6mW, 2ns, 1.3V) and ALS TTL (ImW, 3ns, 1.8V); VCC = 4.5-5.5V; Zout < 150 B for both states; low to medium h (2 to 6 FIT for SSI/MSI, 20 to 100 FIT for LSUVLSI)

Fast logic (LS TTI ECL ) with uncntical power consump., rel. higl cap. loading, 8 j < 175°C (< 200°C for SOI:

Sensitive to

Supply voltage, noise (> lV) , temperature (OSeV), ESD, rise and fall times, breakdown BE diode, moisture*

Memones and microprocessors high source impedance, low capacitive loading

Monolithic ICs with MOS transistors, mainly N channel depletion type (formerly also P channel); often TTL compatible and therefore VDD = 4.5 - 5.5 V ( 100 pW , 10 ns ); very high Zi, ; medium Zout (1 to 10 kQ); medium to high h (50 to 200 FIT)

Monolithic ICs with complementary enhancement-type MOS transistors; often TTL compatible and therefore VDD = 4.5 -5.5V ; power consumption - f ( 10 pW at10kHz,VDD=5.5V,CL=15pF); Cast CMOS (HCMOS, HCT) for 2 to 6 V with 6 ns at 5 Vand 20 pW at 10 kHz : large static noise immunity (0.4 VDD); very high Zi, ; medium Zmt (0.5 to 5 kB); low to medinm h (2 to 6FIT for SSI/MSI, 10 to 100 FIT for LSINLSI)

Low power consumption, high noise immunity, not extremely higi- frequency, high source impedance, low cap. load, 8 j < 175OC, f0r memones: 1125°C

ESD, noise (> 2 V ), temperature ( 0.4eV), rise and fall times, radiation, moisture*

ESD, latch-up, temperature (0.4eV),riseand fall times, noise (> 0.4 VDD), moisture*

Monolithic ICs with bipolar and CMOS devices; trend to less than 2 V supplies; rombine the advantages of both bipolar and CMOS technologies

Combination of chip components (ICs, transistors, diodes, capacitors) on a thick tilm 5 - 20 pm or thin film 0.2 - 0.4 pm Substrate with deposited resistors and :onnections; substrate area up to 10 cm2 ; medium to high h (usually detennined by the chip components)

similar to CMOS similar to CMOS but also for very

high frequencies

Monolithic ICs with bipolar and /or FET transistors for processing analog signals (operational amplifiers, special amplifiers, iomparators, voltage regulators, etc.); up to about 200 transistors; often in meta1 packages; medium to high h J0 to 50 FiT)

Manufacturing Compact and quality, reliable devices temperature, e.g. for avionics mechanical stress, or automotive moisture* (ailow for +20%

drift)

Signal processing, Temperature voltage reg., low tc (0.6eV ), input medium Power voltage, load cOnSUmp CU,.rent, moisture* for dnft),

8 j < 175°C (< 125°C for low power)

I (BA = 40°C, GB ), indicat&e values; foi failure modes See also Table 3.4; * nonhermetic packages

I

!SD = electrostatic discharge; TC = temperature coefficient; h in 1 0 - ~ h-I for standard ind. envir.

A l l Problems for Horne-Work

In addition to the 120 solved examples in this book, the following are some selected problems for home-work, ordered for Chapters 2, 4, 6, 7 and Appendices A6, A7, A8 ( * denotes time-consuming).

Problem 2.1 Draw the reliability block diagram corresponding to the fault tree given by Fig. 6.39b (p. 271).

Problem 2.2 Compare the mean time to failure M T and the reliability function Rs (t ) of the following two reliability block diagrams for the case nonrepairable and constant failure rate for elements E, ,..., E4 (Hint: For a graphical comparison of Rs (t ), Fig. 2.7 can be modified for the 1-out-of-2 redundancy).

y& E4 +E& 1-out-of-2 active 2-out-of-3 active

(E = E = E = E =E) 1 2 3 4

( E =E = E =E) 1 2 3

Problem 2.3

Compare the mean time to failure M T for cases 7 and 8 of Table 2.1 (p. 31) for E, = ... = E5 = E and constant failure rates hl = ... = h5 = h .

Problem 2.4 Compute the reliability function Rs(t) for case 4 of Table 2.1 (p. 3 1) for n= 3, k = 2, EI + E2 + E3.

Problem 2.5*

Demonstrate the result given by Eq. (2.62), p. 63, and apply this to the active and standby redundancy.

Problem 2.6* Compute the reliability function Rs(t) for the n circuit with bidirectional connections given below (Hint: Use Eq. (2.29) as for Example 2.15).

Problem 2.7* Give a realization for the circuit to detect the occurrence of the second failure in a majority redundancy 2-out-of-3 (Example 2.5, p. 47), allowing an expansion of a 2-out-of-3 to a 1-out-of-3 redundancy (Hint: Isolate the first failure and detect the occurrence of the second failure using e.g. 6 two-input AND, 3 two-input EXOR, 1 three-input OR, and adding a delay 6 for an output pulse of width 6).

A l l Problems for Home-Work

Problem 4.1

Compute the M T R s for case 5 of Table 2.1 (p. 31) for hl = 10-~ h-l, h2 = 1 0 - ~ h-', h3 = 10-2 h-', h, = W3 h-', h, = 1 0 - ~ h-', h6 = 10" h-l, h, = I O - ~ h-', and p = ... = = 0.5 h" . Compare the obtained M V R s with the mean repair (restoration) duration at system level MDTs (Hint: use results of Table 6.10 to compute MTTFso and P A S , and assume (as an approximation) MUTs = MTTFso in Eq. (6.291)).

Problem 4.2 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50,000h for the system given by case 6 of Table 2.1 (p. 31) for hl = & = 4 = 10-~h-', h, = 10"h-' (Hint: Assume equal allocation of y between E, and the 2-out-of-3 active redundancy).

Problem 4.3 Same as for Problem 4.2 by assuming that spare parts are repairable with pi=pz =p3 =pv= 0.5 h-l (Hint: consider only the case with Rs (t ) and assume equal allocation of y between E, and the 2-out-of-3 active redundancy).

Problem 4.4 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50,000h for a 1-out-of-2 standby redundancy with constant failure rate h = 10-~ h-' for the operating element (?L= 0 for the reserve element). Compare the results with those obtained for an active 1-out-of-2 redundancy with failure rate h = 10-~ h-l for the active and the reserve element.

Problem 4.5 Give the number of spare parts necessary to cover with a probability y 2 0.9 an operating time of 50, OOOh for an item with Erlangian distributed failure-free times with h = 10-~ h-' and n = 3 (Hint: Consider Appendix A6.10.3).

Problem 4.6* Develop the expression allowing the computation of the number of spare parts necessary to cover with a probability 2 y an operating time T for an item with failure-free times distributed according to a Gamma distribution (Hint: Consider Appendix A6.10.3, and Table A9.7b).

Problem 4.7* A series-system consists of operationally independent elements 4 ,..., E, with constant failure rates Al ,...,?L, . Let c, be the cost for a repair of element Ei. Give the mean (expected value) of the repair cost for the whole system during a total operating time T (Hint: Use results of Section 2.2.6.1 and Appendix A7.2.5).

Problem 4.8* A system has a constant failure rate A and a constant repair rate p. Compute the mean (expected value) of the repair cost during a total operating time T. given the fixed cost co for each repair. Assuming that down time for repair has a cost cd per hour, compute the mean value of the total cost for repair and down time during a total operating time T, (Hint: Consider Appendices A7.2.5 and A7.8.4).

A l 1 Problems for Home-Work

Problem 6.1 Compare the mean time to failure MITFso and the asymptotic & steady-state point and average availability PAS =AAS for the two reliability block diagrams of Problem 2.2, by assuming constant failure rate h and constant repair rate p for each element and only one repair Crew (Hint: Use the results of Table 6.10).

Problem 6.2 Give the asymptotic & steady-state point and average availability PAs = AAS for the bridge giveu by Fig. 2.10, p. 53, by assuming identical and independent elements with constant failure rate h and constant repair rate p (each element has its own repair crew).

Problem 6.3 Give the mean time to failure M7TFs0 and the asymptotic & steady-state point and average availability PAs = AAS for the reliability block diagram given by case 5 of Table 2.1 (p. 31) by assuming constant failure rates hl , ... , h7 and constant repair rates pl , ... ,P,: (i) For independent elements (Table 6.9, p.225); (ii) Using results for macro-stmctures (Table 6.10, p.227); (iii) Using a Markov model with only one repair crew, repair priority on elements E6 and E7, and no further failure at system down. Compare the results by means of numerical examples

(Hint: For (iii), consider Point 2 of Section 6.8.8).

Problem 6.4 Develop the expressions for mean and variance of the down time in (0, t ] for a repairable item with constant failure rate h and constant repair rate p, starting up at t = 0, i.e. prove Eq. (A7.220), p. 499.

Problem 6.5* Show that both diagrams of transition rates of Fig. 6.37 (p. 264) are equivalent for the computation of MTT4 o. It is the case for the availability?

Problem 6.6* Give the asymptotic & steady-state point and average availability PAs = AAS for the n circuit with bidirectional connections given by Problem 2.6, by assuming identical and independent elements with constant failure rate h and constant repair rate p (each element has its own repair crew).

Problem 6.7* For the 1-out-of-2 warm redundancy of Fig. 6.8a (p. 191) show that 2 MTTFSi = Po M q o + 4 M q l differs from MUTs (Hint: Consider Appendix A7.5.4.1 or Point 9 in Section 6.8.8).

Problem 6.8* For the 1-out-of-2 warm redundancy given by Fig. 6.8a (p. 191) compute for states Z o , q, Z 2 : (i) The states probabilities PO, P I , P2 of the embedded Markov chain; (ii) The steady-state probabilities Po, 4 , P2 ; (iii) The mean stay (sojourn) times q, q , T2 ; (iv) The mean recurrence times Tao, q,, T22. Prove that T22 = MUTS + T2 holds (with MUTs from Eq. (6.287)) (Hint: Consider Appendices A7.5.3.3, A7.5.4.1, and A7.6).

Problem 6.9* Prove the results given by Eqs. (6.206) and (6.209), pp. 238 and 239.


Problem 7.1 For an incoming inspection one has to demonstrate a defective probability p = 0.01. Customer and producer agree AQL = 0.01 with producer risk a = 0.1. Give the sample size n for a number of acceptable defectives c = 0, 1,2, 5, 10, 14. Compute the consumer risk ß for the corresponding values of c (Hint: Use the Poisson approximation (Eq. (A6.129)) and Fig. 7.3).

Problem 7.2 For the demonstration of an MTBF = I / h = 4'000h one agrees with the producer the following rule: MTB6 = 4'000h, MTB3 = 2'000h, a =ß =0.2 . Give the cumulative test time T and the number c of allowed failures. How large would the acceptance probability be for a true MTBF of 5'000h and of 1 '500h, respectively? (Hint: Use Table 7.3 and Fig. 7.3).

Problem 7.3 During an accelerated reliability test at an operating temperature BJ =125"C, 3 failures have occurred within the cumulative test time of 100'000h (failed devices have been replaced). Assuming an activation energy E, = 0.5eV, give for a constant failure rate h the maximum likelihood point estimate and the confidence limits at the confidence levels y= 0.8 and y= 0.2 for BJ = 35'C. How large is the upper confidence limit at the confidence levels y = 0.9 and y = 0.6? (Hint: Use Eq. (7.56), Fig. 7.6, and Table A9.2).

Problem 7.4 For the demonstration of an M7TR one agrees with the producer the following rule: MTTRo = 1 h, MTTRl = 1.5 h, a =ß =0.2. Assuming a lognormal distnbution for the repair times

with o2 = 0.2, give the number of repair and the allowed cumulative repair time. Draw the operating characteristic as a function of the true MTTR (Hint: Use results of Section 7.3.2).

Problem 7.5* For the demonstration of an MTBF = 1 / h = 10'000h one agrees with the producer the following rule: MTBF = 10'000h, acceptance risk 20%. Give the cumulative test time T for a number of allowed failures c = 0, 1, 2, 6 by assuming that the acceptance risk is: (i) The producer nsk a (AQL case); (ii) The consumer risk ß (LTPD case) (Hint: Use Fig. 7.3).

Problem 7.6* For a reliability test of a nonrepairable item, the following 20 failure-free times have been observed (ordered by increasing magnitude): 300, 580, 700, 900, 1'300, 1'500, 1'800, 2'000, 2'200, 3'000, 3'300, 3'800, 4'200, 4'600, ä4'800, 5'000, 6'400, 8'000, 9'100,9'800h. Assuming a Weibull distribution, plot the values on a Weibull probability chart (p. 548) and determine graphically the Parameters h and ß. Compute the maximum likelihood estimates for h and ß and draw the corresponding straight line. Draw the random band obtained using the Kolmogorov theorem (p. 508) for a = 0.2. It is possible to affirm, or one can just believe, that the observed distribution function belongs to the Weibull family? (Hint: Use results in Appendix A8.1 and Section 7.5.1).

Problem 7.7* For a repairable electromechanical System, the following amval times t * of successive failures have been observed during T = 3'000h: 450, 800, 1'400, 1'700, 1'950, 2'150, 2'450, 2'600, 2'850, 2'950h. Test the hypothesis H,,: the underlying point process is a HPP, against H1 : the underlying process is a NHPP with increasing density. Fit a possible M (t ) (Hint: Use results of Sections 7.6.3 gr7.7).

Problem A6.1 Devices are delivered from source A with probability p and from source B with probability I - p. Devices from source A have constant failure rate LA, those from source B have early failures and their failure-free time is distributed according to a Gamma distribution (Eq. (A6.97), p. 422) with parameters hB and ß < 1. The devices are mixed. Give the resulting distnbution of the failure-free time and the M7TF for a device randomly selected.

Problem A6.2 Show that only the exponential distribution (Eq. (A6.81), p. 419), in the continuous case, and the geometnc distribution (Eq. (A6.131), p. 431), in the discrete case, possess the memoryless property (Hint: Use Eq. (A6.27) and considerations in Appendices A6.5 and A7.2).

Problem A6.3 Show that the failure-free time of a series-system with operationally independent elements E, ,..., E, each with Weibull distributed failure-free times with parameters h i and ß is distributed according to a Weibull distnbution with parameters hs and ß, give hs (Hint: Consider Appendix A6.10.2).

Problem A6.4 Prove cases (i), (iii), and (v)given in Example A6. 17 (p. 426).

Problem A ~ S * Show that the sum of independent random variables having a common exponential distribution are Erlangian distributed. Same for Gamma distnbuted random variables, giving a Gamma distribution. Same for normal distributed random variables, giving a normal distribution (Hint: Use results of Appendix A6.10 and Table A9.7b).

Problem ~ 6 . 6 * Show that the mean and the variance of a lognormally distnbuted random variable Eqs. (A6. 112) and (A6.113), p. 426, respectively (Hint: Use the substitutions X = and y = X - o I & for the mean and similarly for the variance).

Problem A7.1 Prove that for a homogeneous Poisson process with parameter h, the probability to have k events (failures) in (0, T] is Poisson distributed with parameter AT, i.e. prove Eq. (A7.41), p. 450.

Problem A7.2 Determine graphically from Fig. A7.2 (p. 446) the mean time to failure of the item considered in Example V (Hint: Use Eq. A7.30). Compare this result with that obtained for Case V with h, = 0, i.e. as if no early failures where present. Same for case IV, and compare the result with that obtained for Case IV with -t W, i.e. as if the wearout penod would never occur.

Problem A7.3 Investigate for t + m the mean of the fonvard recurrence time z R ( t ) for a renewal process, i.e. prove Eq. (A7.33), p. 448. Show that for a homogeneous Poisson process it holds that the mean of

( t ) is independent o f t and equal the mean of the successive interarrival times (1 I L). Explain the waiting time paradox (p. 448).

Al 1 Problems for Home-Work

Problem A7.4 Prove that for a nonhomogeneous Poisson process with intensity m(t ) = dM(t ) 1 dt , the probability to have k events (failures) in the interval (0, T] is Poisson distributed with parameter M(T) - M(0).

Problem A7.5 Investigate the cumulative damage caused by Poisson distributed shocks with intensity h, each of which causes a damage 5 > 0 exponentially distnbuted with parameter q > 0, independent of the shock and of the present damage (Hint: Consider Appendix A7.8.4).

Problem ~ 7 . 6 * Investigate the renewal densities hird(t) and hdu (t ) (Eqs. (A7.52) & (A7.53), p. 454) for the case of constant failure and repair (restoration) rates h and p. Show that they converge exponentially for t + W with a time constant 1 I (h + p) -; 1 I p toward their final value h p 1 (h + p) = h (Hint: Use Table A9.7b).

Problem ~ 7 . 7 * Let O<T;<T;< ... be the occurrence times (failure times of a repairable system) of a nonhomogeneous Poisson process with intensity m(t ) = dM(t ) I dt > 0 (measured from the origin t = T 0 = 0 ). Show that the quantities = M (T;) < = M (T 2) < ... are the occurrence times in a homogeneous Poisson process with intensity one, i.e with M(t)= t (Hint: Consider the remarks to Eq. (A7.200)).

Problem ~ 7 . 8 * In the interval (0, T], the failure times (arrivai times) T;< ... <T,*< T of a repairable system have been observed. Assuming a nonhomogeneous Poisson process with intensity m(t ) = &I (t ) 1 dt > 0, show that (for given T and v(T)= n), the quantities 0 < M (T;) 1 M(T) < ... < M (T: ) 1 M(T) I 1 have the same distribution as if they where the order statistics of n independent identically distributed random variables uniformly distributed on (0,l) (Hint: Consider the remarks to Eq. (A7.206)).

Problem A8.1 Prove that the empirical variance given by Eq. (A8.10), p. 507, is unbiased (i.e. prove Eq. (A8.11)).

Problem A8.2 Give the maximnm likelihood point estimate for the Parameters A and ß of a Gamma distnbution (Eq. (A6.97), p. 422) and for m and o of a normal distribution (Eq. (A6.105), p. 424).

Problem A8.3 Give the procedure (Eqs. (A8.91) - (A8.93), p. 532) for the demonstration of an availability PA for the case of constant failure rate and Erlangian distributed repair times with parameter PP.

Problem ~ 8 . 4 *

Investigate mean and variance of the point estimate h = k I T given by Eq. (7.28), p. 296.

Problem A M * Investigate mean and variance of the poin; estimate h = (k - 1)1 (tl + . . . + tk + (n - k) t k ) given by Eq. (A8.35), p. 5 15. Apply this result to h = n 1 (tl + . . . + t, ) given by Eq. (A8.28), p. 5 13.

Acronyms

ACM AFCIQ ANS1 AQAP ASQC BWB CECC CENELEC CNET

DGQ DIN DOD EOQC EOSESD ESA ESREF ETH EXACT GIDEP GPO GRD IEC (CEI) IECEE IECQ IEEE IES IPC IRPS ISO MIL-STD NASA NTIS RAMS RIAC Rel. Lab. RL

SAQ SEV SNV SOLE VDWDE

: Association for Computing Machinery, New York, NY 10036 : Association Francaise pour le Controle Industrie1 de la Qualitk, F-92080 Paris : Amencan National Standards Institute, New York, NY 10036 : Allied Quality Assurance Publications (NATO-Countries) : American Society for Quality Control, Milwaukee, W1 53203 : Bundesamt für Wehrtechnik und Beschaffung, D-56000 Koblenz : Cenelec Electronic Components Committee, B-1050 Bruxelles : European Committee for Electrotechnical Standardization, B-1050 Bruxelles : Centre National d'Etudes des Telecommunications, F-22301 Lannion : Deutsche Gesellschaft fur Qualität, D-60549 Frankfurt a. M. : Deutsches Institut für Normung, D-14129 Berlin 30 : Departement of Defense, Washington, D.C. 20301 : European Organization for Quality Control, B-1000 Brussel : Electrical OverstressElectrostatic Discharge : European Space Agency, NL-2200 AG Noordwijk : European Symp. on Rel. of Electron. Devices, Failure Physics and Analysis : Swiss Federal Institute of Technology, CH-8092 Zürich : Int. Exchange of Authentic. Electronic Comp. Perf. Test Data, London, NW4 4AP : Government-Industry Data Exchange Program, Corona, CA 91720 : Govemment Printing Office, Washington, D.C. 20402 : Gruppe Rüstung, CH-3000 Bem 25 : International Electrotechnical Commission, CH-1211 Genkve 20, P.O.Boxl3 1 : IEC System for Conformity Testing and Certif. of Electrical Equip., CH-121 lGent5ve20 : IEC Quality Assessment System for Electronic Components, CH-1211 Genkve 20 : Institute of Electrical and Electronics Engineers, Piscataway, NJ 08855-0459 : Institute of Environmental Sciences, Mount Prospect, IL 60056 : Institute for Interconnecting and Packaging EI. Circuits, Lincolnwood, IL 60646 : International Reliability Physics Symposium (IEEE), USA : International Organisation for Standardization, CH-1211 Genkve 20, P.O.Box56 : Military (USA) Standard, Standardiz. Doc. Order Desk, Philadelphia, PA191 11-5094 : National Aeronautics and Space Administration, Washington, D.C. 20546 : National Technical Information Service, Springfield, VA 22161-2171 : Reliability, Availability, Maintainability, Safety; also Rel. & Maint. Symposium, IEEE : Reliability Information Analysis Center, Utica, NY 13502-1348 (formerly RAC) : Reliability Laboratory at the ETH (since 1999 at EMPA S173, CH-8600 Dübendorf) : Rome Laboratory, Griffiss M B , NY 13441-4505 : Schweizerische Arbeitsgemeinschaft für Qualitätsfördernng, CH-4600 Olten : Schweizerischer Elektrotechnischer Verein, CH-8320 Fehraltorf : Schweizerische Normen-Vereinigung, CH-8008 Zürich : Society of Logistic Engineers, Huntsville, AL 35806 : Verein Deutscher 1ng.Nerband Deut. Elektrotechniker, D-60549 Frankfurt a. M.

Index

(less relevant places (not bold) are omitted by some terms)

A pnon / a posteriori probability 401,511 Absolutely continuous 403 Absorbing state 471-72 Accelerated test 35,81,86,98,99,102,307-12,

352,426,535 (see also 312-34) Acceleration factor 37,99,308-11 Acceptable Quality Level (AQL) 86,284-86,530 Acceptance line 283-84,301-02,528-29 Acceptance test -+ Demonstration Accessibility 8,118,151-52 Accident prevention 9,362 Accumulated -+ Cumulative Acquisition cost 11,13,14,357 Activation energy 37,97,99,103,308-09, Active redundancy 43,44,61-64,195,206,210,

211,225,227,361 Addition theorem 397,399 Adjustment 118,152 Age replacement 134,234 Aging 6,405 (see also Wearout and Bad-as-old) Alarm circuit 47 Allocation (reliability) 67 Alternating renewal process 168,452-56 Alternative hypothesis 280,291,298,305,525-26 Alternative investigation methods 267-76 AMSAA model 330 Anderson -Darling statistic 534 Antistatic container 148 AOQ l AOQL 281-82 Aperture in shielded enclosure 144 Approximate expressions 59,61,131-34,179-

8O,l88,192,195,198-2OO, 206,211,227, 219-30,236,238,240,243,245,266,430

Approximation of a reliability function 192 Approximation of a repair funct. 114-15,198-200 AQL + Acceptable Quality Level Arbitrary failure and repair rates 164,168,186,

200,23 1-32 Arbitra~y initial conditions (one item) 176-78 Arbitrary repair rate 177,185,188,200,206,

215,241,489,490 Arithmetic random variable 402,428-31 Arrhenius model 37,97,102,307-09 Arrival rate 494,501

Arrival time 321,331,442,494,496 As-bad-as-old 40,497 As-good-as-new 5,6,8,9,40,164,171,232,

234,236,242,249,251,253,254,365,294, 319,353,356,358,359,404,497

Assessed reliability 3 Asymptotic behavior 178-80,447-50,455,474-76,

486-88,491-92 (see also Stationary and Steady-state)

Asynchronous logic 149 Automatic test equipment (ATE) 88 Availability

demonstration 291-92,293,531-32 estimation 289-90,293,523-24

Average availability (AA) 9,171-72,176,177, 476,487 (see also Intrinsic, Operational, Overall, Point, Technical availability)

Average Outgoing QuaIity (AOQ) 281-82 Axioms of probability theory 394

Backdriving 340 Backward recurrence time 447 Bad-as-old (BAO) 40,405,497 Bathtub curve 6-7,422 Bayes theorem 401,414 Bayesian estimate / statistics 414,511 Bernoulli distnbution -t Binomial distribution Bernoulli trials 427-28,431,433,513 Bernoulli variable 427 Bidirectional connection 31,53,554 Binary decision diagram (BDD) 271 Binomial distribution 408-09,427-29,517-20,

527-28 Birth and death process 131,207,211,479-83 BIST -+ Built-in self-test BIT + Built-in test BITE + Built-in test equipment Black model 97 Bonding 95,100 Boolean function method 58-61 Bottom-up 72, 156,158,355 Boundary-scan 149 Bounds 59,179-80 Branching process 499

Index

Breakdown 96-97,102,106,140,144,145 Bridge structure 31,5344 Built-in self-test (BIST) 150 Built-in test 66,116-1 18,149-51 Built-in test equipment (BITE) 116 Burn-in 6,339,342,352

capability 13,66,72,154,248,352,376,479 Capacitors 140,141,143,146,521 Captured 181 CASE 157 Cataleptic failure 4,6 Causes for defects 66,155-57,329,341 Causes for failures 3-4 Cause-to-effects-analysis 15,66,72-80,153,

15748,329,356 Cause-to-effects-chart 76,356 CDM -+ Charged device model Censoring 295,296,298,323,324,331,504,s 15 Central limit theorem 126,434-37,449 Centralized logistic Support 125-29,130 Ceramic capacitor 140,143,146,147,521 Change 379 Chapman-Kolmogorov equations 462 Charactenstic function 539,545 Characterization 90-92,108 Charge spreading 103 Charged device model 94 Chebyshev inequality 411,293,433,434 Check list 79,120,372-75,376-82,383-87 Chi-square ( x 2 ) distribution 408-09,423,540 Chi-square ( x 2 ) test 3 16- 18,535-38 Classical probability 395 Clock 66,144,146,150,151 Clustering of states 222 CMOS tenninals 145 Coating 142 Coefficient of variation 128,411 Coffin-Manson 109,3 11 Coherent System 57,61 Cold redundancy -+ Standby redundancy Common-cause 72,260-64 Common-mode currents 144 Cornmon mode failures 42,66,72,260,361 Comparative studies 15,25,26,31,44,48-49,78,

103,116,119,130,133,164,194,220-21,234, 261,446,550-54

Complement (complementary events) 392 Complex structure 52,231 Complex ystern 64-66 Composite shmoo-plots 91 Compound failure rate 3 10 Compound process + Cumulative process

Computer-aided reliability prediction 272-76 Concurrent engineenng 1,11,16,17,19,21,

353,357,360,376 Conditional density Idistribution function

404,412-14,485,491,495,497 Conditional expected value 405,414 Conditional failure rate 405,497 Conditional probability 396-97,444,460,494,501 Confidence ellipse 278-79,518-19 Confidence interval 279,290,297,516-24 Confidence level 516 Confidence limits 516

availability 289-90,523-24 failure rate h 296-97,520-23 f ah re rate hs at system level 298 Parameters lognormal distribution 305 unknown probability 278-80,5 16-520

Configuration accounting 379 Configuration auditing 374,378-79 Configuration control 158,374,379 Configuration management 16,21,152,157,

158,335,353,378-81 Conformal coating -+ Coating Congruential relation 274 Connector 140,145,146,148,152 Consecutive k-out-of-n system 45 Consistent estimates 5 11-12 Constant acceleration test 339 Constant failure rate 6-7,35,40,172,165,177,

179-80,294-303,405,419-20,450-51,460-83 Constant repair rate 171,181,177,182-84,189-

96,207-1 1,213-30,238-40,243-71,46043 Consumer risk 86,281,284,291,299,302,

526,532 Contamination 85,93,98 Continuity test 88 Continuous random variable 403-04,408,412-14 ControIlability 149 Convergence almost sure T, Conv. with prob. one Convergence in probability 433 Convergence quickness 127,179-80,279,290,

297,303,313,394,506,507-08,519,522,530 Convergence with probability one 434 Convolution 416-17,545 Cooling 84,140-42,146 Corrective actions 16,21,22,72,73,77,80,104-

05,153,336,389-90 Corrective maintenance 8,113,118,120,154,353 Correlation 78,415,425,440-41 Corrosion 83,85,98-99,102,103,142,311 Cost I cost equation 12,14,136-38,235,242-43,

342-49,357,364,369-70,372,376,428,476 Cost effectiveness 13,353

Cost optimization 11,13,16,353,357 Count function 442,451,493,499 Covariance matrix 4 15 Coverage -+ Incomplete coverage, Test Cover. Cracks 85,93,102,104,106,108,109,111 Cramer - von Mises test 322,534 Critical design review -+ Design review Critical operating states 264 Criticality 72-73,78,153,158,161 Criticality grid 1 criticality matrix 72-73 Cumulated states 259,477 Cumulative damage 499,559 Cumulative operating time 294-303,309,515 Cumulative process 237-38,498-500 Customer requirements 365-68,369-7 1 Cut Sets -+ Minimal cut sets Cut sets theorem 509,512 Cutting of states 222,230,273 Cycle 275,455-57,491

Damage 85,93,94,95,100,104,106,107,109, 110,311,312,329,336,337,340

Damp test -T) Humidity test Data collection 21,22,23,360-61,388-90 Data retention 89,97-98 DC Parameter 88,92 De Moivre-Laplace theorem 434,518 Death process 61-63 Debug test 159 Debugging 153,158 Decentralized logistic Support 129-30,134 Decoupling capacitor 66,143,146 Defect 354

152-61,302-04,341,343,344-49,362 localization 337 prevention 66,78,155-59

(see also Dynamic defect) Defect tolerant 152-53,155 Defective prob. 12,86,277-86,337,341,343 Deferred cost 12,14,342,342,344-47 Definition of probability 394-95 Deformation mechanisms I energy 109,3 41 Degradation 4,7,66,92,96,101,112,248,264 Degree of freedom 423,540-42 Demonstration

availability 291-92,293,531-32 defective (or unknown) probability p 283, 280-86,287-88,526-30 . const. failure rate h or MTBF=l I h 301, 298-303,370-71 M7TR 305-07,371

Dendrites 95,100

Density 403,408,413 Dependability 9,11,13,19,354,366,367,479 Derating 33,82,84,86,139-40,354 Design FMEAIEMECA 72,78 Design guidelines 25-27,66,77,80,84,374,377

maintainability 149-52 reliability 139-48 software quality 152-61

Design reviews 21,27,77,79,107,120,153,159, 354,374,378,381,383-87

Design d e s + Rules Destructive analysis 104 Device under test (DUT) 88 Diagnosis -+ Fault isolation Diagram of

state transition 187,201,215,244,489,490 transition probabilities 62,183,191,196, 208,214,229,465-68,471,472,479,481 transition rates 231,239,240,245,246, 247,250,252,256,261,263,264,269

Differente between + Distinction between Different elements 194-96,225,227 Differential equations (method of) 190,469-72 Directed connection 31,55 Discrete random variable 402,408-09 Discrimination ratio 281,300 Dislocation climbing 109,341 Distinction between . arrival times and interarrival times 494-95

time and failure censoring 295,515,520-22 h(t ) and f(t) 404 h(t) arid zS(t) 7,501,356 z s ( t ) , m(t) and h(t) 7,356,444-45,501 . examples 3-4,21,23,66,67,72,78,113, 117, P;(&) and QO(6t) 465 . Pi and 1: 475,487,488 . t;,tz ,... andt l , t2 , ... 319,331,494 . T: , 22, ... and zl ,z2, ... 319,331,494

Distributed system I structure 52 Distribution function 401-02,408-09,412,419-32 Documentation 6,15,118,154-56,375,378,

379,380,381 Dominant failure mechanism 37-38,310 Dormant state 33,36,140 Double one-sided sampling plan 285-86 Down state 265-66,452-53,469 Down time 123,124,136,173-74,235,476,499

(see also MDT) Drift 52,67,71,76,83,100,113,142,146,550-54 Drying material 142 Duane model 330-32 Duration -+ Frequency lduration Duration (sojourn, stay) -+ Stay time Duty cycle 38,67,273,370

Index

Dwell time 98,108,109,339,341 Dynamic burn-in 101,109,339 Dynamic defect 3-4,152,354,362,363,410 Dynamic fault tree 270-71 Dynamic Parameter 88,145 Dynamic stress 69,144

Early failures 6-7,35,315-16,323,326,328, 329,337,342,352,354,355,406,445-46

Early failure period 6-7,315,323,328 Ecological IEcologically acceptable 10,369,370 EDF + Empirical distribution function EDX spectrometry 104 Effect + Failure effect Effectiveness -+ Cost effectiveness Efficient estimates 51 1,512 Electrical overstress 148 Electrical test

assemblies 340-41 . components 88-92 Electromagnetic compatibility (EMC) 82,84,

108,139,143-44 Electromigration 6,95,97,103,311 Electron beam induced current (EBIC) 104 Electron beam tester 91,104 Electrostatic Discharge (ESD) 89,94,102,104,

106-07,108,139,144,148,335 Elementary event 392 Elementary renewal theorem 447 Elements of a quality assurance system 21 Embedded Markov chain 274,464,475,483,

486,487,488,491 Embedded renewal process 169,203,452,453,

456,484,491 Embedded semi-Markov process 197,215,440,

485,488,489-91 Embedded software 153,157 EMC -+ Electrornagnetic compatibility Emission + EMC Emission microscopy (EMMI) 104 Empincal distribution function 3 12- 17,504-10 Empirical evaluation of data 314-17,421,

503-10,547-49 Empirical failure rate 5 Empirical mean I variance 4,303,304,506-07 Empincal reliability function 4-5 Empty set 392 Environmental . conditions lstress 10,28,33,36,82.83

stress screening -+ ESS Environmental and special tests

assemblies 108-09 . components 92-100

Equivalence between asymptotic, steady-state, stationary 180-81,450,476,487

Equivalent event 392 Erlang distribution 186,423 Error I mistake 3,6,9,76,78,95,153,156-57,

329,354,355,356,362,386 E m r correcting code 153 ESD -+ Electrostatic discharge ESS 6,341,349,352,35445,362 Estimate 511,503-24 Estimation

availability 289-90,293,523-24 defective probability p 279,278-80, 287-88,513,516-20 failure rate h or MTBF = I 1 h (T fmed) 297,295-98,513,515,520-21 failure rate h (k fixed) 295,521-22 MiTR 303-05 Nonhomog. Poisson process 33 l-32,497 pointlinterval (basic theory) 511-24

Euler integral 544 Event field 391-94 Exchangeability 118,151-52 Expanding 2-out-of-3 to I-out-of-3 red. 47,544 Expected percentage of performance 513 Expected percentage of time in a state 206,513 Expected value (mean) 4,406,415,416, 506 Exponential distribution 408-09 Extreme value distributions 421 Extrinsic 3-4,86,355,389 Eyring model 99,102,311

Faii-safe I, 9,66,72,157 Failure 1,3-4,6-7,22,23,61-62,64-65,78,355 Failure analysis 87,89,95,102-07,111 Failure cause 3-4,72-73,78,102-05,355-56,389 Failure effect 4,72-80,87,101,355-56,389,363 Failure-free operating time -+ Failure-free time Failure-free time 3-6,39-40,404,420 failure frequency -+ System failure frequency Failure hypothesis 69-70 failure intensity 5,7,355,501-02 Failure isolation -+ Fault isolation Failure mechanism 4,33-38,92,96-100,102,

103,307-12,337,339,406 Failure mode 3,27,42,51,101,356,362,389

examples 3O,5l, 64-66,550-54 distribution 100,550-54 investigations 64-66,72-77,236-47,255-58

Failure mode anaiysis + FMEA / FMECA Failure propagation + Secondary failures Failure rate 4-7,33-38,355,404-05,409,419-20 Failure rate analysis 26,28-67

Index

Failure rate confidence limits at components level 296-98 . at system level 298

Failure rate estimation 296-98,513,520-22 Failure rate demonstration 298-303 Failure rate models IHDBKs 35-38,99,310-12 Failure rate of mixed distributions 41,404-06 Failure recognition 101,116-18,149-51,236-46 Failures with constant failure rate h 6-7,35 False alarm 66,232,241,246 Fatigue 88,98,311,421 (see also Wearout) Fault 4,72,356 Fault coverage -t Incomplete coverage Fault isolation 116-17 Fault model 90,91,236-64 Fault modes and effects analysis -t FMEA Fault recognition 112,115,116-18,119,149 Fault tolerant system 47,64-65,66,101,153,

157,162,165,231,233,248-60,264,476,478 Fault tree /Fault tree analysis (FTA) 66,76,78,

270-71,356 Feasibility I feasibility check 10, 19,77,121,

154,354,378,381,383,384 Field of events 391-94 Fine leak test 339-40 Finite element analysis 69 First delivery 350 First-in / first-out 164,232,273 Fishbone diagram -+ Ishikawa diagram Fisher distribution 290,291,523,532,429,542-43 FIT (Failures in time) 36 Fitness for use 11,360 fixed length test -+ Simple two-sided test Flow of system failures 16 1,294,330,497 FMEAFMECA 27,42,66,69,72-75,78,117,

237,248,264,355,377 Force of mortality 7,356 Forward recurrence time 175,178,180,446-47,

448,45 1,454 (see also Rest waiting time) Frequency / duration 231,148,255,259-60,266,

475,476-78,487 FTA -t Fault Tree Analysis Functional block diagram 29,68,256,271 Function of a random variable 405,410,426 Functional test 88

Gamma distribution 408-09,422-23 Gamma function 544 Gate review 378,381 Gaussian distribution + Normal distribution General reliability data 3 19-28 Generation of nonhomog. Poisson processes 497 Generator for stochastic processes 275-76

Geometric distribution 408-09,431 Geometric probability 395,408-09,43 1 Glassivation + Passivation Glitches 66, 146 Glivenko-Cantelli theorem 505 Gold-plated pins 94,147 Gold wires 100 Good-as-new -+ As-good-as-new Goodness-of-fit tests 312-18,322,533-38 Graceful degradation 66,248 Grain boundary sliding 109,341 Grigelionis theorem 498 Gross leak 339-40 Ground 143-45,146,147,152 Guard rings 144,146 Guidelines -t Design guidelines

HALT 312 HAST 89,98-99,312 Hazard rate 5 HBM -t Human body model HPP -+ Poisson process (homogeneous PP) Hermetic enclosure 142,148 Hermetic package 85,102,104,142,337,339 Hidden defect 14,117 Hidden failures 8,66,79,107,113,116,117,120,

149,150,233,241-46,243,359 High temperature Storage 89,98,337 Higher-order moments 41 0,4 11,507 Highly accelerated tests 3 12 Historical development 16,17,85 Homogeneous -t Time-homogeneous Homogeneous Poisson process + Poisson proc. Hot carriers 96,102,103 Hot redundancy + Active redundancy Human aspects Ifactors 2,3,9,27,73,76,77,

152,153,157-58,352,361,363,373,385 Human body model (HBM) 94 Human errors 10,119,157 Human reliability + Risk management Humidity tests 89,98-100 (See also HAST) Hypergeometric distribution 408-09,432

Idempotency 61,392 Imperfect switching -t Switching In-circuit test 340 Inclusion / Exclusion 400 Incoming inspection 90,101,145,336,340,

343,344-49 Incomplete coverage 241-46,267 Independent elements 52 Independent events 397,398 Independent increments 439-40

Index

Independent random variable 394,413,415, 416,416-18,419,422,423,434,465

Indicator 56,57,58,61 Indices 167 Indirect plug connectors 152 Inductive Icapacitive coupling 91,143,146 Industrial applications (environment) 37,38,140 Influence of prev. maintenance 134-36,233-36 Influence of repair time distribution 114-15,

133-34,198-200 Information feedback 22,360-61,390 Infrared thermography (IRT) 104 Inherent + Intnnsic Initial conditions 63,176,178,180,190,191,

208,449-50,454-55,462,469,471,485 Initial distribution 449,459,460-61,463,

475-76,485,486-87,491-92 Input/output dnver 146 Inserted components 84,108,110 Integral equations (method of) 166,185,193-94,

211-12,216-17,473-74 Integral Laplace theorem 434,5 18 Integrated circuits (ICs) 34-37,84-85,89,

90-100,142,149,336,337-40 Intensity 7,296,321,451,493,497,498,501,502 Interaction 66,253,156 Interarrival time 5,294,319,323,328,442,494 Interchangeability 8 Interface 78,82,96,97,103,118,139,146,154,157 Intermetallic compound Ilayer 100,103,109 Internal redundancy + Active redundancy Internal visual inspection 89,93,104 Intersection of events 392 Interval estimation 278-80,289-90,293,296-98,

305,516-24 Interval estimate at system level 298 Interval reliability 166-67,172,177,181,188,

193,195,198,211,265,454 Intrinsic 3-4,9,86,139,355,389 Inverse function 405 Ion migration 103 Irreducible Markov chain 459-60,475-76,

486-87,491 IRT -;r Infrared thermography Ishikawa diagrarn 76-77,78,356 ISO 9000: 2000 family 11,366-67 Item 2,357

Jelinski-~oranda 160 Joint availability 174- 175,177 Joint density I distribution 412-13,494-95 Junction temperature 33,34,35,37,79,84,85,

140-42,145,309

k-out-of-n: G -+ k-out-of-n redundancy k-out-of-n redundancy 31,44,61-64,130,

206-12,211,225,227,271,479,489-90 Kepner-Tregoe 76,78 Key item method 52-55,60,68-69 Key renewal theorem 178,179,448,455,457,491 Khintchine theorem 45 1,498 Kirkendall voids 100 Kolmogorov backward / forward eqs. 462 Kolmogorov-Smirnov test 312-17,322,332,497,

534,536-37,543 Korolyuk theorem 493 kth momentkentral moment 410-1 1

Laplace test 324 Laplace transform 545-46 Last repairable unit + Line replaceable unit Last replaceable unit + Line replaceable unit Latch-up 89,96,145,148 Latent damage -+ Darnage Law of large numbers 433-34 Leak test + Seal test Liability + Product liability Life cycle cost (LCC) 11,13,16,112,353,357,

364,369,370,377 Life-cycle phases 19 (hardware), 154 (software) Lifetime 357 Like new + As-good-as-new Likelihood function + Max. likelihood function Limit theorems of probability theory 432-37 Line repairable unit -+ Line replaceable unit Line replaceable unit (LRU) 115,116,118, 120,

125,149 Liquid crystals 104 List of preferred parts (LPP) + Qualified part

list Load capability 33 Load sharing 43,45,52,61-64,163,164,190,

194,207,458,488 Logarithmic Poisson model 333 Logistic support 8,115,119,125,129,235,357 Lognormal distribution 113-15,303-07,408-09,

425-26,547 Long-term stability 86 Lot tolerance percent defective 284-85,530 Lowest replaceable unit + Line replaceable unit LRU + Line replaceable unit LTPD + Lot tolerance percent defective

Macro-structures 165,222,227,264 Maintainability 1,2,8, 9,12,13,21,112-15,

357,366,367,368

Index

Maintainability analysis 72,115-24,149-52, 373,375

Maintainability estimation/demonstr. 303-07,371 Maintainability program -t Maintenance concept Maintenance 8,113 Maintenance concept 8,112,115-20,373,375 Maintenance levels 119-20 Maintenance strategy 35,134-36,233-36 Majority 31,47,66,215 Manufacturing processes 106-1 1,147-48,335-

50,378 Manufacturing quality 16,20,86,335-50 Margin voltage 98 Marginal density I distribution function 413 Marking 306 Markov chain 244,268,274,458-60,461,463,

464,475,483,485,486,487,488 Markov models 61-64,166-67,170-71,189-93,

195,211,220-21,225,227,226-30,238-40, 260-63,264-67,440,460,466-68,471,479

Markov process 166-67,440,460-83,487 Markov renewal property 465

(see also Memoryless property Markov renewal processes 483 Match IMatching 144,146 Mathematical statistics 503-38 Maximum likelihood function lmethod 278,89,

296,304,305,313,319,322,331,512-15,536 Mean (expected value) 406-07,410,415,416 Mean down time (MDn 124,259,266,478 Mean (for rel. applications) -+ MDT, MTBF,

MTBUR, MTTF, MTTPM, MTTR, MUT Mean logistic delay 235 Mean operating time between failures (MTBF)

6,39-40,358 (see also 294-303,369-71 for estimation & demonstration of MTBF=l I h)

Mean time to failure (M77'fl 6,39,40,63,166- 67,195,211,220-21,225,227,358,474,486

Mean time to preventive maintenance ( M Z P M ) 113,121,125,358

Mean time to repair (MTTR) 8-9,113,121-24, 359 (303-07 for estimation & demonstration)

Mean time to restoration -+ Mean time to repair Mean up time 6,265,477 Mean value function 321,324,328,333,493,496 Mechanical reliability 67-7 1 Mechanism -+ Failure mechanism Median 412 Meshed structure 52 Memories 90-91,93,97-98,146 Memoryless property 7,40,63,136,172,192,234,

2!?5,298,405,420,43l, 440,451,464,465,478 Meta1 migration 103 (see also Electromigration)

Metallographic investigation 104-05,108-09 Method of differential eqs. 167,190-81,469-72 Method of integral eqs. 166,193-94,473-74,486 Metrics (software quality) 153 Microcracks + Cracks Microsection 104,105,108,110 Minimal cut sets 59,60,76 Minimal operating state -t Critical Oper. states Minimal path Sets 58,60,76 Mission availability 173 Mission profile 3,15,28,38,68,79,231,357,370 Mistake -+ Error Mixed distribution function 403 Mixture of distributions 7,41,316,406 Modal value 412 Mode -+ Failure mode Models for failure rates 35-38 (see also Mixture) Models for faults -+ Fault model Modification 379 Moisture 98-99,142 Module I Modular Il8,120,149,150-59 Moment generating function 545 Monotony 57 Monte Carlo simulation l65,23 1,233,272,

273-76,426,435,436 (see also Generation and Generator)

Motivation and training 24,119,375 MDT -+ Mean down time MTBF -+ Mean operating time between failures MTBUR 8,358 MTTF + Mean time to failure MZTPM -+ Mean time to prev. maintenance MTTR + Mean time to repair / restoration MUT -+ Mean up time

Multidimensional random var. 412- 16,438-41 Multifunction system -+ Phased-rnission system Multilayer 143,148 Multimodal 412 Multinornial distribution 318,429,537,538 Multiple failure mechanism 64-65,310,3 12,

319,341,406 Multiple failure mode 52,64-65,66,246-47,

255-58 Multiple faults I consequences 76 Multiple one-sided sampling plans 285-86 Multiplication theorem 398-99 Mutually exclusive events 57,171,174,237,

392,393,394,397-98,400,446 MUX 150,151

Nitride passivation -+ Passivation Nonconformity 354,359 Nondestructive analysis 102-05

Index

Nonhomogeneous Poisson process 161,321-34, 451,493-97

Nonregenerative state 201,210,440,490 Nonregenerative stochastic process 164,186,

200,212,488,492-502 Nonrepairable item (up to system failure) 5,7,

39-57,61-71,236-37,240,243,245,254,260, 270,272

Normal distribution 113,126-127,408-09, 424-25,434-35,449,496,539,549

Number of states 56,219 N-version programming (NVP) 47

OBIC 105 Object oriented programming 157 Observability 149 Obsolescence 8,118,138,145,357 Occurrence time + Arrival time One-item structure 39-41,168-82

Parameter estimation 278-80,289-90,293, 294-98,303-05,331-32,511-24

Pareto 76,78 Part Count method 5 1 Part Stress method 33-38,50-51 (see also 69-71) Partitioning 115,118,157,158 Partitioning cumulative operating time 294,

295,301,371 Passivation IPassivation test 89,93,104,106 Path Set -+ Minimal path sets Pattern sensitivity 91,93 PCBs + Populated printed circuit boards Pearson 517,535 Percentage point 412 Performability 259 Performance -+ Capability Performance effectiveness + Reward Performance test 108 Petri nets 267-69

One-out-of-2 redundancy ( 1-out-of-2 redundancy) Phased-mission Systems 28,30,38,231,248-55 42-43,189-206,225,227,236-45,247,260-64, 466,470-72,488-92

One-sided confidence intervai 280,290,297, 316,319,321,322,324,516,519

One-sided sampling plan (forp) 284436,529-30 One-sided tests to demonstrate kor MTBF=1 I k

302-03 Only one repair Crew + models of Chapter 6 except pp. 210,224-25

Operating characteristic lcurve 281-82,284-85, 300,306-07,527-28,530

Operating conditions 2,3,7,26,28,33,35,79, 84,90,96,99,102,354,365

Operation monitoring 116 Operational availability 235 Operational profile 28 Optical beam induced current (OBIC) 104 Optimal derating 33,140 Optimal preventive maintenance 234-36,242-43 Optimization l2-l5,67,l2O,l36,l38,342-49,

353,364 Optocoupler 146 Order observations 1 sample lstatistics 3 12,3 13,

321,323,324-25,332,495,496,5O4,5O6,535 Organizational structure (company) 20 Overall availability 9,235 Overstress 33,103,148,336 Oxide breakdown 96-97,102,103,106,3 11

Packaging 84-85,89,100,142 Parallel model 43-45,61-64,195,206,211,225,

227,236-43,247,466-39,470-42,489,490 Parallel redundancy + Active redundancy

~h~sics-of-failures 102-07 (see Failure mech.) Pitch 84,109,147,341 Plastic packages -i Packaging Point availability 9,170,178,181,166-67,190,

289-93,352,454 Point estimate 278,289,296,303,332,511-15 Point estimate at system level 298 Point process (general) 500-02 Poisson approximation 430 Poisson distribution 283,294,408-09,429-30 Poisson integral 544 Poisson process

homogeneous (HPP) 7,294,295-96,320, 323-27,356,445,448,450-51,515, 493-97 for m(t)=h nonhomogeneous (NHPP) 161,3Sl-34,451, 493-97

Populated printed circuit board (PCB) 84,85,90, 94,107-11,116,144,146-48,152,336,340-41

Power devices 1 supply 96,98,99,108, 143, 145,146,147,150,152

Power Law process 230 ppm 337,424 Precision measurement unit (PMU) 88 Predicted maintainability 121-25 Predicted reliability 3,25-27,28-71,172-276,

372 Preferred list -+ Qualified part list (QPL) Preheating 147 Preliminary design reviews + Design reviews Pressure cooker -+ HAST Preventive action 16,22,72,77,112,139-52,

155-58,341,37 1-82

Index

Preventive maintenance 8,112-13,233-36,241- 43,359

Printed circuit board -+ Populated printed C. b. Probability 393-96 Probability chart 314-15,317,421,509-10,547-49 Probability density + Density Probability plot paper -t Probability chart Problems for Home-Work 554-59 Procedure for

analysis of complex systems 264-66 analysis of mechanical systems 69 electrical test of complex ICs 88-90 demonstration of

availability (PA=AA) 291-93 MTTR 305-07 probability p 280-86,287-88,526-30 li. or MTBF= 1 Ili. 298-303

estimation of availability (PA=AA) 289-90 MTTR 303-05 probability p 278-80,287-88,513,516-20 h or MTBF= 1 Ih 296-98 (see in particular 279,290,297)

ESD test 94 FMEAIFMECA 72-75 frequency I duration 265-66,476-79

0 graphical estimation of F(t) 507-10 (see also 312-17,533-34,547-49)

Goodness-of-fit tests Anderson-Darling 534 Cramer - von Mises 322,534-35 Kolmogorov-Smimov 3 l2-17,322, 333-34,534,536-37 (see also 504-10) X* test 316-18,535-38

mechanical system's analysis 67-68,69 modeling complex rep. systems 264-66 qualification test

assemblies 107-1 1 complex ICs 89,87-107 first delivery 349-50

reliability allocation 67 reliability prediction

3,25-27,28-7 l,l72-276,372-73 (see 67-71 for mechanical reliability)

reliability test accelerated tests 307-12 technical aspects 101,109,337-40 statistical aspects 277-334,503-38

(see in particular 283,297,301) screening of

assemblies 333-41 (see also 107-1 1) components 366-40 (see also 92-100)

sequential test 283-84,300-01,528-29

s simple one-sided test plan 284-86, 302-03,529-30 simple two-sided test plan 280-83, 298-301,527-28 software developmentltest 1561158 test and screening strategy 342-49 transition probabilities (determination of) 185,187,193,244,464,489-91

Process FMEAIFMECA 72,78 Process reliability 3 Process with independent increments 333-34,

439-40,451,493-97 Process with stationary increments 441,45 1,494 Producer risk86,28 1,284,291,299,302,526,532 Product assurance 16,359,367,368 Product liability 9-10,15,354,359,360,379 Production process 6,21,87,98,106-07,108,

335-36,342-44,354-55,360,3 65,3 68,3 78 Programl erase cycles 97-98,338 Project management 17-24,152-61,369-82 Prototype 18,19,87,107,312,329,343,374,

375,377,380,381,384,386,387 Pseudo redundancy 42,361 Pseudorandom number 274 hll-uplpull-down resistor 145,147,150 Purple plague 100,103

Quad redundancy 65,66,lOl Quadrate statistics 534 Qualification tests 21,343,374,378,380,381

assemblies 107-1 1 components 89,87-107

Qualified part list (QPL) 87,145,372,378,385 Quality 11,360 Quality & reliability assurance progr. 17,371-82 Quality & reliability requirements 365-68,369-71 Quality and reliability standards 365-68 Quality assurance 11,13,16,17-24,152-61,

360,372-75,376-82 Quality assurance system 21,366 Quality attributes for software 157 Quality control 13,16,21,158,277-86, 336,360 Quality cost optimization + CostJcost equations Quality data reporting system 22,360-61,388-90 Quality growth (software) 159-61 Quality handbook 21 Quality management 16,20,21,24,360,361,366

(see also Quality assurance and TQM) Quality of manufacturing 16,21,86,335-36 Quality metric for software 153 Quality tests 21,361,376,380 Quantile 412,540-43 Quick test 116

Index

Random duration (phased-mission systems) 274 Repair priority 214,227,229,232,239,240,247, Random sample + Sample 256,264,466,468 Random variable 401-03 Repair rate 115,170-71,177,214,466,468 Random vector 412-15,438-41 Repair strategy -+ Maintenance strategy Rare event 10,272,273,275 Repair time 8,113-14,121-24,303-07,359 Reachability tree 268-69 Repairable spare parts -t Spare parts Reconfiguration 66,118,157,231,248-60 Repairable Systems 5,162-276

time censored (phased-mission system) 248-55 Repairable versus nonrepairable 40 failure censored 255-58 Repairability + Corrective maintenance with reward and frequencylduration 259-60 Replaceability 152

Recrystallization 109 Replacement policy 236 Recurrence time 174,175,178,446-51,494 Requalification 87 Recycling 10,19,357 Required function 28,362 Redesign 8,329 Requirements + Quality and rel. requirements Reduction (diagram of transition rates) 264(P.2) Reserve contacts 152 Redundancy 42-45,47,51,61-64,65,66,68,189- Reserve/reserve state 43,62,163,190,201

92,195,211,220-21,225,227,236-46,260-64, Rest waiting time 221,494 361 Restart anew 171,440,456 in software 47,153,157 Restoration 8,112,353

Reflow soldering 147 Restoration frequency -t System repair freq. Refuse to start 239 Results (tables lgraphs) 31,44,48-49,111,127, Regeneration 1 renewal point 201,440,442, 166-67,177,181,188,195,206,211,220-21, 453,456,473,484,489,490 225,227,230,234,258,279,283,290,292,297,

Regeneration state 200,215-16,440,456,464, 301,302,309,315,408-09,45 1,468,s 10,522 484,489 Reuse 10,116,119,130

Regenerative process 456-57 Reward 23 1,255,259-60,266,476,478-79 Rejection line 283-84,301-02,528-229 Rework 108,148,341 Relation between 4 and Pi 475 Rise time 143,144

(see also Distinction between) Risk 9-11,15,67,72,145,148,273,278,347, Relative frequency 278-789,393,394-96,513,516 363,369,373,384 Relaxation 109 (see a, ß & ß, , ß2, y for statistical nsk) Reliability 2,13,27,66,69,72,231,361,367,372 ROCW 542 Reliability allocation 67 Rules for Reliability analysis 13,25-27,66,67-71,80, convergence PA(t ) + PA 179-80,195

139-48,162-67,372-73,377-78 data analysis 320 Reliability block diagram (RBD) 28-32,68-69,362 derating 33,140

(see 23 1-76 if the RBD doesn't exist) FMEA/FMECA 72 Reliability function 2-3,166-67,169,176,361, imperfect switching 238,240,247 404,471-72,473-74,486 incomplete coverage 245

Reliability growth 329-34,362 (see also 159-61) jnnction temperature 37,141,145 Reliability prediction + Procedure for partition of cumulative operating time Reliability tests + Procedure for 294,295,301,371 Remote control ldiagnostic 117-18,120 power-up 1 power-down 145,147 Renewal density 443-44 quality and reliability assurance 19 Renewal density theorem 179,448 senes /parallel structures 46,219 Renewal equation 444 (see also Design guidelines ) Renewal function 443 Run-in 341,352 Renewal point + Regeneration point Renewal process 164,441-51 Safety 9-10,13,15,66,72,78,362-63,366,379 . embedded 203,452-52,456,484,487,491 Safet~ anal~sis 15,66,72-78,373,377,378 Repair 8,113,163-64,353,359 Safety factor 69 Repair frequency -t System repair frequency Same element in rel. bleck diagram 30,32,55,

60.69

Index

Same stress 45,71 Sample 504 Sample space 391-92 Sampling tests 277,280-88,34449,527-30 Scan path 150-5 1 Scanning electron microscope (SEM) 104 Schmitt-trigger 92,143 Scrambling table 91 Screening (see also ESS)

assemblies 340-41 components 337-340 (see also 92-100)

Screening strategy -+ Test and screening strategy Seal test 339-40 Secondary failure 4,66,73 Selection cnteria for electronic comp. 550-53 Semidestructive analysis 104 Semi-Markov process 164,166-67,440,483-88 Semi-Markov proc. embedded -+ Semi-reg. proc. Semi-Markov transition probability 166,185,

187,197,244,463-64,48445,489,490 Semi-regenerative process 162,163,164,197,

215,233,264,273,274-75,438,440,488-92 Sequential test 283-84,300-02,528-29 Senes model 41-42,64,71,182-88,320,406,421 Series - parallel structure 45-49,213-30,468

(see 48-49 and 220-21 for comparisons) Series - parallel system -+ Seties- paral. structure Serviceability i, Preventive maintenance Services reliability 3 Set operations 392 Shewhart cycles 76 Shielded enclosure 144 Shmoo plot 9l,93 Short-term test 312 Silicon nitride glassivation -t Passivation Simple one-sided test 284-86,302-03, 529-30 Simple structure 28,39-5 1,168-236 Simple two-sided test 280-83, 298-301,527-28 Simulation -+ Monte Carlo Single-point failure 42,66,79 Single-point ground 143 Six-o approach 424 Sleeping state -+ Dormant state SMD I SMT 84,109-11,146-47,341 Sneak analyses 76,79,377 Soft error 97 Software

attributes + quality attributes defects 67,117,149,152-53,155-61,329 defect prevention 155-58,160 design reviews 154,157,158,159 development procedure 153-56 documentation 154,155,156

FMEAFMECA 72-73 interaction 156 life-cycle phases 154 metncs 153 quality assurance 21,152-61,153,362 quality attributes 153,155 quality metrics 153 quality growth 159-61,329-334 specifications 154,156,157,159 standards 143,152,153,158,159 testing I validation 158-59 time lspace domain 153,157

Sojoum time i, Stay time Solder joint 84-85,108-11,147,340-41 Solder-stop pads 146,147 Solderability test 94 Soldering temperature profile 147,148 Spare parts provisioning 125-34 Special diodes 145 Special manufacturing processes 378 Specifications 3,154,156,157,159,365,372,

376,379,381,386 Standard deviation 41 1 Standard industrial environment 36 Standard normal distribution 424-25,539 Standardization 117,120,149,152,155,365,386 Standards 365-68 Standby redundancy 43,62,195,206,

211,237,361,418 (see also Active &Warm) State probability 63,190,461,475-76,486-87 State space 438-41 State space extension 492 State space method 56-57 State space reduction 264 State transition diagram -+ Diagram of Static fault tree 270 Stationary (or in steady-state)

alternating renewal process 180-81,454-55 distribution 459,475,486,488 increments (time-hoinogeneous) 441 initial distribution 459,475,486,488 Markov chain 459 Markov process 166-67,474-76,488 one-item structure 180-81 process 440-41 regenerative process 457 renewal process 449-51 semi-Markov process 166-67,486-8s

Statistical decision 504 Statistical error -+ Statistical risk Statistical hypothesis 525-26 Statistical maintainability tests 303-307 Statistical quality control 16, 277-86

Index

Statistical reliability tests 277-334,503-38 Statistical risk 503 (see also a, P, ßl, ßz, Y) Statistically independent 397,504,512,525

(see also Stochastically independent) Statistics + Mathematical statistics Status test 116,119 Stay time (sojourn time) 163,166-67,249,264,

274,275,458,463-64,474,479,483,486,488 Steady-state + Stationary Steady-state property of Markov processes

477,480,488 Step-stress tests 3 12 Strategy

maintenance 35,134-36,233-36 . test&screening 342-44,347-49,361,373,380 Stirlings' formula 3 19,544 Stochastic demand 174 Stochastic matrix 458,460 Stochastic process 438-41,441-502 Stochastically independent 397,399,413 Storage temperature 148 Stress factor 33,139-40,145 Stress-strength method 69-71,76 Strict liability 15,360 Strong law of large numbers 434,505 Structure function -+ System function Stuck-at-state 238,247 90 Stuck-at-zero 1 at-one 90 Student distribution 541 Successful path method 55-56 Sufficient statistic 295,324-27,511-12,513,514 Sum of

Homogen. Poisson proc. 296,451,498 Nonhomogen. Poisson proc. 45 1,496,498 Point processes 501 Random variables 416-18,443 Renewal processes 497-98

Superconform 535 Superimposed processes -+ Sum of Superposition + Sum of Supplementary states 186-88,492 Supplementary variables 186,492 Suppressor diodes 143,144 Surface mount devices I techn. -+ SMD 1 SMT Survival function -+ Reliability function Susceptibility + EMC Sustainable development 10,357,385 Switch 47,48-49,213-19,220-21,236-40,255-58 Switching + Switch System 2,31,166-67,264-66,363 System's confidence limits 298 System design review 381,383-87 System effectiveness + cost effectiveness

System failure frequency 265-66,477-78 System function 58 System mean time to failure (M7TFS )+ MTTF System reconfiguration -+ Reconfiguration System repair kequency 266,478 System restoration frequency + System rep. freq. System specifications + Specifications Systems engineering 1 1,16,357,363 Systems with complex structure 31,52-67,69,

231-33,236-76 Systems with hardware and software 161 System without redundancy + Senes model Systematic failure 1,3,6,109,115,329,331,342,

352,354,355,362,363

Tasks / task assignment l7-20,372-75 Technical Availability 235 Technical safety + Safety Technical system -+ System Tecbnological characterization 96-98 Technological properties Ilimits 10,38,84-85,

92,96-100,107-111,550-54 Test and screening procedures + Screening Test and screening strategy 342-44,347-49,361,

373,380 (see also Screening) Test coverage 90,91,117,231.233,241-46 Test Pattern 90-93 Test plan 281,283,291,292,299,301,306,527,

528,529-30,532 Test point 147,150 Test time partitioning + Partitioning Test vector 88 Testability 117,147,149-51,155,157,158 Testing

unknown availability 291-92,293,531-32 unknown distr. function 312-18,533-38 . unknown MTTR 305-07

0 unknown probability 28046,287,291- 92,526-30,531-32 unknown h or MTBF=l I h 298-303, statistical hypotheses (basic theory) 525-38

Tchebycheff 4 Chebyshev Theorem of cut sets + Cut sets theorem Thermal cycles 83,95,98,100,108,109,110,

337-39,341 Thermal design concept / management 141 Thermal resistance 141-42 Thermal Stress 145 Three Parameter Weibull disturb. 421,509-10 Time censoring -+ Censoring Time-dep. dielectric breakdown 96-97,103 Time-homogeneous Markov process 164,166-

67,440,460-83

Index

Time-homogeneous process 440 Time schedule (diagram) 169,175,201,202,

212,242,442,453,489,490 Time to market 10,19,369 Timing diagram 146 Top-down 76,78,156,157,356 Top event 76,78,270 Tort liability + Product liability Total additivity 394 Total down time 124,174,235,499 Total expectation 415,418 Total operating time -+ Total up time Total probability 170,400,447,459,473 Total up time 173,174,235,478,499 Totally independent elements 52,61,210,219,225 TQM (Total Quality Management) 16,17,18,

19,20,21,353,354,363,3 65,3 66,369,3 72 Traceability 379,380 Training 24,119,375 Transformation of random variables 274,405,426 Transition diagram + Diagram of Transition probability 166-67,458-59,460-65,

469-71,473,483-86 (see also Diagram of) Transition rate 461-65 (see also Diagram of) Trend test 323-328 True reliability 26 Truncated distnbution / random variable 71,

250,273,275,406 Truth table 88,92 Two-sided test . const. failure rate h or MTBF=I l h 298-301

unknown probability p 280-86,526-30 (see in particular 283,301)

Type I I I1 error (alß) 281-84,2891-92,298-303, 305-07,312-18,323-27,525526,527,530-37

Unavailability 61,219,223,230 Unbiased 511 Unconditional

expected value 415 density 404 probability 396

Uniform distribution 427 Uniformiy distributed

random numbers 274 random variables 324

Union of events 392 Unused logic inputs 145 Up state 265-66,452-53,469 UPS 223 Useful life 8,14,35,39,81,85,118,141,169,364

(comp. with limited useful life 142, 145,146) User documentation 15,117,118-19,375,379

Value Analysis 364 Value Engineering 364 Variable resistor 100,140,146,550 Variance 410-11,415,416,506 Vibrations 82,83,108,109,341 Viscoplastic deformation 109 Voter 47,215

Wafer 97,106,148 Waiting redundancy + Warm redundancy Waiting time paradox 448 Waiting time -t Stay time Warm redundancy 43,61-64,189-93,195,206,

211,361 (see also Active & Standby) Washing liquid 148 Weaknesses analysis 3,6,26-28,69,72-80,96,

139,329,380 Wearout / wearout failures 3,6-7,8,35,98,233

31 l,315,32O,323,328,329,355,4O6,421, 445-46

Wearout period 6,315,323,328 Weibull distribution 126-28,314-15,408-09,

420-21,509-10,548 Weibull prob. chart 314-15,421,509-10,548 Weibull process 330 Weighted sum 7,12,14,41,315-16,343-49,

403,406 (see also Cost & Mixture) Without aftereffect 320,334,451,494,497,501 Work-mission availability 173-74,499 Worst case analysis 76,384

X-ray inspection 102

Zener diodes 140,144,145 Zero defects 86 Zero hypothesis 525-27

1-out-of-2 -+ one-out-of-two 6-0 approach 424 85/85 test + Humidity test a particles 103 a, ß 525-26 ßy 02, Y 516-17 X -+ Chi-square o (6t) (Landau notation) 461 i = 539,545

circuit 554 t t (realizations of z ) 4-5,503-38 " 2 ' " ' tl;, t2, ... (arbitrary points on the time axis, e.g.

arrival times, realizations of z T, T 2, ...) 494 7,n, T„, 418

reliability engineering

Engineering

new approach

general use

german copyright law

estpr2t d

registered names

phasedmission systems

camera ready

acidfree paper