handbook of lessons for information processing … part i collection of lessons - main text (product...
TRANSCRIPT
Handbook of Lessons
for Information Processing
System Reliability Enhancement
(Product / Control System edition)
~Excerpts from Collection of Lessons for Information
Processing System Reliability Enhancement
(Product / Control System edition)~
2013 abridged version
June 20, 2014
情報処理システム高信頼化教訓ハンドブック
(製品・制御システム編)
独立行政法人情報処理推進機構
Copyright© Information-Technology Promotion Agency, Japan. All Rights Reserved 2014
1
PART I Collection of Lessons - Main Text (Product / Control System edition) ......................................... 2
1. Introduction .............................................................................................................................................. 2
1.1 Background & Objective .............................................................................................................................. 2
Product / control system (embedded system) is so widely used in literally every corner of our lives and society that it
has now become an indispensible key infrastructure. But at the same time, it is becoming increasingly difficult to
maintain the reliability of the system as a whole due to its growing complexity. ...................................................... 2
1.2 Significance of the Approach Taken by IPA/SEC ............................................................................................. 3
1.3 Key Features on this Handbook .................................................................................................................... 4
2.Collection of Lessons for Information Processing System Reliability Enhancement (Product / Control System
edition) .......................................................................................................................................................... 5
2.1 Policy on Gathering Failure Prevention Knowledge 2.1.1 Overview .............................................................. 5
2.1.2 Targeted Users of Failure Prevention Knowledge .................................................................................... 5
2.1.3 Procedure to Gather and Organize Information ...................................................................................... 6
2.2 Lessons ..................................................................................................................................................... 11
2
PART I Collection of Lessons - Main Text
(Product / Control System edition)
1. Introduction
1.1 Background & Objective
Product / control system (embedded system) is so widely used in literally every corner of our lives and society
that it has now become an indispensible key infrastructure. But at the same time, it is becoming increasingly
difficult to maintain the reliability of the system as a whole due to its growing complexity.
In the past, the attribution analysis to identify the factors that undermined the reliability of individual products
and the formulation of countermeasures to address the identified causal factors have been performed privately
and were not disclosed in public. Due to this closed nature, even when similar failures occurred in different
products or industries, they could not be prevented nor solved through the countermeasures developed by those
who encountered similar failures earlier. These cases were especially true with failures caused by problems that
were difficult to detect in preliminary verification or testing.
Moreover, projects to develop products entirely from scratch are becoming less and less, and the opportunity
to utilize the accumulated knowledge and techniques to maintain or improve the reliability of the products is
getting scarce. It is therefore becoming necessary to share and hand down the experiences of the product
manufacturers, system developers and user companies to the current and future generations.
In order to maintain the reliability of the products, it is becoming increasingly important for the product
manufacturers, system developers and user companies to share their experiences to other companies, industries
and future generations. To do so, they need to summarize their individual experiences and know-how into a form
that can be understood and practiced by other manufacturers. In response to this trend, IPA/SEC has decided to
collect and analyze the information on failure cases that they have obtained through the cooperation of
corporations that engage in the building of the architecture and development of products and control systems
used for supporting the key infrastructures, and also organize and systematize the countermeasures that have
already been formulated and implemented to address the identified causes of the system failures to prevent
them from recurring. This approach taken by IPA/SEC aims at sharing the “lessons” learned from these activities
with and across all the different industries and business fields, and building a mechanism that would help prevent
similar failures from occurring and minimize the magnitude and extent of negative impact arising from these
failures, should they occur. (Refer to Fig. 1.1.)
3
1.2 Significance of the Approach Taken by IPA/SEC
In order to maintain the reliability of the products, it is becoming increasingly important for the enterprises
(product manufacturers, system developers, user companies, etc.) to share their experiences to other companies,
industries and future generations. To do so, they need to summarize their individual experiences and know-how
into a form that can be understood and practiced by the third party. In response to this trend, IPA/SEC has
decided to collect and analyze the information on failure cases that they have obtained through the cooperation
of corporations that engage in the building of the architecture and development of products and control systems
used for supporting the key infrastructures, and also organize and systematize the countermeasures that have
already been formulated and implemented to address the identified causes of the system failures to prevent
them from recurring. This approach taken by IPA/SEC aims at sharing the “lessons” learned from these activities
with and across all the different industries and business fields, and building a mechanism that would help prevent
similar failures from occurring and minimize the magnitude and extent of negative impact arising from these
failures, should they occur.
In general, it is very rare for any enterprises to provide the raw data on their failure cases. Therefore, IPA/SEC
requested the enterprises for the information on their failure cases that have been already once analyzed
Lessons DB
Enterprises, engineers
Experience Reflect
Learn the lesson
Lessons DB
Enterprises, engineers
ExperienceReflect
Learn the lesson
IndustryLessons /
failure cases / countermeasures DB
IndustryLessons /
failure cases / countermeasures DB
Lessons DB
Enterprises, engineers
ExperienceReflect
Learn the lesson
Lessons DB
Enterprises, engineers
Experience Reflect
Learn the lesson
Share(Generalize into abstract form)
Utilize(Translate into reality) Share
(Generalize into abstract form)
Utilize(Translate into reality)
Utilize(Translate into reality) Share
(Generalize into abstract form)
Lessons / failure cases /
countermeasures DB
SEC
Share(Generalize into abstract form)
Utilize(Translate into reality)
Share(Generalize into abstract form)
Utilize(Translate
into reality)
Share(Generalize into abstract form)
Utilize(Translate into reality)
Mechanism of sharing and utilizing failure cases and lessons
Fig. 1.1 Sharing and utilization of knowledge
4
internally (including the countermeasures that have been devised and carried out based on the analytical results)
and received such information under the condition that IPA/SEC would, as a public agency, strictly abide by the
rules on preservation of confidentiality (stipulated in National Public Service Act, IPA Committee Regulations,
etc.) and used them as the source for organizing a generalized abstract set of “lessons” learned from their past
failure cases in an orderly manner and documenting these lessons in a handbook titled “Collection of Lessons for
Information Processing System Reliability Enhancement (Product / Control System edition)”.
1.3 Key Features on this Handbook
One of the limitations in conducting a comprehensive set of tests to verify the appropriateness of product /
control systems in advance is that it is often difficult to specify the users of product / control systems and the
environment they will be used in. In many cases, the failures occurring in these systems are caused by factors
that have been difficult to detect during the preliminary verification and testing processes.
In practice, when a failure occurs in a product / control system, the enterprise that was using this system would
analyze the causes and formulate necessary countermeasures to prevent it from recurring, on its own without
disclosing the remedial actions it has taken individually. Since the experience of this enterprise was not shared
with other companies in the same industry nor with any other industries, similar failures caused by factors
difficult to detect in preliminary verification and testing processes continued to occur in other products or
industries. One of the key features of this handbook is that it is a collection of lessons that can be applied to
prevent such failures caused by factors difficult to detect in preliminary verification and testing processes from
occurring.
Moreover, projects to develop products entirely from scratch are becoming less and less, and the opportunity
to utilize the accumulated knowledge and techniques to maintain or improve the reliability of the products is
getting scarce. It is therefore becoming necessary to develop a mechanism to share and hand down the
experiences of the product manufacturers, system developers and user companies to the current and future
generation. Another feature of this handbook is that it has been created with an objective to hand on such useful
and valuable knowledge to the future generations.
5
2.Collection of Lessons for Information Processing System Reliability Enhancement
(Product / Control System edition)
2.1 Policy on Gathering Failure Prevention Knowledge
2.1.1 Overview
This handbook on lessons learned from past failure cases has been prepared to provide a generalized abstract
set of lessons to be utilized in the enterprises developing a wide range of embedded systems, using the
knowledge on measures implemented in various industries to prevent system quality problems from arising, as
the shared source of information.
In this handbook, the knowledge on measures to prevent system quality problems from arising is referred to as
the “failure prevention knowledge”.
There are two types of failure prevention knowledge: i) knowledge to prevent the same or similar failures from
recurring; and ii) knowledge to prevent a foreseeable problem from arising. The former type of knowledge is
based on the knowledge gained from past incidents and failures that occurred in the same field of business, same
company or same organization, and is used to prevent the same or similar problems from recurring. The latter
type of knowledge is based on the knowledge gained from past incidents and failures that occurred in other
fields of business, companies or organizations, which have been generalized into an abstract set of knowledge to
prevent the same or similar problems from arising in a specific field of business, company or organization that
never in the past has experienced the incident or failure addressed by the acquired knowledge.
Moreover, the knowledge gathered for the preparation of this handbook include both the knowledge to
improve the quality of the system by eliminating the intrinsic risks of failure and the knowledge to make the
system more robust and fault-tolerant to the intrinsic risks of failure so that these risks will not cause observable
adverse effect on the system even when they happen to externalize.
There are many expressions used to describe a problematic state, including the terms “fault”, “failure” and
“defect”. In this handbook, the problematic state is expressed as either “fault” or “failure” in accordance with the
terminology defined in JIS X 0133 and IEEE 1044, where “fault” is defined as the problematic state caused by one
or more incorrect steps, processes or data existing in any of the programs of the computing system, and where
“failure” refers to the state when a product is unable to execute a required function or when a system does not
have the ability to execute a function within the limitations specified beforehand.
2.1.2 Targeted Users of Failure Prevention Knowledge Failure prevention knowledge described in this handbook has been gathered on the assumption that it will be
used mainly by the following three types of engineers engaged in the development of embedded systems:
Software architects
System designers belonging to the vendor
System designers belonging to the user company
The end users who have not received a special training on how to use the system have been excluded from the
list of targeted users, based on the assumption that it would be difficult for them to make effective use of the
6
failure prevention knowledge.
2.1.3 Procedure to Gather and Organize Information It is desirable for each enterprise to work on defining the lessons based on the failure prevention knowledge.
This handbook gives an example of the procedure and method on how to gather and organize the information
that can be translated as the failure prevention knowledge.
[Background of the knowledge gathering procedure]
Failure prevention knowledge should not simply be a set of information on the findings and practices
gathered from various corporate entities that encountered similar failure cases in the past. Rather, it should be
derived from the set of core information extracted from various similar failure cases that has then been
translated into the failure prevention knowledge that is relevant to the users intending to prevent their specific
failures. This translation process should be taken to make the enterprises providing the information on their
failure cases and the knowledge from their experiences feel secure to share their internal information. If the
shared information were disclosed to the public as it is, it would be easy to imagine that they would feel reluctant
to reveal their failures openly from then on. Moreover, since the products with embedded system range widely, it
would be difficult for the users of certain products to relate to the knowledge gained from the failure of other
types of product, even if they are in the same company. Failure prevention knowledge, therefore, should be
rewritten as much as possible in a way that is relatable to the users in similar fields of business.
[Framework of the knowledge gathering procedure]
The process to gather and organize the information that can be translated into failure prevention knowledge
should be repeated to find where it could be improved and ultimately established as a fixed procedure for any
enterprises to gather and utilize the failure prevention knowledge. In order to establish the knowledge gathering
process, there is a need to put in place a set of methods that clearly define how to classify the types of failure
prevention knowledge, filter them, extract the core information that can be translated into knowledge, generalize
the knowledge into abstract form and rewrite them as the knowledge on measures for preventing failure cases
that are specifically relatable. But since it is difficult to establish all of these methods at once, the process to
gather and organize the information that can be translated into the failure prevention knowledge need to be
executed repeatedly and refined in the course of repetition.
[Specific steps of knowledge gathering procedure]
Based on the aforesaid background, failure prevention knowledge should be gathered according to the
following procedure:
1) Conduct an interview using the failure prevention knowledge sheet
By using the failure prevention knowledge sheet explained later on, interview the stakeholders or request
them to fill out the sheet directly. At this stage, the raw information should be recorded without translating it
into information on other relatable failure cases.
2) Make the information abstract by rewriting it into information that is relatable to other business fields and
failure cases
3) Make the information abstract by keeping only the core essence of the gathered failure prevention
7
knowledge and translating it into the failure prevention knowledge that is relatable to other business fields
and failure cases.
4) Extract the failure prevention knowledge in each process
Extract the failure prevention knowledge in each process to create a matrix that shows what kind of problem
may occur in a process when no prevention measures are taken compared to the processes where
prevention measures have been taken.
5) Conduct a review
Request the experts to review the failure prevention knowledge and refine the descriptions by editing the
information into just proportion and rephrasing the parts that are difficult to comprehend.
6) Revise the knowledge
Revise the description of the failure prevention knowledge according to the result of the review.
[Designing the failure prevention knowledge sheet]
Failure prevention knowledge sheet is a spreadsheet used to enter the information on failure prevention. It
should be composed of cells filled out with just the right amount of primary information on failure prevention by
those who can provide such information. It is desirable to design this sheet with entry items that take the
characteristics of each domain into consideration. Some examples of these entry items are explained later for
reference.
Fig. 2.1 shows a schematic drawing that illustrates how the failure prevention knowledge sheet should be
designed. The preparation of this sheet should be considered complete when the entered information is sorted in
the order exemplified in Fig. 2.1.
As the first step, start with sorting the information that describes the failed state right when it occurred. Some
examples of such information include the phenomenon that could be observed immediately after the problem
became visible, like when the system got out of control, or when the system stopped all of a sudden. Next, sort
the information that describes the event that occurred internally, which became the direct trigger to cause the
phenomenon.
Then sort the information that describes the factors that became the direct cause of failure. These causal
factors can be further broken down into factors related to technical aspect, personal (or human) aspect, or
organizational aspect of system development.
After sorting the information describing the failed state and the factors causing the failure, examine the
permanent measures to prevent the described failure from occurring.
8
Fig. 2.1 Policy on designing the failure prevention knowledge sheet
[Items to enter in the failure prevention knowledge sheet]
Based on the design policy explained above, create the failure prevention knowledge sheet by keeping in mind
what kind of entry items it should be composed of. The standard set of entry items are as exemplified below: 。
Lesson title
Key words to briefly describe the background of the failure
Product domain
Product features
Observable phenomena
Event that occurred internally
Causal factors
Preventive measures
Beside the entry items explained above when discussing about how the failure prevention knowledge sheet
should be designed (observable phenomena, event that occurred internally, causal factors, preventive measures),
items to enter the background information of the identified failure and the failure prevention knowledge that can
be useful in each process should also be added.
Incidents
Types of incidents Causal factors Countermeasures against these factors
Observable phenomena *1)
✓ System goes out of control✓ System stoppage✓ Partial malfunction✓ Inconsistency with operating
environment✓ System does not activate✓ Screen freeze✓ ・・・・
*1) or phenomena that may have occurred if left unattended
Event that occurred internally
✓ Data erased or destroyed✓ Illegal interrupt✓ Incomplete data transmission /
reception✓ Log data processing error✓ Memory leak✓ ・・・・
[Technical aspect](1) Requirements
✓ Misunderstanding requirements✓ Missing requirements✓ ・・・・
(2) Design✓ Misunderstanding specifications✓ Unconsidered feature interactions✓ Requirements that have not been
presented✓ Unconsidered exception processing✓ Unconsidered conditional judgment
processing✓ File export processing error✓ ・・・・
(3) Implementation✓ Gap between code and
implementation✓ Incorrectly described ‘null’ versus
‘blank’ ✓ ・・・・
[Personal / organizational aspect]✓ Lack of communication✓ ・・・・
◆Express the specificationsmore clearly
・Review the Japanese language level・Reorganize the information
◆Reflect the design conventions◆Use the document templates
◆Improve the completenessof tests
Failure prevention knowledge ⇒ Generalization / abstraction ⇒ Lessons
9
The background information on the identified failure refers, for example, to the level of reliability of the system
required in the product domain and the hardware features. The reason why such background information should
also be provided additionally is because the knowledge on failure prevention required to devise effective
preventive measures can vary largely depending on these quality requirements and product features.
Moreover, the failure prevention knowledge that can be useful in each process should also be added to help
understand what kind of preventive measures would be effective at which stage of development to prevent the
problem specified in the failure prevention knowledge sheet from occurring.
[Policy on organizing the failure prevention knowledge by process]
Create a table that shows the failure prevention measures that are categorized by process and the potential
problems that can occur in each process if these preventive measures are not taken, and organize the
information by process so that the users of the failure prevention knowledge sheet will be able to gain an
overview of the failure prevention knowledge that is useful in each process, and trace back to the original failure
prevention knowledge from the knowledge sorted to be useful in each process. (Fig. 2.2) 。
Fig. 2.2 Association between failure prevention knowledge
and past failure cases categorized by process
For organizing the failure prevention knowledge by process, the example shown in this handbook uses the
process model [1] in “Embedded System development Process Reference guide (ESPR) Ver. 2.0”. For the
processes, it is desirable to include not only the system engineering process (SYP) and software engineering
process (SWP), but also the support process (SUP) as much as possible. In ESPR, the educational process is not
defined, but this is another process that should be considered including to expand the variety of failure
Collection of failure casesFailure prevention knowledge in each process
Analysis of system requirements:・・・・
System architectural design:1) Conduct an impact analysis
when changes are made to the hardware.
2) ・・・・3) ・・・・
Concrete description of the failure:The terminal does not start up even when the power button is pressed.→ Case ????-?
Case 1:
Case 2:
Case 3:
Case 4:
10
prevention knowledge defined by process. Moreover, since embedded system development nowadays is often
partial development of areas that are newly added to the existing system, it is also desirable to note whether the
failure prevention knowledge in each process is effective for only the areas additionally developed in each
process or applicable to the entire process.
11
2.2 Lessons
In this chapter, the readers will be introduced to a series of lessons derived from the abstraction of failure
prevention knowledge resulting from the analysis of wide range of failure cases that have been gathered.
Table 2.1 below shows a list of these lessons and the processes in which measures formulated from these
lessons need to be implemented respectively to prevent the identified failures from occurring.
12
Table 2.1 List of lessons and corresponding processes
Lesson # Lesson title
1Verification using, such as, a decision table is effective when the logiccomposed of complex conditional expressions is changed. ○ ○
2
To modify a function that has a total of over hundred unsorted conditions,or a function that has 10 or more conditions, check whether there is anyinconsistency by identifying and sorting all the conditions that are related.
○ ○
3
When integrating multi-functional modules, check whether there are anyconditions missing when the sum of the number of conditions differsbefore and after the integration.
○ ○
4
When the range of values of variables is wide and there are very numerousvariations of parameter combinations, divide the range of variables intoappropriate size, and perform boundary value tests.
○
5To use an internal battery, take into consideration the boot sequence thatoccurs when the battery is in deeply discharged state.
○ ○ ○ ○ ○
6When using flash memory, keep in mind the finite number of times data canbe written within its life cycle. ○ ○ ○ ○
7
When adding a function that consumes a lot of power, keep in mind theimpact of temporary voltage drop (reset, freeze, etc.), the type of powersource used, and remaining battery capacity.
○
8 Formally analyze all the exceptions that can be assumed. ○ ○
9
When adopting a redundant system, appropriately set the area for datasynchronization. ○
10Even when the same hardware is specified as the control target, reconfirmthe hardware specifications if the operation conditions are going to bechanged.
○ ○ ○ ○
11
When sharing (or passing) data between processes or threads, keep a closeeye on whether the exclusive control / synchronous processing is beingcarried out correctly or whether deadlock is occurring.
○ ○ ○
12
The instrument used for testing yield-type products to determine whetherthey are defective or non-defective products should be suspected to beabnormal when it outputs test results indicating that all the tested productsare either defective or non-defective.
○ ○
13
When improving the performance of existing software, check when theidling time occurs, when the process goes out of sync, and how theyaffect the software.
○ ○ ○ ○ ○
14
・When handling a large volume of data via a communication device, becareful not to create any bottlenecks in the sequential processing flow.・Also take into consideration the load fluctuation at different time zones.
○ ○ ○ ○
15
Be prepared with a recovery plan to support all the foreseeable abnormalsystem operations (reset, power shutdown, state of being left uncontrolled,etc.) that may occur during sequential execution of business systems thatare used by the customers after delivery.
○ ○
16Prepare specifications document and conduct impact evaluation even forthe processing of maintenance log data that is used for failure analysis. ○
17Do not only extract the conditions required to perform the determinationprocess, but also identify all the conditions that should be processed aslimitations.
○
18 Be careful of the fragmentation of log files. ○
Syste
m te
sting
Edu
catio
n /
trainin
g
Pro
ject m
anage
ment
Ope
ration
Syste
m re
quire
ments de
finitio
n
Syste
m arc
hite
ctu
re de
sign
Softw
are arc
hite
ctu
re de
sign
Softw
are arc
hite
ctu
re de
sign(m
odific
ations)
Imple
mentatio
n (c
odin
g)
Revie
w
13
Lesson 1
Lesson title Verification using, such as, a decision table is effective when the logic composed of complex
conditional expressions is changed.
Product features
A system designed to open / close a number of emergency doors by using air pressure
In this system, the movement to open / door the emergency doors is controlled by the input of
multiple sensors, including the speed sensor and various interlocks. A high level of certainty that
these doors open / close without fail at times of emergency must be secured.
Observable
phenomena
During the trial use, several emergency doors did not open when they were supposed to. This
incident not only led to the decline of system reliability but also stained the credibility of the
manufacturer.
Event that
occurred
internally
Through troubleshooting, it was found that a conflicting condition was set in the logic that
controlled the air pressure for opening / closing the emergency doors.
Causal factors The power source used to drive the compressor for opening / closing the emergency doors was also
used by other systems for the purpose of using the available power efficiently. The electric current
for providing residual heat to the solenoid was also supplied from this power source. Then the
power source configuration was changed. This configuration change required the condition that
was set to secure the power current for the residual heat to be deleted. The system engineer that
was assigned to the program modification job simply thought that cutting one line of the
conditional expressions would suffice, and deleted this line without realizing that the modified
pattern no longer met the overall requirement specifications.
Before the change:
(Precondition)
&& (Air pressure is equal to or above the set value)
&& ( (●No request for electric current)
|| (Abnormal value detected by speed sensor)
|| (■Abnormal value detected by thermal sensor) )
Error state after the change:
(Precondition)
&& (Air pressure is equal to or above the set value)
&& ( (●No request for electric current)
|| (Abnormal value detected by speed sensor)
|| (■Abnormal value detected by thermal sensor) )
14
After implementing the countermeasure:
(Precondition)
&& ( (Abnormal value detected by speed sensor)
|| (■Abnormal value detected by thermal sensor)
|| (Air pressure is equal to or above the set value) )
Preventive
measures
Measures against the direct cause:
・Correct the logic after confirming the conditional expressions.
Permanent measures against causal factors (clearly describe the process to be taken):
When changing a complicated setting of conditions, check whether there is any conflict or
non-compliance in the logic after the change by preparing a decision table like the kind shown
below. The use of such decision table will make it easier to compare what has been deleted
from or added to the original pattern of conditions (will become easier to notice the change by
using different colors, for example), and therefore help prevent system failures from being
caused by incorrectly written logic.
Before the change:
Conditional expressions Established condition
#1 #2 #3 #4 #5
&& (Precondition) ○ ○ ○
&& Air pressure >= Set value ○ ○ ○
&&
| | ●Request for electric current = = FALSE ○
| | Abnormal speed = = TRUE ○
| | ■Abnormal temperature = = TRUE ○
Error state after the change:
Conditional expressions Established condition
#1 #2 #3 #4 #5
&& (Precondition) ○ ○
&& Air pressure >= Set value ○ ○
&&
| | Abnormal speed = = TRUE ○
| | ■Abnormal temperature = = TRUE ○
15
Lesson 2
Lesson title To modify a function that has a total of over hundred unassociated conditions, or a function that
has 10 or more conditions, check whether there is any inconsistency by identifying and associating
all the conditions that are related.
Product features
A system designed to control the process that involves solvent treatment
A high reliability is required as it is used in rigorous environmental conditions including
temperature.
Observable
phenomena
When the system was in operation, a drainage pipe cracked. As a result, a part of the pump used
for treating exhaust air continued to operate even after completing the predefined exhaust
treatment and burnt out due to overheat. Although this incident did not escalate into a fire
accident, the functionality of the exhaust treatment system was partially lost. If the fire had spread
to elsewhere, it might have led to a big accident.
Event that
occurred
internally
This exhaust treatment system was composed of two pumps, A and B. Each pump activated when
the signal to instruct the start of draining operation was switched on. According to the
specifications, the exhaust monitor control would operate for a predefined length of time to
activate the pump. After activating pump A, it would stop when this monitor control completes its
function, and pump B would activate and proceed with the exhaust treatment. In this case of
failure, the crack in the drainage pipe has led to the damage of the thermal sensor. Because the
thermal sensor broke, a part of the control sequence that was set to function in abnormal state
did not work. Consequently, the signal to instruct the start of draining operation was switched off
before the exhaust monitor completed its control function after activating pump A, and a part of
pump A burnt out as the pump continued to operate while the exhaust monitor was left in a
incomplete state.
Causal factors ・The original logic was written based on the assumption that the signal for instructing the start of
draining operation would continue to be switched on until the completion of the exhaust
treatment. This logic was actually able to make the system continue operating normally even
when the thermal sensor malfunctioned. But when the specifications of the exhaust treatment
process were changed and the pump activation sequence had to be modified, it was rewritten
unintentionally into a logic that would switch off the signal for instructing the start of draining
operation without waiting for the completion of the exhaust treatment.
・This exhaust treatment was controlled in conjunction with the aforesaid reaction process. But the
system engineer assigned to the program modification job had little knowledge and experience
working on such control mechanism on the overall and performed the modification only by
taking the reaction function into consideration.
16
Preventive
measures
Measures against the direct cause:
・Correct the logic after confirming the conditional expressions.
Permanent measures against causal factors (clearly describe the process to be taken):
① Comparison of the number of variables before and after the change is effective to a certain
extent. But in this case where there is not much difference in the number of variables before
and after the change, it is difficult to notice the change just by counting the number of
variables.
In this case, control operation logic that is complexly composed of increased number of
conditions is concentrated too much on a specific module. From the standpoint of
maintenance, it is therefore desirable to split the conditions appropriately to separate
modules. By reducing the processing range, the possibility of error processing would also
decrease, and the designing process could be completed easier and in a shorter period of
time.
Variables
Scale of
change
Complexity before the change Complexity after the change
Input
variables
Number of
conditions
Input
variables
Number of
conditions
State when pump A is
operating None 8 7 8 7
Activation of pump A Small 6 5 7 6
Operation of ○○ valve Small 7 8 9 11
State when ○○ valve
is open None 5 4 5 4
Instruction to start
draining - 26 43
State when draining
operation is executed - 30 58
② Implement a logic that would stop pump A and pump B from operating for a certain period of
time so that they would not operate continuously to the extent that would make both of them
burn out. Also assign seasoned engineers to teach the designers about the technical
knowledge required to deeply understand the entire process. By putting this educational
activity into practice on a regular basis, the knowledge on maintaining the system reliability
that has been accumulated over the years in the organization can be constantly transferred to
the engineers of the next generation.
No much difference before and after the change
Definition of complexity and indication of threshold:
A function that has a total of over hundred
unassociated conditions, or a function that has 10 or
more conditions
17
Lesson 3
Lesson title When integrating multi-functional modules, check whether there are any conditions missing when
the sum of the number of conditions differs before and after the integration.
Product features
A system to control the automatic guide vehicles (AGV) used to convey intermediate and finished
products to different areas of the plant
This system must capable of controlling the optimal conveying routes and operating at a very high
rate in order to achieve high plant productivity.
Observable
phenomena
A part of the numerous position sensors malfunctioned while the system was in operation. When
the maintenance team was preparing to fix the defective sensors, automatic guide vehicles
collided against each other. This incident stopped the conveying operation for a long time and
slowed down the shipping process.
Event that
occurred
internally
The automatic guide vehicles were designed to move by learning the information incoming from
the sensors that sense other running vehicles, people or objects, and measure their distances and
processing the input information in optimal control routine. Among the conditions that were
supposed to be input from the sensors that became defective were conditions to instruct the
vehicles to learn the forward positions, which turned out to be missing.
Causal factors ・The self-learning function was controlled by multiple software modules. Then the changes in
business needs required this function to be integrated. When the configuration of the modules
was modified to meet the functional integration requirement, a part of the conditions included in
the original behavioral specifications was missed out from the transcription.
・The modified program was debugged in normal development environment and checked with a
structure visualization tool. But the system engineer in charge of this verification process did not
realize that a condition was missing in the specifications because collision of automatic guided
vehicles rarely occurred.
18
Before integration: Error state after integration: After modification:
Module A
Module B
Module C
Preventive
measures
Measures against the direct cause:
The conditions before and after the change were compared, and the program was corrected by
adding the missing condition that was identified in this comparative analysis.
Permanent measures against causal factors (clearly describe the process to be taken):
・Normally, the combination of parameters that has to be checked to verify the consistency of the
learning function tends to get very numerous. Unless a real machine is used for this verification
process, it is often difficult to detect the conditions that are missing. In the organization that was
using this system, a virtual environment was built in advance to perform the tests required for
the verification. But there is a limit in the completeness of the tests that can be conducted in a
virtual environment.
・To perform an effective measure to prevent conditions from missing in the modified software
program, a structure visualization tool should be used for visual examination. Use this tool to
check whether X ≒ Y or not (see below for the definition of X and Y, and record the results.
X: Sum of the specified number of conditions arising in the modules before integration
Y: Specified number of conditions arising the modules after integration
FW learning function ●
FW learning instruction condition ●
FW learning function ▲
FW learning instruction condition ▲
FW learning function ■
FW learning instruction condition ■
FW learning function
FW learning instruction condition ●
FW learning instruction condition ▲
FW learning instruction condition ■
Missing condition
FW learning function
FW learning instruction condition ●
FW learning instruction condition ▲
FW learning instruction condition ■
19
Before integration and
error state after integration
Before integration and
after correction
Comparison of the code
size that appears on the
visualization tool
Differs largely Almost the same
Comparison of the
number of conditions that
appears on the
visualization tool
Before integration
Sum>19 or more
Error state after integration:
Sum=4
Before integration:
Sum>19 or more
After correction:
Sum>19 or more
This simple visual method makes it easy to find whether there are any conditions missing in the
modified software program or not.
20
Lesson 4
Lesson title When the range of values of variables is wide and there are very numerous variations of parameter
combinations, divide the range of variables into appropriate size, and perform boundary value
tests.
Product features
A system designed to transfer a specified volume of chemical substance on conveying trays along a
designated route at pre-defined intervals
A high level of certainty in movement and accuracy in response time are required for this transfer
system.
Observable
phenomena
One day, a conveyor line that was transferring the chemical substance stopped all of a sudden due
to the activation of emergency stop mechanism. According to the specifications, this transfer
system had been programmed to enable the line to restart the pre-defined sequence from where it
stopped, based on the direction the trays were heading, the volume of chemical substance they
were carrying, and their respective positions on the line when it stopped. However, on this
particular day, the line did not restart for a long time and the extended downtime led to an
increasingly large production loss. In addition, the chemical substances remaining still on the
conveyor line had to be removed by manual labor, which also required a long time to complete
since the chemical substances were hazardous and therefore had to be handled very carefully.
Event that
occurred
internally
The system was programmed to restart the transfer operation sequence after the line stops, after
calculating the highly probable positions and direction of the trays by referring to the parameter
table. But in this event of failure, a wrong module was used to refer to the parameter table and
calculate the tray positions and direction. As a result, tray #5 was targeted as the tray to restart
instead of tray #4 that actually had to be restarted.
Causal factors ・To improve the calculation process, the modules were merged without changing the original
functionality. But in the course of this integration, the calculation logic was modified
unintentionally into a logic that was non-complaint with the original behavioral specifications.
・This calculation logic determines which sequence to restart, based on the information of the state
in which the line had stopped (the direction of the trays, the volume of chemical substance they
were carrying, their respective position on the static line, etc.). Combination tests were
conducted to check the modified calculation logic, but the completeness of this test was limited
because there were so many parameter combinations to test, and besides, the range of values
of these parameters was very wide.
Preventive
measures
Measures against the direct cause:
After re-examining the modified logic, the non-compliant parts of the logic were rewritten
correctly.
21
Permanent measures against causal factors (clearly describe the process to be taken):
As a reflection of the failure to detect the missing conditions in the modified calculation logic
because there were so many parameter combinations to test and the range of values of these
parameters was very wide, adopt the following two approaches in verification that make use of
model testing techniques.
Property verification:
Perfect match verification:
Moreover, if the range of value of variables is wide and the variation of parameter combinations
is very abundant, the parameter combinations examined in the model test may explode.
Therefore, in such situation, it is advisable to generalize the conditions into an abstract form.
For this particular case, the model test was performed after dividing the range of variables into
appropriate size. By doing so, the verification process could be completely easily in a relatively
short period of time.
Post conditions
後条件
Module before the change
ール
Constrained conditions
(Prior conditions and post conditions
Module after the change
Established?
Module before the change
ール
Module after the change
Merged module Prior conditions
Do the targeted variables
before the change and after
the change match perfectly?
Definition of complexity and indication of threshold:
A function that has a total of over hundred
unassociated conditions, or a function that has 10 or
more conditions
22
Lesson 5
Lesson title To use an internal battery, take into consideration the boot sequence that occurs when the battery
is in deeply discharged state.
Product features
Portable terminal for business use that is equipped with a display, wireless communication
function and an internal battery that enables it to be used without connecting to an AC charger
Observable
phenomena
The terminal does not activate even when the power button is pressed.
Power cannot be charged even when the AC charger is connected.
Event that
occurred
internally
・When the voltage of the battery is extremely low (i.e. when the battery is in deeply discharged
state), the software is programmed to prevent the terminal from activating until the voltage
recovers to a certain level through power charge or other means (i.e. until the voltage in deeply
discharge state rises to the threshold level set as the voltage to activate the terminal).
・When the AC adapter was connected while the battery was in deeply discharged state, the
voltage increased due to the power charged through the AC adapter. When the user of the
terminal pushed the power button, the software checked whether the terminal could be activated
or not based on the voltage of the inner battery that included the power charged, determined that
it could from the measured voltage that was higher than the threshold level set as the voltage to
activate the terminal, and started the boot process.
・However, according to the specifications of the terminal boot process, the power charge is once
stopped during the process. When the power charge stops, the voltage of the battery returns to
the original state (deeply discharged state) and gets lower than the threshold level set in the
power IC as the voltage to activate the terminal.
・When the voltage of the battery gets lower than the threshold level set as the voltage to
activate the terminal, the hardware determines that the battery voltage is insufficient and stops
activating the terminal and supplying power to the entire system. As a result, the system stops
without charging power.
Causal factors ・Did not consider the battery voltage dropping to an extremely low level (in deeply discharged
state) as one of the scenarios for evaluating the operability of the terminal.
・Lacked consideration of the battery voltage dropping to an extremely low level (in deeply
discharged state) when the terminal is connected to an AC charger.
Preventive
measures
Measures against the direct cause:
・Change the threshold level of the voltage to activate the terminal when the terminal is connected
to an AC adapter.
Permanent measures against causal factors (clearly describe the process to be taken):
・Recognize the need of taking correction actions against the specifications of battery-run terminal
that make the battery voltage drop to extremely low level (to deeply discharged state), such as,
23
when the battery is left unused for a long period of time.
・Need to devise ways to prevent the battery voltage from dropping to extremely low level (to
deeply discharged state), consider methods to recover the voltage if it drops even after devising
preventive measures, and be sure to verify the effectiveness of these preventive measures and
recovery methods with the actual terminal and its internal battery.
・Take into consideration the operating characteristics of the terminal’s starting current, the
possibility of its internal battery voltage dropping, and the variability of the extent to which it
drops, and incorporate these elements in the design.
・Investigate / review the customer’s use case, and evaluate the solution to this problem under the
same conditions / environment in which the terminal is actually used by the customer (end user).
Implementation of the preventive measures described above should make it easier to design the
product in such a way so that the following problems related to power charge can be prevented
from occurring:
・The terminal does not activate even when the power button is pressed;
・Power cannot be charged even when the AC charger is connected.
24
Lesson 6
Lesson title When using flash memory, keep in mind the finite number of times data can be written within its
life cycle.
Product features
Portable terminal for business use that is equipped with a display, wireless communication
function, built-in flash memory for data storage and an internal battery that enables it to be used
without connecting to an AC charger
Observable
phenomena
・The terminal does not activate even when the power button is pressed.
・The terminal freezes while it is started.
・The terminal blacks out (due to power outage), resets or freezes while it is used.
Event that
occurred
internally
・Flash memory is used as the storage for this terminal (to save data of the operating system (OS),
apps, etc.)
・The values in a specific area on the flash memory are broken and making the terminal unable to
start, or causing other kinds of problems.
Causal factors ・The values in a specific area on the flash memory have been found to be broken.
・Data have been written in the specific area of the flash memory for more than the number of
times data can be written within the life cycle of this flash memory.
Preventive
measures
Measures against the direct cause:
・Reduce the number of times data is written into the flash memory by finding why it increased and
correcting the identified causal factors.
Permanent measures against causal factors (clearly describe the process to be taken):
・When using flash memory as the storage of a terminal, recognize that it is a component with a
finite life cycle.
・Be careful not to exceed the finite number of times data can be written into the flash memory
within its lifecycle.
・In case there is a possibility that data may be written in the flash memory for more than the
number of times data can be written within its life cycle, prepare a function to monitor the number
of times data is written or other mechanism to prevent data from being written into the flash
memory for more than the number of times it can store data within its life cycle.
・For the finite number of times data can be written in the selected flash memory within its life
cycle or the method of evenly balancing the data stored in the available space within the flash
memory, inquire the manufacturer of the flash memory in advance about any matters that need to
be careful of when using it in the product you are actually planning to use as its storage.
・When adopting the flash memory, have a good knowledge about the environment the customer
is actually going to use it in, the types of data handled by the customer, as well as the likely timing
25
and frequency the customer reads and writes the data in the flash memory.
・When selecting the flash memory, do not only consider the software area developed by your
company but also the operating system (OS) you are planning to use for the product, as well as the
behavior of other software you have purchased and freeware that you are planning to use in the
same product.
・If there are different variations of flash memory to choose from, understand the differences and
evaluate them.
・When replacing the flash memory (from NOR flash →NAND flash; from SLC→MLC→TLC; etc.), be
especially attentive about the different life cycle of each type.
・When using the flash memory, work together with software development team and hardware
development team to prevent it from causing any problem during its use, by keeping in mind that
this collaboration is very important for the prevention of system failure caused by the flash
memory.
Implementation of the preventive measures described above should make it easier to design the
product in such a way so that the following problems attributable to the life cycle of product
components can be prevented from occurring:
・The terminal does not activate even when the power button is pressed.
・The terminal freezes while it is started.
・The terminal blacks out (due to power outage), resets or freezes while it is used.
26
Lesson 7
Lesson title When adding a function that consumes a lot of power, keep in mind the impact of temporary
voltage drop (reset, freeze, etc.), the type of power source used, and remaining battery capacity.
Product features
Portable terminal for business use that is equipped with a display, 3G wireless communication
function, and an internal battery that enables it to be used without connecting to an AC charger
Observable
phenomena
・When the terminal is activated, or while it is used, an error message saying “3G modem has
stopped” appears on the display, and 3G communication will be disabled after this message.
・The terminal does not activate even when the power button is pressed.
Event that
occurred
internally
・As soon as the 3G modem started running, the terminal began writing data into the flash
memory. During the writing process, the terminal powered off. As a result, the file system
corrupted and the 3G modem became unable to use.
・Reset occurred due to the sudden drop of power voltage caused by the inrush current that had
flown in when the 3G modem started running.
・To prevent the 3G modem from resetting, a condenser was added to the power source of the
modem. But then, this time, the system reset occurred due to the sudden drop of power voltage
caused by the inrush current that had flown in when the terminal with low battery capacity started
running. As a result, the terminal powered off.
Causal factors ・There were two product models developed for this terminal: one with the 3G modem and one
without the 3G modem.
・The model without the 3G modem was developed first. The later model that was developed with
the new 3G-communication feature had been released without evaluating it as thoroughly as it
should have been.
・The evaluation of the built-in 3G modem was also inadequate. The power voltage varied
depending on the parts assembled in the modem.
・When the condenser was added to the power source of this 3G modem, the impact on the
peripheral circuits was not analyzed as stringently as it should have been.
・The terminal after using up much of its battery capacity (and capable of outputting only low
power voltage) was not evaluated as closely as it should have been.
Preventive
measures
Measures against the direct cause:
・To prevent the 3G modem from resetting, add a condenser to the power source of the modem.
・To prevent the system reset from occurring when the remaining battery capacity is low, add a
condenser at the root of the power source IC, and also modify the software in such a way so that
the terminal will not activate when the remaining battery capacity is low.
27
Permanent measures against causal factors (clearly describe the process to be taken):
■ System architecture design
・When there are multiple models developed for a particular product (terminal in this case),
evaluate all the models after gaining a good understanding on how they differ from one another.
・When designing the system architecture of the terminal, take the variable aspects of the terminal
into consideration, including the inrush current of each function, voltage drop, inconsistency of the
parts assembled in the circuits as well as the starting sequence of the terminal.
・When adding a component to the power line to prevent the voltage from dropping, check where
the most appropriate position is to install this component and also the impact it may have on other
power lines.
・In case the terminal is a type that also runs on battery, recognize that there is a need to take
measures to prevent it from powering off abruptly when its remaining battery capacity is low.
Implementation of the preventive measures described above should make it easier to design the
product in such a way so that the following problems related to its power capacity can be
prevented from occurring:
・The communication function becomes disabled when the terminal is activated or while it is used;
・The terminal does not activate even when the power button is pressed.
28
Lesson 8
Lesson title Formally analyze all the exceptions that can be assumed.
Product features
Embedded system for controlling physical phenomena that is designed to achieve the system
objectives while running many functions concurrently and operating or referencing various
peripheral devices and subsystems.
Observable
phenomena
An aircraft controlled by this system nosedived while it was flying in autopilot mode. This abnormal
behavior was attributable to the malfunction of the pose information control unit. Normally, this
unit receives a set of information from the sensors and translates them into pose information,
which the automatic piloting system reads out to control the flight while the aircraft is switched to
autopilot mode. But in this case of system failure, the pose information control unit continued to
output unexpectedly large values as the pose information, making the automatic piloting system
nosedive the aircraft based on this incorrect post information. The pose information control unit
recovered after shutting its power and resetting the internal data. The following three phenomena
1, 2 or 3 can be assumed to have occurred to make the
pose information control unit malfunction.
Phenomenon 1: System failure occurred when the system suddenly stopped operating. If this was
the case, the system should return to normal operation by reboot.
Phenomenon 2: System failure occurred because the system could not refer to / process the
external I/O correctly and malfunctioned as a result. It is not possible to identify the particular type
of external I/O and function that led to this phenomenon.
Phenomenon 3: System failure occurred because a particular function that was executed during
operation could not be completed, and other functions, as a result, could not be activated. The
system cannot recover from this failure unless the power is reset.
Events that
occurred
internally
As the potential events inside the pose information control unit that are attributable to its
continuous output of incorrect the abnormal pose information, the following six events (Event 1 -
Event 6) may have occurred.
Event 1: Corruption of data inside MPU
If noise had intruded into MPU, it would have been through the pin of the MPU. The noise would
have first damaged the IO direction register. If the noise had the energy to reach inside the MPU, it
would have damaged the internal register as well, making the MPU malfunction (input does not
change, output does not change, no interrupt starts, etc.) or uncontrollable.
29
→Phenomenon 1, Phenomenon 2
Event 2: Execution of unexpected interrupt program
External interrupt occurs due to noise. The system processes this input but because the input value
is indefinite, system failure occurs.
→Phenomenon 2
Event 3: CPU fully occupied by continuous interrupt
A burst of interrupt occurs and the CPU becomes fully occupied with interrupt programs. As a
result, the time constraints of peripheral devices connected to the MPU cannot be met, making
the system unable to operate the inputs and outputs correctly.
→Phenomenon 2
Event 4: Defect of input device
System failure occurs when the input values become indefinite due to hardware damage.
→Phenomenon 2
Event 5: Defect of output device
System failure occurs because the damage in the hardware makes the hardware unable to operate
even when the system tries to operate the outputs.
→Phenomenon 2
Event 6: Defect or disconnection of the input / output devices and subsystems
The system waits for the input / output devices or subsystems to respond but the response never
comes. The function that the system is executing cannot therefore be completed, making other
functions unable to activate. As a result, system failure occurs.
Causal factors The exceptions that can be assumed to occur have not been defined in the specifications.
This failure case occurred because no consideration was made on the exceptions of input / output
devices and subsystems that the embedded system might reference or operate.
Preventive
measures
Measures against the direct cause:
Event 1: Corruption of data inside MPU
Recover the register damaged by noise by refreshing the register inside the MPU.
Recover the register by monitoring the interrupts and tasks, and resetting the register when none
of these interrupts and tasks become active for a specified length of time.
Eliminate the noise included in the data obtained from the input device.
Round the input range if the noise is coming from outside the input data range.
30
Event 2: Execution of unexpected interrupt program
When an interrupt occurs, make the system determine whether it is a normal interrupt or not.
When the interrupt is caused by noise, do nothing and end the interrupt program.
Event 3: CPU fully occupied by continuous interrupt
Same as above.
Event 4: Defect of input device
Round the input range if the noise is coming from outside the input data range.
Monitor the input device at fixed cycle. Reset the input device if there is no response from the this
device. If the input device still does not recover, reset the MPU.
Event 5: Defect of output device
Monitor the output device at fixed cycle. Reset the output device if there is no response from this
device. If the output device still does not recover, reset the MPU.
Event 6: Defect or disconnection of the input / output devices and subsystems
Set a timer that waits for the input / output devices and subsystems to respond. If there is no
response for a predefined length of time, make the system suspend the execution of the functions
that will be affected by any of these responses, and handle the “time out” error.
Permanent measures against causal factors (clearly describe the process to be taken):
Prevent exceptions from occurring by performing the following as a part of requirements analysis
and definition process.
・Define the exceptional items from physical and environmental perspectives.
・Create a list of defined exceptional items.
Physical item
xxx system
Environmental item
xxx system
Exceptional item
xxx system
Exception
Exception
Device
Type
Item name
Exceptional item list
31
・Create a matrix of functional item list and exceptional item list, and use it to define the functional
specifications of each exception.
Implementation of the preventive measures described above should make it easier to reduce the
possibility of system failure caused by functions that do not operate normally because the system
cannot reference / operate the external inputs / outputs correctly.
Functional item
xxx system
Fun
ction
al item
Exceptional item
Exceptional item
xxx system
Item name
Type
○:Affected ×:Not affected
32
Lesson 9
Lesson title When adopting a redundant system, appropriately set the data area to be synchronized.
Product features
A remote monitoring system that must maintain high availability (long continuous uptime)
This system requires high reliability by not only minimizing the frequency of minor defects, errors
and malfunctions that tend to occur on a regular basis, but also by being able to recover quickly
from the failed state and continue operating even when failure occurs.
Observable
phenomena
In order to achieve high system availability by adopting a redundant system, the master system
must be able to switch to the slave system and maintain the continuity of control even when the
master side becomes defective. But in this case of failure, an alarm to notify abnormality went off
immediately after the master system switched to the slave system.
Event that
occurred
internally
When the master system was switched to the slave system, the values of the data that have not
been synchronized were detected as invalid values and triggered the alarm to notify parameter
abnormality.
Causal factors When new functions were added to the system, the data area for managing these new functions
was also added. But at this time, the data area used for synchronization was not changed.
Moreover, the test to check the state when the data synchronization area that was added at the
master side was in use was not included as one of the test items. Therefore, during the testing
process, there was no way of knowing that the master data and slave data were not synchronized
when the master system was switched to the slave system.
Preventive
measures
Measures against the direct cause:
Revise the data area required for data synchronization.
Permanent measures against causal factors (clearly describe the process to be taken):
As another data synchronization test item, add a test to check whether the master data and the
slave data will be properly synchronized when the master system is using all the way up to the
additional data area.
Also include as one of the test items, a test to check the boundary values of the data range.
Implementation of the preventive measures described above should make it easier to prevent
malfunctions from occurring when the data is transferred in a redundant system.
33
Lesson 10
Lesson title Even when the same hardware is specified as the control target, reconfirm the hardware
specifications if the operation conditions are going to be changed.
Product features
・A product developed from a base product that communicates via wireless LAN for other purposes
・Hardware and software have both been developed from the base product
・The maximum number of handy terminals that can be attributed to the developed product is
more than that of the base product.
Observable
phenomena
The product sometimes reboots when the handy terminals attributed to this product reaches the
maximum number.
Event that
occurred
internally
To make the handy terminal attributable to the product, the key entry of the terminal must be
registered in the wireless LAN chip. But in this case of failure, this registration had failed.
Causal factors 〔Technical aspect〕
(1) Design: Failure to check the items that affect the key entry registration on the wireless LAN chip
datasheet described in the requirements specifications
The management table in the memory inside the wireless LAN chip is used to manage the key
entry of the handy terminals that makes them attributable to the product. The key entry
method varies with each cipher system. Depending on the sequence of the cipher system used
for attribution, the key entry registration may fail when the number of terminals that are
attributed reaches the maximum. The software designers who had little understanding of
wireless LAN chip specifications were not aware of this risk.
(2) Design: Failure to select reliable reviewers
The reviewers (who were members of the development team) had little understanding wireless
LAN chip specifications.
(3) Test: Failure in the combination of conditions
A scenario to test the communication between the handy terminals and the product when the
maximum number of terminals was attributed was included as a part of the integration test.
However, the sequence of the cipher system used for attribution was not considered in this test
scenario, making it evident that the integration test was lacking in completeness.
Preventive
measures
Measures against the direct cause:
Modified the part of the software that controls the wireless LAN chip in such a way so that the
software would use the external memory of the chip as the management table of the wireless LAN
chip, instead of the internal memory.
34
Permanent measures against causal factors (clearly describe the process to be taken):
1. Specify the areas in the requirements specifications that will be affected. (Requirements
specifications process)
Add the following check item in the review check sheet: “Check the impact of the difference in
requirements specifications from the hardware specifications of the base product.”
2. Improve the completeness of the tests. (Integration testing process).
Add the combination tests as a part of the integration test to check all the different
combinations of cipher systems used for attribution when the maximum number of handy
terminals is attributed to the targeted product.
3. Select the most appropriate reviewers. (Requirements definition process, integration testing
process)
Invite the software designers of the base product and the designers in charge of the system
tests to the integration test specifications review as participants.
Implementation of the preventive measures described above should make it easier to reduce the
possibility of system failure being caused by lack of memory capacity.
35
Lesson 11
Lesson title When sharing data between processes or threads, keep a close eye on whether the exclusive control /
synchronous processing is being carried out correctly or whether deadlock is occurring.
Product
features
A production management system that uses Windows server to process information received from an
on-site process controller through a communication network to control the routes taken by the
conveyor at diverging points as well as to collect performance data
Observable
phenomena
The process controller and the server stopped communicating. As a result, the line in the plant had to
be stopped.
Event that
occurred
internally
① The transmission buffer got into an invalid state when the timing to delete the data transmission
thread and the timing to register the data request thread overlapped.
② The data transmission thread tried to read out the data in the transmission buffer that was in invalid
state. As a result, error handling occurred and created a state where the communication with the
process controller was connected / disconnected repeatedly.
Causal
factors
ArrayList had been used in the transmission buffer for data linkage between multiple threads in the
process. Since ArrayList was not thread-safe, exclusive control was actually required. However, the
engineers that were in charge of coding did not know that ArrayList was not thread-safe.
Preventive
measures
Measures against the direct cause:
1) Added exclusive control so the ArrayList can be used between multiple threads.
2) Reviewed the similar cases related to thread-safe.
Data request thread
Transmission buffer
Data transmission thread
Any transmission data?(Check the number of
data)
Is ACK normal?
Connect to process controller
Connection completed (ACK) received
Transmission buffer in
invalid state
Register in transmission buffer
Create request data
Data transmission
Response reception
Transmission buffer deleted
Communication with process controller disconnected
Readout of data in transmission bufferRegister
Abnormal
Normal
Yes
Exception handling
Delete
Pro
cess co
ntro
ller
36
Permanent measures against causal factors (clearly describe the process to be taken):
Added the perspective that multithreading would make the code thread-safe in the program check
sheet and coding conventions.
Implementation of the preventive measures described above should make it easier during the code
review to detect the potential problems that may arise when the data between multiple threads are not
exclusively controlled.
37
Lesson 12
Lesson title The instrument used for testing products that require yield management to determine whether
they are defective or non-defective products should be suspected to be abnormal when it outputs
test results indicating that all the tested products are either defective or non-defective.
Product features
Instrument used to test semiconductors to check if their functions and performance meet their
specifications or not
Semiconductor testing process consists of multiple tests. If the tested products pass all these tests,
they are determined to be non-defective. Whereas, if the tested products fail any of these tests,
they are determined to be defective. Normally, when a defective product is detected, the
remaining subsequent tests included in the testing process are not conducted to save time and
streamline the entire process.
Observable
phenomena
In a semiconductor testing process, a certain proportion of the tested products are normally
determined to be defective.
However, there was a case where all the products passed the tests and were determined to be
non-defective.
When all the tested products are determined to be non-defective, there is a need to consider the
testing process itself to be abnormal, just like when all the tested products are determined to be
defective.
Event that
occurred
internally
The personnel in charge of investigating defective products masked the products during the
troubleshooting process to prevent them from being detected as defective products. After
completing the investigation, the mask should have been released but was not, and mass
production began without releasing the mask. As a result, all the products were determined to be
non-defective in the testing process. The personnel forgot to set a warning to notify that the
testing process may be abnormal if all the tested products are determined to be either defective or
non-defective.
Causal factors Forgot to set a warning to notify that the testing process itself may be abnormal if all the tested
products are determined to be non-defective.
When all the tested products were determined to be defective, the testing process could be
intuitively suspected to be abnormal. But it was not possible to conceive that the testing process
might also be abnormal when all the products were determined to be non-defective.
Preventive
measures
Measures against the direct cause:
Modified the program by adding a warning to notify that the testing process may be abnormal not
only when all the tested products are determined to be defective but also when they are all
determined to be non-defective.
38
Permanent measures against causal factors (clearly describe the process to be taken):
・Clarify the specifications on the criteria for determining whether the tested products are
defective or non-defective.
・In addition to clarifying the items for verifying the testing instrument, automate the verification
process as much as possible.
Implementation of the preventive measures described above should make it easier to reduce the
problems caused by human error of the maintenance personnel during their work.
39
Lesson 13
Lesson title When improving the performance of existing software, check when the idling time occurs, when
the process goes out of sync, and how they affect the software.
Product features
An inspection system used during the manufacturing process to inspect the electronic products
in the making
This system is composed of a PC and various embedded devices (signal generator, voltage/current
measuring instrument, current measuring instrument, communication device, camera, etc.) and
requires the ability to check the precision of mS order measurements and control flow.
An upgraded version of the current system is being developed with additional features.
Observable
phenomena
When the inspection system was inspecting the electronic products, it outputted “no data” even
when the electronic products were transmitting valid communications data. As a result, the line
had to be stopped for some time.
Event that
occurred
internally
The PC of the inspection system cyclically retrieves the communications data received within a
fixed period via the buffer of the communication device and stores these data in log files. These
log files are later closely analyzed after executing the inspection.
There are times when the electronic products do not communicate at all within the fixed period,
even when they are normally operating. In this case of failure, the PC that received no
communications data within the fixed period generated an empty data and stored it in the log file
as a valid data with a time stamp “0s”. As a result, the normal 2-minute search could not be
performed and no valid data could be detected.
Causal factors As a part of the development of the upgraded version of the current inspection system, the
processing speed has also been enhanced. However, this enhancement has caused the
performance to degrade due to the following factors. Moreover, the adverse effects of this
enhancement could not be detected in the system tests.
〔Technical aspect〕
■Lack of test items for testing the system when communications data do not exist
〔Personal / organizational aspects〕
■Design intentions of the current system have not been documented for future reference.
■Change management lacked scrutiny.
■Too dependent on the individual engineers in charge of modification (due to the judgment
that the scale of change was small)
Preventive
measures
Measures against the direct cause:
Modify the software by adding the process to check whether there is any communications data or
not when retrieving the communications data via the buffer of the communications device (which
was a process that was existing in the beginning) so that the PC of the inspection will not store
40
empty data in the log file when there is no communications data to retrieve.
Permanent measures against causal factors (clearly describe the process to be taken):
■Make sure all the modifications are entered in the change management list without any
omission (software design modification process)
■Describe the design intentions in the design specifications (software design process)
■Be sure to “check the modifications” and “check the extent of the impact of the modifications”
when reviewing the design and implementation. (review process)
■Improve the completeness of the tests (testing process)
・Add the following perspectives in the test items:
⇒Check the behavior when data does not exist;
⇒Intentionally insert a testing period when no communications data exists.
■Improve the reviewers’ skill level (create reviewer’s skill map, select reliable reviewers)
Implementation of the preventive measures described above should make it easier to prevent the
lack of consideration of data abnormalities in communication systems.
41
Lesson 14
Lesson title When handling a large volume of data via a communication device, be careful not to create any
bottlenecks in the sequential processing flow. Also take into consideration the load fluctuation at
different time zones.
Product features
A system composed of mobile terminals for business use carried by multiple sales reps, a database
server that provides remote data transmission / reception services, and client terminals (in head /
branch offices) that receives the information sent from the sales reps and supports the sales
activities
This system is required to operate continuously and provide data transmission / reception services
without delay.
Observable
phenomena
In a time zone when there were many users communicating through this system, the server
response time of the server slowed down significantly. As a result, the transmission and reception
of sales information and sales support information took a very long time to complete, making the
services stop intermittently.
Event that
occurred
internally
・Because the processing of data for analysis became a high-load operation in the server process, a
large volume of data transmitted from the mobile terminals for business use were temporarily
retained in the memory (of the communication buffer).
・In the program for accessing the server, there was a description that was not optimized. This
description became a bottleneck that slowed down the processing speed. (The processing took
unnecessarily long time because the character strings of the data were first converted into
numerals before comparison, instead of comparing them directly.)
Causal factors ・When the number of users increased in a particular time zone, and an unexpectedly large volume
of data was transmitted to the server.
・The specifications had no descriptions regarding the conditions for accessing the server that took
account of the potential risk that there may be a bottleneck.
・People who were knowledgeable about the server did not participate in the implementation
reviews.
・The integration load test that covered the entire system were not conducted sufficiently.
⇒ A test environment that was close to the real environment could not be prepared on a timely
basis.
Preventive
measures
Measures against the direct cause:
・Modify the program in such a way so that the character strings of the data will be compared
directly. (Program optimization)
・Measure the processing time to access the server in a test environment that is close to the real
environment and confirm that the process speed has improved.
42
Permanent measures against causal factors (clearly describe the process to be taken):
・Describe in the specifications the conditions for accessing the server that takes account of the
potential risk that there may be a bottleneck.
⇒ Perform programming that strictly adheres to the specifications.
・Build a workflow (make it a rule) to always call on people who are knowledgeable about servers
to participate in reviews.
・Make quick arrangements to prepare a test environment that simulates the real environment.
Implementation of the preventive measures described above should make it easier to secure the
required level of performance or improve it further with higher certainty through a
performance-based design approach.
43
Lesson 15
Lesson title Be prepared with a recovery plan to support all the foreseeable abnormal system operations
(reset, power shutdown, state of being left uncontrolled, etc.) that may occur during sequential
execution of business systems that are used by the customers after delivery.
Product features
A system used for supporting the information desk response type services at the customer’s store
This system is used to receive the data necessary for processing the daily business operations as
well as to automatically execute a batch process to send aggregated data on daily business
operations processed by the system during the day to the central server at specified time.
Observable
phenomena
One day, the personnel in charge of information desk response duties ended the day’s work and
left the store after switching off the power of this business system. When this personnel tried to
start the business system next morning, the system did not start normally. As a result, the
customer could not use the business system to process the daily store operations that day.
Event that
occurred
internally
The business system did not operate normally because the transmission of the aggregated data on
daily business operations processed by the system that was supposed to be sent to the central
server as a batch process was left in an incomplete state from the day before. until the next day.
could not be completed during the day to the central server at specified time.
Causal factors Through the investigation, it was found that the direct cause was the batch process that was forced
to execute while the system was sending the aggregated data on daily business operations
processed by the system the day before. Due to the forced start of the batch process, the data
transmission process was left in an incomplete state. In such a case, the system actually should
have executed a recovery process (such as, sending the remaining data that had not been
transmitted yet to complete the transmission) once it was started next morning, even when the
data reception or transmission job was forced to terminate before completion. However, since
such situation was not considered as a possible scenario during the system design process, the
recovery function was not implemented.
Preventive
measures
Measures against the direct cause:
Implement a recovery program to send the remaining data that had not been transmitted yet due
to forced termination of data exchange with the central server once the system is restarted.
Permanent measures against causal factors (clearly describe the process to be taken):
Include the process to recover from forced termination as one of the check items to examine when
reviewing the completeness of the requirements definition.
Implementation of the preventive measures described above should make it easier to improve the
response capability against incidents, such as, forced termination of the system caused by the
operator’s human error while the system was executing the business services.
44
Lesson 16
Lesson title Prepare specifications document and conduct impact evaluation even for the processing of
maintenance log data that is used for failure analysis.
Product features
A system designed for quality management at each production process in a manufacturing plant
This system is used in a plant that manufactures products through numerous production processes
in order to develop them into finished products. These production processes not only include the
process for processing treatment but also a process for treating chemical substances that are
harmful to the human body. To manage the different jobs carried out in each process, traceable log
data are taken and saved in the system, which are then aggregated by lot and sent to the server for
use in the subsequent processes.
Observable
phenomena
There was a time when the processing of the log data taken in a specific production process ended
abnormally while aggregating them. Due to this abnormal termination of data processing, the line
stopped at this particular production process.
Event that
occurred
internally
A part of the log data that recorded the latest information about the products in the making and
needed to be sent to the next process had been lost.
Causal factors In the particular production process where log data processing ended abnormally, the log data was
always taken and saved after confirming the state of the products upon completion of the final
step of this process, because the quality of the products treated with chemical substances was
often inconsistent, depending on the outcome of chemical treatment. Therefore, the log data
taken in this process was saved, as a rule, in two separate log files: one that contained the data of
products confirmed to be in compliant state (those that were successfully treated in the chemical
treatment process) and the other that contained the data of products detected to be in
non-compliant state (those that failed in the chemical treatment process), so that they would not
be mixed in one file. Through the investigation, it was found that the processing of log data taken
in this particular process ended abnormally because the two different log files (for compliant and
non-compliant products respectively) that were supposed to be output to separate modules were,
for some reason, logged by the single shared module. It also became clear that the log output
function for processing the log data of non-compliant products was considered to be of use only
for failure analysis and was not described in the specifications document.
Preventive
measures
Measures against the direct cause:
Modify the program to make the system use two separate modules to write the log data on normal
production (of compliant products) and abnormal production (of non-compliant products)
respectively in two different log files.
45
Permanent measures against causal factors (clearly describe the process to be taken):
Prevent the log data on normal production (of compliant products) and abnormal production from
being mixed in the same log file by preparing, as a part of the software design process, the
specifications document that describes the required procedure to be taken in processing
maintenance log data for use in failure analysis after evaluating the impact of the outcome of this
processing without any omission. Also prevent the non-compliant products that should be rejected
from the line at the point of detection from being fed to subsequent production processes to be
developed into finished products and then shipped out to the market as commercial products by
conducting review tests thoroughly.
Implementation of the preventive measures described above should make it easier to improve the
certainty of quality assurance required by the manufacturer to reinforce its maintenance functions.
46
Lesson 17
Lesson title Do not only extract the conditions required to perform the determination process, but also identify
all the conditions that should be processed as limitations.
Product features
An entry / exit gate control system used in an establishment with multiple important facilities
constructed within its premises
This system is designed to allow the employees of the establishment to access the building
facilities within the premises by associating the identification (ID) information on the electronic
gate pass provided to each employee with the information of each building they enter. Within the
premises of this establishment are a number of important building facilities that are physically
connected but are separated with an access gate built in between each compartmentalized facility.
This system monitors and controls the entry / exit to and from each of these compartments based
on various sets of access rules, identification data on who can have access to which zone for how
long (period when the access permit is valid) as well as restrictions in specific facilities that limit
the number of times one can pass the gate on the same day.
Observable
phenomena
One day, an employee tried to enter facility A, but could not because the alarm went off to notify
that an abnormal situation had arisen. Not only was this employee the only one who could not get
in. For a certain period of time, all the employees of facility A were also unable to get in or out of
this building. As a result, the business activities that were supposed to be handled by these
employees in this facility were seriously affected that day.
Event that
occurred
internally
What actually occurred was that this employee tried to enter facility A after entering and exiting
facilities X and Y several times and had already by then reached the upper limit of the number of
times this employee was allowed to enter or exit the restricted facilities that particular day. This
was why the system had processed this employee’s attempt to enter facility A as an abnormal
event, determining it to be an unacceptable entry that was over the limit.
Causal factors The system was programmed to check the upper limit of the number of times each employee was
allowed to enter or exit a facility, and normally activated this check function when an employee
tried to enter a facility after dropping by another facility. But this check function at facility Y was
somehow not written in the program. As a result, abnormal end occurred when the system was
processing the employee’s attempt to enter into a facility from other facility because it could not
recognize facility Y in the determination condition processing routine.
Preventive
measures
Measures against the direct cause:
Check whether there are any code in the program for checking the number of times each
employee was allowed to enter or exit other facilities like X and Y, or any code for checking other
conditions that have been missing, and implement these missing parts of the program.
47
Permanent measures against causal factors (clearly describe the process to be taken):
Review the business rules on entering / exiting the facilities and revise the manual according to the
result of the review.
Detect other similar errors through the following methods:
1. Decompose the conditions into attributes like ‘facility’, ‘limit in the number of times of
entry / exit’, and ‘types of gate pass’;
2. Check whether each of these attributes is a condition that should be processed as a
limitation or not.
Implementation of the preventive measures described above should make it easier to prevent any
processing from being left out by being able to visually check the relation between conditions and
attributes.
48
Lesson 18
Lesson title Be careful of the fragmentation of log files.
Product features
A production management system that uses Windows server to process information received from
an on-site process controller through a communication network to control the routes taken by the
conveyor at diverging points as well as to collect performance data
Observable
phenomena
A problem occurred in the system. For troubleshooting, the log file was copied to analyze what has
caused the problem. Then the line in the plant stopped due to the slow and delayed processing of
control data.
Event that
occurred
internally
The processing was slow because the log file copied on the disk was fragmented. Due to this
problem, the disk input/output load increased sharply while the log file was being copied. As a
result, the online process to write the log data of the ongoing performance slowed down and
delayed the data processing.
Causal factors ① On each day of plant operation, 30 variable-length log data on multiple production processes
of file size ranging from around 1 to 300MB are being created. These log files are deleted
automatically after being saved for 30 days.
② Since they are variable-length data, the log files are automatically extended and also
fragmented if the file size gets too big.
③ Fragmented files have been automatically deleted after being saved for 30 days. But
fragmentation continued and the log files kept on being created. As a result, fragmentation
accelerated.
Preventive
measures
Measures against the direct cause:
1) Performed defragmentation on the day when the plant was not operating.
Permanent measures against causal factors (clearly describe the process to be taken):
1) Avoided the daily fragmentation of log files by moving them to different partitions separated by
day.
2) Reduced the number of log files by deleting unnecessary log data.
Implementation of the preventive measures described above should make it easier to reduce
problems arising from processing variable-length data for record-keeping purpose.
49