handbook of lessons for information processing … part i collection of lessons - main text (product...

Handbook of Lessons

for Information Processing

System Reliability Enhancement

(Product / Control System edition)

～Excerpts from Collection of Lessons for Information

Processing System Reliability Enhancement

(Product / Control System edition)～

2013 abridged version

June 20, 2014

情報処理システム高信頼化教訓ハンドブック

(製品・制御システム編)

独立行政法人情報処理推進機構

Copyright© Information-Technology Promotion Agency, Japan. All Rights Reserved 2014

1

PART I Collection of Lessons - Main Text (Product / Control System edition) ......................................... 2

1. Introduction .............................................................................................................................................. 2

1.1 Background & Objective .............................................................................................................................. 2

Product / control system (embedded system) is so widely used in literally every corner of our lives and society that it

has now become an indispensible key infrastructure. But at the same time, it is becoming increasingly difficult to

maintain the reliability of the system as a whole due to its growing complexity. ...................................................... 2

1.2 Significance of the Approach Taken by IPA/SEC ............................................................................................. 3

1.3 Key Features on this Handbook .................................................................................................................... 4

2．Collection of Lessons for Information Processing System Reliability Enhancement (Product / Control System

edition) .......................................................................................................................................................... 5

2.1 Policy on Gathering Failure Prevention Knowledge 2.1.1 Overview .............................................................. 5

2.1.2 Targeted Users of Failure Prevention Knowledge .................................................................................... 5

2.1.3 Procedure to Gather and Organize Information ...................................................................................... 6

2.2 Lessons ..................................................................................................................................................... 11

2

PART I Collection of Lessons - Main Text


1. Introduction

1.1 Background & Objective

Product / control system (embedded system) is so widely used in literally every corner of our lives and society

that it has now become an indispensible key infrastructure. But at the same time, it is becoming increasingly

difficult to maintain the reliability of the system as a whole due to its growing complexity.

In the past, the attribution analysis to identify the factors that undermined the reliability of individual products

and the formulation of countermeasures to address the identified causal factors have been performed privately

and were not disclosed in public. Due to this closed nature, even when similar failures occurred in different

products or industries, they could not be prevented nor solved through the countermeasures developed by those

who encountered similar failures earlier. These cases were especially true with failures caused by problems that

were difficult to detect in preliminary verification or testing.

Moreover, projects to develop products entirely from scratch are becoming less and less, and the opportunity

to utilize the accumulated knowledge and techniques to maintain or improve the reliability of the products is

getting scarce. It is therefore becoming necessary to share and hand down the experiences of the product

manufacturers, system developers and user companies to the current and future generations.

In order to maintain the reliability of the products, it is becoming increasingly important for the product

manufacturers, system developers and user companies to share their experiences to other companies, industries

and future generations. To do so, they need to summarize their individual experiences and know-how into a form

that can be understood and practiced by other manufacturers. In response to this trend, IPA/SEC has decided to

collect and analyze the information on failure cases that they have obtained through the cooperation of

corporations that engage in the building of the architecture and development of products and control systems

used for supporting the key infrastructures, and also organize and systematize the countermeasures that have

already been formulated and implemented to address the identified causes of the system failures to prevent

them from recurring. This approach taken by IPA/SEC aims at sharing the “lessons” learned from these activities

with and across all the different industries and business fields, and building a mechanism that would help prevent

similar failures from occurring and minimize the magnitude and extent of negative impact arising from these

failures, should they occur. (Refer to Fig. 1.1.)

3

1.2 Significance of the Approach Taken by IPA/SEC

In order to maintain the reliability of the products, it is becoming increasingly important for the enterprises

(product manufacturers, system developers, user companies, etc.) to share their experiences to other companies,

industries and future generations. To do so, they need to summarize their individual experiences and know-how

into a form that can be understood and practiced by the third party. In response to this trend, IPA/SEC has

decided to collect and analyze the information on failure cases that they have obtained through the cooperation

of corporations that engage in the building of the architecture and development of products and control systems

used for supporting the key infrastructures, and also organize and systematize the countermeasures that have

already been formulated and implemented to address the identified causes of the system failures to prevent

them from recurring. This approach taken by IPA/SEC aims at sharing the “lessons” learned from these activities

with and across all the different industries and business fields, and building a mechanism that would help prevent

similar failures from occurring and minimize the magnitude and extent of negative impact arising from these

failures, should they occur.

In general, it is very rare for any enterprises to provide the raw data on their failure cases. Therefore, IPA/SEC

requested the enterprises for the information on their failure cases that have been already once analyzed

Lessons DB

Enterprises, engineers

Experience Reflect

Learn the lesson

Lessons DB


ExperienceReflect

Learn the lesson

IndustryLessons /

failure cases / countermeasures DB

IndustryLessons /

failure cases / countermeasures DB

Lessons DB


ExperienceReflect

Learn the lesson

Lessons DB


Experience Reflect

Learn the lesson

Share(Generalize into abstract form)

Utilize(Translate into reality) Share

(Generalize into abstract form)

Utilize(Translate into reality)

Utilize(Translate into reality) Share

(Generalize into abstract form)

Lessons / failure cases /

countermeasures DB

SEC




Utilize(Translate

into reality)



Mechanism of sharing and utilizing failure cases and lessons

Fig. 1.1 Sharing and utilization of knowledge

4

internally (including the countermeasures that have been devised and carried out based on the analytical results)

and received such information under the condition that IPA/SEC would, as a public agency, strictly abide by the

rules on preservation of confidentiality (stipulated in National Public Service Act, IPA Committee Regulations,

etc.) and used them as the source for organizing a generalized abstract set of “lessons” learned from their past

failure cases in an orderly manner and documenting these lessons in a handbook titled “Collection of Lessons for

Information Processing System Reliability Enhancement (Product / Control System edition)”.

1.3 Key Features on this Handbook

One of the limitations in conducting a comprehensive set of tests to verify the appropriateness of product /

control systems in advance is that it is often difficult to specify the users of product / control systems and the

environment they will be used in. In many cases, the failures occurring in these systems are caused by factors

that have been difficult to detect during the preliminary verification and testing processes.

In practice, when a failure occurs in a product / control system, the enterprise that was using this system would

analyze the causes and formulate necessary countermeasures to prevent it from recurring, on its own without

disclosing the remedial actions it has taken individually. Since the experience of this enterprise was not shared

with other companies in the same industry nor with any other industries, similar failures caused by factors

difficult to detect in preliminary verification and testing processes continued to occur in other products or

industries. One of the key features of this handbook is that it is a collection of lessons that can be applied to

prevent such failures caused by factors difficult to detect in preliminary verification and testing processes from

occurring.

Moreover, projects to develop products entirely from scratch are becoming less and less, and the opportunity

to utilize the accumulated knowledge and techniques to maintain or improve the reliability of the products is

getting scarce. It is therefore becoming necessary to develop a mechanism to share and hand down the

experiences of the product manufacturers, system developers and user companies to the current and future

generation. Another feature of this handbook is that it has been created with an objective to hand on such useful

and valuable knowledge to the future generations.

5

2．Collection of Lessons for Information Processing System Reliability Enhancement


2.1 Policy on Gathering Failure Prevention Knowledge

2.1.1 Overview

This handbook on lessons learned from past failure cases has been prepared to provide a generalized abstract

set of lessons to be utilized in the enterprises developing a wide range of embedded systems, using the

knowledge on measures implemented in various industries to prevent system quality problems from arising, as

the shared source of information.

In this handbook, the knowledge on measures to prevent system quality problems from arising is referred to as

the “failure prevention knowledge”.

There are two types of failure prevention knowledge: i) knowledge to prevent the same or similar failures from

recurring; and ii) knowledge to prevent a foreseeable problem from arising. The former type of knowledge is

based on the knowledge gained from past incidents and failures that occurred in the same field of business, same

company or same organization, and is used to prevent the same or similar problems from recurring. The latter

type of knowledge is based on the knowledge gained from past incidents and failures that occurred in other

fields of business, companies or organizations, which have been generalized into an abstract set of knowledge to

prevent the same or similar problems from arising in a specific field of business, company or organization that

never in the past has experienced the incident or failure addressed by the acquired knowledge.

Moreover, the knowledge gathered for the preparation of this handbook include both the knowledge to

improve the quality of the system by eliminating the intrinsic risks of failure and the knowledge to make the

system more robust and fault-tolerant to the intrinsic risks of failure so that these risks will not cause observable

adverse effect on the system even when they happen to externalize.

There are many expressions used to describe a problematic state, including the terms “fault”, “failure” and

“defect”. In this handbook, the problematic state is expressed as either “fault” or “failure” in accordance with the

terminology defined in JIS X 0133 and IEEE 1044, where “fault” is defined as the problematic state caused by one

or more incorrect steps, processes or data existing in any of the programs of the computing system, and where

“failure” refers to the state when a product is unable to execute a required function or when a system does not

have the ability to execute a function within the limitations specified beforehand.

2.1.2 Targeted Users of Failure Prevention Knowledge Failure prevention knowledge described in this handbook has been gathered on the assumption that it will be

used mainly by the following three types of engineers engaged in the development of embedded systems:

Software architects

System designers belonging to the vendor

System designers belonging to the user company

The end users who have not received a special training on how to use the system have been excluded from the

list of targeted users, based on the assumption that it would be difficult for them to make effective use of the

6

failure prevention knowledge.

2.1.3 Procedure to Gather and Organize Information It is desirable for each enterprise to work on defining the lessons based on the failure prevention knowledge.

This handbook gives an example of the procedure and method on how to gather and organize the information

that can be translated as the failure prevention knowledge.

[Background of the knowledge gathering procedure]

Failure prevention knowledge should not simply be a set of information on the findings and practices

gathered from various corporate entities that encountered similar failure cases in the past. Rather, it should be

derived from the set of core information extracted from various similar failure cases that has then been

translated into the failure prevention knowledge that is relevant to the users intending to prevent their specific

failures. This translation process should be taken to make the enterprises providing the information on their

failure cases and the knowledge from their experiences feel secure to share their internal information. If the

shared information were disclosed to the public as it is, it would be easy to imagine that they would feel reluctant

to reveal their failures openly from then on. Moreover, since the products with embedded system range widely, it

would be difficult for the users of certain products to relate to the knowledge gained from the failure of other

types of product, even if they are in the same company. Failure prevention knowledge, therefore, should be

rewritten as much as possible in a way that is relatable to the users in similar fields of business.

[Framework of the knowledge gathering procedure]

The process to gather and organize the information that can be translated into failure prevention knowledge

should be repeated to find where it could be improved and ultimately established as a fixed procedure for any

enterprises to gather and utilize the failure prevention knowledge. In order to establish the knowledge gathering

process, there is a need to put in place a set of methods that clearly define how to classify the types of failure

prevention knowledge, filter them, extract the core information that can be translated into knowledge, generalize

the knowledge into abstract form and rewrite them as the knowledge on measures for preventing failure cases

that are specifically relatable. But since it is difficult to establish all of these methods at once, the process to

gather and organize the information that can be translated into the failure prevention knowledge need to be

executed repeatedly and refined in the course of repetition.

[Specific steps of knowledge gathering procedure]

Based on the aforesaid background, failure prevention knowledge should be gathered according to the

following procedure:

1) Conduct an interview using the failure prevention knowledge sheet

By using the failure prevention knowledge sheet explained later on, interview the stakeholders or request

them to fill out the sheet directly. At this stage, the raw information should be recorded without translating it

into information on other relatable failure cases.

2) Make the information abstract by rewriting it into information that is relatable to other business fields and

failure cases

3) Make the information abstract by keeping only the core essence of the gathered failure prevention

7

knowledge and translating it into the failure prevention knowledge that is relatable to other business fields

and failure cases.

4) Extract the failure prevention knowledge in each process

Extract the failure prevention knowledge in each process to create a matrix that shows what kind of problem

may occur in a process when no prevention measures are taken compared to the processes where

prevention measures have been taken.

5) Conduct a review

Request the experts to review the failure prevention knowledge and refine the descriptions by editing the

information into just proportion and rephrasing the parts that are difficult to comprehend.

6) Revise the knowledge

Revise the description of the failure prevention knowledge according to the result of the review.

[Designing the failure prevention knowledge sheet]

Failure prevention knowledge sheet is a spreadsheet used to enter the information on failure prevention. It

should be composed of cells filled out with just the right amount of primary information on failure prevention by

those who can provide such information. It is desirable to design this sheet with entry items that take the

characteristics of each domain into consideration. Some examples of these entry items are explained later for

reference.

Fig. 2.1 shows a schematic drawing that illustrates how the failure prevention knowledge sheet should be

designed. The preparation of this sheet should be considered complete when the entered information is sorted in

the order exemplified in Fig. 2.1.

As the first step, start with sorting the information that describes the failed state right when it occurred. Some

examples of such information include the phenomenon that could be observed immediately after the problem

became visible, like when the system got out of control, or when the system stopped all of a sudden. Next, sort

the information that describes the event that occurred internally, which became the direct trigger to cause the

phenomenon.

Then sort the information that describes the factors that became the direct cause of failure. These causal

factors can be further broken down into factors related to technical aspect, personal (or human) aspect, or

organizational aspect of system development.

After sorting the information describing the failed state and the factors causing the failure, examine the

permanent measures to prevent the described failure from occurring.

8

Fig. 2.1 Policy on designing the failure prevention knowledge sheet

[Items to enter in the failure prevention knowledge sheet]

Based on the design policy explained above, create the failure prevention knowledge sheet by keeping in mind

what kind of entry items it should be composed of. The standard set of entry items are as exemplified below: 。

Lesson title

Key words to briefly describe the background of the failure

Product domain

Product features

Observable phenomena

Event that occurred internally

Causal factors

Preventive measures

Beside the entry items explained above when discussing about how the failure prevention knowledge sheet

should be designed (observable phenomena, event that occurred internally, causal factors, preventive measures),

items to enter the background information of the identified failure and the failure prevention knowledge that can

be useful in each process should also be added.

Incidents

Types of incidents Causal factors Countermeasures against these factors

Observable phenomena *1)

✓ System goes out of control✓ System stoppage✓ Partial malfunction✓ Inconsistency with operating

environment✓ System does not activate✓ Screen freeze✓ ・・・・

*1) or phenomena that may have occurred if left unattended

Event that occurred internally

✓ Data erased or destroyed✓ Illegal interrupt✓ Incomplete data transmission /

reception✓ Log data processing error✓ Memory leak✓ ・・・・

［Technical aspect］(1) Requirements

✓ Misunderstanding requirements✓ Missing requirements✓ ・・・・

(2) Design✓ Misunderstanding specifications✓ Unconsidered feature interactions✓ Requirements that have not been

presented✓ Unconsidered exception processing✓ Unconsidered conditional judgment

processing✓ File export processing error✓ ・・・・

(3) Implementation✓ Gap between code and

implementation✓ Incorrectly described ‘null’ versus

‘blank’ ✓ ・・・・

[Personal / organizational aspect]✓ Lack of communication✓ ・・・・

◆Express the specificationsmore clearly

・Review the Japanese language level・Reorganize the information

◆Reflect the design conventions◆Use the document templates

◆Improve the completenessof tests

Failure prevention knowledge ⇒ Generalization / abstraction ⇒ Lessons

9

The background information on the identified failure refers, for example, to the level of reliability of the system

required in the product domain and the hardware features. The reason why such background information should

also be provided additionally is because the knowledge on failure prevention required to devise effective

preventive measures can vary largely depending on these quality requirements and product features.

Moreover, the failure prevention knowledge that can be useful in each process should also be added to help

understand what kind of preventive measures would be effective at which stage of development to prevent the

problem specified in the failure prevention knowledge sheet from occurring.

[Policy on organizing the failure prevention knowledge by process]

Create a table that shows the failure prevention measures that are categorized by process and the potential

problems that can occur in each process if these preventive measures are not taken, and organize the

information by process so that the users of the failure prevention knowledge sheet will be able to gain an

overview of the failure prevention knowledge that is useful in each process, and trace back to the original failure

prevention knowledge from the knowledge sorted to be useful in each process. (Fig. 2.2) 。

Fig. 2.2 Association between failure prevention knowledge

and past failure cases categorized by process

For organizing the failure prevention knowledge by process, the example shown in this handbook uses the

process model [1] in “Embedded System development Process Reference guide (ESPR) Ver. 2.0”. For the

processes, it is desirable to include not only the system engineering process (SYP) and software engineering

process (SWP), but also the support process (SUP) as much as possible. In ESPR, the educational process is not

defined, but this is another process that should be considered including to expand the variety of failure

Collection of failure casesFailure prevention knowledge in each process

Analysis of system requirements:・・・・

System architectural design:1) Conduct an impact analysis

when changes are made to the hardware.

2) ・・・・3) ・・・・

Concrete description of the failure:The terminal does not start up even when the power button is pressed.→ Case ????-?

Case 1:

Case 2:

Case 3:

Case 4:

10

prevention knowledge defined by process. Moreover, since embedded system development nowadays is often

partial development of areas that are newly added to the existing system, it is also desirable to note whether the

failure prevention knowledge in each process is effective for only the areas additionally developed in each

process or applicable to the entire process.

11

2.2 Lessons

In this chapter, the readers will be introduced to a series of lessons derived from the abstraction of failure

prevention knowledge resulting from the analysis of wide range of failure cases that have been gathered.

Table 2.1 below shows a list of these lessons and the processes in which measures formulated from these

lessons need to be implemented respectively to prevent the identified failures from occurring.

12

Table 2.1 List of lessons and corresponding processes

Lesson # Lesson title

1Verification using, such as, a decision table is effective when the logiccomposed of complex conditional expressions is changed. ○ ○

2

To modify a function that has a total of over hundred unsorted conditions,or a function that has 10 or more conditions, check whether there is anyinconsistency by identifying and sorting all the conditions that are related.

○ ○

3

When integrating multi-functional modules, check whether there are anyconditions missing when the sum of the number of conditions differsbefore and after the integration.

○ ○

4

When the range of values of variables is wide and there are very numerousvariations of parameter combinations, divide the range of variables intoappropriate size, and perform boundary value tests.

○

5To use an internal battery, take into consideration the boot sequence thatoccurs when the battery is in deeply discharged state.

○ ○ ○ ○ ○

6When using flash memory, keep in mind the finite number of times data canbe written within its life cycle. ○ ○ ○ ○

7

When adding a function that consumes a lot of power, keep in mind theimpact of temporary voltage drop (reset, freeze, etc.), the type of powersource used, and remaining battery capacity.

○

8 Formally analyze all the exceptions that can be assumed. ○ ○

9

When adopting a redundant system, appropriately set the area for datasynchronization. ○

10Even when the same hardware is specified as the control target, reconfirmthe hardware specifications if the operation conditions are going to bechanged.

○ ○ ○ ○

11

When sharing (or passing) data between processes or threads, keep a closeeye on whether the exclusive control / synchronous processing is beingcarried out correctly or whether deadlock is occurring.

○ ○ ○

12

The instrument used for testing yield-type products to determine whetherthey are defective or non-defective products should be suspected to beabnormal when it outputs test results indicating that all the tested productsare either defective or non-defective.

○ ○

13

When improving the performance of existing software, check when theidling time occurs, when the process goes out of sync, and how theyaffect the software.

○ ○ ○ ○ ○

14

・When handling a large volume of data via a communication device, becareful not to create any bottlenecks in the sequential processing flow.・Also take into consideration the load fluctuation at different time zones.

○ ○ ○ ○

15

Be prepared with a recovery plan to support all the foreseeable abnormalsystem operations (reset, power shutdown, state of being left uncontrolled,etc.) that may occur during sequential execution of business systems thatare used by the customers after delivery.

○ ○

16Prepare specifications document and conduct impact evaluation even forthe processing of maintenance log data that is used for failure analysis. ○

17Do not only extract the conditions required to perform the determinationprocess, but also identify all the conditions that should be processed aslimitations.

○

18 Be careful of the fragmentation of log files. ○

Syste

m te

sting

Edu

catio

n /

trainin

g

Pro

ject m

anage

ment

Ope

ration

Syste

m re

quire

ments de

finitio

n

Syste

m arc

hite

ctu

re de

sign

Softw

are arc

hite

ctu

re de

sign

Softw

are arc

hite

ctu

re de

sign（m

odific

ations）

Imple

mentatio

n (c

odin

g)

Revie

w

13

Lesson 1

Lesson title Verification using, such as, a decision table is effective when the logic composed of complex

conditional expressions is changed.

Product features

A system designed to open / close a number of emergency doors by using air pressure

In this system, the movement to open / door the emergency doors is controlled by the input of

multiple sensors, including the speed sensor and various interlocks. A high level of certainty that

these doors open / close without fail at times of emergency must be secured.

Observable

phenomena

During the trial use, several emergency doors did not open when they were supposed to. This

incident not only led to the decline of system reliability but also stained the credibility of the

manufacturer.

Event that

occurred

internally

Through troubleshooting, it was found that a conflicting condition was set in the logic that

controlled the air pressure for opening / closing the emergency doors.

Causal factors The power source used to drive the compressor for opening / closing the emergency doors was also

used by other systems for the purpose of using the available power efficiently. The electric current

for providing residual heat to the solenoid was also supplied from this power source. Then the

power source configuration was changed. This configuration change required the condition that

was set to secure the power current for the residual heat to be deleted. The system engineer that

was assigned to the program modification job simply thought that cutting one line of the

conditional expressions would suffice, and deleted this line without realizing that the modified

pattern no longer met the overall requirement specifications.

Before the change:

(Precondition)

&& (Air pressure is equal to or above the set value)

&& ( (●No request for electric current)

|| (Abnormal value detected by speed sensor)

|| (■Abnormal value detected by thermal sensor) )

Error state after the change:

(Precondition)

&& (Air pressure is equal to or above the set value)

&& ( (●No request for electric current)

|| (Abnormal value detected by speed sensor)

|| (■Abnormal value detected by thermal sensor) )

14

After implementing the countermeasure:

(Precondition)

&& ( (Abnormal value detected by speed sensor)

|| (■Abnormal value detected by thermal sensor)

|| (Air pressure is equal to or above the set value) )

Preventive

measures

Measures against the direct cause:

・Correct the logic after confirming the conditional expressions.

Permanent measures against causal factors (clearly describe the process to be taken):

When changing a complicated setting of conditions, check whether there is any conflict or

non-compliance in the logic after the change by preparing a decision table like the kind shown

below. The use of such decision table will make it easier to compare what has been deleted

from or added to the original pattern of conditions (will become easier to notice the change by

using different colors, for example), and therefore help prevent system failures from being

caused by incorrectly written logic.

Before the change:

Conditional expressions Established condition

#1 #2 #3 #4 #5

&& (Precondition) ○ ○ ○

&& Air pressure >= Set value ○ ○ ○

&&

| | ●Request for electric current = = FALSE ○

| | Abnormal speed = = TRUE ○

| | ■Abnormal temperature = = TRUE ○

Error state after the change:

Conditional expressions Established condition

#1 #2 #3 #4 #5

&& (Precondition) ○ ○

&& Air pressure >= Set value ○ ○

&&

| | Abnormal speed = = TRUE ○

| | ■Abnormal temperature = = TRUE ○

15

Lesson 2

Lesson title To modify a function that has a total of over hundred unassociated conditions, or a function that

has 10 or more conditions, check whether there is any inconsistency by identifying and associating

all the conditions that are related.

Product features

A system designed to control the process that involves solvent treatment

A high reliability is required as it is used in rigorous environmental conditions including

temperature.

Observable

phenomena

When the system was in operation, a drainage pipe cracked. As a result, a part of the pump used

for treating exhaust air continued to operate even after completing the predefined exhaust

treatment and burnt out due to overheat. Although this incident did not escalate into a fire

accident, the functionality of the exhaust treatment system was partially lost. If the fire had spread

to elsewhere, it might have led to a big accident.

Event that

occurred

internally

This exhaust treatment system was composed of two pumps, A and B. Each pump activated when

the signal to instruct the start of draining operation was switched on. According to the

specifications, the exhaust monitor control would operate for a predefined length of time to

activate the pump. After activating pump A, it would stop when this monitor control completes its

function, and pump B would activate and proceed with the exhaust treatment. In this case of

failure, the crack in the drainage pipe has led to the damage of the thermal sensor. Because the

thermal sensor broke, a part of the control sequence that was set to function in abnormal state

did not work. Consequently, the signal to instruct the start of draining operation was switched off

before the exhaust monitor completed its control function after activating pump A, and a part of

pump A burnt out as the pump continued to operate while the exhaust monitor was left in a

incomplete state.

Causal factors ・The original logic was written based on the assumption that the signal for instructing the start of

draining operation would continue to be switched on until the completion of the exhaust

treatment. This logic was actually able to make the system continue operating normally even

when the thermal sensor malfunctioned. But when the specifications of the exhaust treatment

process were changed and the pump activation sequence had to be modified, it was rewritten

unintentionally into a logic that would switch off the signal for instructing the start of draining

operation without waiting for the completion of the exhaust treatment.

・This exhaust treatment was controlled in conjunction with the aforesaid reaction process. But the

system engineer assigned to the program modification job had little knowledge and experience

working on such control mechanism on the overall and performed the modification only by

taking the reaction function into consideration.

16

Preventive

measures


・Correct the logic after confirming the conditional expressions.


① Comparison of the number of variables before and after the change is effective to a certain

extent. But in this case where there is not much difference in the number of variables before

and after the change, it is difficult to notice the change just by counting the number of

variables.

In this case, control operation logic that is complexly composed of increased number of

conditions is concentrated too much on a specific module. From the standpoint of

maintenance, it is therefore desirable to split the conditions appropriately to separate

modules. By reducing the processing range, the possibility of error processing would also

decrease, and the designing process could be completed easier and in a shorter period of

time.

Variables

Scale of

change

Complexity before the change Complexity after the change

Input

variables

Number of

conditions

Input

variables

Number of

conditions

State when pump A is

operating None 8 7 8 7

Activation of pump A Small 6 5 7 6

Operation of ○○ valve Small 7 8 9 11

State when ○○ valve

is open None 5 4 5 4

Instruction to start

draining － 26 43

State when draining

operation is executed － 30 58

② Implement a logic that would stop pump A and pump B from operating for a certain period of

time so that they would not operate continuously to the extent that would make both of them

burn out. Also assign seasoned engineers to teach the designers about the technical

knowledge required to deeply understand the entire process. By putting this educational

activity into practice on a regular basis, the knowledge on maintaining the system reliability

that has been accumulated over the years in the organization can be constantly transferred to

the engineers of the next generation.

No much difference before and after the change

Definition of complexity and indication of threshold:

A function that has a total of over hundred

unassociated conditions, or a function that has 10 or

more conditions

17

Lesson 3

Lesson title When integrating multi-functional modules, check whether there are any conditions missing when

the sum of the number of conditions differs before and after the integration.

Product features

A system to control the automatic guide vehicles (AGV) used to convey intermediate and finished

products to different areas of the plant

This system must capable of controlling the optimal conveying routes and operating at a very high

rate in order to achieve high plant productivity.

Observable

phenomena

A part of the numerous position sensors malfunctioned while the system was in operation. When

the maintenance team was preparing to fix the defective sensors, automatic guide vehicles

collided against each other. This incident stopped the conveying operation for a long time and

slowed down the shipping process.

Event that

occurred

internally

The automatic guide vehicles were designed to move by learning the information incoming from

the sensors that sense other running vehicles, people or objects, and measure their distances and

processing the input information in optimal control routine. Among the conditions that were

supposed to be input from the sensors that became defective were conditions to instruct the

vehicles to learn the forward positions, which turned out to be missing.

Causal factors ・The self-learning function was controlled by multiple software modules. Then the changes in

business needs required this function to be integrated. When the configuration of the modules

was modified to meet the functional integration requirement, a part of the conditions included in

the original behavioral specifications was missed out from the transcription.

・The modified program was debugged in normal development environment and checked with a

structure visualization tool. But the system engineer in charge of this verification process did not

realize that a condition was missing in the specifications because collision of automatic guided

vehicles rarely occurred.

18

Before integration: Error state after integration: After modification:

Module A

Module B

Module C

Preventive

measures


The conditions before and after the change were compared, and the program was corrected by

adding the missing condition that was identified in this comparative analysis.


・Normally, the combination of parameters that has to be checked to verify the consistency of the

learning function tends to get very numerous. Unless a real machine is used for this verification

process, it is often difficult to detect the conditions that are missing. In the organization that was

using this system, a virtual environment was built in advance to perform the tests required for

the verification. But there is a limit in the completeness of the tests that can be conducted in a

virtual environment.

・To perform an effective measure to prevent conditions from missing in the modified software

program, a structure visualization tool should be used for visual examination. Use this tool to

check whether X ≒ Y or not (see below for the definition of X and Y, and record the results.

X: Sum of the specified number of conditions arising in the modules before integration

Y: Specified number of conditions arising the modules after integration

FW learning function ●

FW learning instruction condition ●

FW learning function ▲

FW learning instruction condition ▲

FW learning function ■

FW learning instruction condition ■

FW learning function




Missing condition

FW learning function




19

Before integration and

error state after integration

Before integration and

after correction

Comparison of the code

size that appears on the

visualization tool

Differs largely Almost the same

Comparison of the

number of conditions that

appears on the

visualization tool

Before integration

Sum＞19 or more

Error state after integration:

Sum＝4

Before integration:

Sum＞19 or more

After correction:

Sum＞19 or more

This simple visual method makes it easy to find whether there are any conditions missing in the

modified software program or not.

20

Lesson 4

Lesson title When the range of values of variables is wide and there are very numerous variations of parameter

combinations, divide the range of variables into appropriate size, and perform boundary value

tests.

Product features

A system designed to transfer a specified volume of chemical substance on conveying trays along a

designated route at pre-defined intervals

A high level of certainty in movement and accuracy in response time are required for this transfer

system.

Observable

phenomena

One day, a conveyor line that was transferring the chemical substance stopped all of a sudden due

to the activation of emergency stop mechanism. According to the specifications, this transfer

system had been programmed to enable the line to restart the pre-defined sequence from where it

stopped, based on the direction the trays were heading, the volume of chemical substance they

were carrying, and their respective positions on the line when it stopped. However, on this

particular day, the line did not restart for a long time and the extended downtime led to an

increasingly large production loss. In addition, the chemical substances remaining still on the

conveyor line had to be removed by manual labor, which also required a long time to complete

since the chemical substances were hazardous and therefore had to be handled very carefully.

Event that

occurred

internally

The system was programmed to restart the transfer operation sequence after the line stops, after

calculating the highly probable positions and direction of the trays by referring to the parameter

table. But in this event of failure, a wrong module was used to refer to the parameter table and

calculate the tray positions and direction. As a result, tray #5 was targeted as the tray to restart

instead of tray #4 that actually had to be restarted.

Causal factors ・To improve the calculation process, the modules were merged without changing the original

functionality. But in the course of this integration, the calculation logic was modified

unintentionally into a logic that was non-complaint with the original behavioral specifications.

・This calculation logic determines which sequence to restart, based on the information of the state

in which the line had stopped (the direction of the trays, the volume of chemical substance they

were carrying, their respective position on the static line, etc.). Combination tests were

conducted to check the modified calculation logic, but the completeness of this test was limited

because there were so many parameter combinations to test, and besides, the range of values

of these parameters was very wide.

Preventive

measures


After re-examining the modified logic, the non-compliant parts of the logic were rewritten

correctly.

21


As a reflection of the failure to detect the missing conditions in the modified calculation logic

because there were so many parameter combinations to test and the range of values of these

parameters was very wide, adopt the following two approaches in verification that make use of

model testing techniques.

Property verification:

Perfect match verification:

Moreover, if the range of value of variables is wide and the variation of parameter combinations

is very abundant, the parameter combinations examined in the model test may explode.

Therefore, in such situation, it is advisable to generalize the conditions into an abstract form.

For this particular case, the model test was performed after dividing the range of variables into

appropriate size. By doing so, the verification process could be completely easily in a relatively

short period of time.

Post conditions

後条件

Module before the change

ール

Constrained conditions

(Prior conditions and post conditions

Module after the change

Established?

Module before the change

ール

Module after the change

Merged module Prior conditions

Do the targeted variables

before the change and after

the change match perfectly?

Definition of complexity and indication of threshold:

A function that has a total of over hundred

unassociated conditions, or a function that has 10 or

more conditions

22

Lesson 5

Lesson title To use an internal battery, take into consideration the boot sequence that occurs when the battery

is in deeply discharged state.

Product features

Portable terminal for business use that is equipped with a display, wireless communication

function and an internal battery that enables it to be used without connecting to an AC charger

Observable

phenomena

The terminal does not activate even when the power button is pressed.

Power cannot be charged even when the AC charger is connected.

Event that

occurred

internally

・When the voltage of the battery is extremely low (i.e. when the battery is in deeply discharged

state), the software is programmed to prevent the terminal from activating until the voltage

recovers to a certain level through power charge or other means (i.e. until the voltage in deeply

discharge state rises to the threshold level set as the voltage to activate the terminal).

・When the AC adapter was connected while the battery was in deeply discharged state, the

voltage increased due to the power charged through the AC adapter. When the user of the

terminal pushed the power button, the software checked whether the terminal could be activated

or not based on the voltage of the inner battery that included the power charged, determined that

it could from the measured voltage that was higher than the threshold level set as the voltage to

activate the terminal, and started the boot process.

・However, according to the specifications of the terminal boot process, the power charge is once

stopped during the process. When the power charge stops, the voltage of the battery returns to

the original state (deeply discharged state) and gets lower than the threshold level set in the

power IC as the voltage to activate the terminal.

・When the voltage of the battery gets lower than the threshold level set as the voltage to

activate the terminal, the hardware determines that the battery voltage is insufficient and stops

activating the terminal and supplying power to the entire system. As a result, the system stops

without charging power.

Causal factors ・Did not consider the battery voltage dropping to an extremely low level (in deeply discharged

state) as one of the scenarios for evaluating the operability of the terminal.

・Lacked consideration of the battery voltage dropping to an extremely low level (in deeply

discharged state) when the terminal is connected to an AC charger.

Preventive

measures


・Change the threshold level of the voltage to activate the terminal when the terminal is connected

to an AC adapter.


・Recognize the need of taking correction actions against the specifications of battery-run terminal

that make the battery voltage drop to extremely low level (to deeply discharged state), such as,

23

when the battery is left unused for a long period of time.

・Need to devise ways to prevent the battery voltage from dropping to extremely low level (to

deeply discharged state), consider methods to recover the voltage if it drops even after devising

preventive measures, and be sure to verify the effectiveness of these preventive measures and

recovery methods with the actual terminal and its internal battery.

・Take into consideration the operating characteristics of the terminal’s starting current, the

possibility of its internal battery voltage dropping, and the variability of the extent to which it

drops, and incorporate these elements in the design.

・Investigate / review the customer’s use case, and evaluate the solution to this problem under the

same conditions / environment in which the terminal is actually used by the customer (end user).

Implementation of the preventive measures described above should make it easier to design the

product in such a way so that the following problems related to power charge can be prevented

from occurring:

・The terminal does not activate even when the power button is pressed;

・Power cannot be charged even when the AC charger is connected.

24

Lesson 6

Lesson title When using flash memory, keep in mind the finite number of times data can be written within its

life cycle.

Product features

Portable terminal for business use that is equipped with a display, wireless communication

function, built-in flash memory for data storage and an internal battery that enables it to be used

without connecting to an AC charger

Observable

phenomena

・The terminal does not activate even when the power button is pressed.

・The terminal freezes while it is started.

・The terminal blacks out (due to power outage), resets or freezes while it is used.

Event that

occurred

internally

・Flash memory is used as the storage for this terminal (to save data of the operating system (OS),

apps, etc.)

・The values in a specific area on the flash memory are broken and making the terminal unable to

start, or causing other kinds of problems.

Causal factors ・The values in a specific area on the flash memory have been found to be broken.

・Data have been written in the specific area of the flash memory for more than the number of

times data can be written within the life cycle of this flash memory.

Preventive

measures


・Reduce the number of times data is written into the flash memory by finding why it increased and

correcting the identified causal factors.


・When using flash memory as the storage of a terminal, recognize that it is a component with a

finite life cycle.

・Be careful not to exceed the finite number of times data can be written into the flash memory

within its lifecycle.

・In case there is a possibility that data may be written in the flash memory for more than the

number of times data can be written within its life cycle, prepare a function to monitor the number

of times data is written or other mechanism to prevent data from being written into the flash

memory for more than the number of times it can store data within its life cycle.

・For the finite number of times data can be written in the selected flash memory within its life

cycle or the method of evenly balancing the data stored in the available space within the flash

memory, inquire the manufacturer of the flash memory in advance about any matters that need to

be careful of when using it in the product you are actually planning to use as its storage.

・When adopting the flash memory, have a good knowledge about the environment the customer

is actually going to use it in, the types of data handled by the customer, as well as the likely timing

25

and frequency the customer reads and writes the data in the flash memory.

・When selecting the flash memory, do not only consider the software area developed by your

company but also the operating system (OS) you are planning to use for the product, as well as the

behavior of other software you have purchased and freeware that you are planning to use in the

same product.

・If there are different variations of flash memory to choose from, understand the differences and

evaluate them.

・When replacing the flash memory (from NOR flash →NAND flash; from SLC→MLC→TLC; etc.), be

especially attentive about the different life cycle of each type.

・When using the flash memory, work together with software development team and hardware

development team to prevent it from causing any problem during its use, by keeping in mind that

this collaboration is very important for the prevention of system failure caused by the flash

memory.


product in such a way so that the following problems attributable to the life cycle of product

components can be prevented from occurring:


・The terminal freezes while it is started.

・The terminal blacks out (due to power outage), resets or freezes while it is used.

26

Lesson 7

Lesson title When adding a function that consumes a lot of power, keep in mind the impact of temporary

voltage drop (reset, freeze, etc.), the type of power source used, and remaining battery capacity.

Product features

Portable terminal for business use that is equipped with a display, 3G wireless communication

function, and an internal battery that enables it to be used without connecting to an AC charger

Observable

phenomena

・When the terminal is activated, or while it is used, an error message saying “3G modem has

stopped” appears on the display, and 3G communication will be disabled after this message.


Event that

occurred

internally

・As soon as the 3G modem started running, the terminal began writing data into the flash

memory. During the writing process, the terminal powered off. As a result, the file system

corrupted and the 3G modem became unable to use.

・Reset occurred due to the sudden drop of power voltage caused by the inrush current that had

flown in when the 3G modem started running.

・To prevent the 3G modem from resetting, a condenser was added to the power source of the

modem. But then, this time, the system reset occurred due to the sudden drop of power voltage

caused by the inrush current that had flown in when the terminal with low battery capacity started

running. As a result, the terminal powered off.

Causal factors ・There were two product models developed for this terminal: one with the 3G modem and one

without the 3G modem.

・The model without the 3G modem was developed first. The later model that was developed with

the new 3G-communication feature had been released without evaluating it as thoroughly as it

should have been.

・The evaluation of the built-in 3G modem was also inadequate. The power voltage varied

depending on the parts assembled in the modem.

・When the condenser was added to the power source of this 3G modem, the impact on the

peripheral circuits was not analyzed as stringently as it should have been.

・The terminal after using up much of its battery capacity (and capable of outputting only low

power voltage) was not evaluated as closely as it should have been.

Preventive

measures


・To prevent the 3G modem from resetting, add a condenser to the power source of the modem.

・To prevent the system reset from occurring when the remaining battery capacity is low, add a

condenser at the root of the power source IC, and also modify the software in such a way so that

the terminal will not activate when the remaining battery capacity is low.

27


■ System architecture design

・When there are multiple models developed for a particular product (terminal in this case),

evaluate all the models after gaining a good understanding on how they differ from one another.

・When designing the system architecture of the terminal, take the variable aspects of the terminal

into consideration, including the inrush current of each function, voltage drop, inconsistency of the

parts assembled in the circuits as well as the starting sequence of the terminal.

・When adding a component to the power line to prevent the voltage from dropping, check where

the most appropriate position is to install this component and also the impact it may have on other

power lines.

・In case the terminal is a type that also runs on battery, recognize that there is a need to take

measures to prevent it from powering off abruptly when its remaining battery capacity is low.


product in such a way so that the following problems related to its power capacity can be

prevented from occurring:

・The communication function becomes disabled when the terminal is activated or while it is used;


28

Lesson 8

Lesson title Formally analyze all the exceptions that can be assumed.

Product features

Embedded system for controlling physical phenomena that is designed to achieve the system

objectives while running many functions concurrently and operating or referencing various

peripheral devices and subsystems.

Observable

phenomena

An aircraft controlled by this system nosedived while it was flying in autopilot mode. This abnormal

behavior was attributable to the malfunction of the pose information control unit. Normally, this

unit receives a set of information from the sensors and translates them into pose information,

which the automatic piloting system reads out to control the flight while the aircraft is switched to

autopilot mode. But in this case of system failure, the pose information control unit continued to

output unexpectedly large values as the pose information, making the automatic piloting system

nosedive the aircraft based on this incorrect post information. The pose information control unit

recovered after shutting its power and resetting the internal data. The following three phenomena

1, 2 or 3 can be assumed to have occurred to make the

pose information control unit malfunction.

Phenomenon 1: System failure occurred when the system suddenly stopped operating. If this was

the case, the system should return to normal operation by reboot.

Phenomenon 2: System failure occurred because the system could not refer to / process the

external I/O correctly and malfunctioned as a result. It is not possible to identify the particular type

of external I/O and function that led to this phenomenon.

Phenomenon 3: System failure occurred because a particular function that was executed during

operation could not be completed, and other functions, as a result, could not be activated. The

system cannot recover from this failure unless the power is reset.

Events that

occurred

internally

As the potential events inside the pose information control unit that are attributable to its

continuous output of incorrect the abnormal pose information, the following six events (Event 1 -

Event 6) may have occurred.

Event 1: Corruption of data inside MPU

If noise had intruded into MPU, it would have been through the pin of the MPU. The noise would

have first damaged the IO direction register. If the noise had the energy to reach inside the MPU, it

would have damaged the internal register as well, making the MPU malfunction (input does not

change, output does not change, no interrupt starts, etc.) or uncontrollable.

29

→Phenomenon 1, Phenomenon 2

Event 2: Execution of unexpected interrupt program

External interrupt occurs due to noise. The system processes this input but because the input value

is indefinite, system failure occurs.

→Phenomenon 2

Event 3: CPU fully occupied by continuous interrupt

A burst of interrupt occurs and the CPU becomes fully occupied with interrupt programs. As a

result, the time constraints of peripheral devices connected to the MPU cannot be met, making

the system unable to operate the inputs and outputs correctly.

→Phenomenon 2

Event 4: Defect of input device

System failure occurs when the input values become indefinite due to hardware damage.

→Phenomenon 2

Event 5: Defect of output device

System failure occurs because the damage in the hardware makes the hardware unable to operate

even when the system tries to operate the outputs.

→Phenomenon 2

Event 6: Defect or disconnection of the input / output devices and subsystems

The system waits for the input / output devices or subsystems to respond but the response never

comes. The function that the system is executing cannot therefore be completed, making other

functions unable to activate. As a result, system failure occurs.

Causal factors The exceptions that can be assumed to occur have not been defined in the specifications.

This failure case occurred because no consideration was made on the exceptions of input / output

devices and subsystems that the embedded system might reference or operate.

Preventive

measures


Event 1: Corruption of data inside MPU

Recover the register damaged by noise by refreshing the register inside the MPU.

Recover the register by monitoring the interrupts and tasks, and resetting the register when none

of these interrupts and tasks become active for a specified length of time.

Eliminate the noise included in the data obtained from the input device.

Round the input range if the noise is coming from outside the input data range.

30

Event 2: Execution of unexpected interrupt program

When an interrupt occurs, make the system determine whether it is a normal interrupt or not.

When the interrupt is caused by noise, do nothing and end the interrupt program.

Event 3: CPU fully occupied by continuous interrupt

Same as above.

Event 4: Defect of input device

Round the input range if the noise is coming from outside the input data range.

Monitor the input device at fixed cycle. Reset the input device if there is no response from the this

device. If the input device still does not recover, reset the MPU.

Event 5: Defect of output device

Monitor the output device at fixed cycle. Reset the output device if there is no response from this

device. If the output device still does not recover, reset the MPU.

Event 6: Defect or disconnection of the input / output devices and subsystems

Set a timer that waits for the input / output devices and subsystems to respond. If there is no

response for a predefined length of time, make the system suspend the execution of the functions

that will be affected by any of these responses, and handle the “time out” error.


Prevent exceptions from occurring by performing the following as a part of requirements analysis

and definition process.

・Define the exceptional items from physical and environmental perspectives.

・Create a list of defined exceptional items.

Physical item

xxx system

Environmental item

xxx system

Exceptional item

xxx system

Exception

Exception

Device

Type

Item name

Exceptional item list

31

・Create a matrix of functional item list and exceptional item list, and use it to define the functional

specifications of each exception.

Implementation of the preventive measures described above should make it easier to reduce the

possibility of system failure caused by functions that do not operate normally because the system

cannot reference / operate the external inputs / outputs correctly.

Functional item

xxx system

Fun

ction

al item

Exceptional item

Exceptional item

xxx system

Item name

Type

○：Affected ×：Not affected

32

Lesson 9

Lesson title When adopting a redundant system, appropriately set the data area to be synchronized.

Product features

A remote monitoring system that must maintain high availability (long continuous uptime)

This system requires high reliability by not only minimizing the frequency of minor defects, errors

and malfunctions that tend to occur on a regular basis, but also by being able to recover quickly

from the failed state and continue operating even when failure occurs.

Observable

phenomena

In order to achieve high system availability by adopting a redundant system, the master system

must be able to switch to the slave system and maintain the continuity of control even when the

master side becomes defective. But in this case of failure, an alarm to notify abnormality went off

immediately after the master system switched to the slave system.

Event that

occurred

internally

When the master system was switched to the slave system, the values of the data that have not

been synchronized were detected as invalid values and triggered the alarm to notify parameter

abnormality.

Causal factors When new functions were added to the system, the data area for managing these new functions

was also added. But at this time, the data area used for synchronization was not changed.

Moreover, the test to check the state when the data synchronization area that was added at the

master side was in use was not included as one of the test items. Therefore, during the testing

process, there was no way of knowing that the master data and slave data were not synchronized

when the master system was switched to the slave system.

Preventive

measures


Revise the data area required for data synchronization.


As another data synchronization test item, add a test to check whether the master data and the

slave data will be properly synchronized when the master system is using all the way up to the

additional data area.

Also include as one of the test items, a test to check the boundary values of the data range.

Implementation of the preventive measures described above should make it easier to prevent

malfunctions from occurring when the data is transferred in a redundant system.

33

Lesson 10

Lesson title Even when the same hardware is specified as the control target, reconfirm the hardware

specifications if the operation conditions are going to be changed.

Product features

・A product developed from a base product that communicates via wireless LAN for other purposes

・Hardware and software have both been developed from the base product

・The maximum number of handy terminals that can be attributed to the developed product is

more than that of the base product.

Observable

phenomena

The product sometimes reboots when the handy terminals attributed to this product reaches the

maximum number.

Event that

occurred

internally

To make the handy terminal attributable to the product, the key entry of the terminal must be

registered in the wireless LAN chip. But in this case of failure, this registration had failed.

Causal factors 〔Technical aspect〕

(1) Design: Failure to check the items that affect the key entry registration on the wireless LAN chip

datasheet described in the requirements specifications

The management table in the memory inside the wireless LAN chip is used to manage the key

entry of the handy terminals that makes them attributable to the product. The key entry

method varies with each cipher system. Depending on the sequence of the cipher system used

for attribution, the key entry registration may fail when the number of terminals that are

attributed reaches the maximum. The software designers who had little understanding of

wireless LAN chip specifications were not aware of this risk.

(2) Design: Failure to select reliable reviewers

The reviewers (who were members of the development team) had little understanding wireless

LAN chip specifications.

(3) Test: Failure in the combination of conditions

A scenario to test the communication between the handy terminals and the product when the

maximum number of terminals was attributed was included as a part of the integration test.

However, the sequence of the cipher system used for attribution was not considered in this test

scenario, making it evident that the integration test was lacking in completeness.

Preventive

measures


Modified the part of the software that controls the wireless LAN chip in such a way so that the

software would use the external memory of the chip as the management table of the wireless LAN

chip, instead of the internal memory.

34


1. Specify the areas in the requirements specifications that will be affected. (Requirements

specifications process)

Add the following check item in the review check sheet: “Check the impact of the difference in

requirements specifications from the hardware specifications of the base product.”

2. Improve the completeness of the tests. (Integration testing process).

Add the combination tests as a part of the integration test to check all the different

combinations of cipher systems used for attribution when the maximum number of handy

terminals is attributed to the targeted product.

3. Select the most appropriate reviewers. (Requirements definition process, integration testing

process)

Invite the software designers of the base product and the designers in charge of the system

tests to the integration test specifications review as participants.


possibility of system failure being caused by lack of memory capacity.

35

Lesson 11

Lesson title When sharing data between processes or threads, keep a close eye on whether the exclusive control /

synchronous processing is being carried out correctly or whether deadlock is occurring.

Product

features

A production management system that uses Windows server to process information received from an

on-site process controller through a communication network to control the routes taken by the

conveyor at diverging points as well as to collect performance data

Observable

phenomena

The process controller and the server stopped communicating. As a result, the line in the plant had to

be stopped.

Event that

occurred

internally

① The transmission buffer got into an invalid state when the timing to delete the data transmission

thread and the timing to register the data request thread overlapped.

② The data transmission thread tried to read out the data in the transmission buffer that was in invalid

state. As a result, error handling occurred and created a state where the communication with the

process controller was connected / disconnected repeatedly.

Causal

factors

ArrayList had been used in the transmission buffer for data linkage between multiple threads in the

process. Since ArrayList was not thread-safe, exclusive control was actually required. However, the

engineers that were in charge of coding did not know that ArrayList was not thread-safe.

Preventive

measures


1) Added exclusive control so the ArrayList can be used between multiple threads.

2) Reviewed the similar cases related to thread-safe.

Data request thread

Transmission buffer

Data transmission thread

Any transmission data?(Check the number of

data)

Is ACK normal?

Connect to process controller

Connection completed (ACK) received

Transmission buffer in

invalid state

Register in transmission buffer

Create request data

Data transmission

Response reception

Transmission buffer deleted

Communication with process controller disconnected

Readout of data in transmission bufferRegister

Abnormal

Normal

Yes

Exception handling

Delete

Pro

cess co

ntro

ller

36


Added the perspective that multithreading would make the code thread-safe in the program check

sheet and coding conventions.

Implementation of the preventive measures described above should make it easier during the code

review to detect the potential problems that may arise when the data between multiple threads are not

exclusively controlled.

37

Lesson 12

Lesson title The instrument used for testing products that require yield management to determine whether

they are defective or non-defective products should be suspected to be abnormal when it outputs

test results indicating that all the tested products are either defective or non-defective.

Product features

Instrument used to test semiconductors to check if their functions and performance meet their

specifications or not

Semiconductor testing process consists of multiple tests. If the tested products pass all these tests,

they are determined to be non-defective. Whereas, if the tested products fail any of these tests,

they are determined to be defective. Normally, when a defective product is detected, the

remaining subsequent tests included in the testing process are not conducted to save time and

streamline the entire process.

Observable

phenomena

In a semiconductor testing process, a certain proportion of the tested products are normally

determined to be defective.

However, there was a case where all the products passed the tests and were determined to be

non-defective.

When all the tested products are determined to be non-defective, there is a need to consider the

testing process itself to be abnormal, just like when all the tested products are determined to be

defective.

Event that

occurred

internally

The personnel in charge of investigating defective products masked the products during the

troubleshooting process to prevent them from being detected as defective products. After

completing the investigation, the mask should have been released but was not, and mass

production began without releasing the mask. As a result, all the products were determined to be

non-defective in the testing process. The personnel forgot to set a warning to notify that the

testing process may be abnormal if all the tested products are determined to be either defective or

non-defective.

Causal factors Forgot to set a warning to notify that the testing process itself may be abnormal if all the tested

products are determined to be non-defective.

When all the tested products were determined to be defective, the testing process could be

intuitively suspected to be abnormal. But it was not possible to conceive that the testing process

might also be abnormal when all the products were determined to be non-defective.

Preventive

measures


Modified the program by adding a warning to notify that the testing process may be abnormal not

only when all the tested products are determined to be defective but also when they are all

determined to be non-defective.

38


・Clarify the specifications on the criteria for determining whether the tested products are

defective or non-defective.

・In addition to clarifying the items for verifying the testing instrument, automate the verification

process as much as possible.


problems caused by human error of the maintenance personnel during their work.

39

Lesson 13

Lesson title When improving the performance of existing software, check when the idling time occurs, when

the process goes out of sync, and how they affect the software.

Product features

An inspection system used during the manufacturing process to inspect the electronic products

in the making

This system is composed of a PC and various embedded devices (signal generator, voltage/current

measuring instrument, current measuring instrument, communication device, camera, etc.) and

requires the ability to check the precision of mS order measurements and control flow.

An upgraded version of the current system is being developed with additional features.

Observable

phenomena

When the inspection system was inspecting the electronic products, it outputted “no data” even

when the electronic products were transmitting valid communications data. As a result, the line

had to be stopped for some time.

Event that

occurred

internally

The PC of the inspection system cyclically retrieves the communications data received within a

fixed period via the buffer of the communication device and stores these data in log files. These

log files are later closely analyzed after executing the inspection.

There are times when the electronic products do not communicate at all within the fixed period,

even when they are normally operating. In this case of failure, the PC that received no

communications data within the fixed period generated an empty data and stored it in the log file

as a valid data with a time stamp “0s”. As a result, the normal 2-minute search could not be

performed and no valid data could be detected.

Causal factors As a part of the development of the upgraded version of the current inspection system, the

processing speed has also been enhanced. However, this enhancement has caused the

performance to degrade due to the following factors. Moreover, the adverse effects of this

enhancement could not be detected in the system tests.

〔Technical aspect〕

■Lack of test items for testing the system when communications data do not exist

〔Personal / organizational aspects〕

■Design intentions of the current system have not been documented for future reference.

■Change management lacked scrutiny.

■Too dependent on the individual engineers in charge of modification (due to the judgment

that the scale of change was small)

Preventive

measures


Modify the software by adding the process to check whether there is any communications data or

not when retrieving the communications data via the buffer of the communications device (which

was a process that was existing in the beginning) so that the PC of the inspection will not store

40

empty data in the log file when there is no communications data to retrieve.


■Make sure all the modifications are entered in the change management list without any

omission (software design modification process)

■Describe the design intentions in the design specifications (software design process)

■Be sure to “check the modifications” and “check the extent of the impact of the modifications”

when reviewing the design and implementation. (review process)

■Improve the completeness of the tests (testing process)

・Add the following perspectives in the test items:

⇒Check the behavior when data does not exist;

⇒Intentionally insert a testing period when no communications data exists.

■Improve the reviewers’ skill level (create reviewer’s skill map, select reliable reviewers)

Implementation of the preventive measures described above should make it easier to prevent the

lack of consideration of data abnormalities in communication systems.

41

Lesson 14

Lesson title When handling a large volume of data via a communication device, be careful not to create any

bottlenecks in the sequential processing flow. Also take into consideration the load fluctuation at

different time zones.

Product features

A system composed of mobile terminals for business use carried by multiple sales reps, a database

server that provides remote data transmission / reception services, and client terminals (in head /

branch offices) that receives the information sent from the sales reps and supports the sales

activities

This system is required to operate continuously and provide data transmission / reception services

without delay.

Observable

phenomena

In a time zone when there were many users communicating through this system, the server

response time of the server slowed down significantly. As a result, the transmission and reception

of sales information and sales support information took a very long time to complete, making the

services stop intermittently.

Event that

occurred

internally

・Because the processing of data for analysis became a high-load operation in the server process, a

large volume of data transmitted from the mobile terminals for business use were temporarily

retained in the memory (of the communication buffer).

・In the program for accessing the server, there was a description that was not optimized. This

description became a bottleneck that slowed down the processing speed. (The processing took

unnecessarily long time because the character strings of the data were first converted into

numerals before comparison, instead of comparing them directly.)

Causal factors ・When the number of users increased in a particular time zone, and an unexpectedly large volume

of data was transmitted to the server.

・The specifications had no descriptions regarding the conditions for accessing the server that took

account of the potential risk that there may be a bottleneck.

・People who were knowledgeable about the server did not participate in the implementation

reviews.

・The integration load test that covered the entire system were not conducted sufficiently.

⇒ A test environment that was close to the real environment could not be prepared on a timely

basis.

Preventive

measures


・Modify the program in such a way so that the character strings of the data will be compared

directly. (Program optimization)

・Measure the processing time to access the server in a test environment that is close to the real

environment and confirm that the process speed has improved.

42


・Describe in the specifications the conditions for accessing the server that takes account of the

potential risk that there may be a bottleneck.

⇒ Perform programming that strictly adheres to the specifications.

・Build a workflow (make it a rule) to always call on people who are knowledgeable about servers

to participate in reviews.

・Make quick arrangements to prepare a test environment that simulates the real environment.

Implementation of the preventive measures described above should make it easier to secure the

required level of performance or improve it further with higher certainty through a

performance-based design approach.

43

Lesson 15

Lesson title Be prepared with a recovery plan to support all the foreseeable abnormal system operations

(reset, power shutdown, state of being left uncontrolled, etc.) that may occur during sequential

execution of business systems that are used by the customers after delivery.

Product features

A system used for supporting the information desk response type services at the customer’s store

This system is used to receive the data necessary for processing the daily business operations as

well as to automatically execute a batch process to send aggregated data on daily business

operations processed by the system during the day to the central server at specified time.

Observable

phenomena

One day, the personnel in charge of information desk response duties ended the day’s work and

left the store after switching off the power of this business system. When this personnel tried to

start the business system next morning, the system did not start normally. As a result, the

customer could not use the business system to process the daily store operations that day.

Event that

occurred

internally

The business system did not operate normally because the transmission of the aggregated data on

daily business operations processed by the system that was supposed to be sent to the central

server as a batch process was left in an incomplete state from the day before. until the next day.

could not be completed during the day to the central server at specified time.

Causal factors Through the investigation, it was found that the direct cause was the batch process that was forced

to execute while the system was sending the aggregated data on daily business operations

processed by the system the day before. Due to the forced start of the batch process, the data

transmission process was left in an incomplete state. In such a case, the system actually should

have executed a recovery process (such as, sending the remaining data that had not been

transmitted yet to complete the transmission) once it was started next morning, even when the

data reception or transmission job was forced to terminate before completion. However, since

such situation was not considered as a possible scenario during the system design process, the

recovery function was not implemented.

Preventive

measures


Implement a recovery program to send the remaining data that had not been transmitted yet due

to forced termination of data exchange with the central server once the system is restarted.


Include the process to recover from forced termination as one of the check items to examine when

reviewing the completeness of the requirements definition.

Implementation of the preventive measures described above should make it easier to improve the

response capability against incidents, such as, forced termination of the system caused by the

operator’s human error while the system was executing the business services.

44

Lesson 16

Lesson title Prepare specifications document and conduct impact evaluation even for the processing of

maintenance log data that is used for failure analysis.

Product features

A system designed for quality management at each production process in a manufacturing plant

This system is used in a plant that manufactures products through numerous production processes

in order to develop them into finished products. These production processes not only include the

process for processing treatment but also a process for treating chemical substances that are

harmful to the human body. To manage the different jobs carried out in each process, traceable log

data are taken and saved in the system, which are then aggregated by lot and sent to the server for

use in the subsequent processes.

Observable

phenomena

There was a time when the processing of the log data taken in a specific production process ended

abnormally while aggregating them. Due to this abnormal termination of data processing, the line

stopped at this particular production process.

Event that

occurred

internally

A part of the log data that recorded the latest information about the products in the making and

needed to be sent to the next process had been lost.

Causal factors In the particular production process where log data processing ended abnormally, the log data was

always taken and saved after confirming the state of the products upon completion of the final

step of this process, because the quality of the products treated with chemical substances was

often inconsistent, depending on the outcome of chemical treatment. Therefore, the log data

taken in this process was saved, as a rule, in two separate log files: one that contained the data of

products confirmed to be in compliant state (those that were successfully treated in the chemical

treatment process) and the other that contained the data of products detected to be in

non-compliant state (those that failed in the chemical treatment process), so that they would not

be mixed in one file. Through the investigation, it was found that the processing of log data taken

in this particular process ended abnormally because the two different log files (for compliant and

non-compliant products respectively) that were supposed to be output to separate modules were,

for some reason, logged by the single shared module. It also became clear that the log output

function for processing the log data of non-compliant products was considered to be of use only

for failure analysis and was not described in the specifications document.

Preventive

measures


Modify the program to make the system use two separate modules to write the log data on normal

production (of compliant products) and abnormal production (of non-compliant products)

respectively in two different log files.

45


Prevent the log data on normal production (of compliant products) and abnormal production from

being mixed in the same log file by preparing, as a part of the software design process, the

specifications document that describes the required procedure to be taken in processing

maintenance log data for use in failure analysis after evaluating the impact of the outcome of this

processing without any omission. Also prevent the non-compliant products that should be rejected

from the line at the point of detection from being fed to subsequent production processes to be

developed into finished products and then shipped out to the market as commercial products by

conducting review tests thoroughly.

Implementation of the preventive measures described above should make it easier to improve the

certainty of quality assurance required by the manufacturer to reinforce its maintenance functions.

46

Lesson 17

Lesson title Do not only extract the conditions required to perform the determination process, but also identify

all the conditions that should be processed as limitations.

Product features

An entry / exit gate control system used in an establishment with multiple important facilities

constructed within its premises

This system is designed to allow the employees of the establishment to access the building

facilities within the premises by associating the identification (ID) information on the electronic

gate pass provided to each employee with the information of each building they enter. Within the

premises of this establishment are a number of important building facilities that are physically

connected but are separated with an access gate built in between each compartmentalized facility.

This system monitors and controls the entry / exit to and from each of these compartments based

on various sets of access rules, identification data on who can have access to which zone for how

long (period when the access permit is valid) as well as restrictions in specific facilities that limit

the number of times one can pass the gate on the same day.

Observable

phenomena

One day, an employee tried to enter facility A, but could not because the alarm went off to notify

that an abnormal situation had arisen. Not only was this employee the only one who could not get

in. For a certain period of time, all the employees of facility A were also unable to get in or out of

this building. As a result, the business activities that were supposed to be handled by these

employees in this facility were seriously affected that day.

Event that

occurred

internally

What actually occurred was that this employee tried to enter facility A after entering and exiting

facilities X and Y several times and had already by then reached the upper limit of the number of

times this employee was allowed to enter or exit the restricted facilities that particular day. This

was why the system had processed this employee’s attempt to enter facility A as an abnormal

event, determining it to be an unacceptable entry that was over the limit.

Causal factors The system was programmed to check the upper limit of the number of times each employee was

allowed to enter or exit a facility, and normally activated this check function when an employee

tried to enter a facility after dropping by another facility. But this check function at facility Y was

somehow not written in the program. As a result, abnormal end occurred when the system was

processing the employee’s attempt to enter into a facility from other facility because it could not

recognize facility Y in the determination condition processing routine.

Preventive

measures


Check whether there are any code in the program for checking the number of times each

employee was allowed to enter or exit other facilities like X and Y, or any code for checking other

conditions that have been missing, and implement these missing parts of the program.

47


Review the business rules on entering / exiting the facilities and revise the manual according to the

result of the review.

Detect other similar errors through the following methods:

1. Decompose the conditions into attributes like ‘facility’, ‘limit in the number of times of

entry / exit’, and ‘types of gate pass’;

2. Check whether each of these attributes is a condition that should be processed as a

limitation or not.

Implementation of the preventive measures described above should make it easier to prevent any

processing from being left out by being able to visually check the relation between conditions and

attributes.

48

Lesson 18

Lesson title Be careful of the fragmentation of log files.

Product features

A production management system that uses Windows server to process information received from

an on-site process controller through a communication network to control the routes taken by the

conveyor at diverging points as well as to collect performance data

Observable

phenomena

A problem occurred in the system. For troubleshooting, the log file was copied to analyze what has

caused the problem. Then the line in the plant stopped due to the slow and delayed processing of

control data.

Event that

occurred

internally

The processing was slow because the log file copied on the disk was fragmented. Due to this

problem, the disk input/output load increased sharply while the log file was being copied. As a

result, the online process to write the log data of the ongoing performance slowed down and

delayed the data processing.

Causal factors ① On each day of plant operation, 30 variable-length log data on multiple production processes

of file size ranging from around 1 to 300MB are being created. These log files are deleted

automatically after being saved for 30 days.

② Since they are variable-length data, the log files are automatically extended and also

fragmented if the file size gets too big.

③ Fragmented files have been automatically deleted after being saved for 30 days. But

fragmentation continued and the log files kept on being created. As a result, fragmentation

accelerated.

Preventive

measures


1) Performed defragmentation on the day when the plant was not operating.


1) Avoided the daily fragmentation of log files by moving them to different partitions separated by

day.

2) Reduced the number of log files by deleting unnecessary log data.

Implementation of the preventive measures described above should make it easier to reduce

problems arising from processing variable-length data for record-keeping purpose.

handbook of lessons for information processing … part i collection of lessons - main text (product...

Documents