self-learning and adaptive functional fault diagnosis a look at what is possible with “data”

Self-Learning and Adaptive Functional Fault DiagnosisA Look at What is Possible with “Data”

Krishnendu Chakrabarty

Department of Electrical & Computer EngineeringDuke University

Durham, NC

Acknowledgments

• Fangming Ye, Duke University• Zhaobo Zhang, Huawei• Xinli Gu, Huawei• Sponsors: Cisco (until 2011), Huawei (since 2011)

ITC 2014 Tutorial

• Tutorial 9:• TEST, DIAGNOSIS, AND ROOT-CAUSE IDENTIFICATION OF FAILURES

FOR BOARDS AND SYSTEMS

• Presenters: KRISHNENDU CHAKRABARTY, WILLIAM EKLOW, ZOE CONROY

• The gap between working silicon and a working board/system is becoming more significant as technology scales and complexity grows. The result of this increasing gap is failures at the system level that cannot be duplicated at the component level. These failures are most often referred to as “NTFs” (No Trouble Founds). The problem will only get worse as technology scales and will be compounded as new packaging techniques (SiP, SoC, 3D) extend and expand Moore’s law. This tutorial will provide detailed background on NTFs and will present DFT, test, and root-cause identification solutions at the board/system level.

Automated Diagnosis: Wishlist

Automated and accurate diagnosis

Reduced diagnosis and repair costs Identify the most-effective

syndromes

Accelerate product release Self-learning

Functional Test

Automated diagnosis system

• Report ambiguity• Develop new tests• Analyze and update

current test set

Fixed board

What Data Should We Collect?• Pass/fail information, extent of mismatch with expected

values, performance marginalities, etc.• Counter values, BIST signatures, sensor data

– Example: A segment of traffic-test log

## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)

……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……

Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

What Can We Do With This Data?

• Train machine-learning models for root-cause localization• Identify redundant syndromes (i.e., test outcome data) and

do data pruning• Identify deficiencies of tests in terms of diagnostic ambiguity

and provide guidance for test redesign• “Fill in the gaps” when the data is not sufficient for precise

diagnosis

Key Challenges and Solutions

• How to improve diagnosis accuracy?– Support-vector machines and incremental learning [Ye et al.,

TCAD’14] – Multiple classifiers and weighted-majority voting [Ye et al.

TCAD’13]

• How to speed up diagnosis?– Decision trees [Ye et al. ATS’12]

• How to do diagnosis using incomplete information?– Imputation methods [Ye et al. ATS’13]

• What can we learn from past diagnosis?– Root-cause and syndrome analysis [Ye et al. ETS’13]– Knowledge discovery and knowledge transfer [Ye et al. ITC’14]

Syndrome and Root-Cause

• A segment of traffic-test log

• Syndromes are test outcomes parsed from log• Root causes are replaced components, e.g. C37

## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)

……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……

Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

s1 s2 s3

Case 1 1 1 1 A

Case 2 1 1 1 A

Case 3 1 1 1 B

Case 4 0 1 1 B

Syndrome Root cause

Success Ratio

• Success ratio = ¾ = 75%

s1 s2 s3

Case 1 1 1 1 A

Case 2 1 1 1 A

Case 3 1 1 1 B

Case 4 0 1 1 B

Actual root cause

Predicted root cause

Using Machine Learning (Support-Vector Machines)

– A binary classifier based on statistical learning– Train the model with data– Using trained model for diagnosis

Line2 (optimal separation)

Margin

Line1 (feasible separation)

Margin

Example• Suppose we have 6 cases (successful debugged boards) for training

• Let x1, x2, x3 be three syndromes. If the syndrome manifests itself, we record it as 1, and 0 otherwise.

• Suppose that the board has two candidate root causes A and B, and we encode them as y = 1 and y = −1

• Merge the syndromes and the known root causes into matrix A = [B|C], where the left (B) side refers to syndromes, while the right side (C) refers to the corresponding fault classes

Example (Contd.)• SVM calculation: w1 = 1.99, w2 = 0,

w3 = 0 and b= −1.00

• Therefore, the classifier is

y = −1

Distance betweenclassifier and support vector

• Given a new failing system with syndrome [1 0 0], then f(x) > 0. The root cause for this failing system is A.

• Given another new failing system with syndrome [0 1 0], f(x) < 0. The root cause for this failing system is B.

Results for Different Kernels Circa 2011 (Manufactured Boards)

• 811 boards for training, 212 for test

• Diagnostic accuracy under different kernel functions

100%92.23%

72.64%77.83%

81.13%

94.20%

61.79%

72.64% 75.47%

93.96%

57.08%

67.45%69.81%

polynomial degree=1

Gaussian

Exponential

Incremental SVMs

Support vector extraction

Initial training set

Extracted support vectors

New training cases

Combined Training Set

IncrementalLearning

Flow Chart

New training(incoming) data

Failing boards

Preparation stage

Extract fault syndromes and repair actions as additional training set S

Learning stage

Existing supportvectors S*

Optimization problem for combined training set (S U S*)

Solve and update SVM model

Diagnosis stagefor new systems

Determine root cause based on the output of final SVM model

Existing SVM model

More new training data?

Final SVM model

400 500 600 700 800 900 1000 1100 1200 1300 14000

Non-incremental learning Incremental learning

Number of training cases in SVMs

Comparison of Training Time

Comparison of Success Rates

400 500 600 700 800 900 1000 1100 1200 1300 140030%

Non-incremental learning Incremental learning

Number of training cases in SVMs

Typical Diagnosis Systems

• Number of syndromes (1,000 per board)• Diagnosis time (up to several hours per board)• Often requires manual diagnosis

Usefulsyndromes

The complete set of

syndromes

How to select

Comparison of Diagnosis Procedures

Start Diagnosis

No Confident about root

cause?

Observe ONE syndrome

Start Diagnosis

Observe ALL syndromes

Decision Tree-Based DiagnosisTraditional Diagnosis

Decision Trees

• Internal Nodes– Can branch to two child

nodes– Represent syndromes

• Terminal Nodes– Do not branch– Contain class

information

Decision Trees

• We may reach root causeA1 in two different testsequences.

1) Start from the most dis-criminative syndrome S1

2) If S1 manifests itself, we then consider syndrome S2

3) If S2 manifests itself, wecan determine A1 to be the root cause

Decision Trees

yesS 4yes

A 1yes

• We may reach root causeA1 in two different testsequences

1) Start from the most dis-criminative syndrome S1

2) If S1 pass, we will considersyndrome S3

3) If S3 manifests itself, we will consider syndrome S4

4) If S4 manifests itself, then we can determine A1 to be the root cause

Training Of Decision Trees (Syndrome Identification)

• Goals:– Rank syndromes– Minimize ambiguity– Reduce tree depth

• Three popular criteria can be used for training decision trees– Information Gain– Gini Index– Twoing

Class 1 Class 2

Diagnosis Using Decision Trees

Training Data PreparationExtract all the fault syndromes and the repair

actions from historical data

DT Architecture DesignDesign inputs, outputs, splitting criterion, pruning

DT TrainingGenerate a tree-based predictive model and

assess the performance

DT-based DiagnosisTraverse from the root node of DTs and obtain the

root cause at the leaf node

Diagnosis Using Decision Trees

Start DiagnosisObserve the syndrome at

the root of DTs

Adaptive Diagnosis

Select and observe the new syndrome

based on the observation of

current syndrome

Predict Root Cause

Generate root cause for the failing board

Leaf Node? Yes

Experiments• Experiments performed on industrial boards

currently in production– Tens of ASICs, hundreds of passive components

• All the boards under analysis failed traffic test– A comprehensive functional test set for fault isolation,

run through all components

Board 1 Board 2 Board 3Number of test items 420 207 420

Number of root cause components

10 14 10

Number of failed boards 130 40 1000

Comparison Of Different Decision-Tree Architectures

Board 1 Board 2 Board 30

50100150200250300350400450

DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs

Total number of syndromes used for diagnosis

Comparison Of Different Decision-Tree Architectures

Board 1 Board 2 Board 30

50100150200250300350400450

DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs

Average Number of syndromes used for diagnosis

15X17X

Comparison Between DTs And SVMs

• Success rates (SR) obtained for Board 3

DTs SVM_1 SVM_2 SVMs_3(min) SVMs_3(max) SVMs_3(avg.)0%

SR1 SR2 SR3

• SR obtained by DTs are similar to SR obtained by SVMs

Comparison Between DTs And ANNs

• Success rates (SR) obtained for Board 3

DTs ANNs_1 ANNs_2 ANNs_3(min) ANNs_3(max) ANNs_3(avg.)0%

SR1 SR2 SR3

• SR obtained by DTs are similar to SR obtained by ANNs

Information-Theoretic Syndrome and Root-Cause Analysis for

Guiding Diagnosis

• Analysis of diagnosis performance• Feedback for guiding test improvement

Problem Statement

• Lack of diagnosis performance evaluation for individual root cause or individual syndrome

• Redundant syndromes

Syndrome set from team 1

Complete set of syndromes

Syndrome set

from team 3

Syndrome set from team 2

Reducedsyndrome set

Redundant syndrome set

Complete set of root causes

Which root cause is hard to diagnosis?

Problem Statement

• Lack of diagnosis performance evaluation for individual root cause or individual syndrome

• Redundant syndromes• Ambiguous root-cause pairs

Analysis Framework

Automated diagnosis system

Minimum-redundancyMaximum-relevance

Class-relevance analysis (precision, recall)

Syndrome analysis

Root-cause analysis

A reduced set of

syndromes

Root causes with low

ambiguity

A set of redundant syndromes

Root causes with high ambiguity

Update

Test design team

Root-cause prediction

Syndromes

Add/Drop tests

Synthetic boards

Feedback

Syndrome Analysis

• Problem– Which syndrome is useful for diagnosis? And, which

is not?

• Method– Select useful syndromes– Minimum-redundancy maximum-relevance (mRMR)

method

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

Experimental Setup

• Dataset– Industrial boards in high-volume production– Each board has tens of ASICs, hundreds of

passive components– A comprehensive functional test run though all

components

Number of syndromes 546

Number of root causes 153Number of failing boards 1613

Experimental Setup

• Selected diagnosis systems– Support-vector machines– Artificial neural networks– Decision trees– Majority-weight voting

Results (Syndrome Analysis)

0 100 200 300 400 5000%

mRMR MaxRel

Random selection

Number of syndromes

Support-vector machines

0 100 200 300 400 5000%

mRMR MaxRel

Random selection

Number of syndromes

Artificial neural networks

0 100 200 300 400 5000%

mRMR MaxRel

Random selection

Number of syndromes

Decision trees

0 100 200 300 400 5000%

mRMR MaxRel

Random selection

Number of syndromes

Weighted-majority voting

Results (Syndrome Analysis)

0 100 200 300 400 5000%

mRMR MaxRel Random selection

Number of syndromes

Root-Cause Analysis

• Problem– Which root-cause is hard to isolate from another root-

cause? – Which root-cause is hard to isolate from other root

causes?

• Method– Screen root causes of high ambiguity– Statistical metrics (e.g., precision, recall)

Metrics for Root-Cause Analysis

Actual root cause Predicted root cause

Root cause A

Other root causes(Root cause B, C, …)

Root cause A

TP (True Positive)

TN (True Negative)

FP (False Positive)

FN (False Negative)

Root cause A

TP (True Positive)

FP (False Positive)

Root cause A

FN (False Negative)

TP (True Positive)

Root-Cause Analysis (demo)

s1 s2 s3 s4 s5

Case 1 1 1 0 0 1 A A (TP)

Case 2 1 1 0 0 1 A A (TP)

Case 3 1 1 0 0 1 A B (FN)

Case 4 0 1 1 1 1 B B (TN)

Case 5 0 0 1 1 1 B B (TN)

Case 6 1 0 1 1 0 C C (TN)

Case 7 0 1 1 1 0 C C (TN)

Actual root cause

Root-Cause Analysis (demo)

s1 s2 s3 s4 s5

Case 1 1 1 0 0 1 A A (TP)

Case 2 1 1 0 0 1 A A (TP)

Case 3 1 1 0 0 1 A B (FN)

Case 4 0 1 1 1 1 B B (TN)

Case 5 0 0 1 1 1 B B (TN)

Case 6 1 0 1 1 0 C C (TN)

Case 7 0 1 1 1 0 C C (TN)

Actual root cause

Results I (Root-cause analysis)

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Precision Recall

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Precision Recall

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Precision Recall

Decision trees

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Precision Recall

Results I (Root-cause analysis)

0-0.1 0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

Precision Recall

Results II (Root-cause analysis)

0 0.2 0.4 0.6 0.8 10

Recall

0 0.2 0.4 0.6 0.8 10

Recall

0 0.2 0.4 0.6 0.8 10

Recall

Decision trees

0 0.2 0.4 0.6 0.8 10

Recall

Results II (Root-cause analysis)

0 0.2 0.4 0.6 0.8 10

Recall

No change to the testsAdd tests to avoid high

false negatives

These tests are useless and mislead

(incorrect root causes)

Add tests to avoid high false positives

Results III (Root-cause analysis)

Root-cause pair that can be differentiated

Ambiguous root-cause pair in a group of 4

97.32% 2.68%

Ambiguous root-cause pairs

Summary

• Obtained reduced syndrome set with high

discriminative ability

• Analyzed ambiguity among root causes

• Provided guidelines for test-design teams to

re-design tests

Knowledge Discovery and Knowledge Transfer in Board-Level Diagnosis

• Acceleration of manufacturing time• Quick improvement of diagnosis accuracy

Problem Statement

• Low diagnosis accuracy at product ramp-up stage

Timeline Dia

New product

Product 1

Product 2

Preparation Learning Mature

Knowledge Source?

Knowledge discovery

Component

Test program

Case’

New product

Case’

Knowledge transfer

Old product

New product

Component mapping Test-program mapping

• Discover relationship between root-causes and syndromes from text of test logs through keywords

• Transfer diagnosis experience from other products through component and test-program mapping

New diagnosis

engine

Keywords

Reliable

Scalable

Collabor-ative

Doable

From From Academia to IndustryToolkit

From Academia to Industry

Generic programming languages

Shell Programming Languages

Statistics Programming Languages

Machine learning-related tools

Integrated design environment

Miscellaneous

Electronic design-aid tools

------ Tool categories

self-learning and adaptive functional fault diagnosis a look at what is possible with “data”

diagnosis accuracy

rulebased diagnosis

test outcome data

test log

test redesignfill

incremental learning

diagnostic accuracy

fault syndromes

Documents

lte handover fault diagnosis

software for fault diagnosis using knowledge models in...

fault diagnosis and fault tolerant control using set ... ·...

intelligent fault diagnosis & prognosis

fault diagnosis notes

advanced automotive fault diagnosis

self-adaptive fault diagnosis of roller bearings using...

ef1c-297e-0483-5c78...fault diagnosis - common electric...

probabilistic performance analysis of fault diagnosis...

design and research of electronic circuit fault diagnosis...

lucas fault diagnosis service manualtitle lucas fault...

fault detection and diagnosis

the fault diagnosis

lucas fault diagnosis service manual - gomog · lucas fault...

higher-order sliding mode techniques for fault diagnosis ·...

auto fault self-diagnosis

fault diagnosis of a rolling element … rolling element...

data-driven design of fault diagnosis systems -...

fault detection and fault diagnosis techniques for lookup

fault diagnosis in transformers