self-learning and adaptive functional fault diagnosis a look at what is possible with “data”

Post on 04-Jan-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Self-Learning and Adaptive Functional Fault Diagnosis A Look at What is Possible with “Data”. Krishnendu Chakrabarty Department of Electrical & Computer Engineering Duke University Durham, NC. Acknowledgments. Fangming Ye, Duke University Zhaobo Zhang, Huawei Xinli Gu , Huawei - PowerPoint PPT Presentation

TRANSCRIPT

Self-Learning and Adaptive Functional Fault DiagnosisA Look at What is Possible with “Data”

Krishnendu Chakrabarty

Department of Electrical & Computer EngineeringDuke University

Durham, NC

2

Acknowledgments

• Fangming Ye, Duke University• Zhaobo Zhang, Huawei• Xinli Gu, Huawei• Sponsors: Cisco (until 2011), Huawei (since 2011)

3

ITC 2014 Tutorial

• Tutorial 9:• TEST, DIAGNOSIS, AND ROOT-CAUSE IDENTIFICATION OF FAILURES

FOR BOARDS AND SYSTEMS

• Presenters: KRISHNENDU CHAKRABARTY, WILLIAM EKLOW, ZOE CONROY

• The gap between working silicon and a working board/system is becoming more significant as technology scales and complexity grows. The result of this increasing gap is failures at the system level that cannot be duplicated at the component level. These failures are most often referred to as “NTFs” (No Trouble Founds). The problem will only get worse as technology scales and will be compounded as new packaging techniques (SiP, SoC, 3D) extend and expand Moore’s law. This tutorial will provide detailed background on NTFs and will present DFT, test, and root-cause identification solutions at the board/system level. 

4

Automated Diagnosis: Wishlist

Automated and accurate diagnosis

Reduced diagnosis and repair costs Identify the most-effective

syndromes

Accelerate product release Self-learning

Functional Test

Automated diagnosis system

• Report ambiguity• Develop new tests• Analyze and update

current test set

Faile

d bo

ard

Fixed board

6

What Data Should We Collect?• Pass/fail information, extent of mismatch with expected

values, performance marginalities, etc.• Counter values, BIST signatures, sensor data

– Example: A segment of traffic-test log

## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)

……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……

Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

7

What Can We Do With This Data?

• Train machine-learning models for root-cause localization• Identify redundant syndromes (i.e., test outcome data) and

do data pruning• Identify deficiencies of tests in terms of diagnostic ambiguity

and provide guidance for test redesign• “Fill in the gaps” when the data is not sufficient for precise

diagnosis

8

Key Challenges and Solutions

• How to improve diagnosis accuracy?– Support-vector machines and incremental learning [Ye et al.,

TCAD’14] – Multiple classifiers and weighted-majority voting [Ye et al.

TCAD’13]

• How to speed up diagnosis?– Decision trees [Ye et al. ATS’12]

• How to do diagnosis using incomplete information?– Imputation methods [Ye et al. ATS’13]

• What can we learn from past diagnosis?– Root-cause and syndrome analysis [Ye et al. ETS’13]– Knowledge discovery and knowledge transfer [Ye et al. ITC’14]

9

Syndrome and Root-Cause

• A segment of traffic-test log

• Syndromes are test outcomes parsed from log• Root causes are replaced components, e.g. C37

## Summary: Interfaces< r2d2 -- metro > counts - Fail(mismatch)

……464. (00000247) ERR EG R2D2_ARIC_CP_DBUS_CRC_ERR ……

Error: (0000010A) DIAGERR_ERRISO_INVALID_PKT_CNT: Packet count invalid

s1 s2 s3

Case 1 1 1 1 A

Case 2 1 1 1 A

Case 3 1 1 1 B

Case 4 0 1 1 B

Syndrome Root cause

10

Success Ratio

• Success ratio = ¾ = 75%

s1 s2 s3

Case 1 1 1 1 A

Case 2 1 1 1 A

Case 3 1 1 1 B

Case 4 0 1 1 B

A

A

A

B

Actual root cause

Predicted root cause

Using Machine Learning (Support-Vector Machines)

– A binary classifier based on statistical learning– Train the model with data– Using trained model for diagnosis

11

Line2 (optimal separation)

Margin

(b)

Line1 (feasible separation)

Margin

(a)

Example• Suppose we have 6 cases (successful debugged boards) for training

• Let x1, x2, x3 be three syndromes. If the syndrome manifests itself, we record it as 1, and 0 otherwise.

• Suppose that the board has two candidate root causes A and B, and we encode them as y = 1 and y = −1

• Merge the syndromes and the known root causes into matrix A = [B|C], where the left (B) side refers to syndromes, while the right side (C) refers to the corresponding fault classes

Example (Contd.)• SVM calculation: w1 = 1.99, w2 = 0,

w3 = 0 and b= −1.00

• Therefore, the classifier is

x3

x2

x1

y = 1

y = −1

Distance betweenclassifier and support vector

• Given a new failing system with syndrome [1 0 0], then f(x) > 0. The root cause for this failing system is A.

• Given another new failing system with syndrome [0 1 0], f(x) < 0. The root cause for this failing system is B.

Results for Different Kernels Circa 2011 (Manufactured Boards)

• 811 boards for training, 212 for test

• Diagnostic accuracy under different kernel functions

14

40%

50%

60%

70%

80%

90%

100%92.23%

72.64%77.83%

81.13%

94.20%

61.79%

72.64% 75.47%

93.96%

57.08%

67.45%69.81%

polynomial degree=1

Gaussian

Exponential

Incremental SVMs

Support vector extraction

Initial training set

Extracted support vectors

New training cases

Combined Training Set

IncrementalLearning

Flow Chart

New training(incoming) data

Failing boards

Preparation stage

Extract fault syndromes and repair actions as additional training set S

Learning stage

Existing supportvectors S*

Optimization problem for combined training set (S U S*)

Solve and update SVM model

Diagnosis stagefor new systems

Determine root cause based on the output of final SVM model

Existing SVM model

More new training data?

Final SVM model

NoYes

400 500 600 700 800 900 1000 1100 1200 1300 14000

5

10

15

20

25

Non-incremental learning Incremental learning

SV

M m

odel

tra

inin

g ti

me

(sec

ond

s)

Number of training cases in SVMs

Comparison of Training Time

Comparison of Success Rates

400 500 600 700 800 900 1000 1100 1200 1300 140030%

40%

50%

60%

70%

80%

90%

100%

Non-incremental learning Incremental learning

SV

M m

odel

su

cces

s ra

te

(per

cen

tage

)

Number of training cases in SVMs

19

Typical Diagnosis Systems

• Number of syndromes (1,000 per board)• Diagnosis time (up to several hours per board)• Often requires manual diagnosis

Usefulsyndromes

The complete set of

syndromes

How to select

Comparison of Diagnosis Procedures

Start Diagnosis

Predicted root cause

Yes

No Confident about root

cause?

Observe ONE syndrome

Start Diagnosis

Observe ALL syndromes

Predicted root cause

Decision Tree-Based DiagnosisTraditional Diagnosis

20

21

Decision Trees

• Internal Nodes– Can branch to two child

nodes– Represent syndromes

• Terminal Nodes– Do not branch– Contain class

information

S 1

S 2

S 3

S 4

A 2

A 3

A 4

A 1

yes

yes

yes

yes

no

no

no

no

A 1

22

yes

Decision Trees

• We may reach root causeA1 in two different testsequences.

1) Start from the most dis-criminative syndrome S1

2) If S1 manifests itself, we then consider syndrome S2

3) If S2 manifests itself, wecan determine A1 to be the root cause

S 1

A 1

S 2

S 3

S 4

A 2

A 3

A 4

A 1

yes

yes

yes

yes

no

no

no

no

S 2

yes

S 1

A 1

23

Decision Trees

S 1

A 1

S 2

S 3

S 4

A 2

A 3

A 4

A 1

no

yes

yes

yes

no

no

no

no

S 1

S 3

yesS 4yes

A 1yes

• We may reach root causeA1 in two different testsequences

1) Start from the most dis-criminative syndrome S1

2) If S1 pass, we will considersyndrome S3

3) If S3 manifests itself, we will consider syndrome S4

4) If S4 manifests itself, then we can determine A1 to be the root cause

24

Training Of Decision Trees (Syndrome Identification)

• Goals:– Rank syndromes– Minimize ambiguity– Reduce tree depth

• Three popular criteria can be used for training decision trees– Information Gain– Gini Index– Twoing

Class 1 Class 2

Class 1 Class 2

25

Diagnosis Using Decision Trees

Training Data PreparationExtract all the fault syndromes and the repair

actions from historical data

DT Architecture DesignDesign inputs, outputs, splitting criterion, pruning

DT TrainingGenerate a tree-based predictive model and

assess the performance

DT-based DiagnosisTraverse from the root node of DTs and obtain the

root cause at the leaf node

Diagnosis Using Decision Trees

26

Start DiagnosisObserve the syndrome at

the root of DTs

Adaptive Diagnosis

Select and observe the new syndrome

based on the observation of

current syndrome

Predict Root Cause

Generate root cause for the failing board

Leaf Node? Yes

No

Experiments• Experiments performed on industrial boards

currently in production– Tens of ASICs, hundreds of passive components

• All the boards under analysis failed traffic test– A comprehensive functional test set for fault isolation,

run through all components

27

Board 1 Board 2 Board 3Number of test items 420 207 420

Number of root cause components

10 14 10

Number of failed boards 130 40 1000

28

Comparison Of Different Decision-Tree Architectures

Board 1 Board 2 Board 30

50100150200250300350400450

DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs

Total number of syndromes used for diagnosis

5x

6x

6x

29

Comparison Of Different Decision-Tree Architectures

Board 1 Board 2 Board 30

50100150200250300350400450

DT(Gini index)DT(Info. Gain)DT(Twoing)ANNsSVMs

Average Number of syndromes used for diagnosis

15X17X

18X

30

Comparison Between DTs And SVMs

• Success rates (SR) obtained for Board 3

DTs SVM_1 SVM_2 SVMs_3(min) SVMs_3(max) SVMs_3(avg.)0%

20%

40%

60%

80%

100%

SR1 SR2 SR3

• SR obtained by DTs are similar to SR obtained by SVMs

31

Comparison Between DTs And ANNs

• Success rates (SR) obtained for Board 3

DTs ANNs_1 ANNs_2 ANNs_3(min) ANNs_3(max) ANNs_3(avg.)0%

20%

40%

60%

80%

100%

SR1 SR2 SR3

• SR obtained by DTs are similar to SR obtained by ANNs

Information-Theoretic Syndrome and Root-Cause Analysis for

Guiding Diagnosis

• Analysis of diagnosis performance• Feedback for guiding test improvement

33

Problem Statement

• Lack of diagnosis performance evaluation for individual root cause or individual syndrome

• Redundant syndromes

Syndrome set from team 1

Complete set of syndromes

Syndrome set

from team 3

Syndrome set from team 2

Reducedsyndrome set

Redundant syndrome set

34

Complete set of root causes

Which root cause is hard to diagnosis?

Problem Statement

• Lack of diagnosis performance evaluation for individual root cause or individual syndrome

• Redundant syndromes• Ambiguous root-cause pairs

35

Analysis Framework

Automated diagnosis system

Minimum-redundancyMaximum-relevance

Class-relevance analysis (precision, recall)

Syndrome analysis

Root-cause analysis

A reduced set of

syndromes

Root causes with low

ambiguity

A set of redundant syndromes

Root causes with high ambiguity

Update

Test design team

Root-cause prediction

Syndromes

Add/Drop tests

Synthetic boards

Feedback

36

Syndrome Analysis

• Problem– Which syndrome is useful for diagnosis? And, which

is not?

• Method– Select useful syndromes– Minimum-redundancy maximum-relevance (mRMR)

method

37

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

38

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

39

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

40

mRMR Method (demo)

s1 s2 s3 s4 s5RC

Case 1 0 1 1 0 1 A

Case 2 0 1 1 0 1 A

Case 3 0 1 1 0 1 A

Case 4 1 1 0 1 1 B

Case 5 1 0 0 1 1 B

Case 6 1 0 1 1 0 C

Case 7 1 1 0 1 0 C

41

Experimental Setup

• Dataset– Industrial boards in high-volume production– Each board has tens of ASICs, hundreds of

passive components– A comprehensive functional test run though all

components

Number of syndromes 546

Number of root causes 153Number of failing boards 1613

42

Experimental Setup

• Selected diagnosis systems– Support-vector machines– Artificial neural networks– Decision trees– Majority-weight voting

43

Results (Syndrome Analysis)

0 100 200 300 400 5000%

50%

100%

mRMR MaxRel

Random selection

Number of syndromes

Suc

cess

rat

io

Support-vector machines

0 100 200 300 400 5000%

50%

100%

mRMR MaxRel

Random selection

Number of syndromes

Suc

cess

rat

io

Artificial neural networks

0 100 200 300 400 5000%

50%

100%

mRMR MaxRel

Random selection

Number of syndromes

Suc

cess

rat

io

Decision trees

0 100 200 300 400 5000%

50%

100%

mRMR MaxRel

Random selection

Number of syndromes

Suc

cess

rat

io

Weighted-majority voting

44

Results (Syndrome Analysis)

0 100 200 300 400 5000%

20%

40%

60%

80%

100%

mRMR MaxRel Random selection

Number of syndromes

Suc

cess

rat

io

Support-vector machines

45

Root-Cause Analysis

• Problem– Which root-cause is hard to isolate from another root-

cause? – Which root-cause is hard to isolate from other root

causes?

• Method– Screen root causes of high ambiguity– Statistical metrics (e.g., precision, recall)

46

Metrics for Root-Cause Analysis

Actual root cause Predicted root cause

Root cause A

Other root causes(Root cause B, C, …)

Root cause A

Other root causes(Root cause B, C, …)

TP (True Positive)

TN (True Negative)

FP (False Positive)

FN (False Negative)

47

Metrics for Root-Cause Analysis

Actual root cause Predicted root cause

Root cause A

Other root causes(Root cause B, C, …)

Root cause A

Other root causes(Root cause B, C, …)

TP (True Positive)

FP (False Positive)

48

Metrics for Root-Cause Analysis

Actual root cause Predicted root cause

Root cause A

Other root causes(Root cause B, C, …)

Root cause A

Other root causes(Root cause B, C, …)

FN (False Negative)

TP (True Positive)

49

Root-Cause Analysis (demo)

s1 s2 s3 s4 s5

Case 1 1 1 0 0 1 A A (TP)

Case 2 1 1 0 0 1 A A (TP)

Case 3 1 1 0 0 1 A B (FN)

Case 4 0 1 1 1 1 B B (TN)

Case 5 0 0 1 1 1 B B (TN)

Case 6 1 0 1 1 0 C C (TN)

Case 7 0 1 1 1 0 C C (TN)

Actual root cause

Predicted root cause

50

Root-Cause Analysis (demo)

s1 s2 s3 s4 s5

Case 1 1 1 0 0 1 A A (TP)

Case 2 1 1 0 0 1 A A (TP)

Case 3 1 1 0 0 1 A B (FN)

Case 4 0 1 1 1 1 B B (TN)

Case 5 0 0 1 1 1 B B (TN)

Case 6 1 0 1 1 0 C C (TN)

Case 7 0 1 1 1 0 C C (TN)

Actual root cause

Predicted root cause

51

Results I (Root-cause analysis)

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

1.00

20

40

60

80

Precision Recall

Num

ber

of r

oot-

caus

es

Support-vector machines

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

1.00

20

40

60

80

Precision Recall

Num

ber

of r

oot c

ause

s

Artificial neural networks

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

1.00

20

40

60

80

Precision Recall

Num

ber

of r

oot-

caus

es

Decision trees

0-0.1

0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

1.00

20

40

60

80

Precision Recall

Num

ber

of r

oot-

caus

es

Weighted-majority voting

52

Results I (Root-cause analysis)

0-0.1 0.1-0.2

0.2-0.3

0.3-0.4

0.4-0.5

0.5-0.6

0.6-0.7

0.7-0.8

0.8-0.9

0.9-1.0

1.00

10

20

30

40

50

60

70

80

Precision Recall

Nu

mb

er

of

roo

t-ca

use

s

Support-vector machines

53

Results II (Root-cause analysis)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Support-vector machines

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Artificial neural networks

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Decision trees

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Weighted-majority voting

54

Results II (Root-cause analysis)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Support-vector machines

No change to the testsAdd tests to avoid high

false negatives

These tests are useless and mislead

(incorrect root causes)

Add tests to avoid high false positives

55

Results III (Root-cause analysis)

Root-cause pair that can be differentiated

Ambiguous root-cause pair in a group of 4

Ambiguous root-cause pair in a group of 8

97.32% 2.68%

0.63%

1.29%

0.39%

0.37%

Ambiguous root-cause pairs

Ambiguous root-cause pair in a group of 2

Ambiguous root-cause pair in a group of 3

Support-vector machines

56

Summary

• Obtained reduced syndrome set with high

discriminative ability

• Analyzed ambiguity among root causes

• Provided guidelines for test-design teams to

re-design tests

Knowledge Discovery and Knowledge Transfer in Board-Level Diagnosis

• Acceleration of manufacturing time• Quick improvement of diagnosis accuracy

58

Problem Statement

• Low diagnosis accuracy at product ramp-up stage

Timeline Dia

gnos

is a

ccu

racy

100%

0

low

New product

Product 1

Product 2

Preparation Learning Mature

59

Knowledge Source?

Knowledge discovery

Component

Test program

Case’

Case’

Case’

New product

Case’

Case’

Case’

Knowledge transfer

Old product

Case

Case

Case

Case

New product

Component mapping Test-program mapping

• Discover relationship between root-causes and syndromes from text of test logs through keywords

• Transfer diagnosis experience from other products through component and test-program mapping

New diagnosis

engine

Keywords

Tools

Reliable

Scalable

Collabor-ative

Doable

From From Academia to IndustryToolkit

From Academia to Industry

Generic programming languages

Shell Programming Languages

Statistics Programming Languages

Machine learning-related tools

Integrated design environment

Miscellaneous

Electronic design-aid tools

------ Tool categories

top related