feature extraction and classification models for high-dimensional profile data

9
Research Article Published online 21 January 2011 in Wiley Online Library (wileyonlinelibrary.com) DOI: 10.1002/qre.1178 Feature Extraction and Classification Models for High-dimensional Profile Data Amit Shinde, a George Church, b Mani Janakiram b and George Runger a As manufacturing transitions to real-time sensing, it becomes more important to handle multiple, high-dimensional (non-stationary) time series that generate thousands of measurements for each batch. Predictive models are often challenged by such high-dimensional data and it is important to reduce the dimensionality for better performance. With thousands of measurements, even wavelet coefficients do not reduce the dimensionality sufficiently. We propose a two-stage method that uses energy statistics from a discrete wavelet transform to identify process variables and appropriate resolutions of wavelet coefficients in an initial (screening) model. Variable importance scores from a modern random forest classifier are exploited in this stage. Coefficients that correspond to the identified variables and resolutions are then selected for a second-stage predictive model. The approach is shown to provide good performance, along with interpretable results, in an example where multiple time series are used to indicate the need for preventive maintenance. In general, the two-stage approach can handle high dimensionality and still provide interpretable features linked to the relevant process variables and wavelet resolutions that can be used for further analysis. Copyright © 2011 John Wiley & Sons, Ltd. Keywords: discrete wavelet transformation; random forest; preventive maintenance 1. Introduction M odern manufacturing organizations (such as semiconductors) collect massive amounts of in-process data with the foresight of potentially using the information hidden in it to drive product and process improvements. Many operations require inputs from multiple process variables such as temperature, pressure, gas flow rates, etc. with each variable being a time-dependent profile as against a set value. Advances in measurement and data storage capabilities enable recording these variables at a high sampling frequency that results in hundreds of measurements per variable per batch. It is valuable to look for patterns in the data collected on the process variables that can help make better-informed process decisions. For example, consider the maintenance scheduling process. Maintenance strategies are broadly categorized as corrective and preventive. Corrective maintenance refers to unscheduled maintenance events that are carried out following a machine failure with the intention of restoring production. On the other hand, the preventive maintenance (PM) strategy refers to proactively scheduling and carrying out periodic maintenance on machines in order to prevent machine failure 1 . Optimal PM scheduling and PM prediction are key enablers of factory efficiency and effectiveness. A well-designed PM process has a number of benefits. It can significantly increase the system’s life, reduce system downtime and reduce product failures; all of which eventually translate into reduced production costs 2 . Preventive maintenance (PM) is usually carried out using two approaches. The traditional approach is a time-based approach in which the system is pulled down for maintenance either after a fixed number of cycles or after a fixed time interval. The second approach called conditional-based maintenance (CBM) is an approach in which the system is monitored over time and maintenance decisions based on the system’s actual deterioration as compared with a predefined threshold for PM 3--6 . A modification to this approach is to base maintenance decisions on the predicted future deterioration status instead of using the current deterioration status for decision-making 7 . This approach is called the preventive conditional-based maintenance (PCBM). The existing PM process used in semiconductor manufacturing is based on the time-based approach. The PM process involves pulling a tool offline for maintenance after a set number of lots (units) have been processed. The tool chamber is then conditioned and the tool is re-qualified based on the metrology of the outputs such as film thickness/uniformity. The tool processes test wafers until the desired output is again within specified limits. Subsequently, the tool is put online for the processing of production wafers. Repeating the PM cycle based on a fixed number of lots can be either too frequent or too infrequent. A better strategy a Arizona State University, Tempe, AZ, U.S.A. b Intel Corporation, Chandler, AZ, U.S.A. Correspondence to: Amit Shinde, Ira A, Fulton Schools of Engineering, P.O. 9309, Arizona State University, Tempe, AZ 85287-9309, U.S.A. E-mail: [email protected] Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893 885

Upload: amit-shinde

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Research Article

Published online 21 January 2011 in Wiley Online Library(wileyonlinelibrary.com) DOI: 10.1002/qre.1178

Feature Extraction and Classification Modelsfor High-dimensional Profile DataAmit Shinde,a∗† George Church,b Mani Janakiramb and George Rungera

As manufacturing transitions to real-time sensing, it becomes more important to handle multiple, high-dimensional(non-stationary) time series that generate thousands of measurements for each batch. Predictive models are oftenchallenged by such high-dimensional data and it is important to reduce the dimensionality for better performance.With thousands of measurements, even wavelet coefficients do not reduce the dimensionality sufficiently. We proposea two-stage method that uses energy statistics from a discrete wavelet transform to identify process variables andappropriate resolutions of wavelet coefficients in an initial (screening) model. Variable importance scores from amodern random forest classifier are exploited in this stage. Coefficients that correspond to the identified variablesand resolutions are then selected for a second-stage predictive model. The approach is shown to provide goodperformance, along with interpretable results, in an example where multiple time series are used to indicate the needfor preventive maintenance. In general, the two-stage approach can handle high dimensionality and still provideinterpretable features linked to the relevant process variables and wavelet resolutions that can be used for furtheranalysis. Copyright © 2011 John Wiley & Sons, Ltd.

Keywords: discrete wavelet transformation; random forest; preventive maintenance

1. Introduction

Modern manufacturing organizations (such as semiconductors) collect massive amounts of in-process data with the foresightof potentially using the information hidden in it to drive product and process improvements. Many operations requireinputs from multiple process variables such as temperature, pressure, gas flow rates, etc. with each variable being a

time-dependent profile as against a set value. Advances in measurement and data storage capabilities enable recording thesevariables at a high sampling frequency that results in hundreds of measurements per variable per batch. It is valuable to look forpatterns in the data collected on the process variables that can help make better-informed process decisions.

For example, consider the maintenance scheduling process. Maintenance strategies are broadly categorized as corrective andpreventive. Corrective maintenance refers to unscheduled maintenance events that are carried out following a machine failurewith the intention of restoring production. On the other hand, the preventive maintenance (PM) strategy refers to proactivelyscheduling and carrying out periodic maintenance on machines in order to prevent machine failure1. Optimal PM scheduling andPM prediction are key enablers of factory efficiency and effectiveness. A well-designed PM process has a number of benefits. Itcan significantly increase the system’s life, reduce system downtime and reduce product failures; all of which eventually translateinto reduced production costs2.

Preventive maintenance (PM) is usually carried out using two approaches. The traditional approach is a time-based approach inwhich the system is pulled down for maintenance either after a fixed number of cycles or after a fixed time interval. The secondapproach called conditional-based maintenance (CBM) is an approach in which the system is monitored over time and maintenancedecisions based on the system’s actual deterioration as compared with a predefined threshold for PM3--6. A modification to thisapproach is to base maintenance decisions on the predicted future deterioration status instead of using the current deteriorationstatus for decision-making7. This approach is called the preventive conditional-based maintenance (PCBM).

The existing PM process used in semiconductor manufacturing is based on the time-based approach. The PM process involvespulling a tool offline for maintenance after a set number of lots (units) have been processed. The tool chamber is then conditionedand the tool is re-qualified based on the metrology of the outputs such as film thickness/uniformity. The tool processes test wafersuntil the desired output is again within specified limits. Subsequently, the tool is put online for the processing of productionwafers. Repeating the PM cycle based on a fixed number of lots can be either too frequent or too infrequent. A better strategy

aArizona State University, Tempe, AZ, U.S.A.bIntel Corporation, Chandler, AZ, U.S.A.∗Correspondence to: Amit Shinde, Ira A, Fulton Schools of Engineering, P.O. 9309, Arizona State University, Tempe, AZ 85287-9309, U.S.A.†E-mail: [email protected]

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

88

5

A. SHINDE ET AL.

would be to build a predictive model that monitors the time signatures of the multiple variables and indicates whether PMneeds to be scheduled based on deterioration of these signatures. This can considerably improve tool availability, lead to fasterdecisions and reduce scrap and rework.

In general, data collected on these time series have two sources of variation embedded in it. The first source of variation isin the form of structured signals from the controlled variables. The second source is more random and is associated with noise.This could be the result of measurement system error, difficulty in setting and controlling the process variables to the desiredvalues, etc. Another issue arises from the lack of time synchronization of the tool transitions. This results in lack of alignmentof the variables from one lot to another. Many predictive models built on such high-dimensional noisy data suffer from higherror rates. Therefore, prior to building any predictive model, a method is required to align the signals, denoise them and reducetheir dimensionality. Principal Component Analysis8 has been widely used as a tool for summarizing and extracting importantfeatures from such data as well as denoising it. However, each principal component is a linear combination of all variables andis therefore, difficult to interpret.

Additionally, most variables are represented as a non-stationary profile formed by superposition of many effects at differentfrequencies and different scales. Hence, models that can capture time-scale information by focusing on local time effects areuseful. ARIMA models9 are rendered ineffective because of non-stationarity. Smoothing filters used by traditional univariate andmultivariate statistical process control charts such as Exponentially Weighted Moving Average have fixed frequency, i.e. the chartsare single scale, and do not summarize the time series. These control charts are restricted to detecting mean shifts and areinsensitive to detecting frequency changes. Fourier transformations10 map a signal from the time domain to the frequency domainand hence can be used to capture frequency changes. On the other hand, the time information in the signal is lost. Short timeFourier transform (STFT) ‘windows’ the signal and maps it to the time–frequency domain11. However, the precision of the resultdepends on the size of the window. STFT maintains a constant window size for analyzing the entire signal, i.e. window size isthe same for all frequencies and is therefore, ineffective in analyzing signals that have features at different frequencies.

A discrete wavelet transformation (DWT) is a starting point to potentially address these problems. A DWT uses appropriatelow-pass and high-pass filters to map the signal from the time domain into the time-scale domain12. Wavelet coefficients can bedenoised using appropriate thresholding methods. The coefficients that fall below the set threshold are discarded, thus providinga spare representation of the data13. By using inverse DWT, the original signal can be synthesized using denoised coefficients.Moreover, each wavelet coefficient can be traced back to a specific period of the original signal. While DWT significantly reducesthe dimensionality of the data, the number of wavelet coefficients retained can still be significantly large. Hence, additionaldimensionality reduction effort is necessary to summarize the wavelet coefficients. Consequently, a DWT is only a start and furtheranalysis of the coefficients is needed for the predictive model.

The objective of our research is to prototype a robust method that will help identify which variables and specifically whatperiods are important to be monitored when the inputs to the process are in the form of time-dependent profiles. This methodwill address issues such as misalignment of signals, noise reduction, dimensionality reduction and feature selection. The remainderof this paper is organized as follows: Section 2 summarizes DWTs and random forest (RF) classification; Section 3 provides theproposed methodology; Section 4 demonstrates the application of this method to high-dimensional data for PM; and Section 5provides conclusions.

2. Background

2.1. Discrete wavelet transformation

Time series for all variables contain certain low-frequency deterministic features such as spikes, mean shifts as well as certainhigh-frequency stochastic components. Wavelet analysis facilitates local analysis by extracting high-frequency components usingshorter windows and low-frequency components using longer windows. Using DWT, the signal Sn is decomposed into differentscales using appropriate Lo-pass filters (h) and Hi-pass filters (g) and down sampling (↓) of order 212, 14, 15. At each level, thehigh-pass filter produces detail coefficients (dn) and the low-pass filter produces approximation coefficients (an).

a1 = (a1n)= (S∗h)↓2 (1)

d1 = (d1n)= (S∗g)↓2 (2)

Approximations can be further decomposed into components of lower resolution using the same process. The coefficients thatfall below the set threshold are discarded, thus providing a spare representation of the data13, 16. Conversely, it is also possibleto reconstruct the original signal from the denoised wavelet coefficients. At each level, the approximation and detail coefficientsare up-sampled (↑) by an order of 2, passed through reconstruction filters g’ and h’ and then concatenated.

S= (a1 ↑2)∗h′+(d1 ↑2)∗g′ (3)

Moreover, an important advantage is that each wavelet coefficient can be traced back to a specific period of the original signal.For example, if the signal consists of 256 data points, the first approximation coefficient at level 3 relates the first 8 data pointsof the signal. Thus, wavelets provide a solution to denoising as well as an easy to interpret method of reducing dimensionality.

88

6

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

A. SHINDE ET AL.

2.2. Random forests

RF17 is a method used for predictive modeling. RF combines the predictions made by multiple, fully grown decision trees18.Decision trees partition rows of data successfully based on the predictor variables to achieve a consistent response value ineach partition. In an RF, each tree is grown as follows: Let N and M represent the number of cases and number of predictors,respectively, in the training data set. A random sample of size N is drawn with replacement from the original data. This is alsoreferred to as bootstrap sample and forms the training set for growing the tree. Each decision tree is built on a separate bootstrapsample. The cases not selected in the bootstrap sample are referred to as out of bag (OOB) data. Bootstrap samples lead tocorrelation between trees, which, in turn, inflates the variance of the RF model. To compensate for this, RF injects additionalrandomness in the model-building process by randomly selecting from a smaller subset (m<M) of input variables at each partition,in each tree. Each tree partitions the data to a maximum depth, but the prediction is smoothed because it is averaged over thetrees.

For every tree grown, the probability of a case being out of the bootstrap sample is approximately 13 . The OOB samples

can serve as a test set for the tree grown on the non-OOB data. This can be used to get an unbiased estimate of the test setclassification error. Along with providing accurate predictions, RF uses the OOB set to provide an estimate of variable importanceto a predictive model. This is a useful feature as it allows one to learn additional information regarding the structure of theprocess that generates the data.

3. Two-stage modeling to reduce dimensionality

Generally, time traces for variables between batches are out of alignment. This can either be due to lack of time synchronizationof tool transitions or due to manual adjustments made by the operators during the process. This misalignment can distort anystatistical measure, which in turn can lead to erroneous models. Therefore, addressing this problem of misalignment becomes acrucial pre-processing step.

Aligning two input signals requires a measure of similarity/dissimilarity between them. Dynamic Time Warping (DTW)19 hasbeen widely used to align profiles and compute a similarity/dissimilarity measure that is robust to shifts in the time axis. DTWnonlinearly stretches and shrinks the time axis so as to maximize the similarity between the two signals. However, this can leadto loss of information in cases where the misalignment is due to operators adjusting the process such that a particular variableis run at a particular setting for a longer time. The alignment algorithm we propose is based on translation of the profiles tomaximize the cross correlation between them.

Consider two time traces of length n given by x(i) and y(i) where i=1, 2. . . n. Let x and y represent the average of the twotime traces. The sample cross-correlation is a listing for lag k =−d to d of the values of

rk(x(i), y(i))=∑n−k

i=1 [(x(i) − x)×(y(i) − y)]√∑Mi=1 (x(i) − x)2 ×

√∑ni=1 (y(i) − y)2

(4)

Given multiple variables, the time traces for each variable are concatenated into a single signal. The next step is to identify areference signal to align all the profiles with. Initially, the median profile is computed (from the median at each time point) toact as the representative signal across all lots. We estimate the cross correlation of each lot against the median signal, whichhelps determine the lot that has the maximum similarity with the median signal. This lot is then defined as the reference lot.The signals from each lot are then translated along the time axis so as to maximize the cross-correlation with the reference lot.Certain portions of the signal that may be affected due to this shifting (edge effect) are discarded.

The aligned signals can then be mapped from the time domain into a time-scale domain using an appropriate DWT. Thelow-frequency deterministic components of the signal can be effectively summarized using the approximation coefficients at theappropriate level of decomposition. Since noise constitutes the higher frequency component of the signal, denoising can beachieved by applying appropriate thresholds on the wavelet coefficients. A universal threshold value20 is computed as:

thr=√

2× log(n)×�noise (5)

where �noise is an estimate of level noise and n is the length of the signal. Let d(j,k) represent the wavelet coefficients correspondingto all k translations at scale j. Only the significant wavelet coefficients d(j,k) are retained using the soft thresholding rule:

d(j,k) = sign(d(j,k)(|d(j,k)|−thr)) for |d(j,k)|≥ thr

= 0 for |d(j,k)|<thr(6)

Using the right level of decomposition followed by thresholding can significantly reduce the number of wavelet coefficients.However, when the input signal consists of thousands of dimensions, the number of wavelet coefficients retained can still bevery large. Based on the nature of the signal and what features are important for process monitoring, only the approximationsor the details at a particular level can be used. For example, if the variable time traces are in the form of piecewise constants,then only the approximation coefficients at the jth level can be used.

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

88

7

A. SHINDE ET AL.

However, we apply a more general approach that uses a two-stage model-building strategy. The first stage is to identifythe important process variables and also the wavelet levels for those variables (called the screening model). The second stagegenerates a predictive model based on features extracted from only the process variables and wavelet levels in the screeningmodel. One approach to screening is based on the energy of the wavelet coefficients21 at each level of decomposition. Energyis defined as the sum of squares of the wavelet coefficients d at level j

Ej =∑

k(d2

k )j (7)

where k is the number of coefficients at level j21. For example, consider a level-3 decomposition of a signal with 256 data points.This results in 32 approximation coefficients at level 3 (a3), 32 detail coefficients at level 3 (d3), 64 detail coefficients at level 2(d2) and 128 detail coefficients at level 1 (d1). Using Equation (7), a total of four energy statistics are defined; one for the 32 a3

coefficients, one for the 32 d3 coefficients, one for the 64 d2 coefficients and one for the 128 d1 coefficients. Not only are theseenergy statistics useful to reduce dimensionality, but they are also relatively insensitive to minor misalignments of the signals.

The primary function of any predictive or classification model is to accurately predict some output based on the input variables.However, it is just as valuable to learn which input variables are important to providing these results. In most applications, thislatter function can provide significant insights into the mechanisms of the process. Hence, our aim is to build a model thatprovides accurate results as well as an effective way to select among the set of input features. Amongst the many supervisedlearners, support vector machines (SVM)22 and RF provide consistently accurate predictions for a wide array of applications17.However, RF provides an inherent capability to estimate variable importance. Also, it can handle a large number of input variables,is relatively insensitive to noise and provides an internal unbiased estimate of test set error from the OOB cases. The internalestimate is particularly useful for our two-stage strategy to evaluate results from the screening stage. For these reasons, we choseRF as our method for the screening and predictive models.

Consequently, the first model is a screening model based on energy statistics. The objective of this model is to identify variablesand the corresponding energy levels that are important for prediction. Therefore, a modeling method that has the capability toidentify important features is used in this first (screening) stage. Using the information from the screening model, a second, moredetailed predictive model is then built. The inputs for this model are the approximation and/or detail wavelet coefficients fromthe energy levels that were identified as important in the screening model. The intent is to identify important process variablesthrough the lower dimensional energy statistics, and then identify characteristics of the variables for the final model.

A variation of this two-stage strategy is to use only approximation coefficients as the screening model. This model can be usedto screen important process variables rather than important coefficients. Hence, for a process variable to be retained for use asinput for the next modeling stage, at least one of its approximation coefficients must be identified as important. The inputs for thepredictive model would then include approximations and detail coefficients at all levels for these important variables. Dependingon the application, other two-stage strategies can be designed and evaluated. For this research, we provide a comparison of thetwo approaches outlined above as well as a model from the full, original variable profiles.

4. Application

We demonstrate the proposed method to solve the PM cycle prediction problem introduced in Section 1. Processing a singlewafer lot (batch) requires inputs from 29 variables. Each variable in itself is a profile consisting of approximately 300 data points,thus resulting in a total of 29×300=8700 measurements per lot. The objective of this study is to identify variables that providecontrast between time signatures of lots that are due for maintenance and lots post maintenance. Moreover, we would also liketo identify time windows during the operation in which to track these variables. Once identified, these time windows can bemonitored to provide online feedback on the health of the tool.

Figure 1 illustrates that profiles between lots are out of alignment. The signals were aligned using the cross-correlation-basedalignment algorithm discussed in Section 3. After alignment, only the central 256 points were retained for each variable to preventedge effects due to the shifting of the signals. The aligned signals for variable 1 and 2 are shown in Figure 2.

To obtain wavelet coefficients, the profiles from each variable were decomposed using Haar wavelets at three levels. For thisdata level 3 corresponded to a sufficiently small time interval to capture the important features. The methodology may be appliedin the same manner with other choices. A universal threshold value was set using Equation (5). Wavelet coefficients were thendenoised using soft thresholding (Equation (6)), which reduced 4798 of the total wavelet coefficients to zero. Hence, the original7424=29×256 dimensions were reduced to 2626.

To train the RF classification model, a total of 182 lots were used: 92 were prior to maintenance, while 90 lots were postmaintenance. To validate the model, a test set consisting of 88 lots was used. The test set consisted of 42 lots due for maintenanceand 46 post maintenance lots. The response variable consisted of two classes, namely, due for maintenance and post maintenance.For the RF parameters, throughout our experiments, we applied the recommended default for the number of variables tried ateach split equal to the square root of the number of input variables i.e. m=√

M17. The number of trees to be grown per runinfluences the stability of the variable importance scores. The higher the number of trees, the better the stability. Therefore, weset this value at 5000 trees throughout the experiments. A smaller value would no doubt be acceptable, but the runs are suitablyfast (seconds) for even the larger value.

88

8

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

A. SHINDE ET AL.

0

(a) (b)

50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

3x 105

0 50 100 150 200 250 300

0

2

4

6

8

10

12

14

16x 104

Figure 1. Original profiles for variable 1 and variable 2

0(a) (b)

50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

x 105

0 50 100 150 200 250 300

0

2

4

6

8

10

12

14

16x 104

3

Figure 2. Aligned profiles for variable 1 and variable 2

Table I. Variable importance scores for model from energy statistics

Variable Level1 Approx. level 3, detail level 315 Approx. level 323 Approx. level 3

The simplest approach was to use the aligned variables as inputs for the RF classifier. This method resulted in a test error rateof 10.23%. Two other approaches, one using energy statistics and the other using only approximation coefficients, are discussedin detail below.

4.1. Screening with energy statistics

In this approach, energy sum of squares was computed from the wavelet coefficients at each level of decomposition for eachvariable using Equation (7). Since decomposition was carried out at level 3, there were a total of 4 energy coefficients for eachvariable. Thus, the data could be summarized in 4×29=116 dimensions. Energy statistics were then used as inputs for a screeningmodel built using an RF. A total of 5000 trees were built with each tree considering a maximum of 10 variables to split a node.

The OOB estimate of error on the training set was 2.75%. Important variables and their corresponding levels can be identifiedusing the variable importance scores in Table I.

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

88

9

A. SHINDE ET AL.

Table II. Error matrix from test data based on the predictive modelfrom screening with energy statistics

Actual ↓ Due PostDue 42 0Post 8 38

Table III. Variable importance for model from approximation coefficients at level 3 for variables 1, 15 and 23and detail coefficients at level 3 for variable 1

Coefficient Approximation/detail Variable Corresponding time window10 Approximation level 3 1 73 8011 Approximation level 3 1 81 8814 Approximation level 3 1 105 11215 Approximation level 3 1 113 12016 Approximation level 3 1 121 12825 Approximation level 3 1 193 20026 Approximation level 3 1 201 20827 Approximation level 3 1 209 21628 Approximation level 3 1 217 22432 Approximation level 3 1 249 256

0 50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

3x 105

Figure 3. Reconstructed signal for variable 1—screening with energy statistics

A second model was built using the wavelet coefficients for the identified variables and their levels. That is, 32 a3 coefficientsfor variable 1, 15 and 23 each and 32 d3 coefficients for variable 1. Hence, only 128 dimensions were used as compared with theoriginal 7424. Since this model uses wavelet coefficients as compared with summary statistics, it has greater time resolution. Also,since wavelet coefficients can be traced to time intervals of the original signal, it can pinpoint features in the original signal thatcontribute to the classification. This model would be the final predictive classification model. The RF model with 5000 trees wasbuilt with 11 variables being tried at each split. The test error rate was 9.09% (Table II). This represents a percentage reductionin test error rate by 11.14% over the test error rate of 12.5% for the model using all variables. Table III identifies the importantcoefficients and their corresponding time window. To visualize the features, a denoised signal was reconstructed from waveletcoefficients using the inverse wavelet transformation. Figure 3 highlights the time window of the signal that corresponds to theimportant wavelet coefficients.

4.2. Screening with approximation coefficients

Because variable profiles tend towards piece-wise constant over time, it was assumed that most of the information could becaptured by the approximation coefficients. Hence, a screening model was trained only on approximation coefficients correspondingto level 3 for all variables. Thus, a total of 32×29=928 input variables were used. The objective of this model was to identifysignificant variables and not significant coefficients. Even if one coefficient from a variable was identified as being significant, thatvariable was retained for further investigation. The number of trees built was 5000 with 29 variables at each split node. The OOB

89

0

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

A. SHINDE ET AL.

Table IV. Error matrix from test data based on the predictive modelwith screening approximation coefficients

Actual ↓ Due PostDue 42 0Post 6 38

Table V. Variable importance for the model using wavelet coefficients at level 3 for variables 1, 15, 22 and 23

Coefficient Approximation/detail Variable Corresponding time window10 Approximation level 3 1 73 8011 Approximation level 3 1 81 8814 Approximation level 3 1 105 11215 Approximation level 3 1 113 12016 Approximation level 3 1 121 12822 Approximation level 3 1 169 17623 Approximation level 3 1 177 18424 Approximation level 3 1 185 19226 Approximation level 3 1 201 20827 Approximation level 3 1 209 21628 Approximation level 3 1 217 22431 Approximation level 3 1 241 24832 Approximation level 3 1 249 25616 Detail level 3 1 121 12827 Detail level 2 1 105 10831 Detail level 2 1 121 124

0 50 100 150 200 250 300

0

0.5

1

1.5

2

2.5

3x 105

Figure 4. Reconstructed signal for variable 1—screening with approximation coefficients

estimate of error on the training data was 1.1%. Using the variable importance scores, variables 1, 15, 22 and 23 were identifiedas important.

To identify significant coefficients and to improve the resolution of the model, all wavelet coefficients at a level 3 decompositionfor variables 1, 15, 22 and 23 were used to train the model. This translates into 1024 input variables. Parameters used in this RFclassification model used 5000 trees, and 32 variables at each split. The test error rate decreased to 6.82% (Table IV) which is a33.3% decrease from the error rate of the model using the profiles from all variables. Table V identifies the important coefficientsand the corresponding time periods of the variables.

Figure 4 indicates time periods of the reconstructed signal that contribute significantly to the classification of the pre and postmaintenance lots.

5. Conclusion

For the high-dimensional data considered here, latent variables are often difficult to interpret, thus the need to focus onalternatives. But with thousands of measurements, even wavelet coefficients require further analysis. The proposed two-stage

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

89

1

A. SHINDE ET AL.

approach is simple, but effective to extract information from the time series for predictive models. We also point out that theability to detect important variables is an important element in a high-dimensional analysis such as this and a modern randomforest classifier was used in this role. The important practical issue of alignment of data between batches is also considered. Twoapproaches to using wavelet coefficients as input to RF trees for classifying lots were demonstrated. The first approach utilizesenergy statistics to build a screening model in the first stage followed by a predictive model built on wavelet coefficients. Thesecond approach used approximation coefficients at a specified level from all variables as input to the first-stage screening model.Approximation and detail coefficients at all levels for the important variables were then used to build the predictive model. Testerror rates were illustrated to be substantially improved by the two-stage procedure. The two-stage methods presented here areprototypes and certainly, related alternatives for this type of methodology are feasible.

In a manufacturing application, such as PM decisions, numerous process variables may need to be evaluated for appropriatedecision-making. In our illustrative example, 29 time series (each with several hundred measurements) were available from eachbatch. The relevant information to predict the need for maintenance had to be extracted from the high-dimensional collectionof inputs. We illustrated how model performance could be improved so that better maintenance decisions could be made.Furthermore, we illustrated that effective models could be derived from only a subset of the input measurements available.Consequently, more simple models for maintenance decisions could be applied. Furthermore, process engineers could focus onthe important variables/ features for a better understanding of the signatures that indicate that maintenance is needed, and thisin turn could suggest process improvements.

References1. Blanchard B, Fabrycky W. Systems Engineering and Analysis (3rd edn). Prentice-Hall: New York, 1998.2. Ebeling E. An Introduction to Reliability and Maintainability Engineering. McGraw-Hill: New York, 1997.3. Tsang C. Condition-based maintenance: Tools and decision-making. Journal of Quality in Maintenance Engineering 1995; 1(3):3--17.4. Rajan S, Roylance J. Condition-based maintenance: A systematic method for counting the cost and assessing the benefits. Journal of Process

Mechanical Engineering 2000; 214:97--108.5. Matzelevich M. Real-time condition based maintenance for high value systems. The Shock and Vibration Digest 2001; 33(5):387--443.6. Saranga H. Relevant condition-parameter strategy for an effective condition-based maintenance. Journal of Quality in Maintenance Engineering

2002; 8(1):92--105.7. Lu S, Tu Y, Lu H. Predictive condition-based maintenance for continuously deteriorating systems. Quality and Reliability Engineering International

2007; 23(1):71--81.8. Jolliffe I. Principal Component Analysis (2nd edn). Springer: New York, 2002.9. Box G, Jenkins G. Time Series Analysis: Forecasting and Control. Holden-Day: San Francisco, 1970.

10. Bochner S, Chandrasekharan K. Fourier transforms. Annals of Mathematics Studies. Princeton University Press: Princeton, vol. 19, 1949.11. Gabor D. Theory of comunication. Journal of Institute of Electrical and Electronics Engineers 1946; 93:429--457.12. Mallat S. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on Pattern Analysis and Machine

Intelligence 1989; 11:674--693.13. Donoho D, Johnstone I, Kerkyacharian G, Picard D. Wavelet shrinkage: asymptopia? Journal of the Royal Statistical Society, Series B

(Methodological) 1995; 57(2):301--369.14. Strang G, Nguyen T. Wavelets and Filter Banks. Cambridge Press: Wellesley, MA, 1996.15. Burrus C, Gopinath R, Guo H. Introduction to Wavelets and Wavelet Transform—A Primer. Prentice-Hall: Upper Saddle River, NJ, 1998.16. Donoho D. Denoising by soft thresholding. IEEE Transactions on Information Theory 1995; 41:613--627.17. Breiman L. Random forests. Machine Learning 2001; 45(1):5--32.18. Breiman L, Friedman J, Olshen R, Stone C. Classification and Regression Trees. Wadsworth: Belmont, MA, 1984.19. Kruskal J, Liberman M. The symmetric time-warping problem: From continuous to discrete. Time Warps, String Edits and Macromolecules: The

Theory and Practice of Sequence Comparison. CSLI Publications: Stanford, 1999; 125--161.20. Donoho D, Johnstone I. Ideal spatial adaptation by wavelet shrinkage. Biometrika 1994; 81(3):425--455.21. Pittner S, Kamarthi S. Feature extraction from wavelet coefficients for pattern recognition tasks. IEEE Transactions on Pattern Analysis and

Machine Intelligence 1999; 21:83--89.22. Vapnik V. Statistical Learning Theory. Wiley: New York, NY, 1998.

Authors’ biographies

Amit Shinde is a doctoral student in the Industrial Engineering program at the School of Computing, Informatics and DecisionSystems Engineering, Arizona State University. He received his BE degree in Mechanical Engineering from the University ofMumbai and his MS degree in Industrial Engineering from Arizona State University. His research interests include real-time processmonitoring and the application of data mining tools towards supply chain modeling.

George Church received his BS degree in Finance and an MS degree in Statistics from Arizona State University. From 1988 to2002 he worked at American Express in Phoenix, Arizona. Since 2004 he has been a software engineer at Intel Corporation’sArizona site working on statistical process control applications. His interests are in the general areas of statistical model buildingand both supervised and un-supervised learning methods.

Mani Janakiram is a Director/Principal Engineer of Supply Chain Strategy at Intel and in his 11+ years at Intel, he has managedseveral projects in supply chain, strategy roadmap, modeling, capacity planning, process control, analytics and research. He has20+ years of experience (including Honeywell and Motorola), published 50+ papers and has two patents. Mani is a Six SigmaMaster Black Belt, and served on several committees including, ITRS FI, AZ Tech Council, Stanford AIM, ISMI, NSF research panelsand Factory Systems of SRC. He is also an adjunct professor at the Thunderbird School of Global Management. Mani holds a

89

2

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

A. SHINDE ET AL.

PhD and an MS in Industrial & Systems Engineering from Arizona State University, an MBA from Thunderbird School of GlobalManagement and a BS in Mechanical Engineering.

George C. Runger PhD, is a Professor in the School of Computing, Informatics, and Decision Systems Engineering at ArizonaState University. His research is on machine learning, real-time monitoring and control, and other data-analysis methods with afocus on large, complex, multivariate data streams. His work has been funded by grants from the National Science Foundationand other national and state organizations, as well as corporations. In addition to academic work, he was a senior engineer atIBM. He holds degrees in Industrial Engineering and statistics.

Copyright © 2011 John Wiley & Sons, Ltd. Qual. Reliab. Engng. Int. 2011, 27 885--893

89

3