a constraint-based genetic algorithm approach for …w3.et.uch.edu.tw/hsupl/upload/pei.pdf · a...

153
A Constraint-Based Genetic Algorithm Approach for Rule Induction Abstract Data Mining aims to discover hidden valuable knowledge from a database. Existing genetic algorithms (GAs) designed for rule induction evaluate the rules via a fitness function. The major drawbacks of applying GAs for rule induction include computation inefficiency, accuracy and rule expression. In this paper we propose a constraint-based genetic algorithm (CBGA) approach to reveal more accurate and significant prediction rules. This approach allows the constraints to be specified as relationships among attributes according to predefined requirements, the user’s preferences, or partial knowledge in the form of a constraint network. Constraint-based reasoning is employed to produce valid chromosomes using constraint propagation to assure that the genes comply with the predefined constraint network. The proposed approach is compared with a regular GA using a medical data set. Better computational efficiency and more accurate prediction results from the CBGA are demonstrated. Keyword s: Constraint-Based Reasoning; Genetic Algorithms; Rule Induction 1. INTRODUCTION Revealing valuable knowledge hidden in corporate data becomes more critical for enterprise decision making. When more data is collected and accumulated, extensive data analysis is not easy without effective and efficient data mining methods. In addition to statistical and other machine learning methods, the recent development of novel or improved data mining methods such as Bayesian networks [7], frequent patterns [30], decision or regression trees [13,14], and evolution algorithms [2,4,8,11] have drawn more attention from academics and industry. Rule induction is one of the most common forms of knowledge discovery. It is a method for discovering a set of "If / Then" rules that can be used for classification or estimation. That is, rule induction is able to convert the data into a rule-based representation that can be used either as a knowledge base for decision support or as an easily understood description of the system behavior. Basically rule induction features the capability to search for all possible interesting patterns from data sets. Most rule induction methods discover the rules by adopting local heuristic search techniques. Some other rule induction methods employ global search techniques, such as the genetic algorithm (GA) [17] evaluate the entire rule set via a fitness function rather than evaluating the impact of adding or removing one condition to or from a rule. However the GA’s major drawback is its heavy computation load when the search space is huge. 1

Upload: nguyenthuy

Post on 05-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

A Constraint-Based Genetic Algorithm Approach for Rule Induction Abstract

Data Mining aims to discover hidden valuable knowledge from a database. Existing genetic algorithms (GAs) designed for rule induction evaluate the rules via a fitness function. The major drawbacks of applying GAs for rule induction include computation inefficiency, accuracy and rule expression. In this paper we propose a constraint-based genetic algorithm (CBGA) approach to reveal more accurate and significant prediction rules. This approach allows the constraints to be specified as relationships among attributes according to predefined requirements, the user’s preferences, or partial knowledge in the form of a constraint network. Constraint-based reasoning is employed to produce valid chromosomes using constraint propagation to assure that the genes comply with the predefined constraint network. The proposed approach is compared with a regular GA using a medical data set. Better computational efficiency and more accurate prediction results from the CBGA are demonstrated. Keywords:

Constraint-Based Reasoning; Genetic Algorithms; Rule Induction

1. INTRODUCTION

Revealing valuable knowledge hidden in corporate data becomes more critical for enterprise decision making. When more data is collected and accumulated, extensive data analysis is not easy without effective and efficient data mining methods. In addition to statistical and other machine learning methods, the recent development of novel or improved data mining methods such as Bayesian networks [7], frequent patterns [30], decision or regression trees [13,14], and evolution algorithms [2,4,8,11] have drawn more attention from academics and industry.

Rule induction is one of the most common forms of knowledge discovery. It is a method for discovering a set of "If / Then" rules that can be used for classification or estimation. That is, rule induction is able to convert the data into a rule-based representation that can be used either as a knowledge base for decision support or as an easily understood description of the system behavior. Basically rule induction features the capability to search for all possible interesting patterns from data sets. Most rule induction methods discover the rules by adopting local heuristic search techniques. Some other rule induction methods employ global search techniques, such as the genetic algorithm (GA) [17] evaluate the entire rule set via a fitness function rather than evaluating the impact of adding or removing one condition to or from a rule. However the GA’s major drawback is its heavy computation load when the search space is huge.

1

In general rule induction method is designed aims to produce a rule set that can predict the expected outcomes as accurately as possible. However the emphasis on revealing novel or interesting knowledge has become a recent research issue in data mining. These concerns could result in additional rule discovery constraints, and thereby produce additional computation overhead. For regular GAs operations, constraint validation is proceeded after a candidate chromosome is produced. That is, several iterations may be required to determine a valid chromosome. One way to improve the computation load problem is to prevent the production of invalid chromosomes before a chromosome is generated; thereby accelerating the evolution process. In other words, this could be improved by embedding a well-designed constraint mechanism into the chromosome-encoding scheme.

In this research we propose a novel approach that integrates constraint-based reasoning with GAs to discover rule sets. Constraint-based reasoning is a process that incorporates various inference techniques including local propagation, backtrack free search and tree-structured reduction. The constraint-based reasoning mechanism is used to push constraints along with data insights into the rule set construction. This research applies hybrid techniques from local propagation and tree search approaches to assure local consistency before continuing the search in a GA process. Local propagation can reduce the search space from possible gene values that can not meet predefined constraints. This approach allows constraints to be specified as relationships among attributes according to predefined requirements, user preferences, or partial knowledge in the form of a constraint network. In essence, this approach provides a chromosome-filtering mechanism prior to generating or evaluating a chromosome. Thus insignificant or irreverent rules can be precluded in advance via the constraint network.

Proposition logic [11] is a popular representation used in rule induction systems. However, our proposed approach allows first order logic to formulate the knowledge in the form of linear inequations. This enhances the expressive power of the rules set to model the data behavior in a more effective way.

The remainder of this paper is organized as follows. In Section 2, previous research works and related techniques are reviewed. The detailed constraint-based GA procedure is then introduced in Section 3. Section 4 presents the experiments and results with medical data records followed by discussion and conclusions.

2. Literature Review

2.1 Genetic Algorithm for Rule Induction

Rule induction methods may be categorized into either tree based or non-tree

2

based methods. Quinlan [29] introduced techniques to transform an induced decision tree into a set of production rules. Some of the often-mentioned decision tree induction methods include C4.5 [29], CART [3] and GOTA [16] algorithms.

Michalski et al. [24] proposed AQ15 algorithms to generate a disjunctive set of classification rules. The CN2 rule induction algorithms also use a modified AQ algorithm that involves a top-down beam search procedure [6]. It adopts entropy as its search heuristic and is only able to generate an ordered list of rules. The Basic Exclusion Algorithm (BEXA) is another type of rule induction method proposed by Theron and Cloete [30]. It follows a general-to-specific search procedure in which disjunctive conjunctions are allowed. Every conjunction is evaluated using the Laplace error estimate. More recently Witten and Frank [31] described covering algorithms for discovering rule sets in a conjunctive form.

GAs have been successfully applied to data mining for rule discovery. Greene and Smith [15] and Noda et al. [26] proposed algorithms that used one-rule-per-individual encoding. In this approach, a chromosome can usually be identical to a linear string of rule conditions, where each condition is often an attribute-value pair. Although the individual encoding is simpler and syntactically shorter, the problem is that the fitness of a single rule is not necessarily the best indicator of the quality of the discovered rule set. The several-rules-per-individual approach [9,20] has the advantage of considering its rule set as a whole by taking the rule interactions into account. However, this approach makes the chromosome encoding more complicated and syntactically longer, usually requiring more complex genetic operators.

Hu [19] proposed a Genetic Programming (GP) approach in which a program can be represented by a tree with rule conditions and/or attribute values in the leaf nodes and functions in the internal nodes. The challenge is that a tree can grow in size and shape in a very dynamic way. An efficient tree-pruning algorithm would be required to prune unsatisfactory parts from the tree to avoid infeasible solutions. Bojarczuk et al. [2] proposed a constrained–syntax GP approach to build a decision model. The emphasis was on the discovery of comprehensible knowledge. The constraint-syntax mechanism was applied to verify the relationship between operators and operand data types during tree building.

To discover high-level prediction rules, Freitas [11] applied first-order relationships such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed especially effective in knowledge representation. However, other complicated attribute relationships such as the linear or non-linear quantitative relationships among multiple attributes were not discussed in the ACT.

3

2.2 Constraint Satisfaction in Genetic Algorithms

Constraint satisfaction involves finding values for problem variables associated with constraints on acceptable solutions to satisfy these constraints. Problem solving in a constraint satisfaction problem (CSP) that belongs to the NP-Complete problems normally lack suitable methods. A number of different approaches have been developed for solving the CSP problems. Some of them adopted constraint propagation to reduce the solutions space. Others tried “backtracking” to directly search for possible solutions. Some applied a combination of these two techniques including tree-search and consistent algorithms to efficiently determine one or more feasible solutions. Nadel [25] compared the performance of several algorithms including ‘generate and test’, ‘simple backtracking’, ‘forward checking’, ‘partial lookahead’, ‘full lookahead’, and ‘really full lookahead.’ The major difference in these algorithms is the degree of consistency performed at the node during the tree-solving process. Besides the ‘generate and test’ method, others performed hybrid techniques. In other words, whenever a new value is assigned for a variable, the domains of all unassigned variables are filtered and left only with those values that are consistent with the variable being assigned. If the domains of any of these uninstantiated variables become empty, the contradiction is recognized and backtracking occurs. Freuder [12] mentioned that if a given CSP has a tree-structured graph, then it could be solved without any backtracking. That is, solutions can be retrieved in a backtrack-free manner. Dechter and Peral [10] used this theory coupled with the notion of directional consistency to generate backtrack-free solutions more efficiently.

Dealing with constraints for search space reduction seems to be an important research issue in many artificial intelligence areas. GAs maintain a set of chromosomes (solutions), called a population. The population consists of parents and offspring. When the evolution process proceeds, the best N chromosomes in the current population are selected as parents. Through genetic operators, offspring are selected according to a filtering criterion that is usually expressed as fitness functions along with some predefined constraints. The GA evolves over several generations until the stopping criteria are met. However valid chromosomes are usually produced by trial and error. That is, a candidate chromosome is produced and tested against the filtering criterion. Therefore a GA may require more computation, especially in dealing with complicated or severe filtering criterion. To resolve this problem, an effective chromosome construction process can be applied to the initialization, crossover, and mutation stages.

Garofalakis [13] provided a model constraint-based algorithm inside the mining

4

process to specify the expected tree size and accuracy. In other words, constraints can be used to express the trade-off between the model accuracy and computation efficiency associated with the tree-building or tree-pruning process. Similar research work, CADSYN [23], adopted a constraint satisfaction algorithm for case adaptation in case based reasoning. Purvis [28] applied a repair-based constraint satisfaction algorithm to aid in case adaptation. By pushing constraints into the case adaptation process, the constraint satisfaction algorithm is able to aid a CBR system to produce decent solution cases more efficiently for the discrete based constraints problem.

Barnier and Brisset [1] developed a hybrid system of genetic algorithm and constraint satisfaction techniques. This method was adopted to resolve the optimization problems for vehicle routing and radio link frequency assignments. This approach applied a GA to reduce the search space for a large CSP instead of applying constraint satisfaction techniques to improve the GA’s computation efficiency. Kowalczyk [21] pointed out the concept of using constraint satisfaction principles to support a GA in handling constraints. However few research works applied constraint based reasoning to effectively handle GA computation inefficiency or on how the user’s knowledge can be presented and processed within a constraint network.

3. The Proposed Rule Induction System

We developed a rule induction system that consists of three modules: the

user-interface, the symbol manager, and constraint-based GA (CBGA). According to Fig. 1, the user interface module allows users to execute the following system operations including:

Loading a constraint program, Adding or retracting the constraints, Controlling the GA’s parameter settings, and Monitoring the best solutions. The constraint program here is a set of any first order logic sentence (atomic,

compound or quantified) about a many-sorted universe of discourse that includes integers, real numbers, and arbitrary application-specific sorts. The details can be found in Lai [22].

The symbol manager examines the syntax of the first order logic sentences in the constraint program and translates the syntax into a constraint network for further processing.

In the CBGA module, the constraint-based reasoning filters each gene value and processes both the GA initial population and regular populations. To speed up the reasoning process, both the variable ordering [27] and backtrack-free search methods

5

[12] are adopted in the CBGA to derive contradiction-free chromosomes.

(INSERT Fig. 1. The Conceptual Diagram of the Proposed Rule Induction System)

As shown in Fig. 2, each chromosome Ci of a regular population with size N is

processed by constraint-based reasoning in sequence. The genes in a chromosome can be viewed as a subset of variables. The valuation scope of each gene gij is restricted via constraint–based reasoning. This system efficiently transforms human knowledge, such as expert experience or common sense into a constraint network leading to more significant rule sets. The search efforts for valid chromosomes can be reduced without having to activate the fitness evaluation procedure on each candidate chromosome.

(INSERT Fig. 2. The Framework of Constraint-Based Preprocessing for GA

Operators)

Fig. 3 illustrates the details of the chromosome construction process. In essence, the CBGA applies local propagation for chromosome filtering. The basic concept of local propagation [12] involves using the information local to a constraint to validate gene values. Local propagation also restricts the valid gene range referenced in the constraint. The satisfactory gene values (i. e., the satisfactory universe) are propagated through the network thus enabling other constraints to impound other valid range sets for the remaining genes.

According to the restricted valid range denoted by the satisfactory universe SGj, the valuation process for gene gij is then activated to examine the inferred gene value. The new g’ij value is replaced by a value randomly selected from SGj if the inferred gene value is inconsistent with the constraint network. By repeatedly applying local propagation and the valid valuation process on chromosome Ci in the sequence gi1, gi2, … , gim, the new chromosome C’i is thus able to satisfy the constraint network. As a result local propagation offers an efficient way to guide the GA toward searching for the best chromosome by reducing the search space already filtered by the constraint network.

(INSERT Fig. 3. The Detail Illustration for Chromosome Screening)

4. Design of the CBGA for Rule Induction 4.1 The Medical Data Set

6

A synthetic medical database involving patient information including the age, sex, blood pressure (BP), Cholesterol (Cho) status, Na and K values and the quantity (Qty) and frequency (Freq) for taking a drug. The prediction attribute is one of five drug types, including Drug A, Drug B, Drug C, Drug D and Drug E.

This data set assumes that physicians may prefer a certain medical prescription expressed in the following description:

Drug A and Drug B are used if the blood pressure is high. If the blood pressure is low and Cholesterol is high, then Drug C is used. If the Cholesterol is low and Drug D is being used, the quantity is suggested as

up to 3 units. If the blood pressure or Cholesterol are high, Drug E is used under the condition that “the frequency is less than twice a day.”

4.2 Knowledge Representation as the First Order Logic

The above medical professional preferences can be considered predefined rules that can be represented by a constraint program. The rule can be represented using propositional logic, so that a data mining algorithm is able to perform more efficiently. However the employment of propositional logic in the rules set is rather limited for the expressive power to model the data behavior. For example, a rule between two different attributes such as the condition “Na > K” can not be determined. Instead, the first order logic representation can conquer this limitation, but it usually requires more computation efforts to search for any possible relationship between any two attributes.

In the CBGA the relationship between any two attributes could be formulated as a first order logic except for the relationship between the quantitative and qualitative attributes. For instance, the age with a numerical value (1~100) cannot simultaneously be compared with a BP status that uses symbolic features {High, Normal, and Low}. To consider the different numerical ranges among attributes, we propose a weighting linear form such as “Attribute i <= w * Attribute j” to determine further data insights not available from conventional classification techniques such as decision trees.

In addition to allowing the relationship between any two attributes to be presented, rule sentences based on first order logic to model common human knowledge can be easily formulated using a constraint program. The program is then translated into a constraint network to be processed by the constraint-based inference engine developed in our earlier work [18,22]. In this example the predefined rules can be described using the following universal quantification.

ALL X: DrugA(X) and X.BP=High; ALL X: DrugB(X) and X.BP=High; ALL X: DrugC(X) and X.BP=Low and X.Cho=High;

7

ALL X: DrugD(X) and X.Cho=Low implies X.Qty=3; ALL X: DrugE(X) and (X.Cho=High or X.BP=High) implies X.Freq<=2; Domain CompareVar =:= {Age, Na, K, Qty, Freq}; ALL X, Y: CompareVar(X) and CompareVar(Y) and X<>Y.

4.3 A Chromosome Encoding

In this method a rules set is encoded as a chromosome based on several-rules-per-individual approach. This approach allows a rules set to describe all possible conditions associated with a single drug type to consist of at least one rule within a rules set. Each drug type should possess it’s own rules set, typically represented in the following format:

(“IF cond1,1 AND … AND cond1,n THEN drug= A ” or “IF cond2,1 AND … AND cond2,n THEN drug =A” or “IF cond3,1 AND … AND cond3,n THEN drug=A” ) and

(“IF cond41 AND … AND cond4,n THEN drug= B ” or “IF cond51 AND … AND cond5,n THEN drug =B”) … and

(“IF condm-1,1 AND … AND condm-1,n THEN drug= E ” or “IF condm,1 AND … AND condm,n THEN drug =E”)

For example, the rule set for Drug A can be encoded in the following rules shown in Table 1.

IF (Age >=25) AND (Sex=”M”) AND (K>=0.03) THEN Drug A or IF (BP=High) AND (Freq>=3) AND (Na<=1.2*K) THEN Drug A

(INSERT Table 1: The Rule Representation for Drug A)

Table 2 presents the predefined range (domain) for each user-defined gene in one

rule. The Gene Name “Attribute enabled/disabled” denotes whether an attribute is adopted in the condition part within a rule.

(INSERT Table 2: The Detail Specification of Gene Encoding for One Rule)

4.4 Initial Population

To generate a valid chromosome in the beginning, an initial population can be

8

derived through an individual screening process using constraint-based reasoning. Whenever the gene is randomly instantiated from the satisfied universe the potential values applied to the remaining genes will be restricted by the constraint propagation to assure constraint network consistency. In this way the constructed chromosomes in the CBGA can be guaranteed to be valid. 4.5 Fitness Function

When using a rule to classify a given patient record, four types of results can be observed for the prediction model. These include:

True positive: this rule predicts that the patient uses a given drug and the patient does use it;

False positive: this rule predicts that the patient uses a given drug but the patient does not use it;

True negative: this rule predicts that the patient does not use a given drug but the patient does not use it;

False negative: this rule predicts that the patient does not use a given drug, but the patient does use it.

In our approach, the fitness function is defined as the number of errors consisting of false positive and false negative cases. The formal definition that specifies the valid rules can be stated as follows.

),()( ,1

,1

jk

l

jji

l

jRR

ki

¬∧∨==

where niik ,...,1,1,...,1 +−= mnlll n *...21 <=+++ .

Rij denotes the rule associated with Drug i, ¬

4.6. Generic Operators

To exchange information between different rules a uniform crossover is used to swap the gene values to generate new valid offspring. The mutation operator is applied by replacing random values based on the gene type whose values are randomly selected from the predefined range.

For example, the value of a certain gene for the following offspring is determined by applying a uniform crossover according to the flag that denotes if a

9

certain gene value will be swapped. The detailed operations are illustrated as follows.

Step1: the parent is defined as follows.

Parent 1

Age<=10 Sex=F BP=High Cho=Low Na<=0.67 K>=0.3 Qty=2 Freq=2 Age>0.5*Qty

Parent 2

Age>=20 Sex=M BP=Normal Cho=High Na<=0.3 K>=0.5 Qty<3 Freq=5 Qty>1.2*Freq

Step2: the outcome of the offspring is determined by swapping the gene values based on the flags.

Flags 1 0 0 1 1 0 1 0 0 Offsrping1 Age>=20 Sex=F BP=High Cho=High Na<=0.3 K>=0.3 Qty=2 Freq=2 Age>0.5*QtyOffspring2 Age<=10 Sex=M BP=Normal Cho=Low Na<=0.67 K>=0.5 Qty<3 Freq=5 Qty>1.2*Freq

Step 3: mutation is incurred at the BP attribute of offspring 1 and its value is

replaced by a randomly assigned value ‘LOW’ described as follows..

Offsrping1 Age>=20 Sex=F BP=Low Cho=High Na<=0.3 K>=0.3 Qty<3 Freq=2 Age>0.5*QtyOffspring2 Age<=10 Sex=M BP=Normal Cho=Low Na<=0.67 K>=0.5 Qty=2 Freq=5 Qty>1.2*Freq

5. Experimental Results and Discussions This research adopted 600 synthetic medical data records. Each record contains

eight variables including the age, sex, blood pressure, degree of Cholesterol, Na and K degrees, the quantity and frequency for taking a given drug.

Among these data sets, 2/3 data were selected for training examples and the remaining were used for the test examples. The model evaluation was based on a three-fold cross validation. One hundred organisms in the population were used for the GA control parameters. The crossover rate was 0.6 and the mutation rate ranged from 0.01 to 0.05 in the initial settings. The entire training process proceeded until either a fixed number of generations or the fixed run time was met.

Fig. 4 shows the results up to 200 generations. The X-axis denotes the GA generations and the Y-axis represents the average errors based on the results from three-fold cross validation. Fig. 5 displays the comparison results from a regular GA and CBGA with a fixed run time. The results indicate that the CBGA produces better performance with a smaller number of generations and less time required.

(INSERT Fig. 4. Experimental Results for Fixed Generation with Mutation

10

Rate 0.01)

(INSERT Fig. 5. Experimental Results for Fixed Computation Time with Mutation Rate 0.01)

The accuracy is simply the ratio of the number of correctly classified training (or

testing) examples over the total number of training (or test) examples. The CBGA achieved an average rate of 81.07% for the training data and 79.5% for the test data. Both of these results outperformed the regular GA. The details for the results are shown in Table 3.

(INSERT Table 3: The Accuracy for Three-Fold Cross Validation (Mutation Rate=0.01))

We also applied the same three-fold cross validation with a mutation rate of 0.05.

The results are shown in Fig. 6, 7 and Table 4. As a result the CBGA exhibited more significant improvement than the regular GA.

(INSERT Fig. 6. Experimental Results for Fixed Number of Generations with Mutation Rate 0.05)

(INSERT Fig. 7. Experimental Results for Fixed Computation Time with

Mutation Rate 0.05) Fig. 8 shows the average accuracy with different mutation rates. For a regular

GA the higher mutation rate resulted in higher accuracy, but lower accuracy for the CBGA for both the training and test data. It seems normal that an increase in the mutation rate can improve the accuracy for a regular GA. However opposite results occurred for the CBGA. This could be the reason that a higher mutation rate produces a greater number of disordered chromosomes that may require more computation during evolution. Because the number of generations was fixed, a higher mutation rate could influence the CBGA when handling more invalid chromosomes thereby reducing CBGA’s effectiveness in rule induction. However, more experiments can be performed to further verify this assumption in future work.

(INSERT Table 4: The Accuracy for Three-Fold Cross Validation (Mutation Rate=0.05))

11

(INSERT Fig. 8. The Comparative Accuracy vs. Two Different Mutation Rates for the Regular GA and CBGA)

(INSERT Fig. 9. The Comparative Time vs. Two Different Mutation Rates for

the Regular GA and CBGA)

6. Conclusions We introduced the CBGA approach that hybridizes constraint-based

reasoning within a genetic algorithm for rule induction. Incorporating user-control information into the mining process is not straightforward and typically requires a novel algorithm design. This approach is able to infer rule sets by pushing partial knowledge into chromosome construction for guiding the rule to reveal more significant meanings. Furthermore the computation efficiency can be improved using the constraint network to prevent invalid chromosome production.

Compared with a regular GA, the CBGA achieves higher predictive accuracy and requires less computation time for rule inductions using a medical data set. The rule sets discovered by the CBGA exhibited higher predictive accuracy with more significant knowledge in accordance with the user’s preferences.

The proposed CBGA is generic and problem independent. It is flexible, incorporating user information or domain knowledge via the expressive power of first order logic into a rule induction process. Proprietary genetic operators or chromosome representation are not needed to interact with the constraint-based reasoning. Most importantly, regular data mining methods construct predictive models based on the data behavior. However, data quality can never be completely assured before the mining results are available. Even when the results are available it is difficult to verify model reliability. The CBGA provides a way to minimize this effect by allowing domain experts to input professional knowledge or constraints to prevent possible anomalies from inappropriate data quality.

At this stage CBGA is able to reveal complex rule sets consisting of a format such as “Attributei <= w * Attributej” that can be extended to express more complicated multivariate ineqations with either a linear or nonlinear format in the future. Further investigation of the proposed CBGA using more real world data sets or data sets from benchmark databanks are required to further demonstrate the CBGA’s generalization capability. ACKNOWLEDGEMENTS This research was partially supported by National Science Council, Taiwan, Republic of China, under the contract number NSC90-2745-P-155-003.

12

The Hybrid of Association Rule Algorithms and Genetic Algorithms for

Tree Induction: An Example of Predicting Students Learning Performance

Abstract

Revealing valuable knowledge hidden in corporate data becomes more critical for enterprise decision making. When more data is collected and accumulated, extensive data analysis won’t be easier without effective and efficient data mining methods. This paper proposes a hybrid of the association rule algorithm and genetic algorithms (GAs) approach to discover a classification tree. The association rule algorithm is adopted to obtain useful clues based on which the GA is able to proceed its searching tasks in a more efficient way. In addition an association rule algorithm is employed to acquire the insights for those input variables most associated with the outcome variable before executing the evolutionary process. These derived insights are converted into GA’s seeding chromosomes. The proposed approach is experimented and compared with a regular genetic algorithm in predicting a student’s course performance. Keywords: Genetic Algorithms; Association Rule; Classification Trees; Student

Course Performance

1. Introduction

Revealing valuable knowledge hidden in corporate data becomes more critical for enterprise decision making. When more data is collected and accumulated, extensive data analysis won’t be easier without effective and efficient data mining methods.

Tree induction is one of the most common methods of knowledge discovery. It is a method for discovering a tree-like pattern that can be used for classification or estimation. Some of the often-mentioned tree induction methods such as C4.5 (Quinlan 1993), CART (Breiman et al., 1984), and Quest (Loh & Shih, 1997) are not evolutionary-based approaches. Basically an ideal tree induction technique has to carefully tackle those aspects, such as model comprehensibility and interestingness, attributes selection, learning efficiency and effectiveness, and etc. Genetic algorithms (GAs), one of the often used evolutionary computation technique, has been increasingly aware for its superior flexibility and expressiveness of problem representation as well as its fast searching capability for knowledge discovery. In the past genetic algorithms were mostly employed to enhance the learning process of data

13

mining algorithms such as neural nets or fuzzy expert systems, but rather to discover models or patterns. That is, genetic algorithms act as a method for performing a guided search for good models in the solution space. While genetic algorithms are an interesting approach for discovering hidden valuable knowledge, they have to handle computation efficiency with large volume data

Generally, tree induction methods are used to automatically produce rule sets for predicting the expected outcomes as accurately as possible. However the emphasis on revealing novel or interesting knowledge has become a recent research issue in data mining. These attempts may impose additional trees discovery constraints, and thereby produce additional computation overhead. For regular GAs operations, constraint validation is proceeded after a candidate chromosome is produced. That is, several iterations may be required to determine a valid chromosome (i. e., patterns). One way to improve the computation load problem is to obtain associated information such as attributes or the attributes values before initial chromosomes are generated; thereby accelerating the evolution efficiency and effectiveness. Potentially, this can be done by applying association rule algorithms to find out these clues that are related to the classification values.

This research proposes a novel approach that integrates an association rule algorithm with a GA to discover a classification tree. An association rule algorithm, which is also knows as market basket analysis, is used for attributes selection; therefore those related input variables can be determined before proceeding the GA’s evolution.

Apriori algorithm is a popular association rule technique for discovering the attributes relationship that is converted to formulate initial GA population. This proposed method attempts to enhance the GA’s searching performance by gaining more important clues leading to the final patterns. A prototype system based on AGA (Association-based GA) approach is developed to predict the learning performance of college students. The application data that consists of student learning profiles and related course information was derived from one university in Taiwan.

The remainder of this paper is organized as follows. In Section 2, previous research works and related techniques are reviewed. The detailed AGA procedures are then introduced in Section 3. Section 4 presents the experiments and results with financial data sets followed by discussion and conclusions.

2. The Literature Review 2.1. Genetic Algorithm for Rule Induction

14

Current rule induction systems typically fall into two categories: “divide and conquer” (Quinlan 1993) and “separate and conquer” (Clark & Niblett, 1989). The former recursively partitions the instance space until regions of roughly uniform class membership are obtained. The latter induces one rule at a time, separates out the covered instances. Rule induction methods may also be categorized into either tree based or non-tree based methods (Abdullah, 1999). Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms. Both decision trees and rules can be described as disjunctive normal form (DNF) models. Decision trees are generated from data in a top-down, general to specific direction (Chidanand &d Sholom, 1997). Each path to a terminal node is represented as a rule consisting of a conjunction of tests on the path’s internal nodes. These ordered rules are mutually exclusive (Clark & Boswell, 1991). Quinlan (1993) introduced techniques to transform an induced decision tree into a set of production rules.

Michalski et al., (1986) proposed AQ15 algorithms to generate a disjunctive set

of classification rules. The CN2 rule induction algorithms also use a modified AQ algorithm that involves a top-down beam search procedure (Clark & Niblett, 1989). It adopts entropy as its search heuristic and is only able to generate an ordered list of rules. The Basic Exclusion Algorithm (BEXA) is another type of rule induction method proposed by Theron & Cloete (1996). It follows a general-to-specific search procedure in which disjunctive conjunctions are allowed. Every conjunction is evaluated using the Laplace error estimate. More recently Witten and Frank (1999) described covering algorithms for discovering rule sets in a conjunctive form.

GAs have been successfully applied to data mining for rule discovery in literatures. There are some techniques using one-rule-per-individual encoding proposed by Greene & Smith (1993), and Noda et al. (1999). For the one-rule-per-individual encoding approach, a chromosome usually can be identical to a linear string of rule conditions, where each condition is often an attribute-value pair, to represent a rule or a rule set. Although the individual encoding is simpler and syntactically shorter, the problem is that the fitness of a single rule is not necessarily the best indicator of the quality of the discovered rule set. Then, the several-rules-per-individual approach (De Jong et al., 1993; Janikow, 1993) has the advantage by considering its rule set as a whole, by taking into account rule interactions. However, this approach makes the chromosome encoding more complicated and syntactically longer, which usually requires more complex genetic operators.

15

Hu (1998) proposed a Genetic Programming (GP) approach in which a program can be represented by a tree with rule condition and/or attribute values in the leaf nodes and functions in the internal nodes. The challenge is that a tree can grow in size with a shape in a very dynamical way. Thus, an efficient tree-pruning algorithm would be required to prune unsatisfied parts of within a tree to avoid infeasible solutions. Bojarczuk et al. (2001) proposed a constrained–syntax GP approach to build a decision model, particularly with emphasis on the discovery of comprehensible knowledge. The constrained-syntax mechanism was applied in verifying the relationship between operators and data types of operands during the tree-building.

In order to discover high-level prediction rules, Freitas (1999) applied a first-order relationships such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed particularly effective for its knowledge representation capability. By extending the use of ACT, our proposed approach allows other complicated attributes relationships such as linear or non-linear quantitative relationship among multi-attributes. This mechanism attempts to aid reducing the search spaces during the GA’s evolution process.

2.2. Classification Trees

Among data mining techniques, a decision tree is one of the most commonly used methods for knowledge discovery. A decision tree is used to discover rules and relationships by systematically breaking down and subdividing the information contained in data (Chou, 1991). A decision tree features its easy understanding and a simple top-down tree structure where decisions are made at each node. The nodes at the bottom of the resulting tree provide the final outcome, either of a discrete or continuous value. When the outcome is of a discrete value, a classification tree is developed (Hunt, 1993), while a regression tree is developed when the outcome is numerical and continuous (Bala & De Jong, 1996).

Classification is a critical type of prediction problems. Classification aims to examine the features of a newly presented object and assign it to one of a predefined set of classes (Michael & Gordon, 1997). Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The classification trees induction process often selects the attributes by the algorithm at a given node according to such as Quinlan’s information gain (ID3), gain ratio (C4.5) criterion, Quest’s statistics-based approach to determine proper attributes and split points for a tree (Quilan, 1986; 1992; Loh & Shih, 1997).

16

2.3. Association Rules Algorithms for Attributes Selection

Many algorithms can be used to discover association rules from data in order to identify patterns of behavior. One of the most used and famous is the apriori algorithm that can be detailed in (Agrawal et al., 1993; Agrawal & Srikant; 1994). For instance, an association rule algorithm is able to produce a rule as follows: When people buy Bankers Trust they also buy Dow Chemical 20 percent of the time.

An association rule algorithm, given the minimum support and confidence levels, is able to quickly produce rules from a set of data through the discovery of the so-called itemset. A rule has two measures, called confidence and support. Support (or prevalence) measures how often items occur together, as a percentage of the total records. Confidence (or predictability) measures how much a particular item is dependent on another.

Due to association rule algorithms’ advantage in deriving association among data items efficiently, recent data mining research in classification trees construction has attempted to adopt this mechanism for knowledge preprocessing. For example, apriori algorithm was applied to produce those association rules that can be converted into initial population of GP (Niimi & Tazaki, 2000; Niimi & Tazaki, 2001). Improved learning efficiency for the proposed method was demonstrated when compared with a GP without using association rule algorithms. However, the handling of multivariate classification problems and better learning accuracy were not specified in these researches.

3. The Hybrid of Association Rule Algorithms and Genetic Algorithms (AGA)

The proposed AGA approach consists of three modules. According to Fig. 1, these modules are:

Association Rule Mining; Tree Initialization; and Classification Tree Mining.

CategoryAttributes Apriori

Algorithm

A1->C1A2->C2

.

.

.An->Ck

ClassificationTrees

InitializationContinuousAttributes

Search for theOptimal Tree

with GA

...

Association Rule Mining Trees Initialization Classificaiton Tree Mining

Data Set

17

Fig. 1 The Conceptual Framework of AGA The association rule mining module generates association rules by apriori algorithm. In this research, the items derived from category attributes can be used to construct the association rules. An association rule here is an implication of the form X->Y where X is the conjunction of conditions, and Y is the type of classification. The rule X->Y has to satisfy user-specified minimum support and minimum confidence levels. The tree initialization module constructs the potential candidates of classification trees for better predictive accuracy. Figure 2 illustrates the classification tree with n nodes. Each node consists of two types of attributes: categorical and continuous. Categorical attributes provide the partial combination of conjunction of conditions and the others are contributed by continuous attributes in the form of inequality. The formal form of a tree node is presented by

A and B -> C ; where A denotes the antecedent part of the association rule; B is the conjunction of inequality functions in which the continuous attributes, relational operators and splitting values are determined by the GA; and C is the classification result directly obtained from the association rule.

Fig. 2. The Illustration for the Classification Tree

For example, the x1, x2, x3 are categorical attributes and x4, x5 are continuous attributes. Assume that one association rule “x1=5 and x3=3 -> Ck” is selected as a tree node specified as follows.

Nodei: IF x1=5 and x3=3 and x4<=10 and x5>1 Then Class Ck;

The execution steps of the classification tree initialization module are stated as follows. Step (a): For each node, the antecedent condition is obtained from one association rule

that is randomly selected out of the entire association rules set generated in advance. That is, the “A” part of one tree node formal form is determined.

Step (b): By applying the GA, the selected relational operator ((<= or >) and splitting

point are determined for each continuous attribute. Therefore the “B” part of one tree node formal form is determined.

18

Step (c): The classification result on a tree node part comes directly from the consequence of the derived association rule. Subsequently each tree node is generated by repeatedly applying Step 1 & 2 for n times; where n is a value automatically determined by the GA.

The classification tree search module applies the GA to search for a superior

classification tree based on the potential candidates generated in the tree initialization module. Fig. 3 shows the proposed AGA approach for classification trees mining. The details of each step are illustrated as follows.

Fig. 3. The Conceptual Diagram of the Proposed Classification Tree Mining.

Step (a): Chromosome Encoding -- The tree nodes can be easily presented in the form of “Node1, Node2, … , Noden” where n is the total number of the tree node, and is automatically determined by the GA.

Step (b): The GA Initialization -- To generate a potential chromosome in the beginning. The initial population is obtained from the tree initialization module.

Step (c): Fitness Evaluation -- Calculate the fitness value for each chromosome in the current population. The fitness function is defined as the total number of misclassification.

Step (d): Stop Condition Met -- If the specified stopping condition is satisfied, then the entire process is terminated, and the optimal classification tree is confirmed; otherwise, the GA operations are continued.

Step (e): GA Operators -- Each GA operation contains chromosomes selection, crossover, and mutation in order to produce offspring generation based on different GA parameter settings.

Step (f): Chromosome Decoding -- The best chromosome is thus transformed to the optimal classification tree.

4. The Experiments and Results Before mining the student learning performance data set, the house-votes data

sets from the UCI repository (Blake & Merz, 1998) was used to validate our proposed approach. This data set was derived from the 1984 United States Congressional Voting Records database. For the comparing purpose, a simple GA (SGA) was applied to both the data sets.

Generally association rules extracted by apriori algorithm would be varied depending on the defined support and confidence values. Different association rules

19

extracted may result in different impacts on AGA learning performances. Therefore this research experimented with different sets of minimum support and confidence values to both the credit screening and financial performance prediction problems. The evaluation of those classification trees generated by SGA and AGA approaches was based on a five-fold cross validation. That is, each training stage used 4/5 of the entire data records; with the rest 1/5 data records used for testing stage. The GA parameter settings for both the applications are summarized in Table 1.

Table 1. The GA Parameter Settings

Item Value Population Size 100 Generations 100/200/300/400/500 Crossover rate 0.6 Mutation rate 0.01 Selection method Roulette Wheel Training time (House-Votes) 1.5 Minutes*• Training time (Student Learning Performance) 0.8 Minutes*• * •

The hardware platform is Pentium III 1.0 GHz with 256 MB RAM

The Application of House-Votes Data Set

The collected 435 House-Votes data records consist of 16 categorical attributes expressed by ‘Y’, ‘N’, and ‘?’ for the input part. The output part is a categorical attribute with 2 classes (Democrat, Republican). Data of the 16 categorical attributes were fed into apriori algorithm to produce the association rules. Among the entire data records, 267 are democrats, 168 are republicans. Instead of precluding these records from training, this research denotes these ‘?’ values by ‘Others’. After several trials with different sets of minimum support values and confidence values, the best AGA learning performance is obtained. Both the training and testing results are summarized in Table 2, along with their corresponding representation depicted in Fig. 4 & 5. These results are based on the minimum support value (=20) and confidence value (=100). The derived association rule sets consists of 1743 rules for ‘Democrat’ output category and 2215 association rules for ‘Republican’ output category. In order to obtain more details about the learning progress for the two approaches, learning tract behavior were recorded in sessions. Fig. 6 & 7 depict the entire learning progresses monitored over generations and time. Table 3 presents the detail tree nodes notation for one of the relatively better classification trees derived. Based on this produced classification tree, the accuracy rates for the training and testing stages are able to reach 98.28% and 100.00%, respectively.

20

Table 2. The 5-Fold Training/Testing Results for SGA and AGA

(House-Votes Data) Gen. 100 200 300 400 500

Train Test Train Test Train Test Train Test Train Test SGA 95.98% 94.25% 97.36% 94.71% 97.82% 94.71% 97.87% 94.71% 97.99% 94.71%

AGA 97.93% 95.63% 98.28% 95.17% 98.51% 95.40% 98.56% 95.40% 98.62% 95.40%

Fig. 4. Training Results with Various Generations

Fig. 5. Testing Results with Various Generations

Fig. 6. The Learning Progress over Generations (based on 5-fold average)

Fig. 7. The Learning Progress over Time (based on 5-fold average)

Table 3. The Notation for Each Tree Node

The Application of Student Learning Performance Data Set The Data Description

In order to better monitor a student’s learning performance, one university in Taiwan designed a pre-warning system that requires each course instructor to input a student’s up-to-present learning performance grade one week after mid-term exam. The student’s mid-term rating is categorized into three levels – ‘A’, ‘B’, and ‘C’ among which ‘A’ implies EXCELLENT; ‘B’ for O.K.; and ‘C’ for POOR. Generally the more ‘C’s’ a students receives, the higher failure probability a student will have for the course taken. However, purely replying on this mid-term grading information is not sufficient to determine whether the student will survive for a course in the end of the semester. Therefore, other supporting information such as the course difficulty, the grading tract records of the instructor, and the student profile information were collected. Each data record contains 5 categorical attributes and 2 continuous attributes for the input part. The output part is one categorical attribute that is the class of ‘pass’ or ‘fail’. The notation for the variables used in this prediction model is specified in Table 4. 410 student records containing both the passed (total of 146) and failed (total of 264) records from 48 freshman and sophomore dropped out students in Engineering School were collected for analysis.

Table 4. The Variables Used in the Model

Descriptions Data Type X1: Department Code (5 categories) category

21

X2: Gender (Male/Female) category X3: Mid-Term Rating (3 levels) category X4: Course Credits (1-3 credits) category X5: Course Type (Required/Optional) category X6: Course Difficulty continuous X7: Instructor’s Track Record of Flunk Ratio continuous

After several trials with different sets of minimum support values and confidence

values, the best AGA learning performance is obtained. Both the training and testing performance are summarized in Table 5, along with their corresponding representation depicted in Fig. 8 & 9. These results are based on the minimum support value (=5) and confidence value (=100). The derived association rule sets consists of 100 rules for “Pass” output category and 58 association rules for “Fail” output category. In order to obtain more details about the learning progress for the two approaches, learning tract behavior were recorded in sessions. Fig. 10 & 11 depict the entire learning progresses monitored over generations and time. Table 6 presents the detail tree nodes notation for one of the superior classification trees derived. Based on this produced classification tree, the accuracy rates for the training and testing stages are able to reach 79.27% and 80.49%, respectively.

Table 5. The 5-Fold Training/Testing Results for SGA and AGA (Student Learning Performance data)

Gen. 100 200 300 400 500 Train Test Train Test Train Test Train Test Train Test

SGA 76.34% 69.76% 76.59% 69.76% 77.20% 69.76% 77.32% 69.51% 77.56% 70.00%AGA 76.89% 73.17% 77.68% 73.17% 78.29% 73.17% 78.78% 73.41% 79.15% 73.66%

5. Discussion According to the results indicated above AGA achieves superior learning

performance than SGA in terms of computation efficiency and accuracy. By applying association rule process, the partial knowledge is extracted and transformed as seeding chromosomes. According to Fig. 6, the initial average number of errors for both SGA and AGA are 230 and 50, respectively, with House-Votes data. In Fig. 10, the initial average number of errors for both SGA and AGA are 127 and 113, respectively, with student learning performance data. This improvement of initial learning performance can be resulted from the derived association rules that are then transformed into GA’s seeding chromosomes.

According to Fig. 6, for the training stage SGA takes 500 generations to reach the similar performance that takes AGA only 40 generations to reach. The outcomes can be attributed to the adoption of apriori algorithm by which the GA search space is substantially reduced. Also it can be seen that AGA consistently outperforms SGA

22

over generations. Further, for the computation time, AGA takes .19 minute to reach the learning performance that takes SGA at least 1.5 minute to reach.

As shown in Fig. 10, for the training stage SGA takes 500 generations to reach the similar performance that takes AGA 100 generations to reach. Also it can be seen that AGA consistently outperforms SGA over generations. Further for the computation time, AGA takes .17 minute to reach the learning performance that takes SGA 0.80 minute to reach. 6. Conclusions and Future Development

We have introduced the AGA approach that hybridizes apriori algorithm and the genetic algorithm for classification tree induction. Incorporating the associated knowledge related to the classification results is crucial for improving evolutionary-based mining tasks. By employing the association rule algorithm to acquire partial knowledge from data, our proposed approach is able to more effectively and efficiently induce a classification tree by converting the derived association rules into the GA’s seeding chromosomes.

Comparing with SGA, AGA achieves higher predictive accuracy and less computation time required for the classification tree induction by experimenting a UCI benchmark data set as well as the student learning performance data set.

Predicting a student’s course performance from the data derived from the student/course profiles as well as the mid-term rating information is a novel way to aid both the students and the university to grasp further information about the approximate student course performance before too late to recover. According to the experiment results, AGA has been proved to be a feasible way to provide a decently acceptable solution for predicting a student’s course performance with near 80% classification accuracy.

In addition, the classification trees discovered by AGA not only obtain higher predictive accuracy and computation efficiency, but also may produce more user transparent or significant knowledge.

The proposed AGA is generic and problem independent. Besides to integrating with association rule algorithms for knowledge preprocessing, AGA is flexible to incorporate the user information or domain knowledge via the expressive power of first order logic into a tree induction process. The proposed approach is not only applicable for binary classification problems, but also applicable for multi-category classification problems.

Currently AGA approach is able to reveal tree splitting nodes that may allowed complex rule sets-like discriminating formats such as “Attributei <= w ∗ Attributej” relationship which can be extended to express more complicated multivariate

23

inequations with either a linear or nonlinear form in the future. Mining Three-Dimensional Anthropometric Body Surface Scanning Data for Hypertension Detection: An Evolutionary-Based Classification Approach

Abstract

Hypertension is a major disease leading to as one of the top ten death causes in Taiwan. The exploration of 3D anthropometry data along with other existing subject medical profiles using data mining techniques becomes an important research issue for medical decision support. This research attempts to construct a prediction model for hypertension using anthropometric body surface scanning data. This research adopts classification trees to reveal the correlationship between a subject’s three-dimensional (3D) scanning data and hypertension disease using the hybrid of the association rule algorithm and genetic algorithms (GAs) approach. The association rule algorithm is adopted to obtain useful clues based on which the GA is able to proceed its searching tasks in a more efficient way. The proposed approach was experimented and compared with a regular genetic algorithm in predicting a subject’s hypertension disease. Better computational efficiency and more accurate prediction results from the proposed approach are demonstrated. Keywords: Hypertension; Anthropometric Data; Genetic Algorithms; Association

Rule; Classification Trees;

1. Introduction

Hypertension was a major disease leading to as one of the top ten causes of death in Taiwan. Hypertension can also lead to some major causes of death in Taiwan such as cardiovascular diseases and is deemed as factors for Syndrome X that has been investigated for years in epidemiologic studies (Srinivasan, 1993; Mykkanen et al., 1997; Chen et al., 2000; Jeppesen et al., 2000). Although earlier identification of this disease is gaining concerns for issues in clinical research, the investigation of factors for prevention and intervention were also crucial issues for preventive medicine. Modifiable factors, such as life-style variables and body measurements, for reducing risk of the disease are especially interesting for public health professionals.

Due to recent development of a new 3D scanning technology that has many advantages over the old system of anthropometric measurements (Coombes et al., 1991; Jones et al., 1994) using tape measures, anthropometers (a special measuring ruler) (Kroemer, 1989), and other similar instruments, the Department of Health

24

Management of Chang Gung Medical Center at Taiwan is able to collect 3D anthropometric body surface scanning data easily and accurately (Lin, et al, 2002). Unlike other anthropometric databases that only contain geometric and demographic data on human subjects, each of the scanned body data sets in Chang Gung Medical Center database connects to the health record and clinical record through the medical center computer.

Therefore the exploration of these 3D data along with other existing subject medical profiles using data mining techniques becomes an important research issue for medical decision support (Jones & Rioux, 1997; Meaney & Farrer, 1986). Traditional approaches usually applied statistic techniques to determine the correlationship between a target disease and corresponding anthropometric data.

Tree induction is one of the most common methods of knowledge discovery. It is a method for discovering a tree-like pattern that can be used for classification or estimation. Some of the often-mentioned tree induction methods such as C4.5 (Quinlan 1993), CART (Breiman et al., 1984), and Quest (Loh & Shih, 1997) are not evolutionary-based approaches. Basically an ideal tree induction technique has to carefully tackle those aspects, such as model comprehensibility and interestingness, attributes selection, learning efficiency and effectiveness, and etc. The genetic algorithm (GA), one of the often used evolutionary computation techniques, has been increasingly aware for its superior flexibility and expressiveness of problem representation as well as its fast searching capability for knowledge discovery. In the past genetic algorithms were mostly employed to enhance the learning process of data mining algorithms such as neural nets or fuzzy expert systems, but rather to discover models or patterns. That is, genetic algorithms act as a method for performing a guided search for good models in the solution space. While genetic algorithms are an interesting approach for discovering hidden valuable knowledge, they have to handle computation efficiency with large volume data.

Generally, tree induction methods are used to automatically produce rule sets for predicting the expected outcomes as accurately as possible. However the emphasis on revealing novel or interesting knowledge has become a recent research issue in data mining. These attempts may impose additional trees discovery constraints, and thereby produce additional computation overhead. For regular GAs operations, constraint validation is proceeded after a candidate chromosome is produced. That is, several iterations may be required to determine a valid chromosome (i. e., patterns). One way to improve the computation load problem is to obtain associated information such as attributes or the attributes values before initial chromosomes are generated;

25

thereby accelerating the evolution efficiency and effectiveness. Potentially, this can be done by applying association rule algorithms to find out the clues related to the classification results.

This research proposes a novel approach that integrates an association rule algorithm with a GA to discover classification trees. An association rule algorithm, which is also known as market basket analysis, is used for attribute selection; therefore those related input variables can be determined before proceeding the GA’s evolution.

Apriori algorithm is a popular association rule technique for discovering the attributes relationship that is converted to formulate initial GA population. This proposed method attempts to enhance the GA’s searching performance by gaining more important clues leading to the final patterns. A prototype system based on AGA (Association-based GA) approach is developed to predict the hypertension disease. The application data that consists of 3D anthropometry data and related profiles information was derived from Chang Gung Memorial Hospital, Taiwan.

Other, in order to compare with often used classification tree techniques, C4.5 is adopted using the same data sets.

The remainder of this paper is organized as follows. In Section 2, previous research works and related techniques are reviewed. The detailed AGA procedures are then introduced in Section 3. Section 4 presents the experiments and results with medical data sets followed by discussion and conclusions.

2. The Background

2.1. Instruments and Procedures

The Chang Gung Whole Body 3D Laser Scanner scans a cylindrical volume 1.9 meters high and 1.0 meter in diameter. These dimensions accommodate the vast majority of human subjects. A platform structure supports the subject and provides alignment for the towers. The system is built to withstand shipping and repeated use without alignment or adjustment. The standard scanning apparel for both men and women included light gray cotton biker shorts and a gray sports bra for women. Latex caps were used to cover the hair on subjects’ heads. Each subject was measured in three different scanning postures. Automatic landmark recognition (ALR) technology was used to automatically extract anatomical landmarks from the 3D body scan data. The landmarks were then placed on the subject. More than 30 measurements of the

26

above results were new anthropometric factors that traditional measurements did not provide. However, not all the collected anthropometric factors were adopted in this our research. Those factors included in this research are body mass index (BMI)、left_arm_volume (LAV), trunk surface area (TSA), weight (W), waist circumference(WC), waist-hip ratio (WHR), and waist width (WW). Other factors and corresponding illustration are detailed in Table 4. The body scanner scans the human body from head to feet by laser in the cohorizontal plane around the whole body. The computer processes 3D data at the speed of 60 laser beams per second. For a body height of 180 cm for computation, when the Gemini 10085 scanner is set up for a 4-mm vertical scanning resolution, it takes 7.5 sec to scan the whole body. If it is configured at a vertical scanning resolution of 2.5-mm, then it takes 12 sec to scan the whole body. Subjects were also asked to provide demographic data such as age, ethnic group, sex, area of residence, education level, present occupation, and family income. Related hospital health records and clinical records were obtained for each subject, if available.

The 3D laser scanning system is based on optical triangulation of reflected photo profiles of an incident, cross-sectional plane of laser light that travels around the segment. The profile is collected by a digitizer camera and used to characterize the 3D spatial surface geometry. The markers developed for use with the ALR algorithm are flat, adhesive backed disks with a concentric circular, high contrast pattern, consisting of a 6-mm diameter black center circle and a 12-mm diameter outer white annulus.

Anthropometric measurements were performed to determine BMI and WHR. Data were coded by the computer. The results were correlated with data on blood pressure, blood glucose, lipid, and uric acid levels. The health index (HI) was determined using following equation:

HI (body weight x 2 x waist profile area) / [body height2 x (breast profile area hip profile area)].

2.2. Genetic Algorithm for Rule Induction

Current rule induction systems typically fall into two categories: “divide and conquer” (Quinlan 1993) and “separate and conquer” (Clark & Niblett, 1989). The former recursively partitions the instance space until regions of roughly uniform class membership are obtained. The latter induces one rule at a time, separates out the covered instances. Rule induction methods may also be categorized into either tree based or non-tree based methods (Abdullah, 1999). Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms. Both decision trees and rules

27

can be described as disjunctive normal form (DNF) models. Decision trees are generated from data in a top-down, general to specific direction (Chidanand & Sholom, 1997). Each path to a terminal node is represented as a rule consisting of a conjunction of tests on the path’s internal nodes. These ordered rules are mutually exclusive (Clark & Boswell, 1991). Quinlan (1993) introduced techniques to transform an induced decision tree into a set of production rules.

GAs have been successfully applied to data mining for rule discovery in literature. There are some techniques using Michigan approach (one-rule-per-individual encoding) proposed by Greene & Smith (1993), and Noda et al. (1999). For the Michigan approach, a chromosome usually can be identical to a linear string of rule conditions, where each condition is often an attribute-value pair, to represent a rule or a rule set. Although the individual encoding is simpler and syntactically shorter, the problem is that the fitness of a single rule is not necessarily the best indicator of the quality of the discovered rule set. Then, the Pittsburg approach (several-rules-per-individual encoding) (De Jong et al., 1993; Janikow, 1993) has the advantage by considering its rule set as a whole, by taking into account rule interactions. However, this approach makes the chromosome encoding more complicated and syntactically longer, which usually requires more complex genetic operators.

In order to discover high-level prediction rules, Freitas (1999) applied a first-order relationship such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed particularly effective for its knowledge representation capability. By extending the use of ACT, our proposed approach allows other complicated attributes relationships such as linear or non-linear quantitative relationship among multi-attributes. This mechanism attempts to aid reducing the search spaces during the GA’s evolution process.

2.3. Classification Trees

Among data mining techniques, a decision tree is one of the most commonly used methods for knowledge discovery. A decision tree is used to discover rules and relationships by systematically breaking down and subdividing the information contained in data (Chou, 1991). A decision tree features its easy understanding and a simple top-down tree structure where decisions are made at each node. The nodes at the bottom of the resulting tree provide the final outcome, either of a discrete or continuous value. When the outcome is of a discrete value, a classification tree is

28

developed (Hunt, 1993), while a regression tree is developed when the outcome is numerical and continuous (Bala & De Jong, 1996).

Classification is a critical type of prediction problems. Classification aims to examine the features of a newly presented object and assign it to one of a predefined set of classes (Michael & Gordon, 1997). Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The classification trees induction process often selects the attributes by the algorithm at a given node according to such as Quinlan’s information gain (ID3), gain ratio (C4.5) criterion, Quest’s statistics-based approach to determine proper attributes and split points for a tree (Quilan, 1986; 1993; Loh & Shih, 1997).

2.4. Association Rules Algorithms for Attributes Selection

Many algorithms can be used to discover association rules from data in order to identify patterns of behavior. One of the most used and famous is the apriori algorithm that can be detailed in (Agrawal et al., 1993; Agrawal & Srikant; 1994). For instance, an association rule algorithm is able to produce a rule as follows:

When gender is female and age greater than 55 then having hypertension 20 percent of the time.

An association rule algorithm, given the minimum support and confidence levels, is able to quickly produce rules from a set of data through the discovery of the so-called itemset. A rule has two measures, called confidence and support. Support (or prevalence) measures how often items occur together, as a percentage of the total records. Confidence (or predictability) measures how much a particular item is dependent on another.

Due to association rule algorithms’ advantage in deriving association among data items efficiently, recent data mining research in classification trees construction has attempted to adopt this mechanism for knowledge preprocessing. For example, apriori algorithm was applied to produce those association rules that can be converted into initial population of GP (Niimi & Tazaki, 2000; Niimi & Tazaki, 2001). Improved learning efficiency for the proposed method was demonstrated when compared with a GP without using association rule algorithms. However, the handling of multivariate classification problems and better learning accuracy were not specified in these researches.

3. The Hybrid of Association Rule Algorithms and Genetic Algorithms (AGA)

29

4. The Experiments and Results

Before mining the hypertension data set, the heart disease data set from Cleveland Clinic Foundation in UCI repository (Blake & Merz, 1998) was used to validate our proposed approach. For the comparing purpose, a simple GA (SGA) was applied to both the data sets.

Generally association rules extracted by apriori algorithm would be varied depending on the defined support and confidence values. Different association rules extracted may result in different impacts on AGA learning performances. Therefore this research experimented with different sets of minimum support and confidence values to both the credit screening and financial performance prediction problems. The evaluation of those classification trees generated by SGA and AGA approaches was based on a five-fold cross validation. That is, each training stage used 4/5 of the entire data records; with the rest 1/5 data records used for testing stage. The GA parameter settings for both the applications are summarized in Table 1. Other, the same data sets were learned by C4.5 technique as well. The Application of Hypertension Data Set The Data Description

This study has collected 1152 subjects data from Department of Health Examination, Chang Gung Memorial Hospital from Jul. 2000 to Jul. 2001 years. 489 subjects have hypertension (abnormal); 650 are without hypertension (normal). All subjects all Taiwan local residents and without history of systemic disease. Standardized 3D anthropometry scanning protocols were performed, and the data were collected by trained staffs. Subjects were instructed to fast for 12 to 14 hr prior to testing, and compliance regarding fasting was determined by interview on the morning of the examination. Measurements of height (to 0.1 cm) and weight (to 0.1 kg) were performed according to specified protocols (Berenson et al., 1979). BMI [weight (kg)/height (m)2] was used as an indicator of obesity in this study. Blood pressure measurements were made on the right arm in seated, relaxed subjects. For both systolic and diastolic blood pressure, the average of six replicate mercury readings taken by two randomly assigned and trained nurses were used in the analyses. Blood pressure levels were classified according to 1999 World Health Organization-International Society of Hypertension guidelines (Guidelines, 1999). Hypertension was defined as a systolic blood pressure (SBP) of 140 mmHg or greater and/or a diastolic blood pressure (DBP) of 90 mmHg or greater. High-normal blood pressure was defined as an SBP of 130–139 mmHg or a DBP of 85–89. Normal blood pressure was defined as an SBP of 120–129 mmHg and a DBP of 80–84 mmHg;

30

optimal blood pressure, as an SBP less than 120 mmHg and a DBP of less than 80 mmHg. When SBP and DBP fell into different categories, the higher category was applied.

Each data record contains 5 categorical attributes and 11 continuous attributes for the input part. The output part is one categorical attribute whose value is either normal or abnormal.

After several trials with different sets of minimum support values and confidence values, the best AGA learning performance is obtained. Both the training and testing performance are summarized in Table 5, along with their corresponding representation depicted in Fig. 8 & 9. These results are based on the minimum support value (=8) and confidence value (=100). The derived association rule sets consists of 75 rules for “normal” output category and 51 rules for “abnormal” output category. In order to obtain more details about the learning progress for the two approaches, learning tract behavior were recorded in sessions. Fig. 10 & 11 depict the entire learning progresses monitored over generations and time. Table 6 presents the detail tree nodes notation for one of the superior classification trees derived. Based on this produced classification tree, the accuracy rates for the training and testing stages are able to reach 98.37% and 97.84, respectively. Those records matched with no rules are considered as ”NKNOW” that are also counted as misclassification. The accuracy rate of 5-fold testing for C4.5 is 88.1%.

Table 4. The Variables Used in Hypertension Detection Model

Variable Types Variables Description Data TypeDemographic Data

Age Sex

Age Sex (Male/Female)

Continuous Category

Biochemical Tests

TC UT

Total Cholesterol Urine Turbidity (Level 1,2, or 3)

Continuous Category

Three DimensionAnthropometry

BMI LAV TSA W WC WHR WW

Body Mass Index Left_Arm_Volume Trunk Surface Area Weight Waist Circumference Waist-Hip Ratio Waist Width

Continuous Continuous Continuous Continuous Continuous Continuous Continuous

Risk Factors Diet SMOK WINE

Dietary Pattern (Yes/No) Number of Cigarettes/per day Number of Cups in Drinking Wine/per day

Category Continuous Continuous

Family History of Disease

FHHT FHSK

Family Hypertension History (Yes/No) Family Paralysis History (Yes/No)

Category Category

31

In order to find out the prediction accuracy of the classification trees that are based on the entire hypertension data set and same data set excluding 3D information, we have proceed another experiment to construct the classification trees using data without 3D information. The results are indicated in Table 7 indicating that the model based on data without 3D information exhibits inferior prediction accuracy. The relatively better testing result show in Table 7 is 82.11%.

5. Discussion According to the results indicated above AGA achieves superior learning

performance than SGA in terms of computation efficiency and accuracy. By applying association rule process, the partial knowledge is extracted and transformed as seeding chromosomes. According to Fig. 6, the initial average accuracy for both SGA and AGA are 60% and 80%, respectively, with heart disease data. In Fig. 10, the initial average accuracy for both SGA and AGA are 43% and 80%, respectively, with 3D data. This improvement of initial learning performance can be resulted from the derived association rules that are then transformed into GA’s seeding chromosomes.

Table 6. The Notation for Each Tree Node for a Given Data Fold (Hypertension Data) Tree Node Number of

Records Matched

Rule Illustration

Node 1 17 IF Sex = Female AND UT = 2 AND FHHT = No AND FHSK = No AND Age <= 64 AND WHR <= 2.16152 AND LAV > 2065.04 AND TC > 189 AND WINE <= 57 THEN Normal

Node 2 503 IF W <= 90.8105 AND TSA <= 7620.91 AND W <= 2.56355 * BMI THEN Normal

Node 3 27 IF FHHT = No AND BMI > 28.9181 AND TSA > 4601.92 THEN Abnormal

Node 4 1 IF FHHT = Yes AND Age > 77 AND WHR <= 1.43129 AND WINE > 34 THEN Abnormal

Node 5 358 IF WW > 20.517 AND BMI <= 2.5251 * Age THEN Abnormal ELSE Unknown

According to Fig. 6, for the training stage SGA takes 500 generations to reach the similar performance that takes AGA only 245 generations to reach. The outcomes can be attributed to the adoption of apriori algorithm by which the GA search space is substantially reduced. Also it can be seen that AGA consistently outperforms SGA over generations. Further, for the computation time, AGA takes 0.4 minute to reach the learning performance that takes SGA at least 0.8 minute to reach.

As shown in Fig. 10, for the training stage SGA takes 500 generations to reach the similar performance that takes AGA 50 generations to reach. Also it can be seen that AGA consistently outperforms SGA over generations. Further for the computation

32

time, AGA takes 0.1 minute to reach the learning performance that takes SGA 1.5 minute to reach.

As compared with C4.5 in 5-fold cross-validation testing results for Heart Disease data, SGA has the performances ranging from 77.1% to 78.1%; AGA has the performances ranging from 80.46% to 81.48% among 5 different generation types. Both SGA and AGA have better results than C4.5 whose 5-fold cross-validation testing accuracy is 76.8%.

As compared with C4.5 in 5-fold cross-validation testing results for Hypertension (with 3D) data, SGA has the performances ranging from 82.64% to 83.51%; AGA has the performances ranging from 85.77% to 92.71% among 5 different generation types. Only AGA’s results outperform C4.5 whose 5-fold cross-validation testing accuracy is 88.1%.

The results reveal clues for preventive medicine that shows distinction and coherence with current knowledge in hypertension. Body shape played a major role in predicting hypertension on those subjects without family disease history but apparently with very high BMI (>28.9) and relatively larger trunk surface area (> 4601.9 cm2). The data has also demonstrated that a hypertension will be found in subjects with a larger waist width (> 20.5 cm) and small body mass index relative to their age. The mechanism behind the relationship of three dimension body measures and hypertension is still imperfectly understood. Albeit, the findings forecast that a potential set of indicators to break through our current knowledge boundary on diagnostic decision support system in medicine is close at hand. 6. Conclusions and Future Development

We have introduced the AGA approach that hybridizes apriori algorithm and the genetic algorithm for classification tree induction. Results from predicting hypertension disease by anthropometrical and 3D measurements are promising and innovative in field of biomedical sciences. Specifically, significant predictors for hypertension are AGE, FHHT, WHR, WINE, WW, W, BMI, TSA, TC, respectively.

Incorporating the associated knowledge related to the classification results is crucial for improving evolutionary-based mining tasks. By employing the association rule algorithm to acquire partial knowledge from data, our proposed approach is able to more effectively and efficiently induce a classification tree by converting the derived association rules into the GA’s seeding chromosomes.

Comparing with SGA, AGA achieves higher predictive accuracy and less computation time required for the classification tree induction by experimenting a UCI benchmark data set as well as hypertension data set.

Predicting a subject’s hypertension disease from the 3D anthropometry data and other related profile information is a novel way for medical decision support. Without

33

3D information, the rest of the hypertension data won’t be able to result in better prediction models. According to the experiment results, AGA has been proved to be a feasible way to provide a decently acceptable solution for predicting a subject’s hypertension disease with about 90% classification accuracy. It also indicated that the accuracy rates of both the derived models for heart disease and 3D data sets outperformed the results of the models constructed by C4.5 approach.

In addition, the classification trees discovered by AGA not only obtain higher predictive accuracy and computation efficiency, but also may produce more user transparent or significant knowledge.

The proposed AGA is generic and problem independent. Besides to integrating with association rule algorithms for knowledge preprocessing, AGA is flexible to incorporate the user information or domain knowledge via the expressive power of first order logic into a tree induction process. The proposed approach is not only applicable for binary classification problems, but also applicable for multi-category classification problems.

Currently AGA approach is able to reveal tree splitting nodes that may allow complex rule sets-like discriminating formats such as “Attributei <= w ∗ Attributej” relationship which can be extended to express more complicated multivariate inequations with either a linear or nonlinear form in the future. A Constraint-Based Evolutionary Classification Tree (CECT) for Financial Performance Prediction

Abstract

Most of evolutionary computation methods for discovering hidden valuable knowledge from large volume data have to handle computation efficiency, prediction accuracy and the expressiveness of the derived models. In this paper we propose a constraint-based evolutionary classification tree (CECT) approach that combines both the constraint-based reasoning and evolutionary techniques to generate useful patterns from data in a more effective way. Constraint-based reasoning is employed to reduce the solution-irrelevant search spaces by filtering invalid chromosomes. CECT approach allows the problem constraints to be specified as relationships among attributes according to predefined requirements, the user’s preferences, or partial knowledge in the form of a constraint network. In addition an association rule algorithm is employed to acquire the insights for those input variables most associated with the outcome variable before executing the evolutionary process. These derived insights are then converted into partial knowledge that can be translated into constraint network. The proposed approach is experimented, tested and compared

34

with a regular genetic algorithm (GA) to predict corporate financial performance using data from Taiwan Economy Journal (TEJ). Keywords: Constraint-Based Reasoning; Genetic Algorithms; Rule Induction;

Classification Trees; Financial Performance; Association Rule

1. Introduction Revealing valuable knowledge hidden in corporate data becomes more critical

for enterprise decision making. When more data is collected and accumulated, extensive data analysis won’t be easier without effective and efficient data mining methods. In addition to statistic and other machine learning methods, recent development of novel or improved data miming methods such as Bayesian networks (Cooper, 1991), frequent patterns (Srikant, 1996), decision or regression trees (Gehrke, 1999; Garofalakis, 2000), and evolutionary computation algorithms (Bojarczuk, 2001; Carvalho, 1999; Correa, 2001; Freitas, 1999), have drawn more attention from academics and industries.

Rule induction is one of the most common methods of knowledge discovery. It is

a method for discovering a set of "If/Then" rules that can be used for classification or estimation. That is, rule induction is able to convert the data into a rule-based representation that can be used either as a knowledge base for decision support or as an easily understood description of the system behavior. Basically an ideal technique for rules induction has to carefully tackle those aspects, such as model comprehensibility and interestingness, attributes selection, learning efficiency and effectiveness, and etc. Genetic algorithms (GAs), one of the often used evolutionary computation technique, has been increasingly aware for its superior flexibility and expressiveness of problem representation as well as its fast searching capability for knowledge discovery. In the past genetic algorithms were mostly employed to enhance the learning process of data mining algorithms such as neural nets or fuzzy expert systems, but rather to discover models or patterns. That is genetic algorithms acted as a method for performing a guided search for good models in the solution space. While genetic algorithms are an interesting approach to optimizing models, they add a lot of heavy computation load when the search space is huge.

Generally, rule induction methods are used to automatically produce rule sets for predicting the expected outcomes as accurately as possible. However the emphasis on revealing novel or interesting knowledge has become a recent research issue in data mining. These attempts may impose additional rule discovery constraints, and thereby produce additional computation overhead. For regular GAs operations, constraint

35

validation is proceeded after a candidate chromosome is produced. That is, several iterations may be required to determine a valid chromosome. One way to improve the computation load problem is to prevent the production of invalid chromosomes before a chromosome is generated; thereby accelerating the efficiency and effectiveness of evolution processes. Potentially, this can be done by embedding a well-designed constraint mechanism into the chromosome-encoding scheme.

In this research we propose a novel approach that integrates an association rule algorithm and constraint-based reasoning with GAs to discover classification trees. An association rule algorithm, which is also knows as market basket analysis, is used for attributes selection; therefore those related input variables can be determined before proceeding the GA’s evolution. Constraint-based reasoning is a process that incorporates various inference techniques including local propagation, backtrack free search and tree-structured reduction. The constraint-based reasoning mechanism is used to push constraints along with data insights into the rule set construction. This research applies hybrid techniques of local propagation and tree search approaches to assure local consistency before continuing the search in the GA process. Local propagation can reduce the search space from possible gene values that can not meet predefined constraints. This approach allows constraints to be specified as relationships among attributes according to predefined requirements, user preferences, or partial knowledge in the form of a constraint network. In essence, this approach provides a chromosome-filtering mechanism prior to generating or evaluating a chromosome. Thus insignificant or irreverent rules can be precluded in advance via the constraint network.

Proposition logic is a popular representation used in rule induction systems. Our proposed approach allows first order logic to be extended to formulate the knowledge in the form of linear inequations. This enhances the expressive power of the rules set to model the data behavior in a more effective way.

A prototype system based on CECT (Constraint-based Evolutionary Classification Tree) approach is developed to predict corporate financial performance using TEJ finance data of year 2001.

The remainder of this paper is organized as follows. In Section 2, previous research works and related techniques are reviewed. The detailed CECT procedures are then introduced in Section 3. Section 4 presents the experiments and results with financial data sets followed by discussion and conclusions.

2. The Literature Review 2.1. Genetic Algorithm for Rule Induction

36

Current rule induction systems typically fall into two categories: “divide and conquer” (Quinlan 1993) and “separate and conquer” (Clark & Niblett, 1989). The former recursively partitions the instance space until regions of roughly uniform class membership are obtained. The latter induces one rule at a time, separates out the covered instances. Rule induction methods may also be categorized into either tree based or non-tree based methods (Abdullah, 1999). Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms. Both decision trees and rules can be described as disjunctive normal form (DNF) models. Decision trees are generated from data in a top-down, general to specific direction (Chidanand &d Sholom, 1997). Each path to a terminal node is represented as a rule consisting of a conjunction of tests on the path’s internal nodes. These ordered rules are mutually exclusive (Clark & Boswell, 1991). Quinlan (1993) introduced techniques to transform an induced decision tree into a set of production rules.

Michalski et al., (1986) proposed AQ15 algorithms to generate a disjunctive set

of classification rules. The CN2 rule induction algorithms also use a modified AQ algorithm that involves a top-down beam search procedure (Clark & Niblett, 1989). It adopts entropy as its search heuristic and is only able to generate an ordered list of rules. The Basic Exclusion Algorithm (BEXA) is another type of rule induction method proposed by Theron & Cloete (1996). It follows a general-to-specific search procedure in which disjunctive conjunctions are allowed. Every conjunction is evaluated using the Laplace error estimate. More recently Witten and Frank (1999) described covering algorithms for discovering rule sets in a conjunctive form.

GAs have been successfully applied to data mining for rule discovery in literatures. There are some techniques using one-rule-per-individual encoding proposed by Greene & Smith (1993), and Noda et al. (1999). For the one-rule-per-individual encoding approach, a chromosome usually can be identical to a linear string of rule conditions, where each condition is often an attribute-value pair, to represent a rule or a rule set. Although the individual encoding is simpler and syntactically shorter, the problem is that the fitness of a single rule is not necessarily the best indicator of the quality of the discovered rule set. Then, the several-rules-per-individual approach (De Jong et al., 1993; Janikow, 1993) has the advantage by considering its rule set as a whole, by taking into account rule interactions. However, this approach makes the chromosome encoding more complicated and syntactically longer, which usually requires more complex genetic operators.

Hu (1998) proposed a Genetic Programming (GP) approach in which a program

37

can be represented by a tree with rule condition and/or attribute values in the leaf nodes and functions in the internal nodes. The challenge is that a tree can grow in size with a shape in a very dynamical way. Thus, an efficient tree-pruning algorithm would be required to prune unsatisfied parts of within a tree to avoid infeasible solutions. Bojarczuk et al. (2001) proposed a constrained–syntax GP approach to build a decision model, particularly with emphasis on the discovery of comprehensible knowledge. The constrained-syntax mechanism was applied in verifying the relationship between operators and data types of operands during the tree-building.

In order to discover high-level prediction rules, Freitas (1999) applied a first-order relationships such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed especially effective for its knowledge representation capability. By extending the use of ACT, our proposed approach allows other complicated attributes relationships such as linear or non-linear quantitative relationship among multi-attributes. This mechanism attempts to aid reducing the search spaces during the GA’s evolution process.

2.2. Classification Trees

Among data mining techniques, a decision tree is one of the most commonly used methods for knowledge discovery. A decision tree is used to discover rules and relationships by systematically breaking down and subdividing the information contained in data (Chou, 1991). A decision tree features its easy understanding and a simple top-down tree structure where decisions are made at each node. The nodes at the bottom of the resulting tree provide the final outcome, either of a discrete or continuous value. When the outcome is of a discrete value, a classification tree is developed (Hunt, 1993], while a regression tree is developed when the outcome is numerical and continuous (Bala & De Jong, 1996).

Classification is a critical type of prediction problems. Classification aims to examine the features of a newly presented object and assign it to one of a predefined set of classes (Michael & Gordon, 1997). Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The classification trees induction process often selects the attributes by the algorithm at a given node according to such as Quinlan’s information gain (ID3), gain ratio (C4.5) criterion, Quest’s statistics-based approach to determine proper attributes and split points for a tree (Quilan, 1986; 1992; Loh & Shih, 1997).

38

2.3. Association Rules Algorithms for Attributes Selection

Many algorithms can be used to discover association rules from data in order to identify patterns of behavior. One of the most used and famous is the apriori algorithm that can be detailed in (Agrawal et al., 1993; Agrawal & Srikant; 1994). For instance, an association rule algorithm is able to produce a rule as follows: When people buy Bankers Trust they also buy Dow Chemical 20 percent of the time.

An association rule algorithm, given the minimum support and confidence levels, is able to quickly produce rules from a set of data through the discovery of the so-called itemset. A rule has two measures, called confidence and support. Support (or prevalence) measures how often items occur together, as a percentage of the total records. Confidence (or predictability) measures how much a particular item is dependent on another.

Due to association rule algorithms’ advantage in deriving association among data items efficiently, recent data mining research in classification trees construction has attempted to adopt this mechanism for knowledge preprocessing. For example, apriori algorithm was applied to produce those association rules that can be converted into initial population of GP (Niimi & Tazaki, 2000; Niimi & Tazaki, 2001). Improved learning efficiency for the proposed method was demonstrated when compared with a GP without using association rule algorithms. However, the handling of multivariate classification problems and better learning accuracy were not specified in these researches.

2.4. Constraint Satisfaction in Genetic Algorithms

Constraint satisfaction involves finding values for problem variables associated with constraints on acceptable solutions to satisfy these constraints. Problem solving in a constraint satisfaction problem (CSP) that is basically belongs to NP-Complete problems normally lacks suitable methods. A number of different approaches have been developed for solving the CSP problems. Some of them adopted constraint propagation to reduce the solutions space. Others tried “backtrack” to directly search for possible solutions. Some applied the combination of these two techniques including tree-search and consistent algorithms to efficiently find out one or more feasible solutions. Nadel (1988) compared the performance of the several algorithms including “generate and test”, “simple backtracking”, “forward checking”, “partial lookahead”, “full lookahead”, and “really full lookahead.” The major differences of

39

these algorithms are the degree of consistency performed at the node during the tree-solving process. Besides the “generate and test” method, others performed hybrid techniques. In other words, whenever a new value is assigned for the variable, the domains of all unassigned ones are filtered and left only with those values that are consistent with the one already being assigned. If the domains of any of these uninstantiated variables become empty, the contradiction is recognized, and backtracking occurs. Freuder (1982) mentioned that if a given CSP has a tree-structured graph, then it could be solved without any backtracking once. That is, solutions can be retrieved in a backtrack-free manner. Dechter & Peral (1988) used this theory coupled with the notion of directional consistency to generate backtrack-free solutions more efficiently.

Dealing with constraints for search space reduction seems to be an important research issues for many artificial intelligence areas. GAs maintain a set of chromosomes (solutions), called population. The population consists of parents and offspring. When the evolution process proceeds, the best N chromosomes in the current population are selected as parents. Through performing genetic operators, offspring are selected according to the filtering criterion that is usually expressed as fitness functions along with some predefined constraints. The GA evolves over generations until stopping criterions are met. However valid chromosomes are usually produced by trials and errors. That is, a candidate chromosome is produced and tested against the filtering criterions. Therefore a GA may require more computation, especially in dealing with complicated or severe filtering criterions. To resolve this problem, an effective chromosome construction process can be applied to the initialization, crossover, and mutation stages respectively.

Garofalakis (2000) provided a model constraints-based algorithm inside the mining process to specify the expected tree size and accuracy. In other words, constraints can be used to express the trade-off between the model accuracy and computation efficiency that is associated with the tree-building or tree-pruning process. Similar research work, CADSYN (Maher, 1993), adopted a constraint satisfaction algorithm for case adaptation in case based reasoning. Purvis (1995) also applied a repair-based constraint satisfaction algorithm to aid case adaptation. By pushing constraints into the case adaptation process, the constraint satisfaction algorithm is able to aid a case-based reasoning system to produce more decent solution cases more efficiently for the discrete based constraints problem.

Barnier & Brisset (1998) developed a hybrid system of a genetic algorithm and constraint satisfaction techniques. This method is adopted to resolve the optimization problems for vehicle routing and radio link frequency assignment. Basically this approach applied a GA to reduce the search space for a large CSP, instead of applying

40

constraint satisfaction techniques to improve the GA’s computation efficiency. Kowalczyk (1996) pointed out the concept of using constraint satisfaction principles to support a GA in handling constraints. However rare research works applied constraint based reasoning to effectively handle GAs computation inefficiency, nor on how the user’s knowledge can be presented and processed within a constraint network.

3. The Proposed Constraint-based Evolutionary Classification Tree (CECT) Approach

The proposed CECT approach consists of three modules: the user-interface, the symbol manager, and constraint-based GA (CBGA). According to Fig. 1, the user interface module allows users to execute the following system operations including:

Loading a constraint program, Adding or retracting the constraints, Controlling the GA’s parameter settings, and Monitoring the best solutions.

The constraint program here is a set of any first order logic sentence (atomic, compound or quantified) about a many-sorted universe of discourse that includes integers, real numbers, and arbitrary application-specific sorts. The details can be found in Lai (1992).

41

Constraint-Based GA

(a) GAInitialization

(c) FitnessEvaluation

(e) GA Operators

CrossoverSelection

(b) ChromosomesFiltering

using Constraint-Based

Reasoning

(b) ChromosomesFiltering

using Constraint-Based

Reasoning

(d) StopCondition met

Yes

ConstraintProgram

SymbolManager

ConstraintNetwork

User Interface

User

Mutation

no

Control FlowData Flow

Return the Best Solution

Fig. 1. The Conceptual Diagram of the Proposed CECT System

Fig.2 indicates the knowledge preprocessing for constructing the constraint

program. Three types of data sources: GA parameter settings, human knowledge, and data sets are converted into the constraint programs. Each gene in a GA here is equal to each object of constraint program. The range for each gene can be viewed as a domain constraint for each object. The predefined hard constraints are represented by the first order logic sentences.

The human knowledge specifies the user preferences, or partial expert experiences. For example, the user’s preference such as “people high blood pressure cannot take certain drugs” can be treated as one type of expert knowledge. It can be translated into the user’s defined constraints in the form of first order logic sentence.

The association rule mining module generates association rules by apriori algorithm. In this research the derived association rules has to satisfy user-specified minimum support and minimum confidence constraints. The format of the developed constraints is given by:

A1 ^ … ^ Ak -> Ci ; where A1, A2,.., Ak are conjunctions of conditions and each condition is gi=vi ( vi is value from the domain of the gene gi) , and Ci is the classification result.

42

Fig. 2. The Knowledge Preprocessing for Constraint Program Construction

The symbol manager examines the syntax of the first order logic sentences in the

constraint program and translates the syntax into a constraint network for further processing.

In the CBGA module, the constraint-based reasoning filters each gene value and processes both the GA initial population and regular populations. The details of each module are illustrated as follows.

Module (a) -- GA Initialization: To generate a valid chromosome in the beginning, an initial population can be derived through a chromosome filtering process using constraint-based reasoning to produce valid candidates.

Module (b) -- Chromosome Filtering: During the chromosome construction process, any determined gene might affect the valuation of remaining genes. Whenever the gene is randomly instantiated from the satisfied universe, the possible values for the remaining genes will be restricted by the constraint propagation to assure constraint network consistency. In this way the constructed chromosomes in the CBGA are thus guaranteed as valid; and the hard and soft constraints can all be examined and solved during the chromosome construction process.

Module (c) -- Fitness Evaluation: Calculate the fitness value for each chromosome in the current population.

Module (d) -- Stop Condition Met: If a specified stopping condition is satisfied, then terminate the CBGA and return the best solution; otherwise, continue the GA operations.

Module (e) -- GA operators: Each operation such as crossover or mutation generates new offspring. The offspring produced by CBGA needs to be validated by the chromosome filtering operation to assure valid individuals are derived for further process.

To speed up the reasoning process, both the variable ordering (Purdom, 1983) and backtrack-free search methods (Freuder, 1982) are adopted in the CBGA to derive contradiction-free chromosomes.

As shown in Fig. 3, each chromosome Ci of a regular population with size N is processed by constraint-based reasoning in sequence. The genes in a chromosome can be viewed as a subset of variables. The valuation scope of each gene gij is restricted via constraint–based reasoning. This system efficiently transforms human knowledge, such as expert experience or common sense into a constraint network leading to more

43

significant rule sets. The search efforts for valid chromosomes can be reduced without having to activate the fitness evaluation procedure on each candidate chromosome.

Fig. 4 illustrates the details of the chromosome construction process. In essence, CECT system applies local propagation for chromosome filtering. The basic concept of local propagation (Freuder, 1982) involves using the information local to a constraint to validate gene values. Local propagation also restricts the valid gene range referenced in the constraint. The satisfactory gene values (i. e., the satisfactory universe) are propagated through the network thus enabling other constraints to impound other valid range sets for the remaining genes. According to the restricted valid range denoted by the satisfactory universe SGj, the valuation process for gene gij is then activated to examine the inferred gene value. The new g’ij value is replaced by a value randomly selected from SGj if the inferred gene value is inconsistent with the constraint network. By repeatedly applying local propagation and the valid valuation process on chromosome Ci in the sequence gi1, gi2, … , gim , the new chromosome C’i is thus able to satisfy the constraint network. As a result local propagation offers an efficient way to guide the GA toward searching for the best chromosome by reducing the search space already filtered by the constraint network.

Fig. 3. The Framework of Constraint-Based Preprocessing for GA Operators

Fig. 4. The Detail Illustration for Chromosome Screening

4. The Experiments and Results Before analyze the TEJ data sets, the credit screening data sets from the UCI

repository (Blake & Merz, 1998) was adopted to validate our proposed approach. Besides, two other approaches: a simple GA (SGA) and apriori algorithm with GA (AGA) were employed to compare our proposed approach that is denoted by ACECT (i. e., apriori algorithm with CECT).

Generally association rules extracted by apriori algorithm could be varied depending on the defined support and confidence values. Different association rules extracted may result in different impacts on CECT learning performances. Therefore this research experimented with different sets of minimum support and confidence values to both the credit screening and financial performance prediction problems. The evaluation of those classification trees generated by each of the three approaches was based on a five-fold cross validation. That is, each training stage used 4/5 of the entire data records; with the rest 1/5 data records used for the testing stage. The GA parameter settings for both the applications are summarized in Table 1.

44

Application to Credit Screening Data Sets

The collected 690 credit screening data records consist of nine categorical attributes and six continuous attributes for the input part. The output part is one categorical attribute. Data of the nine categorical attributes are fed into apriori algorithm to produce the association rules. Among the entire data records, 37 incomplete data records out of the entire data records were precluded in the entire learning processes. After several trials with different sets of minimum support values and confidence values, the best learning performance for AGA and ACECT is obtained. Both the training and testing results are summarized in Table 2, along with their corresponding representation depicted in Fig. 5 & 6. These results are based on the minimum support value (=3) and confidence value (=100). The derived association rule sets consists of 480 rules for “Approve” output category and 4096 association rules for “Disapprove” output category. In order to obtain more details about the learning progress for the three approaches, learning tract behavior were recorded in sessions. Fig. 7 & 8 depict the entire learning progresses monitored over generations and time. The optimal classification tree derived and its detail notation are specified in Fig. 9 and Table 3.

Table 1. The GA Parameter Settings

Item Value Population Size 100 Generations 100/200/300/400/500 Crossover rate 0.6 Mutation rate 0.01 Selection method Roulette wheel Training time (Credit Screening) 1.4 Minutes*• Training time (Financial Performance Prediction) 1.7 Minutes*• * •

The hardware platform is Pentium III 800 MHz with 512 MB RAM Table 2. The 5-Fold Learning Results for SGA, AGA, and ACECT

Application to Financial Performance Data Sets Financial ratios are commonly employed to measure a corporate financial

performance. In recent years a considerable amount of research has been directed towards the analysis of the predictive power of financial ratios as influential factors of corporate stock market behavior. Some of the financial ratios, such as Current Ratio, Receivables Turnover, and Times Interest Earned, Capital, were used for bankruptcy prediction (Shah & Murtaza, 2000), financial distress prediction (Coats & Fant, 1991; Ganesalingam, 2001). This research applies CECT approach to construct a

45

classification model for predicting the corporate finance performance using various financial ratios. The notation for the variables (i. e., the seven financial ratios) is specified in Table 4.

The dependent variable is a categorical type of data labeled by either “Good” or “Bad” according to the Tobin’s Q value. Tobin’s Q is a measure for evaluating a corporate’s financial performance (Ciccolo & Fromm, 1979; Lindenberg & Ross, 1981). The higher value a Tobin’s Q is, the better a corporate financial performance is. On the other hand, the lower value a Tobin’s Q is, the inferior a corporate financial

Table 4. The Various Financial Ratios Used in the Model

Descriptions Data Type X1: Industry type (22 types) category X2: Credit rating (1-10 rating) category X3: Employee size (1-4 level) category X4: Capital scope (1-4 level) category X5: Current ratio continuous X6: Debt ratio continuous X7: Times interest earned continuous X8: Receivables turnover continuous

performance is. This research denotes the dependent variable as “Good” if Tobin’s Q > 1; “Bad” if Tobin’s Q =< 1. The data used was derived from Taiwan Economic Journal (TEJ) database, a standard source of financial market database. 502 financial data records of the listed companies on Taiwan Stocks Market for the entire period of year 2001 were collected. Each data record includes eight input financial ratios and one output Tobin’s Q. Tobin’s Q value is converted into either “Good” or “Bad” before executing the learning process. Among the entire data records, 181 records are “Good” while 329 records are “Bad”.

After several trials with different sets of minimum support values and confidence values, the best ACECT learning performance is obtained. Both the training and testing performance are summarized in Table 5, along with their corresponding representation depicted in Fig. 10 & 11. These results are based on the minimum support value (=7) and confidence value (=100). The derived association rule sets consists of 12 rules for “Good” output category and 14 association rules for “Bad” output category. In order to obtain more details about the learning progress for the three approaches, learning tract behavior were recorded in sessions. Fig. 12 & 13 depict the entire learning progresses monitored over generations and time. The details of the optimal results derived are illustrated in Appendix A.

Table 5. The Summarized Learning Performance for SGA, AGA, and ACECT

46

5. Discussion According to the results indicated above ACECT achieves superior learning

performance than SGA and AGA in terms of computation efficiency and accuracy. According to Fig. 5 & 6 testing performances become worse for those training models of higher generations, mainly due to over-training.

By applying association rule process, the partial knowledge is extracted and transformed as seeding chromosomes. It can be seen that the initial training results for both AGA and ACECT exhibit significantly higher accuracy and efficiency than SGA.

As displayed in Fig. 7 both AGA and ACECT approach relative convergence within 20 generations, while SGA requires 160 generations to reach the similar result in the training stage. Further, as shown in Fig. 12 both AGA and ACECT approach relative convergence within 50 generations, while SGA requires 500 generations to reach the similar result in the training stage. The outcomes can be attributed to the adoption of apriori algorithm by which the GA search space is substantially reduced.

For the ACECT approach, the derived partial knowledge is not only encoded as seeding chromosomes, but also converted into the constraint network. As shown in the figures displaying learning progresses, ACECT outperforms AGA less significantly than outperforms SGA. Nevertheless, the improvement of ACECT over AGA positively demonstrates its effectiveness for both the applications data.

6. Conclusions and Future Development

We have introduced CECT approach that hybridizes constraint-based reasoning within a genetic algorithm for classification tree induction. Incorporating the partial knowledge or user-control information into mining process is not straightforward and, typically, requires the design of novel approaches. By employing the rule association algorithm to acquire partial knowledge from data, our proposed approach is able to induce a classification tree by pushing the partial knowledge into chromosome construction. Most importantly, the adoption of constraint-based reasoning mechanism into the GA process can filter invalid chromosomes; therefore feasible solutions can be more efficiently derived.

Comparing with SGA and AGA, ACECT achieves higher predictive accuracy and less computation time required for classification trees inductions using a benchmark data set as well as real financial data set. In addition, the classification trees discovered by ACECT not only obtain higher predictive accuracy and computation efficiency, but also may produce more user transparent or significant knowledge.

The proposed CECT is generic and problem independent. It is flexible to incorporate the user information or domain knowledge via the expressive power of first order logic into a tree induction process. In addition, it is not required to design

47

any proprietary genetic operator and chromosome representation to interact with constraint-based reasoning. Besides to integrating with association rule algorithms for knowledge preprocessing, CECT approach provides a potential way to allow domain experts to input professional knowledge or constraints to reveal further interesting knowledge. This approach is not only applicable for binary classification problems, but also applicable for multi-category classification problems, though the experiment examples are binary classification problems.

Currently CECT approach is able to reveal tree splitting nodes that may allowed complex rule sets-like discriminating formats such as “Attributei <= w ∗ Attributej” relationship which can be extended to express more complicated multivariate inequations with either a linear or nonlinear format in the future.

Improving Financial Performance by Exploring the Financial Ratios Basically prediction models map the inputs values to produce the outcome(s).

When the model is complex, it is not possible to easily figure out the appropriate inputs that can best approximate the expected output. Usually this type of research is called parameters design. The classification tree constructed by our proposed CECT approach can be a multivariate-split based classification tree. It would be relatively difficult to find out suitable inputs values in order to match an expected outcome. The mechanism that allows proceeding “what-if” as well as “goal-seeking” analysis can be a useful aid for financial managers in further exploring those financial ratios that are most likely or most unlikely to be adjusted to improve a corporate financial performance. In addition to our proposed CECT approach, this research is currently working on adopting another optimization technique to support the “goal-seeking” function. It is believed that such information provides highly strategic values for the corporate financial management.

48

The Hybrid of Apriori Algorithm and Constraint-Based Genetic Algorithm for Aircraft Electronic Ballasts Troubleshooting Abstract

The maintenance of aircraft components is crucial for avoiding aircraft accidents and aviation fatalities. Reliable and effective maintenance support is vital to the airline operations and flight safety. Sharing previous repair experiences with the state-of-the-art computer technology can improve aircraft maintenance productivity. This research proposes the hybrid of apriori algorithm and constraint-based genetic algorithm (ACBGA) approach to discover a classification tree for electronic ballasts troubleshooting. Constraint-based genetic algorithm (CBGA) is employed to reduce the solution-irrelevant search spaces by filtering invalid chromosomes. Apriori algorithm is used to acquire the insights for those input itemsets most associated with the outcome itemsets before executing CBGA for tree induction. As apposed to simple GA and apriori algorithm plus simple GA, the proposed approach is able to discover the troubleshooting classification tree with superior computation efficiency and prediction accuracy.

Keywords: Constraint-Based Reasoning; Genetic Algorithms; Classification Trees; Aircraft Electronic Ballast; Apriori Algorithm

1. Introduction Every airplane in operation throughout the world calls for appropriate

maintenance in order to assure flight safety. When an aircraft fault emerges, actions for fault diagnosis and troubleshooting must be executed promptly and effectively. An airplane consists of many electronic components among which the electronic ballast is one common component in controlling the cabin fluorescent lamps. The electronic ballast plays an important role in providing proper lights for passengers and flight crews during a flight. Unstable cabin lighting, such as flash and ON/OFF problems, would degrade the flight quality. An airplane usually has hundreds of electronic ballasts mounted in panels such as the light deflector of a fluorescent lamp fixture. When an electronic ballast is abnormal, it has to be removed and sent to the accessory shop for further investigation.

Revealing valuable knowledge hidden in corporate data becomes more critical for enterprise decision making. When more data is collected and accumulated, extensive data analysis won’t be easier without effective and efficient data mining methods. The maintenance records of electronic ballasts generally contain information about the number of defective units found, the procedures taken, and the inspection or repair status. Basically these records were stored and used for assisting mechanics in

49

identifying faults and determining the components as repair or replacement was necessary. This is because previous similar solutions may provide valuable troubleshooting clues for new faults.

Rule induction is one of the most common methods of knowledge discovery. It is a method for discovering a set of "If/Then" rules that can be used for classification or estimation. Basically an ideal technique for rules induction has to carefully tackle those aspects, such as model comprehensibility and interestingness, attributes selection, learning efficiency and effectiveness, and etc. Genetic algorithm (GA), one of the often used evolutionary computation techniques, has been increasingly aware for its superior flexibility and expressiveness of problem representation as well as its fast searching capability for knowledge discovery. In the past genetic algorithms were mostly employed to enhance the learning process of data mining algorithms such as neural nets or fuzzy expert systems, but rather to discover models or patterns. That is genetic algorithms acted as a technique for performing a guided search for suitable pattern s among the solution space. While genetic algorithms are a feasible approach for discovering patterns, they add a lot of heavy computation load when the search space is huge.

Generally, rule induction methods are used to automatically produce rule sets for predicting the expected outcomes as accurately as possible. However the emphasis on revealing crucial or interesting knowledge has become a recent research issue in data mining. These attempts may impose additional rule discovery constraints, and thereby produce additional computation overhead. For regular GA operations, constraint validation is proceeded after a candidate chromosome is produced. That is, several iterations may be required to determine a valid chromosome. One way to improve the computation load problem is to prevent the production of invalid chromosomes before a chromosome is generated; thereby accelerating the efficiency and effectiveness of evolution processes. Potentially, this can be done by embedding a well-designed constraint mechanism into the chromosome-encoding scheme.

In this research we propose a novel approach that integrates apriori algorithm and constraint-based genetic algorithm (ACBGA) to discover a troubleshooting classification tree. Apriori algorithm, one of the common seen association rule algorithms, is used for attributes selection; therefore those related input attributes can be determined before proceeding the GA’s evolution. The constraint-based reasoning is used to push constraints along with data insights into the rule set construction. This research applied tree search and forward checking (Haralick & Elliott, 1980; Brailsford et al, 1999) techniques to reduce the search space from possible gene values that can not meet predefined constraints during the evolution process. This approach allows constraints to be specified as relationships among attributes

50

according to predefined requirements, user preferences, or partial knowledge in the form of a constraint network. In essence, this approach provides a chromosome-filtering mechanism prior to generating and evaluating a chromosome. Thus insignificant or irreverent rules can be precluded in advance via the constraint network.

A prototype troubleshooting system based on ACBGA approach was developed using the aircraft maintenance records of electronic ballasts provided by one major airline company in Taiwan.

2. The Background

2.1. The Aircraft Electronic Ballasts The aircraft electronic ballasts used to drive fluorescent lamps can be mounted

on a panel such as the light deflector of a fluorescent lamp fixture. The fluorescent lamps initially require a high voltage to strike the lamp arc and maintain a constant current subsequently. Usually there is a connector at one end of the unit for the routing of all switching and power connections. As shown in Fig. 1, the electronic ballast operates from control lines of 115-vac/400Hz aircraft power. When the operation power is supplied, the electronic ballast will start and operate two rapid start fluorescent lamps or one single lamp in the passenger cabin of various commercial aircrafts, such as Boeing 747-400, 737-300, 737-400, 747-500 and etc. There are two control lines connecting with the ballast set and control panel for ON/OFF as well as BRIGHT/DIM modes among which DIM mode is used at night when the cabin personnel attempts to decrease the level of ambient light.

To diagnose which component is of malfunction, the mechanics usually measure the alternating current in BRIGHT mode, DIM mode when the electronic ballast turns on or off. In addition, the check of light stability and the illumination status is also important to aid the maintenance decision. The detail maintenance record description for troubleshooting is summarized in Table 1.

Each maintenance record contains seven attributes identified as highly related to abnormal electric ballasts. These data attribute values are all categorical. Each category in the outcome attribute represents a different set of replacement parts. For instance, category C1 denotes the replacement parts of a transformer (illustrated as T101 on a printed circuit board) and a capacitor (illustrated as C307 on a printed circuit board). Category C2 denotes the replaced parts of an integrated circuit (illustrated as U300 on a printed circuit board), a transistor (illustrated as Q301 on a printed circuit board) and a fuse (illustrated as F401 on a printed circuit board).

51

Fluorescent Lamp

Fluorescent Lamp

Fluorescent Lamp

ElectronicBallast

ElectronicBallast

Lamp Set Ballast Set

Control Lines

Control Panel

Control Lines

Fig. 1.The Operational Setup for Electronic Ballast

Table 1. The Record Description

Input Attributes Data Type Range Alternating Current on Bright Mode When Electronic Ballast Turns OnAlternating Current on DIM Mode When Electronic Ballast Turns OnAlternating Current on Bright Mode When Electronic Ballast Turns OffAlternating Current on DIM Mode When Electronic Ballast Turns OffIs Light Unstable When Electronic Ballast Turns On Is It Not Illuminated When Electronic Ballast Turns On

Categorical Categorical Categorical Categorical Categorical Categorical

15 intervals of (amp)11 intervals of (amp)16 intervals of (amp) 16 intervals of (amp)

0 and 1 0 and 1

Outcome Attribute Replacement Parts

Categorical

C1, C2, …, C10

2.2. Genetic Algorithm for Rule Induction

Rule induction methods can be categorized into either tree based or non-tree based methods (Abdullah, 1999). Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms. Both decision trees and rules can be described as disjunctive normal form (DNF) models. Decision trees are generated from data in a top-down and general to specific direction (Chidanand & Sholom, 1997). Each path to a terminal node is represented as a rule consisting of a conjunction of tests on the path’s internal nodes.

Michalski et al., (1986) proposed AQ15 algorithm to generate a disjunctive set of classification rules. The CN2 rule induction algorithm also used a modified AQ algorithm that involves a top-down beam search procedure (Clark & Niblett, 1989). The Basic Exclusion Algorithm (BEXA) is another type of rule induction method proposed by Theron & Cloete (1996). It follows a general-to-specific search procedure in which disjunctive conjunctions are allowed.

GAs have been successfully applied to data mining for rule discovery in literatures. There are some techniques using one-rule-per-individual encoding

52

proposed by Greene & Smith (1993), and Noda et al. (1999). For the one-rule-per-individual encoding approach, a chromosome usually can be identical to a linear string of rule conditions, where each condition is often an attribute-value pair, to represent a rule or a rule set. However, the more complicated and syntactically longer for the chromosome encoding, more complex genetic operators are required.

Hu (1998) proposed a Genetic Programming (GP) approach in which a program can be represented by a tree with rule condition and/or attribute values in the leaf nodes and functions in the internal nodes. The challenge is that a tree can grow in size with a shape in a very dynamical way. Thus, an efficient tree-pruning algorithm would be needed to readjust unsatisfied parts within a tree to avoid infeasible solutions. Bojarczuk et al. (2001) proposed a constrained–syntax GP approach to build a decision model, particularly with emphasis on the discovery of comprehensible knowledge.

In order to discover high-level prediction rules, Freitas (1999) applied a first-order logic relationships such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed especially effective for its knowledge representation capability. By extending the use of ACT, our proposed approach allows other complicated attributes relationships such as linear or non-linear quantitative relationship among attributes. This mechanism attempts to aid reducing the search spaces during the GA’s evolution process. 2.3. Classification Trees

Among data mining techniques, a decision tree is one of the most commonly used methods for knowledge discovery. A decision tree is used to discover rules and relationships by systematically breaking down and subdividing the information contained in data (Chou, 1991). A decision tree features its easy understanding and a simple top-down tree structure where decisions are made at each node. The nodes at the bottom of the resulting tree provide the final outcome, either of a discrete or continuous value. When the outcome is of a discrete value, a classification tree is developed (Hunt, 1993), while a regression tree is developed when the outcome is of a continuous value (Bala & De Jong, 1996).

Classification is a critical type of prediction problems. Classification aims to examine the features of a newly presented object and assign it to one of a predefined set of classes (Michael & Gordon, 1997). Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The classification trees induction process often selects the attributes by the algorithm at a given node according to such as Quinlan’s information gain (ID3), gain ratio (C4.5) criterion,

53

Quest’s statistics-based approach to determine proper attributes and split points for a tree (Quilan, 1986; 1992; Loh & Shih, 1997).

2.4. Apriori Algorithms for Attributes Selection

Many algorithms can be used to discover association rules from data in order to identify patterns of behavior. One of the most often used approach is the apriori algorithm that is detailed in (Agrawal et al., 1993; Agrawal & Srikant; 1994). For instance, apriori algorithm is able to produce a rule as follows: When component A is abnormal AND component B is normal THEN Replace UNIT-l with 70 percent possibility true.

Apriori algorithm, given the minimum support and confidence levels, is able to produce rules from a set of data through the discovery of the so-called itemset. A rule has two measures, called confidence and support. Support (or prevalence) measures how often items occur together, as a percentage of the total records. Confidence (or predictability) measures how much a particular item is dependent on another. Due to apriori algorithm’s advantage in deriving association among data items efficiently, recent data mining research in classification trees construction has attempted to adopt this mechanism for knowledge preprocessing. For example, apriori algorithm was applied to produce those association rules that can be converted into initial population of genetic programming (GP) (Niimi & Tazaki, 2000; Niimi & Tazaki, 2001). Improved learning efficiency for the proposed method was demonstrated when compared with a GP without using association rule algorithms. However, the handling of multivariate classification problems and better learning accuracy were not specified in these researches.

2.5. Constraint Satisfaction in Genetic Algorithms

Constraint satisfaction involves finding values for problem variables associated with constraints on acceptable solutions to satisfy these constraints. Problem solving in a constraint satisfaction problem (CSP) that basically belongs to NP-Complete problems normally lacks suitable methods. A number of different approaches have been developed for solving the CSP problems. Some of them adopted constraint propagation (Bowen et al., 1990; Dechter & Pearl, 1988; Lai, 1992) to reduce the solutions search space. Others tried “backtrack” to directly search for possible solutions (Nadel, 1988). Some applied the combination of these two techniques including tree-search and consistent algorithms to efficiently find out one or more feasible solutions (Purdom, 1983; Freuder, 1982). Nadel (1988) compared the

54

performance of several algorithms including “generate and test”, “simple backtracking”, “forward checking”, “partial lookahead”, “full lookahead”, and “really full lookahead”. The major differences of these algorithms are the degree of consistency performed at the node during the tree-solving process. Besides the “generate and test” method, others performed hybrid techniques. Whenever a new value is assigned for the variable, the domains of all unassigned ones are filtered and left only with those values that are consistent with the one already being assigned. If the domains of any of these unassigned variables become empty, the contradiction is recognized, and backtracking occurs. Freuder (1982) mentioned that if a given CSP has a tree-structured graph, then it could be solved without any backtracking. That is, solutions can be retrieved in a backtrack-free manner. Dechter & Pearl (1988) used this theory coupled with the notion of directional consistency to generate backtrack-free solutions more efficiently.

Dealing with constraints for search space reduction becomes a crucial issue in artificial intelligence research areas. GAs maintain a set of chromosomes (solutions), called population. The population consists of parents and offsprings. When the evolution process proceeds, the best N chromosomes in the current population are selected as parents. Through performing genetic operators, offsprings are selected according to the filtering criterion that is usually expressed as fitness functions along with some predefined constraints. The GA evolves over generations until stopping criterions are met. However valid chromosomes are usually produced by trials and errors. That is, a candidate chromosome is produced and tested against the filtering criterions. Therefore a GA may require more computation, especially in dealing with complicated or severe filtering criterions. To resolve this problem, an effective chromosome construction process can be applied to the initialization, crossover, and mutation stages respectively.

Garofalakis (2000) provided a model constraints-based algorithm inside the mining process to specify the expected tree size and accuracy. In other words, constraints can be used to express the trade-off between the model accuracy and computation efficiency that is associated with the tree-building or tree-pruning process. Similar research work, CADSYN (Maher, 1993), adopted a constraint satisfaction algorithm for case adaptation in case-based reasoning. Purvis (1995) also applied a repair-based constraint satisfaction algorithm to aid case adaptation. By pushing constraints into the case adaptation process, the constraint satisfaction algorithm is able to aid a case-based reasoning system to produce suitable solution cases more efficiently for the discrete based constraints problem. Barnier & Brisset (1998) developed a hybrid system of a genetic algorithm and constraint satisfaction techniques. This method is adopted to resolve the optimization problems for vehicle

55

routing and radio link frequency assignment. Basically this approach applied a GA to reduce the search space for a large CSP, rather than applying constraint satisfaction techniques to improve the GA’s computation efficiency. Kowalczyk (1996) pointed out the concept of using constraint satisfaction principles to support a GA in handling constraints. However rare research works applied constraint based reasoning to effectively handle GAs computation inefficiency, nor on how the user’s knowledge can be presented and processed within a constraint network. This research attempts to apply constraint-based reasoning to aid a GA to more efficiently to deal with chromosomes screening in stages of initialization and ongoing genetic operations.

3. The Hybrid of Association Rule Algorithms and Constraint-Based Genetic Algorithms (ACBGA)

The proposed ACBGA approach consists of three modules. According to Fig. 2, these modules are:

Rule Association; GA Initialization; Fitness Evaluation; Chromosome Screening; and GA Operation.

By executing apriori algorithm the Rule Association module produces association rules. In this research, the data items of categorical attributes are used to construct the association rules. The association rule here is an implication of the form X→Y where X is the conjunction of conditions, and Y is the classification result. The rule X→Y has to satisfy user-defined minimum support and minimum confidence levels.

Fig. 3 illustrates the classification tree with n nodes. Each node consists of two types of attributes: categorical and continuous. Categorical attributes are used for constructing the partial combination of conditions, while the continuous attributes are used for constructing the inequality relationship. The formal form of a tree node is expressed by the following expression:

A and B → C;

56

Chromosome Screening

Rule AssociationX1->Y1

.

.

.Xn->Yk

ACBGA Approach

Data Set

GAInitialization

FitnessEvaluation

GA operation

Crossover

Selection

Stop ConditionMet ?

Yes

Mutation

no

Optimal Classificaiton Tree

ConstraintNetwork

Constraint-Based

Reasoning

AprioriAlgorithm

Fig. 2. The Conceptual Diagram of ACBGA

Fig. 3. The Illustration for the Classification Tree where A denotes the antecedent part of the association rule; B is the conjunction of inequality functions in which the continuous attributes, relational operators and splitting values are determined by the GA; and C is the classification result directly obtained from the association rule.

For example, the X1, X2, X3 are categorical attributes and X4, X5 are continuous attributes. Assume that one association rule “X1=5 and X3=3 → Ck” is selected first, and then combined with the conjunction of inequality “X4<=10 and X5>1” that is randomly determined by the GA to form a tree node expressed as follows.

Nodei: IF X1=5 and X3=3 and X4<=10 and X5>1 Then the result is Ck; The GA Initialization module generates a chromosome that is equivalent to a

57

classification tree. The tree nodes can be easily presented in the form of “Node1, Node2, … , Noden” where n is the total tree node number of a classification tree, and automatically determined by the GA. The chromosome encoding steps of a classification tree are stated as follows.

Step (a): For each node, the antecedent condition is obtained from one

association rule that is randomly selected out of the entire association rules set generated in advance. That is, the “A” part of one tree node formal form is determined.

Step (b): By applying the GA, the selected relational operator (<= or >) and

splitting point are determined for each continuous attribute. Therefore the “B” part of one tree node formal form is determined.

Step (c): The classification result of a tree node part comes directly from the

consequence of the derived association rule. Subsequently each tree node is generated by repeatedly applying Step 1 & 2 for n times; where n is a value automatically determined by the GA.

The association rules for each classification result are used not only to construct

the candidate classification trees for GA initialization, but also filter the insignificant or irreverent attributes in rules. The process of filtering the insignificant or irreverent attributes is done by applying the hybrid of tree search and forward checking algorithm during the GA operation. In forward checking, when a value is assigned to current gene, any possible value of remaining genes that conflict with current gene is removed from the potential candidates pool. That is, whenever a new gene is considered, all its possible genes remained in the pool are guaranteed to be consistent with the constraint network of past genes assignments. We illustrate this reasoning process by the following example.

Assume that the data set consists of three attributes X1, X2, and Y, where Y is a binary classification result. The domain universe for each attribute is as follows.

},{},{

},,{

2

1

noyesYedX

cbaX

∈∈∈

By applying apriori algorithm, the association rules inferred based on given minimum support and confidence are expressed as follows.

58

noYeXcXnoYaX

yesYeXbXyesYdXbXyesYdXaX

=→=∧==→=

=→=∧==→=∧==→=∧=

21

1

21

21

21

These association rules for different classification results can be translated into a set of constraints (i. e., the constraint network) that is used to confine the generation of feasible relationships among attributes. In regular solving CSP given with a set of constraints, a constraint processing technique (i.e., constraint-based reasoning) is applied to determine whether these constraints can all be satisfied. For handling a dynamic environment, application programs in a constraint-based language consist simply of declarative sets of constraints. The necessary processing is supposed to be performed by general-purpose reasoning techniques.

In this example, the association rules can be translated into a constraint program. The syntax of this program is based on a constraint programming language (Bowen, 1990; Lai, 1992). The program is a set of any first order logic sentences about a many-sorted universe of discourse which includes integer, real numbers, and arbitrary application-specific sorts. The association rules here are simply formulated by a set of first order logic sentences described by a constraint program as follows.

Domain DomY =:= {yes, no}; Domain DomX1 =:={a, b, c}; Domain DomX2 =:={d, e}; Relation assorule1(DomY, DomX1, DomX2) =:= {(yes, a, d), (yes, b, d), (yes, b, e) }; Relation assorule2(DomY, DomX1, DomX2) =:= {(no, a, d),(no, a, e),(no, c, e)}; assorule1(Y, X1, X2); assorule2(Y, X1, X2);

The domain for each attribute is first defined. The domain names “DomX1, DomX2, and DomY” are described by a set of application-specific symbols, and the operator “=:=” stands for “is defined as”. The definitions of relational symbols “assorule1” and “assorule2” are used to describe the relationship of relevant values for the classification results “yes” and “no”, respectively. Further details about the constraint program can be found in (Bowen, 1990; Lai, 1992)

59

To improve computation efficiency in search for the optimal classification tree, effective reduction of domain range for each attribute is required. Fig. 4 illustrates how to reduce the search space by incorporating the hybrid techniques of tree search and constraint propagation. Assume that the nodei is used to predict the classification result “yes”. Then, the satisfied values “a” and “b” for attribute X1 can be inferred by

X X

Yes No

{yes,no}

a b c a b c

X

d e d e

X

d e

Y

X1

X2

d e

{a,b} {a,c}

{d} {d,e} {e}{d,e}

Fig. 4. The Hybrid of Tree Search and Constraint Propagation. constraint propagation with the relation “assorule1”. That is, the infeasible value “c” is removed from domain range of X1 to reduce the search space.

The Fitness Evaluation module calculates the fitness value for each chromosome in the current population. The fitness function is defined as the total number of misclassification. If the specified stopping condition is satisfied, then the entire process is terminated, and the optimal classification tree is confirmed; otherwise, the GA operations are continuing.

Each GA operation contains chromosomes selection, crossover, and mutation in order to produce offspring generation based on different GA parameter settings. The offsprings produced by ACBGA needs to be validated by the chromosome filtering operation to assure valid or potential individuals are derived for further process.

As shown in Fig. 5, each classification tree containing n nodes can be represented by a chromosome. Each node represents a rule that is used to determine one classification result. The rule consists of the conjunction of conditions in the form of equality or inequality relation. The encoding process for a condition requires three genes to specify “the enable/disable status”, “the equality/inequality operator”, and

60

“the splitting value”. When there are k attributes, the number of genes for each nodei is 3*k+1, where the additional one gene (gi1) indicates the classification result. Therefore totally n*(3*k+1) genes are required to form a chromosome. The

Chromosome Filteringusing

Constraint-BaedReasoning

ConstraintNetwork

Chromosome Screening

Node 1

Class ?yes

no Node i

Class ?

Node n

Class ?

...yes yes

no

gi1 gi2 gij... ... gimg11 g12 g1j... ... g1m gn1 gn2 gnj... ... gnm

...

... ...

...(for node1) ...(for nodei) (for noden)

g'i1 g'i2 g'ij... ... g'img'11 g'12 g'1j... ... g'1m g'n1 g'n2 g'nj... ... g'nm... ......(for node1) ...(for nodei) (for noden)

Node 1

Class g'11

yes

no Node i Node n...yes yes

no...Class g'i1 Class g'n1

Fig. 5. The Framework of Constraint-Based Preprocessing for GA Operators

constraint network contains the information extracted from the association rules. The valuation scope of each gene gij is confined via constraint–based reasoning. The search efforts for potential chromosomes can be reduced without having to activate the fitness evaluation procedure for each candidate chromosome.

Fig. 6 illustrates the details of the chromosome construction process. In essence, ACBGA system applies forward checking for chromosome filtering. In forward checking, when a value is assigned to the current gene, any possible value of the remaining genes that conflict with the current gene is removed from the potential candidates pool. Besides, the valid gene values are propagated through the network

61

thus enabling other constraints to impound other valid range sets for the remaining genes. That is, whenever a new gene is considered, all its valid gene values remained in the pool are guaranteed to be consistent with the constraint network and past genes assignments. According to the restricted valid range denoted by SGj, the valuation process for gene gij is then activated to examine the inferred gene value. The new g’ij value is replaced by a value randomly selected from SGj if the inferred gene value is inconsistent with the constraint network.

By repeatedly applying forward checking and the valuation process on chromosome C in the sequence gi1, gi2, … , gnm , the new chromosome C’ is thus able to satisfy the constraint network. As a result the constraint-based reasoning approach offers an efficient way to guide the GA toward searching for the optimal chromosome by reducing the search space already filtered by the constraint network.

ijij gg ←'

Satisfied Gene Values for Gj

(denoted by SGj )Forward Checking

DomainRange for

Genes

ConstraintNetwork

no

yes

j=1Ci =classification result (gi1 )

jGij Sg ∈

Select any value from S

Gj randomly

←ijg '

no

mj ≤j=j+1 yes

no

yes ni ≤i=i+1

i=1

gi1 gi2 gij... ... gimg11 g12 g1j... ... g1m gn1 gn2 gnj... ... gnm... ...

g'i1 g'i2 g'ij... ... g'img'11 g'12 g'1j... ... g'1m g'n1 g'n2 g'nj... ... g'nm... ...

C

C'

62

Fig. 6. The Detail Illustration for Chromosome Screening

4. The Experiments and Results Before experimenting the electronic ballast data, the car evaluation data set from

the UCI repository (Blake & Merz, 1998) was used to validate our proposed approach. Besides, two other approaches: a simple GA (SGA) and apriori algorithm with GA (AGA) were employed to compare against ACBGA approach.

Generally association rules extracted by apriori algorithm could be varied depending on the defined support and confidence values. Different association rules extracted may result in different impacts on ACBGA learning performances. Basically this can be done by trying with different combinations of minimum support and confidence values. However, this research first computed the ratios of the number of each classification results in terms of the total data records. In other words, each ratio is denoted by

Ratioi = (the number of data records whose classification results is Classi )/(the number of entire data records);

where i=1 to n; n is the total number of the classification results.

Then the support level for each classification result is defined as SupportDiscoutj * Ratioi ; where SupportDiscountj = j * 10%; j= 1 to 10.

That is, the support level for Classi is incrementally increased from 10% to 100% of Ratioi. Though each Ratioi may apply different SupportDiscountj, for the simplification purpose, this research applied the same SupportDiscountj to each Ratioi. Similarly, the confidence level for each classification result is set as 100% first; and then dropped down decreasingly if the learned pattern is not satisfied.

The evaluation of those classification trees generated by each of the three approaches was based on a five-fold cross validation. That is, each training stage used 4/5 of the entire data records; with the rest 1/5 data records used for the testing stage. The GA parameter settings for both the applications are summarized in Table 2.

Table 2. The GA Parameter Settings

Item Value Population Size 100 Generations 100/200/300/400/500 Crossover rate 0.6

63

Mutation rate 0.01 Crossover method Uniform Selection method Roulette wheel Maximum tree node number (for UCI Car Evaluation Data) 50 Maximum tree node number (for Electronic Ballast Data) 30 Average training time used (for UCI Car Evaluation Data) 1.9 Minutes*• Average training time used (for Electronic Ballast Data) 0.37 Minutes*•

* •

The hardware platform is Pentium III 1.0 MHz with 512 MB RAM

The Car Evaluation Problem

The collected 1728 car evaluation data records consist of six categorical attributes in the input part. The output part is one categorical attribute. Data of the six categorical attributes were fed into apriori algorithm to produce the association rules. Both the final training and testing results are summarized in Table 3, along with their corresponding representation depicted in Fig. 7 & 8. These results are based on SupportDiscount (=10%) and confidence value (=100). The derived association rule sets consist of 62 rules for “unacc”, 83 rules for “acc”, 155 rules for “good”, and 159 association rules for “v-good” output category. To obtain more details about the learning progress for the three approaches, learning behavior were recorded in sessions. Fig. 9 & 10 depict the entire learning progresses monitored over various generations and time. Appendix A presents each tree node details of one relatively better classification tree derived. Based on this produced classification tree, the accuracy rates for the training and testing stages are able to reach 93.2% and 91.91%, respectively.

Table 3. The Summarized Learning Performance for SGA, AGA, and ACBGA

(based on 5-fold average)

Fig. 7. Training Results with Various Generations (based on 5-fold average)

Fig. 8. Testing Results with Various Generations (based on 5-fold average)

Fig. 9. Training Results with Various Generations (based on 5-fold average)

The Electronic Ballast Problem

Two hundred and fifty electric ballast maintenance records of Boeing 747-400 from the accessory shop of one major airline in Taiwan were used to construct the trouble-shooting model. Both the training and testing performance are summarized in

64

Table 4, along with their corresponding representation depicted in Fig. 11 & 12. These results are also based on SupportDiscount (=10%) and confidence value (=100). The derived association rule sets consist of 157 association rules for “C1”, 62 ones for “C2”, 62 ones for “C3”, 117 ones for “C4”, 187 ones for “C5”, 107 ones for “C6”, 119

Fig. 10. Testing Results with Various Generations (based on 5-fold average) ones for “C7”, 178 ones for “C8”, 177 ones for “C9”, and 196 ones for “C10” output category. In order to obtain more details about the learning progress for the three approaches, learning behavior were recorded in sessions. Fig. 13 & 14 depict the entire learning progresses monitored over generations and time. Appendix B presents each tree node details of one superior classification tree derived. Based on this produced classification tree, the accuracy rates for the training and testing stages are able to reach 84.5% and 84%, respectively.

Table 4. The Summarized Learning Performance for SGA, AGA, and ACBGA (based on 5-fold average)

Gen. 100 200 300 400 500 Train Test Train Test Train Test Train Test Train Test

SGA 58.30% 56.40% 67.60% 63.20% 68.80% 64.00% 71.60% 68.00% 72.40% 68.00%AGA 76.90% 76.40% 78.60% 75.20% 79.30% 74.80% 79.30% 74.80% 80.00% 74.80%

ACBGA 83.30% 79.60% 84.10% 79.60% 84.50% 80.40% 85.00% 80.80% 85.10% 80.40%

Fig. 11. Training Results with Various Generations (based on 5-fold average) Fig. 12. Testing Results with Various Generations (based on 5-fold average)

Fig. 13. The Learning Progress over Generations (based on 5-fold average)

5. Discussion According to the results indicated above ACBGA achieves superior learning

performance than SGA and AGA in terms of computation efficiency and accuracy for both the car evaluation and electronic ballast data sets. By employing apriori algorithm the partial knowledge is extracted and transformed as seeding chromosomes. The differences can be seen that the initial training accuracy is greatly improved from 50% to 67% (shown in Fig. 9) for car evaluation data; 21% to 54% (shown in Fig. 13) for electronic ballast data.

Fig. 14. The Learning Progress over Time (based on 5-fold average) As displayed in Fig. 9 ACBGA approach converges around 300 generations, while

AGA and SGA are still progressing at 500 generations with inferior learning accuracy for the car evaluation data. This situation is more salient for the electronic ballast data. As depicted in Fig. 10 & 14, the testing performances for both data sets exhibit

65

similar patterns of performances as in the training stages. For the ACBGA approach, the derived partial knowledge is not only encoded as

seeding chromosomes, but also converted into the constraint network. As shown in those figures displaying learning progresses, ACBGA outperforms AGA less significantly than outperforms SGA. Nevertheless, the improvement of ACBGA over AGA positively demonstrates its learning effectiveness for both the applications data.

6. Conclusions

Aircraft maintenance is now recognized as one of the most important airlines activities for improving flight safety as well as obtaining worldwide competition strengths. Sharing repair experiences with the- state-of-arts computer technology is helpful to improve the productivity of aircraft maintenance. This study has proposed ACBGA approach to aid aircraft electronic ballast maintenance. The ACGBA approach can be employed by maintenance mechanics in the accessory shop to assist them to obtain knowledge, skills and experience required for effective electronic ballast repair and maintenance. In addition to the electronic ballast, there are millions of other components embedded in the aircraft system. Most of them are in need of a short repair time in order to reduce other opportunity costs. This is because that an inefficient aircraft maintenance services will lead to flight delays, cancellations or even flight accidents.

The ACBGA approach hybridizes constraint-based reasoning within a genetic algorithm for classification tree induction. Incorporating the partial knowledge or user-control information into mining process is not straightforward and, typically, requires the design of novel approaches. By employing the rule association algorithm to acquire partial knowledge from data in advance, our proposed approach is able to induce a classification tree by pushing the partial knowledge into chromosome construction. Most importantly, the adoption of constraint-based reasoning mechanism into the GA process can filter invalid chromosomes; therefore feasible solutions can be more efficiently derived.

Comparing with SGA and AGA, ACBGA achieves higher predictive accuracy and less computation time required in constructing a classification tree for electronic ballasts troubleshooting. In addition, the classification trees discovered by ACBGA not only obtain higher predictive accuracy and computation efficiency, but also may produce more user transparent or significant knowledge.

The proposed ACBGA is generic and problem independent. It is not required to design any proprietary genetic operator and chromosome representation to interact with constraint-based reasoning. Besides to integrating with apriori algorithm for knowledge preprocessing, ACBGA approach provides a potential way to allow

66

domain experts to provide professional knowledge or constraints to facilitate revealing further crucial or interesting knowledge. This approach is not only applicable for binary classification problems, but also applicable for multi-category classification problems.

Currently ACBGA approach is able to determine tree splitting nodes that may allowed complex rule sets-like discriminating formats such as “Attributei <= w ∗ Attributej” relationship. This format can be extended to express more complicated multivariate inequations with either a linear or nonlinear format in the future. Other, the GA parameter settings such as the maximum number of tree node allowed, the minimum (or maximum) number of data records in a tree node, the granularity of SupportDiscount are those aspects that can be further investigated in terms of their impacts on the learning time required and corresponding learning effectiveness.

Intelligent Aircraft Maintenance Support System Using Genetic Algorithms and Case-Based Reasoning

Abstract The maintenance of aircraft components is crucial for avoiding aircraft accidents

and aviation fatalities. To provide reliable and effective maintenance support, it is important for the airline companies to utilize previous repair experiences with the aid of advanced decision support technology. Case-Based Reasoning (CBR) is a machine learning method that adapts previous similar cases to solve current problems. For effective retrieving similar aircraft maintenance cases, this research proposes a CBR system to aid electronic ballast fault diagnoses of Boeing 747-400 airplanes. By employing the genetic algorithm (GA) to enhance the dynamic weighting as well as the design of non-similarity functions, the proposed CBR system is able to achieve superior learning performance than those with either equal/varied weights or linear similarity functions.

Keywords: Aircraft Maintenance, Electronic Ballast, Case-Based Reasoning, Genetic

Algorithms 1. INTRODUCTION

Airplanes in operation throughout the world call for appropriate maintenance to assure flight safety and quality. When an aircraft component faults emerge, actions for fault diagnosis and troubleshooting must be executed promptly and effectively. An airplane consists of many electronic components among which the electronic ballast is one common component in controlling the cabin fluorescent lamps. The electronic ballast plays an important role in providing proper lights for passengers and flight crews during a flight. Unstable cabin lighting, such as flash and ON/OFF problems, is a common problem occurred in airplanes. An airplane usually has hundreds of

67

electronic ballasts mounted in panels such as the light deflector of a fluorescent lamp fixture. When an electronic ballast is abnormal, it has to be removed and sent to the accessory shop for further investigation.

The maintenance records of electronic ballasts generally contain information about the number of defective units found, the procedures taken, and the inspection or repair status. Basically these records were stored and used for assisting mechanics in identifying faults and determining the components as repair or replacement was necessary. This is because previous similar solutions may provide valuable troubleshooting clues for new faults.

Similar to analogy, CBR is machine learning method that adapts previous similar cases to solve current problems. CBR shows significant promise for improving the effectiveness of complex and unstructured decision making. It is a problem-solving technique that is similar to the decision making process used in many real world applications. This study considers CBR an appropriate approach to aid aircraft mechanics in dealing with the electronic ballast maintenance problem. Basically CBR systems make inferences using analogy to obtain similar experiences for solving problems. Similarity measurements between pairs of features play a central role in CBR (Kolodner, 1992). However the design of an appropriate case-matching process in the retrieval step is still challenging. For the effective retrieval of previous similar cases, this research develops a CBR system with GA mechanisms used to enhance the dynamic feature weighting as well as the design of non-similarity functions. GA is an optimization technique inspired by biological evolution (Holland, 1975). Based upon the natural evolution concept, GA works by breeding a population of new answers from the old ones using a methodology based on survival of the fittest. In this research GA is used to determine not only the fittest non-linear similarity functions, but also the optimal feature weights.

By using GA mechanisms to enhance the case retrieval process, a CBR system is developed to aid electronic ballast fault diagnoses of Boeing 747-400 airplanes. Three hundred electric ballasts maintenance records of Boeing 747-400 airplanes were gathered from the accessory shop of one major airline in Taiwan. The results demonstrated that the approach with non-linear similarity functions and dynamic weights indicated better learning performance than other approaches with either linear similarity functions or equal/varied weights.

2. LITERATURE REVIEW

2.1 Case-Based Reasoning CBR is a relatively new method in artificial intelligence (AI). It is a general

problem-solving method that takes advantage of the knowledge gained from

68

experiences and attempts to adapt previous similar solutions for solving a particular current problem. As shown in Figure 1, CBR can be conceptually described by a CBR-cycle that composes of several activities (Dhar & Strin, 1997). These activities include (A) retrieving similar cases from the case base, (B) matching the input and retrieved cases, (C) adapting solutions suggested by retrieved similar cases to better fit the new problem; and (D) retaining the new solution once it has been confirmed or validated.

A CBR system gains an understanding of the problem by collecting and analyzing case feature values. In a CBR system, the retrieval of similar cases relies on a similarity metric which is used to compute the distance between pairs of case features. Generally, the performance of the similarity metric and the feature weights are keys to the CBR (Kim & Shin, 2000). A CBR system could be ineffective in retrieving similar cases if the case-matching mechanism is not appropriately designed.

For an aircraft maintenance problem, CBR is a potential approach in retrieving similar cases for diagnosing faults as well as providing appropriate repair solutions. Several researches applied CBR to solve different airlines industry problems. Richard (1997) developed CBR diagnostic software for aircraft maintenance. Magaldi (1994) proposed applying CBR to aircraft troubleshooting on the flight line. Other CBR applications included flight condition monitoring and fault diagnosis for aircraft engine (Vingerhoeds et al., 1995), service parts diagnosis for improving service productivity(Hiromitsu et al., 1994), and data mining for predicting aircraft component replacement (Sylvain et al., 1999).

Figure 1. A CBR Cycle D

Retrieved Cases Input Cases

B

A

CMatching and Retrieval

Case Base

Most of these CBR systems applied n-dimension vector space to measure the similarity distance between input and retrieved cases. For example, Sylvain et al. (1999) adopted the nearest neighborhood method. However, seldom researches attempted to employ dynamic weighting with non-linear similarity functions to develop fault diagnosis models for aircraft maintenances. 2.2 Genetic Algorithms for Feature Weighting

In general, feature weights can be used to denote the relevance of case features to a particular problem. Wettschereck et al. (1997) made an empirical evaluation of feature-weighting methods and summarized that feature-weighting methods have a substantially higher learning rate than un-weighted k-nearest neighbor methods. Kohavi et al. (1995) observed that feature weighting methods have superior

69

performance as compared to feature selection methods. When some features are irrelevant to the prediction task, Langley and Iba (1993) pointed out that appropriate feature weights can substantially increase the learning rate.

Several researches applied GA to determine the most suitable feature weights. GA is a technique of modeling the genetic evolution and natural selection processes. A GA procedure usually consists of chromosomes in a population, a ‘fitness’ evaluation function, and three basic genetic operators of ‘reproduction’, ‘crossover’ and ‘mutations’. Initially, chromosomes in the form of binary strings are generated randomly as candidate solutions to the addressed problem. A fitness value associated with each chromosome is subsequently computed through the fitness function representing the goodness of the candidate solution. Chromosomes with higher fitness values are selected to generate better offspring for the new population through genetic operators. Conceptually, the unfit are eliminated and the fit will survive to contribute genetic material to the subsequent generations.

Wilson and Martinez (1996) proposed a GA-based weighting approach which had better performance than the un-weighted k-nearest neighbor method. For large-scale feature selection, Siedlecki and Sklansky (1989) introduced 0-1 weighting process based on GAs. Kelly and Davis (1991) proposed a GA-based weighted K-NN approach (GA-WK-NN) which had a lower error rates than the standard K-NN one. Brill et al. (1992) demonstrated fast features selection using GAs for neural network classifiers.

Though the above research works used GA mechanisms to determine the feature weights for the case retrieval, seldom had a study applied the GA to simultaneously determine features weights as well as the corresponding similarity functions in a non-linear way. This paper attempts to apply GA mechanisms to determine both the optimal feature weights and the most appropriate non-linear similarity functions for case features. A CBR system is developed to diagnose the faulty accessories of electronic ballasts for the Boeing 747-400 airplanes.

3. METHODOLOGY 3.1 Linear Similarity

From the case base, a CBR system retrieves an old case that is similar to the input case. As shown in Figure 2, the retrieval process is based on comparing the similarities for all feature values between the retrieved case and the input case, where fi

I and fiR are the values of feature i in the input and retrieved case, respectively. There

are many evaluation functions for measuring the degree of similarity. One numerical function using the standard Euclidean distance metric is shown in the following formula (Eq.1), where Wi is the ith feature weight. The feature weights are usually

70

statically assigned to a set of prior known fixed values or all set equal to 1 if no arbitrary priorities determined.

Figure 2. Feature Values

Features

Retrieved Case f1R f2

R f3R … fi

R … fnR

Input Case f1I f2

I f3I … fi

I … fnI

( )21

Ri

Ii

n

ii ffW −×∑

=

(Eq.1)

3.2 Non-Linear Similarity

Based on the formula (Eq.1), this study proposed a non-linear similarity approach. The difference between the linear similarity and non-linear similarity is the distance function definition. For a non-linear similarity approach (fi

I-fiR)2 is replaced

by the distance measurement [(fiI-fi

R)2]k as shown in formula (Eq.2).

( )[ ] kR

iI

i

n

ii ffW 2

1

−×∑=

(Eq.2)

Where k is the exponent of the standard Euclidean distance function for the corresponding input and retrieved feature values. A GA mechanism is proposed to compute the optimal k value for each case feature. The range of exponent k is scaled from 1/2, 1/3, 1/4, 1/5, 1, 2, 3, 4 and 5. Figure 3 depicts an example equation y=xk; where x ∈ [0, 1] with various combinations of k.

0

0.1

0.2

0.3

0.4

5

0.6

0.7

0.8

0.9

1

00.0

50.1

10.1

70.2

30.2

90.3

50.4

10.4

70.5

30.5

90.6

50.7

10.7

70.8

30.8

90.9

51.0

0

X

5

4

3

2

1

1/5

1/4

1/3

1/2

0.Y

71

Figure 3. The Illustration for Linear and Non-Linear Functions

3.3 Static Feature Weighting In addition to the linear or non-linear type of similarity function, feature weights

Wi can also influence the distance metric. Feature weighting can be either static or dynamic. The static weighting approach assigns fixed feature weights for all case features throughout the entire retrieval process. For static feature weighting, each feature’s weight can be either identical or varied. The feature weights are usually statically assigned to a set of prior known fixed values or equal to 1 if no arbitrary priorities determined. For varied feature weighting, this study proposed another GA mechanism to determine the most appropriate weight for each feature. 3.4 Dynamic Feature Weighting

For the dynamic weighting approach, feature weights are determined according to the context of each input case. As shown in Figure 4, for a given input case, there are m retrieved cases in the case base, where i = 1 to n, n is the total number of features in a case, j = 1 to m, m is the total number of retrieved cases in a case base. fij

R is the ith feature value of the retrieved casej, and fiI is the of the ith feature value of

the input case. OjR is the

Figure 4. The Denotation of Features and Outcome feature Values fim

f1mR

outcome feature value of the jth retrieved case and OI is the outcome feature value of the input case.

fi f1

I

fi1R

fi2R

fijR

I

Features Outcome Feature

R…

f11R

f12R

f1jR

Retrieved

Cases

Input Case

O1R

O2R

OjR

OmR

OI

fn1R

fn2R

fnjR

fnmR

fnI

Assume that the outcome feature value is categorical data with p categories. For those features of categorical values, their weights are computed using the formula (Eq.3).

=

i

iti E

LMaxW (Eq.3)

where i=1 to n, n is the number of case features in a case; t=1 to p, p is the number of categories for the outcome feature. Ei is the number of retrieved cases of which fij

R is equal to fi

I. Lit is the number of retrieved cases of which fijR is equal to fi

I and OjR is the

72

tth categories. For those features of continuous values, their weights are not generated in the same way as described above unless the feature values are discretized in advance. Though there may exist various ways for discritization, this study proposed another GA mechanism to discretize the continuous feature values. For the ith feature, a GA procedure is used to compute the optimal value, say Ai, to form a range centered on fi

I. Let Ki denote the number of cases whose fij

R is between (fiI-Ai) and (fi

I+Ai). Thus, Ei is replaced by Ki in the formula (Eq.3). Feature weights are computed as shown in the formula (Eq.4).

=

i

iti K

LMaxW (Eq.4)

Based on formulas (Eq.3) and (Eq.4), each input case has a corresponding set of feature weights in this dynamic weighting approach. 3.5 The Experiment Design

Since both the feature weights and similarity measurements between pairs of features play a vital role in case retrieval, this research investigated the CBR performance by observing the effects resulting from the combinations of different feature weighting approaches and similarity functions.

As indicated in Figure 5, there are six approaches that combine different types of similarity functions and feature weighting methods. These are the Linear Similarity Function with Equal Weights (Approach A); Linear Similarity Function with Varied Weights (Approach B); Non-Linear Similarity Function with Equal Weights (Approach C); Non-Linear Similarity Function with Varied Weights (Approach D); Linear Similarity Function with Dynamic Weights (Approach E); and Non-Linear Similarity Function with Dynamic Weights (Approach F).

The differences between the three feature weighting approaches are described as follows. For the equal weights approach, feature weights are all set equal to 1. For the varied weights approach, there is only one set of feature weights determined by a proposed GA procedure. For the dynamic weights approach, there is a corresponding set of feature weights for each input case. That is, there are sets of feature weights dynamically determined according to the input case.

Figure 5. The Combinations of Similarity Functions and Feature Weighting Methods (F) (E) (D) (C) (B)

Feature Weighting Dynamic Weights

Varied Weights

Equal Weights

Non-Linear Similarity

Linear Similarity

Similarity Function

(A)

73

4. THE EXPERIMENT AND RESULTS 4.1 Case Description

The aircraft electronic ballasts used to drive fluorescent lamps can be mounted on a panel such as the light deflector of a fluorescent lamp fixture. The fluorescent lamps initially require a high voltage to strike the lamp arc and maintain a constant current followingly. Usually there is a connector at one end of the unit for the routing of all switching and power connections. As shown in Figure 6, the electronic ballast operates from control lines of 115-vac/400Hz aircraft power. When the operation power is supplied, the electronic ballast will start and operate two rapid start fluorescent lamps or single lamp in the passenger cabin of various commercial aircrafts, such as Boeing 747-400, 737-300, 737-400, 747-500 and etc. There are two control lines connecting the ballast set and control panel for ON/OFF and BRIGHT/DIM modes among which DIM mode is used at night when the cabin personnel attempts to decrease the level of ambient light in the cabin.

Fluorescent Lamp

Fluorescent Lamp

Fluorescent Lamp

ElectronicBallast

ElectronicBallast

……Lamp Set Ballast Set

Control Lines

Control Panel

Control Lines

Figure 5.The Operational Setup for Electronic Ballast

Three hundred electric ballast maintenance records of Boeing 747-400 from the accessory shop of one major airline company in Taiwan were used to construct the trouble-shooting system. Each maintenance case contains seven features identified as highly related to abnormal electric ballast operations. In Table 1, these features are either continuous or categorical. The outcome feature is the categories of the replaced parts set. For instance, category C1 denotes the replaced parts of a transformer (illustrated as T101 on a printed circuit board) and a capacitor (illustrated as C307 on a printed circuit board). Category C2 denotes the replaced parts of an integrated circuit (illustrated as U300 on a printed circuit board), a transistor (illustrated as Q301 on a printed circuit board) and a fuse (illustrated as F401 on a printed circuit board). Each category in the outcome feature represents a different set of replaced parts. 4.2 The GA Implementation

According to the experiment design, this study implements three GA procedures to determine (1) the optimal exponent k in the non-linear similarity functions; (2) the

74

most appropriate set of varied weights for static feature weighting; and (3) sets of feature weights for dynamic feature weighting. Several steps are required in developing a GA computer program. These steps include chromosome encoding, fitness function

Table 1. The Case Description Data Type Range

Input Features Alternating Current on Bright Mode When Electronic Ballast Turns On Alternating Current on DIM Mode When Electronic Ballast Turns On Alternating Current on Bright Mode When Electronic Ballast Turns Off Alternating Current on DIM Mode When Electronic Ballast Turns Off Is Light Unstable When Electronic Ballast Turns On Is It Not Illuminated When Electronic Ballast Turns On

Continuous Continuous Continuous Continuous Categorical Categorical

0 to 2 (amp) 0 to 2 (amp) 0 to 2 (amp) 0 to 2 (amp)

0 and 1 0 and 1

Outcome Feature Components Replacement

Categorical

C1, C2, … ,C10

specification, and internal control parameter specification. The details of each step according to the order of three GA applications are described as follows. Non-Linear Similarity

Chromosomes are designed for encoding the exponent k in the non-linear similarity functions. Because there are six features in a case, a chromosome was composed of six genes to encode the exponents in the six corresponding non-linear functions. Each chromosome is assigned a fitness value based on the formula (Eq.5). The population size was set to 50; population selection method is based on Roulette Wheel; the probability of mutation was 0.06, and the probability of crossover was 0.5. Crossover method is based on uniform; the entire learning process stopped after 10,000 generations.

Minimize

=∑=

q

Cfitness

q

jj

1 (Eq.5)

Where j=1 to q, q is the number of training cases. Cj is set to 1 if the expected outcome feature is equal to the real outcome feature for the jth training case. Otherwise, Cj is set to 0. Varied Weights

Chromosomes are designed for encoding a set of feature weights whose values are ranged between [0..1]. The fitness function is defined as indicated in the formula

75

(Eq.5), too. As to the GA parameters, the population size was set to 50, the probability of mutation was 0.06, and the probability of crossover was 0.5. The entire learning process stopped after 10,000 generations. Dynamic Weights

Chromosomes are designed for encoding values Ai to form a range centered on fiI

for features that are continuous data. Fitness value is also calculated by the formula (Eq.5) for each chromosome in the population. As for the GA parameters, mutation rate was 0.009, and the other settings were the same as the ones used for varied weights. 4.3 The Results

The case base is divided into two data sets for training and testing with the ratio of 2:1. That is, 200 maintenance cases of Boeing 747-400 aircraft electric ballast are for training and the remaining 100 cases are for testing. The results are illustrated in Table 2. All approaches are evaluated with 3-fold cross validation. The result of approach (F) with non-linear similarity functions and dynamic weighs is the best where the Mean Errors (ME) is equal to 0.193 for training and 0.180 for testing.

Table 2. The Mean Errors of Different Approaches Mean Error

Approach training testing

(A) Linear Similarity Function with Equal Weights 0.240 0.223 (B) Linear Similarity Function with Varied Weights 0.213 0.220 (C) Non-Linear Similarity Function with Equal Weights 0.207 0.210 (D) Non-Linear Similarity Function with Varied Weights 0.200 0.203 (E) Linear Similarity Function with Dynamic Weights 0.233 0.230 (F) Non-Linear Similarity Function with Dynamic Weights 0.193 0.180

To further investigate the results, the approach (A) with linear similarity function and equal weights has an inferior training result. There is no obvious difference for the testing results of the approaches (A), (B), and (E) all of which adopt linear similarity functions. However, among those approaches adopting non-linear similarity functions, it seems that approach (F) with dynamic weights has a superior result than both the approach (D) with varied weights and approach (C) with equal weights. It can be inferred that both non-linear similarity functions and the dynamic weighting process are crucial for a CBR system to effectively retrieve previous associated cases.

5. CONCLUSIONS An inefficient aircraft maintenance service may lead to flight delays,

76

cancellations or even accidents. Aircraft maintenance is therefore one of the most important activities for the airlines to improve flight safety as well as obtain worldwide competitive strength. To improve the maintenance productivity, this research developed a CBR system with GA mechanisms to enhance the retrieval of similar aircraft electronic ballast maintenance cases. Three GA procedures are proposed to determine the optimal non-similar similarity functions, varied and dynamic feature weights, respectively. The experimental results demonstrated that the approach adopting both non-linear similarity functions and dynamic weights achieves the best performance than approaches with either linear similarity functions or equal/varied weights.

In addition to the electronic ballast, there are numerous components embedded in an aircraft system. The proposed method could also be employed for a shorter repair time and a lower maintenance costs. Besides, aircraft preventative maintenance is also an important issue. In the future, it may be possible to embed such a trouble-shooting component into the aircraft preventive maintenance system based on the history data in flight data recorders (FDR) to help ensuring a safer and comfortable flight. Workflow Mining in a Wafer Fabrication Factory using Constraint-Based

Evolutionary Regression Trees

Abstract Understanding the factors associating with the flow-time of wafer production is

crucial for workflow design and analysis in wafer fabrication factories. Owing to the wafer fabrication complexity, the traditional human approach to assigning due-date is imprecise and prone to failure, especially that the shop status is dynamically changing. Therefore, assigning a due-date to each customer order becomes a challenge to production planning. This paper proposes a constraint-based genetic algorithm (CBGA) approach to determine the flow-time. Also the flow-time prediction model is constructed and compared with other approaches. Better computational effectiveness and prediction results from the CBGA are demonstrated using experimental data from a wafer manufacturing factory.

Keywords: Due-date assignment, genetic algorithms, constraint-based reasoning,

rule induction, regression trees. 1. INTRODUCTION

The determination of customer orders due-date is a crucial issue in wafer fabrication factories. More precise due-date assignment can be of much competition advantages in wafer manufacturing industries. In general wafer manufacturing processes are complicated and time-consuming. The processing steps of each wafer depend on the workstations layout, production capacity of the shop floor, order types, and etc. To better model the wafer production workflow processes, this is commonly done by rule of thumbs or through lengthy discussions with senior staffs.

77

Basically, the wafer manufacturing processes can be divided into two sections, i.e., the front-end and the back-end processes. In the front-end, bare wafers are processed and packaged. A flowchart of the basic front-end processes is described in Figure 1.

The front-end processes include (1) photolithography, (2) thermal processes, (3) implantation, (4) chemical vapor deposition, (5) etching, (6) physical vapor deposition, (7) chemical mechanical polishing, (8) process diagnostics and control (metrology), and (9) cleaning. The production steps introduced above are abstract and simplistic. Real floor shop manufacturing processes are more complicated with many detailed processing procedures.

After the front-end processes, wafers are fed into the back-end processes, as shown in Figure 2, which include processes such as (1) test, (2) wafer dicing, (3) die attach, (4) wire bonding, and (5) encapsulation.

Photolitho-graphy

ThermalProcess Implantation Etching

ChemicalVapor

Deposition

PhysicalVapor

Deposition

ChemicalMechanicalPolishing

Metrology Cleaning

Fig. 1. Basic Front-End Processes

Test WaferDicing Die Attach EncapsulationWire

Bonding Fig. 2. Basic Back-End Processes

Revealing valuable knowledge hidden in corporate data becomes more critical for

enterprise decision making. The analysis of workflow previous orders may provide valuable insights of due-date assignment. These insights can be inferred by mining the related operational information collected from manufacturing workflow processes. Recent development of novel or improved data mining methods such as Bayesian networks (Cooper et al., 1991), frequent patterns (Srikant and Agrawal, 1996), decision trees (Garofalakis et al., 2000; Gehrke et al., 1999), and evolution algorithms (Bojarczuk et al., 2001; Carvalho et al., 1999; Correa et al., 2001; Freitas, 1999) have drawn more attention from academics and industry. Among these techniques, a decision tree is one of the most commonly used methods for knowledge discovery.

A decision tree features its easy understanding and a simple top-down tree structure where decisions are made at each node. The nodes at the bottom of the resulting tree provide the final outcome, either of a discrete or continuous value. When the outcome is of discrete value, a classification tree is developed, while a regression tree is developed when the outcome numerical and continuous (Breiman et al., 1984).

Most tree induction methods discover the rules by adopting local heuristic search techniques. Some other rule induction methods employ global search techniques, such as the genetic algorithm (GA) (Holland, 1975) evaluate the entire rule set via a fitness function rather than evaluating the impact of adding or removing one condition to or from a rule.

For regular GAs operations, constraint validation is proceeded after a candidate

78

chromosome is produced. That is, several iterations may be required to determine a valid chromosome. To prevent the production of invalid chromosomes before a chromosome is generated, one of the possible ways is by embedding a well-designed constraint mechanism into the chromosome-encoding scheme.

In this research we propose a novel approach that integrates constraint-based reasoning with GAs to discover rule sets. The constraint-based reasoning mechanism is used to push constraints along with data insights into the rule set construction. This research applied tree search and forward checking (Haralick & Elliott,1980; Brailsford et al, 1999) techniques to reduce the search space from possible gene values that can not meet predefined constraints during the evolution process. This approach allows constraints to be specified as relationships among attributes according to predefined requirements, user preferences, or partial knowledge in the form of a constraint network. In essence, this approach provides a chromosome-filtering mechanism prior to generating and evaluating a chromosome. Thus insignificant or irreverent trees can be precluded in advance via the constraint network.

This paper proposes CBGA approach to model the workflow process data for customer order due-date assignment using the data from one major wafer fabrication factory in Taiwan. For comparison purpose, other soft computing approaches including simple GA for regression tree (GA), Back Propagation neural network (BPN), Case-based Reasoning (CBR), statistic regression, and traditional due-date assignment rules are also used to construct the model as well.

The remainder of this paper is organized as follows. In Section 2, previous research works and related techniques are reviewed. The detailed constraint-based GA procedure is then introduced in Section 3. Section 4 presents the experiments and results followed by discussion and conclusions.

2. The Background 2.1. The Wafer Fabrication

The production characteristics of wafer factories differ from the traditional job shops in: (1) reentry, (2) rework, (3) lot sizing, (4) common machines, (5) work-in-process (WIP) control, (6) random yield, (7) multi-function machines, and (8) diversities of machine types. Owing to the complicated production steps and various factors involved, due-date assignment becomes a great challenge to the production planning and scheduling department.

According to the above description, it is apparent that the due-date of each order in the wafer fabrication factory is essentially affected by the following factors: (1) routing of the product, (2) order quantities, (3) current shop loading, (4) jobs in the queue of the bottleneck machine, and so on. Generally, the production planning and control staffs are responsible to determine each order due-date based on their knowledge about the manufacturing processes along with the information of the above factors taken into consideration. Previous products order information also plays as important references in estimating the flow-time of each order. If a new order specification is similar to the previous one, then the new flow-time can be approximately determined based on that of previous order. Understandably, the status of the shop floor such as jobs in the system, shop loading and jobs in the bottleneck machine may not be all the same. As a result, the due-date estimated could be subject to errors. Cheng and Gupta (1985) proposed several different due-date assignment approaches: the TWK (total processing time), NOP (number of operations), CON (constant allowance), and RDM (random allowance) rules. As soon as the processing

79

times are estimated by these rules, the due-date is set equal to the order release time plus the estimated processing time, i.e.,

di=ri+pi ; (1)

where di is the due-date of the ith order, ri and pi are the release time and

processing time of order i respectively. Many other discussions focused on the relationships between the shop status

information and due-dates. Several significant factors in terms of due-date assignment, for example, jobs-in-queue (JIQ), jobs-in-system (JIS), delay-in-queue (DIQ), and processing plus waiting times (PPW) were explored. Conway, et al. (1967) revealed that due-date rules incorporating job characteristics performed better than those ignoring job characteristics.

2.2 Genetic Algorithm for Rule Induction Rule induction methods can be categorized into either tree based or non-tree

based methods (Abdullah, 1999). Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms. Both decision trees and rules can be described as disjunctive normal form (DNF) models. Decision trees are generated from data in a top-down and general to specific direction (Chidanand & Sholom, 1997). Each path to a terminal node is represented as a rule consisting of a conjunction of tests on the path’s internal nodes.

Classification and regression are critical types of prediction problems. Classification aims to examine the features of a newly presented object and assign it to one of a predefined set of classes (Michael and Gordon, 1997). Regression, also known as function approximation, is usually more difficult than a classification prediction. It is able to describe the relationship between input and output variables as a formula. Both classification tree and regression tree are the extension of the conventional classification and regression.

Regression tree induction is similar to classification tree induction. A complex problem can be partitioned into sub problems that could be solved individually by a regression tree technique. CART, one of the often mentioned decision tree techniques, can be applied to constructing classification and regression trees, respectively (Breiman et al., 1984). CART’s regression tree approach tries to reduce the statistical variance of the data at each segment and separate the segments as much as possible by maximizing the distance between them. The produced terminal node is assigned with a constant value. However different techniques may employ different type of algorithms (i.e., regression models) in the context of regression tree leaves (Karalic , 1996; Torgo, 1997).

GAs have been successfully applied to data mining for rule discovery in literatures. There are some techniques using one-rule-per-individual encoding proposed by Greene & Smith (1993), and Noda et al. (1999). For the one-rule-per-individual encoding approach, a chromosome usually can be identical to a linear string of rule conditions, where each condition is often an attribute-value pair, to represent a rule or a rule set. However, the more complicated and syntactically longer for the chromosome encoding, more complex genetic operators are required.

Hu (1998) proposed a Genetic Programming (GP) approach in which a program can be represented by a tree with rule condition and/or attribute values in the leaf

80

nodes and functions in the internal nodes. The challenge is that a tree can grow in size with a shape in a very dynamical way. Thus, an efficient tree-pruning algorithm would be needed to readjust unsatisfied parts within a tree to avoid infeasible solutions. Bojarczuk et al. (2001) proposed a constrained–syntax GP approach to build a decision model, particularly with emphasis on the discovery of comprehensible knowledge.

In order to discover high-level prediction rules, Freitas (1999) applied a first-order logic relationships such as “Salary > Age” by checking an Attribute Compatibility Table (ACT) during the discovery process with GA-Nuggests. ACT was claimed especially effective for its knowledge representation capability for classification problems. However there was seldom research attempting to hybridize constraint-based reasoning and evolutionary techniques for estimation problems.

2.3 Constraint Satisfaction in Genetic Algorithms Constraint satisfaction involves finding values for problem variables associated

with constraints on acceptable solutions to satisfy these constraints. Problem solving in a constraint satisfaction problem (CSP) that basically belongs to NP-Complete problems normally lacks suitable methods. A number of different approaches have been developed for solving the CSP problems. Some of them adopted constraint propagation (Bowen et al., 1990; Dechter & Peral, 1988; Lai, 1992) to reduce the solutions search space. Others tried “backtrack” to directly search for possible solutions (Nadel, 1988). Some applied the combination of these two techniques including tree-search and consistent algorithms to efficiently find out one or more feasible solutions (Purdom, 1983; Freuder, 1982). Nadel (1988) compared the performance of several algorithms including “generate and test”, “simple backtracking”, “forward checking”, “partial lookahead”, “full lookahead”, and “really full lookahead”. The major differences of these algorithms are the degree of consistency performed at the node during the tree-solving process. Besides the “generate and test” method, others performed hybrid techniques. Whenever a new value is assigned for the variable, the domains of all unassigned ones are filtered and left only with those values that are consistent with the one already being assigned. If the domains of any of these unassigned variables become empty, the contradiction is recognized, and backtracking occurs. Freuder (1982) mentioned that if a given CSP has a tree-structured graph, then it could be solved without any backtracking. That is, solutions can be retrieved in a backtrack-free manner. Dechter & Peral (1988) used this theory coupled with the notion of directional consistency to generate backtrack-free solutions more efficiently.

Dealing with constraints for search space reduction becomes a crucial issue in artificial intelligence research areas. GAs maintain a set of chromosomes (solutions), called population. The population consists of parents and offsprings. When the evolution process proceeds, the best N chromosomes in the current population are selected as parents. Through performing genetic operators, offsprings are selected according to the filtering criterion that is usually expressed as fitness functions along with some predefined constraints. The GA evolves over generations until stopping criterions are met. However valid chromosomes are usually produced by trials and errors. That is, a candidate chromosome is produced and tested against the filtering criterions. Therefore a GA may require more computation, especially in dealing with complicated or severe filtering criterions. To resolve this problem, an effective chromosome construction process can be applied to the initialization, crossover, and mutation stages respectively.

81

Garofalakis (2000) provided a model constraints-based algorithm inside the mining process to specify the expected tree size and accuracy. In other words, constraints can be used to express the trade-off between the model accuracy and computation efficiency that is associated with the tree-building or tree-pruning process. Similar research work, CADSYN (Maher, 1993), adopted a constraint satisfaction algorithm for case adaptation in case-based reasoning. By pushing constraints into the case adaptation process, the constraint satisfaction algorithm is able to aid a case-based reasoning system to produce suitable solution cases more efficiently for the discrete based constraints problem. Barnier & Brisset (1998) developed a hybrid system of a genetic algorithm and constraint satisfaction techniques. This method is adopted to resolve the optimization problems for vehicle routing and radio link frequency assignment. Basically this approach applied a GA to reduce the search space for a large CSP, rather than applying constraint satisfaction techniques to improve the GA’s computation efficiency. Kowalczyk (1996) pointed out the concept of using constraint satisfaction principles to support a GA in handling constraints. However rare research works applied constraint-based reasoning to effectively handle GAs modeling performance, nor on how the user’s knowledge can be presented and processed within a constraint network. This research attempts to apply embed constraint-based reasoning into GA to aid chromosomes screening in stages of initialization and ongoing evolution.

3. The Proposed CBGA Approach The proposed CBGA approach consists of five modules. According to Fig. 3,

these modules are: Regression Tree Initialization; Chromosome Filtering; Fitness Evaluation; GA Operation; and Chromosome Decoding.

82

Regreesion Tree Initialization

Chromosome Filtering

GAInitialization

FitnessEvaluation

Stop Conditionmet

Yes

no

Optimal Regression Tree

Constraint-Basedreasoning

GA Operation

Crossover

Selection

Mutation

Dataset

ConstraintNetwork

ChromosomeEncoding

ChromosomeDecoding

Fig. 3. The Conceptual Framework of CBGA The regression tree initialization module constructs the potential candidates of

initial regression trees construction. Figure 4 illustrates the regression tree with n nodes. Each node may consist of two types of attributes: categorical and continuous. Categorical attributes provide the partial combination of conjunction of conditions and the others are contributed by continuous attributes in the form of inequality. That is, the category or continuous attribute, the selected relational operator (=, <= or >) and splitting point in each node are determined by GA Initialization module. The format for each regression tree is presented as follows.

IF cond1,1 AND … AND cond1,m THEN y=f1(x1,x2,…,xm) ElseIF cond2,1 AND … AND cond2,m THEN y=f2(x1,x2,…,xm) … ElseIF condi,1 AND … AND condi,m THEN y=fi(x1,x2,…,xm) … ElseIF condn,1 AND … AND cond,n,m THEN y=fn(x1,x2,…,xm) Else y=fn+1(x1,x2,…,xm)

where m is the total number of attributes, n is the total number of the tree nodes, and condi,j is the form of equality or inequality for each attribute xj in a tree Nodei.

83

Fig. 4. A formal Regression Tree Representation

Node 1

y = f1Node 2

y = f2Node i

y = fiNoden

y = fn

. ..

. ..

y = fn+1

yes no

yes

yes no

yes

The GA Encoding module generates a chromosome that is equivalent to a

regression tree. The tree nodes can be easily presented in the form of “Node1, Node2, … , Noden” where n is the total number of the regression tree nodes, and is automatically determined by GA.

The chromosome filtering module filters the insignificant or irreverent nodes in regression trees. The filtering process applies the tree search and forward checking algorithm during the GA operation. In forward checking, when a value is assigned to current gene, any possible value of remaining genes that conflict with current gene is removed from the potential candidates pool. That is, whenever a new gene is considered, all its possible genes remained in the pool are guaranteed to be consistent with the constraint network of past genes assignments.

In this research the constraint network may consist of predefined constraints or user’s preferences as follows.

Predefined constraints: Certain constraints are restricted by linear regression model including the number of data set to correctly produce forecasting values based on the measurement of least square error. That is, the data set must have at least the number of attributes to estimate.

User’s preferences: Our approach allows users to specify the expected values during the regression tree induction process in revealing more reasonable and accurate results. These parameter settings are listed as follows.

1. The Law of Large Number: This specification is to define a large enough number of records to be contained within a tree node; thus the error term involved in estimating cost will approach a normal distribution for the regression model.

84

2. The threshold of statistics (e. g., the R-square and F values) imposed on the regression model in each tree leaf. This can assure a regression model to be more significantly in the statistic point of view.

3. The minimum/maximum number of attributes that should be retained in each tree node.

4. The maximum regression tree size, such as the tree depth and width.

In essence, this approach provides a chromosome-filtering mechanism prior to generating and evaluating a chromosome. As shown in Fig. 5, each regression tree containing n nodes can be represented by a chromosome. Each node represents a rule that is used to determine estimation results of one linear regression function. The rule consists of the conjunction of conditions in the form of equality or inequality relation. The encoding process for a condition requires three genes to specify “the enable/disable status”, “the equality/inequality operator”, and “the splitting value”. The constraint network contains the information extracted from the predefined constraints and user’s preference. The valuation scope of each node is confined via constraint-based reasoning. For example, assume that the nodei is inconsistent with the constraint network, and thus it is filtered and marked as invalid node. In such way, the search efforts for potential chromosomes can be reduced without having to activate the fitness evaluation procedure for each candidate chromosome.

The Fitness Evaluation module calculates the fitness value for each chromosome in the current population. The fitness function is defined as root mean square error (RMSE) to measure the performances in the regression tree. If the specified stopping condition is satisfied, then the entire process is terminated, and the best chromosome is confirmed and thus transformed into the optimal regression tree; otherwise, the GA operations are continuing.

85

Constraint-BaedReasoning

ConstraintNetwork

Chromosome Filtering

Node 1

y=f1

yes

no Node i

y=fi

Node n

y=fn

...yes yes

no

gi1 gi2 gij... ... gimg11 g12 g1j... ... g1m gn1 gn2 gnj... ... gnm

...

... ...

...(for node1) ...(for nodei) (for noden)

g'i1 g'i2 g'ij... ... g'img'11 g'12 g'1j... ... g'1m g'n1 g'n2 g'nj... ... g'nm... ......(for node1) ...(for nodei) (for noden)

Node 1

y=f '1

yes

no Node i Node n...yes yes

no...invalid y=f 'n

y=fn+1

no

y=f 'n+1

no

Fig. 5. The Framework of Constraint-Based Preprocessing for GA Operators

Each GA operation contains chromosomes selection, crossover, and mutation in order to produce offspring generation based on different GA parameter settings. The offsprings produced by CBGA needs to be validated by the chromosome filtering operation to assure valid or potential individuals are derived for further process.

4. The Experiments and Results Data Description 241 data sets were collected from a steady-state simulated factory located in Hsin-Chu Science Park, Taiwan. Each datum provides information about the order’s attributes

86

including seven variables of order quantities, existed order quantities when the order arrived, average shop workload when the order arrived, average queuing length when the order arrived, queuing quantities of work stations when the order arrived, utilization rate of work station when the order arrived, and flow-time. All of these features, dependent and independent, are of the continuous type. The time series plot of dependent feature vs. flow-time is shown in Figure 6. As can be seen the pattern of the flow-time is fluctuating and non-stationary. All methods were tested with a 5-fold cross-validation method. Consistently, the RMSE is derived individually from GA, CBGA, CBR, BPN, Regression, TWK, NOP and JIQ. Generally the essential part of CBR approach involves appropriate design of similarity functions and corresponding weights for individual features. In this experiment the CBR we applied is based on non-linear similarity functions and variable feature weights. Detail algorithm can be found in Chiu (2001). For the GA parameters, there are 100 chromosomes in each population. The crossover rate was varied from 0.5 to 0.6 whereas the mutation rate ranged from 0.01 to 0.05 in the initial settings. The entire learning process stopped after five different types of generations (100, 200, 300, 400, and 500). Both the RMSE in the training stage and the testing stages are shown in Table 1. The comparative performance of respective approach is depicted in Fig. 7. It’s noted that in this figure the CBGA’s optimal performance is based on the result of 5-fold cross-validation with 300 generations; while the GA’s optimal performance is based on 5-fold cross-validation with 100 generations. The time required for both GA and CBGA is approximated 0.2 minutes.

0

1000

2000

3000

4000

5000

6000

1 11 21 31 41 51 61 71 81 91 101

111

121

131

141

151

161

171

181

191

201

211

221

231

241

Record Numbers

Flo

w-T

ime

Fig. 6. Time Series Plot of Flow-Time

Table 1. The RMSE for Different ApproachesMethods RMSE

Training Testing

87

Stage Stage Regression 444.58 458.83

GA 393.2 468.9 CBGA 392.2 435.4 CBR 400.17 447.49 BPN 343.07 467.82 TWK 641.29 748.82 NOP 730.00 782.13 JIQ 484.72 482.71

It can be seen that CBGA performance outperforms those of other approaches. Although BPN approach has the best training performance, its testing performance is inferior to that of CBGA. This could be resulted from over-learning impact.

200

300

400

500

600

700

800

Regression GA CBGA CBR BPN TWK NOP JIQ

Approaches

Ave

rage

Err

or (

RM

SE

)

Train

Test

Fig. 7. The Optimal Performances of Different Approaches

Compared with a simple GA, the CBGA achieves higher predictive accuracy. Other, the regression tree discovered by the CBGA also exhibited higher predictive accuracy with more transparent knowledge than regression approach.

88

382384386388390392394396398400

100 200 300 400 500

Generations

Ave

rage

Err

or (R

MSE

)GACBGA

(Based on Mutation Rate 0.01)

Fig. 8. The Optimal Performance for GA and CBGA in the Training Stage with Different Generations

410420430440450460470480490

100 200 300 400 500

Generations

Ave

rage

Err

or (R

MSE

)

GACBGA

(Based on Mutation Rate 0.01)

Fig. 9. The Optimal Performance for GA and CBGA in the Testing Stage with Different Generations

5. Discussions and Conclusions

The due-date of a wafer production order is conventionally assigned by the production planning and scheduling staffs. These staffs mostly relied on previous experiences and ordered product specification to estimate the flow-time. However, flow-time is greatly

89

dependent on the current shop status. Ignoring current shop status is the major reason why most of the assigned due-date cannot be met by production department. This study proposed CBGA that hybridizes constraint-based reasoning within a genetic algorithm approach to assist the flow-time prediction in consideration of the work flow status. Incorporating run-time production information into the process mining is not straightforward and typically requires novel algorithm design. The proposed approach is able to infer regression trees by pushing partial knowledge into chromosome construction for guiding the search for suitable flow-time models. As compared with BPN, CBR, regression and other traditional due-date assignment rules such as TWK, NOP and JIQ, the regression tree discovered by the CBGA exhibited higher predictive accuracy with more significant knowledge. As to the GA vs. CBGA approaches, the later achieved higher predictive accuracy though computational efficiency was not significant improved. This can be due to the relatively small size of data records that may require fewer computational resources during evolution.

The Future Development The proposed CBGA is generic and problem independent. Proprietary genetic operators or chromosome representation are not needed to interact with the constraint-based reasoning. By extending the use of ACT, our proposed CBGA approach not only allows other complicated attributes relationships such as linear or non-linear quantitative relationship among attributes, but also is able to integrate regression expression in the tree leaves. Basically estimation models map the inputs values onto the outcome(s). When the model is complex, it is not possible to easily figure out the appropriate inputs that can best approximate the expected outcome. Usually this type of research is called parameters design. The regression tree constructed by our proposed CBGA approach can be treated as a multivariate-split based regression tree. From a decision support point of view, the estimation model belongs to the “what-if” that allows proceeding the mapping of input value(s) to output value(s). However, for a “goal-seeking” mechanism that aims to determine suitable inputs values to approximate expected outcome value(s), it would be relatively difficult with a complicated regression tree model. To better manage the wafer work flow process, one might try to execute the “goal-seeking” analysis to support the re-adjustment decision making. The proposed CBGA can be further extended to create a feedback loop in adapting the workflow model to changing circumstances by the aid of optimization scheme. The GA could also be one of the potential techniques in exploring the most suitable values of those factors associated with the work flow process. We are currently working on adopting

90

similar optimization technique to develop the “goal-seeking” mechanism. It is believed that such information would provide highly strategic values for the wafer production management.

Dynamic Instruction Strategy for Computer-Aided Learning Systems: The Genetic Algorithms Approach

Abstract Individualized instruction is the optimum goal of computer-aided learnsystems. Many education specialists agree that learning effectiveness can be greatly improved if the instruction materials fit the individual student's capability. This study aims to discuss how genetic algorithms can support the planning of instruction strategies based upon which suitable learning contents are provided to individual students. Currently a prototype system is developed to aid the arithmetic operations for entry-level students in elementary schools. The proposed methodology can dynamically monitor a student's learning performance and the observed dialog information can then be processed using the planning system to produce the optimum navigation plan.

ing

1. Introduction Individualized instruction is the optimum goal for computer-aided learning systems (CALSs). It is believed that learning performance would be significantly improved when the instruction material is appropriately provided to students according to their knowledge levels. Therefore, CALSs with flexible instruction strategies that are able to dynamically branch into the appropriate contents are crucial to achieve learning effectiveness (Farley, 1981). Studies on adaptive computer learning systems aim to construct CALSs that can provide flexibility for students with diverse backgrounds and capabilities. Due to the interdisciplinary nature of CALSs, most research works adopted various approaches from social science, psychology, artificial intelligence, or software engineering to improve the CALS effectiveness. However, the issues involved in building effective CALSs are more complex than simply the technical issues. Because of the diversity in individual differences, providing cooperative dialog behavior between a student and a system is one of the primary goals of CALS. To accomplish this goal, a system needs to acquire and maintain sufficient structural understanding about a student in order to adapt to the student's problem-solving capability and level of expertise within particular learning contexts. Individual differences may include cognitive style, learning capability, motivation, attitude and so on. Collectively, in this study, these factors are termed individual difference variables (IDVs). Though it has been recognized that IDVs can explicitly or

91

implicitly influence the interaction in student-system communication, there remain several fundamental research issues that need to be addressed in CALS design:

(a) incomplete understanding about the details of what kind of personal traits and student dialog behavior constitutes an IDV. It is those tangible and intangible human factors that characterize and differentiate one student from others;

(b) little theory has been developed that demonstrates what and how these IDVs affect dialog behavior. The relationship between each IDV and dialog behavior is complicated in nature and this makes the modeling of system dialog dynamics more difficult;

(c) there is no principle for quantifying these variables by merely applying the directly observed information during an interactive session. Since observing a student's interactive behavior is the most direct way a system can detect a student's actual performance, objective methods are needed to formally model this problem; and

(d) it is not clear how to represent these variables appropriately in a system. As Norcio and Stantely pointed out, current adaptive interface systems have not fully incorporated various cognitive types into their design due to the cognitive nature of high level processing (Norcio and Stantely, 1989). Suitable schema for representing knowledge about a student has direct implications for a system's adaptation.

To achieve learning effectiveness, the student‘s learning performance has to be accurately constructed so that the system can adapt correspondingly. There are several ways a student’s learning status can be derived. Today most of the advanced CALSs employ artificial intelligence (AI) techniques to aid further understanding of a student‘s learning progress via process monitoring (Chiu et al, 1994). However, there are some crucial issues that need to be carefully tackled. These include (1) the fuzzy nature of a student’s capability; (2) the appropriateness of the knowledge reasoning scheme; and (3) dialog navigation planning. This research provides a methodology to support instruction strategy planning using the genetic algorithm technique (Holland, 1975). This approach can dynamically determine a student‘s current learning performance by continuously observing the dialog progression. Based upon the concluded understanding about the student, the system is able to develop an optimum instruction strategy and corresponding learning contents. 2. The Problems with Traditional CALSs In general, CALSs interact with a student according to the student’s dialog inputs. For traditional CALSs, the instruction sequences and the corresponding instruction contents are pre-determined based on the courseware structure. According to Yao's experiment on mathematics instruction, a partial omission of learning material may

92

not affect the student’s overall learning performance. (Yao, 1989). This situation is more significant for those students that have better grades of performance in math subjects. Such findings indicate that formal sequential instruction design may not be the preferred strategy for knowledge delivery. However, in practice most of the existing CALSs adopt a design of providing the contents according to the one closest to the current level of difficulty. That is, the decision for CALSs to determine the most appropriate level among the available contents is according to a predefined mapping of the student‘s dialog input and courseware design. This is usually the one having the shortest learning distance from the current content. This research argues that the traditional courseware design facilitates better courseware design that can automatically generate an optimum learning path. The optimum learning path not only can assure maintaining the learning performance but also shorten the learning period. That is, a sequential instruction strategy is only able to achieve the local optimum for learning performance instead of the global optimum for learning performance derived via the dynamic instruction strategy. To demonstrate this difference, an adaptive learning and testing environment embedded with the dynamic instruction strategy was developed with the aid of Genetic Algorithms (GAs). This environment was constructed especially to improve traditional testing environments that do not consider individualized capabilities and the synergy with the instruction purposes. This environment includes a monitoring utility to collect interaction behavior as the input for reasoning the student’s learning performance. An example of a CAL system designed to assist elementary students in arithmetic operations is provided for discussion. 3. The Proposed Research Methodology

The GA is an evolutionary computing technique introduced by Holland in the 1970’s (Holland, 1975). The GA, inspired by the biological system process, is a procedure that can improve the search results by constantly trying various possible solutions with reproduction and mixing the elements found in superior solutions. Based upon the natural evolution concept, the GA is able to rapidly converge in identifying solutions that are globally optimal within a large search space. This technique offers an abundance of opportunities to solve a variety of problems including resource allocation problems, scheduling problems, optimization problems, machine learning problems and so on (Kumar et al., 1995; Levitin and Rubinovitz, 1993; Goldberg, 1989). For certain problems, the GA cannot promise to produce an absolute optimization nor can it assure that the derived solution is the optimum. It can usually determine an acceptable solution within a short period of time. Successful examples include NP-complete problems, TSP (Traveling Salesperson Problem), shortest paths in weighted graphs, etc. The problem intended for solution in this research is a typical

93

shortest path problem that may require an exhaustive search process to determine the most efficient ways of learning. 3.1 The GA Process Typical GA operations, as shown in figure 1, are a set of life-cycle based computational procedures that include the tasks of generating chromosomes, evaluating their fitness, selecting the better ones, pairing, mating and mutating. According to the GA rationale for generating better outcomes (offspring), the ones (chromosomes) that best fit the evaluation criterion usually have a higher possibility of survival. The ones with a lower degree of fit will be eliminated from further evolutionary processes. The survivors are chosen to exchange (i.e., overcross) some bits (e.g., digit 0 or 1) of their two individual chromosomes on a pair-wise basis. The two new chromosomes inherit the features of their ancestors. However, the mutation stage can occasionally proceed with the modification of the certain bit(s) from 0 to 1 (or 1 to 0) by random selection. The selection stage has its own evaluation criterion (functions) to assess the fitness of the newly generated chromosomes; and is also the portion where the users can develop in order to meet the reality of their applications needs.

Generation

Selecting

Pairing Mating

Mutation

Figure 1. GA Process Life Cycle

3.2 The Learning/Testing Environment The experimental platform constructed in this research is an arithmetic addition drilling system for the addition of two numbers, each with three digits at most. The test contents and problem order is based on Chen’s classification schema that structures the arithmetic addition problems into a two-perspective dimensions (Chen, 1978). As shown in Figure 2, the horizontal dimension (with 7 stages) indicates the number of digits involved in the addition operation. The vertical dimension (with 5 situations) displays the types of carryover. That is, the level of learning difficulty is

94

determined according to the type of carryover and number of digits involved. Therefore the operation should be easier to accomplish when there are fewer digits and carryover operations incurred. As indicated in each unit, there are three scores indicating the relative degree of confidence in accomplishing the problems for students in three different grades. These scores were derived using an experimental design observing student subjects’ practices for each type of problem scenario. By following the so-called Learning Hierarchy Theory (Gagne, 1985), all of the units were structured in terms of their level of difficulty (i.e., the consideration of carryover status and number of digits involved). For each single unit there exist various different tests that follow the same notation. The computer program can automatically generate these tests. Examples of some tests in unit “two digits + one digit & without carryover” are illustrated in Figure 3.

nocarryover

carryover1st digit

2nd digit

3rd digit

twice

carryover

carryover

carryover

triple

1a

1b

1c

1d

1e

two digits two digits + + one digit two digits

a b c d e

7

6

5

4

3

2

1 addition

1st year2nd year3rd year

: 82.12: 99.39: 100.00

: 86.00: 99.90: 100.00

: 78.00: 99.56: 99.90

: 68.78: 97.72: 99.10

: 74.00: 99.94: 99.95

: 55.00: 99.25: 99.38

: 51.51: 97.73: 99.43

: 28.75: 62.25: 98.25

: 18.50: 47.00: 98.00

: 13.75: 41.88: 96.87

: 50.88: 95.44: 99.55

: 12.00: 35.63: 97.63

: 50.63: 97.31: 99.38

: 10.25: 39.70: 98.00

: 80.06: 97.63: 99.37

: 9.38: 29.00: 96.65

: 8.75: 27.88: 96.44

: 17.88: 53.63: 97.88

: 7.84: 36.84: 95.59

: 4.60: 25.54: 90.70

: 7.03: 29.72: 95.31

: 10.25: 31.75: 96.25

: 4.33: 34.17: 96.00

: 18.50: 55.50: 97.50

: 2.93: 18.47: 88.47

: 3.00: 22.63: 92.69

: 4.50: 23.00: 92.30carryover

three digits + one digit

three digits + two digits

three digits +three digits

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

1st year2nd year3rd year

basic

Figure 2. The Conceptual Framework for the Learning Hierarchy for Arithmetic Addition

95

26 + 2

28

17 + 1

18

56 + 3

59 Figure 3. Some Tests in Unit “two digits + one digit & without carryover”

3.3 The Modeling of Learning Flows Control 3.3.1 The Development of Learning Network Diagram

Based on the conceptual framework for the Learning Hierarchy for Arithmetic Addition, the distance between two different content units can be defined as a weight w(eij) of a directed edge e<ui,uj>, where ui & uj indicate different units. The larger the difference between two units, the higher the value of the weight. As shown in Figure 4, because learning from u1 to u4 is easier than from u1 to u5, the w(e14) (i.e., 5) is less than w(e15) (i.e., 7). These weights and the relationships among the units can be determined either by real data analysis or encoded by domain experts.

u1 u2 u3

u4 u5 u6

1

5 7

3

Figure 4. The Relationships among Units in Learning Network Diagram

3.3.2 The Modification of Context Diagram

In order to provide the most suitable learning paths, the proposed system has to dynamically acquire the user’s current performance level. A user’s performance level may contain information about a user's preferences, a user's level of proficiency, frequency of dialog errors, time spent, a user's problem-solving approaches, etc. In this research the user performance level (UPL) is defined as the overall user’s observed learning performance in terms of the number of occurrence on: C1-- tests correctly completed; C2-- pre-test hints; C3-- on-line hints; C4-- detail post-test explanations used; and C5-- tests failed. The formula used to calculate the UPL is described as follows:

UPL =∑∑

s n

n

i i*

i

ni : is the number of occurrences on the ith evaluation category (C1 ~ C5). Si : is the weighting score on the ith evaluation category. This research presumably

allocates 1 point for C1, .9 point for C2, .7 point for C3, .5 point for C4, and null

96

(.0) point for C5 as a relatively conceptual differentiation among these categories. These weighting points can be further adjusted by conducting more rigorous research experiments.

Example Assume that a user’s up-to-present performance record (C1, C2, C3, C4, C5) is

(15,10, 2, 6, 7), which indicates that this user has correctly finished 15 tests, used pre-test hints 10 times, activated on-line hints twice; used post-test explanations 6 times and failed 7 tests. Thus

UPL = (1*15+0.9*10+0.7*2+0.5*6+.0*7)/40 = 0.71

Whenever the UPL is determined, the system is able to adjust the learning distance (i.e., w(eij), where i is the current unit and j=1 to n, excluding j=i) from the current unit vi synchronously to other units. The reason that w(eij) is modified is because different users may exhibit different degrees of learning achievement in a given unit than other users . That is, different users would learn the contents to varying degrees while in the same unit. To conform this rationale, the system updates the user’s learning network diagram by eliminating redundant (or coherent in some degree) units that the user might not need to learn. Enlarging those learning distances from the current unit to neighboring units to hinder unnecessary learning can accomplish this. Therefore, the user can proceed to those units that are more efficient in leading to the final objective without having to re-learn what he/she users might have already understood. For example, in Figure 5, u1 and u4 are two adjacent units; w14、w24 and w34 are existing learning distances. r(eij), such as r14、r24 and r34, are newly added learning distances that are allotted according to reasoning from the knowledge bases shown in Tables 1 and 2. These knowledge bases were encoded through intensive interviews with several experienced arithmetic instructors from local county elementary schools. Those knowledge bases were used to guide the learning distance adjustment.

u1 u2 u3

u4 u5 u6w34+r34

w24+r24w14+r14

Figure 5. The Illustration of Newly Added Learning Distances

Table 1. The Knowledge Base Used for Adjusting the Learning Distances

97

If UPL>0.95 then r1=5:r2=3:r3=1 'r1: most correlated unit Else If UPL>0.8 then r1=3:r2=1:r3=0 'r2: less correlated unit

else If UPL>0.7 then r1=1:r2=0:r3=0 'r3: least correlated unit else r1=0:r2=0:r3=0

end If

Table 2. The Mapping of the Correlationship among Units

Unit Identity

The Most Correlated Units

(r1)

The Less Correlated Units

(r2)

The Least Correlated Units

(r3) 2a 2c,3a 2b,3c 3d,4c 2b 2c,3b 2e,3e 4b,4e 2c 2d,3c 3a,3d 4d 2d 2e,3d 3a,3c 4d 2e 3e 3b 4b,4d 3a 3b,3c 3d 4b 3b 3c,3e,4b 3c,4d 5e 3c 3d 3e 4d 3d 4d,3e 4b,4e 6d 3e 4e 4b,4d 5e,6b 4b 5e,6b 4d 6b,6e 4d 4e,6d 5e 6a 4e 5e 6d,6e 6b 5e 6e,6b 6d,6a 7c,7d,7e 6a 6b,6c 6d,7c,7d 7e 6b 6c,6e 7e 7c,7d 6c 6d,7c 7d 7e 6d 6e,7d 6e,7c 7e 6e 7e 7d • 7c • 7d • 7d • • • 7e • • •

“•” indicates no impact to other nodes

3.3.3 The Shortest Learning Path

To search for the shortest learning path, an evaluation function is used to measure the learning efficiency. For a given starting unit (say uj), the total length (L) of a path leading to the final learning objective can be described in Eq1 as follows. This equation is used as the fitness function in the GA process.

Eq1 ∑

=++ +=

1_

_,)1()1( )]()([L

uend

ucurrentjkjkjk erew

Example

98

This example explains how the modified learning distance dij, which is equal to the sum of w(eij) and r(eij), is obtained by referencing Tables 1& 2. Looking at the partial learning network diagram, shown in Figure 6, a user is currently learning the starting unit (say unit 2a). Assume that the user’s UPL is measured at 0.8. All of the learning distances for the major links (the most correlated) from 2a to 3a and 2c will be added to 3 respectively (as expressed by the bold line). However the distances for the secondary links (i.e., the less correlated) from 2a to other units, such as 2b and 3c are only added to 1 (as expressed by single dashed line). Therefore both the user's learning performance and the structure of the learning contents heavily influences the newly modified learning distance dij.

2a 2b 2c 2d 2e

3a 3b 3c 3d 3e

4b 4d 4e

9+3 7+3 5+3

15+1 11+1 10+1 12+1

2+3

6+1

4+1

7 8 9 10

11 12 13

6

14 15

16 17 18

1+3

15+3

1+1

Figure 6. The Explanation of Learning Distance Adjustment

After the user’s overall learning content diagram has been updated, the system is able to proceed with searching for the optimum learning path. 4. The Implementation of GA To implement the GA process, several steps are required in developing a GA computer program. These steps include chromosome encoding, fitness function specification, and internal control parameter specification. The details of each step are described as follows. 4.1 Chromosome Encoding Because the shortest path problem on the global leaning network diagram involves a number of learning units that may be navigated, a coding scheme based on binary numbers is used to represent the path. Based on the

99

2b→2d→2e→3b→3d→4e→5e→6a→6b→6c→6d→6e→7c→7d→7e

7 9 10 12 14 18 19 20 21 22 23 24 25 26 27

conceptual framework of the learning hierarchy shown in Figure 2, there are 27 learning units, labeled with 1a, 1b, 1c, 1d, 1e; 2a, 2b, 2c, 2d, 2e; 3a, 3b, 3c, 3d, 3e; 4b, 4d, 4e; 5e; 6a, 6b, 6c, 6d, 6e; 7c, 7d, 7e. For example, one route starting from unit 2b toward the ending unit 7e with its corresponding mapping indexes are represented as:

Thus its chromosome is expressed as:

(101101010001111111111) where

‘0’ indicates the unit that is not learned; ‘1’ indicates the unit that is learned.

When a user has finished unit 2b, the learning distance dij is then modified. By

referencing both the knowledge base for adjusting the learning distances (shown in Table 1) and the unit correlationship mapping (shown in Table 2), the learning network diagram is restructured and thus the learning distances among the units are updated. Based on these rules guiding the relationship modification, changes in the links apply only to those units from 2b to 4e. Accordingly, the mapping indexes for nodes 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, and 18 are affected. To further illustrate the GA’s evolution details for this application, the following section demonstrates its individual process. Population Generation Chromosomes Evaluation Fitness Values (101101001111) 7,9,10,12,15,16,17,18 21 (101101110101) 7,9,10,12,13,14,16,18 27 (100111010101) 7,9,10,11,12,14,16,18 28 (110100100111) 7,8,10,13,16,17,18 28 (111111101101) 7,8,9,10,11,12,13,15,16 24 (101101011011) 7,9,10,12,14,15,17,18 21 ……… ………. .. Average 24.8 The fitness values are the overall distance from the initial node to the end node. These values are calculated according to the predefined distances among the nodes

100

and the newly modified distances.

Selection To perform the search evolution over generations, the best-performing chromosomes will be retained in the long run. In this example, the chromosomes with a distance longer than the average (i.e., 24.8) are selected. For instance, some of these results are shown below: Chromosomes Evaluation Fitness Values (101101001111) 7,9,10,12,15,16,17,18 21 (111111101101) 7,8,9,10,11,12,13,15,16 24 (101101011011) 7,9,10,12,14,15,17,18 21 ……… ………. .. Mating

To exchange information between population members, i.e., chromosomes, several methods, such as single point switch, specified cut point or uniform crossover, can be used to cross gene values over to their partner, resulting in new and diverse “offspring”. In this example, a commonly seen approach, the cut point method, is used to illustrate this process. Chromosomes Evaluation Fitness Values (101101|001111) 7,9,10,12,15,16,17,18 21 (101101|101101) 7,9,10,12,13,15,16,18 22 (111111|011011) 7,8,9,10,11,12,14,15,17,18 23 (101101|011011) 7,9,10,12,14,15,17,18 21 ……… ………. ..

4.2 The GA Control Parameters:

Many parameters need to be defined when developing a GA computer program, all of which can affect its search performance. The key parameters are the population size, crossover rate and mutation rate. Currently, a theory that can concisely guide the assignment of these values has rarely been seen (Srinivas, 1994). The following values were initially used in this research.

population size:30 crossover rate:1.0 mutation rate:0.05

This system was developed especially for grade one students in learning

101

the basic arithmetic operations. As depicted in Figure 7, a student can initiate the system by pointing to the right-hand side keypad with the mouse in any stage of the operations. It is recommended that a student begin with the stage nearest to her/his existing level. On the left-hand side to the front end, the system is constantly displaying the student’s up-to-present learning performance status as well as the geographic location on the learning hierarchy map. Simultaneously the internal planning agent is calculating the shortest learning path for the student. For illustration purposes the shortest learning path calculation is demonstrated in Figure 8. The shortest path is determined according to the on-line input information consisting of the starting node, ending node and the user’s current knowledge level. However, other information that is not displayed in this screen, such as the newly adjusted learning distance among the nodes, is also required during the calculation process.

2a 2b 2c 2d 2e

3a 3b 3c 3d 3e

4b 4d 4e

9+3 7+3 5+3

15+1 11+1 10+1 12+1

2+3

6+1

4+1

7 8 9 10

11 12 13

6

14 15

16 17 18

1+3

15+3

1+1

Figure 7. The Front Ends of Adaptive Computer Aided Learning System

Figure 8. The GA’s Calculation Process of Shortest Learning Path 5. Discussion Although there are many methods, such as the Greedy method and the heuristic-based search, that can be used to look for the shortest path, this research adopted the GA to

102

develop a shortest learning navigation plan because of concern for the following aspects: 5.1 Ease of design In contrast to the Greedy method’s approach which usually selects the locally optimum solution and the heuristic approach that determines the next optimum based on subjective experiences, intuition or predefined rules, the GA determines the optimum solution by globally considering all candidate solutions in a more efficient way (Thomas, 1989). Thus, it is relatively easy for the GA to determine the optimum solution. Most traditional views on optimum learning paths determine next nearest contents that are close to the learner’s knowledge level. Although the student would likely be more comfortable with such a system, long learning periods and less efficient ways of learning may be incurred. On the other hand, a system with an adaptive learning functional will promote superior students with a higher sense of achievement by accommodating the student with rather challenging contents. For those less skilled students, instead of providing the next level of difficulty, the system tries to reduce the student’s sense of learning frustration by varying the contents with similar degrees of difficulty. That is, the system considers individual’s learning performance and constantly adjusts the most suitable learning routes to fit the student’s current capabilities from the viewpoint of global learning efficiency. 5.2 Learning Performance In order to shorten the overall learning period required to complete a learning objective, this system produces the most suitable navigation plan according to a user's present capability. For example, when a student starts from unit 8 and is identified with a SPL of 0.6, the system proposes a learning path of 8→9→10→11→12→13→

14→15→16→17 (final unit); while the path of 8→10→12→13→15→17→18 with SPL of 0.96. It is suggested that a CAL could provide fewer units for more capable students while more units for less capable students. For those students with varying learning performance, the system is able to arrange varying learning paths. Currently a prototype system is being constructed and under evaluation by one local elementary school. 5.3 Computing Efficiency

Usually there are several types of learning structures, such as the linear structure, block structure and a mixture of both. In the above illustration the linear type was used. For a linear type problem, shown in Figure 9, with 15 units, the total number of navigation plans will add up to 32,768 according to the following formula.

103

Cii

15

0

15152 32768

=∑ = =

start→○→○→○→○→○→○→○→○→○→○→○→○→○→end

Figure 9. Linear Type In the block type problem, shown in Figure 10, with 15 units, the

navigation between units is quite flexible. That is, the navigation can have a backward or forward orientation. Learning each unit is not required for this type of problem to be mastered. According to Barker's Learning Hierarchy Theory, navigation between blocks is allowed for the forward direction only (Barker, 1979). Thus the total number of navigation plans is 34,328,125 based on the following formula, where l, m, n = 5. l, m, n are the number of units in a row block.

( !) * ( !) * ( !)

* *

C i C j C k

P P P

in

i

n

jm

j

m

kl

k

l

in

i

n

jm

j

m

kl

k

l

= = =

= = =

∑ ∑ ∑

= ∑ ∑ ∑

1 1 1

1 1 1

Figure 10. The Block Type Problem Without an appropriate computationally efficient approach, handling such a problem requires searching for an optimum learning strategy in an enormous solution space. This is not possible within a short period of time. The GA is proposed as one alternative for solving this problem. Comparatively, the GA may not be a fully ideal approach for problems in which a fast time response is critical. However, it would be possible with more time for computation during the user's interaction with the system. The computation time can be decreased through technical improvement in the GA internal algorithms, such as seeding the previous population method (Ciesielski and Scerri, 1998).

104

6. Conclusions

Individualized instruction is the optimum goal of computer-aided learning systems. Many education specialists agree that learning effectiveness can be greatly improved if the instruction materials fit the individual student‘s capabilities. This research proposes a different concept of the global shortest learning path by dynamically updating its learning path. A novel research method was presented that provided a more efficient learning path planning method for CALSs. When the learning material can be constructed according the learning hierarchy concept, a student’s learning navigation plan can be dynamically adjusted in order to interact with the most appropriate contents. This study also maintains that a learner’s best navigation plan should be constructed on the basis of global learning objectives rather on the next local optimum learning unit. That is, a learner may enter a unit that may not be best suited for his/her current status. However, this could be the best way to lead to shorter distances or shorter periods to achieve the final goal. According to this assumption, a prototype using the GA technique was developed to support instruction strategy planning. The GA assists with the path arrangement after each dialog event is completed. The proposed methodology implemented and is able to dynamically monitor a student's learning performance. The observed dialog information can then be processed by the planning system to determine the optimum navigation plan. Currently the prototype system is being used to aid in arithmetic operations instruction in local elementary schools. Field experiments on this system will be conducted in local elementary schools. Further improvement in expanding the applications to other learning domains and the dynamic adjustment of the GA control parameters are under investigation.

Intelligent Aircraft Maintenance Support System Using Genetic Algorithms and Case-Based Reasoning

Abstract The maintenance of aircraft components is crucial for avoiding aircraft accidents

and aviation fatalities. To provide reliable and effective maintenance support, it is important for the airline companies to utilize previous repair experiences with the aid of advanced decision support technology. Case-Based Reasoning (CBR) is a machine learning method that adapts previous similar cases to solve current problems. For effective retrieving similar aircraft maintenance cases, this research proposes a CBR system to aid electronic ballast fault diagnoses of Boeing 747-400 airplanes. By employing the genetic algorithm (GA) to enhance the dynamic weighting as well as

105

the design of non-similarity functions, the proposed CBR system is able to achieve superior learning performance than those with either equal/varied weights or linear similarity functions.

1. INTRODUCTION

Airplanes in operation throughout the world call for appropriate maintenance to assure flight safety and quality. When an aircraft component faults emerge, actions for fault diagnosis and troubleshooting must be executed promptly and effectively. An airplane consists of many electronic components among which the electronic ballast is one common component in controlling the cabin fluorescent lamps. The electronic ballast plays an important role in providing proper lights for passengers and flight crews during a flight. Unstable cabin lighting, such as flash and ON/OFF problems, is a common problem occurred in airplanes. An airplane usually has hundreds of electronic ballasts mounted in panels such as the light deflector of a fluorescent lamp fixture. When an electronic ballast is abnormal, it has to be removed and sent to the accessory shop for further investigation.

The maintenance records of electronic ballasts generally contain information about the number of defective units found, the procedures taken, and the inspection or repair status. Basically these records were stored and used for assisting mechanics in identifying faults and determining the components as repair or replacement was necessary. This is because previous similar solutions may provide valuable troubleshooting clues for new faults.

Similar to analogy, CBR is machine learning method that adapts previous similar cases to solve current problems. CBR shows significant promise for improving the effectiveness of complex and unstructured decision making. It is a problem-solving technique that is similar to the decision making process used in many real world applications. This study considers CBR an appropriate approach to aid aircraft mechanics in dealing with the electronic ballast maintenance problem. Basically CBR systems make inferences using analogy to obtain similar experiences for solving problems. Similarity measurements between pairs of features play a central role in CBR (Kolodner, 1992). However the design of an appropriate case-matching process in the retrieval step is still challenging. For the effective retrieval of previous similar cases, this research develops a CBR system with GA mechanisms used to enhance the dynamic feature weighting as well as the design of non-similarity functions. GA is an optimization technique inspired by biological evolution (Holland, 1975). Based upon the natural evolution concept, GA works by breeding a population of new answers from the old ones using a methodology based on survival of the fittest. In this research GA is used to determine not only the fittest non-linear similarity functions, but also

106

the optimal feature weights. By using GA mechanisms to enhance the case retrieval process, a CBR system is

developed to aid electronic ballast fault diagnoses of Boeing 747-400 airplanes. Three hundred electric ballasts maintenance records of Boeing 747-400 airplanes were gathered from the accessory shop of one major airline in Taiwan. The results demonstrated that the approach with non-linear similarity functions and dynamic weights indicated better learning performance than other approaches with either linear similarity functions or equal/varied weights.

2. LITERATURE REVIEW

2.1 Case-Based Reasoning CBR is a relatively new method in artificial intelligence (AI). It is a general

problem-solving method that takes advantage of the knowledge gained from experiences and attempts to adapt previous similar solutions for solving a particular current problem. As shown in Figure 1, CBR can be conceptually described by a CBR-cycle that composes of several activities (Dhar & Strin, 1997). These activities include (A) retrieving similar cases from the case base, (B) matching the input and retrieved cases, (C) adapting solutions suggested by retrieved similar cases to better fit the new problem; and (D) retaining the new solution once it has been confirmed or validated.

A CBR system gains an understanding of the problem by collecting and analyzing case feature values. In a CBR system, the retrieval of similar cases relies on a similarity metric which is used to compute the distance between pairs of case features. Generally, the performance of the similarity metric and the feature weights are keys to the CBR (Kim & Shin, 2000). A CBR system could be ineffective in retrieving similar cases if the case-matching mechanism is not appropriately designed.

For an aircraft maintenance problem, CBR is a potential approach in retrieving similar cases for diagnosing faults as well as providing appropriate repair solutions. Several researches applied CBR to solve different airlines industry problems. Richard (1997) developed CBR diagnostic software for aircraft maintenance. Magaldi (1994) proposed applying CBR to aircraft troubleshooting on the flight line. Other CBR applications included flight condition monitoring and fault diagnosis for aircraft engine (Vingerhoeds et al., 1995), service parts diagnosis for improving service productivity(Hiromitsu et al., 1994), and data mining for predicting aircraft component replacement (Sylvain et al., 1999).

107

B C Matching and Retrieval Input Cases Retrieved Cases

AD

Case Base

Figure 1. A CBR Cycle

Most of these CBR systems applied n-dimension vector space to measure the similarity distance between input and retrieved cases. For example, Sylvain et al. (1999) adopted the nearest neighborhood method. However, seldom researches attempted to employ dynamic weighting with non-linear similarity functions to develop fault diagnosis models for aircraft maintenances. 2.2 Genetic Algorithms for Feature Weighting

In general, feature weights can be used to denote the relevance of case features to a particular problem. Wettschereck et al. (1997) made an empirical evaluation of feature-weighting methods and summarized that feature-weighting methods have a substantially higher learning rate than un-weighted k-nearest neighbor methods. Kohavi et al. (1995) observed that feature weighting methods have superior performance as compared to feature selection methods. When some features are irrelevant to the prediction task, Langley and Iba (1993) pointed out that appropriate feature weights can substantially increase the learning rate.

Several researches applied GA to determine the most suitable feature weights. GA is a technique of modeling the genetic evolution and natural selection processes. A GA procedure usually consists of chromosomes in a population, a ‘fitness’ evaluation function, and three basic genetic operators of ‘reproduction’, ‘crossover’ and ‘mutations’. Initially, chromosomes in the form of binary strings are generated randomly as candidate solutions to the addressed problem. A fitness value associated with each chromosome is subsequently computed through the fitness function representing the goodness of the candidate solution. Chromosomes with higher fitness values are selected to generate better offspring for the new population through genetic operators. Conceptually, the unfit are eliminated and the fit will survive to contribute genetic material to the subsequent generations.

Wilson and Martinez (1996) proposed a GA-based weighting approach which had better performance than the un-weighted k-nearest neighbor method. For large-scale feature selection, Siedlecki and Sklansky (1989) introduced 0-1 weighting process based on GAs. Kelly and Davis (1991) proposed a GA-based weighted K-NN approach (GA-WK-NN) which had a lower error rates than the standard K-NN one. Brill et al. (1992) demonstrated fast features selection using GAs for neural network classifiers.

Though the above research works used GA mechanisms to determine the feature weights for the case retrieval, seldom had a study applied the GA to simultaneously determine features weights as well as the corresponding similarity functions in a

108

non-linear way. This paper attempts to apply GA mechanisms to determine both the optimal feature weights and the most appropriate non-linear similarity functions for case features. A CBR system is developed to diagnose the faulty accessories of electronic ballasts for the Boeing 747-400 airplanes.

3. METHODOLOGY 3.1 Linear Similarity

From the case base, a CBR system retrieves an old case that is similar to the input case. As shown in Figure 2, the retrieval process is based on comparing the similarities for all feature values between the retrieved case and the input case, where fi

I and fiR are the values of feature i in the input and retrieved case, respectively. There

are many evaluation functions for measuring the degree of similarity. One numerical function using the standard Euclidean distance metric is shown in the following formula (Eq.1), where Wi is the ith feature weight. The feature weights are usually statically assigned to a set of prior known fixed values or all set equal to 1 if no arbitrary priorities determined.

Figure 2. Feature Values

Features

Retrieved Case f1R f2

R f3R … fi

R … fnR

Input Case f1I f2

I f3I … fi

I … fnI

( )21

Ri

Ii

n

ii ffW −×∑

=

(Eq.1)

3.2 Non-Linear Similarity

Based on the formula (Eq.1), this study proposed a non-linear similarity approach. The difference between the linear similarity and non-linear similarity is the distance function definition. For a non-linear similarity approach (fi

I-fiR)2 is replaced

by the distance measurement [(fiI-fi

R)2]k as shown in formula (Eq.2).

( )[ ] kR

iI

i

n

ii ffW 2

1

−×∑=

(Eq.2)

Where k is the exponent of the standard Euclidean distance function for the corresponding input and retrieved feature values. A GA mechanism is proposed to compute the optimal k value for each case feature. The range of exponent k is scaled from 1/2, 1/3, 1/4, 1/5, 1, 2, 3, 4 and 5. Figure 3 depicts an example equation y=xk;

109

where x ∈ [0, 1] with various combinations of k.

0

0.1

0.2

0.3

0.4

5

0.6

0.7

0.8

0.9

1

00.0

50.1

10.1

70.2

30.2

90.3

50.4

10.4

70.5

30.5

90.6

50.7

10.7

70.8

30.8

90.9

51.0

0

X

5

4

3

2

1

1/5

1/4

1/3

1/2

0.Y

Figure 3. The Illustration for Linear and Non-Linear Functions

3.3 Static Feature Weighting In addition to the linear or non-linear type of similarity function, feature weights

Wi can also influence the distance metric. Feature weighting can be either static or dynamic. The static weighting approach assigns fixed feature weights for all case features throughout the entire retrieval process. For static feature weighting, each feature’s weight can be either identical or varied. The feature weights are usually statically assigned to a set of prior known fixed values or equal to 1 if no arbitrary priorities determined. For varied feature weighting, this study proposed another GA mechanism to determine the most appropriate weight for each feature. 3.4 Dynamic Feature Weighting

For the dynamic weighting approach, feature weights are determined according to the context of each input case. As shown in Figure 4, for a given input case, there are m retrieved cases in the case base, where i = 1 to n, n is the total number of features in a case, j = 1 to m, m is the total number of retrieved cases in a case base. fij

R is the ith feature value of the retrieved casej, and fiI is the of the ith feature value of

the input case. OjR is the

1

fi1R

fi2R

fijR

fim

fiI

Features Outcome Feature

10

f11R

f12R

f1jR

f1mR

f1I

fn1R

fn2R

fnjR

fnmR

fnI

O1R

O2R

OjR

OmR

OI

Retrieved

Cases

Input Case

R…

Figure 4. The Denotation of Features and Outcome feature Values outcome feature value of the jth retrieved case and OI is the outcome feature value of the input case.

Assume that the outcome feature value is categorical data with p categories. For those features of categorical values, their weights are computed using the formula (Eq.3).

=

i

iti E

LMaxW (Eq.3)

where i=1 to n, n is the number of case features in a case; t=1 to p, p is the number of categories for the outcome feature. Ei is the number of retrieved cases of which fij

R is equal to fi

I. Lit is the number of retrieved cases of which fijR is equal to fi

I and OjR is the

tth categories. For those features of continuous values, their weights are not generated in the same way as described above unless the feature values are discretized in advance. Though there may exist various ways for discritization, this study proposed another GA mechanism to discretize the continuous feature values. For the ith feature, a GA procedure is used to compute the optimal value, say Ai, to form a range centered on fi

I. Let Ki denote the number of cases whose fij

R is between (fiI-Ai) and (fi

I+Ai). Thus, Ei is replaced by Ki in the formula (Eq.3). Feature weights are computed as shown in the formula (Eq.4).

=

i

iti K

LMaxW (Eq.4)

Based on formulas (Eq.3) and (Eq.4), each input case has a corresponding set of feature weights in this dynamic weighting approach. 3.5 The Experiment Design

Since both the feature weights and similarity measurements between pairs of features play a vital role in case retrieval, this research investigated the CBR performance by observing the effects resulting from the combinations of different feature weighting approaches and similarity functions.

As indicated in Figure 5, there are six approaches that combine different types of similarity functions and feature weighting methods. These are the Linear Similarity

111

Function with Equal Weights (Approach A); Linear Similarity Function with Varied Weights (Approach B); Non-Linear Similarity Function with Equal Weights (Approach C); Non-Linear Similarity Function with Varied Weights (Approach D); Linear Similarity Function with Dynamic Weights (Approach E); and Non-Linear Similarity Function with Dynamic Weights (Approach F).

The differences between the three feature weighting approaches are described as follows. For the equal weights approach, feature weights are all set equal to 1. For the varied weights approach, there is only one set of feature weights determined by a proposed GA procedure. For the dynamic weights approach, there is a corresponding set of feature weights for each input case. That is, there are sets of feature weights dynamically determined according to the input case.

(F) (E) (D) (C) (B)

Feature Weighting Dynamic Weights

Varied Weights

Equal Weights

Non-Linear Similarity

Linear Similarity

Similarity Function

(A)

Figure 5. The Combinations of Similarity Functions and Feature Weighting Methods

4. THE EXPERIMENT AND RESULTS 4.1 Case Description

The aircraft electronic ballasts used to drive fluorescent lamps can be mounted on a panel such as the light deflector of a fluorescent lamp fixture. The fluorescent lamps initially require a high voltage to strike the lamp arc and maintain a constant current followingly. Usually there is a connector at one end of the unit for the routing of all switching and power connections. As shown in Figure 6, the electronic ballast operates from control lines of 115-vac/400Hz aircraft power. When the operation power is supplied, the electronic ballast will start and operate two rapid start fluorescent lamps or single lamp in the passenger cabin of various commercial aircrafts, such as Boeing 747-400, 737-300, 737-400, 747-500 and etc. There are two control lines connecting the ballast set and control panel for ON/OFF and BRIGHT/DIM modes among which DIM mode is used at night when the cabin personnel attempts to decrease the level of ambient light in the cabin.

112

Fluorescent Lamp

Fluorescent Lamp

Fluorescent Lamp

ElectronicBallast

ElectronicBallast

Lamp Set Ballast Set

Control Lines

Control Panel

Control Lines

Figure 5.The Operational Setup for Electronic Ballast

Three hundred electric ballast maintenance records of Boeing 747-400 from the accessory shop of one major airline company in Taiwan were used to construct the trouble-shooting system. Each maintenance case contains seven features identified as highly related to abnormal electric ballast operations. In Table 1, these features are either continuous or categorical. The outcome feature is the categories of the replaced parts set. For instance, category C1 denotes the replaced parts of a transformer (illustrated as T101 on a printed circuit board) and a capacitor (illustrated as C307 on a printed circuit board). Category C2 denotes the replaced parts of an integrated circuit (illustrated as U300 on a printed circuit board), a transistor (illustrated as Q301 on a printed circuit board) and a fuse (illustrated as F401 on a printed circuit board). Each category in the outcome feature represents a different set of replaced parts. 4.2 The GA Implementation

According to the experiment design, this study implements three GA procedures to determine (1) the optimal exponent k in the non-linear similarity functions; (2) the most appropriate set of varied weights for static feature weighting; and (3) sets of feature weights for dynamic feature weighting. Several steps are required in developing a GA computer program. These steps include chromosome encoding, fitness function

Table 1. The Case Description Data Type Range

Input Features Alternating Current on Bright Mode When Electronic Ballast Turns On Alternating Current on DIM Mode When Electronic Ballast Turns On Alternating Current on Bright Mode When Electronic Ballast Turns Off Alternating Current on DIM Mode When Electronic Ballast Turns Off Is Light Unstable When Electronic Ballast Turns On Is It Not Illuminated When Electronic Ballast Turns On

Continuous Continuous Continuous Continuous Categorical Categorical

0 to 2 (amp) 0 to 2 (amp) 0 to 2 (amp) 0 to 2 (amp)

0 and 1 0 and 1

Outcome Feature Components Replacement

Categorical

C1, C2, … ,C10

specification, and internal control parameter specification. The details of each step

113

according to the order of three GA applications are described as follows. Non-Linear Similarity

Chromosomes are designed for encoding the exponent k in the non-linear similarity functions. Because there are six features in a case, a chromosome was composed of six genes to encode the exponents in the six corresponding non-linear functions. Each chromosome is assigned a fitness value based on the formula (Eq.5). The population size was set to 50; population selection method is based on Roulette Wheel; the probability of mutation was 0.06, and the probability of crossover was 0.5. Crossover method is based on uniform; the entire learning process stopped after 10,000 generations.

Minimize

=∑=

q

Cfitness

q

jj

1 (Eq.5)

Where j=1 to q, q is the number of training cases. Cj is set to 1 if the expected outcome feature is equal to the real outcome feature for the jth training case. Otherwise, Cj is set to 0. Varied Weights

Chromosomes are designed for encoding a set of feature weights whose values are ranged between [0..1]. The fitness function is defined as indicated in the formula (Eq.5), too. As to the GA parameters, the population size was set to 50, the probability of mutation was 0.06, and the probability of crossover was 0.5. The entire learning process stopped after 10,000 generations. Dynamic Weights

Chromosomes are designed for encoding values Ai to form a range centered on fiI

for features that are continuous data. Fitness value is also calculated by the formula (Eq.5) for each chromosome in the population. As for the GA parameters, mutation rate was 0.009, and the other settings were the same as the ones used for varied weights. 4.3 The Results

The case base is divided into two data sets for training and testing with the ratio of 2:1. That is, 200 maintenance cases of Boeing 747-400 aircraft electric ballast are for training and the remaining 100 cases are for testing. The results are illustrated in Table 2. All approaches are evaluated with 3-fold cross validation. The result of approach (F) with non-linear similarity functions and dynamic weighs is the best

114

where the Mean Errors (ME) is equal to 0.193 for training and 0.180 for testing.

Table 2. The Mean Errors of Different Approaches Mean Error

Approach training testing

(A) Linear Similarity Function with Equal Weights 0.240 0.223 (B) Linear Similarity Function with Varied Weights 0.213 0.220 (C) Non-Linear Similarity Function with Equal Weights 0.207 0.210 (D) Non-Linear Similarity Function with Varied Weights 0.200 0.203 (E) Linear Similarity Function with Dynamic Weights 0.233 0.230 (F) Non-Linear Similarity Function with Dynamic Weights 0.193 0.180

To further investigate the results, the approach (A) with linear similarity function and equal weights has an inferior training result. There is no obvious difference for the testing results of the approaches (A), (B), and (E) all of which adopt linear similarity functions. However, among those approaches adopting non-linear similarity functions, it seems that approach (F) with dynamic weights has a superior result than both the approach (D) with varied weights and approach (C) with equal weights. It can be inferred that both non-linear similarity functions and the dynamic weighting process are crucial for a CBR system to effectively retrieve previous associated cases.

5. CONCLUSIONS An inefficient aircraft maintenance service may lead to flight delays,

cancellations or even accidents. Aircraft maintenance is therefore one of the most important activities for the airlines to improve flight safety as well as obtain worldwide competitive strength. To improve the maintenance productivity, this research developed a CBR system with GA mechanisms to enhance the retrieval of similar aircraft electronic ballast maintenance cases. Three GA procedures are proposed to determine the optimal non-similar similarity functions, varied and dynamic feature weights, respectively. The experimental results demonstrated that the approach adopting both non-linear similarity functions and dynamic weights achieves the best performance than approaches with either linear similarity functions or equal/varied weights.

In addition to the electronic ballast, there are numerous components embedded in an aircraft system. The proposed method could also be employed for a shorter repair time and a lower maintenance costs. Besides, aircraft preventative maintenance is also an important issue. In the future, it may be possible to embed such a trouble-shooting component into the aircraft preventive maintenance system based on the history data

115

in flight data recorders (FDR) to help ensuring a safer and comfortable flight.

Applying Genetic Algorithms to Nested Case-Based Reasoning for the Optimum Information Systems Outsourcing Decision

Abstract In recent years, much attention has been focused on information systems (IS)

outsourcing by practitioners as well as academics. However, understanding the factors that affect IS outsourcing success is still incomplete. The causal relationships among these factors are also not obvious. In a domain that lacks a strong domain theory and empirical results, forecasting IS outsourcing success is difficult. Case-Based Reasoning (CBR) is a machine reasoning that adapts previous similar cases to infer further similarity. Therefore, CBR can be very useful for solving complex and unstructured problems. This study aims to analogize the implications of IS attributes to the consequences of IS outsourcing practices. A nested CBR system is developed to forecast IS outsourcing success with a Genetic Algorithm (GA) mechanism used to enhance the case-matching process. One hundred forty-six real IS outsourcing cases, each with 22 IS attributes, are gathered in the case base. The results indicate that the nested CBR approach is practically suitable and can be a support for optimum IS outsourcing decisions.

1. Introduction IS outsourcing is a growing phenomenon. Much attention has been given to this

issue. After the Kodak deal, numerous corporations outsourced all or part of their IS functions to external suppliers. The reasons for corporations outsourcing their IS functions usually include cost savings, access to expertise and new technologies, a decrease in IS professional recruitment, flexibility in managing IS resources and an increase in capital utilization [Jurison, 1995; Quinn & Hilmer, 1994; Sobol & Apte, 1995]. However, the associated shortcomings must also be carefully evaluated when managers pursue the expected benefits. There could be some disadvantages, for example, hidden costs, loss of strategic control, data security problems, etc. [McLellan & Marcolin, 1994; Collins & Millen, 1995]. In addition, there is also the possibility of weak vendor management skills, IS services, contracts and relationships with IS suppliers [Earl, 1996]. Lacity & Hirschheim [1995] already reported several successes as well as failures in their multicase study of IS outsourcing practices.

In an attempt to gain greater understanding of the factors affecting IS outsourcing success, Grover et al.[1996] investigated the relationship between the extent of IS outsourcing and IS outsourcing success. Their result showed that the relationship between the extent and success of outsourcing is likely to vary with different IS functions especially with system operations/ network management and maintenance. Lee[2001] also examined the relationship between knowledge sharing and IS outsourcing success. His findings indicate that the service receiver’s ability to absorb the needed knowledge has a significant effect. However, the understanding of the factors that affect IS outsourcing success is still incomplete. The causal relationships among these factors are also not obvious. In a domain that lacks a strong domain theory and empirical results, forecasting IS outsourcing success is difficult.

Different from the methodology for building research models to show positive and negative effects, CBR allows a computer program to propose solutions in domains that are not completely understood [Kolodner, 1992]. For forecasting IS outsourcing success, drawing an analogy to similar prior cases may be more persuasive than model-based arguments. Related to the analogy concept used in real

116

human reasoning, CBR is a machine reasoning problem solver that adapts previous similar cases to infer further similarity.

When developing a CBR system, a set of useful case features must first be determined to differentiate one case from the others. Furthermore, weights representing the importance of features must be assigned in the case-matching process. The weights are usually determined using subjective judgments or trial and error. To provide an alternative solution, this study used GA to enhance the case-matching process for dynamically determining a set of weights by learning the historical cases.

This study aims to synergize the implications of IS attributes on the consequences of IS outsourcing practices. Using a GA mechanism to enhance the case-matching process, a nested CBR system with a two-level weight design is developed to forecast IS outsourcing success. By investigating the IS outsourcing practices of large corporations in Taiwan, 146 real IS outsourcing cases, each with 22 IS attributes as case features, were gathered at an operational level. The results indicate that the GA-CBR approach is practically feasible and can be a support for optimum IS outsourcing decisions. This approach also successfully demonstrated that some insights could be revealed from the experiences of others in the form of similar IS outsourcing cases.

2. An Overview of CBR CBR can be considered as a five-step reasoning process, shown in Figure 1

[Bradley, 1994]. In the presentation stage, a description of the current problem is input to the CBR systems. The system then retrieves the closest-matching cases stored in a case base and uses the current problem and closest-matching cases to generate a solution to the current problem. The solution is later validated through feedback from the user or the environment. Finally, the validated solution is added to the case base for use in future problem-solving if appropriate.

Presentation

Retrieval

Adaptation

Validation

Update

Case Base

Figure 1. The General CBR Process

In general, a CBR system consists of a database of previous cases and their results,

features for retrieving previous cases and storing new cases, a function or functions for measuring the degree of match, and methods for adapting recalled case solutions.

117

A case represents specific knowledge tied to a context and records knowledge at an operational level [Kolodner, 1992]. A CBR system first gains an understanding of the problem by collecting case feature values. A function or functions are used to compute the degree of match between the input case and the target case.

Every feature in the input case is matched to a corresponding feature in the retrieved case. For each feature in the input case, a corresponding feature is found in the retrieved case. The two values are then compared and the degree of match is computed. A weight is usually assigned to each case feature representing the importance of that feature to the match. A nearest-neighbor matching function with weights inserted into the formula is shown in the following equation (Eq.1) [Kolodner, 1993]. Usually, cases with higher degrees of match are retrieved.

i

i

n

iI

iR

ii

n

W sim f f

W

=

=

1

1

* ( , )

)

(Eq.1)

Where Wi is the weight of the ith feature, fiI is the value of the ith feature for the

input case, fiR is the value of the ith feature for the retrieved case, and sim( ) is the

similarity function for fiI and fi

R. There are many evaluation functions for measuring the degree of match. Another

numerical function using the standard Euclidean distance metric is shown in the following formula (Eq.2).

( 2

1

Ri

Ii

n

ii ffW −×∑

=

(Eq.2)

Where Wi is the weight of the ith feature, fiI and fi

R are the value of feature i in the input and retrieved case, respectively.

Several recent researches have applied CBR to different business domains, including business acquisitions [Pal & Palmer, 2000], transfer pricing [Curet & Elliott, 1997], marketing research [Ville, 1997] and predicting high risk software components [Eman, Benlarbi, Goel & Rai, 2001]. It has also been applied to help desks, software cost control and software development processes [Kesh, 1995].

3. Feature Weighting Problem In a case-based approach, representing a case with features tied to a context is an

important issue. However, to design an appropriate case-matching mechanism in the retrieval stage is challenging. Several approaches have been presented to improve the effectiveness of case retrieval. These include the parallel approach [Kolodner, 1988], goal-oriented model [Seifert, 1988], decision trees induction approach [Quinlan, 1986; Utgoff, 1989], instance-based learning algorithms [Aha, 1992], fuzzy logic method [Jeng & Lian, 1995], etc. These methods have been demonstrated effective in retrieving cases. However, most of these researches focused on similarity functions rather than determining a set of optimum weights for the case features.

The search space for determining the most appropriate weight for each case feature is usually quite huge. This is because the search process must consider countless combinations of possible values for a set of weights. Therefore, traditional approaches such as heuristic or enumerative search methods lack efficiency due to the enormous computing time. The object of the proposed GA-CBR approach is to efficiently determine a set of optimal weights in the case-matching process.

Though it is argued that the nearest-neighbor matching method is sensitive to the

118

similarity functions [Brieman et al., 1984], a mechanism for determining a set of optimal weights could also improve case retrieval effectiveness. Kohavi et al.[1995] observed that feature weighting methods have superior performance compared to feature selection methods.

The feature weights are usually statically assigned to a set of prior known fixed values or all set equal to 1 if no arbitrary priorities are determined. However, the retrieved solution cannot always be guaranteed if most weights are determined using human judgment. Feature weighting can be a dynamic process by using the GA to determine the most appropriate weights. The GA procedure improves the search results by constantly examining various possible solutions with the reproduction, crossover and mutation operations. Kelly and Davis [1991] proposed a GA-based approach for the k-nearest neighbor method with a lower error rate than the standard method. This study focuses on applying the GA to a nested CBR to support the determining the most appropriate weights in two levels. Further details about this method are discussed in the following sections. 4. Two-level Weight Design

A two-level weight design is shown in Figure 2. There are several groups of features according to the specific domain knowledge tied to a context. In the first level, each feature has a weight representing the comparative importance of that feature to the match of its corresponding group. In the second level, each group also has a weight representing the aggregate importance of the group to the match of the input and retrieved cases. Where i = 1 to m, m is the total number of feature groups in a case. Groupi is the ith feature group. j = 1 to ni and ni is the number of features in the ith group. fij is the value of the jth feature in the ith Group. wij is the weight of the jth feature in Group i, and capital Wi is the weight of the ith Group.

A similar notation system is applied to the outcome features. This is shown in Figure 3. Assume that a CBR system is required for diagnosis to achieve only one goal. Therefore, all of the outcome features describe the goal and belong to the same group. Where i = 1 to q, q is the total number of outcome features, and Oi is the ith outcome feature value.

Figure 2. Two-level Weight Design Figure 3. Outcome Feature

5. GA-CBR Approach

The basic principle is that the higher the SI,R, the more likely the retrieved case matches the input case. Therefore, the challenge is to determine the most appropriate values for the tow weights levels. This study introduces a GA-CBR approach to determine the most appropriate set of weights.

The overall system framework is presented in Figure 4. The case base is divided into two data sets for training and testing with a ratio of 2:1. The system is composed of four major processes: similarity process, weighting process, adapting process, and evaluation process. A detailed description of the four basic processes follows.

119

Figure 4. The System Architecture The similarity process computes similarities for each feature group (SGroup i) in the

first level and the aggregate similarity (SI,R) between the input and retrieved cases in the second level. As seen in Eq.3, for each feature group in the input case, the corresponding feature group is found in the retrieved case. Their feature values are then compared and multiplied by the corresponding feature weights to compute the similarity for each feature group. As seen in Eq.4, the similarity of each feature group is multiplied by the corresponding group weight to determine the aggregate degree of similarity between the retrieved and input cases.

( )21

Rij

Iij

ni

jijiGroup ffwS −×= ∑

=

(Eq.3)

=

=

×= m

ii

iGroup

m

ii

RI

W

SWS

1

1,

(Eq.4)

Where i = 1 to m, m is the total number of feature groups in a case. j = 1 to ni, ni is the number of features in the ith group, wij is the weight of the jth feature in Group i, fij

I and fij

R are the values of the jth feature in Group i for the input and retrieved cases, respectively. SGroup i is the degree of similarity for the ith Group, Wi is the weight of the ith Group, and SI,R is the aggregate degree of similarity between the input and retrieved cases.

The values of feature and group weights (wij & Wi) are generated from the Weighting Process. Assume that there are p training cases in the case set. For each specific input case from the training cases, there still are (p-1) cases that can be retrieved for comparison. Similarity Process is executed with (p-1) times and (p-1) number of SI,R are produced. There are usually several retrieved cases that are inferred highly similar to the input case. The outcome features of these cases could be proposed as a solution, that is, the expected outcome features for a certain input case.

The adapting process uses the top 10% SI,R to generate a solution for a given input case. If the number of cases with the top 10% SI,R is k, each expected outcome feature Oi’ is generated as shown in equation (Eq.5). The composite outcome feature O’ is computed by averaging the sum of all of the expected outcome features, as shown in equation (Eq.6).

[ ]

×=

=

=k

t

tRI

k

t

ti

tRI

i

S

OSO

1,

1,

' (Eq.5)

120

q

OO

q

ii∑

=

′= 1' (Eq.6)

In equation (Eq.5), t=1 to k, k is the number of cases with the top 10% SI,R, StI,R is

the tth case with 10% SI,R. Oit is the ith outcome feature of the tth case with the top

10% SI,R. Oi’ is the ith expected outcome feature. In equation (Eq.6), i=1 to q, q is the number of outcome features and O’ is the composite outcome feature.

The evaluation process is applied to minimize the overall difference between the real composite outcome feature and the expected one. The evaluation function is expressed in equation (Eq.7).

Minimize

p

p

ttt OO∑

=

−= 1

'

Y (Eq.7)

Where t=1 to p, p is the number of training cases. Ot’ and Ot are the expected and real composite outcome feature of the tth training case, respectively. Y is the Mean Absolute Error (MAE) used to evaluate the training stage result.

6. The Experiment and Results 6.1 Case Description

To differentiate IS outsourcing from the diversity of service or product outsourcing, the characteristics of the outsourcing target, IS attributes, are the required descriptors for the IS outsourcing situation. Therefore, this study proposes IS attributes as features for the CBR system to perform diagnosis to achieve the situation’s goal. As shown in Table 1, the cases cover 22 features divided into four groups of IS attributes. The situation’s goal, IS outsourcing success, is a set of 8 outcome features belonging to one group. The alphanumeric mark in the bracket is the same as that used in Figures 2 and 3.

IS outsourcing success. IS outsourcing success is the situation’s goal that describes the eight benefits of IS outsourcing. The benefit items are primarily adapted from Grover et al.[1996].

IS asset specificity. As one of the most important concepts in transaction cost theory, asset specificity refers to the degree to which an asset can be applied to alternative uses or users with loss of value. Once an asset-specific investment is made, it may be counterproductive for the vendor to behave opportunistically. In the specific context of IS outsourcing, the specificity arises when firms have their IS assets for customized usage. The asset items are selectively adapted from Loh[1994], Aubert et al.[1996], and Ang and Cummings [1997].

Table 1. Case Descriptions IS measurement problem. For most IS activities in a corporation, the IS

measurement problem indicates the difficulty (or problems) of measuring the IS services and their output qualities. The higher the measurement problem, the more time, cost and human resources may be required for monitoring and managing an IS supplier. The items for the IS measurement problem are derived from Aubert et al.[1996].

IS strategic importance. IS strategic importance refers to the importance of using

121

IS for a corporation to achieve its strategic goals. The items are integrated from IT strategic disposition [Loh, 1994] and IS strategic impact measures [Nam et al., 1996].

IS capability. This refers to the IS knowledge and skills capacity possessed by the IS personnel in a corporation. The items are primarily adapted from the tacit knowledge measure [Nam et al., 1996]. 6.2 The Case Collection

A cross-sectional questionnaire was developed for collecting IS outsourcing cases from a group of large-sized organizations in Taiwan based on the items in Table 1. Formal interviews with five IS managers were conducted to provide valuable ideas and insights into developing the questionnaire. Ten other selected IS managers were pretested with the entire questionnaire to ensure the face validity. The sample population consisted of 1729 large organizations in Taiwan, including the top 1000 firms in the manufacturing industry, top 500 in the service industry, top 100 in the financial industry, and 129 government institutions. The final questionnaires were distributed to IS managers at 576 organizations by systematically sampling one third of this population. Each IS manager was asked to rate on a scale of 1-5 his or her agreement with the questionnaire items. One hundred forty-six completed questionnaires that were usable, yielding an effective response rate of 25.57% (146/571).

6.3 GA Control Parameters The key defined GA parameters consisted of the population size, crossover rate,

and mutation rate. A theory that can concisely guide the assignment of these values is rarely seen [Srinivas, 1994]. Initially, the following values were adopted in this research. The population size was set at 50, crossover rate was 0.6, and mutation rate 0.05. The entire learning process stopped after 1000 generations.

6.4 Results After the 1000th generation in the GA training process, the best approximate

two-level weight values were generated. Once these derived values were applied to the case features, the GA-CBR system demonstrated more accurate results than methods with equal weights. These results are illustrated in Table 2. All methods were tested with a 3-fold cross-validation method. The GA-CBR results were the best with MSE was equal to 0.464 for training and 0.438 for testing.

Table 2. MAE for Different Methods

Mean Absolute Error Method

Training TestingOne-level approach:

wi=1 0.682 0.724Two-level approach:

wij=1 &Wi=1 varied wij & Wi

(GA-CBR)

0.6610.464

0.6880.438

7. Conclusion The study applied GA to nested CBR to forecast IS outsourcing success. For

effective case retrieval, the GA was proposed to define the most appropriate weight values. A prototype GA-CBR system was developed for supporting the optimum IS outsourcing decision. Compared to methods with equal weights, the proposed approach exhibited better learning and testing performance. The results demonstrate that there is potential to help corporations set IS outsourcing goals easier in the future.

122

There is also potential to provide pre-warning for the possibility of a failure or point out unforeseen problems. However, there are also drawbacks in using this system. If there is lack of sufficiently relevant cases, the system may not be able to recognize a new IS outsourcing case. The adapted solution may therefore be inappropriate. In the future, it will also be possible to determine the inputs that can best approximate the expected target. This is a parameter design problem that could transform the target value into the corresponding input values. A future study could adopt another GA procedure to aid the search process and determine the optimum values for adjustable features to approximate the target outcome features.

APPLYING GENETIC ALGORITHMS TO COVERING ALGORITHMS FOR DATA DISCRETIZATION: A CASE STUDY OF MODELING AIRPLANES

LANDING GRAVITIES Abstract

The Covering Algorithm is one of the often-used methods for rule induction. The main shorting of this algorithm is its inability to handle continuous type of data. This research proposes a novel method that integrates genetic algorithms with covering algorithms in support of rule induction in dealing with both the continuous and categorical types of data. We illustrate this method and demonstrate its effectiveness with data obtained directly from the flight data recorders of Boeing 747-400 airplanes. The results indicate that the hybrid of genetic algorithm and covering algorithms is feasible as a complete rule induction method. 1. INTRODUCTION Rule induction is one of the most common forms of knowledge discovery. It is a technique for discovering a set of "IF / THEN" rules that can be used for classification or prediction. That is, rule induction is able to convert the system behavior into a rule-based representation that can be used either as a knowledge base for decision support or as an easily understood description of the system behavior. This method features the capability to search for all possible interesting patterns from data sets.

Rule induction methods may be categorized into either tree based or non-tree based methods. Quinlan (1993) introduced techniques to transform an induced decision tree into a set of production rules. Some of the often-mentioned decision tree induction methods include C4.5 (Quinlan, 1993), CART (Breiman et al., 1984) and GOTA (Hartmann et al., 1982) algorithms.

Michalski et al., (1986) proposed the AQ15 algorithms to generate a disjunctive set of classification rules. The CN2 rule induction algorithms also use a modified AQ algorithm that involves a top-down beam search procedure (Clark and Niblett, 1989). It adopts entropy as its search heuristic and is only able to generate an ordered list of rules. Clark and Boswell (1991) improved these algorithms by using the Laplace error estimate as a heuristic instead of entropy. The Basic Exclusion Algorithm (BEXA) is

123

another type of rule induction method proposed by Theron and Cloete (1996). It follows a general-to-specific search procedure in which disjunctive conjunctions are allowed. Every conjunction is evaluated using the Laplace error estimate. More recently Witten and Frank (1999) described covering algorithms for discovering rule sets in a conjunctive form. These algorithms can only process nominal data and generate unordered rule sets. Most rule induction methods perform well for categorical data but poorly for continuous data. Basically, covering algorithms can only process categorical data. That is, the main shortcoming of covering algorithms is its inability to process continuous types of data for independent variables as well as continuous type of data for a dependent variable. Several well-known methods have been used to solve this problem by segmenting continuous data into to a finite number of classes. Francis et al., (1998) stated that continuous data can be linearly quantized into integer values. Fayyad and Irani (1993) presented a method for discretizing continuous data using the minimum description length principle to determine the appropriate granularity. Quinlan (1992) proposed an entropy-based C4.5, in which continuous values are split into two intervals. Skowron and Son (1995) proposed a method to discretize continuous data based on Rough Sets and Boolean reasoning. Little research has been conducted on treating continuous data and categorical data in a coherent manner using related rule induction methods, especially using covering algorithms. The quantization of continuous variables for covering algorithms becomes an issue in rule induction method research.

In this paper, we propose a novel method that adopts genetic algorithms to preprocess continuous data into categorical data. The GA is a general-purpose optimization technique based on the principles of natural evolution used to find the optimal (or near optimal) solution (Holland, 1975). The proposed method attempts to automatically segment continuous flight data and then employs covering algorithms to construct rule sets for predicting the landing gravity of Boeing 747-400 airplanes. The landing gravity is the vertical acceleration of airplane on the flight landing moment. Basically, it is one of the most important factors influencing the airplane landing safety. Therefore the effective prediction of landing gravity becomes a crucial concern for the prevention of landing accidents.

2. THE LITERATURE REVIEW 2.1 The Discretization of Continuous Data Discretization, usually called quantization, is the process of converting continuous data into categorical data. In some circumstances, discretization may become essential when treated as a preprocessing step before initiating a data learning process. Data in many real-world problems is continuous. It may be desirable to present this data as categorical data in tune with the required modeling format. In order to improve a

124

model’s learning performance, discretization can help simplify the data representation, improve results interpretation, and make the data accessible to a greater number of data mining methods. The benefits may consist of better improvement in induction time; smaller induced tree or rule set sizes; and even improved predictive accuracy.

Discretization procedures may fall into the categories of supervised/unsupervised and global/local (Dougherty et al., 1995). Supervised discretization groups training examples into intervals, taking into account the respective classes of the training examples. Unsupervised discretization is class-blind in that it groups training examples into intervals without taking into account the respective classes of the training examples. Catlett (1991) presented a supervised method for global discretization, which was improved by Fayyad and Irani (1993). Dougherty, et al., (1995) compared several supervised and unsupervised methods and concluded that Fayyad and Irani’s (1993) method was more effective than the other methods. Table 1 lists several methods according to the supervised or unsupervised quantization variables. These include equal width interval, equal-frequency, maximum marginal entropy, Naive algorithm, Entropy-based discretization, ChiMerge and Chi2 algorithms and Orthogonal Hyperplanes.

Table 1 Supervised Discretization vs. Unsupervised Discretization Supervised Discretization Unsupervised Discretization

Naive algorithm (Cestnik, 1990) Equal-Width (Catlett, 1991) Entropy-Based discretization (Fayyad and Irani,

1993) Equal-Frequency (Wong and Chiu, 1987)

ChiMerge and Chi2 algorithms (Kerber, 1992) Maximum Marginal Entropy (Chmielewski and Grzymala-Busse, 1994)

Orthogonal Hyperplanes (Skowron and Son, 1995)

Local discretization, as exemplified by C4.5 (Quinlan, 1993), performs locally on partial regions during induction. Global discretization (Chmielewski and Grzymala-Busse, 1994) performs globally through the entire dataset prior to proceeding with induction. Kohonen (1989) proposed a local discretization method to partition N-dimensional continuous data. Holte (1993) presented a global discretization method for the 1R algorithm that attempts to divide the domain of every continuous variable into bins. The ChiMerge system (Kerber 1992) is also a global discretization method that provides a statistically justified heuristic method for supervised discretization. Richeldi and Rossotto (1995) proposed the StatDisc method, using statistical tests as a means for determining discretization intervals. The D-2 (Catlett, 1991) uses entropy-based discretization and several conditions as criteria for stopping the recursive formation of partitions on each variable. Some of these

125

methods are listed in Table 2. Table 2.Global Discretization vs. Local Discretization

Global Discretization Local Discretization D-2 (Catlett, 1991) C4.5 (Quinlan, 1993) 1R (Holte, 1993) ID3 (Quinlan, 1992)

ChiMerge (Kerber 1992)

Weiss et al, (1990) proposed the predicative value maximization algorithm that makes use of a supervised discretization method by finding partition boundaries with locally maximal correct classification accuracy. Chan et al, (1991) proposed an adaptive quantizer method that combines supervised and unsupervised discretization. Chimielewski and Grzymala-Busse (1994) used a similar method involving a cluster-based method to find candidate intervals boundaries. Maass (1994) introduced a dynamic programming algorithm that finds the minimum training set error in partitioning a continuous variable. Pfahringer (1995) used entropy to select the split points and employed a Minimum Description Length (MDL) heuristic to determine the most suitable discretization.

Among these existing discretization methods, some methods belong to the supervised discretization category and others belong to the unsupervised discretization category. There are rare discretization methods that adopt GAs for converting the continuous data into categorical data in both the independent and dependent variables. A GA can be a suitable method for discretizing continuous variables due to its optimization capability. It can also discriminate both the continuous variables in the supervised method for the independent variables and in the unsupervised method for a dependent variable. 2.2 Optimization Methods Optimization is the process of finding the best solution (or optimum) from a set of candidate solutions. It is usually used when the problem structure is complex or there are millions of possible solutions. Optimization is divided into two classes: global and local. Global optimization finds the best solution from the set of all candidate solutions. It always finds a better solution regardless of the search starting point. Local optimization finds the best solution starting from the surrounding solutions to the nearer neighbor solution. The final solution depends heavily on the starting point in a search for the optimum solution. This strategy is usually called a Descent Algorithm or a steepest descent strategy.

In 1983, Kirkpatrick et al. proposed Simulated Annealing (SA) algorithms to solve optimization problems. This involves the alteration of the Descent Algorithms. It does not search for the best solution in the neighborhood of the current solution. Golver (1986) proposed Tabu Search (TS), a method comparable to the SA. Similar views

126

were developed by Hansen (1986) who formulated the steepest ascent/mildest descent heuristic principle. The GA is an optimal mechanism that mimics the genetic evolution of species. The GA deals with a population of solutions rather than with single solutions as in the SA or TS. The final solution is not derived from the neighborhood of a single solution, but from the neighborhood of a whole population of solutions.

The GA is a procedure for modeling genetic evolution and the natural selection processes. The basic GA consists of several components including number of chromosomes in a generation, a ‘fitness’ evaluation unit and genetic operators for ‘reproduction’, ‘crossover’ and ‘mutation’. Initially, a set of number strings generated by the random number generator is treated as a candidate solution for the optimization problem being addressed. Associated with each string is a fitness value that is a measure of the goodness of the candidate solution. The aim of the genetic operators is to transform the set of strings into sets with fitness values. In essence, the procedure selects highly fit individuals and their chromosomes at random to generate better offspring within the new population. The unfit are eliminated and the fittest survive to contribute genetic material to the subsequent generation.

The GA has been successfully applied to many optimization problems. Adeli and Cheng (1993) presented single-objective optimization. Fonseca and Fleming (1993) proposed multi-objective optimization without constraints. However, little research has applied the GA to optimize continuous data discretization and treat continuous valued variables and categorical variables in a coherent manner within the framework of covering algorithms. This research attempts to adopt the GA optimization strength to discretize continuous data and then employ for covering algorithms for rule generation.

3. THE PROPOSED METHOD As shown in Figure 1, there could be many potential research orientations from various types of combination of discretization, optimization, and rule induction methods. Covering algorithms can be mainly applied to problems of categorical data variables. Continuous variables must be transformed into categorical variables in order to be processed by covering algorithms. In this study, the basic genetic algorithm is used to aid discretization. Variable discretization using the GA includes discriminating independent variables or the dependent variable. For the dependent variable, the GA discriminates continuous data using unsupervised discretization. This is conducted by determining the most suitable discriminating points for the dependent

127

variable.

Local

Global

Supervised

Unsupervised

SA

CN2 AQR BEXA RULES CA C4.5 Rules Rep IREP RIPPERk

Rule Induction Methods

Discretization Methods

TS

GA

Optimization Methods

Figure 1. A 3-Dimension Representation of Potential Research Orientations

For the independent variables, the GA discriminates the continuous data using

supervised discretization. This is proceeded by determining the most suitable discriminating points according to the independent variable. After these procedures have been completed, a covering algorithm is applied to the transformed data to produce a set of rules. A detailed illustration is depicted in Figure 2.

Figure 2. The Rule Induction for Continuous Data vs. Categorical Data using Covering Algorithms

Categorical Data

Continuous Data Discretization by

GAs

Rules

Generated By

Covering

Algorithms

Rules

Set Examples

Set

3.1 Discretization by Genetic Algorithms

128

To process continuous data, a dynamic discriminating approach is proposed in this study. As shown in Figure 3, each gene value in the chromosome represents the discriminating values for a particular continuous variable. For each chromosome evolution, a set of discriminating values is generated. Each discriminating value was referred to the value of the real continuous variable.

Figure 3. Data Discretization in a Chromosome

0.6 0 0 0 1.1 2.9 0 0 6.8 3.4 9.2 0 10 0

0

Sorting

10 9.2 6.8 3.42.91.10.60000 0 00

0 Chromosome

Before Sorting

Chromosome After Sorting

Real Values 50……………………………………………………………………………….56 61 79 84 118 142 150

Minimum Value Maximum Value

The data discretization for independent variables is based on supervised discretization. The best discriminating points is determined according to the value of the dependent variable. During this process, the vale of the dependent variable has to be remained unchanged. On the other hand, the dependent variable with continuous value has to be processed first for data discretization. Then, the independent variables with continuous values will be discretized ordinarily.

In the case of data discretization for the dependent variable, the fitness function of the unsupervised discretization is evaluated by minimizing the standard deviation for each split group. For instances, as shown in Table 3, assume that there are twelve examples in the training data set. This data set is sorted based on the dependent variable Y. At the fist trial, three subgroups “A”, “B” and “C” are determined. For subgroup A, there are four examples in which two have a value of 13. The remaining examples have a value of 14. The standard deviation in this subgroup is 0.5. This procedure is applied to the remaining subgroups. The standard deviation computed for subgroups B and C are 1 and 0 respectively. Thus the total standard deviation for the first trial is 1.5. After several consecutive trials, the most suitable discriminating points for the dependent variable with the minimum total standard deviation is determined.

129

Table 3. Unsupervised Discretization for the Dependent Variable Dependent Variable Y

Discriminating Subgroups

Mapping Value

13 A 13.5 13 A 13.5 14 A 13.5 14 A 13.5 20 B 21 20 B 21 22 B 21 22 B 21 33 C 33 33 C 33 33 C 33 33 C 33

For the data discretization of an independent variable, the fitness of an individual

split group must be evaluated each discretization iteration. The entire process adaptively decreases the resulting classification error for each group. After completing the discretization of all continuous variables, the transformed data are learned by the covering algorithms. For instance, as shown in Table 4, assume that there are twelve examples in the training data set. These data are sorted based on a given independent variable Xi. In the fist trial, three subgroups “A”, “B” and “C” are determined for variable Xi. For subgroup A, there are four examples in which one is classed as “Low” and the rest are classed as “High”. The classification error rate for this subgroup is 1/4. This procedure is applied to the remaining subgroups. The classification error rate for subgroups B and C are 2/4 and 1/4, respectively. Thus the total classification error rate for the first trial is 1 (1/4+2/4+1/4). After several consecutive trials, the most suitable discriminating points for the independent variable with the minimum total classification error rate is determined.

Table 4. Supervised Discretization for a Given Independent Variable (Xi) Independent Variable Dependent

Variable X1 X2 … Xi … Xn Y

Discriminating Subgroups

… … … 1 … … Low A … … … 2 … … High A … … … 2 … … High A … … … 4 … … High A … … … 5 … … Low B … … … 8 … … Middle B … … … 8 … … Middle B … … … 12 … … High B … … … 17 … … Low C

130

… … … 19 … … Low C … … … 23 … … Low C … … … 24 … … Middle C

3.2 Covering Algorithms

Covering algorithms operate by adding variables to the rule that is under construction, striving to create a rule achieving maximum classification accuracy. Its basic concept is to include as many instances of the desired class as possible and exclude as many instances of other classes as possible. Suppose that the rule Ri covers a total of Ci instances, of which Pi is a positive example of the class and Ci - Pi is the number of other classes. That is, they are the negative or wrong conclusions inferred from the rule. Then a new rule is chosen for its resulting optimum Pi /Ci ratio. The fitness function T of a given rule for the training sets is shown in Formula 1.

=

i

ij

CP

MaximizeT …

(1) where =1 to m i

m is the total number of rules derived from all independent variables. j is the total number of classes for the dependent variable.

In this learning phase, a rule with the maximum classification accuracy would be adopted. The examples that are covered by the selected rule will be removed from the training sets. Such procedures are executed repeatedly in accordance with the remaining training data sets. However, in certain circumstances, test examples may not be applied exactly to any rule in the entire rules set. To resolve this problem, one possible way is to assign a target class that is the most frequently occurring class as an alternative solution.

When the test examples are inferred with multiple classifications, Laplace accuracy is adopted for choosing the best rule as the results. The Laplace accuracy aims to iteratively enable a rule’s classification performance to converge into the “random guess” value of ( C is the number of classes). Thus rules with higher Laplace accuracy will be favored. Therefore, a test example that receives multiple classifications, will adopt the classification result from the rule with the highest Laplace accuracy. The Laplace accuracy is defined in Formula (2).

nC/1 n

( )

++

=nw

cwc CN

NNNL

1, …

131

(2)

Where L is any rule, Cn is the number of classes, Nw is the total number or examples belong to rule L, Nc is the number of examples being correctly classified by rule L. The Laplace accuracy of each rule for a specific example that it covers is measured in the learning phase. Each rule is given with a Laplace accuracy value.

Regularly, covering algorithms induce a rule in which “=” is the only allowed operator to appear in either the “IF” or “THEN” part. In our method, more operators are allowed in the rules. These operators are “=”, “>”, “< & >”, and “<”. For instance, a typical rules set containing multiple operators is shown in Figure 4.

Figure 4. Rules Set with Multiple Operators

IF Outlook = Cloudy THEN Temperature = 32

IF Humidity > 92 THEN Temperature = 34

IF Humidity < 63 AND Humidity > 55 THEN Temperature = 24

IF Humidity < 43 THEN Temperature = 16

……………………...

ELSE Temperature = 26

4. THE EXPERIMENTS Most flight accidents occurring worldwide are due to an inappropriate approach in the landing phase. A study commissioned by the U.K. Civil Aviation Authority for the Flight Safety Foundation examined in detail 287 fatal approach and landing accidents worldwide. Among these findings, 75 percent of these accidents were due to inappropriate alert prediction mechanisms (Ashford, 1998). Flight safety problems are very important and rarely has work been done on landing gravity prediction. A total of 100 records from several Boeing 747-400 aircraft flight data recorders (FDRs) from one specific airline in Taiwan were used to construct the landing gravity model. Each FDR data represents a record that contained seven variables identified as highly related to landing gravity. These variables included pitch, roll, airspeed, weight, fuel flow, acceleration and landing gravity for the flight landing moment. All of the values for these variables were of the continuous data type. As depicted in Figure 5, most data were around gravity 1.2. As for the GA control parameters, 50 organisms in the population were used. The crossover rate ranged from 0.5-0.7 and the mutation rate ranged from 0.06-0.1 for the initial settings. The entire learning process stopped after 20,000 trials. Since the amount of collected data was not large, the model construction and evaluation was based on a 10-fold cross-validation. The linear regression was

132

adopted for comparison with our method. Figure 5. The Data Distribution of Landing Gravity

0%

20%40%

60%

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3

The Value of Landing Gravity

Per

cent

age

Conventionally, no training error is reported for rule induction using covering algorithms, thus the average training error for a 10-folds cross validation is not available in our experimental results. According to Table 5, the GA method with covering algorithms (GA-CA) is a non-tree based rule induction method. Our GA-CA method has a higher average elapsed time than linear regression. The average errors for GA-CA is defined as mean absolute errors between the existing and the forecasted landing gravities. It can be seen that the average error of GA-CA outperforms that of linear regression. It is evident that the discretization of a dependent variable in GA-CA improved the model performance.

Table 5. The Comparison Results for the GA-CA vs. Linear Regression Methods

Evaluation Items GA-CA Linear Regression

Average Elapsed Time (min.)

0.14 0.03

Average Error 0.31 0.33 To further explore GA-CA’s modeling capability in predicting categorical data, the

landing gravity data was segmented into two and three category types according to the safety criteria denoted by experts in Table 6. For the two-category type of landing gravity, the value 2 was the cut-off point between group ‘A’ and group ‘B’. In the three-category type of landing gravity, a value of 2 was the cut-off point between group ‘A’ and group ‘B’. A value of 3 was the cut-off point between group ‘B’ and group ‘C’.

Table 6. The Range of Categories Categories Range Groups

<2 A 2-Categories >2 B <2 A

>2 & <3 B 3-Categories>3 C

133

Both two-category and three-category types of flight data were randomly divided into 10 folds. These flight data were learned by C5.0, Quest (Loh and Shih, 1997) and our proposed method. The experiment results are shown in Table 7. The average errors in these methods are defined as the number of overall misclassification cases divided by the number of total cases between the existing and the forecasted categories of landing gravities. Among these three methods, GA-CA had higher average elapsed time than the other methods. It can be seen that the average errors for these three methods are closer to one another for the two-category type of data. It is evident that the results from GA-CA for three-category type of flight data are marginally better than C5.0 and Quest methods, though the elapsed time are higher than C5.0 and Quest methods.

Table 7. Results of Categorical Landing Gravity Methods

Evaluation Items GA-CA C5.0 Quest

2-Categories 0.040 0.006 0.003 Average Elapsed Time (min.) 3-Categories 0.131 0.006 0.003

2-Categories 2.8 3.4 4 Average Rule Number 3-Categories 7 4 4 2-Categories 0 0 0 Average Errors 3-Categories 0 0.01 0.04

The contents of average errors between Table5 and Table 7 are different. The

average error in Table 5 was based on the continuous dependent variable that was denoted by landing gravity. However, the average errors in Table 7 were based on the categorical dependent variable that was segmented according to the safety criteria listed in Table 6. The average errors denoted the misclassification number of records divided by the entire number of records. Also, as shown in Figure 5 and Table 6, it is about 90 percent of flight data was segmented into group A. The flight data was segmented into a large bin. It can be seen that the overall average errors in Table 7 are lower than the overall average error in Table 5.

5. DISCUSSION AND CONCLUSIONS Discretization is essential for machine learning algorithms that can not directly handle continuous data. Discretization may also be useful in accelerating induction processes, producing simpler and more effective models, even though discretization is not a must for certain types of learning algorithms. This paper presented a way in which continuous and categorical flight data variables can be treated cohesively with covering algorithms by using GAs. The empirical results show that the discretization

134

using GAs with covering algorithms demonstrated an effective approach for both the supervised method with independent variables and the unsupervised method with a dependent variable. The local discrimination of hybrid GAs with covering algorithms is a novel method for solving the quantization of continuous variables into categorical variables for flight data.

Most unsupervised discretization methods rely heavily upon expert experiences or human intuitions to decide how many discrimination intervals are needed. In our proposed method, GAs can simultaneously determine both the best discriminating points and interval number for continuous variables.

The performance of the proposed method marginally outperforms some of competitive methods. Rather than demonstrating the modeling performance of our method, this research focuses on how the existing capabilities of covering algorithms can be further expanded by introducing genetic algorithms into its insufficiency in data discretization.

Most landing gravity data in FDRs are normal flight data. High value landing gravity data in FDRs is scarce. Only a minority of high landing gravity value cases has greater importance than normal flight data. To overcome this issue, the over-sampling approach could be one way to balance the minority data shortfall in training the models. Essentially, it is believed that a collection of more accident data is required to better reveal landing insights, though these data are rarely being collected. Our future research direction is to embed GAs with covering algorithms into flight simulators based on collected FDR data from incident cases to improve flight safety around the world. A Case-Based Customer Classification Approach for Direct Marketing Abstract Case-based reasoning (CBR) shows significant promise for improving the effectiveness of complex and unstructured decision making. CBR is both a paradigm for computer-based problem-solvers and a model of human cognition. However the design of appropriate case retrieval mechanisms is still challenging. This paper presents a genetic algorithm (GA)-based approach to enhance the case-matching process. A prototype GA-CBR system used to predict customer purchasing behavior is developed and tested with real cases provided by one worldwide insurance direct marketing company, Taiwan branch. The results demonstrate better prediction accuracy over the results from the regression-based CBR system. Also an optimization mechanism is integrated into the classification system to reveal those customers most likely and most unlikely

135

customers to purchase insurance. Keywords: Direct marketing, case-based reasoning, genetic algorithms, customer

classification 1. Introduction

Case-based reasoning (CBR) shows significant promise for improving the effectiveness of complex and unstructured decision making. It is a problem-solving technique that is similar to the decision making process used in many real world applications. CBR is both a paradigm for computer-based problem-solvers and a model of human cognition. The reasoning mechanism in the CBR system is based on the synergy of various case features. Therefore this method differs from a rule-based system because of its’ inductive nature. That is CBR systems reason using analogy concepts rather than the pure decision tree (or IF-THEN rules) usually adopted in rule-based systems.

Basically the CBR core steps are (1) retrieving past cases that resemble the current problem; (2) adapting past solutions to the current situation; (3) applying these adapted solutions and evaluating the results; and (4) updating the case base. Basically CBR systems make inferences using analogy to obtain similar experiences for solving problems. Similarity measurements between pairs of features play a central role in CBR (Kolodner, 1992). However the design of an appropriate case-matching process in the retrieval step is still challenging. Some CBR systems represent cases using features and employ a similarity function to measure the similarities between new and prior cases (Shin & Han, 1999). Several approaches have been presented to improve the case retrieval effectiveness. These include the parallel approach (Kolodner, 1988), goal-oriented model (Seifert, 1988), decision trees induction approach (Quinlan, 1986; Utgoff, 1989), domain semantics approach (Pazzani & Silverstein, 1991), instance-based learning algorithms (Aha, 1992), fuzzy logic method (Jeng & Lian, 1995), etc. These methods have been demonstrated effective in retrieval processes. However, most of these research works focused on the similarity function aspect rather than synergizing the matching results from individual case features. In essence when developing a CBR system, determining useful case features that are able to differentiate one case from others must be resolved first. Furthermore the weighting values used to determine the relevance of each selected feature has to be assigned before proceeding with the case matching process. Rather than being precisely or optimally constructed, the weighting values are usually determined using subjective judgment or a trial and errors basis. To provide an alternative solution this article presents a GA-based approach to automatically construct the

136

weights by learning the historical data. A prototype CBR system used to predict which customers are most likely to buy life insurance products is developed. The data provided by one worldwide insurance direct marketing subsidiary in Taiwan was used for constructing this model. The results show that the GA-based design of CBR system generates more accurate and consistent decisions than the regression-based CBR system. 2. An Overview of CBR

Analogy, one way of human reasoning, is the inference that a certain resemblance implies further similarity. The CBR is a similar machine reasoning that adapts previous similar cases to solve new problems. It can be considered as a five-step reasoning process shown in Figure 1 (Bradley, 1994).

Presentation

Retrieval

Adaptation

Validation

Update

Case Base

Figure 1. The General CBR Process

Presentation: a description of the current problem is input into the system. Retrieval: the system retrieves the closest-matching cases stored in a case

base (i.e., a database of cases). Adaptation: the system uses the current problem and closest-matching cases

to generate a solution to the current problem. Validation: the solution is validated through feedback from the user or the

environment.

137

Update: if appropriate, the validated solution is added to the case base for use in future problem solving.

Case retrieval searches the case base to select existing cases sharing significant features with the new case. Through the retrieval step, similar cases that are potentially useful to the current problem are retrieved from the case base. That is, previous experience can be recalled or adapted for the solution(s) to the current problem and mistakes made previously can be avoided. The computing of the degree of similarity between the input case and the target case can usually be calculated using various similarity functions among which the nearest-neighbor matching is one of the frequently used methods. Nearest-Neighbor Matching

Nearest-neighbor matching is a quite direct method that uses a numerical function to

compute the degree of similarity. Usually, cases with higher degree of similarities are retrieved. A

typical numerical function (Eq1) is shown in the following formula (Kolodner, 1993).

i

i

n

iI

iR

ii

n

W sim f f

W

=

=

1

1

* ( , )

…………………………….(Eq1)

where Wi is the weight of the ith feature, is the value of the ith feature for the

input case, is the value of the ith feature for the retrieved case, and sim() is the

similarity function for and .

fiI

fiR

fiI fi

R

The implicit meaning of nearest-neighbor matching is that each feature of a case is a dimension in the search space. A new case can be added into the same space according to its feature values and the relative importance of the features. The nearest neighbors identified can then be presented as similar cases. However, the matching process can only be executed with prior known weighting values as well as clearly defined similarity functions. Most of times weighting values are determined using human judgement, and thereby the retrieved solution(s) cannot always be guaranteed. Though Brieman et al. argued that the nearest neighbor algorithms are sensitive to the similarity functions (Brieman et al., 1984), the additional effects from weighting synergy could leverage the potential

138

uncertainty. Wettschereck et al. (1997) organized feature weighting methods and summarized that feature weighting methods have a substantially higher learning rate than k-nearest neighbor. Kohavi et al. (1995) described the evidence that feature weighting methods lead to superior performance as compared to feature selection methods for tasks where some features are useful but less important than others. Though Kelly and Davis (1991) proposed a GA-based, weighted K-NN approach to attain lower error rates than the standard K-NN approach, seldom has other study focused on the non-linear feature value distance relationship between an old case and an input case. To overcome this shortcoming in the traditional case retrieval process, this study presents a GA approach to support the determination of the most appropriate weighting values for each case feature.

3. The Genetic Algorithm Approach GA is an optimization technique inspired by biological evolution (Holland, 1975).

Its procedure can improve the search results by constantly trying various possible solutions with the reproduction operations and mixing the elements of the superior solutions. In contrast to the traditional mathematical optimization methods that search for solutions via a blindfold, the GA works by breeding a population of new answers from the old ones using a methodology based on survival of the fittest. Based upon the natural evolution concept, the GA is computationally simple and powerful in its search for improvement and is able to rapidly converge by continuously identifying solutions that are globally optimal within a large search space. By using the random selection mechanism, the GA has been proven to be theoretically robust and empirically applicable for searching in complex spaces (Goldberg, 1989).

To determine a set of optimum weighting values, the search space is usually quite huge. This is because the search process must consider countless combinations of variant possible weighting values for each of the feature against all of the cases stored in the case base. Therefore, traditional approaches such as heuristic or enumerative search methods are lack of efficiency due to the enormous computing time.

To solve a problem, the GA randomly generates a set of solutions for the first generation. Each solution is called a chromosome that is usually in the form of a binary string. According to a fitness function, a fitness value is assigned to each solution. The fitness values of these initial solutions may be poor. However, the fitness values will rise as better solutions survive in the next generation. A new generation is produced through the following three basic operations.

139

(1) Reproduction - Solutions with higher fitness values will be reproduced with a higher probability. Solutions with lower fitness value will be eliminated with a higher probability. (2) Crossover - Crossover is applied to each random mating pair of solutions. For example, consider solutions S1 and S2 (Figure 2a). By randomly choosing the location for the separator (as symbol | shown in Figure 2b), a simple crossover can be applied

to yield S'1 and S'

2 as new offspring (Figure 2c).

S1 = 0 1 1 0 1 0 1 1 S2 = 0 0 1 0 1 1 0 0

Figure 2a

S1 = 0 1 1 0 1 0 1 1 S2 = 0 0 1 0 1 1 0 0

Figure 2b

S'1= 0 1 1 0 1 1 0 0

S'2= 0 0 1 0 1 0 1 1

Figure 2c

(3) Mutation - With a very small mutation rate Pm, a mutation occurs to arbitrarily change a solution that may result in a much higher fitness value. The Fitness Function

The objective of the proposed GA approach is to determine a set of weighting values that

can best formalize the match between the input case and the previously stored cases. The GA is used to

search for the best set of weighting values that are able to promote the association consistency among

the cases. The fitness value in this study is defined as the number of old cases whose solutions match

the input case(s) solution, i. e., the training case(s). In order to obtain the fitness value, many

procedures have to be executed beforehand. Figure 3 presents the overall system framework. The

details of the system processes are illustrated in the system architecture as follows.

The System Architecture The system is composed of three major processes. One case base includes both training cases and old cases. The Similarity Process computes the similarity between an input training case and an old case. The similarity value (named Overall Similarity Degree; OSD) is derived by summing each degree of similarity resulting from

140

comparing each pair of corresponding case features out of the selected training case and old case. The OSD is expressed as the following equation (Eq2).

…………………………….(Eq2) kji

n

ii

eSWOSD i ,,1

*∑=

=

Where i = 1 to n, n is the total number of features in a case. Wi is the weighting value assigned to the ith feature. This value is generated from the Weighting Process by the GA. ei represents the power of Si,j,k , which represents the degree of similarity for the ith feature between the training case j and the old case k (j= 1 to p; k= 1 to q). p is the total number of training cases; while q is total number of old cases. Si,j,k is used as an index to describe the similarity level for certain case features for one training case against that of one old case. This can be expressed as the following equation (Eq3).

Si,j,k = 1 - [|FeatureCase j-FeatureCase k|/Rangei]………………...(Eq3)

where Rangei is the longest distance between two extreme values for the ith feature.

TrainingCases

SimilarityProcess

WeightingProcess

EvaluationProcess

OldCases

CASE BASE

End of Training

Accepted

Not Accepted

WiSi,j,k

Case j Case k

Figure 3. The System Architecture

The Evaluation Process As mentioned above, OSD is the key determinant for assessing the similarity

between the input case and the old case. For each the Similarity Process batch executed with a specific input case, Casej, then q OSDs are produced since q old cases have been compared with the input Casej. The basic notation is that the higher the

141

OSD, the more likely the retrieved old case matches the input case. Therefore the challenge becomes how to determine the most appropriate case(s). The purpose for introducing the GA is to determine the most appropriate set of weighting values that can direct a more effective search for higher OSDs to match the input case.

Usually there may exist several old cases that are inferred to be similar (either exact or near identical) to the input case. In other words, the solution (i.e., the outcome feature) of each old case could be proposed as a solution for a certain training case. To determine which cases whose outcome feature can be adopted as the outcome feature for the input case, this research proposes that for the majority of the outcome feature among the top 10% OSDs in those old cases are used to represent the final solution for each batch of the Similarity Process execution for a given training Casej. The derived final expected outcome feature is denoted as O’j as opposed to the real outcome feature Oj for a given training Case j.

However for a complete run of the Similarity Process, p outcome features will be determined for each training case. Hence the Weighting Process is applied to minimize the overall difference between the original real outcome features and the expected outcome features. In other words, the more chance O’j is deemed as equal to Oj, the higher the probability that the appropriate weighting values can be produced. According to the illustration in Table 1, the evaluation function is expressed as the following equation (Eq4).

Maximize

Y = …………………………………... (Eq4) ∑=

p

jj

1y

where p -- the total number of training cases;

Yj -- the matched result between the expected outcome and the real outcome, if O’j = Oj then Yj is 1; otherwise Yj is 0.

Table 1. The Illustration of the Evaluation Function

Training Cases

Expected Outcome(O’j)

Real Outcome (O j)

Matched (Yj)

Case 1 Yes Yes 1 Case 2 No Yes 0

. . . . Case j No No 1

. . . . Case p . . . Total Evaluation Function

Y= ∑=

p

jj

1y

142

4. The Experiment and Results Data Description

Customer classification is an important issue in real world marketing. It is believed that the more understanding the corporation has about its customer behavior patterns, the greater the chance that more effective marketing strategies can be developed. This research adopts the GA-CBR method to classify potential customers into either purchasing or non-purchasing categories. Data derived from real world insurance customer was collected by the direct marketing department and separated into learning and testing groups for model construction. In order to develop a model able to effectively differentiate purchasing customers from non-purchasing customers, all possible factors such as customer demographics and other supporting information were collected. The supporting information from experienced domain experts and information from professional reports were collected to support the feature selection process (Reynolds & Wells, 1977; Kenneth & David, 1987). These factors were further investigated using statistical factor analysis to reveal the most influential factors that would have substantial impact upon the outcome.

The substantial factors derived as case features include Gender, Marital Status, Number of Child, Age, Profession, Zip Code, Area Saving Rate. The data model and detail explanation of these features is described in Table 2. Four hundred and forty customer profiles were averaged and mixed with purchasing and non-purchasing cases and learned by the multiple logistic regression and the GA-CBR. The 400 GA-CBR cases were used for model construction with 40 cases used for testing.

GA Control Parameters

The key parameters consisting of population size, crossover rate, and mutation rate needed to be defined first when developing GA computer programs. A theory that can concisely guide the assignment of these values is rarely seen (Srinivas, 1994). Initially, the following values were adopted in this research.

Table 2. The Description of Case Features

Feature (I) Data Type Content Range Gender Character F: Female; M: Male * Marital Status Character Y: Married; N: Single: U: Unknown * Number of Child Integer Range: [1..8] 8.0 Age Integer Range: [1..70] 70.0

143

Profession List Range: [1..10] * Zip Code Integer Three-digits zip code * Area Saving Rate Integer Range: [1..27] 27 Purchasing Potential (predicted outcome)

Character Y: Yes; N: No

‘*’ Indicating the range is not required. The features were matched using heuristic rules.

The Developed Model and Forecasting Results

After the 1000th generation in the GA training process, the best approximate weighting values and powers for the similarity values are shown in Table 3.

Table 3. The Learning Results of Weighting Values and Powers Weight Values Powers

w1 .01 e1 2 w2 .01 e2 1 w3 .01 e3 3 w4 .07 e4 1 w5 .72 e5 3 w6 .17 e6 2 w7 .01 e7 2

Once these derived values were applied to the case features, the GA-CBR system

produced an accuracy or “Best-Fit” of 77% from the targets (purchasing outcomes) over the training data, the regression model produced an accuracy of 50%. The GA-CBR demonstrated an accuracy of 65% with the test data. The regression model accuracy was 45%. These results are illustrated in Figure 4. Exploring the Customers with the Most Potential

Basically prediction models are used to map the inputs to determine the outcome(s). However when the model is complex, is not possible to easily figure out the appropriate inputs that can best approximate the expected target. This situation applies to GA-CBR as well. In order to further explore those customer types that are

Figure 4. Learning Accuracy vs. Testing Accuracy for GA-CBR and Regression Model

Table 4. The Most Likely and Unlikely Purchasing Customer Types

Type Gender Marital Status

Number of Child

Age

Profession Zip Code

Area Saving Rate

Most

Likely A F Y 1 40 3 540 27

144

B M N 4 64 7 540 27 Customers C M Y 4 52 2 570 26

Type Gender Marital Status

Number of Child

Age

Profession Zip Code

Area Saving Rate

D F Y 3 58 2 120 19 E M N 4 60 2 120 19

Most

Unlikely Customers

F F N 4 55 6 650 23 most likely and most unlikely to purchase the insurance products, an additional procedure is required to aid the search process. This research adopted another GA computer program to determine the three types of customers most likely and most unlikely customers to purchase insurance. These customer types were presented in Table 4. Such information provides highly strategic values for the following campaign management. 5. Discussion and Conclusion

Defining appropriate feature weighting values is a crucial issue for effective case retrieval. This paper proposed the GA-based approach to determine the fittest weighting values for improving the case identification accuracy. Compared to the regression model, the proposed method has better learning and testing performance. In this study the proposed GA-based CBR system is employed to classify potential customers in insurance direct marketing. The results show significant promise for mining customer purchasing insights that are complex, unstructured, and mixed with qualitative and quantitative information. By using the GA’s rapid search strengths this system is able to determine the optimum customer characteristics that reflect the customer features that are most likely and unlikely to buy the insurance products. This system has not only demonstrated its better performance for prediction but also the ability to understand a model. While traditional approaches may provide many similar capabilities, other types of business data can be intensively investigated and tested to assure the GA-CBR strength in modeling classification problems. Because the similarity functions may influence the case association process, future research may work on different combinations of similarity functions between case features to examine their retrieval effectiveness.

The Effect of Transaction Risks on Contract Management in Information Systems Outsourcing

C. Hsu T. Wang

145

Dept. of Management Information SystemsChung Yuan Christian University

Chungli, Taiwan 320 [email protected]

Dept. of Information Management

National Central University

Chungli, Taiwan 320

Abstract Accumulated evidence appears to indicate that information systems (IS)

outsourcing entails a significant amount of risks. Having examined IS outsourcing risks, corporations are able to adopt appropriate actions for reducing these risks. A research model is proposed for the risk effect in IS outsourcing based on transaction cost perspective. The risk-related variables are identified from the relevant literature. By investigating the IS outsourcing practices of large businesses in Taiwan, the effect of transaction risks on contract management is then examined. 139 survey questionnaires from the manufacturing, service, and financial industries are analyzed using the structural equation modeling technique. The empirical results indicate that the effectiveness of contract management is not risk-free and is influenced by asset specificity, loss of control, and operations risk. Keywords: information systems outsourcing; risk management; contract management; structural equation model; transaction cost theory. * Please send all correspondence to first author.

Introduction Information technology is intended to produce not only cost savings and

efficiency gains but also external market results such as increased market share, profitability, and customer satisfaction. To assure competitive survival and success, effective information management is needed to offer more efficient, faster, and higher quality information services. However, faced with these increased performance requirements, mangers at some firms have found their internal IS unit is not able to provide proper information services. Thus, corporations may consider to outsource all or part of their IS functions to external suppliers.

The advantages of IS outsourcing usually include cost savings, access to expertise and new technologies, a decrease in IS professional recruitment, flexibility in managing IS resources, and an increase in capital utilization (Tate, 1992; Quinn & Hilmer, 1994; Jurison, 1995; Sobol & Apte, 1995). When managers pursue the expected benefits of IS outsourcing, the associated risks must also be carefully evaluated. In practice, the management of risks is a central issue in the management decisions for any venture. It is possible to gain some insight by considering the various types and degree of risks inherent in IS outsourcing. Having examined IS outsourcing risks, corporations are able to adopt appropriate actions for reducing these risks. The potential benefits of IS outsourcing can be further secured by understanding the impact of risks upon IS outsourcing. In an attempt to gain greater understanding of the risk effect in IS outsourcing, this research investigated the outsourcing practices of large businesses in Taiwan.

Many MIS research works employed correlational analysis to study the relationships between research variables (e.g., see Premkumar & King, 1992; Raymond, 1985, 1990; Thong & Yap, 1995). However, as compared with recent advances in the structural equation modeling methodology, correlational analysis is relatively less useful in its ability to examine the construct validity of measures, measurement errors in variables, and causal relationships between latent variables.

146

This research also attempted to develop a structural equation model and demonstrate how the use of EQS allows the examination of the measurement properties of constructs and the substantive relationships between latent variables.

Using the structural equation modeling technique, this research explored the phenomenon of IS outsourcing in which the links between transaction risks and contract management are constructed. In the following section, the general background concerning IS outsourcing risks is introduced. Next we propose an IS outsourcing model and describe the research method, including the development of the research hypothesis, measures for variables, and data collection. Finally, the results of data analysis is presented and discussed, followed by concluding remarks.

Outsourcing Risks Accumulated evidence from the literature appears to indicate that IS outsourcing

entails a significant amount of risks (Earl, 1996; Quinn & Hilmer, 1994; McLellan & Marcolin, 1994). Lacity & Hirschheim(1995) also reported several successes as well as failures in their multicase study of IS outsourcing practices. To set the stage for this study, Table 1 provides a description of the eleven outsourcing risks identified by Earl(1996).

In addition to the outsourcing risks listed in Table 1, there are additional risks cited in the literature. For example, Harriss, Giunipero & Hult(1998) noted long-term contract inflexibility as one outsourcing risk and proposed that firms need to consider both their internal organizational stability and contract flexibility when developing outsourcing contracts in the IS/IT areas. Quinn & Hilmer(1994) observed that three strategic risks must be considered: (1) loss of critical skills or developing the wrong skills; (2) loss of cross-functional skills; and (3) loss of control over a supplier.

Most of the outsourcing risks discussed above were derived from observations without the support of theory. However, a non-theory-based risk is literally any issue about which a doubt exists in some context. From the perspective of transaction cost theory, this research suggests that transaction risks can be well applied to the context of IS outsourcing since outsourcing is indeed a transaction with both buyers and vendors in the marketplace.

Clemons, Reddi & Row(1993) proposed the concept of transaction risks. They decomposed transaction risks into opportunism risk and operations risk. Operations risk is defined as “the risk that the other parties in the transaction willfully misrepresent or withhold information, or underperform— that is, ‘shirk’-- their agreed-upon responsibilities.” Opportunism risk is defined as “the risks associated with a lack of bargaining power or the loss of bargaining power directly resulting from the execution of a relationship.” Based on Clemons et al.(1993), this research considered transaction risks in terms of these two types of risks, and proposed that asset specificity, loss of control, and operations risks have major impacts on the outcomes realized from contract management in IS outsourcing.Table 1: Eleven Risks of Outsourcing IT (Earl, 1996)

Research Model This section develops some tentative hypotheses, which address the relationships

between the transaction risks and the contract management of IS outsourcing. Figure 1 summarizes the research model and the specific propositions related to contract management in IS outsourcing. Opportunism Risk

147

Opportunism is one of the two critical behavioral assumptions in transaction cost theory (Williamson, 1975). This assumption refers that human agents are given to opportunism, which is a condition of self-interest seeking with guile. One source of opportunism risk that has been examined extensively in the transaction cost literature is asset specificity (Ang & Beath, 1993; Loh, 1994). Another source of opportunism risk less considered in the economics literature, but widely recognized in the management literature, is loss of control (Kelly, 1990; Quinn & Hilmer, 1994; McLellan & Marcolin, 1994). Asset specificity and loss of control are identified among the important sources of opportunism risk (Clemons et al., 1993).

Asset specificity is one of the most important concepts in transaction cost theory. It refers to the degree to which an asset can be reapplied to alternative uses or users without loss of value. In the specific context of IS outsourcing, the specificity arises when firms have their overall architecture (e.g. hardware, software, and communication architectures or platforms), operating procedures, IT knowledge/experience base, and IS staff training for customized usage (Loh, 1994). Both the user and the vendor may need to make asset-specific investments. Once an asset-specific investment is made, it may be counterproductive for the vendor to behave opportunistically. Consequently, the higher the degree of asset-specificity (here refers to the asset specific investment made by the vendor), the more likely it will be for the user to effectively manage the contract. H1: Asset specificity inherent in the structuring of a user-vendor relationship is

positively related to the effectiveness of contract management.

Loss of control is one potential problem facing firms after they have jumped onto the IS outsourcing bandwagon. IS outsourcing may result in the loss of control in data security, operations, strategic flexibility, disaster recovery, and future direction (Kelly, 1990; Martinsons, 1993). Since there are potential losses of control in these aspects, firms may expose themselves under the threat of a vendor’s opportunistic behavior, which may result in a poor contract management outcome. H2: Loss of control inherent in the structuring of a user-vendor relationship is

negatively related to the effectiveness of contract management.

Operations Risk Operations risk may result from differences in objectives among the parties and be supported by information asymmetries between the parties or by difficulties for mutual agreement to be realized (Clemons et al., 1993). An example of operations risk is the potential for quality shirking due to the difficulties in measuring the quality of deliverables. In the presence of operations risk, the contract may ambiguously cover certain contingencies. The following proposition is therefore suggested: H3: Operations risk inherent in the structuring of a user-vendor relationship is

negatively related to contract management effectiveness.

148

Methodology Measurements

Asset Specificity (ASS). This variable was measured as the level of asset specific investment made by the vendor. The measures were based on a five-point Likert scale ranging from insignificant to very significant levels of investment in (1) IS staff training for understanding the user’s specific operation procedure/business, (2) human resources for understanding the user’s specific information requirements, and (3) time, effort, and money for building a relationship with the user. These items were primarily adapted from Loh(1994).

Loss of Control (LOS). The measures for this variable were based on a 5-point Likert scale ranging from non-loss to complete loss of control in (1) data; (2) operations; (3) strategic use of resources; (4) disaster recovery, and (5) future planning. These items were adapted primarily from Kelly(1990).

Operations Risk (OPE). The measures for this variable were based on a five-point Likert scale ranging from strongly disagreed to strongly agreed with the difficulty of (1) measuring the vendor’s performance; (2) assessing the fulfillment of the vendor’s agreed-upon responsibilities; and (3) enforcing the agreements. These items were derived from Clemons et al.(1993).

Contract Management (CON). This primary dependent variable refers to the effectiveness or cost-effectiveness of contract management between a buyer and a supplier in IS outsourcing. This variable was measured in a subjective way, by assessing the IS manager’s recognition of the contract management outcome in IS outsourcing. These measures were based on a five-point Likert scale ranging from strongly disagreed to strongly agreed with the effectiveness or cost-effectiveness of (1) human costs for contract management; (2) other relevant costs for IS outsourcing; (3) schedule control; (4) conflict resolution; and (5) change management. The items were derived from the general considerations of the cost, time, and implementation issues for contract management.

Data Collection To explore the relationship between transaction risks and contract management

in IS outsourcing, a cross-sectional questionnaire was developed for collecting data from a group of large-sized corporations in Taiwan. The final instrument was distributed to the CIOs of 880 firms randomly selected from the directories of the 1996 Common Wealth 1600 large businesses in Taiwan. This directory includes the top 1000 firms in the manufacturing industry, top 500 in the service industry, and top 100 in the financial industry. This sample is thought to represent the large businesses in Taiwan.

207 survey questionnaires were collected, with an overall return rate of 23.6%. Out of the 207 responding corporations, 146 had outsourcing experiences and 139 returned completed data usable for analysis, yielding an effective response rate of 15.9%. 89 firms (64%) were from the manufacturing sector, 34 firms (24%) were in the service sector, and 16 firms (11%) were in the banking and insurance industries. Approximately one fifth (29 firms) of the respondents had total assets of less than NT$1 billion. Half (70 firms) had more that $1 billion, but less that NT$5 billion. Fourteen percent (19 firms) had more than NT$5 billion, but less than NT$10 billion and fifteen percent (21 firms) had total assets of over NT$10 billion. Moreover, approximately one fourth (35 firms) of the respondents had a number of employees less than 200. Thirty-two percent (44 firms) had more than 200, but less than 500.

149

Nineteen percent (27 firms) had more than 500, but less than 1000 and twenty-fourth percent (33 firms) had more than 1000 employees.

Analysis The operational model of this study is depicted in Figure 2. It shows the expected

signs for all of the path coefficients in the model. Following the approach recommended by Anderson & Gerbing(1988), the analysis first examines the measurement properties of the constructs through confirmatory factor analysis. After ensuring that all of the measures have sufficient unidimensionality, validity and reliability, a full structural equation model was constructed for hypothesis testing. The software package used to perform the statistical analysis was EQS for Windows, developed by Bentler(1995). This software was specifically designed for analyzing structural equation models.

Measurement Model To avoid confusion in the interpretation, a confirmatory measurement model was

created to examine the measurement properties of all of the constructs postulated in this study. This model is a first-order factor model and specifies the relationships of the observed measures to their posited underlying constructs with all four constructs allowed to intercorrelate freely. This approach affords an evaluation of the product rules of internal and external consistency, which is the necessary for scale unidimensionality (Gerbing & Anderson, 1988). With this approach, it is possible to determine empirically if the measures have convergent and discriminant validity (Bagozzi & Philips, 1982). Only when scale unidimensionality has been acceptably established, can the reliability of the measures be assessed (Gerbing & Anderson, 1988). Based on confirmatory factor analysis, the model was estimated using a full-information maximum likelihood (ML) method, which provides the most efficient parameter estimates that best explain the observed covariance (Anderson & Gerbing, 1988).

With the variance of all the four factors normalized to 1, various fitness indexes and the ML estimates of the model parameters are given in Tale 2. Although the chi-square statistic (χ2 (df: 98) = 175.444) is highly significant (p<0.001), it is well-known that this statistic is sensitive to sample size, degree of freedom, and non-normality of distributions, and thereby cannot be solely relied upon in assessing model fit (Gerbing & Anderson, 1988). Moreover, the Normed Fit Index (NFI) (0.868), which also depends on sample size, again suggests an unacceptable model fit. For EQS, however, Bentler(1995) suggested the Comparative Fit Index (CFI) as the choice for an overall model fit assessment. CFI belongs to a class of incremental fit indexes based on the noncentrality parameter from the noncentral χ2 distribution. According to Gerbing & Anderson(1988), this class of indexes, including CFI, McDonald & Marsh’s(1990) RNI (or the bounded counterpart of CFI), and Bollen’s(1989) DELTA2 (or termed IFI in EQS), are recommended as the best candidates for the overall fit assessment. Since the values of CFI (0.936) and IFI (0.937) are both greater than 0.9, the measurement model can then be accepted as providing reasonable fitness. Moreover, the Nonnormed Fit Index (NNFI) (0.922) is greater than 0.9, which also indicates an acceptable model fit.

For each of the constructs in the model, both the (monomethod) convergent and discriminant validity of the measures must be appropriately established in order to

Table 2: Parameter Estimates of the Measurement Model

150

Construct Variable ML estimate Std. Err. z-value

Asset Specificity

V1 V2

V3

0.900 0.917 0.702

0.071 0.068 0.078

12.721 13.440 9.049

Loss of

Control

V4 V5 V6 V7 V8

0.805 0.700

0.979 0.889 0.874

0.075 0.073 0.073 0.080 0.080

10.702 9.648 13.337 11.106 10.854

Operations Risk

V9 V10

V11

0.843 0.984 0.949

0.089 0.084 0.097

9.463 11.659 9.757

Contract Management

V12 V13 V14

V15 V16

0.507 0.591 0.637 0.808

0.776

0.083 0.079 0.080 0.063 0.066

6.119 7.521 7.959 12.761

11.813

χ2 (df: 98) = 175.444; p < 0.001; χ2 / df = 1.79

NFI = 0.868; NNFI = 0.922; CFI = 0.936; IFI = 0.937

ensure their scale unidimensionality. It is clear from Table 2 that all of the parameter estimates are highly significant, thus indicating the convergent validity for these measures. Table 3 gives the estimated correlation coefficients among the four latent variables. Except for the correlation between asset specificity and loss of control, all other correlations were statistically significant (p<0.05). To ensure discriminant validity, any pair of the variables should not be highly correlated within the sampling errors. As a result, all of the correlation coefficients were less than 0.6. Thus all measures appear to have reasonable discriminant validity and thereby are distinct.

According to the above confirmatory factor analysis, all of the measures used in this study are acceptably unidimensional, thereby the reliability of the measures can now be assessed. The typical scale reliability index is Cronbach’s α (Cronbach, 1947). As a result, Cronbach’sαis indicated on the diagonal of Table 3. The reliability index indicators are all greater than 0.7 and thereby satisfy the reliability requirement.

Assessment of Model Fit Table 4 shows the model statistical results using EQS. As in the measurement

model, though the chi-square statistic is highly significant and the NFI is below 0.9, the other indexes demonstrate an acceptable fit.

Results The results of hypotheses testing are summarized in Table 5. All of the path

coefficients are significant at the 0.05 level and have the expected sign of direction. The empirical evidence intends to support the hypothesis associated with the effect of transaction risks on IS outsourcing contract management.

151

Table 3: Estimated Correlation and Cronbach’s α

Factor ASS LOS OPE CON

ASS (0.87) LOS 0.015 (0.90) OPE -0.299* 0.298* (0.83) CON 0.357* -0.354* -0.548* (0.84)

ASS: Asset Specificity; LOS: Loss of Control; OPE: Operations Risk; CON: Contract Management * Significant at 0.05 level

Table 4. Model Statistics

χ2 (df: 98) p

χ2 / df NFI

NNFI CFI

IFI

175.442 < 0.001

1.79

0.868 0.922

0.936

0.937

Table 5: Empirical Results

Hypothesis ML estimate Std. Err. z-value Supported/ Not Supported

(H1) ASS (H2) LOS (H3) OPE

→ → →

CON CON CON

0.124 -0.125 -0.196

0.047 0.047 0.056

2.654 -2.663 -3.505

Supported Supported

Supported

Discussions and Conclusions The supports for the hypotheses indicate that the effectiveness of contract

management is not risk-free and is influenced by asset specificity, loss of control, and operations risk. Based on the above findings, it can be postulated that:

A vendor making an asset-specific investment with a particular customer becomes vulnerable. In the presence of asset specificity, the customer may require less effort to manage the outsourcing contract once the vendor has made the asset-specific investment.

The loss of control in data, operations, the strategic use of resources, disaster recovery, and future plans may generate a relationship that is difficult to manage contract after the fact.

Operations risk stemming from difficulties in measuring the vendor’s performance, assessing the fulfillment of a vendor’s agreed-upon

152

153

responsibilities, and enforcing the agreements can also lead to poor contract management.

The results involve two facets of interest here: the heuristics for IS outsourcing decisions and the formal imposition of risk reduction as a contract concern. The former usually weights risk as a factor in the decision to use internal or external resources. The latter furthermore addresses the issue by designing and enforcing contracts between the IS outsourcing parties.

According to transaction cost theory, the degree of asset specificity, is an important consideration in the outsourcing decision. Decision-makers need to be quite sensitive to asset specific investments and have a clear investment strategy in mind. In addition, it is unwise to contract out IS functions with strategic importance due to the danger of losing control over future IT resources applications (Quinn & Hilmer, 1994; King, 1994). Generally, the above decision concerns are due to potential opportunistic behavior by the IS supplier.

From the perspective of transaction cost theory, transactions that are subject to ex-post opportunism will benefit if appropriate safeguards can be devised ex ante. Considering the contract issues regarding transaction risks, there are some common topics that could be covered in terms and conditions. For example, premises and liquidated damages are two topics that cover legal details concerning the loss of control in data and disaster recovery. Adapted from Engelke(1999), these premises refer to “Where the supplier is to be granted rights of occupation of, for example a data centre, then the terms of this will need to be agreed, not least to ensure that the user can regain the property should the outsourcing come to an end.” Liquidated damages refer to “Specify the damages the outsourcer may have to pay if certain preventable disasters occur.”

Besides, incentives may be realigned to safeguard the transactions in question against the hazards of operations risk. The use of incentive contracts means that the contract can include mechanisms by which the vendor shares in any cost savings or increased profits made under the agreement (Harriss, Giunipero & Hult, 1998). When some type of incentive is built into the contract, (e.g. link service levels to predetermined credit against charges) the contract directly influences the IS supplier toward the same goals of the client. Examples of such goals include technical/managerial performance, project development schedule, and profit/cost realization etc.

Collectively, these results provide better managerial implications for the ex ante consideration of transaction risks in outsourcing decisions and, even more, ex post negotiating of safeguards in an outsourcing contract. Several features of this study can be extended further. For instance, additional risk factors such as uncertainty can be included for further investigation. It is believed that there is a relationship between risk and success. Further examination of the risk versus success effect may reveal more insights for IS outsourcing management. Finally, this analysis was based on cross-sectional data without controlling the effect of industry type due to the limit of the sample size. Future research can be conducted by collecting more data in order to compare the risk effect on IS outsourcing for different industries.